The Challenge of Managing Access to New and Novel Forms of Data: An Application of UDC

Size: px
Start display at page:

Download "The Challenge of Managing Access to New and Novel Forms of Data: An Application of UDC"

Transcription

1 The Challenge of Managing Access to New and Novel Forms of Data: An Application of UDC Suzanne Barbalet Discovery Team UK Data Service, University of Essex Nathan Cunningham Associate Director Big Data, UK Data Service, University of Essex International UDC Seminar 2017, London, September.

2 St Peters Square, Rome.

3 A working definition of Big Data

4 Defining New and Novel Forms of Data (2013) OECD report on New Data for Understanding the Human Condition: Category A: Data stemming from the transactions of government, for example, tax and social security systems. Category B: Data describing official registration or licensing requirements. Category C: Commercial transactions made by individuals and organisations. Category D: Internet data, deriving from search and social networking activities. Category E: Tracking data, monitoring the movement of individuals or physical objects subject to movement by humans. Category F: Image data, particularly aerial and satellite images but including land -based video images.

5 Challenge 1: Managing NNfD The literature currently suggests that some of this data is unmanageable. OECD Global Science Forum, (2013) New Data for Understanding the Human Condition

6 Challenge 2: Topic Searches on the Web Is an acknowledged challenge in the literature and no less so for the discovery of data. Tay, A. (2016) Managing Volume in Discovery Systems Jacob, E. K. (2004) Classification and Categorization: A Difference that Makes a Difference Our solution is interactive topic access

7 Our approach The ambiguity in Big Data (NNfD) has challenged us to change our approach for the nascent and unstructured nature of NNfD. We can t do rigorous classification but we can determine its topics We want to look at these new NNfD with the same lens as we use for our curated collection..

8 What is Different about Metadata for Data? An index describes the content of an information source but when the information source : is a measurement not an idea i.e. in its raw state it has numeric form it can be dynamic i.e. social science research captures attitudes and behaviour in a time frame then the construction of an index is a complex task

9 Data is a Special Type of Digital Resource the content of studies will not necessarily relate semantically to the title of the study without knowledge of the research indicators and research question not easy to index an indexer asks what is being measured + what might be a useful measurement i.e. both variables and research concepts are indexed BUT studies are not difficult to classify clear subject information is the general rule for example

10 For example: Subject Title: Understanding the Importance of Work Histories in Determining Poverty in Old Age: Variables Derived from the English Longitudinal Study of Ageing, Abstract: This study found that for the most part life-course events as measured here are not strongly associated with the chances of being on a low income in retirement. Topics covered include length of time spent in paid work and in marriage, the timing of retirement, the number of children and timing of childbirth, and whether ill-health as an adult or as a child had been experienced. The modelling looks at the influence of these factors alongside a range of other characteristics such as social class and educational attainment Variable concepts are complex

11 Where does UDC fit in? HASSET Thesaurus keywords + a classification code for each study in our collection allows us to both address the problem of: known item retrieval topic search [See Tay, A (2016) Managing Volume in Discovery Systems] With legacy classification complete we can now create bespoke subject categories interactively AND Retrieve specific studies to create automatically generated subject metadata

12 Transactional or Dynamic Data can it be managed? How we curate data for the UK Data Service: We follow a policy of: authenticity reliability logical integrity NNfD has to meet our rigorous standards THUS NNfD does require linking with other data sources to ensure the provenance, reliability and integrity of the data Interactive topic access bridges the divide between NNfD and other forms of social and economic data in our collection.

13 NNfD can provide valuable research data It is important that this NNfD transactional data is accessed by the same set of thesaurus key words as our traditional data. Thesaurus keywords allocated to studies in the bespoke category are mined to derive metadata for a particular NNfD Our algorithm here will exclude: Keywords for demographic variables (age, gender etc) and geography keywords Outliers are removed

14 NNfD and Social Science Research NNfD need not mean The End of Theory Anderson, C. (2008) The End of Theory: The Data Deluge Makes the Scientific Method Obsolete NNfD offers researchers data that is not subject to: the bias of survey data: interviewer bias sampling bias response bias

15 Case Study: Diet and Nutrition Transactional data will include such topics as dietary requirements. It is an important topic with considerable policy and research implications. We need to be able to link our metadata across to these NNfD from our aggregated and survey data to verify and supplement the primary (NNfD transactional) source.

16 Case Study: Diet and Nutrition Collection classified by UDC and indexed with HASSET keywords. This allows us to apply interactive topic access. By having the same underpinning RDF SKOS (graph) methodology utilised by both it enables us to employ some big data technologies such as NO-SQL tools and graph based technologies.

17 Case Study: Diet and Nutrition Row Labels Sum of Survey Count ARTIFICIAL SWEETENERS 8 BEVERAGES 17 BODY CIRCUMFERENCE MEASUREMENTS 2 BUTTER 4 CARBOHYDRATES 8 CEREAL PRODUCTS 15 CEREALS 7 CHEESE 6 CHILD NUTRITION 15 CHILD OBESITY 3 CLINICAL TESTS AND MEASUREMENTS 9 COFFEE (BEVERAGE) 8 CONFECTIONERY 17 CONSUMERS 9 CONSUMPTION 14 COOKING 23 DAIRY PRODUCTS 15 DIET (LIFESTYLE) 5 DIET AND EXERCISE 26 DIETARY FIBRE 6 EATING DISORDERS 1 ECONOMIC ACTIVITY 20 EDIBLE FATS 19 EMPLOYMENT 17 ETHNIC GROUPS 19 FISH (AS FOOD) 13 FOOD 41 FOOD ADDITIVES 2 FOOD SUPPLEMENTS 12 FROZEN FOODS 6 FRUIT 23 HEALTH 24 HEALTH FOODS 10 HEIGHT (PHYSIOLOGY) 24 INCOME 13 IRON 2 LEGUMES 7 MALNUTRITION 2 MEALS 17 MEAT 17 MEDICAL DIETS 2 MILK 20 MINERALS 4 NUTRIENTS 15 NUTS 8 OBESITY 4 ORGANIC FOODS 9 PACKETED FOODS 3 POTATOES 5 POULTRY 2 PRESERVED FOODS 5 PROTEINS 8 SALT 13 SAVOURY SNACKS 8 SLIMMING DIETS 5 SOFT DRINKS 7 SOFT FRUIT 1 SPECIAL DIETS 6 SUGAR 19 TEA 8 TINNED FOODS 10 VEGETABLE OILS 4 VEGETABLES 22 VEGETARIANISM 10 VITAMINS 14 WEIGHT (PHYSIOLOGY) 21 WEIGHT CONTROL 4

18 Case Study: Diet and Nutrition

19 Case Study: Diet and Nutrition Transactional Data Sources include: Apps for fitness Diet diaries online Telemetry analytics from phones indicating activity such as walking, cycling flying etc. Supermarket loyalty card Credit/Debit card administration records Dietary preference from airline booking, shopping habits, conference booking etc Subscription transaction to dieting clubs, societies etc

20 Graph Store Approach

21

22 Workflow for NNfD

23 Reference Architecture of DSaaP Open source because we can have meaningful common conversations with the community Hadoop is..

24 Implementation Architecture of DSaaP Consumers and Producers Security Services Deposit Platform Discovery Platform Information Platform Repository Support And Maintenance Semantic Platform Access Platform Data Platform Preservation Platform

25 Five Safes for Securing Data

26 Operationalising 5 Safes at Scale DSaaP Hybrid Service Instances On premise Instance On premise Instance AWS Instance Common Service Authentication (Kerberos)

27 Conclusions NNfD challenges us to generate metadata innovatively and each form of NNfD poses separate management challenges Data and metadata are intrinsically related, and this method allows us to use it, with a common RDF Graph approach. Transactional or dynamic data calls for a dynamic approach to the generation of metadata

28 And the future provide a simple but powerful approach to adding structured metadata to NNfD so that we can scale up as our collection grows. support a research community which includes the international scientific community, commercial users of big data and citizens a trusted source of information on the use of new and novel forms of data in developing impactful research

29 Thank you!

30 Questions Suzanne Barbalet Discovery Team, UK Data Service Nathan Cunningham