Data Science, Investigations and Privacy Current Status, Challenges and Solutions

Size: px
Start display at page:

Download "Data Science, Investigations and Privacy Current Status, Challenges and Solutions"

Transcription

1 Data Science, Investigations and Privacy Current Status, Challenges and Solutions Starrett Consulting, Inc. Agenda Global Privacy Big Data and Data Science Introduction What are they? Data Science and Investigations what s the problem? Data-Science Investigative Tools the solution! A Unique Challenge Automation vs. Human Decisions (Note: examples, algorithms and technologies presented may be summarized or abbreviated for efficient presentation and to accommodate communication to a lay audience) Starrett Consulting, Inc. 2 Agenda Global Privacy Big Data and Data Science Introduction What are they? Data Science and Investigations what s the problem? Data-Science Investigative Tools the solution! A Unique Challenge Automation vs. Human Decisions Starrett Consulting, Inc. 3 1

2 Data Privacy Data Privacy Right of individuals to keep their personal data from being misused or disclosed Personal Data Information that identifies or relates to an identifiable individual Sensitive Personal Data - Examples Personal: Digital signatures, biometric data, fingerprints, passwords Demographic: Birth date, marital status, race/ethnicity, health Info Financial: Credit card info, bank account info, earnings Government-Issued: Social Security Number, ID #, Tax ID #, driver s license #, passport # Starrett Consulting, Inc. 4 Data Protection Laws of the World Source: DLA PIPER -- Starrett Consulting, Inc. 5 Privacy Frameworks & Principles Privacy Frameworks Governance Fair Information Practice Principles (FIPPs), 1973 OECD Privacy Framework, 1980, (updated 2013) Retention & Disposal Notice APEC Privacy Framework, 2005 Generally Accepted Privacy Principles (GAAP), 2009 Quality & Accuracy General Principles Choice & Consent Final FTC Privacy Framework, 2012 General Data Protection Regulation (GDPR), 2016 Access Collection Other Use Starrett Consulting, Inc. 6 2

3 EU General Data Protection Regulation (GDPR) New 2016 EU data protection law Replaces EU Data Protection Directive 95/46/EC Applies to all companies handling EU citizens data Enforceable from May 25, 2018 New rights for data subjects Right to data portability Right to erasure Key operational requirements Data Protection Officer (DPO) Breach notification Privacy Impact Assessment (PIA) Data Subject consent Cross-border data transfers Starrett Consulting, Inc. Agenda Global Privacy Big Data and Data Science Introduction What are they? Data Science and Investigations what s the problem? Data-Science Investigative Tools the solution! A Unique Challenge Automation vs. Human Decisions Starrett Consulting, Inc. 8 Big Data and Data Science Introduction The Need 90% or more Use of data science is the single most important competitive differentiator for enterprises generally. Legal touches every aspect of business and life. Information we need and use is in electronic form Starrett Consulting, Inc. 9 3

4 Big Data and Data Science Introduction What are Big Data and Data Science? Data that is out-of-hand too voluminous, complex or fast-moving for conventional methods to handle. Focus is on solutions found in data science: Data science is an interdisciplinary field about scientific methods to extract knowledge from data. It involves subjects in mathematics, statistics, information science, and computer science. Thus data science is contextual. Starrett Consulting, Inc. 10 Big Data and Data Science Introduction Realities Many technical verticals. Domain (legal) professional must be present to make decision on application of data-science vertical(s). Forensic in nature PhD needed? Insight and leads generated require follow up and corroboration. Starrett Consulting, Inc. 11 Big Data and Data Science Introduction Predictive Analytics Learning Progression Advanced Topics / Horizontal Areas Machine Learning Generalized Linear Models Regression and Multivariate Analysis Statistics Math for Modelers Starrett Consulting, Inc. 12 4

5 Big Data and Data Science Introduction The Data Science in Investigations Golden Rule Until proven otherwise, data science as used in investigations is a service, not a product! (It s not the car, it s the driver!) Domain Data Science Starrett Consulting, Inc. 13 Big Data and Data Science Introduction Types of Data and Analysis Quantitative (numbers) Structured (e.g. columns and rows, logs files) Unstructured (e.g. free-form text, NoSQL database) Qualitative (qualities, categories) Starrett Consulting, Inc. 14 Big Data and Data Science Introduction Data Science Bigger Picture Unstructured Structured Qualitative Numeric Aspects of Qualitative Features Quantitative Approaches Starrett Consulting, Inc. 15 5

6 Big Data and Data Science Introduction Where is identification of personal / sensitive data most challenging? Structured Data Unstructured Data Metadata Free-form Text Spreadsheets Natural Language Relational Databases NoSQL Database Key / Value Pairs (Et cetera) (Et cetera) Identifying personal and sensitive data in structured data is much easier. So.. FOCUS WILL BE HERE! Starrett Consulting, Inc. 16 Agenda Global Privacy Big Data and Data Science Introduction What are they? Data Science and Investigations what s the problem? Data-Science Investigative Tools the solution! A Unique Challenge Automation vs. Human Decisions Starrett Consulting, Inc. 17 Data Science and Investigations Unsupervised Learning Exploratory / Investigative. Clustering, for example: K-means Hierarchical Document / Text Correlation. Starrett Consulting, Inc. 18 6

7 Data Science and Investigations Unsupervised Learning - Document Clustering Sub- Topic 1 Sub- Topic 2 Sub- Topic 1 Topic 1 Sub- Topic 3 Clusters determined by common words, phrases, concepts, etc. found in docs Sub- Topic 2 Topic 2 Starrett Consulting, Inc. 19 Data Science and Investigations Unsupervised Learning - Document Clustering Starrett Consulting, Inc. 20 Data Science and Investigations: Unsupervised and Supervised Learning Regression (predictive analytics - numeric) Salesperson Performance Records (6 months) 6 5 Outlier Sales Calls per Week Regression line ( formula ) used in predictive analytics (compare - correlation) 1 Outlier Sales per Month (Millions) Starrett Consulting, Inc. 21 7

8 Data Science and Investigations: Supervised Learning Classification (predictive analytics - categorical) Skull shape Eyebrows Leg length Hair length Tail Number of ears Type Pointed Yes 3 inches Short Yes 2 Dog Round Yes 2 feet Short Yes 2 Dog Triangle No 5 inches Medium Yes 2 Cat Triangle No 5 inches Long Yes 2 Cat Round Yes 1 foot Long Yes 2 Dog Triangle Unk 4 inches Short Yes 2 Cat Triangle No 5 inches Short Yes 2 Cat Training Data Starrett Consulting, Inc. 22 Data Science and Investigations: Supervised Learning Classification (predictive analytics - categorical) Skull shape Eyebrows Leg length Hair length Tail Number of ears Pointed Yes 3 inches Short Yes 2 Round Yes 2 feet Short Yes 2 Triangle No 5 inches Medium Yes 2 Triangle No 5 inches Long Yes 2 Round Yes 1 foot Long Yes 2 Triangle Unk 4 inches Short Yes 2 Triangle No 5 inches Short Yes 2 Predict? Type Dog Dog Cat Cat Dog Cat Cat Starrett Consulting, Inc. 23 Data Science and Investigations: Supervised Learning Classification (predictive analytics - categorical) Skull shape Eyebrows Leg length Hair length Tail Number of ears Predict? Pointed Yes 3 inches Short Yes 2 Cat Round Yes 2 feet Short Yes 2 Dog Triangle No 5 inches Medium Yes 2 Cat Triangle No 5 inches Long Yes 2 Cat Round Yes 1 foot Long Yes 2 Dog Triangle Unk 4 inches Short Yes 2 Dog Triangle No 5 inches Short Yes 2 Cat Type Dog Dog Cat Cat Dog Cat Cat Predictive Accuracy Starrett Consulting, Inc. 24 8

9 Data Science and Investigations: Supervised Learning (Classification) Example Electronic Discovery and Predictive Coding Identification Processing Review Production Starrett Consulting, Inc. 25 Data Science and Investigations: Supervised Learning (Classification) Example Electronic Discovery and Predictive Coding Sample e.g. 50k Patterns found in sample training data are used to classify documents in population as relevant and nonrelevant. Document Population Requiring Review for Relevancy to Lawsuit e.g. 1 million Starrett Consulting, Inc. 26 Data Science and Investigations: Supervised Learning (Classification) Example Electronic Discovery and Predictive Coding Date Metadata Document Text Relevant Yes No Yes Yes No No Yes 50k total documents in sample Training Data Starrett Consulting, Inc. 27 9

10 Data Science and Investigations: Supervised Learning (Classification) Example Electronic Discovery and Predictive Coding 1 million files Predictive Model (Classifier) Relevant Non-relevant Starrett Consulting, Inc. 28 Data Science and Investigations: Supervised Learning (Classification) Example Clustering and Classification Document Population Requiring Review for Relevancy to Lawsuit 1 million Cluster all 1 million documents 100 s of Clusters Two clusters look interesting CEO and CFO s / SM Starrett Consulting, Inc. 29 Data Science and Investigations Example Clustering and Classification Vendors Board CEO s Husband CEO s Husband Assume s and social media messages total Bank CFO Social Media Starrett Consulting, Inc

11 Data Science and Investigations Example Clustering and Classification Use CLUSTERS to help classify DOCs related to LEGAL ISSUES (compare ediscovery relevancy review where docs were randomly selected and manually tagged by attorneys) CEO s to Board CEO s to Husband CEO s to Vendors CFO SM CEO Husband CFO SM Bank Conspiracy Fraud Money Laundering Contract Breach Starrett Consulting, Inc. 31 Data Science and Investigations Example Clustering and Classification Date Metadata Document Text Data Type Fraud Conspiracy Money Laundering Contract Breach Fraud Conspiracy Contract Breach Fraud Money Laundering Conspiracy Money Laundering Conspiracy Contract Breach Starrett Consulting, Inc total documents from clusters Data Science and Investigations Example Clustering and Classification Contract Breach 1 million files Predictive Model (Classifier) Money Laundering Conspiracy Fraud None of the above Starrett Consulting, Inc

12 Data Science and Investigations Information Retrieval Diverse Data into NoSQL Database (Text Repository) to Search Engine Original Files Date Author Title Last edit Body Part Name Part No. Price Dept. To From CC Sent Subject Body Date URL Title Web Page Text Etc. Compare SQL which has fixed, structured schema vs. diverse, schemaless, unstructured NoSQL database ( document database ) Text Repository (NoSQL Database) Format: e.g. JSON, XML Search Engine Starrett Consulting, Inc. 34 Data Science and Investigations Information Retrieval Search Index Search Starrett Consulting, Inc. 35 Data Science and Investigations Information Retrieval Information Retrieval / Relevancy Ranking Sparse Term-Matrix to Inverted Index Doc apple dog cat blue Simple ran time Doc Doc Doc Doc Doc Doc Word Doc apple 3, 6 dog 1,5 cat 2 blue 3 simple 1, 5 ran 4 time 3, 6 Common words (e.g. to, the, a ), punctuation marks are often removed here. Other conversions such as converting all chars to lower-case, taking root versions of words, etc. are also common. Compound words (phrases) and other additions can be done. Starrett Consulting, Inc

13 Data Science and Investigations Information Retrieval / Relevancy Ranking - TF-IDF Term Frequency The number of times a word appears in a document means that word is more important. Inverse Document Frequency Terms that appear frequently across all documents are unimportant and thus weight down a term. TF-IDF Terms that appear often in a doc are important, those that appear often in document collection are not. A word receives and importance score. Starrett Consulting, Inc. 37 Data Science and Investigations Information Retrieval / Relevancy Ranking TF-IDF Document Dog Cat Other words -> Doc Doc Doc Doc Search: Documents returned in search for Dog and Cat sorted by relevancy. TF-IDF scores for terms in documents weight individual docs up or down. OR Classify: Documents with similar word combinations can be grouped together. This approximates classifying like documents together. Remember electronic discovery example? Starrett Consulting, Inc. 38 Data Science and Investigations Information Extraction Named Entity Extraction Original Files Date Author Title Last edit Body Part Name Part No. Price Dept. To From CC Sent Subject Body Date URL Title Web Page Text Text Repository (NoSQL Database) Format: e.g. JSON, XML Etc. Search Engine Information Extraction occurs on text repository, NOT on original files or search engine index Starrett Consulting, Inc

14 Data Science and Investigations Information Extraction Named Entity Types (examples) ORGANIZATION PERSON LOCATION NE Type Examples ACFE, American Bar Association Donald Trump, Hillary Clinton Mississippi River, Mt. Whitney DATE , January 15 th, 2013 TIME MONEY FACILITY Four fifty p.m., 0200 hours $43.15, 90,000 YEN Lincoln Memorial, U.S. Treasury Bldg. Some commercially available named entity tools have almost 1000 types of entities! Starrett Consulting, Inc. 40 Data Science and Investigations Information Extraction Named Entity Extraction Raw Text from NoSQL Document Database (not search engine) Sentence Segmentation Tokenization Parts-of- Speech Tagging Entity Detection Entity Extraction Starrett Consulting, Inc. 41 Data Science and Investigations Information Extraction Named Entity Extraction POS Tagging Pronoun Verb Preposition Adjective Noun W e s a w t h e b r o w n c a t s Noun phrase Noun phrase Starrett Consulting, Inc

15 Data Science and Investigations Information Extraction Named Entity Extraction Entity identification Machine learning determines that each named-entity type follows certain parts-of-speech patterns, for example: (PERSON = /N + /N + /N) (PERSON Donald/N J./N Trump/N) Starrett Consulting, Inc. 43 Data Science and Investigations Information Extraction Keyphrase Extraction Extracts key words and word combinations. Often identified using TF-IDF-like methods. Useful in identifying concepts, important terms, topics, code words and other lingo. Also used in document classification and clustering. Machine learning techniques can be used to create keyphrase extraction tools. Starrett Consulting, Inc. 44 Data Science and Investigations Information Extraction Other Available Information Categories Input - Output - /news/art and entertainment/movies and tv/television/news/international news Concepts Input - "Natural language processing uses machine learning to analyze text. Output - Linguistics, Natural language processing, machine learning Starrett Consulting, Inc

16 Data Science and Investigations Information Extraction Other Available Information Emotion Input - "I love cities, but I hate the country Output - "cities": joy, "country": anger Metadata Input - " Output: Author: Paul Starrett Title: A state-of-the-art investigations and consulting firm Publication date: March 1, 2016 Starrett Consulting, Inc. 46 Data Science and Investigations Information Extraction Other Available Information Semantic Roles Input - "In 2016, Trump ran for president Output: Subject: Trump Action: ran Object: for president Sentiment Input - "Thank you and enjoy your trip! Output - Positive sentiment (score: 0.81) Starrett Consulting, Inc. 47 Data Science and Investigations Information Extraction Other Available Information Other information: Geospatial data physical addresses converted to GPS coordinates, distances can be calculated. Topic modeling. Lexical analysis. Many of above resources are developed using machine learning just as named entity extraction. Above information can be used in predictive models to sensitive / private data. Starrett Consulting, Inc

17 Data Science and Investigations Graphs Nodes and Edges Node 1 Edge (relationship) Node 2 A node can be anything: Person, Location, Project, Concept, Association, Account, Document, etc. An edge can be anything: Ownership, Parent / child, Lawyer / client, Knows, Member, Married, etc. Starrett Consulting, Inc. 49 Data Science and Investigations Graphs Nodes and Edges John Doe Owns 123 Main St. Starrett Consulting, Inc. 50 Data Science and Investigations Graphs Exploratory and Predictive Exploratory uses For investigations, due diligence and to conduct research for predictive models. Predictive analytics and graphs Typically used inside enterprise / government infrastructure to identify threats. Think anomaly detection and machine learning. Starrett Consulting, Inc

18 Data Science and Investigations Graphs (Example) Trump organizations links to advisors and auditors. Exploration only. (Data courtesy Bureau Van Dijk ( Visualization rendered in Polinode ( Graph created by Starrett Consulting, Inc.) Starrett Consulting, Inc. 52 Data Science and Investigations Summary Use unsupervised methods to summarize and categorize data for focus and prioritization. Helps identify legal, regulatory and policy issues along with identifying supporting facts. Use supervised methods to identify certain information in other data. Helps mine other data to capture or identify known facts or issues (often as identified in unsupervised learning). Starrett Consulting, Inc. 53 Data Science and Investigations What s the Problem? How do we investigate without running afoul of privacy and compliance regulations? Why not apply certain data-science investigative tools to this problem? Certain tools are perfect for identifying and gathering personal and sensitive data! Hence, the solution. But wait, we re not done! Enter GDPR (stay tuned!) Starrett Consulting, Inc

19 Agenda Global Privacy Big Data and Data Science Introduction What are they? Data Science and Investigations what s the problem? Data-Science Investigative Tools the solution! A Unique Challenge Automation vs. Human Decisions Starrett Consulting, Inc. 55 Data-Science Investigative Tools as Solution Previous tools used to find personal and sensitive data Use information extraction to find data that is personal and sensitive. Use information retrieval (search technologies) to further refine personal and sensitive data. No reason clustering and graph databases cannot be used. Starrett Consulting, Inc. 56 Data-Science Investigative Tools as Solution Information Retrieval Diverse Data into NoSQL Database (Text Repository) to Search Engine Original Files Date Author Title Last edit Body Part Name Part No. Price Dept. To From CC Sent Subject Body Date URL Title Web Page Text Etc. Refine with: Clustering? Search engine? Text Repository (NoSQL Database) Information Extraction at Document Level Categories, Concepts, Emotion, Entities, Keywords, Metadata, Keyphrases, Semantic Roles, Sentiment Starrett Consulting, Inc

20 Data-Science Investigative Tools as Solution Information Extraction High-Level Flow Use EXTRACTED data from a document -> Categories Concepts Emotion Entities Keywords Metadata Keyphrases Semantic Roles Sentiment To help identify (This process often involves active human review, i.e. whether extracted data will classify a document as containing Health, Name / ID, Sex Life (etc.) data.) DOCS containing personal / sensitive data Health Name / ID Sex Life Psychological Location Political opinion (Etc.) Starrett Consulting, Inc. 58 Data-Science Investigative Tools as Solution: Supervised Learning Classification (predictive analytics - categorical) Skull shape Eyebrows Leg length Hair length Tail Number of ears Predict? Pointed Yes 3 inches Short Yes 2 Cat Round Yes 2 feet Short Yes 2 Dog Triangle No 5 inches Medium Yes 2 Cat Triangle No 5 inches Long Yes 2 Cat Round Yes 1 foot Long Yes 2 Dog Triangle Unk 4 inches Short Yes 2 Dog Triangle No 5 inches Short Yes 2 Cat Type Dog Dog Cat Cat Dog Cat Cat Remember this? Except now we go from two classes (dog / cat) to Health, Name / ID, Sex Life, Psychological, Location, Political opinion, etc. Predictive Accuracy Starrett Consulting, Inc. 59 Data-Science Investigative Tools as Solution Supervised Learning (Classification) Example Classifying Personal / Sensitive Data Date Metadata Document Text Data Type Health Location Sex Life Name / ID Psychological Health Sex Life Health Religious Belief Sex Life Psychological Location Health Starrett Consulting, Inc

21 Data-Science Investigative Tools as Solution GDPR data classification Health Location Sensitive Files Stream Predictive Model (Classifier) Sex Life Name / ID Personal Psychological (Etc.) This example is too coarse and generic as a real-world example but communicates basic concept. Major solution providers are using this same basic concept though for information-governance data classification. Starrett Consulting, Inc. 61 Agenda Global Privacy Big Data and Data Science Introduction What are they? Data Science and Investigations what s the problem? Data-Science Investigative Tools the solution! A Unique Challenge Automation vs. Human Decisions Starrett Consulting, Inc. 62 A Unique Challenge Automation vs. Human Decisions: GDPR (abbreviated!) Generally: Individuals have the right not to be subject to a decision when: It is based on automated processing It produces a legal effect or a similarly significant effect on the individual. (Don t investigations fit this definition?) You must ensure that individuals can: Obtain human intervention. Express their point of view. Obtain an explanation of the decision and challenge it. Starrett Consulting, Inc

22 A Unique Challenge Automation vs. Human Decisions GDPR and DPA using personal data to profile Safeguards against the risk that damaging decision is not taken without human intervention. Establish if any of your processing operations amount to automated decision making. Starrett Consulting, Inc. 64 A Unique Challenge Automation vs. Human Decisions: GDPR (abbreviated!) Profiling Any form of automated processing to evaluate personal aspects of an individual in order to analyze / predict: Performance at work. Economic situation. Health. Personal preferences. Reliability. Behavior. Location. Movements. (Again, don t investigations fit this definition?) Starrett Consulting, Inc. 65 A Unique Challenge Automation vs. Human Decisions: GDPR (abbreviated!) When Profiling Requires: Processing is fair and transparent by providing meaningful information about the logic involved, as well as the significance and the envisaged consequences. Use appropriate mathematical or statistical procedures for the profiling. Implement appropriate technical and organizational measures to enable inaccuracies to be corrected and minimize the risk of errors. Not of a child or special categories (exceptions apply) Starrett Consulting, Inc

23 A Unique Challenge Automation vs. Human Decisions: Others Credit applications and adverse inference May need to explain automated decisions. Employment decisions (e.g. resume recommendations) Do algorithms or predictive models inadvertently discriminate? Starrett Consulting, Inc. 67 A Unique Challenge Automation vs. Human Decisions: Solutions! Keeping any machine learning effort conventional and straightforward. Dog vs. wolf example. Intuitive assessments are key in interpretability. This starts at outset of automation design. Starrett Consulting, Inc. 68 A Unique Challenge Automation vs. Human Decisions: Solutions! What factors (features) are used? Choice of machine learning algorithm. How is sampling done (if at all)? Details of predictive model testing and validation. Software outputs to logs to detail decision process that can be interpreted in lay terms. Starrett Consulting, Inc

24 THE END! QUESTIONS? Starrett Consulting, Inc Starrett Consulting, Inc. 24