BIG DATA: MORE DATA MORE INSIGHTS

Size: px
Start display at page:

Download "BIG DATA: MORE DATA MORE INSIGHTS"

Transcription

1 How is the SEC using big data to enhance its analytical capabilities and what public data sets might be of interest for academic analysis? September 7, 2017 BIG DATA: MORE DATA MORE INSIGHTS

2 Mike Willis ASSISTANT DIRECTOR OFFICE OF STRUCTURED DISCLOSURE DIVISION OF ECONOMIC AND RISK ANALYSIS U.S. SECURITIES AND EXCHANGE COMMISSION

3 Disclaimer The Securities and Exchange Commission, as a matter of policy, disclaims responsibility for any private publication or statement by any of its employees. Therefore, the views expressed today are our own, and do not necessarily reflect the views of the Commission or the other members of the staff of the Commission.

4 Discussion Topics Know or infer? Now your choice How we use Big Data to enhance insights Potential Research Topics

5 Perspective What do you see?

6 Quick Refresher from Last Year What are you using for analysis? General Electric 10-K (As Reported) General Electric per Data Aggregator A 6

7 Quick Refresher from Last Year What are you using for analysis? In the structured XBRL form: Research & Development Expense - $5,466M Depreciation / Amortization Expense $4,997M Interest and other financial charges- $5,025M Other revenue, total Other income of - $4,005M Earnings (loss) from discontinued operations, net of taxes ($954M) General Electric per Data Aggregator A 7

8 8 Financial Statement and Notes Data Sets The Financial Statement and Notes Data Sets provide the text and detailed numeric information from all financial statements and their notes. This data is extracted from exhibits to corporate financial reports filed with the Commission using extensible Business Reporting Language (XBRL).

9 Common Data Quality Errors Scaling errors Incomplete tagging Inappropriate Extensions Inappropriate Tagging Negative Values Disclosures that are simply not structured Missing calculation links Other Staff Observations and Guidance here 9

10 1 0 How it works Public Users Other Data etc Fin Data Sets Financial Statement Query Viewer Data Filings Structured Data Inline Viewer Corporate Issuer Risk Public Assessment Users Renderer EDGAR Text Analytics Text Filings Research Analytics... SEC Staff

11 1 1 What s Inside? Semantic map - maps or webs of words. The purpose of creating a map is to visually display the meaning-based connections between a word or phrase and a set of related words or concepts. Semantic Map Mongo Structured Data MongoDB - an open source database that uses a document-oriented data model. Hadoop Hadoop - an open-source software framework for distributed storage and distributed processing of very large data sets on computer clusters.

12 1 2 Corporate Issuer Risk Assessment (CIRA) Analytical tool: provides detailed information on various aspects of a company s business activities and financial reporting environment Dashboard: enables the user to search, compare and analyze a variety of information about companies through a single intuitive visual interface Identify patterns: Helps users assess the risks associated with financial reporting with more than 200 variables for thousands of SEC registrants across multiple years Approach: Based on database approaches used by academic financial accountants and large sample evidence documented in academic literature Data sources: Uses a variety of structured data.

13 Financial Statement Query Viewer (FSQV) Intuitive, quick and easy-to-use web browser interface. Search and review filings and all facts across all filers in ways not previously possible. Search using various criteria (e.g., CIK, ticker, industry, filer status, country). Search by Fact (e.g. specific disclosure type and/or specific taxonomy element) Search by Text (e.g. any text within a narrative disclosure) Compare footnote narrative text differences between periods (e.g. red-line changes). Save all results and searches locally for further analysis and reuse.

14 1 4 Inline XBRL Viewer Single document - structure actual filing rather than a separate copy (attachment) to the filing Familiar View within financial statement browser view to review structured data Enhancing Review - search and filter filing by keyword or concept (e.g., FASB references) Navigation - use Table of Contents to quickly jump to financial statements and footnotes Improving Data Quality Assist staff reviews (e.g., identify mislabeled or untagged information) Eliminate the need to reconcile 2 different documents (HTML and XBRL)

15 Inline XBRL Viewer Video

16 1 6 So what.changes? Early adoption a particular accounting standard Specific combination of disclosures that may reveal a risk pattern Compare disclosure and specific sector risk profiles across targeted filers Aggregate a specific disclosure across all filers for a target period/year Narrative sentiments that are misaligned with the numeric results and ratios Statistics or trends on a specific financial disclosure such as net deferred tax assets (liabilities) or income tax expense Data quality assessment and searching for issues such as incorrect tagging, use of inappropriate extensions, and scaling errors

17 Potential Research Topics Roach assessment: Data Quality (extensions, negative values, inappropriate element selection, etc.) v earnings quality Hey I m Special: Communication implications of extension rates What did you say?: Comparative sentiment analysis Definitions matter - boot?: Appropriateness of Extensions Navigating disclosures: Disclosure modeling variances across comparable companies Judge a book by it s cover: Presentation options and variances What not to wear: Presentation choices and options best and worst practices Joe Friday vs Picasso: Facts vs Story telling what do investors want? Does fashion matter: Trends in disclosure structures DIY Hacks: What else could we use this for? 17

18 Questions?

19 Thank you!

20 2 0 Appendix: Papers on Data Quality from Aggregators The Quality of Interactive Data: XBRL Versus Compustat, Yahoo Finance, and Google Finance Abstract: The issue of the quality of interactive data is of interest to all stakeholders in the evolution of XBRL since the adoption of XBRL is often premised on the value of making business data available to users in a standardized, shareable format. Proponents of XBRL claim that XBRL-tagged data obtained directly from the company or from a regulator s website such as the SEC s EDGAR, in contrast with data obtained from aggregators such as Compustat, are the closest and most accurate reflection of the company s intended communication in their official financial reports. However, to date, there has been no formal study of the similarities and differences between interactive data (i.e., XBRL-tagged data filed with the SEC) and data provided by aggregators. This study fills this gap by comparing interactive data with the data items reported by three prominent data aggregators or redistributors: Compustat, Google Finance, and Yahoo Finance. We find a significant rate of omission of more than 50% in the financial statement items provided by aggregators/redistributors compared with the interactive data available on the SEC s EDGAR website. For items that are not omitted, we find up to 4.8% (tracing from interactive data to aggregator data) and 8% (tracing from aggregator data to interactive data) mismatches, with approximately 56% of differences being greater than conventional materiality. The rate of mismatch differs by aggregator for the three financial statements studied: Balance Sheet, Income Statement, and the Statement of Cash Flows. The differences are most frequent in the Statement of Cash Flows (comparison between interactive data and aggregators) and the Income Statement (reverse comparison), but generally they tend to decline over the three year period studied. Using XBRL to Conduct a Large-Scale Study of Discrepancies between the Accounting Numbers in Compustat and SEC 10-K Filings Abstract: Compustat accounting database is frequently used for both research and decision-making. However, the accuracy and value of the information extracted from Compustat depend not only on the methods used to extract that information, but also on the validity of the data provided by Compustat. It has been documented (San Miguel 1977; Rosenberg and Houglet 1974; Yang et al. 2003; Tallapally et al. 2011, 2012; Boritz and No 2013) that information found in Compustat database differs from both the information found in other accounting databases and the information disclosed in corporate financial filings. In this study, we conduct the first large-scale comparison of Compustat and 10-K data. Specifically, we compare 30 accounting items for approximately 5,000 companies for the period from October 1, 2011 to September 30, We find that the values reported in Compustat significantly differ from the values reported in 10-K filings. We also find that the amount and magnitude of the original data alterations introduced by Compustat depend on the type of the accounting item and company characteristics such as industry and size. Data differences XBRL versus Compustat Abstract Given their proprietary data standardization process, Compustat ( provides accounting data which may differ, to some degree, from accounting data provided by some individual companies via XBRL (extensible Business Reporting Language). In this regard, the purpose of this study is to analyze the extent of such differences, if any, between accounting data provided by Compustat and accounting data provided by (via) XBRL. The results suggest that differences exist and the reconciliation of such differences is not obvious.