GROW WITH BIG DATA. Third Eye Consulting Services & Solutions LLC.

Size: px
Start display at page:

Download "GROW WITH BIG DATA. Third Eye Consulting Services & Solutions LLC."

Transcription

1 GROW WITH BIG DATA Third Eye Consulting Services & Solutions LLC.

2 Crime Analysis & Predictions System (CAPS)

3 ORIGINALLY DEVELOPED FOR Public Safety & National Security team at lead by Sanjay Jacob, Parul Bhandari & Mahesh Punyamurthula

4 CAPS Problem Definition Public Governments around the world need to: 1. Do more while spending the least. 2.Better manage existing resources. 3.Be proactive in battling crime. 4.Be at the right place at the right time to beat crime with the lowest impact. 5.Know what to do when and why.

5 CAPS Problem Definition Other Challenges for Public Governments: 1. Lack of technical knowledge and resources. 2.Lack of management resources to manage, monitor and operate such systems. 3.Need to analyze disparate data sets spread across various systems and trapped in different formats. 4. Reliance on outdated infrastructure & systems both stationary & mobile.

6 CAPS - Solution Leverages Open Data initiatives by government bodies worldwide. Based on Microsoft s Big Data technologies stack. Capable of handling Big Data s Velocity, Volume and Veracity. Easy to integrate, assemble and develop customized end-toend solutions. Analyze various types of data feeds - real time streaming & static data. Provides comprehensive analytical capabilities. Predict crime patterns for efficient deployment of public safety resources.

7 CAPS - Solution CAPS is a system to analyze & detect crime hotspots & predict crime. Collects data from various data sources - crime data from OpenData sites, US census data, social media, traffic & weather data etc. Leverages Azure s Cloud and on premise technologies for backend processing & desktop based visualization tools.

8 BENEFITS FOR THE LOCAL POLICE The police can use the system in two ways: 1. The system can alert that a crime is imminent (in the next 4 hours) based on any new traffic or weather event/s. 2. The police can run the system once a day and based on the predictions, decide how to deploy resources (policemen) in each community/district.

9 TECHNICAL SECTION

10 TECHNOLOGIES USED Azure HDInsight MapReduce Hive Stream Analytics Azure Queue Azure Storage SQL Azure SQL Server Power BI PowerQ&A PowerView PowerMap

11 DATA COLLECTION LAYER DATA COLLECTION OPEN DATA - Static CRIME DATA - Static CENSUS DATA - Static ANY OTHER DATA - Static WEATHER DATA Real Time TRAFFIC DATA Real Time SOCIAL MEDIA DATA Real Time ANY OTHER DATA Real Time ENTERPRISE DATA Real Time & Static MACHINE DATA Real Time & Static INTERNET OF THINGS Real Time & Static ANY OTHER DATA Real Time & Static DATA PROCESSING LAYER Cloud or On Premise PRESENTATION LAYER

12 ADDITIONAL DATA SOURCES The system can be further enhanced to include additional data sources as available. For ex: Video Data Images Data Police Systems Data

13 DATA COLLECTION Windows Data Sources - For Chicago Real time Tweet streams ingested from Twitter using Search APIs Facebook data ingested using Graph Search APIs. Traffic data ingested from Mapquest. Weather data ingested from Forcast.io Data feed ingestion is automated and captured using C# custom code base. Pre-Processor Tweets are feed into Stream Computing Layer for sentiment logic processing. Facebook, Traffic & Weather data parsed from JSON to csv on run time. All data is persisted on Azure Storage. Analyzed & summarized data is persisted in SQL Azure. Storage Analyzed Twitter data is pushed to Window Azure SQL Parsed Twitter/Facebook/Traffic/Weather data is persisted in Azure Storage in different containers.

14 DATA PROCESSING LAYER - Windows DATA COLLECTION LAYER DATA PROCESSING LAYER Windows Azure Windows HDInsight Stream Analytics Azure Queue Azure Storage SQL Azure SQL Server PRESENTATION LAYER

15 DATA STORAGE & PROCESSING STORAGE SCHEDULER Calls script on pre-set schedule to ingest data into Hive tables. Checks periodically to ensure normal system operations Inserts data incrementally HIVE Contains all data as per the table schemas. Enables HiveQL execution when requests come in from PowerBI components. Sqoop SQL AZURE Sqoop STORAGE Processed & Aggregated data ingested into SQL Azure. HDInsight blob storage provides reliable and a scalable solution. All data is partitioned on dates. HIVE Scheduled Jobs Daily scripts to create table and insert data, scheduled with cron jobs. HIVE Tables Have all data in full details from all data sources.

16 PRESENTATION LAYER Windows DATA COLLECTION LAYER DATA PROCESSING LAYER PRESENTATION LAYER Power BI PowerQ&A PowerView PowerMap Power Query PowerPivot Windows 8 Apps Mobile Apps

17 DATA PRESENTATION LAYER

18 DATA PRESENTATION LAYER Excel 2013 is used as the platform and workbench for analyzing and mining data, using functionalities which are familiar to most power users. PowerPivot is the semantic layer that defines the relationship between data and calculated measures. Data is stored in-memory as a columnar database for faster retrievals. Model data is saved along with Excel as a part of it, which makes sharing of these reports very easy. PowerMap provides instant and overall picture of the trends happening across geographies over.. PowerView is a Silverlight Add-in that provides powerful interactive and intuitive dashboards and reports which are built on top of PowerPivot s data model. It enables slicing/dicing, drilling-up/down of any level of data. It s very useful to identify trends and root causes.

19 CLOUD MODEL Windows Real time Data Sources Static Data Sources CLOUD BASED INFRASTRUCTURE Cloud based data processing & transformations. Cloud based real time & batch analytics. Office 365 s PowerBI components for adhoc analytics. Enabled for Windows 8 based Mobile & Desktop Apps. Data Collection Layer (C# custom code) Data Processing Layer (Stream Computing Platform - Storm) Analytics (Stream Analytics & MapReduce) Message Queue Layer (Azure Event Hubs) SQL Azure HDFS & Blob Storage (Azure) Machine Learning Algorithms (AzureML) Analytics (HDInsight Hive) Presentation Layer (Power BI)

20 HYBRID MODEL Windows Real time Data Sources Static Data Sources CLOUD BASED INFRASTRUCTURE Cloud based data processing & transformations. Cloud based real time & batch analytics. Enabled for Windows 8 based Mobile & Desktop Apps. ON-PREMISE INFRA PowerBI components for adhoc analytics. SQL Server based. Data Collection Layer (C# custom code) Data Processing Layer (Azure Stream Analytics) Analytics (Stream Analytics & MapReduce) Message Queue Layer (Azure Event Hubs) SQL Server HDFS & Blob Storage (Azure) Machine Learning Algorithms (AzureML) Presentation Layer (Power BI) Analytics (HDInsight Hive)

21 DATA SOURCES For Chicago DATA DESCRIPTION SOURCE Crime Data Historic crime case data over years from present Chicago districts Chicago Police districts address information Chicago communities Safety/Crimes-2001-to-present/ijzp-q8t2 ortal/clearpath/communities/districts Chicago community area mapping in_chicago Socio economic factors Selected socio economic indicators like people below poverty, unemployment, per capita income for each community Twitter Tweets about Chicago. Facebook Posts about Chicago. Weather Chicago weather data Traffic Chicago traffic details Services/Census-Data-Selectedsocioeconomic-indicators-in-C/kn9c-c2s2 Twitter Streaming API Facebook Graph Search API Forecast.io MapQuest

22 ANALYTICS

23 CRIME ANALYTICS Analyze Crime Levels Filters (depending on data) Number of crime Crime Types Location Date & Time Temperature Residents Graph Type Line Bar Pie Chart Table Bubble

24 CRIME ANALYTICS Analyze Crime Levels Filters (depending on data) Number of crime Crime Types Location Date & Time Temperature Residents Graph Type Line Bar Pie Chart Table Bubble

25 CRIME ANALYTICS Analyze Crime Levels Filters (depending on data) Number of crime Crime Types Location Date & Time Temperature Residents Graph Type Line Bar Pie Chart Table Bubble

26 PREDICTIONS

27 FACTORS CONSIDERED FOR PREDICTING CRIME Name Values Comments Community Community ID This is the key. The prediction is for a specific community for a specific date & time. Date Time Period Weather Date 1: 12am 4am 2: 4am 8am 3: 8am 12pm 4: 12pm 4pm 5: 4pm-8pm 6: 8pm 12am 1- Normal 2- Abnormal 3- Extreme For convenience purposes, we have broken up a day into 6 time slots. We can change this based on the supporting data. All weather conditions are categorized into these values. We picked suitable values for each of the weather types to get a good distribution. Traffic Event Traffic Event Distance from Police Station 1- Normal 2- Abnormal 3- Extreme 1 Near 2- Far 3 Very Far All traffic conditions are categorized into these values. We picked suitable values for each of the traffic types to get a good distribution. The assumption is that farther away the event from a police station, higher the chances of a crime. We picked suitable values for each to get a good distribution. Unemployment Rate This is the unemployment rate in that precinct. Number of police stations in District Number Assuming that propensity for crime is inversely proportional to # of police stations. Crime 1 Theft 2 Assault 3 Burglary 4 Narcotics 5 Battery 6 None This is a placeholder category. This list can be anything that is (a) supported by the underlying data and (b) what the law enforcement are interested in seeing.

28 PREDICTION MODEL With the initial dataset, an initial prediction model is constructed. If any of the fields change value, then the model is retrained. Some of the fields will change infrequently and others will change on a daily basis (ex. social media, weather & traffic events). The model is continuously updated/upgraded with new data. The system periodically pulls in the latest fields (automatically) from appropriate sources. Then the model runs against the new data to predict what kind of crime is likely to be committed in each of the communities.

29 CRIME PREDICTIONS Predict Crime Filters (depending on data) Number of crime Crime Types Location Date & Time Temperature Residents Graph Type Line Bar Pie Chart Table Bubble

30 CRIME PREDICTIONS Predict Crime Filters (depending on data) Crime Types Location Date Time Temperature Traffic Distance to Police Station Weather

31 EXTENSIBLITY The system is fully extensible and future proof. Lessons learned Patterns detected Observations made for one city can be used and extended for other cities worldwide. The backend infrastructure will also adjust accordingly.

32 SUMMARY The Crime Analysis and Prediction System (CAPS) can/is: Detect, Analyze & Predict Crime. Help public governments battle crime better with lowered costs. Based on Microsoft s Big Data technologies both cloud and on premise. Built on the robust Azure platform that can scale vertically & horizontally. Customizable & Extensible to meet the needs of specific business use cases.

33 THANK YOU!