Big Data Initiatives in China: Opportunities and Challenges Joshua Zhexue Huang Distinguished Professor Director of Big Data Institute College of Computer Science and Software Engineering Shenzhen University
Agenda 1. Recent Development of Big Data in China 2. Key Initiatives, Challenges and Opportunities 3. Research and Applications at Big Data Institute, Shenzhen University
What is Big Data? Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them (Wikipedia). Big data often refers to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set.
Big Data Term and Popularity Big Data term was coined in 1998 by John R. Mashey, Chief Scientist of SGI The term then referred to data size in Gigabytes which will cause stress on infrastructure. On MARCH 29, 2012, Obama Administration announced Big Data Research and Development Initiative and $200 million to invest on big data, which made Big Data popular.
Recent Development of Big Data in China - China NSF funded key projects (2010) Massive data mining on cloud computing (2013) Big data oriented machine learning theory and methods (2014) Challenging research problems in big data technology and applications(8 projects) (2015) Five projects on big data (2016) More projects funded in information science and management areas
Recent Development of Big Data in China In August of 2012, Chinese Academy of Sciences started a strategic pilot project (1.3 billion in 5 years) Sensing China oriented next generation information Technologies A subproject on big data Research and development of key technologies for sea and cloud data systems 中国科学院图册 V 百科
Recent Development of Big Data in China In 2016, Ministry of Science and Technology of China started a special program on Cloud computing and big data which will accomplish 12 tasks in four areas with 400 millions RMB Cloud platform and big data infrastructure Data driven new software on cloud service model Big data analytics, applications and Human like intelligence Cloud convergence of Perceptual cognition and human machine interaction
Recent Development of Big Data in China -Ministry of Education of China 85 universities set up a new major on data science and big data technology Some major universities set up special schools, faculties and research institutes on data science and big data Tsinghua University:Tsinghua-Qingdao Data Science Institute Peking University: Beijing University Big Data Technology Inst Fudan University: School of Data Science, Sun Yat-Sen University:School of Data and Computer Science Shenzhen University: Big Data Institute
Recent Development of Big Data in China Local governments set up special organizations to promote big data Beijing: Beijing Institute of Big Data Research Guangdong Province: Big Data Bureau Shanghai: Shanghai Data Exchange Center Shenzhen: Shenzhen Research Institute of Big Data, Chinese University of Hong Kong (Shenzhen)
Recent Development of Big Data in China -Industry Big Internet Companies are the leaders in big data development and applications. They are also big data owners. Baidu, Alibaba, Tencent (BAT) All industry sectors are interested in big data Technology companies, e.g., Huawei, ZTE Telecommunications, e.g., China Mobile, China Unicom Banks and Insurance companies Manufacturing companies E-commerce companies Logistics service companies
Big Data Market in China 0.1 billion compound annual growth rate
Big data: a national strategy A decision was made to implement a national strategy for big data At the Third Plenary Session of the 18th Central Committee of the CPC in October 2015. The 13th Five-year Plan (2016-2020) further defined that big data is fundamental strategic resources to be developed and utilized. National big data centers and platforms will be established. Key technologies, hardware and software will be innovated and developed, including data collection, storage, cleansing, analysis, mining, visualization, security and privacy protection.
Implementation Measures The State Council issued the action outline to promote the development of large data in 2015. In January 2016, The National Development and Reform Commission issued a notice on organizing the implementation of major projects to promote the development of big data, supporting projects in four areas: Pilot projects on big data applications Big data sharing Big data infrastructure development Big data standards and exchange systems
Agenda 1. Recent Development of Big Data in China 2. Key Initiatives, Challenges and Opportunities 3. Research and Applications at Big Data Institute, Shenzhen University
Initiatives to Develop Innovation Driven Economy in China Encourage young people to start their own business and pursue innovation (Mass entrepreneurship and innovation ) Development of big data Internet + action plan Cloud computing service development Internet of Things (including wireless Internet) Artificial Intelligence Made in China 2015 (advanced manufacturing) Internet +
Directions Data science disciplines Key technology development Big data platforms Key applications Data resource development Data sharing and open data Human resource training for big data
Internet + Manufacturing AI Manufacturing procurement Design Customer Service Intelligent warehouse retail Transportation
Technological Challenges Storage cloud storage Communication 4G, 5G Processing cleansing, integration Analysis capability, efficiency Mining methods, tools, platforms Energy consumption
Application Challenges Lack of clear business requirements Lack of successful pilots Data availability and data sharing Data security and privacy ROI on big data applications Infrastructure Skills and human resources
Opportunities: Big Data Industry Chain Telecom Retail Finance Manufacturing Internet Smart Grid E-commerce Logistics Smart City
Agenda 1. Recent Development of Big Data in China 2. Key Initiatives, Challenges and Opportunities 3. Research and Applications at Big Data Institute, Shenzhen University
Shenzhen Shenzhen
China s first Special Economic Zone (SEZ) Neighboring to Hong Kong Area: 2050 km 2 A major city in South China Population (2014): 11 million Shenzhen University The fourth largest city in GDP in China, GDP per capita in USD: 25,038 GDP Growth (2015): 8.9% Xichong Beach Shenzhen Bay Bridge Night View of Shennan Road East
A public university established in 1983. The fastest growing university intop 100 Universitiesin China. 26 schools (colleges) 57 undergraduate programs, 70 master's programs 3 doctorate programs. Shenzhen University 34,000 full-time students 27,000 undergraduates, 6,000 postgraduates 1,500 international students. Lake Wenshan South pavilion of the school library
Big Data Institute, Shenzhen University Established in 2014 20 research staff 30 students Computer Science Building Three organizations International PhD students Institute Corridor
Faculty Members
Data Center
Internet + Manufacturing accumulates big data AI Manufacturing procurement Design Customer Service Intelligent warehouse retail Transportation
Research Problems 1 2 n-4 n-3 n-2 n-1 n f1 f2 f3 f4 f5 Thousands of features Curse of dimensionality 1. Mixed data 2. Noise/missing value 3. Correlation 4. Unbalance 5. Subspace property 6. Uninformative Millions of records Challenge of Big Data Matrix
Big Data Analytics Big data refers to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data.
MapReduce Programming(Divide-and- Conquer) Programming (Map) Master node (Reduce) file file file file file node node node node node output File 文件划 partition
MapReduce Iteration K-means Pipeline implementation M R M R M R M R M R M R M R M R M R M R M R M R Input Data????? Map process Assign objects to clusters Reduce process Recompute cluster centers C o n v e r g e? output
MapReduce limitation Decision Tree It is difficult to implement recursive algorithm like decision trees in MapReduce
Spark RDD Computing Model RDD is a matrix.
RDD Divide-and-Conquer
Asymptotic Ensemble Learning Framework
Randomization of Data Blocks Before randomization After randomization
Asymptotic Ensemble Learning Results Learning result from none randomized data blocks Learning result from none randomized data blocks
Advantage of Asymptotic Ensemble Learning Sampling without replacement Sampling data blocks instead records increases sampling efficiency Learning partial data(10-20%) to approach the result learnt from the whole data. Significantly reduce computation load Scalability,learning TB or PB data
Integrated Big Data Analysis Platform
Key Technologies Workflow Engine Cloud Computing Engine Algorithm Library Big Data Analytics Open API Cloud Storage
Distributed Machine Learning Algorithm Libraries MapReduce Clustering Classification Regression Association K-Means K-Modes W-K-Means EWKM Decision Tree Random Forests LDA Logistic Regression Random Forest Regression FP-Growth Spark 1. Machine Learning Mllib 2. Graph Analysis GraphX 3. Data streams Dstream 4. QuerySpark SQL
Analytical Workflow
Manufacturing Big Data Application --Product batch quality problem monitoring system Visualization Impala 数据分析引擎 Applications Vis 数据可视化引擎 xxx xx 引擎 Application Layer Data analysis R 数据挖掘 Hive 数据仓库 Analytics Storm 实时流计算 Spark 数据流处理 Data Warehouse Data cleansing and integration Central DB Local quality data Sqoop 数据迁移 ETL Flume 数据收集工具 Cluster Environment Kettle ETL 工具 HDFS Map/Reduce Runtime System Supl 1 Supl 2 Supl n Fac 1 Fac 2 Fac n Platform Layer Data Layer
大数据分析一体化平台 - 应用展示
Manufacturing Big Data Application --Product batch quality problem monitoring system 10 Year Product quality monitoring period 50M+ No. of products monitored 2015 Huawei President award 30000+ Factories 1PB+ Data 80%+ Report Accuracy 100+ Development Team 50+ Products 0% Missing Rate
Thank You!!! Questions?