Big data beyond Hadoop How to integrate ALL your data. Kai waehner.de 4/26/13

Size: px
Start display at page:

Download "Big data beyond Hadoop How to integrate ALL your data. Kai waehner.de 4/26/13"

Transcription

1 Big data beyond Hadoop How to integrate ALL your data Kai waehner.de 4/26/13

2 Kai Wähner Main Tasks Requirements Engineering Enterprise Architecture Management Business Process Management Architecture and Development of Applications Service-oriented Architecture Integration of Legacy Applications Cloud Computing Big Data Consulting Developing Coaching Speaking Writing Contact Blog: Social Networks: Xing, LinkedIn

3 Key messages You have to care about big data to be competitive in the future! You have to integrate different sources to get most value out of it! Big data integration is no (longer) rocket science!

4 Agenda Big data paradigm shim Challenges of big data Big data from a technology perspecpve IntegraPon with an open source framework IntegraPon with an open source suite

5 Agenda Big data paradigm shim Challenges of big data Big data from a technology perspecpve IntegraPon with an open source framework IntegraPon with an open source suite

6 Why should you care about big data? If you can't measure it, you can't manage it. William Edwards Deming ( ) American stapspcian, professor, author, lecturer and consultant

7 Why should you care about big data? è Silence the HiPPOs (highest- paid person s opinion) è Being able to interpret unimaginable large data stream, the gut feeling is no longer juspfied!

8 What is big data? The Vs of big data Volume (terabytes, petabytes) Velocity (realpme or near- realpme) Value Variety (social networks, blog posts, logs, sensors, etc.)

9 Big data tasks to solve - before analysis Big Data Integra3on Land data in a Big Data cluster Implement or generate parallel processes Big Data Manipula3on Simplify manipulapon, such as sort and filter ComputaPonal expensive funcpons Big Data Quality & Governance IdenPfy linkages and duplicates, validate big data Match component, execute basic quality features Big Data Project Management Place frameworks around big data projects Common Repository, scheduling, monitoring

10 Use case: Replacing ETL jobs The advantage of their new system is that they can now look at their data [from their log processing system] in anyway they want: Nightly MapReduce jobs collect stapspcs about their mail system such as spam counts by domain, bytes transferred and number of logins. When they wanted to find out which part of the world their customers logged in from, a quick [ad hoc] MapReduce job was created and they had the answer within a few hours. Not really possible in your typical ETL system. hjp://highscalability.com/how- rackspace- now- uses- mapreduce- and- hadoop- query- terabytes- data

11 Use case: Risk management Deduce Customer DefecPons hjp://hkotadia.com/archives/5021

12 Use case: Flexible pricing With revenue of almost USD 30 billion and a network of 800 locapons, Macy's is considered the largest store operator in the USA Daily price check analysis of its 10,000 arpcles in less than two hours Whenever a neighboring compeptor anywhere between New York and Los Angeles goes for aggressive price reducpons, Macy's follows its example If there is no market compeptor, the prices remain unchanged hjp:// systems.com/about- t- systems/examples- of- successes- companies- analyze- big- data- in- record- Pme- l- t- systems/

13 Agenda Big data paradigm shim Challenges of big data Big data from a technology perspecpve IntegraPon with an open source framework IntegraPon with an open source suite

14 Limited big data experts This is your company Big Data Geek

15 Big data tool selection (business perspective) Wanna buy a big data solupon for your industry? Maybe a compeptor has a big data solupon which adds business value? The compeptor will never publish it (rat- race)!

16 Big data tool selection (technical perspective) Looking for your required big data product? Support your data from scratch? Good luck! J

17 How to solve these big data challenges?

18 Be no expert! Be simple! à [OMen] simple models and big data trump more- elaborate [and complex] analypcs approaches à OMen someone coming from outside an industry can spot a bejer way to use big data than an insider Erik Brynjolfsson / Lynn Wu hjp://alfredopassos.tumblr.com/post/ /big- data- the- management- revolupon- by- andrew- mcafee

19 Be creative! à Look at use cases of others (SMU, but also large companies) à How can you do something similar with your data? à You have different data sources? Use it! Combine it! Play with it!

20 What is your Big Data process? Step 1 Step 2 Step 3 1) Do not begin with the data, think about business opportunipes 2) Choose the right data (combine different data sources) 3) Use easy tooling hjp://hbr.org/2012/10/making- advanced- analypcs- work- for- you

21 Agenda Big data paradigm shim Challenges of big data Big data from a technology perspecpve IntegraPon with an open source framework IntegraPon with an open source suite

22 Technology perspective How to process big data?

23 How to process big data? The cripcal flaw in parallel ETL tools is the fact that the data is almost never local to the processing nodes. This means that every Pme a large job is run, the data has to first be read from the source, split N ways and then delivered to the individual nodes. Worse, if the parppon key of the source doesn t match the parppon key of the target, data has to be constantly exchanged among the nodes. In essence, parallel ETL treats the network as if it were a physical I/O subsystem. The network, which is always the slowest part of the process, becomes the weakest link in the performance chain. hjp://blog.syncsort.com/2012/08/parallel- etl- tools- are- dead

24 How to process big data? Slides: hjp:// big- data- 0- hadoop- 0- java Video: hjp:// Data- Hadoop- Java

25 How to process big data? The defacto standard for big data processing

26 How to process big data? A big part of [the company s strategy] includes wiring SQL Server 2012 (formerly known by the codename Denali ) to the Hadoop distributed compupng playorm, and bringing Hadoop to Windows Server and Azure Even MicrosoM (the.net house) relies on Hadoop since 2011

27 What is Hadoop? Apache Hadoop, an open- source somware library, is a framework that allows for the distributed processing of large data sets across clusters of commodity hardware using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computapon and storage.

28 Map (Shuffle) Reduce Simple example Input: (very large) text files with lists of strings, such as: 318, N We are interested just in some content: year and temperate (marked in red) The Map Reduce funcpon has to compute the maximum temperature for every year Example from the book Hadoop: The DefiniPve Guide, 3rd EdiPon

29 How to process big data?

30 Hadoop alternatives Apache Hadoop Hadoop DistribuPon Big Data Suite few many Features included MapReduce HDFS Ecosystem + Packaging Deployment-Tooling Support + Tooling / Modeling Code Generation Scheduling Integration

31 Agenda Big data paradigm shim Challenges of big data Big data from a technology perspecpve IntegraPon with an open source framework IntegraPon with an open source suite

32 Alternatives for systems integration Integration Framework Enterprise Service Bus IntegraPon Suite Low High Complexity of Integration Connectivity Routing Transformation + INTEGRATION Tooling Monitoring Support + BUSINESS PROCESS MGT. BIG DATA / MDM REGISTRY / REPOSITORY RULES ENGINE YOU NAME IT

33 Alternatives for systems integration Integration Framework Enterprise Service Bus IntegraPon Suite Low High Complexity of Integration

34 More details about integration frameworks... hjp:// waehner.de/blog/2012/12/20/showdown- integrapon- framework- spring- integrapon- apache- camel- vs- enterprise- service- bus- esb/

35 Enterprise Integration Patterns (EIP) Apache Camel Implements the EIPs

36 Enterprise Integration Patterns (EIP)

37 Enterprise Integration Patterns (EIP)

38 Architecture hjp://java.dzone.com/arpcles/apache- camel- integrapon

39 Choose your required components SQL TCP Netty SMTP Jetty JMS RMI FTP Lucene JDBC EJB JMX Bean-Validation RSS AMQP MQ IRC Quartz File AWS-S3 HTTP Many many more Atom Akka CXF LDAP Log XSLT Custom Components

40 Choose your favorite DSL XML (not production-ready yet)

41 Deploy it wherever you need Standalone Application Server Web Container Spring Container OSGi Cloud

42 Enterprise-ready Open Source Scalability Error Handling Transaction Monitoring Tooling Commercial Support

43 Example: Camel integration route

44 Example: camel-hdfs component // Producer from( jms:myqueue").to( hdfs:///mydirectory/myfile.txt?valuetype=text"); // Consumer from( hdfs:///mydirectory/myfile.txt").to( file:target/reports/report.txt");

45 Live demo Apache Camel in action...

46 Agenda Big data paradigm shim Challenges of big data Big data from a technology perspecpve IntegraPon with an open source framework IntegraPon with an open source suite

47 Alternatives for systems integration Integration Framework Enterprise Service Bus IntegraPon Suite Low High Complexity of Integration Connectivity Routing Transformation + INTEGRATION Tooling Monitoring Support + BUSINESS PROCESS MGT. BIG DATA / MDM REGISTRY / REPOSITORY RULES ENGINE YOU NAME IT

48 Alternatives for systems integration Integration Framework Enterprise Service Bus IntegraPon Suite Low High Complexity of Integration

49 More details about ESBs and suites... hjp:// waehner.de/blog/2013/01/23/spoilt- for- choice- how- to- choose- the- right- enterprise- service- bus- esb/

50 Vision: Democratize big data Talend Open Studio for Big Data Improves efficiency of big data job design with graphic interface Pig Generates Hadoop code and run transforms inside Hadoop NaPve support for HDFS, Pig, Hbase, Hcatalog, Sqoop and Hive an open source ecosystem 100% open source under an Apache License Standards based

51 Vision: Democratize big data Talend PlaAorm for Big Data Builds on Talend Open Studio for Big Data Pig an open source ecosystem Adds data quality, advanced scalability and management funcpons MapReduce massively parallel data processing Shared Repository and remote deployment Data quality and profiling Data cleansing ReporPng and dashboards Commercial support, warranty/ip indemnity under a subscrippon license

52 Talend Open Studio for Big Data

53 Live demo Talend Open Studio for Big Data in acpon...

54 Did you get the key message?

55 Key messages You have to care about big data to be competitive in the future! You have to integrate different sources to get most value out of it! Big data integration is no (longer) rocket science!

56 Did you get the key message?

57 Thank you for your attention. Questions? LinkedIn /