Theses. Elena Baralis, Tania Cerquitelli, Silvia Chiusano, Paolo Garza Luca Cagliero, Luigi Grimaudo Daniele Apiletti, Giulia Bruno, Alessandro Fiori

Size: px
Start display at page:

Download "Theses. Elena Baralis, Tania Cerquitelli, Silvia Chiusano, Paolo Garza Luca Cagliero, Luigi Grimaudo Daniele Apiletti, Giulia Bruno, Alessandro Fiori"

Transcription

1 Theses Elena Baralis, Tania Cerquitelli, Silvia Chiusano, Paolo Garza Luca Cagliero, Luigi Grimaudo Daniele Apiletti, Giulia Bruno, Alessandro Fiori Turin, January 2015

2 General information Duration: 6 months full time equivalent overall duration if part time Internal thesis cooperation on active research topic or research project good programming and analytical skills required supervised by a group member can work at home or in our lab (LAB5) External thesis (stage) supervised by external tutor More info on topics 2

3 Main Topics Big data and cloud-based data mining services and algorithms Database and data mining applications Text and social network mining Network traffic data analysis Clinical and biological data management Green/urban data mining 3

4 Big data and cloud-based data mining Study of innovative, parallel, and distributed data mining approaches for Pattern mining algorithms Clustering techniques Classification algorithms Summarization algorithms to efficiently gain interesting insights from huge data volume Design and development of novel cloud-based data mining services based on HADOOP and Spark frameworks MapReduce paradigm Exploitation of the cloud-based services for novel big data analytics applications (e.g., network traffic data, fraud detection, social networks) Analysis modules based on HADOOP and Spark Ecosystems European research project ONTIC 4

5 Data Mining Algorithms Itemset mining algorithms for time series data analysis design and development of novel itemset mining algorithms targeted to time series data analysis analysis of historical financial data (e.g., stock prices, stock indexes, credit card transactions) to plan trading and investment strategies to support fraud detection Integration of data mining algorithms into Rapid Miner Rapid Miner is an established Java-based machine learning tool integration in Rapid Miner of state-of-the-art algorithms for weighted itemset mining document analysis and summarization 5

6 Mining Advisor Design and implementation of an automatic system (Mining Advisor) to select for a dataset an optimal mining algorithm for a given analysis task based on innovative data characterization statistics definition/design of mining algorithms (i.e., access methods and mining primitives), possibly disk-based algorithm selection strategies exploiting a trade-off between accuracy and exploration time Different instances of a Mining advisor can be tailored to different data mining techniques (e.g., clustering algorithms, pattern discovering) Mining advisor Data characterization through metrics computation Algorithm selection Dataset under analysis Access methods Mining primitives 6

7 Network data analysis Analysis of very large wired network traffic captures wired traffic monitoring and characterization by exploiting data mining techniques (e.g., association rules, clustering, classification) distributed VLDB/cloud technologies to support efficient storage, retrieval and indexing of huge amounts of network data Analysis of huge wireless network traffic captures wireless traffic monitoring by exploiting itemset mining algorithms wireless traffic classification by exploiting association rules and probabilistic models European research projects: mplane and ONTIC 7

8 Text mining Text summarization identification of salient knowledge from news, research articles, blogs generation of sound and easy-to-read summaries of large document collections development of multi-lingual and automatic summarization systems development of cloud-based summarization systems targeted to the extraction of succinct summaries from big data collections Social and educational text mining Content curation systems allow users to build personalized and dynamically updated news reports Integration of a summarization system into a content curation platform Evaluation of system appreciation and feedback E-learning refers to the use of ICTs in education Development and integration of a summarization system in an e- learning context 8

9 Social network mining Social network analysis user behavior analysis by means of data mining techniques topic extraction and correlation analysis discovery of user communities, trends and deviations classification of web objects using user-generated content Social watching analysis of social messages during specific TV programs analysis of evolution of hashtags during program broadcast characterization of user groups and social interactions 9

10 Clinical and biological data management Physiological data analysis analyze physiological data collected during incremental tests (e.g., cardiopulmonary exercise testing) commonly used in clinical domain and in sport science improve the effectiveness of the reliability/training sessions predict the final values of crucial parameters reduce test duration and the physical effort for patients/athletes Clinical data analysis analyze data collected by the healthcare network of an Italian Health Care Center extract medical treatments (in terms of performed examinations, prescribed drugs) frequently done by patients identify deviation from expected medical treatments according to medical guidelines 10

11 Genomic Computing Next Generation Sequencing (NGS) is a new and highthroughput technology for DNA sequencing There is a need for new and effective data mining approaches to discover knowledge from NGS data Goals analysis of complex NGS data and development of innovative semantics-aware algorithms exploitation of bio-ontologies, gene/protein and genetic disorder libraries smart indexing and mining of large-scale NGS datasets compact data representation and efficient data access mining algorithms based on disk-based structures National research project [PRIN 2011]: GenData

12 Green data mining Joint analysis of Energy consumption logs of residential and public building heating systems and indoor climate conditions Data on the user thermal comfort perception of indoor climate conditions and user feedbacks Goals Suggest ready-to-implement energy efficient actions based on innovative and user-friendly indicators Discovering of interesting correlations among the large and heterogeneous amount of available data Regional research project: EDEN project 12

13 Green data mining Joint analysis of energy and water consumption data to efficiently support an intelligent building management system localization of network losses and leaks detection of abnormal consumption characterization of user consumption forecast of energy and water consumption Analysis of available bikes in the stations of a public bikesharing system to forecast critical situations (e.g., empty or full stations) to reschedule the bike redistribution process on the fly characterize the cyclic mobility patterns to support human mobility in urban areas 13

14 Data analysis for Smart Cities Mining urban data to increase the well-being of citizens by improving the efficiency, accessibility and functionality of provided services Analysis of data collected through sensor networks embedded in smart street furniture National research project S[M2]ART Analysis of air pollution data on urban area to detect possible critical conditions National research project: MIE Analysis on data un citizens urban safety and security 14

15 Data analysis for Smart Cities IOC (Intelligent Operations Center) - IBM Platform for data analytics Study IOC architecture, data flow and programming model Deploy IOC in a real application to efficiently support an intelligent transportation system an intelligent water management system 15

16 Biomedical IRCCS Laboratory Assistant Suite modular architecture to manage different kinds of raw experimental data, tracking several laboratory activities, integrate different resources and aid in performing a variety of analyses to extract knowledge related to tumors design of model-driven automated GUI generation development of infrastructural components (e.g., task scheduler, notification system, dashboard) Genome analysis innovative approaches to analyzing NGS data from human genomes analytical algorithms to identify genetic variants in tumors and in the blood of patients implementation and optimization of analytical algorithms for identification and classification of genetic variants in paired comparative tests development of data analysis pipelines in parallel and distributed environment Microarray data analysis study of class discovery algorithms (e.g., clustering, bi-clustering) identify robust gene markers by means of the integration of several classification methods analysis of gene expression values over the time on data derived from xenopatients 16

17 Virtual IRCCS Realization of virtual representation of life-science working environments based on 3D interactive models development of a prototypic application for user-friendly management of complex and hierarchical storage systems, by means of 3D realistic representation of the physical containers and their interactions Computer vision for sensor-based real-time tracking of laboratory activities development of a prototypic platform for automated monitoring of interactions between users, instruments and experimental materials virtual representation of objects and activities is also foreseen to provide intuitive and user-friendly GUIs Biological laboratories need Next Generation LIMS Computer vision to improve the efficiency of laboratory data-tracking procedures Virtual reality to improve graphical user interfaces usability in laboratory information management systems (LIMS) 17

18 Ooros External stages exploiting geo-located social interactions among people, places and businesses (e.g., checkins) business intelligence for social recommendations Analysis of a real-world dataset from hundreds of businesses in Turin, see Narus Lab Developing novel solutions to analyze network traffic data for security purposes (i.e., malware traffic detection, signature generation, anomaly detection), using machine learning and data mining techniques to find relevant patterns The thesis will be conducted in collaboration with NARUS, Inc. Sunnyvale, California in the context of the new laboratories that Narus is opening within the Politecnico di Torino Campus. 18