POLITECNICO DI TORINO

Size: px
Start display at page:

Download "POLITECNICO DI TORINO"

Transcription

1 POLITECNICO DI TORINO Master of Science Degree in Computer Engineering Master of Science Thesis PREDICTION AND VISUALIZATION OF PHYSIOLOGICAL SIGNALS IN INCREMENTAL EXERCISE TESTING Advisors: Prof. Silvia Chiusano Prof. Tania Cerquitelli Candidate: Andrea Carolina Rosales Africano October 2015

2 Contents List of Figures... III List of Tables... VI 1. INTRODUCTION DATA MINING Knowledge Discovery Regression Classification PREDICTIVE MODELING Artificial Neural Network (ANN) Neural Network Topologies Training of ANN The Back Propagation Algorithm Support Vector Machine (SVM) Formal Explanation of SVM FRAMEWORK DATA PREPARATION Cardiopulmonary Exercise Testing Test Execution Considered Dataset Athlete s Information Data Processing Data Segmentation Min-Max Normalization Sampling Windowing PREDICTION ANALYSIS Prediction Process Prediction Models Artificial Neural Network in Rapidminer Support Vector Machine in Rapidminer Prediction Validation I

3 7. VISUALIZATION Graphics in Java Physiological Signals Visualization Prediction Visualization EXPERIMENTS RESULTS Predictions at the End of the Test HRpeak Prediction VO2peak Prediction Predictions at the Next Step Multiple-Test Approach Single-Test Approach Predictions including the athlete s information CONCLUSIONS REFERENCES BIBLIOGRAPHY II

4 List of Figures Figure 2.1. Data Mining Process [3]... 3 Figure 2.2. The knowledge Discovery Process [4]... 5 Figure 2.3. Datasets for the Regression Model [9]... 7 Figure 2.4. Training Set in a Classification Model [12]... 9 Figure 2.5. Test Set in a Classification Model [12]... 9 Figure 3.1. Neural Connections in Animals [15] Figure 3.2. Neuron Function [16] Figure 3.3. Feedforward Neural Network [16] Figure 3.4. Recurrent Neural Network [16] Figure 3.5. Many linear classifiers (hyperplanes) may separate the data [3] Figure 3.6. Maximum separation hyperplanes [3] Figure 4.1. Framework for Data Prediction and Visualization Figure 5.1. Normalization Process of the Dataset Figure 5.2. Normalization of Parameters Figure 5.3. Windowing Process of the Dataset Figure 5.4. Windowing Parameters Figure 5.5. Dataset Format after Windowing Figure 6.1. ANN Prediction Process with Multiple-Test Approach Figure 6.2. Neural Net Operator Parameters Figure 6.3. (a) ANN Prediction Process with Single-Test Approach Figure 6.4. (b) ANN Prediction Process Single-Test Approach Inner Loop Files 33 Figure 6.5. (c) ANN Prediction Process Single-Test Approach Inner Validation 33 Figure 6.6. SVM Prediction Process with Multiple-Test Approach Figure 6.7. Support Vector Machine Operator Parameters Figure 6.8. (a) SVM Prediction Process with Single-Test Approach Figure 6.9. (b) SVM Prediction Process Single-Test Approach Inner Loop Files 36 Figure (c) SVM Prediction Process Single-Test Approach Inner Validation Figure 7.1. Window Coordinates in Java [33] Figure 7.2. Specification of Signal Data Points Figure 7.3. Visualization of Signal Data Points Figure 7.4. Timer Event for Data Update Figure 7.5. Initial Visualization Window Figure 7.6. Physiological Signal Selection Figure 7.7. Visualization of the FIO2 Signal Figure 7.8. Visualization of the FEO2 Signal Figure 7.9. Visualization of the FECO2 Signal Figure Visualization of the FETCO2 Signal Figure Visualization of the FETO2 Signal Figure Visualization of the VE Signal III

5 Figure Visualization of the IT Signal Figure Visualization of the ET Signal Figure Visualization of the HR Signal Figure Visualization of the VO2 Signal Figure Visualization of the HR and VO2 Signals Figure Specification of Prediction Data Points Figure Visualization of Prediction Data Points Figure Data Update on File Modification Figure Prediction Visualization Step Figure Prediction Visualization Step Figure Prediction Visualization Step Figure Prediction Visualization Step Figure 8.1. MAE - Multiple-test HRpeak Prediction, Segment low Figure 8.2. RMSE - Multiple-test HRpeak Prediction, Segment low Figure 8.3. MAE Multiple-test HRpeak Prediction, Segment medium Figure 8.4. RMSE Multiple-test HRpeak Prediction, Segment medium Figure 8.5. MAE - Multiple-test HRpeak Prediction, Segment high Figure 8.6. RMSE - Multiple-test HRpeak Prediction, Segment high Figure 8.7. MAE Multiple-test VO2peak Prediction, Segment low Figure 8.8. RMSE Multiple-test VO2peak Prediction, Segment low Figure 8.9. MAE Multiple-test VO2peak Prediction, Segment medium Figure RMSE Multiple-test VO2peak Prediction, Segment medium Figure MAE - Multiple-test VO2peak Prediction, Segment high Figure RMSE - Multiple-test VO2peak Prediction, Segment high Figure MAE - Multiple-test HRnext Prediction, Segment low Figure RMSE - Multiple-test HRnext Prediction, Segment low Figure MAE Multiple-test HRnext Prediction, Segment medium Figure RMSE Multiple-test HRnext Prediction, Segment medium Figure MAE - Multiple-test HRnext Prediction, Segment high Figure RMSE - Multiple-test HRnext Prediction, Segment high Figure MAE Multiple-test VO2next Prediction, Segment low Figure RMSE Multiple-test VO2next Prediction, Segment low Figure MAE Multiple-test VO2next Prediction, Segment medium Figure RMSE Multiple-test VO2next Prediction, Segment medium Figure MAE - Multiple-test VO2next Prediction, Segment high Figure RMSE - Multiple-test VO2next Prediction, Segment high Figure MAE Single-test HRnext Prediction, Segment low Figure RMSE Single-test HRnext Prediction, Segment low Figure MAE Single-test HRnext Prediction, Segment medium Figure RMSE Single-test HRnext Prediction, Segment medium Figure MAE Single-test HRnext Prediction, Segment high Figure RMSE Single-test HRnext Prediction, Segment high Figure MAE Single-test VO2next Prediction, Segment low Figure RMSE Single-test VO2next Prediction, Segment low Figure MAE - Single-test VO2next Prediction, Segment medium IV

6 Figure RMSE - Single-test VO2next Prediction, Segment medium Figure MAE - Single-test VO2next Prediction, Segment high Figure RMSE - Single-test VO2next Prediction, Segment high Figure MAE Comparison - Multiple-test HRpeak Prediction, Segment low Figure RMSE Comparison - Multiple-test HRpeak Prediction, Segment low Figure MAE Comparison Multiple-test HRpeak Prediction, Segment medium84 Figure RMSE Comparison Multiple-test HRpeak Prediction, Segment medium Figure MAE Comparison - Multiple-test HRpeak Prediction, Segment high Figure RMSE Comparison - Multiple-test HRpeak Prediction, Segment high Figure MAE Comparison Multiple-test VO2peak Prediction, Segment low Figure RMSE Comparison Multiple-test VO2peak Prediction, Segment low Figure MAE Comparison Multiple-test VO2peak Prediction, Segment medium Figure RMSE Comparison Multiple-test VO2peak Prediction, Segment medium Figure MAE Comparison Multiple-test VO2peak Prediction, Segment high Figure RMSE Comparison Multiple-test VO2peak Prediction, Segment high. 88 Figure MAE Comparison Multiple-test HRnext Prediction, Segment low Figure RMSE Comparison Multiple-test HRnext Prediction, Segment low Figure MAE Comparison Multiple-test HRnext Prediction, Segment medium 90 Figure RMSE Comparison Multiple-test HRnext Prediction, Segment medium Figure MAE Comparison Multiple-test HRnext Prediction, Segment high Figure RMSE Comparison Multiple-test HRnext Prediction, Segment high Figure MAE Comparison - Multiple-test VO2next Prediction, Segment low Figure RMSE Comparison - Multiple-test VO2next Prediction, Segment low Figure MAE Comparison - Multiple-test VO2next Prediction, Segment medium Figure RMSE Comparison - Multiple-test VO2next Prediction, Segment medium Figure MAE Comparison Multiple-test VO2next Prediction, Segment high Figure RMSE Comparison Multiple-test VO2next Prediction, Segment high.. 94 Figure MAE Comparison Single-test HRnext Prediction, Segment low Figure RMSE Comparison Single-test HRnext Prediction, Segment low Figure MAE Comparison - Single-test HRnext Prediction, Segment medium Figure RMSE Comparison - Single-test HRnext Prediction, Segment medium.. 96 Figure MAE Comparison Single-test HRnext Prediction, Segment high Figure RMSE Comparison Single-test HRnext Prediction, Segment high Figure MAE Comparison Single-test VO2next Prediction, Segment low Figure RMSE Comparison Single-test VO2next Prediction, Segment low Figure MAE Comparison Single-test VO2next Prediction, Segment medium. 99 Figure RMSE Comparison Single-test VO2next Prediction, Segment medium 99 Figure MAE Comparison - Single-test VO2next Prediction, Segment high Figure RMSE Comparison - Single-test VO2next Prediction, Segment high V

7 List of Tables Table 5.1. Monitored Physiological Signals Table 5.2. Dataset Format Table 5.3. Dataset Characteristics Table 5.4. Athlete s Additional Attributes Table 5.5. Characteristics of each segment Table 5.6. Dataset Format after Sampling Table 6.1. Output Format after the Validation Process. VO2next prediction Table 6.2. MAE Calculation Format Table 6.3. RMSE Calculation Format VI

8 1. INTRODUCTION Cardiopulmonary exercise testing (CPET) is a methodology that has changed the approach to patients functional evaluation. It allows to find a link between performance and physiological parameters and the underlying metabolic bases, and it also provides highly reproducible exercise capacity descriptors [1]. In the last years its use has helped for diagnostic purposes in terms of functional evaluation of cardiac patients, in both clinical and research scenarios, and it has also been used to test normal subjects. Incremental exercise testing is one of the different approaches of the CPET. It is a procedure for determining submaximal and maximal physiological values such as VO2peak, the maximum rate of oxygen consumption measured during incremental exercise [2]. This practice consists on starting the test and gradually increase the intensity (workload) over time. The incremental protocol can be modified in terms of the starting workload, the magnitude of the workload increment and the duration of each stage. Although cardiopulmonary tests are non-invasive, they are physically demanding, the exercise stresses the subject body s systems by making them work faster and harder. This leads to creating the possibility of decreasing the test duration without losing sensitive and important information that may be obtained from the complete test execution. The main objective of this thesis is to study and develop a framework that allows forecasting physiological signals that have an important medical impact and graphically visualizing the prediction results. The object is, during test execution, predict the final value or the next step value of the heart rate (HR) and the maximum oxygen consumption (VO2), by evaluating several different physiological signals during the test execution. Using the framework, during the test execution it is possible to progressively visualize the monitored physiological signals and the prediction results along with the expected error. At each step of the prediction process, the graphical visualization is updated by including the new predicted signal with its corresponding calculated error. With the support of the graphical visualization, the cardiopulmonary response to the test is analyzed during the prediction process, and physicians can decide when to prematurely stop the test execution, consequently reducing the body stress. The approach proposed in this thesis is based on using data mining techniques for prediction. Specifically, it is possible to obtain a suitable model for the currently monitored individual. Both the Artificial Neural Networks (ANN) and the Support Vector Machines (SVM) methods have been selected to accomplish the prediction 1

9 process. Two different models are provided. The first model, also called single-test, is trained using only the measurements collected during the test currently in execution. It is strictly related to the current individual response in the ongoing test. The second model, named multiple-test, is trained with a larger reference knowledge base that contains a set of previous tests. Both ANN and SVM perform the analysis on the two previously described models. The prediction models allow forecasting the heart rate value at the test end (HRpeak) and at the next step (HRnext), and the oxygen consumption value at the test end (VO2peak) and at the next step (VO2next). The multiple-test model has been used to predict all the four previous values, while the single-test model has been exploited to predict HRnext and VO2next values. Since both HR and VO2 have continuous values, the prediction process consisted on a regression task. In this thesis work, in addition to the physiological signals monitored during test execution, the physical information (i.e., gender, age, BMI and BSA) of the subject under testing is included in order to make the model more precise. This thesis is organized as follows. Chapter 2 presents the theoretical bases of data mining and a description of the data mining process. Chapter 3 presents two techniques for predicted model. In particular it describes Artificial Neural Networks and Support Vector Machine. Chapter 4 describes the framework for prediction and visualization of physiological signals that has been studied and developed in this work. Chapter 5 describes the process of data preparation representing the first module of the framework. It describes the considered data set and its specifications. Chapter 6 presents the prediction analysis process, the second module of the framework. It describes the prediction process using Artificial Neural Networks and Support Vector Machine. Chapter 7 presents the visualization module of the framework. It describes how the prediction results are visualized. Chapter 8 presents the experiments results of the prediction analysis process. Chapter 9 reports conclusions and possible future developments of this work. 2

10 2. DATA MINING Data Mining refers to the analysis of the large quantities of data that are stored in computers [3]. Actual computer systems accumulate great amount of data from a wide variety of sources: from grocery and retail stores that process customers purchases, to the medical field where it is possible to include diagnosis of patients. Nowadays, even the Human Genome project and the NASA satellites stores great amount of bytes that later on are processed [4]. Data Mining is an analytic process that aims to explore data in search of consistent patterns and/or logical relationships between variables, and then to validate the results by applying them to new subsets of data [5]. Figure 2.1. Data Mining Process [3] There is a Cross-Industry Standard Process for Data Mining (CRISP-DM - Figure 2.1) that is divided into six main phases: - Business Understanding: This phase includes the specification of the business objectives, assessment the current situations, specification of the data mining goals and the development of a project plan. - Data Understanding: It involves data requirements and can include initial data collection, description and exploration, and the verification of data quality. - Data Preparation: This step, also known as Exploration, consists in preparing the data by cleaning it, transforming it or selecting subsets of records if necessary. This due to the fact that data may come from a different number of sources, which may not use the same formats or which may include data that is not necessary or has a great number of variables. 3

11 When data is prepared, it is examined in order to identify the most relevant variables and analyze the complexity of the models that can be considered in the next step. - Modeling: Taking into account the results from the data preparation stage, modeling involves developing several models that may be used for data analysis. - Evaluation: Once the different models are specified, it is necessary to choose the best one according to their performance in prediction (that is, producing stable results across samples) and more important, according to the business objectives that were established in the first stage. It is possible that different satisfactory models are identified and thus, selected. - Deployment: The final step in the process consists in applying the selected model in the previous stage to new data aiming to generate predictions or to estimate the expected outputs. There is a continuous growth of applications in a vast range of areas such as: analysis of organic compounds, automatic abstracting, credit card fraud detection, electric load prediction, financial forecasting, product design, medical diagnosis and more. The applications can vary in a range of aims. For example: - Optimize targeting of customers by data mining the transactions data in a supermarket chain, - Detect fraud in a credit card company by analyzing its data warehouse, - Predict audience share for television programs, allowing to arrange show schedules so that market share is maximized, - Predict medical signals of patients accordingly to previous results (main topic of this thesis that will be explored later on), - among others. This shows that each day the quantity of data that is recorded increases, leading to great advances in storage technology so that it is possible to store large amounts of data. It is come to realization that all the collected data contains knowledge and information that could lead to important progresses or discoveries, no matter if they are made in the field of science, business or even economics [4]. The problem arises when given the great volume of data most of it is stored, and it is never analyzed more than superficially. In the efforts to turn data into useful information, data mining is also referred to as exploratory data analysis and it has led to the emergence of a new research area, Knowledge Discovery, to which it relates [6] Knowledge Discovery Knowledge Discovery has been defined as the non-trivial extraction of implicit, previously unknown and potentially useful information from data [4]. It was proposed to describe the process of extraction of knowledge from data [6]. It is a process that includes data mining, as a central part, throughout the discovery stage. 4

12 Figure 2.2. The knowledge Discovery Process [4] Figure 2.2 shows the complete knowledge discovery process. It relates to Figure 2.1 as both share common stages (In particular, Data Preparation/Selection and Preprocessing). Starting from the data sources (may be different from one another), data is integrated and located into a data store. It could be selected and preprocessed so that it fits any specific or standard format. This prepared data is analyzed by a data mining algorithm which produces an output in the form of patterns; these are then interpreted in order to present new and potentially useful knowledge (relationships and patterns between data elements). Data mining is used in order to extract useful information from large data archives. This information can be exposed directly in the form of relations between the variable of interest, or indirectly as functions and models that allow to predict, classify, or represent regularities in the distribution of the data. According to [4], initially it is necessary to make the distinction between two types of data. When dealing with dataset of examples (instances), it s necessary to know that each instance includes the value of several variables (attributes). The first type of data is called labelled, it refers to a selected attribute whose value will be predicted for instances that have not been seen yet, accordingly to the usage of the given data; the second type of data is referred to as un-labelled, that is, data that does not have any selected attribute. With this division, the objective is to extract the most possible information from the available data. Once the type of data is distinguished, it is possible to divide the different applications into several data mining techniques such as clustering, association, numerical prediction (also named regression) and classification [3]. Cluster analysis takes ungrouped data and uses automatic techniques to put this data into groups. In Association, the relationship of a particular item in a data transaction on other items in the same transaction is used to predict patterns. In Classification, the methods are intended for learning different functions that map each item of the selected data into one of a predefined set of classes. The main idea of Regression analysis is to discover the relationship between the dependent and independent variables. 5

13 The regression and classification techniques will be furtherly explained in Sections 2.2 and 2.3, respectively Regression Regression, also called numerical prediction, relates to the labelled data type where the selected attribute is numerical, e.g. profit, sales, temperature or distance. It is a data mining function that predicts these numerical values [7]. In the data set to which will be applied a regression task, the target value (the labelled attribute or dependent variable) is known, that is, the value to be predicted; all the other attributes are used as the predictors (or independent variables, the prediction is based on them) and the complete set of information of each instance is known as a case. For example: a data set contains gathered information on houses over a period of time. For each instance, several attributes are known, such as the house value, its age, square footage, number of rooms and number of floors, among others; a regression model attempts to predict house values and it is applied to the previous data set. In this scenario, the house value would be the target attribute, the other attributes would be the predictors, and the data of each house in the data set would be a case. The regression functions are used to determine relationships between the dependent variable and one or more of the independent variables [8]. In the model, a regression algorithm predicts or estimates the value of the dependent variable as a function of the independent variables for each case in the data set. These relationships are put together within a model that can then be applied to a different data set in which the target values are not known. The regression models include two different dataset (Figure 2.3), one for building the model (training set) and one for testing the model (test set). The training set is larger than the test set because the former only include the specific cases to which the model will be applied. On the other hand, the training set include a larger amount of cases with all the relationships that will be applied to the test case in order to perform the prediction. The regression models are tested by computing statistic measurements that analyze the difference between the predicted values and the expected ones. 6

14 Figure 2.3. Datasets for the Regression Model [9] The analysis of a regression aims to specify the values of parameters for a function that lead the function to fit best a set of data relationships that are provided. The previous equation [7] expresses the relationships in symbols. Regression is shown as the process of estimating the value of a continuous target (y) as a function (F) of one or more predictors (x1, x2,, xn), a set of parameters (θ1, θ2,, θn), and a measure of error (e), also called residual, that is the difference between the expected and the predicted value of the dependent variable; and the regression parameters are known as regression coefficients. The training test is the result of training the regression model, a process that includes finding the parameter values that minimize the residual. The outcome is applied to test data with known target values in order to compare the predicted values with the actual ones. In order to test the model, the test data must be compatible with the data used to build the model, that is, they must have been prepared equally. The Root Mean Squared Error (RMSE) and the Mean Absolute Error (MAE) are the most commonly used statistics for assessing the quality of a regression model. These will be explained in detail in Sections and Classification Classification, relates to the labelled data type where the selected attribute is categorical, i.e. it must take one of a number of distinct values such as good or bad. It is a data mining function that assigns items in a collection to target categories or classes [10]. The classes are mutually exhaustive and exclusive categories, meaning that each object must be assigned to exactly one class (never to no class and never to more than one) [4]. Classification aims to correctly predict the target class for each case in the data. Classification are discrete values that do not imply any order. 7

15 As in the regression, a classification task includes a data set in which the class assignments are known. Each instance of the data set constitute a case; the labelled attribute is considered the target and the remaining attributes conform the predictors. For example: a data set contains many loan applicants over a period of time. For each instance, several attributes are known, such as the credit rating, employment history, home ownership or rental, number and type of investments, among others; a classification model attempts to predict credit risk and it is applied to the previous data set. In this scenario, the credit rating would be the target attribute, the other attributes would be the predictors, and the data of each applicant in the data set would be a case. It is possible to have the simplest type of classification problems, known as binary classification. Where the target attribute can only be chosen among two possible values; whereas the multiclass classification allows to choose between more than two values. As in regression, classification functions are used to determine relationships between the values of the target and the values of the predictor. In the model, different classification algorithms use different techniques for finding relationships. These relationships are put together within a model that can then be applied to a different data set in which the target values are not known. An example of a classifier is the rule-based classifier. It uses prediction rules to express knowledge, which are expressed in the form of IF-THEN rules. The IF part includes a conjunction of conditions and the THEN part predicts an attribute value for an item that satisfies or nor the conditions. The accuracy of the predictions is measured in percentage of predictions hit against the total number of predictions. A rule is considered accepted if the hit rate is considerably greater than the occurrence of the prediction attribute [11]. Classification models also include two different dataset, the training set and the test set. The training set is the result of the learning phase, where the classification algorithms build the classifier. Figure 2.4 shows an example of a rule based classifier on the construction of a training data, and thus, the classification rules. The training set is made of database tuples and their associated attributes (class labels); each tuples is also known as sample or data point. These are the bases for the classifier. 8

16 Figure 2.4. Training Set in a Classification Model [12] Figure 2.5. Test Set in a Classification Model [12] Figure 2.5 shows how the previously stablished classifier is used to estimate the accuracy of the classification rules. If it is considered acceptable, the classification rules can be applied to a test set or to the new data tuples. The test metrics used to assess the accuracy of the model predictions differ from those of the Regression task. Classification testing criteria includes [10]: - Accuracy: It refers to the percentage of the correct predictions when compared with the actual classifications in the test set. - Confusion Matrix: It displays the number of correct and incorrect predictions when compared with the actual classifications in the test set. 9

17 - Precision: It refers to the fraction of positively classified instances that are relevant. - Recall: It refers to the fraction of relevant instances that are positively classified. - Receiver Operating Characteristic (ROC): It measures the impact of changes in the probability threshold, that is, the decision point used by the model for classification. Regarding both, regression and classification, it is possible to find different methods and algorithms to perform the predictions or classifications. These may differ between them and it may be necessary to compare them in order to use the one(s) that fulfills the business goals. The following is the criteria that can be used for comparing the different methods [12]: - Accuracy: When referring to the accuracy of a classifier, it means how correctly the class label is predicted. On the other hand, the accuracy of a predictor refers to how well a specific predictor guesses the value of a predicted attribute in a new data set. - Speed: When generating and using a given classifier or predictor, speed refers to the computational cost in performing the previous activities. - Robustness: The ability of a classifier or a predictor to make correct predictions even if the new data doesn t present the exact same characteristics of the data set. - Scalability: The efficiency in constructing the classifier or predictor as the amount of data increases. - Interpretability: What extent the classifier or the predictor understands. 10

18 3. PREDICTIVE MODELING The prediction process that will be explain later on in this thesis document relates directly to a regression task. For this reason, the algorithms that will be explained in this section support numerical prediction and they are used as bases for the experiments described in Section Artificial Neural Network (ANN) An artificial neural network (ANN), also called neural network (NN), is a mathematical or computational model based on biological neural networks. It is an interconnected group of neurons that process information by means of a connectionist approach to compute [15]. The ANN technique is modeled after the process of learning the cognitive system and the neurological functions of the brain. It is capable of predicting new observations based on previous observations, after performing the learning process from existing data. An ANN is a system that adapts and changes its structure taking into account external or internal information that flows through the network during the learning phase. The neurons in the network work together in order to produce an output function. The network is robust and fault tolerant, that is, it can still produce an output even if some of the inner neurons present malfunctioning. Figure 3.1. Neural Connections in Animals [15] Figure 3.1 shows a neuron and a neural connection. ANN were modeled based on the cognitive processes of the brain, which serves as bases to the neuron function. 11

19 Figure 3.2. Neuron Function [16] As shown in Figure 3.2, a neuron is seen as a nonlinear, parameterized, bounded function [16]. Where the {xi} are the variables and the {wj} are the parameters (or weights of the neuron). Each neuron in the neural network has an activation number associated with it; and each connection between the neurons has an associated weight. This to simulate the actual functioning of the biological brain: firing rate of a neuron and strength of a synapse. The activation of each neuron depends on the activation of its associations and its weights. The neurons in the network are organized in layers. The number of layers and the number of neurons inside each layer depend on the nature of the investigated situation. For instance, Figure 3.3 shows a neural network with n inputs, a layer of Nc hidden neurons, and No output neurons Neural Network Topologies Feedforward neural network Figure 3.3. Feedforward Neural Network [16] 12

20 A feedforward neural network is a nonlinear function of its inputs, which is the composition of the functions of its neurons. Graphically it is a set of neurons connected together where the information flows strictly in the forward direction, from the input nodes to the output nodes (Figure 3.3). The neurons that perform the final computation are called output neurons; the remaining neurons, which perform intermediate computations, are called hidden neurons. In this type of topology there are no cycles or loops in the network. The data processing extends over multiple layers, but there is no connection between nodes in the same layer or previous layers. Each processing element (neuron or node) receives inputs from the outside world or the previous layer Recurrent neural network Figure 3.4. Recurrent Neural Network [16] This type of network is the most general NN and it presents feedback connections. In particular, it is possible to find at least one cycle in the network. A cycle represents a path that, when following the connections leads back to the starting vertex (or neuron). In this neural network typology time is explicitly taken into account, due to the fact that the output of a neuron cannot be a function of itself at the same instant of time. Therefore, a neuron becomes a function of its past value(s). Nowadays most of the neural network applications are implemented as digital systems and this is why discrete-time systems are used for investigating recurrent networks. The name comes from the recurrent equations that mathematically describe the discrete-time systems. A delay is assigned to each connection in the network, in addition to a parameter or weight as in feedforward neural networks. This delay is an integer multiple of an 13

21 elementary time that is considered as a time unit. The sum of the delays of the edges of a cycle in the graph of the network must be nonzero. This type of networks obeys a set of nonlinear discrete-time recurrent equations, including the set of the functions of its neurons and the time delays associated to the connections [16]. Figure 3.4 shows an example of this topology. The digits in each box represent the delays assigned to the connections (expressed as integers multiples of a time unit T). A cycle is present from neuron 3 back to itself through neuron Training of ANN When working with artificial neural networks, the first step consists on designing the specific architecture related to the situation or problem that will be investigated. This includes the definition of the number of layers and the number of neurons in each layer. When the size of the network has been selected, the network is subjected to training. This is an algorithmic procedure where the parameters of the neurons are estimated so that the NN accurately completes the task that has been assigned to it. The NN has to be constructed in such a way that when a set of inputs are introduced the wanted set of outputs are produced. One way of accomplish this is to explicitly set the weights of each connection in the network; another way is to train it by means of teaching it patterns ant letting it change its weights regarding some learning rules [15] Supervised Training In supervised training, also called Associative training, the network is trained when inputs with matching output patterns are provided to it. This input-output pairs are provided either by an external teacher, or by the system where the NN is contained (self-supervised). The teacher is in charge of providing examples of values of the inputs and its corresponding output values Unsupervised Training Unlike the supervised training process, in this case there is no initial set of groups in which the patterns are classified; on the contrary, the output neurons are trained to develop their own representation of the input stimuli. More specifically, the system is in charge of discovering, statistically, similarities between the elements of the dataset, and translate them into output patterns The Back Propagation Algorithm The back propagation algorithm is a method within the training process of an ANN. It is used in layered feedforwards NNs. The neurons send their signals forward and the errors are propagated backwards. This algorithm uses supervised training, so it is provided with input-output pairs that the network is to compute and the difference between the actual value and the 14

22 expected value is calculated. The main idea is to diminish the error until the network learns the training information. Initially a training sample is presented to the neural network. The output of the network is compared to the desired output of the sample, and the error is calculated in each output neuron. For each neuron, the expected output is calculated along with a scaling factor (how lower or higher the output must be adjusted so that it matches the desired output), and this conforms the local error. After this, the weights of each neuron are adjusted so that the local error is reduced. This process is performed from the output layers to the first hidden layer Support Vector Machine (SVM) The support vector machine (SVM) is a training algorithm for learning classification and regression rules from data. SVM arose from statistical learning theory; the goal is to solve the exact problem of study without involving a more difficult problem as an intermediate step [17]. Unlike artificial neural networks, SVMs have been developed from the theory to the implementation and experiments. In SVMs the learning process consists in some unknown, nonlinear dependency y=f(x) between some high-dimensional input vector x and scalar output y (or vector output y). Since there is no information about the related joint probability functions, the only available information is a training data set D={(xi,yi) X x Y},i=1,l, where l is the number of the training data pairs and it is equal to the size of the training data set D. yi (also denoted as di) is the target value. Thus, the support vector machines are part of the supervised learning techniques [18]. The SVMs methods generate input-output mapping functions from a set of labeled training data. They, also known as kernel methods, belong to a set of generalized linear models which achieves classification or regression based on the value of the linear combination features [3]. A kernel is a function that transforms the input data into a high-dimensional space where the problem is solved. Kernel functions can be linear or nonlinear [19]. 15

23 Figure 3.5. Many linear classifiers (hyperplanes) may separate the data [3] SVMs classify data by learning from the historic cases, which are represented as data points. These data points may have more than two dimensions. The main idea is to find out if it is possible to separate data by an n-1 dimensional hyperplane (decision plane that define decision boundaries; it separates a data points that have different class memberships), and to see if there is a maximum separation between the classes. A hyperplane is chosen so that the distance between this and the nearest data point is maximized. That is, to find the support vectors that define the separators giving the widest separation of classes. Figure 3.5 shows data separated by means of several linear classifiers or hyperplanes Formal Explanation of SVM Considering the following form of data points in a training set: {(x1,c1), (x2,c2),, (xn,cn)}, Where the ci is either 1 or 0 (denoting the class to which the data point belongs), and each data point is an n-dimensional real vector. This training data denotes the correct classification that the SVM will distinguish, by means of the dividing hyperplane w xb=0. w is a vector that points to the separating hyperplane; b is an offset parameter that allows to increase the margin (or separation). If it is not used, the hyperplane must pass through the origin, restricting the solution. The main point is to maximize the margin, so the most important are the support vectors and the parallel hyperplanes that are closest to these support vectors in either class. There parallel hyperplanes are described by the equations w x-b=1, w x-b=-1. 16

24 Figure 3.6. Maximum separation hyperplanes [3] It is possible to select the hyperplanes so that there are no points between them and then try to maximize their distance, if the training data are linearly separable. By means of geometry, the distance between the hyperplane is 2/ w, so the point is to maximize w. In order to exclude data points, it is necessary to assure that for all I either w xi-b 1 or w x-b -1. Also rewritten as ci(w xi-b) 1, 1 i n. When dealing with regression, the produced model depends only on a subset of the training data, due to the fact that the cost functions for building the model do not take into consideration any training data that are close to the prediction model. In this case, SVM uses an epsilon-insensitive loss function. The main point is to find a continuous function where the maximum number of data points are within the epsilon-wide insensitivity range. The predictions that falls within this range are not interpreted as errors [19]. 17

25 4. FRAMEWORK The present thesis presents a framework that has been developed for the prediction and visualization of physiological signals. Figure 4.1 shows the three modules of the process: data preparation, prediction analysis and visualization. Figure 4.1. Framework for Data Prediction and Visualization Data Preparation: The data set preparation includes the information that will serve as input for the prediction process. In this thesis it contains the information of a set of athletes. However, it may be preprocessed in such a way that several different characteristics are included. Section 5 describes the characteristics of the dataset and the specifications for its preprocessing. Prediction Analysis: Once the data set has been preprocessed, it is used as input for the prediction phase. In this case it consists on a regression task on physiological signals monitored in incremental exercise testing. Specifically, Section 3 presents the theory foundation for the prediction knowledge. Section 6 describes the specific algorithms, techniques and parameters that were applied to the actual experiments. 18

26 The prediction process outputs a series of results files. They present a specific format and are used as input for the visualization step. The results depend on the specific prediction techniques and they not only include the predicted values, but they also show the measured error according to the actual values. Section 8 presents the output values for the experiments that were performed. Visualization: The last step of the process consists on the visualization of the prediction results. A running program continuously reads from the result file and updates the graphics, which include the signal predicted values along with the positive and negative errors. The program also presents actual physiological signal values in the running test. An example of prediction visualization can be seen in Section 7. These modules are based on RapidMiner (Section 4.1) and Python, and will be described in detail in Sections 5, 6 and 7. RapidMiner is a provider of software, solutions, and services for data mining, machine learning, and predictive analytics [13]. It has a suite of products that includes RapidMiner Studio, the main program that will be used for developing the Data preparation and Prediction analysis modules of the framework. RapidMiner Studio is a downloadable GUI for machine learning, data mining, text mining, predictive analytics and business analytics. It provides a GUI to design an analytical pipeline. This GUI generates an XML (extensible Markup Language) file that defines the analytical process the user wants to apply to the data. This file is read by RapidMiner that, then, runs the analyses automatically. While the analyses run, it is possible to interact with the FUI to control and inspect the running processes [14]. 19

27 5. DATA PREPARATION This section gives an overview of the Cardiopulmonary Exercise Testing. It describes the process of data preparation, the considered dataset and its specifications. The data preparation consists on preprocessing the data in order to use it for the prediction analysis Cardiopulmonary Exercise Testing The Cardiopulmonary Exercise Test (CPET) is a highly sensitive, non-invasive stress test [20]. The exercise stresses the patient body s systems by making them work faster and harder. A CPET evaluates how well the heart, lungs, and muscles work individually, and how the body s systems work as a unity. A CPET calculates how the cardiopulmonary system performs by measuring the amount of oxygen the body is consuming, the amount of carbon dioxide it produces, the breathing pattern, and the electrocardiogram (ECG) while riding a stationary bicycle or a treadmill. In addition to detecting multiple body systems, the CPET also allows to monitor changes in a disease condition, the effect of medications on the body and if medical therapy improves the patient s condition Test Execution The cardiopulmonary exercise test is performed on a stationary bicycle or a treadmill. During the test several pieces of equipment are used to monitor the body s response. This pieces include: - Face mask: It is used to monitor the oxygen consumed, carbon dioxide produce and the breathing pattern. - Electrocardiogram (ECG): It is used to monitor the heart rate and rhythm. - Blood pressure cuff: As the blood pressure is taken several times during the test. - Pulse oximeter: It is used to measure the percentage of blood cells covered with oxygen. Once the patient is on the bicycle/treadmill he begins pedaling lightly in order to warm up. The resistance on the bicycle/treadmill becomes harder in relation to the test protocol that has been selected for the test. The test will continue until the person gives the maximum effort and can no longer continue. However, the test can be immediately stopped by the doctor if the patient shows certain symptoms. 20

28 The test protocol, denoted Wstep x tstep, means that every tstep seconds/minutes the workload is increased by Wstep Watt. It is determined by the purpose of the test and the functional capabilities of the patient [21]. Regarding the experiments directly involved with this thesis, the monitored physiological signals in the tests are the following: Signal name Abbreviation Measurement Unit Fraction of inspired oxygen FIO2 % Fraction of expired oxygen FEO2 % Fraction of expired carbon dioxide FECO2 % Fraction of end-tidal oxygen FETO2 % Fraction of end-tidal carbon dioxide FETCO2 % Ventilation VE l/min Inspiratory time IT sec Expiratory time ET Sec Heart rate HR bpm Oxygen consumption VO2 l/min Table 5.1. Monitored Physiological Signals 5.2. Considered Dataset The data set from which the experiments will be performed include data from athletes that were subjected to an incremental cardiopulmonary exercise test. The data set includes the test results of 236 athletes. The test protocol for the CPET was 50W x 2min, meaning that each 2 minutes the workload is incremented by 50Watts. TIME LOAD FIO2 FEO2 FECO2 FETO2 FETCO2 VE TI TE HR VO Table 5.2. Dataset Format It can be seen in Table 5.2 the format of each file within data set. The TIME column references the time in which the measures were taken regarding each test; the LOAD column includes the workload that was set at that specific time; and the following columns relate to the monitored physiological signal from Table 5.1. CHARACTERISTICS Value Number of Athletes 236 Min Workload 0 Max Workload

29 Average Workload 212, Standard Deviation Workload 108,44097 Min VO2 0 Max VO2 6410,46475 Average VO2 2726,05261 Standard Deviation VO2 1176,47308 Min HR 33 Max HR 215 Average HR 139, Standard Deviation HR 34, Table 5.3. Dataset Characteristics Table 5.3 shows the dataset characterization regarding the main signals that will be taken into account when performing the predictions (HR and VO2) and also including the workload, since it will be used for segmenting the dataset Athlete s Information In order to make more precise the prediction process, the prediction model will predict the cardiopulmonary response based on the physiological signals from Table 5.1 and it will also include the athlete s physical data. TEST_ID AGE BMI BSA SEX Table 5.4. Athlete s Additional Attributes Table 5.4 shows the format of the file that contains the athlete s data. - Test_Id: Identifier of the athlete in order to associate him to its specific results file. - Age: Age of the athlete. - BMI: Body Mass Index of the athlete. - BSA: Body Surface Area of the athlete. - Sex: Gender of the athlete. Before the data preprocessing begins, this file will be matched to the file where the physiological signals of each athlete are stored, that present the format shown in Table 5.2. It is performed by means of a Python script that will include the new information in the signals file. The resulted file consists on the initial file that contains the physiological information (Table 5.2) with four new columns, one per each new attribute of the subject (age, BMI, BSA, sex). 22

30 5.3. Data Processing Data Segmentation In order to correctly apply the prediction process to the dataset, it has been divided into three segments according to the maximum workload that the athletes have achieved at the end of each test. That is, the first segment includes the athletes that reached a maximum of 200Watts at the test end; the second segment includes the athletes starting with a maximum of 250Watts until 400Watts when the test ended; and the third group contains the ones between the range Watts at the end of the test. Table 5.5 shows the characterization of each segment regarding the main signals that will be taken into account when performing the predictions (HR and VO2). Segment Low Medium High Workload [0, 200] [250, 400] [450, 500] Number Of Athletes Min Workload Max Workload Average 122, , , Workload Standard Deviation 60, , , Workload Min VO , Max VO2 3335, , , Average VO2 1806, , , Standard Deviation 792, , , VO2 Min HR Max HR Average HR 119, , , Standard Deviation HR 26, , , Table 5.5. Characteristics of each segment 23

31 Min-Max Normalization A major problem when dealing with the physiological signals where the values have different ranges is that due to the gap between these, if they are large enough they could obfuscate the smaller ones, leading to inconsistency in the results. To overcome this situation the normalization technique is used. Normalization is a preprocessing technique used to rescale attribute values to fit in a specific range. It is used to level the range when working with values that vary in size due to the units in their representation [22]. The min-max normalization allows to normalize the attribute values so that they fit within the minimum (min) and maximum (max) values. Figure 5.1. Normalization Process of the Dataset Figure 5.1 shows the process of min-max normalization by means of the Normalize operator in RapidMiner. In particular, the input data (exa port) corresponds to each of the files that contain the result of the cardiopulmonary exercise tests; the output data (exa port) corresponds to the initial file with the selected attributes in normalized form. The parameters that were specifically specified in the process were (Figure 5.2): - Attribute_filter_type: subset - This option allows selection of multiple attributes through a list. - Attributes: The required attributes can be selected from this option. In this case the attributes that were selected are: fraction of inspired oxygen, fraction of expired oxygen, fraction of expired carbon dioxide, fraction of end-tidal oxygen, fraction of end-tidal carbon dioxide, ventilation, inspiratory time and expiratory time. - Method: range_transformation This method normalizes the attribute values in the specified range [min,max]. - Min: 0.0 It specifies the minimum point of the range. - Max: 1.0 It specifies the maximum point of the range. After applying the normalization operator, the dataset remains with the same format as in Table

32 Figure 5.2. Normalization of Parameters Sampling Once the normalization has been performed a sliding time window is applied to the resultant dataset. This is done by means of a python script that sets a sliding time window of fixed time intervals, which moves across the rows of the dataset, calculates the average for each attribute and returns an output row per window. The window size and the step size parameters are set to 20 seconds and 5 seconds respectively. 25

33 Sampling is the process of selecting a subset of examples from a given dataset [23]. In this case a python script extracts from a given dataset, num_samples for each interval of time length equal to step_length. Num_samples is set to 3 and the step_length is set to 30 seconds. TEST_ID STEP SAMPLE TIME LOAD COUNTS Table 5.6. Dataset Format after Sampling Once the sampling process is applied, the format of the dataset remains as in Table 5.6. The TEST_ID columns is an identifier of the test, it increments by 1 as the number of athletes increments; the STEP is the step number in the sampling; SAMPLE is the sample number of the row within a specific step; TIME corresponds to the average of time within the specific window; LOAD correspond to the average of the workload within the specific window; COUNTS correspond to the number of rows that were included in the window; and includes all the initial physiological signals Windowing Once the preprocessing of each file has finished, the Windowing (Series Extension) operator in RapidMiner is used. Windowing allows to take any time series data and transform it into a cross-sectional format [24]. It transforms a given dataset containing series data into a new dataset that contains single valued examples. Windows with a specified window and step size move across the series and the attribute value lying horizon values after the window end is used as label which should be predicted. When the process finishes it is possible to apply any predictive modeling algorithm to the dataset to predict future values. Figure 5.3. Windowing Process of the Dataset Figure 5.3 shows the Windowing operator that is applied to each of the files of the dataset. In particular, the input data (exa port) corresponds to each of the files that contain the result of the cardiopulmonary exercise tests; the output data (exa port) corresponds to the final file that includes all the windowing. The parameters ([25]) that were specifically specified in the process are, as shown in Figure 5.4: 26

34 - Window size: 3 It determines how many attributes are created for the cross sectional data. Each row of the original time series within the window width will become a new attribute. - Step size: 1 It determines how to advance the window. - Horizon: 1 It determines how far out to make the forecast. That is, the distance between the current sample in the time window and the value to be predicted. - Label attribute: HR/VO2 It selects the name of the attribute which should be used for creating the label values. - Add incomplete windows: Checked If checked, it creates windows for all examples. The parameters were selected according to the documentation in [26]. Figure 5.4. Windowing Parameters The windowing process must be run twice. The first time to specify the label attribute parameter as HR and the second time to specify it as VO2, or vice versa, as shown in Figure 5.4. The output file of the windowing process has the format shown in XXX. In particular, for each column in the input file (Table 5.6) three columns are created (since the window size was specified as 3), except for the TEST_ID column that was selected as identifier and is put as the final column. A label column is created and its value depends on the physiological signal that will be predicted (HR or VO2). 27

35 STEP-2 STEP-1 STEP-0 SAMPLE-2 SAMPLE-1 SAMPLE-0 label TEST-ID Figure 5.5. Dataset Format after Windowing The processes of normalization, sampling and windowing are performed on two different dataset. The initial dataset does not include the athlete s body information (Section 5.2.1) and the second dataset is similar to the first one, but this one includes the athlete s attributes. This is done in order to compare the results and analyze the impact of including the new information on the prediction process. 28

36 6. PREDICTION ANALYSIS This chapter presents the prediction analysis process. It describes the prediction process using Artificial Neural Networks and Support Vector Machine Prediction Process Once the data has been preprocessed, the next step consists on proceeding with the signal prediction. The main signals involved in the prediction process are the heart rate (HR) and the oxygen consumption (VO2). Since the values of both of these signals are numeric, the prediction process will be one of a regression task, also because they will be continuous predictions in time. In order to perform this process the ANN and SVM techniques will be used, provided they exist as RapidMiner operators and they have been studied in previous thesis leading to good results. The prediction process may vary from a couple of hours to at most four weeks, depending on the used algorithm and the specific approach. Each prediction experiment may be launched either directly in RapidMiner or as a background process launched by a Python script (in order to improve efficiency). The launching mode also depends on the specific approach used in the predictions. This section will present the different approaches for predicting the signals, along with the specifications for each of the prediction techniques. It is important to remark that the prediction process is performed on two datasets in order to compare and analyze the results. The first dataset is based only on the athlete s physiological signals; the second dataset also includes the athlete s body information. The second dataset will be analyzed only by means of the ANN technique. Whereas the first one will be examined by both ANN and SVM Prediction Models The prediction of the physiological signal values that can be achieved in a new ongoing test takes place at each time in which the workload is increased. There are two values that can be predicted in relation to each of the signals. It is possible to predict the HR and VO2 values reaches at the end of the test (HRpeak and VO2peak, respectively); and it is possible to predict the HR and VO2 values reached at the step following the prediction step (HRnext and VO2next, respectively). 29

37 According to the previous signal predictions, there are two types of prediction approaches that can be used, multiple-test and single-test. They differ in the reference knowledge base used for model training. Both of the prediction models are carried out either by ANN or SVM. To obtain an exact prediction of the signals, an appropriate model for the current monitored patient should be selected Multiple-Test Approach The multiple-test model is available to predict HRpeak and VO2peak, and also HRnext and VO2next. This model is trained using a large reference knowledge base that contains a collection of previous tests that were run with the same protocol as the one that the current patient is performing, and that reached a workload value at least equal to the workload of the current test at the current prediction step. The multiple-test approach, also named community-based, generates an enriched model because it takes into account the collected responses in previous tests. Those that shows responses similar to the current patient responses Single-Test Approach On the other hand, the single-test model, also called individual-based, is trained using only the measurements collected during the test that is currently in execution. Thus, it is strongly tailored to the patient response in the ongoing test. This prediction approach supports HRnext and VO2next predictions only. The prediction process follows the same stages for both different approaches. Initially a prediction model is created for each target physiological value. Then the prediction model is trained following the prediction approach (single-test or multiple-test), by considering the physiological signals manifested in Table 5.1. Finally, the created prediction model is used to predict the physiological signal values Artificial Neural Network in Rapidminer The Neural Net operator in RapidMiner learns a model by means of a feed-forward neural network trained by a backpropagation algorithm (multi-layer perceptron) [27]. An ANN is a computation model that is based on the structure and the functional aspects of biological neural networks, more specifically, an interconnected group of artificial neurons. It is an adaptive system that changes its structure based on external or internal information that flows through the network during the learning phase. 30

38 In a feed-forward NN the connections between the units do not form any loops. The information only moves forward, from the input nodes to the output nodes. The back propagation algorithm is a supervised learning method that compares the output values with the correct answer to compute the value of some predefined errorfunction. This error is fed back through the network so that the algorithm adjusts the weights of each connection leading to a decrease in the value of the error function. This process is repeated until the network converges to a state where the error calculation is small. A multilayer perceptron (MLP) is a feed-forward ANN model that maps sets of input data onto a set of appropriate output. An MLP includes several layers of nodes in a direct graph where each layer is completely connected to the next one. MLP uses backpropagation for training the network. Figure 6.1. ANN Prediction Process with Multiple-Test Approach Figure 6.2. Neural Net Operator Parameters 31

39 The Neural Net operator uses a sigmoid function as the activation function. The following specified parameters were selected according to [26] and are shown in Figure 6.2: - Hidden_layers: 2 This parameter describes the name and the size of all hidden layers. It is used to define the structure of the neural network. - Training_cycles: 100 It specifies the number of training cycles that are used for the neural network training. It defines the number of times the process of back-propagation is repeated. - Learning_rate: 0.3 It determines how much the weights are changed at each step. - Momentum: 0.2 This parameter adds a fraction of the previous weight to the current one. Thus, preventing local maxima and smooths optimization directions. - Error_epsilon: 1.0E-5 The optimization stops if the training error gets below this value. - Figure 6.1 shows the prediction process related to the multiple-test model (HRpeak, VO2peak, HRnext and VO2next). This process is launched by a Python script. Initially it reads out the training and the testing datasets. The training set is used by the Neural Net operator and when the learning phase finishes the model is applied to the testing set. At the end, the prediction results are written in a file. - Read CSV operator: Both the training set and the testing set follow a csv (comma separated values) format. That is why this operator is used. It reads each file and the output consists of the file information in tabular form along with the meta data. - Replace Missing Values operator: After performing the windowing process, the datasets may end up with missing values. This operator replaces them by zero so that the neural net operator can work properly. - Apply Model operator: This operator is used to apply a trained model to a data set. In this case the output of the neural net operator and the testing set respectively. - Write CSV operator: Once the model has been applied, it is written in an output file that follows the csv format. The prediction process related to the single-test approach (HRnext and VO2next) is shown in Figure 6.3, Figure 6.4 and Figure 6.5. Unlike the multiple-test approach, the single-test approach can be launched directly from RapidMiner. It loops through each of the files and it performs the prediction individually. Initially each file (related to one athlete) is read and the windowing operator is applied to it (Section 5.3.4). Then it is used for the neural net operator in order to perform the learning phase; once this is done, the model is applied as in the multiple-test approach process. 32

40 Figure 6.3. (a) ANN Prediction Process with Single-Test Approach Figure 6.4. (b) ANN Prediction Process Single-Test Approach Inner Loop Files Figure 6.5. (c) ANN Prediction Process Single-Test Approach Inner Validation - Loop Files operator: This operates over its subprocess for all the files (each file correspond to one athlete). - Append operator: Once the prediction is performed on each athlete, this operator merges all the results into one final output file that will be written using the csv format. - Multiply operator: When looping, in the inner process, each file is copied so that it can be used as a training set and as a testing set. - Windowing operator: Performs the same process as specified in the Section of this document. - Validation operator: This operation encapsulates sliding windows of training and tests in order to estimate the performance of a prediction operator. It uses a certain window of examples for training and uses another window for testing. - Performance operator: It delivers as output a list of performance values according to a list of selected performance criteria Support Vector Machine in Rapidminer The SVM operator in RapidMiner is based on the Java libsvm. 33

41 The SVM takes a set of input data and predicts, for each given input, which of two possible classes comprises the imput, making the SVM a non-probabilistic binary linear classifier [28]. Given a set of training sets, each marked as belonging to one of two classes, an SVM training algorithm builds a model that assigns new sets into one of the classes. An SVM model represents the examples as points in spaces, which are mapped so that the examples of each category are divided by a gap that is as wide as possible. New examples are mapped into that same space and predicted to belong to a class based on which side of the gap they fall on. More formally, a SVM constructs a hyperplane or a set of hyperplanes in a high or infinite dimensional space. A correct separation is achieved by the hyperplane that has the largest distance to the nearest training data points of any category, since the larger the margin the lower the generalization error of the classifier. While the original problem may be set in a finite dimensional space, the sets to divide are not linearly separable in that space. This is why the original space would be mapped into a considerably higher-dimensional space. Consequently, the separation in that space becomes easier. So that the computational load remains reasonable, the SVM schemes use mapping that are designed to guarantee that dot products are computed easily in terms of the variables in the original space, by defining them based on a kernel function K(x,y) selected accordingly to the problem. The hyperplanes in the higher dimensional space are defined as the set of points whose inner product with a vector in that space is constant. Figure 6.6. SVM Prediction Process with Multiple-Test Approach 34

42 Figure 6.7. Support Vector Machine Operator Parameters The following specified parameters were selected according to [26] and are shown in Figure 6.7. Support Vector Machine Operator Parameters: - Kernel_type: rbf The Radial Basis Function kernel type nonliearly maps samples into a higher dimensional space and it can handle the case when the relation between class labels and attributes is nonlinear. - Gamma: 0.0 This parameter specifies the gamma for the rbf kernel function. - C: It specifies the cost parameter for epsilon-svr and it is the penalty parameter of the error term. - Epsilon: It defines the tolerance of the termination criterion. Figure 6.6 shows the prediction process related to the multiple-test model (HRpeak, VO2peak, HRnext and VO2next). As the ANN prediction process, the SVM process is launched by a Python script. The phases and operators of this process are equal to the ANN technique, they only differ in the SVM operator. The prediction process related to the single-test approach (HRnext and VO2next) is shown in Figure 6.8, Figure 6.9 and Figure Unlike the multiple-test approach, single-test approach can be launched directly from RapidMiner. It behaves similarly as the same approach using the ANN technique. It 35

43 loops through each of the files and it performs the prediction individually. They differ in the validation phase where the SVM operator replaces the ANN operator and they also differ after the windowing operator. In this case, the attributes related to the prediction are selected and then the missing values are replaced by zero. Figure 6.8. (a) SVM Prediction Process with Single-Test Approach Figure 6.9. (b) SVM Prediction Process Single-Test Approach Inner Loop Files Figure (c) SVM Prediction Process Single-Test Approach Inner Validation - Select Attributes operator: It selects which attributes of the dataset are kept in order to proceed with the prediction. In this case the selected attribute may be either the HR or the VO2, depending on the signal that is to be predicted. 36

44 Prediction Validation Once the predictions are performed, it is necessary to measure how accurate the results are. In order to do this, the leave-one-out cross validation is used. The cross validation is a model evaluation method that consists in not using the entire data set when training a learner. Part of the data is removed before the training phase begins. Once the training is done, the previously removed data can be used to test the performance of the learnt method on new data [29]. The leave-one-out cross validation, also known as N-fold cross-validation, is a validation method where the data set is divided into as many parts as there are instances, each instance effectively forming a test set of one. N classifiers are generator, each from N-1 instances, and each is used to classify a single test instance. The predicted accuracy is the total number correctly classified divided by the total number of instances [4]. The standard error is denoted as Where p is the predicted accuracy and N is the total number of instances. In this method every single data point is used for testing at least once on as many models developer as there are number of data points [3]. The leave-one-out cross validation, even though time consuming, benefits small datasets where the needed data to train the classifier is as much as possible. In these particular experiments at each workload increment, the subset of tests still running is selected from the dataset. Each time, a different test is picked from this subset and the remaining tests are used as knowledge base to predict the considered values Mean Absolute Error (MAE) The MAE is an error measure in the estimation period. It measures the average magnitude of the errors in a set of predictions, without considering their directions. It measures accuracy for continuous variables [30]. Where xi is the actual value; yi is the prediction value; and N is the number of nonmissing data points. The MAE is the average over the verification sample of the absolute values of the difference between the predicted and the actual value of the signal in the test (absolute prediction error). This indicator is a linear score, meaning that all the individual differences are weighted equally in the average. 37

45 TEST_ID VO2_next prediction(vo2_next) Table 6.1. Output Format after the Validation Process. VO 2next prediction After the result file has been produced by the prediction process it is possible to compare the results and calculate the error. Table 6.1 shows the format of the results file. The file contains the information of all the predicted values for all the tests. In particular the contained columns are: - TEST_ID: The identifier of each test, related to each athlete. - VO2_next: This column presents the actual value of the predicted signal. The column name may vary as the prediction process targets different signals a prediction approaches. - Prediction(VO2_next): The predicted value for the specified signal. The column name may vary as the prediction process targets different signals a prediction approaches. This result file is the input for the computation of the MAE. A python script goes through the results file computing the difference between the actual value and the prediction value of the specific signal, and thus, calculating the mean absolute error. When the script finishes, the results are compared to those of the same predicted signal with the contrary prediction technique (ANN versus SVM). Table 6.2 shows the output format once the MAE script has been run. It consist of a csv file containing the following columns: - Step: Number of the step in relation to the test duration. - MAE: Mean absolute error of the n-th step. - AVG_SIGNAL: Mean value of the target signal in the n-th step. - AVG_SIGNAL+MAE: Average value of the target signal in the n-th step affected by positive uncertainty. - AVG_SIGNAL-MAE: Average value of the target signal in the n-th step affected by negative uncertainty. - PATIENTS: Number of patients that have arrive to the n-th step. 38

46 Step MAE AVG_SIGNAL AVG_SIGNAL + MAE AVG_SIGNAL - MAE PATIENTS Table 6.2. MAE Calculation Format Root Mean Squared Error (RMSE) The RMSE is another error measure in the estimating period that measures the average magnitude of the error. The differences between the prediction value and the corresponding actual value are each squared and averaged over the sample. At the end, the square root of the average is taken [30]. Where xi is the actual value; yi is the prediction value; and N is the number of nonmissing data points. The RMSE will always be greater or equal to the MAE; the greater the difference between them, the greater the variance in the individual errors in the sample. If they are equal all the errors are of the same magnitude. The RMSE usually takes precedence over the other statistic measures. It is more sensitive than other measures since the errors are squared before they are averaged, this measure gives a relatively high weight to large errors. The squaring process gives disproportionate weight to very large errors [31]. As the MAE, the RMSE calculation takes as input the result file of the prediction, which has the format of Table 6.1. A python script goes through this file computing the difference between the actual value and the prediction value of the specific signal, and thus, calculating the root mean squared error. When the script finishes, the results are compared to those of the same predicted signal with the contrary prediction technique (ANN versus SVM). Table 6.3 shows the output format once the RMSE script has been run. It consist of a csv file containing the following columns: - Step: Number of the step in relation to the test duration. - RMSE: Root mean squared error of the n-th step. - AVG_SIGNAL: Mean value of the target signal in the n-th step. 39

47 - AVG_SIGNAL+RMSE: Average value of the target signal in the n-th step affected by positive uncertainty. - AVG_SIGNAL-RMSE: Average value of the target signal in the n-th step affected by negative uncertainty. - PATIENTS: Number of patients that have arrive to the n-th step. Step RMSE AVG_SIGNAL AVG_SIGNAL + RMSE AVG_SIGNAL - RMSE PATIENTS Table 6.3. RMSE Calculation Format 40

48 7. VISUALIZATION The final part of the complete process presented in this thesis consists on visualizing the physiological signals that are under study along with the prediction results. The results visualization allows a deeper understanding of the results and it also unifies the prediction process with a graphical part. This means that after each step of the prediction process the results are made visible so that it s possible to observe the values as the prediction continues. The visualization part was developed under the Java Programming Language using the NetBeans IDE. NetBeans IDE is a free, open source, fully-featured Java IDE written completely in Java. It provides support for latest java technologies; allows fast code editing; makes project management easy and efficient; and provides a rapid user interface development [32] Graphics in Java In order to plot graphics (2D) in java it is necessary to initially understand that the computer screen is made up of several pixels (picture elements). Each window is a set of pixels; and each of these pixels has a set of coordinates that works in the same way as the Cartesian coordinates. However, the main difference relies in that in this case the upper left corner is the origin point (unlike the lower left as the Cartesian plane) [33]. Figure 7.1 shows an example for a window that is 640 pixels wide by 480 pixels tall. Figure 7.1. Window Coordinates in Java [33] 41

49 In the Java graphing window the grid is numbered in a positive direction on the x- axis (horizontally to the right) and in a positive direction on the y-axis (vertically going down) [34]. The visualization uses a framework where a window will be launched where the drawings will be done, by means of the Graphics Class in the package java.awt.graphics that contains the main methods for graphing. Not only that, but also the javax.swing.jframe and the javax.swing.jpanel libraries were used, along with several other java packages Physiological Signals Visualization The first part in the visualization section includes only the visualization of the physiological signals as they are selected when the test is running. In order to plot the graphics, it is necessary to specify and draw each data point separately. Figure 7.2 shows a snippet code of how the data points are set according to the window description. Taking into consideration the grid organization in the Java graphing window it is necessary to specify a set of variables in order to be able to properly draw the data points: - X_AXIS_FIRST_X_COORD: Constant that indicates the first x-axis value (left-to-right in the x-axis). - Y_AXIS_SECOND_Y_COORD: Constant that indicates the first y-axis value (top-to-bottom in the y-axis). - listpoints: List that stores the data points (x coordinate and y coordinate). - xlenght: Constant that indicates the width between each point in the x-axis. - ylength: Constant that indicates the width between each point in the y-axis. - ycoordnumbers: Constant that indicates the number of divisions in the y-axis. - ymaximum: Constant that indicates the maximum value an specific signal can get. - point_length: Constant that indicates the diameter of a data point circle. The Point2D.Double method creates a new data point that has as x-axis value the first parameter, and the second parameter indicates the y-axis value according to the Cartesian coordinates. In order to draw a line, it is necessary to specify the initial and the final points. In this case the variables initial and end specify these points respectively. For each pair of consecutive points in the list, the two variables are specified. The variables initialvertical and endvertical specify the points in order to draw a vertical line from the x-axis origin to the actual signal value in the y-axis. The method g2.draw draws a circle (Ellipse2D.Double) that references the physiological signal value. 42

50 Figure 7.2. Specification of Signal Data Points Once the data points are specified, it is necessary to draw them. Figure 7.3 shows how this is implemented. The g2.draw method is used to draw a line from the initial point to the final point. Figure 7.3. Visualization of Signal Data Points The data visualization is continuously updated. This means that the file that contains the signal values is read each certain amount of time. Figure 7.4 shows a Timer event that triggers the actionperformed method. If the file contains information it is read again and the visualization window is repainted. Figure 7.4. Timer Event for Data Update Figure 7.5 shows the main visualization window. Initially it does not display any graphic and it allows to select the signal(s) that is wanted to be displayed. 43

51 Figure 7.5. Initial Visualization Window 44

52 Figure 7.6. Physiological Signal Selection Figure 7.6 shows a snippet code of the selection of each signal. An ItemEvent is raised if a check field is selected, it compares the event source to each one of the signals that can be displayed. Once the source matches a signal it analyze if the signal has been selected or deselected. If deselected, the signal visualization is removed from the window. On the contrary, if the physiological signal has been selected a graphic container initializes the signal values and the graphic is made visible. The physiological signals that can be presented are the ones specified in Table 5.1. It is possible to visualize their graphics from Figure 7.7 to Figure

53 Figure 7.7. Visualization of the FIO2 Signal 46

54 Figure 7.8. Visualization of the FEO2 Signal 47

55 Figure 7.9. Visualization of the FECO2 Signal 48

56 Figure Visualization of the FETCO2 Signal 49

57 Figure Visualization of the FETO2 Signal 50

58 Figure Visualization of the VE Signal 51

59 Figure Visualization of the IT Signal 52

60 Figure Visualization of the ET Signal 53

61 Figure Visualization of the HR Signal 54

62 Figure Visualization of the VO 2 Signal 55

63 Figure Visualization of the HR and VO 2 Signals When each signal is selected the graphic updates at a certain amount of time so that it can include the new incoming values as the test is in execution. A horizontal scrollbar is made visible to facilitate keeping track of the current and past results. It is also possible to select/deselect several signals in order to visualize them at the same time. Since not all the signal values are within the same range and they do not manage the same units, each selected signal will be plotted separately. In this case a vertical scrollbar appears so that all the graphics can be observed as shown in Figure This example presents the visualization of heart rate (bottom, red graphic) along with the oxygen consumption (top, blue graphic). 56

64 7.3. Prediction Visualization The prediction visualization has been integrated with the RapidMiner prediction of the values. This means that each time a step concludes with the prediction, the results can be observed graphically instead of having to read the output file. The plotting is based on Section 7.2, they differ in that the present only shows one graphic that includes the prediction results along with the errors, whereas in the former part it was possible to visualize each physiological signal separately. Figure 7.18 and Figure 7.19 shows how the data points are specified and then made visible. As in Section 7.2, they are based on the same variables and constants, but in this case there exist three different lines, each specified by initial and final points. Figure Specification of Prediction Data Points Figure Visualization of Prediction Data Points Figure Data Update on File Modification Since the prediction process is continuous, it is necessary to update the graphic so that it includes the new data points. Unlike the data update in Section 7.2, in this case 57

65 the plot is updated each time the result file is modified, as shown in Figure A comparison is made between the currently stored timestamp and the actual timestamp; if they differ the file is read once more and the window updates the graphics. The plot of the results presents, at each step of the prediction process, the signal value along with the computed error results (positive uncertainty and negative uncertainty). It also specifies the approach that was used (multiple-test or single-test), the signal that was predicted (HRpeak, VO2peak, HRnext or VO2next), the technique for regression (ANN or SVM) and the computed error (MAE or RMSE). From Figure 7.21 to Figure 7.24 it is possible to see an example of a prediction process. This example is based on a multiple-test approach to predict VO2peak by means of the ANN technique, and the computed error is the MAE. The figures present the evolution of the prediction process from the first step to the last one. The prediction visualization module has been integrated with the module presented in Section 7.2. This means that it will be possible to make visible not only the prediction results, but the physician will also be able to analyze the physiological signal values as the test is in execution. Figure 7.23 and Figure 7.24 show how the prediction results can be visualized at the same time with the actual heart rate and oxygen consumption values. By this means, it is possible to analyze the predicted signal along with the cardiopulmonary response of the patient. 58

66 Figure Prediction Visualization Step 1 Figure Prediction Visualization Step 2 59

67 Figure Prediction Visualization Step 10 Figure Prediction Visualization Step 20 60