CBR System for Leukemia Patients Diagnosis

Size: px
Start display at page:

Download "CBR System for Leukemia Patients Diagnosis"

Transcription

1 CBR System for Leukema Patents Dagnoss Juan F. De Paz, Sara Rodríguez, Javer Bao, Juan M. Corchado Departamento de Informátca y Automátca, Unversdad de Salamanca Plaza de la Merced s/n, 37008, Salamanca, España {fcofds,srg,baope,corchado}@usal.es Abstract. The use of computatonal methods s fundamental n cancer research. One of the possbltes s the use of Artfcal Intellgence technques. Several of these technques have been used to analyze expresson arrays. However, the new Exon arrays whch work wth a large amount of data requre novel solutons. Ths paper presents a Case-based reasonng (CBR) system for automatc classfcaton of leukema patents from Exon array data. The proposed CBR system ncorporates novel algorthms for data flterng and classfcaton. The system has been tested and the results obtaned are presented n ths paper. Keywords: Case-based Reasonng, ESOINN neural network, leukema classfcaton.. Introducton Durng recent years there have been great advances n the feld of Bomedcne []. The ncorporaton of computatonal and artfcal ntellgence technques to the feld of medcne has yelded remarkable progress n predctng and detectng dseases []. One of the areas of medcne whch s essental and requres the mplementaton of technques that facltate automatc data processng and extracton of knowledge s genomcs. Genomcs deals wth the study of genes, ther documentaton, ther structure and how they nteract [2]. We dstngush dfferent felds of study wthn the genome. One s transcrptome, whch deals wth the study of rbonuclec acd (RNA), and can be studed through expresson analyss [3]. Ths technque studes RNA chans thereby dentfyng the level of expresson for each gene studed. It conssts of hybrdzng a sample for a patent and colourng the cellular materal wth a specal dye. Ths offers dfferent levels of lumnescence that can be analyzed and represented as a data array. Tradtonally, methods and tools have been developed to work wth expresson arrays contanng about data ponts. The emergence of the Exon arrays [7], holds mportant potental for bomedcne. However, the Exon arrays requre novel tools and methods to work wth very large ( ) amounts of data Ths paper presents a CBR system that facltates the analyss and classfcaton of data from Exon arrays correspondng to patents wth leukema. Leukema, or blood cancer, s a dsease that has a sgnfcant potental for cure f detected early [4].

2 The relatonshp between the chromosomal alteratons and prognoss of leukema and lymphomas s well establshed. Recently, conventonal array-based expresson proflng has demonstrated that chromosomal alteratons are assocated wth dstnctve patterns of expresson. The four most mportant types of Leukema are acute and chronc myelogenous leukema (AML;CML) and acute and chronc lymphocytc Leukema(ALL; CLL). About new cases of both acute and chronc Leukema appear every year. About 2000 adult cases are dagnosed annually as acute myelogenc Leukema, 8000 as chronc lymphocytc Leukema, 500 as chronc myelogenous forms, and about 3500 as acute forms of lymphocytc Leukema. A study [8] shows that an estmated 9900 new cases of myeloma were dagnosed n USA n The system proposed n the context of ths work focuses on the detecton of carcnogenc patterns n the data from Exon arrays and s constructed from a CBR system that provdes a classfcaton technque based on prevous experences. An expresson analyss conssts bascally of three stages: normalzaton and flterng; clusterng and classfcaton; and extracton of knowledge. These stages can be automated and ncluded n a CBR system. The frst step s crtcal to acheve both a good normalzaton of data and an ntal flterng to reduce the dmensonalty of the data set wth whch to work [5]. Snce the problem at hand s workng wth hghdmensonal arrays, t s mportant to have a good pre-processng technque that facltates automatc decson-makng about the varables that wll be vtal for the classfcaton process [6]. For some tme now, we have been workng on the dentfcaton of technques to automate the reasonng cycle of several CBR systems appled to complex domans [20] [2] [25]. The obectve of ths work s to develop a CBR system that allows the dentfcaton of patents wth varous types of cancer. The model ams to mprove the cancer classfcaton based on mcroarray data. The system proposed n ths paper presents a new synthess that brngs several artfcal ntellgence subfelds together (flter technques, clusterng and artfcal neural networks). The retreval, reuse, revson and learnng stages of the CBR system use these technques to facltate the CBR adaptaton to the doman of bologcal dscovery wth mcroarray datasets. Specfcally, the system presented n ths paper uses a model whch takes advantage of two novel methods for analyzng Exon array data: a technque for flterng data, and a technque ESOINN [25] for clusterng. The frst method combnes varous flterng technques to dramatcally reduce the dmensonalty. The second one allows clusterng by ncorporatng both the dstrbuton process of the entre surface of classfcaton, and the separaton between groups wth low densty among them. The paper s structured as follows: Secton 2 and Secton 3 descrbe the proposed CBR model and how t s adapted to the problem under consderaton. Fnally, Secton 4 presents the results and conclusons obtaned after testng the model. 2. CBR System for Classfyng Exon Array Data The CBR developed tool receves data from the analyss of chps and s responsble for classfyng of ndvduals based on evdence and exstng data. Case-based

3 Reasonng s a type of reasonng based on the use of past experences [9]. The way cases are managed s known as the CBR cycle, and conssts of four sequental phases: retreve, reuse, revse and retan. 2.. Retreve The retreve phase starts when a new problem descrpton s receved. Contrary to what usually happens n the CBR, our case study s unque n that the number of varables s much greater than the number of cases. Ths leads to a change n the way the CBR functons so that nstead of recoverng cases at ths stage, mportant varables are retreved. In the case study, the number of cases s not the problem, rather the number of varables. For ths reason varables are retreved at ths stage and then, dependng on the dentfed varables, the other stages of the CBR are carred out. Ths phase wll be broken down nto 5 stages whch are descrbed below: 2... RMA The RMA (Robust Mult-array Average) [0] algorthm s frequently used for preprocessng Affymetrx mcroarray data. RMA conssts of three steps: () Background Correcton; () Quantle Normalzaton (the goal of whch s to make the dstrbuton of probe ntenstes the same for arrays); and () Expresson Calculaton Control and Errors Durng ths phase, all probes used for testng hybrdzaton are elmnated. These probes have no relevance at the tme when ndvduals are classfed, as there are no more than a few control ponts whch should contan the same values for all ndvduals. If they have dfferent values, the case should be dscarded. Therefore, the probes control wll not be useful n groupng ndvduals. On occason, some of the measures made durng hybrdzaton may be erroneous; not so wth the control varables. In ths case, the erroneous probes that were marked durng the mplementaton of the RMA must be elmnated Varablty Once both the control and the erroneous probes have been elmnated, the flterng begns. The frst stage s to remove the probes that have low varablty. Ths work s carred out accordng to the followng steps:. Calculate the standard devaton for each of the probes 2 () N x N Where N s the number of tems total, varable, x s the value of the probe for the ndvdual. 2. Standardze the above values s the average populaton for the

4 where z N N and x N N 2 (2) where z N(0, ) 3. Dscard of probes for whch the value of z meet the followng condton: z.0 gven that P ( z.0) Ths wll effect the removal of about 6% of the probes f the varable follows a normal dstrbuton Unform Dstrbuton Fnally, all remanng varables that follow a unform dstrbuton are elmnated. The varables that follow a unform dstrbuton wll not allow the separaton of ndvduals. Therefore, the varables that do not follow ths dstrbuton wll be really useful varables n the classfcaton of the cases. The contrast of assumptons followed s explaned below, usng the Kolmogorov-Smrnov [8] test as an example. H 0 : The data follow a unform dstrbuton; H : The analyzed data do not follow a unform dstrbuton. Statstcal contrast: D, max D D (3) where max 0 ) ( D F x D max F0( x ) wth as the pattern n n n n of entry, n the number of tems and F ( x ) the probablty of observng values less 0 than wth H 0 beng true. The value of statstcal contrast s compared to the next value: n the specal case of unform dstrbuton sgnfcance C D (4) k(n) C. 0. k( n) n 0.2 n and a level of Correlatons At the last stage of the flterng process, correlated varables are elmnated so that only the ndependent varables reman. To ths end, the lnear correlaton ndex of Pearson s calculated and the probes meetng the followng condton are elmnated. r x y (5)

5 beng: x x. N, x x xs u xs.. x x. r x y x x s the covarance between probes and. Where N s 2.2. Reuse Once fltered and standardzed, the probes produce a set of values x wth =... N, =... s where N s the total number of cases, s the number of end probes. The next step s to perform the clusterng of ndvduals based on ther proxmty accordng to ther probes. Snce the problem on whch ths study s based contaned no pror classfcaton wth whch tranng could take place, a technque of unsupervsed classfcaton was used. There s a wde range of possbltes. Some of these technques are artfcal neural networks such as SOM [] (self-organzng map), GNG [2] (Growng neural Gas) resultng from the unon of technques CHL [3] (compettve Hebban Learnng) and NG [4] (neural gas), GCS [2] (Growng Cell Structure), Growng Grd or the SOINN [5] (self-organzng ncremental neuronal network) or methods based on herarchcal clusterng [26]. Some of the methods, such as self-organzed Kohonen maps, set the number of clusters n the ntal phase of tranng when usng the algorthm of the k-means learnng method. The self-organzed maps have other varants of learnng methods that base ther behavour on methods smlar to the NG. They create a mesh that s adusted automatcally to a specfc area. The greatest dsadvantage, however, s that both the number of neurons that are dstrbuted over the surface and the degree of proxmty are set beforehand, resultng n the number remanng constant throughout the entre tranng process. Unlke the self-organzng maps based on meshes, Growng Grd or GCS do not set the number of neurons, or the degree of connectvty, but they do establsh the dmensonalty of each mesh. Ths complcates the separaton phase between groups once t s dstrbuted evenly across the surface. After analyzng dfferent technques and checkng the problems they mght present so that they mght be appled to the problem at hand, we have decded to use a varaton of neural network SOINN [5], called ESOINN [6] (Enhanced selforganzng ncremental neuronal network). Unlke the SOINN, ESOINN conssts of a sngle layer, so t s not necessary to determne the manner n whch the tranng of the frst layer changes to the second. Wth a sngle layer, ESOINN s able to ncorporate both the dstrbuton process along the surface and the separaton between low densty groups. It has characterstcs of a network CHL n the ntal phases, by whch t could be understood as a phase of competton, whle n a second phase, the network of nodes begns to expand ust as wth a NG. Ths process s conducted n an teratve way untl t reaches stablty. Only the changes n tranng phase are detaled below:. Update the weghts of neurons by followng a process smlar to the SOINN, but ntroducng a new defnton for the learnng rate n order to provde greater stablty for the model. Ths learnng rate has produced good results n other networks such as SOM [7].

6 W n M )( W ) W n ( M )( W ) wth N (6) Beng a ( a a n ( x), n ( x) 2 a a x 2 2 x 2. Delete the connectons wth hgher age. The ages are typfed and are removed those whose values are n the regon of reecton wth k>0. s If all nput patterns have been passed then a KS-Test [8] s carred out n order to determne f the densty dstrbuton for the neurons n each group follows a normal dstrbuton. If so then the learnng procedure s fnshed; otherwse the next pattern s processed. The value of chosen s Once, the clusters have been made, the new sample s classfed. Its assocaton s carred out bearng n mnd the smlarty of the new case wth the recovered varables n the frst phase. The smlarty measure used s as follows: s d( n, m) f ( x, x )* w (7) Where s s the total number varables, n and m the cases, n m w the value obtaned n the unform test and f the Mnkowsk [9] Dstance that s gven for the followng equaton. f x, y) p ( con x y p x p, y R (8) Ths dssmlarty measure weghs those probes that have a less unform dstrbuton, snce these varables don t allow a separaton 2.3. Revse/Retan The revson s carred out by an expert who determnes the correcton wth the group assgned by the system. If the assgnaton s consdered correct, then the retreve and reuse phases are carred out agan so that the system s ready for the next classfcaton. Nevertheless, the system provdes an automatc temporal revson consderng the retreved cases. The system calculates the percentage of cases that have already been accurately classfed among those retreved for the current problem. If the percentage of a class s greater than a threshold then the system establshes that the case has been successfully classfed. Ths decson has to be confrmed by the human expert. 3. Case Study In the case study presented n the framework of ths research are avalable 248 samples are avalable from analyses performed on patents ether through punctures n

7 marrow or blood samples. The am of the tests performed s to determne whether the system s able to classfy new patents based on the prevous cases analyzed and stored. Fgure shows a scheme of the bo-nspred model ntended to resolve the problem descrbed n Secton 2. The proposed model follows the procedures that are performed n medcal centres. As can be seen n Fgure, a prevous phase, external to the model, conssts of a set of tests whch allow us to obtan data from the chps and are carred out by the laboratory personnel. The chps are hybrdzed and explored by means of a scanner, obtanng nformaton on the markng of several genes based on the lumnescence. At that pont, the CBR-based model starts to process the data obtaned from the Exon arrays. Fg.. Proposed CBR model The retreve phase receves an array wth a patent s data as nput nformaton. The retreve step flters genes but never patents. The set of patents s represented n as D d,..., d }, where represents the patent and n represents the { t d R number of probes taken nto consderaton. As explaned n Secton 2., durng the retreve phase the data are normalzed by the RMA algorthm [0] and the dmensonalty s reduced bearng n mnd, above all, the varablty, dstrbuton and correlaton of probes. The result of ths phase reduces any nformaton not consdered meanngful to perform the classfcaton. The new set of patents s defned through s ' ' ' ' s varables D { d,..., } d R, s n. d t The reuse phase uses the nformaton obtaned n the prevous step to classfy the patent nto a leukema group. The data comng from the retrever phase conssts of a ' ' ' ' s group of patents D { d,..., } con d R, s n, each one characterzed by a set of meanngful attrbutes ) d t ( x,..., x s d, where x s the lumnescence value of the probe for the patent. In order to create clusters and consequently obtan patterns to classfy the new patent, the reuse phase mplements a novel neural network based on the ESOINN [6], secton 2.2. The network classfes the patents by takng nto account ther proxmty and ther densty, n such a way that the result provded s a set G where g g wth G { g,..., gr} r s. g D,

8 and r,. The set G s composed of a group of clusters, each of them contanng patents wth a smlar dsease. Once the clusters have been obtaned, the system can classfy the new patent by assgnng hm to one of the clusters. The new patent s defned as d ' t and hs membershp to a group s determned by a smlarty functon defned n (7). The result of the reuse phase s a group of clusters ' ' G { g,... g.,... gr} r s where g { ' g dt }. An expert from the Cancer Insttute s n charge of the revson process. Ths expert ' determnes f g g d } can be consdered as correct. In the retan phase the { ' t system learns from the new experence. If the classfcaton s consdered successful, then the patent s added to the memory case D d,..., d t, d }. { t 4. Results and Conclusons Ths paper has presented a CBR system whch allows automatc cancer dagnoss for patents usng data from Exon arrays. The model combnes technques for the reducton of the dmensonalty of the orgnal data set and a novel method of clusterng for classfyng patents. The CBR system presented n ths work focused on dentfyng the mportant varables for each of the varants of blood cancer so that patents can be classfed accordng to these varables. In the experments reported n ths paper, we worked wth a database of bone marrow cases from 248 adult patents wth fve types of leukaema, plus a group of 6 samples belongng to healthy persons (no leukemas. The retreve stage of the proposed CBR system presents a novel technque to reduce the dmensonalty of the data. The total number of varables selected n our experments was reduced to 883, whch ncreased the effcency of the cluster probe. In addton, the selected varables resulted n a classfcaton smlar to that already acheved by experts from the laboratory of the Insttute of Cancer. The error rates have remaned farly low especally for cases where the number of patents was hgh. To try to ncrease the reducton of the dmensonalty of the data we appled prncpal components (PCA) but ths reducton of the dmensonalty was not approprate n order to obtan a correct classfcaton of the patents. Fgure 2a shows the classfcaton performed for patents from groups CLL. As can be seen n Fgure 2a, represented n black, most of the people of the CLL group are together, concdng wth the prevous classfcaton gven by the experts at the Insttute of Cancer. Only a small porton of the ndvduals departed from the ntal classfcaton. In a smlar way we proceeded to evaluate the classfcaton for the rest of the groups. Fgure 2b shows the total number of patents from each group and the number of msclassfcatons. As can be seen n Fgure 2b, groups wth fewer patents are those wth a greater error rate. Once the valdty of the method of fltraton for selectng the most mportant varables for classfcaton s verfed, the next step n the evaluaton was to assess the functonng of the classfcaton process. The system was tested wth 5 new patents. The patents were assgned to the expected groups. Only one of the patents was msclassfed, beng assgned to an erroneous group. The

9 patent msclassfed belongs to the ALL class, whle the others ndvduals belong to the CML and CLL classes. Fg. 2. Classfcaton errors (a) numercal (b) percentage The fnal classfcaton was compared wth the data obtaned usng a dendogram [24] and PAM [23] (Parttonng Around Medods). The proporton of errors n every group was calculated and the Kurskal-Walls [22] test was appled to determnate f the medan of these proportons was equal. The results are shown n table. Table. Comparson of methods. * dfferent medan and = equal, (-) medan of column less than medan of row CBR Dendogram PAM CBR Dendogram *(-) PAM *(-) *(-) One of the great contrbutons of the model presented s the ablty to work wth data from Exon arrays because of ts great capacty for the selecton of sgnfcant varables. The proposed system allows the reducton of the dmensonalty based on the flterng of genes wth lttle varablty and those that do not allow a separaton of ndvduals due to the dstrbuton of data. It also presents a technque for clusterng based on the use of neural networks ESOINN. The results obtaned from emprcal studes are provded wth a tool that allows both the detecton of genes and those varables that are most mportant for the detecton of pathology, and the facltaton of a classfcaton and relable dagnoss, as shown by the results presented n ths paper. References. Shortlffe, E.H., Cmno, J.J.: Bomedcal Informatcs: Computer Applcatons n Health Care and Bomedcne, Sprnger, (2006). 2. Tsoka, S., Ouzouns C.: Recent developments and future drectons n computatonal genomcs. FEBS Letters, Vol. 480 (), (2000) Lander E.S. et al.: Intal sequencng and analyss of the human genome. Nature (200) 409, Rubntz, J.E., Hya, N., Zhou, Y., Hancock, M.L., Rvera,G.K., Pu C.: Lack of beneft of early detecton of relapse after completon of therapy for acute lymphoblastc leukema. Pedatrc Blood & Cancer, Vol. 44 (2), (2005). 38-4

10 5. Armstrong, N.J., van de, Wel M.A.: Mcroarray data analyss: From hypotheses to conclusons usng gene expresson data. Cellular Oncology, Vol. 26 (5-6), (2004) Quackenbush, J.: Computatonal analyss of mcroarray data. Nature Revew Genetcs, Vol. 2(6), (200) Affymetrx, GeneChp Human Exon.0 ST Array, 8. SEER (Survellance Epdemology and End Results), U.S. Natonal Cancer Insttute (2007) 9. Kolodner, J.: Case-Based Reasonng. Morgan Kaufmann (993). 0. Irzarry, R.A., Hobbs,B., Colln,F., Beazer-Barclay,Y.D., Antonells,K.J., Scherf,U. Speed,T.P.: Exploraton, Normalzaton, and Summares of Hgh densty Olgonucleotde Array Probe Level Data. Bostatstcs, Vol. 4, (2003) Kohonen, T.: Self-organzed formaton of topologcally correct feature maps. Bologcal Cybernetcs, (982) Frtzke, B.: A growng neural gas network learns topologes. Cambrdge MA: G. Tesauro, D. S. Touretzky, and T. K. Leen,, Advances n Neural Informaton Processng Systems 7, (995) Martnetz, T.: Compettve Hebban learnng rule forms perfectly topology preservng maps. Amsterdam : Sprnger,. ICANN'93: Internatonal Conference on Artfcal Neural Networks, (993) Martnetz, T., Schulten, K.: A neural-gas network learns topologes. T Kohonen, et al. Amsterdam: Artfcal Neural Networks, (99) Shen, F.: An algorthm for ncremental unsupervsed learnng and topology representaton. Tokyo: Ph.D. thess. Tokyo Insttute of Technology, (2006). 6. Furao, S, Ogura, T, Hasegawa, O.: An enhanced self-organzng ncremental neural network for onlne unsupervsed learnng. Neural Networks, Vol. 20, (2007), Corchado, J. M., Bao, J., De Paz Y., De Paz J. F. Integratng Case Plannng and RPTW Neuronal Networks to Construct an Intellgent Envronment for Health Care. Internatonal Journal of Computatonal Intellgence n Bonformatcs and System Bology, In Press. 8. Brunell, R.: Hstogram Analyss for Image Retreval. Pattern Recognton, Vol. 34, (200) Garepy, R., Pepe, W.D.: On the Level sets of a Dstance Functon n a Mnkowsk Space. Proceedngs of the Amercan Mathematcal Socety, Vol. 3 (), (972), Corchado, J.M., Corchado, E.S., Aken, J., Fyfe, C., Fdez-Rverola, F., Glez-Beda, M.: Maxmum Lkelhood Hebban Learnng Based Retreval Method for CBR Systems. In Proceedngs of the 5th Internatonal Conference on Case-Based Reasonng, (2003) Rverola, F., Díaz F., Corchado, J. M.: Gene-CBR: a case-based reasonng tool for cancer dagnoss usng mcroarray datasets. Computatonal Intellgence,. Vol. 22 (3-4), (2006), Kruskal, W, Walls, W.: Use of ranks n one-crteron varance analyss. Journal of Amercan Statstcs Assocaton (952) 23. Kaufman, L., Rousseeuw, P.J.: Fndng Groups n Data: An Introducton to Cluster Analyss. Wley, New York. (990). 24. Satou, N., Ne, M.: The neghbor-onng method: A new method for reconstructng phylogenetc trees. Mol. Bol. Vol. 4, (987) Arshad, N., Jursca, I.: Data mnng for case-based reasonng n hgh-dmensonal bologcal domans. Knowledge and Data Engneerng, IEEE Transactons. Vol. 7 (8), (2005) Sultan M., Wgle D., Cumbaa C.A., Mazarz M., Glasgow J., Tsao M., Jursca I. Bnary tree-structured vector quantzaton approach to clusterng and vsualzng mcroarray data. ISMB (2002) -9