Indexing and Retrieval of Degraded Handwritten Medical Forms

Size: px
Start display at page:

Download "Indexing and Retrieval of Degraded Handwritten Medical Forms"

Transcription

1 Indexing and Retrieva of Degraded Handwritten Medica Forms Huaigu Cao, Faisa Farooq and Venu Govindaraju Center for Unified Biometrics and Sensors (CUBS) Dept. of Computer Science and Engineering University at Buffao, Amherst, NY Abstract The tasks of indexing and retrieva are specificay chaenging for the erroneous output of handwriting recognition (HR) systems. This paper proposes an approach of indexing and retrieving degraded documents with very ow recognition rates. We present a modified version of the popuar Vector Mode in information retrieva (IR). Our mode incorporates top n candidates from a HR system into the scheme of cacuating the term frequency (tf) and the inverted document frequency (idf). Standardized IR Tests show that the proposed approach outperforms the retrieva of ordinary HR text in terms of mean average precision (MAP) and R-Precision. 1 Introduction Decades of the deveopment in optica character recognition (OCR) techniques has made it possibe to convert great voumes of documents into digita form such as text or structured forms of PDF and XML. In recent years, the achievement in information retrieva has provided a powerfu too for indexing and searching arge scae on-ine database of documents. Athough OCR has been successfu in appications of machine-printed document recognition and handwriting recognition (HR) with sma exicon, unconstrained handwriting with arge exicon in genera sti has very high error rate. Researchers [Croft et a., 1994; Beitze et a., 23] have shown that the performance of OCR text retrieva is bady affected when deaing with short or ow quaity documents. Our work is motivated by the objective of buiding a system for recognizing and retrieving New York State Pre-hospita Care Reports (PCR forms). In New York State a patients who enter the Emergency Medica System (EMS) are tracked through their pre-hospita care to the emergency room using the PCR. The PCR is used to gather vita patient information and has manifod purpose: (i) it is used by the emergency room staff in providing appropriate care, (ii) it is to faciitate record keeping about the patient and the care provided, and (iii) to afford arge regiona EMS (such as WREMS, Western Regiona Emergency Medica Systems.) to gain trend knowedge through macro anaysis. Our goa is to automate the coection of data from the PCR and enabe efficient maintenance, and dissemination of information. The task proposed is quite chaenging for severa reasons: (i) handwritten responses are very oosey constrained in terms of writing stye, format of response, and choice of text, (ii) medica exicons of words are very arge (about 5, entries), iii) forms are often very noisy due to irrepressibe emergency situations. This eads to the automatic transcription of the forms very difficut. Our dataset consists of over 6 scanned PCR forms containing medicine and heathcare information. Each PCR contains about 1 handwritten words on average so the content is very short. The word recognition rate of the forms using Word Mode Recognizer (WMR) [Kim and Govindaraju, 1997] is as ow as around 2% as reported in [Miewski and Govindaraju, 26]. The probem of indexing and retrieving documents with ow recognition rate has recenty been addressed by researchers. In [Mittendorf and Schaube, 1996; Ohta et a., 1997; Jing, 22] different approaches modeing typica recognition errors were proposed. In [Mittendorf and Schaube, 1996] a probabiistic mode for common recognition errors was proposed and this mode was used to design the term-weighting scheme of information retrieva. The approach that generates candidate terms for each true search term and adds the retrieva resuts of candidate terms into the fina resut was studied in [Ohta et a., 1997]. In[Jing, 22], a anguage mode that took common recognition errors into account was buit. This anguage mode can then be used to approximate an uncorrupted version of a particuar document, and it can be used for retrieva in a anguage modeing approach. Due to ow recognition accuracies, the task of retrieva and indexing has been eusive. Keyword spotting as an aternative approach of indexing and retrieving handwritten documents has been proposed in [Manmatha and Croft, 1997; Rath and Manmatha, 26; Zhang et a., 24]. The idea is to search the document for a certain keyword by feature matching instead of recognition. Rath and Manmatha deveoped a word matching agorithm that compares word images using Dynamic Time Warping (DTW) [Rath et a., 24]. They address a singe-

2 author probem by matching word images with each other to create equivaence casses. Each equivaence cass consists of mutipe instances of the same word. We propose an approach based on the idea of buiding index on the text of top n(n > 1) word recognition candidates rather than on the ordinary HR text containing the top choice candidate ony. For retrieva, a variant of the popuar Vector Mode with a scheme of using a goba probabiistic mode of word recognition ranks to cacuate TF and TDF is introduced. IR tests show that the proposed approach outperforms the standard Vector Mode on OCR text with a gain of the Mean Average Precision (MAP) from 1.6% to 18.1% and the R- Precision from 1.% to 22.1%. The foowing sections are organized as foows. The Top-n-Candidate Vector Mode is introduced in section 2. Resuts of IR tests are presented and discussed in section 3. Finay, section 4 concudes the paper. 2 Mutipe-candidate Vector Mode 2.1 Benefit from mutipe-candidates The existence of errors in HR text decreases the performance of IR in the sense that a query word present in a certain position of the document, due to incorrect recognition, might not be present in the recognized text, but the count of occurrences of the query word is essentia to the IR agorithm. On the one hand, as resuts in tabe 1 suggest, the use of top n (n >1) candidates in terms of word recognition ranks wi improve the count of occurrences. Tabe 1: Recognition rates of PCR forms for different n. n Recognition rate(%) This is because the percentage chance of the correct choice being in the top n(n >1) is significanty greater. On the other hand, use of too many candidates or many candidates with equa weights introduces more fase occurrences of terms, thus effecting precision adversey. 2.2 A Vector Mode for mutipe word candidates Suppose WR is a word recognizer, I is a coection of document images, and O is the output of WR. In order to retrieve mutipe-candidate recognized text using the Vector Mode, first we need to define the formua for cacuating the term frequency for WR or for genera purpose any OCR system. We wi denote it by OCR-tf. Here we assume unigrams and ignore the effects of stemming. Then the OCR-tf of a given term (word) t i in a document d j is defined as the expectation of the tf in the ground truth of document d j, denoted by tf ocr i,j = E{tf i,j O} freq i,j = E O freq,j = E {freq i,j O} E {freq,j O} where freq i,j is the raw frequency of term t i in document d j (i.e., the number of times term t i is mentioned in the text of document d j ) and freq,j is a normaization factor that stands for the tota number of terms in document j. In the second and third ines of Equation 1, freq,j is assumed to be independent of tf i,j so { E {freq i,j } = E tf i,j } freq,j O { } = E {tf i,j O} E freq,j O = E {tf i,j O} E {freq,j O}. Denote the set of words (or terms, for simpicity) in a document d by w(d), then Therefore, E {freq i,j O} = tf ocr i,j = E{tf i,j O} = w w(d j) w w(d j ) w w(d j ) P (w = t i ) P(w = t i ) P(w = t ) Equation 3 is an anaytica expression of tfi,j ocr. Given probabiities P (w = t i ) t i, we wi be abe to cacuate the OCR-tf. However, most word recognizers do not (or are not necessariy to) produce the posterior probabiity for each cass. The ony information readiy avaiabe from recognition output is the rank of each candidate word. The rank of a candidate is correated with the posterior probabiity so we can find some way of probabiity estimation when we know ony the rank. A technique to map the ranks to a normaized probabiity distribution by fitting to a Zipfian distribution was presented in [Howe et a., 25]. We present a nove method based on the foowing motivation. Given an image η of word w, the recognizer exicon L, and a sequence of entries in L : w 1,w 2,,w n representing the recognition resut, (1) (2) (3)

3 where L = {w 1,w 2,,w n } and the rank of term w i w.r.t. word image η, rank η (w i )=i(1 i n), we want to cacuate Pr(w = w i w L), denoted by P (i), for each i. To simpify the anaysis, assuming that for any arge exicon L, the top-1 probabiity is intrinsic of the recognizer and therefore is a constant p, define a sequence of nested exicons { L (1) = L L (k) = L (k 1) {w k 1 }, 1 <k n then we have P (1) = p P (2) = P (w w 1 w L) P (w = w 2 w L (2) ) =(1 p)p P (3) = P (w w 1 w w 2 w L) P (w = w 3 w L (3) ) =(1 p) 2 p... and finay P (i) =(1 p) i 1 p, i << n recognition rate sampes of ranks 1 to 1 cacuated from the recognition resuts of PCR forms and the estimated curve P (i) = e θ(i). Whie cacuating OCR-TF using equation 3 with estimated probabiities, any probabiity that is beow a certain threshod is omitted in order to suppress the number of fase occurrences of terms. In our tests any P (i) smaer than c t P ( 1) is considered to be zero and hence omitted, where c t is a truncation parameter and takes the vaue of. When cacuating the IDF of a term, ony those documents incuding the term recognized with P (i) c t P ( 1) are taken into account. In other words, the IDF of a given term t i, idfi ocr, is cacuated by idf ocr i = og {d j } {d j η d j s.t. P (rankη(ti)) c t P (1) } where {d j } denotes the tota number of documents, η denotes a word image, and rank η (t i ) denotes the word recognition rank of term t i w.r.t. word image η. 3 Test resuts and discussions 8 6 Estimated prob rank curve Prob rank sampes 4 2 Probabiity Rank Figure 1: Top-1 recognition rates and the estimated posterior probabiity curve P (w = w i w L) =e θ(i). In other words, P (w = w i w L) is an exponentia function of i in this simpified scenario. To be more genera, suppose P (i) = e θ(i) where θ(i) is a, possiby non-inear, function of i. In our test, θ(i) is assumed to be a third degree poynomia of i and is estimated by fitting a poynomia to ogarithms of normaized word recognition rate sampes of each rank i. The word recognition rate sampe at rank i is cacuated by counting the tota number of true matches of rank i candidates and then divided by tota number of words recognized during the test. Figure 1 shows the word Figure 2: An exampe of NYS Pre-hospita Care Report form. Our corpus contains 342 PCR forms with ground truth and the coordinates of each word. The test bed invoves 21 queries, and the annotation of reevance of the 342 forms to these queries. An exampe of the handwritten region in a PCR form is shown in figure 2. Of a the 21 queries, thirteen contain 1 term each, six contain 2 terms each, and one contains 3 terms. The tota number of reevant documents (with dupicates) is 379. A the forms are first binarized and ines removed using the agorithm proposed in [Miewski and Govindaraju, 26]. Then a the word images are extracted and fed into the word recognizer of [Kim and Govindaraju, 1997] with a exicon of 445 Engish words. The top 1, 2, 5, and 1 word recognition rates are shown in tabe 1. IR tests of a the 21 queries are performed on both ordinary recognition text (with top one output) and the recognition output of mutipe candidates. The Vector Mode

4 with standard tf-idf scheme is appied to ordinary recognition text, and the Vector Mode with modified TF-IDF scheme is appied to mutipe-candidate recognition text. A the IR tests are impemented on the basis of Lemur Tookit ( a benchmark IR system designed by Carnegie Meon University and the University of Massachusetts, Amherst. In order to get better performance, a pseudo-feedback using top 3 documents and maximum of 1 terms is appied. For the purpose of comparison, the foowing naive scheme of tf-idf is aso tested { 1 P (i) = S, 1 i S, otherwise where S is a truncation threshod. The resuts of interpoated average precisions at 11 different reca vaues are compared in figures 3 and 4. The Mean Average Precisions (MAP) and R-Precisions of the resuts of ordinary recognition text, muti-candidate text with naive tfidf scheme and the proposed TF-IDF scheme are compared in tabe 2. The performances shown in figures 3 and 4 and tabe 2 indicate that the proposed method outperforms the Vector Modes using both standard and naive tf-idf schemes. A gain of 7.5% of MAP and 12.1% of R-Precision is obtained as compared to the standard scheme for ordinary recognition text. We aso earn from tabe 2 that the scheme of exponentia probabiity is better than the scheme of assigning equa weight to each candidate top n top MAP 17 Average Precision c t Reca Figure 5: MAP vaues using different vaues of c t {.2,.5,.75,,.2,.3} are tested. Figure 3: Performances without pseudo reevance feedback..7.6 top n top 1 The effect of choosing different vaues of is tested and the resut is shown in figure 5. We can see that the best performance is obtained when c t {,.75}. The c t vaue as arge as.2 or.3 does not invove enough top word candidates, whereas too sma c t vaue seems to invove too many irreevant candidates which bring about poor performance. Average Precision Reca Figure 4: Performances with pseudo reevance feedback. 4 Concusion This paper proposed a modified IR Vector Mode for degraded documents with ow recognition rate that incorporates the ranks of top word recognition candidates into the scheme of cacuating tf-idf. The proposed approach improved the performance of IR as compared to the method using standard Vector Mode. Future work wi focus on two respects. Firsty, define and test a more genera mode that takes segmentation errors into account. Secondy, use anguage mode such as n-gram. This may encounter the probem of earning the anguage mode from mutipe candidates of word recognition. Besides, we are sti investigating methods of improving recognition performance on PCR forms.

5 Tabe 2: MAP vaues of retrievas using different tf-idf schemes (with feedback and c t =). TF-IDF scheme Standard S=3 S=5 S=1 Proposed MAP(%) R-Precision(%) References [Beitze et a., 23] Steven M. Beitze, Eric C. Jensen, and David A. Grossman. A survey of retrieva strategies for ocr text coections. In Proceedings of the Symposium on Document Image Understanding Technoogies, Greenbet, Maryand, Apri 23. [Croft et a., 1994] W. B. Croft, S. M. Harding, K. Taghva, and J. Borsack. An evauation of information retrieva accuracy with simuated ocr output. In Proceedings of the Symposium on Document Anaysis and Information Retrieva, [Howe et a., 25] Nichoas R. Howe, Toni M. Rath, and R. Manmatha. Boosted decision trees for word recognition in handwritten document retrievas. In Proceedings of the SIGIR, pages , 25. [Jing, 22] Hongyan Jing. Using hidden markov modeing to decompose human-written summaries. Computationa Linguistics, 28(4): , 22. [Kim and Govindaraju, 1997] G. Kim and V. Govindaraju. A exicon driven approach to handwritten word recognition for rea-time appications. IEEE Transactions on Pattern Anaysis and Machine Inteigence, 19: , Apri [Manmatha and Croft, 1997] R. Manmatha and W.B. Croft. Word spotting: Indexing handwritten manuscripts. In Inteigent Muti-media Information Retrieva Coection. AAAI press, May [Miewski and Govindaraju, 26] Robert Miewski and Venu Govindaraju. Extraction of handwritten text from carbon copy medica form images. In Document Anaysis Systems, pages , 26. [Mittendorf and Schaube, 1996] E. Mittendorf and P. Schaube. Measuring the effects of data corruption on information retrieva. In Proceedings of the Fifth Annua Symposium on Document Anaysis and Information Retrieva, [Ohta et a., 1997] M. Ohta, A. Takasu, and J. Adachi. Retrieva methods for engish text with misrecognized ocr characters. In Proceedings of the Internationa Conference on Document Anaysis and Recognition,, [Rath and Manmatha, 26] T. M. Rath and R. Manmatha. Word spotting for historica documents. Internationa Journa on Document Anaysis and Recognition, 26. [Rath et a., 24] Toni M. Rath, R. Manmatha, and Victor Lavrenko. A search engine for historica manuscript images. In Proceedings of the 27th annua internationa ACM SIGIR conference on research and deveopment in information retrieva, 24. [Zhang et a., 24] B. Zhang, S. N. Srihari, and C. Huang. Word image retrieva using binary features,. In Document Recognition and Retrieva XI, SPIE, pages SPIE, 24.