Development of Automatic Keyword Extraction System from Digitally Accumulated Newspaper Articles on Disasters

Size: px
Start display at page:

Download "Development of Automatic Keyword Extraction System from Digitally Accumulated Newspaper Articles on Disasters"

Transcription

1 2 nd International Conference on Urban Disaster Reduction November 27~29, 27 Development of Automatic Keyword Extraction System from Digitally Accumulated Newspaper Articles on Disasters Shosuke SATO 1*, Haruo HAYAHI 2, Norio MAKI 2, Munenari INOGUCHI 1 1 Graduate School of Informatics, Kyoto University, Uji, Kyoto, JAPAN 2 Disaster Prevention Research Institute, Uji, Kyoto, JAPAN) Abstract Newspaper articles reporting disasters are basic data to study social aspects of disasters. They become readily available in digital format through Internet. In this study, we developed a method to extract those keywords automatically from the newspaper articles covering disaster. In our algorithm, newspaper articles will be chopped wn to morphemes by Chasen widely used Japanese text morphological analysis software. Then, only those morphemes representing nouns will be further analyzed by TFIDF (Term Frequency/ Inversed Document Frequency) method, a popularly used keyword extraction algorithm. Since newspaper articles covering disasters will increase in number overtime, we compare newly arrived news articles with the existing archives of news articles on the same issue by expanding TFIDF algorithm. The validity of this proposed keyword extraction method was investigated by applying this method for an archive of a total of 2,623 web news articles covering the 24 Niigata-ken- Chuetsu Earthquake in Japan. The automatically extracted keywords for the first 1 hours after the consisted mainly of life saving and emergency response activities, the those between 1 hours and 1, hours featured disaster relief activities. Then, the automatically extracted keywords after 1, hours were mainly on recovery processes. These results clearly indicated this method can be helpful for emergency responders who need to obtain disaster intelligence from a huge pile of text data as soon as possible. Keyword: keyword extraction, disaster intelligence, open-source-intelligence, text mining, the 24 Niigata-ken-Chuetsu Earthquake disaster * Corresponding author address: Shosuke SATO, Graduate School of Informatics, Kyoto University, Gokasho, Uji, Kyoto , Japan, show@drs.dpri.kyoto-u.ac.jp

2 1. Introduction Newspaper articles reporting disasters are basic data to study social aspects of disasters. Public information (e.g. news) is called OSINT (open-source intelligence). OSINT is information to which intelligence agencies in the world handle as important information source. The Newspaper articles become readily available in digital format through Internet. They continue reporting hazard, age and response situation by the name of web news. Because the web news is the network media, the observation of the disaster that occurs in the whole country is possible. Moreover, because it is a digital medium, making to the database and the processing are easy. It is used for the grasp at the disaster current situation and learning past disasters. The amount of the web news reported after the occurrence of the disaster is huge. Therefore, a lot of time and the labor will be required so that we may read those all. The goal of our research is to build a system which automatically and objectively extracts keywords from digital cuments. Developed system aims at being able to take a general view of the feature of the intelligence by using the keyword from a huge amount of disaster web news articles. 2. Sample As a sample which will be applied to the keyword extraction method, the web news which reported the 24 Niigata-ken-Chuetsu Earthquake disaster was archived. Since it was a major disaster which occurred in Japan after the Internet spread, web news focused the disaster is suitable for the experiment sample. The web news was continuously sent over the long period of time, so we expected that many web news articles could be collected and analyzed. The web news articles relevant to the Niigata-ken-Chuetsu Earthquake disaster sent on the news contents of a portal site [1] after October 23, 24 were collected. The period which we collected web news articles is six months after the occurrence of the. Collection was a total of 2,623 web news articles. Using the collected articles, time and date of update, publisher name (news provider name), title, and the article text were stored, and the database was built. 3. Developing Keyword Extraction Method The keyword extraction method shown in this paper is composed of the extent technique for general keyword extraction (morphological analysis and TFIDF) and a new weight evaluation method of the term that adds the time concept. The problem is that a disaster never ends up at a certain time, so the disaster web news articles in the corpus increases continuously. Extent

3 method cannot extract keywords from such a corpus. For this reason, we developed a new method which is able to apply increasing corpus over time. 3.1 Morphological Analysis Generally as an analyzing method which chops wn Japanese text to morphemes, the morphological analysis is used. We need to divide the sentence into term (morpheme) because there is no space between Japanese texts. A morphological analysis is the basic technology of the natural language processing which divides the sentence written by natural language into the unit called a morpheme, and adds each morpheme to part-of-speech information. A morpheme is a character string of the minimum unit which has a meaning grammatically. As a result of applying a morphological analysis using Chasen [2] (morphological analysis software for Japanese text) to collected web news, 15,211 kinds of morphemes was obtained (total number of morphemes is 623,765). Not all of the morphemes obtained as a result of the morphological analysis are suitable for the keyword. These include particles, pronouns, and symbols which have no contents meaning. Such a term is called stop word in natural language processing field. It is appropriate to set general noun, verb, adjective, and adverb to kinds of term as a candidate of a keyword. When stop words were removed, 15,211 kinds of morphemes decreased to 14,19 kinds. 3.2 TFIDF Containing Time Parameter In this section, the index which gives weight to candidate of keywords for the purpose of choosing keywords is proposed. In order to extract keyword, TFIDF (Term Frequency / Inversed Document Frequency) index is often used to weight terms. This weight is a statistical measure used to evaluate how important a term is to a cument in a collection or corpus. TFIDF is calculable by multiplication of TF (Term Frequency) and IDF (Inversed Document Frequency) as follows: TFIDF(t i, d j ) = TF(t i, d j ) IDF(t i,) (1) IDF(t i,) = log 1 (DF(t i,) / N) (2) where, t i ; term calculated TFIDF(t i, d j ). i is the subscript showing an identifier. d j : cument including t i. Book, paper, chapter, or paragraph is as the example. In this research, it is a web news article report. j is the subscript showing an identifier. N: number of cuments.

4 TF(t i, d j ): term frequency in the given cument is simply the number of times a given t i appears in d i. DF(t i,): ratio of the number of cuments which occurs t i to N. IDF(t i,): the reciprocal of DF(t i,) logarithmized Since N and DF(t i,) are constants, TFIDF(t i, d j ) cannot be calculated weight of the word in corpus which increase with progress of time. So, we modified TFIDF index as follows so that weight could be calculated, whenever a cument was added to a corpus as time advance: Chronological Incremental TFIDF(t i, d j ) = TF(t i, d j ) IDF(t i, d j ) IDF(t i, d j ) = log 1 (DF(t i, d j ) / N j ) where, DF(t i, d j ): the number of cuments in which t i at the time of d j appearing is contained. N j : the number of cuments at the time of d j appearing. 3.3 Residual Analysis In this section, the technique of choosing unique terms out of the latest web news article was developed based on comparison of the newly added terms in recent news articles and the terms in news articles which had delivered before. The relationship between summation of TF ( TF, X) and summation of TFIDF (ΣTFIDF, Y) gave a satisfactory fit to the formula Y=aX b (Fig. 1). The r-squared showed the value of the range of.9 to.99 of fitting curves made after 1 hours with calculated results of TF and TFIDF. Plots (terms) near a curve have a high possibility of being terms which is an average appearance pattern. In contrast, Plots far from a curve have a high possibility of being terms which is characteristic terms and can become keywords. So, we expected that the candidate of a keyword was extracted by calculating the difference of actual measurement ΣTFIDF and estimate value based on fitting curves, i.e., by making residual analysis. Fig. 2 shows the result of residual analysis using fitting curve made after 1, hours (t=1, hours) and actual measurement data., (temporary house), (housing age) assessment,, fund-raising, etc. indicated high residual value. Since that time was winter (December 4, 24), it corresponds at the time hit by snow coverage in the affected area which is one of the greatest heavy snowfall areas in Japan. It is also the time when the move towards recovery was actively pursued. In addition, some terms showed high negative value (Fig.2). is frequently used on the Japanese lexical characteristic., niigata and chuetsu are included the formal

5 nomenclature. Therefore, a high positive residual value term is probably a unique term at the time, and a high negative residual value term is probably a ubiquitous term in a corpus. To compare terms contained in the latest web news articles with terms stored before, we decided to use the curve at the time of t-δt as a curve using residual analysis and TF andσ TFIDF dataset at the time of t as actual measurement values (t is any time, Δt is any time period). ΣChronologically Σ 時間増加型 Incremental TFIDF TFIDF 1,2 1, Y=3.85X.77 R 2 =.98 (1,hrs) , 1, 1, ΣTF ΣTF Fig. 1 Relationship between ΣTF and Σ Chronologically Incremental TFIDF Residual value (subtract estimated ΣChronologically Incremental TFIDF value from actual value) , 1, 1, ΣTF 1-2, 1 1 1, chuetsu 1, 1, -4, -6, -8, assesment t=1, hrs support fund-raise elementary school student niigata Fig. 2 Result of residual analysis based on relationship between ΣTF and Σ Chronologically Incremental TFIDF The flow of keyword extraction is as follows: 1) chop wn web news articles to morphemes (terms) with morphological analysis, 2) calculate TF and TFIDF index which contains time parameter to weight each term, and 3) use residual analysis with the curve relationship between TF andσtfidf at the time of t-δt and TF andσtfidf dataset at the time of t. As a result, a high positive residual value terms as unique keyword for Δt and a high negative residual value words as ubiquitous keyword in web news archive are obtained. 4. Application and Results The result of applied an archive of total of 2,623 web news articles covering the 24 Niigataken-Chuetsu Earthquake disaster to the keyword extraction method is shown in Fig. 3. Δt was set 1 hour, 1 hours, 1 hours, 1, hours between 1-1 hours, 1-1 hours, 1-1, hours, 1,-hours after the respectively. The first 1 hours, the terms that indicated high residual value are gas, telephone, fault and death. And from 1 to 1 hours, rail is shown, that is because the bullet train derail and then gal and fault. And for 1 to 1 hours, volunteer and (temporary housing) and so on are shown. From 1 to 1, hours, unique keywords are, ski and so on.

6 Residual value Residual value Residual value , -4, -6, -8, -1, -12, Unique Term player MA children YUTA NORINOMIYA player YU fault majesty japanese inn volunteer volunteer fault class IC the city majesty assesment pet gal class reply gal gal gal rail rail the city fault IC rain rail faultfault rail rain TAKAKO IC volunteer fault seismic intensity telephone fault rail death telephone IC seismic intensity seismic intensity gas , 1, 1 1 Elapsed time after 1 the (hrs) 1, 1, information Ubiquitous Term YUTA chuetsu niigata niigata niigata niigata chuetsu niigata chuetsu niigata chuetsu chuetsu niigata niigata snow removing snow removing Fig. 3 Result of applied the 24 Niigata-ken-Chuetsu disaster web news archive to the developed keyword extraction method Choosing the extracted keywords which ranked top 1st-1 th at each time, the terms residual value trends which are related to three kinds of disaster response objectives (saving life, recovery of flow, and reconstruction stock) based on disaster process theoretical model [3][4] were plotted (Fig. 4). The first 1 hours, most of the objectives have been reported for search and rescue type operations. So the keywords related to search and rescue type things will be very characteristic of that particular period. The next phase terms between 1 hours and 1, hours that can be characterized as a restoration of the lifeline systems. So that the keywords related to such restoration of daily life activities will be very minant, including transportation services. After 1, hours that is more like a heavily concerned with the recovery processes and then also only for selected victims who lost their homes, they need to find temporary housing place or they have to move and such kind of keywords have been very characteristic of that particular period. So, these keywords nearly reflecting the changes in activities that will be very salient in that period can be extracted. This result has very high validity and method to extract keywords.

7 Residual value telephone death dispatching life-or-death Saving Life , 1, volunteer IC rail tunnel Recovery of Flow , 1, assesment support change of the house Reconstruction Stock , 1, Elapsed time after the (hrs) Fig. 4 Unique keywords which was high positive residual value related to 3 kinds of disaster response objectives These results clearly indicated this method can be helpful for emergency responders who need to obtain disaster intelligence from a huge pile of text data as soon as possible. References [1] Yahoo! JAPAN (News contents): [2] Yuji MATSUMOTO, Morphological Analysis System Chasen, Information Processing, Vol. 41, No. 11, (2), pp [3] Haruo HAYASHI, Initiative Citizen Principle, Koyo Shobo, (21), pp [4] Haruo HAYASHI, Earthquake Disaster Reduction for Saving Life, Iwanami Shoten, (23), pp