CHAPTER VOICE RECOGNITION MICHAEL J. MARDINI AMIT MEHTA

Size: px

Start display at page:

Download "CHAPTER VOICE RECOGNITION MICHAEL J. MARDINI AMIT MEHTA"

Bertha Nelson
5 years ago
Views:

1 CHAPTER 23 VOICE RECOGNITION MICHAEL J. MARDINI AMIT MEHTA Perhaps no new technology introduced into radiology has evoked more debate and raw emotion than the use of speech recognition technology for radiology reporting. As adoption of the technology has increased, the controversy has increased, with arguments both for and against its overall value to a radiology service. Among those that use the technology, approximately 20% strongly oppose it, 20% strongly favor it, and the remaining 60% fall somewhere in between. Regardless of who is right, there is no arguing the fact that successfully adopting speech technology has a positive effect on many of the areas in which a radiology service is measured. The reduced report turnaround time, decreased costs associated with producing a report, and overall decrease in confusion about report availability and distribution all support the adoption of this technology. One of the largest issues has been the question of the effect on radiologists productivity and workload. Does it slow the radiologist down to unacceptable production levels, as some claim, or is it just a matter of changing some work habits to allow the user to maintain and ultimately improve productivity, as others claim? Real results from sites using this technology appear to support the latter.

2 468 PACS: A Guide to the Digital Revolution The underlying cause for successful adoption of this technology seems to be the belief that it is no longer enough just to interpret images and dictate findings. Now that we have the technology to do so, it is necessary to take the extra step to communicate all findings, not just critical ones, to clinicians and patients efficiently. With increased competition for business and the availability of images through Web distribution, it simply comes down to being clinically relevant to patient care. If the report is unavailable but the images are, the clinician down the street may feel that he can view and interpret them to his satisfaction and disregard the radiologist s report entirely. At what point do he start billing for the service? HISTORY The use of computers to recognize human speech has a long history within medicine. In the late 1980s, radiologists and other medical subspecialists started to use expensive dedicated hardware systems that employed specialized vocabulary to recognize reports dictated in a discrete speaking style. After several years of use, it was deemed that voice recognition technology was not mature enough to handle the high-volume demand for the transcription requirements of most radiology practices. The late 1980s and early 1990s saw limited usage and general nonacceptance of voice recognition technology for commercial applications. However, despite the lack of widespread use, development and distribution continued. By 1994, speech recognition systems in American English running on computers with increased processing power had progressed, and speech recognition engines had vastly improved. These developments led to accuracy rates that became acceptable for commercial applications, especially medicine. It became apparent once development began that no specialty was better suited to voice recognition applications than radiology for the following reasons: A strong need to reduce report turnaround time for improved service provided operational motivation. Financial pressures and a shortage of qualified medical transcriptionists presented a strong financial motivation. A vocabulary that was limited and predictable provided the ability to develop systems that were accurate enough for high-volume use. A defined number of stationary users provided an environment for a supportable application.

3 VOICE RECOGNITION 469 High-volume, daily use created an opportunity for users to become proficient relatively quickly. Currently, there are more than 900 radiology installations of various voice recognition systems throughout the United States and continued interest throughout the world. The technology has encountered both acceptance and rejection based on several factors. As the technology becomes more available as an option on picture archiving and communication system (PACS) and radiology information system (RIS) workstations, the barriers to adoption continue to recede. ACCEPTANCE AND CONTINUED INTEREST Speech recognition software packages offer many advantages that promote acceptance by radiologists. First, continuous speech recognition systems require limited amounts of learning and adaptation compared to other transcription systems and methods. The systems are designed to conform to people s most natural way of communicating and essentially do not require the user to alter this method short of speaking more clearly. Second, the ability of the software package to integrate almost seamlessly into existing radiology and hospital information workflows makes the transition easier. The biggest barriers are often resistance to change and fear of technology. As these are overcome, the benefits become more apparent and user acceptance increases. This, in turn, encourages developers to continue development. Several factors drive interest in speech recognition. First, continued development coupled with increasing processing power lead to improved accuracy rates and the easier use of natural speech. Second, a shortage of medical transcriptionists is occurring in most medical markets. This forces healthcare institutions and practice groups to seek alternative strategies to foster growth and maintain services. Third, there is an immediate return on investment, with decreased operating costs and improved services. Fourth, when thoroughly analyzed, this technique does not require physicians to drastically change their practice in terms of transcription, a feature setting it far apart from competing technologies. Fifth, pressure to be clinically relevant by providing results quickly has increased dramatically as images become easily available through Web distribution. Finally, integration on the desktop with PACS and RIS systems to provide seamless workflow and improved reporting capabilities such as the creation of multimedia and structured reports will further attract attention.

4 470 PACS: A Guide to the Digital Revolution RESISTANCE TO ADOPTION Despite the bright future for this technology, several factors delay its widespread implementation. Physicians and people, in general, have an inherent resistance to change. As the next generation of medical practitioners who have been trained on computer systems begin to practice, we will see a change in the dissemination of new technologies. Users who resist the technology claim that they are forced to do more work by editing their own reports. However, an analysis of the current dictation-to-transcriptionist method uncovers the same editing requirements, so why the complaints? The fact is that radiologists typically browse through a report returned from transcription 24 hours later. They may uncover glaring and obvious typographical errors but have a slim chance of correcting errors involving content because they may not remember what they said. For example, a user will most likely miss a mistake of a transcriptionist typing left instead of right unless left was misspelled. In contrast, a report dictated and edited immediately using voice recognition usually receives a more careful review of the resulting text not simply because the software makes different types of mistakes but also because the content is still at the forefront of the radiologist s mind. This is a very important difference and often overlooked when evaluating this technology. Furthermore, when the time spent reviewing the transcribed reports is added to the time spent initially dictating them, it is unclear whether the total is really any different that that spent with voice recognition systems. Over the next few years, as the technology continues to improve, workflow options increase through integration with RIS and PACS systems, and continued market pressures for improved service and decreased costs drive need, this technology will become ubiquitous and a routine part of a radiologist s tools. CURRENT OFFERINGS There are a number of sources for this technology. The most common are companies that have built stand-alone applications that include the components and workflow options needed for radiology reporting. The most prevalent of these in the market are Dictaphone s PowerScribe and Agfa s Talk Technology TalkStation. These 2 products comprise over 90% of the systems in use for radiology today. Both of these companies are undergoing some changes that have hindered their ability to continue developing and supporting the product offerings and have created an opportunity for some of the newer players to enter the market, such as Lanier/Medquist s SpeechQ

5 VOICE RECOGNITION 471 (Mount Laurel, NJ), Provox s VoxReports (Roanoke, VA), and Commissure s RadWhere (New York), which combines structured reporting with voice recognition. Solutions may also be obtained through various RIS and PACS vendors. Most have simply integrated with one or both of the above products, while others have opted to integrate a speech engine such as IBM ViaVoice (Wizard Software, Pittsburgh, PA), Philips SpeechMagic (Vienna) or Dragon NaturallySpeaking (Scansoft, Burlington, MA) into their own reporting workflow application. Regardless of how the application is purchased, there are 6 key components to any viable solution to bear in mind: 1. The core speech engine 2. The language model 3. The application and workflow 4. The interface and integration components 5. Implementation services 6. Ongoing support and customization CORE SPEECH ENGINE The speech engine is the starting point for any application driven by speech recognition. Think of the speech engine as a keyboard replacement. There are different types of speech engines. Some recognize only commands and can work on a number of platforms, from PDAs to a PC. Many are built for telephonic applications such as airline scheduling and customer service answering systems. The continuous speech recognition engine is the one used for dictation solutions in medicine. This type of speech recognition allows users to speak in their natural style and will also allow a properly designed application to be controlled by voice command. This is among the most advanced speech technology and among the most processing-power intensive. Speech engines use a number of proven algorithms and models to recognize speech. Words are spoken into a microphone. Some microphones transmit the analog sound signal to the sound card of the computer, which then digitizes the signal for processing. Others digitize the analog sound signal immediately within the microphone itself and then send the digital signal to the computer, typically via the universal serial bus (USB) port. The latter method seems more robust, as it eliminates the distortions caused to the analog signal from surrounding interference (common in a hospital

6 472 PACS: A Guide to the Digital Revolution setting) as it travels down the wire to reach the sound card. Digital signals are essentially unaffected by such interference. The speech engine listens to these digital signals and compares them to acoustic models stored in the software. The engine tries to find the best word match for the acoustic signal to recognize it and convert it to text. Speech engines can typically handle a total vocabulary of more than 60,000 words. However, there are very few applications in which using more than 15,000 is necessary. Shakespeare s total works comprised fewer than 13,000 words. The average college graduate uses a vocabulary of fewer than 7,000 words in everyday conversation. More specifically, an analysis of more than 4 million radiology reports collected from more than 15 different hospitals produced fewer than 25,000 unique words, including proper nouns. Nonetheless, building acoustic and language models is a timeconsuming process that costs many person-hours to successfully complete. In fact, speech recognition technology has been under development since It is the longest-running continuously funded research and development (R & D) effort in the history of IBM. Philips has made a significant R & D effort, too, and there are literally dozens of other development efforts continuing. Perhaps the biggest indicator that the time for speech recognition has come is that Microsoft has built its own speech development labs and is likely outspending all other engine developers combined to advance this technology for the human-machine interface. LANGUAGE MODEL The language model is a layer on top of the speech engine that allows for the recognition of specialty-specific terms and reporting structures. Application providers build these language models using programming tools that are included with the speech engine tool kits they have integrated into their applications. There are actually 3 layers to the language model. The first typically contains a base vocabulary of 30,000 to 60,000 words that come from the manufacturer. These are from everyday English or New York Times English. They include many medical terms as well. This layer also contains the acoustic models associated with those words. The acoustic models are what allow an engine to differentiate one spoken sound from another. Commercial off-the-shelf packages typically provide only this base language model. The second layer is used to build specialty models that allow application providers to customize and weight the vocabulary for certain topics or

7 VOICE RECOGNITION 473 subject matters. This layer is in addition to the base model. Hence, in most cases, radiology-specific applications still contain many nonradiology words that are also active in the vocabulary. However, the words in the customization layer will be more heavily weighted than those in the base model when the engine is working to recognize text, resulting in improved accuracy. The better the effort to build this layer is, the better the overall accuracy of the product will be. It is common to use in excess of 2 million reports to build a language model for medical reporting. The final layer is the user-specific layer. This layer gets built and optimized as the user dictates and corrects words. If a user adds a word or dictates in a specific style, this layer will learn the style, resulting in improved accuracy. In most cases, 2 to 4 weeks of use will optimize this layer for optimum accuracy. There is a method of speeding this process with a utility inherent in some products that will allow a user to feed up to 3 megabytes (MB) of reports to learn new words and reporting style from previously dictated documents. A properly built language model is very important to a successful implementation of speech recognition. APPLICATION COMPONENT The application is the most important component of a solution. A functional application for radiology reporting is comprised of many things. These include a database, an interface engine, a desktop workflow application for the radiologist, administrative tools, voice file management tools, and distribution and integration capabilities. Figure 23.1 represents how data flows in a properly designed reporting application. The database includes all the data elements necessary for receiving order data and allowing for the creation and distribution of a report and associated information. It should mirror many of the tables in the RIS database. By integrating workstations into a network, a user can dictate at any PC and not be restricted to a particular workstation. The network link also allows integration with the RIS through the voice recognition server and ultimately into the hospital information system (HIS) and PACS. FEATURES NEEDED FOR SPEECH REPORTING Besides the ability to support speech recognition and communicate within the existing information technology (IT) infrastructure, there are a few basic features a speech reporting solution should have, as follows.

8 474 PACS: A Guide to the Digital Revolution FIGURE 23.1 Diagram shows how data flows in a properly designed reporting application. HL7 indicates Health Level Seven; IP, Internet protocol; RIS, radiology information system. SYSTEM SECURITY AND LOG-ON The voice recognition software package must ensure that security is maintained by requiring each radiologist to enter a user identification code and password to sign on and begin dictation. This is also a utility in the speaker identification and verification function to ensure high accuracy rates. In addition, most systems require the radiologist to enter the same password or another predefined password in order to sign off on the dictated report and allow its transfer to the RIS and HIS. This second password feature ensures that the user is the designated physician identified by the system as dictating the report. Further security measures include the inability to log on to multiple workstations within a facility, which ensures that user workstations are not left unattended. The system will also not allow an order to be open more than once at any one time.

9 VOICE RECOGNITION 475 STANDARD REPORTS A speech recognition system should allow members of the radiology department to create predefined reports for individual radiologists or the institution. These predefined reports may be categorized for normal studies or commonly performed studies. For example, there is typically a predefined report for the normal chest radiograph, which would describe the normal cardiomediastinal silhouette and clear lungs. Standard reports should be easily called up either by voice command or mouse click. Some systems may also bring up standard reports based on the type of exam being interpreted. TEMPLATES AND MACROS As an extension of the standard report function, many systems allow the creation of standard reports with customizable templates and fields. The template capability gives the radiologist the flexibility to create a form with blank areas that get filled in during dictation. For items such as procedures, the radiologist is able to dictate the necessary components to fill in the blanks. This feature has great value when describing different dosages or instruments. Most systems contain a feature-filled macro/template creation/editing utility. Using this function during report generation greatly improves the efficiency of the radiologist and decreases the time required to dictate reports. CUSTOMIZABLE FIELDS Most packages permit custom definitions by the institution of multiple fields associated with a report. These fields may include ICD-9, current procedural terminology (CPT), Breast Imaging Reporting and Data System (BIRADS), American College of Radiology (ACR)/National Electrical Manufacturers Association (NEMA) codes, or ACR pathology identifiers. The data entered into these fields may be shared with other information systems, such as the HIS-RIS integration interface. The integration of these fields into a speech recognition solution allows many collateral benefits. For example, an institution may generate a database that can be used for research and education purposes. By defining cases by ACR codes, trainees can retrospectively review cases with selected words. By employing ICD-9 codes, radiology billing services are greatly facilitated. The various uses of these fields are innumerable.

10 476 PACS: A Guide to the Digital Revolution DESKTOP INTEGRATION Desktop integration with PACS or RIS workstations has become increasingly important as the proliferation of PACS has advanced. There are many ways for a speech reporting solution to sit on a radiologist s desktop. In a nonintegrated environment there are 2 workstations, each with a mouse and a keyboard. These systems do not communicate, and the user must select and sign cases individually within each application. In a semi-integrated environment, there are still 2 separate workstations. However, they communicate such that order selection and signing happen just once, typically on the PACS but sometimes on the RIS. In a fully integrated environment, the applications reside on a single workstation and communicate almost transparently on the workstation. From a usability perspective, this is the ideal scenario. However, PACS and speech recognition are 2 of the most demanding applications for a PC to run. The author has not yet seen these 2 applications run simultaneously on a workstation and consistently perform optimally over an extended period of time. There are also issues with sharing screen real estate, although this can be overcome by adding more inexpensive color LCD displays. Currently, the most reliable integration is semiintegration. This allows for the use of context sharing and a single workflow point while also providing enough PC horsepower for both applications to run optimally. RIS/HIS INTERFACE Any system needs to provide links to the existing IT infrastructure, allowing seamless integration with the RIS, HIS, PACS, and the billing system. Most software packages incorporate back-end transparent interfaces that allow the speech recognition system to query the RIS for demographic data as well as report status. In addition, the system allows the upload of dictated reports to the RIS and ultimately to the HIS. To achieve this task, most software packages contain Health Level Seven (HL7)-compliant application programming interfaces that use standard formats and protocols. Depending on the level of HL7 interface, most systems allow the radiologist to create a worklist by modality, date, or wildcard categories. Here is an explanation of how HL7 messaging works and some sample messages related to a reporting system. The HL7 protocol is a simple messaging system in which all data are transferred as ASCII data, not unlike a simple text file. Each transaction consists of a message unit (file) that consists of segments (lines or rows) that

11 VOICE RECOGNITION 477 consist of fields. The fields are formatted according to their HL7 data type. Each message, segment, and field is of variable length and uses ASCII characters for delimiters. An example portion of a message follows: MSH ^~/& Radiology RDW Radiology ORU PID Smith^James^T M The first segment is a message header (MSH). It contains the delimiter characters to be used in the message (^~/& ), the message type (ORU), and other message control information. The PID segment contains patient demographics, such as name and date of birth (DOB). Notice the format of the DOB, This is an HL7 data type called TS (time stamp). The TS is defined as YYMMDDHHMM[SS]. There are numerous HL7 data types, ranging from tightly restricted formats such as TS to very loose restrictions such as ST, which is any string data. Also notice that some fields contain no data. These are optional fields. For a more detailed description of HL7, refer to the HL7 Standard Specification (Version 2.1 or higher). There are 2 types of communications that occur between a RIS and a speech reporting solution. The first is the sending of an order to the reporting system so that it is aware of what needs to be dictated. These are very similar to orders sent from an RIS to a PACS. The second is the sending of a result from the reporting system to the RIS. An example of an order message that an RIS might send is: MSH ^~\& ORU PID Doe^John^P M 1 Fairway Lane OBR ^Head,left view ^Welby This example contains basic patient data, exam type, accession number, and the ordering physician information. An example of a result message sent to the RIS is: MSH ^~\& ORU PID ORC RE OBR A F D12345

12 478 PACS: A Guide to the Digital Revolution OBX 1 FT 201A2&BODY^PA & Lat. Chest FINDINGS: Comparison, 03/01/91. OBX 2 FT 201A2&BODY^PA & Lat. Chest The patient is status post right mastectomy. The lungs are clear. OBX 3 FT 201A2&IMP^PA & Lat. Chest No active cardiopulmonary disease. No interval change since 03/01/91. DG ^ MALIG NEOPLASM BREAST- CENTRAL^ICD9^^ F DG ^ BREAST DISORDER NOS^ICD9^^ F Many variables and options can be a part of the messaging between an RIS and a reporting system. The above examples are the most basic of messaging that occurs. IMPLEMENTATION Once a particular solution has been identified, the next task is to prepare the site for implementation to ensure success. The institution must first decide what its objectives are and what a successful implementation should look like. This varies depending on differences among sites. Information technology infrastructure must be analyzed to ensure that the system can be deployed everywhere it is needed. A champion should be appointed from among the radiologists to ensure that his or her colleagues are capable and willing to adapt to using the system. Without exception, the single most important component of a successful implementation is a strong commitment from clinical leadership in the department. If the chairman and section heads are not fully committed to succeed, the chances of a successful implementation are small. RADIOLOGIST TRAINING The training of radiologists to use voice recognition software and workstations must occur prior to its wide-scale use in a department. Radiologists must first be able to navigate the basic functions of the operating system; second, they must be versed in the use of computer input devices such as a keyboard, mouse, microphone whether head mounted or a hand style and possibly a bar code reader. Next the radiologist must become familiar

13 VOICE RECOGNITION 479 with the software interface of the voice recognition package. Many packages offer navigation by voice, which requires the user to remember the names of each function, such as Accept and sign or Save as preliminary. Other packages require mouse clicks to perform these same functions. Some allow a variety of methods and prompt the user with these commonly used commands to decrease the need to memorize commands immediately. Usually, the champion of the transition is a key radiologist who facilitates learning and eases the implementation. Such an individual can be instrumental in promoting acceptance and training colleagues in the use of the new system. The vendor will typically schedule enough training for all users. Follow-up training some weeks after go-live is highly recommended. TRAINING MATERIALS AND TOOLS Providing users with a written review of the steps they have performed serves to reinforce computer-based training. Vendors should provide written material that is graphically intensive and easy to follow. It should be in a quickreference format and easily accessed on the workstation. TECHNICAL SUPPORT All devices stop working at times. Whether the affected component is a mouse, a monitor, or the entire workstation, problems invariably occur at one time or another to a PC. This is obvious to all users and inherent in all computers despite precautions taken to prevent downtime. However, a reporting system plays a mission-critical role in the life of a radiology department. Downtime can cost thousands of dollars in radiologist time and can have a severe negative effect on service. In most cases, vendor helpdesk support is simply not enough to tend to the needs of a radiologist with a down workstation. Having in-department or hospital support staff that can respond to basic issues in real time is an expense but well worth the effort. OPERATIONS From the initial thought of employing a voice recognition system to all stages during the deployment, close attention to operational planning is necessary. This attention must be generated both departmentally as well as individually.

14 480 PACS: A Guide to the Digital Revolution The first impressions of voice recognition systems are that they are slow and often cause significant delays with large volumes. It is always difficult to foster change in a department, especially when the ultimate benefits of improved report turnaround times and departmental cost savings seem to occur at the expense of the radiologists time. To successfully overcome these hurdles, there must be a general understanding, especially by those who champion such an effort, of the impact a system shift of this magnitude will have on the radiologists daily workflow. Only with acknowledgment of these gains by the designated leadership will it be feasible to achieve milestones and communicate the required steps to the members of the department. Beyond these operational issues, specific issues must be addressed during the integration of voice recognition into a department. SPECIFIC DELIVERY During the installation of a system, 4 operational issues should be considered prior to its full-time use in the practice. One first needs a firm grasp of the hardware and infrastructure requirements necessary to create the support systems that would allow seamless integration into the department. Second, individuals need to be identified who can assist radiologists with technical issues promptly and effectively. Third, timely operational checks with close follow-up are necessary to monitor that users are productive with the new system. By ensuring that individual users are effectively operating with the new system, enterprise benefits can be realized. The final operational requirement is a plan for the removal of the legacy dictation system so that penetration of the newer system is guaranteed. OPERATIONAL CHECKS For continued success in voice recognition system implementation, the department must make a commitment to guarantee continued support and use. First and foremost, technical support must be quick, knowledgeable, and courteous. The time that lapses between the request for support and the response to a distressed user should be monitored because once fully implemented, a nonfunctioning voice recognition system is costly to both the productivity and the morale of the department. A workstation that crashes usually results in the complete cessation of workflow because of its absolute necessity, especially once 100% penetration has been achieved. Often, users will resort to using legacy dictation systems if they are avail-

15 VOICE RECOGNITION 481 able, but this further hampers the operational benefits of the system. Second, the support team must follow up on all outstanding events to prevent recurrence of common problems. There should be a well-documented contingency plan in the event of system failure and a means of communicating this to both radiologists and other users of the system. REMOVAL OF OLDER DICTATION SYSTEMS During the phase-in of the voice recognition system, a point is reached at which a well-organized and timely removal of legacy dictation systems must occur. The removal of these alternate dictation systems ensures both primarily use of the voice recognition system as well as a commitment on the part of the department to the radiologists. It also demonstrates that a wellplanned contingency structure is in place in the event of failure. Experience has demonstrated that every department has users who resist change and will continue to use legacy systems if they are available (and in some cases even if they are not readily available). However, the removal of older dictation systems once the voice recognition system is in place, coupled with encouragement from department heads and senior management, allows usage to steadily increase and goals to be reached. COST SAVINGS Several areas of cost savings are associated with the implementation of a voice recognition system. The obvious and most apparent in the radiology practice is the transcription cost. With the direct dictation and transcription of the report, the number of personnel who handle the transcription process can be decreased. The cost savings are realized not only in terms of salaries and benefits but also in the host of other costs associated with personnel. The indirect cost savings are realized through the use of computer systems. Although this may appear as an added cost due to the purchase of hardware, the actual savings come from the multiple uses of a desktop PC. With the integration of the medical record to include radiology, pathology, and other image-based specialties, the practice of radiology is undergoing drastic changes in the availability of information to the radiologist. Housing the voice recognition system on a conventional desktop PC allows other information agents, including references, paging systems, and HIS-RIS applications, to be coupled to this system. Adding these services to a unifying system saves both time and physical space.

16 482 PACS: A Guide to the Digital Revolution REPORT TURNAROUND There is no arguing that report turnaround times decrease dramatically when using a speech recognition system. In most cases, times drop by more than 90%. If coupled with proper distribution methods, reports can be delivered to ordering physicians before the patient leaves the department or arrives home. Considering the level of service this represents and the fact that images are now available via the Web from most PACS, it becomes a necessity to implement this type of service to remain current and competitive. CONCLUSION There is no doubt that radiology must take the path that leads to improved service and more efficient delivery of results. The demands of more timely and improved patient care are greater than ever. The documentation process must be improved, and no technology has demonstrated itself to be more of an aid toward attaining this goal than speech recognition. Perhaps speech in combination with some clinical content and structured reporting will be the pinnacle of functional reporting systems. The technology has improved dramatically over the years. Vendors have learned and advanced their applications to be true workflow improvement tools. Now is the time to consider implementing speech recognition as a means to improve service and remain clinically relevant to patient care.