The practice of medicine is a field that is particularly demanding of accuracy. A small error in dosing could spell the difference between a life-saving cure and a life-ending mistake. That makes medical facilities’ adoption of OpenAI’s Whisper technology worrisome.
This article examines OpenAI's Whisper technology, its concerning error rates and hallucinations, and the legal implications for medical professionals who employ it. More importantly, it provides essential guidance for attorneys whose clients may be using or affected by this technology, particularly in medical malpractice and personal injury contexts. Understanding these risks is crucial as the intersection of AI and healthcare continues to expand, potentially affecting both patient care and legal liability.
Whisper
Whisper, a general-purpose speech recognition tool, was released in November 2022 and has been fraught with problems since that time. OpenAI touts Whisper’s ability to take 25MB chunks of data and provide either transcriptions or translations. Whisper was trained on about 680,000 hours of audio, with about 33% of its training data being non-English, which OpenAI states makes it ideal for translating audio to English.[1] OpenAI’s paper introducing Whisper states that although it was trained on a vast amount of data, it was not fine-tuned to any specific usage. That same paper states that Whisper’s zero-shot performance (including no examples in the prompt; check out an excellent prompt guide here) across “many” uses found that it made about 50% fewer errors than models such as LibriSpeech that were specialized in their training. LibriSpeech is another open-source speech dataset trained on audio books. But is Whisper as good as OpenAI makes it sound? Probably not.
Whisper is available as an open-source tool that anyone can download and use for free from either OpenAI or HuggingFace. In a review of Whisper, Deepgram found “a median Word Error Rate (WER) of 53.4 in Whisper-v3 while Whisper-v2 only has a median WER of 12.7. In other words, Whisper-v3 makes 4x as many mistakes as Whisper-v2.” According to Sanchit Gandhi at HuggingFace, Whisper is “the most popular open-source speech recognition model.” Whisper-large-v3 has been downloaded more than 4.2 million times in a single month. Other versions have also been downloaded a significant number of times from HuggingFace (Whisper-large was downloaded over 200,000 times last month, and Whisper-large-v3-turbo was downloaded nearly 700,000 last month alone.) And presumably, Whisper has been downloaded a significant number of times directly from OpenAI, however those numbers are not made public.
Whisper technology is being used in a variety of industries to generate text by translating and transcribing audio. Whisper has been integrated into some versions of ChatGPT, and built into Oracle and Microsoft’s cloud computing software. The languages that OpenAI asserts Whisper can support for translation are those that had less than a 50% error rate in testing. Whisper has also been integrated into various applications like call centers and voice assistants. With a potential error rate approaching 40 or 50%, the use of Whisper could lead not only to frustrations and misunderstandings by clients and consumers but also to more serious errors.
Whisper is also used in some closed captioning applications. With a high rate of flaws and errors, this could be not only confusing and frustrating for blind and deaf users but even dangerous if they receive incorrect or faulty information as a result of the high error rate.
OpenAI stated that Whisper should not be used in “high-risk” applications. However, hospitals and doctors are using Whisper. The medical field, in general, has been using artificial intelligence for some time to help draft messages and patient notes. However, professionals are leaning more heavily on AI. Applications such as HealthPath, Medcare, and Doct-Assist among others, include Whisper in their platform. One hospital group—Nabla—downloaded Whisper and built their own tools on top of it. Over 30,000 clinicians and 40 health systems are using Nabla's Whisper-based tool. Nabla's tool has been used to transcribe an estimated 7 million medical visits. This is problematic.
A developer found hallucinations in nearly all of 26,000 transcripts created with Whisper. Computer scientists found 187 hallucinations in over 13,000 clear audio snippets. Professors Koenecke and Sloane at Cornell University and the University of Virginia, respectively, found nearly 40% of hallucinations were harmful or concerning.
The two researchers: highlighted several mistakes:
Example 1: On the recording, a speaker said, “He, the boy, was going to, I’m not sure exactly, take the umbrella.”
What the transcription ended up as: “He took a big piece of a cross, a teeny, small piece ... I’m sure he didn’t have a terror knife so he killed a number of people.”
Example 2: A speaker described “two other girls and one lady.”
Whisper invented their races, adding “two other girls and one lady, um, which were Black.”
Example 3: Whisper invented a non-existent medication called “hyperactivated antibiotics.”
Apparently, hallucinations seem most prevalent during pauses or other background noises.
In studying Whisper’s accuracy, a University of Michigan researcher found hallucinations in 8 out of 10 audio transcriptions of public meetings. Similarly, a machine learning engineer discovered hallucinations in about 50% of over 100 hours of analyzed Whisper transcriptions. While some of the errors are minor, others—as shown above—are more troubling.
HIPPA Confidentiality Requirements
Similar to lawyers’ duties of confidentiality of clients’ information, medical professionals and their staff also have duties of confidentiality. Under HIPPA, it is a violation of law to share patients’ confidential information, such as information about health conditions and their private visits with their providers, without their consent. Using a medical transcription tool where confidential information might be seen and used by others would breach those duties.
According to OpenAI, as of March 1, 2023, data sent to the OpenAI API (of which Whisper is a part) will not be used to train or improve OpenAI models unless the user explicitly opts in to allow their data to be shared. So long as users do not opt-in to having their information used—which OpenAI claims might allow models to improve your specific use case over time—this should go a long way to satisfying confidentiality concerns under HIPPA laws.
Liability for Medical Professionals Using Whisper
Medical professionals using Whisper, or any application built on top of Whisper, may have some unexpected liability. Whisper is known to hallucinate, as discussed above. Some of these hallucinations include fake or non-existent medical treatments. It is prone to making up all or part of sentences, according to interviews with more than a dozen software engineers, developers and academic researchers. Whisper’s hallucinations also include racial and violent additions, as shown in the examples above.
OpenAI warned against using Whisper for high-risk applications, which will likely go a long way to help prevent OpenAI from being liable for the most egregious errors. Indeed, OpenAI Whisper documentation avoids using it in “decision-making contexts, where flaws in accuracy can lead to pronounced flaws in outcomes.” However, this will not protect the computer scientists, developers, and users adopting this technology. Mistakes made in the medical field “could have ‘really grave consequences,’ particularly in hospital settings,” according to Alondra Nelson, who led the White House Office of Science and Technology Policy for the Biden administration until last year. Errors could lead to misdiagnosis, wrongful treatment, and emotional distress, among others.
Determining whether medical offices are using Whisper technology requires investigation. When receiving medical records or documentation, attorneys should explicitly inquire about the transcription tools used in their creation. This includes asking about both direct Whisper usage and any third-party applications that may be built on Whisper's architecture. Request written confirmation of the transcription tools employed, as many healthcare providers may be unaware that their software incorporates Whisper technology.
For medical records received during discovery, consider including specific interrogatories about AI transcription tool usage, including questions about quality control measures and error verification processes. Document any instances where Whisper or Whisper-based technology was used, as this information may become relevant if discrepancies or errors emerge in the medical documentation.
More Reason for A Uniform AI Law in the U.S.
The prevalence of such hallucinations has led experts, advocates and former OpenAI employees to call for the federal government to consider AI regulations. According to an ABC News article, William Saunders, a research engineer who quit OpenAI due to differences of opinion on where OpenAI should be headed, “This seems solvable if the company is willing to prioritize it. It’s problematic if you put this out there and people are overconfident about what it can do and integrate it into all these other systems.”
The U.S. has yet to see a federal law that could regulate AI models—perhaps into categories as the EU’s law provides—without strangling innovation. OpenAI is a well-funded leader in GenAI. It needs to correct this flaw or make warnings more prevalent.
Best Practices for Lawyers
Best practices for lawyers include, as per usual, verifying all output received from any AI tool. But these hallucinations could be even more perilous for lawyers who practice personal injury law.
In Summary:
Always verify all output from AI tools, and remind clients to do the same.
Consider advising clients to verify whether they are consenting to allow third-party services to receive audio, especially if they are seeking treatment for an injury that will be the subject of litigation.
Determine whether the source of the documents you receive from a medical office or opposing counsel uses or relies on Whisper or a similar technology. Remember that Whisper can be used as a basis for another technology.
Verify that you are not relying on Whisper technology or another similar transcription or translation tool—which, again, might form the basis of another application. You might need to do some digging to find out if Whisper or another such technology was used in your tool.
If you use any technology like Whisper, test the accuracy of its output within your own practice or hire a third party to conduct an accuracy test.
For medical records received during discovery, consider including specific interrogatories about AI transcription tool usage, including questions about quality control measures and error verification processes.
Document any instances where Whisper, Whisper-based, or similar technology was used, as this information may become relevant if discrepancies or errors emerge in the medical documentation.
Remind clients to verify all output from AI tools, especially if they are in high-risk fields such as medicine.
Warn clients who work in precision-dense or high-risk fields to check whether any tools they are using (including but not limited to transcription or translation applications) rely on Whisper or similar error-prone technology.
Warn clients to verify the accuracy of the output they receive from transcription and translation tools, especially in critical uses.
Conclusion
The integration of OpenAI's Whisper into medical settings represents a concerning trend: technological adoption may be outpacing proper risk assessment. With documented error rates reaching 50% and numerous instances of dangerous hallucinations, the technology poses significant risks to patient care and creates potential liability exposure for healthcare providers. Understanding these risks is crucial for attorneys, particularly those practicing in medical malpractice and personal injury law.
The responsibility falls on legal professionals to guide their healthcare clients through the complex intersection of AI implementation, HIPAA compliance, and liability risk. By following the best practices outlined in this article and maintaining vigilant oversight of AI tool usage, attorneys can help their clients navigate these challenges while protecting patient safety and minimizing legal exposure. As AI continues to transform healthcare delivery, the legal profession must stay informed and proactive in addressing these emerging technological risks.
[1] OpenAI states that the following languages are supported for transcription and translation: Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.
© 2024 Amy Swaner. All rights reserved. May use with attribution and link.