Real world healthcare data is one of the most valuable resources in the advancement of medical research and patient care. Yet, to this day, much of the focus continues lying on structured data – information neatly organized into standardized formats such as tables and fields in databases. While this data is easier accessible and extractible, it often proves to be incomplete and missing a major chunk of the patients’ journey and medical history. There is a vast, largely untapped resource in the form of unstructured data, which includes everything from clinical notes to imaging files. 

In a study conducted by Briya, selected Best Podium Research, on aspirin intake in pregnant women found that over two-thirds of critical information was sourced from free text medical notes. One of the greatest challenges with text is its freeform format, making it difficult to analyze using traditional methods. 

Free text appears in clinical notes, imaging summaries, and even in seemingly structured fields such as medication descriptions. What makes it “free” is that it isn’t confined to a predefined set of categories (like ICD-10 codes), but instead reflects the variability of natural language (for example, using “female” vs. “woman”). 

Its complexity arises from inconsistent grammar, varied syntax, and informal expressions. Spelling errors, typos, and practitioner-specific jargon further complicate data extraction, increasing the risk of misinterpretation or the loss of valuable clinical information. This task becomes even more difficult with longer and richer text. 

In short? What’s convenient for the doctor proves to be hard for the researcher. 

 

Unlocking the Free Text Treasure Trove with Large Language Models

Natural Language Processing (NLP) techniques have been used for decades to extract meaningful information from free text. Classical tasks include identifying different parts of speech (Named Entity Recognition), categorizing text (text classification), capturing semantic meanings (word embeddings), and labelling each word with its grammatical role (part-of-speech tagging).

These NLP-algorithms can be fine-tuned on domain-specific jargon, syntax, and contextual cues to ensure accurate interpretation, especially in fields like healthcare, where small differences in wording can have major implications (e.g., “prescribe aspirin” vs. “do not prescribe aspirin”).

Large Language Models (LLMs) and NLP algorithms fine-tuned on massive amounts of data represent a paradigm shift in the scalable analysis of unstructured data. They play a major role in transforming free text into structured fields and tables, standardizing it (e.g., to match system codes like ICD-9), and enabling it to be queried and analyzed. Beyond extraction, these models also help ensure consistency and quality in textual data, leading to insights that are more robust and accurate.

Yet, despite the immense potential of LLMs to transform medical data research and analysis, significant hurdles remain, particularly around privacy regulations and limited access to computational resources. The challenge isn’t just training these models on sensitive data, but also enabling them to operate within secure environments, ideally with minimal resource requirements.

At Briya, we develop specialized language models for information extraction and question answering from free text, unlocking the full potential of large language models on sensitive medical data. Briya’s models can run entirely within the hospital’s infrastructure, eliminating the need to move or duplicate sensitive data. We allow in depth state-of-the-art medical free text analytics, while ensuring full compliance with data privacy regulations and reducing dependency on external computing resources.

By bridging the gap between the complexity of medical free text and the demand for precise, harmonized data, Briya’s language models empower advanced clinical research, analytics, and informed decision-making.

In our next blog, we’ll dive deeper into how Bryia’s language models not only help extract valuable clinical information but, when developed with the appropriate data and methodology, can also be evolved into specialized medical experts capable of accurately answering complex medical questions, while preserving patient privacy.

 

Maximizing Clinical Data Potential and Research Accuracy 

Imagine conducting research using only a fraction of the available data, compared to using nearly 100% of it. By relying solely on structured data sources, researchers overlook valuable insights hidden in doctors’ notes, imaging reports, and other unstructured formats. 

By enhancing data completeness and accurately extracting context-aware insights from unstructured data, Briya’s NLP unlocks the full potential of clinical data by tapping into free text, extracting and analyzing meaningful information across therapeutic areas and clinical sites and fuels more precise, scalable, and real-world-ready evidence generation. 

Dr. Talia Tron