The use of generative AI in everyday life has been one of the fastest adopted technologies of all time. The launch of ChatGPT has placed one of the most exciting evolutions of AI at the fingertips of everyone on the planet with access to the internet. In the McKinsey study on the economic potential of generative AI it was estimated that Gen-AI could add between $2.6 trillion and $4.4 trillion in value annually, with a key focus being the use of Gen-AI to enhance productivity. Generative AI embedded into business applications have the potential to automate work activities that take up to 70% of employees’ time, with a main contribution coming in the form of customer operations.
Organizations have up to 80% of their data captured in unstructured sources in thousands or even millions of files across the organization. As these data sources grow the number of hours needed for a person to find meaningful insights is staggering. These data sources are often not tagged, annotated, or organized in a way to make searching or filtering for information quick and easy. Different formats and file types add to the complexity, and time is wasted searching massive document repositories for relevant details.
Using the suite of Azure AI Services OmniData built a solution to ingest these different data sources and catalog them for easy search of required information. A user-friendly web application was deployed and was further integrated into D365 Customer Engagement. The integration of the AI solution into the interface for customer queries and responses speeds up the resolution time and will improve customer experience.
Our client was enabled to access their knowledge repositories built up over the years by experts in the field. The knowledge is accessible for use in training and speeds up the upskilling of new staff.
At OmniData we leverage the capabilities of Azure AI services, including Azure AI Search, Azure Document Intelligence and Azure OpenAI to give clients the ability to extract meaningful context from their document repositories by extracting, cataloging, and interacting with unstructured data.
Business Problem
Extracting meaningful information from clinical texts presents a significant challenge. Recognizing if a published article is related to the topic of research forthe clients’ different departments is key.
- Named Entity Recognition (NER) involves identifying specific entities, such as drug names or medical conditions, within the text.
- Relation extraction focuses on determining interactions between these identified entities.
- Document classification groups articles or other documents into relevant categories, enabling faster exploration and retrieval.
- Additionally, interacting with unstructured text document repositories using Question & Answer functionality enhances efficiency.
- Keeping your research and Intellectual Property (IP) safe and secure.
One of the largest medical journal publications databases, PubMed, has over 36 million citations for biomedical literature. Speeding up the process of finding content that can be explored for ideation and trial phases adds significant business value.
Our Solution Approach
OmniData’s engineers use a wide range of Azure AI services to build end-to-end solutions for clients. The key challenges to overcome are content fusion, converting from unstructured to semi-structured information, cataloging information for fast retrieval, fine tuning the models to cater for company specific domains, and finally an intuitive interaction layer
1. Content Fusion
Content fusion is a common challenge in structured data solutions. The initial step involves combining all data sources into an environment that simplifies access to all information. Data Lake architectures provided the capability to loosen the restrictions on what type of data can be processed simultaneously. Azure Data Lake Gen2 storage enables the ingestion of many different file types into one container. The merging of image, video, audio, and structured data is facilitated and served to various AI platforms such as Azure Document Intelligence (Azure DI) and Azure AI Search.
2. Creating Semi-structured data
Optical Character Recognition (OCR) solutions have significantly improved over the past few years, enabling the extraction of information from files such as PDFs. Azure DI is frequently used to convert documents into semi-structured formats, such as JSON files. These files contain metadata along with the content, providing additional insights and value to the final solution. In the biotechnology sector, this capability is particularly useful for extracting content from research articles for further processing.
3. Cataloging the Information
Azure AI Search offers services for cataloging and indexing various data sources. Similar to online search engines, these indexes enable rapid retrieval of relevant information from a company’s data repositories. Additionally, Azure AI Search provides functionality for filtering and screening personal data.
4. Fine Tuning
Fine-tuning different LLMs involves adjusting various parameters that can impact the accuracy of the output. Establishing a robust framework for optimization is critical. Our team of experts employs various techniques to achieve optimal results.
- Optimizing prompts: Comparing validated prompt-completion pairs to generated pairs provides a framework for optimizing the best way to ask a question. Leveraging similarity metrics, such as cosine similarity scores helps ensure prompts generate relevant completions.
- Retrieval Augmented Generation (RAG):
Incorporating external documents and data for context. - Model Fine-tuning: Using the fine-tuning capabilities of Azure AI Services where possible to further tune the models for specific use cases.
At OmniData, we utilize Azure OpenAI services to leverage the capabilities of the latest LLMs. Furthermore, we employ domain-specific models like BioBERT, PubMedBERT, or BioGPT. These models provide different capabilities that are relevant to business problems. BERT models enable contextual understanding and topic extraction. This proves invaluable in biomedical research, facilitating entity extraction and document classification, thereby accelerating R&D.
GPT models like BioGPT enhance these capabilities by enabling chat-based interactions and question-answering. These models also provide the capabilities to be fine-tuned for specific uses.
5. Intuitive Interfaces
Completing the AI solution involves creating an intuitive interface for interacting with data efficiently. Our team utilizes the integration capabilities of Azure App Services in many instances to serve the results, and provide the interaction point to the trained models. The web services provide the ability to deploy the interface directly into other operating systems for quick access to information.
6. Data Security
In an industry where knowledge, research, and discovery are the key asset to create value, its crucial to ensure security of this information. OmniData’s reference architecture has data privacy and security built in as part of the design.
The Azure services used do not share any data outside of the client’s tenant. That means that your data remains your data and none of it is used to train third party models. This ensures strict confidentiality and prevents unauthorized access to sensitive and proprietary information.
Azure’s comprehensive security protocols, including encryption, access control and policies enhances data protection. These measures ensure a client can use these services without compromising confidentiality of their data.
Conclusion
Advancements in Large Language Modeling have propelled NLP and NLU forward. The ability to process large volumes of text with AI systems while maintaining comprehension is Critical for how R&D teams analyze documents. Incorporating these capabilities into team workflows will speed up the advancements of new product and solution design and development, unlocking efficiencies at scale.