🔄 New Drug Approvals Scraper

This Python package automates the scraping, cleaning, and classification of new drug approval data from Drugs.com. Designed for robustness and versatility, it integrates data processing techniques with AI-driven classification using OpenAI's GPT-3.5 Turbo model, enabling enriched data analysis within dynamic environments like Dash applications.

🧹 Data Cleaning and Normalization

🧽 Cleaning Techniques

The scraper meticulously extracts and refines data, addressing variations in drug names, generics, and administration methods. Using regular expressions, the extract_generic_and_admin function isolates and sanitizes these components, ensuring data uniformity and precision. An example can be seen in the formatting of drug names as shown below:

Key Cleaning Operations:

Drug Names: Separating combined names and administration routes using custom regular expressions.
Generics: Clearing empty parentheses or irrelevant details enclosed within double parentheses.
Administration Methods: Filtering out non-essential text such as outdated drug names or initial prepositions.

🔧 Normalization

Normalization processes target company names that often appear with slight variations in formatting, punctuation, or presentation. Using the clean_company_name function, we standardize these names to reduce the complexity and ensure consistency across the dataset. Here’s how specific issues are addressed:

Removing Redundant Suffixes: Strips common corporate suffixes like "Inc.", "Ltd.", "Corp.", "Corporation", and others to maintain a clean, uniform database.
Unifying Abbreviations and Full Names: Converts abbreviations to their full forms and ensures that variations in company names are standardized to a single, consistent format.
Standardizing Collaboration Descriptions: Variations in the representation of collaboration between companies, such as "and", "&", "+", "/", are unified to "and" to maintain consistency in joint ventures or co-developed products.

Example Transformations:

Original Company Name	Normalized Company Name
Pfizer, Inc.	Pfizer
Pfizer Inc.	Pfizer
AMAG Pharmaceuticals, Inc.	AMAG Pharmaceuticals
AFT Pharmaceuticals Ltd.	AFT Pharmaceuticals
ALK-Abelló A/S	ALK-Abelló
GSK	GlaxoSmithKline
GlaxoSmithKline PLC	GlaxoSmithKline
Amgen, Inc.	Amgen
Daiichi Sankyo Company, Limited	Daiichi Sankyo
AstraZeneca and Daiichi Sankyo Company, Limited	AstraZeneca and Daiichi Sankyo
Bristol-Myers Squibb Company / Gilead Sciences, Inc.	Bristol-Myers Squibb Company and Gilead Sciences
Boehringer Ingelheim Pharmaceuticals, Inc. and Eli Lilly	Boehringer Ingelheim Pharmaceuticals and Eli Lilly

These methods enhance the reliability of data for subsequent analyses by ensuring that each entity is represented uniformly, reducing the number of unique company names from approximately 1000 to 700.

🏷️ Data Classification with AI

Utilizing LangChain integrated with OpenAI's GPT-3.5 Turbo, the scraper enriches the extracted data by categorizing medications and their treatment categories. This process not only simplifies complex medical information but also facilitates insightful trend analysis across various disease treatments. The integration with LangChain allows dynamic interaction with data, applying logical rules to categorize drugs based on their detailed descriptions and intended uses.

📦 Package Structure and Versatility

The scraper is structured as a Python package, enabling it to function independently or as part of larger systems:

scraper.py: Handles data collection logic.
classification.py: Manages AI-driven data categorization.
utils.py: Provides utility functions for data manipulation.
init.py: Initializes the directory as a Python package for easy import.

🔄 Using the Scraper

To use the scraper, manage the OpenAI API key through one of the following methods:

Environment Variable: Store the API key in an environment variable which the scraper accesses.
Direct Specification: Directly pass the API key to the function when invoking it.

The main function, scrape_new_drug_approvals_data, supports:

Standalone Execution: Updates a local CSV file with the latest drug approvals.
Integration with Dash: Returns a DataFrame directly, enabling real-time data refresh in Dash applications whenever invoked.

🌐 Example Integration

For a live example of how this scraper package is utilized within a Dash ecosystem to provide real-time updates on drug approvals, visit the New Drug Approvals Dashboard repository.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
img_readme		img_readme
new_drug_approvals_scraper		new_drug_approvals_scraper
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔄 New Drug Approvals Scraper

🧹 Data Cleaning and Normalization

🧽 Cleaning Techniques

🔧 Normalization

🏷️ Data Classification with AI

📦 Package Structure and Versatility

🔄 Using the Scraper

🌐 Example Integration

About

Releases

Packages

Languages

License

Tanguy9862/scraper-new-drug-approvals

Folders and files

Latest commit

History

Repository files navigation

🔄 New Drug Approvals Scraper

🧹 Data Cleaning and Normalization

🧽 Cleaning Techniques

🔧 Normalization

🏷️ Data Classification with AI

📦 Package Structure and Versatility

🔄 Using the Scraper

🌐 Example Integration

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages