the-pattern-platform

This is a NLP pipeline based on RedisGears, this is evolution of Cord19Project

The purpose of this part is NLP pipeline to turn text into knowledge graph ("it's about things, not strings") by matching text (terms) to Medical Methathesaurus UMLS (concepts). As input this pipeline is using CORD19 competition Kaggle dataset - medical articles.

Super Quick start using Docker

git checkout main
cd conf && launch_cluster_docker.sh

It will create docker network build and run Redisgraph and Rgcluster in two separate dockers. In another terminal run

pip install gears-cli
sh cluster_pipeline.sh

It will populate all steps, submit 25 articles into cluster for processing and run matcher. There are few sleep statements to allow cluster to recover.

Check that RedisGraph instances were populated:

redis-cli -p 9001 -h 127.0.0.1 GRAPH.QUERY cord19medical "MATCH (n:entity) RETURN count(n) as entity_count"

Architecture

Uses RedisGears using KeyReader, StreamReader

NLP Steps:

Identify language LangDetect (It should be English)
Split paragraphs into sentences using Spacy spacy_sentences_streams.py
- It can be done differently, but the point was to use large NLP library for processing
Spellcheck sentences using symspell_sentences_streamed.py
Match terms from sentences to UMLS concepts using pre-build Aho-Corasick Automata sentences_matcher_streamed.py
- To build you own use aho_corasick_create_direct.py
  - You need to download and unpack umls-2019AB-metathesaurus.zip
Populate RedisGraph edges_to_graph_streamed.py from nodes (concepts) and edges (relationship between concepts, assumption is that if two concepts in the same sentence they are related). RedisGraph is separate instance listening on 9001.
Run set_debug_key.py if you want to see logging on each step

Quickstart

To run locally:

Compile RedisGears and Redis then use conf/launch_cluster.sh to launch gears cluster, amend paths as needed
Start RedisGraph on port 9001 (or amend ports in conf/database.ini and in edges_to_graph_streamed.py)
Install gears-cli (pip install -r requirements.txt) and run sh cluster_pipeline_streams.sh to register functions
Populate cluster with sample of articles python RedisIntakeRedisClusterSample.py (Pass --nsamples n to increase size of the sample)
1. Give a cluster kick using lang_detect_gears_paragraphs_force.py if logs are not showing a lot of activity. Actual command will look like gears-cli run --host 127.0.0.1 --port 30001 lang_detect_gears_paragraphs_force.py --requirements requirements_gears_lang.txt
Validate RedisGraph is populated with GRAPH.QUERY cord19medical "MATCH (n:entity) RETURN count(n) as entity_count"

Alternatively, use Docker to launch RedisGears/RedisGraph, but pass commands from launch_cluster.sh via redis-cli -c

If you want to create you own NLP processing step lang_detect_gears_paragraphs_force.py is simplest example of KeyReader in batch mode, start with batch and then create a registration for events. StreamsReaders is probably closer to production, but pain in the back to debug.

TODO

It's not ideal, most parts are hard coded, but I hope it's useful enough for NLP data scientists. Overall architecture is still as in original project.

Update the-pattern overall repository
Publish API server repository
Publish UI demo
Publish demo BERT based QA
Publish demo BERT based Summary
Create a docker deployment script for gears and RedisGraph
Add sentence splitter with https://github.com/mediacloud/sentence-splitter instead of spacy
Add redis cluster based debug flag (if execute('GET') then enable logs)

Update 01.01.2021

New way to run most of the pipeline: gears-cli run --host 127.0.0.1 --port 30001 gears_pipeline_sentence.py --requirements requirements_gears_pipeline.txt

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
common		common
conf		conf
data		data
streamed_based		streamed_based
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
RedisIntakeRedisClusterSample.py		RedisIntakeRedisClusterSample.py
cluster_pipeline.sh		cluster_pipeline.sh
cluster_pipeline_docker.sh		cluster_pipeline_docker.sh
config.py		config.py
edges_to_graph_streamed.py		edges_to_graph_streamed.py
gears_pipeline_sentence.py		gears_pipeline_sentence.py
gears_pipeline_sentence_register.py		gears_pipeline_sentence_register.py
lang_detect_gears_paragraphs_force.py		lang_detect_gears_paragraphs_force.py
parse_publish_dates.py		parse_publish_dates.py
parse_publish_dates_threaded.py		parse_publish_dates_threaded.py
requirements.txt		requirements.txt
requirements_gears.txt		requirements_gears.txt
requirements_gears_aho.txt		requirements_gears_aho.txt
requirements_gears_graph.txt		requirements_gears_graph.txt
requirements_gears_lang.txt		requirements_gears_lang.txt
requirements_gears_pipeline.txt		requirements_gears_pipeline.txt
sentences_matcher_gears.py		sentences_matcher_gears.py
sentences_matcher_register.py		sentences_matcher_register.py
set_debug_key.py		set_debug_key.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

the-pattern-platform

Super Quick start using Docker

Architecture

Quickstart

TODO

Update 01.01.2021

About

Releases

Packages

Languages

License

applied-knowledge-systems/the-pattern-platform

Folders and files

Latest commit

History

Repository files navigation

the-pattern-platform

Super Quick start using Docker

Architecture

Quickstart

TODO

Update 01.01.2021

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages