Cecilia Schramm - Semi-supervised learning for data creation in the A-PROOF project

Overview

This repository contains the code used in the thesis "Using Semi-supervised Learning to Automatically Annotate Dutch Medical Notes for Patients’ Functioning Levels" by Cecilia Schramm, supervised by Dr. Piek Vossen of the VU Amsterdam, in partial fulfilment of the requirements for the degree of an MA in Linguistics.

This is part of the A-PROOF project and used data provided by the AUMC on their secure servers. The A-PROOF repository can be found at https://github.com/cltl/a-proof-zonmw and files that remained unchanged from their original files are not included in this project.

This thesis is based on and uses the A-PROOF ICF-domains Classification system for its experiments, created by Jenia Kim.

This thesis also uses code from the KeywordMatcher repository, though unchanged files from there or files that are needed in combination with the data for this project are also not included in this repository.

Project structure

thesis-project
└───a-proof-zonmw
│       │   predict_copy_binary.py
│       │   predict_copy.py
│       │   train_model_copy.py
│       │   config.json
└───KeywordMatcher
│       │   all_sents_inspection.ipynb
│       │   cs_eval_new3.ipynb
│       │   cs_inspection.ipynb
│       │   index_inspection.ipynb
│       │   matched_notes.ipynb
│       │   convert copy.json
│       │   get_relevant_sents_copy.py
│       │   get_relevant_sents.py
└───Baseline
│       │   bl_eval_cm.png
│       │   bl_eval_cm2.png
│       │   bl_test_cm.png
│       │   bl_test_cm2.png
│       │   eval_eval.ipynb
│       │   eval_test.ipynb
└───hq_data1
│       │   eval_ell_sh.ipynb
│       │   eval.ipynb
│       │   hq1_cm_test.png
│       │   hq1_cm.png
│       │   hq1_cm2_test.png
└───hq_data2 - lqhq_data3
│       │   eval.ipynb
│       │   [data name]_cm.png
│   .gitignore
│   LICENSE
│   README.md
│   requirements.tx
│   eval_best_model.ipynb
│   jenia_test.ipynb
│   test_separation_ell_sh.ipynb
│   test_separation_jenia.ipynb
│   tfidf_scores_final.ipynb

a-proof-zonmw predict_copy_binary.py: Using Cecilia Kuan's binary classifier for predictions

predict_copy.py: Using Jenia Kim's multi-label classification system for predictions, but adding confidence scores to the outcome

train_model_copy.py: Training Jenia Kim's model, either from the start or checkpoints

config.json: Configurations for my experiments

KeywordMatcher all_sents_inspection.ipynb: Inspecting the matched notes returned from KeywordMatcher and dividing their sentences into equal amounts per ICF category

cs_eval_new3.ipynb: Dividing the final data by confidence score into high or low quality data

cs_inspection.ipynb: Inspecting the confidence scores given to the data

index_inspection.ipynb: Inspecting the matched notes from KeywordMatcher by year and category

matched_notes.ipynb: Inspecting the matched notes from KeywordMatcher by amounts

convert copy.json: Configurations for my experiments

get_relevant_sents_copy.py: Get sentences matched with a keyword and category, no duplicates

get_relevant_sents.py: Get sentences matched with a keyword and category

Baseline eval_eval.ipynb: Evaluating the baseline on the development set

eval_test.ipynb: Evaluating the baseline on the test set

hq_data1 eval_ell_sh.ipynb: Evaluating ModelHQ1 on the test set

hq_data1 - eval.ipynb: Evaluating ModelHQ1 on the development set

hq_data2 - lqhq_data3 hq_data2 - lqhq_data3: Each of these folders includes one eval.ipynbfile, that evaluates the model on the development set, and one .png file that shows its confusion matrix as a picture

General eval_best_model.ipynb: Inspecting all models' performances

test_separation_ell_sh.ipynb: Removing all none's from the original test set, after running them through Cecilia Kuan's binary classifier

test_separation_jenia.ipynb: Removing all none's from the original development set, after running them through Cecilia Kuan's binary classifier

tfidf_scores_final.ipynb: Calculating 20 highest ranked TF-IDF words per category

Data

The path to the data used for this project on the secure AUMC servers is mnt/data/Users/A-PROOF/Cecilia_S/ All data used or necessary for this project can be found in this location and its subsequent folders within.

Requirements

All code must be run on the secure servers on the AUMC, where the necessary data for this thesis is stored and cannot be moved from. All requirements are stored there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cecilia Schramm - Semi-supervised learning for data creation in the A-PROOF project

Overview

Project structure

Data

Requirements

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Baseline		Baseline
KeywordMatcher		KeywordMatcher
a-proof-zonmw		a-proof-zonmw
hq_data1		hq_data1
hq_data2		hq_data2
hq_data3		hq_data3
lq_data1		lq_data1
lq_data2		lq_data2
lq_data3		lq_data3
lqhq_data3		lqhq_data3
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
all_results.pages		all_results.pages
all_results.png		all_results.png
eval_best_model.ipynb		eval_best_model.ipynb
requirements.txt		requirements.txt
test_separation_ell_sh.ipynb		test_separation_ell_sh.ipynb
test_separation_jenia.ipynb		test_separation_jenia.ipynb
tfidf_scores_final.ipynb		tfidf_scores_final.ipynb

License

cltl-students/Cecilia_Schramm_ICF_semi_supervised_learning

Folders and files

Latest commit

History

Repository files navigation

Cecilia Schramm - Semi-supervised learning for data creation in the A-PROOF project

Overview

Project structure

Data

Requirements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages