An LCP baseline for the Multilingual Lexical Simplification Pipeline 2024 Shared Task modelled as a linear regression on log-frequency. The frequency baseline is trained using log-frequency (minimum value if the target consists of multiple tokens) on the trial set for each language. We use frequencies provided by the wordfreq
package when possible. Additionally, since the package uses an incompatible tokenization for Japanese and does not provide any data for Sinahala, we use TUBELEX-JA for Japanese, and the word frequency list for Sinhala.
Note that the trained models and output of the baseline are already included in the repository. You can reproduce them by following the steps below.
-
Install the Git submodule for MLSP_Data, Word-Frequency-List-for-Sinhala and tubelex:
git submodule init && git submodule update
-
Install the requirements:
python -m pip install -r requirements.txt
-
Run the baseline (both training and prediction):
bash experiments.sh