GitHub - Riiid-Team/AI-Learning: Using knowledge states of 100K+ students to predict whether a student will answer a question correctly.

Welcome! We are the AI Learning Team 👾

Background

About Riiid Labs

“Riiid Labs, an AI solutions provider delivering creative disruption to the education market, empowers global education players to rethink traditional ways of learning leveraging AI. With a strong belief in equal opportunity in education, Riiid launched an AI tutor based on deep-learning algorithms in 2017 that attracted more than one million South Korean students. This year, the company released EdNet, the world’s largest open database for AI education containing more than 100 million student interactions.” Source

TOEIC (Test of English for International Communication)

The TOEIC Listening & Reading test is an objective test that features 200 questions with a two hour time limit. Listening (approximately 45 minutes, 100 questions) and Reading (75 minutes, 100 questions). Source

Riiid's TOEIC Platform

“Santa TOEIC is the AI-based web/mobile learning platform for the TOEIC. AI tutor provides a one-on-one curriculum, effectively increasing scores based on the essential questions and lectures for each user.” Source

Deliverables

Project Management

Data Dictionary

Original Dataset

train.csv

Feature Name	Description
row_id	(int64) ID code for the row
timestamp	(int64) The time in milliseconds between this user interaction and the first event completion from the user
user_id	(int32) ID code for the user
content_id	(int16) ID Code for the user interaction
content_type_id	(int8) 0 if the event was a question prompted to the user, 1 if the event was watching a lecture
task_container_id	(int16) ID code for the batch of questions or lectures. -1 for lectures
user_answer	(int8) The user's answer to the question, if any. -1 for lectures
answered_correctly	(int8) If the user responded correctly, if any. -1 for lectures
prior_question_elapsed_time	(float32) The average time in milliseconds it took a user to answer each question in the previous question bundle, ignoring any lectures in-between. Null for a user's first question bundle or lecture. Note: The time is the average time a user took to solve each question in the previous bundle.
prior_question_had_explanation	(bool) Whether or not the user saw an explanation and the correct response(s) after answering the previous question bundle ignoring any lectures in between. The value is shared across a single question bundle and is null for a user's first question bundle or lecture. Typically the first several questions a user sees were part of an onboarding diagnostic test where they did not get any feedback.

lectures.csv metadata for the lectures watched by users as they progress in their education.

Feature Name	Description
lecture_id	Foreign key for the train/test content_id column, when the content type is a lecture (1).
part	Top-level category code for the lecture.
tag	One tag code for the lecture. The meaning of the tags will not be provided, but these codes are sufficient for clustering the lectures together.
type_of	Brief description of the core purpose of the lecture.

questions.csv metadata for the questions posed to users.

Feature Name	Description
question_id	Foreign key for the train/test content_id column, when the content-type is question (0).
bundle_id	Code for which questions are served together.
correct_answer	The answer to the question. It can be compared with the train user_answer column to check if the user was right.
part	The relevant section of the TOEIC test.
tags	One or more detailed tag codes for the question. The meaning of the tags will not be provided, but these codes are sufficient for clustering the questions together.

Feature Engineering

Feature Name	Description
question_had_explanation	Indicates if a question had an explanation
user_acc_mean	The number of questions a user answered correctly divided by all questions they've answered
user_lectures_running_total	The running total of lectures a user has watched at a given timestamp
avg_user_q_time	The average amount of time a user spends on a question
mean_bundle_accuracy	The average accuracy of a bundle of questions
mean_content_accuracy	The number of questions a user answered correctly divided by all questions they've answered in different content/topics
mean_tagcount_accuracy	The number of tags linked to a question
mean_tags_accuracy	The average accuracy for questions that share the same number of tags
mean_task_accuracy	The average accuracy of a specific content_id
median_content_accuracy	The median accuracy for a specific content type
median_tesk_accuracy	The median accuracy for a task
q_time	The amount of time a user spent on the previous question
question_content_asked	The type of question asked
question_priortime_asked	The timestamp of the previous question prompted to the user
question_task_asked	The type of question asked in a bundle
question_timestamp_asked	The timestamp a question was prompted to the user
std_content_accuracy	The standard deviation of content accuracy
std_task_accuracy	The standard deviation of task accuracy
std_timestamp_accuracy	The standard deviation of accuracy for a given timestamp
skew_content_accuracy	The skewness of accuracy for a specific content type
skew_task_accuracy	The skewness of accuracy for a specific task
skew_timestamp_accuracy	The skewness of accuracy for a given timestamp
avg_user_q_time [scaled]	Scaled version of avg_user_q_time using MinMaxScaler. Returned from prep_riiid function
user_lectures_running_total [scaled]	Scaled version of user_lectures_running_total using MinMaxScaler. Returned from prep_riiid function

Initial Thoughts

Are questions with explanations answered correctly more often?
Do users with higher accuracy take longer to answer questions?
Are there questions that are more difficult to answer than others?
How long does the average user engage with the platform?
Does the number of lectures a user watch impact their accuracy?

Project Steps

1. Acquire

Data acquired from Kaggle. The data is stored in three separate files: lectures.csv, questions.csv, and train.csv. The primary dataset is train.csv, which has 100+ million user interactions from 390,000+ users. We used a random sample of 100K users for our analysis. The original 10 features describe the type of question, the time it took to answer, and whether the user’s response was correct.

2. Prepare

Missing Values

Filled missing boolean values in question_had_explanation with False. Missing values indicated that the question did not have an explanation or the user viewed a lecture.
Filled missing values in prior_question_elapsed_time with 0. Missing values indicated that a user viewed a lecture before answering the first question in a bundle of questions.
Dropped columns: lecture_id, tag, lecture_part, type_of, question_id, bundle_id, correct_answer, question_part, and tags
Dropped rows considered lectures: Where answered_correctly = -1

Feature Engineering

Created new features using descriptive statistics for content, task, timestamp, and whether a question had an explanation.
Scaled timestamp from milliseconds to weeks to look at trends overtime.
Refer to the feature engineering data dictionary for more information.

Preprocessing

Scaled mean_timestamp_accuracy, mean_priortime_accuracy, user_lectured_running_total, and avg_user_q_time using MinMaxScaler

3. Explore

Used scatterplots, barplots, and histograms to visualize interactions between features and the target variable.
Performed hypothesis tests to find statistically significant relationships between features.

Hypotheses

Hypothesis – Answered correctly vs. Question had an explanation

Null hypothesis: Whether a student gets a question right is independent of whether a question had an explanation.
Alternative hypothesis: Whether a student gets a question right is dependent on whether a question had an explanation.
Test: Chi-Squared Test
Results: With a p-value less than alpha, we reject the null hypothesis.

Hypothesis – Answered correctly vs. Part

Null hypothesis: Whether a user answers a question correctly is independent of the type of question being asked.
Alternative hypothesis: Whether a user answers a question correctly is dependent upon the type of question being asked.
Test: Chi-Squared Test
Results: With a p-value less than alpha, we reject the null hypothesis.

Hypothesis – Number of lectures a user has watched vs. Average task accuracy

Null hypothesis: There is no linear relationship between the number of lectures a user has watched and their average task accuracy.
Alternative hypothesis: There is a linear relationship between the number of lectures a user has watched and their average task accuracy.
Test: Pearson Correlation Test
Results: With a p-value less than alpha, we reject the null hypothesis.

On average, as the number of lectures a user has seen increases, so does their task accuracy.

Viewing lectures may have a weak positive impact on user accuracy.

Hypothesis – Average user question time vs. Average user accuracy

Null hypothesis: There is no linear relationship between the average time a user takes to answer a question and their average accuracy.
Alternative hypothesis: There is a linear relationship between the average time a user takes to answer a question and their average accuracy.
Test: Pearson Correlation Test
Results: With a p-value less than alpha, we reject the null hypothesis

Users who take longer to answer questions tend to have lower overall accuracy and vice versa

Hypothesis – Average user accuracy vs. Average content accuracy

Null hypothesis: Users with above-average accuracy spend the same amount of time or more time on questions with lower-than-average content accuracy than users with average or lower-than-average accuracy.
Alternative hypothesis: Users with above-average accuracy spend less time on questions with lower-than-average content accuracy than users with average or lower-than-average accuracy.
Test: Two-Sample One-Tailed T-Test
Results: With a p-value less than alpha, and t less than 0, we reject the null hypothesis.

If users with above-average accuracy answer questions (difficult and otherwise) more quickly than other users, then they may be more prepared for the content.

4. Model

First, we created a baseline model was to compare our model performances. The baseline is the most common outcome from the training dataset, answered correctly = 1. Baseline accuracy is 50%. This means that a user will get an answer correct 50% of the time. Models evaluated on train, validate, and the test set were:

Logistic Regression
Random Forest
LGBM

Final Model

Our LGBM model performed the best, with an AUC score of .744. AUC is a measure of True Positives and False Positives. A True Positive means that our model predicted that a student answered a question correctly, and their response was correct. A False Positive means our model predicted a student responded to a question correctly when their answer was incorrect. An AUC score ranges between 0 and 1, where the higher the number, the more accurate a classification model is.

"Gradient boosting algorithm sequentially combines weak learners (decision tree) in way that each new tree fits to the residuals from the previous step so that the model improves. The final model aggregates the results from each step and a strong learner is achieved." Source

5. Conclusion

Key Findings

Performance decreases on incomplete sentences and narration.
Performance increases with explanations and visuals.
Lectures don't significantly impact performance.

What was the best model?

LightGBM: AUC score of .744
The LightGBM model surpassed the baseline by 0.24, which is a 47% improvement (which is a comparison of the difference between the scores divided by the baseline).

Recommendations

Offer more study material for difficult subjects and to develop students' ability to understand english based on context clues when photos aren't available
Revise lectures to have a stronger impact over the course of students' studies
Use model to tailor content to student's needs

Expectation

Riiid's program will better prepare students for the TOEIC

Future Investigations

What are your next steps?

Use this predictive model on Riiid's other educational programs.
Explore more features and different classification models like xgboost.
Improve model to predict new student performance more effectively.

How to Reproduce

All files are reproducible and available for download and use.

Read this README.md
Download the aquire.py, prepare.py, and Final-Report.ipynb files
Run Final-Report.ipynb

Contact Us

Daniella Bojado

Samuel Davila

Yongliang Shi

Christopher Logan Ortiz

Name		Name	Last commit message	Last commit date
Latest commit History 343 Commits
Individual_MVP		Individual_MVP
User_Files		User_Files
gifs		gifs
images		images
.gitignore		.gitignore
Final-Report.ipynb		Final-Report.ipynb
MVP.ipynb		MVP.ipynb
README.md		README.md
acquire.py		acquire.py
explore.py		explore.py
model.py		model.py
prepare.py		prepare.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Welcome! We are the AI Learning Team 👾

Background

About Riiid Labs

TOEIC (Test of English for International Communication)

Riiid's TOEIC Platform

Deliverables

Project Management

Data Dictionary

Original Dataset

Feature Engineering

Initial Thoughts

Project Steps

1. Acquire

2. Prepare

3. Explore

Hypotheses

4. Model

Final Model

5. Conclusion

Key Findings

What was the best model?

Recommendations

Expectation

Future Investigations

What are your next steps?

How to Reproduce

Contact Us

About

Contributors 3

Languages

Riiid-Team/AI-Learning

Folders and files

Latest commit

History

Repository files navigation

Welcome! We are the AI Learning Team 👾

Background

About Riiid Labs

TOEIC (Test of English for International Communication)

Riiid's TOEIC Platform

Deliverables

Project Management

Data Dictionary

Original Dataset

Feature Engineering

Initial Thoughts

Project Steps

1. Acquire

2. Prepare

3. Explore

Hypotheses

4. Model

Final Model

5. Conclusion

Key Findings

What was the best model?

Recommendations

Expectation

Future Investigations

What are your next steps?

How to Reproduce

Contact Us

About

Topics

Resources

Stars

Watchers

Forks

Contributors 3

Languages