Skip to content

Denikozub/Twitter-geolocation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Twitter users geolocation based on tweet texts

image

Data

Processed dataset contains 620k tweets and corresponding coordinates.
Processing includes geocoding US cities, which is done using Nominatim, and country location check.
Train - test - val split: 80% - 10% - 10%, batch size = 64.
The task is to predict coorditates (lat - lon) based on tweet texts.

Loss function

To estimate distance betweet predicted and real coordinates, haversine distance is used.
It considers Earth as a sphere with a set radius, which is its simplest representation.
image

Models

Baseline model

  • BERT tokenizer is used with max_length=32, truncation=True
  • Takes use of BERT <CLS> token embeddings only
  • They are fed to two linear layers, followed up by linear regression
  • Each layer uses batch normalization
  • ReLU is used as an activation function

Autoencoder model

  • Used for dimensionality reduction
  • Denoising architecture (with scalable factor)
  • BERT weights are disabled while training AE
  • MSE loss is used for autoencoder training
  • Both encoder and decoder consist of two layers with ReLU activation
  • Encoder states are saved during training and used in regression model