Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NLP Models and Data Collection Discussion #8

Open
TyJK opened this issue May 26, 2017 · 6 comments
Open

NLP Models and Data Collection Discussion #8

TyJK opened this issue May 26, 2017 · 6 comments

Comments

@TyJK
Copy link
Owner

TyJK commented May 26, 2017

A Discussion on the Best NLP and Data Collection Approaches

This is a place we hope we can generate discussion, with both experts and non-experts, on how we're planning on moving forward in the immediate future towards a classification model for topic modeling and sentiment analysis. We've included data collection in this, as none of this can proceed until we have some labelled data.

The scope of this discussion can include:

  • How we are labelling our data
  • How we are collecting/scraping this data
  • Our plans for topic modeling
  • Our plans for sentiment analysis
  • How we will be classifying the resulting models

A Brief Overview of Our Current Plan

  • Labelling: We're going to label our data by selecting blogs and websites (or sections of websites) that have a consistent sentiment and a coherent topic in line with our chosen topics (Full List). These will be collected
  • Scraping: We're thinking of using Portia, Beautiful Soup or possibly Selenium. This aspect is still being discussed and we should have a final plan within the next few days.
  • Topic Modeling: Our current plan is to use the Doc2Vec algorithm (specifically the gensim Python library). Each topic would be used as a tag, in addition a unique tag for each document (blog post/article). However, we're also looking into the use of labelled LDA for this stage.
  • Sentiment Analysis: This stage is pretty firmly decided as using doc2vec, as it's the state of the art for this sort of task. However we have not decided on general sentiment detection (across all topics), topic specific sentiment analysis (a separate sentiment model for each topic) or possibly a hybrid model. We will likely test all of the above and find what works best for our purposes.
  • Classification: Selecting that classification algorithm should be a pretty trivial matter. We suspect an SVM algorithm will perform best, or else Naive Bayes based on our research, but we'll try a broad range.

We welcome questions and suggestions with regards to these topics, so please feel free to drop a comment.

@PSanni
Copy link

PSanni commented May 29, 2017

Classification: What classification approach you are planning to use, Documents or Sentence based ??. Because, as you might know, if you are taking sentence based approach then you need set of labeled sentences. :)

@TyJK
Copy link
Owner Author

TyJK commented May 29, 2017

We would be looking at document based, with each website being assigned labels for sentiment and topic, as well as each post, comment, entry (however it's organized) being given a unique document id which would just be assigned through enumeration most likely. So when it came to model construction, a given document would be given a unique ID, but would also be a part of larger groups based on the other tags.

@PikioopSo
Copy link

@TyJK, the document id could be assigned VIA date data, so that you can do a analysis through spans of time, but I wasn't quite sure what type of enumeration system you were going with.

@TyJK
Copy link
Owner Author

TyJK commented May 31, 2017

@PiReel I was going to use a simple count. Doc2Vec only requires that documents be unique in order to keep them separate (all documents sharing the same tag are treated as one document). It probably doesn't matter for the number of documents we'll get but by enumerating linearly it saves memory. Luckily in my experiments so far, it naturally organizes by date, since that's usually how it's organized on the site archive. I'll have a few examples of test runs up later today.

@PSanni
Copy link

PSanni commented Jun 1, 2017

Great, if we are using document than we need to select websites and topic content carefully because there are higher chances of diverse information on same content or website. And that can easily screw up model. We can include a subjectivity classification, so by subjectivity, we can remove unuseful sentences/information.

I am not sure but, I think Word2Vec might be able to do this? I haven't tried it. Does anyone aware of that ???

@TyJK
Copy link
Owner Author

TyJK commented Jun 1, 2017

Word2Vec probably could do this, but I think we might need to use it as a secondary filter rather than a primary. ie, I think we could run it on each website/video transcript after it had been scraped and cleaned to make sure nothing that wasn't a part of the category got through, but I don't know how we could use it to help in the selection process itself. Hopefully people will be careful, we do mention it numerous times but if you have any suggestions for ways to make it clearer to people I'm all ears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants