Skip to content
TyJK edited this page Mar 16, 2019 · 10 revisions

EchoBurst Wiki

This wiki will be the primary document discussing the direction of the project.

Table of Contents

The Road So Far

EchoBurst was my first foray into NLP, and machine learning more generally, and it has sat dormant for almost two years. Part of this has been a lack of time and skill on my part, and part of it has been a lack of belief in the project. Since 2017, it's become my position not only that many positions espoused by people online are factually or morally indefensible, but that a tool such as EchoBurst as it originally existed would likely do more to spread such positions than it would to impeded them. A balanced, multi-sided perspective and false equivalency are not the same, and I felt the project was susceptible to getting them confused.

It's taken some time, restructuring and reconsideration, but I'm confident that a heavier focus on fact checking on top of echo chamber disruption will solve this philosophical concern. It will be the highest priority to see to it that while diverse opinions are highlighted, these are opinions of as factually accurate and honest as we can ensure. For those opinions that don't meet that standard we will instead provide the user an opportunity to learn what is wrong with the claims made. In the many cases where such cut and dry distinctions cannot be made, a diverse set of perspectives will be offered instead.

Philosophy of EchoBurst

What we stand for : This project takes a firm stance against bigotry and science denial, but also firmly believes that there is a greater perceived divide than actual divide in what many people want to see in the world. Thus, we seek to encourage this discourse while not facilitating a culture of false equivalence between all competing views when not all views are equally true. The purpose of EchoBurst is to encourage civil, fact based online media consumption and discussion on a variety of issues to better expose people to competing worldviews and foster a culture conducive to discourse and policy that benefits humanity as a whole. This involves a myriad of sub problems in the natural language processing domain that will come together in concert to reach this goal. We also believe that there is no such thing as being unbiased, but you can be aware of and work to correct for your bias.

What is EchoBurst? A set of Natural Language Processing tools integrated with a browser extension to help internet users find insightful commentary regarding the subjects and stories they read, as well as providing fact checking functionality. Additionally, such a tool may be usable as a research tool for political and social scientists.

Why are we doing this? The scope and complexity of information on the internet is such that people struggle to navigate it effectively. There is so much to sift through that it's easy to get taken in by false information, and easier still to find a comfortable echo chamber for your views. natural language processing (NLP) offers a way to augment and encourage healthier online information consumption.

How will this be accomplished? Python implemented natural language processing algorithms, along with substantial data, and a web extention for the user to interact with.

Who should us this platform? People who believe that they're discerning consumers of online information. Those who want to have their views challenged. Those who value the truth in what they consume. Really anyone who browses the internet for news and information.

When? While much of the core ML functionality should meet a reasonably high standard before September 2019, integration with a proper interface will likely take substantially longer.

Where? Everywhere, the project will be fully open sourced (when there is a project to speak of).

Structure and Outline

Proposed structure of EchoBurst

User: The user of the extension or app.

Browser: The interface for both input and output of app

User Self Stated Political Position: A survey on a myriad of topics the user has opinions on, rated on a Likert scale from very conservative to very liberal. This may grow in dimensionality to include authoritarianism and libertarianism as well, but for now will be strictly bidirectional.

Topic Detection: Detects what the topic of conversation or is discussing by training on political news and comments based on many broad topics.

Event Detection: Detects what event is being discussed based on a large repository of historical and news events. The detection will work by condensing the article or comment to a few words that will then be entered into external search systems.

AllSides Search: Takes the detected event or topic and feeds it to the AllSides website as a search. 3-5 diverse (as defined by their political leaning scale) sources are then accessed and the full text parsed.

Wikipedia: Takes the detected topic or historical event and feeds it to Wikipedia as a search. It returns the full text article of the most relevant hits.

Reliable Sources: Umbrella category for the returned news and Wikipedia results which act as data for text summarizations and fake news detection.

Text Summarization: Takes full articles as input and outputs the important points for easy consumption. May include a summary that combines all returned text.

Set of Diverse Summaries: The resulting summaries are then displayed for the user to read, along with the links to the original articles.

Fake News Flagging: Classifies selected text to determine if it's likely fake news. Fake news here is defined in two ways: largely/entirely fabricated, or cherry-picked perspective. Will display a warning/confidence rating if it suspects that it is one of these. Otherwise, if also not deemed toxic, the text is considered valuable discourse.

Toxicity Flagging: Classifies selected text to determine if it's likely toxic content/trolling. Will display a warning/confidence rating if it suspects that it is. Otherwise, if also not deemed fake news, the text is considered valuable discourse.

Valuable Discourse: Text that is determined to likely be insightful, meaning it is not vitriolic or blatantly false.

Political Leaning Detection: Determines the political perspective of the valuable discourse text ranging from conservative too liberal along the 5-point Likert scale used by AllSides.

User Self Stated Political Position: A survey on a myriad of topics the user has opinions on, rated on the same Likert scale.

Highlighted Confrontational Text: Text that has a political stance that is in opposition to the user's stance. This is determined based on the topic and the political leaning of the text, and then compared with the user's answers. These articles and posts are highlighted in the browser to bring them to the user's attention.

Machine Learning Components

Political Leaning Detection

Purpose

The first and most essential component, this will detect whether something is likely liberal, centrist or conservative in perspective. This will help to highlight comments and articles that conflict with the user's position. This is not done to change their mind, but more as a soft introduction to competing views that hopefully inspires the user to engage in more direct, respectful discussion.

Data

  • AllSides
  • Reddit and Twitter

First Steps

Taking a dataset of many different news organizations, and the AllSides classification of what their political leaning is, we will try and classify news articles based only on the text provided.

Approach

The initial approach will be using word embeddings to transform the data, which will then be passed through the BERT pretrained model, with a final layer added for this specific task.

Eventual Goal

The model will have to be trained on comments as well as news articles, which will likely be much more difficult. Likely we'll have to resort to scraping Reddit and using distant labels (r\The_Donald is conservative, r\Socialism is liberal, etc.). We'd like to be accurate enough that political leaning can be assessed even from just a few meaningful sentences.

Topic Detection

Purpose

To detect the general topic of discussion in a news article or comment. Topic is the second dimension along with political leaning that together correspond with the stated positions survey. It also helps generate the appropriate searches for the fact checking functionality.

Data

  • AllSides
  • Reddit and Twitter

First Steps

Taking a dataset of many different news organizations, and the AllSides topics (some topics might be merged together) to try and classify news articles based only on the text provided.

Approach

Topic detection will be conducted similarly to political leaning detection, though it has substantially more precedent in the literature. As such, while BERT will likely be used, first steps will include more extensive research on the advantages of different methodologies.

Eventual Goal

The model will have to be trained on comments as well as news articles, just like political leaning. It will be more difficult to get comments that fall into neat categories, so more brute force (key word search) or more distant (unsupervised topic modeling) methods may need to be employed. Once this is complete, we will ideally be able to detect the topic of even short sentences and determine if that matches with an existing category.

Text Summarization

Purpose

Text summarization will be used on news articles and established sources such as Wikipedia in order to give a concise overview from a variety of sources, which should make it easier to fact check. These sources will be searched for on their respective sites using the output from the topic or event detection models, with a diverse set of sources generated using AllSides' leaning score in the case of news article searches. Full links will also be provided, but summaries will allow people to quickly obtain a more balanced perspective without needing to read multiple full articles on the topic. This may also act as a 'foot in the door', where once people read the summary, they are more likely to investigate the full links on their own. This will be used primarily for news articles and topics/events detected from comments, not on comments themselves, as they are naturally very concise and unreliable.

Data

  • Wikipedia
  • AllSides

First Steps

To research the SOTA methods and approaches, and then generate a variety of summarizing models. Unlike in most summarization tasks, it's extremely important that the tone, phrasing and mannerisms of the text are preserved in the summary as much as possible. Once summarizing works, it may require some testing to determine which model does this best, as that can be a somewhat subjective criterion.

Approach

Unsupervised abstractive summarization using BERT implemented in PyTorch

Eventual Goal

We would like to be able to accurately summarize the key points in news and Wikipedia articles, while preserving the tone of the original article. This may be something that can go even further in time by creating a meta summary consisting mostly of facts that are common to all or most of the news articles or sources.

Event Detection

Purpose

To find the specific event being discussed in the article or comment, so that other sources can be found to compare positions and conclusions. This is important for both fact-checking generally, as well as the fake news detection module.

Data

  • AllSides story IDs
  • Wikipedia
  • Potentially other news sources, as leaning is not a necessary component

First Steps

An assessment of the data and the literature to determine the best course of action. There are numerous challenges present that don't exist in other modules.

Approach

Unknown, as a lot depends on what sort of transformation of the text will give the best search results, something not entirely dependent on the quality or machinations of any machine leaning algorithm. However, a likely approach is to take the title or headline in concert with the full text to generate a short sentence that will act as the query. The key challenge here is that it's difficult to give feedback to the model, as it's partially subjective, and more importantly it's in an external system.

Eventual Goal

We would like to be able to at least get the right event in the top 3-5 results of the search output. This gives us a bit more leeway, allowing for some inaccuracy that the user can accommodate for, though of course ideally it would be accurate enough that this isn't required often.

Toxicity Detection

Purpose

To warn people of toxic comments so they can more easily avoid them, which will hopefully make online discourse less taxing. It will also serve as a filter for what constitutes a valuable comment which might be highlighted if other criteria are also met.

Data

  • Kaggle Toxicity dataset
  • Real world manually labeled data for the sake of validation

First Steps

This is one of the more straight-forward tasks, as a high-quality dataset for this task has already been created and others have proven that a high degree of accuracy is possible. Thus, the first step will simply be researching and implementing versions of established models.

Approach

Established effective models for the task, although a BERT pretrained model with an extra layer trained on this specific task will also be tested for comparison. It may be that even better, more generalizable results are possible with a fusion of approaches.

Goal

To accurately flag toxic and trolling comments so users don't waste time reading them.

Fake News Detection

Purpose

To flag links, articles, blogs and possible comments as fake news, either in the sense of entirely fabricated, or in the sense of cherry-picking points to paint a false narrative around events or facts.

Data

  • Fake news stance dataset
  • AllSides (for comparison against target stance)
  • Other datasets on the topic for classification based on linguistics rather than stance
  • Moonshot: Cluster various facts about the event or topic based on source, so that some sources generally over report certain aspects, while those with different leaning under report those aspects and over report others. Then determine of the target article conspicuously misses out on points from one side or the other.

First Steps

To establish a working definition of fake news that is as politically neutral as possible, as well as a methodology for detecting content that meets that criteria. It is also necessary to determine of it's even possible to flag comments as fake news in a reliable way.

Approach

The first and most basic would be using existing data and fact checking organizations to make a list of dubious sources, including sites and social media pages. BERT language model classification can then be used to determine linguistic markers of fake news. Finally, stance detection can be used to contrast the stance of the target article with the collective/average stance of established news organizations. This will likely be the most difficult of all the modules, as fake news detection is a largely unsolved problem. It may be that what really halts the spread of fake news is just the fact detecting functions of the extension, allowing users to detect fake news themselves.

Goal

To warn people of misleading stories and posts to make users better informed, and them to some extent against spreading fake news themselves.

Stages of Development

MVP

Components

  • Political Leaning Detection
  • Topic Detection
  • Event Detection
  • Very basic extension

Functionality

The MVP will be primarily be useful when browsing news sites or blogs that don't have an established leaning, and to fact check those articles against other articles and Wikipedia entries. Unlike future developments, this version will likely require users to specifically request an assessment, after which a request will be made, the data processed, and results returned. These results will be presented as links along with the topic, headline, and the bias rating when possible. If the models make a mistake, users will also be able to flag it and the data will be sent back.

Primary Challenges

By far the biggest challenge will be the creation of the extension itself. There is a lot of data processing, and exchange between the browser and the models which will have to run on separate servers. The largest ML challenge will likely be event detection, although the data present is well suited for straightforward training. The real challenge likely lies in how to turn a search query into an output of the correct results while going through external systems.

Goals of Release/Testing

To determine interest in the project from users, to get feedback on the extension and its usability and to gather data to improve the models.

Core Components Complete

Components

  • Political Leaning Detection
  • Topic Detection
  • Event Detection
  • Text Summarization
  • Polished Extension

Functionality

At this point, the models will hopefully be improved, and the searches and results for fact checking should be more advanced. We will hopefully have also figured out how we want to target text for fact checking and highlighting, although highlighting will likely not yet be operational. This is mostly due to its dependence on two of the lower priority ML components, and because of the need to process things efficiently so servers don't get overloaded.

Primary Challenges

Setting up the architecture needed for the rest of the project's features and ensuring that there isn't a huge burden on the user's system, or our own. On the ML side, text summarization will need to be tweaked in order ensure each summary has the same tone and perspective as the original article.

Goals of Release/Testing

To test the effectiveness of the text summarization for improving the fact-checkability of articles and comments. On top of UX and UE tests we'll also be testing the server load and efficiency of the extension.

Full Release

Components

  • Political Leaning Detection
  • Topic Detection
  • Event Detection
  • Toxicity Detection
  • Fake News Flagging
  • Text Summarization
  • Full Extension

Functionality

Full functionality, including highlighting of valuable comments, fake news and toxicity detection on top of the established fact checking functionality. The full release will also hopefully have some form integration with interested news platforms.

Primary Challenges

Ensuring the extension is doing what we want it to do in terms of how it's being used and how it's facilitating discourse. This will be based largely on user feedback and be less of a technical issue than a design and sociological one. The main ML challenge will be the fake news detection, which by its very nature is politically charged. It will require a through review and consensus from multiple people, as well as the fusing of multiple methods to achieve the desired results.

Goals of Release/Testing

To get it into the hands of as many people as possible, particularly those interested in a more nuanced discourse, and to encourage more people to push outside their information bubble in more direct ways such as direct correspondence with those they disagree with.