-
Notifications
You must be signed in to change notification settings - Fork 2
/
index.html
551 lines (463 loc) · 40.1 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
<head>
<meta charset="utf-8">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
<!-- Fonts -->
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Roboto:400,700">
<!-- Style sheets -->
<link rel="stylesheet" type="text/css" href="https://cdn.rawgit.com/bmabey/pyLDAvis/files/ldavis.v1.0.0.css">
<link rel="stylesheet" type="text/css" href="./webpg_assets/index.css">
<title>WOTUS Public Comments Analysis</title>
</head>
<body>
<h1>
Politics and Perplexity in Python<br>
<span class="subhead">An Analysis of Public Comments Regarding Changes to the 2015 Clean Water Rule</span>
</h1>
<hr>
<p>
This project applies Natural Language Processing (NLP) techniques to analyze publicly-submitted comments in response to proposed changes to the 2015 Clean Water Rule, namely around the re-definition of "Waters of the United States" (WOTUS). The scope of the project includes:
</p>
<ul>
<li><a href="#overview">Data Collection</a> done via a webscraping script to collect the text of all comments posted in the docket</li>
<li><a href="#eda">Exploratory Data Analysis</a> to visualize comment vocabulary and differences between supporting or opposing comments</li>
<li><a href="#topic">Topic Analysis</a> to understand the reasons commenters gave for their support or opposition to the rule change</li>
<li><a href="#similarity">Similarity Analysis</a> to find similar comments to a new or unseen one</li>
<li><a href="clustering">Clustering Analysis</a> to find structure in the un-labeled dataset, and potentially to act as a proxy for sentiment analysis</li>
<li><a href="sentiment">Sentiment Analysis</a> (performed after labeling a subset of comments) to train a classifier to identify comments as either supportive or opposing the rule change</li>
<li><a href="conclusions">Project Conclusions</a> to review key takeaways from each analysis</li>
</ul>
<!-- Page navigation
<ul>
<li><a href="#overview">Overview</a></li>
<li><a href="#eda">EDA</a></li>
<li><a href="#preprocessing">Data Pre-Processing</a></li>
<li><a href="#topic">Topic Analysis</a></li>
<li><a href="#similarity">Similarity Analysis</a></li>
<li><a href="clustering">Clustering Analysis</a></li>
<li><a href="sentiment">Sentiment Analysis</a></li>
<li><a href="conclusions">Conclusions</a></li>
</ul>
-->
<p></p>
<!-- PROJECT OVERVIEW SECTION -->
<h2><a name="overview"></a>Project Overview: Background and Challenges</h2>
<p>
In early 2017, President Trump signed an Executive Order<sup><a href="#footnote1">1</a></sup> requesting that agencies review a 2015 rule (the "Clean Water Rule") regarding the definition of "Waters of the United States". WOTUS establishes the scope of waters that fall under federal jurisdiction, meaning that they are regulated under the Clean Water Act (CWA). The agencies, including the Environmental Protection Agency (EPA) and the Department of the Army, were instructed to rescind or replace the rule, in accordance with law.
</p>
<p>
The agencies have since conducted a reevaluation and revision of the definition of WOTUS. This new rule was open for public comment for 60 days until April 15, 2019, and the revised rule took effect on December 23, 2019. The comments are available under "Public Submissions" for docket EPA-HQ-OW-2018-0149 on the <a target="blank" href="https://www.regulations.gov/docket?D=EPA-HQ-OW-2018-0149">regulations.gov web page</a>.
</p>
<p>
With some exceptions, the comments generally fell into one of two buckets: they were either supportive of the proposed re-definition or opposed to changes to the 2015 rule.
</p>
<p>
The challenges of this project fell into two broad categories - ones relating to data collection and ones associated with the nature of the dataset.
</p>
<h3>Challenge 1: Data Availability and Collection</h3>
<p>
Only a fraction of the total comments submitted during the comment period are actually publically available in the docket. The docket lists a total of over 625K comments received, however, there are only 11K comments available under "Public Submissions". The EPA and Army explained this discrepancy:
</p>
<blockquote>
Agencies review all submissions, however some agencies may choose to redact, or withhold, certain submissions (or portions thereof) such as those containing private or proprietary information, inappropriate language, or duplicate/near duplicate examples of a mass-mail campaign. This can result in discrepancies between this count and those displayed when conducting searches on the Public Submission document type.
</blockquote>
<p>
Second, the project relied on a web scraper<sup><a href="#footnote2">2</a></sup> to actually collect the text of the comments in the docket. The regulations.gov website has an API to request data, however, only organizations (not individuals) qualify for an API key. The text of approximately 3K of the available 11K comments were locked in attachments (PDF or image formats) and unavailable to the scraper. Even with an API key, it's unclear if the text of those comments would be accessible without additional software.
</p>
<p>
In light of this, the project had to assume the 8K sample of comments used in the analysis are generally representative of the public's views. That said, care was taken to not draw conclusions about the <em>proportions</em> of comments for or against the rule change. There could be thousands of members of an organization sending the same mass-mail form to support or oppose the rule change, but it would only show up once in the submissions.
</p>
<h3>Challenge 2: Nature of the Dataset</h3>
<p>
The dataset is comprised of unstructured, unlabeled text. The lack of labeled data restricted the analysis to unsupervised learning techniques at first. However, a sample of 1,200 comments were manually labeled to gauge the performance of clustering analysis (as a proxy for sentiment analysis) and to enable the supervised sentiment analysis. Of course, this further limited the size of the dataset and having more labeled observations potentially could have improved the performance of the models.
</p>
<p>
Finally, the overlap of the language used in many comments added to the challenge of finding distinct topics or clusters. Training unsupervised models to find context from a "bag of words" representation turned out to be quite difficult. Below are two example excerpts from comments - both mentioning agriculture and the need for clean water - but they are on opposing sides of the issue:
</p>
<table>
<thead>
<th scope="col">Example Supportive Comment</th>
<th scope="col">Example Opposing Comment</th>
</thead>
<tbody>
<tr>
<td>
"I support clean water. We need clear rules on how this is to be accomplished. I personally try hard to have as little erosion as possible. Terraces, grass waterways, contour farming, no-till, headlands, etc. As hard as I and other farmers try to conserve and keep waters clean, we need clear rules with common sense to provide us direction. Hopefully this is a step in that direction and we all can work together to feed the world."
</td>
<td>
"I stand for clean water! Clean and healthy waterways are key to western agriculture, recreation and tourist economies that support our communities. These wholesale changes to the Clean Water Act to limit its jurisdiction provide loopholes in the law and give polluters incentives to discharge dangerous pollutants into unprotected waterways. Please protect our water - the essence of life upon which we all depend!"
</td>
</tr>
</tbody>
</table>
<!-- EXPLORATORY DATA ANALYSIS SECTION -->
<h2><a name="eda"></a>Exploratory Data Analysis</h2>
<p>
Word clouds proved to be useful to get an overview of the most common words (or <i>bigrams</i> - two-word combinations - as in the analysis below) in the dataset. The first image was created using the entire training set, and provides a nice sanity check that the comments are generally about clean water. For example, if "immigration" showed up in the plot, it would raise concerns about the integrity of the collected comments. But as expected, the most common bigram was "clean water". A closer inspection of the smaller-sized bigrams also uncovers the more opinionated phrases, such as "polluting industries", "dirty water", "clear rule", or "common sense".
</p>
<img alt="Image showing the most common 2-word combinations from the full training set" src="./webpg_assets/wordcloud_all.png">
<p>
Word clouds were also generated for subsets of labeled data to compare the vocabulary differences between supportive comments and opposing comments. No surprises here either - the supporting comments include "farm", "support proposed", and "clear rule" while the opposing ones mention "protection", "keep clean", and "drinking water".
</p>
<p>
A look at the top-50 words in each set of comments also sheds light on the different vocabulary between the groups. The table below shows the shared words as well as the unique ones found in one set but not the other.
</p>
<table>
<thead>
<tr>
<th scope="col" colspan="2">Supportive Comments</th>
<th scope="col"><!-- Intentionally blank --></th>
<th scope="col" colspan="2">Opposing Comments</th>
</tr>
</thead>
<tbody>
<tr>
<td><img src="./webpg_assets/wordcloud_sup.png"></td>
<td><!-- Intentionally blank --></td>
<td><!-- Intentionally blank --></td>
<td><img src="./webpg_assets/wordcloud_opp.png"></td>
<td><!-- Intentionally blank --></td>
</tr>
</tbody>
</table>
<table>
<thead>
<tr class="trow-as-header">
<th colspan="2">Supportive Words Only</td>
<th>Shared Words</td>
<th colspan="2">Opposing Words Only</td>
</tr>
</thead>
<tbody>
<tr>
<!-- First half supportive words list -->
<td>
<ul>
<li>2015</li>
<li>agency</li>
<li>army</li>
<li>clarity</li>
<li>clear</li>
<li>common</li>
<li>corp</li>
<li>ditch</li>
<li>engineer</li>
<li>farm</li>
<li>farmer</li>
<li>federal</li>
<li>follow</li>
<li>help</li>
<li>know</li>
<li>land</li>
<li>landowner</li>
</ul>
</td>
<!-- Second half supportive words list -->
<td>
<ul>
<li>new</li>
<li>practice</li>
<li>property</li>
<li>provide</li>
<li>regulate</li>
<li>regulation</li>
<li>regulatory</li>
<li>resource</li>
<li>revise</li>
<li>right</li>
<li>sense</li>
<li>support</li>
<li>time</li>
<li>understand</li>
<li>work</li>
<li>wotus</li>
<li>write</li>
</ul>
</td>
<!-- Shared words list -->
<td>
<ul>
<li>act</li>
<li>clean</li>
<li>definition</li>
<li>environmental</li>
<li>epa</li>
<li>family</li>
<li>important</li>
<li>need</li>
<li>propose</li>
<li>protect</li>
<li>quality</li>
<li>rule</li>
<li>state</li>
<li>unite</li>
<li>water</li>
<li>waterway</li>
</ul>
</td>
<!-- First half opposing words list -->
<td>
<ul>
<li>allow</li>
<li>body</li>
<li>change</li>
<li>community</li>
<li>country</li>
<li>dirty</li>
<li>downstream</li>
<li>drink</li>
<li>environment</li>
<li>ephemeral</li>
<li>fish</li>
<li>flow</li>
<li>health</li>
<li>include</li>
<li>industry</li>
<li>lake</li>
<li>large</li>
</ul>
</td>
<!-- Second half opposing words list -->
<td>
<ul>
<li>live</li>
<li>million</li>
<li>oppose</li>
<li>people</li>
<li>place</li>
<li>pollute</li>
<li>pollution</li>
<li>proposal</li>
<li>protection</li>
<li>remove</li>
<li>river</li>
<li>science</li>
<li>small</li>
<li>source</li>
<li>stream</li>
<li>wetland</li>
<li>year</li>
</ul>
</td>
</tr>
</tbody>
</table>
<!-- DATA PRE-PROCESSING SECTION -->
<h2><a name="preprocessing"></a>Data Pre-Processing</h2>
<p>
In order to leverage the power of the machine learning models that underpinned the various analyses, it was necessary to convert the comments into a numeric representation. Several techniques (listed below) were compared, ranging from a simple "bag of word" count vectorization to more advanced word embedding models via the SpaCy library. This was an iterative process that used results from the clustering and sentiment analyses to determine which pre-processing method worked best with this dataset for a given task.
</p>
<ul>
<li>Count vectorizer</li>
<li>Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer</li>
<li>Hashing vectorizer using L2 norm</li>
<li>Hashing vectorizer using TF-IDF</li>
<li>Document vectors</li>
</ul>
<p>
All vectorizer methods used SpaCy's English language model tokenizer and lemmatizer and removed stop words. Also, any word that appeared in fewer than five comments were removed - this helped reduce potential noise from typographical errors, slang, and rare words. Finally, whether to include bigrams or not was tested as well (generally this did not improve results).
</p>
<p>
The final technique didn't use a "bag of words" approach, but instead created a 300-dimension "document vector" by averaging the word vector values from SpaCy's large English language model.
</p>
<!-- TOPIC ANALYSIS SECTION -->
<h2><a name="topic"></a> Topic Analysis Using Latent Dirichlet Allocation</h2>
<p> The following topics were generated using a Latent Dirichlet Allocation (LDA) model in Scikit-Learn on the entire training set. LDA is a statistical model to find the different mix of <code>n</code> topics that comprise the collection of comments. The user needs to supply the number of topics, which isn't always clear in an un-labeled dataset. Fortunately there's a <code>perplexity</code> metric, which allows you to compare models using a varying <code>n</code> number of topics to help determine the appropriate topic count for the dataset.</p>
<p><a href="https://en.wikipedia.org/wiki/Perplexity">Wikipedia</a> defines <code>perplexity</code> as:</p>
<blockquote>In information theory, <bold>perplexity</bold> is a measurement of how well a probability distribution or probability model predicts a sample. It may be used to compare probability models. A low perplexity indicates the probability distribution is good at predicting the sample.</blockquote>
<p>Low perplexity scores are better, but there's a practical trade-off with the number of topics. They need to be interpretable to a human and different enough to have value in a practical application. This analysis employed the "elbow" method as guidance to find the appropriate number of topics to balance the score and, after introspection, the utility of the topics.</p>
<p>
<img alt="Line chart showing declining perplexity scores by number of topics" src="./webpg_assets/LDA_Perplexity_for_nTopics.png">
</p>
<p>In the above chart, the sharpest decline in perplexity happens from two to eight topics, with a flattening of the curve thereafter. This indicates that around eight topics is where the law of diminishing returns kicks in thus further improvements to perplexity are minimal. The results of a model set to eight topics were then analyzed to find themes within the top words by topic.</p>
<!-- Interactive LDA script -->
<div id="ldavis_el336141128775838403788339513"></div>
<script type="text/javascript" src="./webpg_assets/LDA_script.js"></script>
<h3>LDA Topic Summary</h3>
<p>
The comments in the dataset tend to share a lot of language, so the most frequent words extend across many of the topics. This makes it hard to distinguish the topics with top words alone.
</p>
<p>
However, reviewing exemplary comments per topic and adjusting the values for lambda with the relevance metric slider (in the right-hand chart) helped to identify the general theme for a selected topic. Below is a summary of the key words and themes per topic:
</p>
<div class="list_container">
<ol>
<li>
These comments make general requests to protect headwaters, intermittent and ephemeral streams, and waterways. They use science-based arguments that relate to definitions and exclusions in the rule change, like navigable waters, groundwater, surface water, and the connectedness of the water system in general. They also argue for wildlife and habitat
</li>
<li>
This topic leans more politically than others while making similar points about using the latest science to protect waterways as the first topic. It also references specific regions, like the South and West, that would be left vulnerable. There are requests to reconsider the "risky" proposal, and notes that the CWA is a bipartisan success story and should remain intact
</li>
<li>
This topic is a catch-all for supporters of the rule change. There are comments from farmers and ranchers who want "clean water and clear (or common-sense) rules" and landowners pushing for property rights. These are people who mention they are using responsible farming techniques on their land and want clarity on what waterways are federally regulated
</li>
<li>
These comments opposed the proposed "dirty water" rule by making scientific arguments, sometimes noting the environmental threats like pollution, increased flood risk, and health and safety issues when regulations are loosened and the additional cost of clean-ups. They request the EPA abandon the dangerous rollback of the CWA and focus on protecting waterways
</li>
<li>
This topic includes environmental arguments about the importance of headwater, the risk to downstream users when there's water issues, and that the EPA needs to continuously protect water using available science. It also notes that future generations will need to deal with any repercussions of the rule change
</li>
<li>
This topic says clean water is a right and necessity in life, we all need it to live, and the government and its bureaucrats are responsible for protecting it for its citizens. It also notes the proposed rule undermines the CWA and doesn't establish a universal definition of federally protected waterways. Finally, it mentions the need to look to the future to see what we're leaving to our children
</li>
<li>
This topic covers arguments that the rule change will diminish outdoor recreation experiences and can affect parks, nature, wildlife, coastal waters, and the health of communities. The comments note flourishing outdoor economies that are supported by clean waterways
</li>
<li>
Commenters saying "we need to do more, not less to protect clean water" and that the CWA is a "bedrock environmental law". They make arguments to protect temporary waterways and ephemeral streams, with mentions of Western states, and that the new rule will leave these waterways unprotected. It warns against polluting industries and that the rule change puts profits ahead of the environment
</li>
</ol>
</div>
<!-- SIMILARITY ANALYSIS SECTION -->
<h2><a name="similarity"></a>Finding Similar Comments via Non-Negative Matrix Factorization</h2>
<p>
An eight-topic non-negative matrix factorization (NMF) model was fit to the training data. The topics closely matched the ones the LDA model found, which was a nice cross-check of the results from that analysis, but the goal for this model was to find "similar" comments, not topics. This functionality could then be applied to perform either batch-labeling (find the similar comments to a known, labeled one), or to find similar comments to an unseen, new one.
</p>
<p>
Given how each comment was processed into a term-frequency vector, this analysis defined and calculated "similarity" using the cosine value between vectors. With a "bag of words" processing approach, the values in each comment vector align with how many times certain vocabulary words show up in that comment. Therefore, comments with similar vocabulary will have similar vectors, and will therefore 'point' in similar directions in hyperspace.
</p>
<p>
The <code>cosine</code> of the angle created between any two comment vectors can indicate how similar their vocabulary is - the smaller the angle between the vectors, the higher the value of the <code>cosine(angle)</code> is (recall that <code>cosine(0) = 1</code>, which is the upper bound of the function's range). Therefore, cosine values near one indicate similarity. Likewise, as the language between comments diverges, that angle becomes larger, and the cosine value moves closer to zero. Because NMF doesn't allow for negative values, the resulting <code>cosine</code> values are bounded between zero and one.
</p>
<p>
Two labeled comments (one supportive and one opposing) were randomly selected from the test set and run through the NMF pipeline. The tables below show the most similar comments in the training set as well as the <code>cosine</code> value between the test comment and those training comments. Note that personally identifying information about the commenter (such as their address) was removed from some comments.
</p>
<h3>Similar Training Comments vs. an Example Supportive Comment</h3>
<blockquote>
"As a farmer, I need to be able to look at waterways on my land and know immediately if it is or is not federally regulated. The proposed new water rule should make this so. I support the proposed Waters of the US rules."
</blockquote>
<div class="table_container">
<table>
<thead>
<tr>
<th scope="col">Cosine Value</th>
<th class="com_txt" scope="col">Comment Text</th>
</tr>
</thead>
<tbody>
<tr>
<td class="cosine">0.9956</td>
<td class="comment">As farmers, we care about clean, sustainable water that help our crops and livestock. I believe that if the rules are defined clearly and logically, and they will not inhibit producers from making sound decisions on their farm, then general support for the new proposed language will garner support.</td>
</tr>
<tr>
<td class="cosine">0.9933</td>
<td class="comment">As a Farm Bureau member I understand that the proposal for the revised definition of "waters of the United States," under the Clean Water Act, is more practical than the 2015 WOTUS Rule. It provides for a more clear, specific, and limited description of what land is subject to regulation. I own 150 acres of farm ground. If I had to hire engineering firms to determine whether of not permits were required to farm, I would not make enough income through renting or farming the land to recoup the cost. The proposed clean water rule will make it more practical for farmers themselves to understand and follow. I strongly support the proposed rule.</td>
</tr>
<tr>
<td class="cosine">0.9911</td>
<td class="comment">I support the new rule for Waters of the US listed as rule 4154. Thank you.</td>
</tr>
<tr>
<td class="cosine">0.9889</td>
<td class="comment">I support the proposed Waters of the U.S. rules. I thank the EPA and the Army Core of Engineers for making the rules more understandable for me and fellow farmers and ranchers. These clearer rules should make it easier for myself and other farmers and ranchers to continue to protect and improve or water resources. With these proposed rules I can look at my land and know what I can do and continue to use my land and try to improve water quality on my farm. I support the proposed rule and thank the EPA for taking this important step to clarify and simplify the rule. I support clean water clearer rules.</td>
</tr>
<tr>
<td class="cosine">0.9874</td>
<td class="comment">I support Clean Water and Clear Rules. As an agronomist who supports the industry of agriculture and is affiliated with the implementation of nutrient management; I would like to see rules that are easy to comprehend and implement. Water is a crucial resource to our society and I applaud the EPA and Army Corps of Engineers for taking a step toward creating rules that are more clear than what the past has provided. The new rules will stimulate economic growth, limit regulatory uncertainty, and respect the roles of Congress and our nation under the Constitution.</td>
</tr>
</tbody>
</table>
</div>
<h3>Similar Training Comments vs. an Example Opposing Comment</h3>
<blockquote>
"I am urging the EPA and Army Corps of Engineers to adopt the 2015 Clean Water Rule and abandon the revised WOTUS rule. The Clean Water Rule needs to protect ephemeral waters and wetlands, not limited to the "navigable waters" currently defined, since by nature water is ephemeral and is not contained in one area. Pollution in one area potentially pollutes unlimited areas. Not only is lack of protection a danger to human and animal health, it's also economically unfeasible, considering affects on drinking water. Additionally, loss of wetlands can take decades to restore."
</blockquote>
<div class="table_container">
<table>
<thead>
<tr>
<th scope="col">Cosine Value</th>
<th class="com_txt" scope="col">Comment Text</th>
</tr>
</thead>
<tbody>
<tr>
<td class="cosine">0.9801</td>
<td class="comment">We should be doing MORE to protect our waters, not less. It is preposterous that you want to redefine "water" to remove protections from many wetlands and small streams. Please reconsider this unsound rule and keep the Clean Water Act in tact, in order to protect the health of Missouri communities and the waterways that we love and enjoy. The reasons the change should NOT take place: Wetlands are an important part of flood control so the new definition will increase flooding risk in Missouri. The new definition will threaten drinking water for over 2 million Missourians. The new definition ignores importance of wetlands and small streams for recreation. The new definition ignores public will for strengthened protections. Nearly 1.5 million American wrote in to support of the Clean Water Rule (currently in place) in 2014 and 2017. While the comment period for the original Clean Water Rule was open for over 200 days, we are concerned that the agency is only providing a 60-day comment period for the rollback of the Clean Water Rule now. I expect the administration and the EPA to protect our clean water, and I do NOT support efforts to weaken existing protections. My beloved state of Missouri is already woefully behind on implementing the Clean Water Act, and I'm concerned that the new "Waters" definition would take Missouri even further back.</td>
</tr>
<tr>
<td class="cosine">0.9667</td>
<td class="comment">Clean water is vital to every living thing. It affects our ecology, our health, and our quality of life. Finalized in 2015, a Clean Water Rule within that act restored federal protections to half the nations streams, which help provide drinking water to one in every three Americans. This Rule also protects millions of acres of wetlands that provide wildlife habitat and keep pollutants out of Americas great waterways. The Rule is now under attack by the Trump administration through court suits and Executive Orders. The latest threat is a proposed narrowing of the definition of 'Waters of the United States' that would reduce by half the waters being regulated by the EPA concerning such matters as point source discharges; dredge and fill permits, and oil spill prevention programs. Clearly, we need more clean water, not less. I strongly oppose the proposed redefinition of "Waters of the United States" because it greatly reduces the scope of the Clean Water Act to protect wetlands and ensure safe water for all Americans.</td>
</tr>
<tr>
<td class="cosine">0.9623</td>
<td class="comment">Dear Acting Administrator Wheeler, The EPA's proposed water rule will drastically weaken Clean Water Act protections in New Mexico. As an arid state, over 90% of New Mexico's waters do not flow year round and are therefore at risk of losing protections under this proposal. Even our biggest river, the Rio Grande, runs dry in sections during the summer! Many iconic New Mexico Rivers, such as the Gila River, Santa Fe River, Rio Ruidoso, and the Rio Costilla are threatened. Hundreds of thousands of New Mexicans depend on ephemeral or intermittent streams (waters that flow for only part of the year) for drinking water. Consistent flow in our streams, wildlife habitat, recreational uses, and our agricultural traditions are all dependent on the water that flows from our headwater wetlands, many of which will be left unprotected if this rule is passed. Please abandon this unwise rule that will hurt our water and our communities.</td>
</tr>
<tr>
<td class="cosine">0.9621</td>
<td class="comment">I am writing to oppose President Trump's new water rule. This rule will impact clean water where people live, work, and play. The proposed rule would eliminate entire categories of waterways from the protections afforded by the Clean Water Act, threatening the water you use to drink, grow your food, fish, and recreate. Categories of excluded waterways include: interstate waters, ephemeral streams or other isolated waters, non-adjacent wetlands, ditches, upland waters, and groundwater. Please oppose this destructive water rule.</td>
</tr>
<tr>
<td class="cosine">0.9607</td>
<td class="comment">To Whom It May Concern, The Friends of Tennant Lake and Hovander Park, an all-volunteer group, is concerned about the proposal to revise the definition of the Waters of the U.S. and no longer follow the guidance of the Clean Water Rule of 2015. For decades, the Clean Water Act, passed in 1972, has been essential to protecting the Waters of the United States from pollution and degradation. These waters encompass more than navigable rivers and lakes. We believe Congress clearly intended the protection of drinking water, agricultural water and ecosystems, including groundwater and the ephemeral streams that feed into our shared waterways. Tennant Lake and Tennant Creek adjoin the Nooksack River in northwest Washington State. We have seen the vital role the Nooksack plays in protecting and nurturing our precious salmon. Weve also witnessed the flood protection the Tennant Lake wetland provides to the farmers and residents of Whatcom County where we live: In full flood, the River flows up Tennant Creek into Tennant Lake, refilling the wetland, which serves as a nursery for wildlife and waterfowl. Our group and other residents use these natural cycles to illustrate the benefits of Tennant Lake and other wetlands. Our lake also used for hunting and fishing serves as a teaching habitat and beloved local recreational site. Our groups worry is that Tennant Lake and Tennant Creek significant waters adjoining the Nooksack River - could lose the physical, chemical and biological protections currently afforded to wetlands, ground water and creeks by the Clean Water Rule of 2015 if that rule is suspended. Nearby Ferndale continues to expand towards Tennant Lake, and we rely on Clean Water Act protection to preserve the lake and its environs from pollution associated with careless development. Therefore, we ask that the U.S. EPA withdraw the Revised Definition of the Waters of the U.S. and allow the Clean Water Rule of 2015 to proceed.</td>
</tr>
</tbody>
</table>
</div>
<!-- CLUSTERING ANALYSIS SECTION -->
<h2><a name="clustering"></a>Clustering Analysis</h2>
<p>
The clustering analysis was performed before a sample of comments were labeled. The goal was to see if an underlying structure existed in the language used that separated supporting versus opposing comments. Unfortunately, while the results could loosely support two clusters, the clusters themselves didn't properly distinguish comments by their position. After analyzing the results of a two-cluster KMeans model, it seemed the comments were separated more by topic than position.
</p>
<p>
Not surprisingly, the pre-processing choices for how to convert the text into a term-document matrix impacted the clustering results. A KMeans Minibatch model was run for all vectorization techniques using a range of two to twelve clusters. The following metrics were used to compare them. The model with the best quantitative results also got a qualitative review to determine whether the clustering task successfully separated the comments.
</p>
<ul>
<li><span class="bold">Inertia:</span> the total distance of all points from their assigned cluster's centroid. This typically goes down as more clusters are added, and thus is generally used with the "elbow method" to determine the point where the trade-off of adding a new cluster isn't worth the reduction in inertia</li>
<li><span class="bold">Inertia Times the Number of Clusters:</span> adjusts inertia scores to penalize for more clusters. The minimum of this metric is generally the trade-off point where the reduction in inertia balances with the number of clusters</li>
<li><span class="bold">Silhouette Score:</span> a value ranging from -1 to 1 that indicates how separate and tightly packed the clusters are. The calculation compares distances of points to their assigned centroid versus to the next nearest cluster centroid. The more distinct and tight a cluster is, the closer the score is to 1, whereas the more amorphous and overlapping it is, the more it falls in the 0 to -1 range</li>
</ul>
<p>
There weren't any major takeaways from the inertia scores by method. No method showed a clear "elbow" and all showed a minimum inertia-times-clusters at two clusters, meaning the drop in inertia by adding a new cluster wasn't necessarily worth it. However, the silhouette scores distinctly showed that the count vectorizer model generated clusters that were generally more distinct and separate than the other methods. The chart below shows the silhouette score by pre-processing method for models ranging from two to twelve clusters.
</p>
<img alt="Line chart showing silhouette score by vectorization method - the count vectorizer in general out-scored all other methods." src="./webpg_assets/Clustering-AllSilhouetteScores.png">
<p>
The following charts show all the metrics from the count vectorizer models.
</p>
<img src="./webpg_assets/MiniBatchKMeans_Count_Inertia-Silhouette.png">
<p>
Once the quantitative metrics helped narrow down the field to a model using count vectorized data, the next step was to examine the results qualitatively and visually. First, a sample of comments were read from each cluster. This turned out to be the first indication that there was a lot of overlap in the two clusters - both clusters had comments from each side.
</p>
<p>
Second, two dimensionality reduction techniques were employed - Principal Component Analysis (PCA) and TSNE - to visualize the clusters in 2D. A sample of comments were then labeled and colored and included in the charts to show the ground truth of a comment's support or opposition. Below are the resulting charts.
</p>
<img alt="Summary Principal Component Analysis charts: (left) a scatter plot of clusters showing a lot of overlap of supportive and opposing comments in a non-distinct cluster blob, and (right) a bar chart of explained variance of top PCA features, with the first component explains ~28% and second explains ~7% of the variance." src="./webpg_assets/PCA-BOW.png">
<img alt="TSNE visualization showing blobby groups with some separation of ground truth supporting and opposing comments, but not clearly tied to the clusters." src="./webpg_assets/TSNE-BOW.png">
<p>
As noted above, the clustering analysis was not a home run to distinguish the position of a comment. The KMeans model and dimensionality reduction techniques uncovered some stucture within the vectorized data, but unfortunately, it didn't seem to hinge on the language being supportive of the rule change or opposed to it.
</p>
<!-- SENTIMENT ANALYSIS SECTION -->
<h2><a name="sentiment"></a>Sentiment Analysis</h2>
<p>TO COME</p>
<!-- CONCLUSIONS -->
<h2><a name="conclusions"></a>Project Conclusions</h2>
<h3> Topic Analysis</h3>
<p>
Overall, the LDA model found relevant topics spanning the dataset. These topics were also confirmed while reading through a sample of 1,200 comments to label them for the sentiment analysis task. There were a few one-off comments that made points outside the eight categories, but they seemed to be the exception over the norm.
</p>
<p>
The NMF model also proved useful as an efficient way to identify similar comments. This functionality could be extended to batch-label comments to broaden the labeled set, allowing for the use of more complex modeling options for the sentiment classifier. It was also useful as a pseudo k Nearest Neighbors identifier - given a new or unlabeled comment, it could find the most similar labeled ones to determine the more-likely side of the issue the commenter was on.
</p>
<h3>Clustering Analysis</h3>
<p>
The final clustering analysis employed a KMeans clustering algorithm with two clusters fit on training data pre-processed with the top performing (based on silhouette scores) count vectorizer. After inspecting a sample of comments in each cluster, then performing dimensionality reduction with Principal Component Analysis and TSNE to visualize the clusters against a sample of labeled data, the clusters proved to have a lot of overlap of opposing and supportive comments.
</p>
<p>
The visualizations showed there were groupings and structure in the data, but the model unfortunately wasn't able to distinguish the groups clearly in terms of <i>sentiment</i>. Instead, the clusters the model found appeared to be driven more by <i>topic</i>. This is somewhat attributable to the nature of a simple "bag of words" pre-processing technique like the count vectorizer.
</p>
<p>
As noted above, one of the challenges with this dataset was how much shared language occured between supporting and opposing comments - "I support the current rule change" vs. "I support the 2015 ruling", or "I want clean water for my grandkids" vs. "I want clean water and clear rules". The "bag of words" preprocessing technique wasn't able to preserve the context that the similar language occured in, and therefore put limitations on the clustering model's ability to be a proxy for sentiment analysis. Pre-processing with different vectorizers, limiting features, and using only bigrams didn't improve the clustering results.
</p>
<h3>Sentiment Analysis</h3>
<p>
TO COME
<!-- Count vectorizer outperforms because didn't normalize for length of comments - longer comments had more info? -->
</p>
<footer>
<p>
Notes:<br />
<a name="footnote1">1</a>: Executive Order 13778, signed on February 28, 2017, titled "Restoring the Rule of Law, Federalism, and Economic Growth by Reviewing the 'Waters of the United States' Rule"<br>
<a name="footnote2">2</a> The web scraper script built in ample wait times so as not to overload the website with requests
</p>
</footer>
</body>