Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

not working on google.gr #113

Open
Valve opened this issue Sep 22, 2015 · 15 comments
Open

not working on google.gr #113

Valve opened this issue Sep 22, 2015 · 15 comments
Assignees

Comments

@Valve
Copy link

Valve commented Sep 22, 2015

not working on these urls:

https://www.google.gr
http://google.gr
http://google.tn

@Valve
Copy link
Author

Valve commented Sep 22, 2015

OK, it's not working on any google domain, what am I missing?

@fblundun
Copy link
Contributor

import com.snowplowanalytics.refererparser.Parser
val parser = new Parser()
val referer = parser.parse("https://www.google.gr", "http://www.example.com")
println(referer)

The above prints "{medium: search, source: Google, term: null}", which is the expected result for the Java library. What is going wrong for you @Valve ? Which language's version of the library are you using?

@Valve
Copy link
Author

Valve commented Sep 23, 2015

@fblundun oh sorry, I didn't know this has many language versions. I'm using Ruby version.

Here is my output:

ruby -v                                                                                                                                                                                                                                                  
ruby 2.2.3p173 (2015-08-18 revision 51636) [x86_64-darwin14]
rails c                                                                                                                                                                                                                                                  
Loading development environment (Rails 4.2.4)
[1] pry(main)> RefererParser::Parser.new.parse('http://google.com')
=> {:known=>false, :uri=>"http://google.com"}

@fblundun
Copy link
Contributor

Assigning to @kreynolds , the maintainer for the Ruby library...

@morrow95
Copy link

morrow95 commented May 18, 2016

Just tested this on the php version and all the google domains I tried do not return the search terms.

https://www.google.com/?gws_rd=ssl#safe=off&q=testing

medium: search
source: Google
terms:

@yalisassoon
Copy link

@morrow95 Google doesn't provide keyword terms for searches done on HTTPS, which is now the vast majority of them: https://searchenginewatch.com/sew/news/2296351/goodbye-keyword-data-google-moves-entirely-to-secure-search

@morrow95
Copy link

morrow95 commented May 18, 2016

@yalisassoon - I've been aware of that change for some time, but in my use case I have the actual url's such as the one I listed in the earlier post. As you can see in my example the q= is given, so, I would expect the parser to return back 'testing' as the term(s).

For something like https://www.google.com/#safe=off&q=testing+one+two+three I would expect 'testing one two three' returned.

@morrow95
Copy link

morrow95 commented May 19, 2016

For what it is worth... removing the '#safe=off' from the urls I mentioned above and the parser correctly returns the search terms. It would appear the '#' is causing the parser to not handle the parameters correctly and preventing it from returning the terms (q).

@morrow95
Copy link

morrow95 commented Jul 6, 2016

Anyone look into this? You can try the above mentioned yourself and see the same results.

@donspaulding
Copy link
Contributor

donspaulding commented Jul 6, 2016

This is not a Ruby-specific problem. I would be surprised to find any of the language bindings that parsed this URL as you're expecting it to.

The root of the problem is that everything after the # is the "fragment" portion of the URL. Even if the fragment is structured to look like the "query" portion of the URL, all of the language parsers will treat everything after that # as a single string which comprises the fragment.

This is a general problem with JavaScript-heavy webpages which want to represent the URL to the user without causing an actual page nav to take place in the browser. They're (ab)using the fragment as a place to store information on what would traditionally have been a page-load inducing document.href or form.submit() call.

The code to fix this would likely look like this (in Python):

the_url = "https://www.google.com/?gws_rd=ssl#safe=off&q=testing"
parsed_url = urlparse(the_url)
if 'google.com' in parsed_url.netloc and 'q=' in parsed_url.fragment:  #  <-- Or some other fragile heuristic
    new_url = the_url.replace('#', '&', 1)
    parsed_url = urlparse(new_url)
# continue parsing as before

I'm sure my bias is showing through in the comment above, but I'm a +0 on adding this capability into referer-parser. The biggest reason is that I don't believe there's a generic, widely-applicable heuristic that we could build into the referers.yaml file to detect and correct these types of abuses of the URL syntax.

But, I'm also sensitive to the concern of referer-parser punting on this, because it would effectively put the burden on users of our respective libraries to add a snippet as above any time they were dealing with a domain that pulled these kinds of shenanigans. So if we could determine how many domains do this, or find a way to encode the special-cases in referers.yaml, I'm probably easily swayed.

/cc @alexanderdean

@alexanderdean
Copy link
Contributor

Thanks for finally bottoming this one out @donspaulding! Ouch, that's a pretty nasty behavior by Google and friends.

Another challenge with fixing this is that I'm pretty sure it will require the referer URI to be passed in to referer-parser as a string, because if you pass it in as a proglang-idiomatic Url class or equivalent, it's probably "too late" to fix it. This then leads to the unfortunate side effect that the library's behavior will likely be different depending on whether you supply a string or a proglang-idiomatic Url class.

@morrow95
Copy link

morrow95 commented Jul 6, 2016

Yes, this is certainly not language specific. As has been mentioned this is a way to prevent/hinder easy 'decoding' of the url to collect the search terms. I know Google does this now, but not sure if anyone else has followed the practice since. I honestly never looked at the snowplow code before, but after reading the responses above that tells me snowplow uses the default url parsing that each language provides (I thought snowplow had its own implementation of parsing the url this whole time).

Looking at the referrers.yml for Google :

   Google: 
     parameters: 
       - q 
       - query # For www.cnn.com (powered by Google) 
       - Keywords # For gooofullsearch.com (powered by Google) 
     domains: 
       - www.google.com 

you are already passing the needed qualifier to determine the search terms, but of course these new style urls are not parsed as expected (the # makes it a fragment rather than query). The only way to get around this would be adding special parse cases as donspaulding pointed out. I haven't looked into the uniformity that Google uses (like if the #safe-off is always used), but it seems like any solution might open up false positives unless of course there is strict uniformity on Google's side to how they present the url. While you could strip the #safe-off that assumes Google doesn't use q= and query= in the query part of the url for any of their NON-search related urls else you would have false positives.

For what it is worth my usage case of snowplow includes having the full string urls. I am essentially passing url strings into it and collecting data from the results such as what search engine was used, search terms were used, and so on. I build my own data of those results like viewing how often terms were searched even across different engines. Being that Google is the most widely used se this leaves quite the gap if you plan on using the results for any sort of data/reports as far as search terms go.

@alexanderdean
Copy link
Contributor

Hey @morrow95 - thanks for this, that's a lot of helpful context. This was surprising to be:

As has been mentioned this is a way to prevent/hinder easy 'decoding' of the url to collect the search terms.

Do you have a source for search engines doing this deliberately to obfuscate the URI?

@morrow95
Copy link

morrow95 commented Jul 6, 2016

I honestly do not - I haven't really been paying as much attention to SEO related information in the past years as I used to, however, as someone pointed out earlier Google specifically made this change not all that long ago. Frankly, and IMO, I believe this was their way of toning down SEO related activities as well as pushing their own analytics platform onto site owners. Being that Google has pretty much always represented most of the search market percentage it has always been the goal to 'figure out' what works and what doesn't with their algorithm to enhance rankings. I, and many others, see most of their recent changes aimed at eradicating these possibilities. Without going on and on, I think 'hiding' the search terms was just another reason they did this, but it might also have to do with privacy and a number of other things - who knows.

Anyways, I think if snowplow wants to extract this information from Google, and any others who took this practice, some conditionals and expressions are going to be needed in the code to handle it.

@morrow95
Copy link

morrow95 commented Jul 6, 2016

For what it is worth this is the first result in a quick search - http://adage.com/article/dataworks/google-hides-search-terms-publishers-marketers/244949/. If you want to read more then there should be an abundance of information and opinions out there about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants