DOE AGORA Qualquer valor

Intercepting Google CSE resources: automate Google searches with client-side generated URIs (for free)






Intercepting Google CSE resources: automate Google searches with client-side generated URIs (for free)

Introduction

From an OSINT perspective, Google Search has been an indispensable tool for collecting data about companies, sites, persons, leaks, i.e, any kind of relevant information for countless investigation purposes. Although mostly used by analysts on targeted research, there are actors who would take advantage of developing a fully automated discovery process using Google’s Search Engine as one of its most important sources of data.

Nowadays, Google already offers to the public a service that facilitates the development of automated searches, which is called Google’s Custom Search JSON API. In order to use it, one needs to create its own Programmable Search Engine — a very useful Google service, created to help developers to embed Google search boxes in their websites, increasing their users experience by helping with more focused searches — and must ask for an API key to consume Google’s JSON API. However, this API has some free usage limits: after making a hundred (100) queries in a day, you’ll be charged for a fee of five (5) american dollars per thousand (1000) queries — limited to ten thousand (10.000) queries a day if one does not want to use their restricted JSON API version — if you want to proceed with your automated data collection processes.

That’s where this article comes in. Exploring a client-side generated API URI, it was possible to consume Google’s API data without needing to use any personal CSE API key and, consequently, without being charged for queries, as we avoid its traditional JSON API methods.

In order to consume this observed Google CSE API, a python proof of concept module — named csehook — was developed, with the help of libraries such as Selenium Wire — a library that enables access to the underlying requests made by a browser –, which was used to intercept Google CSE API URIs, Requests — an HTTP library — to consume the content of those previously intercepted URIs, and other publicly available resources, which will be mentioned by the course of this article and in its References section.

Thought Process

Google’s CSE, now called Google’s Programmable Search Engine, is not news anymore. Already well known by web developers — who use it to embed google search iframes in their site’s pages –, investigation actors — who want to search predefined focused domains in order to collect particularly interesting data — and other kinds of individuals and professionals, this is a useful, widely spread public tool, firstly made to facilitate the embedding of Google Search boxes in sites and the use of more specific, personalized and focused search engines, but which happens to be an incredible tool for people who have a ton of research work to do.

Despite being truly helpful on its own, there are some things in its bundle that are not so handy for people who depend on heavy automated tasks to do its job: its CSE JSON API limits. For this reason, attempts to find alternative paths — for curiosity purposes — were made in order to contour those obstacles.

All the demonstrations were made with a personal Programmable Search Engine, focused on searching terms on Pastebin site pages.

Observing how the client-side of a Google CSE URI interacts with Google’s backend resources, a couple of interesting behaviours were noticed when a query is made:

Even though the Ads interactions could be consumed and parsed to some extent, what really calls the attention is the behaviour 2. The figures below illustrate its response content.

Beginning of client-side generated Element URI response

End of client-side generated Element URI response

As illustrated, the response contains a call to a client-side JavaScript function, which receives a JSON object that was sent to the client from a Google server. This function will parse the JSON object — which will always contain up to ten (10) search results at a time (per page), up to ten (10) maximum distinct pages — and exhibit the results in the CSE page that is being used.

Despite the frontend generating an individual URI for each performed query, what was noticed is that those URIs could be reused to query different terms, i.e there was a possibility to later automate the collection of data by intercepting the generated URIs, changing their query strings, and then requesting new results and parsing the collected content. In order to achieve this, CSEHook was developed.

As the content to be consumed is a response of a dynamically generated URI — that depends on the execution of client-side JavaScript to be generated –, it would be useful to use a browser instance in order to generate those interesting URIs. The traditional Python solution to this kind of issues is generally Selenium; however, Selenium alone would not be able to track those client secondary network interactions that need to be intercepted. That is why the Selenium Wire, an extended version of Selenium that monitors the requests that were made by the browser instance, was chosen to help in the effort to catch those URIs.

A Chromedriver will be needed so the Selenium Wire library can do its work. This driver should fit the installed Google Chrome browser version (used version: 91.0.4472.114 for x86_64).

Additionally, Geonode proxy service was used in order to avoid Google’s detection systems and diffuse the requests made to its resources. This was implemented because, during the first implementation tests, it was observed that those URIs had a specific limitation to the amount of requests that could sequentially be sent to it. Apart from limiting the quantity of requests made to those dynamic URIs — by re-intercepting those resources from time to time –, it was preferred to spread the source IP addresses geolocation that would be sending those requests as well.

Also, to avoid User-Agent pattern-based detections, a list of Google Chrome User-Agents was picked from tamimibrahim17’s repository. This list was utilized in order to randomly choose a User-Agent — and place it in the request headers — just before sending a request to Google’s resources.

Finally, to prove that it would be possible to surpass Google’s official CSE JSON API limitations with the approach of this article, the Python library named English-Words was chosen so it could be demonstrated that the CSEHook proof-of-concept could effectively iterate through all set of english words in a relative short time — i.e searching lowercase english words in order to obtain results from the already created Programmable Search Engine –, without calling Google’s detection systems attention.

All the previously specified libraries and resources can be found in the References section. A link to their own respective websites was left there as well.

PoC Structure

To achieve the data collection intentions cited before, the project is structured in the following way:

This structure can be better visualized at the project’s Github repository.

Configuration

The configuration file has the following variables in it:

Screenshot of config.py file

Wired Driver

This is the class that interacts with the Selenium Wire library. Not too much to detail: an instance of this class will be used in order to interact with the Google Chrome browser so it can be possible to intercept the client-side generated URIs. The file structure is illustrated by the image below.

Class WireDriver that generates the browser instance and handles options

CSE Hook

Here is where the main logic is placed. It’s explanation will be broken in different fragments in order to detail it’s functionality.

The class has three internal inherent attributes that do not depend on its initialization:

Inside the class initialized components, there are the following attributes:

CseHook class pre-initialization and initialized attributes

The next four methods listed inside the class are responsible for the following activities:

The first four methods present in CseHook class

The fifth method is called self._get_response, and it’s responsibility is to request an endpoint using the specified URI, headers and proxies. If the request raises a timeout or any other exception, it will check if the amount of retries — which is passed as an argument as well — has achieved its limit. If this limit is achieved, it configures self._proxy_list with a new proxy list retrieved from Geonode.

The fifth method present in CseHook class

The sixth method is named self._search_page. This method receives the arguments URI, query and page and returns the JSON retrieved from the modified client-side generated URI response — i.e the results from Google that it wants to collect with a specific query term.

This is the most complex method of the class, and what interacts with most of the other already declared methods.

While the response is not received from Google, it will: select a random User-Agent and define it in the request headers; choose a random proxy from the self._proxy_list; form the proxies dictionary with the information of the chosen_proxy — so it can be used with the requests library; get a response using the method self._get_response; if the response is not satisfactory, it will pop the chosen_proxy from the self._proxy_list and increase the error_count by one — and will continue the loop. The error_count is the value passed as the argument retries of the self._get_response method.

If the response is received, check if the status_code of the response equals 403 (HTTPStatus.FORBIDDEN). If it does, return a dictionary with the key-value pair illustrated by the following image. If it does not, find the JSON inside the client-side generated URI response with the self._cse_regex compiled regex, catch it and assign it to api_json. After assigning, return api_ json.

The sixth method present in CseHook class

The last method of the class is named self.search, which is the only public method of the class — and the one that is used by __main__. This method receives both query — the term that one wants to search using the client-side generated URIs — and renew_cse_uris — which is a boolean that determines if the client-side generated URIs should be refreshed — as arguments.

The responsibilities of this method are the following:

All this logic flow can be better visualized by looking at the following image.

The last method of the CseHook class

Now that all the attributes and methods of CseHook were detailed, there is only __main__ left to explain.

Main

The __main__ file contains the instantiations made to WiredDriver and CseHook, along with a few more things.

The following activities are present in this file:

The details can be better visualized by looking at the following image.

__main__ file content

Obtained Results

To prove that this intercepted client-side generated URIs could be explored in a large scale, the code was left running for 24h to see how many requests could be made, without interruptions or banishments, within that time frame.

Surprisingly enough, even though Geonode proxies were used — and a lot of them were not even functioning correctly, which delayed the amount of iterations/requests that could be made in the same time frame without these obstacles –, 15.706 requests were made to Google resources — all of them containing real and valid Google results — just before the execution was interrupted. It means that the PoC would iterate through all the lowercase english words in less than two days — given that the set has 25.480 words in it. The following image illustrates it (zoom it in order to see them better):

It also means that it has surpassed the daily quota limits (10.000/day maximum) allowed by the official Google CSE JSON API — the one without Restricted JSON — by a lot, and surpassed even more the Google CSE JSON API free usage limits (100/day maximum).

Conclusion

The PoC demonstrated that, within a time frame of 24h, 15.706 requests were made and successfully returned Google CSE page values using the intercepted client-side generated URIs as a facilitator, in order to obtain JSON format results. With basic user-agent randomization, client-side generated URIs frequent renewal and proxy changes, one could avoid Google’s detection mechanisms and consume its data without the need to subscribe for it’s JSON CSE API fees.

References

Selenium Wire: https://pypi.org/project/selenium-wire/

Requests: https://docs.python-requests.org/en/master/

User-Agents: https://github.com/tamimibrahim17/List-of-user-agents/blob/master/Chrome.txt

Geonode Proxies: https://geonode.com/free-proxy-list

Google CSE: https://programmablesearchengine.google.com/about/

English-Words: https://pypi.org/project/english-words/

Regex101: https://regex101.com/

Custom Search JSON API rules: https://developers.google.com/custom-search/v1/overview

Comentários

Ebook

Postagens mais visitadas