I’m training myself to learn machine learning and A.I. When I get started with Tensorflow for transfer learning, I would like to have some images for my training set. The images can be easily found by Google Image Search but I found no way to get all found images downloaded to my local machine. There’re some add-ons for Google Chrome such as Firefox but they don’t really work. They crash most of the time when I start to download them. Therefore I have to write myself a small tool to get this job done for me. In this blog post, I would like to show you how I made it with Python.
1. Use Google Custom Search API
Google provides us Google Custom Search API to integrate his search engine into our apps. What we need is an API key and a Search Engine Id which I already mentioned in this post.
1.2 Install Google API Python Client
After having API key and Search Engine Id for your Google Custom Search, let’s install Google API Python Client package. I’m using Windows and Anaconda Package Manager. If you have other systems, just use standard pip to get the correct package installed
conda install -c conda-forge google-api-python-client=1.6.2
1.3 Supported arguments
Code with Google Custom Search support 3 arguments: query text, destination folder and count of images.
def main(argv): """Main""" gcs = GoogleCustomSearch() gcs.count, gcs.folder, gcs.query = gcs.parse_args(argv) gcs.search()
For example, if you would like to search for first 10 images tagged by “Darth Vader” to folder C:\temp\StarWars\DarthMaul, then the syntax should be
python gcse.py -q "darth vader" -f "C:\temp\StarWars\DarthVader" -c 10
The class GoogleCustomSearch is initialized with API key and Search Engine Id got from step above.
class GoogleCustomSearch(HDBase): """Google Custom Search""" def __init__(self, usage_text=""): usage_text = "python gcse.py -q <query> -f <destination folder> -c 100" self.api_key = "AIzaSyDQ92Dx35mWmYWEmBdCqBQnkfgdxpCKF-w" self.search_engine_id = "003470263288780838160:ty47piyybua" HDBase.__init__(self, usage_text)
1.5 Search and download
After initializing GoogleCustomSearch with correct API key and Search Engine Id, we can make a query with paging.
def download_links(self, response): """Download files""" for item in response["items"]: if "pagemap" in item: page_map = item["pagemap"] if "cse_image" in page_map: link = page_map["cse_image"]["src"] self.download_link(link) def search(self): """Search""" page_size = 10 start = 1 service = build("customsearch", "v1", developerKey=self.api_key) while start < self.count: response = service.cse().list( q=self.query, cx=self.search_engine_id, start=start ).execute() self.download_links(response) if self.count - start < page_size: start += self.count - start else: start += page_size
The response is in form of JSON, therefore we can easily access it as key-value-item. Just loop through the list, extract the link of each image and download them to your local folder.
However, the problem of using Google Custom Search is the images we got are not as same as what we see when we search over the browser because our browsers and Google Custom Search Engine have different settings. Depend on these settings, Google will give different result back. I don’t know how to set these settings so that I receive same results as in the browser. So the idea of using Google Custom Search doesn’t bring what I want. So in next section, I will show you how to use Selenium to simulate browser behaviors such as automating our search actions, scrolling and getting links to images.
2. Use Selenium with Firefox
Let’s download latest version of Firefox from his homepage and get the latest Gecko driver. Then copy Gecko driver to the same folder as your Python file.
2.2 Install Selenium
Install Selenium package.
conda install -c conda-forge selenium=3.4.2
2.3 Supported arguments
Searching with Selenium support 4 arguments: query text, destination folder, count of images and extension.
def main(argv): """Main""" gcs = Selenium() gcs.count, gcs.extension, gcs.folder, gcs.query = gcs.parse_args(argv) gcs.search()
For example, if you would like to search for first 10 images tagged by “Darth Vader” to folder C:\temp\StarWars\DarthMaul with type of JPEG then the syntax should be
python gsel.py -q "darth vader" -f "C:\temp\StarWars\DarthVader" -c 10 -e ".jpg;.jpeg"
2.4 Search and download
We’ll use Selenium to simulate what happens as same as in the browser. Browse to Google Search, make a query, scroll down to view all images and click on the button “Show more results” to view full search result.
def search(self): """Search""" url = "https://www.google.com/search?q=" + self.query + "&source=lnms&tbm=isch" # caps = webdriver.DesiredCapabilities().FIREFOX # caps["marionette"] = False # driver = webdriver.Firefox(capabilities=caps) driver = webdriver.Firefox() driver.get(url) self.count_downloaded = 0 while self.count_downloaded < self.count: for scroll in range(10): driver.execute_script("window.scrollBy(0,1000000)") time.sleep(0.2) time.sleep(0.5) images = driver.find_elements_by_xpath("//div[@class='rg_meta']") for image in images: if self.count_downloaded >= self.count: break image_url = json.loads(image.get_attribute("innerHTML"))["ou"] self.download_link(image_url) button_smb = driver.find_element_by_xpath( "//input[@id='smb']") if button_smb is not None: try: button_smb.click() except ElementNotInteractableException: pass driver.quit()
When the code is executed, Firefox will be launched, go to Google Image Search, make a query, scroll down and click on button automatically. At the end, links to images will be extracted and the images will be downloaded to your defined folder. It’ll take a while when downloading process is running, depending on how fast your internet connection is.
3. Source code
The full source code is available at Bitbucket: https://bitbucket.org/hintdesk/python-google-search-image
4.1 Update 11.10.2017
– Update new geckodriver v0.19.0 https://github.com/mozilla/geckodriver/releases
– Update algorithm for getting original image urls from Google Search