Image web scraping using python

Read Time: 4 min

Through this post, you will learn how to collect images from Google directly i.e. “image scraping” using python.

What is the most important or primary element if you want to solve real world problem using deep learning or machine learning?

I think, Data!!!

“Data collection” is a huge pain especially when you need to train your model using deep learning.

And if the data consist “images” than it seems more crucial point because as many images you will give for training, more chances will be there get better predictions as model will be able to generalize well.

If you create your dataset, then it’s a different case. But what if you need to collect data from internet.

In that case you have two choices: Either do it manually or do it in smarter way i.e. writing python script to automate the process of downloading the images.

So, what do you prefer?

“Smarter way, right?”

What is web scrapping?

Web scrapping is basically extraction or retrieval of the data from any website. It’s a powerful techniques to reduce mundane efforts of collecting the data manually.

Steps followed for image scraping using python

  • Install google chrome first.

(skip if you have already installed)

  • Identify your chrome version.

To identify chrome version in your system, go to the settings, check “about chrome” as follows:

Google chrome version check for web scraping the images.
google chrome version
  • Download the corresponding Chrome Driver from here whatever compatible for your chrome version

Link: https://chromedriver.chromium.org/downloads

  • Now you need to install python Selenium package by following command in jupyter notebook.
!pip install selenium

What is Selenium?

Selenium is basically a tool used to automate the web applications for testing purposes. Selenium python package automates the web browser interaction from python.

  • Once Selenium is installed, Check if web driver is able to start and close. Start the web driver using following code snippet, get google page b using wd.get() function and close the driver.
import selenium

# Put the path for your ChromeDriver here
DRIVER_PATH = r'C:/Windows/chromedriver_win32/chromedriver.exe'

wd = webdriver.Chrome(executable_path=DRIVER_PATH)
wd.get('https://google.com')
wd.quit()

Following output you will get once driver gets start:

Above mentioned steps were primary ones to make sure everything is working fine.

How to scrap the images?

Now, just think, if you download an image from google, how do you do it?

You will first go to the google page, write down your query and then download the image.

Same is done by python.

Steps will be:

  • Start web driver
  • First pass google query format in the search_url format.
  • Get all possible results in the form of URL
  • Check if required number of images URL are downloaded or not

Execute the following code for above tasks:

def fetch_image_urls(query:str, max_links_to_fetch:int, wd:webdriver, sleep_between_interactions:int=1):
    def scroll_to_end(wd):
        wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(sleep_between_interactions)    
    
    # build the google query
    search_url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"

    # load the page
    wd.get(search_url.format(q=query))

    image_urls = set()
    image_count = 0
    results_start = 0
    while image_count < max_links_to_fetch:
        scroll_to_end(wd)

        # get all image thumbnail results
        thumbnail_results = wd.find_elements_by_css_selector("img.Q4LuWd")
        number_results = len(thumbnail_results)
        
        print(f"Found: {number_results} search results. Extracting links from {results_start}:{number_results}")
        
        for img in thumbnail_results[results_start:number_results]:
            # try to click every thumbnail such that we can get the real image behind it
            try:
                img.click()
                time.sleep(sleep_between_interactions)
            except Exception:
                continue

            # extract image urls    
            actual_images = wd.find_elements_by_css_selector('img.n3VNCb')
            for actual_image in actual_images:
                if actual_image.get_attribute('src') and 'http' in actual_image.get_attribute('src'):
                    image_urls.add(actual_image.get_attribute('src'))

            image_count = len(image_urls)

            if len(image_urls) >= max_links_to_fetch:
                print(f"Found: {len(image_urls)} image links, done!")
                break
        else:
            print("Found:", len(image_urls), "image links, looking for more ...")
            time.sleep(30)
            return
            load_more_button = wd.find_element_by_css_selector(".mye4qd")
            if load_more_button:
                wd.execute_script("document.querySelector('.mye4qd').click();")

        # move the result startpoint further down
        results_start = len(thumbnail_results)

    return image_urls
  • Next, using persist_image function download all images from URL downloaded and store them into an folder.
def persist_image(folder_path:str,url:str):
    try:
        image_content = requests.get(url).content

    except Exception as e:
        print(f"ERROR - Could not download {url} - {e}")

    try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')
        file_path = os.path.join(folder_path,hashlib.sha1(image_content).hexdigest()[:10] + '.jpg')
        with open(file_path, 'wb') as f:
            image.save(f, "JPEG", quality=85)
        print(f"SUCCESS - saved {url} - as {file_path}")
    except Exception as e:
        print(f"ERROR - Could not save {url} - {e}")
  • Next using search_and_download function, where you need to pass search keyword, folder_path in which you want to store the images and number of images you want to download.
def search_and_download(search_term:str,driver_path:str,target_path=r'C:/Users/Desktop/Untitled Folder/',number_images=60):
    target_folder = os.path.join(target_path,'_'.join(search_term.lower().split(' ')))

    if not os.path.exists(target_folder):
        os.makedirs(target_folder)

    with webdriver.Chrome(executable_path=driver_path) as wd:
        res = fetch_image_urls(search_term, number_images, wd=wd, sleep_between_interactions=1)
        
    for elem in res:
        persist_image(target_folder,elem)

Now all you need to do is, to call the “search_and_download” function and pass your search query.

search_term = 'Dog'
search_and_download(search_term = search_term,driver_path= DRIVER_PATH)

You can download the code from following GitHub link:

Note:

During downloading of image, one thing is needed to keep in mind that some images have their copyright. So please be clear, about the image you download that you do not violate any terms of services or negatively affect the images you are scraping