bekkidavis.com

Web Scraping the Highest-Grossing Movies with Python and Selenium

Written on

Understanding Web Scraping

Web scraping refers to the technique of retrieving web pages, extracting valuable data, and organizing that data into a structured format for analysis.

Applications of Web Scraping

Web scraping can serve various purposes, including:

  • Price Comparison: Gathering data from e-commerce sites to analyze product prices.
  • Real Estate: Creating databases for property listings to support market research.
  • Social Media: Collecting trending topics from platforms to gauge public interest.
  • Lead Generation: Extracting contact details for potential outreach to customers.
  • Movie Reviews: Compiling information about films to determine popularity and ratings.

Is Web Scraping Allowed?

In a 2019 ruling, the US Court of Appeals upheld that publicly accessible data, not covered by copyright, can be scraped, as seen in the case involving LinkedIn and HiQ.

While some websites permit scraping, others restrict it. To check a website's scraping policies, you can append “/robots.txt” to its URL. Additionally, it's wise to read the terms of service before scraping. Excessive requests may lead to an IP ban, but proxies can help circumvent this.

Roadmap for Scraping Movie Data

This article will guide you through a program that scrapes a webpage listing top-earning movies by analyzing the HTML and gathering necessary information. The following steps will be performed using Python and Selenium:

  1. Install Required Packages and Libraries.
  2. Set Up the Web Driver.
  3. Examine the HTML Structure of the Web Page.
  4. Overview of Selenium Functionality.
  5. Locate and Extract Data Elements.
  6. Create and Visualize a Data Frame.
  7. Export the Data Frame to a CSV File.

The Program

Objective: Extract data elements from a webpage and compile them into a dataset.

Install the Necessary Packages: !pip install selenium

Import Required Libraries: from selenium import webdriver from selenium.webdriver.common.by import By import pandas as pd

Set Up the Web Driver: !apt-get update !apt install chromium-chromedriver

chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--headless') chrome_options.add_argument('--no-sandbox') chrome_options.add_argument('--disable-dev-shm-usage') driver = webdriver.Chrome('chromedriver', chrome_options=chrome_options) The web driver is essential for Selenium, functioning as a browser automation tool that interacts with web applications.

Examining the HTML Structure

Top 200 Movies HTML Structure

To extract data from this webpage, right-click anywhere on the page, select the HTML arrow icon, and click on the title (e.g., Avatar).

Highlighted HTML for Movie Titles

The HTML will display the title's name (e.g., Avatar). By navigating one line up, you can find the parent tag, which is necessary for extracting all movie titles.

Extraction Code Example: movies_names = driver.find_elements(By.XPATH, '//td[@class="a-text-left mojo-field-type-title"]/a[@class="a-link-normal"]')

movie_name_list = [] for movie in range(len(movies_names)):

movie_name_list.append(movies_names[movie].text)

Overview of Selenium

Selenium is a versatile Python library that enables dynamic web scraping and web automation. It offers features such as:

  • Multi-Browser Support
  • Language Compatibility
  • Dynamic Element Handling
  • Easy Element Identification
  • Performance Efficiency
  • Open Source and Portable

Using XPath, a language for locating elements in HTML, we can select data values based on tags and attributes.

XPath Syntax: Xpath = //tagname[@Attribute='Value']

Examples: - Locate all divs with class "movie": Xpath = //div[@class='movie'] - Locate links with class "drama" inside a div: Xpath = //div[@class='movie']/a[@class='drama']

Now, let’s extract essential data from the Box Office Mojo site, including Rank, Title, Lifetime Gross, and Year.

Extracting Data Elements

To extract the desired data, we will find all HTML elements associated with specific tags and classes, store the data in lists, and compile them into a data frame.

Launching the Browser:

Extracting Rankings: movies_rankings = driver.find_elements(By.XPATH, "(//td[@class='a-text-right mojo-header-column mojo-truncate mojo-field-type-rank'])") movie_rank_list = [] for movie in range(len(movies_rankings)):

movie_rank_list.append(movies_rankings[movie].text)

Extracting Movie Names: movies_names = driver.find_elements(By.XPATH, "(//td[@class='a-text-left mojo-field-type-title']/a[@class='a-link-normal'])") movie_name_list = [] for movie in range(len(movies_names)):

movie_name_list.append(movies_names[movie].text)

Extracting Release Years: release_year = driver.find_elements(By.XPATH, "(//td[@class='a-text-left mojo-field-type-year']/a[@class='a-link-normal'])") release_year_list = [] for year in range(len(release_year)):

release_year_list.append(release_year[year].text)

Extracting Lifetime Gross Earnings: lifetime_gross = driver.find_elements(By.XPATH, "(//td[@class='a-text-right mojo-field-type-money'])") lifetime_gross_list = [] for i in range(len(lifetime_gross)):

lifetime_gross_list.append(lifetime_gross[i].text)

Creating and Displaying the Data Frame

Now we can combine our lists into a data frame and display the results:

data = list(zip(movie_rank_list, movie_name_list, release_year_list, lifetime_gross_list)) df = pd.DataFrame(data, columns=['Rank', 'Movie Name', 'Release Date', 'Lifetime Earnings']) print(df.head(10))

Data Frame of Top Movies

Exporting to a CSV File

To save our data frame as a CSV file, we can execute:

df.to_csv('top_200_movies_with_lifetime_gross.csv', index=False)

You can read back the CSV file with:

reviews = pd.read_csv('top_200_movies_with_lifetime_gross.csv', lineterminator='n')

CSV File of Top Movies

Video Tutorial: Scraping Data using Beautiful Soup

This video tutorial demonstrates how to scrape data from IMDB's top movies using Beautiful Soup and Python.

Video Tutorial: Browser Automation & Web Scraping with Selenium

This video covers browser automation and web scraping using Selenium and Beautiful Soup.

Thank you for reading! If you have any feedback or questions, feel free to leave a comment below.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

AI in Warfare: The Ethical Dilemma of Clearview AI in Ukraine

Exploring the controversial use of Clearview AI in Ukraine's conflict, balancing technological potential with ethical concerns.

Unlocking Writing Success: The Power of a Content Calendar

Discover how a content calendar can enhance your writing strategy and boost your online presence effectively.

Hyperparameter Optimization with Hyperband: 30x Faster Insights

Explore how Hyperband optimizes hyperparameters efficiently, outperforming traditional methods like Bayesian Optimization.

Rediscovering the Enigmatic Papain: A Journey Through Taste

Explore the intriguing world of papain and its connections to food, all while participating in today's Spelling Bee challenge.

Title: Can Aliens 65 Million Light Years Away See Dinosaurs?

Explore whether extraterrestrial beings could observe dinosaurs on Earth from a distance of 65 million light years.

# Exploring the Viability of Space Colonization: A New Era Awaits

This article investigates the feasibility of establishing human settlements in space, examining technological, physiological, economic, ethical, and environmental aspects.

Finding Your Path: Embracing the Journey of Self-Discovery

A personal narrative about overcoming challenges and finding purpose through self-reflection and community support.

Finding Joy Through Structured Living: The Path to Happiness

Discover how a disciplined life leads to true happiness and fulfillment through structured choices and daily achievements.