Scrape HTML Table in Python

Scrape HTML Table in Python - python

I am trying to scrape the SEC report page to pull some basic info on a number of tickers.
Here is an example URL for Apple - https://sec.report/CIK/0000320193
Within the page is a 'Company Details' table which includes basic info. I am essentially trying to just scrape the IRS Number, State of Incorp and Address.
I am cool with just scraping this chart and saving it into a PD Df. I am very new to web scraping so looking for some tips to make this work! Below is my code, but I don't know where to go once I extract the panel body. Thanks guys!
session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
page = requests.get('https://sec.report/CIK/0000051143.html', headers = headers)
page.content
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all(class_='panel-body')

Instead of BeautifoulSoup try with lxml package, for me it's easier to find elements with xpath sentences:
import requests
from lxml import html
session = requests.Session()
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'}
page = requests.get('https://sec.report/CIK/0000051143', headers=headers)
raw_html = html.fromstring(page.text)
irs = raw_html.xpath('//tr[./td[contains(text(),"IRS Number")]]/td[2]/text()')[0]
state_incorp = raw_html.xpath('//tr[./td[contains(text(),"State of Incorporation")]]/td[2]/text()')
address = raw_html.xpath('//tr[./td[contains(text(),"Business Address")]]/td[2]/text()')[0]

Related

Element is not in response Python Requests

I would to scrape the last odds in archive from this page https://www.betexplorer.com/soccer/estonia/esiliiga/elva-flora-tallinn/Q9KlbwaJ/ but I can't get it with requests. How can I get it without interact with Selenium?
To trigger the archive odds page in the Developer Tools I need to hover on the odd.
Code
url = "https://www.betexplorer.com/archive-odds/4l4ubxv464x0xc78lr/14/"
headers = {
"Referer": "https://www.betexplorer.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'
}
Json = requests.get(url, headers=headers).json()

As the site is being loaded by JavaScript, requests doesn't work. I have used selenium to load the page, extract the complete source code after everything is loaded.
Then used beautifulsoup to create a soup object to get required data.
From the source code you can see that the data-bid of the <tr> are what are being passed to get the odds data.
I extracted all the data-bid and passed them to the URL you've provided at the very end of your question one by one.
This code will get all the odds data in JSON format
import time
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
base_url = 'https://www.betexplorer.com/soccer/estonia/esiliiga/elva-flora-tallinn/Q9KlbwaJ/'
driver = webdriver.Chrome()
driver.get(base_url)
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
t = soup.find('table', attrs= {'id': 'sortable-1'})
trs = t.find('tbody').findAll('tr')
for i in trs:
data_bid = i['data-bid']
url = f"https://www.betexplorer.com/archive-odds/4l4ubxv464x0xc78lr/{data_bid}/"
headers = {"Referer": "https://www.betexplorer.com",'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}
Json = requests.get(url, headers=headers).json()
# Do what you wish to do withe JSON data here....

How do I get the URLs for all the pages?

I have a code to collect all of the URLs from the "oddsportal" website for a page:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
source = requests.get("https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/",headers=headers)
soup = BeautifulSoup(source.text, 'html.parser')
main_div=soup.find("div",class_="main-menu2 main-menu-gray")
a_tag=main_div.find_all("a")
for i in a_tag:
print(i['href'])
which returns these results:
/soccer/africa/africa-cup-of-nations/results/
/soccer/africa/africa-cup-of-nations-2019/results/
/soccer/africa/africa-cup-of-nations-2017/results/
/soccer/africa/africa-cup-of-nations-2015/results/
/soccer/africa/africa-cup-of-nations-2013/results/
/soccer/africa/africa-cup-of-nations-2012/results/
/soccer/africa/africa-cup-of-nations-2010/results/
/soccer/africa/africa-cup-of-nations-2008/results/
I would like the URLs to be returned as:
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/2/
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/3/
for all the parent urls generated for results.
I can see that the urls can be appended as seen from inspect element as below for div id = "pagination"

The data under id="pagination" is loaded dynamically, so requests won't support it.
However, you can get the table of all those pages (1-3) via sending a GET request to:
https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/MN8PaiBs/X0/1/0/{page}/?_={timestampe}"
where {page} is corresponding to the page number (1-3) and {timestampe} is the current time
You'll also need to add:
"Referer": "https://www.oddsportal.com/"
to your headers.
also, use the lxml parser instead of html.parser to avoid a RecursionError.
import re
import requests
from datetime import datetime
from bs4 import BeautifulSoup
headers = {
"User-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"Referer": "https://www.oddsportal.com/",
}
with requests.Session() as session:
session.headers.update(headers)
for page in range(1, 4):
response = session.get(
f"https://fb.oddsportal.com/ajax-sport-country-tournament-archive/1/MN8PaiBs/X0/1/0/{page}/?_={datetime.now().timestamp()}"
)
table_data = re.search(r'{"html":"(.*)"}', response.text).group(1)
soup = BeautifulSoup(table_data, "lxml")
print(soup.prettify())

BeautifulSoup returning 'None' object type

After watching a video I tried to fetch price for an item from a amazon.de website using BeautifulSoup api.
#My CODE
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.de/Neues-Apple-iPhone-Pro-128-GB/dp/B08L5SNWD2/ref=sr_1_1_sspa?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=3UH87RWLLO40E&dchild=1&keywords=iphone+12+pro&qid=1605603669&sprefix=Iphone+12%2Caps%2C175&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzRjAxN0xWNTk0TVpYJmVuY3J5cHRlZElkPUEwNzE4ODIxMktCWlhJMVlHWDFNMyZlbmNyeXB0ZWRBZElkPUExMDMwODk2Tk5OVkdZRTJISDVMJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'lxml')
#I tried other parsing methods too: 'html.parser', 'html5lib'. Not helpful
title = soup.find(id="productTitle").get_text()
price = soup.find(id='priceblock_ourprice')
print(title) #returns correct string from the URL above
print(price)
#returns 'None'. Unexpected. Expecting price with some extensions from <span id="priceblock_ourprice"
Anyone who finds something wrong in my code would be really helpful for me.
Thanks in Advance!

Can not reproduce the 'None', code works fine, just added get_text() to the price and strip() both variables, to make the result a little bit cleaner.
import requests, time
from bs4 import BeautifulSoup
URL = 'https://www.amazon.de/Neues-Apple-iPhone-Pro-128-GB/dp/B08L5SNWD2/ref=sr_1_1_sspa?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=3UH87RWLLO40E&dchild=1&keywords=iphone+12+pro&qid=1605603669&sprefix=Iphone+12%2Caps%2C175&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzRjAxN0xWNTk0TVpYJmVuY3J5cHRlZElkPUEwNzE4ODIxMktCWlhJMVlHWDFNMyZlbmNyeXB0ZWRBZElkPUExMDMwODk2Tk5OVkdZRTJISDVMJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'Cache-Control': 'no-cache'
}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
#I tried other parsing methods too: 'html.parser', 'html5lib'. Not helpful
title = soup.find(id="productTitle").get_text().strip()
# to prevent script from crashing when there isn't a price for the product
try:
price = soup.find(id='priceblock_ourprice').get_text().strip()
#convert price to float by slicing
convertedPrice = price[:8]
except:
price = 'not loaded'
convertedPrice = 'not loaded'
print(title) #returns correct string from the URL above
print(price)
print(convertedPrice)
Output
Neues Apple iPhone 12 Pro (128 GB) - Graphit
1.120,00 €
1.120,00
But
As #Chase mentioned, if it is an dynamicly genereated content, you may give Selenium a try, this can handle the load with its Waits - By adding a delay, you can wait until page is loaded, dynamicly generated content to and then grap your information.

Python Webscraping Vue Components

I am trying to get the value marked in the picture extracted to be a variable, but it seems that when it is within Vue Components, bs4 is not doing the searching like i am expecting. Can anyone point me in the general direction as to how i would be able to extract the value from this document in Python?
Code is found below picture, thanks in advance.
import requests
from bs4 import BeautifulSoup
URL = 'https://api.tracker.gg/api/v2/rocket-league/standard/profile/steam/76561198060134880'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
#print(soup.prettify())
div_list = soup.findAll({"class":'value'})
print(div_list)

Since the page is returning a json response you don't need beautifulsoup to parse it.
import requests
import json
URL = 'https://api.tracker.gg/api/v2/rocket-league/standard/profile/steam/76561198060134880'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}
response = requests.get(URL, headers = headers)
dict_of_response = json.loads(response.text)
obj_list = dict_of_response['data']['segments']
print(obj_list)
The obj_list variable now contains a list of dicts. Those dicts contain the data you want and now you only need to loop trough the list and do what you want with the data.

How to crawl website slower in Python with Jupyter Notebook?

My current python script would perform web scraping on the website in one second with 2 pages. I want to make it go slower, like 25 seconds on one page. How do I do that?
I tried this following python script.
# Dependencies
from bs4 import BeautifulSoup
import requests
import pandas as pd
# Testing
linked = 'https://www.zillow.com/homes/for_sale/San-Francisco-CA/fsba,fsbo,fore,new_lt/house_type/20330_rid/globalrelevanceex_sort/37.859675,-122.285557,37.690612,-122.580815_rect/11_zm/{}_p/0_mmm/'
for link in [linked.format(page) for page in range(1,2)]:
user_agent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
headers = {'User-Agent': user_agent}
response = requests.get(link, headers=headers)
soup = BeautifulSoup(response.text, 'html.pafinite-item')
print(soup)
What should I add to my script to make the web scraping go slower?

Just use time.sleep:
import requests
import pandas as pd
from time import sleep
from bs4 import BeautifulSoup
linked = 'https://www.zillow.com/homes/for_sale/San-Francisco-CA/fsba,fsbo,fore,new_lt/house_type/20330_rid/globalrelevanceex_sort/37.859675,-122.285557,37.690612,-122.580815_rect/11_zm/{}_p/0_mmm/'
for link in [linked.format(page) for page in range(1,2)]:
sleep(25.0)
user_agent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
headers = {'User-Agent': user_agent}
response = requests.get(link, headers=headers)
soup = BeautifulSoup(response.text, 'html.pafinite-item')
print(soup)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape HTML Table in Python - python

Related

Element is not in response Python Requests

How do I get the URLs for all the pages?

BeautifulSoup returning 'None' object type

Python Webscraping Vue Components

How to crawl website slower in Python with Jupyter Notebook?

Categories

Resources