Scraping Table With Python/BS4 - python

Im trying to scrape the "Team Stats" table from http://www.pro-football-reference.com/boxscores/201602070den.htm with BS4 and Python 2.7. However Im unable to get anywhere close to it,
url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url)
soup = BeautifulSoup(page.text, "html5lib")
table=soup.findAll('table', {'id':"team_stats", "class":"stats_table"})
print table
I thought something like the above code would work but no luck.

The problem in this case is that the "Team Stats" table is located inside a comment in the HTML source which you download with requests. Locate the comment and reparse it with BeautifulSoup into a "soup" object:
import requests
from bs4 import BeautifulSoup, NavigableString
url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'})
soup = BeautifulSoup(page.content, "html5lib")
comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "team_stats" in x)
soup = BeautifulSoup(comment, "html5lib")
table = soup.find("table", id="team_stats")
print(table)
And/or, you can load the table into, for example, a pandas dataframe which is very convenient to work with:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from bs4 import NavigableString
url = 'http://www.pro-football-reference.com/boxscores/201602070den.htm'
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36'})
soup = BeautifulSoup(page.content, "html5lib")
comment = soup.find(text=lambda x: isinstance(x, NavigableString) and "team_stats" in x)
df = pd.read_html(comment)[0]
print(df)
Prints:
Unnamed: 0 DEN CAR
0 First Downs 11 21
1 Rush-Yds-TDs 28-90-1 27-118-1
2 Cmp-Att-Yd-TD-INT 13-23-141-0-1 18-41-265-0-1
3 Sacked-Yards 5-37 7-68
4 Net Pass Yards 104 197
5 Total Yards 194 315
6 Fumbles-Lost 3-1 4-3
7 Turnovers 2 4
8 Penalties-Yards 6-51 12-102
9 Third Down Conv. 1-14 3-15
10 Fourth Down Conv. 0-0 0-0
11 Time of Possession 27:13 32:47

Related

How do I grab a title and link from a website?

I'm currently working on a web scraper on this website (https://www.allabolag.se). I would like to grab the title and link to every result on the page, and I'm currently stuck.
<a data-v-4565614c="" href="/5566435201/grenspecialisten-forvaltning-ab">Grenspecialisten Förvaltning AB</a>
This is an example from the website where I would like to grab the href and >Grenspecialisten Förvaltning AB< as It contains the title and link. How would I go about doing that?
The code I have currently looks like this
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15'}
url = 'https://www.allabolag.se/bransch/bygg-design-inredningsverksamhet/6/_/xv/BYGG-,%20DESIGN-%20&%20INREDNINGSVERKSAMHET/xv/JURIDIK,%20EKONOMI%20&%20KONSULTTJÄNSTER/xl/12/xb/AB/xe/4/xe/3'
r = requests.get (url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
questions = soup.findAll('div', {'class': 'tw-flex'})
for item in questions:
title = item.find('a', {''}).text
print(title)
Any help would be greatly appreciated!
Best regards :)
The results are embedded in the page in Json form. To decode it, you can use next example:
import json
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15"
}
url = "https://www.allabolag.se/bransch/bygg-design-inredningsverksamhet/6/_/xv/BYGG-,%20DESIGN-%20&%20INREDNINGSVERKSAMHET/xv/JURIDIK,%20EKONOMI%20&%20KONSULTTJÄNSTER/xl/12/xb/AB/xe/4/xe/3"
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
data = json.loads(
soup.find(attrs={":search-result-default": True})[":search-result-default"]
)
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for result in data:
print(
"{:<50} {}".format(
result["jurnamn"], "https://www.allabolag.se/" + result["linkTo"]
)
)
Prints:
Grenspecialisten Förvaltning AB https://www.allabolag.se/5566435201/grenspecialisten-forvaltning-ab
Peab Fastighetsutveckling Syd AB https://www.allabolag.se/5566998430/peab-fastighetsutveckling-syd-ab
BayWa r.e. Nordic AB https://www.allabolag.se/5569701377/baywa-re-nordic-ab
Kronetorp Park Projekt AB https://www.allabolag.se/5567196539/kronetorp-park-projekt-ab
SVENSKA HUSCOMPAGNIET AB https://www.allabolag.se/5568155583/svenska-huscompagniet-ab
Byggnadsaktiebolaget Gösta Bengtsson https://www.allabolag.se/5561081869/byggnadsaktiebolaget-gosta-bengtsson
Tectum Byggnader AB https://www.allabolag.se/5562903582/tectum-byggnader-ab
Winthrop Engineering and Contracting AB https://www.allabolag.se/5592128176/winthrop-engineering-and-contracting-ab
SPI Global Play AB https://www.allabolag.se/5565082897/spi-global-play-ab
Trelleborg Offshore & Construction AB https://www.allabolag.se/5560557711/trelleborg-offshore-construction-ab
M.J. Eriksson Entreprenad AB https://www.allabolag.se/5567043814/mj-eriksson-entreprenad-ab
Solix Group AB https://www.allabolag.se/5569669574/solix-group-ab
Gripen Betongelement AB https://www.allabolag.se/5566646427/gripen-betongelement-ab
BLS Construction AB https://www.allabolag.se/5569814345/bls-construction-ab
We Construction AB https://www.allabolag.se/5590705116/we-construction-ab
Helsingborgs Fasad & Kakel AB https://www.allabolag.se/5567814248/helsingborgs-fasad-kakel-ab
Gat & Kantsten Sverige AB https://www.allabolag.se/5566564919/gat-kantsten-sverige-ab
Bjärno Byggsystem AB https://www.allabolag.se/5566743190/bjarno-byggsystem-ab
Bosse Sandells Bygg Aktiebolag https://www.allabolag.se/5564391158/bosse-sandells-bygg-aktiebolag
Econet Vatten & Miljöteknik AB https://www.allabolag.se/5567388953/econet-vatten-miljoteknik-ab

Python web scrape numerical weather data

I am attempting to print the int value of current outside air temperature. (55)
Any chance for a tip on what I am doing wrong? (sorry not a lot of wisdom here!)
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import datetime as dt
#this is used at the end with plotting results to current hour
h = dt.datetime.now().hour
r = requests.get(
'https://www.google.com/search?q=weather+duluth')
soup = BeautifulSoup(r.text, 'html.parser')
stuff = []
for item in soup.select('vk_bk sol-tmp'):
item = int(item.contents[1].get_text(strip=True)[:-1])
#print(item)#this is weather data
stuff.append(item)
This is the web URL for weather and the current outdoor temperature is tied to the div class highlighted below.
If I attempt to print stuff I just get an empty list returned.
Adding User-Agent header should give expected result
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
r = requests.get('https://www.google.com/search?q=weather%20duluth', headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
soup.find("span", {"class": "wob_t"}).text

Scrape table fields from html with specific class

So I want to build a simple scraper for google shopping and I encountered some problems.
This is the html text from my request(to https://www.google.es/shopping/product/7541391777504770249/online) where I'm trying to query the highlighted div class sh-osd__total-price inside the div class sh-osd__offer-row :
My code is currently:
from bs4 import BeautifulSoup
from requests import get
url = 'https://www.google.es/shopping/product/7541391777504770249/online'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
r = html_soup.findAll('tr', {'class': 'sh-osd__offer-row'}) #Returns empty
print(r)
r = html_soup.findAll('tr', {'class': 'sh-osd__total-price'}) #Returns empty
print(r)
Where both r are empty, beatiful soup doesn't find anything.
Is there any way to find these two div classes with beautiful soup?
You need to add user agent into the headers:
from bs4 import BeautifulSoup
from requests import get
url = 'https://www.google.es/shopping/product/7541391777504770249/online'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'} #<-- added line
response = get(url, headers=headers) #<--- include here
html_soup = BeautifulSoup(response.text, 'html.parser')
r = html_soup.find_all('tr', {'class': 'sh-osd__offer-row'}) #Returns empty
print(r)
r = html_soup.findAll('tr', {'class': 'sh-osd__total-price'}) #Returns empty
print(r)
But, since it's a <table> tag, you can use pandas (it uses beautifulsoup under the hood), but does the hard work for you. It will return a list of all elements that are <table>s as dataframes
import pandas as pd
url = 'https://www.google.es/shopping/product/7541391777504770249/online'
dfs = pd.read_html(url)
print(dfs[-1])
Output:
print(dfs[-1])
Sellers Seller Rating ... Base Price Total Price
0 One Fragance No rating ... £30.95 +£8.76 delivery £39.71
1 eBay No rating ... £46.81 £46.81
2 Carethy.co.uk No rating ... £34.46 +£3.99 delivery £38.45
3 fruugo.co.uk No rating ... £36.95 +£9.30 delivery £46.25
4 cosmeticsmegastore.com/gb No rating ... £36.95 +£9.30 delivery £46.25
5 Perfumes Club UK No rating ... £30.39 +£5.99 delivery £36.38
[6 rows x 5 columns]

Web scraping twitter

I want to do web scraping on twitter page to download tweets on a specific search word. I am not able to fetch recursively all the tweets, rather I can fetch 20 tweets. Please help to fetch all the tweets recursively. Below is the code
from bs4 import BeautifulSoup
import requests
import pandas as pd
company_name = 'ABC'
url = 'https://twitter.com/search?q=%23%27%20%20%20' + company_name + '&src=typd&lang=en'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
req = requests.get(url, headers=headers);#print(req)
data = req.text;# print(data)
# soup = BeautifulSoup(data, "lxml");# print(soup)
soup = BeautifulSoup(data, "html.parser");# print(soup)
tweets = [p.text for p in soup.findAll('p', class_='tweet-text')]
# print(tweets)
df = pd.DataFrame()
df['Tweet'] = tweets
print(df.head())
print(df.shape)

Beautifulsoup parsing error

I am trying to extract some information about an App on Google Play and BeautifulSoup doesn't seem to work.
The link is this(say):
https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts
My code:
url = "https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts"
r = requests.get(url)
html = r.content
soup = BeautifulSoup(html)
l = soup.find_all("div", { "class" : "document-subtitles"})
print len(l)
0 #How is this 0?! There is clearly a div with that class
I decided to go all in, didn't work either:
i = soup.select('html body.no-focus-outline.sidebar-visible.user-has-no-subscription div#wrapper.wrapper.wrapper-with-footer div#body-content.body-content div.outer-container div.inner-container div.main-content div div.details-wrapper.apps.square-cover.id-track-partial-impression.id-deep-link-item div.details-info div.info-container div.info-box-top')
print i
What am I doing wrong?
You need to pretend to be a real browser by supplying the User-Agent header:
import requests
from bs4 import BeautifulSoup
url = "https://play.google.com/store/apps/details?id=com.cimaxapp.weirdfacts"
r = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"
})
html = r.content
soup = BeautifulSoup(html, "html.parser")
title = soup.find(class_="id-app-title").get_text()
rating = soup.select_one(".document-subtitle .star-rating-non-editable-container")["aria-label"].strip()
print(title)
print(rating)
Prints the title and the current rating:
Weird Facts
Rated 4.3 stars out of five stars
To get the additional information field values, you can use the following generic function:
def get_info(soup, text):
return soup.find("div", class_="title", text=lambda t: t and t.strip() == text).\
find_next_sibling("div", class_="content").get_text(strip=True)
Then, if you do:
print(get_info(soup, "Size"))
print(get_info(soup, "Developer"))
You will see printed:
1.4M
Email email#here.com

Categories