Scraping with requests and BS4 - python

I'd like to get the content in the table to then put in a pandas dataframe in the following website: https://projects.fivethirtyeight.com/soccer-predictions/premier-league/
I'm quite new to BS, but I believe that what I want would be something like:
import requests
from bs4 import BeautifulSoup
r = requests.get(url = "https://projects.fivethirtyeight.com/soccer-predictions/ligue-1/")
soup = BeautifulSoup(r.text, "html.parser")
#print(soup.prettify())
print(soup.find("div", {"class":"forecast-table"}))
But of course, unfortunately this is returning "None". Any help and guidance would be amazing!
I believe that the bit I need to get is somewhere in here (not really sure though):
<div id="forecast-table-wrapper">
<table class="forecast-table" id="forecast-table">
<thead>
<tr class="desktop">
<th class="top nosort">
</th>
<th class="top bordered-right rating nosort drop-6" colspan="3">
Team rating
</th>
<th class="top nosort rating2" colspan="1">
</th>
<th class="top bordered-right nosort drop-1" colspan="5">
avg. simulated season
</th>
<th class="top bordered-right nosort show-1 drop-3" colspan="2">
avg. simulated season
</th>
<th class="top bordered nosort" colspan="4">
end-of-season probabilities
</th>
</tr>
<tr class="sep">
<th colspan="11">
</th>
</tr>

Since you're using pandas anyway, you can use the built-in table processing, like this:
pandas.read_html('https://projects.fivethirtyeight.com/soccer-predictions/premier-league/',
attrs = {
'class': 'forecast-table'
}, header = 1)

That's because you are searching for a div, but it's a table, so it should be:
print(soup.find("table", {"class":"forecast-table"}))

import requests
from bs4 import BeautifulSoup
r = requests.get('https://projects.fivethirtyeight.com/soccer-predictions/ligue-1/')
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find_all('table', attrs={'class':'forecast-table'})
for i in table:
tr = i.find_all('tr')
for l in tr:
print(l.text)
Output:
Team ratingavg. simulated seasonavg. simulated seasonend-of-season probabilities
teamspioff.def.WDLgoal diff.proj. pts.pts.relegatedrel.qualify for UCLmake UCLwin Ligue 1win league
PSG24 pts90.03.00.530.74.52.9+7897<1%>99%97%
Lyon14 pts76.32.10.719.69.19.3+2768<1%60%2%
Marseille13 pts71.12.00.918.38.311.4+1663<1%40%<1%
Lille19 pts63.71.70.916.78.612.6+9591%24%<1%
St Étienne15 pts62.71.60.914.710.912.4-1553%14%<1%
Montpellier16 pts64.01.50.713.912.411.7+2543%12%<1%
Nice11 pts62.01.60.913.510.014.5-7507%7%<1%
Monaco6 pts65.91.80.913.010.714.2+0508%7%<1%
Rennes8 pts63.41.60.813.010.514.5-3499%6%<1%
Bordeaux14 pts59.21.50.913.09.915.0-6498%5%<1%
Strasbourg12 pts59.21.51.012.610.814.6-2499%5%<1%
Angers11 pts60.41.50.912.610.215.2-54810%4%<1%
Toulouse13 pts58.21.50.911.912.014.1-104811%4%<1%
Dijon FCO10 pts57.71.61.112.28.517.3-124517%2%<1%
Caen10 pts55.61.41.010.812.414.8-104518%3%<1%
Nîmes10 pts54.91.51.110.711.615.6-134420%2%<1%
Reims10 pts55.31.30.910.312.315.4-144321%2%<1%
Nantes6 pts59.01.50.910.410.916.7-144225%1%<1%
Guingamp5 pts57.31.51.010.39.817.9-194130%<1%<1%
Amiens10 pts53.01.31.010.49.018.6-164031%<1%<1%

Related

Python : Scrape each info in table without class using beautifulsoup4

I'm new to python and i have a problem for scraping with beautifulsoup4 a table containing informations of a book because each tr and td of the table doesnt contain classnames.
https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
here is the table in the website:
<table class="table table-striped">
<tr>
<th>
UPC
</th>
<td>
a897fe39b1053632
</td>
</tr>
<tr>
<th>
Product Type
</th>
<td>
Books
</td>
</tr>
<tr>
<th>
Price (excl. tax)
</th>
<td>
£51.77
</td>
</tr>
<tr>
<th>
Price (incl. tax)
</th>
<td>
£51.77
</td>
</tr>
<tr>
<th>
Tax
</th>
<td>
£0.00
</td>
</tr>
<tr>
<th>
Availability
</th>
<td>
In stock (22 available)
</td>
</tr>
<tr>
<th>
Number of reviews
</th>
<td>
0
</td>
</tr>
</table>
the only thing i learned is with classnames, for example : book_price = soup.find('td', class_='book-price').
but in this situation i am blocked...
Is there something like find and pair the first th tag with the first td and the second th tag with the second td and so on.
i see something like that :
import requests
from bs4 import BeautifulSoup
book_url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
page = requests.get(book_url)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('table').prettify()
table_infos = soup.find('table')
for info in table_infos.findAll('tr'):
upc = ...
price = ...
tax = ...
thank you !

Cannot scrape web with many tables with python lxml

I am trying to scrape this web, but i am not getting any result, this works with other pages in which there´s only one simple table. Can you help me with the code?
import lxml
from lxml import html
import requests
import numpy as np
import pandas as pd
import urllib
def scrape_table(url):
# Fetch the page that we're going to parse
page = requests.get(url)
tree = html.fromstring(page.content)
# Using XPATH, fetch all table elements on the page
#df = tree.xpath('//div[#id="main content"]/div[#id="style-1"]/table[#class="table"]/tbody')
df = tree.xpath('//tr')
#assert len(table) == 1
#df = pd.read_html(lxml.etree.tostring(table[0], method='html'))[0]
return df
symbol = 'AMZN'
#balance_sheet_url = 'https://finance.yahoo.com/quote/' + symbol + '?p=' + symbol
#df_balance_sheet = scrape_table(balance_sheet_url)
#df_balance_sheet.info()
#print(df_balance_sheet)
url = "https://www.macrotrends.net/stocks/charts/"+ symbol + "/pe-ratio"
data = requests.request("GET", url)
url_completo = data.url
print(url_completo)
df_pe = scrape_table(url_completo)
Here is the web i am trying to scrape (Code) web:https://www.macrotrends.net/stocks/charts/TMO/thermo-fisher-scientific/pe-ratio
<div id="style-1" style="background-color:#fff; height: 500px; overflow:auto; margin: 0px 0px 30px 0px; padding:0px 30px 20px 0px; border:1px solid #dfdfdf;">
<table class="table">
<thead>
<tr>
<th colspan="4" style="text-align:center;">Thermo Fisher Scientific PE Ratio Historical Data</th>
</tr>
</thead>
<thead>
<tr>
<th style="text-align:center;">Date</th>
<th style="text-align:center;">Stock Price</th>
<th style="text-align:center;">TTM Net EPS</th>
<th style="text-align:center;">PE Ratio</th>
</tr>
</thead>
<tbody><tr>
<td style="text-align:center;">2019-04-12</td>
<td style="text-align:center;">280.65</td>
<td style="text-align:center;"></td>
<td style="text-align:center;">38.71</td>
</tr><tr>
<td style="text-align:center;">2018-12-31</td>
<td style="text-align:center;">223.79</td>
<td style="text-align:center;">$7.25</td>
<td style="text-align:center;">30.87</td>
</tr><tr>
<td style="text-align:center;">2018-09-30</td>
<td style="text-align:center;">243.90</td>
<td style="text-align:center;">$6.33</td>
<td style="text-align:center;">38.53</td>
</tr><tr>
<td style="text-align:center;">2018-06-30</td>
<td style="text-align:center;">206.84</td>
<td style="text-align:center;">$5.92</td>
<td style="text-align:center;">34.94</td>
</tr>
</table>
</div>```
You have not built you URLs correctly. This code will fetch two tables one for amazon then the next for thermo-fisher-scientific.
import lxml
from lxml import html
import requests
import pandas as pd
pd.set_option('display.expand_frame_repr', False)
def scrape_table(url):
# Fetch the page that we're going to parse
page = requests.get(url)
tree = html.fromstring(page.content)
tables = tree.findall('.//*/table')
df = pd.read_html(lxml.etree.tostring(tables[0], method='html'))[0]
return df
for symbol in ['AMZN/amazon', 'TMO/thermo-fisher-scientific']:
url = "https://www.macrotrends.net/stocks/charts/" + symbol + "/pe-ratio"
data = requests.request("GET", url)
url_completo = data.url
print(url_completo)
df_pe = scrape_table(url_completo)
print(df_pe)
Outputs:
Amazon PE Ratio Historical Data
Date Stock Price TTM Net EPS PE Ratio
0 2019-04-12 1843.06 NaN 91.56
1 2018-12-31 1501.97 $20.13 74.61
2 2018-09-30 2003.00 $17.84 112.28
...
Thermo Fisher Scientific PE Ratio Historical Data
Date Stock Price TTM Net EPS PE Ratio
0 2019-04-12 280.65 NaN 38.71
1 2018-12-31 223.79 $7.25 30.87
2 2018-09-30 243.90 $6.33 38.53
...

BeautifulSoup find() returns odd data

I am using BeautifulSoup to get data off a website. I can find the data I want but when I print it, it comes out as "-1" The value in the field is 32.27. Here is the code I'm using
import requests
from BeautifulSoup import BeautifulSoup
import csv
symbols = {'451020'}
with open('industry_pe.csv', "ab") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
writer.writerow(['Industry','PE'])
for s in symbols:
try:
url = 'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/industries.jhtml?tab=learn&industry='
full = url + s
response = requests.get(full)
html = response.content
soup = BeautifulSoup(html)
for PE in soup.find("div", {"class": "sec-fundamentals"}):
print PE
#IndPE = PE.find("td")
#print IndPE
When I print PE it returns this...
<h2>
Industry Fundamentals
<span>AS OF 03/08/2018</span>
</h2>
<table summary="" class="data-tbl">
<colgroup>
<col class="col1" />
<col class="col2" />
</colgroup>
<thead>
<tr>
<th scope="col"></th>
<th scope="col"></th>
</tr>
</thead>
<tbody>
<tr>
<th scope="row" class="align-left"><a href="javascript:void(0);" onclick="javasc
ript:openPopup('https://www.fidelity.com//webcontent/ap010098-etf-content/18.01/
help/research/learn_er_glossary_3.shtml#priceearningsratio',420,450);return fals
e;">P/E (Last Year GAAP Actual)</a></th>
<td>
32.27
</td>
</tr>
<tr>
<th scope="row" class="align-left"><a href="javascript:void(0);" onclick="javasc
ript:openPopup('https://www.fidelity.com//webcontent/ap010098-etf-content/18.01/
help/research/learn_er_glossary_3.shtml#priceearningsratio',420,450);return fals
e;">P/E (This Year's Estimate)</a>.....
I want to get the value 32.27 from 'td' but when i use the code i have commented out to get and print 'td' it gives me this.
-1
None
-1
<td>
32.27
</td>
-1
any ideas?
The find() method returns the tag which is the first match. Iterating over the contents of a tag, will give you all the tags one by one.
So, to get the <td> tags in the table, you should first find the table and store it in a variable. And then iterate over all the td tags using find_all('td').
table = soup.find("div", {"class": "sec-fundamentals"})
for row in table.find_all('td'):
print(row.text.strip())
Partial Output:
32.27
34.80
$122.24B
$3.41
14.14%
15.88%
If you want only the first value, you can use this:
table = soup.find("div", {"class": "sec-fundamentals"})
value = table.find('td').text.strip()
print(value)
# 32.27

Beautifulsoup HTML table parsing--only able to get the last row?

I have a simple HTML table to parse but somehow Beautifulsoup is only able to get me results from the last row. I'm wondering if anyone would take a look at that and see what's wrong. So I already created the rows object from the HTML table:
<table class='participants-table'>
<thead>
<tr>
<th data-field="name" class="sort-direction-toggle name">Name</th>
<th data-field="type" class="sort-direction-toggle type active-sort asc">Type</th>
<th data-field="sector" class="sort-direction-toggle sector">Sector</th>
<th data-field="country" class="sort-direction-toggle country">Country</th>
<th data-field="joined_on" class="sort-direction-toggle joined-on">Joined On</th>
</tr>
</thead>
<tbody>
<tr>
<th class='name'>Grontmij</th>
<td class='type'>Company</td>
<td class='sector'>General Industrials</td>
<td class='country'>Netherlands</td>
<td class='joined-on'>2000-09-20</td>
</tr>
<tr>
<th class='name'>Groupe Bial</th>
<td class='type'>Company</td>
<td class='sector'>Pharmaceuticals & Biotechnology</td>
<td class='country'>Portugal</td>
<td class='joined-on'>2004-02-19</td>
</tr>
</tbody>
</table>
I use the following codes to get the rows:
table=soup.find_all("table", class_="participants-table")
table1=table[0]
rows=table1.find_all('tr')
rows=rows[1:]
This gets:
rows=[<tr>
<th class="name">Grontmij</th>
<td class="type">Company</td>
<td class="sector">General Industrials</td>
<td class="country">Netherlands</td>
<td class="joined-on">2000-09-20</td>
</tr>, <tr>
<th class="name">Groupe Bial</th>
<td class="type">Company</td>
<td class="sector">Pharmaceuticals & Biotechnology</td>
<td class="country">Portugal</td>
<td class="joined-on">2004-02-19</td>
</tr>]
As expected, it looks like. However, if I continue:
for row in rows:
cells = row.find_all('th')
I'm only able to get the last entry!
cells=[<th class="name">Groupe Bial</th>]
What is going on? This is my first time using beautifulsoup, and what I'd like to do is to export this table into CSV. Any help is greatly appreciated! Thanks
You need to extend if you want all the th tags in a single list, you just keep reassigning cells = row.find_all('th') so when your print cells outside the loop you will only see what it was last assigned to i.e the last th in the last tr:
cells = []
for row in rows:
cells.extend(row.find_all('th'))
Also since there is only one table you can just use find:
soup = BeautifulSoup(html)
table = soup.find("table", class_="participants-table")
If you want to skip the thead row you can use a css selector:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
rows = soup.select("table.participants-table thead ~ tr")
cells = [tr.th for tr in rows]
print(cells)
cells will give you:
[<th class="name">Grontmij</th>, <th class="name">Groupe Bial</th>]
To write the whole table to csv:
import csv
soup = BeautifulSoup(html, "html.parser")
rows = soup.select("table.participants-table tr")
with open("data.csv", "w") as out:
wr = csv.writer(out)
wr.writerow([th.text for th in rows[0].find_all("th")] + ["URL"])
for row in rows[1:]:
wr.writerow([tag.text for tag in row.find_all()] + [row.th.a["href"]])
which for you sample will give you:
Name,Type,Sector,Country,Joined On,URL
Grontmij,Company,General Industrials,Netherlands,2000-09-20,/what-is-gc/participants/4479-Grontmij
Groupe Bial,Company,Pharmaceuticals & Biotechnology,Portugal,2004-02-19,/what-is-gc/participants/4492-Groupe-Bial

python beautiful soup extract data

I am parsing a html document using a Beautiful Soup 4.0.
Here is an example of table in document
<tr>
<td class="nob"></td>
<td class="">Time of price</td>
<td class=" pullElement pullData-DE000BWB14W0.teFull">08/06/2012</td>
<td class=" pullElement pullData-DE000BWB14W0.PriceTimeFull">11:43:08 </td>
<td class="nob"></td>
</tr>
<tr>
<td class="nob"></td>
<td class="">Daily volume (units)</td>
<td colspan="2" class=" pullElement pullData-DE000BWB14W0.EWXlume">0</td>
<td class="nob"></td>
<t/r>
I would like to extract 08/06/2012 and 11:43:08 DAily volume, 0 etc.
This is my code to find specific table and all data of it
html = file("some_file.html")
soup = BeautifulSoup(html)
t = soup.find(id="ctnt-2308")
dat = [ map(str, row.findAll("td")) for row in t.findAll("tr") ]
I get a list of data that needs to be organized
Any suggestions to do it in a simple way??
Thank you
list(soup.stripped_strings)
will give you all the string in that soup (removing all trailing spaces)

Categories