BeautifulSoup not finding dates

BeautifulSoup not finding dates - python

I'm trying to scrape some data from here: https://www.reuters.com/companies/AMPF.MI/financials/income-statement-quarterly.
I'd like to get the dates in the first row (ie. 31-Mar-21 31-Dec-20 30-Sep-20 30-Jun-20 31-Mar-20).
The problem comes when I try to get the date, with bs4 it outputs nothing. I wrote this code:
url = "https://www.reuters.com/companies/AMPF.MI/financials/income-statement-quarterly"
html_content = requests.get(url).text
soup = BeautifulSoup (html_content, "lxml")
a = soup.find('div', attrs = {"class": "tables-container"})
date = a.find("time").text;
When I execute it, it gives me nothing. Printing a, it can be seen that the find () doesn't get the date ... `
<th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg"></time>
Thanks.

The data is embedded within the page in JSON form. You can use this example how to parse it:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.reuters.com/companies/AMPF.MI/financials/income-statement-quarterly"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").contents[0])
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
x = data["props"]["initialState"]["markets"]["financials"]["financial_tables"]
headers = x["income_interim_tables"][0]["headers"]
print(*headers, sep="\n")
Prints:
2021-03-31
2020-12-31
2020-09-30
2020-06-30
2020-03-31

As I do not have enough reputation to comment:
The problem is that the scraped HTML does not contain the dates. The time tags are empty.
You need a way to scrape while pre-rendering the JavaScript which fills in the dates. This is a different topic which requires some headless browser or other approaches, e.g. https://www.scrapingbee.com/blog/scrapy-javascript/

Related

What is the proper syntax for .find() in bs4?

I am trying to scrape the bitcoin price off of coinbase and cannot find the proper syntax. When I run the program (without the line with question marks) I get the block of html that I need, but I don't know how to narrow down and retrieve the price itself. Any help appreciated, thanks.
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/charts'
data = requests.get(url)
nicedata = data.text
soup = BeautifulSoup(nicedata, 'html.parser')
prettysoup = soup.prettify()
bitcoin = soup.find('h4', {'class':
'Header__StyledHeader-sc-1q6y56a-0 hZxUBM
TextElement__Spacer-sc-18l8wi5-0 hpeTzd'})
price = bitcoin.find('???')
print(price)
The attached image contains the html

To get text from item:
price = bitcoin.text
But this page has many items <h4> with this class but find() gets only first one and it has text Bitcoin, not price from your image. You may need find_all() to get list with all items and then you can use index [index] or slicing [start:end] to get some items, or you can use for-loop to work with every item on list.
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/charts'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
all_h4 = soup.find_all('h4', {'class': 'Header__StyledHeader-sc-1q6y56a-0 hZxUBM TextElement__Spacer-sc-18l8wi5-0 hpeTzd'})
for h4 in all_h4:
print(h4.text)
It can be easier to work with data if you keep it in list of list or array or DataFrame. But to create list of lists it would be easier to find rows <tr> and inside every row search <h4>
import requests
from bs4 import BeautifulSoup
url = 'https://www.coinbase.com/charts'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
all_tr = soup.find_all('tr')
data = []
for tr in all_tr:
row = []
for h4 in tr.find_all('h4'):
row.append(h4.text)
if row: # skip empty row
data.append(row)
for row in data:
print(row)
It doesn't need class to get all h4.
BTW: This page uses JavaScript to append new rows when you scroll page but requests and BeautifulSoup can't run JavaScript - so if you will need all rows then you may need Selenium to control web browser which runs JavaScript

Extract items within </h> but without <h> from html

I have scraped a website that provides me with Lisbon zip-codes. With BeautifulSoup I was able to get the zip-codes within a class item. However, the zip-codes themselves are still within other classes and I have tried many things to extract all of them from there. However, except for string-manipulation, I couldn't make it work. I am new to webscraping and html, so sorry if this question is very basic..
This is my code:
from bs4 import BeautifulSoup as soup
from requests import get
url='https://worldpostalcode.com/portugal/lisboa/'
response = get(url)
print(response.text)
html_soup = soup(response.text,'lxml')
type(html_soup)
zip_codes=html_soup.find_all('div', {'class' : 'rightc'})
And this is a snippet of the result from which I would like to only extract the zip codes..
[<div class="rightc">1000-246<hr/> 1050-138<hr/> 1069-188<hr/> 1070-204<hr/> 1100-069<hr/> 1100-329<hr/> 1100-591<hr/> 1150-144<hr/> 1169-062<hr/> 1170-128<hr/> 1170-395<hr/> 1200-228<hr/> 1200-604<hr/> 1200-862<hr/> 1250-111<hr/> 1269-121<hr/> 1300-217<hr/> 1300-492<hr/> 1350-092<hr/> 1399-014<hr/> 1400-237<hr/> 1500-061<hr/> 1500-360<hr/> 1500-674<hr/> 1600-232<hr/> 1600-643<hr/> 1700-018<hr/> 1700-302<hr/> 1750-113<hr/> 1750-464<hr/> 1800-262<hr/> 1900-115<hr/> 1900-401<hr/> 1950-208<hr/> 1990-162<hr/> 1000-247<hr/> 1050-139<hr/> 1069-190<hr/> 1070-205<hr/> 1100-070<hr/> 1100-330</div>]

Your result zip_codes has the type bs4.element.ResultSet, which is a set of bs4.element.Tag. So zip_codes[0] is what you're interested in (the first tag found). Use the .text method to strip the <hr> tags. Now you have a long string of zip codes separated by spaces. Strip them out into a list somehow (two options below, option one is more pythonic and faster).
from bs4 import BeautifulSoup as soup
from requests import get
url = 'https://worldpostalcode.com/portugal/lisboa/'
response = get(url)
html_soup = soup(response.text,'lxml')
zip_codes = html_soup.find_all('div', {'class' : 'rightc'})
# option one
zips = zip_codes[0].text.split(' ')
print(zips[:8])
# option two (slower)
zips = []
for zc in zip_codes[0].childGenerator():
zips.append(zc.extract().strip())
print(zips[:8])
Output:
['1000-246', '1050-138', '1069-188', '1070-204', '1100-069', '1100-329', '1100-591', '1150-144']
['1000-246', '1050-138', '1069-188', '1070-204', '1100-069', '1100-329', '1100-591', '1150-144']

html_soup = BeautifulSoup(htmlcontent,'lxml')
type(html_soup)
zip_codes=html_soup.find_all('div', {'class' : 'rightc'})
print(zip_codes[0].text.split(' '))
you can get the text and split it.
o/p :
[u'1000-246', u'1050-138', u'1069-188', u'1070-204',.........]

Use regex to grab the codes
from bs4 import BeautifulSoup
import requests
import re
url = 'https://worldpostalcode.com/portugal/lisboa/'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
element = soup.select_one('.codelist .rightc')
codes = re.findall(r"\d{4}-\d{3}",element.text)
for code in codes:
print(code)

I would suggest you to replace all the </hr>tags into some delimiter (i.e., # or $ or ,) before loading the page response as soup. Now the job will be so easy once you load it into the soup you can extract the zip codes as a list just by calling the class.
from bs4 import BeautifulSoup as soup
from requests import get
url='https://worldpostalcode.com/portugal/lisboa/'
response = get(url)
print(response.text.replace('<hr>', '#'))
html_soup = soup(response.text,'lxml')
type(html_soup)
zip_codes=html_soup.find_all('div', {'class' : 'rightc'})
zip_codes = zip_codes.text.split('#')
Hope this helps! Cheers!
P.S.: Answer is open for improvements and comments.

Beautifulsoup scraping table from website with requests for pandas

I am trying to download the data on this website
https://coinmunity.co/
...in order to manipulate later it in Python or Pandas
I have tried to do it directly to Pandas via Requests, but did not work, using this code:
res = requests.get("https://coinmunity.co/")
soup = BeautifulSoup(res.content, 'lxml')
table = soup.find_all('table')[0]
dfm = pd.read_html(str(table), header = 0)
dfm = dfm[0].dropna(axis=0, thresh=4)
dfm.head()
In most of the things I tried, I could only get to the info in the headers, which seems to be the only table seen in this page by the code.
Seeing that this did not work, I tried to do the same scraping with Requests and BeautifulSoup, but it did not work either. This is my code:
import requests
from bs4 import BeautifulSoup
res = requests.get("https://coinmunity.co/")
soup = BeautifulSoup(res.content, 'lxml')
#table = soup.find_all('table')[0]
#table = soup.find_all('div', {'class':'inner-container'})
#table = soup.find_all('tbody', {'class':'_ngcontent-c0'})
#table = soup.find_all('table')[0].findAll('tr')
#table = soup.find_all('table')[0].find('tbody')#.find_all('tbody _ngcontent-c3=""')
table = soup.find_all('p', {'class':'stats change positiveSubscribers'})
You can see in the lines commented, all the things I have tried, but nothing worked.
Is there any way to easily download that table to use it on Pandas/Python, in the tidiest, easier and quickest possible way?
Thank you

Since the content is loaded dynamically after the initial request is made, you won't be able to scrape this data with request. Here's what I would do instead:
from selenium import webdriver
import pandas as pd
import time
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.implicitly_wait(10)
driver.get("https://coinmunity.co/")
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'lxml')
results = []
for row in soup.find_all('tr')[2:]:
data = row.find_all('td')
name = data[1].find('a').text
value = data[2].find('p').text
# get the rest of the data you need about each coin here, then add it to the dictionary that you append to results
results.append({'name':name, 'value':value})
df = pd.DataFrame(results)
df.head()
name value
0 NULS 14,005
1 VEN 84,486
2 EDO 20,052
3 CLUB 1,996
4 HSR 8,433
You will need to make sure that geckodriver is installed and that it is in your PATH. I just scraped the name of each coin and the value but getting the rest of the information should be easy.

Using the BeautifulSoup find method to obtain data from a table row

I am writing a Python script using BeautifulSoup to scrape values from this webpage: https://uk-air.defra.gov.uk/latest/currentlevels
I want to use soup.find() to get values for "Hourly mean Nitrogen dioxide" and "Last updated" from the table row where the "Monitoring site" is "Edinburgh St Leonards".
As I am new to web scraping I am having a bit of trouble so would be grateful for any help on this.

Scrap all the html tables in a list of tables.
The table index may change, then you should not rely on a row/column index.
A part of the folowing script look up for the index of the searched data. Moreover, it prints the header name: so you know want are the data you get.
from bs4 import BeautifulSoup
import urllib.request
import re
with urllib.request.urlopen('https://uk-air.defra.gov.uk/latest/currentlevels?view=region') as response:
htmlData = response.read()
soup = BeautifulSoup(htmlData, 'html5lib')
tables = soup.find_all('table', attrs={'class':'current_levels_table'})
#what you want to check:
Iwant = ['nitrogen', 'update']
about = 'Edinburgh'
for table in tables:
#get header to have the data (we're looking for) column number and table real names
table_head = table.find('thead')
headrows = table_head.find_all('tr')
measures = headrows[1].find_all('th')
for colnum, measure in enumerate(measures):
index.update({colnum: measure.text.strip() for wanted in Iwant if re.search(wanted+'(?iu)', measure.text)})
#get table content and look for Edinburgh
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cels = row.find_all('td')
rowContent = [cel.text.strip().replace(u'\xa0', u' ').replace(u'\n Timeseries Graph', u'') for cel in cels if cel]
if re.search(about+'(?iu)', rowContent[0]):
for indexwanted, measurewanted in index.items():
print(measurewanted, ':', rowContent[indexwanted])

Making use of the suggestion from d2718nis, you can do it in this way. Of course, many other ways would work too.
First, find the link that has the 'Edinburgh St Leonards' text in it. Then find the grandparent of that link element, which is a tr element. Now identify the td elements in the tr. When you examine the table you see that the columns you want are the 4th and 7th. Get those from all of the td elements as the (0-relative) 3rd and 6th. Finally, display the crude texts of these elements.
You will need to do something clever to extract properly readable strings from these results.
>>> import requests
>>> import bs4
>>> page = requests.get('https://uk-air.defra.gov.uk/latest/currentlevels', headers={'User-Agent': 'Not blank'}).content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> Edinburgh_link = soup.find_all('a',string='Edinburgh St Leonards')[0]
>>> Edinburgh_link
Edinburgh St Leonards
>>> Edinburgh_row = Edinburgh_link.findParent('td').findParent('tr')
>>> Edinburgh_columns = Edinburgh_row.findAll('td')
>>> Edinburgh_columns[3]
<td class="center"><span class="bg_low1 bold">20 (1 Low)</span></td>
>>> Edinburgh_columns[6]
<td>05/08/2017<br/>14:00:00</td>
>>> Edinburgh_columns[3].text
'20\xa0(1\xa0Low)'
>>> Edinburgh_columns[6].text
'05/08/201714:00:00'

you can start with this:
import requests
from bs4 import BeautifulSoup
# Request the page, set headers to prevent 403 Forbidden
page = requests.get(
url='https://uk-air.defra.gov.uk/latest/currentlevels',
headers={'User-Agent': 'Not blank'})
# Get html from page
html = page.text
# BeautifulSoup object
soup = BeautifulSoup(html, 'html5lib')
for table in soup.find_all('table'):
# Print all tables on the page
print(table)

Parse html table with BeautifulSoup4 and Python 3

I am trying to scrape certain financial data from Yahoo Finance. Specifically in this case, a single revenue number (type: double)
Here is my code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
searchurl = "http://finance.yahoo.com/q/ks?s=AAPL"
f = urlopen(searchurl)
html = f.read()
soup = BeautifulSoup(html, "html.parser")
revenue = soup.find("div", {"class": "yfnc_tabledata1", "id":"yui_3_9_1_8_1456172462911_38"})
print (revenue)
The view source inspection from Chrome looks like this:
I am trying to scrape the "234.99B" number, strip the "B", and convert it to a decimal. There is something wrong with my 'soup.find' line, where am I going wrong?

Locate the td element with Revenue (ttm): text and get the next td sibling:
revenue = soup.find("td", text="Revenue (ttm):").find_next_sibling("td").text
print(revenue)
Prints 234.99B.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup not finding dates - python

Related

What is the proper syntax for .find() in bs4?

Extract items within </h> but without <h> from html

Beautifulsoup scraping table from website with requests for pandas

Using the BeautifulSoup find method to obtain data from a table row

Parse html table with BeautifulSoup4 and Python 3

Categories

Resources