Python Webscraping with BeautifulSoup not displaying full content - python

I am trying to scrape all the text from a webpage which is embedded within the "td" tags that have a class="calendar__cell calendar__currency currency ". As of now my code only returns the first occurence of this tag and class. How can I keep it iterating through the source code. So that it returns all occurrences one by one. The webpage is forexfactory.com
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.forexfactory.com/#detail=108867").text
soup = BeautifulSoup(source, 'lxml')
body = soup.find("body")
article = body.find("table", class_="calendar__table")
actual = article.find("td", class_="calendar__cell calendar__actual actual")
forecast = article.find("td", class_="calendar__cell calendar__forecast forecast").text
currency = article.find("td", class_="calendar__cell calendar__currency currency")
Tcurrency = currency.text
Tactual = actual.text
print(Tcurrency)

You have to use find_all() to get all elements and then you can use for-loop to iterate it.
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.forexfactory.com/#detail=108867")
soup = BeautifulSoup(r.text, 'lxml')
table = soup.find("table", class_="calendar__table")
for row in table.find_all('tr', class_='calendar__row--grey'):
currency = row.find("td", class_="currency")
#print(currency.prettify()) # before get text
currency = currency.get_text(strip=True)
actual = row.find("td", class_="actual")
actual = actual.get_text(strip=True)
forecast = row.find("td", class_="forecast")
forecast = forecast.get_text(strip=True)
print(currency, actual, forecast)
Result
CHF 96.4 94.6
EUR 0.8% 0.9%
GBP 43.7K 41.3K
EUR 1.35|1.3
USD -63.2B -69.2B
USD 0.0% 0.2%
USD 48.9 48.2
USD 1.2% 1.5%
BTW: I found that this page uses JavaScript to redirect page and in browser I see table with different values. But if I turn off JavaScript in browser then it shows me data which I get with Python code. BeautifulSoup and requests can't run JavaScript. If you need data like in browser then you may need Selenium to control web browser which can run JavaScript.

Related

bs4 findAll not collecting all of the data from the other pages on the website

I'm trying to scrape a real estate website using BeautifulSoup.
I'm trying to get a list of rental prices for London. This works but only for the first page on the website. There are over 150 of them so I'm missing out on a lot of data. I would like to be able to collect all the prices from all the pages. Here is the code I'm using:
import requests
from bs4 import BeautifulSoup as soup
url = 'https://www.zoopla.co.uk/to-rent/property/central-london/?beds_max=5&price_frequency=per_month&q=Central%20London&results_sort=newest_listings&search_source=home'
response = requests.get(url)
response.status_code
data = soup(response.content, 'lxml')
prices = []
for line in data.findAll('div', {'class': 'css-1e28vvi-PriceContainer e2uk8e7'}):
price = str(line).split('>')[2].split(' ')[0].replace('£', '').replace(',','')
price = int(price)
prices.append(price)
Any idea as to why I can't collect the prices from all the pages using this script?
Extra question : is there a way to access the price using soup, IE with doing any list/string manipulation? When I call data.find('div', {'class': 'css-1e28vvi-PriceContainer e2uk8e7'}) I get a string of the following form <div class="css-1e28vvi-PriceContainer e2uk8e7" data-testid="listing-price"><p class="css-1o565rw-Text eczcs4p0" size="6">£3,012 pcm</p></div>
Any help would be much appreciated!
You can append &pn=<page number> parameter to the URL to get next pages:
import re
import requests
from bs4 import BeautifulSoup as soup
url = "https://www.zoopla.co.uk/to-rent/property/central-london/?beds_max=5&price_frequency=per_month&q=Central%20London&results_sort=newest_listings&search_source=home&pn="
prices = []
for page in range(1, 3): # <-- increase number of pages here
data = soup(requests.get(url + str(page)).content, "lxml")
for line in data.findAll(
"div", {"class": "css-1e28vvi-PriceContainer e2uk8e7"}
):
price = line.get_text(strip=True)
price = int(re.sub(r"[^\d]", "", price))
prices.append(price)
print(price)
print("-" * 80)
print(len(prices))
Prints:
...
1993
1993
--------------------------------------------------------------------------------
50

BeautifulSoup limits to two decimals, and integers that start with 0.00 results in "0.0"

The goal of this app is to print the prices of cryptocurrencies. It prints the prices of the 100 currencies on the first page on coinmarketcap, however, it only records up to two decimals and I don't know why. When the int starts with 0.00 the application only prints 0.0. Why?
from bs4 import BeautifulSoup
import requests
url = "https://coinmarketcap.com/"
result = requests.get(url).text
doc = BeautifulSoup(result, "html.parser")
tbody = doc.tbody
trs = tbody.contents
for tr in trs[:10]:
price = tr.contents[4]
price = price.text
HTML changes here, so new code is in order.
for tr in trs[10:]:
try:
price = tr.contents
price = str(price)
price = price.split("$<!-- -->", 1)[1]
price = price.split("</span></td", 1)[0]
price = float(price)
finally:
pass
print(price)
print()
I've tried using Selenium to extract the information, but I only want to use BeautifulSoup.
Your program says that the price is zero because that’s what the HTML says. When I visit https://coinmarketcap.com/ in a browser with scripting disabled, it tells me that eCash is worth “$0.00”. If I turn scripting back on and refresh the page, then it tells me that it’s worth “$0.0002358”.
Instead of parsing the HTML, you could try using the CoinMarketCap API.
That is because you are converting the price which is a string to a float. See below
>>> float('0.00')
0.0
Also, I don't see any difference between 0.00 and 0.0. Both values are same.
from bs4 import BeautifulSoup
import requests
url = "https://coinmarketcap.com/"
result = requests.get(url).text
soup = BeautifulSoup(result, "lxml")
t = soup.find('table', class_='cmc-table').find('tbody')
trs = t.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
name = tds[2].text.strip()
price = tds[3].text.strip()
print(f'{name}\t{price}')
Sample output where price is $0.00 - I did not convert the price to float
eCashXEC $0.00
SHIBA INUSHIB $0.00
BitTorrentBTT $0.00

BeautifulSoup not finding dates

I'm trying to scrape some data from here: https://www.reuters.com/companies/AMPF.MI/financials/income-statement-quarterly.
I'd like to get the dates in the first row (ie. 31-Mar-21 31-Dec-20 30-Sep-20 30-Jun-20 31-Mar-20).
The problem comes when I try to get the date, with bs4 it outputs nothing. I wrote this code:
url = "https://www.reuters.com/companies/AMPF.MI/financials/income-statement-quarterly"
html_content = requests.get(url).text
soup = BeautifulSoup (html_content, "lxml")
a = soup.find('div', attrs = {"class": "tables-container"})
date = a.find("time").text;
When I execute it, it gives me nothing. Printing a, it can be seen that the find () doesn't get the date ... `
<th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg"></time>
Thanks.
The data is embedded within the page in JSON form. You can use this example how to parse it:
import json
import requests
from bs4 import BeautifulSoup
url = "https://www.reuters.com/companies/AMPF.MI/financials/income-statement-quarterly"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = json.loads(soup.select_one("#__NEXT_DATA__").contents[0])
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
x = data["props"]["initialState"]["markets"]["financials"]["financial_tables"]
headers = x["income_interim_tables"][0]["headers"]
print(*headers, sep="\n")
Prints:
2021-03-31
2020-12-31
2020-09-30
2020-06-30
2020-03-31
As I do not have enough reputation to comment:
The problem is that the scraped HTML does not contain the dates. The time tags are empty.
You need a way to scrape while pre-rendering the JavaScript which fills in the dates. This is a different topic which requires some headless browser or other approaches, e.g. https://www.scrapingbee.com/blog/scrapy-javascript/

Get a <span> value using python web scrape

I am trying to get a product price using BeautifulSoup in python.
But i keep getting erroes, no matter what I try.
The picture of the site i am trying to web scrape
I want to get the 19,90 value.
I have already done a code to get all the product names, and now need their prices.
import requests
from bs4 import BeautifulSoup
url = 'https://www.zattini.com.br/busca?nsCat=Natural&q=amaro&searchTermCapitalized=Amaro&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
price = soup.find('span', itemprop_='price')
print(price)
Less ideal is parsing out the JSON containing the prices
import requests
import json
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.zattini.com.br/busca?nsCat=Natural&q=amaro&searchTermCapitalized=Amaro&page=1'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
scripts = [script.text for script in soup.select('script') if 'var freedom = freedom ||' in script.text]
pricesJson = scripts[0].split('"items":')[1].split(']')[0] + ']'
prices = [item['price'] for item in json.loads(pricesJson)]
names = [name.text for name in soup.select('#item-list [itemprop=name]')]
results = list(zip(names,prices))
df = pd.DataFrame(results)
print(df)
Sample output:
span[itemprop='price'] is generated by javascript. Original value stored in div[data-final-price] with value like 1990 and you can format it to 19,90 with Regex.
import re
...
soup = BeautifulSoup(page.text, 'html.parser')
prices = soup.select('div[data-final-price]')
for price in prices:
price = re.sub(r'(\d\d$)', r',\1', price['data-final-price'])
print(price)
Results:
19,90
134,89
29,90
119,90
104,90
59,90
....

Parse html table with BeautifulSoup4 and Python 3

I am trying to scrape certain financial data from Yahoo Finance. Specifically in this case, a single revenue number (type: double)
Here is my code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
searchurl = "http://finance.yahoo.com/q/ks?s=AAPL"
f = urlopen(searchurl)
html = f.read()
soup = BeautifulSoup(html, "html.parser")
revenue = soup.find("div", {"class": "yfnc_tabledata1", "id":"yui_3_9_1_8_1456172462911_38"})
print (revenue)
The view source inspection from Chrome looks like this:
I am trying to scrape the "234.99B" number, strip the "B", and convert it to a decimal. There is something wrong with my 'soup.find' line, where am I going wrong?
Locate the td element with Revenue (ttm): text and get the next td sibling:
revenue = soup.find("td", text="Revenue (ttm):").find_next_sibling("td").text
print(revenue)
Prints 234.99B.

Categories