Scraping using BeautifulSoup, value is not clean

Scraping using BeautifulSoup, value is not clean - python

I'm trying to scrape a nutrient tag (http://smartlabel.generalmills.com/41196891218). and I'm having a hard time getting a clean gram value for each category.
For example, this is how it comes out for fat
('fat': '\n 1 g\n ',)\
Any way to get something like this("fat": 1g)?
I just started learning bs4 yesterday, any help will be appreciated!.
My code is
def minenutrition1(link):
driver = webdriver.Chrome()
driver.get(link)
# noticed there is an ad here, sleep til page fully loaded.
time.sleep(1)
soup = BeautifulSoup(driver.page_source)
driver.quit()
calories=soup.find_all("span",{"class":"header2"})[0].text
fat=soup.find_all("span",{"class":"gram-value"})[0].text
satfat=soup.find_all("span",{"class":"gram-value"})[1].text
cholesterol=soup.find_all("span",{"class":"gram-value"})[3].text
sodium=soup.find_all("span",{"class":"gram-value"})[4].text
carb=soup.find_all("span",{"class":"gram-value"})[5].text
Total_sugar=soup.find_all("span",{"class":"gram-value"})[7].text
protein=soup.find_all("span",{"class":"gram-value"})[9].text
name = soup.find_all('div',{'class': 'product-header-name header1'})[0].text
upc=soup.find_all("div",{"class":"upc sub-header"})
upc=upc[0].text

You get normal string "\n 1 g\n " so you can use string functions to clean/change it.
Using "\n 1 g\n ".strip() you can get "1 g"
So you can add .strip() at the end of this line
fat = soup.find_all("span",{"class":"gram-value"})[0].text.strip()
or do it later
fat = fat.strip()
BS has also function .get_text(strip=True) which you can use instead of .text
fat = soup.find_all("span",{"class":"gram-value"})[0].get_text(strip=True)
Minimal working code.
I display fat with > < to see if there are spaces, tabs, enters (new lines).
from selenium import webdriver
from bs4 import BeautifulSoup
import time
url = 'http://smartlabel.generalmills.com/41196891218'
driver = webdriver.Chrome()
#driver = webdriver.Firefox()
driver.get(url)
# noticed there is an ad here, sleep til page fully loaded.
time.sleep(1)
soup = BeautifulSoup(driver.page_source)
driver.quit()
items = soup.find_all("span", {"class": "gram-value"})
fat = items[0].text
print('>{}<'.format(fat))
fat = items[0].text.strip()
print('>{}<'.format(fat))
fat = items[0].get_text(strip=True)
print('>{}<'.format(fat))
Result:
>
1 g
<
>1 g<
>1 g<

For this, I would not use Selenium. Not that you can't, but the site is static, and you can get the html source straight away with requests. So this is a little bit of a stretch since you are beginning with BeautifulSoup, but if you open Dev Tools (Ctrl-Shift-I) and reload the page, you will notice the requests made in the right panel under Network -> XHR. There is a requeset to GetNutritionalDetails.
Withibn there, you'll see the request url, and the the requests headers, and at the bottom the payload. You will also see it's a POST request (usually you'll use GET.
The data is within a list (<li> tags). So it's not just a mater of getting all those tags, then iterate through each of those, to pull out the other data.
You can append that data into a list, and then that list into a table/dataframe with pandas.
Code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'http://smartlabel.generalmills.com/GTIN/GetNutritionalDetails'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
payload = {
'id': '41196891218',
'servingSize': 'AS PACKAGED'}
response = requests.post(url, headers=headers, params=payload)
soup = BeautifulSoup(response.text, 'html.parser')
listItems = soup.find_all('li')
labels = []
gramValues = []
percValues = []
for each in listItems:
label = each.find('label').text.strip()
if label == 'Includes':
label += ' Added Sugar'
gram = each.find('span', {'class':'gram-value'}).text.strip()
if each.find('span', {'class':'dv-result'}):
perc = each.find('span', {'class':'dv-result'}).text.strip()
else:
perc = ''
labels.append(label)
gramValues.append(gram)
percValues.append(perc)
df = pd.DataFrame({
'Label':labels,
'Grams':gramValues,
'Percent':percValues})
Output:
print (df)
Label Grams Percent
0 Total Fat 1 g 1 %
1 Saturated Fat 0 g 0 %
2 Trans Fat 0 g
3 Cholesterol 0 mg 0 %
4 Sodium 810 mg 35 %
5 Total Carbohydrate 17 g 6 %
6 Dietary Fiber 2 g 6 %
7 Total Sugar 2 g
8 Includes Added Sugar 2 g 3 %
9 Protein 4 g
10 Vitamin D 0 ?g 0 %
11 Calcium 60 mg 4 %
12 Iron 1.2 mg 6 %
13 Potassium 0 mg 0 %

Related

How to scrape a table from a page and create a multi-column dataframe with python?

This website https://aviation-safety.net/wikibase/ DB begins from year 1902 to 2022.
I am trying to scrape the table, narrative, probable cause and classification for every accidents in the year 2015 and 2016: https://aviation-safety.net/database/dblist.php?Year=2015. With the below code I am able to scrape the table only:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import re
import concurrent.futures
import itertools
from random import randint
from time import sleep
def scraping(year):
headers = {
'accept':'*/*',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
}
url = f'https://aviation-safety.net/database/dblist.php?Year={year}&sorteer=datekey&page=1'
#sleep(randint(1,3))
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.text,'html.parser')
page_container = soup.find('div',{'class':'pagenumbers'})
pages = max([int(page['href'].split('=')[-1]) for page in page_container.find_all('a')])
#info = []
tl = []
for page in range(1,pages+1):
new_url = f'https://aviation-safety.net/database/dblist.php?Year={year}&lang=&page={page}'
print(new_url)
#sleep(randint(1,3))
data = requests.get(new_url,headers=headers)
soup = BeautifulSoup(data.text,'html.parser')
table = soup.find('table')
for index,row in enumerate(table.find_all('tr')):
if index == 0:
continue
link_ = 'https://aviation-safety.net/'+row.find('a')['href']
#sleep(randint(1,3))
new_page = requests.get(link_, headers=headers)
new_soup = BeautifulSoup(new_page.text, 'lxml')
table1 = new_soup.find('table')
for i in table1.find_all('tr'):
title = i.text
tl.append(title)
df= pd.DataFrame(tl)
df.columns = ['status']
df.to_csv(f'{year}_aviation-safety_new.csv', encoding='utf-8-sig', index=False)
if __name__ == "__main__":
START = 2015
STOP = 2016
years = [year for year in range(START,STOP+1)]
print(f'Scraping {len(years)} years of data')
with concurrent.futures.ThreadPoolExecutor(max_workers=60) as executor:
final_list = executor.map(scraping,years)
But the data is not organized. The dataframe looks like this:
The outcome should be like this:

It looks the values of tl are strings, e.g. 'Status:Accident investigation report completed and information captured'.
Converting the list of strings into a pd.DataFrame gets you a single column with all the values in the list.
If you want to use the "name" of the string, e.g. Status as a column header, you'll need to separate it from the rest of the text.
# maxsplit of 1 so we don't accidentally split up the values, e.g. time
title, text = title.split(":", maxsplit=1)
This looks like
('Status', 'Accident investigation report completed and information captured')
Now we create a dictionary
row_dict[title] = text
Giving us
{'Status': 'Accident investigation report completed and information captured'}
We will add to this same dictionary in the last loop
# old
for i in table1.find_all('tr'):
title = i.text
tl.append(title)
# new
row_dict = {}
for i in table1.find_all('tr'):
title = i.text
title, text = title.split(":", maxsplit=1)
row_dict[title] = text
After we've gathered all the data from page, i.e. completed the row_dict loop, we append to tl.
row_dict = {}
for i in table1.find_all('tr'):
title = i.text
title, text = title.split(":", maxsplit=1)
row_dict[title] = text
tl.append(row_dict)
All together now
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
import re
import concurrent.futures
import itertools
from random import randint
from time import sleep
def scraping(year):
headers = {
'accept':'*/*',
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
}
url = f'https://aviation-safety.net/database/dblist.php?Year={year}&sorteer=datekey&page=1'
#sleep(randint(1,3))
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.text,'html.parser')
page_container = soup.find('div',{'class':'pagenumbers'})
pages = max([int(page['href'].split('=')[-1]) for page in page_container.find_all('a')])
#info = []
tl = []
for page in range(1,pages+1):
new_url = f'https://aviation-safety.net/database/dblist.php?Year={year}&lang=&page={page}'
print(new_url)
#sleep(randint(1,3))
data = requests.get(new_url,headers=headers)
soup = BeautifulSoup(data.text,'html.parser')
table = soup.find('table')
for index,row in enumerate(table.find_all('tr')):
if index == 0:
continue
link_ = 'https://aviation-safety.net/'+row.find('a')['href']
#sleep(randint(1,3))
new_page = requests.get(link_, headers=headers)
new_soup = BeautifulSoup(new_page.text, 'lxml')
table1 = new_soup.find('table')
# make changes here!!!!!!!
row_dict = {}
for i in table1.find_all('tr'):
title = i.text
title, text = title.split(":", maxsplit=1)
row_dict[title] = text
tl.append(row_dict)
df= pd.DataFrame(tl)
df.to_csv(f'{year}_aviation-safety_new.csv', encoding='utf-8-sig', index=False)
if __name__ == "__main__":
START = 2015
STOP = 2016
years = [year for year in range(START,STOP+1)]
print(f'Scraping {len(years)} years of data')
with concurrent.futures.ThreadPoolExecutor(max_workers=60) as executor:
final_list = executor.map(scraping,years)

The read_html()
method offers convenient access to such datasets.
>>> url = "https://web.archive.org/web/20221027040903/https://aviation-safety.net/database/dblist.php?Year=2015"
>>>
>>> dfs = pd.read_html(url)
>>>
>>> df = dfs[1].drop(columns="operator").dropna(axis=1, how="all")
>>> df["date"] = pd.to_datetime(df.date.str.replace("??-", "01-", regex=False), format="%d-%b-%Y")
>>> df.set_index("date")
type registration fat. location cat
date
2015-01-02 Saab 340B G-LGNL 0 Stornoway Ai... A1
2015-01-03 Antonov An-26B-100 RA-26082 0 Magadan-Soko... A1
2015-01-04 Fokker 50 5Y-SIB 0 Nairobi-Jomo... A1
2015-01-08 Bombardier Challenger 300 PR-YOU 0 São Paulo-Co... O1
2015-01-09 Cessna 208B Grand Caravan 8R-GAB 0 Matthews Rid... A2
... ... ... ... ... ..
2015-06-11 Eclipse 500 N508JA 0 Sacramento-E... A2
2015-06-11 Hawker 800XP N497AG 0 Port Harcour... A1
2015-06-12 Boeing 737-33A VH-NLK 0 near Kosrae Airpo... I2
2015-06-15 Antonov An-2R RA-84553 0 Tatsinsky di... A1
2015-06-16 Boeing 737-322 (WL) LY-FLB 0 Aktau Airpor... O1
[100 rows x 5 columns]
It's hard to control the
user-agent
header, so either use a cooperative site,
or do a bit of extra work with requests or curl
to obtain the html text beforehand.

How to scrape website while iterate on multiple pages

Trying to scrape this website using python beautifulsoup:
https://www.leandjaya.com/katalog
having some challenges in navigating the multiple pages of the website and scrape it
using python, this website has 11 pages, and curious to know the best option to
achieve this like use for loop and will break the loop if the page doesnt exist.
this is my initial code, I have set a big number 50, however seems this is not a good option.
page = 1
while page != 50:
url=f"https://www.leandjaya.com/katalog/ss/1/{page}/"
main = requests.get(url)
pmain = BeautifulSoup(main.text,'lxml')
page = page + 1
Sample output:
https://www.leandjaya.com/katalog/ss/1/1/
https://www.leandjaya.com/katalog/ss/1/2/
https://www.leandjaya.com/katalog/ss/1/3/
https://www.leandjaya.com/katalog/ss/1/<49>/

This is one way to extract that info and display it in a dataframe, based on an unknown number of pages with data:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
cars_list = []
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
s = requests.Session()
s.headers.update(headers)
counter = 1
while True:
try:
print('page:', counter)
url = f'https://www.leandjaya.com/katalog/ss/1/{counter}/'
r = s.get(url)
soup = bs(r.text, 'html.parser')
cars_cards = soup.select('div.item')
if len(cars_cards) < 1:
print('all done, no cars left')
break
for car in cars_cards:
car_name = car.select_one('div.item-title').get_text(strip=True)
car_price = car.select_one('div.item-price').get_text(strip=True)
cars_list.append((car_name, car_price))
counter = counter + 1
except Exception as e:
print('all done')
break
df = pd.DataFrame(cars_list, columns = ['Car', 'Price'])
print(df)
Result:
page: 1
page: 2
page: 3
page: 4
page: 5
page: 6
page: 7
page: 8
page: 9
page: 10
page: 11
page: 12
all done, no cars left
Car Price
0 HONDA CRV 4X2 2.0 AT 2001 DP20jt
1 DUJUAL XPANDER 1.5 GLS 2018 MANUAL DP53jt
2 NISSAN JUKE 1.5 CVT 2011 MATIC DP33jt
3 Mitsubishi Xpander 1.5 Exceed Manual 2018 DP50jt
4 BMW X1 2.0 AT SDRIVE 2011 DP55jt
... ... ...
146 Daihatsu Sigra 1.2 R AT DP130jt
147 Daihatsu Xenia Xi 2010 DP85jt
148 Suzuki Mega Carry Pick Up 1.5 DP90jt
149 Honda Mobilio Tipe E Prestige DP150jt
150 Honda Freed Tipe S Rp. 170jtRp. 165jt
151 rows × 2 columns
The relevant documentations for the packages used above can be found at:
https://beautiful-soup-4.readthedocs.io/en/latest/index.html
https://requests.readthedocs.io/en/latest/
https://pandas.pydata.org/pandas-docs/stable/index.html

How to scrape all rows from a dynamic table in html with multiple displays using python

Here's link for scraping : https://stockanalysis.com/stocks/
I'm trying to get all the rows of the table (6000+ rows), but I only get the first 500 results. I guess it has to do with the condition of how many rows to display.
I tried almost everything I can. I'm , ALSO, a beginner in web scraping.
My code :
# Importing libraries
import numpy as np # numerical computing library
import pandas as pd # panel data library
import requests # http requests library
from bs4 import BeautifulSoup
url = 'https://stockanalysis.com/stocks/'
headers = {'User-Agent': ' user agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36'}
r = requests.get(url, headers)
soup = BeautifulSoup(r.text, 'html')
league_table = soup.find('table', class_ = 'symbol-table index')
col_df = ['Symbol', 'Company_name', 'Industry', 'Market_Cap']
for team in league_table.find_all('tbody'):
# i = 1
rows = team.find_all('tr')
df = pd.DataFrame(np.zeros([len(rows), len(col_df)]))
df.columns = col_df
for i, row in enumerate(rows):
s_symbol = row.find_all('td')[0].text
s_company_name = row.find_all('td')[1].text
s_industry = row.find_all('td')[2].text
s_market_cap = row.find_all('td')[3].text
df.iloc[i] = [s_symbol, s_company_name, s_industry, s_market_cap]
len(df) # should > 6000
What should I do?

Take a look down the bottom of the html and you will see this
<script id="__NEXT_DATA__" type="application/json">
Try using bs4 to find this tag and load the data from inside it, I think this is everything you need.

As stated, it's in the <script> tags. Pull it and read it in.
import requests
from bs4 import BeautifulSoup
import json
import re
import pandas as pd
url = 'https://stockanalysis.com/stocks/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
jsonStr = str(soup.find('script', {'id':'__NEXT_DATA__'}))
jsonStr = re.search('({.*})', jsonStr).group(0)
jsonData = json.loads(jsonStr)
df = pd.DataFrame(jsonData['props']['pageProps']['stocks'])
Output:
print(df)
s ... i
0 A ... Life Sciences Tools & Services
1 AA ... Metals & Mining
2 AAC ... Blank Check / SPAC
3 AACG ... Diversified Consumer Services
4 AACI ... Blank Check / SPAC
... ... ...
6033 ZWS ... Utilities-Regulated Water
6034 ZY ... Chemicals
6035 ZYME ... Biotechnology
6036 ZYNE ... Pharmaceuticals
6037 ZYXI ... Health Care Equipment & Supplies
[6038 rows x 4 columns]

How to get all products from a beautifulsoup page

I want to get all the products on this page:
nike.com.br/snkrs#estoque
My python code is this:
produtos = []
def aviso():
print("Started!")
request = requests.get("https://www.nike.com.br/snkrs#estoque")
soup = bs4(request.text, "html.parser")
links = soup.find_all("a", class_="btn", text="Comprar")
links_filtred = list(set(links))
for link in links_filtred:
if(produto not in produtos):
request = requests.get(f"{link['href']}")
soup = bs4(request.text, "html.parser")
produto = soup.find("div", class_="nome-preco-produto").get_text()
if(code_formated == ""):
code_formated = "\u200b"
print(f"Nome: {produto} Link: {link['href']}\n")
produtos.append(link["href"])
aviso()
Guys, this code gets the products from the page, but not all yesterday, I suspect that the content is dynamic, but how can I get them all with request and beautifulsoup? I don't want to use Selenium or an automation library, how do I do that? I don't want to have to change my code a lot because it's almost done, how do I do that?

DO NOT USE requests.get if you are dealing with the same HOST.
Reason: read-that
import requests
from bs4 import BeautifulSoup
import pandas as pd
def main(url):
allin = []
with requests.Session() as req:
for page in range(1, 6):
params = {
'p': page,
'demanda': 'true'
}
r = req.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
goal = [(x.find_next('h2').get_text(strip=True, separator=" "), x['href'])
for x in soup.select('.aspect-radio-box')]
allin.extend(goal)
df = pd.DataFrame(allin, columns=['Title', 'Url'])
print(df)
main('https://www.nike.com.br/Snkrs/Feed')
Output:
Title Url
0 Dunk High x Fragment design Black https://www.nike.com.br/dunk-high-x-fragment-d...
1 Dunk Low Infantil (16-26) City Market https://www.nike.com.br/dunk-low-infantil-16-2...
2 ISPA Flow 2020 Desert Sand https://www.nike.com.br/ispa-flow-2020-153-169...
3 ISPA Flow 2020 Pure Platinum https://www.nike.com.br/ispa-flow-2020-153-169...
4 Nike iSPA Men's Lightweight Packable Jacket https://www.nike.com.br/nike-ispa-153-169-211-...
.. ... ...
115 Air Jordan 1 Mid Hyper Royal https://www.nike.com.br/air-jordan-1-mid-153-1...
116 Dunk High Orange Blaze https://www.nike.com.br/dunk-high-153-169-211-...
117 Air Jordan 5 Stealth https://www.nike.com.br/air-jordan-5-153-169-2...
118 Air Jordan 3 Midnight Navy https://www.nike.com.br/air-jordan-3-153-169-2...
119 Air Max 90 Bacon https://www.nike.com.br/air-max-90-153-169-211...
[120 rows x 2 columns]

To get the data you can send a request to:
https://www.nike.com.br/Snkrs/Estoque?p=<PAGE>&demanda=true
where providing a page number between 1-5 to p= in the URL.
For example, to print the links, you can try:
import requests
from bs4 import BeautifulSoup
url = "https://www.nike.com.br/Snkrs/Estoque?p={page}&demanda=true"
for page in range(1, 6):
response = requests.get(url.format(page=page))
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find_all("a", class_="btn", text="Comprar"))

How to scrape hidden data elements using BeautifulSoup

Level2StockQuotes.com offers free real-time top of book quotes that I would like to capture in python using BeautifulSoup. The issue is even though I can see the actual data values in a browser inspector, I can't scrape these values into python.
BeautifulSoup returns all data rows with each data element blank. Pandas returns a dataframe with NaN for each data element.
import bs4 as bs
import urllib.request
import pandas as pd
symbol = 'AAPL'
url = 'https://markets.cboe.com/us/equities/market_statistics/book/'+ symbol + '/'
page = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(page,'lxml')
rows = soup.find_all('tr')
print(rows)
for tr in rows:
td = tr.find_all('td')
row =(i.text for i in td)
print(row)
#using pandas to get dataframe
dfs = pd.read_html(url)
for df in dfs:
print(df)
Can someone more experienced than I tell me how to pull this data?
Thanks!

The page is dynamic. You'll either need to use Selenium to simulate a browser and let the page render before grabbing the html, or you can get the data straight from the json XHR.
import requests
import pandas as pd
from pandas.io.json import json_normalize
url = 'https://markets.cboe.com/json/bzx/book/AAPL'
headers = {
'Referer': 'https://markets.cboe.com/us/equities/market_statistics/book/AAPL/',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest'}
jsonData = requests.get(url, headers=headers).json()
df_asks = pd.DataFrame(jsonData['data']['asks'], columns=['Shares','Price'] )
df_bids = pd.DataFrame(jsonData['data']['bids'], columns=['Shares','Price'] )
df_trades = pd.DataFrame(jsonData['data']['trades'], columns=['Time','Price','Shares','Time_ms'])
Output:
df_list = [df_asks, df_bids, df_trades]
for df in df_list:
print (df)
Shares Price
0 40 209.12
1 100 209.13
2 200 209.14
3 100 209.15
4 24 209.16
Shares Price
0 200 209.05
1 200 209.02
2 100 209.01
3 200 209.00
4 100 208.99
Time Price Shares Time_ms
0 10:45:57 300 209.0700 10:45:57.936000
1 10:45:57 300 209.0700 10:45:57.936000
2 10:45:55 29 209.1100 10:45:55.558000
3 10:45:52 45 209.0900 10:45:52.265000
4 10:45:52 50 209.0900 10:45:52.265000
5 10:45:52 5 209.0900 10:45:52.265000
6 10:45:51 100 209.1100 10:45:51.902000
7 10:45:48 100 209.1400 10:45:48.528000
8 10:45:48 100 209.1300 10:45:48.528000
9 10:45:48 200 209.1300 10:45:48.528000

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping using BeautifulSoup, value is not clean - python

Related

How to scrape a table from a page and create a multi-column dataframe with python?

How to scrape website while iterate on multiple pages

How to scrape all rows from a dynamic table in html with multiple displays using python

How to get all products from a beautifulsoup page

How to scrape hidden data elements using BeautifulSoup

Categories

Resources