Extract the data rich nodes of a webpage using beautifulsoup in python - python

Extract the data rich nodes of a web page using beautifulsoup in python ,Is there a way to count the frequency of the tags in the page,
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.in"
r = requests.get(url)
html = BeautifulSoup(r.content)
Now I want to count the frequency of all the tags in html obtained.

Use dict comprehensions and collection.Counter to get counts of tags that are instance of bs4.element.Tag.
from collections import Counter
import requests
import bs4
from bs4 import BeautifulSoup
url = "http://www.amazon.in"
r = requests.get(url)
html = BeautifulSoup(r.content)
Counter(tag.name for tag in html.descendants if isinstance(tag, bs4.element.Tag))
output
Counter({'div': 462, 'a': 448, 'span': 395, 'li': 288, 'br': 78, 'img': 60, 'td': 57, 'script': 48, 'ul': 39, 'option': 27, 'tr': 22, 'table': 17, 'meta': 13, 'map': 12, 'area': 12, 'link': 11, 'style': 10, 'p': 10, 'b': 9, 'h2': 7, 'strong': 5, 'input': 2, 'body': 1, 'title': 1, 'html': 1, 'header': 1, 'form': 1, 'head': 1, 'label': 1, 'select': 1})

Related

Python -BeautifulSoup - How to target nth child and print the text

I'm trying to scrape the "Biggest Gainers" list of coins on https://coinmarketcap.com/
How do I access the nth child (Biggest Gainers) in the div class_ = 'sc-1rmt1nr-0 sc-1rmt1nr-2 iMyvIy'
I managed to get the data from the "Trending" section but having trouble targeting the "Biggest Gainers" top 3 text items.
I get AttributeError: 'NoneType' object has no attribute 'p'
from bs4 import BeautifulSoup
import requests
source = requests.get('https://coinmarketcap.com/').text
soup = BeautifulSoup(source, 'lxml')
section = soup.find(class_='sc-1rmt1nr-0 sc-1rmt1nr-2 iMyvIy')
#List the top 3 Gainers
for top_gainers in section.find_all(class_='sc-16r8icm-0 sc-1uagfi2-0 bdEGog sc-1rmt1nr-1 eCWTbV')[1]:
top_gainers = top_gainers.find(class_='sc-1eb5slv-0 iworPT')
top_coins = top_gainers.p.text
print(top_coins)
I would avoid those dynamic classes and instead use -:soup-contains and combinators to first locate desired block via text, then with the combinators specify the relationship of the final elements to extract info from.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
soup = bs(requests.get("https://coinmarketcap.com/").text, "lxml")
biggest_gainers = []
for i in soup.select(
'div[color=text]:has(span:-soup-contains("Biggest Gainers")) > div ~ div'
):
biggest_gainers.append(
{
"rank": int(i.select_one(".rank").text),
"currency": i.select_one(".alias").text,
"% change": f"{i.select_one('.icon-Caret-up').next_sibling}",
}
)
gainers = pd.DataFrame(biggest_gainers)
gainers
As mentioned by #QHarr you should avoid dynamic identifier similar to his approach the selection comes via :-soup-contains() and the known text of the element:
soup.select('div:has(>div>span:-soup-contains("Biggest Gainers")) ~ div')
To extract the texts I used stripped_strings and zipped it with the keys to a dict:
dict(zip(['rank','name','alias','change'],e.stripped_strings))
Example
from bs4 import BeautifulSoup
import requests
url = 'https://coinmarketcap.com/'
soup=BeautifulSoup(requests.get(url).content)
data = []
for e in soup.select('div:has(>div>span:-soup-contains("Biggest Gainers")) ~ div'):
data.append(dict(zip(['rank','name','alias','change'],e.stripped_strings)))
Output
[{'rank': '1', 'name': 'Tenset', 'alias': '10SET', 'change': '1406.99'},
{'rank': '2', 'name': 'Burn To Earn', 'alias': 'BTE', 'change': '348.89'},
{'rank': '3', 'name': 'MetaCars', 'alias': 'MTC', 'change': '332.05'}]
You can use :nth-of-type to locate the "Biggest Gainers" parent div:
import requests
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://coinmarketcap.com/').text, 'html.parser')
bg = d.select_one('div:nth-of-type(2).sc-16r8icm-0.sc-1uagfi2-0.bdEGog.sc-1rmt1nr-1.eCWTbV')
data = [{'rank':i.select_one('span.rank').text,
'name':i.select_one('p.sc-1eb5slv-0.iworPT').text,
'change':i.select_one('span.sc-27sy12-0.gLZJFn').text}
for i in bg.select('div.sc-1rmt1nr-0.sc-1rmt1nr-4.eQRTPY')]
Output:
[{'rank': '1', 'name': 'Tenset', 'change': '1308.72%'}, {'rank': '2', 'name': 'Burn To Earn', 'change': '421.82%'}, {'rank': '3', 'name': 'Aigang', 'change': '329.63%'}]

BeautifulSoup to get data from a nested table

I am trying to pull data from a nested table in html
I can get BeautifulSoup to get the other divs but can't get it to see the table.
This is what I've got so far:
import requests
from bs4 import BeautifulSoup
url = ''
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
d = soup.find(td, class_='DT_Text DT_R_Align sorting_1)
I've tried numerous variations. Tried going after div classes, div ids, td classes. Nothing has worked.
Is there anyway to get this data?
Thanks
I just checked the Network tab of Firefox's Dev Tools while loading the page and found that there's a GET request to https://s3.amazonaws.com/datafusion.web.generac.can/Data/Generac_MTN_co.json which contains the data inside the table. Your code can be modified to the followings:
import requests
import json
response = requests.get("https://s3.amazonaws.com/datafusion.web.generac.can/Data/Generac_MTN_co.json")
data = json.loads(response.content)
print(data)
# {'currentdatetime': '1/29/2022 12:17:00 AM', 'outage': [{'Name': 'Alberta', 'OUT': 0, 'SRV': 0, 'SFIPS': '60'}, {'Name': 'British Columbia', 'OUT': 24, 'SRV': 0, 'SFIPS': '61'}, {'Name': 'Manitoba', 'OUT': 75, 'SRV': 0, 'SFIPS': '62'}, {'Name': 'New Brunswick', 'OUT': 283, 'SRV': 0, 'SFIPS': '63'}, {'Name': 'Nova Scotia', 'OUT': 12, 'SRV': 0, 'SFIPS': '65'}, {'Name': 'Ontario', 'OUT': 417, 'SRV': 0, 'SFIPS': '66'}, {'Name': 'Quebec', 'OUT': 342, 'SRV': 0, 'SFIPS': '68'}, {'Name': 'Saskatchewan', 'OUT': 0, 'SRV': 0, 'SFIPS': '69'}, {'Name': 'Newfoundland', 'OUT': 0, 'SRV': 0, 'SFIPS': '64'}, {'Name': 'Prince Edward Island', 'OUT': 2, 'SRV': 0, 'SFIPS': '67'}]}

Data scrape Python beautifulsoup code do not loop

I am trying to scrape data. Somehow the loop doesn't work correctly. It loops just once. I want to scrape all the name of the goods and the price.
The goods are inside "td" eg : "Sendok Semen 7 Bulat" and the price are inside "div" eg : "8.500"
Here is my code :
import requests
from bs4 import BeautifulSoup
url = 'https://www.ralali.com/search/semen'
res = requests.get(url)
html = BeautifulSoup(res.content,"html.parser")
#divs = html.find_all('div', class_ = "col-md-12 col-xs-12")
divs = html.findAll('div', class_ = "row d-block")
cnt = 0
for div in divs:
cnt += 1
#print(div, end="\n"*2)
price = div.find('span', class_ = 'float-right')
print(price.text.strip())
print(cnt)
Any help will be appreciated.
Thanks
What happens?
Somehow the loop doesn't work correctly. It loops just once.
It is not the loop that won't work correctly, it is rather the way you are selecting things. So html.findAll('div', class_ = "row d-block") will find only one <div> that matches your criteria.
How to fix?
Make you are selecting more specific, cause what you are really want to iterate are the <tr> in the table - I often use css selectors and the following will get the correct selection, so just replace your html.findAll('div', class_ = "row d-block") Note In new code use find_all() instead of findAll() it is the newer syntax:
html.select('.d-block tbody tr')
Example
Will give you a well structured list of dicts:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ralali.com/search/semen'
res = requests.get(url)
html = BeautifulSoup(res.content,"html.parser")
data = []
for row in html.select('.d-block tbody tr'):
data.append(
dict(
zip(['pos','name','currency','price'],list(row.stripped_strings))
)
)
data
Output
[{'pos': '1',
'name': 'Sendok Semen 7 Bulat',
'currency': 'Rp',
'price': '8.500'},
{'pos': '2',
'name': 'Sendok Semen 8 Bulat Gagang Kayu',
'currency': 'Rp',
'price': '10.000'},
{'pos': '3', 'name': 'SEMEN', 'currency': 'Rp', 'price': '10.000'},
{'pos': '4',
'name': 'Sendok Semen 8 Gagang Kayu SWARDFISH',
'currency': 'Rp',
'price': '10.000'},...]
But Be Aware
It will just help you to get the Top 10 - List Of Popular Semen Prices In Ralali and not all goods and prices on the page --> That is something you should clarify in your question.
Getting more data from all products
Option#1
Use an api that is provided by the website and iterate by parameter pages:
import requests
url = 'https://rarasearch.ralali.com/v2/search/item?q=semen'
res = requests.get(url)
data = []
for p in range(1, round(res.json()['total_item']/20)):
url = f'https://rarasearch.ralali.com/v2/search/item?q=semen&p={p}'
res = requests.get(url)
data.extend(res.json()['items'])
print(data)
Output:
[{'id': 114797,
'name': 'TIGA RODA Semen NON semen putih',
'image': 'assets/img/Libraries/114797_TIGA_RODA_Semen_NON_semen_putih_1_UrwztohXHo9u1yRY_1625473149.png',
'alias': 'tiga-roda-semen-non-semen-putih-157561001',
'vendor_id': 21156,
'vendor_alias': 'prokonstruksi',
'rating': '5.00',
'vendor_status': 'A',
'vendor_name': 'Pro Konstruksi',
'vendor_location': 'Palembang',
'price': '101500.00',
'discount': 0,
'discount_percentage': 0,
'free_ongkir_lokal': 0,
'free_ongkir_nusantara': 1,
'is_stock_available': 1,
'minimum_order': 1,
'maximum_order': 999999999,
'unit_type': 'unit',
'ss_type': 0,
'is_open': 'Y',
'wholesale_price': []},
{'id': 268711,
'name': 'Sendok Semen Ukuran 6',
'image': 'assets/img/Libraries/268711_Sendok-Semen-Ukuran-6_HCLcQq6TUh5IiEPZ_1553521818.jpeg',
'alias': 'Sendok-Semen-Ukuran-6',
'vendor_id': 305459,
'vendor_alias': 'distributorbangunan',
'rating': None,
'vendor_status': 'A',
'vendor_name': 'Distributor Bangunan',
'vendor_location': 'Bandung',
'price': '11000.00',
'discount': 0,
'discount_percentage': 0,
'free_ongkir_lokal': 0,
'free_ongkir_nusantara': 0,
'is_stock_available': 1,
'minimum_order': 1,
'maximum_order': 999999999,
'unit_type': 'Unit',
'ss_type': 0,
'is_open': 'Y',
'wholesale_price': []},...]
Option#2
Use selenium, scroll to the bottom of the page toa load all products, push the driver.page_source to your soup and start selecting, ...

Html request for Biwenger in python

I'm trying to scrape the data from Biwenger with an html request but the response is returning different data than if url is opened in chrome.
Here is my code
import requests
shots_url = "https://biwenger.as.com/user/naranjas-4537694"
response = requests.get(shots_url)
response.raise_for_status() # raise exception if invalid response
print(response.text)
I don't get any error () however the data request show different data than the data in the url and this message:
<!doctype html><meta charset=utf-8><title>Biwenger</title><base href=/ ><meta...<div class=body><p>Looks like the browser you're using is not compatible with Biwenger :(<p>We recommend using <a href=http://www.google.com/chrome/ target=_blank>Google Chrome</a>...</script>
Any idea what code I can use to get the right data?
If you require any more information please let me know. Thank you everyone.
The data is loaded dynamically via JavaScript/JSON. When you open Firefox/Chrome developer tools - Network tab, you will see where the page is making requests).
This example will get the information about user players:
import re
import json
import requests
from pprint import pprint
from bs4 import BeautifulSoup
user_data_url = 'https://biwenger.as.com/api/v2/user/4537694?fields=*,account(id),players(id,owner),lineups(round,points,count,position),league(id,name,competition,mode,scoreID),market,seasons,offers,lastPositions'
all_data_url = 'https://cf.biwenger.com/api/v2/competitions/la-liga/data?lang=en&score=1&callback=jsonp_xxx' # <--- check #αԋɱҽԃ αмєяιcαη answer, it's possible to do it without callback= parameter
response = requests.get(all_data_url)
data = json.loads( re.findall(r'jsonp_xxx\((.*)\)', response.text)[0] )
user_data = requests.get(user_data_url).json()
# pprint(user_data) # <-- uncomment this to see user data
# pprint(data) # <-- uncomment this to see data about all players
for p in user_data['data']['players']:
pprint(data['data']['players'][str(p['id'])])
print('-' * 80)
Prints:
{'fantasyPrice': 22000000,
'fitness': [10, 2, 2, 2, -2],
'id': 599,
'name': 'Pedro León',
'playedAway': 8,
'playedHome': 8,
'points': 38,
'pointsAway': 16,
'pointsHome': 22,
'pointsLastSeason': 16,
'position': 3,
'price': 1400000,
'priceIncrement': 60000,
'slug': 'pedro-leon',
'status': 'ok',
'teamID': 76}
--------------------------------------------------------------------------------
{'fantasyPrice': 9000000,
'fitness': [None, 'injured', 'doubt', None, 2],
'id': 1093,
'name': 'Javi López',
'playedAway': 4,
'playedHome': 2,
'points': 10,
'pointsAway': 6,
'pointsHome': 4,
'pointsLastSeason': 77,
'position': 2,
'price': 210000,
'priceIncrement': 0,
'slug': 'javier-lopez',
'status': 'ok',
'teamID': 7}
--------------------------------------------------------------------------------
... and so on.
import requests
import csv
r = requests.get(
"https://cf.biwenger.com/api/v2/competitions/la-liga/data?lang=en&score=1").json()
data = []
for k, v in r['data']['players'].items():
data.append(v.values())
with open('output.csv', 'w', newline="", encoding="UTF-8") as f:
writer = csv.writer(f)
writer.writerow(v.keys())
writer.writerows(data)
Output: Click Here

Scraping website that uses javascript

I'm trying to scrape this page: http://stats.nba.com/playerGameLogs.html?PlayerID=2544&pageNo=1&rowsPerPage=100
I'm wanting to get the table into a pandas DataFrame. I've tried BeautifulSoup and it's obvious that won't work. I tried to use selenium, but I'm not having luck with it. I'm hoping someone has a better solution before I continue going down the selenium path, as it at least opens up the browser and shows the correct output, Firefox just force closes after. I also prefer to not have to physically open up the browser either, as I would be doing this for 1000s of pages.
There is no need for scraping HTML, or using a high-level selenium approach.
Simulate the underlying XHR request(s) going to the server and returning the JSON data that is used to fill up the table on the page.
Here's an example using requests:
import requests
url = 'http://stats.nba.com/stats/playergamelog'
params = {
'Season': '2013-14',
'SeasonType': 'Regular Season',
'LeagueID': '00',
'PlayerID': '2544',
'pageNo': '1',
'rowsPerPage': '100'
}
response = requests.post(url, data=params)
print response.json()
Prints the JSON structure containing the player game logs:
{u'parameters': {u'LeagueID': u'00',
u'PlayerID': 2544,
u'Season': u'2013-14',
u'SeasonType': u'Regular Season'},
u'resource': u'playergamelog',
u'resultSets': [{u'headers': [u'SEASON_ID',
u'Player_ID',
u'Game_ID',
u'GAME_DATE',
u'MATCHUP',
u'WL',
u'MIN',
u'FGM',
u'FGA',
u'FG_PCT',
u'FG3M',
u'FG3A',
u'FG3_PCT',
u'FTM',
u'FTA',
u'FT_PCT',
u'OREB',
u'DREB',
u'REB',
u'AST',
u'STL',
u'BLK',
u'TOV',
u'PF',
u'PTS',
u'PLUS_MINUS',
u'VIDEO_AVAILABLE'],
u'name': u'PlayerGameLog',
u'rowSet': [[u'22013',
2544,
u'0021301192',
u'APR 12, 2014',
u'MIA # ATL',
u'L',
37,
10,
22,
0.455,
3,
7,
0.429,
4,
8,
0.5,
3,
5,
8,
5,
0,
1,
3,
2,
27,
-13,
1],
[u'22013',
2544,
u'0021301180',
u'APR 11, 2014',
u'MIA vs. IND',
u'W',
35,
11,
20,
0.55,
2,
4,
0.5,
12,
13,
0.923,
1,
5,
6,
1,
1,
1,
2,
1,
36,
13,
1],
[u'22013',
2544,
u'0021301167',
u'APR 09, 2014',
u'MIA # MEM',
u'L',
41,
14,
23,
0.609,
3,
5,
0.6,
6,
7,
0.857,
1,
5,
6,
5,
2,
0,
5,
1,
37,
-8,
1],
...
}
Alternative solution would be to use an NBA API, see several options here:
https://stackoverflow.com/questions/57106/anyone-know-of-an-nfl-or-nba-api

Categories