I'm trying to scrape the data from Biwenger with an html request but the response is returning different data than if url is opened in chrome.
Here is my code
import requests
shots_url = "https://biwenger.as.com/user/naranjas-4537694"
response = requests.get(shots_url)
response.raise_for_status() # raise exception if invalid response
print(response.text)
I don't get any error () however the data request show different data than the data in the url and this message:
<!doctype html><meta charset=utf-8><title>Biwenger</title><base href=/ ><meta...<div class=body><p>Looks like the browser you're using is not compatible with Biwenger :(<p>We recommend using <a href=http://www.google.com/chrome/ target=_blank>Google Chrome</a>...</script>
Any idea what code I can use to get the right data?
If you require any more information please let me know. Thank you everyone.
The data is loaded dynamically via JavaScript/JSON. When you open Firefox/Chrome developer tools - Network tab, you will see where the page is making requests).
This example will get the information about user players:
import re
import json
import requests
from pprint import pprint
from bs4 import BeautifulSoup
user_data_url = 'https://biwenger.as.com/api/v2/user/4537694?fields=*,account(id),players(id,owner),lineups(round,points,count,position),league(id,name,competition,mode,scoreID),market,seasons,offers,lastPositions'
all_data_url = 'https://cf.biwenger.com/api/v2/competitions/la-liga/data?lang=en&score=1&callback=jsonp_xxx' # <--- check #αԋɱҽԃ αмєяιcαη answer, it's possible to do it without callback= parameter
response = requests.get(all_data_url)
data = json.loads( re.findall(r'jsonp_xxx\((.*)\)', response.text)[0] )
user_data = requests.get(user_data_url).json()
# pprint(user_data) # <-- uncomment this to see user data
# pprint(data) # <-- uncomment this to see data about all players
for p in user_data['data']['players']:
pprint(data['data']['players'][str(p['id'])])
print('-' * 80)
Prints:
{'fantasyPrice': 22000000,
'fitness': [10, 2, 2, 2, -2],
'id': 599,
'name': 'Pedro León',
'playedAway': 8,
'playedHome': 8,
'points': 38,
'pointsAway': 16,
'pointsHome': 22,
'pointsLastSeason': 16,
'position': 3,
'price': 1400000,
'priceIncrement': 60000,
'slug': 'pedro-leon',
'status': 'ok',
'teamID': 76}
--------------------------------------------------------------------------------
{'fantasyPrice': 9000000,
'fitness': [None, 'injured', 'doubt', None, 2],
'id': 1093,
'name': 'Javi López',
'playedAway': 4,
'playedHome': 2,
'points': 10,
'pointsAway': 6,
'pointsHome': 4,
'pointsLastSeason': 77,
'position': 2,
'price': 210000,
'priceIncrement': 0,
'slug': 'javier-lopez',
'status': 'ok',
'teamID': 7}
--------------------------------------------------------------------------------
... and so on.
import requests
import csv
r = requests.get(
"https://cf.biwenger.com/api/v2/competitions/la-liga/data?lang=en&score=1").json()
data = []
for k, v in r['data']['players'].items():
data.append(v.values())
with open('output.csv', 'w', newline="", encoding="UTF-8") as f:
writer = csv.writer(f)
writer.writerow(v.keys())
writer.writerows(data)
Output: Click Here
Related
I am trying to scrape data. Somehow the loop doesn't work correctly. It loops just once. I want to scrape all the name of the goods and the price.
The goods are inside "td" eg : "Sendok Semen 7 Bulat" and the price are inside "div" eg : "8.500"
Here is my code :
import requests
from bs4 import BeautifulSoup
url = 'https://www.ralali.com/search/semen'
res = requests.get(url)
html = BeautifulSoup(res.content,"html.parser")
#divs = html.find_all('div', class_ = "col-md-12 col-xs-12")
divs = html.findAll('div', class_ = "row d-block")
cnt = 0
for div in divs:
cnt += 1
#print(div, end="\n"*2)
price = div.find('span', class_ = 'float-right')
print(price.text.strip())
print(cnt)
Any help will be appreciated.
Thanks
What happens?
Somehow the loop doesn't work correctly. It loops just once.
It is not the loop that won't work correctly, it is rather the way you are selecting things. So html.findAll('div', class_ = "row d-block") will find only one <div> that matches your criteria.
How to fix?
Make you are selecting more specific, cause what you are really want to iterate are the <tr> in the table - I often use css selectors and the following will get the correct selection, so just replace your html.findAll('div', class_ = "row d-block") Note In new code use find_all() instead of findAll() it is the newer syntax:
html.select('.d-block tbody tr')
Example
Will give you a well structured list of dicts:
import requests
from bs4 import BeautifulSoup
url = 'https://www.ralali.com/search/semen'
res = requests.get(url)
html = BeautifulSoup(res.content,"html.parser")
data = []
for row in html.select('.d-block tbody tr'):
data.append(
dict(
zip(['pos','name','currency','price'],list(row.stripped_strings))
)
)
data
Output
[{'pos': '1',
'name': 'Sendok Semen 7 Bulat',
'currency': 'Rp',
'price': '8.500'},
{'pos': '2',
'name': 'Sendok Semen 8 Bulat Gagang Kayu',
'currency': 'Rp',
'price': '10.000'},
{'pos': '3', 'name': 'SEMEN', 'currency': 'Rp', 'price': '10.000'},
{'pos': '4',
'name': 'Sendok Semen 8 Gagang Kayu SWARDFISH',
'currency': 'Rp',
'price': '10.000'},...]
But Be Aware
It will just help you to get the Top 10 - List Of Popular Semen Prices In Ralali and not all goods and prices on the page --> That is something you should clarify in your question.
Getting more data from all products
Option#1
Use an api that is provided by the website and iterate by parameter pages:
import requests
url = 'https://rarasearch.ralali.com/v2/search/item?q=semen'
res = requests.get(url)
data = []
for p in range(1, round(res.json()['total_item']/20)):
url = f'https://rarasearch.ralali.com/v2/search/item?q=semen&p={p}'
res = requests.get(url)
data.extend(res.json()['items'])
print(data)
Output:
[{'id': 114797,
'name': 'TIGA RODA Semen NON semen putih',
'image': 'assets/img/Libraries/114797_TIGA_RODA_Semen_NON_semen_putih_1_UrwztohXHo9u1yRY_1625473149.png',
'alias': 'tiga-roda-semen-non-semen-putih-157561001',
'vendor_id': 21156,
'vendor_alias': 'prokonstruksi',
'rating': '5.00',
'vendor_status': 'A',
'vendor_name': 'Pro Konstruksi',
'vendor_location': 'Palembang',
'price': '101500.00',
'discount': 0,
'discount_percentage': 0,
'free_ongkir_lokal': 0,
'free_ongkir_nusantara': 1,
'is_stock_available': 1,
'minimum_order': 1,
'maximum_order': 999999999,
'unit_type': 'unit',
'ss_type': 0,
'is_open': 'Y',
'wholesale_price': []},
{'id': 268711,
'name': 'Sendok Semen Ukuran 6',
'image': 'assets/img/Libraries/268711_Sendok-Semen-Ukuran-6_HCLcQq6TUh5IiEPZ_1553521818.jpeg',
'alias': 'Sendok-Semen-Ukuran-6',
'vendor_id': 305459,
'vendor_alias': 'distributorbangunan',
'rating': None,
'vendor_status': 'A',
'vendor_name': 'Distributor Bangunan',
'vendor_location': 'Bandung',
'price': '11000.00',
'discount': 0,
'discount_percentage': 0,
'free_ongkir_lokal': 0,
'free_ongkir_nusantara': 0,
'is_stock_available': 1,
'minimum_order': 1,
'maximum_order': 999999999,
'unit_type': 'Unit',
'ss_type': 0,
'is_open': 'Y',
'wholesale_price': []},...]
Option#2
Use selenium, scroll to the bottom of the page toa load all products, push the driver.page_source to your soup and start selecting, ...
From this Tag:
<div class="matchDate renderMatchDateContainer" data-kickoff="1313244000000">Sat 13 Aug 2011</div>
I want to extract the "Sat 13 Aug 2011" using bs4 Beautiful Soup.
My current Code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.premierleague.com/match/7468'
j = requests.get(url)
soup = BeautifulSoup(j.content, "lxml")
containedDateTag_string = soup.find_all('div', class_="matchDate renderMatchDateContainer")
print (containedDateTag_string)
When I run it the printed output does not contain the "Sat 13 Aug 2011" and is simply stored and printed as:
[<div class="matchDate renderMatchDateContainer" data-kickoff="1313244000000"></div>]
Is there a way that I can get this string to be displayed? I have also tried parsing further through the tag with ".next_sibling" and ".text" with both displaying "[]" rather than the desired string which is why I reverted back to trying just 'div' to see if I could at least get the text to display.
Scraping the content using .page_source using selenium/ChromeDriver is the way to go here, since the date text is being generated by JavaScript:
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://www.premierleague.com/match/7468"
driver = webdriver.Chrome()
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
Then you can do your .find the way you were doing:
>>> soup.find('div', {'class':"matchDate renderMatchDateContainer"}).text
'Sat 13 Aug 2011'
A batteries included solution with selenium itself:
>>> driver.find_element_by_css_selector("div.matchDate.renderMatchDateContainer").text
'Sat 13 Aug 2011'
Without Selenium - but using requests and the sites own API - it would look like something this (sure, you'd grab a bunch of other data about each game, but here's just for code for the date-part):
import requests
from time import sleep
def scraper(match_id):
headers = {
"Origin":"https://www.premierleague.com",
"Referer":"https://www.premierleague.com/match/%d" % match_id
}
api_endpoint = "https://footballapi.pulselive.com/football/broadcasting-schedule/fixtures/%d" % match_id
r = requests.get(api_endpoint, headers=headers)
if not r.status_code == 200:
return None
else:
data = r.json()
# this will return something like this:
# {'broadcasters': [],
# 'fixture': {'attendance': 25700,
# 'clock': {'label': "90 +4'00", 'secs': 5640},
# 'gameweek': {'gameweek': 1, 'id': 744},
# 'ground': {'city': 'London', 'id': 16, 'name': 'Craven Cottage'},
# 'id': 7468,
# 'kickoff': {'completeness': 3,
# 'gmtOffset': 1.0,
# 'label': 'Sat 13 Aug 2011, 15:00 BST',
# 'millis': 1313244000000},
# 'neutralGround': False,
# 'outcome': 'D',
# 'phase': 'F',
# 'replay': False,
# 'status': 'C',
# 'teams': [{'score': 0,
# 'team': {'club': {'abbr': 'FUL',
# 'id': 34,
# 'name': 'Fulham'},
# 'id': 34,
# 'name': 'Fulham',
# 'shortName': 'Fulham',
# 'teamType': 'FIRST'}},
# {'score': 0,
# 'team': {'club': {'abbr': 'AVL',
# 'id': 2,
# 'name': 'Aston Villa'},
# 'id': 2,
# 'name': 'Aston Villa',
# 'shortName': 'Aston Villa',
# 'teamType': 'FIRST'}}]}}
return data
match_id = 7468
json_blob = scraper(match_id)
if json_blob is not None:
date = json_blob['fixture']['kickoff']['label']
print(date)
You need the header with those two parameters to get the data.
So if you had a bunch of match_id's you could just loop through them with this function going:
for match_id in range(7000,8000,1):
json_blob = scraper(match_id)
if json_blob is not None:
date = json_blob['fixture']['kickoff']['label']
print(date)
sleep(1)
I'm doing some scraping and looking at pages like this one (https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897), but I have not been able to fully retrieve the JSON content.I have tried using both of the following sets of code, but each returns an incomplete JSON object:
url = 'https://www.endomondo.com/rest/v1/users/%s/workouts/%s'%(string_use_user, string_use_workout)
print(url)
response = urlopen(url)
try:
reader = codecs.getreader("utf-8")
print(reader(response))
jsonresponse = json.load(reader(response))
print(jsonresponse)
and similarly using the response library instead of urllib also fails to retrieve the full JSON
url = 'https://www.endomondo.com/rest/v1/users/%s/workouts/%s'%(string_use_user, string_use_workout)
print("using this url %s"%url)
r = requests.get(url)
try:
print(r.json())
jsonresponse = r.json()# json.loads(response.read())
In both cases I get about 1/4 of the JSON. For example, in this case:
https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897
I received:
{'feed_id': 281475471235835, 'id': 526622897, 'duration': 4082.0, 'local_start_time': '2015-05-21T09:30:45.000+02:00', 'calories': 1073.0, 'tagged_users': [], 'altitude_max': 69.9523, 'sport': 0, 'distance': 11.115419387817383, 'altitud\
e_min': 14.9908, 'include_in_stats': True, 'hydration': 0.545339, 'start_time': '2015-05-21T07:30:45.000Z', 'ascent': 137.162, 'is_live': False, 'pb_count': 2, 'playlist': [], 'is_peptalk_allowed': False, 'weather': {'wind_speed': 11, '\
temperature': 12, 'wind_direction': 13, 'type': 3, 'humidity': 81}, 'speed_max': 24.8596, 'author': {'name': 'gfdgfd', 'id': 20261627, 'last_name': 'gdsgsk', 'gender': 0, 'expand': 'abs', 'picture': {'url': 'https://www.endom\
ondo.com/resources/gfx/picture/18511427/thumbnail.jpg'}, 'first_name': 'gdsgds', 'viewer_friendship': 1, 'is_premium': False}, 'sharing': [{'share_time': '2015-05-21T08:45:19.000Z', 'type': 0, 'share_id': 1635690786663532}], 'show_map':\
0, 'pictures': [], 'hashtags': [], 'descent': 150.621, 'speed_avg': 9.80291763746756, 'expand': 'full', 'show_workout': 0, 'points': {'expand': 'ref', 'id': 2199549878449}}
I am not receiving the long arrays within the data. I am also not even recovering all of the non-array data.
I ran the original page through a JSON validator, and it's fine. Similarly, I ran the JSON I do receive through a validator, and it's also fine - it doesn't show any signs of missing things unless I compare with the original.
I would appreciate any advice about how to troubleshoot this. Thanks.
Looks like this API is doing some User-Agent sniffing and only sending the complete content for what it considers to be actual web browsers.
Once you set a User-Agent header with the UA string of a common browser, you get the full response:
>>> UA = 'Mozilla/5.0 (X11; Linux x86_64; rv:36.0) Gecko/20100101 Firefox/36.0'
>>> url = 'https://www.endomondo.com/rest/v1/users/20261627/workouts/526622897'
>>> r = requests.get(url, headers={'User-Agent': UA})
>>>
>>> print len(r.content)
96412
See the requests docs for more details on setting custom headers.
Extract the data rich nodes of a web page using beautifulsoup in python ,Is there a way to count the frequency of the tags in the page,
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.in"
r = requests.get(url)
html = BeautifulSoup(r.content)
Now I want to count the frequency of all the tags in html obtained.
Use dict comprehensions and collection.Counter to get counts of tags that are instance of bs4.element.Tag.
from collections import Counter
import requests
import bs4
from bs4 import BeautifulSoup
url = "http://www.amazon.in"
r = requests.get(url)
html = BeautifulSoup(r.content)
Counter(tag.name for tag in html.descendants if isinstance(tag, bs4.element.Tag))
output
Counter({'div': 462, 'a': 448, 'span': 395, 'li': 288, 'br': 78, 'img': 60, 'td': 57, 'script': 48, 'ul': 39, 'option': 27, 'tr': 22, 'table': 17, 'meta': 13, 'map': 12, 'area': 12, 'link': 11, 'style': 10, 'p': 10, 'b': 9, 'h2': 7, 'strong': 5, 'input': 2, 'body': 1, 'title': 1, 'html': 1, 'header': 1, 'form': 1, 'head': 1, 'label': 1, 'select': 1})
I am working with the request module within python to grab certain fields within the JSON response.
import json
fn = 'download.json'
data = json
response = requests.get('http://api.appannie.com/v1/accounts/1000/apps/mysuperapp/sales?break_down=application+iap&start_date=2013-10-01&end_date=2013-10-02', \
auth=('username', 'password'))
data = response.json()
print(data)
This works in python, as the response is the following:
{'prev_page': None, 'currency': 'USD', 'next_page': None, 'sales_list': [{'revenue': {'ad': '0.00', 'iap': {'refunds': '0.00', 'sales': '0.00', 'promotions': '0.00'}, 'app': {'refunds': '0.00', 'updates': '0.00', 'downloads': '0.00', 'promotions': '0.00'}},
'units': {'iap': {'refunds': 0, 'sales': 0, 'promotions': 0}, 'app': {'refunds': 0, 'updates': 0, 'downloads': 2000, 'promotions': 0}}, 'country': 'all', 'date': 'all'}], 'iap_sales': [], 'page_num': 1, 'code': 200, 'page_index': 0}
The question is how do I parse this to get my downloads number within the 'app' block - namely the "2000" value?
After the response.json() data is already a dictionary otherwise response.json() would raise an exception. Therefore you can access it just like any other dictionary.
You can use the loads() method of json -
import json
response = requests.get('http://api.appannie.com/v1/accounts/1000/apps/mysuperapp/sales?break_down=application+iap&start_date=2013-10-01&end_date=2013-10-02',
auth=('username', 'password'))
data = json.loads(response.json()) # data is a dictionary now
sales_list = data.get('sales_list')
for sales in sales_list:
print sales['revenue']['app']
You can use json.loads:
import json
import requests
response = requests.get(...)
json_data = json.loads(response.text)
This converts a given string into a dictionary which allows you to access your JSON data easily within your code.