Content missing in html file scraping with BeautifulSoup

Content missing in html file scraping with BeautifulSoup - python

i am trying to scrape the content of this site: http://www.whoscored.com/Matches/824609/Live . If i view the html-file in Chrome under the network tab am i able to see everything i want to scrape. But if i run the script below is a lot of data is missing in the results, data in JSON format. Its data of every specific event during the game. Why is the result different when i inspect the content in chrome and when i scrape the site?
import requests
from bs4 import BeautifulSoup
url = 'http://www.whoscored.com/Matches/824609/Live'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text)
print soup

Related

Parsing text with bs4 works with selenium but does not work with requests in Python

This code works and returns the single digit number that i want but its so slow and takes good 10 seconds to complete.I will be running this 4 times for my use so thats 40 seconds wasted every run.
` from selenium import webdriver
from bs4 import BeautifulSoup
options = webdriver.FirefoxOptions()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)
driver.get('https://warframe.market/items/ivara_prime_blueprint')
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
price_element = soup.find('div', {'class': 'row order-row--Alcph'})
price2=price_element.find('div',{'class':'order-row__price--hn3HU'})
price = price2.text
print(int(price))
driver.close()`
This code on the other hand does not work. It returns None.
` import requests
from bs4 import BeautifulSoup
url='https://warframe.market/items/ivara_prime_blueprint'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
price_element=soup.find('div', {'class': 'row order-row--Alcph'})
price2=price_element.find('div',{'class':'order-row__price--hn3HU'})
price = price2.text
print(int(price))`
First thought was to add user agent but still did not work. When I print(soup) it gives me html code but when i parse it further it stops and starts giving me None even tho its the same command like in selenium example.

The data is loaded dynamically within a <script> tag so Beautifulsoup doesn't see it (it doesn't render Javascript).
As an example, to get the data, you can use:
import json
import requests
from bs4 import BeautifulSoup
url = "https://warframe.market/items/ivara_prime_blueprint"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
script_tag = soup.select_one("#application-state")
json_data = json.loads(script_tag.string)
# Uncomment the line below to see all the data
# from pprint import pprint
# pprint(json_data)
for data in json_data["payload"]["orders"]:
print(data["user"]["ingame_name"])
Prints:
Rogue_Monarch
Rappei
KentKoes
Tenno61189
spinifer14
Andyfr0nt
hollowberzinho
You can access the data as a dict and acess the keys/values.
I'd recommend an online tool to view all the JSON since it's quite large.
See also
Parsing out specific values from JSON object in BeautifulSoup

Web scraping of website which is reloading on option selection

So i am trying scrap CPI report from indian govt website.
here is website https://fcainfoweb.nic.in/pmsver2/reports/report_menu_web.aspx ,
I am using this approach,
When we load this website it asks for multiple options to select. after selecting options and then hitting the get data button, we are redirected to report page.
Here, i copied my cookie and session details,which i used in below python script to retrieve information. which is working fine.
Now, i want to fully automate this task, which will require
Price report -> Daily prices
date selection
getting data in code ,
but the issue is, web pages are redirected and even options on selectors are changing, how do i scrap this ?
i have below script where i've given prefecthed cookie & session as param & able to get data.
import requests
#from fake_useragent import UserAgent
from bs4 import BeautifulSoup
import lxml.html as lh
import pandas as pd
from pprint import pprint
# https://fcainfoweb.nic.in/reports/Report_Menu_Web.aspx
# report link = https://fcainfoweb.nic.in/reports/Report_daily1_Web_New.aspx
#url = 'https://fcainfoweb.nic.in/reports/Report_daily1_Web_New.aspx'
#url = 'https://fcainfoweb.nic.in/reports/Reportdaily9.aspx'
# "Cookie": "ASP.NET_SessionId=n3npgkgb2wpy3sup45ze024y; BNI_persistence=XIlVKPHMyFvRq0HtLj7pmqXxmRx7y7byO_ia3T0PrBLraaAiDz2RxPPPWpXCo2y2SGMfsbBJx4Pe4wWpm_C-OA=="}
#headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
#ua = UserAgent()9
#"Cookie":"ASP.NET_SessionId=dkk2h2003kzfamcypczrfaru; BNI_persistence=XIlVKPHMyFvRq0HtLj7pmqXxmRx7y7byO_ia3T0PrBLraaAiDz2RxPPPWpXCo2y2SGMfsbBJx4Pe4wWpm_C-OA==; _ga=GA1.3.654717034.1651138144; _gid=GA1.3.1558736990.1651468427; _gat_gtag_UA_106490103_3=1"
#res = requests.get('https://fcainfoweb.nic.in/reports/Daily_Average_Report_Data_Commoditywise_Percentage_Variation.aspx',headers=head)
head = {'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Cookie": "ASP.NET_SessionId=n3npgkgb2wpy3sup45ze024y; BNI_persistence=XIlVKPHMyFvRq0HtLj7pmqXxmRx7y7byO_ia3T0PrBLraaAiDz2RxPPPWpXCo2y2SGMfsbBJx4Pe4wWpm_C-OA=="}
u = '''https://fcainfoweb.nic.in/Reports/Report_Menu_Web.aspx'''
res = requests.get(u,headers=head)
print(res.headers)
print(res.text)
print(res.cookies)
with open('resp.html','w') as f:
f.writelines(res.text)
soup = BeautifulSoup(res.text, 'lxml')
#pprint(soup)
tab = soup.find_all('table')
cnt = 1
htab = pd.read_html(res.text)[1]
fn = "data_{0}.xlsx".format(cnt)
htab.to_excel(fn)

Unable to parse a rating information from a webpage using requests

I tried to scrape a certain information from a webpage but failed miserably. The text I wish to grab is available in the page source but I still can't fetch it. This is the site address. I'm after the portion visible in the image as Not Rated.
Relevant html:
<div class="subtext">
Not Rated
<span class="ghost">|</span> <time datetime="PT188M">
3h 8min
</time>
<span class="ghost">|</span>
Drama,
Musical,
Romance
<span class="ghost">|</span>
<a href="/title/tt0150992/releaseinfo?ref_=tt_ov_inf" title="See more release dates">18 June 1999 (India)
</a> </div>
I've tried with:
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
rating = soup.select_one(".titleBar .subtext").next_element
print(rating)
I get None using the script above.
Expected output:
Not Rated
How can I get the rating from that webpage?

If you want to get correct version of HTML page, specify Accept-Language http header:
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
s.headers['Accept-Language'] = 'en-US,en;q=0.5' # <-- specify also this!
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
rating = soup.select_one(".titleBar .subtext").next_element
print(rating)
Prints:
Not Rated

There is a better way to getting info on the page. If you dump the html content returned by the request.
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
with open("response.html", "w", encoding=r.encoding) as file:
file.write(r.text)
you will find a element <script type="application/ld+json"> which contains all the information about the movie.
Then, you simply get the element text, parse it as json, and use the json to extract the info you wanted.
here is a working example
import json
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
movie_data = soup.find("script", attrs={"type": "application/ld+json"}).next # Find the element <script type="application/ld+json"> and get it's content
movie_data = json.loads(movie_data) # parse the data to json
content_rating = movie_data["contentRating"] # get rating

IMDB is one of those webpages that makes it incredible easy to do webscraping and I love it. So what they do to make it easy for webscrapers is to put a script in the top of the html that contains the whole movie object in the format of JSON.
So to get all the relevant information and organize it you simply need to get the content of that single script tag, and convert it to JSON, then you can simply ask for the specific information like with a dictionary.
import requests
import json
from bs4 import BeautifulSoup
#This part is basically the same as yours
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
r = requests.get(link)
soup = BeautifulSoup(r.content,"lxml")
#Why not get the whole json element of the movie?
script = soup.find('script', {"type" : "application/ld+json"})
element = json.loads(script.text)
print(element['contentRating'])
#Outputs "Not Rated"
# You can also inspect te rest of the json it has all the relevant information inside
#Just -> print(json.dumps(element, indent=2))
Note:
Headers and session are not necessary in this example.

Webscraping latitude longitude from google results

How can I scrape latitude and longitude from the google results in the image below using beautiful soup.
Google result latitude longitude

Here is the code for do it with bs4:
from requests import get
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',}
response = get("https://www.google.com/search?q=latitude+longitude+of+75270+postal+code+paris+france",headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
a = soup.find("div", class_= "Z0LcW").text
print(a)

Please provide more input on further questions since we don't want to do the pre-work to create a solution.
You will have to grab this container:
<div class="HwtpBd gsrt PZPZlf" data-attrid="kc:/location/location:coordinates" aria-level="3" role="heading"><div class="Z0LcW XcVN5d">48.8573° N, 2.3370° E</div><div></div></div>
BS4
#BeautifoulSoup Stuff
import requests
from requests.packages.urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import re
# Make the request
url = "https://www.google.com/search?q=latitude+longitude+of+75270+postal+code+paris+france&rlz=1C1CHBF_deDE740DE740&oq=latitude+longitude+of+75270+postal+code+paris+france&aqs=chrome..69i57.4020j0j8&sourceid=chrome&ie=UTF-8"
response = requests.get(url)
# Convert it to proper html
html = response.text
# Parse it in html document
soup = BeautifulSoup(html, 'html.parser')
# Grab the container and its content
target_container = soup.find("div", {"class": "Z0LcW XcVN5d"}).text
Then you have a string inside the div returned.
..Assuming google doesn't change the class declarations randomly. I tried five refreshes and the classname didn't change, but who knows.

Make sure you're using user-agent (you can also use python fake user-agents library)
Code and replit.com that grabs location from Google Search results:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=latitude longitude of 75270 postal code paris france',
headers=headers).text
soup = BeautifulSoup(html, 'lxml')
location = soup.select_one('.XcVN5d').text
print(location)
Output:
48.8573° N, 2.3370° E

Python 3 urllib library not returning the same HTML as inspected on Chrome

So I'm trying to extract the current EUR/USD price from a website using Python urllib but the website does not send the same HTML it sends to Chrome. The first part of the HTML is the same as on Chrome but it does not want to give me the EUR/USD value. Can I somehow bypass this?
Here's the code:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
while True:
req = Request('https://www.strategystocks.co.uk/currencies-market.html', headers={"User-Agent":'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'})
html = urlopen(req).read()
soup = BeautifulSoup(html, "html.parser")
print(soup)
buy = int(soup.find("span", class_="buyPrice").text)
sell = int(soup.find("span", class_="sellPrice").text)
print("Buy", buy)
print("Sell", sell)

The data is loaded via Javascript, but you can simulate the Ajax request with requests library:
import requests
url = 'https://marketools.plus500.com/Feeds/UpdateTable?instsIds=2&isUseSentiments=true'
headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0'}
data = requests.get(url, headers=headers).json()
# print(data) # <-- uncomment this to print all data
print('Buy =',data['Feeds'][0]['B'])
print('Sell =',data['Feeds'][0]['S'])
Prints:
Buy = 1.08411
Sell = 1.08403

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Content missing in html file scraping with BeautifulSoup - python

Related

Parsing text with bs4 works with selenium but does not work with requests in Python

Web scraping of website which is reloading on option selection

Unable to parse a rating information from a webpage using requests

Webscraping latitude longitude from google results

Python 3 urllib library not returning the same HTML as inspected on Chrome

Categories

Resources