Scraping a website with clickable content in Python

Scraping a website with clickable content in Python - python

I would like to scrap the content a the following website:
http://financials.morningstar.com/ratios/r.html?t=AMD
In there under Key Ratios I would like to click on "Growth" button and then scrap the data in Python.
How can I do that?

You can solve it with requests+BeautifulSoup. There is an asynchronous GET request sent to the http://financials.morningstar.com/financials/getKeyStatPart.html endpoint which you need to simulate. The Growth table is located inside the div with id="tab-growth":
from bs4 import BeautifulSoup
import requests
url = 'http://financials.morningstar.com/ratios/r.html?t=AMD'
keystat_url = 'http://financials.morningstar.com/financials/getKeyStatPart.html'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'}
# visit the target url
session.get(url)
params = {
'callback': '',
't': 'XNAS:AMD',
'region': 'usa',
'culture': 'en-US',
'cur': '',
'order': 'asc',
'_': '1426047023943'
}
response = session.get(keystat_url, params=params)
# get the HTML part from the JSON response
soup = BeautifulSoup(response.json()['componentData'])
# grab the data
for row in soup.select('div#tab-growth table tr'):
print row.text

Related

How to extract this dictionary using Beautiful Soup

I want to get to variable last element of dictionary (pasted below), it's in another dictionary "offers", and i have no clue how to extract it.
html = s.get(url=url, headers=headers, verify=False, timeout=15)
soup = BeautifulSoup(html.text, 'html.parser')
products = soup.find_all('script', {'type': "application/ld+json"})
{"#context":"http://schema.org","#type":"Product","aggregateRating":{"#type":"AggregateRating","bestRating":5,"ratingValue":"4.8","ratingCount":11,"worstRating":3,"reviewCount":5},"brand":{"#type":"Brand","name":"New Balance"},"color":"white/red/biały","image":["https://img01.ztat.net/3"],"itemCondition":"http://schema.org/NewCondition","manufacturer":"New Balance","name":"550 UNISEX - Sneakersy niskie - white/red","offers":[{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110001000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"},{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110002000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"} (...)

As mentioned extract contents via BeautifulSoup decode the string with json.loads():
import json
products = '{"#context":"http://schema.org","#type":"Product","aggregateRating":{"#type":"AggregateRating","bestRating":5,"ratingValue":"4.8","ratingCount":11,"worstRating":3,"reviewCount":5},"brand":{"#type":"Brand","name":"New Balance"},"color":"white/red/biały","image":["https://img01.ztat.net/3"],"itemCondition":"http://schema.org/NewCondition","manufacturer":"New Balance","name":"550 UNISEX - Sneakersy niskie - white/red","offers":[{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110001000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"},{"#type":"Offer","availability":"http://schema.org/OutOfStock","price":"489","priceCurrency":"PLN","sku":"NE215O06U-A110002000","url":"/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html"}]}'
products = json.loads(products)
To get the last element (dict) in offers:
products['offers'][-1]
Output:
{'#type': 'Offer',
'availability': 'http://schema.org/OutOfStock',
'price': '489',
'priceCurrency': 'PLN',
'sku': 'NE215O06U-A110002000',
'url': '/new-balance-550-unisex-sneakersy-niskie-whitered-ne215o06u-a11.html'}
Example
In your special case you also have to replace('"','"') first:
from bs4 import BeautifulSoup
import requests, json
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-61acac03-6279b8a6274777eb44d81aae",
"X-Client-Data": "CJW2yQEIpLbJAQjEtskBCKmdygEIuevKAQjr8ssBCOaEzAEItoXMAQjLicwBCKyOzAEI3I7MARiOnssB" }
html = requests.get('https://www.zalando.de/new-balance-550-unisex-sneaker-low-whitered-ne215o06u-a11.html', headers=headers)
soup = BeautifulSoup(html.content, 'lxml')
jsonData = json.loads(soup.select_one('script[type="application/ld+json"]').text.replace('"','"'))
jsonData['offers'][-1]

python web scraping issues with mechanize

I am trying to scrape web results from the website: https://promedmail.org/promed-posts/
I have followed beutifulsoup. mechanical soup and mechanize so far unable to scrape the search results.
import re
from mechanize import Browser,urlopen
browser = Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
browser.open("https://promedmail.org/promed-posts")
for form in browser.forms():
if form.attrs['id'] == 'full_search':
browser.form = form
break
browser['search'] = 'US'
response = browser.submit()
content = response.read()
The content does not show the search results when typed in US. Any idea on what am I doing wrong here?

As you mention bs4 you can mimic the POST request the page makes. Extract the json item which contains the html the page would have been updated with (containing the results); parse that into BeautifulSoup object then reconstruct the results table as a dataframe:
import requests
from bs4 import BeautifulSoup as bs
headers = {'user-agent': 'Mozilla/5.0'}
data = {
'action': 'get_promed_search_content',
'query[0][name]': 'kwby1',
'query[0][value]': 'summary',
'query[1][name]': 'search',
'query[1][value]': 'US',
'query[2][name]': 'date1',
# 'query[2][value]': '',
'query[3][name]': 'date2',
# 'query[3][value]': '',
'query[4][name]': 'feed_id',
'query[4][value]': '1'
}
r = requests.post('https://promedmail.org/wp-admin/admin-ajax.php', headers=headers, data=data).json()
soup = bs(r['results'], 'lxml')
df = pd.DataFrame([(i.find_next(text=True),
i.a.text,
f"https://promedmail.org/promed-post/?id={i.a['id'].replace('id','')}") for i in soup.select('li')]
, columns = ['Date', 'Title', 'Link'])
print(df)

trying to webscrape date of last document on a webpage in python

I am trying to get the date as below : 01/19/2021, I would like to get the "19" in a python variable
<span class="grayItalic">
Received: 01/19/2021
</span>
here is the piece of code unworking :
date = soup.find('span', {'class': 'grayItalic'}).get_text()
converted_date = int(date[13:14])
print(date)
I get this error : 'NoneType' object has no attribute 'get_text'
anyone could help ?

Try this with headers:
import requests
from bs4 import BeautifulSoup
url = "https://iapps.courts.state.ny.us/nyscef/DocumentList?docketId=npvulMdOYzFDYIAomW_PLUS_elw==&display=all"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content,'html.parser')
date = soup.find('span', {'class': 'grayItalic'}).get_text().strip()
converted_date = int(date.split("/")[-2])
print(converted_date)
print(date)

import dateutil.parser
from bs4 import BeautifulSoup
html_doc=""""<span class="grayItalic">
Received: 01/19/2021
</span>"""
soup=BeautifulSoup(html_doc,'html.parser')
date_ = soup.find('span', {'class': 'grayItalic'}).get_text()
dateutil.parser.parse(date_,fuzzy=True)
Output:
datetime.datetime(2021, 1, 19, 0, 0)
date_ outputs '\n Received: 01/19/2021\n' You've string slicing ,instead you can use \[dateutil.parser\]. Which will return a datetime.datetime object for you.
In this case i've assumed you just need the date.If at all you need the text too,you can use fuzzy_with_tokens=True.
if the fuzzy_with_tokens option is True, returns a tuple, the first element being a datetime.datetime object, the second a tuple containing the fuzzy tokens.
dateutil.parser.parse(date_,fuzzy_with_tokens=True)
(datetime.datetime(2021, 1, 19, 0, 0), (' Received: ', ' '))

I couldn't load the URL using the request or urllib module. I guess the website is blocking automatic connection requests. So I opened the webpage and save the source code in a file name page.html and ran the BeautifulSoup operation in it. And it seems that worked.
html = open("page.html")
soup = BeautifulSoup(html, 'html.parser')
date_span = soup.find('span', {'class': 'grayItalic'})
if date_span is not None:
print(str(date_span.text).strip().replace("Received: ", ""))
# output: 04/25/2019
I tried scraping the source code with the request library as follows but it didn't work (probably the webpage is blocking the request). See if it works on your machine.
url = "..."
headers = {
'Access-Control-Allow-Origin': '*',
'Access-Control-Allow-Methods': 'GET',
'Access-Control-Allow-Headers': 'Content-Type',
'Access-Control-Max-Age': '3600',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
}
response = requests.get(url, headers=headers)
html = response.content
print(html)

Scraping Data after filling form in python of a website

I have tried to scrape data from http://www.educationboardresults.gov.bd/ with python and BeautifulSoup.
Firstly, website need to fill the form. After filling the form the website provide results. I have attached two image here.
Before Submitting Form: https://prnt.sc/w4lo7i
After Submission: https://prnt.sc/w4lqd0
I have tried with following code
import requests
from bs4 import BeautifulSoup as bs
resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': 2012,
'board': 'chittagong',
'roll': 102275,
'reg': 626948,
'button2': 'Submit',
}
headers ={
'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
'cookie': 'PHPSESSID=24vp2g7ll9utu1p2ob5bniq263; tcount_unique_eb_log=1',
'Origin': 'http://www.educationboardresults.gov.bd',
'Referer': 'http://www.educationboardresults.gov.bd/',
'Request URL': 'http://www.educationboardresults.gov.bd/result.php'
}
with requests.Session() as s:
url = 'http://www.educationboardresults.gov.bd'
r = s.get(url, headers=headers)
soup = bs(r.content,'html5lib')
#Scraping and by passing Captcha
alltable =soup.findAll('td')
captcha = alltable[56].text.split('+')
for digit in captcha:
value_one, value_two = int(captcha[0]), int(captcha[1])
resultdata['value_s'] = value_one+value_two
r=s.post(url, data=resultdata, headers= headers)
While printing r.content it is showing first page's code. I want to scrape the second page.
Thanks in Advance

You are making post requests to the wrong url. Moreover, you are supposed to add the value of two numbers and use the result right next to value_s. If you are using bs4 version 3.7 or later, the following selector will work for you as I've used pseudo css selector. The bottom line is your issue is solved. Try the following:
import requests
from bs4 import BeautifulSoup
link = 'http://www.educationboardresults.gov.bd/'
result_url = 'http://www.educationboardresults.gov.bd/result.php'
resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': 2012,
'board': 'chittagong',
'roll': 102275,
'reg': 626948,
'button2': 'Submit',
}
def get_number(s,link):
r = s.get(link)
soup = BeautifulSoup(r.text,"html5lib")
num = 0
captcha_numbers = soup.select_one("tr:has(> td > #value_s) > td + td").text.split("+")
for i in captcha_numbers:
num+=int(i)
return num
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
resultdata['value_s'] = get_number(s,link)
r = s.post(result_url, data=resultdata)
print(r.text)

I am also trying.
import requests
from bs4 import BeautifulSoup as bs
resultdata = {
'sr': '3',
'et': '2',
'exam': 'ssc',
'year': "2012",
'board': 'chittagong',
'roll': "102275",
'reg': "626948",
}
headers ={
'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36',
'cookie': 'PHPSESSID=24vp2g7ll9utu1p2ob5bniq263; tcount_unique_eb_log=1',
'Origin': 'http://www.educationboardresults.gov.bd',
'Referer': 'http://www.educationboardresults.gov.bd/',
'Request URL': 'http://www.educationboardresults.gov.bd/result.php'
}
with requests.Session() as s:
url = 'http://www.educationboardresults.gov.bd/index.php'
r = s.get(url, headers=headers)
soup = bs(r.content,'lxml')
# print(soup.prettify())
#Scraping and by passing Captcha
alltable =soup.findAll('td')
captcha = alltable[56].text.split('+')
print(captcha)
value_one, value_two = int(captcha[0]), int(captcha[1])
print(value_one, value_one)
resultdata['value_s'] = value_one+value_two
resultdata['button2'] = 'Submit'
print(resultdata)
r=s.post("http://www.educationboardresults.gov.bd/result.php", data=resultdata, headers= headers)
soup = bs(r.content, 'lxml')
print(soup.prettify())

how to automatically use the session?

please help to deal with the authorization through the script. the problem is that it is impossible to automatically insert a get-request invbf_session_id.
import pprint
import requests
import re
import shelve
import bs4
def scanning_posts():
print('------------------------enter begin--------------------')
url = 'http://forum.saransk.ru/'
html = requests.get(url)
pprint.pprint(html.headers)
rawCookie = html.headers['Set-Cookie']
cookie = re.search(r"invbf_session_id=(.*?);", rawCookie).group(1)
pprint.pprint(cookie) # cookie != zzzzzzzzzzzzzzzzzzzzzz
html = requests.get(url)
soup = bs4.BeautifulSoup(html.text)
loginForm = soup.find('form', {'id': 'login'})
hiddenAuthKey = soup.find('input', {'name': 'auth_key'})['value']
authData = {
'ips_username': 'xxxxxx',
'ips_password': 'yyyyyy',
'auth_key': hiddenAuthKey,
'rememberMe': 1,
'referer': 'http://forum.saransk.ru/'
}
header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36',
'Referer': 'http://forum.saransk.ru/forum/'
}
#pprint.pprint(authData)
print('\tlogin: ', authData['ips_username'])
cookie = dict(invbf_session_id='zzzzzzzzzzzzzzzzzzzzzz')
req = requests.get(url, params=authData, cookies=cookie, headers=header)
soup = bs4.BeautifulSoup(req.text)
signLinkNotLogged = soup.find('a', {'id': 'sign_in'})
if signLinkNotLogged:
print('------------------------enter failed--------------------')
else:
print('------------------------enter successful--------------------')
scanning_posts()
After running the script displays the wrong value invbf_session_id, as seen in FF firebug. respectively, authorization is not obtained.
if the value invbf_session_id copy of FF firebug and paste the script, the authorization is successful

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping a website with clickable content in Python - python

I would like to scrap the content a the following website: http://financials.morningstar.com/ratios/r.html?t=AMD In there under Key Ratios I would like to click on "Growth" button and then scrap the data in Python. How can I do that?

Related

How to extract this dictionary using Beautiful Soup

python web scraping issues with mechanize

trying to webscrape date of last document on a webpage in python

Scraping Data after filling form in python of a website

how to automatically use the session?

Categories

Resources