How do I Webscrape a website that uses iframes?

How do I Webscrape a website that uses iframes? - python

I am trying to scrape this website 'https://swimming.org.nz/results.html'. In the form that comes up, I am filling in only the Age column as 8 to 8. I am using the following code to scrape the table as suggested elsewhere in StackOverflow. I am unable to get the table. How to get all the tables for this age group 8 to 8.
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get("https://swimming.org.nz/results.html")
soup = BeautifulSoup(r.content, "html.parser")
iframe_src = soup.select_one("x-MS_FIELD_AGE.FROM.L").attrs["src"]
r = s.get(f"https:{iframe_src}")
soup = BeautifulSoup(r.content, "html.parser")
for row in soup.select("x-form-text x-form-field"):
print("\t".join([e.text for e in row.select("th, td")]))

You will see that it is not necessary using BeautifulSoup if you' ve look at the developer tools on your browser. Sending request is below and response type is xml. You don't need any scraping tool. You can get all of that data changing the StartRowIndexand MaximumRowCount.
import requests
url = "https://connect.swimming.org.nz/snz-wrap-public/pages/pageRequestHandler?tunnelTarget=tableData%2F%3F&data_file=MS.COMP.RESULTS&dict_file=MS.COMP.RESULTS&doGet=true"
payload="StartRowIndex=0&MaximumRowCount=100&sort=BY-DSND%20COMP.DATE%20BY-DSND%20STAGE&dir=ASC&tid=extTable1620108707767_4352538&selectCriteria=GET-LIST%20CMS_TABLE_19483_65507_076_184811_&extraColumns=%3CColumns%20DynamicLinkRoot%3D%22https%3A%2F%2Fconnect.swimming.org.nz%3A443%2Fsnz-wrap-public%2Fworkflows%2F%22%3E%3CColumn%3E%3CColumnName%3EExpander%3C%2FColumnName%3E%3CField%3EFRAGMENT_DISPLAY.SPLITS%3C%2FField%3E%3CShowInExpander%3Etrue%3C%2FShowInExpander%3E%3C%2FColumn%3E%3CColumn%3E%3CFieldExpression%3E%7BMEMBER.FORE1%7D%20%7BMEMBER.SURNAME%7D%3C%2FFieldExpression%3E%3CField%3EEXPRESSION_FIELD_1%3C%2FField%3E%3CColumnName%3EName%2520%3C%2FColumnName%3E%3CWidth%3E130%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXGENDER%3C%2FField%3E%3CColumnName%3EGender%3C%2FColumnName%3E%3CWidth%3E50%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EENTRANT.AGE%3C%2FField%3E%3CColumnName%3EAge%3C%2FColumnName%3E%3CWidth%3E35%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CFieldExpression%3E%7BXCATEGORY2%7D%3C%2FFieldExpression%3E%3CField%3ECATEGORY2.NUM%24%24SNZ%3C%2FField%3E%3CColumnName%3EDistance%3C%2FColumnName%3E%3CWidth%3E70%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXCATEGORY1%3C%2FField%3E%3CColumnName%3EStroke%3C%2FColumnName%3E%3CWidth%3E70%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CFieldExpression%3E%7BTIME%24%24SNZ%7D%3C%2FFieldExpression%3E%3CField%3ERESULT.TIME.MILLISECONDS%3C%2FField%3E%3CColumnName%3ETime%2520%3C%2FColumnName%3E%3CWidth%3E70%3C%2FWidth%3E%3CAlign%3Eright%3C%2FAlign%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EFINA.POINTS%24%24SNZ%3C%2FField%3E%3CColumnName%3EFINA%2520Points%3C%2FColumnName%3E%3CWidth%3E85%3C%2FWidth%3E%3CAlign%3Eright%3C%2FAlign%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EFINA.YEAR%24%24SNZ%3C%2FField%3E%3CColumnName%3EPoints%2520Year%3C%2FColumnName%3E%3CWidth%3E80%3C%2FWidth%3E%3CAlign%3Eright%3C%2FAlign%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3E%24DATE%24COMP.DATE%3C%2FField%3E%3CColumnName%3EDate%3C%2FColumnName%3E%3CWidth%3E70%3C%2FWidth%3E%3CAlign%3Eright%3C%2FAlign%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXEVENT.CODE%3C%2FField%3E%3CColumnName%3EMeet%3C%2FColumnName%3E%3CWidth%3E190%3C%2FWidth%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EPARAMETER1%3C%2FField%3E%3CColumnName%3ECourse%3C%2FColumnName%3E%3CWidth%3E50%3C%2FWidth%3E%3C%2FColumn%3E%3C%2FColumns%3E&extraColumnsDownload=%3CDownloadColumns%20DynamicLinkRoot%3D%22https%3A%2F%2Fconnect.swimming.org.nz%3A443%2Fsnz-wrap-public%2Fworkflows%2F%22%3E%3CColumn%3E%3CField%3EXGENDER%3C%2FField%3E%3CColumnName%3EGender%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EENTRANT.AGE%3C%2FField%3E%3CColumnName%3EAge%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXCATEGORY2%3C%2FField%3E%3CColumnName%3EDistance%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXCATEGORY1%3C%2FField%3E%3CColumnName%3EStroke%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3ETIME%24%24SNZ%3C%2FField%3E%3CColumnName%3ETime%2520%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EFINA.POINTS%24%24SNZ%3C%2FField%3E%3CColumnName%3EFINA%2520Points%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EFINA.YEAR%24%24SNZ%3C%2FField%3E%3CColumnName%3EPoints%2520Year%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3E%24DATE%24COMP.DATE%3C%2FField%3E%3CColumnName%3EDate%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EXEVENT.CODE%3C%2FField%3E%3CColumnName%3EMeet%3C%2FColumnName%3E%3C%2FColumn%3E%3CColumn%3E%3CField%3EPARAMETER1%3C%2FField%3E%3CColumnName%3ECourse%3C%2FColumnName%3E%3C%2FColumn%3E%3C%2FDownloadColumns%3E"
headers = {
'Connection': 'keep-alive',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"',
'accept': '*/*',
'x-requested-with': 'XMLHttpRequest',
'accept-language': 'tr-TR,tr;q=0.9,en-US;q=0.8,en;q=0.7,ru;q=0.6',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Origin': 'https://connect.swimming.org.nz',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://connect.swimming.org.nz/snz-wrap-public/workflows/COMP.RESULTS.FIND',
'Cookie': 'JSESSIONID=93F2FEA63BA41ECB2505E2D1CD76374D; _ga=GA1.3.1735786808.1620106921; _gid=GA1.3.1806138988.1620106921'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)

Related

Scraping data through Api from json

I would like to limit the data they receive to the first 8 links on the website. As shown in the picture, there is no data available beyond the 8th link, as seen in the CSV file. How can I apply this limit so that they only receive data from the first 8 links? The website link is https://www.linkedin.com/learning/search?keywords=data%20science,
JSON API
CSV File
Code part
import requests
import pandas as pd
url = "https://www.linkedin.com/learning-api/searchV2?keywords=data%20science&q=keywords&searchRequestId=RW4AuZRJT22%2BUeXnsZJGQA%3D%3D"
payload={}
headers = {
'authority': 'www.linkedin.com',
'accept': 'application/vnd.linkedin.normalized+json+2.1',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8,pt;q=0.7',
'cookie': 'bscookie="v=1&202108281231498ed9b977-a15a-4647-83ff-d0ef12adfbfbAQFdf9p_GSaBPrFkmyztJ8zyOnqVND-D"; li_theme=light; li_theme_set=app; li_sugr=4752e3dd-9232-4bb9-9dbb-b29c1a127f77; bcookie="v=2&9fb3a4d0-1139-4e2b-89ba-e5374eeb9735"; aam_uuid=08800810176251362264578372297522883472; _gcl_au=1.1.240501668.1664707206; li_rm=AQELLfU3ZqmMhAAAAYQ_tPjGK8ONpN3EEUxH1P4M6Czq5fk6EXaEXSzKwoNSXoSZ7KgO5uSTE9iZ30fuhs6ju1rLH1VgXYyRM3nNuiTQEx1k2ca6SR0Hk1d5-NBafeE0zv65QetFY5Yrx2ufzRlfEXUkJJSoO9Z2o7MeuX-3Go7P4dI-m5HQM7VOKLiK_TD-ZWzj_OkdkR75K31QKGq8bxPLa0JpkGUzhDIVGWzl6vqkcl6BJEK2s-keIZjsiH5MZ9sbLXEVOxLg4vD21TTJBNshE6zaiWrSnxx_PEm44eDPqjvXRMVWFeX7VZfIe2KFshWXLRc4SY8hAQINymU; visit=v=1&M; G_ENABLED_IDPS=google; JSESSIONID="ajax:7673827752327651374"; timezone=Asia/Karachi; _guid=0f0d3402-80be-4bef-9baf-18d281f68921; mbox=session^#965dfb20b29e4f2688eedcf643d2e5ab^#1671620169|PC^#965dfb20b29e4f2688eedcf643d2e5ab.38_0^#1687170309; __ssid=db28305b-28da-4f8b-ad3a-54dea10b9eb9; dfpfpt=da2e5dde482a41b09cf7178ba1bcec7e; g_state={"i_l":0}; liap=true; li_at=AQEDATKxuC8DTVh9AAABhaytidQAAAGGZN5q6E0AdHv14xrDnsngkfFuMyIIbGYccHR15UrPQ8rb3qpS0_-mpCFm9pXQkoNYGdk87LiGVIqiw4oXuJ9tqflCEOev71_L83JoJ-fkbOfZwdG0RICtuIHn; AnalyticsSyncHistory=AQKUIualgILMBgAAAYZHP2t3mvejt25dMqUMRmrpyhaQMe1cucNiAMliFNRUf4cu4aKnZ1z1kQ_FGeqFr2m04Q; lms_ads=AQEr9ksNAL4kugAAAYZHP2z8QK26stPkoXe2TgJZW3Fnrl4dCzbC2DtithS1-zp5Ve85QwxzRhPvP9okaC0kbu40FYX7EqIk; lms_analytics=AQEr9ksNAL4kugAAAYZHP2z8QK26stPkoXe2TgJZW3Fnrl4dCzbC2DtithS1-zp5Ve85QwxzRhPvP9okaC0kbu40FYX7EqIk; fid=AQGWcXnO5AffyAAAAYZRr6tph6cekZ9ZD66e1xdHhumlVvJ3cKYzZLwfK-I3nJyeRyLQs3LRnowKjQ; lil-lang=en_US; lang=v=2&lang=en-us; _dd_l=1; _dd=ff90da3c-aa07-4491-9106-b226eba1c09c; AMCVS_14215E3D5995C57C0A495C55%40AdobeOrg=1; AMCV_14215E3D5995C57C0A495C55%40AdobeOrg=-637568504%7CMCIDTS%7C19403%7CMCMID%7C09349215808923073694559483836331055195%7CMCAAMLH-1677084815%7C3%7CMCAAMB-1677084815%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1676487215s%7CNONE%7CMCCIDH%7C1076847823%7CvVersion%7C5.1.1; s_cc=true; UserMatchHistory=AQJJ3j-efkcQeQAAAYZWAETxBE44VVBGzo_i-gr5nEGPOK85mS3kDScLdGC24_GeNx-GEeCNDrPOjkQde_MGT4iPc7vJV4sT_nPL8Tv4WMTLarIEliLYPkCvou8zFlb3dFNkbXZjVV_KTVeDvUSJ5WJTeStLNXmzV3_EV5mI9dbSRpoTFlJ94vi_zxcCmnLTaGAYGQAdymMv4SbaMgtnt3QcY8Zj9-hnwxdsIEmJloq47_QTP7sfl-SG-vw8xvhl9KYb0ZPKCnQ6ioJhu3G4cFpKJiSUbULkYMADSo0; lidc="b=VB23:s=V:r=V:a=V:p=V:g=4060:u=105:x=1:i=1676480108:t=1676566269:v=2:sig=AQEz2UktgVcQuJwMoVRgKgnUuKtCEm9C"; s_sq=%5B%5BB%5D%5D; gpv_pn=www.linkedin.com%2Flearning%2Fsearch; s_ips=615; s_plt=7.03; s_pltp=www.linkedin.com%2Flearning%2Fsearch; s_tp=6116; s_ppv=www.linkedin.com%2Flearning%2Fsearch%2C47%2C10%2C2859%2C7%2C18; s_tslv=1676480356388',
'csrf-token': 'ajax:7673827752327651374',
'referer': 'https://www.linkedin.com/learning/search?keywords=data%20science',
'sec-ch-ua': '"Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36',
'x-li-lang': 'en_US',
'x-li-page-instance': 'urn:li:page:d_learning_search;gNOg2MJoSqWv2XNAh4ukiQ==',
'x-li-pem-metadata': 'Learning Exp - Search=search',
'x-li-track': '{"clientVersion":"1.1.2236","mpVersion":"1.1.2236","osName":"web","timezoneOffset":5,"timezone":"Asia/Karachi","mpName":"learning-web","displayDensity":1,"displayWidth":1366,"displayHeight":768}',
'x-lil-intl-library': 'en_US',
'x-restli-protocol-version': '2.0.0'
}
res = requests.request("GET", url, headers=headers, data=payload).json()
product=[]
items=res['included']
for item in items:
try:
title=item['headline']['title']['text']
except:
title=''
try:
url='https://www.linkedin.com/learning/'+item['slug']
except:
url=''
try:
rating=item['rating']['ratingCount']
except:
rating=''
wev={
'title':title,
'instructor':name,
'review':rating,
'url':url
}
product.append(wev)
df=pd.DataFrame(product)
df.to_csv('learning.csv')

To filter the rows that contain empty columns, specifically those with an empty title column, you can simply add the following code:
df=pd.DataFrame(product)
filter = df["title"] != ""
dfNew = df[filter]
dfNew.to_csv('learning.csv')
The entire code will be:
import requests
import pandas as pd
url = "https://www.linkedin.com/learning-api/searchV2?keywords=data%20science&q=keywords&searchRequestId=RW4AuZRJT22%2BUeXnsZJGQA%3D%3D"
payload={}
headers = {
'authority': 'www.linkedin.com',
'accept': 'application/vnd.linkedin.normalized+json+2.1',
'accept-language': 'en-GB,en-US;q=0.9,en;q=0.8,pt;q=0.7',
'cookie': 'bscookie="v=1&202108281231498ed9b977-a15a-4647-83ff-d0ef12adfbfbAQFdf9p_GSaBPrFkmyztJ8zyOnqVND-D"; li_theme=light; li_theme_set=app; li_sugr=4752e3dd-9232-4bb9-9dbb-b29c1a127f77; bcookie="v=2&9fb3a4d0-1139-4e2b-89ba-e5374eeb9735"; aam_uuid=08800810176251362264578372297522883472; _gcl_au=1.1.240501668.1664707206; li_rm=AQELLfU3ZqmMhAAAAYQ_tPjGK8ONpN3EEUxH1P4M6Czq5fk6EXaEXSzKwoNSXoSZ7KgO5uSTE9iZ30fuhs6ju1rLH1VgXYyRM3nNuiTQEx1k2ca6SR0Hk1d5-NBafeE0zv65QetFY5Yrx2ufzRlfEXUkJJSoO9Z2o7MeuX-3Go7P4dI-m5HQM7VOKLiK_TD-ZWzj_OkdkR75K31QKGq8bxPLa0JpkGUzhDIVGWzl6vqkcl6BJEK2s-keIZjsiH5MZ9sbLXEVOxLg4vD21TTJBNshE6zaiWrSnxx_PEm44eDPqjvXRMVWFeX7VZfIe2KFshWXLRc4SY8hAQINymU; visit=v=1&M; G_ENABLED_IDPS=google; JSESSIONID="ajax:7673827752327651374"; timezone=Asia/Karachi; _guid=0f0d3402-80be-4bef-9baf-18d281f68921; mbox=session^#965dfb20b29e4f2688eedcf643d2e5ab^#1671620169|PC^#965dfb20b29e4f2688eedcf643d2e5ab.38_0^#1687170309; __ssid=db28305b-28da-4f8b-ad3a-54dea10b9eb9; dfpfpt=da2e5dde482a41b09cf7178ba1bcec7e; g_state={"i_l":0}; liap=true; li_at=AQEDATKxuC8DTVh9AAABhaytidQAAAGGZN5q6E0AdHv14xrDnsngkfFuMyIIbGYccHR15UrPQ8rb3qpS0_-mpCFm9pXQkoNYGdk87LiGVIqiw4oXuJ9tqflCEOev71_L83JoJ-fkbOfZwdG0RICtuIHn; AnalyticsSyncHistory=AQKUIualgILMBgAAAYZHP2t3mvejt25dMqUMRmrpyhaQMe1cucNiAMliFNRUf4cu4aKnZ1z1kQ_FGeqFr2m04Q; lms_ads=AQEr9ksNAL4kugAAAYZHP2z8QK26stPkoXe2TgJZW3Fnrl4dCzbC2DtithS1-zp5Ve85QwxzRhPvP9okaC0kbu40FYX7EqIk; lms_analytics=AQEr9ksNAL4kugAAAYZHP2z8QK26stPkoXe2TgJZW3Fnrl4dCzbC2DtithS1-zp5Ve85QwxzRhPvP9okaC0kbu40FYX7EqIk; fid=AQGWcXnO5AffyAAAAYZRr6tph6cekZ9ZD66e1xdHhumlVvJ3cKYzZLwfK-I3nJyeRyLQs3LRnowKjQ; lil-lang=en_US; lang=v=2&lang=en-us; _dd_l=1; _dd=ff90da3c-aa07-4491-9106-b226eba1c09c; AMCVS_14215E3D5995C57C0A495C55%40AdobeOrg=1; AMCV_14215E3D5995C57C0A495C55%40AdobeOrg=-637568504%7CMCIDTS%7C19403%7CMCMID%7C09349215808923073694559483836331055195%7CMCAAMLH-1677084815%7C3%7CMCAAMB-1677084815%7CRKhpRz8krg2tLO6pguXWp5olkAcUniQYPHaMWWgdJ3xzPWQmdj0y%7CMCOPTOUT-1676487215s%7CNONE%7CMCCIDH%7C1076847823%7CvVersion%7C5.1.1; s_cc=true; UserMatchHistory=AQJJ3j-efkcQeQAAAYZWAETxBE44VVBGzo_i-gr5nEGPOK85mS3kDScLdGC24_GeNx-GEeCNDrPOjkQde_MGT4iPc7vJV4sT_nPL8Tv4WMTLarIEliLYPkCvou8zFlb3dFNkbXZjVV_KTVeDvUSJ5WJTeStLNXmzV3_EV5mI9dbSRpoTFlJ94vi_zxcCmnLTaGAYGQAdymMv4SbaMgtnt3QcY8Zj9-hnwxdsIEmJloq47_QTP7sfl-SG-vw8xvhl9KYb0ZPKCnQ6ioJhu3G4cFpKJiSUbULkYMADSo0; lidc="b=VB23:s=V:r=V:a=V:p=V:g=4060:u=105:x=1:i=1676480108:t=1676566269:v=2:sig=AQEz2UktgVcQuJwMoVRgKgnUuKtCEm9C"; s_sq=%5B%5BB%5D%5D; gpv_pn=www.linkedin.com%2Flearning%2Fsearch; s_ips=615; s_plt=7.03; s_pltp=www.linkedin.com%2Flearning%2Fsearch; s_tp=6116; s_ppv=www.linkedin.com%2Flearning%2Fsearch%2C47%2C10%2C2859%2C7%2C18; s_tslv=1676480356388',
'csrf-token': 'ajax:7673827752327651374',
'referer': 'https://www.linkedin.com/learning/search?keywords=data%20science',
'sec-ch-ua': '"Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36',
'x-li-lang': 'en_US',
'x-li-page-instance': 'urn:li:page:d_learning_search;gNOg2MJoSqWv2XNAh4ukiQ==',
'x-li-pem-metadata': 'Learning Exp - Search=search',
'x-li-track': '{"clientVersion":"1.1.2236","mpVersion":"1.1.2236","osName":"web","timezoneOffset":5,"timezone":"Asia/Karachi","mpName":"learning-web","displayDensity":1,"displayWidth":1366,"displayHeight":768}',
'x-lil-intl-library': 'en_US',
'x-restli-protocol-version': '2.0.0'
}
res = requests.request("GET", url, headers=headers, data=payload).json()
product=[]
items=res['included']
for item in items:
try:
title=item['headline']['title']['text']
except:
title=''
try:
url='https://www.linkedin.com/learning/'+item['slug']
except:
url=''
try:
rating=item['rating']['ratingCount']
except:
rating=''
name = item.get("description", {}).get("text", "")
wev={
'title':title,
'instructor':name,
'review':rating,
'url':url
}
product.append(wev)
df=pd.DataFrame(product)
filter = df["title"] != ""
dfNew = df[filter]
dfNew.to_csv('learning.csv')
However, this is solution works because the web is structured. For complex/irregular websites I prefer to use scrapy as we use in my job.

Python requests POST gives different response content than browser

I'm writing a Python script to scrape a table from this site (this is public information about ocean tide levels).
One of the stations I'd like to scrape is Punta del Este, code 83.0, in any given day. But my scripts returns a different table than the browser even when the POST request seems to have the same input.
When I fill the form in my browser, the headers and data sent to the server are these:
So I wrote my script to make a POST request as it follows:
url = 'https://www.ambiente.gub.uy/SIH-JSF/paginas/sdh/consultaHDMCApublic.xhtml'
s = requests.Session()
r = s.get(url, verify=False)
soupGet = BeautifulSoup(r.content, 'lxml')
#JSESSIONID = s.cookies['JSESSIONID']
javax_faces_ViewState = soupGet.find("input", {"type": "hidden", "name":"javax.faces.ViewState"})['value']
headersSih = {
'Accept': 'application/xml, text/xml, */*; q=0.01',
'Accept-Language': 'gzip, deflate, br',
'Accept-Language': 'es-ES,es;q=0.6',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
# 'Cookie': 'JSESSIONID=FBE5ZdMQVFrgQ-P6K_yTc1bw.dinaguasihappproduccion',
'Faces-Request': 'partial/ajax',
'Origin': 'https://www.ambiente.gub.uy',
'Referer': url,
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'Sec-GPC': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
ini_date = datetime.strftime(fecha0 , '%d/%m/%Y %H:%M')
end_date = datetime.strftime(fecha0 + timedelta(days=1), '%d/%m/%Y %H:%M')
codigo = 830
dataSih = {
'javax.faces.partial.ajax': 'true',
'javax.faces.source': 'formConsultaHorario:j_idt64',
'javax.faces.partial.execute': '#all',
'javax.faces.partial.render': 'formConsultaHorario:pnlhorarioConsulta',
'formConsultaHorario:j_idt64': 'formConsultaHorario:j_idt64',
'formConsultaHorario': 'formConsultaHorario',
'formConsultaHorario:estacion_focus': '',
'formConsultaHorario:estacion_input': codigo,
'formConsultaHorario:fechaDesde_input': ini_date,
'formConsultaHorario:fechaHasta_input': end_date,
'formConsultaHorario:variables_focus': '',
'formConsultaHorario:variables_input': '26', # Variable: H,Nivel
'formConsultaHorario:fcal_focus': '',
'formConsultaHorario:fcal_input': '7', # Tipo calculo: Ingresado
'formConsultaHorario:ptiempo_focus': '',
'formConsultaHorario:ptiempo_input': '2', #Paso de tiempo: Escala horaria
'javax.faces.ViewState': javax_faces_ViewState,
}
page = s.post(url, headers=headersSih, data=dataSih)
However, when I do it via browser I get a table full of data, while python request returns (page.text) a table saying "No data was found".
Is there something I'm missing? I've tried changing a lots of stuff but nothing seems to do the trick.

Maybe on this website javascript loads the data. Requests dont activate it. If you want to get data from there use Selenium

python requests not returning json data

I would like to get the json data from for instance https://app.weathercloud.net/d0838117883#current using python requests module.
I tried:
import re
import requests
device='0838117883'
URL='https://app.weathercloud.net'
URL1=URL+'/d'+device
URL2=URL+'/device/stats'
headers={'Content-Type':'text/plain; charset=UTF-8',
'Referer':URL1,
'User-Agent':'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/48.0.2564.82 Chrome/48.0.2564.82 Safari/537.36',
'Accept':'application/json, text/javascript,*/*'}
with requests.Session() as s:
#get html from URL1 in order to get the CSRF token
page = s.get(URL1)
CSRF=re.findall('WEATHERCLOUD_CSRF_TOKEN:"(.*)"},',page.text)[0]
#create parameters for URL2, in order to get the json file
params={'code':device,'WEATHERCLOUD_CSRF_TOKEN':CSRF}
page_stats=requests.get(URL2,params=params,headers=headers)
print(page_stats.url)
print(page_stats) #<Response [200]>
print(page_stats.text) #empty
print(page_stats.json()) #error
But the page_stats is empty.
How can I get the stats data from weathercloud?

Inspecting the page with DevTools, you'll find a useful endpoint:
https://app.weathercloud.net/device/stats
You can "replicate" the original web request made by your browser with requests library:
import requests
cookies = {
'PHPSESSID': '************************',
'WEATHERCLOUD_CSRF_TOKEN':'***********************',
'_ga': '**********',
'_gid': '**********',
'__gads': 'ID=**********',
'WeathercloudCookieAgreed': 'true',
'_gat': '1',
'WEATHERCLOUD_RECENT_ED3C8': '*****************',
}
headers = {
'Connection': 'keep-alive',
'sec-ch-ua': '^\\^Google',
'Accept': 'application/json, text/javascript, */*; q=0.01',
'X-Requested-With': 'XMLHttpRequest',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36',
'sec-ch-ua-platform': '^\\^Windows^\\^',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Dest': 'empty',
'Referer': 'https://app.weathercloud.net/d0838117883',
'Accept-Language': 'it-IT,it;q=0.9,en-US;q=0.8,en;q=0.7,es;q=0.6',
}
params = (
('code', '0838117883'),
('WEATHERCLOUD_CSRF_TOKEN', '****************'),
)
response = requests.get('https://app.weathercloud.net/device/stats', headers=headers, params=params, cookies=cookies)
# Serializing json
json_object = json.loads(response.text)
json Output:
{'last_update': 1632842172,
'bar_current': [1632842172, 1006.2],
'bar_day_max': [1632794772, 1013.4],
'bar_day_min': [1632845772, 1006.2],
'bar_month_max': [1632220572, 1028],
'bar_month_min': [1632715572, 997.3],
'bar_year_max': [1614418512, 1038.1],
'bar_year_min': [1615434432, 988.1],
'wdir_current': [1632842172, 180],
..............}
That's it.

Python POST to check a box on a webpage

I am trying to scrape a webpage that posts prices for the Mexico power market. The webpage has checkboxes that need to be checked for the file with prices to show up. Once I get the relevant box checked, I want to pull the links on the page and check if the particular file I am looking for is posted. I am getting stuck in the first part where I get the checkbox selected using requests.post. I used fiddler to track the changes when I post and passed those arguments in through requests.post.
I was expecting to be able to parse out all the 'href' links in the response but I didn't get any. Any help in redirecting me toward a solution would be greatly appreciated.
Below is the relevant portion of the code I am using:
data{
"ctl00$ContentPlaceHolder1$toolkit":"ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$treePrincipal",
"_EVENTTARGET": "ctl00$ContentPlaceHolder1$treePrincipal",
"__EVENTARGUMENT":{"commandName":"Check","index":"0:0:0:0"},
"__VIEWSTATE": "/verylongstringhere",
"__VIEWSTATEGENERATOR":"6B88769A",
"__EVENTVALIDATION":"/wEdAAPhpIpHlL5kdIfX6MRCtKcRwfFVx5pEsE3np13JV2opXVEvSNmVO1vU+umjph0Dtwe41EcPKcg0qvxOp6m6pWTIV4q0ZOXSBrDwJTrxjo3dZg==",
"ctl00_ContentPlaceHolder1_treePrincipal_ClientState":{"expandedNodes":[],"collapsedNodes":
[],"logEntries":[],"selectedNodes":[],"checkedNodes":["0","0:0","0:0:0","0:0:0:0"],"scrollPosition":0},
"ctl00_ContentPlaceHolder1_ListViewNodos_ClientState":"",
"ctl00_ContentPlaceHolder1_NotifAvisos_ClientState":"",
"ctl00$ContentPlaceHolder1$NotifAvisos$hiddenState":"",
"ctl00_ContentPlaceHolder1_NotifAvisos_XmlPanel_ClientState":"",
"ctl00_ContentPlaceHolder1_NotifAvisos_TitleMenu_ClientState":"",
"__ASYNCPOST":"true"
}
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Content-Length': '26255',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': '_ga=GA1.3.1966843891.1571403663; _gid=GA1.3.1095695800.1571665852',
'Host': 'www.cenace.gob.mx',
'Origin': 'https://www.cenace.gob.mx',
'Referer': 'https://www.cenace.gob.mx/SIM/VISTA/REPORTES/PreEnergiaSisMEM.aspx',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/77.0.3865.120 Safari/537.36',
'X-MicrosoftAjax': 'Delta=true',
'X-Requested-With': 'XMLHttpRequest'
}
url ="https://www.cenace.gob.mx/SIM/VISTA/REPORTES/PreEnergiaSisMEM.aspx"
r= requests.post(url,data=data, headers=headers, verify=False)
This is what Fiddler showed on the Post:enter image description here

Maybe you have incorrect __EVENTVALIDATION or __VIEWSTATE fields. You can get the initial page & scrape all the inputs with the initial values.
The following code grabs the input on the first requests, edit them like you did & then send the POST request scraping all the href values :
import requests
import json
from bs4 import BeautifulSoup
base_url = "https://www.cenace.gob.mx"
url = "{}/SIM/VISTA/REPORTES/PreEnergiaSisMEM.aspx".format(base_url)
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
payload = dict([
(t['name'],t.get('value',''))
for t in soup.select("input")
if t.has_attr('name')
])
payload['ctl00$ContentPlaceHolder1$toolkit'] = 'ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$treePrincipal'
payload['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$treePrincipal'
payload['__ASYNCPOST'] = 'true'
payload['__EVENTARGUMENT']= json.dumps({
"commandName":"Check",
"index":"0:1:1:0"
})
payload['ctl00_ContentPlaceHolder1_treePrincipal_ClientState'] = json.dumps({
"expandedNodes":[], "collapsedNodes":[],
"logEntries":[], "selectedNodes":[],
"checkedNodes":["0","0:1","0:1:1","0:1:1:0"],
"scrollPosition":0
})
r = requests.post(url, data = payload, headers= {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)"
})
soup = BeautifulSoup(r.text, "html.parser")
print([
"{}/{}".format(base_url, t["href"])
for t in soup.findAll('a')
if not t["href"].startswith('javascript')
])

Scraping with BS4 but HTML gets messed up when parsed

I'm having trouble scraping a website using BeautifulSoup4 and Python3. I'm using dryscrape to get the HTML since it requires JavaScript to be enabled in order to be shown (but as far as I know it's never used in the page itself).
This is my code:
from bs4 import BeautifulSoup
import dryscrape
productUrl = "https://www.mercadona.es/detall_producte.php?id=32009"
session = dryscrape.Session()
session.visit(productUrl)
response = session.body()
soup = BeautifulSoup(response, "lxml")
container1 = soup.find("div","contenido").find("dl").find_all("dt")
container3 = soup.find("div","contenido").find_all("td")
Now I want to read container3 content, but:
type(container3)
Returns:
bs4.element.ResultSet
which is the same as type(container1), but it's length it's 0!
So I wanted to know what was I getting to container3 before looking for my <td> tag, so I wrote it to a file.
container3 = soup.find("div","contenido")
soup_file.write(container3.prettify())
And, here is the link to that file: https://pastebin.com/xc22fefJ
It gets all messed up just before the table I want to scrape. I can't understand why, looking at the URL source code from Firefox everything looks fine.

Here's the updated solution:
url = 'https://www.mercadona.es/detall_producte.php?id=32009'
rh = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.mercadona.es',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
s = requests.session()
r = s.get(url, headers = rh)
The response to this gives you the Please enable JavaScript to view the page content. message. However, it also contains the necessary hidden data sent by browser using javascript, which can be seen from the network tab of the developer's tools.
TS015fc057_id: 3
TS015fc057_cr: a57705c08e49ba7d51954bea1cc9bfce:jlnk:l8MH0eul:1700810263
TS015fc057_76: 0
TS015fc057_86: 0
TS015fc057_md: 1
TS015fc057_rf: 0
TS015fc057_ct: 0
TS015fc057_pd: 0
Of these, the second one (the long string) is generated by javascript. We can use a library like js2py to run the code, which will return the required string to be passed in the request.
soup = BeautifulSoup(r.content, 'lxml')
script = soup.find_all('script')[1].text
js_code = re.search(r'.*(function challenge.*crc;).*', script, re.DOTALL).groups()[0] + '} challenge();'
js_code = js_code.replace('document.forms[0].elements[1].value=', 'return ')
hidden_inputs = soup.find_all('input')
hidden_inputs[1]['value'] = js2py.eval_js(js_code)
fd = {i['name']: i['value'] for i in hidden_inputs}
rh = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Referer': 'https://www.mercadona.es/detall_producte.php?id=32009',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Content-Length': '188',
'Content-Type': 'application/x-www-form-urlencoded',
'Cache-Control': 'max-age=0',
'Host': 'www.mercadona.es',
'Origin': 'https://www.mercadona.es',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
# NOTE: the next one is a POST request, as opposed to the GET request sent before
r = s.post(url, headers = rh, data = fd)
soup = BeautifulSoup(r.content, 'lxml')
And here's the result:
>>> len(soup.find('div', 'contenido').find_all('td'))
70
>>> len(soup.find('div', 'contenido').find('dl').find_all('dt'))
8
EDIT
Apparently, the javascript code needs to be run only once. The resulting data can be used for more than one request, like this:
for i in range(32007, 32011):
r = s.post(url[:-5] + str(i), headers = rh, data = fd)
soup = BeautifulSoup(r.content, 'lxml')
print(soup.find_all('dd')[1].text)
Result:
Manzana y plátano 120 g
Manzana y plátano 720g (6x120) g
Fresa Plátano 120 g
Fresa Plátano 720g (6x120g)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I Webscrape a website that uses iframes? - python

Related

Scraping data through Api from json

Python requests POST gives different response content than browser

python requests not returning json data

Python POST to check a box on a webpage

Scraping with BS4 but HTML gets messed up when parsed

Categories

Resources