I am attempting to automate the process of passing text from a website to a tool in order the estimated reading level of the text. However, when I pass the url-encoded text via a post method, i get error 400 bad request.
article = 'The quick brown fox jumps over the lazy dog.'
headers = ({'Host': 'auto-ilr.ll.mit.edu',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'DNT': '1',
'Referer': 'https://auto-ilr.ll.mit.edu/instant/',
'Connection': 'keep-alive'})
s = requests.Session()
#s.mount('https://', SSLAdapter())
s.mount('https://', MyAdapter())
try:
postdata = urllib.parse.urlencode({'Language': 'English', 'Text': article})
soup = s.post('https://auto-ilr.ll.mit.edu/instant/summary3', data=postdata, headers = headers, verify=False)
I'm not sure what the difference is but there have been a few cases where the request has gone through and the final soup variable ended with the text from the site, but it was text showing the site did not correctly process the text i included.
You are missing something simple , you don't have to encode data , requests does it for you :
article = 'The quick brown fox jumps over the lazy dog.'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Firefox/45.0',
'Referer': 'https://auto-ilr.ll.mit.edu/instant/'
}
postdata = {'Language': 'English', 'Text': article}
s = requests.Session()
soup = s.post('https://auto-ilr.ll.mit.edu/instant/summary3', data=postdata, headers = headers, verify=False)
print(soup.status_code)
Also , you dont have to send all the headers , mabe just 'User-Agent' or 'Referer' sometimes .
Related
I'm writing a Python script to scrape a table from this site (this is public information about ocean tide levels).
One of the stations I'd like to scrape is Punta del Este, code 83.0, in any given day. But my scripts returns a different table than the browser even when the POST request seems to have the same input.
When I fill the form in my browser, the headers and data sent to the server are these:
So I wrote my script to make a POST request as it follows:
url = 'https://www.ambiente.gub.uy/SIH-JSF/paginas/sdh/consultaHDMCApublic.xhtml'
s = requests.Session()
r = s.get(url, verify=False)
soupGet = BeautifulSoup(r.content, 'lxml')
#JSESSIONID = s.cookies['JSESSIONID']
javax_faces_ViewState = soupGet.find("input", {"type": "hidden", "name":"javax.faces.ViewState"})['value']
headersSih = {
'Accept': 'application/xml, text/xml, */*; q=0.01',
'Accept-Language': 'gzip, deflate, br',
'Accept-Language': 'es-ES,es;q=0.6',
'Connection': 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
# 'Cookie': 'JSESSIONID=FBE5ZdMQVFrgQ-P6K_yTc1bw.dinaguasihappproduccion',
'Faces-Request': 'partial/ajax',
'Origin': 'https://www.ambiente.gub.uy',
'Referer': url,
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'Sec-GPC': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
ini_date = datetime.strftime(fecha0 , '%d/%m/%Y %H:%M')
end_date = datetime.strftime(fecha0 + timedelta(days=1), '%d/%m/%Y %H:%M')
codigo = 830
dataSih = {
'javax.faces.partial.ajax': 'true',
'javax.faces.source': 'formConsultaHorario:j_idt64',
'javax.faces.partial.execute': '#all',
'javax.faces.partial.render': 'formConsultaHorario:pnlhorarioConsulta',
'formConsultaHorario:j_idt64': 'formConsultaHorario:j_idt64',
'formConsultaHorario': 'formConsultaHorario',
'formConsultaHorario:estacion_focus': '',
'formConsultaHorario:estacion_input': codigo,
'formConsultaHorario:fechaDesde_input': ini_date,
'formConsultaHorario:fechaHasta_input': end_date,
'formConsultaHorario:variables_focus': '',
'formConsultaHorario:variables_input': '26', # Variable: H,Nivel
'formConsultaHorario:fcal_focus': '',
'formConsultaHorario:fcal_input': '7', # Tipo calculo: Ingresado
'formConsultaHorario:ptiempo_focus': '',
'formConsultaHorario:ptiempo_input': '2', #Paso de tiempo: Escala horaria
'javax.faces.ViewState': javax_faces_ViewState,
}
page = s.post(url, headers=headersSih, data=dataSih)
However, when I do it via browser I get a table full of data, while python request returns (page.text) a table saying "No data was found".
Is there something I'm missing? I've tried changing a lots of stuff but nothing seems to do the trick.
Maybe on this website javascript loads the data. Requests dont activate it. If you want to get data from there use Selenium
I want to scrape data from the shipping page of our company logistics which based on ASP. I watched a lot of tutorial on Internet about BeautifulSoup and Requests library. But it isn't working as expected for me.
The login url is:
https://portal-vesta.sequoialog.com.br/tms/LoginPortal.aspx
I wrote code in bash script and my login attempt worked, returns this message:
69|dataItem||<script type="text/javascript">window.location="about:blank"</script>|32|pageRedirect||/tms/HomePortal.aspx?Usu=1070rpr|
But in python, it is returning me this message:
b'69|dataItem||<script type="text/javascript">window.location="about:blank"</script>|21|pageRedirect||/TMS/LoginPortal.aspx|'
My code is:
with Session() as s:
page = s.get(urls[0])
get_ev_vs(page)
payload = "Ajax=UpdatePanel1%7CButton1&Button1=ENTRAR&__ASYNCPOST=true&__EVENTARGUMENT=&__EVENTTARGET=&__EVENTVALIDATION={}&__LASTFOCUS=&__VIEWSTATE={}&txtSenha={}&txtUsuario={}&txtUsuarioSolicita=".format(ev, vs, password, user)
head = {
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'Cache-Control': 'no-cache',
'X-MicrosoftAjax': 'Delta=true',
'sec-ch-ua-mobile': '?0',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Accept': '*/*'
}
response = s.post(urls[0], data=payload, headers=head)
payload = {}
headers = {
'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
'sec-ch-ua-mobile': '?0',
'Upgrade-Insecure-Requests': '1',
'DNT': '1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Cookie': 'ASP.NET_SessionId=' + s.cookies["ASP.NET_SessionId"]
}
open_page = s.get(urls[1], data=payload, headers=headers)
print(open_page.text)
def get_ev_vs(page):
soup = BeautifulSoup(page.text, 'html.parser')
global vs, ev
vs = soup.select_one('#__VIEWSTATE')['value']
ev = soup.select_one('#__EVENTVALIDATION')['value']
import requests
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin
target = ['__VIEWSTATE', '__EVENTVALIDATION']
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0",
}
def main(url):
with requests.Session() as req:
req.headers.update(headers)
r = req.get(url)
soup = BeautifulSoup(r.text, 'lxml')
values = [soup.select_one('#{}'.format(x))['value'] for x in target]
data = {
"Ajax": "UpdatePanel1|Button1",
"__LASTFOCUS": "",
"__EVENTTARGET": "",
"__EVENTARGUMENT": "",
"__VIEWSTATE": values[0],
"__EVENTVALIDATION": values[1],
"txtUsuario": "aaa", #username should be here
"txtSenha": "aaa", #password too!
"txtUsuarioSolicita": "",
"__ASYNCPOST": "true",
"Button1": "ENTRAR"
}
r = req.post(url, data=data)
match = re.search(r'ScriptPath\|([^|]+db)\|', r.text).group(1)
final = urljoin(url, match)
r = req.get(final)
print(r.text)
main('https://portal-vesta.sequoialog.com.br/tms/LoginPortal.aspx')
I'm trying to submit the query form on every ad posted on https://immoweb.be using requests. Whenever the request is sent a unique x-xsrf-token is generated and attached with the Header in POST request. So, after every few hours, 419 error occurs due to token expiration. I'm using https://curl.trillworks.com/ for creating the header and payload for the python script.
import requests
header1 = {
'authority': 'www.immoweb.be',
'accept': 'application/json, text/plain, */*',
'x-xsrf-token': 'value here',
'x-requested-with': 'XMLHttpRequest',
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.128 Safari/537.36',
'content-type': 'application/json;charset=UTF-8',
'origin': 'https://www.immoweb.be',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.immoweb.be/en/classified/apartment/for-sale/merksem/2170/9284079?searchId=607a09e4e4532',
'accept-language': 'en-US,en;q=0.9',
'cookie': 'Cookie data here',
}
data1 = '{"firstName":"Name here","lastName":"Name here","email":"myemail#gmail.com","classifiedId":9284079,"customerIds":\[add ehre\],"phone":"","message":"I am interested in your property. Can you give me some more information in order to plan a possible visit? Thank you.","isUnloggedUserInfoRemembered":false,"sendMeACopy":false}'
response = s.post('https://www.immoweb.be/nl/email/request', headers=header1, data=data1)
print (response.status_code)
print (response.cookies)
You can get new X-XSRF-TOKEN from the cookie, for example:
import requests
from urllib.parse import unquote
url = "https://www.immoweb.be"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
}
json_data = {
"classifiedId": 9234221, # <-- change this
"customerIds": [305309], # <-- change this
"email": "sample#example.com",
"firstName": "xxx",
"isUnloggedUserInfoRemembered": False,
"lastName": "xxx",
"message": "I am interested in your property. Can you give me some more information in order to plan a possible visit? Thank you.",
"phone": "",
"sendMeACopy": False,
}
with requests.Session() as s:
# this is only for getting first "XSRF-TOKEN"
s.get(url)
token = unquote(s.cookies["XSRF-TOKEN"])
# use this token to post request:
headers["X-Requested-With"] = "XMLHttpRequest"
headers["X-XSRF-TOKEN"] = token
status = s.post(
"https://www.immoweb.be/en/email/request",
headers=headers,
json=json_data,
).json()
print(status)
# new XSRF-TOKEN is here:
token = unquote(s.cookies["XSRF-TOKEN"])
headers["X-XSRF-TOKEN"] = token
# repeat
# ...
Prints:
{'success': True}
I am trying to scrape a webpage that posts prices for the Mexico power market. The webpage has checkboxes that need to be checked for the file with prices to show up. Once I get the relevant box checked, I want to pull the links on the page and check if the particular file I am looking for is posted. I am getting stuck in the first part where I get the checkbox selected using requests.post. I used fiddler to track the changes when I post and passed those arguments in through requests.post.
I was expecting to be able to parse out all the 'href' links in the response but I didn't get any. Any help in redirecting me toward a solution would be greatly appreciated.
Below is the relevant portion of the code I am using:
data{
"ctl00$ContentPlaceHolder1$toolkit":"ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$treePrincipal",
"_EVENTTARGET": "ctl00$ContentPlaceHolder1$treePrincipal",
"__EVENTARGUMENT":{"commandName":"Check","index":"0:0:0:0"},
"__VIEWSTATE": "/verylongstringhere",
"__VIEWSTATEGENERATOR":"6B88769A",
"__EVENTVALIDATION":"/wEdAAPhpIpHlL5kdIfX6MRCtKcRwfFVx5pEsE3np13JV2opXVEvSNmVO1vU+umjph0Dtwe41EcPKcg0qvxOp6m6pWTIV4q0ZOXSBrDwJTrxjo3dZg==",
"ctl00_ContentPlaceHolder1_treePrincipal_ClientState":{"expandedNodes":[],"collapsedNodes":
[],"logEntries":[],"selectedNodes":[],"checkedNodes":["0","0:0","0:0:0","0:0:0:0"],"scrollPosition":0},
"ctl00_ContentPlaceHolder1_ListViewNodos_ClientState":"",
"ctl00_ContentPlaceHolder1_NotifAvisos_ClientState":"",
"ctl00$ContentPlaceHolder1$NotifAvisos$hiddenState":"",
"ctl00_ContentPlaceHolder1_NotifAvisos_XmlPanel_ClientState":"",
"ctl00_ContentPlaceHolder1_NotifAvisos_TitleMenu_ClientState":"",
"__ASYNCPOST":"true"
}
headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Content-Length': '26255',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': '_ga=GA1.3.1966843891.1571403663; _gid=GA1.3.1095695800.1571665852',
'Host': 'www.cenace.gob.mx',
'Origin': 'https://www.cenace.gob.mx',
'Referer': 'https://www.cenace.gob.mx/SIM/VISTA/REPORTES/PreEnergiaSisMEM.aspx',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/77.0.3865.120 Safari/537.36',
'X-MicrosoftAjax': 'Delta=true',
'X-Requested-With': 'XMLHttpRequest'
}
url ="https://www.cenace.gob.mx/SIM/VISTA/REPORTES/PreEnergiaSisMEM.aspx"
r= requests.post(url,data=data, headers=headers, verify=False)
This is what Fiddler showed on the Post:enter image description here
Maybe you have incorrect __EVENTVALIDATION or __VIEWSTATE fields. You can get the initial page & scrape all the inputs with the initial values.
The following code grabs the input on the first requests, edit them like you did & then send the POST request scraping all the href values :
import requests
import json
from bs4 import BeautifulSoup
base_url = "https://www.cenace.gob.mx"
url = "{}/SIM/VISTA/REPORTES/PreEnergiaSisMEM.aspx".format(base_url)
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
payload = dict([
(t['name'],t.get('value',''))
for t in soup.select("input")
if t.has_attr('name')
])
payload['ctl00$ContentPlaceHolder1$toolkit'] = 'ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$treePrincipal'
payload['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$treePrincipal'
payload['__ASYNCPOST'] = 'true'
payload['__EVENTARGUMENT']= json.dumps({
"commandName":"Check",
"index":"0:1:1:0"
})
payload['ctl00_ContentPlaceHolder1_treePrincipal_ClientState'] = json.dumps({
"expandedNodes":[], "collapsedNodes":[],
"logEntries":[], "selectedNodes":[],
"checkedNodes":["0","0:1","0:1:1","0:1:1:0"],
"scrollPosition":0
})
r = requests.post(url, data = payload, headers= {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64)"
})
soup = BeautifulSoup(r.text, "html.parser")
print([
"{}/{}".format(base_url, t["href"])
for t in soup.findAll('a')
if not t["href"].startswith('javascript')
])
I'm having trouble scraping a website using BeautifulSoup4 and Python3. I'm using dryscrape to get the HTML since it requires JavaScript to be enabled in order to be shown (but as far as I know it's never used in the page itself).
This is my code:
from bs4 import BeautifulSoup
import dryscrape
productUrl = "https://www.mercadona.es/detall_producte.php?id=32009"
session = dryscrape.Session()
session.visit(productUrl)
response = session.body()
soup = BeautifulSoup(response, "lxml")
container1 = soup.find("div","contenido").find("dl").find_all("dt")
container3 = soup.find("div","contenido").find_all("td")
Now I want to read container3 content, but:
type(container3)
Returns:
bs4.element.ResultSet
which is the same as type(container1), but it's length it's 0!
So I wanted to know what was I getting to container3 before looking for my <td> tag, so I wrote it to a file.
container3 = soup.find("div","contenido")
soup_file.write(container3.prettify())
And, here is the link to that file: https://pastebin.com/xc22fefJ
It gets all messed up just before the table I want to scrape. I can't understand why, looking at the URL source code from Firefox everything looks fine.
Here's the updated solution:
url = 'https://www.mercadona.es/detall_producte.php?id=32009'
rh = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Host': 'www.mercadona.es',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
s = requests.session()
r = s.get(url, headers = rh)
The response to this gives you the Please enable JavaScript to view the page content. message. However, it also contains the necessary hidden data sent by browser using javascript, which can be seen from the network tab of the developer's tools.
TS015fc057_id: 3
TS015fc057_cr: a57705c08e49ba7d51954bea1cc9bfce:jlnk:l8MH0eul:1700810263
TS015fc057_76: 0
TS015fc057_86: 0
TS015fc057_md: 1
TS015fc057_rf: 0
TS015fc057_ct: 0
TS015fc057_pd: 0
Of these, the second one (the long string) is generated by javascript. We can use a library like js2py to run the code, which will return the required string to be passed in the request.
soup = BeautifulSoup(r.content, 'lxml')
script = soup.find_all('script')[1].text
js_code = re.search(r'.*(function challenge.*crc;).*', script, re.DOTALL).groups()[0] + '} challenge();'
js_code = js_code.replace('document.forms[0].elements[1].value=', 'return ')
hidden_inputs = soup.find_all('input')
hidden_inputs[1]['value'] = js2py.eval_js(js_code)
fd = {i['name']: i['value'] for i in hidden_inputs}
rh = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Referer': 'https://www.mercadona.es/detall_producte.php?id=32009',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Connection': 'keep-alive',
'Content-Length': '188',
'Content-Type': 'application/x-www-form-urlencoded',
'Cache-Control': 'max-age=0',
'Host': 'www.mercadona.es',
'Origin': 'https://www.mercadona.es',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
# NOTE: the next one is a POST request, as opposed to the GET request sent before
r = s.post(url, headers = rh, data = fd)
soup = BeautifulSoup(r.content, 'lxml')
And here's the result:
>>> len(soup.find('div', 'contenido').find_all('td'))
70
>>> len(soup.find('div', 'contenido').find('dl').find_all('dt'))
8
EDIT
Apparently, the javascript code needs to be run only once. The resulting data can be used for more than one request, like this:
for i in range(32007, 32011):
r = s.post(url[:-5] + str(i), headers = rh, data = fd)
soup = BeautifulSoup(r.content, 'lxml')
print(soup.find_all('dd')[1].text)
Result:
Manzana y plátano 120 g
Manzana y plátano 720g (6x120) g
Fresa Plátano 120 g
Fresa Plátano 720g (6x120g)