POST URL Encoded vs Line-based text data via Python Requests - python

I'm trying to scrape some data from a website and I can't get the POST to work, it acts as though I didn't give it the input data ("appnote").
When I examine the POST data it looks relatively the same except that the actual webform's POST is called "URL Encoded" and lists each form input, whereas mine is labeled "Line-based text data".
Here's my code: (appnote) and Search (Search) are the most relevant pieces I need
import requests
import cookielib
jar = cookielib.CookieJar()
url = 'http://www.vivotek.com/faq/'
headers = {'content-type': 'application/x-www-form-urlencoded'}
post_data = {#'__EVENTTARGET':'',
#'__EVENTARGUMENT':'',
'__LASTFOCUS':'',
'__VIEWSTATE':'',
'__VIEWSTATEGENERATOR':'',
'__VIEWSTATEENCRYPTED':'',
'__PREVIOUSPAGE':'',
'__EVENTVALIDATION':''
'ctl00$HeaderUc1$LanguageDDLUc1$ddlLanguage':'en',
'ctl00$ContentPlaceHolder1$CategoryDDLUc1$DropDownList1':'-1',
'ctl00$ContentPlaceHolder1$ProductDDLUc1$DropDownList1':'-1',
'ctl00$ContentPlaceHolder1$Content':'appnote',
'ctl00$ContentPlaceHolder1$Search':'Search'
}
response = requests.get(url, cookies=jar)
response = requests.post(url, cookies=jar, data=post_data, headers=headers)
print(response.text)
Links to images of what I'm talking about in Wireshark:
Wireshark Form
Wireshark Line
I also tried it using wget with the same results.

The main problem is that you are not setting the important hidden field values, like __VIEWSTATE.
For this to work using requests, you need to parse the page html and get the appropriate input values.
Here's the solution using BeautifulSoup HTML parser and requests:
from bs4 import BeautifulSoup
import requests
url = 'http://www.vivotek.com/faq/'
query = 'appnote'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36'}
session = requests.Session()
response = session.get(url, headers=headers)
soup = BeautifulSoup(response.content)
post_data = {'__EVENTTARGET':'',
'__EVENTARGUMENT':'',
'__LASTFOCUS':'',
'__VIEWSTATE': soup.find('input', id='__VIEWSTATE')['value'],
'__VIEWSTATEGENERATOR': soup.find('input', id='__VIEWSTATEGENERATOR')['value'],
'__VIEWSTATEENCRYPTED': '',
'__PREVIOUSPAGE': soup.find('input', id='__PREVIOUSPAGE')['value'],
'__EVENTVALIDATION': soup.find('input', id='__EVENTVALIDATION')['value'],
'ctl00$HeaderUc1$LanguageDDLUc1$ddlLanguage': 'en',
'ctl00$ContentPlaceHolder1$CategoryDDLUc1$DropDownList1': '-1',
'ctl00$ContentPlaceHolder1$ProductDDLUc1$DropDownList1': '-1',
'ctl00$ContentPlaceHolder1$Content': query,
'ctl00$ContentPlaceHolder1$Search': 'Search'
}
response = session.post(url, data=post_data, headers=headers)
soup = BeautifulSoup(response.content)
for item in soup.select('a#ArticleShowLink'):
print item.text.strip()
Prints the specific results for the appnote query:
How to troubleshoot when you can't watch video streaming?
Recording performance benchmarking tool
...

Related

Logging into Facebook with requests Python 2020

Hello I would like to make a bot that automatically logs into facebook and makes a post on a specific group.I think I will use selenium to make a post, which will be easy so I am just asking for help with the first part. I have problems because some form data from network developer tool tab is hidden and not displayed in websites html and I don't know how to find it. Here is my code so far:
import requests
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36'
}
data = {
'email': '----------',
'pass': '---------',
'timezone': '-60',
'locale': 'pl_PL',
'next': 'https://www.facebook.com/',
'login_source': 'login_bluebar',
'prefill_contact_point': '512250794',
'prefill_source': 'browser_onload',
'prefill_type': 'password',
'skstamp': 'eyJoYXNoIjoiYThiN2EyOTMwNTJhZTUzODg0YjZiNWNlOWQ1NzZjZjUiLCJoYXNoMiI6IjQ3ZWI4M2U1ZjVmYTQxMTQ4MDIxYWVlZTgzNTk3ZWJmIiwicm91bmRzIjo1LCJzZWVkIjoiYjU0NWE4MzczOTgwYTZhODViZjUzYmE3ZmM0OWIyOWYiLCJzZWVkMiI6IjdiNTU0NzBjM2M5NjlhMTY3YmZkZmIwZjE5ODlmNDdhIiwidGltZV90YWtlbiI6ODA3OTAsInN1cmZhY2UiOiJsb2dpbiJ9'
}
with requests.Session () as s:
url = 'https://www.facebook.com/'
r = s.get(url, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
data['jazoest'] = soup.find('input', attrs={'name': 'jazoest'})['value']
data['lsd'] = soup.find('input', attrs={'name': 'lsd'})['value']
data['lgnrnd'] = soup.find('input', attrs={'name': 'lgnrnd'})['value']
data['lgndim'] = soup.find('input', attrs={'name': 'lgndim'})['value']
data['ab_test_data'] = soup.find('input', attrs={'name': 'ab_test_data'})['value']
data['lgnjs'] = soup.find('input', attrs={'name': 'lgnjs'})['value']
data['guid'] = soup.find('input', attrs={'name': 'guid'})['value']
data['lgndim'] = soup.find('input', attrs={'name': 'lgndim'})['value']
r = s.post(url, data=data, headers=headers)
print(r.content)
I would be very happy if someone could help me with it. Is there a better way to do such things in 2020? Yes I know that there were made some posts about logging into facebook with requests with bs4 but they are from 2018 and I think that facebook changed a lot, like some headers disappear or change their name after each time I log in.
You can use the Facebook API, which is available at developers.facebook.com
Rather than using a third-party library, you could post directly to the group using the API(see here for more details)

Can't scrape names from next pages using requests

I'm trying to parse names traversing multiple pages from a webpage using a python script. With my current attempt I can get the names from it's landing page. However, I can't find any idea to fetch the names from next pages as well using requests and BeautifulSoup.
website link
My attempt so far:
import requests
from bs4 import BeautifulSoup
url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"
with requests.Session() as s:
r = s.get(url)
soup = BeautifulSoup(r.text,"lxml")
for elem in soup.select("table#gvContractors tr:has([id*='_lblName'])"):
name = elem.select_one("span[id*='_lblName']").get_text(strip=True)
print(name)
I've tried to modify my script to get only the content from the second page to make sure it is working when there is a next page button involved but unfortunately it still fetches data from the first page:
import requests
from bs4 import BeautifulSoup
url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"
with requests.Session() as s:
r = s.get(url)
soup = BeautifulSoup(r.text,"lxml")
payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
payload['__EVENTARGUMENT'] = 'Page$Next'
payload.pop('btnClose')
payload.pop('btnMapClose')
res = s.post(url,data=payload,headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
'X-Requested-With':'XMLHttpRequest',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Referer': 'https://proximity.niceic.com/mainform.aspx?PostCode=YO95',
})
sauce = BeautifulSoup(res.text,"lxml")
for elem in sauce.select("table#gvContractors tr:has([id*='_lblName'])"):
name = elem.select_one("span[id*='_lblName']").get_text(strip=True)
print(name)
Navigating to next page is being performed via POST request with __VIEWSTATE cursor.
How you can do it with requests:
Make GET request to first page;
Parse required data and __VIEWSTATE cursor;
Prepare POST request for next page with received cursor;
Run it, parse all data and new cursor for next page.
I won't provide any code, because it requires to write down almost all crawler's code.
==== Added ====
You almost done it, but there are two important things you have missed.
It is necessary to send headers with first GET request. If there're no headers sent - we get broken tokens (it is easy to detect visually - they haven't == at the end)
We need to add __ASYNCPOST to payload we send. (It is very interesting: it is not a boolean True, it is a string 'true')
Here's code. I removed bs4 and added lxml (i don't like bs4, it is very slow). We exactly know which data we need to send, so let's parse only few inputs.
import re
import requests
from lxml import etree
def get_nextpage_tokens(response_body):
""" Parse tokens from XMLHttpRequest response for making next request to next page and create payload """
try:
payload = dict()
payload['ToolkitScriptManager1'] = 'UpdatePanel1|gvContractors'
payload['__EVENTTARGET'] = 'gvContractors'
payload['__EVENTARGUMENT'] = 'Page$Next'
payload['__VIEWSTATEENCRYPTED'] = ''
payload['__VIEWSTATE'] = re.search(r'__VIEWSTATE\|([^\|]+)', response_body).group(1)
payload['__VIEWSTATEGENERATOR'] = re.search(r'__VIEWSTATEGENERATOR\|([^\|]+)', response_body).group(1)
payload['__EVENTVALIDATION'] = re.search(r'__EVENTVALIDATION\|([^\|]+)', response_body).group(1)
payload['__ASYNCPOST'] = 'true'
return payload
except:
return None
if __name__ == '__main__':
url = "https://proximity.niceic.com/mainform.aspx?PostCode=YO95"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.87 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Referer': 'https://proximity.niceic.com/mainform.aspx?PostCode=YO95',
}
with requests.Session() as s:
page_num = 1
r = s.get(url, headers=headers)
parser = etree.HTMLParser()
tree = etree.fromstring(r.text, parser)
# Creating payload
payload = dict()
payload['ToolkitScriptManager1'] = 'UpdatePanel1|gvContractors'
payload['__EVENTTARGET'] = 'gvContractors'
payload['__EVENTARGUMENT'] = 'Page$Next'
payload['__VIEWSTATE'] = tree.xpath("//input[#name='__VIEWSTATE']/#value")[0]
payload['__VIEWSTATEENCRYPTED'] = ''
payload['__VIEWSTATEGENERATOR'] = tree.xpath("//input[#name='__VIEWSTATEGENERATOR']/#value")[0]
payload['__EVENTVALIDATION'] = tree.xpath("//input[#name='__EVENTVALIDATION']/#value")[0]
payload['__ASYNCPOST'] = 'true'
headers['X-Requested-With'] = 'XMLHttpRequest'
while True:
page_num += 1
res = s.post(url, data=payload, headers=headers)
print(f'page {page_num} data: {res.text}') # FIXME: Parse data
payload = get_nextpage_tokens(res.text) # Creating payload for next page
if not payload:
# Break if we got no tokens - maybe it was last page (it must be checked)
break
Important
Response not a well formed HTML. So You have to deal with it: cut table or something else. Good luck!

requests-html not finding page element

So I'm trying to navigate to this url: https://www.instacart.com/store/wegmans/search_v3/horizon%201%25
and scrape data from the div with the class item-name item-row. There are two main problems though, the first is that instacart.com requires a login before you can get to that url, and the second is that most of the page is generated with javascript.
I believe I've solved the first problem because my session.post(...) gets a 200 response code. I'm also pretty sure that r.html.render() is supposed to solve the second problem by rendering the javascript generated html before I scrape it. Unfortunately, the last line in my code is only returning an empty list, despite the fact that selenium had no problem getting this element. Does anyone know why this isn't workng?
from requests_html import HTMLSession
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
session = HTMLSession()
res1 = session.get('http://www.instacart.com', headers=headers)
soup = BeautifulSoup(res1.content, 'html.parser')
token = soup.find('meta', {'name': 'csrf-token'}).get('content')
data = {"user": {"email": "alexanderjbusch#gmail.com", "password": "password"},
"authenticity_token": token}
response = session.post('https://www.instacart.com/accounts/login', headers=headers, data=data)
print(response)
r = session.get("https://www.instacart.com/store/wegmans/search_v3/horizon%201%25", headers=headers)
r.html.render()
print(r.html.xpath("//div[#class='item-name item-row']"))
After logging in using requests module and BeautifulSoup, you can make use of the link I've already suggested in the comment to parse the required data available within json. The following script should get you name, quantity, price and a link to the concerning product. You can only get 21 product using the script below. There is an option for pagination within this json content. You can get all of the products by playing around with that pagination.
import json
import requests
from bs4 import BeautifulSoup
baseurl = 'https://www.instacart.com/store/'
data_url = "https://www.instacart.com/v3/retailers/159/module_data/dynamic_item_lists/cart_starters/storefront_canonical?origin_source_type=store_root_department&tracking.page_view_id=b974d56d-eaa4-4ce2-9474-ada4723fc7dc&source=web&cache_key=df535d-6863-f-1cd&per=30"
data = {"user": {"email": "alexanderjbusch#gmail.com", "password": "password"},
"authenticity_token": ""}
headers = {
'user-agent':'Mozilla/5.0',
'x-requested-with': 'XMLHttpRequest'
}
with requests.Session() as s:
res = s.get('https://www.instacart.com/',headers={'user-agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.text, 'lxml')
token = soup.select_one("[name='csrf-token']").get('content')
data["authenticity_token"] = token
s.post("https://www.instacart.com/accounts/login",json=data,headers=headers)
resp = s.get(data_url, headers=headers)
for item in resp.json()['module_data']['items']:
name = item['name']
quantity = item['size']
price = item['pricing']['price']
product_page = baseurl + item['click_action']['data']['container']['path']
print(f'{name}\n{quantity}\n{price}\n{product_page}\n')
Partial output:
SB Whole Milk
1 gal
$3.90
https://www.instacart.com/store/items/item_147511418
Banana
At $0.69/lb
$0.26
https://www.instacart.com/store/items/item_147559922
Yellow Onion
At $1.14/lb
$0.82
https://www.instacart.com/store/items/item_147560764

Requests login into website only getting 403 error

I am trying to login into www.ebay-kleinanzeigen.de using the requests library, but every time I try to post my data (on the register page its the same as on the login page) I am getting a 403 error.
Here is the code for the register function:
import requests
from bs4 import BeautifulSoup
session = requests.Session()
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
headers = {'user-agent': user_agent, 'Referer': 'https://www.ebay-kleinanzeigen.de'}
with requests.Session() as c:
url = 'https://www.ebay-kleinanzeigen.de/m-benutzer-anmeldung.html'
c.headers = headers
hp = c.get(url, headers=headers)
soup = BeautifulSoup(hp.content, 'html.parser')
crsf = soup.find('input', {'name': '_csrf'})['value']
print(crsf)
payload = dict(email='test.email#emailzz1.de', password='test123', passwordConfirmation='test123',
_marketingOptIn='on', _crsf=crsf)
page = c.post(url, data=payload, headers=headers)
print(page.text)
print(page.url)
print(page.status_code)
Is the problem that I need some more headers? Isn't a user-agent and a referrer enough?
I have tried adding all requested headers, but then I am getting no response.
I have managed to create a script that will successfully complete the register form you're trying to fill in using the mechanicalsoup library. Note you will have to manually check your email account for the email they send you to complete registration.
I realise this doesn't actually answer the question of why BeautifulSoup returned a 403 forbidden error however it does complete your task without encountering the same error.
import mechanicalsoup
browser = mechanicalsoup.StatefulBrowser()
browser.open("https://www.ebay-kleinanzeigen.de/m-benutzer-anmeldung.html")
browser.select_form('#registration-form')
browser.get_current_form().print_summary()
browser["email"] = "mailuser#emailprovider.com"
browser["password"] = "testSO12345"
browser["passwordConfirmation"] = "testSO12345"
response = browser.submit_selected()
rsp_code = response.status_code
#print(response.text)
print("Response code:",rsp_code)
if(rsp_code == 200):
print("Success! Opening a local debug copy of the page... (no CSS formatting)")
browser.launch_browser()
else:
print("Failure!")

Scraping AJAX page with requests

I would like to scrape the results of this booking flow.
By looking at the network tab I've found out that the data is retrieved with an AJIAX GET at this URL:
https://shop.caremar.it/main_acquista_1_corse_00_ajax.asp?l=it&data=24/02/2019&portoP=3&portoA=5&form_url=ticket_s1_2
I've build the URL passing the parameters as follows:
params = urllib.parse.urlencode({
'data': '24/02/2019',
'portoP': '3' ,
'portoA': '5',
'form_url': 'ticket_s1_2',
})
and make the request:
caremar_timetable_url = "https://shop.caremar.it/main_acquista_1_corse_00_ajax.asp?l=it&"
print(f"https://shop.caremar.it/main_acquista_1_corse_00_ajax.asp?l=it&{params}")
headers = {'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.3'}
res = requests.get(caremar_timetable_url,headers=headers, params=params)
soup = BeautifulSoup(res.text,'html.parser')
print(soup.text)
Output
https://shop.caremar.it/main_acquista_1_corse_00_ajax.asp?l=it&data=24%2F02%2F2019&portoP=7&portoA=1&form_url=ticket_s1_2
Non è stato possibile procedere con l'acquisto del biglietto online. Si prega di riprovare
The response is an error message from the site which says it can't complete the booking. If I copy and paste the URL I created in the browser I get an unstyled HTML page with the data I need.
Why is this and how can I overcome it?
Data seems to come back with requests
import requests
from bs4 import BeautifulSoup as bs
url = 'https://shop.caremar.it/main_acquista_1_corse_00_ajax.asp?l=it&data=27/02/2019&portoP=1&portoA=4&form_url=ticket_s1_2'
res = requests.get(url)
soup = bs(res.content, 'lxml')
print(soup.select_one('html'))

Categories