Not able to replicate AJAX using Python Requests - python

I am trying to replicate ajax request from a web page (https://droughtmonitor.unl.edu/Data/DataTables.aspx). AJAX is initiated when we select values from dropdowns.
I am using the following request using python, but not able to see the response as in Network tab of the browser.
import bs4
import requests
import lxml
ses = requests.Session()
ses.get('https://droughtmonitor.unl.edu/Data/DataTables.aspx')
headers_dict = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
url = 'https://droughtmonitor.unl.edu/Ajax2018.aspx/ReturnTabularDMAreaPercent_urban'
req_data = {'area':'00064', 'statstype':'1'}
resp = ses.post(url,data = req_data,headers = headers_dict)
soup = bs4.BeautifulSoup(resp.content,'lxml')
print(soup)

You need to add several things to your request to get an Answer from the server.
You need to convert your dict to json to pass it as string and not as dict.
You also need to specify the type of request-data by setting the request header to Content-Type:application/json; charset=utf-8
with those changes I was able to request the correkt data.
import bs4
import requests
ses = requests.Session()
ses.get('https://droughtmonitor.unl.edu/Data/DataTables.aspx')
headers_dict = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Content-Type': 'application/json; charset=utf-8'}
url = 'https://droughtmonitor.unl.edu/Ajax2018.aspx/ReturnTabularDMAreaPercent_urban'
req_data = json.dumps({'area':'00037', 'statstype':'1'})
resp = ses.post(url,data = req_data,headers = headers_dict)
soup = bs4.BeautifulSoup(resp.content,'lxml')
print(soup)
Quite a tricky problem I must say.

From the requests documentation:
Instead of encoding the dict yourself, you can also pass it directly
using the json parameter (added in version 2.4.2) and it will be
encoded automatically:
>>> url = 'https://api.github.com/some/endpoint'
>>> payload = {'some': 'data'}
>>> r = requests.post(url, json=payload)
Then, to get the output, call r.json() and you will get the data you are looking for.

Related

Python Webscraping Vue Components

I am trying to get the value marked in the picture extracted to be a variable, but it seems that when it is within Vue Components, bs4 is not doing the searching like i am expecting. Can anyone point me in the general direction as to how i would be able to extract the value from this document in Python?
Code is found below picture, thanks in advance.
import requests
from bs4 import BeautifulSoup
URL = 'https://api.tracker.gg/api/v2/rocket-league/standard/profile/steam/76561198060134880'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
#print(soup.prettify())
div_list = soup.findAll({"class":'value'})
print(div_list)
Since the page is returning a json response you don't need beautifulsoup to parse it.
import requests
import json
URL = 'https://api.tracker.gg/api/v2/rocket-league/standard/profile/steam/76561198060134880'
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36'}
response = requests.get(URL, headers = headers)
dict_of_response = json.loads(response.text)
obj_list = dict_of_response['data']['segments']
print(obj_list)
The obj_list variable now contains a list of dicts. Those dicts contain the data you want and now you only need to loop trough the list and do what you want with the data.

Login in douban.com with python display b''

I want to login 'douban.com' with python session
import requests
url = 'https://www.douban.com/'
logurl = 'https://accounts.douban.com/passport/login_popup'
data = {'username': 'abc#gmail.com',
'password': 'abcdef'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/80.0.3987.87 Safari/537.36'}
se = requests.session()
request = se.post(logurl, data=data, headers=headers)
request1 = se.get(url)
print(request1.content)
this display "b''",I don't get this work or not!
You're getting an empty response, meaning the request is not working properly. You want to debug the request further by looking into Your response. Check Requests lib documentation. request1.status_code and request1.headers might interest you.
The b'' is just a Bytes literal prefix. Python 3 documentation:

How to push keys (values) with requests module python3?

I'm trying to push some value in the search box of amazon.com.
I'm using requests rather then selenium (push keys option).
I've identified the xpath of the search box, and now want to push value in it, IE: char "a" or "apple" or any other string and then collect the results.
However when pushing the data with the post method for request I get an error.
Here below it's my code:
import requests
from lxml import html
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)''AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
page = requests.get('https://www.amazon.com/', headers=headers)
page = requests.get('https://www.amazon.com/', headers=headers)
response_code = page.status_code
if response_code == 200:
htmlText = page.text
tree = html.fromstring(page.content)
search_box = tree.xpath('//input[#id="twotabsearchtextbox"]')
pushing_keys = requests.post(search_box,'a')
print(search_box)
However I get this error code:
requests.exceptions.MissingSchema: Invalid URL "[<InputElement 20b94374a98 name='field-keywords' type='text'>]": No schema supplied. Perhaps you meant http://[<InputElement 20b94374a98 name='field-keywords' type='text'>]?
How do I correctly push any char in the search box with requests?
Thanks
Try using this approach:
import requests
base_url = 'https://www.amazon.com'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)''AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
page = requests.get(base_url, headers=headers)
response_code = page.status_code
if response_code == 200:
key_word_to_search = 'bean bags'
pushing_keys = requests.get(f'{base_url}/s/ref=nb_sb_noss', headers=headers, params={'k': key_word_to_search})
print(pushing_keys.content)
The search box is using a get request.
See here

Workaround for blocked GET requests in Python

I'm trying to retrieve and process the results of a web search using requests and beautifulsoup.
I've written some simple code to do the job, and it returns successfully (status = 200), but the content of the request is just an error message "We're sorry for any inconvenience, but the site is currently unavailable.", and has been the same for the last several days. Searching within Firefox returns results without issue, however. I've run the code using a URL for the UK-based site and it works without issue so I wonder if the US site is set up to block attempts to scrape web searches.
Are there ways to mask the fact I'm attempting to retrieve search results from within Python (eg, masquerading as a standard search within Firefox) or some other work around to allow access to the search results?
Code included for reference below:
import pandas as pd
from requests import get
import bs4 as bs
import re
# works
# baseURL = 'https://www.autotrader.co.uk/car-search?sort=sponsored&radius=1500&postcode=ky119sb&onesearchad=Used&onesearchad=Nearly%20New&onesearchad=New&make=TOYOTA&model=VERSO&year-from=1990&year-to=2017&minimum-mileage=0&maximum-mileage=200000&body-type=MPV&fuel-type=Diesel&minimum-badge-engine-size=1.6&maximum-badge-engine-size=4.5&maximum-seats=8'
# doesn't work
baseURL = 'https://www.autotrader.com/cars-for-sale/Certified+Cars/cars+under+50000/Jeep/Grand+Cherokee/Seattle+WA-98101?extColorsSimple=BURGUNDY%2CRED%2CWHITE&maxMileage=45000&makeCodeList=JEEP&listingTypes=CERTIFIED%2CUSED&interiorColorsSimple=BEIGE%2CBROWN%2CBURGUNDY%2CTAN&searchRadius=0&modelCodeList=JEEPGRAND&trimCodeList=JEEPGRAND%7CSRT%2CJEEPGRAND%7CSRT8&zip=98101&maxPrice=50000&startYear=2015&marketExtension=true&sortBy=derivedpriceDESC&numRecords=25&firstRecord=0'
a = get(baseURL)
soup = bs.BeautifulSoup(a.content,'html.parser')
info = soup.find_all('div', class_ = 'information-container')
price = soup.find_all('div', class_ = 'vehicle-price')
d = []
for idx, i in enumerate(info):
ii = i.find_next('ul').find_all('li')
year_ = ii[0].text
miles = re.sub("[^0-9\.]", "", ii[2].text)
engine = ii[3].text
hp = re.sub("[^\d\.]", "", ii[4].text)
p = re.sub("[^\d\.]", "", price[idx].text)
d.append([year_, miles, engine, hp, p])
df = pd.DataFrame(d, columns=['year','miles','engine','hp','price'])
By default, Requests sends a unique user agent when making requests.
>>> r = requests.get('https://google.com')
>>> r.request.headers
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
It is possible that the website you are using is trying to avoid scrapers by denying any request with a user agent of python-requests.
To get around this, you can change your user agent when sending a request. Since it's working on your browser, simply copy your browser user agent (you can Google it, or record a request to a webpage and copy your user agent like that). For me, it's Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36 (what a mouthful), so I'd set my user agent like this:
>>> headers = {
... 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
... }
and then send the request with the new headers (the new headers are added to the default headers, they don't replace them unless they have the same name):
>>> r = requests.get('https://google.com', headers=headers) # Using the custom headers we defined above
>>> r.request.headers
{'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
Now we can see that the request was sent with our preferred headers, and hopefully the site won't be able to tell the difference between Requests and a browser.

python post different page number return the same content

I use following code to scrapy the web "http://gs.amac.org.cn:10080/amac-infodisc/res/pof/manager/index.html". To scrapy the web, I post the data using json format. It works ok to response json content. The strange thing is it always response the same content, no matter what "page" number it is or what "size" it is.So anyone interested in this question can try to change the "page" number in "postdata" and to see the same "id" responsed.
import urllib2
import urllib
import json
import random
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 UBrowser/5.6.10551.6 Safari/537.36",
"Content-Type": "application/json"}
# http://gs.amac.org.cn:10080/amac-infodisc/res/pof/manager/index.html
# change the "page" number here, response the same "id"
postdata = {"rand":random.random(),"page":10,"size":20}
url = "http://gs.amac.org.cn:10080/amac-infodisc/api/pof/manager"
postdata = json.dumps(postdata)
req = urllib2.Request(url,data=postdata,headers=headers)
response = json.load(urllib2.urlopen(req,timeout=30))
print response["content"][0]["id"]
The problem is that the arguments to the page are not sent as post data, but rather as query arguments:
Changing the argument type fixes the problem:
import requests
import random
page = 1
size = 20
rand = random.random()
url = 'http://gs.amac.org.cn:10080/amac-infodisc/api/pof/manager?rand={}&page={}&size={}'.format(
random, page, size
)
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 UBrowser/5.6.10551.6 Safari/537.36",
"Content-Type": "application/json"
}
print(requests.post(url, json={}, headers=headers).json()['content'][0]['id'])
This prints 101000000409 (101000000138 for page 0).

Categories