When I run the below code it returns an empty string
url = 'http://www.allflicks.net/wp-content/themes/responsive/processing/processing_us.php?draw=5&columns[0][data]=box_art&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=title&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=true&columns[1][search][value]=&columns[1][search][regex]=false&columns[2][data]=year&columns[2][name]=&columns[2][searchable]=true&columns[2][orderable]=true&columns[2][search][value]=&columns[2][search][regex]=false&columns[3][data]=genre&columns[3][name]=&columns[3][searchable]=true&columns[3][orderable]=true&columns[3][search][value]=&columns[3][search][regex]=false&columns[4][data]=rating&columns[4][name]=&columns[4][searchable]=true&columns[4][orderable]=true&columns[4][search][value]=&columns[4][search][regex]=false&columns[5][data]=available&columns[5][name]=&columns[5][searchable]=true&columns[5][orderable]=true&columns[5][search][value]=&columns[5][search][regex]=false&columns[6][data]=director&columns[6][name]=&columns[6][searchable]=true&columns[6][orderable]=true&columns[6][search][value]=&columns[6][search][regex]=false&columns[7][data]=cast&columns[7][name]=&columns[7][searchable]=true&columns[7][orderable]=true&columns[7][search][value]=&columns[7][search][regex]=false&order[0][column]=5&order[0][dir]=desc&start=0&length=25&search[value]=sherlock&search[regex]=false&movies=true&shows=true&documentaries=true&rating=netflix&_=1451768717982'
print requests.get(url).text
but if I put the url in my browser it'll show my the json information. I did notice while debugging the browser must have the plugin tamper data installed to view the json. If a browser doesn't have the plug in a blank web page will appear. So, my theory is it has to do something with out the http request is being handed but I'm not where to go from here.
Any help would be great.
You need to open up a session, visit the main page to get the cookies set and then make the XHR request to "processing_us.php":
url = "http://www.allflicks.net/wp-content/themes/responsive/processing/processing_us.php?draw=5&columns[0][data]=box_art&columns[0][name]=&columns[0][searchable]=true&columns[0][orderable]=false&columns[0][search][value]=&columns[0][search][regex]=false&columns[1][data]=title&columns[1][name]=&columns[1][searchable]=true&columns[1][orderable]=true&columns[1][search][value]=&columns[1][search][regex]=false&columns[2][data]=year&columns[2][name]=&columns[2][searchable]=true&columns[2][orderable]=true&columns[2][search][value]=&columns[2][search][regex]=false&columns[3][data]=genre&columns[3][name]=&columns[3][searchable]=true&columns[3][orderable]=true&columns[3][search][value]=&columns[3][search][regex]=false&columns[4][data]=rating&columns[4][name]=&columns[4][searchable]=true&columns[4][orderable]=true&columns[4][search][value]=&columns[4][search][regex]=false&columns[5][data]=available&columns[5][name]=&columns[5][searchable]=true&columns[5][orderable]=true&columns[5][search][value]=&columns[5][search][regex]=false&columns[6][data]=director&columns[6][name]=&columns[6][searchable]=true&columns[6][orderable]=true&columns[6][search][value]=&columns[6][search][regex]=false&columns[7][data]=cast&columns[7][name]=&columns[7][searchable]=true&columns[7][orderable]=true&columns[7][search][value]=&columns[7][search][regex]=false&order[0][column]=5&order[0][dir]=desc&start=0&length=25&search[value]=sherlock&search[regex]=false&movies=true&shows=true&documentaries=true&rating=netflix&_=1451768717982"
with requests.Session() as session:
session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
session.get("http://www.allflicks.net/")
response = session.get(url, headers={"Accept" : "application/json, text/javascript, */*; q=0.01",
"X-Requested-With": "XMLHttpRequest",
"Referer": "http://www.allflicks.net/",
"Host": "www.allflicks.net"})
print(response.json())
Related
I understand what location does in HTTP headers.
Access to a site with Chrome gets location in response headers.
However, access to it with Python requests cannot get that info.
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36',
'accept': '*/*',
'accept-language': 'en-US,en;q=0.9,ru-RU;q=0.8,ru;q=0.7,uk;q=0.6,en-GB;q=0.5',
}
response = requests.get('https://ec.ef.com.cn/partner/englishcenters', headers=headers)
response.headers
Does it matter for scrapy? How do I get that info? Because I guess it might be a flag the site could use for anti-scraping.
What you see in your screenshot is response with HTTP code 302 which will usually automatically redirect some clients (along with Python Requests) to another URL, specified in Location header.
If you enter the URL you shared (https://ec.ef.com.cn/partner/englishcenters) in your browser, you'll see you will get redirected to some other URL. Same behaviour can be observed in your Python code if you print out response.url which should return you the URL you've been redirected to.
I am trying to webscrape Crunch Base to find the total funding amount for certain companies. Here is a link to an example.
At first, I tried just using beautiful soup but I keep getting an error saying:
Access to this page has been denied because we believe you are using automation tools to browse the\nwebsite.
So then I looked up how to fake a browser visit and I changed my code, but I still get the same error. What am I doing wrong??
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.crunchbase.com/organization/incube-labs'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
print(response.content)
All in all your code looks great! It appears that the website you are trying to scrape requires a more complex header than the one you have. The following code should solve your issue:
import requests
from bs4 import BeautifulSoup as BS
url = 'https://www.crunchbase.com/organization/incube-labs'
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0", "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language": "en-US,en;q=0.5", "Accept-Encoding": "gzip, deflate", "DNT": "1", "Connection": "close", "Upgrade-Insecure-Requests": "1"}
response = requests.get(url, headers=headers)
print(response.content)
The website I am trying to log in to is https://realitysportsonline.com/RSOLanding.aspx. I can't seem to get the login to work since the process is a little different to a typical site that has a login specific page. I haven't got any errors, but the log in action doesn't work, which then causes the main to redirect to the homepage.
import requests
url = "https://realitysportsonline.com/RSOLanding.aspx"
main = "https://realitysportsonline.com/SetLineup_Contracts.aspx?leagueId=3000&viewingTeam=1"
data = {"username": "", "password": "", "vc_btn3 vc_btn3-size-md vc_btn3-shape-rounded vc_btn3-style-3d vc_btn3-color-danger" : "Log In"}
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
'Referer': 'https://realitysportsonline.com/RSOLanding.aspx',
'Host': 'realitysportsonline.com',
'Connection': 'keep-alive',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'}
s = requests.session()
s.get(url)
r = s.post(url, data, headers=header)
page = requests.get(main)
First of all, you create a session and assuming your POST request worked, you then request an authorised page without using your previously created session.
You need to make the request with the s object you created like so:
page = s.get(main)
However, there were also a few issues with your POST request. You were making a request to the home page instead of the /Login route. You were also missing the Content-Type header.
import requests
url = "https://realitysportsonline.com/Services/AccountService.svc/Login"
main = "https://realitysportsonline.com/LeagueSetup.aspx?create=true"
payload = {"username":"","password":""}
headers = {
'Content-Type': "text/json",
'Cache-Control': "no-cache"
}
s = requests.session()
response = s.post(url, json=payload, headers=headers)
page = s.get(main)
PS your main request url redirects to the homepage, even with a valid session (at least for me).
I am trying to scrape the following website: https://wwwapps.ncmedboard.org/Clients/NCBOM/Public/LicenseeInformationResults.aspx
In order to get each page to scrape, I need to first conduct a search on this .aspx page by inputting a first name and last name and initiating the search.
Using resources on the internet, I put together the following http POST request:
url = 'https://wwwapps.ncmedboard.org/Clients/NCBOM/Public/LicenseeInformationResults.aspx'
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en;q=0.9,en-US;q=0.8,zh-TW;q=0.7,zh;q=0.6,zh-CN;q=0.5'
}
session = requests.session()
response = session.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'})
soup = BeautifulSoup(response.content, 'html.parser')
form_data = {
'__VIEWSTATE': soup.find('input', {'name': '__VIEWSTATE'}).get('value'),
'__VIEWSTATEGENERATOR': soup.find('input', {'name': '__VIEWSTATEGENERATOR'}).get('value'),
'waLastName':'Smith',
'waFirstName':'John',
'__EVENTTARGET':'btnNext'
}
f = session.post(url, data=form_data, headers=headers)
soup = BeautifulSoup(f.content, 'html.parser')
for a in soup.find_all('a', href=True):
print("Found the URL:" + a['href'])
It doesn't seem like the post has any effect, since when you look at the html after the post request, it doesn't seem to show the results page. Any pointers on why this is the case?
Thanks!
you may have to set asp.net session cookie which will be generated new for each new session. which in your website's case (https://wwwapps.ncmedboard.org) is ASP.NET_SessionId=(value of your sessionid provided by website)
if CSRF is not validated correctly it will may be bypass it.
https://open.spotify.com/search/results/cheval is the link that triggers various intermediary requests, one being the attempted request below.
When running the following request in Postman (Chrome plugin), response cookies (13) are shown but do not seem to exist when running this request in Python (response.cookies is empty). I have also tried using a session, but with the same result.
update: Although these cookies were retrieved after using Selenium (to login/solve captcha and transfer the login cookies to the session to use for the following request, it's still unknown what variable/s are required for the target cookies to be returned with that request).
How can those response cookies be retrieved (if at all) with Python?
url = "https://api.spotify.com/v1/search"
querystring = {"type":"album,artist,playlist,track","q":"cheval*","decorate_restrictions":"true","best_match":"true","limit":"50","anonymous":"false","market":"from_token"}
headers = {
'access-control-request-method': "GET",
'origin': "https://open.spotify.com",
'x-devtools-emulate-network-conditions-client-id': "0959BC056CD6303CAEC3E2E5D7796B72",
'user-agent': "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36",
'access-control-request-headers': "authorization",
'accept': "*/*",
'accept-encoding': "gzip, deflate, br",
'accept-language': "en-US,en;q=0.9",
'cache-control': "no-cache",
'postman-token': "253b0e50-7ef1-759a-f7f4-b09ede65e462"
}
response = requests.request("OPTIONS", url, headers=headers, params=querystring)
print(response.text)