Can't download a file from a website using Python - python

I'm a webscraping newbie and I'm having issues being able to use all of the .get methods imaginable to download some excel files from a website. I have been able to easily parse the HTML to get the URLs for every link on the page, but I'm not experienced enough to understand why on earth I cannot download the file (cookies, sessions, etc., no idea).
Here is the website:
https://mlcu.org.eg/ar/3118/%D9%82%D9%88%D8%A7%D8%A6%D9%85-%D9%85%D8%AC%D9%84%D8%B3-%D8%A7%D9%84%D8%A7%D9%85%D9%86-%D8%B0%D8%A7%D8%AA-%D8%A7%D9%84%D8%B5%D9%84%D8%A9
If you scroll down you'll find the 5 excel file links, none of which I've been able to download. (just search for id="AutoDownload"
When I try to use the requests .get method, and save the file using
import requests
requests.Session()
res = requests.get(url).content
with open(filename) as f:
f.write(res.content)
I get an error that res is a bytes object and when I view res as a variable, the output is:
b'<html><head><title>Request Rejected</title></head><body>The requested URL was rejected.
Please consult with your administrator.<br><br>Your support ID is: 11190392837244519859</body></html>
Been trying for a while now, would really appreciate any help. Thanks a lot.

So I finally came up with a solution using only requests and the standard Python HTML parser.
From what I found, the Request rejected error is generally difficult to trace back to a precise cause. In that case, it was due to the absence of a user agent in the HTTP request.
import requests
from html.parser import HTMLParser
# Custom parser to retrieve the links
link_urls = []
class AutoDownloadLinksHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if(tag == 'a' and [attr for attr in attrs if attr == ('id', 'AutoDownload')]):
href = [attr[1] for attr in attrs if attr[0] == 'href'][0]
link_urls.append(href)
# Get the links to the files
url = 'https://mlcu.org.eg/ar/3118/%D9%82%D9%88%D8%A7%D8%A6%D9%85-%D9%85%D8%AC%D9%84%D8%B3-%D8%A7%D9%84%D8%A7%D9%85%D9%86-%D8%B0%D8%A7%D8%AA-%D8%A7%D9%84%D8%B5%D9%84%D8%A9'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
links_page = requests.get(url, headers=headers)
AutoDownloadLinksHTMLParser().feed(links_page.content.decode('utf-8'))
# Download the files
host = 'https://mlcu.org.eg'
for i, link_url in enumerate(link_urls):
file_content = requests.get(host + link_urls[i], headers = headers).content
with open('file' + str(i) + '.xls', 'wb+') as f:
f.write(file_content)

In order to download the files, you need to set the "User-Agent" field in the header of your python request. This can be done by passing a dict to the get function:
file = session.get(url,headers=my_headers)
Apparently, this host does not respond to requests that come from python which have the following User-Agent:
'User-Agent': 'python-requests/2.24.0'
With this in mind, if you pass another value for that field in the header of your request, for example one from Firefox (see below), the host thinks the request comes from a Firefox user and will respond with the actual file.
Here is the full version of the code:
import requests
my_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0',
'Accept-Encoding': 'gzip, deflate',
'Accept': '*/*',
'Connection': 'keep-alive'
}
session = requests.session()
file = session.get(url, headers=my_headers)
with open(filename, 'wb') as f:
f.write(file.content)
The latest Firefox user agent worked for me but you can find many more possible values for that field here.

If you're not experienced enough to set all correct parameters manually in the HTTP requests so as to avoid the "Request rejected" error you have (for my part, I wouldn't be able to), I would advise you to use a higher level approach such as Selenium.
Selenium can automate actions performed by a browser installed on your computer, such as downloading files (thus it is used to automate tests on web apps as well as to do web scraping). The idea is that the HTTP request generated by the browser would be better than the one you can write by hand.
Here is a tutorial to do what you try to do using Selenium.

Related

Trouble collecting different property ids from a webpage using the requests module

After clicking on the button 11.331 Treffer located at the top right corner within the filter of this webpage, I can see the result displayed on that page. I've created a script using the requests module to fetch the ID numbers of different properties from that page.
However, when I run the script, I get json.decoder.JSONDecodeError. If I copy the cookies from dev tools directly and paste them within the headers, I get the results accordingly.
I don't wish to copy cookies from dev tools every time I run the script, so I used Selenium to collect cookies from the landing page and supply them within headers to get the desired result, but I still get the same error.
I'm trying like:
import time
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
start_url = 'https://www.immobilienscout24.de/'
link = 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?pagenumber=1'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36',
'referer': 'https://www.immobilienscout24.de/Suche/de/nordrhein-westfalen/wohnung-kaufen?enteredFrom=one_step_search',
'accept': 'application/json; charset=utf-8',
'x-requested-with': 'XMLHttpRequest'
}
def get_cookies():
with webdriver.Chrome() as driver:
driver.get(start_url)
time.sleep(10)
cookiejar = {c['name']:c['value'] for c in driver.get_cookies()}
return cookiejar
cookies = get_cookies()
cookie_string = "; ".join([f"{item}={val}" for item,val in cookies.items()])
with requests.Session() as s:
s.headers.update(headers)
s.headers['cookie'] = cookie_string
res = s.get(link)
container = res.json()['searchResponseModel']['resultlist.resultlist']['resultlistEntries'][0]['resultlistEntry']
for item in container:
try:
project_id = item['#id']
except KeyError: project_id = ""
print(project_id)
How can I scrape property ids from that webpage using the requests module?
EDIT:
The existence of the following portion within cookies is crucial, without which the script probably leads to that error I mentioned. However, selenium failed to include that portion within cookies.
reese84=3:/qdGO9he7ld4/8a35vlw8g==:+/xBfAtVPRKHBSJgzngTQw1ywoViUvmVKLws+f8Y6edDgM+3s0Xzo17NvfgPrx9Z/suRy7hee5xcEgo85V3LdGsIop9/29g1ib1JQ0pO3UHWrtn81MseS6G8KE6AF4SrWZ2t8eTr1SEogUmCkB1HNSqXT88sAZaEi+XSzUyAGqikVjEcLX9TeI+KN37QNr9Sl+oTaOPchSgS/IowPj83zvT471Ewabg8CAc6q8I9AJ8Zb9FfLqePweCM+QFKIw+ZUp5GR4TXxZVcWdipbIEAyv3kj2x9Xs1K1k+8aXmy9VES6rFvW1xOsAjLmXbg6REPBye+QcAgPUh/x79mBWktcWC/uQ5L2W2dBLBS4eM2+bpEBw5EHMfjq9bk9hnmmZuxPGALLKASeXBt5lUUwx7x+wtGcjyvB9ZSE6gI2VxFLYqncYmhKqoNzgwQY8wRThaEraiJF/039/vVMa2G3S38iwniiOGHsOxq6VTdnWJGgvJqUmpWfXzz6XQXWL2xcykAoj7LMqHF2tC0DQyInUmZ3T7zjPBV7mEMgZkDn0z272E=:qQHyFe1/pp8/BS4RHAtxftttcOYJH4oqG1mW0+aNXF4=;
I think another part of your problem is that the link is not json. It's an html document. Part of the html document does contains javascript that sets a js variable to a json object. You can't get that with res.json()
In theory, you could use selenium to go to the link and grab the contents of the IS24.resultList variable by executing javascript like this:
driver.get(link)
time.sleep(10)
result_list = json.loads( driver.execute_script("return window.IS24.resultList"))
In practice, I think they're really serious about blocking bots and I suspect convincing them you're not a bot might take more than spoofing a cookie. When I visit via Selenium I don't even get the recaptcha option that I get when visiting through a regular browser session with incognito mode.

I cant login instagram using requests

I find How would I log into Instagram using BeautifulSoup4 and Requests, and how would I determine it on my own? this
but code
import re
import requests
from bs4 import BeautifulSoup
from datetime import datetime
link = 'https://www.instagram.com/accounts/login/'
login_url = 'https://www.instagram.com/accounts/login/ajax/'
time = int(datetime.now().timestamp())
payload = {
'username': 'login',
'enc_password': f'#PWD_INSTAGRAM_BROWSER:0:{time}:your_password',
'queryParams': {},
'optIntoOneTap': 'false'
}
with requests.Session() as s:
r = s.get(link)
csrf = re.findall(r"csrf_token\":\"(.*?)\"", r.text)[0]
r = s.post(login_url, data=payload, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Referer": "https://www.instagram.com/accounts/login/",
"x-csrftoken": csrf
})
print(r.status_code)
gives me error with csrftoken
line 21, in <module>
csrf = re.findall(r"csrf_token\":\"(.*?)\"", r.text)[0]
IndexError: list index out of range
and other posts on Stack Overflow don't work for me
I dont want use Selenium
TL;DR
Add a user-agent to your get request header on line 20:
r = s.get(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3 rv:3.0; sl-SI) AppleWebKit/533.38.2 (KHTML, like Gecko) Version/5.0 Safari/533.38.2'})
Long answer
If we look at the error message you posted, we can start to dissect what's gone wrong. Line 21 is attempting to find a csrf_token attribute on the instagram login page.
Diagnostics
We can see from the error message that the list index is out of range, which in this case means that the list returned by re.findall (docs) is empty. This means that either
Your regex is wrong
The html returned by your get request (docs) r = s.get(link) on line 20 doesn't contain a csrf_token attribute
The attribute doesn't exist in the source html
If we visit the page and look at its html source, we can see that a csrf_token attribute is indeed present on line 261:
<script type="text/javascript">window._sharedData = {"config":{"csrf_token":"TOKEN HERE","viewer":null,"viewerId":null}}</script>
Note, I have excluded the rest on the code for brevity.
Now that we know it's present on the page, we can write the scraped html that you're receiving via your get request to a local file and inspect it:
r = s.get(link)
with open("csrf.html", "w") as f:
f.write(html)
If you open that file and do a Ctrl+f for csrf_token, it's not present. This likely means that Instagram detected that you're accessing the page via a scraper and returned a modified version of the page.
The fix
In order to fix this, you need to add a user-agent to your request header which essentially 'tricks' the page into thinking you're accessing it via a browser, This can be done by by changing:
r = s.get(link)
to something like this:
r = s.get(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3 rv:3.0; sl-SI) AppleWebKit/533.38.2 (KHTML, like Gecko) Version/5.0 Safari/533.38.2'})
Note, this is a random user agent from here.
Notes
I appreciate that you don't want to use selenium for your task, but you might find that the more dynamic interactions you want to do, the harder it is to achieve it with static scraping libraries like the requests module. Here are some good resources for learning selenium in python:
Selenium docs
Python Selenium Tutorial #1 - Web Scraping, Bots & Testing

403 Forbidden Error when scraping a site, user-agents already used and updated. Any ideas?

As the title above states I am getting a 403 error. The URLs generated are valid, I can print them and then open them in my browser just fine.
I've got a user agent, it's the exact same one that my browser sends when accessing the page I want to scrape pulled straight from chrome devtools. I've tried using sessions instead of a straight request, I've tried using urllib, and I've tried using a generic request.get.
Here's the code I'm using, that 403s. Same result with request.get etc.
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36'}
session = requests.Session()
req = session.get(URL, headers=headers)
So yeah, I assume I'm not creating the useragent write so it can tell I am scraping. But I'm not sure what I'm missing, or how to find that out.
I got all headers from DevTools and I started removing headers one by one and I found it needs only Accept-Language and it doesn't need User-Agent and it doesn't need Session.
import requests
url = 'https://www.g2a.com/lucene/search/filter?&search=The+Elder+Scrolls+V:+Skyrim&currency=nzd&cc=NZD'
headers = {
'Accept-Language': 'en-US;q=0.7,en;q=0.3',
}
r = requests.get(url, headers=headers)
data = r.json()
print(data['docs'][0]['name'])
Result:
The Elder Scrolls V: Skyrim Special Edition Steam Key GLOBAL

How to download webpage in python, url includes GET parameters

I have an URL such as
http://www.example-url.com/content?param1=1&param2=2
in particular I am testing in on
http://ws.parlament.ch/votes/councillors?concillorNumberFilter=2565&format=json
How do I get the content of such URL, such that the get parameters are considered as well?
How can I save it to file?
How can I access multiple URLs like this either in parallel or asynchronously (saving to file on response received callback like in JavaScript)?
I have tried
import urllib
urllib.urlretrieve("http://ws.parlament.ch/votes/councillors?concillorNumberFilter=2565&format=json", "file.json")
but I am getting a content of http://ws.parlament.ch/votes/councillors instead of the json I want.
You can use urllib, but there are other libraries I know of which make it a lot easier in different situations. for example, if you want to also have user authentication done you can use Requests.
For this situation you can use httplib2 for example, here is a clean small piece of code which takes the GET into consideration (source).
import httplib2
h = httplib2.Http(".cache")
(resp_headers, content) = h.request("http://example.org/", "GET")
It seems that jou need to set the user agent of the connection otherwise it will refuse to give you the data. I also use the urllib2.Request() instead of the standard urlretrieve() and or urlopen(), mostly because this function allows GET, POST requests and allows the user agent to be set by the programmer.
import urllib2, json
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
header = { 'User-Agent' : user_agent }
fullurl = "http://ws.parlament.ch/votes/councillors?councillorNumberFilter=2565&format=json"
response = urllib2.Request(fullurl, headers=header)
data = urllib2.urlopen(response)
print json.loads(data.read())
Some extra information about headers in python
if you want to keep using httplib2 here is the code for this one:
import httplib2
header = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' }
fullurl = "http://ws.parlament.ch/votes/councillors?councillorNumberFilter=2565&format=json"
http = httplib2.Http(".cache")
response, content = http.request(fullurl, "GET", headers=header)
print content
The data printed by my last example can be saved to a file with json.dump(filename, data).

Python getting HTML content via 'requests' returns partial response

I'm reading a web site content using following 3 liners. I used an example domain for sale which doesn't have many content.
url = "http://localbusiness.com/"
response = requests.get(url)
html = response.text
It returns following html content where the website contains more html when you check through view source. Am I doing something wrong here
Python version 2.7
<html><head></head><body><!-- vbe --></body></html>
Try setting a User-Agent:
import requests
url = "http://localbusiness.com/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
'Content-Type': 'text/html',
}
response = requests.get(url, headers=headers)
html = response.text
The default User-Agent set by requests is 'User-Agent': 'python-requests/2.8.1'. Try to simulate that the request is coming from a browser and not a script.
#jason answered it correctly so I am extending his answer for the reason
Why It happens
Some DOM elements code changed through the Ajax calls and JavaScript code so that will not be seen in the response of your call (Although it's not the case here as you are already using the view source (ctrl+u) to compare and not view element)
Some sites uses user-agent to know the nature of user (as of desktop or mobile user) and provide the response accordingly (as the probable case here)
Other alternatives
You can use the mechanize module of python to mimic a browser to fool
a web site (come handy when the site is using some short of
authentication cookies) A small tutorial
Use selenium to actually implement a browser

Categories