I have an URL such as
http://www.example-url.com/content?param1=1¶m2=2
in particular I am testing in on
http://ws.parlament.ch/votes/councillors?concillorNumberFilter=2565&format=json
How do I get the content of such URL, such that the get parameters are considered as well?
How can I save it to file?
How can I access multiple URLs like this either in parallel or asynchronously (saving to file on response received callback like in JavaScript)?
I have tried
import urllib
urllib.urlretrieve("http://ws.parlament.ch/votes/councillors?concillorNumberFilter=2565&format=json", "file.json")
but I am getting a content of http://ws.parlament.ch/votes/councillors instead of the json I want.
You can use urllib, but there are other libraries I know of which make it a lot easier in different situations. for example, if you want to also have user authentication done you can use Requests.
For this situation you can use httplib2 for example, here is a clean small piece of code which takes the GET into consideration (source).
import httplib2
h = httplib2.Http(".cache")
(resp_headers, content) = h.request("http://example.org/", "GET")
It seems that jou need to set the user agent of the connection otherwise it will refuse to give you the data. I also use the urllib2.Request() instead of the standard urlretrieve() and or urlopen(), mostly because this function allows GET, POST requests and allows the user agent to be set by the programmer.
import urllib2, json
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
header = { 'User-Agent' : user_agent }
fullurl = "http://ws.parlament.ch/votes/councillors?councillorNumberFilter=2565&format=json"
response = urllib2.Request(fullurl, headers=header)
data = urllib2.urlopen(response)
print json.loads(data.read())
Some extra information about headers in python
if you want to keep using httplib2 here is the code for this one:
import httplib2
header = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)' }
fullurl = "http://ws.parlament.ch/votes/councillors?councillorNumberFilter=2565&format=json"
http = httplib2.Http(".cache")
response, content = http.request(fullurl, "GET", headers=header)
print content
The data printed by my last example can be saved to a file with json.dump(filename, data).
Related
I am trying to send HTTP GET request to certain website, for example, https://www.united.com, but it get stuck with no response.
Here is the code:
from urllib.request import urlopen
url = 'https://www.united.com'
resp = urlopen(url,timeout=10 )
Every time, it goes timeout. But the same code works for other URLs, for example, https://www.aa.com.
So I wonder what is behind https://www.united.com that keeps me from getting the HTTP request through. Thank you!
Update:
Adding a request header still doesn't work for this site:
from urllib.request import urlopen
url = 'https://www.united.com'
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent':' Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36'
}
)
resp = urlopen(req,timeout=3 )
The server of united.com might be only responding to certain user-agent strings or request headers and blocking for others. You have to send certain headers or user-agent string which are allowed by their server. This depends upon website to website who want to add some more security to their applications so they are very specific about user-agents like which resource is trying to access them.
I'm a webscraping newbie and I'm having issues being able to use all of the .get methods imaginable to download some excel files from a website. I have been able to easily parse the HTML to get the URLs for every link on the page, but I'm not experienced enough to understand why on earth I cannot download the file (cookies, sessions, etc., no idea).
Here is the website:
https://mlcu.org.eg/ar/3118/%D9%82%D9%88%D8%A7%D8%A6%D9%85-%D9%85%D8%AC%D9%84%D8%B3-%D8%A7%D9%84%D8%A7%D9%85%D9%86-%D8%B0%D8%A7%D8%AA-%D8%A7%D9%84%D8%B5%D9%84%D8%A9
If you scroll down you'll find the 5 excel file links, none of which I've been able to download. (just search for id="AutoDownload"
When I try to use the requests .get method, and save the file using
import requests
requests.Session()
res = requests.get(url).content
with open(filename) as f:
f.write(res.content)
I get an error that res is a bytes object and when I view res as a variable, the output is:
b'<html><head><title>Request Rejected</title></head><body>The requested URL was rejected.
Please consult with your administrator.<br><br>Your support ID is: 11190392837244519859</body></html>
Been trying for a while now, would really appreciate any help. Thanks a lot.
So I finally came up with a solution using only requests and the standard Python HTML parser.
From what I found, the Request rejected error is generally difficult to trace back to a precise cause. In that case, it was due to the absence of a user agent in the HTTP request.
import requests
from html.parser import HTMLParser
# Custom parser to retrieve the links
link_urls = []
class AutoDownloadLinksHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if(tag == 'a' and [attr for attr in attrs if attr == ('id', 'AutoDownload')]):
href = [attr[1] for attr in attrs if attr[0] == 'href'][0]
link_urls.append(href)
# Get the links to the files
url = 'https://mlcu.org.eg/ar/3118/%D9%82%D9%88%D8%A7%D8%A6%D9%85-%D9%85%D8%AC%D9%84%D8%B3-%D8%A7%D9%84%D8%A7%D9%85%D9%86-%D8%B0%D8%A7%D8%AA-%D8%A7%D9%84%D8%B5%D9%84%D8%A9'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
links_page = requests.get(url, headers=headers)
AutoDownloadLinksHTMLParser().feed(links_page.content.decode('utf-8'))
# Download the files
host = 'https://mlcu.org.eg'
for i, link_url in enumerate(link_urls):
file_content = requests.get(host + link_urls[i], headers = headers).content
with open('file' + str(i) + '.xls', 'wb+') as f:
f.write(file_content)
In order to download the files, you need to set the "User-Agent" field in the header of your python request. This can be done by passing a dict to the get function:
file = session.get(url,headers=my_headers)
Apparently, this host does not respond to requests that come from python which have the following User-Agent:
'User-Agent': 'python-requests/2.24.0'
With this in mind, if you pass another value for that field in the header of your request, for example one from Firefox (see below), the host thinks the request comes from a Firefox user and will respond with the actual file.
Here is the full version of the code:
import requests
my_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0',
'Accept-Encoding': 'gzip, deflate',
'Accept': '*/*',
'Connection': 'keep-alive'
}
session = requests.session()
file = session.get(url, headers=my_headers)
with open(filename, 'wb') as f:
f.write(file.content)
The latest Firefox user agent worked for me but you can find many more possible values for that field here.
If you're not experienced enough to set all correct parameters manually in the HTTP requests so as to avoid the "Request rejected" error you have (for my part, I wouldn't be able to), I would advise you to use a higher level approach such as Selenium.
Selenium can automate actions performed by a browser installed on your computer, such as downloading files (thus it is used to automate tests on web apps as well as to do web scraping). The idea is that the HTTP request generated by the browser would be better than the one you can write by hand.
Here is a tutorial to do what you try to do using Selenium.
I'm currently working on attempting to scrape some HTML files from an electronic medical system that I use for work. I currently have a python bot that logs into the system and is able to download and send faxes for me, but there's some pages I want my bot to quickly grab before it even is logged in and sending faxes. These pages are basic HTML that have extremely predictable URLs and I have tested I can manually call the pages from my browser, so once I do get my session established it should be easy work.
The website is: https://kinnser.net/
Login URL: https://kinnser.net/login.cfm
second URL: https://kinnser.net/AM/Message/inbox.cfm
import requests
import json
import logging
import json
from requests.auth import HTTPBasicAuth
from lxml import html
#This URL will be the URL that your login form points to with the "action" tag.
POST_LOGIN_URL = 'https://kinnser.net/loginlogic.cfm'
#This URL is the page you actually want to pull down with requests.
REQUEST_URL = 'https://kinnser.net/AM/Message/inbox.cfm'
#username-input-name is the "name" tag associated with the username input field of the login form.
#password-input-name is the "name" tag associated with the password input field of the login form.
payload = {
'username': 'XXXXXXXX',
'password': 'XXXXXXXXX'}
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36'}
with requests.Session() as session:
post = session.post(POST_LOGIN_URL, data=payload, headers=headers)
print(post)
r = session.get(REQUEST_URL)
print(r.text) #or whatever else you want to do with the request data!
I played around with the username, & password field by setting them equal to the input's name/ID but that wouldn't work. So I tried this script on our old EMR we used just to confirm it wasn't broken, and it did indeed work perfectly. So I began to play around with the headers in my request and it was still no dice. I'm not sure if my login is just failing or if they're detecting me being a bot and serving me the login page over and over again but I have spent about 10 hours trying to research a solution and I've hit a wall with my project currently.
If anyone see's any mistakes in my code or has workable solutions please feel free to suggest them. Thanks for the help and hopefully I'll soon grow to understand more about RESTful web services.
Think the HTML might actually be in post.text?
edit:
try the request with these headers:
...
user_agent_str = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " \
+ "AppleWebKit/537.36 (KHTML, like Gecko) " \
+ "Chrome/78.0.3904.97 " \
+ "Safari/537.36"
content_type_str = "application/json"
headers = {
"user-agent": user_agent_str,
"content-type": content_type_str
}
...
Another edit:
I'm not sure if requests already handles this, but payload isn't valid JSON. You might also try using double instead of single quotes.
I would suggest trying out this two things.
kinnser.net/loginlogic.cfm From network calls it looks like this is post url.
Change 'Username' to 'username' and 'Password' to 'password' and try.
Since I don't have access username and password i can not verify this but this two thing might be causing the problem.
I'm reading a web site content using following 3 liners. I used an example domain for sale which doesn't have many content.
url = "http://localbusiness.com/"
response = requests.get(url)
html = response.text
It returns following html content where the website contains more html when you check through view source. Am I doing something wrong here
Python version 2.7
<html><head></head><body><!-- vbe --></body></html>
Try setting a User-Agent:
import requests
url = "http://localbusiness.com/"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36',
'Content-Type': 'text/html',
}
response = requests.get(url, headers=headers)
html = response.text
The default User-Agent set by requests is 'User-Agent': 'python-requests/2.8.1'. Try to simulate that the request is coming from a browser and not a script.
#jason answered it correctly so I am extending his answer for the reason
Why It happens
Some DOM elements code changed through the Ajax calls and JavaScript code so that will not be seen in the response of your call (Although it's not the case here as you are already using the view source (ctrl+u) to compare and not view element)
Some sites uses user-agent to know the nature of user (as of desktop or mobile user) and provide the response accordingly (as the probable case here)
Other alternatives
You can use the mechanize module of python to mimic a browser to fool
a web site (come handy when the site is using some short of
authentication cookies) A small tutorial
Use selenium to actually implement a browser
I'm building django based website where some data is dynamically loaded using Ajax from a user specified url. For this I'm using urllib2 and later on BeautifulSoup. I came to strange thing with Walmart links. Take a look:
import urllib2
url_to_parse = 'http://www.walmart.com/ip/JVC-HARX300-High-Quality-Full-Size-Headphone/13241375'
# 1 - read the url without user-agent string
opened_url = urllib2.urlopen(url_to_parse)
print len(opened_url.read())
# prints 309316
# 2 - read the url wit user-agent string
headers = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0' }
req = urllib2.Request(url_to_parse, '', headers)
opened_url = urllib2.urlopen(req)
print len(opened_url.read())
# prints 0
My question is why on #2 a zero is printed? I use the user-agent method to deal with other websites (like amazon).
Wget is able to get the page content with no problems btw.
Your problem is not the User-Agent, it is your data parameter.
From the docs:
data may be a string specifying additional data to send to the server,
or None if no such data is needed.
It seems WalMart does not like your empty string. Change your call to this:
req = urllib2.Request(url_to_parse, None, headers)
Now both ways print the same value.