I want to add a referer while retrieving data from the web but this is not working on my python2 referer request.add_header('Referer', 'https://www.python.org').
My Url.txt content
https://www.python.org/about/
https://stackoverflow.com/questions
https://docs.python.org/2.7/
These are my codes
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import urllib2
import threading
import time
import requests
max_thread = 5
urllist = open("Url.txt").readlines()
def url_connect(url):
try :
request = urllib2.Request(url)
request.add_header('Referer', 'https://www.python.org')
request.add_header('User-agent', 'Mozilla/5.0')
goo = re.findall('<title>(.*?)</title>', urllib2.urlopen(url.replace(' ','')).read())[0]
print '\n' + goo.decode("utf-8")
with open('SaveMyDataFile.txt', 'ab') as f:
f.write(goo + "\n")
except Exception as Errors:
pass
for i in urllist:
i = i.strip()
if i.startswith("http"):
while threading.activeCount() >= max_thread:
time.sleep(0.1)
threading.Thread(target=url_connect, args=(i,)).start()
Looks to me the problem is in your call to urlopen. You call it with the url and not with the request.
From https://docs.python.org/2/library/urllib2.html#urllib2.urlopen
Open the URL url, which can be either a string or a Request object.
You need to pass urllib.urlopen() the Request object that you just built--you're currently not doing anything with that.
Related
I have a curl POST to do for elasticsearch:
curl -XPOST "http://localhost:9200/index/name" --data-binary "#file.json"
How can I do this in a python shell? Basically because i need to loop over many json files. I want to be able to do this in a for loop.
import glob
import os
import requests
def index_data(path):
item = []
for filename in glob.glob(path):
item.append(filename[55:81]+'.json')
return item
def send_post(url, datafiles):
r = requests.post(url, data=file(datafiles,'rb').read())
data = r.text
return data
def main():
url = 'http://localhost:9200/index/name'
metpath = r'C:\pathtofiledirectory\*.json'
jsonfiles = index_data(metpath)
send_post(url, jsonfiles)
if __name__ == "__main__":
main()
I fixed to do this, but is giving me a TypeError:
TypeError: coercing to Unicode: need string or buffer, list found
You can use requests http client :
import requests
files = ['file.json', 'file1.json', 'file2.json', 'file3.json', 'file4.json']
for item in files:
req = requests.post('http://localhost:9200/index/name',data=file(item,'rb').read())
print req.text
From your edit, you would need :
for item in jsonfiles:
send_post(url, item)
Use the requests library.
import requests
r = requests.post(url, data=data)
It is as simple as that.
This question already has answers here:
HTTP requests and JSON parsing in Python [duplicate]
(8 answers)
Closed 8 months ago.
How would I parse a json api response with python?
I currently have this:
import urllib.request
import json
url = 'https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty'
def response(url):
with urllib.request.urlopen(url) as response:
return response.read()
res = response(url)
print(json.loads(res))
I'm getting this error:
TypeError: the JSON object must be str, not 'bytes'
What is the pythonic way to deal with json apis?
Version 1: (do a pip install requests before running the script)
import requests
r = requests.get(url='https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty')
print(r.json())
Version 2: (do a pip install wget before running the script)
import wget
fs = wget.download(url='https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty')
with open(fs, 'r') as f:
content = f.read()
print(content)
you can use standard library python3:
import urllib.request
import json
url = 'http://www.reddit.com/r/all/top/.json'
req = urllib.request.Request(url)
##parsing response
r = urllib.request.urlopen(req).read()
cont = json.loads(r.decode('utf-8'))
counter = 0
##parcing json
for item in cont['data']['children']:
counter += 1
print("Title:", item['data']['title'], "\nComments:", item['data']['num_comments'])
print("----")
##print formated
#print (json.dumps(cont, indent=4, sort_keys=True))
print("Number of titles: ", counter)
output will be like this one:
...
Title: Maybe we shouldn't let grandma decide things anymore.
Comments: 2018
----
Title: Carrie Fisher and Her Stunt Double Sunbathing on the Set of Return of The Jedi, 1982
Comments: 880
----
Title: fidget spinner
Comments: 1537
----
Number of titles: 25
I would usually use the requests package with the json package. The following code should be suitable for your needs:
import requests
import json
url = 'https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty'
r = requests.get(url)
print(json.loads(r.content))
Output
[11008076,
11006915,
11008202,
....,
10997668,
10999859,
11001695]
The only thing missing in the original question is a call to the decode method on the response object (and even then, not for every python3 version). It's a shame no one pointed that out and everyone jumped on a third party library.
Using only the standard library, for the simplest of use cases :
import json
from urllib.request import urlopen
def get(url, object_hook=None):
with urlopen(url) as resource: # 'with' is important to close the resource after use
return json.load(resource, object_hook=object_hook)
Simple use case :
data = get('http://url') # '{ "id": 1, "$key": 13213654 }'
print(data['id']) # 1
print(data['$key']) # 13213654
Or if you prefer, but riskier :
from types import SimpleNamespace
data = get('http://url', lambda o: SimpleNamespace(**o)) # '{ "id": 1, "$key": 13213654 }'
print(data.id) # 1
print(data.$key) # invalid syntax
# though you can still do
print(data.__dict__['$key'])
With Python 3
import requests
import json
url = 'http://IP-Address:8088/ws/v1/cluster/scheduler'
r = requests.get(url)
data = json.loads(r.content.decode())
I am using JSON library and trying to import a page feed to an CSV file. Tried many a ways to get the result however every time code execute it Gives JSON not serialzable. No Facebook use auth code which I have and used it so connection string will change however if you use a page which has public privacy you will still be able to get the result from below code.
following is the code
import urllib3
import json
import requests
#from pprint import pprint
import csv
from urllib.request import urlopen
page_id = "abcd" # username or id
api_endpoint = "https://graph.facebook.com"
fb_graph_url = api_endpoint+"/"+page_id
try:
#api_request = urllib3.Requests(fb_graph_url)
#http = urllib3.PoolManager()
#api_response = http.request('GET', fb_graph_url)
api_response = requests.get(fb_graph_url)
try:
#print (list.sort(json.loads(api_response.read())))
obj = open('data', 'w')
# write(json_dat)
f = api_response.content
obj.write(json.dumps(f))
obj.close()
except Exception as ee:
print(ee)
except Exception as e:
print( e)
Tried many approach but not successful. hope some one can help
api_response.content is the text content of the API, not a Python object so you won't be able to dump it.
Try either:
f = api_response.content
obj.write(f)
Or
f = api_response.json()
obj.write(json.dumps(f))
requests.get(fb_graph_url).content
is probably a string. Using json.dumps on it won't work. This function expects a list or a dictionary as the argument.
If the request already returns JSON, just write it to the file.
Below is a code snippet of my api.py module
# -*- coding: utf-8 -*-
from urllib2 import urlopen
from urllib2 import Request
class API:
def call_api(self, url, post_data=None, header=None):
is_post_request = True if (post_data and header) else False
response = None
try:
if is_post_request:
url = Request(url = url, data = post_data, headers = header)
# Calling api
api_response = urlopen(url)
response = api_response.read()
except Exception as err:
response = err
return response
I am trying to mock urllib2.urlopen in unittest of above module. I have written
# -*- coding: utf-8 -*-
# test_api.py
from unittest import TestCase
import mock
from api import API
class TestAPI(TestCase):
#mock.patch('urllib2.Request')
#mock.patch('urllib2.urlopen')
def test_call_api(self, urlopen, Request):
urlopen.read.return_value = 'mocked'
Request.get_host.return_value = 'google.com'
Request.type.return_value = 'https'
Request.data = {}
_api = API()
assert _api.call_api('https://google.com') == 'mocked'
After I run the unittest, I get an exception
<urlopen error unknown url type: <MagicMock name='Request().get_type()' id='159846220'>>
What am I missing? Please help me out.
You are patching the wrong things: take a look to Where to patch.
In api.py by
from urllib2 import urlopen
from urllib2 import Request
you create a local reference to urlopen and Request in your file. By mock.patch('urllib2.urlopen') you are patching the original reference and leave the api.py's one untouched.
So, replace your patches by
#mock.patch('api.Request')
#mock.patch('api.urlopen')
should fix your issue.... but is not enough.
In your test case api.Request are not used but urllib2.urlopen() create a Request by use the patched version: that is why Request().get_type() is a MagicMock.
For a complete fix you should change your test at all. First the code:
#mock.patch('api.urlopen', autospec=True)
def test_call_api(self, urlopen):
urlopen.return_value.read.return_value = 'mocked'
_api = API()
self.assertEqual(_api.call_api('https://google.com'), 'mocked')
urlopen.assert_called_with('https://google.com')
Now the clarification... In your test you don't call Request() because you pass just the first parameter, so I've removed useless patch. Moreover you are patching urlopen function and not urlopen object, that means the read() method you want mocked is a method of the object return by urlopen() call.
Finally I add a check on urlopen call and autospec=True that is always a good practice.
How can I modify my script to skip a URL if the connection times out or is invalid/404?
Python
#!/usr/bin/python
#parser.py: Downloads Bibles and parses all data within <article> tags.
__author__ = "Cody Bouche"
__copyright__ = "Copyright 2012 Digital Bible Society"
from BeautifulSoup import BeautifulSoup
import lxml.html as html
import urlparse
import os, sys
import urllib2
import re
print ("downloading and parsing Bibles...")
root = html.parse(open('links.html'))
for link in root.findall('//a'):
url = link.get('href')
name = urlparse.urlparse(url).path.split('/')[-1]
dirname = urlparse.urlparse(url).path.split('.')[-1]
f = urllib2.urlopen(url)
s = f.read()
if (os.path.isdir(dirname) == 0):
os.mkdir(dirname)
soup = BeautifulSoup(s)
articleTag = soup.html.body.article
converted = str(articleTag)
full_path = os.path.join(dirname, name)
open(full_path, 'wb').write(converted)
print(name)
print("DOWNLOADS COMPLETE!")
To apply the timeout to your request add the timeout variable to your call to urlopen. From the docs:
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This actually only works
for HTTP, HTTPS and FTP connections.
Refer to this guide's section on how to handle exceptions with urllib2. Actually I found the whole guide very useful.
The request timeout exception code is 408. Wrapping it up, if you were to handle timeout exceptions you would:
try:
response = urlopen(req, 3) # 3 seconds
except URLError, e:
if hasattr(e, 'code'):
if e.code==408:
print 'Timeout ', e.code
if e.code==404:
print 'File Not Found ', e.code
# etc etc
Try putting your urlopen line under a try catch statment. Look this up:
docs.python.org/tutorial/errors.html section 8.3
Look at the different exceptions and when you encounter one just restart the loop using the statement continue