Reading URL socket backwards in Python - python

I'm attempting to pull information from a log file posted online and read through the output. The only information i really need is posted at the end of the file. These files are pretty big and storing the entire socket output to a variable and reading through it is consuming alot of internal memory. is there a was to read the socket from bottom to top?
What I currently have:
socket = urllib.urlopen(urlString)
OUTPUT = socket.read()
socket.close()
OUTPUT = OUTPUT.split("\n")
for line in OUTPUT:
if "xxxx" in line:
print line
I am using Python 2.7. I pretty much want to read about 30 lines from the very end of the output of Socket.

What you want in this use case is the HTTP Range request. Here is tutorial I located:
http://stuff-things.net/2015/05/13/web-scale-http-tail/
I should clarify: the advantage of getting the size with a Head request, then doing a Range request, is that you do not have to transfer all the content. You mentioned you have pretty big file resources, so this is going to be the best solution :)
edit: added this code below...
Here is a demo (simplified) of that blog article, but translated into Python. Please note this will not work with all HTTP servers! More comments inline:
"""
illustration of how to 'tail' a file using http. this will not work on all
webservers! if you need an http server to test with you can try the
rangehttpserver module:
$ pip install requests
$ pip install rangehttpserver
$ python -m RangeHTTPServer
"""
import requests
TAIL_SIZE = 1024
url = 'http://localhost:8000/lorem-ipsum.txt'
response = requests.head(url)
# not all servers return content-length in head, for some reason
assert 'content-length' in response.headers, 'Content length unknown- out of luck!'
# check the the resource length and construct a request header for that range
full_length = int(response.headers['content-length'])
assert full_length > TAIL_SIZE
headers = {
'range': 'bytes={}-{}'.format( full_length - TAIL_SIZE, full_length)
}
# Make a get request, with the range header
response = requests.get(url, headers=headers)
assert 'accept-ranges' in response.headers, 'Accept-ranges response header missing'
assert response.headers['accept-ranges'] == 'bytes'
assert len(response.text) == TAIL_SIZE
# Otherwise you get the entire file
response = requests.get(url)
assert len(response.text) == full_length

Related

Simplify a streamed request.get and JSON response decode

I have been working on some code that will grab emergency incident information from a service called PulsePoint. It works with software built into computer controlled dispatch centers.
This is an app that empowers citizen heroes that are CPR trained to help before a first resonder arrives on scene. I'm merely using it to get other emergency incidents.
I reversed-engineered there app as they have no documentation on how to make your own requests. Because of this i have knowingly left in the api key and auth info because its in plain text in the Android manifest file.
I will definitely make a python module eventually for interfacing with this service, for now its just messy.
Anyhow, sorry for that long boring intro.
My real question is, how can i simplify this function so that it looks and runs a bit cleaner in making a timed request and returning a json object that can be used through subscripts?
import requests, time, json
def getjsonobject(agency):
startsecond = time.strftime("%S")
url = REDACTED
body = []
currentagency = requests.get(url=url, verify=False, stream=True, auth=requests.auth.HTTPBasicAuth(REDACTED, REDCATED), timeout = 13)
for chunk in currentagency.iter_content(1024):
body.append(chunk)
if(int(startsecond) + 5 < int(time.strftime("%S"))): #Shitty internet proof, with timeout above
raise Exception("Server sent to much data")
jsonstringforagency = str(b''.join(body))[2:][:-1] #Removes charecters that define the response body so that the next line doesnt error
currentagencyjson = json.loads(jsonstringforagency) #Loads response as decodable JSON
return currentagencyjson
currentincidents = getjsonobject("lafdw")
for inci in currentincidents["incidents"]["active"]:
print(inci["FullDisplayAddress"])
Requests handles acquiring the body data, checking for json, and parsing the json for you automatically, and since you're giving the timeout arg I don't think you need separate timeout handling. Request also handles constructing the URL for get requests, so you can put your query information into a dictionary, which is much nicer. Combining those changes and removing unused imports gives you this:
import requests
params = dict(both=1,
minimal=1,
apikey=REDACTED)
url = REDACTED
def getjsonobject(agency):
myParams = dict(params, agency=agency)
return requests.get(url, verify=False, params=myParams, stream=True,
auth=requests.auth.HTTPBasicAuth(REDACTED, REDACTED),
timeout = 13).json()
Which gives the same output for me.

Python - urllib3 get text from docx using tika server

I am using python3, urllib3 and tika-server-1.13 in order to get text from different types of files. This is my python code:
def get_text(self, input_file_path, text_output_path, content_type):
global config
headers = util.make_headers()
mime_type = ContentType.get_mime_type(content_type)
if mime_type != '':
headers['Content-Type'] = mime_type
with open(input_file_path, "rb") as input_file:
fields = {
'file': (os.path.basename(input_file_path), input_file.read(), mime_type)
}
retry_count = 0
while retry_count < int(config.get("Tika", "RetriesCount")):
response = self.pool.request('PUT', '/tika', headers=headers, fields=fields)
if response.status == 200:
data = response.data.decode('utf-8')
text = re.sub("[\[][^\]]+[\]]", "", data)
final_text = re.sub("(\n(\t\r )*\n)+", "\n\n", text)
with open(text_output_path, "w+") as output_file:
output_file.write(final_text)
break
else:
if retry_count == (int(config.get("Tika", "RetriesCount")) - 1):
return False
retry_count += 1
return True
This code works for html files, but when i am trying to parse text from docx files it doesn't work.
I get back from the server Http error code 422: Unprocessable Entity
Using the tika-server documentation I've tried using curl to check if it works with it:
curl -X PUT --data-binary #test.docx http://localhost:9998/tika --header "Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document"
and it worked.
At the tika server docs:
422 Unprocessable Entity - Unsupported mime-type, encrypted document & etc
This is the correct mime-type(also checked it with tika's detect system), it's supported and the file is not encrypted.
I believe this is related to how I upload the file to the tika server, What am I doing wrong?
You're not uploading the data in the same way. --data-binary in curl simply uploads the binary data as it is. No encoding. In urllib3, using fields causes urllib3 to generate a multipart/form-encoded message. On top of that, you're preventing urllib3 from properly setting that header on the request so Tika can understand it. Either stop updating headers['Content-Type'] or simply pass body=input_file.read().
I believe you can make this much easier by using the tika-python module with Client Only Mode.
If you still insist on rolling your own client, maybe there is some clues in the source code for this module to show how he is handling all these different mime types... if your having a problem with *.docx you will probably have issues with others.

How can i post using Python urllib in html input type submit [duplicate]

I'm trying to create a super-simplistic Virtual In / Out Board using wx/Python. I've got the following code in place for one of my requests to the server where I'll be storing the data:
data = urllib.urlencode({'q': 'Status'})
u = urllib2.urlopen('http://myserver/inout-tracker', data)
for line in u.readlines():
print line
Nothing special going on there. The problem I'm having is that, based on how I read the docs, this should perform a Post Request because I've provided the data parameter and that's not happening. I have this code in the index for that url:
if (!isset($_POST['q'])) { die ('No action specified'); }
echo $_POST['q'];
And every time I run my Python App I get the 'No action specified' text printed to my console. I'm going to try to implement it using the Request Objects as I've seen a few demos that include those, but I'm wondering if anyone can help me explain why I don't get a Post Request with this code. Thanks!
-- EDITED --
This code does work and Posts to my web page properly:
data = urllib.urlencode({'q': 'Status'})
h = httplib.HTTPConnection('myserver:8080')
headers = {"Content-type": "application/x-www-form-urlencoded",
"Accept": "text/plain"}
h.request('POST', '/inout-tracker/index.php', data, headers)
r = h.getresponse()
print r.read()
I am still unsure why the urllib2 library doesn't Post when I provide the data parameter - to me the docs indicate that it should.
u = urllib2.urlopen('http://myserver/inout-tracker', data)
h.request('POST', '/inout-tracker/index.php', data, headers)
Using the path /inout-tracker without a trailing / doesn't fetch index.php. Instead the server will issue a 302 redirect to the version with the trailing /.
Doing a 302 will typically cause clients to convert a POST to a GET request.

Http POST Curl in python

I'm having trouble understanding how to issue an HTTP POST request using curl from inside of python.
I'm tying to post to facebook open graph. Here is the example they give which I'd like to replicate exactly in python.
curl -F 'access_token=...' \
-F 'message=Hello, Arjun. I like this new API.' \
https://graph.facebook.com/arjun/feed
Can anyone help me understand this?
You can use httplib to POST with Python or the higher level urllib2
import urllib
params = {}
params['access_token'] = '*****'
params['message'] = 'Hello, Arjun. I like this new API.'
params = urllib.urlencode(params)
f = urllib.urlopen("https://graph.facebook.com/arjun/feed", params)
print f.read()
There is also a Facebook specific higher level library for Python that does all the POST-ing for you.
https://github.com/pythonforfacebook/facebook-sdk/
https://github.com/facebook/python-sdk
Why do you use curl in the first place?
Python has extensive libraries for Facebook and included libraries for web requests, calling another program and receive output is unecessary.
That said,
First from Python Doc
data may be a string specifying additional data to send to the server,
or None if no such data is needed. Currently HTTP requests are the
only ones that use data; the HTTP request will be a POST instead of a
GET when the data parameter is provided. data should be a buffer in
the standard application/x-www-form-urlencoded format. The
urllib.urlencode() function takes a mapping or sequence of 2-tuples
and returns a string in this format. urllib2 module sends HTTP/1.1
requests with Connection:close header included.
So,
import urllib2, urllib
parameters = {}
parameters['token'] = 'sdfsdb23424'
parameters['message'] = 'Hello world'
target = 'http://www.target.net/work'
parameters = urllib.urlencode(parameters)
handler = urllib2.urlopen(target, parameters)
while True:
if handler.code < 400:
print 'done'
# call your job
break
elif handler.code >= 400:
print 'bad request or error'
# failed
break

Trouble with pycurl.POSTFIELDS

I'm familiar with CURL in PHP but am using it for the first time in Python with pycurl.
I keep getting the error:
Exception Type: error
Exception Value: (2, '')
I have no idea what this could mean. Here is my code:
data = {'cmd': '_notify-synch',
'tx': str(request.GET.get('tx')),
'at': paypal_pdt_test
}
post = urllib.urlencode(data)
b = StringIO.StringIO()
ch = pycurl.Curl()
ch.setopt(pycurl.URL, 'https://www.sandbox.paypal.com/cgi-bin/webscr')
ch.setopt(pycurl.POST, 1)
ch.setopt(pycurl.POSTFIELDS, post)
ch.setopt(pycurl.WRITEFUNCTION, b.write)
ch.perform()
ch.close()
The error is referring to the line ch.setopt(pycurl.POSTFIELDS, post)
I do like that:
post_params = [
('ASYNCPOST',True),
('PREVIOUSPAGE','yahoo.com'),
('EVENTID',5),
]
resp_data = urllib.urlencode(post_params)
mycurl.setopt(pycurl.POSTFIELDS, resp_data)
mycurl.setopt(pycurl.POST, 1)
...
mycurl.perform()
I know this is an old post but I've just spent my morning trying to track down this same error. It turns out that there's a bug in pycurl that was fixed in 7.16.2.1 that caused setopt() to break on 64-bit machines.
It would appear that your pycurl installation (or curl library) is damaged somehow. From the curl error codes documentation:
CURLE_FAILED_INIT (2)
Very early initialization code failed. This is likely to be an internal error or problem.
You will possibly need to re-install or recompile curl or pycurl.
However, to do a simple POST request like you're doing, you can actually use python's "urllib" instead of CURL:
import urllib
postdata = urllib.urlencode(data)
resp = urllib.urlopen('https://www.sandbox.paypal.com/cgi-bin/webscr', data=postdata)
# resp is a file-like object, which means you can iterate it,
# or read the whole thing into a string
output = resp.read()
# resp.code returns the HTTP response code
print resp.code # 200
# resp has other useful data, .info() returns a httplib.HTTPMessage
http_message = resp.info()
print http_message['content-length'] # '1536' or the like
print http_message.type # 'text/html' or the like
print http_message.typeheader # 'text/html; charset=UTF-8' or the like
# Make sure to close
resp.close()
to open an https:// URL, you may need to install PyOpenSSL:
http://pypi.python.org/pypi/pyOpenSSL
Some distibutions include this, others provide it as an extra package right through your favorite package manager.
Edit: Have you called pycurl.global_init() yet? I still recommend urllib/urllib2 where possible, as your script will be more easily moved to other systems.

Categories