Python - urllib3 get text from docx using tika server

Python - urllib3 get text from docx using tika server - python

I am using python3, urllib3 and tika-server-1.13 in order to get text from different types of files. This is my python code:
def get_text(self, input_file_path, text_output_path, content_type):
global config
headers = util.make_headers()
mime_type = ContentType.get_mime_type(content_type)
if mime_type != '':
headers['Content-Type'] = mime_type
with open(input_file_path, "rb") as input_file:
fields = {
'file': (os.path.basename(input_file_path), input_file.read(), mime_type)
}
retry_count = 0
while retry_count < int(config.get("Tika", "RetriesCount")):
response = self.pool.request('PUT', '/tika', headers=headers, fields=fields)
if response.status == 200:
data = response.data.decode('utf-8')
text = re.sub("[\[][^\]]+[\]]", "", data)
final_text = re.sub("(\n(\t\r )*\n)+", "\n\n", text)
with open(text_output_path, "w+") as output_file:
output_file.write(final_text)
break
else:
if retry_count == (int(config.get("Tika", "RetriesCount")) - 1):
return False
retry_count += 1
return True
This code works for html files, but when i am trying to parse text from docx files it doesn't work.
I get back from the server Http error code 422: Unprocessable Entity
Using the tika-server documentation I've tried using curl to check if it works with it:
curl -X PUT --data-binary #test.docx http://localhost:9998/tika --header "Content-type: application/vnd.openxmlformats-officedocument.wordprocessingml.document"
and it worked.
At the tika server docs:
422 Unprocessable Entity - Unsupported mime-type, encrypted document & etc
This is the correct mime-type(also checked it with tika's detect system), it's supported and the file is not encrypted.
I believe this is related to how I upload the file to the tika server, What am I doing wrong?

You're not uploading the data in the same way. --data-binary in curl simply uploads the binary data as it is. No encoding. In urllib3, using fields causes urllib3 to generate a multipart/form-encoded message. On top of that, you're preventing urllib3 from properly setting that header on the request so Tika can understand it. Either stop updating headers['Content-Type'] or simply pass body=input_file.read().

I believe you can make this much easier by using the tika-python module with Client Only Mode.
If you still insist on rolling your own client, maybe there is some clues in the source code for this module to show how he is handling all these different mime types... if your having a problem with *.docx you will probably have issues with others.

Related

Python - Codec Issue with the video file downloaded

I am trying to download a video that has been uploaded in the cloud, and I am using API's to extract the data.
The python script seems to download the file fine, but when I open the video it throws this error:
I have tried using different options (VLC, Windows Media Player, etc) to play the video but do not have any luck. Can someone please help?
if res.status_code == 200:
body = res.json()
for meeting in body["meetings"]:
try:
password = requests.get(
f"{root}meetings/{meeting['uuid']}/recordings/settings?access_token={token}").json()["password"]
url = f"https://api.zoom.us/v2/meetings/{meeting['uuid']}/recordings/settings?access_token={token}"
res = requests.patch(
url,
data=json.dumps({"password": ""}),
headers=sess_headers)
except:
pass
topic = meeting["topic"]
try:
os.makedirs("downloads")
except:
pass
for i, recording in enumerate(meeting["recording_files"]):
#os.makedirs(topic)
download_url = recording["download_url"]
name = recording["recording_start"] + \
"-" + meeting["topic"]
ext = recording["file_type"]
filename = f"{name}.{ext}"
path = f'./downloads/{filename}'.replace(":", ".")
res = requests.get(download_url, headers=sess_headers)
with open(Path(path), 'wb') as f:
f.write(res.content)
else:
print(res.text)

One possible problem is next:
After doing each res = requests.get(...) you need to insert line res.raise_for_status().
This is needed to check that status code was 200.
By default requests doesn't throw anything if status code is not 200. Hence your res.content may be an invalid response body in case of bad status code.
If you do res.raise_for_status() then requests will throw error if status code is not 200, thus saving you from possible problems.
But having status code of 200 doesn't definitely mean that there was no error. Some servers respond with HTML containing error description and status code 200.
Another possible problem could be that download url is missing authorization token inside it, then you need to provide it through headers. So instead of last requests.get(...) put next code:
res = requests.get(download_url, headers = {
**sess_headers, 'Authorization': 'Bearer ' + token})
Also you need to check what content type resulting response has, so after last res = response.get(...), do next:
print('headers:', res.headers)
and check what is inside there. Specifically look at field Content-Type, it should have some binary type like application/octet-stream or video/mp4. But definitely not some text format like application/json or text/html, text format file is definitely not video file. In case if it is text/html then try renaming file to test.html and open it in browser to see what's there, probably server responded with some error inside this HTML.
Also just visually compare in some viewer content of two files - downloaded by script and downloaded by some downloader (e.g. browser). Maybe there is some obvious problem visible by eye.
Also file size should be quite big for video. If it is like 50KB then possibly some bad data is inside there.
UPDATE:
Finally worked next solution, replacing last requests.get(...) with line:
res = requests.get(download_url + '?access_token=' + token, headers=sess_headers)

How to upload a binary/video file using Python http.client PUT method?

I am communicating with an API using HTTP.client in Python 3.6.2.
In order to upload a file it requires a three stage process.
I have managed to talk successfully using POST methods and the server returns data as I expect.
However, the stage that requires the actual file to be uploaded is a PUT method - and I cannot figure out how to syntax the code to include a pointer to the actual file on my storage - the file is an mp4 video file.
Here is a snippet of the code with my noob annotations :)
#define connection as HTTPS and define URL
uploadstep2 = http.client.HTTPSConnection("grabyo-prod.s3-accelerate.amazonaws.com")
#define headers
headers = {
'accept': "application/json",
'content-type': "application/x-www-form-urlencoded"
}
#define the structure of the request and send it.
#Here it is a PUT request to the unique URL as defined above with the correct file and headers.
uploadstep2.request("PUT", myUniqueUploadUrl, body="C:\Test.mp4", headers=headers)
#get the response from the server
uploadstep2response = uploadstep2.getresponse()
#read the data from the response and put to a usable variable
step2responsedata = uploadstep2response.read()
The response I am getting back at this stage is an
"Error 400 Bad Request - Could not obtain the file information."
I am certain this relates to the body="C:\Test.mp4" section of the code.
Can you please advise how I can correctly reference a file within the PUT method?
Thanks in advance

uploadstep2.request("PUT", myUniqueUploadUrl, body="C:\Test.mp4", headers=headers)
will put the actual string "C:\Test.mp4" in the body of your request, not the content of the file named "C:\Test.mp4" as you expect.
You need to open the file, read it's content then pass it as body. Or to stream it, but AFAIK http.client does not support that, and since your file seems to be a video, it is potentially huge and will use plenty of RAM for no good reason.
My suggestion would be to use requests, which is a way better lib to do this kind of things:
import requests
with open(r'C:\Test.mp4'), 'rb') as finput:
response = requests.put('https://grabyo-prod.s3-accelerate.amazonaws.com/youruploadpath', data=finput)
print(response.json())

I do not know if it is useful for you, but you can try to send a POST request with requests module :
import requests
url = ""
data = {'title':'metadata','timeDuration':120}
mp3_f = open('/path/your_file.mp3', 'rb')
files = {'messageFile': mp3_f}
req = requests.post(url, files=files, json=data)
print (req.status_code)
print (req.content)
Hope it helps .

Reading URL socket backwards in Python

I'm attempting to pull information from a log file posted online and read through the output. The only information i really need is posted at the end of the file. These files are pretty big and storing the entire socket output to a variable and reading through it is consuming alot of internal memory. is there a was to read the socket from bottom to top?
What I currently have:
socket = urllib.urlopen(urlString)
OUTPUT = socket.read()
socket.close()
OUTPUT = OUTPUT.split("\n")
for line in OUTPUT:
if "xxxx" in line:
print line
I am using Python 2.7. I pretty much want to read about 30 lines from the very end of the output of Socket.

What you want in this use case is the HTTP Range request. Here is tutorial I located:
http://stuff-things.net/2015/05/13/web-scale-http-tail/
I should clarify: the advantage of getting the size with a Head request, then doing a Range request, is that you do not have to transfer all the content. You mentioned you have pretty big file resources, so this is going to be the best solution :)
edit: added this code below...
Here is a demo (simplified) of that blog article, but translated into Python. Please note this will not work with all HTTP servers! More comments inline:
"""
illustration of how to 'tail' a file using http. this will not work on all
webservers! if you need an http server to test with you can try the
rangehttpserver module:
$ pip install requests
$ pip install rangehttpserver
$ python -m RangeHTTPServer
"""
import requests
TAIL_SIZE = 1024
url = 'http://localhost:8000/lorem-ipsum.txt'
response = requests.head(url)
# not all servers return content-length in head, for some reason
assert 'content-length' in response.headers, 'Content length unknown- out of luck!'
# check the the resource length and construct a request header for that range
full_length = int(response.headers['content-length'])
assert full_length > TAIL_SIZE
headers = {
'range': 'bytes={}-{}'.format( full_length - TAIL_SIZE, full_length)
}
# Make a get request, with the range header
response = requests.get(url, headers=headers)
assert 'accept-ranges' in response.headers, 'Accept-ranges response header missing'
assert response.headers['accept-ranges'] == 'bytes'
assert len(response.text) == TAIL_SIZE
# Otherwise you get the entire file
response = requests.get(url)
assert len(response.text) == full_length

How can I post an attachment to jira via python

I want to use python to post an attachment to jira via jira rest api and without any other packages which are needed to install.
I noticed that "This resource expects a multipart post.", and I tried it,but maybe my method was wrong,I failed
I just want to know that how can I do the follow cmd via python urllib2:
"curl -D- -u admin:admin -X POST -H "X-Atlassian-Token: nocheck" -F "file=#myfile.txt" /rest/api/2/issue/TEST-123/attachments"
And I don't want use subprocess.popen

The key to getting this to work was in setting up the multipart-encoded files:
import requests
# Setup authentication credentials
credentials = requests.auth.HTTPBasicAuth('USERNAME','PASSWORD')
# JIRA required header (as per documentation)
headers = { 'X-Atlassian-Token': 'no-check' }
# Setup multipart-encoded file
files = [ ('file', ('file.txt', open('/path/to/file.txt','rb'), 'text/plain')) ]
# (OPTIONAL) Multiple multipart-encoded files
files = [
('file', ('file.txt', open('/path/to/file.txt','rb'), 'text/plain')),
('file', ('picture.jpg', open('/path/to/picture.jpg', 'rb'), 'image/jpeg')),
('file', ('app.exe', open('/path/to/app.exe','rb'), 'application/octet-stream'))
]
# Please note that all entries are called 'file'.
# Also, you should always open your files in binary mode when using requests.
# Run request
r = requests.post(url, auth=credentials, files=files, headers=headers)
https://2.python-requests.org/en/master/user/advanced/#post-multiple-multipart-encoded-files

As in official documentation, we need to open the file in binary mode and then upload. I hope below small piece of code helps you :)
from jira import JIRA
# Server Authentication
username = "XXXXXX"
password = "XXXXXX"
jira = JIRA(options, basic_auth=(str(username), str(password)))
# Get instance of the ticket
issue = jira.issue('PROJ-1')
# Upload the file
with open('/some/path/attachment.txt', 'rb') as f:
jira.add_attachment(issue=issue, attachment=f)
https://jira.readthedocs.io/en/master/examples.html#attachments

You can use the jira-python package.
Install it like this:
pip install jira-python
To add attachments, use the add_attachment method of the jira.client.JIRA class:
add_attachment(*args, **kwargs) Attach an attachment to an issue and returns a Resource for it. The client will not attempt
to open or validate the attachment; it expects a file-like object to
be ready for its use. The user is still responsible for tidying up
(e.g., closing the file, killing the socket, etc.)
Parameters:
issue – the issue to attach the attachment
to attachment – file-like object to
attach to the issue, also works if it is a string with the
filename. filename – optional name for
the attached file. If omitted, the file object’s name attribute
is used. If you aquired the file-like object by any other method than
open(),
make sure that a name is specified in one way or the other.
You can find out more information and examples in the official documentation

Sorry for my unclear question
Thanks to How to POST attachment to JIRA using REST API?.
I have already resolve it.
boundary = '----------%s' % ''.join(random.sample('0123456789abcdef', 15))
parts = []
parts.append('--%s' % boundary)
parts.append('Content-Disposition: form-data; name="file"; filename="%s"' % fpath)
parts.append('Content-Type: %s' % 'text/plain')
parts.append('')
parts.append(open(fpath, 'r').read())
parts.append('--%s--' % boundary)
parts.append('')
body = '\r\n'.join(parts)
url = deepcopy(self.topurl)
url += "/rest/api/2/issue/%s/attachments" % str(jrIssueId)
req = urllib2.Request(url, body)
req.add_header("Content-Type", "multipart/form-data; boundary=%s" % boundary)
req.add_header("X-Atlassian-Token", "nocheck")
res = urllib2.urlopen(req)
print res.getcode()
assert res.getcode() in range(200,207), "Error to attachFile " + jrIssueId
return res.read()

Http POST Curl in python

I'm having trouble understanding how to issue an HTTP POST request using curl from inside of python.
I'm tying to post to facebook open graph. Here is the example they give which I'd like to replicate exactly in python.
curl -F 'access_token=...' \
-F 'message=Hello, Arjun. I like this new API.' \
https://graph.facebook.com/arjun/feed
Can anyone help me understand this?

You can use httplib to POST with Python or the higher level urllib2
import urllib
params = {}
params['access_token'] = '*****'
params['message'] = 'Hello, Arjun. I like this new API.'
params = urllib.urlencode(params)
f = urllib.urlopen("https://graph.facebook.com/arjun/feed", params)
print f.read()
There is also a Facebook specific higher level library for Python that does all the POST-ing for you.
https://github.com/pythonforfacebook/facebook-sdk/
https://github.com/facebook/python-sdk

Why do you use curl in the first place?
Python has extensive libraries for Facebook and included libraries for web requests, calling another program and receive output is unecessary.
That said,
First from Python Doc
data may be a string specifying additional data to send to the server,
or None if no such data is needed. Currently HTTP requests are the
only ones that use data; the HTTP request will be a POST instead of a
GET when the data parameter is provided. data should be a buffer in
the standard application/x-www-form-urlencoded format. The
urllib.urlencode() function takes a mapping or sequence of 2-tuples
and returns a string in this format. urllib2 module sends HTTP/1.1
requests with Connection:close header included.
So,
import urllib2, urllib
parameters = {}
parameters['token'] = 'sdfsdb23424'
parameters['message'] = 'Hello world'
target = 'http://www.target.net/work'
parameters = urllib.urlencode(parameters)
handler = urllib2.urlopen(target, parameters)
while True:
if handler.code < 400:
print 'done'
# call your job
break
elif handler.code >= 400:
print 'bad request or error'
# failed
break

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.