Reading multiple page PDF file using Google Cloud Vision - python

I am currently trying to read a multiple page pdf file using the google cloud vison API. I am currently only able to read the first page of the pdf and I am also getting an error in my code that gives me an error in my one line of code. I have attached pieces of my code below. How can I solve this error, and also be able to read the whole pdf instead of just one page?

Assuming that your Vision API JSON responses in your GCS are correct, you need to get the whole response instead of getting only the 1st element in response['responses'] and loop through it to get the annotation per page. See code below:
blob_list = list(bucket.list_blobs(prefix=prefix))
print('Output files:')
for obj in blob_list:
print(obj.name)
for blob in blob_list[1:]:
json_string = blob.download_as_string()
response = json.loads(json_string)
pages_response = response['responses'] #get complete response
for page in pages_response: #loop through all pages
annotation = page['fullTextAnnotation']
print('Full text:\n')
print(annotation['text'])
print('END OF PAGE')
print('##########################')
I used a Google Vision sample file (gs://cloud-samples-data/vision/document_understanding/custom_0773375000.pdf) and processed 3 pages. See sample run:

Related

Uploading image string to Google Drive using pydrive

I need to upload an image string (as the one you get from requests.get(url).content) to google drive using the PyDrive package. I checked a similar question but the answer accepted there was to save it in a temporary file on a local drive and then upload that.
However, I cannot do that because of local storage and permission restrictions.
The accepted answer was previously to use SetContentString(image_string.decode('utf-8')) since
SetContentString requires a parameter of type str not bytes.
However the error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte came up, as in the comments on that answer.
Is there any way to do this without using a temporary file, using PIL/BytesIO/anything that can convert it to be uploaded correctly as a string or somehow using PIL manipulated as an image and uploaded using SetContentFile()?
A basic example of what I'm trying to do is:
img_content = requests.get('https://i.imgur.com/A5gIh7W.jpeg')
file = drive.CreateFile({...})
file.setContentString(img_content.decode('utf-8'))
file.Upload()
When I saw the document (Upload and update file content) of pydrive, it says as follows.
Managing file content is as easy as managing file metadata. You can set file content with either SetContentFile(filename) or SetContentString(content) and call Upload() just as you did to upload or update file metadata.
And, I searched about the method for directly uploading the binary data to Google Drive. But, I couldn't find it. From this situation, I thought that there might not be such method. So, in this answer, I would like to propose to upload the binary data using requests module. In this case, the access token is retrieved from the authorization script of pydrive. The sample script is as follows.
Sample script:
from pydrive.auth import GoogleAuth
import io
import json
import requests
url = 'https://i.imgur.com/A5gIh7W.jpeg' # Please set the direct link of the image file.
filename = 'sample file' # Please set the filename on Google Drive.
folder_id = 'root' # Please set the folder ID. The file is put to this folder.
gauth = GoogleAuth()
gauth.LocalWebserverAuth()
metadata = {
"name": filename,
"parents": [folder_id]
}
files = {
'data': ('metadata', json.dumps(metadata), 'application/json'),
'file': io.BytesIO(requests.get(url).content)
}
r = requests.post(
"https://www.googleapis.com/upload/drive/v3/files?uploadType=multipart",
headers={"Authorization": "Bearer " + gauth.credentials.access_token},
files=files
)
print(r.text)
Note:
In this script, it supposes that your URL is the direct link of the image file. Please be careful this.
In this case, uploadType=multipart is used. The official document says as follows. Ref
Use this upload type to quickly transfer a small file (5 MB or less) and metadata that describes the file, in a single request. To perform a multipart upload, refer to Perform a multipart upload.
When you want to upload the data of the large size, please use the resumable upload. Ref
References:
Upload and update file content of pydrive
Upload file data of Drive API

Issue using Dropbox API in Python to access shared link and download images

So, as the title say, I am trying to download images from a public shared link because, when I am trying to download it with either curl/wget or the download button the file gets corrupted. There are 7000 images and in total it is about 2GB.
Here is my code
import dropbox
dbx = dropbox.Dropbox("token")
dbx.users_get_current_account()
url = "https://www.dropbox.com/sh/shared_link?dl=0"
shared_link = dropbox.files.SharedLink(url=url)
result = dbx.files_list_folder(path="", shared_link=shared_link)
file_list = []
def process_entries(entries):
for entry in entries:
if isinstance(entry, dropbox.files.FileMetadata):
link = dbx.sharing_get_shared_link_file(url, path="/"+entry.name)
print(entry.name)
print(link)
file_list.append([entry.name])
process_entries(result.entries)
while result.has_more:
result = dbx.files_list_folder_continue(result.cursor)
process_entries(result.entries)
print(len(file_list))
The list shows correctly in length that there are 7000 images. Though, either I am doing something wrong, or I can't figure out how I can download those files. Any ideas?
The sharing_get_shared_link_file method returns a tuple of (dropbox.sharing.SharedLinkMetadata, requests.models.Response). The first item is the metadata, and the second item is the response object where you can access the actual file content.
Your code is currently just saving that tuple to one variable. If you want to download the file data, you can do something like this instead:
metadata, res = dbx.sharing_get_shared_link_file(url, path="/"+entry.name)
with open(entry.name, "wb") as f:
f.write(res.content)

How to download file using python, requests module

I need to open the page automatically and download the file returned by the server
I have a simple code to open the page and download the content. I am also pulling the headers so I know the name of the returned file. below is the code
downloadPageRequest = self.reqSession.get( self.url_file ,stream=True)
headers = downloadPageRequest.headers
if 'content-disposition' in headers:
file_name = re.findall("filename=(.+)", headers['content-disposition'])
that's what I got, it returns an array with the filename, but now I am stuck and have no idea how to open and go through returned excel file
this has to be done using requests, that's why i cannot use any other method (e.g selenium)
will be thankful for your support

Uploading files with google app engine

I've been following the tutorials from google app engine for uploading an image
I've set up a simple HTML page for uploading a file using the "file" input type, and the name of the element is "file"
The form enctype is multipart/form-data and method is 'post'
Following the example, I would store the image data as a blob and store it with my object
myfile = MyFile()
myfile.filedata = db.Blob(self.request.get('file'))
myfile.filename = self.request.get('filename')
myfile.put()
But when I look at what's stored in filedata via datastore viewer, it wasn't binary image data, but instead just
236 bytes, SHA-1 = 63540e4ca6dba45e9ff3553787f7f23d330b7791
When the image I uploaded is definitely larger than 236 bytes.
Should the snippet from above retrieve the entire image and put it in a blob?
It seems like all that's being stored in the blob is the request header.

In Python, is there a way I can download all/some the image files (e.g. JPG/PNG) from a **Google Images** search result?

Is there a way I can download all/some the image files (e.g. JPG/PNG) from a Google Images search result?
I can use the following code to download one image that I already know its url:
import urllib.request
file = "Facts.jpg" # file to be written to
url = "http://www.compassion.com/Images/Hunger-Facts.jpg"
response = urllib.request.urlopen (url)
fh = open(file, "wb") #open the file for writing
fh.write(response.read()) # read from request while writing to file
To download multiple images, it has been suggested that I define a function and use that function to repeat the task for each image url that I would like to write to disk:
def image_request(url, file):
response = urllib.request.urlopen(url)
fh = open(file, "wb") #open the file for writing
fh.write(response.read())
And then loop over a list of urls with:
for i, url in enumerate(urllist):
image_request(url, str(i) + ".jpg")
However, what I really want to do is download all/some image files (e.g. JPG/PNG) from my own search result from Google Images without necessarily having a list of the image urls beforehand.
P.S.
Please I am a complete beginner and would favour an answer that breaks down the broad steps to achieve this over one that is bogs down on specific codes. Thanks.
You can use the Google API like this, where BLUE and DOG are your search parameters:
https://ajax.googleapis.com/ajax/services/search/images?v=1.0&q=BLUE%20DOG
There is a developer guide about this here:
https://developers.google.com/image-search/v1/jsondevguide
You need to parse this JSON format before you can use the links directly.
Here's a start to your JSON parsing:
import json
j = json.loads('{"one" : "1", "two" : "2", "three" : "3"}')
print(j['two'])

Categories