Downloading file retrieved from MongoDB using Flask - python

I'm building web application for uploading and downloading files to and from MongoDb using Flask. First I'll search MongoDb database in particular collection for matching string and if there is a matching string in any document, then I need to create dynamic URL(clickable from search page) to download using the ObjectId. Once I click the dynamic URL, it should retrieve file stored in MongoDb for that particular ObjectId and download it. I tried changing response.headers['Content-Type'] and response.headers["Content-Dispostion"] to original values, but for some reason the download is not working as expected.
route.py
#app.route('/download/<fileId>', methods = ['GET', 'POST'])
def download(fileId):
connection = pymongo.MongoClient()
#get a handle to the test database
db = connection.test
uploads = db.uploads
try:
query = {'_id': ObjectId(fileId)}
cursor = uploads.find(query)
for doc in cursor:
fileName = doc['fileName']
response = make_response(doc['binFile'])
response.headers['Content-Type'] = doc['fileType']
response.headers['Content-Dispostion'] = "attachment; filename="+fileName
print response.headers
return response
except Exception as e:
return render_template('Unsuccessful.html')
What should I do so that I can download file(retrieved from MongoDB-working as expected) with same file name and data as I uploaded earlier?
Below is the log from recent run.
The file(in this case "Big Data Workflows presentation 1.pptx") retrieved from MongoDb is downloading with ObjectId file name even though I'm changing file name to original file name.
Please let me know if I'm missing any detail. I'll update the post accordingly.
Thanks in advance,

Thank you #Bartek Jablonski for your input.
Finally I made this work with tweaking code a little bit and creating new collection in MongoDB (I got lucky this time, I guess).
#app.route('/download/<fileId>', methods = ['GET', 'POST'])
def download(fileId):
connection = pymongo.MongoClient()
#get a handle to the nrdc database
db = connection.nrdc
uploads = db.uploads
try:
query = {'_id': ObjectId(fileId)}
cursor = uploads.find(query)
for doc in cursor:
fileName = doc['fileName']
response = make_response(doc['binFile'])
response.headers['Content-Type'] = doc['fileType']
response.headers["Content-Dispostion"] = "attachment; filename=\"%s\"" %fileName
return response
except Exception as e:
# self.errorList.append("No results found." + type(e))
return False

Related

<Response [400]> when sending api get request with json body to an api deployed on google app engine

I'm working on a school assignment where I've created a simple api to read and add data to a firebase database. We've tried it locally and it works fine. Now, I'm trying to deploy it to google app engine and I've encountered a very annoying issue. If I set the argument inside of the function (as is now commented out), I get a valid response. But if I try to enter a json body and parse it with RequestParser, I get a <Response 400>. Does anyone know why this works locally but not once it's deployed to google app engine? If anything is unclear, I'll be happy to clarify!
def get(self):
parser = reqparse.RequestParser()
parser.add_argument("name",type=str,help = "name of recipe or ingredient is required",required = True)
args = parser.parse_args()
#args = dict({"name":"Carbonara"})
if(args["name"] == 'all'):
hej = getCollection()
return {'data': hej},200
recipes = getRecipes(args["name"])
if(recipes != None):
return recipes,200
result = recipeContains(args["name"])
if result != []:
return result,200
return "No results"
response = requests.get("https://cohesive-photon-346611.ew.r.appspot.com/recipes", {"name":"Carbonara"})
response = requests.get("https://cohesive-photon-346611.ew.r.appspot.com/recipes")

AWS textract multipage PDF only extract 1st page for Form and Table extraction

I am using AWS Textract for Form and Table extraction using following code. For some pdf it extracts forms from all the pages but for some pdf is extracts only first page. While using the textract user interface it extracts all the pages. What could be the reason for this??
I am using following code which is available on aws.
def create_client(access_key, secret_key):
return boto3.client('textract',region_name='us-east-2',
aws_access_key_id=access_key,
aws_secret_access_key=secret_key)
def isJobComplete(jobId):
client = create_client(access_key, secret_key)
response = client.get_document_analysis(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
while(status == "IN_PROGRESS"):
time.sleep(2)
response = client.get_document_analysis(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
return status
def getJobResults(jobId):
client = create_client(access_key, secret_key)
response = client.get_document_analysis(JobId=jobId)
return response
Edited :
It looks like its related to response size. The size is almost fixed.
Can anyone help me with this ?
Found the solution...
There is one parameter called nexttoken. Form the current response you can take nexttoken value and use that as a parameter in get_document_analysis and iterate till nexttoken is None. You will get the batch of responses.

Google Drive API:How to download files from google drive?

access_token = ''
import json
r = session.request('get', 'https://www.googleapis.com/drive/v3/files?access_token=%s' % access_token)
response_text = str(r.content, encoding='utf-8')
files_list = json.loads(response_text).get('files')
files_id_list = []
for item in files_list:
files_id_list.append(item.get('id'))
for item in files_id_list:
file_r = session.request('get', 'https://www.googleapis.com/drive/v3/files/%s?alt=media&access_token=%s' % (item, access_token))
print(file_r.content)
I use the above code and Google shows:
We're sorry ...
... but your computer or network may be sending automated queries. To protect our users, we can't process your request right now.
I do n’t know if this method ca n’t be downloaded originally, or where is the problem?
The reason you are getting this error is you are requesting the data in a Loop.
causes so many requests to Google's server.
And hence the error
We're sorry ... ... but your computer or network may be sending automated queries
access_token should not be placed in the request body,We should put access_token in the header.Can try on this site oauthplayground

Python Downloading an S3 file without knowing key name

I am writing a Python script that runs a query through Athena, outputs it to S3 and downloads it into my computer. I am able to run my query through Athena and output the result into S3. So my next step that I can’t seem to figure out is how to download it to my computer without knowing the key name?
Is there a way to lookup the object key within my python script after outputting it to Athena?
What I have completed:
# Output location and DB
s3_output = ‘s3_output_here’
database = ‘database_here’
# Function to run Athena query
def run_query(query, database, s3_output):
while True:
try:
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': database
},
ResultConfiguration={
'OutputLocation': s3_output,
}
)
return response
break
except client.exceptions.TooManyRequestsException as e:
print('Too many requests, trying again after sleep')
time.sleep(100)
# Our SQL Query
query = """
SELECT *
FROM test
”””
print("Running query to Athena...")
res = run_query(query, database, s3_output)
I understand how to download a file with this code:
try:
s3.Bucket(BUCKET_NAME).download_file(KEY, ‘KEY_HERE’)
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == "404":
print("The object does not exist.")
else:
raise
So how can I read the key name after running my first completed code?
You can get the key using the get_key command provided by the boto library. This is how I download things from s3:
with open("path/aws-credentials.json") as f:
data= json.load(f)
conn = boto.connect_s3(data["accessKeyId"], data["secretAccessKey"])
bucket = conn.get_bucket('your_bucket')
file_path = bucket.get_key('path/to/s3/file')
file_path.get_contents_to_filename('path/on/local/computer/filename')
You can hardcode your credentials into the code if you are just testing something out, but if you are planning on putting this into production, it's best to store your credentials externally in something like a json file.

Web service in python to get the schema information of table in Big Query

I wrote a web service in python to accept the DatasetName and the TableName as inputs in url which will be passed to big query and the fields will be returned as output.
I have used python client for bigquery and accessing the schema information like this, but am not able to get the result as i expect.
it returns "Invalid dataset ID "publicdata:samples". Dataset IDs must be alphanumeric (plus underscores, dashes, and colons) and must be at most 1024 characters long.">"
import web
from bigquery import get_client
urls = (
'/GetFields(.*)', 'get_Fields'
)
app = web.application(urls, globals())
class get_Fields:
def GET(self,r):
datasetname = web.input().datasetName
tablename = web.input().tableName
# BigQuery project id as listed in the Google Developers Console.
project_id = 'din-1085'
# Service account email address as listed in the Google Developers Console.
service_account = '101363222700-epqo6lmkl67j6u1qafha9dke0pmcck3#developer.gserviceaccount.com'
# PKCS12 or PEM key provided by Google.
key = 'Digin-d2421e7da9.p12'
client = get_client(project_id, service_account=service_account,
private_key_file=key, readonly=True)
# Submit an async query.
job_id, _results = client.get_table_schema(datasetname,tablename)
# Retrieve the results.
return results
if __name__ == "__main__":
app.run()
This is the data that i pass:
http://localhost:8080/GetFields?datasetName=publicdata:samples&tableName=shakespeare
Dataset name :publicdata:samples
TableName : shakespeare
Expected Output:
word
word_count
corpus
corpus_date
Finally made it work by changing this line
From:
# Submit an async query.
job_id, _results = client.get_table_schema(datasetname,tablename)
To:
# Submit an async query.
results = client.get_table_schema(datasetname,tablename)

Categories