Storing files into MongoDB as using Flask, pymongo and GridFS? - python

I have a very specific question. So i'm writing an app that takes in an integer and a file in flask and stores them as a key value pair in a MongoDB. I was writing the mongo part with the help of a friend that works for mongo and that part of the code was working fine. I want to know now how EXACTLY I'm supposed to send the files i recieve from Flask and put them into the mongoDB.
TLDR. I am writing files to disk, how can i store them inside of the mongoDB using functions that i already know are working, but i myself didn't write?
The tutorial for taking in files i found at http://runnable.com/UiPcaBXaxGNYAAAL/how-to-upload-a-file-to-the-server-in-flask-for-python
The code i'm talking about is hosted on my github https://github.com/DavidAwad/SpaceShare/blob/master/init.py
If you look at line 56. I want to put it into the MongoDB right there, and then remove it from disk, I'm aware however that this is hideously inefficient so if there's a way to make it only write directly in and out of the mongoDB i'm all ears.
The code I'm specifically interested in is this.
# put files in mongodb
def put_file(file_location, room_number):
db_conn = get_db()
gfs = gridfs.GridFS(db_conn)
with open(file_location, "r") as f:
gfs.put(f, room=room_number)
# read files from mongodb
def read_file(output_location, room_number):
db_conn = get_db()
gfs = gridfs.GridFS(db_conn)
_id = db_conn.fs.files.find_one(dict(room=room_number))['_id']
#return gfs.get(_id).read()
with open(output_location, 'w') as f:
f.write(gfs.get(_id).read())
... code code code
#make the file same, remove unssopurted chars
filename=secure_filename(file.filename)
#move the file to our uploads folder
file.save(os.path.join(app.config['UPLOAD_FOLDER'],filename))
put_file(app.config['UPLOAD_FOLDER'],space) #PUT IN MONGODB HERE? IS THIS HOW I DO THAT???
# remove the file from disk as we don't need it anymore after database insert.
os.unlink(os.path.join( app.config['UPLOAD_FOLDER'] , filename))
# maybe redirect user to the uploaded_file route, which will show the uploaded file.

Look at pymongo (https://flask-pymongo.readthedocs.org/en/latest/) and more exactly to two function: send_file and save_file

Related

REPL.it JSON Files

So recently I've been using REPL as python code source, but whenever I'm offline, any information stored in the JSON File is rolled back after a bit of time. Now I know this is a REPL specific problem after doing some research, but is there any way I can fix this? My code itself is quite a few lines long, so I would rather not want to use a completely different storage method.
To successfully store data in json files in replit.com, it's important to load and dump it the correct way.
An example of storing data in json files:
with open("sample.json", "r") as file:
sample = json.load(file)
sample["item"] = "Value"
with open("sample.json", "w") as file:
json.dump(sample, file)
Let me know if you've already followed these steps.

pymongo: insert_one() is running but isn't adding anything to mongodb database?

I'm trying to upload a .txt file to a mongodb database collection using PyCharm, but nothing is appearing inside of the collection? Here's the script I'm using at the moment:
from pymongo import MongoClient
client = MongoClient()
db = client.memorizer_data # use a database called "memorizer_data"
collection = db.english # and inside that DB, a collection called "english"
with open('7_1_1.txt', 'r') as f:
text = f.read() # read the txt file
name = '7_1_1.txt'
# build a document to be inserted
text_file_doc = {"file_name": name, "contents": text}
# insert the contents into the "file" collection
collection.insert_one(text_file_doc)
PyCharm gets through the script with no errors, I've also tried printing the acknowledged attribute just to see what comes up:
result = collection.insert_one(text_file_doc)
print(result.acknowledged)
Which is giving me True. I wasn't sure if I was actually connecting to my database, so I tried db.list_collection_names() and my collection 'english' is in the list, so as far as I can tell I am connecting with it?
I'm a newbie to MongoDB so I realize I've probably gone about things the wrong way. At the moment I'm just trying to get the script working for a single .txt file before uploading everything my project is using to the db.
What makes you think there's nothing in the collection? Two ways to check;
In your pymongo code, add a final debug line:
print(collection.find_one())
Or, in the mongodb shell:
use memorizer_data
db.english.findOne()

Retrieve document '_id' of a GridFS document, by its 'filename'

I am currently working on a project in which I must retrieve a document uploaded on a MongoDB database using GridFS and store it in my local directory.
Up to now I have written these lines of code:
if not fs.exists({'filename': 'my_file.txt'}):
CRAWLED_FILE = os.path.join(SAVING_FOLDER, 'new_file.txt')
else:
file = fs.find_one({'filename': 'my_file.txt'})
CRAWLED_FILE = os.path.join(SAVING_FOLDER, 'new_file.txt')
with open(CRAWLED_FILE, 'wb') as f:
f.write(file.read())
f.close()
I believe that find_one doesn't allow me to write in a new file the content of the file previously stored in the database. f.write(file.read()) writes in the file just created (new_file.txt) the directory in which (new_file.txt) is stored! So I have a txt completely different from the one I have uploaded in the database and the only line in the txt is: E:\\my_folder\\sub_folder\\my_file.txt
It's kind of weird, I don't even know why it is happening.
I thought it could work if I use the fs.get(ObjectId(ID)) method, which, according to the official documentation of Pymongo and GridFS, it provides a file-like interface for reading. However I just know the name of the txt saved in the database, I have no clue what is the object ID, I cannot use a list or dict to store all the IDs of my documents since it wouldn't be worthy. I have checked with many posts here on StackOverflow and everyone suggests to use subscription. Basically you create a cursor using fs.find()then you can iterate over the cursor for example like this:
for x in fs.find({'filename': 'my_file.txt'}):
ID = x['_id']
see, many answers here suggest me to do the following, the only problem is that Cursor object is not subscriptable and I have no clue how I can resolve this issue.
I must find way to get the document '_id' given the filename of the document so I can later use it combined with fs.get(ObjectId(ID)).
Hope you can help me, thank you a lot!
Matteo
You can just access it like this:
ID = x._id
But "_" is a protected member in Python, so I was looking around for other solutions (could not find much). For getting just the ID, you could do:
for ID in fs.find({'filename': 'my_file.txt'}).distinct('_id'):
# do something with ID
Since that only gets the IDs, you would probably need to do:
query = fs.find({'filename': 'my_file.txt'}).limit(1) # equivalent to find_one
content = next(query, None) # Iterate GridOutCursor, should have either one element or None
if content:
ID = content._id
...

Migrate csv from gcs to postgresql

I'm trying to migrate csv files from Google Cloud Storage (GCS), which have been exported from BigQuery, to a PostgreSQL Google cloud sql instance using a python script.
I was hoping to use the Google API but found this in the documentation:
Importing CSV data using the Cloud SQL Admin API is not supported for PostgreSQL instances.
As an alternative I could use psycopg2 library and stream the rows of the csv file into the SQL instance. I can do this three ways
Line by line: Read each line and then submit the insert command and then commit
Batch stream: Read each line and then submit the insert commands and then commit after 10 lines or 100 etc.
The entire csv: Read each line and submit the insert commands and then only commit at the end of the document.
My concerns are these csv files could contain millions of rows and running this process for any of the three options mentioned above seems like a bad idea to me.
What alternatives do I have?
Essentially I have some raw data in BigQuery on which we do some preprocessing before exporting to GCS in preparation for importing to the PostgreSQL instance.
I need to export this preprocessed data from BigQuery to the PostgreSQL instance.
This is not a duplicate of this question as I'm preferably looking for the solution which exports data from BigQuery to the PostgreSQL instance wether it be via GCS or direct.
You can do the import process with Cloud Dataflow as suggested by #GrahamPolley. It's true that this solution involves some extra work (getting familiar with Dataflow, setting everything up, etc). Even with the extra work, this would be the preferred solution for your situation. However, other solutions are available and I'll explain one of them below.
To set up a migration process with Dataflow, this tutorial about exporting BigQuery to Google Datastore is a good example
Alternative solution to Cloud Dataflow
Cloud SQL for PostgreSQL doesn't support importing from a .CSV but it does support .SQL files.
The file type for the specified uri.
SQL: The file contains SQL statements.
CSV: The file contains CSV data.
Importing CSV data using the Cloud SQL Admin API is not supported for PostgreSQL instances.
A direct solution would be to convert the .CSV filest to .SQL with some tool (Google doens't provide one that I know of, but there are many online) and then import to the PostgreSQL.
If you want to implement this solution in a more "programatic" way, I would suggest to use Cloud Functions, here is an example of how I would try to do it:
Set up a Cloud Function that triggers when a file is uploaded to a Cloud Storage bucket
Code the function to get the uploaded file and check if it's a .CSV. If it is, use a csv-to-sql API (example of API here) to convert the file to .SQL
Store the new file in Cloud Storage
Import to the PostgreSQL
Before you begin, you should make sure:
The database and table you are importing into must
already exist on your Cloud SQL instance.
CSV file format requirements CSV files must have one line for each row
of data and have comma-separated fields.
Then, you can import data to a Cloud SQL instance using a CSV file present in a GCS bucket following the next steps [GCLOUD]
Describe the instance you are exporting from:
gcloud sql instances describe [INSTANCE_NAME]
Copy the serviceAccountEmailAddress field.
Add the service account to the bucket ACL as a writer:
gsutil acl ch -u [SERVICE_ACCOUNT_ADDRESS]:W gs://[BUCKET_NAME]
Add the service account to the import file as a reader:
gsutil acl ch -u [SERVICE_ACCOUNT_ADDRESS]:R gs://[BUCKET_NAME]/[IMPORT_FILE_NAME]
Import the file
gcloud sql import csv [INSTANCE_NAME] gs://[BUCKET_NAME]/[FILE_NAME] \
--database=[DATABASE_NAME] --table=[TABLE_NAME]
If you do not need to retain the permissions provided by the ACL you set previously, remove the ACL:
gsutil acl ch -d [SERVICE_ACCOUNT_ADDRESS] gs://[BUCKET_NAME]
I found that the pyscopg2 module has copy_from() which allows the loading of an entire csv file instead of streaming the rows individually.
The downside of using this method is that the csv file still needs to be downloaded from the GCS and stored locally.
here are the details of using pyscopg2 'copy_from()'. (From here)
import psycopg2
conn = psycopg2.connect("host=localhost dbname=postgres user=postgres")
cur = conn.cursor()
with open('user_accounts.csv', 'r') as f:
# Notice that we don't need the `csv` module.
next(f) # Skip the header row.
cur.copy_from(f, 'users', sep=',')
conn.commit()
You could just use a class to make the text you are pulling from the internet behave like a file. I have used this several times.
import io
import sys
class IteratorFile(io.TextIOBase):
""" given an iterator which yields strings,
return a file like object for reading those strings """
def __init__(self, obj):
elements = "{}|" * len(obj[0])
elements = (unicode(elements[:-1]).format(*x) for x in obj)
self._it = elements
self._f = io.cStringIO()
def read(self, length=sys.maxsize):
try:
while self._f.tell() < length:
self._f.write(next(self._it) + "\n")
except StopIteration as e:
# soak up StopIteration. this block is not necessary because
# of finally, but just to be explicit
pass
except Exception as e:
print("uncaught exception: {}".format(e))
finally:
self._f.seek(0)
data = self._f.read(length)
# save the remainder for next read
remainder = self._f.read()
self._f.seek(0)
self._f.truncate(0)
self._f.write(remainder)
return data
def readline(self):
return next(self._it)

Why can't I return an image from a function without saving it into a file first?

I'm new to Python and web2py and am currently facing a strange problem.
I have images (pictures) stored in my database (SQL Server) which I'm trying to display in my website. I have managed to access the database and bring the information to the web page, but I'm having trouble with the pictures.
Finally after long time trying, I was able to pinpoint my problem, as can be seen in the following code of my "Download" function. If I try to read the image from the database and return it, the web browser just gets stuck and shows nothing. If I save the image to a temporary static file, then read it and THEN return the result, everything works fine.
My only guess is that it has something to do with scope of the variables, but I'm not sure. Here's my code:
def download():
cursor = erpDbCnxn.cursor()
cursor.execute("SELECT Picture FROM KMS_Parts WHERE PartNo = '%s'" % request.args(0))
row = cursor.fetchone()
if row:
dbpic = row.Picture
f = open ("C:/Users/Udi/Desktop/tmp.jpg", "wb") ##This is just for the test
f.write(dbpic) ##This is just for the test
f.close() ##This is just for the test
f = open ("C:/Users/Udi/Desktop/tmp.jpg", "rb") ##This is just for the test
filepic = f.read() ##This is just for the test
f.close() ##This is just for the test
return filepic ##Returning dbpic doesn't work
else:
return
In web2py, you cannot simply return a file object but must instead use response.stream:
import cStringIO
file = cStringIO.StringIO(row.Picture)
return response.stream(file)
You might also consider using the web2py DAL and taking advantage of its built-in functionality for storing and retrieving files (see here and here).
Just for whoever wants to know the final result:
Looks like this was resolved in one of the following versions of web2py.
Currently I'm just doing "return dbpic" and it works fine.

Categories