Migrate csv from gcs to postgresql - python

I'm trying to migrate csv files from Google Cloud Storage (GCS), which have been exported from BigQuery, to a PostgreSQL Google cloud sql instance using a python script.
I was hoping to use the Google API but found this in the documentation:
Importing CSV data using the Cloud SQL Admin API is not supported for PostgreSQL instances.
As an alternative I could use psycopg2 library and stream the rows of the csv file into the SQL instance. I can do this three ways
Line by line: Read each line and then submit the insert command and then commit
Batch stream: Read each line and then submit the insert commands and then commit after 10 lines or 100 etc.
The entire csv: Read each line and submit the insert commands and then only commit at the end of the document.
My concerns are these csv files could contain millions of rows and running this process for any of the three options mentioned above seems like a bad idea to me.
What alternatives do I have?
Essentially I have some raw data in BigQuery on which we do some preprocessing before exporting to GCS in preparation for importing to the PostgreSQL instance.
I need to export this preprocessed data from BigQuery to the PostgreSQL instance.
This is not a duplicate of this question as I'm preferably looking for the solution which exports data from BigQuery to the PostgreSQL instance wether it be via GCS or direct.

You can do the import process with Cloud Dataflow as suggested by #GrahamPolley. It's true that this solution involves some extra work (getting familiar with Dataflow, setting everything up, etc). Even with the extra work, this would be the preferred solution for your situation. However, other solutions are available and I'll explain one of them below.
To set up a migration process with Dataflow, this tutorial about exporting BigQuery to Google Datastore is a good example
Alternative solution to Cloud Dataflow
Cloud SQL for PostgreSQL doesn't support importing from a .CSV but it does support .SQL files.
The file type for the specified uri.
SQL: The file contains SQL statements.
CSV: The file contains CSV data.
Importing CSV data using the Cloud SQL Admin API is not supported for PostgreSQL instances.
A direct solution would be to convert the .CSV filest to .SQL with some tool (Google doens't provide one that I know of, but there are many online) and then import to the PostgreSQL.
If you want to implement this solution in a more "programatic" way, I would suggest to use Cloud Functions, here is an example of how I would try to do it:
Set up a Cloud Function that triggers when a file is uploaded to a Cloud Storage bucket
Code the function to get the uploaded file and check if it's a .CSV. If it is, use a csv-to-sql API (example of API here) to convert the file to .SQL
Store the new file in Cloud Storage
Import to the PostgreSQL

Before you begin, you should make sure:
The database and table you are importing into must
already exist on your Cloud SQL instance.
CSV file format requirements CSV files must have one line for each row
of data and have comma-separated fields.
Then, you can import data to a Cloud SQL instance using a CSV file present in a GCS bucket following the next steps [GCLOUD]
Describe the instance you are exporting from:
gcloud sql instances describe [INSTANCE_NAME]
Copy the serviceAccountEmailAddress field.
Add the service account to the bucket ACL as a writer:
gsutil acl ch -u [SERVICE_ACCOUNT_ADDRESS]:W gs://[BUCKET_NAME]
Add the service account to the import file as a reader:
gsutil acl ch -u [SERVICE_ACCOUNT_ADDRESS]:R gs://[BUCKET_NAME]/[IMPORT_FILE_NAME]
Import the file
gcloud sql import csv [INSTANCE_NAME] gs://[BUCKET_NAME]/[FILE_NAME] \
--database=[DATABASE_NAME] --table=[TABLE_NAME]
If you do not need to retain the permissions provided by the ACL you set previously, remove the ACL:
gsutil acl ch -d [SERVICE_ACCOUNT_ADDRESS] gs://[BUCKET_NAME]

I found that the pyscopg2 module has copy_from() which allows the loading of an entire csv file instead of streaming the rows individually.
The downside of using this method is that the csv file still needs to be downloaded from the GCS and stored locally.
here are the details of using pyscopg2 'copy_from()'. (From here)
import psycopg2
conn = psycopg2.connect("host=localhost dbname=postgres user=postgres")
cur = conn.cursor()
with open('user_accounts.csv', 'r') as f:
# Notice that we don't need the `csv` module.
next(f) # Skip the header row.
cur.copy_from(f, 'users', sep=',')
conn.commit()

You could just use a class to make the text you are pulling from the internet behave like a file. I have used this several times.
import io
import sys
class IteratorFile(io.TextIOBase):
""" given an iterator which yields strings,
return a file like object for reading those strings """
def __init__(self, obj):
elements = "{}|" * len(obj[0])
elements = (unicode(elements[:-1]).format(*x) for x in obj)
self._it = elements
self._f = io.cStringIO()
def read(self, length=sys.maxsize):
try:
while self._f.tell() < length:
self._f.write(next(self._it) + "\n")
except StopIteration as e:
# soak up StopIteration. this block is not necessary because
# of finally, but just to be explicit
pass
except Exception as e:
print("uncaught exception: {}".format(e))
finally:
self._f.seek(0)
data = self._f.read(length)
# save the remainder for next read
remainder = self._f.read()
self._f.seek(0)
self._f.truncate(0)
self._f.write(remainder)
return data
def readline(self):
return next(self._it)

Related

pymongo: insert_one() is running but isn't adding anything to mongodb database?

I'm trying to upload a .txt file to a mongodb database collection using PyCharm, but nothing is appearing inside of the collection? Here's the script I'm using at the moment:
from pymongo import MongoClient
client = MongoClient()
db = client.memorizer_data # use a database called "memorizer_data"
collection = db.english # and inside that DB, a collection called "english"
with open('7_1_1.txt', 'r') as f:
text = f.read() # read the txt file
name = '7_1_1.txt'
# build a document to be inserted
text_file_doc = {"file_name": name, "contents": text}
# insert the contents into the "file" collection
collection.insert_one(text_file_doc)
PyCharm gets through the script with no errors, I've also tried printing the acknowledged attribute just to see what comes up:
result = collection.insert_one(text_file_doc)
print(result.acknowledged)
Which is giving me True. I wasn't sure if I was actually connecting to my database, so I tried db.list_collection_names() and my collection 'english' is in the list, so as far as I can tell I am connecting with it?
I'm a newbie to MongoDB so I realize I've probably gone about things the wrong way. At the moment I'm just trying to get the script working for a single .txt file before uploading everything my project is using to the db.
What makes you think there's nothing in the collection? Two ways to check;
In your pymongo code, add a final debug line:
print(collection.find_one())
Or, in the mongodb shell:
use memorizer_data
db.english.findOne()

Is postgres COPY tablename FROM STDIN with csv at risk of SQL injection?

I am using python and pyscopg2.
If I run code below, the user provided csv file will be open and read. Then the content contains in csv file will be transferred to database.
I want to know if the code is at risk of SQL injection when some unexpected words or symbols contain in the csv file.
conn_config = dict(port="5432", dbname="test", password="test")
with psycopg2.connection(**conn_config) as conn:
with conn.cursor() as cur:
with open("test.csv") as f:
cur.copy_expert(sql="COPY test FROM STDIN", file=f)
I read some documents of psycopg2 and postgres, but I did not found the result.
Please know that English is not my native language, and I may make some confusing mistakes
The command simply copies the data into the table. No part of the copied data may be interpreted as an SQL command, so SQL injection is out of the question. Additional security is the rigid CSV format. If the data contains extra (redundant) writes, the command will simply fail. The only risk of the command operation may be strange contents in the table.

How to Import a SQL file to Python

I'm attempting to import an sq file that already has tables into python. However, it doesn't seem to import what I had hoped. The only things I've seen so far are how to creata a new sq file with a table, but I'm looking to just have an already completed sq file imported into python. So far, I've written this.
# Python code to demonstrate SQL to fetch data.
# importing the module
import sqlite3
# connect withe the myTable database
connection = sqlite3.connect("CEM3_Slice_20180622.sql")
# cursor object
crsr = connection.cursor()
# execute the command to fetch all the data from the table emp
crsr.execute("SELECT * FROM 'Trade Details'")
# store all the fetched data in the ans variable
ans= crsr.fetchall()
# loop to print all the data
for i in ans:
print(i)
However, it keeps claiming that the Trade Details table, which is a table inside the file I've connected it to, does not exist. Nowhere I've looked shows me how to do this with an already created file and table, so please don't just redirect me to an answer about that
As suggested by Rakesh above, you create a connection to the DB, not to the .sql file. The .sql file contains SQL scripts to rebuild the DB from which it was generated.
After creating the connection, you can implement the following:
cursor = connection.cursor() #cursor object
with open('CEM3_Slice_20180622.sql', 'r') as f: #Not sure if the 'r' is necessary, but recommended.
cursor.executescript(f.read())
Documentation on executescript found here
To read the file into pandas DataFrame:
import pandas as pd
df = pd.read_sql('SELECT * FROM table LIMIT 10', connection)
There are two possibilities:
Your file is not in the correct format and therefore cannot be opened.
The SQLite file can exist anywhere on the disk e.g. /Users/Username/Desktop/my_db.sqlite , this means that you have to tell python exactly where your file is otherwise it will look inside the scripts directory, see that there is no file with the same name and therefore create a new file with the provided filename.
sqlite3.connect expects the full path to your database file or '::memory::' to create a database that exists in RAM. You don't pass it a SQL file. Eg.
connection = sqlite3.connect('example.db')
You can then read the contents of CEM3_Slice_20180622.sql as you would a normal file and execute the SQL commands against the database.

Storing files into MongoDB as using Flask, pymongo and GridFS?

I have a very specific question. So i'm writing an app that takes in an integer and a file in flask and stores them as a key value pair in a MongoDB. I was writing the mongo part with the help of a friend that works for mongo and that part of the code was working fine. I want to know now how EXACTLY I'm supposed to send the files i recieve from Flask and put them into the mongoDB.
TLDR. I am writing files to disk, how can i store them inside of the mongoDB using functions that i already know are working, but i myself didn't write?
The tutorial for taking in files i found at http://runnable.com/UiPcaBXaxGNYAAAL/how-to-upload-a-file-to-the-server-in-flask-for-python
The code i'm talking about is hosted on my github https://github.com/DavidAwad/SpaceShare/blob/master/init.py
If you look at line 56. I want to put it into the MongoDB right there, and then remove it from disk, I'm aware however that this is hideously inefficient so if there's a way to make it only write directly in and out of the mongoDB i'm all ears.
The code I'm specifically interested in is this.
# put files in mongodb
def put_file(file_location, room_number):
db_conn = get_db()
gfs = gridfs.GridFS(db_conn)
with open(file_location, "r") as f:
gfs.put(f, room=room_number)
# read files from mongodb
def read_file(output_location, room_number):
db_conn = get_db()
gfs = gridfs.GridFS(db_conn)
_id = db_conn.fs.files.find_one(dict(room=room_number))['_id']
#return gfs.get(_id).read()
with open(output_location, 'w') as f:
f.write(gfs.get(_id).read())
... code code code
#make the file same, remove unssopurted chars
filename=secure_filename(file.filename)
#move the file to our uploads folder
file.save(os.path.join(app.config['UPLOAD_FOLDER'],filename))
put_file(app.config['UPLOAD_FOLDER'],space) #PUT IN MONGODB HERE? IS THIS HOW I DO THAT???
# remove the file from disk as we don't need it anymore after database insert.
os.unlink(os.path.join( app.config['UPLOAD_FOLDER'] , filename))
# maybe redirect user to the uploaded_file route, which will show the uploaded file.
Look at pymongo (https://flask-pymongo.readthedocs.org/en/latest/) and more exactly to two function: send_file and save_file

flask/python: best way to import and parse a sqlite file

At my work we frequently work with sqlite files to perform troubleshooting. I want to create a web page, possibly in flask, that allows users to upload a .sqlite file and automatically have simple, pre-defined queries run.
What is the best way within a Flask application to import a .sqlite file, run queries on it, and then set itself up to repeat the process?
The best way to use an sqlite file with specific queries is using sqlite3 package, Just:
import sqlite3
db = sqlite3.connect('PATH TO FILE')
result = db.execute(query, args)
...
First of all, you need to upload that file to the server, to do so, you can start reading this: http://flask.pocoo.org/docs/patterns/fileuploads/
Then, you can connect to that .sqlite file like this, and then execute queries:
import sqlite3
connection = sqlite3.connect('/path/to/your/sqlite_file')
cursor = connection.cursor()
cursor.execute('my query')
cursor.fetchall() # If you used a select statement
# OR
connection.commit() # If you inserted date for example

Categories