write scraped binary file to blob without first writing it to disk - python

I use the requests library to retrieve a binary file from a website. I now want to store it in MySQL as a BLOB. I don't want to take the intermediate step of writing the file to disk. What is the best way to do this?
At present, I am using base64 to encode the binary file so that MySQL will accept it, as in this suggestion. Is this the best strategy, or is there a way that permits me to skip the encoding (and the subsequent decoding when I retrieve the file)?
Minimal example:
import base64
import pymysql
import requests
myPDF = requests.get("https://arxiv.org/pdf/2004.00627.pdf")
myPDF_encoded = base64.b64encode(myPDF.content)
conn = pymysql.connect(
host = "127.0.0.1",
user = user,
passwd = password,
db = "myDB")
cur = conn.cursor()
insertLine = "INSERT INTO myDB (PDF) VALUES (%s)"
cur.execute(insertLine, myPDF_encoded)
conn.commit()
Many posts speak to the general problem of writing a binary file to a BLOB, but as best I can tell, all start from the assumption that the file is to be read from disk.

Much better solution for modern versions of mySQL: skip the base64 encoding, and use _binary %s to send binary data, or just add binary_prefix = True option when setting up the pymysql connection. For example,
import pymysql
import requests
myPDF = requests.get("https://arxiv.org/pdf/2004.00627.pdf")
conn = pymysql.connect(
host = "127.0.0.1",
user = user,
passwd = password,
db = "myDB",
binary_prefix = True)
cur = conn.cursor()
insertLine = "INSERT INTO myDB (PDF) VALUES (%s)"
cur.execute(insertLine, myPDF)
conn.commit()

Related

Python: Connect to Oracle DB using Java keystore jceks file

I want to connect to Oracle DB using ojdbc6.jar driver.
I have read about JayDeBeApi and pyodbc
The following code works for me:
import jpype
import jaydebeapi
JHOME = jpype.getDefaultJVMPath()
jpype.startJVM(JHOME, f'-Djava.class.path={os.getcwd()}/ojdbc6.jar')
con = jaydebeapi.connect('oracle.jdbc.driver.OracleDriver',
f"jdbc:oracle:thin:{db_user}/{db_password}#{db_host}:{db_port}/{db_name}")
cur = con.cursor()
cur.execute(f'SELECT * FROM {schema}.{table} WHERE ROWNUM=1')
r = cur.fetchall()
print(r[0])
cur.close()
con.close()
I can't store plain text password, so i make use of a Jave keystore my_keystore.jceks file with the stored password.
password_alias = "my_alias"
jsec_file_path = "jceks://hdfs/path/to/my_keystore.jceks"
Is there any way to use the existing .jceks file to read password?
All solutions using ojdbc6.jar are acceptable.

How do you select values from a SQL column in Python

I have a column called REQUIREDCOLUMNS in a SQL database which contains the columns which I need to select in my Python script below.
Excerpt of Current Code:
db = mongo_client.get_database(asqldb_row.SCHEMA_NAME)
coll = db.get_collection(asqldb_row.TABLE_NAME)
table = list(coll.find())
root = json_normalize(table)
The REQUIREDCOLUMNSin SQL contains values reportId, siteId, price, location
So instead of explicitly typing:
print(root[["reportId","siteId","price","location"]])
Is there a way to do print(root[REQUIREDCOLUMNS])?
Note: (I'm already connected to the SQL database in my python script)
You will have to use cursors if you are using mysql or pymysql , both the syntax are almost similar below i will mention for mysql
import mysql
import mysql.connector
db = mysql.connector.connect(
host = "localhost",
user = "root",
passwd = " ",
database = " "
)
cursor = db.cursor()
sql="select REQUIREDCOLUMNS from table_name"
cursor.execute(sql)
required_cols = cursor.fetchall()#this wll give ["reportId","siteId","price","location"]
cols_as_string=','.join(required_cols)
new_sql='select '+cols_as_string+' from table_name'
cursor.execute(new_sql)
result=cursor.fetchall()
This should probably work, i intentionally split many lines into several lines for understanding.
syntax could be slightly different for pymysql

Parallelize data import from mongodb python

How to import data in parallel manner from mongodb. A solution is to scan all the mongodb, lets say it is 1000 rows there. And then to split, and fetch them in 100's batches and then combine them again so all are 1000.
Below is the code to import data to python from mongodb.
import pandas as pd
from pymongo import MongoClient
def _connect_mongo(host, port, username, password, db):
""" A util for making a connection to mongo """
if username and password:
mongo_uri = 'mongodb://%s:%s#%s:%s/%s' % (username, password, host, port, db)
conn = MongoClient(mongo_uri)
else:
conn = MongoClient(host, port)
return conn[db]
def read_mongo(db, collection, query={}, host='localhost', port=27017, username=None, password=None, no_id=True):
""" Read from Mongo and Store into DataFrame """
# Connect to MongoDB
db = _connect_mongo(host=host, port=port, username=username, password=password, db=db)
# Make a query to the specific DB and Collection
cursor = db[collection].find(query)
# Expand the cursor and construct the DataFrame
df = pd.DataFrame(list(cursor))
# Delete the _id
if no_id:
del df['_id']
return df
As I said my comment, have you tried optimizing your database by using indexes? If the database is slow, I don't think parallelizing will improve it. If you still want to go with parallel, call read_mongo with multiple threads.
For indexes you should check https://docs.mongodb.com/manual/indexes/
There's no code related thing to add here, you just need to better understand your databases.
As for the code, python has concurrent (threads) or parallelism (multiprocessing package). You'd need to make your program call read_mongo with your already define/split queries.
There are many examples out there. I'd try the indexes before because it will help for the parallel stuff after.

How to Convert MYSQL Latin-1 GEOMETRY field to UTF8 to get it to work with Django?

Trying to cast the field back onto itself throws the following error.
UPDATE <table> SET geo_field = CONVERT(CAST(CONVERT(geo_field USING latin1) AS BINARY) USING utf8);
[Err] 1416 - Cannot get geometry object from data you send to the GEOMETRY field
I'm trying to use django 1.9 to datadump json and it keeps choking on the latin-1 chars.
I'm using the mysql.gis backend.
Trying to use a raw cursor in python didn't work either.
def convert_latin_uft8(badfields, table, host, user, passwd, db ):
import MySQLdb
con = MySQLdb.connect(host=host, user=user, passwd=passwd, db=db)
cur = con.cursor()
cur.execute("SELECT * FROM `{0}`;".format(table))
for item in cur.fetchall():
for field in badfields:
data =item[field].decode('latin1').encode('utf8')
print data
I'm stuck. Any help woud be greatly appreciated.
The paths field had the improper model type set.
paths = models.PolygonField()
Worked like a CHARM!

How to read non-english characters from database in Python?

I am trying to read entries from mysql db in python using pymysql. Entries in database are in regional languages.
e.g. 74 погода is one such entry
I have written code like this:
import pymysql
conn = pymysql.connect(ip, user, pass, db, charset="utf8")
curr = conn.cursor()
curr.execute("select val from my_table")
r = cur.fetchone()
>>> r
('74 ??????',)
>>> r[0].encode("utf-8").strip()
'74 \xd0\xbf\xd0\xbe\xd0\xb3\xd0\xbe\xd0\xb4\xd0\xb0'
Here I am not getting the data as it that is present in the database.
This is because MySQLdb normally tries to encode everythin to latin-1. This can be fixed by executing the following commands right after you've etablished the connection:
db.set_character_set('utf8')
dbc.execute('SET NAMES utf8;')
dbc.execute('SET CHARACTER SET utf8;')
dbc.execute('SET character_set_connection=utf8;')
db is the result of MySQLdb.connect(), and dbc is the result of `db.cursor()*.

Categories