multithreading to load data into sqlite db - python

I'm downloading a data from an API and storing it in SQLite db. I want to implement the process using "multithreading". Can someone please help me with how to implement it.
I found a library but getting an error. below is the code.
import sqlite3
import os
import pandas as pd
from sodapy import Socrata
import concurrent.futures
dbPath = 'folder where db exists'
dbName = 'db file name'
## Setup connection & cursor with the DB
dbConn = sqlite3.connect(os.path.join(dbPath, dbName), check_same_thread=False)
## Setup the API and bring in the data
client = Socrata("health.data.ny.gov", None)
## Define all the countys to be used in threading
countys = [all 62 countys in New York]
varDict = dict.fromkeys(countys, {})
strDataList = ['test_date', 'LoadDate']
intDataList = ['new_positives', 'cumulative_number_of_positives', 'total_number_of_tests', 'cumulative_number_of_tests']
def getData(county):
## Check if table exists
print("Processing ", county)
varDict[county]['dbCurs'] = dbConn.cursor()
varDict[county]['select'] = varDict[county]['dbCurs'].execute('SELECT name FROM sqlite_master WHERE type="table" AND name=?', (county,) )
if not len(varDict[county]['select'].fetchall()):
createTable(county)
whereClause = 'county="'+county+'"'
varDict[county]['results'] = client.get("xdss-u53e", where=whereClause)
varDict[county]['data'] = pd.DataFrame.from_records(varDict[county]['results'])
varDict[county]['data'].drop(['county'], axis=1, inplace=True)
varDict[county]['data']['LoadDate'] = pd.to_datetime('now')
varDict[county]['data'][strDataList] = varDict[county]['data'][strDataList].astype(str)
varDict[county]['data']['test_date'] = varDict[county]['data']['test_date'].apply(lambda x: x[:10])
varDict[county]['data'][intDataList] = varDict[county]['data'][intDataList].astype(int)
varDict[county]['data'] = varDict[county]['data'].values.tolist()
## Insert values into SQLite
varDict[county]['sqlQuery'] = 'INSERT INTO ['+county+'] VALUES (?,?,?,?,?,?)'
varDict[county]['dbCurs'].executemany(varDict[county]['sqlQuery'], varDict[county]['data'])
dbConn.commit()
# for i in dbCurs.execute('SELECT * FROM albany'):
# print(i)
def createTable(county):
sqlQuery = 'CREATE TABLE ['+county+'] ( [Test Date] TEXT, [New Positives] INTEGER NOT NULL, [Cumulative Number of Positives] INTEGER NOT NULL, [Total Number of Tests Performed] INTEGER NOT NULL, [Cumulative Number of Tests Performed] INTEGER NOT NULL, [Load date] TEXT NOT NULL, PRIMARY KEY([Test Date]))'
varDict[county]['dbCurs'].execute(sqlQuery)
# for _ in countys:
# getData(_)
# x = countys[:5]
with concurrent.futures.ThreadPoolExecutor() as executor:
# results = [executor.submit(getData, y) for y in x]
executor.map(getData, countys)
getData is the function which brings in the data county wise and loads into the db. Countys is a list of all the countys. I am able to do it synchronously but would like to implement multithreading.
The for loop to do it synchronously (which works) is
for _ in countys:
getData(_)
The error message is
ProgrammingError: SQLite objects created in a thread can only be used in that same thread. The object was created in thread id 8016 and this is thread id 19844.

You might find this useful
sqlite.connect(":memory:", check_same_thread=False)

Related

Slow SQLite update with Python

I have a sqlite database that I've built and it gets both added to and updated on a weekly basis. The issue I have is the update seems to take a very long time. (Roughly 2 hours without the transaction table). I'm hoping there is a faster way to do this. What the script does is read from a CSV and updates the database line by line through a loop
An example data entry would be:
JohnDoe123 018238e1f5092c66d896906bfbcf9abf5abe978975a8852eb3a78871e16b4268
The Code that I use is
#updates reported table
def update_sha(conn, sha, ID, op):
sql_update_reported = 'UPDATE reported SET sha = ? WHERE ID = ? AND operator = ?'
sql_update_blocked = 'UPDATE blocked SET sha = ? WHERE ID = ? AND operator = ?'
sql_update_trans = 'UPDATE transactions SET sha = ? WHERE ID = ? AND operator = ?'
data = (sha, ID, op)
cur = conn.cursor()
cur.execute(sql_update_reported, data)
cur.execute(sql_update_blocked, data)
cur.execute(sql_update_trans, data)
conn.commit()
def Count(conn):
#Creates a dataframe with the Excel sheet information and ensures them to
#be strings
df = pd.DataFrame()
df = pd.read_excel("Count.xlsx", engine='openpyxl',converters={'ID':str})
#Runs through the DataFrame once for reported
for i in df.index:
ID = df['ID'][i]
Sha = df['Sha'][i]
op = df['op'][i]
print(i)
with conn:
update_dupi(conn, Sha, ID, op)
if __name__ == '__main__':
conn = create_connection(database)
print("Updating Now..")
Count(conn)
conn.close()

sqlalchemy MSSQL+pyodbc schema none

I'm trying to connect to SQL server 2019 via sqlalchemy. I'm using both mssql+pyodbc and msql+pyodbc_mssql, but on both cases it cannot connect, always returns default_schema_name not defined.
Already checked database, user schema defined and everything.
Example:
from sqlalchemy import create_engine
import urllib
from sqlalchemy import create_engine
server = 'server'
database = 'db'
username = 'user'
password = 'pass'
#cnxn = 'DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password+';Trusted_Connection=yes'
cnxn = 'DSN=SQL Server;SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password+';Trusted_Connection=yes'
params = urllib.parse.quote_plus(cnxn)
engine = create_engine('mssql+pyodbc:///?odbc_connect=%s' % params)
cnxn = engine.connect()
return None, dialect.default_schema_name
AttributeError: 'MSDialect_pyodbc' object has no attribute 'default_schema_name'
TIA.....
Hopefully the following provides enough for a minimum viable sample. I'm using it in a larger script to move 12m rows 3x a day, and for that reason I've included an example of chunking that I pinched from elsewhere.
#Set up enterprise DB connection
# Enterprise DB to be used
DRIVER = "ODBC Driver 17 for SQL Server"
USERNAME = "SQLUsername"
PSSWD = "SQLPassword"
SERVERNAME = "SERVERNAME01"
INSTANCENAME = "\SQL_01"
DB = "DATABASE_Name"
TABLE = "Table_Name"
#Set up SQL database connection variable / path
#I have included this as an example that can be used to chunk data up
conn_executemany = sql.create_engine(
f"mssql+pyodbc://{USERNAME}:{PSSWD}#{SERVERNAME}{INSTANCENAME}/{DB}?driver={DRIVER}", fast_executemany=True
)
#Used for SQL Loading from Pandas DF
def chunker(seq, size):
return (seq[pos : pos + size] for pos in range(0, len(seq), size))
#Used for SQL Loading from Pandas DF
def insert_with_progress(df, engine, table="", schema="dbo"):
con = engine.connect()
# Replace table
#engine.execute(f"DROP TABLE IF EXISTS {schema}.{table};") #This only works for SQL Server 2016 or greater
try:
engine.execute(f"DROP TABLE Temp_WeatherGrids;")
except:
print("Unable to drop temp table")
try:
engine.execute(f"CREATE TABLE [dbo].[Temp_WeatherGrids]([col_01] [int] NULL,[Location] [int] NULL,[DateTime] [datetime] NULL,[Counts] [real] NULL) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY];")
except:
print("Unable to create temp table")
# Insert with progress
SQL_SERVER_CHUNK_LIMIT = 250000
chunksize = math.floor(SQL_SERVER_CHUNK_LIMIT / len(df.columns))
for chunk in chunker(df, chunksize):
chunk.to_sql(
name=table,
con=con,
if_exists="append",
index=False
)
if __name__ == '__main__':
# intialise data. Example - make your own dataframe. DateTime should be pandas datetime objects.
data = {'Col_01':[0, 1, 2, 3],
'Location':['Bar', 'Pub', 'Brewery', 'Bottleshop'],
'DateTime':["1/1/2018", "1/1/2019", "1/1/2020", "1/1/2021"],
'Counts':[1, 2, 3, 4}
# Create DataFrame
df = pd.DataFrame(data)
insert_with_progress(df, conn_executemany, table=TABLE)
del [df]

Store and retrieve pickled python objects to/from snowflake

as per question, I am trying to store picked python objects to snowflake, to get them back again at a later date. Help on this would be much appreciated:
Snowflake table definition:
CREATE OR REPLACE TABLE <db>.<schema>.TESTING_MEMORY (
MODEL_DATETIME DATETIME,
SCALARS VARIANT
;
Python code:
import numpy as np
import pandas as pd
import pickle
from datetime import datetime
import snowflake.connector
from snowflake.connector.pandas_tools import write_pandas
from sklearn.preprocessing import StandardScaler
def create_snowflake_connection():
conn = snowflake.connector.connect(
user='<username>',
account='<account>',
password = '<password>',
warehouse='<wh>',
database='<db>',
role='<role>',
schema='<schema>'
)
return conn
memory = {}
np.random.seed(78)
df = pd.DataFrame({
'x1': np.random.normal(0, 2, 10000),
'x2': np.random.normal(5, 3, 10000),
'x3': np.random.normal(-5, 5, 10000)
})
scaler = StandardScaler()
scaler.fit(df)
scaled_df = scaler.transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=['x1', 'x2', 'x3'])
memory['SCALARS'] = pickle.dumps(scaler)
ctx = create_snowflake_connection()
# Write to snowflake
db_dat = pd.DataFrame([list(memory.values())], columns=list(memory.keys()))
db_dat.insert(0, 'MODEL_DATETIME', datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f"))
success, nchunks, nrows, _ = write_pandas(conn=ctx, df = db_dat, table_name = 'TESTING_MEMORY')
# retreive from snowflake
cur = ctx.cursor()
sql = """
SELECT hex_encode(SCALARS)
FROM <db>.<schema>.TESTING_MEMORY
QUALIFY ROW_NUMBER() OVER (ORDER BY MODEL_DATETIME DESC) = 1
"""
cur.execute(sql)
returned = cur.fetch_pandas_all()
cur.close()
ctx.close()
Seems like you're trying to put python byte object into a Snowflake variant which won't work for you.
This answer is kind of similar to what the other answer here suggests except, rather than using a varchar field to store base64 encoded binary, use a binary type instead. base64 encoding is around 30% larger than binary from what I've read somewhere.
Create the table with binary data type:
create or replace table testdb.public.test_table (obj binary);
Hex encode the pickled object, write it, read it back and call a method on it:
import pickle
import snowflake.connector
# This is the object we're going to store in Snowflake as binary
class PickleMe:
def __init__(self, first_name, last_name):
self.first_name = first_name
self.last_name = last_name
def say_hello(self):
print(f'Hi there, {self.first_name} {self.last_name}')
# Create the object and store it as hex in the 'hex_person' variable
person = PickleMe('John', 'Doe')
hex_person = pickle.dumps(person).hex()
with snowflake.connector.connect(
user="username",
password="password",
account="snowflake_account_deets",
warehouse="warehouse_name",
) as con:
# Write pickled object into table as binary
con.cursor().execute(f"INSERT INTO testdb.public.test_table values(to_binary('{hex_person}', 'HEX'))")
# Now get the object back and put it into the 'obj' variable
(obj,) = con.cursor().execute(f"select obj from testdb.public.test_table").fetchone()
# Deserialise object and call method on it
person_obj = pickle.loads(obj, encoding='HEX')
person_obj.say_hello()
The output of the above is
Hi there, John Doe
There is probably a better way to do this (disclaimer: I am new to Python), but this seems to work and is based off answer here: How can I pickle a python object into a csv file?
Change sql table defintion
CREATE OR REPLACE TABLE db.schema.TESTING_MEMORY (
MODEL_DATETIME DATETIME,
SCALARS VARCHAR
);
2 Changes to Python Code - general
import base64
3 Changes to Python code (write to snowflake section above)
# Write to snowflake
db_dat = pd.DataFrame([list(memory.values())], columns=list(memory.keys()))
db_dat.insert(0, 'MODEL_DATETIME', datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f"))
pickled_columns = ['SCALARS']
for column in pickled_columns:
b64_bytes = base64.b64encode(db_dat[column].values[0])
db_dat[column] = b64_bytes.decode('utf8')
success, nchunks, nrows, _ = write_pandas(conn=ctx, df = db_dat, table_name = 'TESTING_MEMORY')
Changes to Python code - retrieve from snowflake
cur = ctx.cursor()
sql = """
SELECT *
FROM db.schema.TESTING_MEMORY
QUALIFY ROW_NUMBER() OVER (ORDER BY MODEL_DATETIME DESC) = 1
"""
cur.execute(sql)
returned = cur.fetch_pandas_all()
for column in pickled_columns:
returned[column] = base64.b64decode(returned[column].values[0])
new_dict = returned.to_dict('list')
for key,val in new_dict.items():
new_dict[key] = val[0]

fast insert (on conflict) many rows to postges-DB with python

I want to write messages from a websocket to a postgres-DB running on a Raspberry Pi.
The average message/seconds ratio from the websocket is about 30messages/second. But within peaks it reaches up to 250 messages/second.
I implemented a python program to receive the messages and write them to the database with sqlalchemy orm. After each message i first check if the same primary key already exists and then do an update or an insert, afterwards i always do a commit, and so it gets very slow. I can write maximally 30 messages/second to the database. In peak-times this is a problem.
So i tested several approaches to speed things up.
This is my best approach:
I first make all the single-querys (with psycopg2) and then join them together and send the complete querystring to the database to execute it at once --> so it speeds up to 580 messages /second.
Create the table for Testdata:
CREATE TABLE transactions (
id int NOT NULL PRIMARY KEY,
name varchar(255),
description varchar(255),
country_name varchar(255),
city_name varchar(255),
cost varchar(255),
currency varchar(255),
created_at DATE,
billing_type varchar(255),
language varchar(255),
operating_system varchar(255)
);
example copied from https://medium.com/technology-nineleaps/mysql-sqlalchemy-performance-b123584eb833
Python-Test-Skript:
import random
import time
from faker import Faker
import psycopg2
from psycopg2.extensions import AsIs
"""psycopg2"""
psycopg2_conn = {'host':'192.168.176.101',
'dbname':'test',
'user':'blabla',
'password':'blabla'}
connection_psycopg2 = psycopg2.connect(**psycopg2_conn)
myFactory = Faker()
def random_data():
billing_type_list = ['cheque', 'cash', 'credit', 'debit', 'e-wallet']
language = ['English', 'Bengali', 'Kannada']
operating_system = 'linux'
random_dic = {}
for i in range(0, 300):
id = int(i)
name = myFactory.name()
description = myFactory.text()
country_name = myFactory.country()
city_name = myFactory.city()
cost = str(myFactory.random_digit_not_null())
currency = myFactory.currency_code()
created_at = myFactory.date_time_between(start_date="-30y", end_date="now", tzinfo=None)
billing_type = random.choice(billing_type_list)
language = random.choice(language)
operating_system = operating_system
random_dic[id] = {}
for xname in ['id', 'description', 'country_name','city_name','cost','currency',
'created_at', 'billing_type','language','operating_system']:
random_dic[id][xname]=locals()[xname]
print(id)
return random_dic
def single_insert_on_conflict_psycopg2(idic, icur):
cur=icur
columns = idic.keys()
columns_with_excludephrase = ['EXCLUDED.{}'.format(column) for column in columns]
values = [idic[column] for column in columns]
insert_statement = """
insert into transactions (%s) values %s
ON CONFLICT ON CONSTRAINT transactions_pkey
DO UPDATE SET (%s) = (%s)
"""
#insert_statement = 'insert into transactions (%s) values %s'
print(','.join(columns))
print(','.join(columns_with_excludephrase))
print(tuple(values))
xquery = cur.mogrify(insert_statement,(
AsIs (','.join(columns)) ,
tuple(values),
AsIs (','.join(columns)) ,
AsIs (','.join(columns_with_excludephrase))
))
print(xquery)
return xquery
def complete_run_psycopg2(random_dic):
querylist=[]
starttime = time.time()
cur = connection_psycopg2.cursor()
for key in random_dic:
print(key)
query=single_insert_on_conflict_psycopg2(idic=random_dic[key],
icur=cur)
querylist.append(query.decode("utf-8") )
complete_query = ';'.join(tuple(querylist))
cur.execute(complete_query)
connection_psycopg2.commit()
cur.close()
endtime = time.time()
xduration=endtime-starttime
write_sec=len(random_dic)/xduration
print('complete Duration:{}'.format(xduration))
print('writes per second:{}'.format(write_sec))
return write_sec
def main():
random_dic = random_data()
complete_run_psycopg2(random_dic)
return
if __name__ == '__main__':
main()
Now my question: is this a proper approach? Are there any hints I didn’t consider?
First You can not insert column names like that. I would use .format to inject column names, and then use %s for the values.
SQL = 'INSERT INTO ({}) VALUES (%s,%s,%s,%s,%s,%s)'.format(','.join(columnns))
db.Pcursor().execute(SQL, value1, value2, value3)
Second you will get better speed if you use async processes.
Fortunately for you I wrote a gevent async library for psycopg2 you can use. It makes the process far easier, it is async threaded and pooled.
Python Postgres psycopg2 ThreadedConnectionPool exhausted

Rows inserted in mysql table after application started not picked by the application

I have usecase in whch I have to read rows having status = 0 from mysql.
Table schema:
CREATE TABLE IF NOT EXISTS in_out_analytics(
id INT AUTO_INCREMENT PRIMARY KEY,
file_name VARCHAR(255),
start_time BIGINT,
end_time BIGINT,
duration INT,
in_count INT,
out_count INT,
status INT
)
I am using this below code to read data from mysql.
persistance.py
import mysql
import mysql.connector
import conf
class DatabaseManager(object):
# global vars to storing db connection details
connection = None
def __init__(self):
self.ip = conf.db_ip
self.user_name = conf.db_user
self.password = conf.db_password
self.db_name = conf.db_name
# Initialize database only one time in application
if not DatabaseManager.connection:
self.connect()
self.cursor = DatabaseManager.connection.cursor()
self.create_schema()
def connect(self):
try:
DatabaseManager.connection = mysql.connector.connect(
host= self.ip,
database = self.db_name,
user = self.user_name,
password = self.password
)
print(f"Successfully connected to { self.ip } ")
except mysql.connector.Error as e:
print(str(e))
def create_schema(self):
# Create database
# sql = f"CREATE DATABASE { self.db_name} IF NOT EXIST"
# self.cursor.execute(sql)
# Create table
sql = """
CREATE TABLE IF NOT EXISTS in_out_analytics(
id INT AUTO_INCREMENT PRIMARY KEY,
file_name VARCHAR(255),
start_time BIGINT,
end_time BIGINT,
duration INT,
in_count INT,
out_count INT,
status INT
)"""
self.cursor.execute(sql)
def read_unprocessed_rows(self):
sql = "SELECT id, start_time, end_time FROM in_out_analytics WHERE status=0;"
self.cursor.execute(sql)
result_set = self.cursor.fetchall()
rows = []
for row in result_set:
id = row[0]
start_time = row[1]
end_time = row[2]
details = {
'id' : id,
'start_time' : start_time,
'end_time' : end_time
}
rows.append(details)
return rows
test.py
import time
from persistance import DatabaseManager
if __name__ == "__main__":
# Rows which are inserted after application is started do not get processed if
# 'DatabaseManager' is defined here
# dm = DatabaseManager()
while True:
# Rows which are inserted after application is started do get processed if
# 'DatabaseManager' is defined here
dm = DatabaseManager()
unprocessed_rows = dm.read_unprocessed_rows()
print(f"unprocessed_rows: { unprocessed_rows }")
time.sleep(2)
Problem:
The problem is, when I define database object dm = DatabaseManager() above the while loop, then any new row which is inserted after the application is started do not get processed and if I define the dm = DatabaseManager() inside the while loop then the rows which are inserted even after application is started gets processed.
What is the problem with the above code?
Ideally, we should make only one object of DatabaseManager as this class is creating a connection with MySQL. Hence creating a connection with any database should be the ideal case.
Making an assumption here, as I cannot test it myself.
tl;dr: Add DatabaseManager.connection.commit() to your read_unprocessed_rows
When you execute your SELECT statement, a transaction is created implicitly, using the default isolation level REPEATABLE READ. That creates a snapshot of the database at that point in time and all consecutive reads in that transaction will read from the snapshot established during the first read. The effects of different isolation levels are described here. To refresh the snapshot in REPEATABLE READ, you can commit your current transaction before executing the next statement.
So, when you instantiate your DatabaseManager inside your loop, each SELECT starts a new transaction on a new connection, hence has a fresh snapshot every time. When instantiating your Databasemanager outside the loop, the transaction created by the first SELECT keeps the same snapshot for all consecutive SELECTs and updates from outside that transaction remain invisible.

Categories