Improve loading dataframe into postgress db with python

Improve loading dataframe into postgress db with python - python

Today I started to learn postgress and I was tryng to do the same thing that I do to load dataframes into my Oracle db
So, for example I have a df that contains 70k of records and 10 columns. My code for this is the following:
from sqlalchemy import create_engine
conn = create_engine('postgresql://'+data['user']+':'+data['password']+'#'+data['host']+':'+data['port_db']+'/'+data['dbname'])
df.to_sql('first_posgress', conn)
This code is kinda the same I use for my Oracle tables but in this case it takes several time to accomplish the task. So I was wondering if there is a better way to do this or it is because in postgress in general is slower.
I found some examples on SO and google but mostly are focused on create the table, not insert a df.

If it is possible for you to use psycopg2 instead of SQLALchemy you can transform your df into a csv and then use cursor.copy_from() to copy the csv into the db.
import io
output = io.StringIO()
df.to_csv(output, sep=",")
output.seek(0)
#psycopg2.cursor:
cursor.copy_from(
output,
target_table, #'first_posgress'
sep=",",
columns=tuple(df.columns)
)
con.commit() #psycopg2 conn
(I don't know if there is an similar function in SQLAlchemy, that is faster too)
Psycopg2 Cursor Documentation
This blogpost contains more information!
Hopefully this is useful for you !

Related

Trying to read sqlite database to Dask dataframe

I am trying to read a table from a sqlite database in kaggle using Dask,
link to DB : https://www.kaggle.com/datasets/marcilonsilvacunha/amostracnpj?select=amostraCNPJ.sqlite
some of the tables in this database are really large and I want to test how dask can handle them.
I wrote the following code for one of the tables in the smaller sqlite database :
import dask.dataframe as ddf
import sqlite3
# Read sqlite query results into a pandas DataFrame
con = sqlite3.connect("/kaggle/input/amostraCNPJ.sqlite")
df = ddf.read_sql_table('cnpj_dados_cadastrais_pj', con, index_col='cnpj')
# Verify that result of SQL query is stored in the dataframe
print(df.head())
this gives an error:
AttributeError: 'sqlite3.Connection' object has no attribute '_instantiate_plugins'
any help would be apreciated as this is the first time I use Dask to read sqlite.

As the docstring stated, you should not pass a connection object to dask. You need to pass a sqlalchemy compatible connection string
df = ddf.read_sql_table('cnpj_dados_cadastrais_pj',
'sqlite:////kaggle/input/amostraCNPJ.sqlite', index_col='cnpj')

How to use ODBC connection for pyspark.pandas

In my following python code I successfully can connect to MS Azure SQL Db using ODBC connection, and can load data into an Azure SQL table using pandas' dataframe method to_sql(...). But when I use pyspark.pandas instead, the to_sql(...) method fails stating no such method supported. I know pandas API on Spark has reached about 97% coverage. But I was wondering if there is alternate method of achieving the same while still using ODBC.
Question: In the following code sample, how can we use ODBC connection for pyspark.pandas for connecting to Azure SQL db and load a dataframe into a SQL table?
import sqlalchemy as sq
#import pandas as pd
import pyspark.pandas as ps
import datetime
data_df = ps.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')
.......
data_df.to_sql(name='CustomerOrderTable', con=engine, if_exists='append', index=False, dtype={'OrderID' : sq.VARCHAR(10),
'Name' : sq.VARCHAR(50),
'OrderDate' : sq.DATETIME()})
Ref: Pandas API on Spark and this
UPDATE: The data file is about 6.5GB with 150 columns and 15 million records. Therefore, the pandas cannot handle it, and as expected, it gives OOM (out of memory) error.

I noticed you were appending the data to the table, so this work around came to mind.
Break the pyspark.pandas into chunks, and then export each chunk to pandas, and from there append the chunk.
n = len(data_df)//20 # Break it into 20 chunks
list_dfs = np.array_split(data_df, n) # [df[i:i+n] for i in range(0,df.shape[0],n)]
for df in list_dfs:
df = df.to_pandas()
df.to_sql()

As per the official pyspark.pandas documentation by Apache Spark, there is no such method available for this module which can load the pandas DataFrame to SQL Table.
Please see all provided methods here.
As an alternative approach, there are some similar asks mentioned in these SO threads. This might be helpful.
How to write to a Spark SQL table from a Panda data frame using PySpark?
How can I convert a pyspark.sql.dataframe.DataFrame back to a sql table in databricks notebook

exponential in SQLite3 in python

I'm wrting a python code that creates a SQLite database and does some calculations for massive tables. To begin with, reason i'm doing it in SQLite through python is memory, my data is huge that will break into a memory error if run in, say, pandas. and if chuncked it'll take ages, generally because pandas is slow with merges and groupes, etc.
So my issue now is at some point, i want to calculate exponential of one column in a table (sample code below) but it seems that SQLite doesn't have an EXP function.
I can write data to a dataframe and then use numpy to calculate the EXP but that then beats the whole point that pushed my twoards DBs and not have the additional time of reading/writing back and forth between the DB and python.
so my question is this: is there a way around this to calculate the exponential within the database? i've read that i can create the function within sqlite3 in python, but i have no idea how. If you know how or can direct me to where i can find relavent info then i would be thankful, thanks.
Sample of my code where i'm trying to do the calculation, note here i'm just providing a sample where the table is coming directly from a csv, but in my process it's actually created within the DB after lots of megres and group bys:
import sqlite3
#set path and files names
folderPath = 'C:\\SCP\\'
inputDemandFile = 'demandFile.csv'
#set connection to database
conn = sqlite3.connect(folderPath + dataBaseName)
cur = conn.cursor()
#read demand file into db
inputDemand = pd.read_csv(folderPath + inputDemandFile)
inputDemand.to_sql('inputDemand', conn, if_exists='replace', index=False)
#create new table and calculate EXP
cur.execute('CREATE TABLE demand_exp AS SELECT from_zone_id, to_zone_id, EXP(demand) AS EXP_Demand FROM inputDemand;')

i've read that i can create the function within sqlite3 in python, but i have no idea how.
That's conn.create_function()
https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.create_function
>>> import math
>>> conn.create_function('EXP', 1, math.exp)
>>> cur.execute('select EXP(1)')
>>> cur.fetchone()
(2.718281828459045,)

Write a DataFrame to an SQL database (Oracle)

I need to upload a table I modified to my oracle database. I exported the table as pandas dataframe modified it and now want to upload it to the DB.
I am trying to do this using the df.to_sql function as follows:
import sqlalchemy as sa
import pandas as pd
engine = sa.create_engine('oracle://"IP_address_of_server"/"serviceDB"')
df.to_sql("table_name",engine, if_exists='replace', chunksize = None)
I always get this error: DatabaseError: (cx_Oracle.DatabaseError) ORA-12505: TNS:listener does not currently know of SID given in connect descriptor (Background on this error at: http://sqlalche.me/e/4xp6).
I am not an expert of this, so I could not understand what the matter is, specially that the IP_address I am givingg is the right one.
Could anywone help? Thanks a lot!

Dynamically creating table from csv file using psycopg2

I would like to get some understanding on the question that I was pretty sure was clear for me. Is there any way to create table using psycopg2 or any other python Postgres database adapter with the name corresponding to the .csv file and (probably the most important) with columns that are specified in the .csv file.

I'll leave you to look at the psycopg2 library properly - this is off the top of my head (not had to use it for a while, but IIRC the documentation is ample).
The steps are:
Read column names from CSV file
Create "CREATE TABLE whatever" ( ... )
Maybe INSERT data
import os.path
my_csv_file = '/home/somewhere/file.csv'
table_name = os.path.splitext(os.path.split(my_csv_file)[1])[0]
cols = next(csv.reader(open(my_csv_file)))
You can go from there...
Create a SQL query (possibly using a templating engine for the fields and then issue the insert if needs be)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Improve loading dataframe into postgress db with python - python

Related

Trying to read sqlite database to Dask dataframe

How to use ODBC connection for pyspark.pandas

exponential in SQLite3 in python

Write a DataFrame to an SQL database (Oracle)

Dynamically creating table from csv file using psycopg2

Categories

Resources