Pandas Dataframe csv to Sql with 'multi' method generate error - python

I am moving a lot of data from local csv files into a Azure based SQL database.
I am using sqlalchemy and ODBC Driver 17
Chunk size is 5,000.
Everything is fine if I don't switch on the multi method in the final DF to_sql.
The dataframe is a 9 column dataframe read from csv
the error message I got when switching on multi method is:
"('The SQL contains -20536 parameter markers, but 45000 parameters were supplied', 'HY000')"
The 45,000 is probably the 9 columns times the 5,000 chunck which makes sense. But why does the SQL contains -20536 is giving me a big headache.
thanks so much. My code looks like:
import pyodbc
import urllib
from sqlalchemy import create_engine,Table,MetaData
import pandas as pd
from datetime import datetime
params = urllib.parse.quote_plus("DRIVER={ODBC Driver 17 for SQL Server};SERVER=tcp:oac-
data1.database.windows.net,1433;DATABASE=OAC Analytics;UID=xxxxxx;PWD=xxxxx")
chunk = 5000
engine = create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
conn=engine.connect()
DF_DFS = pd.read_csv('xxxxx\Fact-Qikids-DFS.csv', header=0)
DF_DFS = DF_DFS[['Campus ID','Date','Age Group','Room #','Booking type','Absence','Attendances','Fees
Charged','Version']]
DF_DFS.to_sql('QikKids-DFS-Parsed',con=conn,if_exists='append',index=False,chunksize =
chunk,method='multi')

I found out an alternative is to switch on fast_executemany
The only thing to watch out for is that it request index to be on

Related

Faster way to read from AWS table into Dataframe?

I tried this; the job was too large and kernel restarted
import pandas as pd
from pyathena import connect
from pyathena.pandas.cursor import PandasCursor
cursor = connect(s3_staging_dir='s3://db',
region_name='us-east-1',
cursor_class=PandasCursor).cursor()
df = cursor.execute("SELECT * FROM database.table").as_pandas()
print(df.shape)
I also tried this; again the job was too large and kernel restarted
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='s3://db', region_name='us-east-1')
df_isp_20220204 = pd.read_sql("SELECT * FROM database.table", conn)
The table is somewhat large, around 50 million records, and 3 fields. I think the problem is that I can't load it into RAM, and also it's very slow! Is there a more efficient way to do this kind of thing?
I have my AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN. I think you need wr.redshift.copy if you use these credentials. I tried to read through some documentation, but I couldn't figure out a way to get wr.redshift.copy to read from a table and load into a dataframe.

How to use ODBC connection for pyspark.pandas

In my following python code I successfully can connect to MS Azure SQL Db using ODBC connection, and can load data into an Azure SQL table using pandas' dataframe method to_sql(...). But when I use pyspark.pandas instead, the to_sql(...) method fails stating no such method supported. I know pandas API on Spark has reached about 97% coverage. But I was wondering if there is alternate method of achieving the same while still using ODBC.
Question: In the following code sample, how can we use ODBC connection for pyspark.pandas for connecting to Azure SQL db and load a dataframe into a SQL table?
import sqlalchemy as sq
#import pandas as pd
import pyspark.pandas as ps
import datetime
data_df = ps.read_csv('/dbfs/FileStore/tables/myDataFile.csv', low_memory=False, quotechar='"', header='infer')
.......
data_df.to_sql(name='CustomerOrderTable', con=engine, if_exists='append', index=False, dtype={'OrderID' : sq.VARCHAR(10),
'Name' : sq.VARCHAR(50),
'OrderDate' : sq.DATETIME()})
Ref: Pandas API on Spark and this
UPDATE: The data file is about 6.5GB with 150 columns and 15 million records. Therefore, the pandas cannot handle it, and as expected, it gives OOM (out of memory) error.
I noticed you were appending the data to the table, so this work around came to mind.
Break the pyspark.pandas into chunks, and then export each chunk to pandas, and from there append the chunk.
n = len(data_df)//20 # Break it into 20 chunks
list_dfs = np.array_split(data_df, n) # [df[i:i+n] for i in range(0,df.shape[0],n)]
for df in list_dfs:
df = df.to_pandas()
df.to_sql()
As per the official pyspark.pandas documentation by Apache Spark, there is no such method available for this module which can load the pandas DataFrame to SQL Table.
Please see all provided methods here.
As an alternative approach, there are some similar asks mentioned in these SO threads. This might be helpful.
How to write to a Spark SQL table from a Panda data frame using PySpark?
How can I convert a pyspark.sql.dataframe.DataFrame back to a sql table in databricks notebook

Using Chunksize and Dask to process 8GB Redshift table in Pandas with Missingno

I have successfully connected Python to a redshift table with my Jupyter Notebook.
I sampled 1 day of data (176707 rows) and performed a function using Missingno to assess how much data is missing and where. No problem there.
Here's the code so far (redacted for security)...
#IMPORT NECESSARY PACKAGES FOR REDSHIFT CONNECTION AND DATA VIZ
import psycopg2
from getpass import getpass
from pandas import read_sql
import seaborn as sns
import missingno as msno
#PASSWORD INPUT PROMPT
pwd = getpass('password')
#REDSHIFT CREDENTIALS
config = { 'dbname': 'abcxyz',
'user':'abcxyz',
'pwd':pwd,
'host':'abcxyz.redshift.amazonaws.com',
'port':'xxxx'
}
#CONNECTION UDF USING REDSHIFT CREDS AS DEFINED ABOVE
def create_conn(*args,**kwargs):
config = kwargs['config']
try:
con=psycopg2.connect(dbname=config['dbname'], host=config['host'],
port=config['port'], user=config['user'],
password=config['pwd'])
return con
except Exception as err:
print(err)
#DEFINE CONNECTION
con = create_conn(config=config)
#SQL TO RETRIEVE DATASET AND STORE IN DATAFRAME
df = read_sql("select * from schema.table where date = '2020-06-07'", con=con)
# MISSINGNO VIZ
msno.bar(df, labels=True, figsize=(50, 20))
This produces the following, which is exactly what I want to see:
However, I need to perform this task on a subset of the entire table, not just one day.
I ran...
SELECT "table", size, tbl_rows FROM SVV_TABLE_INFO
...and I can see that the table is a total size of 9GB and 32.5M rows, although the sample I need to assess the data completion of is 11M rows
So far I have identified 2 options for retrieving a larger dataset than the ~18k rows from my initial attempt.
These are:
1) Using chunksize
2) Using Dask
Using Chunksize
I replaced the necessary line of code with this:
#SQL TO RETRIEVE DATASET AND STORE IN DATAFRAME
df = read_sql("select * from derived.page_views where column_name = 'something'", con=con, chunksize=100000)
This still took several hours to run on a MacBook Pro 2.2 GHz Intel Core i7 with 16 GB RAM and gave memory warnings toward the end of the task.
When it was complete I wasn't able to view the chunks anyway and the kernel disconnected, meaning the data held in memory was lost and I'd essentially wasted a morning.
My question is:
Assuming this is not an entirely foolish endeavour, would Dask be a better approach? If so, how could I perform this task using Dask?
The Dask documentation gives this example:
df = dd.read_sql_table('accounts', 'sqlite:///path/to/bank.db',
npartitions=10, index_col='id') # doctest: +SKIP
But I don't understand how I could apply this to my scenario whereby I have connected to a redshift table in order to retrieve the data.
Any help gratefully recieved.

How to avoid 'database disk image malformed' error while loading large json file in python / pandas?

I am trying to read a table from sqlite database (of size 4 gb). Each cell of the table is a json (few cells have large json format files in it).
The query works fine when I execute it within db browser but in Python it gives an error : 'Database disk image is malformed'
I have tried with different tables and the problem persists. The number of rows to fetch with the query, is about 5000 . However, each cell in itself might have a long json structured string (of about 10000 lines).
I have already tried working with replicas of database and with other databases. I tried following as well, in the db
Pragma integrity check;
Pragma temp_store = 2; // to force data into RAM
The problem seems to be linked with Pandas / Python than the actual DB:
import sqlite3
import pandas as pd
conn = sqlite3.connect(db)
sql = """
select a.Topic, a.Timestamp, a.SessionId, a.ContextMask, b.BuildUUID, a.BuildId, a.LayerId,
a.Payload
from MessageTable a
inner JOIN
BuildTable b
on a.BuildId = b.BuildId
where a.Topic = ('Engine/Sensors/SensorData')
and b.BuildUUID = :job
"""
cur = conn.cursor()
cur.execute(sql, {"job" : '06c95a97-40c7-49b7-ad1b-0d439d412464'})
sensordf = pd.DataFrame(data = cur.fetchall(), columns = ['Topic', 'Timestamp_epoch', 'SessionId', 'ContextMask'
'BuildUUID', 'BuildId', 'LayerId', 'Payload'])
I expect the output to be in pandas dataframe with the last column containing json values in each cell. I can further write some script to parse from json to extract more data.

Accessing large datasets with Python 3.6, psycopg2 and pandas

I am trying to pull a 1.7G file into a pandas dataframe from a Greenplum postgres data source. The psycopg2 driver takes 8 or so minutes to load. Using the pandas "chunksize" parameter does not help as the psycopg2 driver selects all data into memory, then hands it off to pandas, using a lot more than 2G of RAM.
To get around this, I'm trying to use a named cursor, but all the examples I've found then loop through row by row. And that just seems slow. But the main problem appears to that my SQL just stops working in the named query for some unknown reason.
Goals
load the data as quickly as possible without doing any "unnatural
acts"
use SQLAlchemy if possible - used for consistency
have the results in a pandas dataframe for fast in-memory processing (alternatives?)
Have a "pythonic" (elegant) solution. I'd love to do this with a context manager but haven't gotten that far yet.
/// Named Cursor Chunky Access Test
import pandas as pd
import psycopg2
import psycopg2.extras
/// Connect to database - works
conn_chunky = psycopg2.connect(
database=database, user=username, password=password, host=hostname)
/// Open named cursor - appears to work
cursor_chunky = conn_chunky.cursor(
'buffered_fetch', cursor_factory=psycopg2.extras.DictCursor)
cursor_chunky.itersize = 100000
/// This is where the problem occurs - the SQL works just fine in all other tests, returns 3.5M records
result = cursor_chunky.execute(sql_query)
/// result returns None (normal behavior) but result is not iterable
df = pd.DataFrame(result.fetchall())
The pandas call returns AttributeError: 'NoneType' object has no attribute 'fetchall' Failure seems due to named cursor being used. Have tried fetchone, fetchmany, etc. Note the goal here is to let the server chunk and serve up the data in large chunks such that there is a balance of bandwidth and CPU usage. Looping through a df = df.append(row) is just plain fugly.
See related questions (not the same issue):
Streaming data from Postgres into Python
psycopg2 leaking memory after large query
Added standard client side chunking code per request
nrows = 3652504
size = nrows / 1000
idx = 0
first_loop = True
for dfx in pd.read_sql(iso_cmdb_base, engine, coerce_float=False, chunksize=size):
if first_loop:
df = dfx
first_loop = False
else:
df = df.append(dfx,ignore_index=True)
UPDATE:
#Chunked access
start = time.time()
engine = create_engine(conn_str)
size = 10**4
df = pd.concat((x for x in pd.read_sql(iso_cmdb_base, engine, coerce_float=False, chunksize=size)),
ignore_index=True)
print('time:', (time.time() - start)/60, 'minutes or ', time.time() - start, 'seconds')
OLD answer:
I'd try to read data from PostgreSQL using internal Pandas method: read_sql():
from sqlalchemy import create_engine
engine = create_engine('postgresql://user#localhost:5432/dbname')
df = pd.read_sql(sql_query, engine)

Categories