I tried this; the job was too large and kernel restarted
import pandas as pd
from pyathena import connect
from pyathena.pandas.cursor import PandasCursor
cursor = connect(s3_staging_dir='s3://db',
region_name='us-east-1',
cursor_class=PandasCursor).cursor()
df = cursor.execute("SELECT * FROM database.table").as_pandas()
print(df.shape)
I also tried this; again the job was too large and kernel restarted
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='s3://db', region_name='us-east-1')
df_isp_20220204 = pd.read_sql("SELECT * FROM database.table", conn)
The table is somewhat large, around 50 million records, and 3 fields. I think the problem is that I can't load it into RAM, and also it's very slow! Is there a more efficient way to do this kind of thing?
I have my AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN. I think you need wr.redshift.copy if you use these credentials. I tried to read through some documentation, but I couldn't figure out a way to get wr.redshift.copy to read from a table and load into a dataframe.
Related
I am trying to create a dataframe using the data in redshift table.
But I am getting "Memory error" because the data I am fetching is huge in volume.
how to sove this issue, (I found chunking is one option. How to implement chucking) Is there any other library useful for such situations ?
The following is an example code
import pandas as pd
import psycopg2
conn = psycopg2.connect(host=host_name,user=usr,port=pt,password=pass,db_name=DB)
sql_query = "SELECT * FROM Table_Name"
df = pd.read_sql_query(conn,sql_query)
Today I started to learn postgress and I was tryng to do the same thing that I do to load dataframes into my Oracle db
So, for example I have a df that contains 70k of records and 10 columns. My code for this is the following:
from sqlalchemy import create_engine
conn = create_engine('postgresql://'+data['user']+':'+data['password']+'#'+data['host']+':'+data['port_db']+'/'+data['dbname'])
df.to_sql('first_posgress', conn)
This code is kinda the same I use for my Oracle tables but in this case it takes several time to accomplish the task. So I was wondering if there is a better way to do this or it is because in postgress in general is slower.
I found some examples on SO and google but mostly are focused on create the table, not insert a df.
If it is possible for you to use psycopg2 instead of SQLALchemy you can transform your df into a csv and then use cursor.copy_from() to copy the csv into the db.
import io
output = io.StringIO()
df.to_csv(output, sep=",")
output.seek(0)
#psycopg2.cursor:
cursor.copy_from(
output,
target_table, #'first_posgress'
sep=",",
columns=tuple(df.columns)
)
con.commit() #psycopg2 conn
(I don't know if there is an similar function in SQLAlchemy, that is faster too)
Psycopg2 Cursor Documentation
This blogpost contains more information!
Hopefully this is useful for you !
I have successfully connected Python to a redshift table with my Jupyter Notebook.
I sampled 1 day of data (176707 rows) and performed a function using Missingno to assess how much data is missing and where. No problem there.
Here's the code so far (redacted for security)...
#IMPORT NECESSARY PACKAGES FOR REDSHIFT CONNECTION AND DATA VIZ
import psycopg2
from getpass import getpass
from pandas import read_sql
import seaborn as sns
import missingno as msno
#PASSWORD INPUT PROMPT
pwd = getpass('password')
#REDSHIFT CREDENTIALS
config = { 'dbname': 'abcxyz',
'user':'abcxyz',
'pwd':pwd,
'host':'abcxyz.redshift.amazonaws.com',
'port':'xxxx'
}
#CONNECTION UDF USING REDSHIFT CREDS AS DEFINED ABOVE
def create_conn(*args,**kwargs):
config = kwargs['config']
try:
con=psycopg2.connect(dbname=config['dbname'], host=config['host'],
port=config['port'], user=config['user'],
password=config['pwd'])
return con
except Exception as err:
print(err)
#DEFINE CONNECTION
con = create_conn(config=config)
#SQL TO RETRIEVE DATASET AND STORE IN DATAFRAME
df = read_sql("select * from schema.table where date = '2020-06-07'", con=con)
# MISSINGNO VIZ
msno.bar(df, labels=True, figsize=(50, 20))
This produces the following, which is exactly what I want to see:
However, I need to perform this task on a subset of the entire table, not just one day.
I ran...
SELECT "table", size, tbl_rows FROM SVV_TABLE_INFO
...and I can see that the table is a total size of 9GB and 32.5M rows, although the sample I need to assess the data completion of is 11M rows
So far I have identified 2 options for retrieving a larger dataset than the ~18k rows from my initial attempt.
These are:
1) Using chunksize
2) Using Dask
Using Chunksize
I replaced the necessary line of code with this:
#SQL TO RETRIEVE DATASET AND STORE IN DATAFRAME
df = read_sql("select * from derived.page_views where column_name = 'something'", con=con, chunksize=100000)
This still took several hours to run on a MacBook Pro 2.2 GHz Intel Core i7 with 16 GB RAM and gave memory warnings toward the end of the task.
When it was complete I wasn't able to view the chunks anyway and the kernel disconnected, meaning the data held in memory was lost and I'd essentially wasted a morning.
My question is:
Assuming this is not an entirely foolish endeavour, would Dask be a better approach? If so, how could I perform this task using Dask?
The Dask documentation gives this example:
df = dd.read_sql_table('accounts', 'sqlite:///path/to/bank.db',
npartitions=10, index_col='id') # doctest: +SKIP
But I don't understand how I could apply this to my scenario whereby I have connected to a redshift table in order to retrieve the data.
Any help gratefully recieved.
I am moving a lot of data from local csv files into a Azure based SQL database.
I am using sqlalchemy and ODBC Driver 17
Chunk size is 5,000.
Everything is fine if I don't switch on the multi method in the final DF to_sql.
The dataframe is a 9 column dataframe read from csv
the error message I got when switching on multi method is:
"('The SQL contains -20536 parameter markers, but 45000 parameters were supplied', 'HY000')"
The 45,000 is probably the 9 columns times the 5,000 chunck which makes sense. But why does the SQL contains -20536 is giving me a big headache.
thanks so much. My code looks like:
import pyodbc
import urllib
from sqlalchemy import create_engine,Table,MetaData
import pandas as pd
from datetime import datetime
params = urllib.parse.quote_plus("DRIVER={ODBC Driver 17 for SQL Server};SERVER=tcp:oac-
data1.database.windows.net,1433;DATABASE=OAC Analytics;UID=xxxxxx;PWD=xxxxx")
chunk = 5000
engine = create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
conn=engine.connect()
DF_DFS = pd.read_csv('xxxxx\Fact-Qikids-DFS.csv', header=0)
DF_DFS = DF_DFS[['Campus ID','Date','Age Group','Room #','Booking type','Absence','Attendances','Fees
Charged','Version']]
DF_DFS.to_sql('QikKids-DFS-Parsed',con=conn,if_exists='append',index=False,chunksize =
chunk,method='multi')
I found out an alternative is to switch on fast_executemany
The only thing to watch out for is that it request index to be on
I'm trying to read a bunch of large csv files (multiple files) from google storage. I use the Dask distribution library for parallel computation, but the problem I'm facing here is, though I mention the blocksize (100mb), I'm not sure how to read partition by partition and save it to my postgres database so that I don't want overload my memory.
from dask.distributed import Client
from dask.diagnostics import ProgressBar
client = Client(processes=False)
import dask.dataframe as dd
def read_csv_gcs():
with ProgressBar():
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
pd = df.compute(scheduler='threads')
return pd
def write_df_to_db(df):
try:
from sqlalchemy import create_engine
engine = create_engine('postgresql://usr:pass#localhost:5432/sampledb')
df.to_sql('sampletable', engine, if_exists='replace',index=False)
except Exception as e:
print(e)
pass
pd = read_csv_gcs()
write_df_to_db(pd)
The above code is my basic implementation, but as said I would like to read it in chunk and update the db. Something like
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
for chunk in df:
write_it_to_db(chunk)
Is it possible to do it in Dask? or should I go for pandas's chunksize and iterate, then save it to DB (But I miss parallel computation here)?
Can someone shed some light?
This line
df.compute(scheduler='threads')
says: load the data in chunks in worker threads, and concatenate them all into a single in-memory dataframe, df. This is not what you wanted. You wanted to insert the chunks as they come and then drop them from memory.
You probably wanted to use map_partitions
df = dd.read_csv('gs://mybucket/renish/*.csv', blocksize=100e6)
df.map_partitions(write_it_to_db).compute()
or use df.to_delayed().
Note that, depending on your SQL driver, you might not be able to get parallelism this way, and if not, the pandas iter-chunk method would have worked just as well.