I am trying to create a dataframe using the data in redshift table.
But I am getting "Memory error" because the data I am fetching is huge in volume.
how to sove this issue, (I found chunking is one option. How to implement chucking) Is there any other library useful for such situations ?
The following is an example code
import pandas as pd
import psycopg2
conn = psycopg2.connect(host=host_name,user=usr,port=pt,password=pass,db_name=DB)
sql_query = "SELECT * FROM Table_Name"
df = pd.read_sql_query(conn,sql_query)
Related
I tried this; the job was too large and kernel restarted
import pandas as pd
from pyathena import connect
from pyathena.pandas.cursor import PandasCursor
cursor = connect(s3_staging_dir='s3://db',
region_name='us-east-1',
cursor_class=PandasCursor).cursor()
df = cursor.execute("SELECT * FROM database.table").as_pandas()
print(df.shape)
I also tried this; again the job was too large and kernel restarted
from pyathena import connect
import pandas as pd
conn = connect(s3_staging_dir='s3://db', region_name='us-east-1')
df_isp_20220204 = pd.read_sql("SELECT * FROM database.table", conn)
The table is somewhat large, around 50 million records, and 3 fields. I think the problem is that I can't load it into RAM, and also it's very slow! Is there a more efficient way to do this kind of thing?
I have my AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN. I think you need wr.redshift.copy if you use these credentials. I tried to read through some documentation, but I couldn't figure out a way to get wr.redshift.copy to read from a table and load into a dataframe.
Today I started to learn postgress and I was tryng to do the same thing that I do to load dataframes into my Oracle db
So, for example I have a df that contains 70k of records and 10 columns. My code for this is the following:
from sqlalchemy import create_engine
conn = create_engine('postgresql://'+data['user']+':'+data['password']+'#'+data['host']+':'+data['port_db']+'/'+data['dbname'])
df.to_sql('first_posgress', conn)
This code is kinda the same I use for my Oracle tables but in this case it takes several time to accomplish the task. So I was wondering if there is a better way to do this or it is because in postgress in general is slower.
I found some examples on SO and google but mostly are focused on create the table, not insert a df.
If it is possible for you to use psycopg2 instead of SQLALchemy you can transform your df into a csv and then use cursor.copy_from() to copy the csv into the db.
import io
output = io.StringIO()
df.to_csv(output, sep=",")
output.seek(0)
#psycopg2.cursor:
cursor.copy_from(
output,
target_table, #'first_posgress'
sep=",",
columns=tuple(df.columns)
)
con.commit() #psycopg2 conn
(I don't know if there is an similar function in SQLAlchemy, that is faster too)
Psycopg2 Cursor Documentation
This blogpost contains more information!
Hopefully this is useful for you !
I want to download data from SQL via python. But, instead of downloading the whole of dataset I only need specific variables.
I am restricted to use only the read_sql from pyodbc
My code is the following:
# call from SQL
import pandas as pd
import pyodbc
conn = pyodbc.connect("""DRIVER={SQL Server};
Server=BXTS131133.eu.rabonet.com\LWID_LAB_03;
Database=CORP_Modelling;
Trusted_connection=yes;""")
SQL1 = 'SELECT * FROM [CORP_Modelling].[LDM_Freeze_1].[JointObligorMonthly]'
Nevertheless, suppose that I want to download only a few variables/attributes from SQL. For example, from the tables sepecified in 'SLQ1' I only want to download:
var_to_download = ['MeasurementPeriodID', 'JointObligorID' ]
I cannot understand how I can modify the above code in order to download only these variables.
I have an sql file locally stored in my PC. I want to open and read it using the pandas library. Here it iswhat I have tried:
import pandas as pd
import sqlite3
my_file = 'C:\Users\me\Downloads\\database.sql'
#I am creating an empty database
conn = sqlite3.connect(r'C:\Users\test\Downloads\test.db')
#I am reading my file
df = pd.read_sql(my_file, conn)
However, I am receiving the following error:
DatabaseError: Execution failed on sql 'C:\Users\me\Downloads\database.sql': near "C": syntax error
Try moving the file to D://
Sometimes Python is not granted access to read/write in C.
Hence may be that is an issue.
You can also try alternative method using cursors.
cur=conn.cursor()
r=cur.fetchall()
This r would contain a tuple of your dataset.
I am trying to pull a 1.7G file into a pandas dataframe from a Greenplum postgres data source. The psycopg2 driver takes 8 or so minutes to load. Using the pandas "chunksize" parameter does not help as the psycopg2 driver selects all data into memory, then hands it off to pandas, using a lot more than 2G of RAM.
To get around this, I'm trying to use a named cursor, but all the examples I've found then loop through row by row. And that just seems slow. But the main problem appears to that my SQL just stops working in the named query for some unknown reason.
Goals
load the data as quickly as possible without doing any "unnatural
acts"
use SQLAlchemy if possible - used for consistency
have the results in a pandas dataframe for fast in-memory processing (alternatives?)
Have a "pythonic" (elegant) solution. I'd love to do this with a context manager but haven't gotten that far yet.
/// Named Cursor Chunky Access Test
import pandas as pd
import psycopg2
import psycopg2.extras
/// Connect to database - works
conn_chunky = psycopg2.connect(
database=database, user=username, password=password, host=hostname)
/// Open named cursor - appears to work
cursor_chunky = conn_chunky.cursor(
'buffered_fetch', cursor_factory=psycopg2.extras.DictCursor)
cursor_chunky.itersize = 100000
/// This is where the problem occurs - the SQL works just fine in all other tests, returns 3.5M records
result = cursor_chunky.execute(sql_query)
/// result returns None (normal behavior) but result is not iterable
df = pd.DataFrame(result.fetchall())
The pandas call returns AttributeError: 'NoneType' object has no attribute 'fetchall' Failure seems due to named cursor being used. Have tried fetchone, fetchmany, etc. Note the goal here is to let the server chunk and serve up the data in large chunks such that there is a balance of bandwidth and CPU usage. Looping through a df = df.append(row) is just plain fugly.
See related questions (not the same issue):
Streaming data from Postgres into Python
psycopg2 leaking memory after large query
Added standard client side chunking code per request
nrows = 3652504
size = nrows / 1000
idx = 0
first_loop = True
for dfx in pd.read_sql(iso_cmdb_base, engine, coerce_float=False, chunksize=size):
if first_loop:
df = dfx
first_loop = False
else:
df = df.append(dfx,ignore_index=True)
UPDATE:
#Chunked access
start = time.time()
engine = create_engine(conn_str)
size = 10**4
df = pd.concat((x for x in pd.read_sql(iso_cmdb_base, engine, coerce_float=False, chunksize=size)),
ignore_index=True)
print('time:', (time.time() - start)/60, 'minutes or ', time.time() - start, 'seconds')
OLD answer:
I'd try to read data from PostgreSQL using internal Pandas method: read_sql():
from sqlalchemy import create_engine
engine = create_engine('postgresql://user#localhost:5432/dbname')
df = pd.read_sql(sql_query, engine)