The dataframe is huge (7-8 million rows). Tried to_sql with chunksize = 5000 but it never finished.
Using,
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL
df.to_sql(snowflake_table , engine, if_exists='replace', index=False, index_label=None, chunksize=20000)
What are other optimal solutions for storing data into SF from Pandas DF? Or what am I doing wrong here? The DF is usually of size 7-10 million rows.
The least painful way I can imagine is to dump the file to S3 and have Snowpipe load it into Snowflake automatically. With that set up you don't have to execute any copy command or make any Snowflake calls at all.
Refer to Snowflake documentation for details on how to set up Snowpipe for S3. In short you need to create a stage, a target table, a file format (I guess you already have these things in place though) and a pipe. Then set up SQS notifications for your bucket that the pipe will listen to.
Snowflake suggests having files sized around 10-100 MB, so it is likely a good idea to split the file.
# set up credentials (s3fs is built on BOTO hence this is AWS specific)
fs = s3fs.S3FileSystem(key=key, secret=secret)
# number of files to split into
n_chunks = 2
# loop over dataframe and dump chunk by chunk to S3
# (you likely want to expand file naming logic to avoid overwriting existing files)
for f_name, chunks in enumerate(np.array_split(np.arange(df.shape[0]), n_chunks)):
bytes_to_write = df.iloc[chunks].to_csv(index=False).encode()
with fs.open('s3://mybucket/test/dummy_{}.csv'.format(f_name), 'wb') as f:
f.write(bytes_to_write)
For reference I tried this with a 7M row dataframe splitted into 5 files of around 40 MB. It took around 3 minutes and 40 seconds from starting splitting the dataframe until all rows had arrived in Snowflake.
The optimal way that ilja-everila pointed out is “copy into...” as SF required the csv to be staged on cloud before transformation I was hesitant to do it but it seems like that is the only option given that the performance is in 5-10 minutes for 6.5million records.
for using SQLAlchemy, could you also add, in the connection parameter, the paramstyle=qmark that binds data. This is also referenced here: https://github.com/snowflakedb/snowflake-connector-python/issues/37#issuecomment-365503841
After this change, if you feel appropriate, it may be good idea to do the performance comparison between the SQLAlchemy approach and bulk load approach of writing the large DF to files and use COPY INTO to load the files into Snowflake table.
pandas does an 'insert into ...' with multiple values behind the scene. Snowflake has a restriction up to 16384 records on ingestion. Please change your chunksize=16384.
Snowflake provides the write_pandas and pd_writer helper functions to manage that:
from snowflake.connector.pandas_tools import pd_writer
df.to_sql(snowflake_table, engine, index=False, method=pd_writer)
# ^ here
The pd_writer() function uses write_pandas():
write_pandas(): Writes a Pandas DataFrame to a table in a Snowflake database
To write the data to the table, the function saves the data to Parquet files, uses the PUT command to upload these files to a temporary stage, and uses the COPY INTO command to copy the data from the files to the table.
Related
I am currently in the process of getting the data from my stakeholder where he has a database from which he is going to extract as a csv file.
From there he is going to upload in shared drive and I am going to pick up the data probably download the data and use that a source locally to import in pandas dataframe.
The approximate size will be 40 million rows, I was wondering if the data can be exported as a single csv file from SQL database and that csv can be used as a source for python dataframe or should it be in chunks as I am not sure what the row limitation of csv file is.
I don't think so ram and processing should be an issue at this time.
Your help is much appreciated. Cheers!
If you can't connect directly to the database, you might need the .db file. I'm not sure a csv will even be able to handle more than a million or so rows.
as I am not sure what the row limitation of csv file is.
There is not such limit inherent for CSV format, if you understood CSV as format defined by RFC4180 which stipulates that CSV file is
file = [header CRLF] record *(CRLF record) [CRLF]
where [...] denote optional part, CRLF denote carriagereturn-linefeed (\r\n) and *(...) denote part repeated zero or more times.
I currently stored some data from a website that I have scraped into .csv for each product of the website. Since it is a quite popular website, I obtained more than 30,000 csv, that I need to merge into one. I'm not really an expert in pandas, but my first reaction was to rely on the concat() function. That is, my code looks like that:
df = pd.DataFrame(columns=["product_id", "price"])
for file in onlyfiles:
df1 = pd.read_csv(file)
df = pd.concat([df, df1])
where onlyfiles represents the directory in which all my dataframes are stored. It works, but it is starting to slow down as the number of dataframes increases. However, it is obviously not the best efficient way to achieve this goal. Does anybody have an idea of a more efficient method to use here?
Thank you for your help.
You need to start storing your data in an SQL database, CSV files are not databases.
You might want to look into Postgresql as SQLite may not have all of the features you need. You should be able to set up SQL code that dumps data into a single database from a CSV file. I have an automated process that pulls CSV data into a database, regularly.
You can interact with Postgres with the Psycopg2 library in python. Another thing you may want to consider is using Pandasql, which allows you to manipulate your Pandas data frames with SQL code. I always import Pandasql when working with Pandas dataframes.
Here is an example of my Postgres CSV file data import:
--Data Import Query
COPY stock_data(date, ticker, industry, open, high, low, close, adj_close, volume, dor)
FROM 'C:\Users\storageplace\Desktop\username\company_data\stock_data\stockdata.csv'
DELIMITER ','
CSV HEADER;
This question already has answers here:
How to write DataFrame to postgres table
(8 answers)
Closed 2 years ago.
I am currently using SQLAlchemy to write a pandas dataframe to a postgresql database on an AWS server. My code looks like this
engine = create_engine(
'postgresql://{}:{}#{}:{}/{}'.format(ModelData.user, ModelData.password, ModelData.host, ModelData.port,
ModelData.database), echo=True)
with open(file, 'rb') as f:
df = pickle.load(f)
df.to_sql(table_name, engine, method='multi', if_exists='replace', index=False, chunksize=1000)
The table I am writing has about 900,000 rows and 500 columns. It takes quite a long time to complete. Is there a faster way to write this data? Sometimes I will wait all day and still not be complete. To reiterate, this post is about speed and not about execution. Any help would be appreciated!
Note: The machine I'm using has 32 GB of RAM, i7 processor, 1 TB storage, and a GPU so I don't think it's my machine.
Have you played with the chunksize parameter?
With a dataset that large you may need to write it out to a tab-delimited file, transfer it to your EC2 instance, and then use the \copy command in psql to get it done in a reasonable amount of time.
Since it is RDS instead of EC2, I checked, and it looks like RDS PostgreSQL supports COPY from S3: https://www.postgresql.org/docs/current/sql-copy.html
This will mean some work setting up the table yourself, but the time savings may make it worth it. If you can make a small subset of your DF and have DataFrame.to_sql() use it to create the table for you, you can then copy your tab-separated values file to S3 and COPY into it using the utility.
I have fetched data from a CSV file, and it is held and manipulated in my Dask dataframe. From there I need to write the data into a data table. I have not really come across any solutions for this. Pandas have built-in functionality for this with its to_sql function, so I am unsure whether I need to convert to Pandas first? I currently think that converting the Dask dataframe to Pandas will cause it to be loaded fully into memory, which may defeat the purpose of using Dask in the first place.
What would the best and fastest approach be to write a Dask dataframe to a datatable?
assuming you have dask dataframe as df, you just need to this this:
df.to_sql(table, schema=schema, uri=conn_str, if_exists="append", index=False)
i've found this is easily the quickest method for dask dataframes.
I have no problem with #kfk's answer, as I also investigated that, but my solution was as follows.
I drop the DASK dataframe to a csv, and from there pick the CSV up with a Golang application that shoves the data into Mongo using multi-threading. For 4.5 million rows, the speed went from 38 minutes using "load local infile" to 2 minutes using a multi-threaded app.
pandas.to_sql() is not the fastest way to load data into a database. to_sql() uses the ODBC driver connection which is a lot slower than the built in bulk load method.
You can load data from a csv file in MySQL like this:
LOAD DATA INFILE 'some_file.csv'
INTO TABLE some_mysql_table
FIELDS TERMINATED BY ';'
So what I would do is this:
import dask.dataframe as dd
from sqlalchemy import create_engine
#1) create a csv file
df = dd.read_csv('2014-*.csv')
df.to_csv("some_file.csv")
#2) load the file
sql = """LOAD DATA INFILE 'some_file.csv'
INTO TABLE some_mysql_table
FIELDS TERMINATED BY ';"""
engine = create_engine("mysql://user:password#server")
engine.execute(sql)
You wrap the above into a function easily and use it instead of to_sql.
I want to append about 700 millions rows and 2 columns to a database. Using the code below:
disk_engine = create_engine('sqlite:///screen-user.db')
chunksize = 1000000
j = 0
index_start = 1
for df in pd.read_csv('C:/Users/xxx/Desktop/jjj.tsv', chunksize=chunksize, header = None, names=['screen','user'],sep='\t', iterator=True, encoding='utf-8'):
df.to_sql('data', disk_engine, if_exists='append')
count = j*chunksize
print(count)
print(j)
It is taking a really long time (I estimate it would take days). Is there a more efficient way to do this? In R, I have have been using the data.table package to load large data sets and it only take 1 minute. Is there a similar package in Python? As a tangential point, I want to also physically store this file on my Desktop. Right now, I am assuming 'data' is being stored as a temporary file. How would I do this?
Also assuming I load the data into a database, I want the queries to execute in a minute or less. Here is some pseudocode of what I want to do using Python + SQL:
#load data(600 million rows * 2 columns) into database
#def count(screen):
#return count of distinct list of users for a given set of screens
Essentially, I am returning the number of screens for a given set of users.Is the data too big for this task? I also want to merge this table with another table. Is there a reason why the fread function in R is much faster?
If your goal is to import data from your TSV file into SQLite, you should try the native import functionality in SQLite itself. Just open the sqlite console program and do something like this:
sqlite> .separator "\t"
sqlite> .import C:/Users/xxx/Desktop/jjj.tsv screen-user
Don't forget to build appropriate indexes before doing any queries.
As #John Zwinck has already said, you should probably use native RDBMS's tools for loading such amount of data.
First of all I think SQLite is not a proper tool/DB for 700 millions rows especially if you want to join/merge this data afterwards.
Depending of what kind of processing you want to do with your data after loading, I would either use free MySQL or if you can afford having a cluster - Apache Spark.SQL and parallelize processing of your data on multiple cluster nodes.
For loading you data into MySQL DB you can and should use native LOAD DATA tool.
Here is a great article showing how to optimize data load process for MySQL (for different: MySQL versions, MySQL options, MySQL storage engines: MyISAM and InnoDB, etc.)
Conclusion: use native DB's tools for loading big amount of CSV/TSV data efficiently instead of pandas, especially if your data doesn't fit into memory and if you want to process (join/merge/filter/etc.) your data after loading.