How do I get a DASK dataframe into a MySQL datatable? - python

I have fetched data from a CSV file, and it is held and manipulated in my Dask dataframe. From there I need to write the data into a data table. I have not really come across any solutions for this. Pandas have built-in functionality for this with its to_sql function, so I am unsure whether I need to convert to Pandas first? I currently think that converting the Dask dataframe to Pandas will cause it to be loaded fully into memory, which may defeat the purpose of using Dask in the first place.
What would the best and fastest approach be to write a Dask dataframe to a datatable?

assuming you have dask dataframe as df, you just need to this this:
df.to_sql(table, schema=schema, uri=conn_str, if_exists="append", index=False)
i've found this is easily the quickest method for dask dataframes.

I have no problem with #kfk's answer, as I also investigated that, but my solution was as follows.
I drop the DASK dataframe to a csv, and from there pick the CSV up with a Golang application that shoves the data into Mongo using multi-threading. For 4.5 million rows, the speed went from 38 minutes using "load local infile" to 2 minutes using a multi-threaded app.

pandas.to_sql() is not the fastest way to load data into a database. to_sql() uses the ODBC driver connection which is a lot slower than the built in bulk load method.
You can load data from a csv file in MySQL like this:
LOAD DATA INFILE 'some_file.csv'
INTO TABLE some_mysql_table
FIELDS TERMINATED BY ';'
So what I would do is this:
import dask.dataframe as dd
from sqlalchemy import create_engine
#1) create a csv file
df = dd.read_csv('2014-*.csv')
df.to_csv("some_file.csv")
#2) load the file
sql = """LOAD DATA INFILE 'some_file.csv'
INTO TABLE some_mysql_table
FIELDS TERMINATED BY ';"""
engine = create_engine("mysql://user:password#server")
engine.execute(sql)
You wrap the above into a function easily and use it instead of to_sql.

Related

What is the most efficient way to concatenate thousands of dataframes in Python?

I currently stored some data from a website that I have scraped into .csv for each product of the website. Since it is a quite popular website, I obtained more than 30,000 csv, that I need to merge into one. I'm not really an expert in pandas, but my first reaction was to rely on the concat() function. That is, my code looks like that:
df = pd.DataFrame(columns=["product_id", "price"])
for file in onlyfiles:
df1 = pd.read_csv(file)
df = pd.concat([df, df1])
where onlyfiles represents the directory in which all my dataframes are stored. It works, but it is starting to slow down as the number of dataframes increases. However, it is obviously not the best efficient way to achieve this goal. Does anybody have an idea of a more efficient method to use here?
Thank you for your help.
You need to start storing your data in an SQL database, CSV files are not databases.
You might want to look into Postgresql as SQLite may not have all of the features you need. You should be able to set up SQL code that dumps data into a single database from a CSV file. I have an automated process that pulls CSV data into a database, regularly.
You can interact with Postgres with the Psycopg2 library in python. Another thing you may want to consider is using Pandasql, which allows you to manipulate your Pandas data frames with SQL code. I always import Pandasql when working with Pandas dataframes.
Here is an example of my Postgres CSV file data import:
--Data Import Query
COPY stock_data(date, ticker, industry, open, high, low, close, adj_close, volume, dor)
FROM 'C:\Users\storageplace\Desktop\username\company_data\stock_data\stockdata.csv'
DELIMITER ','
CSV HEADER;

How to efficiently load mixed-type pandas DataFrame into an Oracle DB

Happy new year everyone!
I'm currently struggling with ETL performance issues as I'm trying to write larger Pandas DataFrames (1-2 mio rows, 150 columns) into an Oracle data base. Even for just 1000 rows, Panda's default to_sql() method runs well over 2 minutes (see code snippet below).
My strong hypothesis is that these performance issues are in some way related to the underlying data types (mostly strings). I ran the same job on 1000 rows of random strings (benchmark: 3 min) and 1000 rows of large random floats (benchmark: 15 seconds).
def_save(self, data: pd.DataFrame):
engine = sqlalchemy.create_engine(self._load_args['con'])
table_name = self._load_args["table_name"]
if self._load_args.get("schema", None) is not None:
table_name = self._load_args['schema'] + "." + table_name
with engine.connect() as conn:
data.to_sql(
name=table_name,
conn=conn,
if_exists='replace',
index=False,
method=None# oracle dialect does not support multiline inserts
)
return
Anyone here how has experience in efficiently loading mixed data into an Oracle data base using python?
Any hints, code snippets and/or API recommendations are very much appreciated.
Cheers,
As said in your question, you are not able to use method='multi' with you db flavor. This is the key reason inserts are so slow, as data going in row by row.
Using SQL*Loader as suggested by #GordThompson may be fastest route for relatively wide/big table. Example on setting up SQL*Loader
Another option to consider is cx_Oracle. See Speed up to_sql() when writing Pandas DataFrame to Oracle database using SqlAlchemy and cx_Oracle

Optimal way to store data from Pandas to Snowflake

The dataframe is huge (7-8 million rows). Tried to_sql with chunksize = 5000 but it never finished.
Using,
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL
df.to_sql(snowflake_table , engine, if_exists='replace', index=False, index_label=None, chunksize=20000)
What are other optimal solutions for storing data into SF from Pandas DF? Or what am I doing wrong here? The DF is usually of size 7-10 million rows.
The least painful way I can imagine is to dump the file to S3 and have Snowpipe load it into Snowflake automatically. With that set up you don't have to execute any copy command or make any Snowflake calls at all.
Refer to Snowflake documentation for details on how to set up Snowpipe for S3. In short you need to create a stage, a target table, a file format (I guess you already have these things in place though) and a pipe. Then set up SQS notifications for your bucket that the pipe will listen to.
Snowflake suggests having files sized around 10-100 MB, so it is likely a good idea to split the file.
# set up credentials (s3fs is built on BOTO hence this is AWS specific)
fs = s3fs.S3FileSystem(key=key, secret=secret)
# number of files to split into
n_chunks = 2
# loop over dataframe and dump chunk by chunk to S3
# (you likely want to expand file naming logic to avoid overwriting existing files)
for f_name, chunks in enumerate(np.array_split(np.arange(df.shape[0]), n_chunks)):
bytes_to_write = df.iloc[chunks].to_csv(index=False).encode()
with fs.open('s3://mybucket/test/dummy_{}.csv'.format(f_name), 'wb') as f:
f.write(bytes_to_write)
For reference I tried this with a 7M row dataframe splitted into 5 files of around 40 MB. It took around 3 minutes and 40 seconds from starting splitting the dataframe until all rows had arrived in Snowflake.
The optimal way that ilja-everila pointed out is “copy into...” as SF required the csv to be staged on cloud before transformation I was hesitant to do it but it seems like that is the only option given that the performance is in 5-10 minutes for 6.5million records.
for using SQLAlchemy, could you also add, in the connection parameter, the paramstyle=qmark that binds data. This is also referenced here: https://github.com/snowflakedb/snowflake-connector-python/issues/37#issuecomment-365503841
After this change, if you feel appropriate, it may be good idea to do the performance comparison between the SQLAlchemy approach and bulk load approach of writing the large DF to files and use COPY INTO to load the files into Snowflake table.
pandas does an 'insert into ...' with multiple values behind the scene. Snowflake has a restriction up to 16384 records on ingestion. Please change your chunksize=16384.
Snowflake provides the write_pandas and pd_writer helper functions to manage that:
from snowflake.connector.pandas_tools import pd_writer
df.to_sql(snowflake_table, engine, index=False, method=pd_writer)
# ^ here
The pd_writer() function uses write_pandas():
write_pandas(): Writes a Pandas DataFrame to a table in a Snowflake database
To write the data to the table, the function saves the data to Parquet files, uses the PUT command to upload these files to a temporary stage, and uses the COPY INTO command to copy the data from the files to the table.

what's the best way to clean CSV and load to mysql

Please let me know
what's the best way to clean CSV and load to mysql
I am working on loading a couple of different CSVs to mysql database but the CSVs have some anomalies.
note: using pandas read_csv for loading to df and to_sql to load to mysql
I am trying to remove all characters like from csv,
getting data into dataframe with pd.read_csv and within the dataframe trying to do df[col].replace('$','').. does not work on some values unable to find out why. There is no error as such but simply does not remove these characters.
Also the intention is to remove these special characters so accurate data types can be found using SQLalchemy function below.
for col in df.columns:
df[col]=(df[col].replace('$',''))
df[col]=(df[col].replace(',',''))
For finding datatype I am using SQL Alchemy as per below:
pandas to_sql all columns as nvarchar
for a string column, you should use .str
try
df[col]=df[col].str.replace('$','')

How to reduce time it takes to append to SQL database in python

I want to append about 700 millions rows and 2 columns to a database. Using the code below:
disk_engine = create_engine('sqlite:///screen-user.db')
chunksize = 1000000
j = 0
index_start = 1
for df in pd.read_csv('C:/Users/xxx/Desktop/jjj.tsv', chunksize=chunksize, header = None, names=['screen','user'],sep='\t', iterator=True, encoding='utf-8'):
df.to_sql('data', disk_engine, if_exists='append')
count = j*chunksize
print(count)
print(j)
It is taking a really long time (I estimate it would take days). Is there a more efficient way to do this? In R, I have have been using the data.table package to load large data sets and it only take 1 minute. Is there a similar package in Python? As a tangential point, I want to also physically store this file on my Desktop. Right now, I am assuming 'data' is being stored as a temporary file. How would I do this?
Also assuming I load the data into a database, I want the queries to execute in a minute or less. Here is some pseudocode of what I want to do using Python + SQL:
#load data(600 million rows * 2 columns) into database
#def count(screen):
#return count of distinct list of users for a given set of screens
Essentially, I am returning the number of screens for a given set of users.Is the data too big for this task? I also want to merge this table with another table. Is there a reason why the fread function in R is much faster?
If your goal is to import data from your TSV file into SQLite, you should try the native import functionality in SQLite itself. Just open the sqlite console program and do something like this:
sqlite> .separator "\t"
sqlite> .import C:/Users/xxx/Desktop/jjj.tsv screen-user
Don't forget to build appropriate indexes before doing any queries.
As #John Zwinck has already said, you should probably use native RDBMS's tools for loading such amount of data.
First of all I think SQLite is not a proper tool/DB for 700 millions rows especially if you want to join/merge this data afterwards.
Depending of what kind of processing you want to do with your data after loading, I would either use free MySQL or if you can afford having a cluster - Apache Spark.SQL and parallelize processing of your data on multiple cluster nodes.
For loading you data into MySQL DB you can and should use native LOAD DATA tool.
Here is a great article showing how to optimize data load process for MySQL (for different: MySQL versions, MySQL options, MySQL storage engines: MyISAM and InnoDB, etc.)
Conclusion: use native DB's tools for loading big amount of CSV/TSV data efficiently instead of pandas, especially if your data doesn't fit into memory and if you want to process (join/merge/filter/etc.) your data after loading.

Categories