I have multiple .csv.gz files (each greater than 10GB) that need to be parsed - multiple rows are read to create one row insertion. The approach I'm taking is as follows:
read .csv.gz file
save soon-to-be-inserted rows into a buffer
if there is enough data in the buffer, perform multirow insertion to database table
Now snowflake limits maximum number of expressions to 16384. I've been running this for about a day but the speed at which it is inserting is very slow. I am using sqlalchemy to do this:
url = "snowflake://<my snowflake url>"
engine = create_engine(url)
savedvalues = []
with pd.read_csv(datapath, header=0, chunksize=10**6) as reader:
for chunk in reader:
for index, row in chunk.iterrows():
"""
<parsing data>
"""
savedvalues.append(<parsed values>)
if(len(savedvalues) > 16384):
stmt = mytable.insert().values(savedvalues)
with engine.connect() as conn:
conn.execute(stmt)
savedvalues = []
Is there a faster way to insert data into snowflake database tables?
I'm looking into COPY INTO <table> operation but not sure if this is truly faster than what I'm doing right now.
Any suggestions would be much appreciated!
Here is an article describing a Python multithreaded approach to bulk loading into Snowflake Zero to Snowflake: Multi-Threaded Bulk Loading with Python. Also note to optimize the number of parallel operations for a load, Snowflake recommends data files roughly 100-250 MB (or larger) in size compressed.
Related
Although there are many solutions to export the mysql tables to csv using python.
I want to know the best way of doing that?
Currently I am storing around 50 tables to csv which takes around 47 minutes and also takes more than 16gb of memory.
The code is :
sqlEngine = create_engine(f'mysql+pymysql://{MYSQL_READER_USERNAME}:%s#{MYSQL_READER_HOST}/{MYSQL_DB_NAME}' % urllib.parse.quote(f'{MYSQL_READER_PASSWORD}'), pool_recycle=3600)
def export_table(name, download_location):
table = pd.read_sql(f'select /*+ MAX_EXECUTION_TIME(100000000) */ * from {name}', sqlEngine)
table.to_csv(os.path.join(download_location, name + '.csv'), index=False)
tables = ['table1', ... , 'table50']
for table in tqdm(tables):
print(f'\t => \t Storing {table}')
export_table(table, store_dir)
I have seen many methods to store to csv like:
using Cursor
using pyodbc library
pandas read sql method.
Is there any other method or technique and which one is best to reduce
memory or execution time ?
I'm currently building an ETL on a Google Cloud based VM (Windows Server 2019 - 4 vCPUs) to execute the following process:
Extract some tables from a MySQL replica db
Adjust data types for Google BigQuery conformities
Upload the data to BigQuery using Python's pandas_gbq library
To illustrate, here are some parts of the actual code (Python, iterator over one table):
while True:
# GENERATES AN MYSQL QUERY BASED ON THE COLUMNS AND THEIR
# RESPECTIVE TYPES, USING A DICTIONARY TO CONVERT
# MYSQL D_TYPES TO PYTHON D_TYPES
sql_query = gen_query(cols_dict=col_types, table=table,
pr_key=p_key, offset=offset)
cursor = cnx.cursor(buffered=True)
cursor.execute(sql_query)
if cursor.rowcount == 0:
break
num_fields = len(cursor.description)
field_names = [i[0] for i in cursor.description]
records = cursor.fetchall()
df = pd.DataFrame(records, columns=columns)
offset += len(df.index)
print('Ok, df structured')
# CHECK FOR DATETIME COLUMNS
col_parse_date = []
for column in columns:
if col_types[column] == 'datetime64':
try:
df[column] = df[column].astype(col_types[column])
col_parse_date.append(column)
except:
df[column] = df[column].astype(str)
for i in to_bgq:
if i['name'] == column:
i['type'] = 'STRING'
# UPLOAD DATAFRAME TO GOOGLE BIGQUERY
df.to_csv('carga_etl.csv', float_format='%.2f',
index=False, sep='|')
print('Ok, csv recorded')
df = ''
df = pd.read_csv('carga_etl.csv', sep='|')
print('Ok, csv read')
df.to_gbq(destination_table='tr.{}'.format(table),
project_id='iugu-bi', if_exists='append', table_schema=to_bgq)
The logic is based on a query generator; it gets the MySQL table Schema and adjusts it to BigQuery formats (e.g. Blob to STRING, int(n) to INTEGER etc.), querying the full results (paginated with an offset, 500K rows per page) and saving it in a dataframe to then upload it to my new database.
Well, the ETL does its job, and I'm currently migrating my tables to the cloud. However, I'm worried I'm subutilizing my resources, due to network traffic gaps. Here is the network report (bytes/sec) from my VM reporting section:
VM Network Bytes report
According to that report, my in/out network data peaks at 2/3 MBs, which is really low compared to the average 1GBs available if I use the machine to download something from my browser, for example.
My point is, what am I doing wrong here? Is there any way to increase my MySQL query/fetch speed and my upload speed to BigQuery?
I understand that you are transforming datetime64 to a compatible BigQuery Data type, correct me if I am wrong.
I have a few recommendations:
You can use Dataflow as it is a ETL product and it is optimized for performance
Depending on your overall use case and if you are using CloudSQL/MySQL, you can use BigQuery Federated queries.
Again depending on your use case, you caould use a MySQL dump and upload the data in GCS or directly to BigQuery.
I know there are some other threads that explain how to deal with big files and Pandas, but in fact, I do not have memory problems, I just want to open a lot of Excel files to get a bunch of rows each time (sometimes even only one), and sometimes I don't even need all the columns.
I've seen in other threads people proposing the usecols and nrows but it appears that pandas still loads the entire sheet, then keeps only the selected rows and cols. To be sure, I've wrote this :
start = time.time()
couples2015 = pd.read_excel(fileInput)
total = time.time() - start
#Reloading file, with only some lines and cols
start = time.time()
couples2015 = pd.read_excel(fileInput, header=4, usecols=0, nrows=10)
total = time.time() - start
and in both cases, it tooks about 55 seconds to load.
And that's only for a 50 Mo file, but I have to load & extract a lot of files, from 50 Mo up to 500 Mo (even sometimes up to 1 Gb)
Is there a way to extract some rows and cols without loading the whole files ?
If not, would creating DB with my excel files to use read_sql_table() be faster ?
Thanks !
[edit : moreover, each file has several sheets, but I often want only one or two. Even in I use sheet_name=0, it seems that it still opens and loads everysheet, as time is almost the sameā¦]
If using Excel for Windows, consider directly querying workbooks using the Jet/ACE SQL Engine (Windows .dll files) via an ODBC connection to the installed Excel driver. Doing so, each sheet serves as a database table and typical SQL semantics (JOIN, UNION, WHERE, GROUP BY) is available and can be read with pandas.read_sql.
Adjust below SQL statement with actual column, sheet, and ranges.
import pyodbc
import pandas as pd
strfile = "C:\Path\To\Workbook.xlsx"
conn = pyodbc.connect(r'Driver={{Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)}};DBQ={};'
.format(strfile), autocommit=True)
strSQL = """SELECT Col1, Col2, Col3
FROM [Sheet1$A4:C10]
"""
df = pd.read_sql(strSQL, conn)
conn.close()
For data without headers, consider an inline Excel query that specifies no headers and data starting on first row of specified range.
strSQL = """SELECT F1, F2, F3
FROM [Excel 12.0 Xml;HDR=NO;IMEX=1;Database=C:\Path\To\Same\Workbook.xlsx].[Sheet$A6:L10000]
WHERE F2 = 'Some Value';
"""
By the way, if your last row is unknown, simply give it a very large number. The query engine selects only the used rows.
I'm wrting a python code that creates a SQLite database and does some calculations for massive tables. To begin with, reason i'm doing it in SQLite through python is memory, my data is huge that will break into a memory error if run in, say, pandas. and if chuncked it'll take ages, generally because pandas is slow with merges and groupes, etc.
So my issue now is at some point, i want to calculate exponential of one column in a table (sample code below) but it seems that SQLite doesn't have an EXP function.
I can write data to a dataframe and then use numpy to calculate the EXP but that then beats the whole point that pushed my twoards DBs and not have the additional time of reading/writing back and forth between the DB and python.
so my question is this: is there a way around this to calculate the exponential within the database? i've read that i can create the function within sqlite3 in python, but i have no idea how. If you know how or can direct me to where i can find relavent info then i would be thankful, thanks.
Sample of my code where i'm trying to do the calculation, note here i'm just providing a sample where the table is coming directly from a csv, but in my process it's actually created within the DB after lots of megres and group bys:
import sqlite3
#set path and files names
folderPath = 'C:\\SCP\\'
inputDemandFile = 'demandFile.csv'
#set connection to database
conn = sqlite3.connect(folderPath + dataBaseName)
cur = conn.cursor()
#read demand file into db
inputDemand = pd.read_csv(folderPath + inputDemandFile)
inputDemand.to_sql('inputDemand', conn, if_exists='replace', index=False)
#create new table and calculate EXP
cur.execute('CREATE TABLE demand_exp AS SELECT from_zone_id, to_zone_id, EXP(demand) AS EXP_Demand FROM inputDemand;')
i've read that i can create the function within sqlite3 in python, but i have no idea how.
That's conn.create_function()
https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.create_function
>>> import math
>>> conn.create_function('EXP', 1, math.exp)
>>> cur.execute('select EXP(1)')
>>> cur.fetchone()
(2.718281828459045,)
I'm using AWS Glue to move multiple files to an RDS instance from S3. Each day I get a new file into S3 which may contain new data, but can also contain a record I have already saved with some updates values. If I run the job multiple times I will of course get duplicate records in the database. Instead of multiple records being inserted I want Glue to try and update that record if it notices a field has changed, each record has a unique id. Is this possible?
I followed the similar approach which is suggested as 2nd option by Yuriy. Get existing data as well as new data and then do some processing to merge to of them and write with ovewrite mode. Following code would help you to get an idea about how to solve this problem.
sc = SparkContext()
glueContext = GlueContext(sc)
#get your source data
src_data = create_dynamic_frame.from_catalog(database = src_db, table_name = src_tbl)
src_df = src_data.toDF()
#get your destination data
dst_data = create_dynamic_frame.from_catalog(database = dst_db, table_name = dst_tbl)
dst_df = dst_data.toDF()
#Now merge two data frames to remove duplicates
merged_df = dst_df.union(src_df)
#Finally save data to destination with OVERWRITE mode
merged_df.write.format('jdbc').options( url = dest_jdbc_url,
user = dest_user_name,
password = dest_password,
dbtable = dest_tbl ).mode("overwrite").save()
Unfortunately there is no elegant way to do it with Glue. If you would write to Redshift you could use postactions to implement Redshift merge operation. However, it's not possible for other jdbc sinks (afaik).
Alternatively in your ETL script you can load existing data from a database to filter out existing records before saving. However if your DB table is big then the job may take a while to process it.
Another approach is to write into a staging table with mode 'overwrite' first (replace existing staging data) and then make a call to a DB via API to copy new records only into a final table.
I have used INSERT into table .... ON DUPLICATE KEY.. for UPSERTs into the Aurora RDS running mysql engine. Maybe this would be a reference for your use case. We cannot use a JDBC since we have only APPEND, OVERWRITE, ERROR modes currently supported.
I am not sure of the RDS database engine you are using, and following is an example for mysql UPSERTS.
Please see this reference, where i have posted a solution using INSERT INTO TABLE..ON DUPLICATE KEY for mysql :
Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array