Efficiently insert massive amount of rows in Psycopg2 - python

I need to efficiently insert about 500k (give or take 100k) rows of data into my PostgreSQL database. After a generous amount of google-ing, I've gotten to this solution, averaging about 150 (wall-clock) seconds.
def db_insert_spectrum(curs, visual_data, recording_id):
sql = """
INSERT INTO spectrums (row, col, value, recording_id)
VALUES %s
"""
# Mass-insertion technique
# visual_data is a 2D array (a nx63 matrix)
values_list = []
for rowIndex, rowData in enumerate(visual_data):
for colIndex, colData in enumerate(rowData): # colData is the value
value = [(rowIndex, colIndex, colData, recording_id)]
values_list.append(value)
psycopg2.extras.execute_batch(curs, sql, values_list, page_size=1000)
Is there a faster way?

Based on the answers given here, COPY is the fastest method. COPY reads from a file or file-like object.
Since memory I/O is many orders of magnitude faster than disk I/O, it is faster to write the data to a StringIO file-like object than to write to an actual file.
The psycopg docs show an example of calling copy_from with a StringIO as input.
Therefore, you could use something like:
try:
# Python2
from cStringIO import StringIO
except ImportError:
# Python3
from io import StringIO
def db_insert_spectrum(curs, visual_data, recording_id):
f = StringIO()
# visual_data is a 2D array (a nx63 matrix)
values_list = []
for rowIndex, rowData in enumerate(visual_data):
items = []
for colIndex, colData in enumerate(rowData):
value = (rowIndex, colIndex, colData, recording_id)
items.append('\t'.join(map(str, value))+'\n')
f.writelines(items)
f.seek(0)
cur.copy_from(f, 'spectrums', columns=('row', 'col', 'value', 'recording_id'))

I don't know whether .execute_batch can accept generator, but can u try something like:
def db_insert_spectrum(curs, visual_data, recording_id):
sql = """
INSERT INTO spectrums (row, col, value, recording_id)
VALUES %s
"""
data_gen = ((rIdx, cIdx, value, recording_id) for rIdx, cData in enumerate(visual_data)
for cIdx, value in enumerate(cData))
psycopg2.extras.execute_batch(curs, sql, data_gen, page_size=1000)
It might be faster.

Related

PyTables: Change Table expectedrows parameter dynamically

Within PyTables optimization tips, we can find suggestion to append expectedrows parameter while creating new table - File.create_table().
However, I couldn't find any information about the possibility of changing this parameter later. This would be reasonable as my table is not static and will grow over time and I want to use it on an ongoing basis.
Alternatively, is it possible, to create a new table (with new settings) and use data from other, already existing tables?
Alternatively, what is the best solution to handle this issue?
I am not aware of a way to access the value of expectedrows or change after creating a table. However, it's "relatively easy" to read a table and copy the data to a new table (either in the same file or another file). Note: if you create a new table and delete the old one, you will want to run ptrepack as described in the PyTables optimization tips you mentioned above.)
Simple example below:
import tables as tb
import numpy as np
with tb.File('SO_71267946.h5', 'w') as h5f:
arr_dt = [('i', int), ('x',float), ('y',float)]
arr = np.empty(dtype=arr_dt, shape=10,)
arr['i'] = [i for i in range(10)]
arr['x'] = [2.*x for x in range(10)]
arr['y'] = [4.*y for y in range(10)]
ex_tbl = h5f.create_table('/','Example',obj=arr, expectedrows=1_000)
print(ex_tbl.chunkshape)
# create more data to add more rows to the table
arr = np.empty(dtype=arr_dt, shape=20,)
arr['i'] = [i for i in range(10,30)]
arr['x'] = [2.*x for x in range(10,30)]
arr['y'] = [4.*y for y in range(10,30)]
ex_tbl.append(arr)
# Copy to a new table in the same file:
xfer = h5f.root.Example.read()
ex_tbl2 = h5f.create_table('/','Example2',obj=xfer, expectedrows=1_000_000)
print(ex_tbl2.chunkshape)
# Copy to a new table in the new file:
with tb.File('SO_71267946.h5', 'r') as h5r, \
tb.File('SO_71267946_2.h5', 'w') as h5w:
xfer = h5r.root.Example.read()
ex_tbl2 = h5w.create_table('/','Example2',obj=xfer, expectedrows=1_000_000)
Table below shows chuckshape calculated for different values of expectedrows. (chuckshape is number of rows read from a Table in a single I/O operation.)
expectedrows
chunkshape
10_000
(3276,)
100_000
(3276,)
1_000_000
(6553,)
10_000_000
(13107,)
1_000_000_000
(52428,)

Fastest way to send dataframe to Redis

I have a dataframe that contains 2 columns. For each row, I simply want to to create a Redis set where first value of dataframe is key and 2nd value is the value of the Redis set. I've done research and I think I found the fastest way of doing this via iterables:
def send_to_redis(df, r):
df['bin_subscriber'] = df.apply(lambda row: uuid.UUID(row.subscriber).bytes, axis=1)
df['bin_total_score'] = df.apply(lambda row: struct.pack('B', round(row.total_score)), axis=1)
df = df[['bin_subscriber', 'bin_total_score']]
with r.pipeline() as pipe:
index = 0
for subscriber, total_score in zip(df['bin_subscriber'], df['bin_total_score']):
r.set(subscriber, total_score)
if (index + 1) % 2000 == 0:
pipe.execute()
index += 1
With this, I can send about 400-500k sets to Redis per minute. We may end up processing up to 300 million which at this rate would take half a day or so. Doable but not ideal. Note that in the outer wrapper I am downloading .parquet files from s3 one at a time and pulling into Pandas via IO bytes.
def process_file(s3_resource, r, bucket, key):
buffer = io.BytesIO()
s3_object = s3_resource.Object(bucket, key)
s3_object.download_fileobj(buffer)
send_to_redis(
pandas.read_parquet(buffer, columns=['subscriber', 'total_score']), r)
def main():
args = get_args()
s3_resource = boto3.resource('s3')
r = redis.Redis()
file_prefix = get_prefix(args)
s3_keys = [
item.key for item in
s3_resource.Bucket(args.bucket).objects.filter(Prefix=file_prefix)
if item.key.endswith('.parquet')
]
for key in s3_keys:
process_file(s3_resource, r, args.bucket, key)
Is there a way to send this data to Redis without the use of iteration? Is it possible to send an entire blob of data to Redis and have Redis set the key and value for every 1st and 2nd value of the data blob? I imagine that would be slightly faster.
The original parquet that I am pulling into Pandas is created via Pyspark. I've tried using the Spark-Redis plugin which is extremely fast, but I'm not sure how to convert my data to the above binary within a Spark dataframe itself and I don't like how the column name is added as a string to every single value and it doesn't seem to be configurable. Every redis object having that label seems very space inefficient.
Any suggestions would be greatly appreciated!
Try Redis Mass Insertion and redis bulk import using --pipe:
Create a new text file input.txt containing the Redis command
Set Key0 Value0
set Key1 Value1
...
SET Keyn Valuen
use redis-mass.py (see below) to insert to redis
python redis-mass.py input.txt | redis-cli --pipe
redis-mass.py from github.
#!/usr/bin/env python
"""
redis-mass.py
~~~~~~~~~~~~~
Prepares a newline-separated file of Redis commands for mass insertion.
:copyright: (c) 2015 by Tim Simmons.
:license: BSD, see LICENSE for more details.
"""
import sys
def proto(line):
result = "*%s\r\n$%s\r\n%s\r\n" % (str(len(line)), str(len(line[0])), line[0])
for arg in line[1:]:
result += "$%s\r\n%s\r\n" % (str(len(arg)), arg)
return result
if __name__ == "__main__":
try:
filename = sys.argv[1]
f = open(filename, 'r')
except IndexError:
f = sys.stdin.readlines()
for line in f:
print(proto(line.rstrip().split(' ')),)

Load a pandas table to dynamoDb

I am trying to load a big Pandas table to dynamoDB.
I have tried the for loop method as follow
for k in range(1000):
trans = {}
trans['Director'] = DL_dt['director_name'][k]
trans['Language'] = DL_dt['original_language'][k]
print("add :", DL_dt['director_name'][k] , DL_dt['original_language'][k])
table.put_item(Item=trans)
it works but it's very time consuming.
Is there a faster way to load it ? (equivalent of to_sql for sql database)
I've found the batchwriteitem function but i am not sure it works and i don't know exactly how to use it.
Thanks a lot.
You can iterate over the dataframe rows, transform each row to json and then convert it to a dict using json.loads, this will also avoid the numpy data type errors.
you can try this:
import json
from decimal import Decimal
DL_dt = DL_dt.rename(columns={
'director_name': 'Director',
'original_language': 'Language'
})
with table.batch_writer() as batch:
for index, row in DL_dt.iterrows():
batch.put_item(json.loads(row.to_json(), parse_float=Decimal))
I did this using aws wrangler. It was a fairly simple process, the only tricky bit was handling pandas floats, so I converted them to decimals before loading the data in.
import awswrangler as wr
def float_to_decimal(num):
return Decimal(str(num))
def pandas_to_dynamodb(df):
df = df.fillna(0)
# convert any floats to decimals
for i in df.columns:
datatype = df[i].dtype
if datatype == 'float64':
df[i] = df[i].apply(float_to_decimal)
# write to dynamodb
wr.dynamodb.put_df(df=df, table_name='table-name')
pandas_to_dynamodb(df)
Batch writer docs here.
Try this:
with table.batch_writer() as batch:
for k in range(1000):
trans = {}
trans['Director'] = DL_dt['director_name'][k]
trans['Language'] = DL_dt['original_language'][k]
print("add :", DL_dt['director_name'][k] , DL_dt['original_language'][k])
batch.put_item(trans))

How to estimate dataframe real size in pyspark?

How to determine a dataframe size?
Right now I estimate the real size of a dataframe as follows:
headers_size = key for key in df.first().asDict()
rows_size = df.map(lambda row: len(value for key, value in row.asDict()).sum()
total_size = headers_size + rows_size
It is too slow and I'm looking for a better way.
Currently I am using the below approach, but not sure if this is the best way:
df.persist(StorageLevel.Memory)
df.count()
On the spark-web UI under the Storage tab you can check the size which is displayed in MB's and then I do unpersist to clear the memory:
df.unpersist()
nice post from Tamas Szuromi http://metricbrew.com/how-to-estimate-rdd-or-dataframe-real-size-in-pyspark/
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batch or not.
"""
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
JavaObj = _to_java_object_rdd(df.rdd)
nbytes = sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)

HDFStore: table.select and RAM usage

I am trying to select random rows from a HDFStore table of about 1 GB. RAM usage explodes when I ask for about 50 random rows.
I am using pandas 0-11-dev, python 2.7, linux64.
In this first case the RAM usage fits the size of chunk
with pd.get_store("train.h5",'r') as train:
for chunk in train.select('train',chunksize=50):
pass
In this second case, it seems like the whole table is loaded into RAM
r=random.choice(400000,size=40,replace=False)
train.select('train',pd.Term("index",r))
In this last case, RAM usage fits the equivalent chunk size
r=random.choice(400000,size=30,replace=False)
train.select('train',pd.Term("index",r))
I am puzzled, why moving from 30 to 40 random rows induces such a dramatic increase in RAM usage.
Note the table has been indexed when created such that index=range(nrows(table)) using the following code:
def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):
max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)
with pd.get_store( storefile,'w') as store:
for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):
chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]
store.append(table_name,chunk, min_itemsize={'values':max_len})
Thanks for insight
EDIT TO ANSWER Zelazny7
Here's the file I used to write Train.csv to train.h5. I wrote this using elements of Zelazny7's code from How to trouble-shoot HDFStore Exception: cannot find the correct atom type
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
def object_max_len(x):
if x.dtype != 'object':
return
else:
return len(max(x.fillna(''), key=lambda x: len(str(x))))
def txtfile2dtypes(infile, sep="\t", header=0, chunksize=50000 ):
max_len = pd.read_table(infile,header=header, sep=sep,nrows=5).apply( object_max_len).max()
dtypes0 = pd.read_table(infile,header=header, sep=sep,nrows=5).dtypes
for chunk in pd.read_table(infile,header=header, sep=sep, chunksize=chunksize):
max_len = max((pd.DataFrame(chunk.apply( object_max_len)).max(),max_len))
for i,k in enumerate(zip( dtypes0[:], chunk.dtypes)):
if (k[0] != k[1]) and (k[1] == 'object'):
dtypes0[i] = k[1]
#as of pandas-0.11 nan requires a float64 dtype
dtypes0.values[dtypes0 == np.int64] = np.dtype('float64')
return max_len, dtypes0
def txtfile2hdfstore(infile, storefile, table_name, sep="\t", header=0, chunksize=50000 ):
max_len, dtypes0 = txtfile2dtypes(infile, sep, header, chunksize)
with pd.get_store( storefile,'w') as store:
for i, chunk in enumerate(pd.read_table(infile,header=header,sep=sep,chunksize=chunksize, dtype=dict(dtypes0))):
chunk.index= range( chunksize*(i), chunksize*(i+1))[:chunk.shape[0]]
store.append(table_name,chunk, min_itemsize={'values':max_len})
Applied as
txtfile2hdfstore('Train.csv','train.h5','train',sep=',')
This is a known issue, see the reference here: https://github.com/pydata/pandas/pull/2755
Essentially the query is turned into a numexpr expression for evaluation. There is an issue
where I can't pass a lot of or conditions to numexpr (its dependent on the total length of the
generated expression).
So I just limit the expression that we pass to numexpr. If it exceeds a certain number of or conditions, then the query is done as a filter, rather than an in-kernel selection. Basically this means the table is read and then reindexed.
This is on my enhancements list: https://github.com/pydata/pandas/issues/2391 (17).
As a workaround, just split your queries up into multiple ones and concat the results. Should be much faster, and use a constant amount of memory

Categories