What is the quickest way to insert a pandas DataFrame into mongodb using PyMongo?
Attempts
db.myCollection.insert(df.to_dict())
gave an error
InvalidDocument: documents must have only string keys, the key was
Timestamp('2013-11-23 13:31:00', tz=None)
db.myCollection.insert(df.to_json())
gave an error
TypeError: 'str' object does not support item assignment
db.myCollection.insert({id: df.to_json()})
gave an error
InvalidDocument: documents must have only string a keys, key was <built-in function id>
df
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 150 entries, 2013-11-23 13:31:26 to 2013-11-23 13:24:07
Data columns (total 3 columns):
amount 150 non-null values
price 150 non-null values
tid 150 non-null values
dtypes: float64(2), int64(1)
Here you have the very quickest way. Using the insert_many method from pymongo 3 and 'records' parameter of to_dict method.
db.collection.insert_many(df.to_dict('records'))
I doubt there is a both quickest and simple method. If you don't worry about data conversion, you can do
>>> import json
>>> df = pd.DataFrame.from_dict({'A': {1: datetime.datetime.now()}})
>>> df
A
1 2013-11-23 21:14:34.118531
>>> records = json.loads(df.T.to_json()).values()
>>> db.myCollection.insert(records)
But in case you try to load data back, you'll get:
>>> df = read_mongo(db, 'myCollection')
>>> df
A
0 1385241274118531000
>>> df.dtypes
A int64
dtype: object
so you'll have to convert 'A' columnt back to datetimes, as well as all not int, float or str fields in your DataFrame. For this example:
>>> df['A'] = pd.to_datetime(df['A'])
>>> df
A
0 2013-11-23 21:14:34.118531
odo can do it using
odo(df, db.myCollection)
If your dataframe has missing data (i.e None,nan) and you don't want null key values in your documents:
db.insert_many(df.to_dict("records")) will insert keys with null values. If you don't want the empty key values in your documents you can use a modified version of pandas .to_dict("records") code below:
from pandas.core.common import _maybe_box_datetimelike
my_list = [dict((k, _maybe_box_datetimelike(v)) for k, v in zip(df.columns, row) if v != None and v == v) for row in df.values]
db.insert_many(my_list)
where the if v != None and v == v I've added checks to make sure the value is not None or nan before putting it in the row's dictionary. Now your .insert_many will only include keys with values in the documents (and no null data types).
I think there is cool ideas in this question. In my case I have been spending time more taking care of the movement of large dataframes. In those case pandas tends to allow you the option of chunksize (for examples in the pandas.DataFrame.to_sql). So I think I con contribute here by adding the function I am using in this direction.
def write_df_to_mongoDB( my_df,\
database_name = 'mydatabasename' ,\
collection_name = 'mycollectionname',
server = 'localhost',\
mongodb_port = 27017,\
chunk_size = 100):
#"""
#This function take a list and create a collection in MongoDB (you should
#provide the database name, collection, port to connect to the remoete database,
#server of the remote database, local port to tunnel to the other machine)
#
#---------------------------------------------------------------------------
#Parameters / Input
# my_list: the list to send to MongoDB
# database_name: database name
#
# collection_name: collection name (to create)
# server: the server of where the MongoDB database is hosted
# Example: server = 'XXX.XXX.XX.XX'
# this_machine_port: local machine port.
# For example: this_machine_port = '27017'
# remote_port: the port where the database is operating
# For example: remote_port = '27017'
# chunk_size: The number of items of the list that will be send at the
# some time to the database. Default is 100.
#
#Output
# When finished will print "Done"
#----------------------------------------------------------------------------
#FUTURE modifications.
#1. Write to SQL
#2. Write to csv
#----------------------------------------------------------------------------
#30/11/2017: Rafael Valero-Fernandez. Documentation
#"""
#To connect
# import os
# import pandas as pd
# import pymongo
# from pymongo import MongoClient
client = MongoClient('localhost',int(mongodb_port))
db = client[database_name]
collection = db[collection_name]
# To write
collection.delete_many({}) # Destroy the collection
#aux_df=aux_df.drop_duplicates(subset=None, keep='last') # To avoid repetitions
my_list = my_df.to_dict('records')
l = len(my_list)
ran = range(l)
steps=ran[chunk_size::chunk_size]
steps.extend([l])
# Inser chunks of the dataframe
i = 0
for j in steps:
print j
collection.insert_many(my_list[i:j]) # fill de collection
i = j
print('Done')
return
I use the following part to insert the dataframe to a collection in the database.
df.reset_index(inplace=True)
data_dict = df.to_dict("records")
myCollection.insert_many(data_dict)
how about this:
db.myCollection.insert({id: df.to_json()})
id will be a unique string for that df
Just make string keys!
import json
dfData = json.dumps(df.to_dict('records'))
savaData = {'_id': 'a8e42ed79f9dae1cefe8781760231ec0', 'df': dfData}
res = client.insert_one(savaData)
##### load dfData
data = client.find_one({'_id': 'a8e42ed79f9dae1cefe8781760231ec0'}).get('df')
dfData = json.loads(data)
df = pd.DataFrame.from_dict(dfData)
If you want to send several at one time:
db.myCollection.insert_many(df.apply(lambda x: x.to_dict(), axis=1).to_list())
If you want to make sure that you're not raising InvalidDocument errors, then something like the following is a good idea. This is because mongo does not recognize types such as np.int64, np.float64, etc.
from pymongo import MongoClient
client = MongoClient()
db = client.test
col = db.col
def createDocsFromDF(df, collection = None, insertToDB=False):
docs = []
fields = [col for col in df.columns]
for i in range(len(df)):
doc = {col:df[col][i] for col in df.columns if col != 'index'}
for key, val in doc.items():
# we have to do this, because mongo does not recognize these np. types
if type(val) == np.int64:
doc[key] = int(val)
if type(val) == np.float64:
doc[key] = float(val)
if type(val) == np.bool_:
doc[key] = bool(val)
docs.append(doc)
if insertToDB and collection:
db.collection.insert_many(docs)
return docs
For upserts this worked.
for r in df2.to_dict(orient="records"):
db['utest-pd'].update_one({'a':r['a']},{'$set':r})
Does it one record at a time but it didn't seem upsert_many was able to work with more than one filter value for different records.
Related
I have the following bucket AWS schema:
In my python code, it returns a list of the buckets with their dates.
I need to stick with the most up-to-date of the two main buckets:
I am starting in Python, this is my code:
str_of_ints = [7100, 7144]
for get_in_scenarioid in str_of_ints:
resultado = s3.list_objects(Bucket=source,Delimiter='/',Prefix=get_in_scenarioid +'/')
#print(resultado)
sub_prefix = [val['Prefix'] for val in resultado['CommonPrefixes']]
for get_in_sub_prefix in sub_prefix:
resultado2 = s3.list_objects(Bucket=source,Delimiter='/',Prefix=get_in_sub_prefix) # +'/')
#print(resultado2)
get_key_and_last_modified = [val['Key'] for val in resultado2['Contents']] + int([val['LastModified'].strftime('%Y-%m-%d %H:%M:%S') for val in resultado2['Contents']])
print(get_key_and_last_modified)
I would recommend to convert your array into pandas DataFrame and to use group by:
import pandas as pd
df = pd.DataFrame([["a",1],["a",2],["a",3],["b",2],["b",4]], columns=["lbl","val"])
df.groupby(['lbl'], sort=False)['val'].max()
lbl
a 3
b 4
In your case you would also have to split your label into 2 parts first, better keep in separate column.
Update:
Once you split your lable into bucket and sub_bucket, you can return max values like this:
dfg = df.groupby("main_bucket")
dfm = dfg.max()
res = dfm.reset_index()
I have a dataframe that contains 2 columns. For each row, I simply want to to create a Redis set where first value of dataframe is key and 2nd value is the value of the Redis set. I've done research and I think I found the fastest way of doing this via iterables:
def send_to_redis(df, r):
df['bin_subscriber'] = df.apply(lambda row: uuid.UUID(row.subscriber).bytes, axis=1)
df['bin_total_score'] = df.apply(lambda row: struct.pack('B', round(row.total_score)), axis=1)
df = df[['bin_subscriber', 'bin_total_score']]
with r.pipeline() as pipe:
index = 0
for subscriber, total_score in zip(df['bin_subscriber'], df['bin_total_score']):
r.set(subscriber, total_score)
if (index + 1) % 2000 == 0:
pipe.execute()
index += 1
With this, I can send about 400-500k sets to Redis per minute. We may end up processing up to 300 million which at this rate would take half a day or so. Doable but not ideal. Note that in the outer wrapper I am downloading .parquet files from s3 one at a time and pulling into Pandas via IO bytes.
def process_file(s3_resource, r, bucket, key):
buffer = io.BytesIO()
s3_object = s3_resource.Object(bucket, key)
s3_object.download_fileobj(buffer)
send_to_redis(
pandas.read_parquet(buffer, columns=['subscriber', 'total_score']), r)
def main():
args = get_args()
s3_resource = boto3.resource('s3')
r = redis.Redis()
file_prefix = get_prefix(args)
s3_keys = [
item.key for item in
s3_resource.Bucket(args.bucket).objects.filter(Prefix=file_prefix)
if item.key.endswith('.parquet')
]
for key in s3_keys:
process_file(s3_resource, r, args.bucket, key)
Is there a way to send this data to Redis without the use of iteration? Is it possible to send an entire blob of data to Redis and have Redis set the key and value for every 1st and 2nd value of the data blob? I imagine that would be slightly faster.
The original parquet that I am pulling into Pandas is created via Pyspark. I've tried using the Spark-Redis plugin which is extremely fast, but I'm not sure how to convert my data to the above binary within a Spark dataframe itself and I don't like how the column name is added as a string to every single value and it doesn't seem to be configurable. Every redis object having that label seems very space inefficient.
Any suggestions would be greatly appreciated!
Try Redis Mass Insertion and redis bulk import using --pipe:
Create a new text file input.txt containing the Redis command
Set Key0 Value0
set Key1 Value1
...
SET Keyn Valuen
use redis-mass.py (see below) to insert to redis
python redis-mass.py input.txt | redis-cli --pipe
redis-mass.py from github.
#!/usr/bin/env python
"""
redis-mass.py
~~~~~~~~~~~~~
Prepares a newline-separated file of Redis commands for mass insertion.
:copyright: (c) 2015 by Tim Simmons.
:license: BSD, see LICENSE for more details.
"""
import sys
def proto(line):
result = "*%s\r\n$%s\r\n%s\r\n" % (str(len(line)), str(len(line[0])), line[0])
for arg in line[1:]:
result += "$%s\r\n%s\r\n" % (str(len(arg)), arg)
return result
if __name__ == "__main__":
try:
filename = sys.argv[1]
f = open(filename, 'r')
except IndexError:
f = sys.stdin.readlines()
for line in f:
print(proto(line.rstrip().split(' ')),)
I am trying to load a big Pandas table to dynamoDB.
I have tried the for loop method as follow
for k in range(1000):
trans = {}
trans['Director'] = DL_dt['director_name'][k]
trans['Language'] = DL_dt['original_language'][k]
print("add :", DL_dt['director_name'][k] , DL_dt['original_language'][k])
table.put_item(Item=trans)
it works but it's very time consuming.
Is there a faster way to load it ? (equivalent of to_sql for sql database)
I've found the batchwriteitem function but i am not sure it works and i don't know exactly how to use it.
Thanks a lot.
You can iterate over the dataframe rows, transform each row to json and then convert it to a dict using json.loads, this will also avoid the numpy data type errors.
you can try this:
import json
from decimal import Decimal
DL_dt = DL_dt.rename(columns={
'director_name': 'Director',
'original_language': 'Language'
})
with table.batch_writer() as batch:
for index, row in DL_dt.iterrows():
batch.put_item(json.loads(row.to_json(), parse_float=Decimal))
I did this using aws wrangler. It was a fairly simple process, the only tricky bit was handling pandas floats, so I converted them to decimals before loading the data in.
import awswrangler as wr
def float_to_decimal(num):
return Decimal(str(num))
def pandas_to_dynamodb(df):
df = df.fillna(0)
# convert any floats to decimals
for i in df.columns:
datatype = df[i].dtype
if datatype == 'float64':
df[i] = df[i].apply(float_to_decimal)
# write to dynamodb
wr.dynamodb.put_df(df=df, table_name='table-name')
pandas_to_dynamodb(df)
Batch writer docs here.
Try this:
with table.batch_writer() as batch:
for k in range(1000):
trans = {}
trans['Director'] = DL_dt['director_name'][k]
trans['Language'] = DL_dt['original_language'][k]
print("add :", DL_dt['director_name'][k] , DL_dt['original_language'][k])
batch.put_item(trans))
I am having a problem (I think memory related) when trying to do an arcpy.Statistics_analysis on an approximately 40 million row table. I am trying to count the number of non-null values in various columns of the table per category (e.g. there are x non-null values in column 1 for category A). After this, I need to join the statistics results to the input table.
Is there a way of doing this using numpy (or something else)?
The code I currently have is like this:
arcpy.Statistics_analysis(input_layer, output_layer, "'Column1' COUNT; 'Column2' COUNT; 'Column3' COUNT", "Categories")
I am very much a novice with arcpy/numpy so any help much appreciated!
You can convert a table to a numpy array using the function arcpy.da.TableToNumPyArray. And then convert the array to a pandas.DataFrame object.
Here is an example of code (I assume you are working with Feature Class because you use the term null values, if you work with shapefile you will need to change the code as null values are not supported are replaced with a single space string (' '):
import arcpy
import pandas as pd
# Change these values
gdb_path = 'path/to/your/geodatabase.gdb'
table_name = 'your_table_name'
cat_field = 'Categorie'
fields = ['Column1','column2','Column3','Column4']
# Do not change
null_value = -9999
input_table = gdb_path + '\\' + table_name
# Convert to pandas DataFrame
array = arcpy.da.TableToNumPyArray(input_table,
[cat_field] + fields,
skip_nulls=False,
null_value=null_value)
df = pd.DataFrame(array)
# Count number of non null values
not_null_count = {field: {cat: 0 for cat in df[cat_field].unique()}
for field in fields}
for cat in df[cat_field].unique():
_df = df.loc[df[cat_field] == cat]
len_cat = len(_df)
for field in fields:
try: # If your field contains integrer or float
null_count = _df[field].value_counts()[int(null_value)]
except IndexError: # If it contains text (string)
null_count = _df[field].value_counts()[str(null_value)]
except KeyError: # There is no null value
null_count = 0
not_null_count[field][cat] = len_cat - null_count
Concerning joining the results to the input table without more information, it's complicated to give you an exact answer that will meet your expectations (because there are multiple columns, so it's unsure which value you want to add).
EDIT:
Here is some additional code following your clarifications:
# Create a copy of the table
copy_name = '' # name of the copied table
copy_path = gdb_path + '\\' + copy_name
arcpy.Copy_management(input_table, copy_path)
# Dividing copy data with summary
# This step doesn't need to convert the dict (not_null_value) to a table
with arcpy.da.UpdateCursor(copy_path, [cat_field] + fields) as cur:
for row in cur:
category = row[0]
for i, fld in enumerate(field):
row[i+1] /= not_null_count[fld][category]
cur.updateRow(row)
# Save the summary table as a csv file (if needed)
df_summary = pd.DataFrame(not_null_count)
df_summary.index.name = 'Food Area' # Or any name
df_summary.to_csv('path/to/file.csv') # Change path
# Summary to ArcMap Table (also if needed)
arcpy.TableToTable_conversion('path/to/file.csv',
gdb_path,
'name_of_your_new_table')
I am trying to read the values in the columns which I would like to use from the database files such as MS Access file only if the certain condition are met.
I have 26 different MS access files representing the database for 26 different years.
import pyodbc
import pandas as pd
import numpy as np
k = 1993 + np.arange(24)
for i in k:
print(i)
DBfile = r'D:\PMIS1993_2016'+'\\'+str(i)+'\\pmismzxpdata_'+str(i)+'.mdb'
print(DBfile)
conn = pyodbc.connect('DRIVER={Microsoft Access Driver (*.mdb)};DBQ='+DBfile)
cur = conn.cursor()
qry = "SELECT JCP_FAILED_JNTS_CRACKS_QTY, JCP_FAILURES_QTY, JCP_SHATTERED_SLABS_QTY, JCP_LONGITUDE_CRACKS_QTY, JCP_PCC_PATCHES_QTY FROM PMIS_JCP_RATINGS WHERE BEG_REF_MARKER_NBR = '0342' and BEG_REF_MARKER_DISP LIKE '0.5' and RATING_CYCLE_CODE = 'P'"
dataf = pd.read_sql(qry, conn)
print(dataf)
D = list(dataf.values[0])
print(D)
conn.close()
Here I have tried to read values of variables of JCP_FAILED_JNTS_CRACKS_QTY, JCP_FAILURES_QTY, JCP_SHATTERED_SLABS_QTY and JCP_LONGITUDE_CRACKS_QTY, JCP_PCC_PATCHES_QTY when BEG_REF_MARKER_NBR = '0342' and BEG_REF_MARKER_DISP LIKE '0.5' and RATING_CYCLE_CODE = 'P'.
However, not every year meets the conditions of BEG_REF_MARKER_NBR = '0342' and BEG_REF_MARKER_DISP LIKE '0.5' and RATING_CYCLE_CODE = 'P'.
So, I would like to skip the years which does not meet these condition such as if else function indicating the years which does not satisfy.
If you have any help or idea, I would really appreciate.
Isaac
You can use the .empty attribute:
In [11]: pd.DataFrame().empty # This DataFrame has no rows
Out[11]: True
e.g. to skip the empty datafs:
if not dataf.empty:
D = list(dataf.values[0])
print(D)