I have a dataframe that contains 2 columns. For each row, I simply want to to create a Redis set where first value of dataframe is key and 2nd value is the value of the Redis set. I've done research and I think I found the fastest way of doing this via iterables:
def send_to_redis(df, r):
df['bin_subscriber'] = df.apply(lambda row: uuid.UUID(row.subscriber).bytes, axis=1)
df['bin_total_score'] = df.apply(lambda row: struct.pack('B', round(row.total_score)), axis=1)
df = df[['bin_subscriber', 'bin_total_score']]
with r.pipeline() as pipe:
index = 0
for subscriber, total_score in zip(df['bin_subscriber'], df['bin_total_score']):
r.set(subscriber, total_score)
if (index + 1) % 2000 == 0:
pipe.execute()
index += 1
With this, I can send about 400-500k sets to Redis per minute. We may end up processing up to 300 million which at this rate would take half a day or so. Doable but not ideal. Note that in the outer wrapper I am downloading .parquet files from s3 one at a time and pulling into Pandas via IO bytes.
def process_file(s3_resource, r, bucket, key):
buffer = io.BytesIO()
s3_object = s3_resource.Object(bucket, key)
s3_object.download_fileobj(buffer)
send_to_redis(
pandas.read_parquet(buffer, columns=['subscriber', 'total_score']), r)
def main():
args = get_args()
s3_resource = boto3.resource('s3')
r = redis.Redis()
file_prefix = get_prefix(args)
s3_keys = [
item.key for item in
s3_resource.Bucket(args.bucket).objects.filter(Prefix=file_prefix)
if item.key.endswith('.parquet')
]
for key in s3_keys:
process_file(s3_resource, r, args.bucket, key)
Is there a way to send this data to Redis without the use of iteration? Is it possible to send an entire blob of data to Redis and have Redis set the key and value for every 1st and 2nd value of the data blob? I imagine that would be slightly faster.
The original parquet that I am pulling into Pandas is created via Pyspark. I've tried using the Spark-Redis plugin which is extremely fast, but I'm not sure how to convert my data to the above binary within a Spark dataframe itself and I don't like how the column name is added as a string to every single value and it doesn't seem to be configurable. Every redis object having that label seems very space inefficient.
Any suggestions would be greatly appreciated!
Try Redis Mass Insertion and redis bulk import using --pipe:
Create a new text file input.txt containing the Redis command
Set Key0 Value0
set Key1 Value1
...
SET Keyn Valuen
use redis-mass.py (see below) to insert to redis
python redis-mass.py input.txt | redis-cli --pipe
redis-mass.py from github.
#!/usr/bin/env python
"""
redis-mass.py
~~~~~~~~~~~~~
Prepares a newline-separated file of Redis commands for mass insertion.
:copyright: (c) 2015 by Tim Simmons.
:license: BSD, see LICENSE for more details.
"""
import sys
def proto(line):
result = "*%s\r\n$%s\r\n%s\r\n" % (str(len(line)), str(len(line[0])), line[0])
for arg in line[1:]:
result += "$%s\r\n%s\r\n" % (str(len(arg)), arg)
return result
if __name__ == "__main__":
try:
filename = sys.argv[1]
f = open(filename, 'r')
except IndexError:
f = sys.stdin.readlines()
for line in f:
print(proto(line.rstrip().split(' ')),)
Related
first of all, thanks for this community and all advice we can retrieve, it's really appreciate!
This is my first venture into parallel processing and I have been looking into Dask by my own but I am having trouble actually coding it... to be honest I am really lost
In on of my project, I want to trigger URL and retrieve observations data (meteorological station) from xml files.
For each URL, I run some different process in order to: retreive data from URL, parsing XML information to dataframe, apply a filter and store data in MySQL database.
So i need to loop these process over thousands of URL (station)...
I wrote a sequential code , and it take 300s to finish computation which is really to long and not efficient.
As we are applying the same process for each station, I think I can speed-up all the computations, but I don't know where to start. I used delayed from dask but I don't think it's the best approach.
This is my code so far:
First I have some functions.
def xml_to_dataframe(ood_xml):
tmp_file = wget.download(ood_xml)
prstree = ETree.parse(tmp_file)
root = prstree.getroot()
################ Section to retrieve data for one station and apply parameter
all_obs = []
for obs in root.iter('observations'):
ood_observation = []
for n, param in enumerate(list_parameters):
x=obs.find(variable_to_check).text
ood_observation.append(x)
all_obs.append(ood_observation)
return(pd.DataFrame(all_obs, columns=list_parameters))
def filter_criteria(df,threshold,criteria):
if criteria in df.columns:
result = []
for index, row in df.iterrows():
if pd.to_numeric(row[criteria],errors='coerce') >= threshold:
result.append(index)
return result
else:
#print(criteria + ' parameter does not exist for this station !!! ')
return([])
def get_and_filter_data(filename,criteria,threshold):
try:
xmlToDf = xml_to_dataframe(filename)
final_df = xmlToDf.loc[filter_criteria(xmlToDf,threshold,criteria)]
some msql connection and instructions....
except:
pass
and then the main code I want to parallelise:
criteria = 'temperature'
threshold = 22
filenames =[url1.html, url2.html, url3.html]
for file in filenames:
get_and_filter_data(file,criteria,threshold)
Do you have any advice to do it ?
Many thanks for your help !
Guillaume
Not 100% sure this is what you are after, but one way is via delayed:
from dask import delayed, compute
delayeds = [delayed(get_and_filter_data)(file,criteria,threshold) for file in filenames]
results = compute(delayeds)
I have a pandas dataframe with strings that I'm using to query an API and return the results.
I'm trying to call the API using a function and .apply and then save the results from the api call into a csv file. The problem is that I'm trying to do 10000+ requests and my kernel/notebook crashes. Basically I'm trying to do a big operation and I'm guessing I'm running out of memory. So I'm trying to think of a way I can do these api calls and save the results and not have it all crash. My version with .apply works with a small amount of data but not once it gets larger.
So my notebook code currently looks something like this.
df = pd.read_csv('bigstringlist.csv')
df = df.loc[0:3000]
My function looks something like this.
def api_fetch_func(address):
sleep(.2)
API_PRIVATE = 'awewaefawefawef'
encoded = urllib.parse.quote(address)
query ='https://apitocall' + str(encoded) + \
'.json?limit=1&key=' \
+ API_PRIVATE
response = requests.get(query)
while True:
try:
jsonResponse = response.json()
break
except:
response = requests.get(query)
try:
return jsonResponse['results']
except:
return
else:
return
Then I'm calling the function like so
df['response_col'] = df['string_col'].apply(api_fetch_func)
Something tells me that .apply isn't the right thing to do here. Would be better if I just push the api responses into an array or another dataframe?
Should I just use .iterrows to loop over the list of strings and call the function? Something tells me .apply tries to jam too much into memory and that's why this doesn't work.
So I was going to try
results = []
for index, row in df.iterrows():
# call API
# push results to array
Or is there another way to do this?
If it's a memory issue, what I'd do is write the API calling function as a generator with the yield statement. Then, you can loop through the api_fetch_function generator and save smaller data frames for the csv files rather than holding everything in memory in one go.
for idx, response in api_fetch_generator():
if idx % 500 == 0:
df = create_df() # create a fresh df as you did above with 'string_col'.
df['response_col'] = df['string_col'].apply(response)
if (idx % 500 == 0) and idx != 0:
# Save the df using idx to control the file name
df.to_csv(f"response_batch_{idx / 500}.csv")
# Combine the csv's after everything is saved.
So I have these given functions:
def make_event_df(match_id, path):
'''
Function for making event dataframe.
Argument:
match_id -- int, the required match id for which event data will be constructed.
path -- str, path to .json file containing event data.
Returns:
df -- pandas dataframe, the event dataframe for the particular match.
'''
## read in the json file
event_json = json.load(open(path, encoding='utf-8'))
## normalize the json data
df = json_normalize(event_json, sep='_')
return df
def full_season_events(comp_name, match_df, match_ids, path):
'''
Function to make event dataframe for a full season.
Arguments:
comp_name -- str, competition name + season name
match_df -- pandas dataframe, containing match-data
match_id -- list, list of match id.
path -- str, path to directory where .json file is listed.
e.g. '../input/Statsbomb/data/events'
Returns:
event_df -- pandas dataframe, containing event data for the whole season.
'''
## init an empty dataframe
event_df = pd.DataFrame()
for match_id in tqdm(match_ids, desc=f'Making Event Data For {comp_name}'):
## .json file
temp_path = path + f'/{match_id}.json'
temp_df = make_event_df(match_id, temp_path)
event_df = pd.concat([event_df, temp_df], sort=True)
return event_df
Now I am running this piece of code to get the dataframe:
comp_id = 11
season_id = 1
path = f'../input/Statsbomb/data/matches/{comp_id}/{season_id}.json'
match_df = get_matches(comp_id, season_id, path)
comp_name = match_df['competition_name'].unique()[0] + '-' + match_df['season_name'].unique()[0]
match_ids = list(match_df['match_id'].unique())
path = f'../input/Statsbomb/data/events'
event_df = full_season_events(comp_name, match_df, match_ids, path)
The above code snippet is giving me this output:
Making Event Data For La Liga-2017/2018: 100%|██████████| 36/36 [00:29<00:00, 1.20it/s]
How can I make use multiprocessing to make the process faster i.e. how can I use the match_ids in full_season_events() to grab the data from the JSON file in a faster manner(using multiprocessing). I am very new to joblib and multiprocessing concept. Can someone tell what changes do I have to make in these functions to get the required results?
You don't need joblib here, just plain multiprocessing will do.
I'm using imap_unordered since it's faster than imap or map, but doesn't retain order (each worker can receive and submit jobs out of order). Not retaining order doesn't seem to matter since you're sort=Trueing anyway.
Because I'm using imap_unordered, there's that need for additional jobs finagling; there's no istarmap_unordered which would unpack parameters, so we need to do it ourselves.
If you have many match_ids, things can be sped up with e.g. chunksize=10 to imap_unordered; it means each worker process will be fed 10 jobs at a time, and they will also return 10 jobs at a time. It's faster since less time is spent in process synchronization and serialization, but on the other hand the TQDM progress bar will update less often.
As usual, the code below is dry-coded and might not work OOTB.
import multiprocessing
def make_event_df(job):
# Unpack parameters from job tuple
match_id, path = job
with open(path) as f:
event_json = json.load(f)
# Return the match id (if required) and the result.
return (match_id, json_normalize(event_json, sep="_"))
def full_season_events(comp_name, match_df, match_ids, path):
event_df = pd.DataFrame()
with multiprocessing.Pool() as p:
# Generate job tuples
jobs = [(match_id, path + f"/{match_id}.json") for match_id in match_ids]
# Run & get results from multiprocessing generator
for match_id, temp_df in tqdm(
p.imap_unordered(make_event_df, jobs),
total=len(jobs),
desc=f"Making Event Data For {comp_name}",
):
event_df = pd.concat([event_df, temp_df], sort=True)
return event_df
I'm trying to read in retrosheet event file into spark. The event file is structured as such.
id,TEX201403310
version,2
info,visteam,PHI
info,hometeam,TEX
info,site,ARL02
info,date,2014/03/31
info,number,0
info,starttime,1:07PM
info,daynight,day
info,usedh,true
info,umphome,joycj901
info,attendance,49031
start,reveb001,"Ben Revere",0,1,8
start,rollj001,"Jimmy Rollins",0,2,6
start,utlec001,"Chase Utley",0,3,4
start,howar001,"Ryan Howard",0,4,3
start,byrdm001,"Marlon Byrd",0,5,9
id,TEX201404010
version,2
info,visteam,PHI
info,hometeam,TEX
As you can see for each game the events loops back.
I've read the file into a RDD, and then via a second for loop added a key for each iteration, which appears to work. But I was hoping to get some feedback on if there was a cleaning way to do this using spark methods.
logFile = '2014TEX.EVA'
event_data = (sc
.textFile(logfile)
.collect())
idKey = 0
newevent_list = []
for line in event_dataFile:
if line.startswith('id'):
idKey += 1
newevent_list.append((idKey,line))
else:
newevent_list.append((idKey,line))
event_data = sc.parallelize(newevent_list)
PySpark since version 1.1 supports Hadoop Input Formats.You can use textinputformat.record.delimiter option to use a custom format delimiter as below
from operator import itemgetter
retrosheet = sc.newAPIHadoopFile(
'/path/to/retrosheet/file',
'org.apache.hadoop.mapreduce.lib.input.TextInputFormat',
'org.apache.hadoop.io.LongWritable',
'org.apache.hadoop.io.Text',
conf={'textinputformat.record.delimiter': '\nid,'}
)
(retrosheet
.filter(itemgetter(1))
.values()
.filter(lambda x: x)
.map(lambda v: (
v if v.startswith('id') else 'id,{0}'.format(v)).splitlines()))
Since Spark 2.4 you can also read data into DataFrame using text reader
spark.read.option("lineSep", '\nid,').text('/path/to/retrosheet/file')
What is the quickest way to insert a pandas DataFrame into mongodb using PyMongo?
Attempts
db.myCollection.insert(df.to_dict())
gave an error
InvalidDocument: documents must have only string keys, the key was
Timestamp('2013-11-23 13:31:00', tz=None)
db.myCollection.insert(df.to_json())
gave an error
TypeError: 'str' object does not support item assignment
db.myCollection.insert({id: df.to_json()})
gave an error
InvalidDocument: documents must have only string a keys, key was <built-in function id>
df
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 150 entries, 2013-11-23 13:31:26 to 2013-11-23 13:24:07
Data columns (total 3 columns):
amount 150 non-null values
price 150 non-null values
tid 150 non-null values
dtypes: float64(2), int64(1)
Here you have the very quickest way. Using the insert_many method from pymongo 3 and 'records' parameter of to_dict method.
db.collection.insert_many(df.to_dict('records'))
I doubt there is a both quickest and simple method. If you don't worry about data conversion, you can do
>>> import json
>>> df = pd.DataFrame.from_dict({'A': {1: datetime.datetime.now()}})
>>> df
A
1 2013-11-23 21:14:34.118531
>>> records = json.loads(df.T.to_json()).values()
>>> db.myCollection.insert(records)
But in case you try to load data back, you'll get:
>>> df = read_mongo(db, 'myCollection')
>>> df
A
0 1385241274118531000
>>> df.dtypes
A int64
dtype: object
so you'll have to convert 'A' columnt back to datetimes, as well as all not int, float or str fields in your DataFrame. For this example:
>>> df['A'] = pd.to_datetime(df['A'])
>>> df
A
0 2013-11-23 21:14:34.118531
odo can do it using
odo(df, db.myCollection)
If your dataframe has missing data (i.e None,nan) and you don't want null key values in your documents:
db.insert_many(df.to_dict("records")) will insert keys with null values. If you don't want the empty key values in your documents you can use a modified version of pandas .to_dict("records") code below:
from pandas.core.common import _maybe_box_datetimelike
my_list = [dict((k, _maybe_box_datetimelike(v)) for k, v in zip(df.columns, row) if v != None and v == v) for row in df.values]
db.insert_many(my_list)
where the if v != None and v == v I've added checks to make sure the value is not None or nan before putting it in the row's dictionary. Now your .insert_many will only include keys with values in the documents (and no null data types).
I think there is cool ideas in this question. In my case I have been spending time more taking care of the movement of large dataframes. In those case pandas tends to allow you the option of chunksize (for examples in the pandas.DataFrame.to_sql). So I think I con contribute here by adding the function I am using in this direction.
def write_df_to_mongoDB( my_df,\
database_name = 'mydatabasename' ,\
collection_name = 'mycollectionname',
server = 'localhost',\
mongodb_port = 27017,\
chunk_size = 100):
#"""
#This function take a list and create a collection in MongoDB (you should
#provide the database name, collection, port to connect to the remoete database,
#server of the remote database, local port to tunnel to the other machine)
#
#---------------------------------------------------------------------------
#Parameters / Input
# my_list: the list to send to MongoDB
# database_name: database name
#
# collection_name: collection name (to create)
# server: the server of where the MongoDB database is hosted
# Example: server = 'XXX.XXX.XX.XX'
# this_machine_port: local machine port.
# For example: this_machine_port = '27017'
# remote_port: the port where the database is operating
# For example: remote_port = '27017'
# chunk_size: The number of items of the list that will be send at the
# some time to the database. Default is 100.
#
#Output
# When finished will print "Done"
#----------------------------------------------------------------------------
#FUTURE modifications.
#1. Write to SQL
#2. Write to csv
#----------------------------------------------------------------------------
#30/11/2017: Rafael Valero-Fernandez. Documentation
#"""
#To connect
# import os
# import pandas as pd
# import pymongo
# from pymongo import MongoClient
client = MongoClient('localhost',int(mongodb_port))
db = client[database_name]
collection = db[collection_name]
# To write
collection.delete_many({}) # Destroy the collection
#aux_df=aux_df.drop_duplicates(subset=None, keep='last') # To avoid repetitions
my_list = my_df.to_dict('records')
l = len(my_list)
ran = range(l)
steps=ran[chunk_size::chunk_size]
steps.extend([l])
# Inser chunks of the dataframe
i = 0
for j in steps:
print j
collection.insert_many(my_list[i:j]) # fill de collection
i = j
print('Done')
return
I use the following part to insert the dataframe to a collection in the database.
df.reset_index(inplace=True)
data_dict = df.to_dict("records")
myCollection.insert_many(data_dict)
how about this:
db.myCollection.insert({id: df.to_json()})
id will be a unique string for that df
Just make string keys!
import json
dfData = json.dumps(df.to_dict('records'))
savaData = {'_id': 'a8e42ed79f9dae1cefe8781760231ec0', 'df': dfData}
res = client.insert_one(savaData)
##### load dfData
data = client.find_one({'_id': 'a8e42ed79f9dae1cefe8781760231ec0'}).get('df')
dfData = json.loads(data)
df = pd.DataFrame.from_dict(dfData)
If you want to send several at one time:
db.myCollection.insert_many(df.apply(lambda x: x.to_dict(), axis=1).to_list())
If you want to make sure that you're not raising InvalidDocument errors, then something like the following is a good idea. This is because mongo does not recognize types such as np.int64, np.float64, etc.
from pymongo import MongoClient
client = MongoClient()
db = client.test
col = db.col
def createDocsFromDF(df, collection = None, insertToDB=False):
docs = []
fields = [col for col in df.columns]
for i in range(len(df)):
doc = {col:df[col][i] for col in df.columns if col != 'index'}
for key, val in doc.items():
# we have to do this, because mongo does not recognize these np. types
if type(val) == np.int64:
doc[key] = int(val)
if type(val) == np.float64:
doc[key] = float(val)
if type(val) == np.bool_:
doc[key] = bool(val)
docs.append(doc)
if insertToDB and collection:
db.collection.insert_many(docs)
return docs
For upserts this worked.
for r in df2.to_dict(orient="records"):
db['utest-pd'].update_one({'a':r['a']},{'$set':r})
Does it one record at a time but it didn't seem upsert_many was able to work with more than one filter value for different records.