Parallel Processing using Multiprocessing in Python - python

I'm new to doing parallel processing in Python. I have a large dataframe with names and the list of countries that the person lived in. A sample dataframe is this:
I have a chunk of code that takes in this dataframe and splits the countries to separate columns. The code is this:
def split_country(data):
d_list = []
for index, row in data.iterrows():
for value in str(row['Country']).split(','):
d_list.append({'Name':row['Name'],
'value':value})
data = data.append(d_list, ignore_index=True)
data = data.groupby('Name')['value'].value_counts()
data = data.unstack(level=-1).fillna(0)
return (data)
The final output is something like this:
I'm trying to parallelize the above process by passing my dataframe (df) using the following:
import multiprocessing import Pool
result = []
pool = mp.Pool(mp.cpu_count())
result.append(pool.map(split_country, [row for row in df])
But the processing does not stop even with a toy dataset like the above. I'm completely new to this, so would appreciate any help

multiprocessing is probably not required here. Using pandas vectorized methods will be sufficient to quickly produce the desired result.
For a test DataFrame with 1M rows, the following code took 1.54 seconds.
First, use pandas.DataFrame.explode on the column of lists
If the column is strings, first use ast.literal_eval to convert them to list type
df.countries = df.countries.apply(ast.literal_eval)
If the data is read from a CSV file, use df = pd.read_csv('test.csv', converters={'countries': literal_eval})
For this question, it's better to use pandas.get_dummies to get a count of each country per name, then pandas.DataFrame.groupby on 'name', and aggregate with .sum
import pandas as pd
from ast import literal_eval
# sample data
data = {'name': ['John', 'Jack', 'James'], 'countries': [['USA', 'UK'], ['China', 'UK'], ['Canada', 'USA']]}
# create the dataframe
df = pd.DataFrame(data)
# if the countries column is strings, evaluate to lists; otherwise skip this line
df.countries = df.countries.apply(literal_eval)
# explode the lists
df = df.explode('countries')
# use get_dummies and groupby name and sum
df_counts = pd.get_dummies(df, columns=['countries'], prefix_sep='', prefix='').groupby('name', as_index=False).sum()
# display(df_counts)
name Canada China UK USA
0 Jack 0 1 1 0
1 James 1 0 0 1
2 John 0 0 1 1

Related

Combining Successive Pandas Dataframes in One Master Dataframe via a Loop

I'm trying to loop through a series of tickers cleaning the associated dataframes then combining the individual ticker dataframes into one large dataframe with columns named for each ticker. The following code enables me to loop through unique tickers and name the columns of each ticker's dataframe after the specific ticker:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
However, I don't know how to create a master dataframe where I add each new ticker to the master dataframe. With that in mind, I'd like to align each new ticker's data using the datetime index. So, if tkr1 has data for 6/25/22, 6/26/22, 6/27/22, and tkr2 has data for 6/26/22, and 6/27/22, the combined dataframe would show all three dates but would produce a NaN for ticker 2 on 6/25/22 since there is no data for that ticker on that date.
When not in a loop looking to append each successive ticker to a larger dataframe (as per above), the following code does what I'd like. But it doesn't work when looping and adding new ticker data for each successive loop (or I don't know how to make it work in the confines of a loop).
combined = pd.concat((df1, df2, df3,...,dfn), axis=1)
Many thanks in advance.
You should only create the master DataFrame after the loop. Appending to the master DataFrame in each iteration via pandas.concat is slow since you are creating a new DataFrame every time.
Instead, read each ticker DataFrame, clean it, and append it to a list which store every ticker DataFrames. After the loop create the master DataFrame with all the Dataframes using pandas.concat:
import pandas as pd
def clean_func(tkr,f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f1.index = f1['Date']
keep = ['Col1','Col2']
f2 = f1[keep]
f2.columns = [tkr+'Col1',tkr+'Col2']
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv')
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
As a suggestion here is a cleaner way of defining your clean_func using DataFrame.set_index and DataFrame.add_prefix.
def clean_func(tkr, f1):
f1['Date'] = pd.to_datetime(f1['Date'])
f2 = f1.set_index('Date')[['Col1','Col2']].add_prefix(tkr)
return f2
Or if you want, you can parse the Date column as datetime and set it as index directly in the pd.read_csv call by specifying index_col and parse_dates parameters (honestly, I'm not sure if those two parameters will play well together, and I'm too lazy to test it, but you can try ;)).
import pandas as pd
def clean_func(tkr,f1):
f2 = f1[['Col1','Col2']].add_prefix(tkr)
return f2
tkrs = ['tkr1','tkr2','tkr3']
dfs_list = []
for tkr in tkrs:
df1 = pd.read_csv(f'C:\\path\\{tkr}.csv', index_col='Date', parse_dates=['Date'])
df2 = clean_func(tkr,df1)
dfs_list.append(df2)
master_df = pd.concat(dfs_list, axis=1)
Before the loop create an empty df with:
combined = pd.DataFrame()
Then within the loop (after loading df1 - see code above):
combined = pd.concat((combined, clean_func(tkr, df1)), axis=1)
If you get:
TypeError: concat() got multiple values for argument 'axis'
Make sure your parentheses are correct per above.
With the code above, you can skip the original step:
df2 = clean_func(tkr,df1)
Since it is embedded in the concat function. Alternatively, you could keep the df2 step and use:
combined = pd.concat((combined,df2), axis=1)
Just make sure the dataframes are encapsulated by parentheses within the concat function.
Same answer as GC123 but here is a full example which mimics reading from separate files and concatenating them
import pandas as pd
import io
fake_file_1 = io.StringIO("""
fruit,store,quantity,unit_price
apple,fancy-grocers,2,9.25
pear,fancy-grocers,3,100
banana,fancy-grocers,1,256
""")
fake_file_2 = io.StringIO("""
fruit,store,quantity,unit_price
banana,bargain-grocers,667,0.01
apple,bargain-grocers,170,0.15
pear,bargain-grocers,281,0.45
""")
fake_files = [fake_file_1,fake_file_2]
combined = pd.DataFrame()
for fake_file in fake_files:
df = pd.read_csv(fake_file)
df = df.set_index('fruit')
combined = pd.concat((combined, df), axis=1)
print(combined)
Output
This method is slightly more efficient:
combined = []
for fake_file in fake_files:
combined.append(pd.read_csv(fake_file).set_index('fruit'))
combined = pd.concat(combined, axis=1)
print(combined)
Output:
store quantity unit_price store quantity unit_price
fruit
apple fancy-grocers 2 9.25 bargain-grocers 170 0.15
pear fancy-grocers 3 100.00 bargain-grocers 281 0.45
banana fancy-grocers 1 256.00 bargain-grocers 667 0.01

.replace codes will not replace column with new column in python

I am trying to read a column in python, and create a new column using python.
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
df = pd.DataFrame(data={'WT_RESIDUE':['']})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df['WT_RESIDUE'].replace(codes)
df.to_csv (r'C:\Users\User\Documents\Research\output.csv')
I tried this, but it will not create a new column no matter what I do.
example
It seems like you made a silly mistake
import pandas as pd
df = pd.read_csv (r'C:\Users\User\Documents\Research\seqadv.csv')
print (df)
df = pd.DataFrame(data={'WT_RESIDUE':['']}) # Why do you have this line?
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df['WT_RESIDUE'].replace(codes)
df.to_csv (r'C:\Users\User\Documents\Research\output.csv')
Try removing the line with the comment. AFAIK, it is reinitializing your DataFrame and thus the WT_RESIDUE column becomes empty.
Considering sample from provided input.
We can use map function to map the keys of dict to existing column and persist corresponding values in new column.
df = pd.DataFrame({
'WT_RESIDUE':['ALA', "REMARK", 'VAL', "LYS"]
})
codes = {'ALA':'A', 'ARG':'R', 'ASN':'N', 'ASP':'D', 'CYS':'C', 'GLU':'E', 'GLN':'Q', 'GLY':'G', 'HIS':'H', 'ILE':'I', 'LEU':'L', 'LYS':'K', 'MET':'M', 'PHE':'F', 'PRO':'P', 'SER':'S', 'THR':'T', 'TRP':'W', 'TYR':'Y', 'VAL':'V'}
df['MUTATION_CODE'] = df.WT_RESIDUE.map(codes)
Input
WT_RESIDUE
0 ALA
1 REMARK
2 VAL
3 LYS
Output
WT_RESIDUE MUTATION_CODE
0 ALA A
1 REMARK NaN
2 VAL V
3 LYS K

Dask : NotImplementedError: `df.column.cat.codes` with unknown categories is not supported

I used this code to create a column for creating a product id in a dataframe :
df = df.assign(id=(df['PROD_NAME']).astype('category').cat.codes)
This code works fine if I use pandas. This line allows me to create an id for each PROD_NAME value.
My issue is that I want to use Dask that allows me to manage several clients and handle memory issues.
I obtain the following error message :
NotImplementedError: `df.column.cat.codes` with unknown categories is not supported. Please use `column.cat.as_known()` or `df.categorize()` beforehand to ensure known categories
How can i create this new column then ?
This is an old post, but being the first that comes up when searching for this error, it could use an answer:
TL;DR:
Run this sequence on your Dask dataframe:
ddf["PROD_NAME"] = ddf["PROD_NAME"].cat.as_known()
ddf = ddf.assign(id=(ddf["PROD_NAME"].cat.codes))
out_df = ddf.compute()
Per Dask's documentation, you can convert categorical data types in Dask between "known categoricals" and "unknown categoricals". In this situation, it needs "known" categories, because it will need to pull category mapping from column metadata.
import pandas as pd
from dask import dataframe as dd
# Show the pandas workflow
>>> d = pd.Series(['A','B','D'], dtype='category').to_frame(name=“PROD_NAME”)
>>> d = d.assign(id=(d["PROD_NAME"]).astype('category').cat.codes)
>>> d
PROD_NAME id
0 A 0
1 B 1
2 D 2
# Now, in Dask:
>>> ddf = dd.from_pandas(d, npartitions=1)
>>> ddf
Dask DataFrame Structure:
PROD_NAME
npartitions=1
0 category[known]
2 ...
Dask Name: from_pandas, 1 tasks
# The conversion to Dask dataframe already created a "known categorical", but
# let's convert it to "unknown" (notice the .compute() is not used):
>>> ddf["PROD_NAME"] = ddf["PROD_NAME"].cat.as_unknown()
>>> ddf
Dask DataFrame Structure:
PROD_NAME
npartitions=1
0 category[unknown]
2 ...
Dask Name: assign, 3 tasks
# Now, let's convert it back to "known", then create the new column using .assign()
# and call .compute() to create output dataframe:
>>> ddf["PROD_NAME"] = ddf["PROD_NAME"].cat.as_known()
>>> ddf = ddf.assign(id=(ddf["PROD_NAME"].cat.codes))
>>> out_df = ddf.compute()
>>> out_df
PROD_NAME id
0 A 0
1 B 1
2 D 2

Save data frame from inside for loop

I have a function that takes in a dataframe and returns a (reduced) dataframe, e.g. like this:
def transforming_data(dataframe, col_1, col_2, normalized = True):
''' takes in dataframe, groups col_1 according to col_2 and returns dataframe
'''
df = dataframe[col_1].groupby(dataframe[col_2]).value_counts(normalize = normalized).unstack(fill_value = 0)
return dataframe
For the following code, this gives me:
import pandas as pd
import numpy as np
np.random.seed(12)
def transforming_data(df, col_1, col_2, normalized = True):
''' takes in df, groups col_1 according to col_2 and returns df '''
df = dataframe[col_1].groupby(dataframe[col_2]).value_counts(normalize = normalized).unstack(fill_value = 0)
return df
numrows = 1000
dataframe = pd.DataFrame({'Numerical': np.random.randn(numrows),
'Category': np.random.choice(['Panda', 'Elephant', 'Anaconda'], numrows),
'Response 1': np.random.choice(['Yes', 'Maybe', 'No', 'Don\'t know'], numrows),
'Response 2': np.random.choice(['Very Much', 'Much', 'A bit', 'Not at all'], numrows)})
test = transforming_data(dataframe, 'Response 1', 'Category')
print(test)
# Output
# Response 1 Don't know Maybe No Yes
# Category
# Anaconda 0.275229 0.232416 0.217125 0.275229
# Elephant 0.220588 0.270588 0.255882 0.252941
# Panda 0.258258 0.222222 0.273273 0.246246
So far, so good.
Now I want to use the function transforming_data inside a for loop for every column in dataframe (as I have lots of columns, not just two) and save the resulting dataframe to a new dataframe, e.g. test_response_1 and test_response_2 for this example.
Can someone point me in the right direction - i.e. how to implement the loop correctly?
So far, I am using something like this - but cannot figure out how to save the data frame
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
# here, I need to save tmp_df outside of the loop but don't know how to
Thanks a lot for pointers and help. (Note: the most similar question I found does not talk about actually saving the data frame, so it doesn't help me with this.
If you want to save (in memory) all of the temp_df's from your loop, you can append them to a list that you can then index afterwards:
temp_dfs = []
for column in dataframe.columns.tolist(): #you don't actually need the tolist() method here
temp_df = transforming_data(dataframe, column, 'Category')
temp_dfs.append(temp_df)
If you rather be able to access these temp_df's by the column name that was used to transform them, then you could assign each to a dictionary, using the column as the key:
temp_dfs = {}
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
temp_dfs[column] = temp_df
If by "save" you meant "write to disk", then you can use one of the many to_<file_format>() methods that pandas provides:
temp_dfs = {}
for column in dataframe.columns.tolist():
temp_df = transforming_data(dataframe, column, 'Category')
temp_df.to_csv('temp_df{}.csv'.format(column))
Here's the to_csv() docs.
The most simple solution would be to save the result dataframes into a list. Assuming that all columns that you want to loop over have the text Response in their column name:
result_dframes = []
for col_name in dataframe.filter(like='Response').columns:
result_dframe = transforming_data(dataframe, col_name, 'Category')
result_dframes.append(result_dframe)
Alternatively you can also obtain the exact same result with a list comprehension instead of a for-loop:
result_dframes = [
transforming_data(dataframe, col_name, 'Category')
for col_name in dataframe.filter(like='Response')
]

Dictionary in Pandas DataFrame, how to split the columns

I have a DataFrame that consists of one column ('Vals') which is a dictionary. The DataFrame looks more or less like this:
In[215]: fff
Out[213]:
Vals
0 {u'TradeId': u'JP32767', u'TradeSourceNam...
1 {u'TradeId': u'UUJ2X16', u'TradeSourceNam...
2 {u'TradeId': u'JJ35A12', u'TradeSourceNam...
When looking at an individual row the dictionary looks like this:
In[220]: fff['Vals'][100]
Out[218]:
{u'BrdsTraderBookCode': u'dffH',
u'Measures': [{u'AssetName': u'Ie0',
u'DefinitionId': u'6dbb',
u'MeasureValues': [{u'Amount': -18.64}],
u'ReportingCurrency': u'USD',
u'ValuationId': u'669bb'}],
u'SnapshotId': 12739,
u'TradeId': u'17304M',
u'TradeLegId': u'31827',
u'TradeSourceName': u'xxxeee',
u'TradeVersion': 1}
How can I split the the columns and create a new DataFrame, so that I get one column with TradeId and another one with MeasureValues?
try this:
l=[]
for idx, row in df['Vals'].iteritems():
temp_df = pd.DataFrame(row['Measures'][0]['MeasureValues'])
temp_df['TradeId'] = row['TradeId']
l.append(temp_df)
pd.concat(l,axis=0)
Here's a way to get TradeId and MeasureValues (using twice your sample row above to illustrate the iteration):
new_df = pd.DataFrame()
for id, data in fff.iterrows():
d = {'TradeId': data.ix[0]['TradeId']}
d.update(data.ix[0]['Measures'][0]['MeasureValues'][0])
new_df = pd.concat([new_df, pd.DataFrame.from_dict(d, orient='index').T])
Amount TradeId
0 -18.64 17304M
0 -18.64 17304M

Categories