Python most efficient way to dictionary mapping in pandas dataframe

Python most efficient way to dictionary mapping in pandas dataframe - python

I have a dictionary of dictionaries and each contains a mapping for each column of my dataframe.
My goal is to find the most efficient way to perform mapping for my dataframe with 1 row and 300 columns.
My dataframe is randomly sampled from range(mapping_size); and my dictionaries map values from range(mapping_size) to random.randint(mapping_size+1,mapping_size*2).
I can see from the answer provided by jpp that map is possibly the most efficient way to go but I am looking for something which is even faster than map. Can you think of any? I am happy if the data structure of the input is something else instead of pandas dataframe.
Here is the code for setting up the question and results using map and replace:
# import packages
import random
import pandas as pd
import numpy as np
import timeit
# specify paramters
ncol = 300 # number of columns
nrow = 1 #number of rows
mapping_size = 10 # length of each dictionary
# create a dictionary of dictionaries for mapping
mapping_dict = {}
random.seed(123)
for idx1 in range(ncol):
# create empty dictionary
mapping_dict['col_' + str(idx1)] = {}
for inx2 in range(mapping_size):
# create dictionary of length mapping_size and maps value from range(mapping_size) to random.randint(mapping_size +1 ,mapping_size*2)
mapping_dict['col_' + str(idx1)][inx2+1] = random.randint(mapping_size+1,mapping_size*2)
# Create a dataframe with values sampled from range(mapping_size)
d={}
random.seed(123)
for idx1 in range(ncol):
d['col_' + str(idx1)] = np.random.choice(range(mapping_size),nrow)
df = pd.DataFrame(data=d)
Results using map and replace:
%%timeit -n 20
df.replace(mapping_dict) #296 ms
%%timeit -n 20
for key in mapping_dict.keys():
df[key] = df[key].map(mapping_dict[key]).fillna(df[key]) #221ms
%%timeit -n 20
for key in mapping_dict.keys():
df[key] = df[key].map(mapping_dict[key]) #181ms

Just use pandas without python for iteration.
# runtime ~ 1s (1000rows)
# creat a map_serials with multi_index
df_dict = pd.DataFrame(mapping_dict)
obj_dict = df_dict.T.stack()
# obj_dict
# col_0 1 10
# 2 14
# 3 11
# Length: 3000, dtype: int64
# convert df to map_serials's index, df can have more then 1 row
obj_idx = pd.Series(df.values.flatten())
obj_idx.index = pd.Index(df.columns.to_list() * df.shape[0])
idx = obj_idx.to_frame().reset_index().set_index(['index', 0]).index
result = obj_dict[idx]
# handle null values
cond = result.isnull()
result[cond] = pd.Series(result[cond].index.values).str[1].values
# transform to reslut DataFrame
df_result = pd.DataFrame(result.values.reshape(df.shape))
df_result.columns = df.columns
df_result

Related

Find minima and maxima of DataFrame by chronological order

I have a pandas data frame where I extract minima and extrema values. It work good so far, but the problem is how can I place them by Date (chronological order) into a list? They are separated into two list and I only want one price values list with them being in chronological order
import pandas as pd
import numpy as np
import yfinance
from scipy.signal import argrelextrema
import matplotlib.dates as mpl_dates
def extract_data():
ticker = 'GBPJPY=X'
ticker = yfinance.Ticker(ticker)
start_date = '2022-09-25'
end_date = '2022-10-08'
df = ticker.history(interval='1h', start=start_date, end=end_date)
df['Date'] = pd.to_datetime(df.index)
df['Date'] = df['Date'].apply(mpl_dates.date2num)
df = df.loc[:, ['Date', 'Open', 'High', 'Low', 'Close']]
# Call function to find Min-Max Extrema
find_extrema(df)
def find_extrema(df):
n = 10 # number of points to be checked before and after
# Find local peaks
df['min'] = df.iloc[argrelextrema(df.Close.values, np.less_equal,
order=n)[0]]['Close']
df['max'] = df.iloc[argrelextrema(df.Close.values, np.greater_equal,
order=n)[0]]['Close']
min_values_list = []
max_values_list = []
# Add min value to list
for item in df['min']:
check_NaN = np.isnan(item) # check if values is empty
if check_NaN == True:
pass
else:
min_values_list.append(item)
# Add max values to list
for item in df['max']:
check_NaN = np.isnan(item) # check if values is empty
if check_NaN == True:
pass
else:
max_values_list.append(item)
print(f"Min: {min_values_list}")
print(f"Max: {max_values_list}")
extract_data()

Option 1
First, use df.to_numpy to convert columns min and max to a np.array.
Get rid of all the NaN values by selecting from the array using np.logical_or applied to a boolean mask (created with np.isnan).
arr = df[['min','max']].to_numpy()
value_list = arr[np.logical_not(np.isnan(arr))].tolist()
print(value_list)
[159.7030029296875,
154.8979949951172,
160.7830047607422,
165.43800354003906,
149.55799865722656,
162.80499267578125,
156.6529998779297,
164.31900024414062,
156.125,
153.13499450683594,
161.3520050048828,
156.9340057373047,
162.52200317382812,
155.7740020751953,
160.98500061035156,
161.83700561523438]
Option 2
Rather more cumbersome:
n = 10
# get the indices for `min` and `max` in two arrays
_min = argrelextrema(df.Close.values, np.less_equal, order=n)[0]
_max = argrelextrema(df.Close.values, np.greater_equal, order=n)[0]
# create columns (assuming you need this for other purposes as well)
df['min'] = df.iloc[_min]['Close']
df['max'] = df.iloc[_max]['Close']
# create lists for `min` and `max`
min_values_list = df['min'].dropna().tolist()
max_values_list = df['max'].dropna().tolist()
# join the lists
value_list2 = min_values_list + max_values_list
value_idxs = _min.tolist() + _max.tolist()
# finally, sort `value_list2` based on `value_idxs`
value_list2 = [x for _, x in sorted(zip(value_idxs, value_list2))]
# check if result is the same:
value_list2 == value_list
# True

Assuming that you have max and min columns, what about something like this?
df['max_or_min'] = np.where(df['max'].notna(), df['max'], df['min'])
min_max_values = df['max_or_min'].dropna().values.tolist()

I have multiple lists and I want to filter by the most current

I have the following bucket AWS schema:
In my python code, it returns a list of the buckets with their dates.
I need to stick with the most up-to-date of the two main buckets:
I am starting in Python, this is my code:
str_of_ints = [7100, 7144]
for get_in_scenarioid in str_of_ints:
resultado = s3.list_objects(Bucket=source,Delimiter='/',Prefix=get_in_scenarioid +'/')
#print(resultado)
sub_prefix = [val['Prefix'] for val in resultado['CommonPrefixes']]
for get_in_sub_prefix in sub_prefix:
resultado2 = s3.list_objects(Bucket=source,Delimiter='/',Prefix=get_in_sub_prefix) # +'/')
#print(resultado2)
get_key_and_last_modified = [val['Key'] for val in resultado2['Contents']] + int([val['LastModified'].strftime('%Y-%m-%d %H:%M:%S') for val in resultado2['Contents']])
print(get_key_and_last_modified)

I would recommend to convert your array into pandas DataFrame and to use group by:
import pandas as pd
df = pd.DataFrame([["a",1],["a",2],["a",3],["b",2],["b",4]], columns=["lbl","val"])
df.groupby(['lbl'], sort=False)['val'].max()
lbl
a 3
b 4
In your case you would also have to split your label into 2 parts first, better keep in separate column.
Update:
Once you split your lable into bucket and sub_bucket, you can return max values like this:
dfg = df.groupby("main_bucket")
dfm = dfg.max()
res = dfm.reset_index()

Get min and max values from pandas dataframes within a dictionary in Python

I have a dictionary (pollution) with one key I wish to ignore (chemical_start_time) and all other keys having values that are pandas dataframes.
I want to get the maximum value present in any of the dataframes and the minumum non-zero value.
I believe the following code does exactly this, but I'm looking for the most efficient or "pythonic" way of doing this
import numpy as np
max_pols = []
min_pols = []
for key, df in pollution.items():
if key != 'chemical_start_time':
max_pols.append(max(df.max()))
min_pols.append(np.nanmin(df[df > 0].min()))
max_pol = max(max_pols)
min_pol = min(min_pols)

One possible solution for improve performance is use numpy.ravel for 1d array from all values of DataFrame and then use np.min (if possible missing values np.nanmin) and np.max:
df1 = pd.DataFrame({
'C':[7,8,9,4,2,3],
'D':[10,3,5,-7,10,0],
'E':[5,-3,6,9,2,4],
})
df2 = pd.DataFrame({
'A':[73,8,9,4,2,3],
'D':[1,3,52,-7,1,0],
'E':[53,-33,63,9,2,4],
})
pollution = {'a':df1, 'b':df2, 'chemical_start_time':pd.DataFrame([100])}
max_pols = []
min_pols = []
for key, df in pollution.items():
if key != 'chemical_start_time':
v = df.values.ravel()
max_pols.append(np.max(v))
min_pols.append(np.min(v[v > 0]))
max_pol = np.max(max_pols)
min_pol = np.min(min_pols)
print (max_pol)
73
print (min_pol)
1

Also you can use:
max_pols.append(df.max().max())
min_pols.append(df[df > 0].min().min())

Combine all relevant dataframes into one:
frames = pd.concat([frame for key, frame in pollution.items() if key != 'chemical_start_time'])
Then get the max, min values:
max_pol = frames.max().max()
min_pol = frames[frames > 0].min().min()

How do I import CSV to Pandas df where data is organized by an index column with a parent/child relationship?

I have GBs of data in this text format:
1,'Acct01','Freds Autoshop'
2,'3-way-Cntrl','Y'
1000,576,686,837
1001,683,170,775
1,'Acct02','Daves Tacos'
2,'centrifugal','N'
1000,334,787,143
1001,749,132,987
The first column indicates the row content and is an index series that repeats for each Account (Acct01, Acct02...). Rows with index values (1,2) are one-to-one associated with each account (Parent). I would like to flatten this data into a dataframe that associates the Account level data (index = 1,2) with it's associated series data (1000, 10001, 1002, 1003...) the child data in a flat df.
Desired df:
'Acct01','Freds Autoshop','3-way-Cntrl','Y',1000,576,686,837
'Acct01','Freds Autoshop','3-way-Cntrl','Y',1001,683,170,775
'Acct02','Daves Tacos',2,'centrifugal','N',1000,334,787,143
'Acct02','Daves Tacos',2,'centrifugal','N',1001,749,132,987
I've been able to do this in a very mechanical, very slow row-by-row process:
import pandas as pd
import numpy as np
import time
file = 'C:\\PythonData\\AcctData.txt'
t0 = time.time()
pdata = [] # Parse data
acct = [] # Account Data
row = {} #Assembly Container
#Set dataframe columns
df = pd.DataFrame(columns=['Account','Name','Type','Flag','Counter','CNT01','CNT02','CNT03'])
# open the file and read through it line by line
with open(file, 'r') as f:
for line in f:
#Strip each line
pdata = [x.strip() for x in line.split(',')]
#Use the index to parse data into either acct[] for use on the rows with counter > 2
indx = int(pdata[0])
if indx == 1:
acct.clear()
acct.append(pdata[1])
acct.append(pdata[2])
elif indx == 2:
acct.append(pdata[1])
acct.append(pdata[2])
else:
row.clear()
row['Account'] = acct[0]
row['Name'] = acct[1]
row['Type'] = acct[2]
row['Flag'] = acct[3]
row['Counter'] = pdata[0]
row['CNT01'] = pdata[1]
row['CNT02'] = pdata[2]
row['CNT03'] = pdata[3]
if indx > 2:
#data.append(row)
df = df.append(row, ignore_index=True)
t1 = time.time()
totalTimeDf = t1-t0
TTDf = '%.3f'%(totalTimeDf)
print(TTDf + " Seconds to Complete df: " + i_filepath)
print(df)
Result:
0.018 Seconds to Complete df: C:\PythonData\AcctData.txt
Account Name Type Flag Counter CNT01 CNT02 CNT03
0 'Acct01' 'Freds Autoshop' '3-way-Cntrl' 'Y' 1000 576 686 837
1 'Acct01' 'Freds Autoshop' '3-way-Cntrl' 'Y' 1001 683 170 775
2 'Acct02' 'Daves Tacos' 'centrifugal' 'N' 1000 334 787 143
3 'Acct02' 'Daves Tacos' 'centrifugal' 'N' 1001 749 132 987
This works but is tragically slow. I suspect there is a very easy pythonic way to import and organize to a df. It appears an OrderDict will properly organize the data as follows:
import csv
from collections import OrderedDict
od = OrderedDict()
file_name = 'C:\\PythonData\\AcctData.txt'
try:
csvfile = open(file_name, 'rt')
except:
print("File not found")
csvReader = csv.reader(csvfile, delimiter=",")
for row in csvReader:
key = row[0]
od.setdefault(key,[]).append(row)
od
Result:
OrderedDict([('1',
[['1', "'Acct01'", "'Freds Autoshop'"],
['1', "'Acct02'", "'Daves Tacos'"]]),
('2',
[['2', "'3-way-Cntrl'", "'Y'"],
['2', "'centrifugal'", "'N'"]]),
('1000',
[['1000', '576', '686', '837'], ['1000', '334', '787', '143']]),
('1001',
[['1001', '683', '170', '775'], ['1001', '749', '132', '987']])])
From the OrderDict I haven't been able to figure out how to combine keys 1,2 and associate with acct specific series of keys (1000, 1001) then append into a df. How do I go from OrderedDict to df while flattening the Parent/Child data? Or, is there a better way to process this data?

I'm not sure if it's the fastes or the pythonic way, but I believe a pandas aproach might do, since you need to iterate for every 4 rows in a weird real specific way:
first importing libraries to work with:
import pandas as pd
import numpy as np
since I didn't have a file to load, I just recreated it as an array (this part you'll have to do some work, or simply load it to a pandas' DataFrame with 4 columns will be fine [like next step]):
data = [[1,'Acct01','Freds Autoshop'],
[2,'3-way-Cntrl','Y' ],
[1000,576,686,837 ],
[1001,683,170,775 ],
[1002,333,44,885 ],
[1003,611183,12,1 ],
[1,'Acct02','Daves Tacos' ],
[2,'centrifugal','N' ],
[1000,334,787,143 ] ,
[1001,749,132,987],
[1,'Acct03','Norah Jones' ],
[2,'undertaker','N' ],
[1000,323,1,3 ] ,
[1001,311,2,111 ] ,
[1002,95,112,4]]
Created a dataframe with the above data + created new columns with numpy's nans (faster than panda's) as placeholders.
df = pd.DataFrame(data)
df['4']= np.nan
df['5']= np.nan
df['6']= np.nan
df['7']= np.nan
df['8']= np.nan
df.columns = ['idx','Account','Name','Type','Flag','Counter','CNT01','CNT02','CNT3']
Making a new df that will get everytime "AcctXXXX" apears and how many rows bellow until the next parent.
# Getting the unique "Acct" and their index position into an array
acct_idx_pos = np.array([df[df['Account'].str.contains('Acct').fillna(False)]['Account'].values, df[df['Account'].str.contains('Acct').fillna(False)].index.values])
# Making a df with the transposed array
df_pos = pd.DataFrame(acct_idx_pos.T, columns=['Acct', 'Position'])
# Shifting the values into a new column and filling the last value (nan) with the df length
df_pos['End_position'] = df_pos['Position'].shift(-1)
df_pos['End_position'][-1:] = len(df)
# Making the column we want, that is the number of loops we'll go
df_pos['Position_length'] = df_pos['End_position'] - df_pos['Position']
A custom function that uses a dummy Dataframe and concatenates temporary ones (will be used later)
def concatenate_loop_dfs(df_temp, df_full, axis=0):
"""
to avoid retyping the same line of code for every df.
the parameters should be the temporary df created at each loop and the concatenated DF that will contain all
values which must first be initialized (outside the loop) as df_name = pd.DataFrame(). """
if df_full.empty:
df_full = df_temp
else:
df_full = pd.concat([df_full, df_temp], axis=axis)
return df_full
Created a function that will loop to fill each row and drop duplicated rows:
# a complicated loop function
def shorthen_df(df, num_iterations):
# to not delete original df
dataframe = df.copy()
# for the slicing, we need to start at the first row.
curr_row = 1
# fill current row's nan values with values from next row
dataframe.iloc[curr_row-1:curr_row:,3] = dataframe.iloc[curr_row:curr_row+1:,1].values
dataframe.iloc[curr_row-1:curr_row:,4] = dataframe.iloc[curr_row:curr_row+1:,2].values
dataframe.iloc[curr_row-1:curr_row:,5] = dataframe.iloc[curr_row+1:curr_row+2:,0].values
dataframe.iloc[curr_row-1:curr_row:,6] = dataframe.iloc[curr_row+1:curr_row+2:,1].values
dataframe.iloc[curr_row-1:curr_row:,7] = dataframe.iloc[curr_row+1:curr_row+2:,2].values
dataframe.iloc[curr_row-1:curr_row:,8] = dataframe.iloc[curr_row+1:curr_row+2:,3].values
# the "num_iterations-2" is because the first two lines are filled and not replaced
# as the next ones will be. So this will vary correctly to each "account"
for i in range(1, num_iterations-2):
# Replaces next row with values from previous row
dataframe.iloc[curr_row+(i-1):curr_row+i:] = dataframe.iloc[curr_row+(i-2):curr_row+(i-1):].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,5] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,0].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,6] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,1].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,7] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,2].values
dataframe.iloc[curr_row+(i-1):curr_row+i:,8] = dataframe.iloc[curr_row+i+1:curr_row+i+2:,3].values
# last 2 rows of df
dataframe = dataframe[0:len(dataframe)-2]
return dataframe
Finally, creating the dummy DF that will concat all "Acct" and loop for each one with it's position, using both functions above.
df_final= pd.DataFrame()
for start, end, iterations in zip(df_pos.Position.values, df_pos.End_position.values, df_pos.Position_length.values):
df2 = df[start:end]
df_temp = shorthen_df(df2, iterations)
df_final = concatenate_loop_dfs(df_temp, df_final)
# Dropping first/unnecessary columns
df_final.drop('idx', axis=1, inplace=True)
# resetting index
df_final.reset_index(inplace=True, drop=True)
df_final
returns
Account Name Type Flag Counter CNT01 CNT02 CNT3
0 Acct01 Freds Autoshop 3-way-Cntrl Y 1000.0 576 686 837
1 Acct01 Freds Autoshop 3-way-Cntrl Y 1001.0 683 170 775
2 Acct01 Freds Autoshop 3-way-Cntrl Y 1002.0 333 44 885
3 Acct01 Freds Autoshop 3-way-Cntrl Y 1003.0 611183 12 1
4 Acct02 Daves Tacos centrifugal N 1000.0 334 787 143
5 Acct02 Daves Tacos centrifugal N 1001.0 749 132 987
6 Acct03 Norah Jones undertaker N 1000.0 323 1 3
7 Acct03 Norah Jones undertaker N 1001.0 311 2 111
8 Acct03 Norah Jones undertaker N 1002.0 95 112 4

How can I operate on grouped arrays from nested pandas DataFrames?

I have a series of nested pandas DataFrames containing several (hundreds) of arrays and I would like to average each variable across different nesting levels.
The variable mydatadf contains a very simple representative example of my actual data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
mydata = dict()
participant = ['participantA', 'participantB']
for p in participant:
ses = dict()
session = ['ses_1', 'ses_2']
for s in session:
series = dict()
set = ['s_1', 's_2', 's_3']
for se in set:
reps = dict()
rep = ['r_1', 'r_2', 'r_3', 'r_4', 'r_5']
for r in rep:
vars = dict()
vars = {'var1': np.sin(np.random.rand(1000)*2),
'var2': np.sin(np.random.rand(1000)*2)}
varsdf = pd.DataFrame(data=vars)
reps[r] = vars
series[se] = reps
ses[s] = series
mydata[p] = ses
mydatadf = pd.DataFrame(mydata)
How could I effectively average (for example) var1 across the nesting levels reps, series, ses and/or participant?
Eventually, I would like to plot all var1 objects and highlight with different colours averaged data across any desired nesting level.
for p in mydatadf.keys():
for ses in mydatadf[p].keys():
for set in mydatadf[p][ses].keys():
for rep in mydatadf[p][ses][set].keys():
data = mydatadf[p][ses][set][rep]['var1']
plt.plot(data)
plt.show()

You can always flatten the dataframe and do standard groupby operations (I don't know if it is optimal, but it works):
df = pd.io.json.json_normalize(mydata) #this will give a nested dataframe
df_flat = pd.DataFrame(df.T.index.str.split('.').tolist()).assign(values=df.T.values)
df_flat.head(3)
>> 0 1 2 3 4 \
0 participantA ses_1 s_1 r_1 var1
1 participantA ses_1 s_1 r_1 var2
2 participantA ses_1 s_1 r_2 var1
values
0 [0.7267196257553268, 0.9822775511169437, 0.991...
1 [0.6633676714415264, 0.2823588336690545, 0.977...
2 [0.2211576389168905, 0.9399581790280525, 0.645...
Edit: to groupby and apply a function (say, mean):
# in this case I choose column 4, corresponding to 'var'.
# You can change the name of the column using df_flat.columns.rename
# note that I use np.hstack as you are dealing with a an array of arrays
column = 4
df_flat.groupby(column)['Values'].apply(lambda x: np.hstack(x).mean())
>> 4
var1 0.707803
var2 0.707821
Name: Values, dtype: float64

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python most efficient way to dictionary mapping in pandas dataframe - python

Related

Find minima and maxima of DataFrame by chronological order

I have multiple lists and I want to filter by the most current

Get min and max values from pandas dataframes within a dictionary in Python

How do I import CSV to Pandas df where data is organized by an index column with a parent/child relationship?

How can I operate on grouped arrays from nested pandas DataFrames?

Categories

Resources