Merging pandas dataframes on dictionary - python

I have to dataframes that are related via a hierarchical dictionary.
In[0]: import pandas as pd
d = {'levelA_1':['sublevel_1', 'sublevel_2'],
'levelA_2':['sublevel_3', 'sublevel_4'],
'levelA_3':['sublevel_5', 'sublevel_6']}
datA = pd.DataFrame({'A': {'levelA_1': 4, 'levelA_2': 2, 'levelA_3': 2},
'B': {'levelA_1': 1, 'levelA_2': 3, 'levelA_3': 5},
'C': {'levelA_1': 2, 'levelA_2': 4, 'levelA_3': 6}})
datB = pd.DataFrame({'A': {'sublevel_1': 4, 'sublevel_2': 1, 'sublevel_3': 3, 'sublevel_4': 4},
'B': {'sublevel_1': 1, 'sublevel_2': 3, 'sublevel_3': 4, 'sublevel_4': 8},
'C': {'sublevel_1': 2, 'sublevel_2': 6, 'sublevel_3': 13, 'sublevel_4': 6}})
In[1]: datA
Out[1]:
A B C
levelA_1 4 1 2
levelA_2 2 3 4
levelA_3 2 5 6
In[2]: datB
Out[2]:
A B C
sublevel_1 4 1 2
sublevel_2 1 3 6
sublevel_3 3 4 13
sublevel_4 4 8 6
In[3]: x = 3
The first dataframe (datA) provides values for the keys of d and the other (datB) provides values for the values of d.
Furthermore I have a base value of x. I want to multiply the matrix of datA with x and then each element of datB with the referenced value (from the dict).
So for example I want to get the following result for a cell.
x = 3
3 * datB['B']['sublevel_3'] * datA['B']['levelA_2']
# res = 3*4*3 = 36
Desired output for dataframe:
A B C
sublevel_1 48 3 12
sublevel_2 12 9 26
sublevel_3 18 36 156
sublevel_4 24 72 72
Is there a better way than to loop through each cell?

IIUC
datA['New']=datA.reset_index()['index'].map(d).values
# map the dict , build the connecction for datA and datB
New_datA=datA.set_index(list('ABC'),append=True).New.apply(pd.Series).stack().reset_index(list('ABC'))
# makeing datA and datB have the same index, then we could do dataframe calculation
New_datA=New_datA.set_index(0)
datB*New_datA*3
#you can add dropna at the end to remove the NaN value
Out[95]:
A B C
sublevel_1 48.0 3.0 12.0
sublevel_2 12.0 9.0 36.0
sublevel_3 18.0 36.0 156.0
sublevel_4 24.0 72.0 72.0
sublevel_5 NaN NaN NaN
sublevel_6 NaN NaN NaN

Related

Expanding just last row of groupby

I need get mean of expanding grouped by name.
I already have this code:
data = {
'id': [1, 2, 3, 4, 5, 6, 7, 8],
'name': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'number': [1, 3, 5, 7, 9, 11, 13, 15]
}
df = pd.DataFrame(data)
df['mean_number'] = df.groupby('name')['number'].apply(
lambda s: s.expanding().mean().shift()
)
Ps: I use .shift() for the mean not to include the current line
Result in this:
id name number mean_number
0 1 A 1 NaN
1 2 B 3 NaN
2 3 A 5 1.0
3 4 B 7 3.0
4 5 A 9 3.0
5 6 B 11 5.0
6 7 A 13 5.0
7 8 B 15 7.0
Works, but I only need the last result of each groupby.
id name number mean_number
6 7 A 13 5.0
7 8 B 15 7.0
I would like to know if it is possible to get the mean of only these last lines, because in a very large dataset, it takes a long time to create the variables of all the lines and filter only the last ones.
If you only need the last two mean numbers you can just take the sum and count per group and calculate the values like this:
groups = df.groupby('name').agg(name=("name", "first"), s=("number", "sum"), c=("number", "count")).set_index("name")
groups
s c
name
A 28 4
B 36 4
Then you can use .tail() to get the last row for each group
tail = df.groupby('name').tail(1).set_index("name")
tail
id number
name
A 7 13
B 8 15
Calculate the mean like this
(groups.s - tail.number) / (groups.c - 1)
name
A 5.0
B 7.0

FillNaN with multiple conditions and using n-1 and n+2 values with Pandas

I have the following data frame:
d = {'T': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Val1': [10, np.NaN, 14, np.NaN, np.NaN, np.NaN, 20, np.NaN, np.NaN, 30]}
df = pd.DataFrame(data=d)
T Val1
1 10.0
2 NaN
3 14.0
4 NaN
5 NaN
6 NaN
7 20.0
8 NaN
9 NaN
10 30.0
I want to fill the NaN with different values depending on certain conditions:
If the value V is NaN and if V+1 and V-1 are not NaN then V=np.mean([V+1, V-1])
If the values V and V+1 are NaN and if V-1 and V+2 are not NaN then I want to fill them following this formula: V=np.cbrt([(V-1)*(V-1)*(V+2)]) AND V+1=np.cbrt([(V-1)*(V+2)*(V+2)])
Other NaN should be removed
So the wanted datatable should look like:
T Val1
1 10.0
2 12.0
3 14.0
7 20.0
8 22.89
9 26.20
10 30.0
I was able to do the V=np.mean([V+1, V-1]) by the following command:
df1 = pd.concat([df.ffill(), df.bfill()]).groupby(level=0).mean()
T Val1
1 10.0
2 12.0
3 14.0
4 17.0
5 17.0
6 17.0
7 20.0
8 25.0
9 25.0
10 30.0
But I don't know how to incorporate the different conditions.
I tried using np.select() but I can't find a way to recover following and previous value and add them to the conditions.
Thanks a lot
You can use:
def condition_2(a, b): #a = V-1, b = V+2
return np.cbrt((a) * (a) * (b))
def condition_3(a,b): # a = V-2, b=V+1
return np.cbrt((a) * (b) * (b))
d = {'T': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'Val1': [10, np.NaN, 14, np.NaN, np.NaN, np.NaN, 20, np.NaN, np.NaN, 30]}
df = pd.DataFrame(data=d)
cond_1 = df['Val1'].isnull() & df['Val1'].shift(1).notna() & df['Val1'].shift(-1).notna()
cond_2 = df['Val1'].isnull() & df['Val1'].shift(1).notna() & df['Val1'].shift(-1).isnull() & df['Val1'].shift(-2).notna()
cond_3 = df['Val1'].isnull() & df['Val1'].shift(-1).notna() & df['Val1'].shift(1).isnull() & df['Val1'].shift(2).notna()
df['Val1'] = np.where(cond_1, (df['Val1'].shift(1) + df['Val1'].shift(-1))/2, df['Val1'])
df['Val1'] = np.where(cond_2, condition_2(df['Val1'].shift(1), df['Val1'].shift(-2)), df['Val1'])
df['Val1'] = np.where(cond_3, condition_3(df['Val1'].shift(2), df['Val1'].shift(-1)), df['Val1'])
df.dropna(subset=['Val1'], inplace=True)
OUTPUT
T Val1
0 1 10.000000
1 2 12.000000
2 3 14.000000
6 7 20.000000
7 8 22.894285
8 9 26.207414
9 10 30.000000
Here's one solution using np.split and a custom function. Basically split on non-NaN values and iterate over each split to evaluate whether to drop NaN or change NaN:
def nan2notna(arr1, arr2):
mask = pd.isna(arr1)
if len(arr1[mask]) > 2:
return arr1[~mask]
else:
if len(arr1[mask]) == 2:
arr1[mask] = [np.cbrt([(arr1.iloc[0])*(arr1.iloc[0])*(arr2.iloc[0])]), np.cbrt([(arr1.iloc[0])*(arr2.iloc[0])*(arr2.iloc[0])])]
elif len(arr1[mask]) == 1:
arr1[mask] = np.mean([arr1.iloc[0], arr2.iloc[0]])
else:
pass
return arr1
splits = np.split(df['Val1'], np.where(pd.notna(df['Val1']))[0])[1:]
out = (df.merge(pd.concat([nan2notna(arr1, arr2) for (arr1, arr2) in zip(splits, splits[1:]+[None])]).to_frame(),
left_index=True, right_index=True)
.drop(columns='Val1_x')
.rename(columns={'Val1_y':'Val1'})
.round(2))
Output:
T Val1
0 1 10.00
1 2 12.00
2 3 14.00
6 7 20.00
7 8 22.89
8 9 26.21
9 10 30.00

How to substitute values in a column in a dataframe based on its column name, values in another column and index range?

I have a dataframe with these characteristics (the indexes are float values):
import pandas as pd
d = {'A': [1,2,3,4,5,6,7,8,9,10],
'B': [1,2,3,4,5,6,7,8,9,10],
'C': [1,2,3,4,5,6,7,8,9,10],
'D': ['one','one','one','one','one','two','two','two','two','two']}
df = pd.DataFrame(data=d)
df
A B C D
50.0 1 1 1 one
50.2 2 2 2 one
50.4 3 3 3 one
50.6 4 4 4 one
50.8 5 5 5 one
51.0 6 6 6 two
51.2 7 7 7 two
51.4 8 8 8 two
51.6 9 9 9 two
51.8 10 10 10 two
And a list of offsets with these values (they are also floats):
offsets = [[0.4, 0.6, 0.8], [0.2, 0.4, 0.6]]
I need to iterate through my dataframe over columns A, B and C, choosing the categorical values from column D, replacing the last values from columns A, B and C by nan according their indexes in relation to the offsets in my list, resulting in a dataframe like this:
A B C D
50.0 1 1 1 one
50.2 2 2 nan one
50.4 3 nan nan one
50.6 nan nan nan one
50.8 nan nan nan one
51.0 6 6 6 two
51.2 7 7 7 two
51.4 8 8 nan two
51.6 9 nan nan two
51.8 nan nan nan two
The value of the offset means what values must be set to nan from the bottom up. For example: offsets[0][0]=0.4, so for column A when D == 'one', the two values from the bottom up must be set to nan (rows 4 and 3, 50.8-0.4 = 50.4 - 50.4 doesn't change). For A when D == 'two', the offsets[1][0]=0.2, so one value from the bottom up must be set to nan (row 9, 51.8-0.2 = 51.6 - 51.6 doesn't change). Offsets[1][0]=0.6, so for column B when D == 'one', the three values from the bottom up must be set to nan (rows 4, 3 and 2, 50.8-0.6 = 50.2 - 50.2 doesn't change). For B when D == 'two', the offsets[1][1]=0.4, so two values from the bottom up must be set to nan (rows 9 and 8, 51.8-0.4 = 51.4 - 51.4 doesn't change). For column C is the same.
Any idea how to do this? A quick comment - I want to replace these values in the dataframe itself, without creating a new one.
One approach is to use apply to set the last values of each column to NaN:
import pandas as pd
# toy data
df = pd.DataFrame(data={'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'B': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'C': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'D': ['one', 'one', 'one', 'one', 'one', 'two', 'two', 'two', 'two', 'two']})
offsets = [2, 3, 4]
offset_lookup = dict(zip(df.columns[:3], offsets))
def funny_shift(x, ofs=None):
"""This function shift each column by the given offset in the ofs parameter"""
for column, offset in ofs.items():
x.loc[x.index[-1 * offset:], column] = None
return x
df.loc[:, ["A", "B", "C"]] = df.groupby("D").apply(funny_shift, ofs=offset_lookup)
print(df)
Output
A B C D
0 1.0 1.0 1.0 one
1 2.0 2.0 NaN one
2 3.0 NaN NaN one
3 NaN NaN NaN one
4 NaN NaN NaN one
5 6.0 6.0 6.0 two
6 7.0 7.0 NaN two
7 8.0 NaN NaN two
8 NaN NaN NaN two
9 NaN NaN NaN two
UPDATE
If you have multiple updates per group, you could do:
offsets = [[2, 3, 4], [1, 2, 3]]
offset_lookup = (dict(zip(df.columns[:3], offset)) for offset in offsets)
def funny_shift(x, ofs=None):
"""This function shift each column by the given offset in the ofs parameter"""
current = next(ofs)
for column, offset in current.items():
x.loc[x.index[-1 * offset:], column] = None
return x
df.loc[:, ["A", "B", "C"]] = df.groupby("D").apply(funny_shift, ofs=offset_lookup)
print(df)

Python: How to replace missing values column wise by median

I have a dataframe as follows
df = pd.DataFrame({'A': [1, 2, 3], 'B': [1.45, 2.33, np.nan], 'C': [4, 5, 6], 'D': [4.55, 7.36, np.nan]})
I want to replace the missing values i.e. np.nan in generic way. For this I have created a function as follows
def treat_mis_value_nu(df):
df_nu = df.select_dtypes(include=['number'])
lst_null_col = df_nu.columns[df_nu.isnull().any()].tolist()
if len(lst_null_col)>0:
for i in lst_null_col:
if df_nu[i].isnull().sum()/len(df_nu[i])>0.10:
df_final_nu = df_nu.drop([i],axis=1)
else:
df_final_nu = df_nu[i].fillna(df_nu[i].median(),inplace=True)
return df_final_nu
When I apply this function as follows
df_final = treat_mis_value_nu(df)
I am getting a dataframe as follows
A B C
0 1 1.0 4
1 2 2.0 5
2 3 NaN 6
So it has actually removed column D correctly, but failed to remove column B.
I know in past there have been discussion on this topic (here). Still I might be missing something?
Use:
df = pd.DataFrame({'A': [1, 2, 3,5,7], 'B': [1.45, 2.33, np.nan, np.nan, np.nan],
'C': [4, 5, 6,8,7], 'D': [4.55, 7.36, np.nan,9,10],
'E':list('abcde')})
print (df)
A B C D E
0 1 1.45 4 4.55 a
1 2 2.33 5 7.36 b
2 3 NaN 6 NaN c
3 5 NaN 8 9.00 d
4 7 NaN 7 10.00 e
def treat_mis_value_nu(df):
#get only numeric columns to dataframe
df_nu = df.select_dtypes(include=['number'])
#get only columns with NaNs
df_nu = df_nu.loc[:, df_nu.isnull().any()]
#get columns for remove with mean instead sum/len, it is same
cols_to_drop = df_nu.columns[df_nu.isnull().mean() <= 0.30]
#replace missing values of original columns and remove above thresh
return df.fillna(df_nu.median()).drop(cols_to_drop, axis=1)
print (treat_mis_value_nu(df))
A C D E
0 1 4 4.55 a
1 2 5 7.36 b
2 3 6 8.18 c
3 5 8 9.00 d
4 7 7 10.00 e
I would recommend looking at the sklearn Imputer transformer. I don't think it it can drop columns but it can definetly fill them in a 'generic way' - for example, filling in missing values with the median of the relevant column.
You could use it as such:
from sklearn.preprocessing import Imputer
imputer = Imputer(strategy='median')
num_df = df.values
names = df.columns.values
df_final = pd.DataFrame(imputer.transform(num_df), columns=names)
If you have additional transformations you would like to make you could consider making a transformation Pipeline or could even make your own transformers to do bespoke tasks.

Filling Pandas columns with lists of unequal lengths

I am having trouble filling Pandas dataframes with values from lists of unequal lengths.
nx_lists_into_df is a list of numpy arrays.
I get the following error:
ValueError: Length of values does not match length of index
The code is below:
# Column headers
df_cols = ["f1","f2"]
# Create one dataframe fror each sheet
df1 = pd.DataFrame(columns=df_cols)
df2 = pd.DataFrame(columns=df_cols)
# Create list of dataframes to iterate through
df_list = [df1, df2]
# Lists to be put into the dataframes
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
# Loop through each sheet (i.e. each round of k folds)
for df, test_index_list in zip_longest(df_list, nx_lists_into_df):
counter = -1
# Loop through each column in that sheet (i.e. each fold)
for col in df_cols:
print(col)
counter += 1
# Add 1 to each index value to start indexing at 1
df[col] = test_index_list[counter] + 1
Thank you for your help.
Edit: This is how the result should hopefully look:-
print(df1)
f1 f2
0 0 2
1 1 5
2 3 6
3 4 8
4 7 NaN
print(df2)
f1 f2
0 0 3
1 1 4
2 2 5
3 6 8
4 7 NaN
We'll leverage pd.Series to attach an appropriate index and will allow us to use the pd.DataFrame constructor without complaining of unequal lengths.
df1, df2 = (
pd.DataFrame(dict(zip(df_cols, map(pd.Series, d))))
for d in nx_lists_into_df
)
print(df1)
f1 f2
0 0 2.0
1 1 5.0
2 3 6.0
3 4 8.0
4 7 NaN
print(df2)
f1 f2
0 0 3.0
1 1 4.0
2 2 5.0
3 6 8.0
4 7 NaN
Setup
from numpy import array
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
# Column headers
df_cols = ["f1","f2"]
You could predefine the size of your DataFrames (by setting the index range to the length of the longest column you want to add [or any size bigger than the longest column]) like so:
df1 = pd.DataFrame(columns=df_cols, index=range(5))
df2 = pd.DataFrame(columns=df_cols, index=range(5))
print(df1)
f1 f2
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
(df2 is the same)
The DataFrame will be filled with NaNs automatically.
Then you use .loc to access each entry separately like so:
for x in range(len(nx_lists_into_df)):
for col_idx, y in enumerate(nx_lists_into_df[x]):
df_list[x].loc[range(len(y)), df_cols[col_idx]] = y
print(df1)
f1 f2
0 0 2
1 1 5
2 3 6
3 4 8
4 7 NaN
print(df2)
f1 f2
0 0 3
1 1 4
2 2 5
3 6 8
4 7 NaN
The first loop iterates over the first dimension of your array (or the number of DataFrames you want to create).
The second loop iterates over the column values for the DataFrame, where y are the values for the current column and df_cols[col_idx] is the respective column (f1 or f2).
Since the row & col indices are the same size as y, you don't get the length mismatch.
Also check out the enumerate(iterable, start=0) function to get around those "counter" variables.
Hope this helps.
If I understand correctly, this is possible via pd.concat.
But see #pir's solution for an extendable version.
# Lists to be put into the dataframes
nx_lists_into_df = [[array([0, 1, 3, 4, 7]),
array([2, 5, 6, 8])],
[array([0, 1, 2, 6, 7]),
array([3, 4, 5, 8])]]
df1 = pd.concat([pd.DataFrame({'A': nx_lists_into_df[0][0]}),
pd.DataFrame({'B': nx_lists_into_df[0][1]})],
axis=1)
# A B
# 0 0 2.0
# 1 1 5.0
# 2 3 6.0
# 3 4 8.0
# 4 7 NaN
df2 = pd.concat([pd.DataFrame({'C': nx_lists_into_df[1][0]}),
pd.DataFrame({'D': nx_lists_into_df[1][1]})],
axis=1)
# C D
# 0 0 3.0
# 1 1 4.0
# 2 2 5.0
# 3 6 8.0
# 4 7 NaN

Categories