Flatten DataFrame by group with columns creation in Pandas

Flatten DataFrame by group with columns creation in Pandas - python

I have the following pandas DataFrame
Id_household Age_Father Age_child
0 1 30 2
1 1 30 4
2 1 30 4
3 1 30 1
4 2 27 4
5 3 40 14
6 3 40 18
and I want to achieve the following result
Age_Father Age_child_1 Age_child_2 Age_child_3 Age_child_4
Id_household
1 30 1 2.0 4.0 4.0
2 27 4 NaN NaN NaN
3 40 14 18.0 NaN NaN
I tried stacking with multi-index renaming, but I am not very happy with it and I am not able to make everything work properly.

Use this:
df_out = df.set_index([df.groupby('Id_household').cumcount()+1,
'Id_household',
'Age_Father']).unstack(0)
df_out.columns = [f'{i}_{j}' for i, j in df_out.columns]
df_out.reset_index()
Output:
Id_household Age_Father Age_child_1 Age_child_2 Age_child_3 Age_child_4
0 1 30 2.0 4.0 4.0 1.0
1 2 27 4.0 NaN NaN NaN
2 3 40 14.0 18.0 NaN NaN

Related

Save dictionary to Pandas dataframe with keys as columns and merge indices

I know there are already lots of posts on how to convert a pandas dict to a dataframe, however I could not find one discussing the issue I have.
My dictionary looks as follows:
[Out 23]:
{'atmosphere': 0
2 5
9 4
15 1
26 5
29 5
... ..
2621 4
6419 3
[6934 rows x 1 columns],
'communication': 0
13 1
15 1
26 1
2621 2
3119 5
... ..
6419 4
6532 1
[714 rows x 1 columns]
Now, what I want is to create a dataframe out of this dictionary, where the 'atmosphere' and 'communication' are the columns, and the indices of both items are merged, so that the dataframe looks as follows:
index atmosphere commmunication
2 5
9 4
13 1
15 1 1
26 5 1
29 5
2621 4 2
3119 5
6419 3 4
6532 1
I already tried pd.DataFrame.from_dict, but it saves all values in one row.
Any help is much appreciated!

Use concat with DataFrame.droplevel for remove second level 0 from MultiIndex in columns:
d = {'atmosphere':pd.DataFrame({0: {2: 5, 9: 4, 15: 1, 26: 5, 29: 5,
2621: 4, 6419: 3}}),
'communication':pd.DataFrame({0: {13: 1, 15: 1, 26: 1, 2621: 2,
3119: 5, 6419: 4, 6532: 1}})}
print (d['atmosphere'])
0
2 5
9 4
15 1
26 5
29 5
2621 4
6419 3
print (d['communication'])
0
13 1
15 1
26 1
2621 2
3119 5
6419 4
6532 1
df = pd.concat(d, axis=1).droplevel(1, axis=1)
print (df)
atmosphere communication
2 5.0 NaN
9 4.0 NaN
13 NaN 1.0
15 1.0 1.0
26 5.0 1.0
29 5.0 NaN
2621 4.0 2.0
3119 NaN 5.0
6419 3.0 4.0
6532 NaN 1.0
Alternative solution:
df = pd.concat({k: v[0] for k, v in d.items()}, axis=1)

You can use pandas.concat on the values and set_axis with the dictionary keys:
out = pd.concat(d.values(), axis=1).set_axis(d, axis=1)
output:
atmosphere communication
2 5.0 NaN
9 4.0 NaN
13 NaN 1.0
15 1.0 1.0
26 5.0 1.0
29 5.0 NaN
2621 4.0 2.0
3119 NaN 5.0
6419 3.0 4.0
6532 NaN 1.0

Pandas: Merging and comparing dataframes

I've got 3 Dataframes I would like to merge or join by "label" and then being able to compare all columns
Examples of df are below:
df1
Label,col1,col2,col3
NF1,1,1,6
NF2,3,2,8
NF3,4,5,4
NF4,5,7,2
NF5,6,2,2
df2
Label,col1,col1,col3
NF1,8,4,5
NF2,4,7,8
NF3,9,7,8
df3
Label,col1,col1,col3
NF1,2,8,8
NF2,6,2,0
NF3,2,2,5
NF4,2,4,9
NF5,2,5,8
and what ill like to see is similar to
Label,df1_col1,df2_col1,df_col1,df1_col2,df2_col2,df3_col2,df1_col3,df2_col3,df_col3
NF1,1,8,2,1,4,8,6,5,8
NF2,3,4,6,2,7,2,8,8,0
NF3,4,9,2,5,7,2,4,8,5
NF4,5,,2,7,,4,2,,9
NF5,6,,2,2,,5,2,,8
but I'm to suggestions on how to make the comparisons more readable.
Thanks!

Use concat with list of DataFrames, add parameter keys for prefixes and sorting by columns names:
dfs = [df1, df2, df3]
k = ('df1','df2','df3')
df = (pd.concat([x.set_index('Label') for x in dfs], axis=1, keys=k)
.sort_index(axis=1, level=1)
.rename_axis('Label')
.reset_index())
df.columns = df.columns.map('_'.join).str.strip('_')
print (df)
Label df1_col1 df2_col1 df3_col1 df2_col1.1 df3_col1.1 df1_col2 \
0 NF1 1 8.0 2 4.0 8 1
1 NF2 3 4.0 6 7.0 2 2
2 NF3 4 9.0 2 7.0 2 5
3 NF4 5 NaN 2 NaN 4 7
4 NF5 6 NaN 2 NaN 5 2
df1_col3 df2_col3 df3_col3
0 6 5.0 8
1 8 8.0 0
2 4 8.0 5
3 2 NaN 9
4 2 NaN 8

You can use df.merge:
In [1965]: res = df1.merge(df2, on='Label', how='left', suffixes=('_df1', '_df2')).merge(df3, on='Label', how='left').rename(columns={'col1': 'col1_df3','col2':'col2_df3','col3':'col3_df3'})
In [1975]: res = res.reindex(sorted(res.columns), axis=1)
In [1976]: res
Out[1965]:
Label col1_df1 col1_df2 col1_df3 col2_df1 col2_df2 col2_df3 col3_df1 col3_df2 col3_df3
0 NF1 1 8.00 2 1 4.00 8 6 5.00 8
1 NF2 3 4.00 6 2 7.00 2 8 8.00 0
2 NF3 4 9.00 2 5 7.00 2 4 8.00 5
3 NF4 5 nan 2 7 nan 4 2 nan 9
4 NF5 6 nan 2 2 nan 5 2 nan 8

We can use Pandas' join method, by setting the Label column as the index and joining the dataframes :
dfs = [df1,df2,df3]
keys = ['df1','df2','df3']
#set Label as index
df1, *others = [frame.set_index("Label").add_prefix(f"{prefix}_")
for frame,prefix in zip(dfs,keys)]
#join df1 with others
outcome = df1.join(others,how='outer').rename_axis(index='Label').reset_index()
outcome
Label df1_col1 df1_col2 df1_col3 df2_col1 df2_col2 df2_col3 df3_col1 df3_col2 df3_col3
0 NF1 1 1 6 8.0 4.0 5.0 2 8 8
1 NF2 3 2 8 4.0 7.0 8.0 6 2 0
2 NF3 4 5 4 9.0 7.0 8.0 2 2 5
3 NF4 5 7 2 NaN NaN NaN 2 4 9
4 NF5 6 2 2 NaN NaN NaN 2 5 8

opposite of df.diff() in pandas

I have searched the forums in search of a cleaner way to create a new column in a dataframe that is the sum of the row with the previous row- the opposite of the .diff() function which takes the difference.
this is how I'm currently solving the problem
df = pd.DataFrame ({'c':['dd','ee','ff', 'gg', 'hh'], 'd':[1,2,3,4,5]}
df['e']= df['d'].shift(-1)
df['f'] = df['d'] + df['e']
Your ideas are appreciated.

You can use rolling with a window size of 2 and sum:
df['f'] = df['d'].rolling(2).sum().shift(-1)
c d f
0 dd 1 3.0
1 ee 2 5.0
2 ff 3 7.0
3 gg 4 9.0
4 hh 5 NaN

df.cumsum()
Example:
data = {'a':[1,6,3,9,5], 'b':[13,1,2,5,23]}
df = pd.DataFrame(data)
df =
a b
0 1 13
1 6 1
2 3 2
3 9 5
4 5 23
df.diff()
a b
0 NaN NaN
1 5.0 -12.0
2 -3.0 1.0
3 6.0 3.0
4 -4.0 18.0
df.cumsum()
a b
0 1 13
1 7 14
2 10 16
3 19 21
4 24 44

If you cannot use rolling, due to multindex or else, you can try using .cumsum(), and then .diff(-2) to sub the .cumsum() result from two positions before.
data = {'a':[1,6,3,9,5,30, 101, 8]}
df = pd.DataFrame(data)
df['opp_diff'] = df['a'].cumsum().diff(2)
a opp_diff
0 1 NaN
1 6 NaN
2 3 9.0
3 9 12.0
4 5 14.0
5 30 35.0
6 101 131.0
7 8 109.0
Generally to get an inverse of .diff(n) you should be able to do .cumsum().diff(n+1). The issue is that that you will get n+1 first results as NaNs

Pandas: rolling count if within a loop

In my data frame I want to create a column '5D_Peak' as a rolling max, and then another column with rolling count of historical data that's close to the peak. I wonder if there is an easier way to simply or ideally vectorise the calculation.
This is my codes in a plain but complicated way:
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,2,4],[4,5,2],[3,5,8],[1,8,6],[5,2,8],[1,4,10],[3,5,9],[1,4,7],[1,4,6]], columns=list('ABC'))
df['5D_Peak']=df['C'].rolling(window=5,center=False).max()
for i in range(5,len(df.A)):
val=0
for j in range(i-5,i):
if df.loc[j,'C']>df.loc[i,'5D_Peak']-2 and df.loc[j,'C']<df.loc[i,'5D_Peak']+2:
val+=1
df.loc[i,'5D_Close_to_Peak_Count']=val
This is the output I want:
A B C 5D_Peak 5D_Close_to_Peak_Count
0 1 2 4 NaN NaN
1 4 5 2 NaN NaN
2 3 5 8 NaN NaN
3 1 8 6 NaN NaN
4 5 2 8 8.0 NaN
5 1 4 10 10.0 0.0
6 3 5 9 10.0 1.0
7 1 4 7 10.0 2.0
8 1 4 6 10.0 2.0

I believe this is what you want. You can set the two values below:
'''the window within which to search "close-to_peak" values'''
lkp_rng = 5
'''how close is close?'''
closeness_measure = 2
'''function to count the number of "close-to_peak" values in the lkp_rng'''
fc = lambda x: np.count_nonzero(np.where(x >= x.max()- closeness_measure))
'''apply fc to the coulmn you choose'''
df['5D_Close_to_Peak_Count'] = df['C'].rolling(window=lkp_range,center=False).apply(fc)
df.head(10)
A B C 5D_Peak 5D_Close_to_Peak_Count
0 1 2 4 NaN NaN
1 4 5 2 NaN NaN
2 3 5 8 NaN NaN
3 1 8 6 NaN NaN
4 5 2 8 8.0 3.0
5 1 4 10 10.0 3.0
6 3 5 9 10.0 3.0
7 1 4 7 10.0 3.0
8 1 4 6 10.0 2.0
I am guessing what you mean by "historical data".

pandas.DataFrame set all string values to nan

I have a pandas.DataFrame that contain string, float and int types.
Is there a way to set all strings that cannot be converted to float to NaN ?
For example:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 "wajdi"
to:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 NaN

You can use pd.to_numeric and set errors='coerce'
pandas.to_numeric
df['D'] = pd.to_numeric(df.D, errors='coerce')
Which will give you:
A B C D
0 1 2 5.0 7.0
1 0 4 NaN 15.0
2 4 8 9.0 10.0
3 11 5 8.0 0.0
4 11 5 8.0 NaN
Deprecated solution (pandas <= 0.20 only):
df.convert_objects(convert_numeric=True)
pandas.DataFrame.convert_objects
Here's the dev note in the convert_objects source code: # TODO: Remove in 0.18 or 2017, which ever is sooner. So don't make this a long term solution if you use it.

Here is a way:
df['E'] = pd.to_numeric(df.D, errors='coerce')
And then you have:
A B C D E
0 1 2 5.0 7 7.0
1 0 4 NaN 15 15.0
2 4 8 9.0 10 10.0
3 11 5 8.0 0 0.0
4 11 5 8.0 wajdi NaN

You can use pd.to_numeric with errors='coerce'.
In [30]: df = pd.DataFrame({'a': [1, 2, 'NaN', 'bob', 3.2]})
In [31]: pd.to_numeric(df.a, errors='coerce')
Out[31]:
0 1.0
1 2.0
2 NaN
3 NaN
4 3.2
Name: a, dtype: float64
Here is one way to apply it to all columns:
for c in df.columns:
df[c] = pd.to_numeric(df[c], errors='coerce')
(See comment by NinjaPuppy for a better way.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Flatten DataFrame by group with columns creation in Pandas - python

Related

Save dictionary to Pandas dataframe with keys as columns and merge indices

Pandas: Merging and comparing dataframes

opposite of df.diff() in pandas

Pandas: rolling count if within a loop

pandas.DataFrame set all string values to nan

Categories

Resources