Trying to unstack dataframe with multiple empty columns (NaN) - python

I currently have a code which turns this:
A B C D E F G H I J
0 1.1.1 amba 50 1 131 4 40 3 150 5
1 2.2.2 erto 50 7 40 8 150 8 131 2
2 3.3.3 gema 131 2 150 5 40 1 50 3
Into this:
ID User 40 50 131 150
0 1.1.1 amba 3 1 4 5
1 2.2.2 erto 8 7 2 8
2 3.3.3 gema 1 3 2 5
And here you can check the code:
import pandas as pd
import io
df1 = pd.read_csv(io.StringIO(""" A B C D E F G H I J
1.1.1 amba 50 1 131 4 40 3 150 5
2.2.2 erto 50 7 40 8 150 8 131 2
3.3.3 gema 131 2 150 5 40 1 50 3"""), sep="\s+")
print(df1)
df2 = (pd.concat([df1.drop(columns=["C","D","E","F","G","H"]).rename(columns={"I":"key","J":"val"}),
df1.drop(columns=["C","D","E","F","I","J"]).rename(columns={"G":"key","H":"val"}),
df1.drop(columns=["C","D","G","H","I","J"]).rename(columns={"E":"key","F":"val"}),
df1.drop(columns=["E","F","G","H","I","J"]).rename(columns={"C":"key","D":"val"}),
])
.rename(columns={"A":"ID","B":"User"})
.set_index(["ID","User","key"])
.unstack(2)
.reset_index()
)
# flatten the columns..
df2.columns = [c[1] if c[0]=="val" else c[0] for c in df2.columns.to_flat_index()]
df2
The program works correctly if Key colums have unique values but it fails if there are duplicate values. The issue I have is that my actual dataframe has rows with 30 clumns, other with 60, other with 63, etc. So the program is detecting empty values as duplicate and the program fails.
Please check this example:
A B C D E F G H I J
0 1.1.1 amba 50 1 131 4 NaN NaN NaN NaN
1 2.2.2 erto 50 7 40 8 150.0 8.0 131.0 2.0
2 3.3.3 gema 131 2 150 5 40.0 1.0 50.0 3.0
And I would like to get something like this:
ID User 40 50 131 150
0 1.1.1 amba 1 4
1 2.2.2 erto 8 7 2 8
2 3.3.3 gema 1 3 2 5
If I try to unstack this, i get the error "Index contains duplicate entries, cannot reshape". I have been reading about this and df.drop_duplicates, pivot_tables, tc could help in this situation but I cannot just make work anything of this with my current code. Any idea about how o fix this? Thanks.

Idea is convert first 2 columns to MultiIndex, then use concat by selected pair and unpair columns by DataFrame.iloc, reshaped by DataFrame.stack and removed third unnecessary level of MultiIndex by DataFrame.reset_index:
df = df.set_index(['A','B'])
df = pd.concat([df.iloc[:, ::2].stack().reset_index(level=2, drop=True),
df.iloc[:, 1::2].stack().reset_index(level=2, drop=True)],
axis=1, keys=('key','val'))
Last add key column to MultiIndex by DataFrame.set_index and reshape by Series.unstack, convert MultiIndex to columns by reset_index, rename columns names and last remove columns levels name by DataFrame.rename_axis:
df = (df.set_index('key', append=True)['val']
.unstack()
.reset_index()
.rename(columns={"A":"ID","B":"User"})
.rename_axis(None, axis=1))
print (df)
ID User 40 50 131 150
0 1.1.1 amba 3 1 4 5
1 2.2.2 erto 8 7 2 8
2 3.3.3 gema 1 3 2 5
Also it working well for second example, because missing rows are removed by stack, also added rename for convert columns names to int if possible:
df = df.set_index(['A','B'])
df = pd.concat([df.iloc[:, ::2].stack().reset_index(level=2, drop=True),
df.iloc[:, 1::2].stack().reset_index(level=2, drop=True)],
axis=1, keys=('key','val'))
print (df)
key val
A B
1.1.1 amba 50.0 1.0
amba 131.0 4.0
2.2.2 erto 50.0 7.0
erto 40.0 8.0
erto 150.0 8.0
erto 131.0 2.0
3.3.3 gema 131.0 2.0
gema 150.0 5.0
gema 40.0 1.0
gema 50.0 3.0
df = (df.set_index('key', append=True)['val']
.unstack()
.rename(columns=int)
.reset_index()
.rename(columns={"A":"ID","B":"User"})
.rename_axis(None, axis=1))
print (df)
ID User 40 50 131 150
0 1.1.1 amba NaN 1.0 4.0 NaN
1 2.2.2 erto 8.0 7.0 2.0 8.0
2 3.3.3 gema 1.0 3.0 2.0 5.0
EDIT1 Added helper column with counter for avoid duplicates:
print (df)
A B C D E F G H I J
0 1.1.1 amba 50 1 50 4 40 3 150 5 <- E=50
1 2.2.2 erto 50 7 40 8 150 8 131 2
2 3.3.3 gema 131 2 150 5 40 1 50 3
df = df.set_index(['A','B'])
df = pd.concat([df.iloc[:, ::2].stack().reset_index(level=2, drop=True),
df.iloc[:, 1::2].stack().reset_index(level=2, drop=True)],
axis=1, keys=('key','val'))
df['g'] = df.groupby(['A','B','key']).cumcount()
print (df)
key val g
A B
1.1.1 amba 50 1 0
amba 50 4 1
amba 40 3 0
amba 150 5 0
2.2.2 erto 50 7 0
erto 40 8 0
erto 150 8 0
erto 131 2 0
3.3.3 gema 131 2 0
gema 150 5 0
gema 40 1 0
gema 50 3 0
df = (df.set_index(['g','key'], append=True)['val']
.unstack()
.reset_index()
.rename(columns={"A":"ID","B":"User"})
.rename_axis(None, axis=1))
print (df)
ID User g 40 50 131 150
0 1.1.1 amba 0 3.0 1.0 NaN 5.0
1 1.1.1 amba 1 NaN 4.0 NaN NaN
2 2.2.2 erto 0 8.0 7.0 2.0 8.0
3 3.3.3 gema 0 1.0 3.0 2.0 5.0

What you are trying to do seems too complex. May i suggest a simpler solution which just converts each row to a dictionary of desired result and then binds them back together:
pd.DataFrame(list(map(lambda row: {'ID':row['A'], 'User':row['B'], row['C']:row['D'],
row['E']:row['F'], row['G']:row['H'], row['I']:row['J']},
df1.to_dict('r'))))

Related

Downsample non timeseries pandas dataframe

I have a data frame like below,
Name = ['A','A','A','A','A','A','B','B','B','B','B','B','B']
Id = ['10','10','10','10','10','10','20','20','20','20','20','20','20']
Depth_Feet = ['69.1','70.5','71.4','72.8','73.2','74.2','208.0','209.2','210.2','211.0','211.2','211.7','212.5']
Val = ['2','3.1','1.1','2.1','6.0','1.1','1.2','1.3','3.1','2.9','5.0','6.1','3.2']
d = {'Name':Name,'Id':Id,'Depth_Feet':Depth_Feet,'Val':Val}
df = pd.DataFrame(d)
print (df.head(20))
Depth_Feet Id Name Val
0 69.1 10 A 2
1 70.5 10 A 3.1
2 71.4 10 A 1.1
3 72.8 10 A 2.1
4 73.2 10 A 6.0
5 74.2 10 A 1.1
6 208.0 20 B 1.2
7 209.2 20 B 1.3
8 210.2 20 B 3.1
9 211.0 20 B 2.9
10 211.2 20 B 5.0
11 211.7 20 B 6.1
12 212.5 20 B 3.2
I want to reduce the size of data frame by Depth_Feet column (let's say every 2 feet).
Desired output is
Depth_Feet Id Name Val
0 69.1 10 A 2
1 71.4 10 A 1.1
2 73.2 10 A 6.0
3 208.0 20 B 1.2
4 210.2 20 B 3.1
5 212.5 20 B 3.2
I have tried few options like round and group by etc, but I'm not able to get the result I want.
If need each 2 rows per groups:
df1 = df[df.groupby('Name').cumcount() % 2 == 0]
print (df1)
Name Id Depth_Feet Val
0 A 10 69.1 2
2 A 10 71.4 1.1
4 A 10 73.2 6.0
6 B 20 208.0 1.2
8 B 20 210.2 3.1
10 B 20 211.2 5.0
12 B 20 212.5 3.2
If need resample by 2 per groups convert values to TimedeltaIndex:
df2 = (df.set_index(pd.to_timedelta(df.Depth_Feet.astype(float), unit='D'))
.groupby('Name')
.resample('2D')
.first()
.reset_index(drop=True))
print (df2)
Name Id Depth_Feet Val
0 A 10 69.1 2
1 A 10 71.4 1.1
2 A 10 73.2 6.0
3 B 20 208.0 1.2
4 B 20 210.2 3.1
5 B 20 212.5 3.2

Flatten DataFrame by group with columns creation in Pandas

I have the following pandas DataFrame
Id_household Age_Father Age_child
0 1 30 2
1 1 30 4
2 1 30 4
3 1 30 1
4 2 27 4
5 3 40 14
6 3 40 18
and I want to achieve the following result
Age_Father Age_child_1 Age_child_2 Age_child_3 Age_child_4
Id_household
1 30 1 2.0 4.0 4.0
2 27 4 NaN NaN NaN
3 40 14 18.0 NaN NaN
I tried stacking with multi-index renaming, but I am not very happy with it and I am not able to make everything work properly.
Use this:
df_out = df.set_index([df.groupby('Id_household').cumcount()+1,
'Id_household',
'Age_Father']).unstack(0)
df_out.columns = [f'{i}_{j}' for i, j in df_out.columns]
df_out.reset_index()
Output:
Id_household Age_Father Age_child_1 Age_child_2 Age_child_3 Age_child_4
0 1 30 2.0 4.0 4.0 1.0
1 2 27 4.0 NaN NaN NaN
2 3 40 14.0 18.0 NaN NaN

opposite of df.diff() in pandas

I have searched the forums in search of a cleaner way to create a new column in a dataframe that is the sum of the row with the previous row- the opposite of the .diff() function which takes the difference.
this is how I'm currently solving the problem
df = pd.DataFrame ({'c':['dd','ee','ff', 'gg', 'hh'], 'd':[1,2,3,4,5]}
df['e']= df['d'].shift(-1)
df['f'] = df['d'] + df['e']
Your ideas are appreciated.
You can use rolling with a window size of 2 and sum:
df['f'] = df['d'].rolling(2).sum().shift(-1)
c d f
0 dd 1 3.0
1 ee 2 5.0
2 ff 3 7.0
3 gg 4 9.0
4 hh 5 NaN
df.cumsum()
Example:
data = {'a':[1,6,3,9,5], 'b':[13,1,2,5,23]}
df = pd.DataFrame(data)
df =
a b
0 1 13
1 6 1
2 3 2
3 9 5
4 5 23
df.diff()
a b
0 NaN NaN
1 5.0 -12.0
2 -3.0 1.0
3 6.0 3.0
4 -4.0 18.0
df.cumsum()
a b
0 1 13
1 7 14
2 10 16
3 19 21
4 24 44
If you cannot use rolling, due to multindex or else, you can try using .cumsum(), and then .diff(-2) to sub the .cumsum() result from two positions before.
data = {'a':[1,6,3,9,5,30, 101, 8]}
df = pd.DataFrame(data)
df['opp_diff'] = df['a'].cumsum().diff(2)
a opp_diff
0 1 NaN
1 6 NaN
2 3 9.0
3 9 12.0
4 5 14.0
5 30 35.0
6 101 131.0
7 8 109.0
Generally to get an inverse of .diff(n) you should be able to do .cumsum().diff(n+1). The issue is that that you will get n+1 first results as NaNs

Assign values from pandas.quantile

I just try to get the quantiles of a dataframe asigned on to an other dataframe like:
dataframe['pc'] = dataframe['row'].quantile([.1,.5,.7])
the result is
0 NaN
...
5758 NaN
Name: pc, Length: 5759, dtype: float64
any idea why the dataframe['row'] got plenty of values
It is expected, because different indices, so no align Series created by quantile with original DataFrame and get NaNs:
#indices 0,1,2...6
dataframe = pd.DataFrame({'row':[2,0,8,1,7,4,5]})
print (dataframe)
row
0 2
1 0
2 8
3 1
4 7
5 4
6 5
#indices 0.1, 0.5, 0.7
print (dataframe['row'].quantile([.1,.5,.7]))
0.1 0.6
0.5 4.0
0.7 5.4
Name: row, dtype: float64
#not align
dataframe['pc'] = dataframe['row'].quantile([.1,.5,.7])
print (dataframe)
row pc
0 2 NaN
1 0 NaN
2 8 NaN
3 1 NaN
4 7 NaN
5 4 NaN
6 5 NaN
If want create DataFrame of quantile add rename_axis + reset_index:
df = dataframe['row'].quantile([.1,.5,.7]).rename_axis('a').reset_index(name='b')
print (df)
a b
0 0.1 0.6
1 0.5 4.0
2 0.7 5.4
But if some indices are same (I think it is not what you want, only for better explanation):
Add reset_index for default indices 0,1,2:
print (dataframe['row'].quantile([.1,.5,.7]).reset_index(drop=True))
0 0.6
1 4.0
2 5.4
Name: row, dtype: float64
First 3 rows are aligned, because same indices 0,1,2 in Series and DataFrame:
dataframe['pc'] = dataframe['row'].quantile([.1,.5,.7]).reset_index(drop=True)
print (dataframe)
row pc
0 2 0.6
1 0 4.0
2 8 5.4
3 1 NaN
4 7 NaN
5 4 NaN
6 5 NaN
EDIT:
For multiple columns need DataFrame.quantile, it also exclude non numeric columns:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
df1 = df.quantile([.1,.2,.3,.4])
print (df1)
B C D E
0.1 4.0 2.5 0.5 2.5
0.2 4.0 3.0 1.0 3.0
0.3 4.0 3.5 1.0 3.5
0.4 4.0 4.0 1.0 4.0

pandas.DataFrame set all string values to nan

I have a pandas.DataFrame that contain string, float and int types.
Is there a way to set all strings that cannot be converted to float to NaN ?
For example:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 "wajdi"
to:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 NaN
You can use pd.to_numeric and set errors='coerce'
pandas.to_numeric
df['D'] = pd.to_numeric(df.D, errors='coerce')
Which will give you:
A B C D
0 1 2 5.0 7.0
1 0 4 NaN 15.0
2 4 8 9.0 10.0
3 11 5 8.0 0.0
4 11 5 8.0 NaN
Deprecated solution (pandas <= 0.20 only):
df.convert_objects(convert_numeric=True)
pandas.DataFrame.convert_objects
Here's the dev note in the convert_objects source code: # TODO: Remove in 0.18 or 2017, which ever is sooner. So don't make this a long term solution if you use it.
Here is a way:
df['E'] = pd.to_numeric(df.D, errors='coerce')
And then you have:
A B C D E
0 1 2 5.0 7 7.0
1 0 4 NaN 15 15.0
2 4 8 9.0 10 10.0
3 11 5 8.0 0 0.0
4 11 5 8.0 wajdi NaN
You can use pd.to_numeric with errors='coerce'.
In [30]: df = pd.DataFrame({'a': [1, 2, 'NaN', 'bob', 3.2]})
In [31]: pd.to_numeric(df.a, errors='coerce')
Out[31]:
0 1.0
1 2.0
2 NaN
3 NaN
4 3.2
Name: a, dtype: float64
Here is one way to apply it to all columns:
for c in df.columns:
df[c] = pd.to_numeric(df[c], errors='coerce')
(See comment by NinjaPuppy for a better way.)

Categories