python - pandas groupby to flat DataFrame - python

I would like to convert groupby result to a flat DataFrame.
import pandas
df1 = pandas.DataFrame( {
"x" : ["A", "B", "C", "A", "B" ,"B"] ,
"y" : [ 1, 2, 3, 4, 5, 6]} )
g1 = df1.groupby(["x"]).max().reset_index()
print(g1)
The expected output DataFrame like below:
x y1 y2 y3
0 A 1 4 0
1 B 2 5 6
2 C 3 0 0
If value not exist, use 0 by default.

Try with groupby.agg with add_prefix and fillna with reset_index.
Like the following:
g1 = df1.groupby('x')['y'].agg(list).agg(pd.Series).add_prefix('y').fillna(0).reset_index()
print(g1)
Or if you care about column names, try using rename with a slick way with 1 .__add__:
g1 = df1.groupby('x')['y'].agg(list).agg(pd.Series).rename(1 .__add__, axis=1).add_prefix('y').fillna(0).reset_index()
Output:
x y1 y2 y3
0 A 1.0 4.0 0.0
1 B 2.0 5.0 6.0
2 C 3.0 0.0 0.0

We can use pivot_table index is the 'x' column, and we can use groupby cumcount on x to enumerate rows to get positional y values as new columns [1,2,3] etc and fill_value of 0 to set the default for missing (benefit of fill_value over fillna is that NaN are not introduced so dtype does not change to float).
Lastly, add_prefix to columns and reset_index to match desired output:
out = (
df1.pivot_table(index='x',
columns=df1.groupby('x').cumcount() + 1,
values='y',
fill_value=0)
.add_prefix('y')
.reset_index()
)
out:
x y1 y2 y3
0 A 1 4 0
1 B 2 5 6
2 C 3 0 0

Related

.agg Sum Converting NaN to 0

I am trying to bin a Pandas DataFrame into three day windows. I have two columns, A and B, which I want to sum in each window. This code which I wrote for the task
df = df.groupby(df.index // 3).agg({'A': 'sum', 'B':'sum'})
Converts NaN values to zero when doing this sum, but I would like them to remain NaN as my data has actual non-NaN zero values.
For example if I had this df:
df = pd.DataFrame([
[np.nan, np.nan],
[np.nan, 0],
[np.nan, np.nan],
[2, 0],
[4 , 0],
[0 , 0]
], columns=['A','B'])
Index A B
0 NaN Nan
1 NaN 3
2 NaN Nan
3 2 0
4 4 0
5 0 0
I would like the new df to be:
Index A B
0 NaN 3
1 6 0
But my current code outputs:
Index A B
0 0 3
1 6 0
df.groupby(df.index // 3)['A', 'B'].mean()
The above snippet provides the mentioned sample output.
If you want to go for the sum, look at df.groupby(df.index // 3)['A', 'B'].sum(min_count = 1)
Another option:
df.groupby(df.index // 3).agg({'A': lambda x: x.sum(skipna=False),
'B':lambda x: x.sum(skipna=True)})
Try with this code:
df.groupby(df.index // 3).agg({'A': lambda x: x.sum(skipna=False),
'B':lambda x: x.sum(skipna=False)})
Out[282]:
A B
0 NaN NaN
1 6.0 0.0

Largest elementwise difference between all rows in dataframe

Given is the following dataframe:
c1 c2 c3 c4
code
x 1 2 1 1
y 3 2 2 1
z 2 0 4 1
For any row in this dataframe I want to calculate the largest elementwise absolute difference between this row and all other rows of this dataframe and put it into a new dataframe:
x y z
code
x 0 2 3
y 2 0 2
z 3 2 0
(the result is, of course, a triangular matrix with the main diagonal = 0 so it would be sufficient to get just either the upper or lower triangular half).
So for instance the maximum elementwise difference between rows x and y is 2 (from column c1: abs(3 - 1) = 2).
What I got so far:
df = pd.DataFrame(data={'code': ['x','y','z'], 'c1': [1, 3, 2], 'c2': [2, 2, 0], 'c3': [1,2,4], 'c4': [1,1,1]})
df.set_index('code', inplace = True)
df1 = pd.DataFrame()
for row in df.iterrows():
df1.append((df-row[1]).abs().max(1), ignore_index = True)
When run interactively, this already looks close to what I need, but the new df1 is still empty afterwards:
>>> for row in df.iterrows(): df1.append((df-row[1]).abs().max(1),ignore_index=True)
...
x y z
0 0.0 2.0 3.0
x y z
0 2.0 0.0 2.0
x y z
0 3.0 2.0 0.0
>>> df1
Empty DataFrame
Columns: []
Index: []
Questions:
How to get the results into the new dataframe df1 (with correct index x, y, ...)?
This is only a mcve. In reality, df has about 700 rows. Not sure if iterrows is so good then. I have a feeling that the apply method would come in handy here but I couldn't figure it out. So is there any more idiomatic / pandas-like way to do it without explicitely iterating over the rows?
You can use NumPy and feed an array to the pd.DataFrame constructor. For a small number of rows, as in your data, this should be efficient.
A = df.values
res = pd.DataFrame(np.abs(A - A[:, None]).max(2),
index=df.index, columns=df.index.values)
print(res)
x y z
code
x 0 2 3
y 2 0 2
z 3 2 0
If you want your code to produce correct output then you can assign the value computed to df1 again.
for row in df.iterrows():
df1 = df1.append((df-row[1]).abs().max(1), ignore_index = True)
df1.index = df.index
print (df1)
x y z
X 0.0 2.0 3.0
y 2.0 0.0 2.0
z 3.0 2.0 0.0

Pandas - combine two columns

I have 2 columns, which we'll call x and y. I want to create a new column called xy:
x y xy
1 1
2 2
4 4
8 8
There shouldn't be any conflicting values, but if there are, y takes precedence. If it makes the solution easier, you can assume that x will always be NaN where y has a value.
it could be quite simple if your example is accurate
df.fillna(0) #if the blanks are nan will need this line first
df['xy']=df['x']+df['y']
Notice your column type right now is string not numeric anymore
df = df.apply(lambda x : pd.to_numeric(x, errors='coerce'))
df['xy'] = df.sum(1)
More
df['xy'] =df[['x','y']].astype(str).apply(''.join,1)
#df[['x','y']].astype(str).apply(''.join,1)
Out[655]:
0 1.0
1 2.0
2
3 4.0
4 8.0
dtype: object
You can also use NumPy:
import pandas as pd, numpy as np
df = pd.DataFrame({'x': [1, 2, np.nan, np.nan],
'y': [np.nan, np.nan, 4, 8]})
arr = df.values
df['xy'] = arr[~np.isnan(arr)].astype(int)
print(df)
x y xy
0 1.0 NaN 1
1 2.0 NaN 2
2 NaN 4.0 4
3 NaN 8.0 8

Pandas pivot on column

my CSV looks like:
"a","b","c","d"
1, "x", 1, 1
1, "y", 2, 2
and I want to convert it based on column "b" to
"a", "x_c", "y_c", "x_d", "y_d"
1, 1, 2, 1, 2
I've tried it with pivot and unstack. Is there a shortcome in pandas ?
EDIT: I have multiple columns therefore I need to append a suffix/prefix
Use pivot_table:
df = df.pivot_table(index='a',columns='b', values=['c', 'd'], aggfunc=np.mean)
#Multiindex to columns
df.columns = df.columns.map(lambda x: '{}_{}'.format(x[1], x[0]))
df = df.reset_index()
print (df)
a x_c y_c x_d y_d
0 1 1 2 1 2
Also if duplicates, then aggfunc is applied:
print (df)
a b c d
0 1 x 1 1 <-duplicates for 1, x
1 1 y 2 2
2 1 x 4 2 <-duplicates for 1, x
3 2 y 2 3
df = df.pivot_table(index='a',columns='b', values=['c', 'd'], aggfunc=np.mean)
df.columns = df.columns.map(lambda x: '{}_{}'.format(x[1], x[0]))
df = df.reset_index()
print (df)
a x_c y_c x_d y_d
0 1 2.5 2.0 1.5 2.0 <-x_c, x_d aggregated mean
1 2 NaN 2.0 NaN 3.0

Creating dataframe from a dictionary where entries have different lengths

Say I have a dictionary with 10 key-value pairs. Each entry holds a numpy array. However, the length of the array is not the same for all of them.
How can I create a dataframe where each column holds a different entry?
When I try:
pd.DataFrame(my_dict)
I get:
ValueError: arrays must all be the same length
Any way to overcome this? I am happy to have Pandas use NaN to pad those columns for the shorter entries.
In Python 3.x:
import pandas as pd
import numpy as np
d = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in d.items() ]))
Out[7]:
A B
0 1 1
1 2 2
2 NaN 3
3 NaN 4
In Python 2.x:
replace d.items() with d.iteritems().
Here's a simple way to do that:
In[20]: my_dict = dict( A = np.array([1,2]), B = np.array([1,2,3,4]) )
In[21]: df = pd.DataFrame.from_dict(my_dict, orient='index')
In[22]: df
Out[22]:
0 1 2 3
A 1 2 NaN NaN
B 1 2 3 4
In[23]: df.transpose()
Out[23]:
A B
0 1 1
1 2 2
2 NaN 3
3 NaN 4
A way of tidying up your syntax, but still do essentially the same thing as these other answers, is below:
>>> mydict = {'one': [1,2,3], 2: [4,5,6,7], 3: 8}
>>> dict_df = pd.DataFrame({ key:pd.Series(value) for key, value in mydict.items() })
>>> dict_df
one 2 3
0 1.0 4 8.0
1 2.0 5 NaN
2 3.0 6 NaN
3 NaN 7 NaN
A similar syntax exists for lists, too:
>>> mylist = [ [1,2,3], [4,5], 6 ]
>>> list_df = pd.DataFrame([ pd.Series(value) for value in mylist ])
>>> list_df
0 1 2
0 1.0 2.0 3.0
1 4.0 5.0 NaN
2 6.0 NaN NaN
Another syntax for lists is:
>>> mylist = [ [1,2,3], [4,5], 6 ]
>>> list_df = pd.DataFrame({ i:pd.Series(value) for i, value in enumerate(mylist) })
>>> list_df
0 1 2
0 1 4.0 6.0
1 2 5.0 NaN
2 3 NaN NaN
You may additionally have to transpose the result and/or change the column data types (float, integer, etc).
Use pandas.DataFrame and pandas.concat
The following code will create a list of DataFrames with pandas.DataFrame, from a dict of uneven arrays, and then concat the arrays together in a list-comprehension.
This is a way to create a DataFrame of arrays, that are not equal in length.
For equal length arrays, use df = pd.DataFrame({'x1': x1, 'x2': x2, 'x3': x3})
import pandas as pd
import numpy as np
# create the uneven arrays
mu, sigma = 200, 25
np.random.seed(365)
x1 = mu + sigma * np.random.randn(10, 1)
x2 = mu + sigma * np.random.randn(15, 1)
x3 = mu + sigma * np.random.randn(20, 1)
data = {'x1': x1, 'x2': x2, 'x3': x3}
# create the dataframe
df = pd.concat([pd.DataFrame(v, columns=[k]) for k, v in data.items()], axis=1)
Use pandas.DataFrame and itertools.zip_longest
For iterables of uneven length, zip_longest fills missing values with the fillvalue.
The zip generator needs to be unpacked, because the DataFrame constructor won't unpack it.
from itertools import zip_longest
# zip all the values together
zl = list(zip_longest(*data.values()))
# create dataframe
df = pd.DataFrame(zl, columns=data.keys())
plot
df.plot(marker='o', figsize=[10, 5])
dataframe
x1 x2 x3
0 232.06900 235.92577 173.19476
1 176.94349 209.26802 186.09590
2 194.18474 168.36006 194.36712
3 196.55705 238.79899 218.33316
4 249.25695 167.91326 191.62559
5 215.25377 214.85430 230.95119
6 232.68784 240.30358 196.72593
7 212.43409 201.15896 187.96484
8 188.97014 187.59007 164.78436
9 196.82937 252.67682 196.47132
10 NaN 223.32571 208.43823
11 NaN 209.50658 209.83761
12 NaN 215.27461 249.06087
13 NaN 210.52486 158.65781
14 NaN 193.53504 199.10456
15 NaN NaN 186.19700
16 NaN NaN 223.02479
17 NaN NaN 185.68525
18 NaN NaN 213.41414
19 NaN NaN 271.75376
While this does not directly answer the OP's question. I found this to be an excellent solution for my case when I had unequal arrays and I'd like to share:
from pandas documentation
In [31]: d = {'one' : Series([1., 2., 3.], index=['a', 'b', 'c']),
....: 'two' : Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
....:
In [32]: df = DataFrame(d)
In [33]: df
Out[33]:
one two
a 1 1
b 2 2
c 3 3
d NaN 4
You can also use pd.concat along axis=1 with a list of pd.Series objects:
import pandas as pd, numpy as np
d = {'A': np.array([1,2]), 'B': np.array([1,2,3,4])}
res = pd.concat([pd.Series(v, name=k) for k, v in d.items()], axis=1)
print(res)
A B
0 1.0 1
1 2.0 2
2 NaN 3
3 NaN 4
Both the following lines work perfectly :
pd.DataFrame.from_dict(df, orient='index').transpose() #A
pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in df.items() ])) #B (Better)
But with %timeit on Jupyter, I've got a ratio of 4x speed for B vs A, which is quite impressive especially when working with a huge data set (mainly with a big number of columns/features).
If you don't want it to show NaN and you have two particular lengths, adding a 'space' in each remaining cell would also work.
import pandas
long = [6, 4, 7, 3]
short = [5, 6]
for n in range(len(long) - len(short)):
short.append(' ')
df = pd.DataFrame({'A':long, 'B':short}]
# Make sure Excel file exists in the working directory
datatoexcel = pd.ExcelWriter('example1.xlsx',engine = 'xlsxwriter')
df.to_excel(datatoexcel,sheet_name = 'Sheet1')
datatoexcel.save()
A B
0 6 5
1 4 6
2 7
3 3
If you have more than 2 lengths of entries, it is advisable to make a function which uses a similar method.

Categories