Getting Pandas.groupby.shift() results with groupbyvars as cols / index? - python

Given this trivial dataset
df = pd.DataFrame({'one': ['a', 'a', 'a', 'b', 'b', 'b'],
'two': ['c', 'c', 'c', 'c', 'd', 'd'],
'three': [1, 2, 3, 4, 5, 6]})
grouping on one / two and applying .max() returns me a Series indexed on the groupby vars, as expected...
df.groupby(['one', 'two'])['three'].max()
output:
one two
a c 3
b c 4
d 6
Name: three, dtype: int64
...in my case I want to shift() my records, by group. But for some reason, when I apply .shift() to the groupby object, my results don't include the groupby variables:
output:
df.groupby(['one', 'two'])['three'].shift()
0 NaN
1 1.0
2 2.0
3 NaN
4 NaN
5 5.0
Name: three, dtype: float64
Is there a way to preserve those groupby variables in the results, as either columns or a multi-indexed Series (as in .max())? Thanks!

It is difference between max and diff - max aggregate values (return aggregate Series) and diff not - return same size Series.
So is possible append output to new column:
df['shifted'] = df.groupby(['one', 'two'])['three'].shift()
Theoretically is possible use agg, but it return error in pandas 0.20.3:
df1 = df.groupby(['one', 'two'])['three'].agg(['max', lambda x: x.shift()])
print (df1)
ValueError: Function does not reduce
One possible solution is transform if need max with diff:
g = df.groupby(['one', 'two'])['three']
df['max'] = g.transform('max')
df['shifted'] = g.shift()
print (df)
one three two max shifted
0 a 1 c 3 NaN
1 a 2 c 3 1.0
2 a 3 c 3 2.0
3 b 4 c 4 NaN
4 b 5 d 6 NaN
5 b 6 d 6 5.0

As what Jez explained, shift return the Serise keep the same len of dataframe, if you assign it like max(), will getting the error
Function does not reduce
df.assign(shifted=df.groupby(['one', 'two'])['three'].shift()).set_index(['one','two'])
Out[57]:
three shifted
one two
a c 1 NaN
c 2 1.0
c 3 2.0
b c 4 NaN
d 5 NaN
d 6 5.0
Using max as the key , and shift value slice the value max row
df.groupby(['one', 'two'])['three'].apply(lambda x : x.shift()[x==x.max()])
Out[58]:
one two
a c 2 2.0
b c 3 NaN
d 5 5.0
Name: three, dtype: float64

Related

How to get average between first row and current row per each group in data frame?

i have data frame like this,
id
value
a
2
a
4
a
3
a
5
b
1
b
4
b
3
c
1
c
nan
c
5
the resulted data frame contain new column ['average'] and to get its values will be:
make group-by(id)
first row in 'average' column per each group is equal to its corresponding value in 'value'
other rows in ' average' in group is equal to mean for all previous rows in 'value'(except current value)
the resulted data frame must be :
id
value
average
a
2
2
a
4
2
a
3
3
a
5
3
b
1
1
b
4
1
b
3
2.5
c
1
1
c
nan
1
c
5
1
You can group the dataframe by id, then calculate the expanding mean for value column for each groups, then shift the expanding mean and get it back to the original dataframe, once you have it, you just need to ffill on axis=1 on for the value and average columns to get the first value for the categories:
out = (df
.assign(average=df
.groupby(['id'])['value']
.transform(lambda x: x.expanding().mean().shift(1))
)
)
out[['value', 'average']] = out[['value', 'average']].ffill(axis=1)
OUTPUT:
id value average
0 a 2.0 2.0
1 a 4.0 2.0
2 a 3.0 3.0
3 a 5.0 3.0
4 b 1.0 1.0
5 b 4.0 1.0
6 b 3.0 2.5
7 c 1.0 1.0
8 c NaN 1.0
9 c 5.0 1.0
Here is a solution which, I think, satisfies the requirements. Here, the first row in a group of ids is simply passing its value to the average column. For every other row, we take the average where the index is smaller than the current index.
You may want to specify how you want to handle the NaN values. In the below, I set them to None so that they are ignored.
import numpy as np
from numpy import average
import pandas as pd
df = pd.DataFrame([
['a', 2],
['a', 4],
['a', 3],
['a', 5],
['b', 1],
['b', 4],
['b', 3],
['c', 1],
['c', np.NAN],
['c', 5]
], columns=['id', 'value'])
# Replace the NaN value with None
df['value'] = df['value'].replace(np.nan, None)
id_groups = df.groupby(['id'])
id_level_frames = []
for group, frame in id_groups:
print(group)
# Resets the index for each id-level frame
frame = frame.reset_index()
for index, row in frame.iterrows():
# If this is the first row:
if index== 0:
frame.at[index, 'average'] = row['value']
else:
current_index = index
earlier_rows = frame[frame.index < index]
frame.at[index, 'average'] = average(earlier_rows['value'])
id_level_frames.append(frame)
final_df = pd.concat(id_level_frames)

Locating and filling with a corresponding value using pandas

I'm trying to locate a null value in a corresponding column and fill it with an existing corresponding value.
Eg:
a 1
b 3
b null
d null
d 4
so I want to fill the null for b with 3 and d with 4.
The code I tried is as follows:
for x in df['Item_Identifier'].unique():
if df.loc[df['Item_Identifier' == x,'Item_Weight']].isnull:
df.loc[df['Item_Identifier','Item_Weight']].fill(df.loc[df['Item_Identifier','Item_Weight']] is not 'null')
Assuming you want to fill the missing values per group with the nearest value.
You can groupby+ffill/bfill:
(df.replace({'null': float('nan')})
.groupby(df['col1']).ffill()
.groupby(df['col1']).bfill()
.convert_dtypes()
)
output:
col1 col2
0 a 1
1 b 3
2 b 3
3 d 4
4 d 4
used input:
df = pd.DataFrame({'col1': ['a', 'b', 'b', 'd', 'd'],
'col2': [1, 3, 'null', 'null', 4]})
This worked for me:
df.groupby(col1)[col3].ffill().bfill()
You can also look at the limit parameter of bfill and ffill:
df['col2'].replace('null', np.nan).ffill(limit=1).bfill(limit=1)
Output:
0 1.0
1 3.0
2 3.0
3 4.0
4 4.0
Name: col2, dtype: float64

When to use .count() and .value_counts() in Pandas?

I am learning pandas. I'm not sure when to use the .count() function and when to use .value_counts().
count() is used to count the number of non-NA/null observations across the given axis. It works with non-floating type data as well.
Now as an example create a dataframe df
df = pd.DataFrame({"A":[10, 8, 12, None, 5, 3],
"B":[-1, None, 6, 4, None, 3],
"C":["Shreyas", "Aman", "Apoorv", np.nan, "Kunal", "Ayush"]})
Find the count of non-NA value across the row axis.
df.count(axis = 0)
Output:
A 5
B 4
C 5
dtype: int64
Find the number of non-NA/null value across the column.
df.count(axis = 1)
Output:
0 3
1 2
2 3
3 1
4 2
5 3
dtype: int64
value_counts() function returns Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
So for the example shown below
s = pd.Series([3, 1, 2, 3, 4, np.nan])
s.value_counts()
The output would be:
3.0 2
4.0 1
2.0 1
1.0 1
dtype: int64
value_counts() aggregates the data and counts each unique value. You can achieve the same by using groupby which is a more broad function to aggregate data in pandas.
count() simply returns the number of non NaN/Null values in column (series) you apply it on.
df = pd.DataFrame({'Id':['A', 'B', 'B', 'C', 'D', 'E', 'F', 'F'],
'Value':[10, 20, 15, 5, 35, 20, 10, 25]})
print(df)
Id Value
0 A 10
1 B 20
2 B 15
3 C 5
4 D 35
5 E 20
6 F 10
7 F 25
# Value counts
df['Id'].value_counts()
F 2
B 2
C 1
A 1
D 1
E 1
Name: Id, dtype: int64
# Same operation but with groupby
df.groupby('Id')['Id'].count()
Id
A 1
B 2
C 1
D 1
E 1
F 2
Name: Id, dtype: int64
# Count()
df['Id'].count()
8
Example with NaN values and count:
print(df)
Id Value
0 A 10
1 B 20
2 B 15
3 NaN 5
4 D 35
5 E 20
6 F 10
7 F 25
df['Id'].count()
7
count() returns the total number of non-null values in the series.
value_counts() returns a series of the number of times each unique non-null value appears, sorted from most to least frequent.
As usual, an example is the best way to convey this:
ser = pd.Series(list('aaaabbbccdef'))
ser
>
0 a
1 a
2 a
3 a
4 b
5 b
6 b
7 c
8 c
9 d
10 e
11 f
dtype: object
ser.count()
>
12
ser.value_counts()
>
a 4
b 3
c 2
f 1
d 1
e 1
dtype: int64
Note that a dataframe has the count() method, which returns a series of the count() (scalar) value for each column in the df. However, a dataframe has no value_counts() method.

Get list of DataFrame column names for non-float columns

I am trying to get a list of column names from a DataFrame corresponding to columns that aren't of type float. Right now I have
categorical = (df.dtypes.values != np.dtype('float64'))
which gives me a boolean array of whether column names are not float or not, but this is not exactly what I'm looking for. Specifically, I would like a list of column names that correspond to the 'True' values in my boolean array.
Use boolean indexing with df.columns:
categorical = df.columns[(df.dtypes.values != np.dtype('float64'))]
Or get difference of columns selected by select_dtypes:
categorical = df.columns.difference(df.select_dtypes('float64').columns)
Sample:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7.,8,9,4,2,3],
'D':[1,3,5.,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a 4 7.0 1.0 5 a
1 b 5 8.0 3.0 3 a
2 c 4 9.0 5.0 6 a
3 d 5 4.0 7.0 9 b
4 e 5 2.0 1.0 2 b
5 f 4 3.0 0.0 4 b
categorical = df.columns.difference(df.select_dtypes('float64').columns)
print (categorical)
Index(['A', 'B', 'E', 'F'], dtype='object')

pandas - Going from aggregated format to long format

If I would go from a long format to a grouped aggregated format I would simply do:
s = pd.DataFrame(['a','a','a','a','b','b','c'], columns=['value'])
s.groupby('value').size()
value
a 4
b 2
c 1
dtype: int64
Now if I wanted to revert that aggregation and go from a grouped format to a long format, how would I go about doing that? I guess I could loop through the grouped series and repeat 'a' 4 times and 'b' 2 times etc.
Is there a better way to do this in pandas or any other Python package?
Thankful for any hints
Perhaps .transform can help with this:
s.set_index('value', drop=False, inplace=True)
s['size'] = s.groupby(level='value', as_index=False).transform(size)
s.reset_index(inplace=True, drop=True)
s
yielding:
value size
0 a 4
1 a 4
2 a 4
3 a 4
4 b 2
5 b 2
6 c 1
Another and rather simple approach is to use np.repeat (assuming s2 is the aggregated series):
In [17]: np.repeat(s2.index.values, s2.values)
Out[17]: array(['a', 'a', 'a', 'a', 'b', 'b', 'c'], dtype=object)
In [18]: pd.DataFrame(np.repeat(s2.index.values, s2.values), columns=['value'])
Out[18]:
value
0 a
1 a
2 a
3 a
4 b
5 b
6 c
There might be something cleaner, but here's an approach. First, store you groupby results in a dataframe and rename the columsn.
agg = s.groupby('value').size().reset_index()
agg.columns = ['key', 'count']
Then, build a frame with with columns that track the count for each letter.
counts = agg['count'].apply(lambda x: pd.Series([0] * x))
counts['key'] = agg['key']
In [107]: counts
Out[107]:
0 1 2 3 key
0 0 0 0 0 a
1 0 0 NaN NaN b
2 0 NaN NaN NaN c
Finally, this can be melted and nulls droppeed to get your desired frame.
In [108]: pd.melt(counts, id_vars='key').dropna()[['key']]
Out[108]:
key
0 a
1 b
2 c
3 a
4 b
6 a
9 a

Categories