I'm trying to locate a null value in a corresponding column and fill it with an existing corresponding value.
Eg:
a 1
b 3
b null
d null
d 4
so I want to fill the null for b with 3 and d with 4.
The code I tried is as follows:
for x in df['Item_Identifier'].unique():
if df.loc[df['Item_Identifier' == x,'Item_Weight']].isnull:
df.loc[df['Item_Identifier','Item_Weight']].fill(df.loc[df['Item_Identifier','Item_Weight']] is not 'null')
Assuming you want to fill the missing values per group with the nearest value.
You can groupby+ffill/bfill:
(df.replace({'null': float('nan')})
.groupby(df['col1']).ffill()
.groupby(df['col1']).bfill()
.convert_dtypes()
)
output:
col1 col2
0 a 1
1 b 3
2 b 3
3 d 4
4 d 4
used input:
df = pd.DataFrame({'col1': ['a', 'b', 'b', 'd', 'd'],
'col2': [1, 3, 'null', 'null', 4]})
This worked for me:
df.groupby(col1)[col3].ffill().bfill()
You can also look at the limit parameter of bfill and ffill:
df['col2'].replace('null', np.nan).ffill(limit=1).bfill(limit=1)
Output:
0 1.0
1 3.0
2 3.0
3 4.0
4 4.0
Name: col2, dtype: float64
Related
i have data frame like this,
id
value
a
2
a
4
a
3
a
5
b
1
b
4
b
3
c
1
c
nan
c
5
the resulted data frame contain new column ['average'] and to get its values will be:
make group-by(id)
first row in 'average' column per each group is equal to its corresponding value in 'value'
other rows in ' average' in group is equal to mean for all previous rows in 'value'(except current value)
the resulted data frame must be :
id
value
average
a
2
2
a
4
2
a
3
3
a
5
3
b
1
1
b
4
1
b
3
2.5
c
1
1
c
nan
1
c
5
1
You can group the dataframe by id, then calculate the expanding mean for value column for each groups, then shift the expanding mean and get it back to the original dataframe, once you have it, you just need to ffill on axis=1 on for the value and average columns to get the first value for the categories:
out = (df
.assign(average=df
.groupby(['id'])['value']
.transform(lambda x: x.expanding().mean().shift(1))
)
)
out[['value', 'average']] = out[['value', 'average']].ffill(axis=1)
OUTPUT:
id value average
0 a 2.0 2.0
1 a 4.0 2.0
2 a 3.0 3.0
3 a 5.0 3.0
4 b 1.0 1.0
5 b 4.0 1.0
6 b 3.0 2.5
7 c 1.0 1.0
8 c NaN 1.0
9 c 5.0 1.0
Here is a solution which, I think, satisfies the requirements. Here, the first row in a group of ids is simply passing its value to the average column. For every other row, we take the average where the index is smaller than the current index.
You may want to specify how you want to handle the NaN values. In the below, I set them to None so that they are ignored.
import numpy as np
from numpy import average
import pandas as pd
df = pd.DataFrame([
['a', 2],
['a', 4],
['a', 3],
['a', 5],
['b', 1],
['b', 4],
['b', 3],
['c', 1],
['c', np.NAN],
['c', 5]
], columns=['id', 'value'])
# Replace the NaN value with None
df['value'] = df['value'].replace(np.nan, None)
id_groups = df.groupby(['id'])
id_level_frames = []
for group, frame in id_groups:
print(group)
# Resets the index for each id-level frame
frame = frame.reset_index()
for index, row in frame.iterrows():
# If this is the first row:
if index== 0:
frame.at[index, 'average'] = row['value']
else:
current_index = index
earlier_rows = frame[frame.index < index]
frame.at[index, 'average'] = average(earlier_rows['value'])
id_level_frames.append(frame)
final_df = pd.concat(id_level_frames)
I have following df,I'd like to group bycustomer and then,countandsum
at the same time,I wish add conditional grouping.
are there any way to achieve this?
customer product score
A a 1
A b 2
A c 3
B a 4
B a 5
B b 6
My desired result is following
customer count sum count(product =a) sum(product=a)
A 3 6 1 1
B 3 15 2 9
My work is like this..
grouped=df.groupby('customer')
grouped.agg({"product":"count","score":"sum"})
Thanks
Let us try crosstab
s = pd.crosstab(df['customer'],df['product'], df['score'],margins=True, aggfunc=['sum','count']).drop('All')
Out[76]:
sum count
product a b c All a b c All
customer
A 1.0 2.0 3.0 6 1.0 1.0 1.0 3
B 9.0 6.0 NaN 15 2.0 1.0 NaN 3
import pandas as pd
df = pd.DataFrame({'customer': ['A', 'A', 'A', 'B', 'B', 'B'], 'product': ['a', 'b', 'c', 'a', 'a', 'b'], 'score':[1, 2, 3, 4, 5, 6]})
df = df[df['product']=='a']
grouped=df.groupby('customer')
grouped = grouped.agg({"product":"count","score":"sum"}).reset_index()
print(grouped)
Output:
customer product score
0 A 1 1
1 B 2 9
Then merge this dataframe with the unfiltered grouped dataframe
Let's say data looks like:
df = pd.DataFrame({'Group' : ['A', 'B', 'A', 'B', 'C'],
'Value' : [1, 4, 3, 2, 3]})
Group Value
0 A 1
1 B 4
2 A 3
3 B 2
4 C 3
Normally when grouping by "Group" and get sum I would get:
df.groupby(by="Group").agg(["sum"])
Group Value sum
A 4
B 6
C 3
Is there a way to get "Group A" vs "non-Group A", so something like:
df.groupby(by="Group A vs non-Group A").agg(["sum"])
Group Value sum
A 4
non-A 9
Thanks everyone!
Use groupby and replace
In [566]: df.groupby(
df.Group.eq('A').replace({True: 'A', False: 'non-A'})
)['Value'].sum().reset_index()
Out[566]:
Group Value
0 A 4
1 non-A 9
Details
In [567]: df.Group.eq('A').replace({True: 'A', False: 'non-A'})
Out[567]:
0 A
1 non-A
2 A
3 non-A
4 non-A
Name: Group, dtype: object
In [568]: df.groupby(df.Group.eq('A').replace({True: 'A', False: 'non-A'}))['Value'].sum()
Out[568]:
Group
A 4
non-A 9
Name: Value, dtype: int64
Given this trivial dataset
df = pd.DataFrame({'one': ['a', 'a', 'a', 'b', 'b', 'b'],
'two': ['c', 'c', 'c', 'c', 'd', 'd'],
'three': [1, 2, 3, 4, 5, 6]})
grouping on one / two and applying .max() returns me a Series indexed on the groupby vars, as expected...
df.groupby(['one', 'two'])['three'].max()
output:
one two
a c 3
b c 4
d 6
Name: three, dtype: int64
...in my case I want to shift() my records, by group. But for some reason, when I apply .shift() to the groupby object, my results don't include the groupby variables:
output:
df.groupby(['one', 'two'])['three'].shift()
0 NaN
1 1.0
2 2.0
3 NaN
4 NaN
5 5.0
Name: three, dtype: float64
Is there a way to preserve those groupby variables in the results, as either columns or a multi-indexed Series (as in .max())? Thanks!
It is difference between max and diff - max aggregate values (return aggregate Series) and diff not - return same size Series.
So is possible append output to new column:
df['shifted'] = df.groupby(['one', 'two'])['three'].shift()
Theoretically is possible use agg, but it return error in pandas 0.20.3:
df1 = df.groupby(['one', 'two'])['three'].agg(['max', lambda x: x.shift()])
print (df1)
ValueError: Function does not reduce
One possible solution is transform if need max with diff:
g = df.groupby(['one', 'two'])['three']
df['max'] = g.transform('max')
df['shifted'] = g.shift()
print (df)
one three two max shifted
0 a 1 c 3 NaN
1 a 2 c 3 1.0
2 a 3 c 3 2.0
3 b 4 c 4 NaN
4 b 5 d 6 NaN
5 b 6 d 6 5.0
As what Jez explained, shift return the Serise keep the same len of dataframe, if you assign it like max(), will getting the error
Function does not reduce
df.assign(shifted=df.groupby(['one', 'two'])['three'].shift()).set_index(['one','two'])
Out[57]:
three shifted
one two
a c 1 NaN
c 2 1.0
c 3 2.0
b c 4 NaN
d 5 NaN
d 6 5.0
Using max as the key , and shift value slice the value max row
df.groupby(['one', 'two'])['three'].apply(lambda x : x.shift()[x==x.max()])
Out[58]:
one two
a c 2 2.0
b c 3 NaN
d 5 5.0
Name: three, dtype: float64
If I would go from a long format to a grouped aggregated format I would simply do:
s = pd.DataFrame(['a','a','a','a','b','b','c'], columns=['value'])
s.groupby('value').size()
value
a 4
b 2
c 1
dtype: int64
Now if I wanted to revert that aggregation and go from a grouped format to a long format, how would I go about doing that? I guess I could loop through the grouped series and repeat 'a' 4 times and 'b' 2 times etc.
Is there a better way to do this in pandas or any other Python package?
Thankful for any hints
Perhaps .transform can help with this:
s.set_index('value', drop=False, inplace=True)
s['size'] = s.groupby(level='value', as_index=False).transform(size)
s.reset_index(inplace=True, drop=True)
s
yielding:
value size
0 a 4
1 a 4
2 a 4
3 a 4
4 b 2
5 b 2
6 c 1
Another and rather simple approach is to use np.repeat (assuming s2 is the aggregated series):
In [17]: np.repeat(s2.index.values, s2.values)
Out[17]: array(['a', 'a', 'a', 'a', 'b', 'b', 'c'], dtype=object)
In [18]: pd.DataFrame(np.repeat(s2.index.values, s2.values), columns=['value'])
Out[18]:
value
0 a
1 a
2 a
3 a
4 b
5 b
6 c
There might be something cleaner, but here's an approach. First, store you groupby results in a dataframe and rename the columsn.
agg = s.groupby('value').size().reset_index()
agg.columns = ['key', 'count']
Then, build a frame with with columns that track the count for each letter.
counts = agg['count'].apply(lambda x: pd.Series([0] * x))
counts['key'] = agg['key']
In [107]: counts
Out[107]:
0 1 2 3 key
0 0 0 0 0 a
1 0 0 NaN NaN b
2 0 NaN NaN NaN c
Finally, this can be melted and nulls droppeed to get your desired frame.
In [108]: pd.melt(counts, id_vars='key').dropna()[['key']]
Out[108]:
key
0 a
1 b
2 c
3 a
4 b
6 a
9 a