Normalizing data in a Pandas GroupBy dataframe using a reference group - python

I have a Pandas dataframe resulting from a groupby() operation. This dataframe has two indexes (year, month). How can I normalize a column relative to the corresponding month in a specific year?
My dataframe looks like the following:
value
year month
2000 1 1234
2 4567
2001 1 2345
2 5678
2002 1 3456
2 6789
I would like to the resulting dataframe to have each value divided by the corresponding monthly value in 2002. Thus, expressing all values relative to 2002 levels. This would result in the values for 2002 being 1.0 for both months.
What is the most efficient way of doing this? Appreciate any help!

Use DataFrame.div with a level argument.
df.div(df.xs(2002), level=1, axis=0)
value
year month
2000 1 0.357060
2 0.672706
2001 1 0.678530
2 0.836353
2002 1 1.000000
2 1.000000
Where,
df.xs(2002)
value
month
1 3456
2 6789
The division is aligned along the first level of the 0th axis.

Related

Comparing previous row values in Pandas DataFrame in different column

my input:
first=pd.Series([0,1680,5000,14999,17000])
last =pd.Series([4999,7501,10000,16777,21387])
dd=pd.concat([first, last], axis=1)
I trying find&compare second value in first column (e.g. 1680) and "range" previous row between first value in first column to first value in second column(e.g. from 0 to 4999). So in my condition value 1680 fall in range previous row between 0 to 4999, also 3td value in first column 5000 fall in range previous row between 1680 to 7501, but other values (e.g. 14999, 17000) not in range of previous rows.
My expect output something like this:
[1680], [5000] so show only values that fall in my condition
I trying with diff(): dd[0].diff().gt(dd[1]) or reshape/shift but not really success
Use shift and between to compare a row with the previous one:
>>> df[0].loc[df[0].between(df[0].shift(), df[1].shift())]
1 1680
2 5000
Name: 0, dtype: int64
Details of shift:
>>> pd.concat([df[0], df.shift()], axis=1)
0 0 1
0 0 NaN NaN
1 1680 0.0 4999.0
2 5000 1680.0 7501.0
3 14999 5000.0 10000.0
4 17000 14999.0 16777.0

Column value from first df to another df based on condition

I have original df where I have column "average", where is average value counted for country . Now I have new_df, where I want to add these df average values based on country.
df
id country value average
1 USA 3 2
2 UK 5 5
3 France 2 2
4 USA 1 2
new df
country average
USA 2
Italy Nan
I had a solution that worked but there is a problem, when there is in new_df a country for which I have not count the average yet. In that case I want to fill just nan.
Can you please recommend me any solution?
Thanks
If need add average column to df2 use DataFrame.merge with DataFrame.drop_duplicates:
df2.merge(df1.drop_duplicates('country')[['country','average']], on='country', how='left')
If need aggregate mean:
df2.join(df1.groupby('country')['average'].mean(), on='country')

How does (DataFrame - Groupby) match rows?

I can't figure out how (DataFrame - Groupby) works.
Specifically, given the following dataframe:
df = pd.DataFrame([['usera',1,100],['usera',5,130],['userc',1,100],['userd',5,100]])
df.columns = ['id','date','sum']
id date sum
0 usera 1 100
1 usera 5 130
2 userc 1 100
3 userd 5 100
Passing the below code returns:
df['shift'] = df['date']-df.groupby(['id'])['date'].shift(1)
id date sum shift
0 usera 1 100
1 usera 5 130 4.0
2 userc 1 100
3 userd 5 100
How did Python know that I meant for it to match by id column?
It doesn't even appear in df['date']
Let us dissect the command df['shift'] = df['date']-df.groupby(['id'])['date'].shift(1).
df['shift'] appends a new column "shift" in the dataframe.
df['date'] returns Series using date column from the dataframe.
0 1
1 5
2 1
3 5
Name: date, dtype: int64
df.groupby(['id'])['date'].shift(1) groupby(['id']) creates a groupby object.
From that groupby object selecting date column and shifting one (previous) value using shift(1). By the way, this also a Series.
df.groupby(['id'])['date'].shift(1)
0 NaN
1 1.0
2 NaN
3 NaN
Name: date, dtype: float64
The Series obtained from step 3 is subtracted (element-wise) with the Series obtained from Step 2. The result is assigned to the df['shift'] column.
df['date']-df.groupby(['id'])['date'].shift(1)
0 NaN
1 4.0
2 NaN
3 NaN
Name: date, dtype: float64
I am not exactly knowing what you are trying, but groupby() method is usuful if you have several same objects in a column (like you usera) and you want to calculate for example the sum(), mean(), find max() etc. of all columns or just one specific column.
e.g. df.groupby(['id'])['sum'].sum() groups you usera and just select the sum column and build the sum over all usera. So it is 230. If you would use .mean() it would output 115 etc. And it also does it for all other unique id in your id column. In the example from above it outputs one column with just three rows (user a-c).
Greetz, miGa

How to compare two rows and when they are different then create another dataframe to copy these two rows

Check column ['esn'] from df1. When any different found between two rows, produce another dataframe, df2. df2 only contains the before change and after change information
>>> df1 = pd.DataFrame([[2014,1],[2015,1],[2016,1],[2017,2],[2018,2]],columns=['year','esn'])
>>> df1
year esn
0 2014 1
1 2015 1
2 2016 1
3 2017 2
4 2018 2
>>> df2 # new dataframe intended to create
year esn
0 2016 1
1 2017 2
can't produce the above result in df2. Thanks for your help in advance.
Create boolena mask by compare shifted values by ne for not equal and replace first missing value by backfill, similar compare shifted with -1 with forward filling missing values - chain by | for bitwise OR and filter by boolean indexing:
mask = df1['esn'].ne(df1['esn'].shift().bfill()) | df1['esn'].ne(df1['esn'].shift(-1).ffill())
df2 = df1[mask]
print (df2)
year esn
2 2016 1
3 2017 2

Pandas groupby for zero values

I have data like this in a csv file
Symbol Action Year
AAPL Buy 2001
AAPL Buy 2001
BAC Sell 2002
BAC Sell 2002
I am able to read it and groupby like this
df.groupby(['Symbol','Year']).count()
I get
Action
Symbol Year
AAPL 2001 2
BAC 2002 2
I desire this (order does not matter)
Action
Symbol Year
AAPL 2001 2
AAPL 2002 0
BAC 2001 0
BAC 2002 2
I want to know if its possible to count for zero occurances
You can use this:
df = df.groupby(['Symbol','Year']).count().unstack(fill_value=0).stack()
print (df)
Output:
Action
Symbol Year
AAPL 2001 2
2002 0
BAC 2001 0
2002 2
You can use pivot_table with unstack:
print df.pivot_table(index='Symbol',
columns='Year',
values='Action',
fill_value=0,
aggfunc='count').unstack()
Year Symbol
2001 AAPL 2
BAC 0
2002 AAPL 0
BAC 2
dtype: int64
If you need output as DataFrame use to_frame:
print df.pivot_table(index='Symbol',
columns='Year',
values='Action',
fill_value=0,
aggfunc='count').unstack()
.to_frame()
.rename(columns={0:'Action'})
Action
Year Symbol
2001 AAPL 2
BAC 0
2002 AAPL 0
BAC 2
Datatype category
Maybe this feature didn't exist back when this thread was opened, however the datatype "category" can help here:
# create a dataframe with one combination of a,b missing
df = pd.DataFrame({"a":[0,1,1], "b": [0,1,0]})
df = df.astype({"a":"category", "b":"category"})
print(df)
Dataframe looks like this:
a b
0 0 0
1 1 1
2 1 0
And now, grouping by a and b
print(df.groupby(["a","b"]).size())
yields:
a b
0 0 1
1 0
1 0 1
1 1
Note the 0 in the rightmost column. This behavior is also documented in the pandas userguide (search on page for "groupby").
If you want to do this without using pivot_table, you can try the below approach:
midx = pd.MultiIndex.from_product([ df['Symbol'].unique(), df['Year'].unique()], names=['Symbol', 'Year'])
df_grouped_by = df_grouped_by.reindex(midx, fill_value=0)
What we are essentially doing above is creating a multi-index of all the possible values multiplying the two columns and then using that multi-index to fill zeroes into our group-by dataframe.
Step 1: Create a dataframe that stores the count of each non-zero class in the column counts
count_df = df.groupby(['Symbol','Year']).size().reset_index(name='counts')
Step 2: Now use pivot_table to get the desired dataframe with counts for both existing and non-existing classes.
df_final = pd.pivot_table(count_df,
index=['Symbol','Year'],
values='counts',
fill_value = 0,
dropna=False,
aggfunc=np.sum)
Now the values of the counts can be extracted as a list with the command
list(df_final['counts'])
All the answers above are focusing on groupby or pivot table. However, as is well described in this article and in this question, this is a beautiful case for pandas' crosstab function:
import pandas as pd
df = pd.DataFrame({
"Symbol": 2*['AAPL', 'BAC'],
"Action": 2*['Buy', 'Sell'],
"Year": 2*[2001,2002]
})
pd.crosstab(df["Symbol"], df["Year"]).stack()
yielding:
Symbol Year
AAPL 2001 2
2002 0
BAC 2001 0
2002 2

Categories