Pandas DataFrame: Fill NA values based on group mean

Pandas DataFrame: Fill NA values based on group mean - python

I would like to update the NA values of a Pandas DataFrame column with the values in a groupby object.
Let's illustrate with an example:
We have the following DataFrame columns:
|--------|-------|-----|-------------|
| row_id | Month | Day | Temperature |
|--------|-------|-----|-------------|
| 1 | 1 | 1 | 14.3 |
| 2 | 1 | 1 | 14.8 |
| 3 | 1 | 2 | 13.1 |
|--------|-------|-----|-------------|
We're simply measuring temperature multiple times a day for many months. Now, let's assume that for some of our records, the temperature reading failed and we have a NA.
|--------|-------|-----|-------------|
| row_id | Month | Day | Temperature |
|--------|-------|-----|-------------|
| 1 | 1 | 1 | 14.3 |
| 2 | 1 | 1 | 14.8 |
| 3 | 1 | 2 | 13.1 |
| 4 | 1 | 2 | NA |
| 5 | 1 | 3 | 14.8 |
| 6 | 1 | 4 | NA |
|--------|-------|-----|-------------|
We could just use panda's .fillna(), however we want to be a little more sophisticated. Since there are multiple readings per day (there could be 100's per day), we'd like to take the daily average and use that as our fill value.
we can get the daily averages with a simple groupby:
avg_temp_by_month_day = df.groupby(['month'])['day'].mean()
Which gives us the means for each day by month. The question is, how best to fill the NA values with the groupby values?
We could use an apply(),
df['temperature'] = df.apply(
lambda row: avg_temp_by_month_day.loc[r['month'], r['day']] if pd.isna(r['temperature']) else r['temperature'],
axis=1
)
however this is really slow (1M+ records).
Is there a vectorized approach, perhaps using np.where(), or maybe creating another Series and merging.
What's the a more efficient way to perform this operation?
Thank you!

I'm not sure if this is the fastest, however instead of taking ~1 hour for apply, it takes ~20 sec for +1M records. The below code has been updated to work on 1 or many columns.
local_avg_cols = ['temperature'] # can work with multiple columns
# Create groupby's to get local averages
local_averages = df.groupby(['month', 'day'])[local_avg_cols].mean()
# Convert to DataFrame and prepare for merge
local_averages = pd.DataFrame(local_averages, columns=local_avg_cols).reset_index()
# Merge into original dataframe
df = df.merge(local_averages, on=['month', 'day'], how='left', suffixes=('', '_avg'))
# Now overwrite na values with values from new '_avg' col
for col in local_avg_cols:
df[col] = df[col].mask(df[col].isna(), df[col+'_avg'])
# Drop new avg cols
df = df.drop(columns=[col+'_avg' for col in local_avg_cols])
If anyone finds a more efficient way to do this, (efficient in processing time, or in just readability), I'll unmark this answer and mark yours. Thank you!

I'm guessing what speeds down your process are two things. First, you don't need to convert your groupby to a dataframe. Second, you don't need the for loop.
from pandas import DataFrame
from numpy import nan
# Populating the dataset
df = {"Month": [1] * 6,
"Day": [1, 1, 2, 2, 3, 4],
"Temperature": [14.3, 14.8, 13.1, nan, 14.8, nan]}
# Creating the dataframe
df = pd.DataFrame(df, columns=df.keys())
local_averages = df.groupby(['Month', 'Day'])['Temperature'].mean()
df = df.merge(local_averages, on=['Month', 'Day'], how='left', suffixes=('', '_avg'))
# Filling the missing values of the Temperature column with what is available in Temperature_avg
df.Temperature.fillna(df.Temperature_avg, inplace=True)
df.drop(columns="Temperature_avg", inplace=True)
Groupby is a resource heavy process so make the most out of it when you use it. Furthermore, as you already know loops are not a good idea when it comes to dataframes. Additionally, if you have a large data you may want to avoid creating extra variables from it. I may put the groupby into the merge if my data has 1m rows and many columns.

Related

How to aggregate in pandas with some conditions?

I want to aggregate my data in this way:
df.groupby('date').agg({ 'user_id','nunique',
'user_id':'nunique' ONLY WHERE purchase_flag==1})
date | user_id | purchase_flag
4-1-2020 | 1 | 1
4-1-2020 | 1 | 1 (purchased second time but still same unique user on that day)
4-1-2020 | 2 | 0
In this case I want the output to looks like:
date | total_users | total_users_who_purchased
4-1-2020 | 2 | 1
How can I best achieve this?

Try this by creating helper column in your dataframe to indicate users who purchased first then groupby and aggregate on that helper column:
df["user_id_purchased"] = df["user_id"].where(df["purchase_flag"].astype(bool))
df_output = df.groupby("date", as_index=False).agg(
total_users=("user_id", "nunique"),
total_users_who_purchased=("user_id_purchased", "nunique"),
)
Output:
date total_users total_users_who_purchased
0 4-1-2020 2 1

I think that one way to achieve this goal is using .loc
df.loc[ (df["purchase_flag"]==1)].user_id.nunique
Implementation to get your output:
details = { 'date' : ['4-1-2020'],
'total_users' : df.user_id.nunique(),
'total_users_who_purchased' :
df.loc(df["purchase_flag"]==1)].user_id.nunique()}
df2 = pd.DataFrame(details)
df2

Find out if values in dataframe are between values in other dataframe

I'm new to pandas and i'm trying to understand if there is a method to find out, if two values from one row in df1 are between two values from one row in df2.
Basically my df1 looks like this:
start | value | end
1 | TEST | 5
2 | TEST | 3
...
and my df2 looks like this:
start | value | end
2 | TEST2 | 10
3 | TEST2 | 4
...
Right now i've got it working with two loops:
for row in df1.iterrows():
for row2 in df2.iterrows():
if row2[1]["start"] >= row[1]["start"] and row2[1]["end"] <= row[1]["end"]:
print(row2)
but this doesn't feel like it's the pandas way to me.
What I'm expecting is that row number 2 from df2 is getting printed because 3 > 1 and 4 < 5, i.e.:
3 | TEST2 | 4
Is there a method to do this in the pandas kind of working?

You could use a cross merge to get all combinations of df1 and df2 rows, and filter using classical comparisons. Finally, get the indices and slice:
idx = (df1.merge(df2.reset_index(), suffixes=('1', '2'), how='cross')
.query('(start2 > start1) & (end2 < end1)')
['index'].unique()
)
df2.loc[idx]
NB. I am using unique here to ensure that a row is selected only once, even if there are several matches
output:
start value end
1 3 TEST2 4

Pandas Merge two tables with the second tables' one column transposed

Table 1
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
Table 2
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','05-05-2021','03-05-2021'],'data':[12,13,16,9],'test':['g','h','j','i'],'test2':['k','l','m','n']})
Result Table
Brief Explanation on how the Result table needs to be created:
I have two data frames and I want to merge them based on a df_id. But the date column from second table should be transposed into the resultant table.
The date columns for the result table will be a range between the min date and max date from the second table
The column values for the dates in the result table will be from the data column of the second table.
Also the test column from the second table will only take its value of the latest date for the result table
I hope this is clear. Any suggestion or help regarding this will be greatly appreciated.
I have tried using pivot on the second table and then trying to merge the pivoted second table df1 but its not working. I do not know how to get only one row for the latest value of test.
Note: I am trying to solve this problem using vectorization and do not want to serially parse through each row

You need to pivot your df2 into two separate table as we need data and test values and then merge both resulting pivot table with df1
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','03-05-2021','05-05-2021'],'data':[12,13,9,16],'test':['g','h','i','j']})
test_piv = df2.pivot(index=['df1_id'],columns=['date'],values=['test'])
data_piv = df2.pivot(index=['df1_id'],columns=['date'],values=['data'])
max_test = test_piv['test'].ffill(axis=1).iloc[:,-1].rename('test')
final = df1.merge(data_piv['data'],left_on=df1.df1_id, right_index=True, how='left')
final = final.merge(max_test,left_on=df1.df1_id, right_index=True, how='left')
and hence your resulting final dataframe as below
| | df1_id | col1 | col2 | 01-05-2021 | 03-05-2021 | 05-05-2021 | test |
|---:|---------:|:-------|:-------|-------------:|-------------:|-------------:|:-------|
| 0 | 1 | a | d | 12 | 9 | 16 | j |
| 1 | 2 | b | e | nan | 13 | nan | h |
| 2 | 3 | c | f | nan | nan | nan | nan |

Here is the solution for the question:
I first sort df2 based of df1_id and date to ensure that table entries are in order.
Then I drop duplicates based on df_id and select the last row to ensure I have the latest values for test and test2
Then I pivot df2 to get the corresponding date as column and data as its value
Then I merge the table with df2_pivoted to combine the latest values of test and test2
Then I merge with df1 to get the resultant table
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','05-05-2021','03-05-2021'],'data':[12,13,16,9],'test':['g','h','j','i'],'test2':['k','l','m','n']})
df2=df2.sort_values(by=['df1_id','date'])
df2_latest_vals = df2.drop_duplicates(subset=['df1_id'],keep='last')
df2_pivoted = df2.pivot_table(index=['df1_id'],columns=['date'],values=['data'])
df2_pivoted = df2_pivoted.droplevel(0,axis=1).reset_index()
df2_pivoted = pd.merge(df2_pivoted,df2_latest_vals,on='df1_id')
df2_pivoted = df2_pivoted.drop(columns=['date','data'])
result = pd.merge(df1,df2_pivoted,on='df1_id',how='left')
result
Note: I have not been able to figure out how to get the entire date range between 01-05-2021 and 05-05-2021 and show the empty values as NaN. If anyone can help please edit the answer

Add column of row numbers for each group of successively increasing dates

I have DataFrame which has column with Date and other columns with some values and, let's say, first 100 rows are in order according to the date, and from 101 till 200 again the same Dates, only different values, and so on. I would like to add a column which count rows from 1 to 100, and start again from 1 when the date repeat.
Example
Date | Value | RowNum
2000-01-01 | 2 | 1
2000-02-01 | 10 | 2
.
.
.
2003-12-01 | 11 | 100
2000-01-01 | 32 | 1
2000-02-01 | 14 | 2
.
.
.
2003-12-01 | 4 | 100
I need this to pivot this table where columns are dates, values are values and RowNum will be index.
Thank You for help.

If the exact same dates repeat, your problem becomes a very simple cumsum and cumcount problem:
m = df.Date.eq(df.at[df.index[0], 'Date']).cumsum()
df['RowNum'] = df.groupby(m).cumcount() + 1
If not, you can check the diff:
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
m = df['Date'].diff().dt.total_seconds().fillna(-1).lt(0).cumsum()
df['RowNum'] = df.groupby(m).cumcount() + 1
Or, similarly, by converting the underlying NumPy array to float and then diffing:
s = pd.Series(df['Date'].values.astype(float), index=df.index)
df['RowNum'] = df.groupby(s.fillna(-1).lt(0).cumsum()).cumcount() + 1

Explanation
Create a new column and iterate through the data frame and simply use %100 of the index column. This will work just fine if you exactly have 100 same dates as you mentioned above.
Code
df[RowNum] = 1
for i, row in df.iterrows():
RowNum_val = i%100
df.set_value(i,'RowNum',RowNum_val)
Resources
https://www.geeksforgeeks.org/python-pandas-dataframe-set_value/
https://www.tutorialspoint.com/python_pandas/python_pandas_iteration.htm

Pandas groupby function returns NaN values

I have a list of people with fields unique_id, sex, born_at (birthday) and I’m trying to group by sex and age bins, and count the rows in each segment.
Can’t figure out why I keep getting NaN or 0 as the output for each segment.
Here’s the latest approach I've taken...
Data sample:
|---------------------|------------------|------------------|
| unique_id | sex | born_at |
|---------------------|------------------|------------------|
| 1 | M | 1963-08-04 |
|---------------------|------------------|------------------|
| 2 | F | 1972-03-22 |
|---------------------|------------------|------------------|
| 3 | M | 1982-02-10 |
|---------------------|------------------|------------------|
| 4 | M | 1989-05-02 |
|---------------------|------------------|------------------|
| 5 | F | 1974-01-09 |
|---------------------|------------------|------------------|
Code:
df[‘num_people’]=1
breakpoints = [18,25,35,45,55,65]
df[[‘sex’,’born_at’,’num_people’]].groupby([‘sex’,pd.cut(df.born_at.dt.year, bins=breakpoints)]).agg(‘count’)
I’ve tried summing as the agg type, removing NaNs from the data series, pivot_table using the same pd.cut function but no luck. Guessing there’s also probably a better way to do this that doesn’t involve creating a column of 1s.
Desired output would be something like this...
The extra born_at column isn't necessary in the output and I'd also like the age bins to be 18 to 24, 25 to 34, etc. instead of 18 to 25, 25 to 35, etc. but I'm not sure how to specify that either.

I think you missed the calculation of the current age. The ranges you define for splitting the bithday years only make sense when you use them for calculating the current age (or all grouped cells will be nan or zero respectively because the lowest value in your sample is 1963 and the right-most maximum is 65). So first of all you want to calculate the age:
datetime.now().year-df.birthday.dt.year
This information then can be used to group the data (which are previously grouped by gender):
df.groupby(['gender', pandas.cut(datetime.now().year-df.birthday.dt.year, bins=breakpoints)]).agg('count')
In order to get rid of the nan cells you simply do a fillna(0) like this:
df.groupby(['gender', pandas.cut(datetime.now().year-df.birthday.dt.year, bins=breakpoints)]).agg('count').fillna(0).rename(columns={'birthday':'count'})

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas DataFrame: Fill NA values based on group mean - python

Related

How to aggregate in pandas with some conditions?

Find out if values in dataframe are between values in other dataframe

Pandas Merge two tables with the second tables' one column transposed

Add column of row numbers for each group of successively increasing dates

Pandas groupby function returns NaN values

Categories

Resources