Merge variable number of rows in pandas dataframe - python

I'm new to pandas and working with dataframes. I have a rather simple problem that I think should have a straightforward solution which is not clear to me (and I do not know pandas that well).
So I have many occurrences of rows with same index in my data frame:
Glucose Insulin Carbs
Hour
2018-05-16 06:43:00 156.0 0.0 0.0
2018-05-16 06:43:00 NaN 0.0 65.0
2018-05-16 06:43:00 NaN 7.0 0.0
And I would like to merge them to get this, a row which contains all the information available at a given time index:
Glucose Insulin Carbs
Hour
2018-05-16 06:43:00 156.0 7.0 65.0
2018-05-16 06:43:00 NaN 0.0 65.0
2018-05-16 06:43:00 NaN 7.0 0.0
Afterwards I would drop all rows which contain NaN in any column to get:
Glucose Insulin Carbs
Hour
2018-05-16 06:43:00 156.0 7.0 65.0
The problem is that in the same dataframe I have duplicates with less information, maybe only Carbs or Insulin.
Glucose Insulin Carbs
Hour
2018-05-19 06:15:00 NaN 1.5 0.0
2018-05-19 06:15:00 229.0 0.0 0.0
I already know the indices of these entries:
bad_indices = _df[ _df.Glucosa.isnull() ].index
What I would like to know is if there's like a nice Pythonic way to do such a task (both for the two, and three rows cases).
Maybe a pandas built-in method or something which is semi standard
or at least readable because I don't want to write ugly (and easily breakable)
code that has explicit considerations for each case.

You can replace 0 to NaN and then get first non NaN values per group:
df = df.mask(df == 0).groupby(level=0).first()
print (df)
Glucose Insulin Carbs
Hour
2018-05-16 06:43:00 156.0 7.0 65.0

Related

Convert summary data (cumulative cases) to daily cases pandas

I have case data that is presented as a time series. They are summed for each following day, what can be used to turn them into daily case count data?
My dataframe in pandas:
data sum_cases (cumulative)
0 2020-05-02 4.0
1 2020-05-03 21.0
2 2020-05-04 37.0
3 2020-05-05 51.0
I want them to look like this:
data sum_cases(cumulative) daily_cases
0 2020-05-02 4.0 4.0
1 2020-05-03 21.0 17.0
2 2020-05-04 37.0 16.0
3 2020-05-05 51.0 14.0
If indeed your DF has has the data in date order, then you might be able to get away with:
df['daily_cases'] = df['sum_cases'] - df['sum_cases'].shift(fill_value=0)

Incomplete filling when upsampling with `agg` for multiple columns (pandas resample)

I found this behavior of resample to be confusing after working on a related question. Here are some time series data at 5 minute intervals but with missing rows (code to construct at end):
user value total
2020-01-01 09:00:00 fred 1 1
2020-01-01 09:05:00 fred 13 1
2020-01-01 09:15:00 fred 27 3
2020-01-01 09:30:00 fred 40 12
2020-01-01 09:35:00 fred 15 12
2020-01-01 10:00:00 fred 19 16
I want to fill in the missing times using different methods for each column to fill missing data. For user and total, I want to to a forward fill, while for value I want to fill in with zeroes.
One approach I found was to resample, and then fill in the missing data after the fact:
resampled = df.resample('5T').asfreq()
resampled['user'].ffill(inplace=True)
resampled['total'].ffill(inplace=True)
resampled['value'].fillna(0, inplace=True)
Which gives correct expected output:
user value total
2020-01-01 09:00:00 fred 1.0 1.0
2020-01-01 09:05:00 fred 13.0 1.0
2020-01-01 09:10:00 fred 0.0 1.0
2020-01-01 09:15:00 fred 27.0 3.0
2020-01-01 09:20:00 fred 0.0 3.0
2020-01-01 09:25:00 fred 0.0 3.0
2020-01-01 09:30:00 fred 40.0 12.0
2020-01-01 09:35:00 fred 15.0 12.0
2020-01-01 09:40:00 fred 0.0 12.0
2020-01-01 09:45:00 fred 0.0 12.0
2020-01-01 09:50:00 fred 0.0 12.0
2020-01-01 09:55:00 fred 0.0 12.0
2020-01-01 10:00:00 fred 19.0 16.0
I thought one would be able to use agg to specify what to do by column. I try to do the following:
resampled = df.resample('5T').agg({'user':'ffill',
'value':'sum',
'total':'ffill'})
I find this to be more clear and simpler, but it doesn't give the expected output. The sum works, but the forward fill does not:
user value total
2020-01-01 09:00:00 fred 1 1.0
2020-01-01 09:05:00 fred 13 1.0
2020-01-01 09:10:00 NaN 0 NaN
2020-01-01 09:15:00 fred 27 3.0
2020-01-01 09:20:00 NaN 0 NaN
2020-01-01 09:25:00 NaN 0 NaN
2020-01-01 09:30:00 fred 40 12.0
2020-01-01 09:35:00 fred 15 12.0
2020-01-01 09:40:00 NaN 0 NaN
2020-01-01 09:45:00 NaN 0 NaN
2020-01-01 09:50:00 NaN 0 NaN
2020-01-01 09:55:00 NaN 0 NaN
2020-01-01 10:00:00 fred 19 16.0
Can someone explain this output, and if there is a way to achieve the expected output using agg? It seems odd that the forward fill doesn't work here, but if I were to just do resampled = df.resample('5T').ffill(), that would work for every column (but is undesired here as it would do so for the value column as well). The closest I have come is to individually run resampling for each column and apply the function I want:
resampled = pd.DataFrame()
d = {'user':'ffill',
'value':'sum',
'total':'ffill'}
for k, v in d.items():
resampled[k] = df[k].resample('5T').apply(v)
This works, but feels silly given that it adds extra iteration and uses the dictionary I am trying to pass to agg! I have looked a few posts on agg and apply but can't seem to explain what is happening here:
Losing String column when using resample and aggregation with pandas
resample multiple columns with pandas
pandas groupby with agg not working on multiple columns
Pandas named aggregation not working with resample agg
I have also tried using groupby with a pd.Grouper and using the pd.NamedAgg class, with no luck.
Example data:
import pandas as pd
dates = ['01-01-2020 9:00', '01-01-2020 9:05', '01-01-2020 9:15',
'01-01-2020 9:30', '01-01-2020 9:35', '01-01-2020 10:00']
dates = pd.to_datetime(dates)
df = pd.DataFrame({'user':['fred']*len(dates),
'value':[1,13,27,40,15,19],
'total':[1,1,3,12,12,16]},
index=dates)

To fill the 0's with before 3 months average value using python

My data set has values like
date quantity
01/04/2018 35
01/05/2018 33
01/06/2018 75
01/07/2018 0
01/08/2018 70
01/09/2018 0
01/10/2018 66
Code I tried:
df['rollmean3'] = df['quantity'].rolling(3).mean()
output:
2018-04-01 35.0 NaN
2018-05-01 33.0 NaN
2018-06-01 75.0 47.666667
2018-07-01 0.0 36.000000
2018-08-01 70.0 48.333333
2018-09-01 0.0 23.333333
2018-10-01 66.0 45.333333
EXPECTED OUTPUT:
But I need output as it should take the AVERAGE of 35,33 and 75 and fill it in the 0.0 value.
and for next zero it should calculated average for previous three values and fill it.
2018-04-01 35.0
2018-05-01 33.0
2018-06-01 75.0
2018-07-01 0.0 47.666667
2018-08-01 70.0
2018-09-01 0.0 64.22222 # average of (0, 47.6667 and 75)
2018-10-01 66.0
like this output should be displayed
Unfortunately there does not seem to be a vectorized solution for this in Pandas. You'll need to iterate the rows and fill in the missing values one by one. This will be slow; if you need to speed it up you can JIT compile your code using Numba.
Like John Zwinck said, there's no vectorized solution in pandas for this.
You'll have to use something like .itterrows(), like this:
for i, row in df.iterrows():
if row['quantity'] == 0:
df.loc[i,'quantity'] = df['quantity'].iloc[(i-3):i].mean()
Or even with recursion, if you prefer:
def fill_recursively(column: pd.Series, window_size: int = 3):
if 0 in column.values:
idx = column.tolist().index(0)
column[idx] = column[(idx-window_size):idx].mean()
column = fill_recursively(column)
return column
You can verify that fill_recursively(df['quantity']) returns the desired result (just make sure that it has the dtype float, otherwise it will be rounded to the nearest integer).

Python Stacked bar chart from DF with index dates?

I have created a data frame in python using pandas that has the following output with date being the index:
Date Daily Anger Daily Haha Daily Like Daily Love Daily Sad Daily WoW
2019-08-31 1 2.0 132.0 8.0 0.0 5.0
2019-09-30 0 1.0 41.0 4.0 0.0 0.0
2019-10-31 15 1.0 117.0 4.0 0.0 2.0
2019-11-30 0 3.0 84.0 4.0 0.0 4.0
2019-12-31 2 17.0 98.0 20.0 5.0 7.0
I'm trying to get these values in a stacked bar chart where the X axis is the date and the y axis is the total values across these metrics
I've spent the last couple of hours trying to get this to work with google with no success. Could anyone help me?
If Date is column use x parameter in DataFrame.plot.bar:
df.plot.bar(x='Date', stacked=True)
If Date is DatetimeIndex use only stacked parameter:
df.plot.bar(stacked=True)

Get a time of first transaction for each day from stock data

Recently I got a csv file with transactions our company made on different markets/instruments. My data set consist of more than 500k rows.
Here is my data sample without irrelevant (in this moment) columns:
Market Price Quantity
Time
2019-01-01 09:42:16 Share 180.00 5.0
2019-01-01 09:44:59 Share 180.00 10.0
2019-01-01 09:46:24 Share 180.00 6.0
2019-01-01 09:47:21 Share 180.00 5.0
2019-01-01 09:52:19 Share 180.00 10.0
2019-01-01 09:52:19 Share 180.00 5.0
2019-01-01 09:52:19 Share 180.00 5.0
2019-01-01 09:57:37 Share 180.01 10.0
2019-01-02 10:03:43 Share 235.00 10.0
2019-01-02 10:04:11 Share 235.00 10.0
2019-01-02 10:04:19 Share 235.00 10.0
... ... ... ...
2019-05-13 10:06:44 Share 233.00 10.0
2019-05-13 10:11:45 Share 233.00 10.0
2019-05-13 10:11:45 Share 233.00 10.0
2019-05-13 10:11:49 Share 234.00 10.0
2019-05-13 10:11:49 Share 234.00 10.0
2019-05-13 10:11:54 Share 233.00 10.0
2019-05-14 09:50:56 Share 230.00 10.0
2019-05-14 09:53:31 Share 229.00 10.0
2019-05-14 09:53:55 Share 229.00 5.0
2019-05-14 09:53:59 Share 229.00 3.0
2019-05-14 09:54:01 Share 229.00 2.0
2019-05-14 09:54:07 Share 229.00 3.0
2019-05-14 09:54:16 Share 229.00 2.0
I already converted Time column to pandas datetime.
Although I was able to obtain some desired statistics I got stuck on finding time of first and last transaction for each day.
Expected OUTPUT:
2019-03-12 08:43:23 Share(name) 248 10
2019-03-12 16:48:21 Share(name) 250 20
Well I don't have problems with getting this in Excel but considering fastly growing number of data I would rather use pandas and python for this purpose.
I am assuming that some combination of groupby and resample methods could be solution but I have no idea how to apply them correctly to my dataframe.
Any thoughts and comments will be appreciated.
Thanks to Ben Pap I got result using:
dbs.groupby(dbs.index.date).apply(lambda x: x.iloc[np.r_[0:1,-1:0]])
Here is another question I came up. What function I suppose to use to get max value of time of first transaction. So in other words which day market starts at the latest?
df.groupby(df['Time'].dt.day).apply(lambda x: x.iloc[np.r_[0:1, -1:0]])
This will give you the first and last of each day as long as your dates are ordered.
Option 1:
groupby followed by apply
new_df = (df.groupby(df.index.floor('D'))
.apply(lambda x: x.iloc[[0,-1]])
.reset_index(level=0, drop=True)
)
new_df
Option 2:
groupby followed by agg and stack
new_df = (df.reset_index().groupby(df.index.floor('D'))
.agg(['first','last'])
.stack(level=1)
.reset_index(drop=True)
.set_index('Time')
)
Output:
Market Price Quantity
Time
2019-01-01 09:42:16 Share 180.00 5.0
2019-01-01 09:57:37 Share 180.01 10.0
2019-01-02 10:03:43 Share 235.00 10.0
2019-01-02 10:04:19 Share 235.00 10.0
2019-05-13 10:06:44 Share 233.00 10.0
2019-05-13 10:11:54 Share 233.00 10.0
2019-05-14 09:50:56 Share 230.00 10.0
2019-05-14 09:54:16 Share 229.00 2.0
In any case, you may want to do drop_duplicates afterwards in case there are days with only on transaction.
If you have your index in datetime format you can use the method resample():
df['Datetime'] = df.index
df.resample('D').agg(['first', 'last']).stack().set_index('Datetime')
Result:
Market Price Quantity
Datetime
2019-01-01 09:42:16 Share 180.00 5.0
2019-01-01 09:57:37 Share 180.01 10.0
2019-01-02 10:03:43 Share 235.00 10.0
2019-01-02 10:04:19 Share 235.00 10.0

Categories