I have dataframe with fields last_payout and amount. I need to sum all amount for each month and plot the output.
df[['last_payout','amount']].dtypes
last_payout datetime64[ns]
amount float64
dtype: object
-
df[['last_payout','amount']].head
<bound method NDFrame.head of last_payout amount
0 2017-02-14 11:00:06 23401.0
1 2017-02-14 11:00:06 1444.0
2 2017-02-14 11:00:06 0.0
3 2017-02-14 11:00:06 0.0
4 2017-02-14 11:00:06 290083.0
I used the code from jezrael's answer to plot the number of transactions per month.
(df.loc[df['last_payout'].dt.year.between(2016, 2017), 'last_payout']
.dt.to_period('M')
.value_counts()
.sort_index()
.plot(kind="bar")
)
Number of transactions per month:
How do I sum all amount for each month and plot the output? How should I extend the code above for doing this?
I tried to implement .sum but didn't succeed.
PeriodIndex solution:
groupby by month period by to_period and aggregate sum:
df['amount'].groupby(df['last_payout'].dt.to_period('M')).sum().plot(kind='bar')
DatetimeIndex solutions:
Use resample by months (M) or starts of months (MS) with aggregate sum:
s = df.resample('M', on='last_payout')['amount'].sum()
#alternative
#s = df.groupby(pd.Grouper(freq='M', key='last_payout'))['amount'].sum()
print (s)
last_payout
2017-02-28 23401.0
2017-03-31 1444.0
2017-04-30 290083.0
Freq: M, Name: amount, dtype: float64
Or:
s = df.resample('MS', on='last_payout')['amount'].sum()
#s = df.groupby(pd.Grouper(freq='MS', key='last_payout'))['amount'].sum()
print (s)
last_payout
2017-02-01 23401.0
2017-03-01 1444.0
2017-04-01 290083.0
Freq: MS, Name: amount, dtype: float64
Then is necessary format x labels:
ax = s.plot(kind='bar')
ax.set_xticklabels(s.index.strftime('%Y-%m'))
Setup:
import pandas as pd
temp=u"""last_payout,amount
2017-02-14 11:00:06,23401.0
2017-03-14 11:00:06,1444.0
2017-03-14 11:00:06,0.0
2017-04-14 11:00:06,0.0
2017-04-14 11:00:06,290083.0"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), parse_dates=[0])
print (df)
last_payout amount
0 2017-02-14 11:00:06 23401.0
1 2017-03-14 11:00:06 1444.0
2 2017-03-14 11:00:06 0.0
3 2017-04-14 11:00:06 0.0
4 2017-04-14 11:00:06 290083.0
You can group by month-start ('MS') using resample:
df.set_index('last_payout').resample('MS').sum().plot(kind='bar')
Related
I have a data-frame formatted like so (I simplified it for the sake of my explanation):
Date_1
Date_2
Date_3
2017-02-14
2017-02-09
2017-02-10
2018-07-16
2019-07-22
2018-07-16
2014-10-10
2017-10-10
2017-10-10
I would like to create a new column that shows the average difference between my date columns. Specifically, I would like it to calculate the difference between Date_1 & Date_2, Date_2 & Date_3, and Date_1 & Date_3. In row # 1 that would equal mean(5 + 1 + 4) = 3.33.
The data frame would look something like this:
Date_1
Date_2
Date_3
Average_Difference
2017-02-14
2017-02-09
2017-02-10
3.33
2018-07-16
2019-07-22
2018-07-16
mean(6+6+0) = 4
2014-10-10
2017-10-10
2017-10-10
0
Do let me know if further explanation is needed.
Edit: I should also add that my actual, un-simplified dataframe has more than just three date columns, so I am trying to think of an answer that is scalable.
Interesting problem. Since you're getting the diffs of several items in each row, itertools.combinations(iterable, N) will help. It returns all a possible N-length combinations of the items in iterable. So we can use that for each row, diff each combination, absolute it (since some might be negative because of the sorting), and compute the mean:
date_cols = df.filter(like='Date_').columns
df[date_cols] = df[date_cols].apply(pd.to_datetime) # Convert the columns to dates
df['Average_Difference'] = df[date_cols].apply(lambda row: np.mean([diff for diff in abs(np.diff(list(it.combinations([date.dayofyear for date in row], 2)))[:, 0])]), axis=1)
Output:
>>> df
Date_1 Date_2 Date_3 Average_Difference
0 2017-02-14 2017-02-09 2017-02-10 3.333333
1 2018-07-16 2019-07-22 2018-07-16 4.000000
2 2014-10-10 2017-10-10 2017-10-10 0.000000
I have a DataFrame like this:
date time value
0 2019-04-18 07:00:10 100.8
1 2019-04-18 07:00:20 95.6
2 2019-04-18 07:00:30 87.6
3 2019-04-18 07:00:40 94.2
The DataFrame contains value recorded every 10 seconds for entire year 2019. I need to calculate standard deviation and mean/average of value for each hour of each date, and create two new columns for them. I have tried first separating the hour for each value like:
df["hour"] = df["time"].astype(str).str[:2]
Then I have tried to calculate standard deviation by:
df["std"] = df.groupby("hour").median().index.get_level_values('value').stack().std()
But that won't work, could I have some advise on the problem?
We can split the time column around the delimiter :, then slice the hour component using str[0], finally group the dataframe on date along with hour component and aggregate column value with mean and std:
hr = df['time'].str.split(':', n=1).str[0]
df.groupby(['date', hr])['value'].agg(['mean', 'std'])
If you want to broadcast the aggregated values to original dataframe, then we need to use transform instead of agg:
g = df.groupby(['date', df['time'].str.split(':', n=1).str[0]])['value']
df['mean'], df['std'] = g.transform('mean'), g.transform('std')
date time value mean std
0 2019-04-18 07:00:10 100.8 94.55 5.434151
1 2019-04-18 07:00:20 95.6 94.55 5.434151
2 2019-04-18 07:00:30 87.6 94.55 5.434151
3 2019-04-18 07:00:40 94.2 94.55 5.434151
have synthesized data. Start by generating a true datetime column
groupby() hour
use describe() to get mean & std
merge() back to original data frame
d = pd.date_range("1-Jan-2019", "28-Feb-2019", freq="10S")
df = pd.DataFrame({"datetime":d, "value":np.random.uniform(70,90,len(d))})
df = df.assign(date=df.datetime.dt.strftime("%Y-%m-%d"),
time=df.datetime.dt.strftime("%H:%M:%S"))
# create a datetime column - better than manipulating strings
df["datetime"] = pd.to_datetime(df.date + " " + df.time)
# calc mean & std by hour
dfh = (df.groupby(df.datetime.dt.hour, as_index=False)
.apply(lambda dfa: dfa.describe().T.loc[:,["mean","std"]].reset_index(drop=True))
.droplevel(1)
)
# merge mean & std by hour back
df.merge(dfh, left_on=df.datetime.dt.hour, right_index=True).drop(columns="key_0")
datetime value mean std
0 2019-01-01 00:00:00 86.014209 80.043364 5.777724
1 2019-01-01 00:00:10 77.241141 80.043364 5.777724
2 2019-01-01 00:00:20 71.650739 80.043364 5.777724
3 2019-01-01 00:00:30 71.066332 80.043364 5.777724
4 2019-01-01 00:00:40 77.203291 80.043364 5.777724
... ... ... ... ...
3144955 2019-12-30 23:59:10 89.577237 80.009751 5.773007
3144956 2019-12-30 23:59:20 82.154883 80.009751 5.773007
3144957 2019-12-30 23:59:30 82.131952 80.009751 5.773007
3144958 2019-12-30 23:59:40 85.346724 80.009751 5.773007
3144959 2019-12-30 23:59:50 78.122761 80.009751 5.773007
I have a df column with dates and hours / minutes:
0 2019-09-13 06:00:00
1 2019-09-13 06:05:00
2 2019-09-13 06:10:00
3 2019-09-13 06:15:00
4 2019-09-13 06:20:00
Name: Date, dtype: datetime64[ns]
I need to count how many days the dataframe contains.
I tried it like this:
sample_length = len(df.groupby(df['Date'].dt.date).first())
and
sample_length = len(df.groupby(df['Date'].dt.date))
But the number I get seems wrong. Do you know another method of counting the days?
Consider the sample dates:
sample = pd.date_range('2019-09-12 06:00:00', periods=50, freq='4h')
df = pd.DataFrame({'date': sample})
date
0 2019-09-12 06:00:00
1 2019-09-12 10:00:00
2 2019-09-12 14:00:00
3 2019-09-12 18:00:00
4 2019-09-12 22:00:00
5 2019-09-13 02:00:00
6 2019-09-13 06:00:00
...
47 2019-09-20 02:00:00
48 2019-09-20 06:00:00
49 2019-09-20 10:00:00
Use, DataFrame.groupby to group the dataframe on df['date'].dt.date and use the aggregate function GroupBy.size:
count = df.groupby(df['date'].dt.date).size()
# print(count)
date
2019-09-12 5
2019-09-13 6
2019-09-14 6
2019-09-15 6
2019-09-16 6
2019-09-17 6
2019-09-18 6
2019-09-19 6
2019-09-20 3
dtype: int64
I'm not completely sure what you want to do here. Do you want to count the number of unique days (Monday/Tuesday/...), monthly dates (1-31 ish), yearly dates (1-365), or unique dates (unique days since the dawn of time)?
From a pandas series, you can use {series}.value_counts() to get the number of entries for each unique value, or simply get all unique values with {series}.unique()
import pandas as pd
df = pd.DataFrame(pd.DatetimeIndex(['2016-10-08 07:34:13', '2015-11-15 06:12:48',
'2015-01-24 10:11:04', '2015-03-26 16:23:53',
'2017-04-01 00:38:21', '2015-03-19 03:47:54',
'2015-12-30 07:32:32', '2015-11-10 20:39:36',
'2015-06-24 05:48:09', '2015-03-19 16:05:19'],
dtype='datetime64[ns]', freq=None), columns = ["date"])
days (Monday/Tuesday/...):
df.date.dt.dayofweek.value_counts()
monthly dates (1-31 ish)
df.date.dt.day.value_counts()
yearly dates (1-365)
df.date.dt.dayofyear.value_counts()
unique dates (unique days since the dawn of time)
df.date.dt.date.value_counts()
To get the number of unique entries from any of the above, simply add .shape[0]
In order to calculate the total number of unique dates in the given time series data example we can use:
print(len(pd.to_datetime(df['Date']).dt.date.unique()))
import pandas as pd
df = pd.DataFrame({'Date': ['2019-09-13 06:00:00',
'2019-09-13 06:05:00',
'2019-09-13 06:10:00',
'2019-09-13 06:15:00',
'2019-09-13 06:20:00']
},
dtype = 'datetime64[ns]'
)
df = df.set_index('Date')
_count_of_days = df.resample('D').first().shape[0]
print(_count_of_days)
I have a series that is within a multiindex, that I would like to change.
Say I have the following Series named ser:
gbd_wijk_naam gbd_buurt_naam cluster_id weging_datum_weging
Centrale Markt Ecowijk 119617.877|488566.830 2017-05-07 20.248457
2017-05-21 23.558438
2017-05-28 40.910273
2017-06-18 14.142136
2017-07-09 15.652476
...
Westindische Buurt Postjeskade e.o. 118620.633|486116.648 2019-11-17 17.029386
2019-12-01 21.530015
2019-12-08 15.491933
2019-12-15 22.896061
2019-12-22 13.228757
In the end, I want to do this for all indexes, but for now lets focus on just one..
I'm taking the first index, so (Centrale Markt, Ecowijk, 119617.877|488566.830). This returns to me the following series:
weging_datum_weging
2017-05-07 20.248457
2017-05-21 23.558438
2017-05-28 40.910273
2017-06-18 14.142136
2017-07-09 15.652476
2017-07-23 44.067607
2017-07-30 17.464249
2017-08-20 20.000000
2017-08-27 30.184594
2017-09-03 19.104973
2017-09-10 17.175564
2017-09-17 15.968719
2017-09-24 38.415531
2017-10-29 18.708287
2017-11-05 18.574176
2017-11-12 21.095023
2017-12-10 21.794495
2019-01-06 42.966652
2019-01-20 13.038405
2019-01-27 29.483345
2019-02-17 16.278821
2019-02-24 15.968719
2019-03-03 31.583124
2019-03-10 19.748418
2019-04-28 18.574176
2019-05-12 17.029386
2019-05-19 20.976177
2019-06-23 20.493902
2019-07-14 15.329710
2019-09-22 34.537485
2019-09-29 17.320508
2019-10-06 16.431677
2019-10-27 10.246951
2019-11-17 16.733201
2019-11-24 29.567957
Name: weging_netto_gewicht, dtype: float64
With shape (35,)
I want to replace all values in this index, with those of an interpolated series that i make through:
_ = ser.loc[('Centrale Markt', 'Ecowijk', '119617.877|488566.830')]
upsampled = _.resample('D')
interpolated = upsampled.interpolate(method='linear')
This series has shape (932,).
I'm able to change the series through:
x = ser.loc[('Centrale Markt', 'Ecowijk', '119617.877|488566.830')]
x = x.reindex(interpolated.index)
x.update(interpolated)
Giving me
weging_datum_weging
2017-05-07 20.248457
2017-05-08 20.484884
2017-05-09 20.721311
2017-05-10 20.957738
2017-05-11 21.194166
...
2019-11-20 22.233810
2019-11-21 24.067347
2019-11-22 25.900884
2019-11-23 27.734420
2019-11-24 29.567957
Freq: D, Name: weging_netto_gewicht, Length: 932, dtype: float64
What I can't seem to figure out is how to put x back into ser at index ('Centrale Markt', 'Ecowijk', '119617.877|488566.830')
When I try to do it for all the indeces for example:
for idx, df_select in ser2.groupby(level=[0,1,2]):
_ = ser.loc[idx]
upsampled = _.resample('D')
interpolated = upsampled.interpolate(method='linear')
ser.loc[idx] = ser.loc[idx].reindex(interpolated.index)
ser.loc[idx].update(interpolated)
Interpolated is generated as it should, but the second part is not updating ser.
I have it working now in this way:
for index, value in interpolated.items():
new_df = new_df.append(
{'gbd_wijk_naam': idx[0], \
'gbd_buurt_naam': idx[1],\
'cluster_id': idx[2],\
'weging_datum_weging': index,\
'weging_netto_gewicht': value}, ignore_index=True)
Where it appends the row to a new df and that df gets grouped in the same way later again. This is super slow though. How can we speed this up?
Resample works when the index is either DatetimeIndex, TimedeltaIndex or PeriodIndex, but not with a multi-index as you have.
It is possible to set the timestamp column as the index, group by the other columns and resample/interpolate.
Using the following data for illustration:
gbd_wijk_naam gbd_buurt_naam cluster_id weging_datum_weging
Centrale Markt Ecowijk 119617.877|488566.830 2017-05-07 20.248457
2017-05-21 23.558438
2017-05-28 40.910273
give the series a name & reset_index
df = series.rename('val').reset_index()
ensure datetime column has the right type
df.weging_datum_weging = pd.to_datetime(df.wegin_datum_wegin)
set index, groupby other cols, resample & interpolate
(df.set_index('weging_datum_weging')
.groupby(['gbd_wijk_naam', 'gbd_buurt_naam', 'cluster_id'])
.val.apply(lambda s: s.resample('D').interpolate('linear')))
produces the output:
gbd_wijk_naam gbd_buurt_naam cluster_id weging_datum_weging
Centrale Markt Ecowijk 119617.877|488566.830 2017-05-07 20.248457
2017-05-08 20.484884
2017-05-09 20.721311
2017-05-10 20.957739
2017-05-11 21.194166
...
2017-07-05 15.364792
2017-07-06 15.436713
2017-07-07 15.508634
2017-07-08 15.580555
2017-07-09 15.652476
Name: val, Length: 64, dtype: float64
I have an observational data set which contain weather information. Each column contain specific field in which date and time are in two separate column. The time column contain hourly time like 0000, 0600 .. up to 2300. What I am trying to do is to filter the data set based on certain time frame, for example between 0000 UTC to 0600 UTC. When I try to read the data file in pandas data frame, by default the time column is read in float. When I try to convert it in to datatime object, it produces a format which I am unable to convert. Code example is given below:
import pandas as pd
import datetime as dt
df = pd.read_excel("test.xlsx")
df.head()
which produces the following result:
tdate itime moonph speed ... qnh windir maxtemp mintemp
0 01-Jan-17 1000.0 NM7 5 ... $1,011.60 60.0 $32.60 $22.80
1 01-Jan-17 1000.0 NM7 2 ... $1,015.40 999.0 $32.60 $22.80
2 01-Jan-17 1030.0 NM7 4 ... $1,015.10 60.0 $32.60 $22.80
3 01-Jan-17 1100.0 NM7 3 ... $1,014.80 999.0 $32.60 $22.80
4 01-Jan-17 1130.0 NM7 5 ... $1,014.60 270.0 $32.60 $22.80
Then I extracted the time column with following line:
df["time"] = df.itime
df["time"]
0 1000.0
1 1000.0
2 1030.0
3 1100.0
4 1130.0
5 1200.0
6 1230.0
7 1300.0
8 1330.0
.
.
3261 2130.0
3262 2130.0
3263 600.0
3264 630.0
3265 730.0
3266 800.0
3267 830.0
3268 1900.0
3269 1930.0
3270 2000.0
Name: time, Length: 3279, dtype: float64
Then I tried to convert the time column to datetime object:
df["time"] = pd.to_datetime(df.itime)
which produced the following result:
df["time"]
0 1970-01-01 00:00:00.000001000
1 1970-01-01 00:00:00.000001000
2 1970-01-01 00:00:00.000001030
3 1970-01-01 00:00:00.000001100
It appears that it has successfully converted the data to datetime object. However, it added the hour time to ms which is difficult for me to do filtering.
The final data format I would like to get is either:
1970-01-01 06:00:00
or
06:00
Any help is appreciated.
When you read the excel file specify the dtype of col itime as a str:
df = pd.read_excel("test.xlsx", dtype={'itime':str})
then you will have a time column of strings looking like:
df = pd.DataFrame({'itime':['2300', '0100', '0500', '1000']})
Then specify the format and convert to time:
df['Time'] = pd.to_datetime(df['itime'], format='%H%M').dt.time
itime Time
0 2300 23:00:00
1 0100 01:00:00
2 0500 05:00:00
3 1000 10:00:00
Just addon to Chris answer, if you are unable to convert because there is no zero in the front, apply the following to the dataframe.
df['itime'] = df['itime'].apply(lambda x: x.zfill(4))
So basically is that because the original format does not have even leading digit (4 digit). Example: 945 instead of 0945.
Try
df["time"] = pd.to_datetime(df.itime).dt.strftime('%Y-%m-%d %H:%M:%S')
df["time"] = pd.to_datetime(df.itime).dt.strftime('%H:%M:%S')
For the first and second outputs you want to
Best!