I would love some guidance how I can compare the same dates over different years. I have daily mean temperature data for all March days from 1997-2018 and my goal is to see the mean temperature of each day over my time period. My df is simple and the head and tail looks like the following:
IndexType = Datetime
Date temp
1997-03-01 6.00
1997-03-02 6.22
1997-03-03 6.03
1997-03-04 4.41
1997-03-05 5.29
Date temp
2018-03-27 -2.44
2018-03-28 -1.01
2018-03-29 -1.08
2018-03-30 -0.53
2018-03-31 -0.11
I imagine the goal could be either 1) a dataframe with days as an index and years as column or 2) a Series with days as index and and the average daily temperature of 1997-2018.
My code:
df = pd.read_csv(file, sep=';', skiprows=9, usecols=[0, 1, 2, 3], parse_dates=[['Datum', 'Tid (UTC)']], index_col=0)
print(df.head())
df.columns = ['temp']
df.index.names = ['Date']
df_mar = df.loc[df.index.month == 3]
df_mar = df_mar.resample('D').mean().round(2)
You can use groupby to see lots of comparisons. Not sure if that's exactly what you're looking for?
Make sure your date column is a Timestamp.
import pandas as pd
df = df.reset_index(drop=False)
df['Date'] = pd.to_datetime(df['Date'])
I'll initialize a dataframe to practice on:
import datetime
import random
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(0, 100000)]
df = pd.DataFrame({'date':date_list, 'temp':[random.randint(-30, 100) for x in range(100000)]})
march = df[df['date'].dt.month == 3]
g = march.groupby(march['date'].dt.day).agg({'temp':['max', 'min', 'mean']})
alternatively you can do this across your whole dataframe, not just march.
df.groupby(df['date'].dt.month).agg({'temp':['max', 'min', 'mean', 'nunique']})
temp
max min mean nunique
date
1 100 -30 34.999765 131
2 100 -30 35.167485 131
3 100 -30 35.660215 131
4 100 -30 34.436264 131
5 100 -30 35.424371 131
6 100 -30 35.086253 131
7 100 -30 35.188133 131
8 100 -30 34.772781 131
9 100 -30 34.839173 131
10 100 -30 35.248528 131
11 100 -30 34.666302 131
12 100 -30 34.575583 131
Related
This is what my data looks like:
month total_mobile_subscription
0 1997-01 414000
1 1997-02 423000
2 1997-03 431000
3 1997-04 479000
4 1997-05 510000
.. ... ...
279 2020-04 9056300
280 2020-05 8928800
281 2020-06 8860000
282 2020-07 8768500
283 2020-08 8659000
[284 rows x 2 columns]
Basically, I'm trying to change this into a dataset sorted by year with the value being the mean average of total mobiles subscriptions for each year.
I am not sure what to do as I am still learning this.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'year': ['1997-01', '1997-02', '1997-03', '1998-01', '1998-02', '1998-03'],
'sale': [500, 1000, 1500, 2000, 1000, 400]
})
for a in range(1,13):
x = '-0'+ str(a)
df['year'] = df['year'].str.replace(x, '')
df2 = (df.groupby(['year']).mean('sale'))
print(df2)
Convert values of month to datetimes and aggregate by years:
y = pd.to_datetime(df['month']).dt.year.rename('year')
df1 = df.groupby(y)['total_mobile_subscription'].mean().reset_index()
df['month'] = pd.to_datetime(df['month'])
df1 = (df.groupby(df['month'].dt.year.rename('year'))['total_mobile_subscription']
.mean().reset_index())
Or aggregate by first 4 values of month column:
df2 = (df.groupby(df['month'].str[:4].rename('year'))['total_mobile_subscription']
.mean().reset_index())
I have a column of Call Duration formatted as mm.ss and I would like to convert it to all seconds.
It looks like this:
CallDuration
25 29.02
183 5.40
213 3.02
290 10.27
304 2.00
...
4649990 13.02
4650067 5.33
4650192 19.47
4650197 3.44
4650204 14.15
In excel I would separate the column at the ".", multiply the minutes column by 60 and then add it to the seconds column for my total seconds. I feel like this should be much easier with pandas/python, but I cannot figure it out.
I tried using pd.to_timedelta but that did not give me what I need - I can't figure out how to put in there how the time is formatted. When I put in 'm' it does not return correctly with seconds being after the "."
pd.to_timedelta(post_group['CallDuration'],'m')
25 0 days 00:29:01.200000
183 0 days 00:05:24
213 0 days 00:03:01.200000
290 0 days 00:10:16.200000
304 0 days 00:02:00
...
4649990 0 days 00:13:01.200000
4650067 0 days 00:05:19.800000
4650192 0 days 00:19:28.200000
4650197 0 days 00:03:26.400000
4650204 0 days 00:14:09
Name: CallDuration, Length: 52394, dtype: timedelta64[ns]
Tried doing it this way, but now can't get the 'sec' column to convert to an integer because there are blanks, and it won't fill the blanks...
post_duration = post_group['CallDuration'].str.split(".",expand=True)
post_duration.columns = ["min","sec"]
post_duration['min'] = post_duration['min'].astype(int)
post_duration['min'] = 60*post_duration['min']
post_duration.loc['Total', 'min'] = post_duration['min'].sum()
post_duration
min sec
25 1740.0 02
183 300.0 4
213 180.0 02
290 600.0 27
304 120.0 None
... ... ...
4650067 300.0 33
4650192 1140.0 47
4650197 180.0 44
4650204 840.0 15
Total 24902700.0 NaN
post_duration2 = post_group['CallDuration'].str.split(".",expand=True)
post_duration2.columns = ["min","sec"]
post_duration2['sec'].astype(float).astype('Int64')
post_duration2.fillna(0)
post_duration2.loc['Total', 'sec'] = post_duration2['sec'].sum()
post_duration2
TypeError: object cannot be converted to an IntegerDtype
Perhaps there's a more efficient way, but I would still convert to a timedelta format then use apply with the Timedelta.total_seconds() method to get the column in seconds.
import pandas as pd
pd.to_timedelta(post_group['CallDuration'], 'm').apply(pd.Timedelta.total_seconds)
You can find more info on attributes and methods you can call on timedeltas here
import pandas as pd
import numpy as np
import datetime
def convert_to_seconds(col_data):
col_data = pd.to_datetime(col_data, format="%M:%S")
# The above line adds the 1900-01-01 as a date to the time, so using subtraction to remove it
col_data = col_data - datetime.datetime(1900,1,1)
return col_data.dt.total_seconds()
df = pd.DataFrame({'CallDuration':['2:02',
'5:50',
np.nan,
'3:02']})
df['CallDuration'] = convert_to_seconds(df['CallDuration'])
Here's the result:
CallDuration
0 122.0
1 350.0
2 NaN
3 182.0
You can also use the above code to convert string HH:MM to total seconds in float but only if the number of hours are less than 24.
And if you want to convert multiple columns in your dataframe replace
df['CallDuration'] = convert_to_seconds(df['CallDuration'])
with
new_df = df.apply(lambda col: convert_to_seconds(col) if col.name in colnames_list else col)
I want to convert a column in dataset of hh:mm format to minutes. I tried the following code but it says " AttributeError: 'Series' object has no attribute 'split' ". The data is in following format. I also have nan values in the dataset and the plan is to compute the median of values and then fill the rows which has nan with the median
02:32
02:14
02:31
02:15
02:28
02:15
02:22
02:16
02:22
02:14
I have tried this so far
s = dataset['Enroute_time_(hh mm)']
hours, minutes = s.split(':')
int(hours) * 60 + int(minutes)
I suggest you avoid row-wise calculations. You can use a vectorised approach with Pandas / NumPy:
df = pd.DataFrame({'time': ['02:32', '02:14', '02:31', '02:15', '02:28', '02:15',
'02:22', '02:16', '02:22', '02:14', np.nan]})
values = df['time'].fillna('00:00').str.split(':', expand=True).astype(int)
factors = np.array([60, 1])
df['mins'] = (values * factors).sum(1)
print(df)
time mins
0 02:32 152
1 02:14 134
2 02:31 151
3 02:15 135
4 02:28 148
5 02:15 135
6 02:22 142
7 02:16 136
8 02:22 142
9 02:14 134
10 NaN 0
If you want to use split you will need to use the str accessor, ie s.str.split(':').
However I think that in this case it makes more sense to use apply:
df = pd.DataFrame({'Enroute_time_(hh mm)': ['02:32', '02:14', '02:31',
'02:15', '02:28', '02:15',
'02:22', '02:16', '02:22', '02:14']})
def convert_to_minutes(value):
hours, minutes = value.split(':')
return int(hours) * 60 + int(minutes)
df['Enroute_time_(hh mm)'] = df['Enroute_time_(hh mm)'].apply(convert_to_minutes)
print(df)
# Enroute_time_(hh mm)
# 0 152
# 1 134
# 2 151
# 3 135
# 4 148
# 5 135
# 6 142
# 7 136
# 8 142
# 9 134
I understood that you have a column in a DataFrame with multiple Timedeltas as Strings. Then you want to extract the total minutes of the Deltas. After that you want to fill the NaN values with the median of the total minutes.
import pandas as pd
df = pd.DataFrame(
{'hhmm' : ['02:32',
'02:14',
'02:31',
'02:15',
'02:28',
'02:15',
'02:22',
'02:16',
'02:22',
'02:14']})
Your Timedeltas are not Timedeltas. They are strings. So you need to convert them first.
df.hhmm = pd.to_datetime(df.hhmm, format='%H:%M')
df.hhmm = pd.to_timedelta(df.hhmm - pd.datetime(1900, 1, 1))
This gives you the following values (Note the dtype: timedelta64[ns] here)
0 02:32:00
1 02:14:00
2 02:31:00
3 02:15:00
4 02:28:00
5 02:15:00
6 02:22:00
7 02:16:00
8 02:22:00
9 02:14:00
Name: hhmm, dtype: timedelta64[ns]
Now that you have true timedeltas, you can use some cool functions like total_seconds() and then calculate the minutes.
df.hhmm.dt.total_seconds() / 60
If that is not what you wanted, you can also use the following.
df.hhmm.dt.components.minutes
This gives you the minutes from the HH:MM string as if you would have split it.
Fill the na-values.
df.hhmm.fillna((df.hhmm.dt.total_seconds() / 60).mean())
or
df.hhmm.fillna(df.hhmm.dt.components.minutes.mean())
I have csv time series data of once per day date, and cumulative sale. Silimar to this
01-01-2010 12:10:10 50.00
01-02-2010 12:10:10 80.00
01-03-2010 12:10:10 110.00
.
. for each dat of 2010
.
01-01-2011 12:10:10 2311.00
01-02-2011 12:10:10 2345.00
01-03-2011 12:10:10 2445.00
.
. for each dat of 2011
.
and so on.
I am looking to get the monthly sale (max - min) for each month in each year. Therefore for past 5 years, I will have 5 Jan values (max - min), 5 Feb values (max - min) ... and so on
once I have those, I next get the (5 years avg) for Jan, 5 years avg for Feb .. and so on.
Right now, I do this by slicing the original df [year/month] and then do the averaging over the specific month of the year.
I am looking to use time series resample() approach, but I am currently stuck at telling PD to sample monthly (max - min) for each month in [past 10 years from today]. and then chain in a .mean()
Any advice on an efficient way to do this with resample() would be appreciated.
It would probably look like something like this (note: no cumulative sale values). The key here is to perform a df.groupby() passing dt.year and dt.month.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'date': pd.date_range(start='2016-01-01',end='2017-12-31'),
'sale': np.random.randint(100,200, size = 365*2+1)
})
# Get month max, min and size (and as they are sorted - last and first)
dfg = df.groupby([df.date.dt.year,df.date.dt.month])['sale'].agg(['last','first','size'])
# Assign new cols (diff and avg) and drop max min size
dfg = dfg.assign(diff = dfg['last'] - dfg['first'])
dfg = dfg.assign(avg = dfg['diff'] / dfg['size']).drop(['last','first','size'], axis=1)
# Rename index cols
dfg.index = dfg.index.rename(['Year','Month'])
print(dfg.head(6))
Returns:
diff avg
Year Month
2016 1 -56 -1.806452
2 -17 -0.586207
3 30 0.967742
4 34 1.133333
5 46 1.483871
6 2 0.066667
You can do it with a resample*2:
First resample to a month (M) and get the diff (max()-min())
Then resample to 5 years (5AS), and groupby month and take the mean()
E.g.:
In []:
date_range = pd.date_range(start='2008-01-01',end='2017-12-31')
df = pd.DataFrame({'sale': np.random.randint(100, 200, size=date_range.size)},
index=date_range)
In []:
df1 = df.resample('M').apply(lambda g: g.max()-g.min())
df1.resample('5AS').apply(lambda g: g.groupby(g.index.month).mean()).unstack()
Out[]:
sale
1 2 3 4 5 6 7 8 9 10 11 12
2008-01-01 95.4 90.2 95.2 95.4 93.2 93.8 91.8 95.6 93.4 93.4 94.2 93.8
2013-01-01 93.2 96.4 92.8 96.4 92.6 93.0 93.2 92.6 91.2 93.2 91.8 92.2
import pandas as pd
import numpy as np
df = pd.DataFrame({'year': np.repeat(2018,12), 'month': range(1,13)})
In this data frame, I am interested in creating a field called 'year_month' such that each value looks like so:
datetime.date(df['year'][0], df['month'][0], 1).strftime("%Y%m")
I'm stuck on how to apply this operation to the entire data frame and would appreciate any help.
Join both columns converted to strings and for months add zfill:
df['new'] = df['year'].astype(str) + df['month'].astype(str).str.zfill(2)
Or add new column day by assign, convert columns to_datetime and last strftime:
df['new'] = pd.to_datetime(df.assign(day=1)).dt.strftime("%Y%m")
If multiple columns in DataFrame:
df['new'] = pd.to_datetime(df.assign(day=1)[['day','month','year']]).dt.strftime("%Y%m")
print (df)
month year new
0 1 2018 201801
1 2 2018 201802
2 3 2018 201803
3 4 2018 201804
4 5 2018 201805
5 6 2018 201806
6 7 2018 201807
7 8 2018 201808
8 9 2018 201809
9 10 2018 201810
10 11 2018 201811
11 12 2018 201812
Timings:
df = pd.DataFrame({'year': np.repeat(2018,12), 'month': range(1,13)})
df = pd.concat([df] * 1000, ignore_index=True)
In [212]: %timeit pd.to_datetime(df.assign(day=1)).dt.strftime("%Y%m")
10 loops, best of 3: 74.1 ms per loop
In [213]: %timeit df['year'].astype(str) + df['month'].astype(str).str.zfill(2)
10 loops, best of 3: 41.3 ms per loop
One way would be to create the datetime objects directly from the source data:
import pandas as pd
import numpy as np
from datetime import date
df = pd.DataFrame({'date': [date(i, j, 1) for i, j \
in zip(np.repeat(2018,12), range(1,13))]})
# date
# 0 2018-01-01
# 1 2018-02-01
# 2 2018-03-01
# 3 2018-04-01
# 4 2018-05-01
# 5 2018-06-01
# 6 2018-07-01
# 7 2018-08-01
# 8 2018-09-01
# 9 2018-10-01
# 10 2018-11-01
# 11 2018-12-01
You could use an apply function such as:
df['year_month'] = df.apply(lambda row: datetime.date(row[1], row[0], 1).strftime("%Y%m"), axis = 1)