Say I have a dataset at daily scale, but not all days have valid data. In other words, some days are missing in the data. I want to compute the summer season mean from the dataset, and want to remove the month which has less than 20 days of valid data.
How do I achieve this (in pythonic fashion)?
Say my dataframe (df) is like this:
DATE VAR
1900-01-01 123
1900-01-02 456
1900-01-10 789
...
I know how to compute the count:
df_count = df.resample('MS').count()
I also know how to compute the summer season mean:
df_summer = df.resample('Q-NOV').mean()
You can based on df_count to filter out the month which have less than 20 days of valid data. After that compute the summer season mean using your formula.
df_count = df.resample('MS').count()
relevant_month = df_count[df_count > 10].index
df_summer = df[df.index.isin(relevant_month)].resample('Q-NOV').mean()
I suppose you store the month in index. If the month or time is stored in a different column, change df.index.isin(relevant_month) to df.columnName.isin(relevant_month).
I also don't know the format of your time column (date or datetime) so you might need to modify the code to change this part df.index.isin(relevant_month) accordingly. It is just the general idea.
Related
I'm working with a dataframe that has daily information (measured data) across 30 years for different variables. I am trying to groupby days of the year, and then find a mean across 30 years. How do I go about this? This is what the dataframe looks like
I tried to groupby day after checking for type of YYYYMMDD (it's an int64 type.) now I have the dataframe looking like this. It has just added new columns for Day, Month year
[]
I'm a bit stuck on how to calculate means from here, i would need to somehow group all Jan-1sts, jan-2nds etc over 30 years and average it after.
You can groupby with month and day:
df.index = pd.to_datetime(df.index)
( df.groupby([df.index.month, df.index.day]).mean().reset_index().
rename({'level_0':'month', 'level_1':'day'}, axis=1))
or if you want to group them by the day of year, i.e. 1, 2, .. 365, set as_index=False:
df.groupby([df.index.month, df.index.day], as_index=False).mean()
Could someone give me a tip on how to use pandas groupby to find similar "days" in a time series dataset?
For example my data is (averaged daily values) a buildings electrical power and weather data, I am attempting to see if Pandas groupby can be used to find similar "days" both in electrical power usage and weather to a unique date in the time stamp of July 25th 2019.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/bbartling/Data/master/stackoverflow_groupby_question.csv', parse_dates=True)
df['Date']=pd.to_datetime(df['Date'], utc=True)
df.set_index('Date', inplace=True)
df_daily_avg = df.resample('D').mean()
What I am trying to find is like the top 10 or 15 most similar days in this dataset to the averaged temperature on that day of July 25th which is:
july_25_temp_avg = df_daily_avg.loc['2019-07-25'].Temperature_C
22.047916666666676
And averaged building power which is:
july_25_power_avg = df_daily_avg.loc['2019-07-25'].kW
52.658333333333324
If I use groupby, something like this below it strips away the time stamp index.
july25_most_similar = df_daily_avg.groupby(['kW','Temperature_C'],as_index=False).Temperature_C.mean()
returns where it seems like most similar days are on the bottom:
kW Temperature_C
0 9.316667 17.256250
1 9.433333 14.979167
2 9.616667 13.933333
3 9.683333 19.822917
4 10.116667 24.606250
... ... ...
360 58.741667 21.816667
361 61.250000 23.839583
362 61.633333 25.204167
363 62.483333 25.970833
364 63.808333 25.300000
Any tips greatly appreciated to return the timestamp/days that are most similar to July 25th Temperature & Power.
Also if it is possible to use more criteria than just Temperature_C is it possible to post an additional answer to use more weather data? For example the averaged power on July 25th and more weather data (beyond just Temperature_C) like Wind_Speed_m_s Relative_Humidity Temperature_C Pressure_mbar DHI_DNI?
I think I would take this approach:
indx = df_daily_avg.sub(df_daily_avg.loc['2019-07-25']).abs()\
.sort_values(['Temperature_C', 'kW']).head(10).index.normalize()
df[df.index.normalize().isin(indx)]
Use diff and take the abs get the top then days sorted on 'Temperature_C' and 'kW' or some sort of metric that ranks the two.
Then get those index normalize them to a date and determine which rows in the original dataframe match retreived index.
I am trying to forecast daily profit using time series analysis, but daily profit is not only recorded unevenly, but some of the data is missing.
Raw Data:
Date
Revenue
2020/1/19
10$
2020/1/20
7$
2020/1/25
14$
2020/1/29
18$
2020/2/1
12$
2020/2/2
17$
2020/2/9
28$
The above table is an example of what kind of data I have. Profit is not recorded daily, so date between 2020/1/20 and 2020/1/24 does not exist. Not only that, say the profit recorded during the period between 2020/2/3 and 2020/3/8 went missing in the database. I would like to recover this missing data and use time series analysis to predict the profit after 2020/2/9 ~.
My approach was to first aggregate the profit every 6 days since I have to recover the profit between 2020/2/3 and 2020/3/8. So my cleaned data will look something like this
Date
Revenue
2020/1/16 ~ 2020/1/21
17$
2020/1/22 ~ 2020/1/27
14$
2020/1/28 ~ 2020/2/2
47$
2020/2/3 ~ 2020/2/8
? (to predict)
After applying this to a time series model, I would like to further predict the profit after 2020/2/9 ~.
This is my general idea, but as a beginner at Python, using pandas library, I have trouble executing my ideas. Could you please help me how to aggregate the profit every 6 days and have the data look like the above table?
Easiest way is using pandas resample function.
Provided you have an index of type Datetime resampling to aggregate profits at every 6 days would be as simple as your_dataframe.resample('6D').sum()
You can do all sorts of resampling (end of month, end of quarter, begining of week, every hour, minute, second, ...). Check the full documentation if you're interested: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html?highlight=resample#pandas.DataFrame.resample
I suggest using a combination of .rolling, pd.date_range, and .reindex
Say your DataFrame is df, with proper datetime indexing:
df = pd.DataFrame([['2020/1/19',10],
['2020/1/20',7],
['2020/1/25',14],
['2020/1/29',18],
['2020/2/1',12],
['2020/2/2',17],
['2020/2/9',28]],columns=['Date','Revenue'])
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date',inplace=True)
The first step is to 'fill in' the missing days with dummy, zero revenue. We can use pd.date_range to get an index with evenly spaced dates from 2020/1/16 to 2020/2/8, and then .reindex to bring this into the main df DataFrame:
evenly_spaced_idx = pd.date_range(start='2020/1/16',end='2020/2/8',freq='1d')
df = df.reindex(evenly_spaced_idx, fill_value=0)
Now we can take a rolling sum for each 6 day period. We're not interested in every day's six day revenue total, only every 6th days, though:
summary_df = df.rolling('6d').sum().iloc[5::6, :]
The last thing with summary_df is just to format it the way you'd like so that it clearly states the date range which each row refers to.
summary_df['Start Date'] = summary_df.index-pd.Timedelta('6d')
summary_df['End Date'] = summary_df.index
summary_df.reset_index(drop=True,inplace=True)
You can use resample for this.
Make sure to have the "Date" column as datetime type.
>>> df = pd.DataFrame([["2020/1/19" ,10],
... ["2020/1/20" ,7],
... ["2020/1/25" ,14],
... ["2020/1/29" ,18],
... ["2020/2/1" ,12],
... ["2020/2/2" ,17],
... ["2020/2/9" ,28]], columns=['Date', 'Revenue'])
>>> df['Date'] = pd.to_datetime(df.Date)
For pandas < 1.1.0
>>> df.set_index('Date').resample('6D', base=3).sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
For pandas >= 1.1.0
>>> df.set_index('Date').resample('6D', origin='2020-01-16').sum()
Revenue
Date
2020-01-16 17
2020-01-22 14
2020-01-28 47
2020-02-03 0
2020-02-09 28
I'm trying to calculate statistical measures based on a range of hours and\or days.
Meaning, I have a CSV file that is something like this:
TRANSACTION_URL START_TIME END_TIME SIZE FLAG
www.google.com 20170113093210 20170113093210 150 1
www.cnet.com 20170113114510 20170113093210 150 2
START_TIME and END_TIME are in yyyyMMddhhmmss format.
I'm first converting it to yyyy-MM-dd hh:mm:ss format by using the following code:
from_pattern = 'yyyyMMddhhmmss'
to_pattern = 'yyyy-MM-dd hh:mm:ss'
log_df = log_df.withColumn('START_TIME', from_unixtime(unix_timestamp(
log_df['START_TIME'].cast(StringType()), from_pattern), to_pattern).cast(TimestampType()))
And afterward, I would like to use groupBy() in order to calculate, for example, the mean of the SIZE column, based on the transaction TIME frame.
For example, I would like to do something like:
for all transactions that are between 09:00 to 11:00
calculate SIZE mean
for all transactions that are between 14:00 to 16:00
calculate SIZE mean
And also:
for all transactions that are in a WEEKEND date
calculate SIZE mean
for all transactions that are NOT in a WEEKEND date
calculate SIZE mean
I DO know how to use groupBy for a 'default' configuration, such as calculating statistical measures for SIZE column, based on FLAG column values. I'm using something like:
log_df.cache().groupBy('FLAG').agg(mean('SIZE').alias("Mean"), stddev('SIZE').alias("Stddev")).\
withColumn("Variance", pow(col("Stddev"), 2)).show(3, False)
So, my questions are:
How to achieve such grouping and calculating, for a range of hours? (1st pseudo code example)
How to achieve such grouping and calculating, by dates? (2nd pseudo code example)
Is there any python package that can receive yy-MM-dd and return true if it's a weekend date?
Thanks
Let's assume you have a function encode_dates which receives the date and returns a sequence of encoding for all times periods you are interested in. So for example for tuesday 9-12 it would return Seq("9-11","10-12","11-13","weekday"). This would be a regular scala function (unrelated to spark).
now you can make it a UDF and add it as a column and explode the column so you will have multiple copies. Now all you need to do is add this column for the groupby.
So it would look something like this:
val encodeUDF = udf(encode_dates _)
log_df.cache().withColumn("timePeriod", explode(encodeUDF($"start_date", $"end_date").groupBy('FLAG', 'timePeriod').agg(mean('SIZE').alias("Mean"), stddev('SIZE').alias("Stddev")).
withColumn("Variance", pow(col("Stddev"), 2)).show(3, False)
I am trying to use pandas to compute daily climatology. My code is:
import pandas as pd
dates = pd.date_range('1950-01-01', '1953-12-31', freq='D')
rand_data = [int(1000*random.random()) for i in xrange(len(dates))]
cum_data = pd.Series(rand_data, index=dates)
cum_data.to_csv('test.csv', sep="\t")
cum_data is the data frame containing daily dates from 1st Jan 1950 to 31st Dec 1953. I want to create a new vector of length 365 with the first element containing the average of rand_data for January 1st for 1950, 1951, 1952 and 1953. And so on for the second element...
Any suggestions how I can do this using pandas?
You can groupby the day of the year, and the calculate the mean for these groups:
cum_data.groupby(cum_data.index.dayofyear).mean()
However, you have the be aware of leap years. This will cause problems with this approach. As alternative, you can also group by the month and the day:
In [13]: cum_data.groupby([cum_data.index.month, cum_data.index.day]).mean()
Out[13]:
1 1 462.25
2 631.00
3 615.50
4 496.00
...
12 28 378.25
29 427.75
30 528.50
31 678.50
Length: 366, dtype: float64
Hoping it can be of any help, I want to post my solution to get a climatology series with the same index and length of the original time series.
I use joris' solution to get a "model climatology" of 365/366 elements, then I build my desired series taking values from this model climatology and time index from my original time series.
This way, things like leap years are automatically taken care of.
#I start with my time series named 'serData'.
#I apply joris' solution to it, getting a 'model climatology' of length 365 or 366.
serClimModel = serData.groupby([serData.index.month, serData.index.day]).mean()
#Now I build the climatology series, taking values from serClimModel depending on the index of serData.
serClimatology = serClimModel[zip(serData.index.month, serData.index.day)]
#Now serClimatology has a time index like this: [1,1] ... [12,31].
#So, as a final step, I take as time index the one of serData.
serClimatology.index = serData.index
#joris. Thanks. Your answer was just what I needed to use pandas to calculate daily climatologies, but you stopped short of the final step. Re-mapping the month,day index back to an index of day of the year for all years, including leap years, i.e. 1 thru 366. So I thought I'd share my solution for other users. 1950 thru 1953 is 4 years with one leap year, 1952. Note since random values are used each run will give different results.
...
from datetime import date
doy = []
doy_mean = []
doy_size = []
for name, group in cum_data.groupby([cum_data.index.month, cum_data.index.day]):
(mo, dy) = name
# Note: can use any leap year here.
yrday = (date(1952, mo, dy)).timetuple().tm_yday
doy.append(yrday)
doy_mean.append(group.mean())
doy_size.append(group.count())
# Note: useful climatology stats are also available via group.describe() returned as dict
#desc = group.describe()
# desc["mean"], desc["min"], desc["max"], std,quartiles, etc.
# we lose the counts here.
new_cum_data = pd.Series(doy_mean, index=doy)
print new_cum_data.ix[366]
>> 634.5
pd_dict = {}
pd_dict["mean"] = doy_mean
pd_dict["size"] = doy_size
cum_data_df = pd.DataFrame(data=pd_dict, index=doy)
print cum_data_df.ix[366]
>> mean 634.5
>> size 4.0
>> Name: 366, dtype: float64
# and just to check Feb 29
print cum_data_df.ix[60]
>> mean 343
>> size 1
>> Name: 60, dtype: float64
Groupby month and day is a good solution. However, the perfect thinking of groupby(dayofyear) is still possible if you use xrray.CFtimeIndex instead of pandas.DatetimeIndex. i.e,
Delete feb29 by using
rand_data=rand_data[~((rand_data.index.month==2) & (rand_data.index.day==29))]
Replace the index of the above data by xrray.CFtimeIndex, i.e.,
index = xarray.cftime_range('1950-01-01', '1953-12-31', freq='D', calendar = 'noleap')
index = index[~((index.month==2)&(index.day==29))]
rand_data['time']=index
Now, for both non-leap and leap year, the 60th dayofyear would be March 1st, and the total number of dayofyear would be 365. groupbyyear would be correct to calculate climatological daily mean.