I have a data frame composed as follows, containing daily precipitation data from 1971 to 2017:
df.dtypes
Out[28]:
date datetime64[ns]
mm float64
dtype: object
df.head()
Out[29]:
date mm
0 1971-01-01 07:00:00 2.2
1 1971-01-02 07:00:00 2.0
2 1971-01-03 07:00:00 3.0
3 1971-01-04 07:00:00 0.0
4 1971-01-05 07:00:00 0.0
I want to ask the user to which time frame he´s interested to filter the available data, therefore:
start_date_entry = input("Input the month and the year to start your analysis (i.e. YYYY-MM-DD): ")
year, month, day = map(int, start_date_entry.split('-'))
start_date_form = datetime(year, month, day,7,0,0)
end_date_entry = input("Input the last month and year to end your analysis (i.e. YYYY-MM-DD): ")
year, month, day = map(int, end_date_entry.split('-'))
end_date_form = datetime(year, month, day,7,0,0)
Both start_date_form and end_date_form have the following format:
type(start_date_form)
Out[31]: datetime.datetime
print(start_date_form)
2010-01-01 07:00:00
Now, I want to filtrate my data frame with the new dates:
df['date']=pd.date_range(start=(start_date_form), end=(end_date_form),freq='D')
I get the following error:
Length of values (1827) does not match length of index (17167)
I am new to python and programming, I would appreciate some help on this. Thanks
Related
I was trying to convert a date column into columns, but I got into an error with the indexing of the weeks:
The error is date 2018-01-01 is showing as Week 1, 2018-12-24 as Week 52, 2018-12-31 as Week 1 as well! This way I am ending up with two entries with Week 1 ; while I want to take 2018-01-08 as my Week 1 and ignore 2018-01-01 altogether!
This makes 2018-12-24 as Week 51, 2018-12-31 as Week 52! How may I do so?
sym_2018 = pd.read_csv('/content/2018_symptoms_dataset.csv')
sym_2019 = pd.read_csv('/content/2019_weekly_symptoms_dataset.csv')
df3 = sym_2018.append(sym_2019) # Add both sets to make 2018-2019 set.
df3 = sym_2018.append(sym_2019) # Add both sets to make 2018-2019 set.
# Converting values of Data column in datetime
df3['Date'] = pd.to_datetime(df3['date']) # tweets_df['Time'] = pd.to_datetime(tweets_df['Time'])
# Getting week value
df3['Week'] = df3['Date'].dt.isocalendar().week # Convert date to week and add a column Week.
df3['Year'] = df3['Date'].dt.isocalendar().year # Convert date to year and add a column Year.
Image showing dataframe:
I think ouput is correct:
df3['Date'] = pd.to_datetime(df3['Date'])
df3['Week'] = df3['Date'].dt.isocalendar().week
df3['Year'] = df3['Date'].dt.isocalendar().year
print (df3)
Date Week Year
0 2018-01-01 1 2018 <- first week in 2018 start 2018-01-01
1 2018-01-07 1 2018 <- first week in 2018 end 2018-01-07
2 2018-01-08 2 2018 <- second week in 2018
3 2018-12-31 1 2019 <- first week in 2019 start 2018-12-31
4 2019-01-06 1 2019 <- first week in 2019 end 2019-01-06
5 2019-01-07 2 2019 <- second week in 2019
I'm using the function resample to change the daily data to be a monthly data of a pandas dataframe. Reading the documentation I found that I could define the rule='M' or rule='MS'. The first is "calendar month end" and the second is "calendar month begin". What is the difference between the two?
It doesn't set the same date as index of the resampled groups.
Here is an example:
date = pd.Series([0,1,2],
index=pd.to_datetime(['2022-01-15',
'2022-01-20',
'2022-02-15']))
2022-01-15 0
2022-01-20 1
2022-02-15 2
dtype: int64
# resampling MS:
date.resample('MS').mean()
2022-01-01 0.5
2022-02-01 2.0
Freq: MS, dtype: float64
# resampling M:
date.resample('M').mean()
2022-01-31 0.5
2022-02-28 2.0
Freq: M, dtype: float64
Note the difference in the dates of the index. For 'MS' the dates of the groups are always the first of the month, for 'M' the last day.
I have the following df:
time_series date sales
store_0090_item_85261507 1/2020 1,0
store_0090_item_85261501 2/2020 0,0
store_0090_item_85261500 3/2020 6,0
Being 'date' = Week/Year.
So, I tried use the following code:
df['date'] = df['date'].apply(lambda x: datetime.strptime(x + '/0', "%U/%Y/%w"))
But, return this df:
time_series date sales
store_0090_item_85261507 2020-01-05 1,0
store_0090_item_85261501 2020-01-12 0,0
store_0090_item_85261500 2020-01-19 6,0
But, the first day of the first week of 2020 is 2019-12-29, considering sunday as first day. How can I have the first day 2020-12-29 of the first week of 2020 and not 2020-01-05?
From the datetime module's documentation:
%U: Week number of the year (Sunday as the first day of the week) as a zero padded decimal number. All days in a new year preceding the first Sunday are considered to be in week 0.
Edit: My originals answer doesn't work for input 1/2023 and using ISO 8601 date values doesn't work for 1/2021, so I've edited this answer by adding a custom function
Here is a way with a custom function
import pandas as pd
from datetime import datetime, timedelta
##############################################
# to demonstrate issues with certain dates
print(datetime.strptime('0/2020/0', "%U/%Y/%w")) # 2019-12-29 00:00:00
print(datetime.strptime('1/2020/0', "%U/%Y/%w")) # 2020-01-05 00:00:00
print(datetime.strptime('0/2021/0', "%U/%Y/%w")) # 2020-12-27 00:00:00
print(datetime.strptime('1/2021/0', "%U/%Y/%w")) # 2021-01-03 00:00:00
print(datetime.strptime('0/2023/0', "%U/%Y/%w")) # 2023-01-01 00:00:00
print(datetime.strptime('1/2023/0', "%U/%Y/%w")) # 2023-01-01 00:00:00
#################################################
df = pd.DataFrame({'date':["1/2020", "2/2020", "3/2020", "1/2021", "2/2021", "1/2023", "2/2023"]})
print(df)
def get_first_day(date):
date0 = datetime.strptime('0/' + date.split('/')[1] + '/0', "%U/%Y/%w")
date1 = datetime.strptime('1/' + date.split('/')[1] + '/0', "%U/%Y/%w")
date = datetime.strptime(date + '/0', "%U/%Y/%w")
return date if date0 == date1 else date - timedelta(weeks=1)
df['new_date'] = df['date'].apply(lambda x:get_first_day(x))
print(df)
Input
date
0 1/2020
1 2/2020
2 3/2020
3 1/2021
4 2/2021
5 1/2023
6 2/2023
Output
date new_date
0 1/2020 2019-12-29
1 2/2020 2020-01-05
2 3/2020 2020-01-12
3 1/2021 2020-12-27
4 2/2021 2021-01-03
5 1/2023 2023-01-01
6 2/2023 2023-01-08
You'll want to use ISO week parsing directives, Ex:
import pandas as pd
date = pd.Series(["1/2020", "2/2020", "3/2020"])
pd.to_datetime(date+"/1", format="%V/%G/%u")
0 2019-12-30
1 2020-01-06
2 2020-01-13
dtype: datetime64[ns]
you can also shift by one day if the week should start on Sunday:
pd.to_datetime(date+"/1", format="%V/%G/%u") - pd.Timedelta('1d')
0 2019-12-29
1 2020-01-05
2 2020-01-12
dtype: datetime64[ns]
UsageDate CustID1 CustID2 .... CustIDn
0 2018-01-01 00:00:00 1.095
1 2018-01-01 01:00:00 1.129
2 2018-01-01 02:00:00 1.165
3 2018-01-01 04:00:00 1.697
.
.
m 2018-31-01 23:00:00 1.835 (m,n)
The dataframe (df) has m rows and n columns. m is a Hourly TimeSeries Index which starts from first hour of month to last hour of month.
The columns are the customers which are almost 100,000.
The values at each cell of Dataframe are energy consumption values.
For every customer, I need to calculate:
1) Mean of every hour usage - so basically average of 1st hour of every day in a month, 2nd hour of every day in a month etc.
2) Summation of usage of every customer
3) Top 3 usage hours - for a customer x, it can be "2018-01-01 01:00:00",
"2018-11-01 05:00:00" "2018-21-01 17:00:00"
4) Bottom 3 usage hours - Similar explanation as above
5) Mean of usage for every customer in the month
My main point of trouble is how to aggregate data both for every customer and the hour of day, or day together.
For summation of usage for every customer, I tried:
df_temp = pd.DataFrame(columns=["TotalUsage"])
for col in df.columns:
`df_temp[col,"TotalUsage"] = df[col].apply.sum()`
However, this and many version of this which I tried are not helping me solve the problem.
Please help me with an approach and how to think about such problems.
Also, since the dataframe is large, it would be helpful if we can talk about Computational Complexity and how can we decrease computation time.
This looks like a job for pandas.groupby.
(I didn't test the code because I didn't have a good sample dataset from which to work. If there are errors, let me know.)
For some of your requirements, you'll need to add a column with the hour:
df['hour']=df['UsageDate'].dt.hour
1) Mean by hour.
mean_by_hour=df.groupby('hour').mean()
2) Summation by user.
sum_by_uers=df.sum()
3) Top usage by customer. Bottom 3 usage hours - Similar explanation as above.--I don't quite understand your desired output, you might be asking too many different questions in this question. If you want the hour and not the value, I think you may have to iterate through the columns. Adding an example may help.
4) Same comment.
5) Mean by customer.
mean_by_cust = df.mean()
I am not sure if this is all the information you are looking for but it will point you in the right direction:
import pandas as pd
import numpy as np
# sample data for 3 days
np.random.seed(1)
data = pd.DataFrame(pd.date_range('2018-01-01', periods= 72, freq='H'), columns=['UsageDate'])
data2 = pd.DataFrame(np.random.rand(72,5), columns=[f'ID_{i}' for i in range(5)])
df = data.join([data2])
# print('Sample Data:')
# print(df.head())
# print()
# mean of every month and hour per year
# groupby year month hour then find the mean of every hour in a given year and month
mean_data = df.groupby([df['UsageDate'].dt.year, df['UsageDate'].dt.month, df['UsageDate'].dt.hour]).mean()
mean_data.index.names = ['UsageDate_year', 'UsageDate_month', 'UsageDate_hour']
# print('Mean Data:')
# print(mean_data.head())
# print()
# use set_index with max and head
top_3_Usage_hours = df.set_index('UsageDate').max(1).sort_values(ascending=False).head(3)
# print('Top 3:')
# print(top_3_Usage_hours)
# print()
# use set_index with min and tail
bottom_3_Usage_hours = df.set_index('UsageDate').min(1).sort_values(ascending=False).tail(3)
# print('Bottom 3:')
# print(bottom_3_Usage_hours)
out:
Sample Data:
UsageDate ID_0 ID_1 ID_2 ID_3 ID_4
0 2018-01-01 00:00:00 0.417022 0.720324 0.000114 0.302333 0.146756
1 2018-01-01 01:00:00 0.092339 0.186260 0.345561 0.396767 0.538817
2 2018-01-01 02:00:00 0.419195 0.685220 0.204452 0.878117 0.027388
3 2018-01-01 03:00:00 0.670468 0.417305 0.558690 0.140387 0.198101
4 2018-01-01 04:00:00 0.800745 0.968262 0.313424 0.692323 0.876389
Mean Data:
ID_0 ID_1 ID_2 \
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.250716 0.546475 0.202093
1 0.414400 0.264330 0.535928
2 0.335119 0.877191 0.380688
3 0.577429 0.599707 0.524876
4 0.702336 0.654344 0.376141
ID_3 ID_4
UsageDate_year UsageDate_month UsageDate_hour
2018 1 0 0.244185 0.598238
1 0.400003 0.578867
2 0.623516 0.477579
3 0.429835 0.510685
4 0.503908 0.595140
Top 3:
UsageDate
2018-01-01 21:00:00 0.997323
2018-01-03 23:00:00 0.990472
2018-01-01 08:00:00 0.988861
dtype: float64
Bottom 3:
UsageDate
2018-01-01 19:00:00 0.002870
2018-01-03 02:00:00 0.000402
2018-01-01 00:00:00 0.000114
dtype: float64
For top and bottom 3 if you want to find the min sum across rows then:
df.set_index('UsageDate').sum(1).sort_values(ascending=False).tail(3)
So, I have StartDateTime and EndDateTime columns in my dataframe, and I want to produce a new dataframe with a row for each date in the datetime range, but I also want the number of hours of that date that are included in the date range.
In [11]: sessions = pd.DataFrame({'Start':['2018-01-01 13:00:00','2018-03-01 16:30:00'],
'End':['2018-01-03 07:00:00','2018-03-02 06:00:00'],'User':['Dan','Fred']})
In [12]: sessions
Out[12]:
Start End User
0 2018-01-01 13:00:00 2018-01-03 07:00:00 Dan
1 2018-03-01 16:30:00 2018-03-02 06:00:00 Fred
Desired dataframe:
Date Hours User
2018-01-01 11 Dan
2018-01-02 24 Dan
2018-01-02 7 Dan
2018-03-01 7.5 Fred
2018-03-02 6 Fred
I've seen a lot of examples that just produced a dataframe for each date in the date range (e.g. Expanding pandas data frame with date range in columns)
but nothing with the additional field of hours per date included in the range.
I don't know it's the cleanest solution, but it seems to work.
In [13]: sessions = pd.DataFrame({'Start':['2018-01-01 13:00:00','2018-03-01 16:30:00'],
'End':['2018-01-03 07:00:00','2018-03-02 06:00:00'],'User':['Dan','Fred']})
convert Start and End to Datetime
In [14]: sessions['Start']=pd.to_datetime(sessions['Start'])
sessions['End']=pd.to_datetime(sessions['End'])
create a row for each date in range
In [15]: dailyUsage = pd.concat([pd.DataFrame({'Date':
pd.date_range(pd.to_datetime(row.Start).date(), row.End.date(), freq='D'),'Start':row.Start,
'User': row.User,
'End': row.End}, columns=['Date', 'Start','User', 'End'])
for i, row in sessions.iterrows()], ignore_index=True)
function to calcuate the hours on date, based on start datetime, end datetime, and specfic date
In [16]: def calcDuration(x):
date= x['Date']
startDate = x['Start']
endDate = x['End']
#starts and stops on same day
if endDate.date() == startDate.date():
return (endDate - startDate).seconds/3600
#this is on the start date
if (date.to_pydatetime().date() - startDate.date()).days == 0:
return 24 - startDate.hour
#this is on the end date
if (date.to_pydatetime().date() - endDate.date()).days == 0:
return startDate.hour
#this is on an interior date
else:
return 24
calculate hours for each date
In [17]: dailyUsage['hours'] = dailyUsage.apply(calcDuration,axis=1)
In [18]: dailyUsage.drop(['Start','End'],axis=1).head()
Out [18]:
Date User hours
0 2018-01-01 Dan 11
1 2018-01-02 Dan 24
2 2018-01-03 Dan 13
3 2018-03-01 Fred 8
4 2018-03-02 Fred 16
something like this would work as well, if you don't mind integers only;
df['date'] = df['Date'].dt.date
gb = df.groupby(['date', 'User'])['Date'].size()
print(gb)
date User
2018-01-01 Dan 11
2018-01-02 Dan 24
2018-01-03 Dan 8
2018-03-01 Fred 8
2018-03-02 Fred 6
Name: Date, dtype: int64