I was trying to convert a date column into columns, but I got into an error with the indexing of the weeks:
The error is date 2018-01-01 is showing as Week 1, 2018-12-24 as Week 52, 2018-12-31 as Week 1 as well! This way I am ending up with two entries with Week 1 ; while I want to take 2018-01-08 as my Week 1 and ignore 2018-01-01 altogether!
This makes 2018-12-24 as Week 51, 2018-12-31 as Week 52! How may I do so?
sym_2018 = pd.read_csv('/content/2018_symptoms_dataset.csv')
sym_2019 = pd.read_csv('/content/2019_weekly_symptoms_dataset.csv')
df3 = sym_2018.append(sym_2019) # Add both sets to make 2018-2019 set.
df3 = sym_2018.append(sym_2019) # Add both sets to make 2018-2019 set.
# Converting values of Data column in datetime
df3['Date'] = pd.to_datetime(df3['date']) # tweets_df['Time'] = pd.to_datetime(tweets_df['Time'])
# Getting week value
df3['Week'] = df3['Date'].dt.isocalendar().week # Convert date to week and add a column Week.
df3['Year'] = df3['Date'].dt.isocalendar().year # Convert date to year and add a column Year.
Image showing dataframe:
I think ouput is correct:
df3['Date'] = pd.to_datetime(df3['Date'])
df3['Week'] = df3['Date'].dt.isocalendar().week
df3['Year'] = df3['Date'].dt.isocalendar().year
print (df3)
Date Week Year
0 2018-01-01 1 2018 <- first week in 2018 start 2018-01-01
1 2018-01-07 1 2018 <- first week in 2018 end 2018-01-07
2 2018-01-08 2 2018 <- second week in 2018
3 2018-12-31 1 2019 <- first week in 2019 start 2018-12-31
4 2019-01-06 1 2019 <- first week in 2019 end 2019-01-06
5 2019-01-07 2 2019 <- second week in 2019
Related
I am working on a time series data frame.The df is as follows:
0 2019-01-01 Contact Tuesday False January 04:00:00.118000 1
1 2019-01-01 Contact Tuesday False January 04:00:00.483000 1
2 2019-01-01 Contact Tuesday False January 08:00:00.162000 1
3 2019-01-01 Contact Tuesday False January 08:00:00.426000 1
4 2019-01-01 Contact Tuesday False January 08:00:00.564000 1
To get this df I have done other transformation above hence, this is not a direct load.
so I am trying to convert the second last column with 04:00:00.118000 to 04:00:00.
What is the quickest way to achieve this?
If your entries in the second to last column are of type datetime.time, you could use the following:
df[name] = df[name].apply(lambda t: t.replace(microsecond=0))
where name is the name of your second to last column. If they are of type str, then you could use this instead:
df[name] = df[name].apply(lambda t: t.split('.')[0])
Try this, if you have the Object type data then it should work..
Sample data mimicking the data ..
>>> df
date col1
0 January 04:00:00.118000 1
1 January 04:00:00.483000 1
2 January 08:00:00.162000 1
3 January 08:00:00.426000 1
>>> df.dtypes
date object
col1 int64
dtype: object
Solution
>>> df['date'] = df['date'].str.split(".").str[0]
>>> df
date col1
0 January 04:00:00 1
1 January 04:00:00 1
2 January 08:00:00 1
3 January 08:00:00 1
I have a df
date
2021-03-12
2021-03-17
...
2022-05-21
2022-08-17
I am trying to add a column year_week, but my year week starts at 2021-06-28, which is the first day of July.
I tried:
df['date'] = pd.to_datetime(df['date'])
df['year_week'] = (df['date'] - timedelta(days=datetime(2021, 6, 24).timetuple()
.tm_yday)).dt.isocalendar().week
I played around with the timedelta days values so that the 2021-06-28 has a value of 1.
But then I got problems with previous & dates exceeding my start date + 1 year:
2021-03-12 has a value of 38
2022-08-17 has a value of 8
So it looks like the valid period is from 2021-06-28 + 1 year.
date year_week
2021-03-12 38 # LY38
2021-03-17 39 # LY39
2021-06-28 1 # correct
...
2022-05-21 47 # correct
2022-08-17 8 # NY8
Is there a way to get around this? As I am aggregating the data by year week I get incorrect results due to the past & upcoming dates. I would want to have negative dates for the days before 2021-06-28 or LY38 denoting that its the year week of the last year, accordingly year weeks of 52+ or NY8 denoting that this is the 8th week of the next year?
Here is a way, I added two dates more than a year away. You need the isocalendar from the difference between the date column and the dayofyear of your specific date. Then you can select the different scenario depending on the year of your specific date. use np.select for the different result format.
#dummy dataframe
df = pd.DataFrame(
{'date': ['2020-03-12', '2021-03-12', '2021-03-17', '2021-06-28',
'2022-05-21', '2022-08-17', '2023-08-17']
}
)
# define start date
d = pd.to_datetime('2021-6-24')
# remove the nomber of day of year from each date
s = (pd.to_datetime(df['date']) - pd.Timedelta(days=d.day_of_year)
).dt.isocalendar()
# get the difference in year
m = (s['year'].astype('int32') - d.year)
# all condition of result depending on year difference
conds = [m.eq(0), m.eq(-1), m.eq(1), m.lt(-1), m.gt(1)]
choices = ['', 'LY','NY',(m+1).astype(str)+'LY', '+'+(m-1).astype(str)+'NY']
# create the column
df['res'] = np.select(conds, choices) + s['week'].astype(str)
print(df)
date res
0 2020-03-12 -1LY38
1 2021-03-12 LY38
2 2021-03-17 LY39
3 2021-06-28 1
4 2022-05-21 47
5 2022-08-17 NY8
6 2023-08-17 +1NY8
I think
pandas period_range can be of some help
pd.Series(pd.period_range("6/28/2017", freq="W", periods=Number of weeks you want))
I have a csv-file: https://data.rivm.nl/covid-19/COVID-19_aantallen_gemeente_per_dag.csv
I want to use it to provide insight into the corona deaths per week.
df = pd.read_csv("covid.csv", error_bad_lines=False, sep=";")
df = df.loc[df['Deceased'] > 0]
df["Date_of_publication"] = pd.to_datetime(df["Date_of_publication"])
df["Week"] = df["Date_of_publication"].dt.isocalendar().week
df["Year"] = df["Date_of_publication"].dt.year
df = df[["Week", "Year", "Municipality_name", "Deceased"]]
df = df.groupby(by=["Week", "Year", "Municipality_name"]).agg({"Deceased" : "sum"})
df = df.sort_values(by=["Year", "Week"])
print(df)
Everything seems to be working fine except for the first 3 days of 2021. The first 3 days of 2021 are part of the last week (53) of 2020: http://week-number.net/calendar-with-week-numbers-2021.html.
When I print the dataframe this is the result:
53 2021 Winterswijk 1
Woudenberg 1
Zaanstad 1
Zeist 2
Zutphen 1
So basically what I'm looking for is a way where this line returns the year of the week number and not the year of the date:
df["Year"] = df["Date_of_publication"].dt.year
You can use dt.isocalendar().year to setup df["Year"]:
df["Year"] = df["Date_of_publication"].dt.isocalendar().year
You will get year 2020 for date of 2021-01-01 but will get back to year 2021 for date of 2021-01-04 by this.
This is just similar to how you used dt.isocalendar().week for setting up df["Week"]. Since they are both basing on the same tuple (year, week, day) returned by dt.isocalendar(), they would always be in sync.
Demo
date_s = pd.Series(pd.date_range(start='2021-01-01', periods=5, freq='1D'))
date_s
0
0 2021-01-01
1 2021-01-02
2 2021-01-03
3 2021-01-04
4 2021-01-05
date_s.dt.isocalendar()
year week day
0 2020 53 5
1 2020 53 6
2 2020 53 7
3 2021 1 1
4 2021 1 2
You can simply subtract the two dates and then divide the days attribute of the timedelta object by 7.
For example, this is the current week we are on now.
time_delta = (dt.datetime.today() - dt.datetime(2021, 1, 1))
The output is a datetime timedelta object
datetime.timedelta(days=75, seconds=84904, microseconds=144959)
For your problem, you'd do something like this
time_delta = int((df["Date_of_publication"] - df["Year"].days / 7)
The output would be a number that is the current week since date_of_publication
So, I have StartDateTime and EndDateTime columns in my dataframe, and I want to produce a new dataframe with a row for each date in the datetime range, but I also want the number of hours of that date that are included in the date range.
In [11]: sessions = pd.DataFrame({'Start':['2018-01-01 13:00:00','2018-03-01 16:30:00'],
'End':['2018-01-03 07:00:00','2018-03-02 06:00:00'],'User':['Dan','Fred']})
In [12]: sessions
Out[12]:
Start End User
0 2018-01-01 13:00:00 2018-01-03 07:00:00 Dan
1 2018-03-01 16:30:00 2018-03-02 06:00:00 Fred
Desired dataframe:
Date Hours User
2018-01-01 11 Dan
2018-01-02 24 Dan
2018-01-02 7 Dan
2018-03-01 7.5 Fred
2018-03-02 6 Fred
I've seen a lot of examples that just produced a dataframe for each date in the date range (e.g. Expanding pandas data frame with date range in columns)
but nothing with the additional field of hours per date included in the range.
I don't know it's the cleanest solution, but it seems to work.
In [13]: sessions = pd.DataFrame({'Start':['2018-01-01 13:00:00','2018-03-01 16:30:00'],
'End':['2018-01-03 07:00:00','2018-03-02 06:00:00'],'User':['Dan','Fred']})
convert Start and End to Datetime
In [14]: sessions['Start']=pd.to_datetime(sessions['Start'])
sessions['End']=pd.to_datetime(sessions['End'])
create a row for each date in range
In [15]: dailyUsage = pd.concat([pd.DataFrame({'Date':
pd.date_range(pd.to_datetime(row.Start).date(), row.End.date(), freq='D'),'Start':row.Start,
'User': row.User,
'End': row.End}, columns=['Date', 'Start','User', 'End'])
for i, row in sessions.iterrows()], ignore_index=True)
function to calcuate the hours on date, based on start datetime, end datetime, and specfic date
In [16]: def calcDuration(x):
date= x['Date']
startDate = x['Start']
endDate = x['End']
#starts and stops on same day
if endDate.date() == startDate.date():
return (endDate - startDate).seconds/3600
#this is on the start date
if (date.to_pydatetime().date() - startDate.date()).days == 0:
return 24 - startDate.hour
#this is on the end date
if (date.to_pydatetime().date() - endDate.date()).days == 0:
return startDate.hour
#this is on an interior date
else:
return 24
calculate hours for each date
In [17]: dailyUsage['hours'] = dailyUsage.apply(calcDuration,axis=1)
In [18]: dailyUsage.drop(['Start','End'],axis=1).head()
Out [18]:
Date User hours
0 2018-01-01 Dan 11
1 2018-01-02 Dan 24
2 2018-01-03 Dan 13
3 2018-03-01 Fred 8
4 2018-03-02 Fred 16
something like this would work as well, if you don't mind integers only;
df['date'] = df['Date'].dt.date
gb = df.groupby(['date', 'User'])['Date'].size()
print(gb)
date User
2018-01-01 Dan 11
2018-01-02 Dan 24
2018-01-03 Dan 8
2018-03-01 Fred 8
2018-03-02 Fred 6
Name: Date, dtype: int64
I want to convert Date into Quarters. I've used,
x['quarter'] = x['date'].dt.quarter
date quarter
0 2013-1-1 1
But, it also repeats the same for the next year.
date quarter
366 2014-1-1 1
Instead of the 1, I want the (expected result) quarter to be 5.
date quarter
366 2014-1-1 5
.
.
.
.
date quarter
731 2015-1-1 9
You can use a simple mathematical operation
starting_year = 2013
df['quarter'] = df.year.dt.quarter + (df.year.dt.year - starting_year)*4
year quarter
0 2013-01-01 1
0 2014-01-01 5
0 2015-01-01 9
0 2016-01-01 13