Combining Columns in Isoweek Object - python

I have a pandas dataframe that contains a year and week column:
year week
2018 18
2019 17
2019 17
I'm trying to combine the year and week columns into a new 'isoweek' column using the isoweek library. I can't seem to figure out how to properly loop through the rows to create the object column. If I do something like:
df['isoweek'] = Week(df['year'],df['week'])
isoweek chokes on the vectorization. I've tried creating a basic list and appending it to my dataframe, like so:
obj_list = []
for i in range(500):
year = df['year'][i]
week = df['week'][i]
w = Week(year,week)
obj_list.append(w)
df['isoweek'] = obj_list
But I end up with a simple tuple in the column.
The goal is to be able to use some of the isoweek library's operations to calculate date differences, like:
df['isoweek'] - 4
>isoweek.Week(2019, 34)
Is it even possible to store an object like this in a dataframe column? If so, how does one go about it?

As an alternative, you can use the built in method for datetime:
df['week_start'] = pd.to_datetime(df['year'].astype(str), format='%Y') + pd.to_timedelta(df['week'].mul(7).astype(str) + ' days')
# Output:
week year week_start
0 18 2018 2018-05-07
1 17 2019 2019-04-30
2 17 2019 2019-04-30
Calculating time differences is pretty straightforward here:
# Choose 7 weeks
n_weeks = pd.to_timedelta(7, unit='W')
# Adding is simple
df['week_start'] + n_weeks
# Output
0 2018-06-25
1 2019-06-18
2 2019-06-18
For more on this, read: Pandas: How to create a datetime object from Week and Year?

Potentially you could do this
First, set up the example dataframe
from isoweek import Week
df = pd.DataFrame ({'year' : [2018,2019,2019],
'week' : [18,17,17]})
Loop through the dataframe, adding the isoweek to a list
ls_isoweek = []
for row in df.itertuples():
ls_isoweek.append(Week(row[1],row[2]))
The list looks like this
[isoweek.Week(2018, 18), isoweek.Week(2019, 17), isoweek.Week(2019, 17)]
This list can be accessed thusly
ls_isoweek[0] - 4
Produces this output
isoweek.Week(2018, 14)
However, the list can also be added back to the dataframe if you wish
df['isoweek'] = ls_isoweek
You can then do things like ...
df['isoweek_minus_4'] = df['isoweek'].apply(lambda x: x-4)
Producing an output like the below

A little late, but if anyone else is still looking to use a solution of this form as I was, you could use lambda functions along with apply. For the dataframe below (with int64 dtypes),
year week
0 2018 18
1 2019 17
2 2019 17
Now we use isoweek to appropriately parse the data,
from isoweek import Week
df.apply(lambda row : Week(row["year"],row["week"]),axis=1)
This produces the output,
0 (2018, 18)
1 (2019, 17)
2 (2019, 17)
dtype: object
You could also identify the (week,year) with a datetime object by combining this approach with this answer https://stackoverflow.com/a/7687085.
df.apply(lambda row : Week(int(row["year"]),int(row["week"])).monday(),axis=1)
The int appears a little redundant there, but pandas by default uses int64 which doesn't appear to function with isoweek correctly. This produces the output,
0 2018-04-30
1 2019-04-22
2 2019-04-22
dtype: object

Related

Filter rows by column date values [duplicate]

So my code is as follows:
df['Dates'][df['Dates'].index.month == 11]
I was doing a test to see if I could filter the months so it only shows November dates, but this did not work. It gives me the following error: AttributeError: 'Int64Index' object has no attribute 'month'.
If I do
print type(df['Dates'][0])
then I get class 'pandas.tslib.Timestamp', which leads me to believe that the types of objects stored in the dataframe are Timestamp objects. (I'm not sure where the 'Int64Index' is coming from... for the error before)
What I want to do is this: The dataframe column contains dates from the early 2000's to present in the following format: dd/mm/yyyy. I want to filter for dates only between November 15 and March 15, independent of the YEAR. What is the easiest way to do this?
Thanks.
Here is df['Dates'] (with indices):
0 2006-01-01
1 2006-01-02
2 2006-01-03
3 2006-01-04
4 2006-01-05
5 2006-01-06
6 2006-01-07
7 2006-01-08
8 2006-01-09
9 2006-01-10
10 2006-01-11
11 2006-01-12
12 2006-01-13
13 2006-01-14
14 2006-01-15
...
Using pd.to_datetime & dt accessor
The accepted answer is not the "pandas" way to approach this problem.
To select only rows with month 11, use the dt accessor:
# df['Date'] = pd.to_datetime(df['Date']) -- if column is not datetime yet
df = df[df['Date'].dt.month == 11]
Same works for days or years, where you can substitute dt.month with dt.day or dt.year
Besides that, there are many more, here are a few:
dt.quarter
dt.week
dt.weekday
dt.day_name
dt.is_month_end
dt.is_month_start
dt.is_year_end
dt.is_year_start
For a complete list see the documentation
Map an anonymous function to calculate the month on to the series and compare it to 11 for nov.
That will give you a boolean mask. You can then use that mask to filter your dataframe.
nov_mask = df['Dates'].map(lambda x: x.month) == 11
df[nov_mask]
I don't think there is straight forward way to filter the way you want ignoring the year so try this.
nov_mar_series = pd.Series(pd.date_range("2013-11-15", "2014-03-15"))
#create timestamp without year
nov_mar_no_year = nov_mar_series.map(lambda x: x.strftime("%m-%d"))
#add a yearless timestamp to the dataframe
df["no_year"] = df['Date'].map(lambda x: x.strftime("%m-%d"))
no_year_mask = df['no_year'].isin(nov_mar_no_year)
df[no_year_mask]
In your code there are two issues. First, need to bring column reference after the filtering condition. Second, can either use ".month" with a column or index, but not both. One of the following should work:
df[df.index.month == 11]['Dates']
df[df['Dates'].month == 11]['Dates']

Create a new column in a dataframe that shows Day of the Week from an already existing dd/mm/yy column? Python

I have a dataframe that contains a column with dates e.g. 24/07/15 etc
Is there a way to create a new column into the dataframe that displays all the days of the week corresponding to the already existing 'Date' column?
I want the output to appear as:
[Date][DayOfTheWeek]
This might work:
If you want day name:
In [1405]: df
Out[1405]:
dates
0 24/07/15
1 25/07/15
2 26/07/15
In [1406]: df['dates'] = pd.to_datetime(df['dates']) # You don't need to specify the format also.
In [1408]: df['dow'] = df['dates'].dt.day_name()
In [1409]: df
Out[1409]:
dates dow
0 2015-07-24 Friday
1 2015-07-25 Saturday
2 2015-07-26 Sunday
If you want day number:
In [1410]: df['dow'] = df['dates'].dt.day
In [1411]: df
Out[1411]:
dates dow
0 2015-07-24 24
1 2015-07-25 25
2 2015-07-26 26
I would try the apply function, so something like this:
def extractDayOfWeek(dateString):
...
df['DayOfWeek'] = df.apply(lambda x: extractDayOfWeek(x['Date'], axis=1)
The idea is that, you map over every row, extract the 'date' column, and then apply your own function to create a new row entry named 'Day'
Depending of the type of you column Date.
df['Date']=pd.to_datetime(df['Date'], format="d/%m/%y")
df['weekday'] = df['Date'].dt.dayofweek

How to filter a dataframe of dates by a particular month/day?

So my code is as follows:
df['Dates'][df['Dates'].index.month == 11]
I was doing a test to see if I could filter the months so it only shows November dates, but this did not work. It gives me the following error: AttributeError: 'Int64Index' object has no attribute 'month'.
If I do
print type(df['Dates'][0])
then I get class 'pandas.tslib.Timestamp', which leads me to believe that the types of objects stored in the dataframe are Timestamp objects. (I'm not sure where the 'Int64Index' is coming from... for the error before)
What I want to do is this: The dataframe column contains dates from the early 2000's to present in the following format: dd/mm/yyyy. I want to filter for dates only between November 15 and March 15, independent of the YEAR. What is the easiest way to do this?
Thanks.
Here is df['Dates'] (with indices):
0 2006-01-01
1 2006-01-02
2 2006-01-03
3 2006-01-04
4 2006-01-05
5 2006-01-06
6 2006-01-07
7 2006-01-08
8 2006-01-09
9 2006-01-10
10 2006-01-11
11 2006-01-12
12 2006-01-13
13 2006-01-14
14 2006-01-15
...
Using pd.to_datetime & dt accessor
The accepted answer is not the "pandas" way to approach this problem.
To select only rows with month 11, use the dt accessor:
# df['Date'] = pd.to_datetime(df['Date']) -- if column is not datetime yet
df = df[df['Date'].dt.month == 11]
Same works for days or years, where you can substitute dt.month with dt.day or dt.year
Besides that, there are many more, here are a few:
dt.quarter
dt.week
dt.weekday
dt.day_name
dt.is_month_end
dt.is_month_start
dt.is_year_end
dt.is_year_start
For a complete list see the documentation
Map an anonymous function to calculate the month on to the series and compare it to 11 for nov.
That will give you a boolean mask. You can then use that mask to filter your dataframe.
nov_mask = df['Dates'].map(lambda x: x.month) == 11
df[nov_mask]
I don't think there is straight forward way to filter the way you want ignoring the year so try this.
nov_mar_series = pd.Series(pd.date_range("2013-11-15", "2014-03-15"))
#create timestamp without year
nov_mar_no_year = nov_mar_series.map(lambda x: x.strftime("%m-%d"))
#add a yearless timestamp to the dataframe
df["no_year"] = df['Date'].map(lambda x: x.strftime("%m-%d"))
no_year_mask = df['no_year'].isin(nov_mar_no_year)
df[no_year_mask]
In your code there are two issues. First, need to bring column reference after the filtering condition. Second, can either use ".month" with a column or index, but not both. One of the following should work:
df[df.index.month == 11]['Dates']
df[df['Dates'].month == 11]['Dates']

get subset dataframe by date

I have the following subset with a starting date (DD/MM/YYYY) and Amount
Start Date Amount
1 01/01/2013 20
2 02/05/2007 10
3 01/05/2004 15
4 01/06/2014 20
5 17/08/2008 21
I'd like to create a subset of this dataframe where only where the Start Date Day is 01:
Start Date Amount
1 01/01/2013 20
3 01/05/2004 15
4 01/06/2014 20
I've tried to loop through the table and use the index but couldn't find a suitable way to iterate through a dataframe rows.
Assuming your dates are datetime already then the following should work, if they are strings you can convert them using to_datetime so df['Start Date'] = pd.to_datetime(df['Start Date']), you may also need to pass param dayfirst = True if required. If you imported the data using read_csv you could've done this at the point of import so df = pd.read_csv('data.csv', parse_dates=[n], dayfirst=True) where n is the index (0-based of course) so if it was the first then pass parse_dates=[0].
One method could be to apply a lambda to the column and use the boolean index returned this to index against:
In [19]:
df[df['Start Date'].apply(lambda x: x.day == 1)]
Out[19]:
Start Date Amount
index
1 2013-01-01 20
3 2004-05-01 15
4 2014-06-01 20
Not sure if there is a built in method that doesn't involve setting this to be the index which will convert it into a timeseries index.

Extracting just Month and Year separately from Pandas Datetime column

I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?
If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.
The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.
You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object
If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])
SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')
Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month
You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')
#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4
Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108
There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year
df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01
df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])

Categories