I have a pandas dataframe and want to drop all rows with a start date smaller than 2019 and greater than 2020. For sure I can just iterate over it, do the condition, and drop it by index if it is False. For example like the following:
for index, row in df.iterrows():
# extract year from date format YYYY-MM-DD
year = int(row['START_DATE'][:4])
# remove all dates before and after 2019/2020
if not (year >= 2019 and year <= 2020):
df = df.drop(index)
But my goal is to write code more effectively. And that is the point where I am stuck. I came to the following line:
df = df.drop(df[(int(df.START_DATE[:4]) < 2019) & (int(df.START_DATE[:4]) > 2020)].index)
but I get a TypeError: cannot convert the series to <class 'int'> and don't know how to convert the values to an int in this short statement.
First ensure that START_DATE column is in pd.datetime. Then filter them by your condition. ~ is a NOT operation in Pandas.
df["START_DATE"] = pd.to_datetime(df["START_DATE"])
df = df[~((df["START_DATE"].dt.year < 2019) | (df["START_DATE"].dt.year > 2020))]
Use pd.to_datetime to check if the date is between your range then extract the year:
>>> df
START_DATE VAL
0 2018-12-31 1
1 2019-12-31 2
2 2020-12-31 3
3 2021-12-31 4
>>> df.loc[pd.to_datetime(df['START_DATE']).between('2019', '2021')] \
.assign(START_DATE=df['START_DATE'].str[:4].astype(int))
START_DATE VAL
1 2019 2
2 2020 3
Related
I have a dataframe with two columns, Case and Date. Here Date is actually the starting date. I want to populate it as a time series, saying add three (month_num) more dates to each case and removing the original ones.
original dataframe:
Case Date
0 1 2010-01-01
1 2 2011-04-01
2 3 2012-08-01
after populating dates:
Case Date
0 1 2010-02-01
1 1 2010-03-01
2 1 2010-04-01
3 2 2011-05-01
4 2 2011-06-01
5 2 2011-07-01
6 3 2012-09-01
7 3 2012-10-01
8 3 2012-11-01
I tried to declare an empty dataframe with the same column names and data type, and used for loop to loop over Case and month_num, and add rows into the new dataframe.
import pandas as pd
data = [[1, '2010-01-01'], [2, '2011-04-01'], [3, '2012-08-01']]
df = pd.DataFrame(data, columns = ['Case', 'Date'])
df.Date = pd.to_datetime(df.Date)
df_new = pd.DataFrame(columns=df.columns)
df_new['Case'] = pd.to_numeric(df_new['Case'])
df_new['Date'] = pd.to_datetime(df_new['Date'])
month_num = 3
for c in df.Case:
for m in range(1, month_num+1):
temp = df.loc[df['Case']==c]
temp['Date'] = temp['Date'] + pd.DateOffset(months=m)
df_new = pd.concat([df_new, temp])
df_new.reset_index(inplace=True, drop=True)
My code can work, however, when the original dataframe and month_num become large, it took huge time to run. Are there any better ways to do what I need? Thanks a alot!!
Your performance issue is probably related to the use of pd.concat inside the inner for loop. This answer explains why.
As the answer suggests, you may want to use an external list to collect all the dataframes you create in the for loop, and then concatenate once the list.
Given your input data this is what worked on my notebook:
df2=pd.DataFrame()
df2['Date']=df['Date'].apply(lambda x: pd.date_range(start=x, periods=3,freq='M')).explode()
df3=pd.merge_asof(df2,df,on='Date')
df3['Date']=df3['Date']+ pd.DateOffset(days=1)
df3[['Case','Date']]
We create a df2 to which we populate 'Date' with the needed dates coming from the original df
Then df3 resulting of a merge_asof between df2 and df (to populate the 'Case' column)
Finally , we offset the resulting column off 1 day
I am a beginner in Python and I am trying to change column names that currently represent the week number, to something easier to digest. I wanted to change them to show the date of the week commencing but I am having issues with converting the types.
I have a table that looks similar to the following:
import pandas as pd
data = [[0,'John',1,2,3]
df = pd.dataframe(data, columns = ['Index','Owner','32.0','33.0','34.0']
print(df)
I tried to use df.melt to get a column with the week numbers and then convert them to datetime and obtain the week commencing from that but I have not been successfull.
df = df.melt(id_vars=['Owner'])
df['variable'] = pd.to_datetime(df['variable'], format = %U)
This is as far as I have gotten as I have not been able to obtain the week number as a datetime type to then use it to get the week commencing.
After this, I was going to then transform the dataframe back to its original shape and have the newly obtained week commencing date times as the column headers again.
Can anyone advise me on what I am doing wrong, or alternatively is there a better way to do this?
Any help would be greatly appreciated!
Add Index column to melt first for only week values in variable, then convert to floats, integers and strings, so possible match by weeks:
data = [[0,'John',1,2,3]]
df = pd.DataFrame(data, columns = ['Index','Owner','32.0','33.0','34.0'])
print(df)
Index Owner 32.0 33.0 34.0
0 0 John 1 2 3
df = df.melt(id_vars=['Index','Owner'])
s = df['variable'].astype(float).astype(int).astype(str) + '-0-2021'
print (s)
0 32-0-2021
1 33-0-2021
2 34-0-2021
Name: variable, dtype: object
#https://stackoverflow.com/a/17087427/2901002
df['variable'] = pd.to_datetime(s, format = '%W-%w-%Y')
print (df)
Index Owner variable value
0 0 John 2021-08-15 1
1 0 John 2021-08-22 2
2 0 John 2021-08-29 3
EDIT:
For get original DataFrame (integers columns for weeks) use DataFrame.pivot:
df1 = (df.pivot(index=['Index','Owner'], columns='variable', values='value')
.rename_axis(None, axis=1))
df1.columns = df1.columns.strftime('%W')
df1 = df1.reset_index()
print (df1)
Index Owner 32 33 34
0 0 John 1 2 3
One solution to convert a week number to a date is to use a timedelta. For example you may have
from datetime import timedelta, datetime
week_number = 5
first_monday_of_the_year = datetime(2021, 1, 3)
week_date = first_monday_of_the_year + timedelta(weeks=week_number)
I have a data frame with a column indicating the number of months. I would like to create a new column, starting from an initial date, let’s say 2015-01-01 and add all the months to this initial date. For example, if the month column has values [0, 1, 2, …,72], then I would like to have a column called Date of the form [2015-01-01,2015-02-01,2015-03-01,…].
How could I achieve this?
Use offsets.DateOffset and add to datetime:
df = pd.DataFrame({'n': [0,1,2,72]})
start = '2015-01-01'
df['new'] = pd.to_datetime(start) + df['n'].apply(lambda x: pd.offsets.DateOffset(months=x))
print (df)
n new
0 0 2015-01-01
1 1 2015-02-01
2 2 2015-03-01
3 72 2021-01-01
I have a Pandas DataFrame called new in which the YearMonth column has date in the format of YYYY-MM. I want to drop the rows based on the condition: if the date is beyond "2020-05". I tried using this:
new = new.drop(new[new.YearMonth>'2020-05'].index)
but its not working displaying a syntax error of "invalid token".
Here is a sample DataFrame:
>>> new = pd.DataFrame({
'YearMonth': ['2014-09', '2014-10', '2020-09', '2021-09']
})
>>> print(new)
YearMonth
0 2014-09
1 2014-10
2 2020-09
3 2021-09
The expected DataFrame after the drop should be:
YearMonth
0 2014-09
1 2014-10
Just convert to datetime, then format it to month and subset it.
from datetime import datetime as dt
new['YearMonth']=pd.to_datetime(new['YearMonth']).dt.to_period('M')
new=new[~(new['YearMonth']>'2020-05')]
I think you want boolean indexing with change > to <= so comparing by month periods working nice:
new = pd.DataFrame({
'YearMonth': pd.to_datetime(['2014-09', '2014-10', '2020-09', '2021-09']).to_period('m')
})
print (new)
YearMonth
0 2014-09
1 2014-10
2 2020-09
3 2021-09
df = new[new.YearMonth <= pd.Period('2020-05', freq='m')]
print (df)
YearMonth
0 2014-09
1 2014-10
In newest versions of pandas also working with compare by strings:
df = new[new.YearMonth <= '2020-05']
have a dataframe with a key named 'date'. first few entries look like this:
0 02.01.2013
1 03.01.2013
2 05.01.2013
3 06.01.2013
4 15.01.2013
Now i want to use pandas to filter out all the rows that are for example not 2014 as a date.
i looked through tutorials and find the following :
mask = transactions['date'][9]==4
trans=transactions[mask]
but that does not work since
transactions['date'][9]
gives me the 9th data entry but not the 9th digit of the date.
Can someone help a newb along ?
df
date
0 02.01.2013
1 03.01.2013
2 05.01.2013
3 06.01.2014
4 15.01.2014
Convert the column to datetime using pd.to_datetime, and test the dt.year attribute -
m = pd.to_datetime(df.date).dt.year != 2014
m
0 True
1 True
2 True
3 False
4 False
Name: date, dtype: bool
Use the mask to filter on df -
df = df[m]
If the datetime column is the index, you'd instead need to convert the df.index -
m = pd.to_datetime(df.index).year != 2014