Drop rows based on date condition - python

I have a Pandas DataFrame called new in which the YearMonth column has date in the format of YYYY-MM. I want to drop the rows based on the condition: if the date is beyond "2020-05". I tried using this:
new = new.drop(new[new.YearMonth>'2020-05'].index)
but its not working displaying a syntax error of "invalid token".
Here is a sample DataFrame:
>>> new = pd.DataFrame({
'YearMonth': ['2014-09', '2014-10', '2020-09', '2021-09']
})
>>> print(new)
YearMonth
0 2014-09
1 2014-10
2 2020-09
3 2021-09
The expected DataFrame after the drop should be:
YearMonth
0 2014-09
1 2014-10

Just convert to datetime, then format it to month and subset it.
from datetime import datetime as dt
new['YearMonth']=pd.to_datetime(new['YearMonth']).dt.to_period('M')
new=new[~(new['YearMonth']>'2020-05')]

I think you want boolean indexing with change > to <= so comparing by month periods working nice:
new = pd.DataFrame({
'YearMonth': pd.to_datetime(['2014-09', '2014-10', '2020-09', '2021-09']).to_period('m')
})
print (new)
YearMonth
0 2014-09
1 2014-10
2 2020-09
3 2021-09
df = new[new.YearMonth <= pd.Period('2020-05', freq='m')]
print (df)
YearMonth
0 2014-09
1 2014-10
In newest versions of pandas also working with compare by strings:
df = new[new.YearMonth <= '2020-05']

Related

How do I convert an object to a week number in datetime

I am a beginner in Python and I am trying to change column names that currently represent the week number, to something easier to digest. I wanted to change them to show the date of the week commencing but I am having issues with converting the types.
I have a table that looks similar to the following:
import pandas as pd
data = [[0,'John',1,2,3]
df = pd.dataframe(data, columns = ['Index','Owner','32.0','33.0','34.0']
print(df)
I tried to use df.melt to get a column with the week numbers and then convert them to datetime and obtain the week commencing from that but I have not been successfull.
df = df.melt(id_vars=['Owner'])
df['variable'] = pd.to_datetime(df['variable'], format = %U)
This is as far as I have gotten as I have not been able to obtain the week number as a datetime type to then use it to get the week commencing.
After this, I was going to then transform the dataframe back to its original shape and have the newly obtained week commencing date times as the column headers again.
Can anyone advise me on what I am doing wrong, or alternatively is there a better way to do this?
Any help would be greatly appreciated!
Add Index column to melt first for only week values in variable, then convert to floats, integers and strings, so possible match by weeks:
data = [[0,'John',1,2,3]]
df = pd.DataFrame(data, columns = ['Index','Owner','32.0','33.0','34.0'])
print(df)
Index Owner 32.0 33.0 34.0
0 0 John 1 2 3
df = df.melt(id_vars=['Index','Owner'])
s = df['variable'].astype(float).astype(int).astype(str) + '-0-2021'
print (s)
0 32-0-2021
1 33-0-2021
2 34-0-2021
Name: variable, dtype: object
#https://stackoverflow.com/a/17087427/2901002
df['variable'] = pd.to_datetime(s, format = '%W-%w-%Y')
print (df)
Index Owner variable value
0 0 John 2021-08-15 1
1 0 John 2021-08-22 2
2 0 John 2021-08-29 3
EDIT:
For get original DataFrame (integers columns for weeks) use DataFrame.pivot:
df1 = (df.pivot(index=['Index','Owner'], columns='variable', values='value')
.rename_axis(None, axis=1))
df1.columns = df1.columns.strftime('%W')
df1 = df1.reset_index()
print (df1)
Index Owner 32 33 34
0 0 John 1 2 3
One solution to convert a week number to a date is to use a timedelta. For example you may have
from datetime import timedelta, datetime
week_number = 5
first_monday_of_the_year = datetime(2021, 1, 3)
week_date = first_monday_of_the_year + timedelta(weeks=week_number)

Throw out rows of same ID that are close in time

I have a pandas dataframe that includes a column with an ID called VIN and a column with a date. If the same VIN has multiple rows with dates that are less than 2 months apart, I would like to throw out the later dates. Here's a minimal example:
rng = pd.date_range('2015-02-24', periods=5, freq='M')
df = pd.DataFrame({ 'Date': rng, 'ID': ['ABD','ABD','CDE','CDE','FEK'] })
df.head()
Here I would like to throw out row 1 and 3.
You can use .groupby() on column ID and get the difference between 2 dates with .diff() and check whether it is less than 2 months by comparing with np.timedelta64(2, 'M'). Then filter by .loc on the boolean mask of negation of the condition.
mask = df.groupby('ID')['Date'].diff() < np.timedelta64(2, 'M')
df_filtered = df.loc[~mask]
Result:
print(df_filtered)
Date ID
0 2015-02-28 ABD
2 2015-04-30 CDE
4 2015-06-30 FEK

Pandas: How to sort dataframe rows by date of one column

So I have two different data-frame and I concatenated both. All columns are the same; however, the date column has all sorts of different dates in the M/D/YR format.
dataframe dates get shuffled around later in the sequence
Is there a way to keep the whole dataframe itself and just sort the rows based on the dates in the date column. I also want to keep the format that date is in.
so basically
date people
6/8/2015 1
7/10/2018 2
6/5/2015 0
gets converted into:
date people
6/5/2015 0
6/8/2015 1
7/10/2018 2
Thank you!
PS: I've tried the options in the other post on this but it does not work
Trying to elaborate on what can be done:
Intialize/ Merge the dataframe and convert the column into datetime type
df= pd.DataFrame({'people':[1,2,0],'date': ['6/8/2015','7/10/2018','6/5/2015',]})
df.date=pd.to_datetime(df.date,format="%m/%d/%Y")
print(df)
Output:
date people
0 2015-06-08 1
1 2018-07-10 2
2 2015-06-05 0
Sort on the basis of date
df=df.sort_values('date')
print(df)
Output:
date people
2 2015-06-05 0
0 2015-06-08 1
1 2018-07-10 2
Maintain the format again:
df['date']=df['date'].dt.strftime('%m/%d/%Y')
print(df)
Output:
date people
2 06/05/2015 0
0 06/08/2015 1
1 07/10/2018 2
Try changing the 'date' column to pandas Datetime and then sort
import pandas as pd
df= pd.DataFrame({'people':[1,1,1,2],'date':
['4/12/1961','5/5/1961','7/21/1961','8/6/1961']})
df['date'] =pd.to_datetime(df.date)
df.sort_values(by='date')
Output:
date people
1961-04-12 1
1961-05-05 1
1961-07-21 1
1961-08-06 2
To get back the initial format:
df['date']=df['date'].dt.strftime('%m/%d/%y')
Output:
date people
04/12/61 1
05/05/61 1
07/21/61 1
08/06/61 2
why not simply?
dataset[SortBy["date"]]
can you provide what you tried or how is your structure?
In case you need to sort in reversed order do:
dataset[SortBy["date"]][Reverse]

pandas filter entries by date

have a dataframe with a key named 'date'. first few entries look like this:
0 02.01.2013
1 03.01.2013
2 05.01.2013
3 06.01.2013
4 15.01.2013
Now i want to use pandas to filter out all the rows that are for example not 2014 as a date.
i looked through tutorials and find the following :
mask = transactions['date'][9]==4
trans=transactions[mask]
but that does not work since
transactions['date'][9]
gives me the 9th data entry but not the 9th digit of the date.
Can someone help a newb along ?
df
date
0 02.01.2013
1 03.01.2013
2 05.01.2013
3 06.01.2014
4 15.01.2014
Convert the column to datetime using pd.to_datetime, and test the dt.year attribute -
m = pd.to_datetime(df.date).dt.year != 2014
m
0 True
1 True
2 True
3 False
4 False
Name: date, dtype: bool
Use the mask to filter on df -
df = df[m]
If the datetime column is the index, you'd instead need to convert the df.index -
m = pd.to_datetime(df.index).year != 2014

Create new column based on another column for a multi-index Panda dataframe

I'm running Python 3.5 on Windows and writing code to study financial econometrics.
I have a multi-index panda dataframe where the level=0 index is a series of month-end dates and the level=1 index is a simple integer ID. I want to create a new column of values ('new_var') where for each month-end date, I look forward 1-month and get the values from another column ('some_var') and of course the IDs from the current month need to align with the IDs for the forward month. Here is a simple test case.
import pandas as pd
import numpy as np
# Create some time series data
id = np.arange(0,5)
date = [pd.datetime(2017,1,31)+pd.offsets.MonthEnd(i) for i in [0,1]]
my_data = []
for d in date:
for i in id:
my_data.append((d, i, np.random.random()))
df = pd.DataFrame(my_data, columns=['date', 'id', 'some_var'])
df['new_var'] = np.nan
df.set_index(['date', 'id'], inplace=True)
# Drop an observation to reflect my true data
df.drop(('2017-02-28',3), level=None, inplace=True)
df
# The desired output....
list1 = df.loc['2017-01-31'].index.labels[1].tolist()
list2 = df.loc['2017-02-28'].index.labels[1].tolist()
common = list(set(list1) & set(list2))
for i in common:
df.loc[('2017-01-31', i)]['new_var'] = df.loc[('2017-02-28', i)]['some_var']
df
I feel like there is a better way to get my desired output. Maybe I should just embrace the "for" loop? Maybe a better solution is to reset the index?
Thank you,
F
I would create a integer column representing the date, substrate one from it (to shift it by one month) and the merge the value left on back to the original dataframe.
Out[28]:
some_var
date id
2017-01-31 0 0.736003
1 0.248275
2 0.844170
3 0.671364
4 0.034331
2017-02-28 0 0.051586
1 0.894579
2 0.136740
4 0.902409
df = df.reset_index()
df['n_group'] = df.groupby('date').ngroup()
df_shifted = df[['n_group', 'some_var','id']].rename(columns={'some_var':'new_var'})
df_shifted['n_group'] = df_shifted['n_group']-1
df = df.merge(df_shifted, on=['n_group','id'], how='left')
df = df.set_index(['date','id']).drop('n_group', axis=1)
Out[31]:
some_var new_var
date id
2017-01-31 0 0.736003 0.051586
1 0.248275 0.894579
2 0.844170 0.136740
3 0.671364 NaN
4 0.034331 0.902409
2017-02-28 0 0.051586 NaN
1 0.894579 NaN
2 0.136740 NaN
4 0.902409 NaN

Categories