have a dataframe with a key named 'date'. first few entries look like this:
0 02.01.2013
1 03.01.2013
2 05.01.2013
3 06.01.2013
4 15.01.2013
Now i want to use pandas to filter out all the rows that are for example not 2014 as a date.
i looked through tutorials and find the following :
mask = transactions['date'][9]==4
trans=transactions[mask]
but that does not work since
transactions['date'][9]
gives me the 9th data entry but not the 9th digit of the date.
Can someone help a newb along ?
df
date
0 02.01.2013
1 03.01.2013
2 05.01.2013
3 06.01.2014
4 15.01.2014
Convert the column to datetime using pd.to_datetime, and test the dt.year attribute -
m = pd.to_datetime(df.date).dt.year != 2014
m
0 True
1 True
2 True
3 False
4 False
Name: date, dtype: bool
Use the mask to filter on df -
df = df[m]
If the datetime column is the index, you'd instead need to convert the df.index -
m = pd.to_datetime(df.index).year != 2014
Related
Let's take this sample dataframe :
df = pd.DataFrame({'ID':[1,1,2,2,3],'Date_min':["2021-01-01","2021-01-20","2021-01-28","2021-01-01","2021-01-02"],'Date_max':["2021-01-23","2021-12-01","2021-09-01","2021-01-15","2021-01-09"]})
df["Date_min"] = df["Date_min"].astype('datetime64')
df["Date_max"] = df["Date_max"].astype('datetime64')
ID Date_min Date_max
0 1 2021-01-01 2021-01-23
1 1 2021-01-20 2021-12-01
2 2 2021-01-28 2021-09-01
3 2 2021-01-01 2021-01-15
4 3 2021-01-02 2021-01-09
I would like to check for each ID if there are overlapping date ranges. I can use a loopy solution as the following one but it is not efficient and consequently quite slow with a real big dataframe :
L_output = []
for index, row in df.iterrows() :
if len(df[(df["ID"]==row["ID"]) & (df["Date_min"]<= row["Date_min"]) &
(df["Date_max"]>= row["Date_min"])].index)>1:
print("overlapping date ranges for ID %d" %row["ID"])
L_output.append(row["ID"])
Output :
overlapping date ranges for ID 1
Would you know please a better way to check that ID 1 has overlapping date ranges ?
Expected output :
[1]
Try:
Create a column "Dates" that contains a list of dates from "Date_min" to "Date_max" for each row
explode the "Dates" columns
get the duplicated rows
df["Dates"] = df.apply(lambda row: pd.date_range(row["Date_min"], row["Date_max"]), axis=1)
df = df.explode("Dates").drop(["Date_min", "Date_max"], axis=1)
#if you want all the ID and Dates that are duplicated/overlap
>>> df[df.duplicated()]
ID Dates
1 1 2021-01-20
1 1 2021-01-21
1 1 2021-01-22
1 1 2021-01-23
#if you just want a count of overlapping dates per ID
>>> df.groupby("ID").agg(lambda x: x.duplicated().sum())
Dates
ID
1 4
2 0
3 0
You can transform your datetime objects into timestamps. Then, construct pd.Interval objects and iter on a generator of all possible intervals combinations for each ID:
from itertools import combinations
import pandas as pd
def group_has_overlap(group):
timestamps = group[["Date_min", "Date_max"]].values.tolist()
for t1, t2 in combinations(timestamps, 2):
i1 = pd.Interval(t1[0], t1[1])
i2 = pd.Interval(t2[0], t2[1])
if i1.overlaps(i2):
return True
return False
for ID, group in df.groupby("ID"):
print(ID, group_has_overlap(group))
Output is :
1 True
2 False
3 False
Set the index as an intervalindex, and use groupby to get your overlapping IDs:
(df.set_index(pd.IntervalIndex
.from_arrays(df.Date_min,
df.Date_max,
closed='both'))
.groupby('ID')
.apply(lambda df: df.index.is_overlapping)
)
ID
1 True
2 False
3 False
dtype: bool
I have a pandas dataframe and want to drop all rows with a start date smaller than 2019 and greater than 2020. For sure I can just iterate over it, do the condition, and drop it by index if it is False. For example like the following:
for index, row in df.iterrows():
# extract year from date format YYYY-MM-DD
year = int(row['START_DATE'][:4])
# remove all dates before and after 2019/2020
if not (year >= 2019 and year <= 2020):
df = df.drop(index)
But my goal is to write code more effectively. And that is the point where I am stuck. I came to the following line:
df = df.drop(df[(int(df.START_DATE[:4]) < 2019) & (int(df.START_DATE[:4]) > 2020)].index)
but I get a TypeError: cannot convert the series to <class 'int'> and don't know how to convert the values to an int in this short statement.
First ensure that START_DATE column is in pd.datetime. Then filter them by your condition. ~ is a NOT operation in Pandas.
df["START_DATE"] = pd.to_datetime(df["START_DATE"])
df = df[~((df["START_DATE"].dt.year < 2019) | (df["START_DATE"].dt.year > 2020))]
Use pd.to_datetime to check if the date is between your range then extract the year:
>>> df
START_DATE VAL
0 2018-12-31 1
1 2019-12-31 2
2 2020-12-31 3
3 2021-12-31 4
>>> df.loc[pd.to_datetime(df['START_DATE']).between('2019', '2021')] \
.assign(START_DATE=df['START_DATE'].str[:4].astype(int))
START_DATE VAL
1 2019 2
2 2020 3
I have a Pandas DataFrame called new in which the YearMonth column has date in the format of YYYY-MM. I want to drop the rows based on the condition: if the date is beyond "2020-05". I tried using this:
new = new.drop(new[new.YearMonth>'2020-05'].index)
but its not working displaying a syntax error of "invalid token".
Here is a sample DataFrame:
>>> new = pd.DataFrame({
'YearMonth': ['2014-09', '2014-10', '2020-09', '2021-09']
})
>>> print(new)
YearMonth
0 2014-09
1 2014-10
2 2020-09
3 2021-09
The expected DataFrame after the drop should be:
YearMonth
0 2014-09
1 2014-10
Just convert to datetime, then format it to month and subset it.
from datetime import datetime as dt
new['YearMonth']=pd.to_datetime(new['YearMonth']).dt.to_period('M')
new=new[~(new['YearMonth']>'2020-05')]
I think you want boolean indexing with change > to <= so comparing by month periods working nice:
new = pd.DataFrame({
'YearMonth': pd.to_datetime(['2014-09', '2014-10', '2020-09', '2021-09']).to_period('m')
})
print (new)
YearMonth
0 2014-09
1 2014-10
2 2020-09
3 2021-09
df = new[new.YearMonth <= pd.Period('2020-05', freq='m')]
print (df)
YearMonth
0 2014-09
1 2014-10
In newest versions of pandas also working with compare by strings:
df = new[new.YearMonth <= '2020-05']
Check column ['esn'] from df1. When any different found between two rows, produce another dataframe, df2. df2 only contains the before change and after change information
>>> df1 = pd.DataFrame([[2014,1],[2015,1],[2016,1],[2017,2],[2018,2]],columns=['year','esn'])
>>> df1
year esn
0 2014 1
1 2015 1
2 2016 1
3 2017 2
4 2018 2
>>> df2 # new dataframe intended to create
year esn
0 2016 1
1 2017 2
can't produce the above result in df2. Thanks for your help in advance.
Create boolena mask by compare shifted values by ne for not equal and replace first missing value by backfill, similar compare shifted with -1 with forward filling missing values - chain by | for bitwise OR and filter by boolean indexing:
mask = df1['esn'].ne(df1['esn'].shift().bfill()) | df1['esn'].ne(df1['esn'].shift(-1).ffill())
df2 = df1[mask]
print (df2)
year esn
2 2016 1
3 2017 2
So I have two different data-frame and I concatenated both. All columns are the same; however, the date column has all sorts of different dates in the M/D/YR format.
dataframe dates get shuffled around later in the sequence
Is there a way to keep the whole dataframe itself and just sort the rows based on the dates in the date column. I also want to keep the format that date is in.
so basically
date people
6/8/2015 1
7/10/2018 2
6/5/2015 0
gets converted into:
date people
6/5/2015 0
6/8/2015 1
7/10/2018 2
Thank you!
PS: I've tried the options in the other post on this but it does not work
Trying to elaborate on what can be done:
Intialize/ Merge the dataframe and convert the column into datetime type
df= pd.DataFrame({'people':[1,2,0],'date': ['6/8/2015','7/10/2018','6/5/2015',]})
df.date=pd.to_datetime(df.date,format="%m/%d/%Y")
print(df)
Output:
date people
0 2015-06-08 1
1 2018-07-10 2
2 2015-06-05 0
Sort on the basis of date
df=df.sort_values('date')
print(df)
Output:
date people
2 2015-06-05 0
0 2015-06-08 1
1 2018-07-10 2
Maintain the format again:
df['date']=df['date'].dt.strftime('%m/%d/%Y')
print(df)
Output:
date people
2 06/05/2015 0
0 06/08/2015 1
1 07/10/2018 2
Try changing the 'date' column to pandas Datetime and then sort
import pandas as pd
df= pd.DataFrame({'people':[1,1,1,2],'date':
['4/12/1961','5/5/1961','7/21/1961','8/6/1961']})
df['date'] =pd.to_datetime(df.date)
df.sort_values(by='date')
Output:
date people
1961-04-12 1
1961-05-05 1
1961-07-21 1
1961-08-06 2
To get back the initial format:
df['date']=df['date'].dt.strftime('%m/%d/%y')
Output:
date people
04/12/61 1
05/05/61 1
07/21/61 1
08/06/61 2
why not simply?
dataset[SortBy["date"]]
can you provide what you tried or how is your structure?
In case you need to sort in reversed order do:
dataset[SortBy["date"]][Reverse]