Pandas Dataframe keep rows where date is between two dates (seperate columns) - python

I have a dataframe that looks similar to this :
Price From To
300€ 2020-01-01 2020-01-07
250€ 2020-01-04 2020-01-08
150€ 2020-02-01 2020-02-04
350€ 2020-02-04 2020-02-08
And then I have a list of dates. For example: list = [2020-01-03, 2020-02-04]
I would like to keep only the rows of the dataframe where the dates are in between the From column and the To column.
So, after transformation I would have the following dataframe.
Price From To
300€ 2020-01-01 2020-01-07
150€ 2020-02-01 2020-02-04
350€ 2020-02-04 2020-02-08
First I thought of using a lambda with an apply but I thought it was not very efficient because my dataset is very large. Is there a simpler way to do this with pandas ?
The result would be contained in one single dataframe

Let's try with numpy broadcasting:
x, y = df[['From', 'To']].values.T
a = np.array(['2020-01-03', '2020-02-04'], dtype=np.datetime64)
mask = ((x[:, None] <= a) & (y[:, None] >= a)).any(1)
df[mask]
Price From To
0 300€ 2020-01-01 2020-01-07
2 150€ 2020-02-01 2020-02-04
3 350€ 2020-02-04 2020-02-08

One option is with Pandas IntervalIndex:
dates = ['2020-01-03', '2020-02-04']
dates = pd.to_datetime(dates)
intervals = pd.IntervalIndex.from_arrays(df.From, df.To, closed='both')
df.iloc[intervals.get_indexer_for(dates)] # for duplicates, you can use .unique
Price From To
0 300€ 2020-01-01 2020-01-07
2 150€ 2020-02-01 2020-02-04
3 350€ 2020-02-04 2020-02-08

Related

how to reverse a pandas series by index without changing values order

now i have a pandas series looks like:
Date
2020-01-02 74.573036
2020-01-03 73.848030
2020-01-06 74.436470
2020-01-07 74.086395
2020-01-08 75.278160
2020-01-09 76.877136
2020-01-10 77.050926
2020-01-13 78.697075
2020-01-14 77.634407
2020-01-15 77.301704
2020-01-16 78.270020
2020-01-17 79.136551
2020-01-21 78.600250
2020-01-22 78.880821
2020-01-23 79.260696
2020-01-24 79.032265
2020-01-27 76.708298
2020-01-28 78.878326
2020-01-29 80.529434
2020-01-30 80.412743
2020-01-31 76.847343
2020-02-03 76.636299
2020-02-04 79.166336
2020-02-05 79.811897
2020-02-06 80.745445
2020-02-07 79.647896
2020-02-10 80.026192
2020-02-11 79.543365
2020-02-12 81.432350
2020-02-13 80.852463
2020-02-14 80.872375
2020-02-18 79.391556
2020-02-19 80.541367
2020-02-20 79.715088
2020-02-21 77.910744
2020-02-24 74.209946
2020-02-25 71.696297
2020-02-26 72.833664
2020-02-27 68.072662
2020-02-28 68.032837
how to reverse the whole series to let the latest date be on the first row without changing the orders of values? (let each index and value stick together)
Let df be your data :
df = df.to_frame().reset_index()
date_vals = df.Date.values
df['Date'] = date_vals[::-1]
ds is your pandas series. You want to reverse the date index and keep the values with their date (index and values stick together) then you can do:
ds = ds[::-1]
# This is shorthand for taking all the dates but you can take specific dates like this
ds['2020-01-07':'2020-01-02':-1]
To reverse the date index but keep the data values in the same location you can do:
ds.index = ds.index.values[::-1]
to just reverse the data values but not the date index:
# Use this to update but sometimes I had issues using this
ds.update(ds.values[::-1])
# or you can do this instead and recreate the series if it doesn't work
ds = pd.Series(ds.values[::-1], ds.index)

Pandas groupby keep rows according to ranking

I have this dataframe:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
3 2020-03-01 0.603989 S2B-SR
4 2020-03-11 0.717264 S2B-SR
5 2020-04-02 0.737118 L8-SR
I would like to groupby by the date columns where I keep the rows according to a ranking/importance of my chooseing from the source columns. For example, my ranking is L8-SR>S2B-SR>GP6_r, meaning that for all rows with the same date, keep the row where source==L8-SR, if none contain L8-SR, then keep the row where source==S2B-SR etc. How can I accomplish that in pandas groupby
Output should look like this:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
3 2020-03-11 0.717264 S2B-SR
4 2020-04-02 0.737118 L8-SR
Let's try category dtype and drop_duplicates:
orders = ['L8-SR','S2B-SR','GP6_r']
df.source = df.source.astype('category')
df.source.cat.set_categories(orders, ordered=True)
df.sort_values(['date','source']).drop_duplicates(['date'])
Output:
date value source
0 2020-02-14 0.438767 L8-SR
1 2020-02-15 0.422867 S2A-SR
2 2020-03-01 0.657453 L8-SR
4 2020-03-11 0.717264 S2B-SR
5 2020-04-02 0.737118 L8-SR
TRY below code for the group by operation. For ordering after this operation you can perform sortby:
# Import pandas library
import pandas as pd
# Declare a data dictionary contains the data mention in table
pandasdata_dict = {'date':['2020-02-14', '2020-02-15', '2020-03-01', '2020-03-01', '2020-03-11', '2020-04-02'],
'value':[0.438767, 0.422867, 0.657453, 0.603989, 0.717264, 0.737118],
'source':['L8-SR', 'S2A-SR', 'L8-SR', 'S2B-SR', 'S2B-SR', 'L8-SR']}
# Convert above dictionary data to the data frame
df = pd.DataFrame(pandasdata_dict)
# display data frame
df
# Convert date field to datetime
df["date"] = pd.to_datetime(df["date"])
# Once conversion done then do the group by operation on the data frame with date field
df.groupby([df['date'].dt.date])

Convert date format to 'Month-Day-Year'

I have a column full of dates of 2 million rows. The data format is 'Year-Month-Day', ex: '2019-11-28'. Each time I load the document I have to change the format of the column (which takes a long time) doing:
pd.to_datetime(df['old_date'])
I would like to rearrange the order to 'Month-Day-Year' so that I wouldn't have to change the format of the column each time I load it.
I tried doing:
df_1['new_date'] = df_1['old_date'].dt.month+'-'+df_1['old_date'].dt.day+'-'+df_1['old_date'].dt.year
But I received the following error: 'unknown type str32'
Could anyone help me?
Thanks!
You could use pandas.Series.dt.strftime (documentation) to change the format of your dates. In the code below I have a column with your old format dates, I create a new columns with this method:
import pandas as pd
df = pd.DataFrame({'old format': pd.date_range(start = '2020-01-01', end = '2020-06-30', freq = 'd')})
df['new format'] = df['old format'].dt.strftime('%m-%d-%Y')
Output:
old format new format
0 2020-01-01 01-01-2020
1 2020-01-02 01-02-2020
2 2020-01-03 01-03-2020
3 2020-01-04 01-04-2020
4 2020-01-05 01-05-2020
5 2020-01-06 01-06-2020
6 2020-01-07 01-07-2020
7 2020-01-08 01-08-2020
8 2020-01-09 01-09-2020
9 2020-01-10 01-10-2020

How to make a pandas series whose index is every day of 2020

I would like to make an empty pandas series with a date index which is every day of 2020. That is 01-01-2020, 02-01-2020 etc.
Although this looks very simple I couldn’t find out how to do it.
Use date_range:
range_2020 = pd.date_range("2020-01-01", "2020-12-31", freq="D")
pd.DataFrame(range(366), index=range_2020)
The output is:
0
2020-01-01 0
2020-01-02 1
2020-01-03 2
2020-01-04 3
2020-01-05 4
...

Pandas: use array index all values

I want to select all rows with a particular index. My DataFrame look like this:
>>> df
Code
Patient Date
1 2003-01-12 00:00:00 a
2003-02-13 00:00:00 b
2003-02-14 00:00:00 ba
2 2001-1-17 22:00:00 z
2002-1-21 00:00:00 d
2003-1-21 00:00:00 a
2005-12-1 00:00:00 ba
Selecting one of the first (Patient) index works:
>>> df.loc[1]
Code
Patient Date
1 2003-01-12 00:00:00 a
2003-02-13 00:00:00 b
2003-02-14 00:00:00 ba
But selecting multiple of the first (Patient) index does not:
>>> df.loc[[1, 2]]
Code
Patient Date
1 2003-01-12 00:00:00 a
2 2001-1-17 22:00:00 z
However, I would like to get the entire dataframe (as the result would be if [1,1,1,2] i.e, the original dataframe).
When using a single index it works fine. For example:
>>> df.reset_index().set_index("Patient").loc[[1, 2]]
Date Code
Patient
1 2003-01-12 00:00:00 a
2003-02-13 00:00:00 b
2003-02-14 00:00:00 ba
2 2001-1-17 22:00:00 z
2002-1-21 00:00:00 d
2003-1-21 00:00:00 a
2005-12-1 00:00:00 ba
TL;DR Why do I have to repeat the index when using multiple indexes but not when I use a single index?
EDIT: Apparently it can be done similar to:
>>> df.loc[df.index.get_level_values("Patient").isin([1, 2])]
But this seems quite dirty to me. Is this the way - or is any other, better, way possible?
For Pandas verison 0.14 the recommended way, according to the above comment, is:
df.loc[([1,2],),:]

Categories