Efficient pandas between() for any of multiple values - python

I have a pandas DataFrame looking like this
data = [["2020-01-01", "2020-01-01"], ["2020-01-02", "2020-01-04"], ["2020-01-05", "2020-01-06"]]
df = pd.DataFrame(data, columns=["START", "END"]).astype({"END": "datetime64[ns]" , "START": "datetime64[ns]"})
START
END
0
2020-01-01
2020-01-01
1
2020-01-02
2020-01-04
2
2020-01-05
2020-01-06
and
a Series/numpy array of datetime64[ns] like this
timestamps = pd.Series(["2020-01-02", "2020-01-03"], dtype="datetime64[ns]")
For every row of df I want to know if there is at least one value in timestamps which lies inbetween START and END.
I can do the following
df["START"].apply(lambda x: (timestamps >= x).any()) & df["END"].apply(lambda x: (timestamps <= x).any())
resulting in [False, True, False], but is there a more performant or built-in way without using df.apply?
EDIT:
Actually, my solution using apply was incorrect because if we had
timestamps = pd.Series(["2019-01-01", "2021-01-01"], dtype="datetime64[ns]")
the output would be [True, True, True], which is obviously false. However, the accepted answer does produce a correct result.

We can broadcast the values in START and END columns to create a boolean mask, then reduce the resulting boolean mask along axis=1
t = timestamps.values
((df['START'].values[:, None] <= t) & (df['END'].values[:, None] >= t)).any(1)
array([False, True, False])

Use IntervalIndex.overlaps if possible create Interval from timestamps by minimal and maximal values:
s = pd.IntervalIndex.from_arrays(df['START'],
df['END'],
closed='both')
i = pd.Interval(timestamps.min(), timestamps.max(), closed='both')
out = s.overlaps(i)
print (out )
[False True False]

Related

How to perform boolean AND on matrix using vector in Pandas?

I would like to perform a boolean AND operation on a matrix using a vector. For example, given:
matrix = pd.DataFrame([[True, False], [True, False], [True, False]], columns=["A", "B"])
vector = pd.Series([False, False, False])
The result would be column-wise boolean AND, like:
result = pd.DataFrame([[False, False], [False, False], [False, False]], columns=["A", "B"])
I was able to achieve that using a loop, but I'm wondering - is there a more elegant way to do that?
I would drop down to numpy that way you can avoid the loop and broadcast over the correct axis. Then reconstruct the DataFrame
import pandas as pd
pd.DataFrame(matrix.to_numpy() & vector.to_numpy()[:, None],
columns=matrix.columns,
index=matrix.index)
A B
0 False False
1 False False
2 False False
Alternatively you can transpose the DataFrame allowing a simple & comparison and then transpose the result back. This might get slow for large DataFrames
(matrix.T & vector).T
Use -
matrix.apply(lambda x: x & vector)
Output
A B
0 False False
1 False False
2 False False
It is implied here that the axis parameter of the apply function is 0 for columnwise apply.

Pandas Dataframe - for each row, return count of other rows with overlapping dates

I've got a dataframe with projects, start dates, and end dates. For each row I would like to return the number of other projects in process when the project started. How do you nest loops when using df.apply()? I've tried using a for loop but my dataframe is large and it takes way too long.
import datetime as dt
data = {'project' :['A', 'B', 'C'],
'pr_start_date':[dt.datetime(2018, 9, 1), dt.datetime(2019, 4, 1), dt.datetime(2019, 6, 8)],
'pr_end_date': [dt.datetime(2019, 6, 15), dt.datetime(2019, 12, 1), dt.datetime(2019, 8, 1)]}
df = pd.DataFrame(data)
def cons_overlap(start):
overlaps = 0
for i in df.index:
other_start = df.loc[i, 'pr_start_date']
other_end = df.loc[i, 'pr_end_date']
if (start > other_start) & (start < other_end):
overlaps += 1
return overlaps
df['overlap'] = df.apply(lambda row: cons_overlap(row['pr_start_date']), axis=1)
This is the output I'm looking for:
pr pr_start_date pr_end_date overlap
0 A 2018-09-01 2019-06-15 0
1 B 2019-04-01 2019-12-01 1
2 C 2019-06-08 2019-08-01 2
I suggest you take advantage of numpy broadcasting:
ends = df.pr_start_date.values < df.pr_end_date.values[:, None]
starts = df.pr_start_date.values > df.pr_start_date.values[:, None]
df['overlap'] = (ends & starts).sum(0)
print(df)
Output
project pr_start_date pr_end_date overlap
0 A 2018-09-01 2019-06-15 0
1 B 2019-04-01 2019-12-01 1
2 C 2019-06-08 2019-08-01 2
Both ends and starts are matrices of 3x3 that are truth when the condition is met:
# ends
[[ True True True]
[ True True True]
[ True True True]]
# starts
[[False True True]
[False False True]
[False False False]]
Then find the intersection with the logical & and sum across columns (sum(0)).
it should be faster than your for loop
I assume the rows are sorted by the start date, and check the previously started projects that have not yet completed. The df.index.get_loc(r.name) yields the index of row being processed.
df["overlap"]=df.apply(lambda r: df.loc[:df.index.get_loc(r.name),"pr_end_date"].gt(r["pr_start_date"]).sum()-1, axis=1)

Get a boolean vector from comparing columns using conditions

Given a matrix with column 0 and 1 that needs to be compared:
>>> import pandas as pd
>>> df = pd.DataFrame([[3,4,'a'],[3,5,'b'],[9,2,'a']])
>>> df
0 1 2
0 3 4 a
1 3 5 b
2 9 2 a
The goal is to compare the first and the second columns such that the we it fits a certain conditions, e.g. if we want to know whether values in column df[0] is lower than df[1]. The desired output would look as such:
[True, True, False]
I've tried to use np.where with a conditions but it seems like it's returning the values that fit the condition instead:
>>> import numpy as np
>>> np.where(df[0] < df[1], df[0], df[1])
array([3, 3, 2])
I could do this, but I'm sure there's a simpler way to use numpy or pandas function to get the boolean vector:
[row[0] < row[1] for idx, row in df.iterrows()]
Is that what you want?
import numpy as np
df = np.array([[3,4,'a'],[3,5,'b'],[9,2,'a']])
df[0, :] == df[1, :]
#output array([ True, False, False], dtype=bool)
df[0, :] < df[1, :]
#output array([False, True, True], dtype=bool)

Select DataFrame data base on series value

I have a pandas' DataFrame and when I perform an operation on the dataframe, I get back a series. How can I use that series to select out only records where I find a match?
Right now I'm appending the column onto the DataFrame and doing a query against the dataframe then dropping the column. I really do not like this solution though, so I'm hoping I can get a better solution.
data = [[1,2,3], [1,3,4], [3,4,5]]
columns = ['a', 'b', 'c']
df = pd.DataFrame(data, columns=columns)
series = df.myoperation()
df['myoperation'] = series
res = df[df['myoperation'] == True]
del res['myoperation']
The series object will produce a 1-1 match, so index item 1 will match item 1 in the dataframe object.
Above is my hacky code to get it done, but I'm afraid when the dataframe have many column or a lot more data than just this simple example, it will be slow.
Thank you
I think you can use if series is boolean Series with same index as df and same length as df - it is called boolean indexing:
series = pd.Series([True, False, True], index=df.index)
res = df[series]
print (res)
a b c
0 1 2 3
2 3 4 5
It always works with boolean list and numpy array, only lenght has to be same as df:
L = [True, False, True]
res = df[L]
print (res)
a b c
0 1 2 3
2 3 4 5
arr = np.array([True, False, True])
res = df[arr]
print (res)
a b c
0 1 2 3
2 3 4 5

selecting pandas rows: why isin() and not i in mylist?

I have a dataframe with a multi-index and need to select only the rows where the first index is not in a list. This works:
df= df.iloc[~(df.index.get_level_values(0).isin(mylist) )
This doesn't:
df= df.iloc[(df.index.get_level_values(0) not in mylist )
I get an error about the truth value of the array.
Why? What does it mean? Is it documented in the official docs?
Say, you have a dataframe df as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(30).reshape((6,5)))
tuples = [(i//2, i%2) for i in range(6)]
df.index = pd.MultiIndex.from_tuples(tuples)
print(df)
0 1 2 3 4
0 0 0.623671 0.335741 0.035219 0.902372 0.349697
1 0.487387 0.325101 0.361753 0.935972 0.425735
1 0 0.147836 0.599608 0.888232 0.712804 0.604688
1 0.156712 0.286682 0.680316 0.104996 0.389507
2 0 0.212923 0.580401 0.02415 0.712987 0.803497
1 0.804538 0.035597 0.611101 0.328159 0.140793
df.index.get_level_values(0) will return an array: Int64Index([0, 0, 1, 1, 2, 2], dtype='int64')
The error says that by using in operator it is not clear whether you want to check all elements in that array are in the list, or any element in that array is in the list. You are comparing the array against the whole list. What you want is the element-wise comparison and in does not do that. Even if it was clear, it would return a single value. If you try df.index.get_level_values(0).isin([0,1]), on the other hand, it will return an array of boolean values: array([ True, True, True, True, False, False], dtype=bool) so it will check first whether 0 is in the list, whether second 0 is in the list, whether 1 is in the list... And then those boolean values will be used to slice the dataframe (i.e. show me only the rows where the array has True value).
In [12]: df.iloc[[ True, True, True, True, False, False]]
Out [12]: 0 1 2 3 4
0 0 0.623671 0.335741 0.035219 0.902372 0.349697
1 0.487387 0.325101 0.361753 0.935972 0.425735
1 0 0.147836 0.599608 0.888232 0.712804 0.604688
1 0.156712 0.286682 0.680316 0.104996 0.389507

Categories