label intervals based on other intervals in pandas [duplicate] - python

This question already has answers here:
Add/fill pandas column based on range in rows from another dataframe
(3 answers)
Closed 3 years ago.
I have two dataframes, a and b
b has a datetime index, while a has a Start and End datetime columns
I need to 'Label' to True, all the rows of b whose indexes fall within any [Start,End] intervals from a
Right now I doing:
for _,r in a.iterrows():
b.loc[np.logical_and(b.index>=r.Start,
b.index<=r.End),'Label']=True
but this is extremely slow when b is large.
How to optimize the provided code snippet?
MVCE:
b=pd.DataFrame(index=[pd.Timestamp('2017-01-01'),pd.Timestamp('2018-01-01')],columns=['Label'])
a=pd.DataFrame.from_dict([{'Start':pd.Timestamp('2018-01-01'),'End':pd.Timestamp('2020-01-01')}])
EDIT:
the solution at
Add/fill pandas column based on range in rows from another dataframe
does not work for me (they use range to fill the intervals, while we are working on datetime

Here's one solution using apply -
Dummy CSV data
Date,Start,End
01-08-2019,01-02-2019, 01-10-2019
01-08-2019,01-02-2020, 01-10-2020
Code
df = pd.read_csv('dummy.csv').apply(pd.to_datetime)
df.T.apply(lambda x: x[1] < x[0] and x[2] > x[0])
Result
0 True
1 False
dtype: bool

How about doing something like this?
def func(): # b.index
mask = (a['Start'] > date) & (a['End'] <= date)
df = a.loc[mask]
if len(df) > 0:
return True
else:
return False
b['Label'] = b.index().to_series().apply(func)

Related

Pandas Styler conditional formatting ( red highlight) on last two rows of a dataframe based off column value [duplicate]

I've been trying to print out a Pandas dataframe to html and have specific entire rows highlighted if the value of one specific column's value for that row is over a threshold. I've looked through the Pandas Styler Slicing and tried to vary the highlight_max function for such a use, but seem to be failing miserably; if I try, say, to replace the is_max with a check for whether a given row's value is above said threshold (e.g., something like
is_x = df['column_name'] >= threshold
), it isn't apparent how to properly pass such a thing or what to return.
I've also tried to simply define it elsewhere using df.loc, but that hasn't worked too well either.
Another concern also came up: If I drop that column (currently the criterion) afterwards, will the styling still hold? I am wondering if a df.loc would prevent such a thing from being a problem.
This solution allows for you to pass a column label or a list of column labels to highlight the entire row if that value in the column(s) exceeds the threshold.
import pandas as pd
import numpy as np
np.random.seed(24)
df = pd.DataFrame({'A': np.linspace(1, 10, 10)})
df = pd.concat([df, pd.DataFrame(np.random.randn(10, 4), columns=list('BCDE'))],
axis=1)
df.iloc[0, 2] = np.nan
def highlight_greaterthan(s, threshold, column):
is_max = pd.Series(data=False, index=s.index)
is_max[column] = s.loc[column] >= threshold
return ['background-color: yellow' if is_max.any() else '' for v in is_max]
df.style.apply(highlight_greaterthan, threshold=1.0, column=['C', 'B'], axis=1)
Output:
Or for one column
df.style.apply(highlight_greaterthan, threshold=1.0, column='E', axis=1)
Here is a simpler approach:
Assume you have a 100 x 10 dataframe, df. Also assume you want to highlight all the rows corresponding to a column, say "duration", greater than 5.
You first need to define a function that highlights the cells. The real trick is that you need to return a row, not a single cell. For example:
def highlight(s):
if s.duration > 5:
return ['background-color: yellow'] * len(s)
else:
return ['background-color: white'] * len(s)
**Note that the return part should be a list of 10 (corresponding to the number of columns). This is the key part.
Now you can apply this to the dataframe style as:
df.style.apply(highlight, axis=1)
Assume you have the following dataframe and you want to highlight the rows where id is greater than 3 to red
id char date
0 0 s 2022-01-01
1 1 t 2022-02-01
2 2 y 2022-03-01
3 3 l 2022-04-01
4 4 e 2022-05-01
5 5 r 2022-06-01
You can try Styler.set_properties with pandas.IndexSlice
# Subset your original dataframe with condition
df_ = df[df['id'].gt(3)]
# Pass the subset dataframe index and column to pd.IndexSlice
slice_ = pd.IndexSlice[df_.index, df_.columns]
s = df.style.set_properties(**{'background-color': 'red'}, subset=slice_)
s.to_html('test.html')
You can also try Styler.apply with axis=None which passes the whole dataframe.
def styler(df):
color = 'background-color: {}'.format
mask = pd.concat([df['id'].gt(3)] * df.shape[1], axis=1)
style = np.where(mask, color('red'), color('green'))
return style
s = df.style.apply(styler, axis=None)

Pandas DataFrame: multiply values in a column, based on condition [duplicate]

This question already has an answer here:
Pandas: update a column with an if statement
(1 answer)
Closed 3 years ago.
Hi I have a DataFrame column like the follow.
dataframe['BETA'], which has float numbers between 0 and 100.
I need to have just numbers with the same numbers of digits. Example:
Dataframe['BETA´]:
[0] 0.11 to [0] 110
[1] 1.54 to [1] 154
[2] 22.1 to [2] 221
I tried to change one by one, but its super inefficient process:
for i in range (len(df_ld)):
nbeta=df_ld['BETA'][i]
if nbeta<1:
val=nbeta
val=val*1000
df_ld.loc[i,'BETA']=val
if (nbeta>=1) and (nbeta<=10):
val=nbeta
val=val*100
df_ld.loc[i,'BETA']=val
if (nbeta>10) and (nbeta<=100):
val=nbeta
val=val*10
df_ld.loc[i,'BETA']=val
#print('%.f >10, %.f Nuevo valor'% (nbeta,val))
Note: The dataframe size is more then 80k elements
Please help!
Edited: Solution
numpy.select
import numpy as np
x = df_ld['BETA']
condlist = [x<1, (x>=1) & (x<10),(x>=10) & (x<100)]
choicelist = [x*1000, x*100,x*10]
output=np.select(condlist, choicelist)
df_ld.insert(4,'BETA3',output,True)
Thank you!
Try this.
I'm guessing your dataframe is called df_ld and your target column is df_ld['BETA'].
def multiply(column):
newcol = []
for item in column:
if item<1:
item=item*1000
newcol.append(item)
if (item>=1) and (item<=10):
item=item*100
newcol.append(item)
if (item>10) and (item<=100):
item=item*10
newcol.append(item)
return newcol
# apply function and create new column
df_ld['newcol'] = multiply(df_ld['BETA'])

Keep upper n rows of a pandas dataframe based on condition

how would I delete all rows from a dataframe that come after a certain fulfilled condition? As an example I have the following dataframe:
import pandas as pd
xEnd=1
yEnd=2
df = pd.DataFrame({'x':[1,1,1,2,2,2], 'y':[1,2,3,3,4,3], 'id':[0,1,2,3,4,5]})
How would i get a dataframe that deletes the last 4 rows and keeps the upper 2 as in row 2 the condition x=xEnd and y=yEnd is fulfilled.
EDITED: should have mentioned that the dataframe is not necessarily ascending. Could also be descending and i still would like to get the upper ones.
To slice your dataframe until the first time a condition across 2 series are satisfied, first calculate the required index and then slice via iloc.
You can calculate the index via set_index, isin and np.ndarray.argmax:
idx = df.set_index(['x', 'y']).isin((xEnd, yEnd)).values.argmax()
res = df.iloc[:idx+1]
print(res)
x y id
0 1 1 0
1 1 2 1
If you need better performance, see Efficiently return the index of the first value satisfying condition in array.
not 100% sure i understand correctly, but you can filter your dataframe like this:
df[(df.x <= xEnd) & (df.y <= yEnd)]
this yields the dataframe:
id x y
0 0 1 1
1 1 1 2
If x and y are not strictly increasing and you want whats above the line that satisfy condition:
df[df.index <= (df[(df.x == xEnd) & (df.y == yEnd)]).index.tolist()]
df = df.iloc[[0:yEnd-1],[:]]
Select just first two rows and keep all columns and put it in new dataframe.
Or you can use the same name of variable too.

Python, Pandas: Reindex/Slice DataFrame with duplicate Index values

Let's consider a DataFrame that contains 1 row of 2 values per each day of the month of Jan 2010:
date_range = pd.date_range(dt(2010,1,1), dt(2010,1,31), freq='1D')
df = pd.DataFrame(data = np.random.rand(len(date_range),2), index = date_range)
and another timeserie with sparser data and duplicate index values:
observations = pd.DataFrame(data =np.random.rand(7,2), index = (dt(2010,1,12),
dt(2010,1,18), dt(2010,1,20), dt(2010,1,20), dt(2010,1,22), dt(2010,1,22),dt(2010,1,28)))
I split the first DataFrame df into a list of 5 DataFrames, each of them containing 1 week worth of data from the original: df_weeks = [g for n, g in df.groupby(pd.TimeGrouper('W'))]
Now I would like to split the data of the second DataFrame by the same 5 weeks. i.e. that would mean in that specific case ending up with a variable obs_weeks containing 5 DataFrames spanning the same time range as df_weeks , 2 of them being empty.
I tried using reindex such as in this question: Python, Pandas: Use the GroupBy.groups description to apply it to another grouping
and Periods:
p1 =[x.to_period() for x in list(df.groupby(pd.TimeGrouper('W')).groups.keys())]
p1 = sorted(p1)
dfs=[]
for p in p1:
dff = observations.truncate(p.start_time, p.end_time)
dfs.append(dff)
(see this question: Python, Pandas: Boolean Indexing Comparing DateTimeIndex to Period)
The problem is that if some values in the index of observations are duplicates (and this is the case) none of those method functions. i also tried to change the index of observations to a normal column and do the slicing on that column, but i received error message as well.
You can achieve this by doing a simple filter:
p1 = [x.to_period() for x in list(df.groupby(pd.TimeGrouper('W')).groups.keys())]
p1 = sorted(p1)
dfs = []
for p in p1:
dff = observations.ix[
(observations.index >= p.start_time) &
(observations.index < p.end_time)]
dfs.append(dff)

Pandas Python - Create dummy variables for multiple conditions

I have a pandas dataframe with a column that indicates which hour of the day a particular action was performed. So df['hour'] is many rows each with a value from 0 to 23.
I am trying to create dummy variables for things like 'is_morning', for example:
if df['hour'] >= 5 and < 12 then return 1, else return 0
A for loop doesn't work given the size of the data set, and I've tried some other stuff like
df['is_morning'] = df['hour'] >= 5 and < 12
Any suggestions??
You can just do:
df['is_morning'] = (df['hour'] >= 5) & (df['hour'] < 12)
i.e. wrap each condition in parentheses, and use &, which is an and operation that works across the whole vector/column.

Categories