pandas groupby return a boolean vector - python

I have a time series database where I would like to group the data to compare them both to another cell in the same row, and the previous value.
The code below will return a vector against the whole dataframe, but if I try to group it I get a dataframe with apply() and an error with agg or transform.
Sample data frame
df = pd.DataFrame({ 'group': [1, 1, 1, 2,2,2,1,2, 1], 'target': [100,100,100,100,10,10,10,10,50],'val' :[90,80,70,4,120,6,60,8, 50] })
df
group target val
0 1 100 90
1 1 100 80
2 1 100 70
3 2 100 4
4 2 10 120
5 2 10 6
6 1 10 60
7 2 10 8
8 1 50 50
Here is my attempt at a function
def spike(df):
high = df['val'] > df['target']+25
rising = df['val'] > df['val'].shift()
return high & rising
print(spike(df))
print( df.groupby('group').apply(spike))
Output
0 False
1 False
2 False
3 False
4 True
5 False
6 True
7 False
8 False
dtype: bool
0 1 2 6 8
group
1 False False False False False
2 False True False False True
Here is my output, I was trying to get the second output to look like the first except row 6 should be false.

You are over thinking it:
shift = df.groupby('group')['val'].shift()
df['val'].gt(df['target']+25) & df['val'].gt(shift)
Output:
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
dtype: bool

Related

Fil rows in df.column betwen rows if number of rows between is less than something

I have a df where i want to fill the rows in column values with True if the number of rows between values True in column values is less then two.
counter
values
1
True
2
False
3
False
4
True
5
False
6
True
7
True
8
False
9
True
10
False
11
False
The result i want is like the df below:
counter
values
1
True
2
False
3
False
4
True
5
True
6
True
7
True
8
True
9
True
10
False
11
False
You can make groups starting with True, if the group is 2 items (or less), replace with True. Then compute the boolean OR with the original column:
N = 2
fill = df['values'].groupby(df['values'].cumsum()).transform(lambda g: len(g)<=N)
df['values'] = df['values']|fill ## or df['values'] |= fill
output (here as new column value2 for clarity):
counter values values2
0 1 True True
1 2 False False
2 3 False False
3 4 True True
4 5 False True
5 6 True True
6 7 True True
7 8 False True
8 9 True True
9 10 False False
10 11 False False
Other option that works only in the particular case of N=2, check if both the row before and after is True:
df['values'] = df['values']|(df['values'].shift()&df['values'].shift(-1))

KeyError: False when I enter conditions in Pandas

I'm getting the KeyError: False when I run this line:
df['Eligible'] = df[('DeliveryOnTime' == "On-time") | ('DeliveryOnTime' == "Early")]
I've been trying to find a way to execute this condition using np.where and .loc() as well but neither work. Open to other ideas on how to apply the condition to the new column Eligible using data from DeliveryOnTime
I've tried these:
np.where
df['Eligible'] = np.where((df['DeliveryOnTime'] == "On-time") | (df['DeliveryOnTime'] == "Early"), 1, 1)
.loc()
df['Eligible'] = df.loc[(df['DeliveryOnTime'] == "On-time") & (df['DeliveryOnTime'] == "Early"), 'Total Orders'].sum()
Sample Data:
data = {'ID': [1, 1, 1, 2, 2, 3, 4, 5, 5],
'DeliveryOnTime': ["On-time", "Late", "Early", "On-time", "On-time", "Late", "Early", "Early", "On-time"],
}
df = pd.DataFrame(data)
#For the sake of example data, the count of `DeliveryOnTime` will be the total number of orders.
df['Total Orders'] = df['DeliveryOnTime'].count()
The right syntax is:
df['Eligible'] = (df['DeliveryOnTime'] == "On-time") | (df['DeliveryOnTime'] == "Early")
# OR
df['Eligible'] = df['DeliveryOnTime'].isin(["On-time", "Early"])
Output:
>>> df
ID DeliveryOnTime Total Orders Eligible
0 1 On-time 9 True
1 1 Late 9 False
2 1 Early 9 True
3 2 On-time 9 True
4 2 On-time 9 True
5 3 Late 9 False
6 4 Early 9 True
7 5 Early 9 True
8 5 On-time 9 True
The df references are misplaced.
Please try:
df['Elegible'] = (df['DeliveryOnTime'] == "On-time") | (df['DeliveryOnTime'] =="Early")
Output:
>>> df
ID DeliveryOnTime Elegible
0 1 On-time True
1 1 Late False
2 1 Early True
3 2 On-time True
4 2 On-time True
5 3 Late False
6 4 Early True
7 5 Early True
8 5 Early True
You cannot call these columns directly. Have a look at the solution that detects all rows that are either 'On-time' or 'Early'
df["eligible"] = df.DeliveryOnTime.isin(['On-time', 'Early'])
df['eligible'].groupby(df['ID']).transform('sum')
df
ID DeliveryOnTime eligible TotalOrders
0 1 On-time True 2
1 1 Late False 2
2 1 Early True 2
3 2 On-time True 2
4 2 On-time True 2
5 3 Late False 0
6 4 Early True 1
7 5 Early True 1

How to split Pandas Frame base on number switch

I am looking for a way to split a pandas dataframe base on number switch from 4.2 to 4.19 in one column.
I cannot use the diff() method as the difference (0.01) also occurs when in the diff column when the data changes from 4.19 to 4.18. Moreso splitting base on a particular number (e.g 4.2 or 4.19) does not work as the column has multiples of these numbers (e.g 4.2 appears like 5 times).
The data looks like this
4.1999
4.1999
4.2
4.1999
4.1975
4.2
4.19
4.1931
4.192
4.1911
4.1902
4.1896
4.189
4.1883
Is there a way to split such dataframe when the numbers change from 4.2 to 4.19 using pandas or any other python method?
Thank you very much in advance.
Sincerely,
Cindino
Use shift to construct your mask with split points:
split_points = df.column.eq(4.2) & df.column.shift(-1).eq(4.19)
Or:
split_points = df.column.shift(1).eq(4.2) & df.column.eq(4.19)
Depending on where you want the split to occur
Example of use:
df = pd.DataFrame({'col': [4.1999,4.1999,4.2,4.1999,4.1975,4.2,4.19,4.1931,4.192,4.1911,4.1902,4.1896,4.189,4.1883]})
df['split'] = df['col'].shift(1).eq(4.2) & df['col'].eq(4.19)
df['group'] = df['split'].cumsum()
df
output:
col split group
0 4.1999 False 0
1 4.1999 False 0
2 4.2000 False 0
3 4.1999 False 0
4 4.1975 False 0
5 4.2000 False 0
6 4.1900 True 1
7 4.1931 False 1
8 4.1920 False 1
9 4.1911 False 1
10 4.1902 False 1
11 4.1896 False 1
12 4.1890 False 1
13 4.1883 False 1
You can then access the subframes using groupby:
list(df.groupby('group'))
[(0,
col split group
0 4.1999 False 0
1 4.1999 False 0
2 4.2000 False 0
3 4.1999 False 0
4 4.1975 False 0
5 4.2000 False 0),
(1,
col split group
6 4.1900 True 1
7 4.1931 False 1
8 4.1920 False 1
9 4.1911 False 1
10 4.1902 False 1
11 4.1896 False 1
12 4.1890 False 1
13 4.1883 False 1)]
Or directly:
list(df.groupby((df['col'].shift(1).eq(4.2) & df['col'].eq(4.19)).cumsum()))
output:
[(0,
col
0 4.1999
1 4.1999
2 4.2000
3 4.1999
4 4.1975
5 4.2000),
(1,
col
6 4.1900
7 4.1931
8 4.1920
9 4.1911
10 4.1902
11 4.1896
12 4.1890
13 4.1883)]

How can I select and index the highest value in each group of a Pandas dataframe?

I have a dataframe with multiple columns, each combination of columns describing one experiment (e.g. multiple super-labels, for each super-label multiple episodes with different number of timesteps). I want to set the last timestep in each episode for all experiments to True, but I can't figure out how to do this. I have tried three different approaches, all using .loc and 1) using .max().index, 2) .idxmax() and 3) .tail(1).index, but they all fail (the first two with for me ununderstandable exceptions and the last one being wrong.
This is my minimal example:
import numpy as np
import pandas as pd
np.random.seed(4)
def gen(t):
results = []
for episode_id, episode in enumerate(range(np.random.randint(2, 4))):
for i in range(np.random.randint(2, 6)):
results.append(
{
"episode": episode_id,
"timestep": i,
"t": t,
}
)
return pd.DataFrame(results)
df = pd.concat([gen("a"), gen("b")])
base_groups = ["t", "episode"]
df["last_timestep"] = False
print("Expected:")
print(df.groupby(base_groups).timestep.max())
#df.loc[df.groupby(base_groups).timestep.max().index, "last_timestep"] = True
#df.loc[df.groupby(base_groups).timestep.idxmax(), "last_timestep"] = True
df.loc[df.groupby(base_groups).tail(1).index, "last_timestep"] = True
print("Is:")
print(df[df.last_timestep])
The output of df.groupby(base_groups).timestep.max() is exactly what I expect, the correct rows are selected:
Expected:
t episode
a 0 3
1 4
b 0 2
1 1
2 4
But when filtering the dataframe, this is what I get:
Is:
episode timestep t last_timestep
2 0 2 a True
3 0 3 a True
4 1 0 a True
8 1 4 a True
2 0 2 b True
3 1 0 b True
4 1 1 b True
8 2 3 b True
9 2 4 b True
The rows 0, 2, 5 and 7 should not be selected.
Use GroupBy.transform for repeat max aggregated values and compare by column timestep:
df["last_timestep"] = df.groupby(base_groups)['timestep'].transform(max).eq(df['timestep'])
print (df)
episode timestep t last_timestep
0 0 0 a False
1 0 1 a False
2 0 2 a False
3 0 3 a True
4 1 0 a False
5 1 1 a False
6 1 2 a False
7 1 3 a False
8 1 4 a True
0 0 0 b False
1 0 1 b False
2 0 2 b True
3 1 0 b False
4 1 1 b True
5 2 0 b False
6 2 1 b False
7 2 2 b False
8 2 3 b False
9 2 4 b True

List columns in data frame that have missing values as '?'

List names of the column(s) of data frame along with the count of missing number of values if missing values are coded with '?' using pandas and numpy.
import numpy as np
import pandas as pd
bridgeall = pd.read_excel('bridge.xlsx',sheet_name='Sheet1')
#print(bridgeall)
bridge_sep = bridgeall.iloc[:,0].str.split(',',-1,expand=True)
bridge_sep.columns = ['IDENTIF','RIVER', 'LOCATION', 'ERECTED', 'PURPOSE', 'LENGTH', 'LANES','CLEAR-G', 'T-OR-D',
'MATERIAL', 'SPAN', 'REL-L', 'TYPE']
print(bridge_sep)
Data: I am posting a snippet. Its actually [107 rows x 13 columns].
IDENTIF RIVER LOCATION ERECTED ... MATERIAL SPAN REL-L TYPE
0 E2 A ? CRAFTS ... WOOD SHORT ? WOOD
1 E3 A 39 CRAFTS ... WOOD ? S WOOD
2 E5 A ? CRAFTS ... WOOD SHORT S WOOD
Output required:
LOCATION 2
SPAN 1
REL-L 1
Compare all values by eq (==) and for count accurencies use sum - Trues are processes like 1, then remove only False values (0) by boolean indexing:
s = df.eq('?').sum()
s = s[s != 0]
print (s)
LOCATION 2
SPAN 1
REL-L 1
dtype: int64
Last for DataFrame add reset_index:
df1 = s.reset_index()
df1.columns = ['names','count']
print (df1)
names count
0 LOCATION 2
1 SPAN 1
2 REL-L 1
EDIT:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)))
print (df)
0 1 2 3 4
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
#compare with same length Series
#same index values like index/columns of DataFrame
s = pd.Series(np.arange(5))
print (s)
0 0
1 1
2 2
3 3
4 4
dtype: int32
#compare columns
print (df.eq(s, axis=0))
0 1 2 3 4
0 False False False False False
1 False False False False False
2 True True False False False
3 False False False False False
4 True False False False True
#compare rows
print (df.eq(s, axis=1))
0 1 2 3 4
0 False False False False False
1 True False True False False
2 False False False False False
3 False False False False False
4 False True False True True
If your DataFrame is named df, try (df == '?').sum()

Categories