KeyError: False when I enter conditions in Pandas - python

I'm getting the KeyError: False when I run this line:
df['Eligible'] = df[('DeliveryOnTime' == "On-time") | ('DeliveryOnTime' == "Early")]
I've been trying to find a way to execute this condition using np.where and .loc() as well but neither work. Open to other ideas on how to apply the condition to the new column Eligible using data from DeliveryOnTime
I've tried these:
np.where
df['Eligible'] = np.where((df['DeliveryOnTime'] == "On-time") | (df['DeliveryOnTime'] == "Early"), 1, 1)
.loc()
df['Eligible'] = df.loc[(df['DeliveryOnTime'] == "On-time") & (df['DeliveryOnTime'] == "Early"), 'Total Orders'].sum()
Sample Data:
data = {'ID': [1, 1, 1, 2, 2, 3, 4, 5, 5],
'DeliveryOnTime': ["On-time", "Late", "Early", "On-time", "On-time", "Late", "Early", "Early", "On-time"],
}
df = pd.DataFrame(data)
#For the sake of example data, the count of `DeliveryOnTime` will be the total number of orders.
df['Total Orders'] = df['DeliveryOnTime'].count()

The right syntax is:
df['Eligible'] = (df['DeliveryOnTime'] == "On-time") | (df['DeliveryOnTime'] == "Early")
# OR
df['Eligible'] = df['DeliveryOnTime'].isin(["On-time", "Early"])
Output:
>>> df
ID DeliveryOnTime Total Orders Eligible
0 1 On-time 9 True
1 1 Late 9 False
2 1 Early 9 True
3 2 On-time 9 True
4 2 On-time 9 True
5 3 Late 9 False
6 4 Early 9 True
7 5 Early 9 True
8 5 On-time 9 True

The df references are misplaced.
Please try:
df['Elegible'] = (df['DeliveryOnTime'] == "On-time") | (df['DeliveryOnTime'] =="Early")
Output:
>>> df
ID DeliveryOnTime Elegible
0 1 On-time True
1 1 Late False
2 1 Early True
3 2 On-time True
4 2 On-time True
5 3 Late False
6 4 Early True
7 5 Early True
8 5 Early True

You cannot call these columns directly. Have a look at the solution that detects all rows that are either 'On-time' or 'Early'
df["eligible"] = df.DeliveryOnTime.isin(['On-time', 'Early'])
df['eligible'].groupby(df['ID']).transform('sum')
df
ID DeliveryOnTime eligible TotalOrders
0 1 On-time True 2
1 1 Late False 2
2 1 Early True 2
3 2 On-time True 2
4 2 On-time True 2
5 3 Late False 0
6 4 Early True 1
7 5 Early True 1

Related

How can I select and index the highest value in each group of a Pandas dataframe?

I have a dataframe with multiple columns, each combination of columns describing one experiment (e.g. multiple super-labels, for each super-label multiple episodes with different number of timesteps). I want to set the last timestep in each episode for all experiments to True, but I can't figure out how to do this. I have tried three different approaches, all using .loc and 1) using .max().index, 2) .idxmax() and 3) .tail(1).index, but they all fail (the first two with for me ununderstandable exceptions and the last one being wrong.
This is my minimal example:
import numpy as np
import pandas as pd
np.random.seed(4)
def gen(t):
results = []
for episode_id, episode in enumerate(range(np.random.randint(2, 4))):
for i in range(np.random.randint(2, 6)):
results.append(
{
"episode": episode_id,
"timestep": i,
"t": t,
}
)
return pd.DataFrame(results)
df = pd.concat([gen("a"), gen("b")])
base_groups = ["t", "episode"]
df["last_timestep"] = False
print("Expected:")
print(df.groupby(base_groups).timestep.max())
#df.loc[df.groupby(base_groups).timestep.max().index, "last_timestep"] = True
#df.loc[df.groupby(base_groups).timestep.idxmax(), "last_timestep"] = True
df.loc[df.groupby(base_groups).tail(1).index, "last_timestep"] = True
print("Is:")
print(df[df.last_timestep])
The output of df.groupby(base_groups).timestep.max() is exactly what I expect, the correct rows are selected:
Expected:
t episode
a 0 3
1 4
b 0 2
1 1
2 4
But when filtering the dataframe, this is what I get:
Is:
episode timestep t last_timestep
2 0 2 a True
3 0 3 a True
4 1 0 a True
8 1 4 a True
2 0 2 b True
3 1 0 b True
4 1 1 b True
8 2 3 b True
9 2 4 b True
The rows 0, 2, 5 and 7 should not be selected.
Use GroupBy.transform for repeat max aggregated values and compare by column timestep:
df["last_timestep"] = df.groupby(base_groups)['timestep'].transform(max).eq(df['timestep'])
print (df)
episode timestep t last_timestep
0 0 0 a False
1 0 1 a False
2 0 2 a False
3 0 3 a True
4 1 0 a False
5 1 1 a False
6 1 2 a False
7 1 3 a False
8 1 4 a True
0 0 0 b False
1 0 1 b False
2 0 2 b True
3 1 0 b False
4 1 1 b True
5 2 0 b False
6 2 1 b False
7 2 2 b False
8 2 3 b False
9 2 4 b True

return any value larger than three for columns with similar names

I have a dataframe similar to this
df_col1 df_col2 df_col3 df_col4 id name
3 4 5 2 1 a
2 3 2 1 2 d
2 1 1 2 3 x
This dataframe is very large
if I want to return any value with a column name df_col larger than 3, is there anyway I can make the code run faster. My current code is
df.filter(like='df_') > 3
this runs very slow, is it possible to make it faster?
PS: I want to get the value > 3 not True or False
You can also do this:
In [616]: cols = df.columns[df.columns.str.contains('df_')]
In [617]: df[cols].gt(3)
Out[617]:
df_col1 df_col2 df_col3 df_col4
0 False True True False
1 False False False False
2 False False False False
Use:
a = np.unique(df.filter(like='df_'))
print (a)
[1 2 3 4 5]
v = a[a > 3].tolist()
print (v)
[4, 5]

pandas groupby return a boolean vector

I have a time series database where I would like to group the data to compare them both to another cell in the same row, and the previous value.
The code below will return a vector against the whole dataframe, but if I try to group it I get a dataframe with apply() and an error with agg or transform.
Sample data frame
df = pd.DataFrame({ 'group': [1, 1, 1, 2,2,2,1,2, 1], 'target': [100,100,100,100,10,10,10,10,50],'val' :[90,80,70,4,120,6,60,8, 50] })
df
group target val
0 1 100 90
1 1 100 80
2 1 100 70
3 2 100 4
4 2 10 120
5 2 10 6
6 1 10 60
7 2 10 8
8 1 50 50
Here is my attempt at a function
def spike(df):
high = df['val'] > df['target']+25
rising = df['val'] > df['val'].shift()
return high & rising
print(spike(df))
print( df.groupby('group').apply(spike))
Output
0 False
1 False
2 False
3 False
4 True
5 False
6 True
7 False
8 False
dtype: bool
0 1 2 6 8
group
1 False False False False False
2 False True False False True
Here is my output, I was trying to get the second output to look like the first except row 6 should be false.
You are over thinking it:
shift = df.groupby('group')['val'].shift()
df['val'].gt(df['target']+25) & df['val'].gt(shift)
Output:
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 False
8 False
dtype: bool

List columns in data frame that have missing values as '?'

List names of the column(s) of data frame along with the count of missing number of values if missing values are coded with '?' using pandas and numpy.
import numpy as np
import pandas as pd
bridgeall = pd.read_excel('bridge.xlsx',sheet_name='Sheet1')
#print(bridgeall)
bridge_sep = bridgeall.iloc[:,0].str.split(',',-1,expand=True)
bridge_sep.columns = ['IDENTIF','RIVER', 'LOCATION', 'ERECTED', 'PURPOSE', 'LENGTH', 'LANES','CLEAR-G', 'T-OR-D',
'MATERIAL', 'SPAN', 'REL-L', 'TYPE']
print(bridge_sep)
Data: I am posting a snippet. Its actually [107 rows x 13 columns].
IDENTIF RIVER LOCATION ERECTED ... MATERIAL SPAN REL-L TYPE
0 E2 A ? CRAFTS ... WOOD SHORT ? WOOD
1 E3 A 39 CRAFTS ... WOOD ? S WOOD
2 E5 A ? CRAFTS ... WOOD SHORT S WOOD
Output required:
LOCATION 2
SPAN 1
REL-L 1
Compare all values by eq (==) and for count accurencies use sum - Trues are processes like 1, then remove only False values (0) by boolean indexing:
s = df.eq('?').sum()
s = s[s != 0]
print (s)
LOCATION 2
SPAN 1
REL-L 1
dtype: int64
Last for DataFrame add reset_index:
df1 = s.reset_index()
df1.columns = ['names','count']
print (df1)
names count
0 LOCATION 2
1 SPAN 1
2 REL-L 1
EDIT:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(5,5)))
print (df)
0 1 2 3 4
0 8 8 3 7 7
1 0 4 2 5 2
2 2 2 1 0 8
3 4 0 9 6 2
4 4 1 5 3 4
#compare with same length Series
#same index values like index/columns of DataFrame
s = pd.Series(np.arange(5))
print (s)
0 0
1 1
2 2
3 3
4 4
dtype: int32
#compare columns
print (df.eq(s, axis=0))
0 1 2 3 4
0 False False False False False
1 False False False False False
2 True True False False False
3 False False False False False
4 True False False False True
#compare rows
print (df.eq(s, axis=1))
0 1 2 3 4
0 False False False False False
1 True False True False False
2 False False False False False
3 False False False False False
4 False True False True True
If your DataFrame is named df, try (df == '?').sum()

Using pandas group operations

I'm trying to better understand pandas' group operations.
As an example, let's say I have a dataframe which has a list of sets played in tennis matches.
tennis_sets = pd.DataFrame.from_items([
('date', ['27/05/13', '27/05/13', '28/05/13', '28/05/13',
'28/05/13', '29/05/13', '29/05/13']),
('player_A', [6, 6, 2, 6, 7, 6, 6]),
('player_B', [4, 3, 6, 7, 6, 1, 0])
])
Resulting in
date player_A player_B
0 27/05/13 6 4
1 27/05/13 6 3
2 28/05/13 2 6
3 28/05/13 6 7
4 28/05/13 7 6
5 29/05/13 6 1
6 29/05/13 6 0
I'd like to determine the overall score for each match played on a given day. This should look like
date player_A player_B
0 27/05/13 2 0
1 28/05/13 1 2
2 29/05/13 2 0
So, I could do this by creating a new numpy array and iterating as follows:
matches = tennis_sets.groupby('date')
scores = np.zeros((len(matches),2))
for i, (_, match) in enumerate(matches):
a, b = match.player_A, match.player_B
scores[i] = np.c_[sum(a>b), sum(b>a)]
I could then reattach this new scores array to the dates. However, it seems unlikely that this is the preferred way of doing things.
To create a new dataframe with each date and match score as above, is there a better way I can achieve this using pandas' api?
To answer your question, yes there are ways to do this in pandas. There may be a more elegant solution, but here's a quick one which uses pandas groupby to perform a sum over the dataframe grouped by date:
In [13]: tennis_sets
Out[13]:
date player_A player_B
0 27/05/13 6 4
1 27/05/13 6 3
2 28/05/13 2 6
3 28/05/13 6 7
4 28/05/13 7 6
5 29/05/13 6 1
6 29/05/13 6 0
In [14]: tennis_sets["pA_wins"] = tennis_sets["player_A"] > tennis_sets["player_B"]
In [15]: tennis_sets["pB_wins"] = tennis_sets["player_B"] > tennis_sets["player_A"]
In [18]: tennis_sets
Out[18]:
date player_A player_B pA_wins pB_wins
0 27/05/13 6 4 True False
1 27/05/13 6 3 True False
2 28/05/13 2 6 False True
3 28/05/13 6 7 False True
4 28/05/13 7 6 True False
5 29/05/13 6 1 True False
6 29/05/13 6 0 True False
In [21]: matches = tennis_sets.groupby("date").sum()
In [22]: matches[["pA_wins", "pB_wins"]]
Out[22]:
pA_wins pB_wins
date
27/05/13 2 0
28/05/13 1 2
29/05/13 2 0

Categories