So I have a dataframe like the one below.
dff = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3], 'categ':['A','A','A','B','C','A','A','A','B','C','A','A','A','B','C'],'cost':[3,1,1,3,10,1,2,3,4,10,2,2,2,4,13] })
dff
id categ cost
0 1 A 3
1 1 A 1
2 1 A 1
3 1 B 3
4 1 C 10
5 2 A 1
6 2 A 2
7 2 A 3
8 2 B 4
9 2 C 10
10 3 A 2
11 3 A 2
12 3 A 2
13 3 B 4
14 3 C 13
Now i want to make a new grouped by 'id' dataframe and create a new column where if the sum of category A = 50% and B = 30% of the cost of C, then return True, otherwise false. My desired output is the one below.
new
id
1 True
2 False
3 False
I have tried some stuff but i can't make it work. Any idea on how to get my desired output? Thanks
Try pivot data frame first and then check if columns A, B, C satisfy the condition:
import numpy as np
dff.pivot_table('cost', 'id', 'categ', aggfunc='sum')\
.assign(new = lambda df: np.isclose(df.A, 0.5 * df.C) & np.isclose(df.B, 0.3 * df.C))
categ A B C new
id
1 5 3 10 True
2 6 4 10 False
3 6 4 13 False
Try with pd.crosstab with normalize, then apply a little bit math.
Notice : here we can not use equal due to float, we need np.isclose
s = pd.crosstab(df['id'], df['categ'], df['cost'],aggfunc='sum',normalize = 'index')
s['new'] = np.isclose(s.values.tolist(),[0.5/1.8,0.3/1.8,1/1.8],atol=0.0001).all(1)
s
Out[341]:
categ A B C new
id
1 0.277778 0.166667 0.555556 True
2 0.300000 0.200000 0.500000 False
3 0.260870 0.173913 0.565217 False
Related
I'm hoping to use a conditional statement to create a new column but I'm unsure on the best way to proceed.
Using below, I essentially have various Items that contain a specific Direction for a given point in time. I want to use the ID to provide the correct Direction. So match the values in Items and ID to determine the correct Direction.
I've manually inserted this below as Main Direction but will need to automate this. I then want to pass a conditional statement to X. Specifically, if == 'Up' then add 10, if == 'Down' then subtract 10.
import pandas as pd
df = pd.DataFrame({
'Time' : [1,1,1,1,2,2,2,2,3,3,3,3],
'ID' : ['A','A','A','A','B','B','B','B','A','A','A','A'],
'Items' : ['A','B','A','A','B','A','A','B','A','A','B','B'],
'Direction' : ['Up','Down','Up','Up','Down','Up','Up','Down','Up','Up','Down','Down'],
'Main Direction' : ['Up','Up','Up','Up','Down','Down','Down','Down','Up','Up','Up','Up'],
'X' : [1,2,3,4,6,7,8,9,3,4,5,6],
})
df['Dist'] = [df['X'] + 10 if x == 'Up' else df['X'] -10 for x in df['Main Direction']]
intended output:
Time ID All Direction Main Direction X Dist
0 1 A A Up Up 1 11
1 1 A B Down Up 2 12
2 1 A A Up Up 3 13
3 1 A A Up Up 4 14
4 2 B B Down Down 6 -4
5 2 B A Up Down 7 -3
6 2 B A Up Down 8 -2
7 2 B B Down Down 9 -1
8 3 A A Up Up 3 13
9 3 A A Up Up 4 14
10 3 A B Down Up 5 15
11 3 A B Down Up 6 16
Try with np.where
df.X=np.where(df['Main Direction'].eq('Up'),df.X+10,df.X-10)
df
Time ID Items Direction Main Direction X
0 1 A A Up Up 11
1 1 A B Down Up 12
2 1 A A Up Up 13
3 1 A A Up Up 14
4 2 B B Down Down -4
5 2 B A Up Down -3
6 2 B A Up Down -2
7 2 B B Down Down -1
8 3 A A Up Up 13
9 3 A A Up Up 14
10 3 A B Down Up 15
11 3 A B Down Up 16
Here is how to do the Main direction column
df['testDirec'] = np.where(df['ID'] == df['Items'],df['Direction'],None)
df['testDirec'] = df['testDirec'].ffill()
Gives the same as Main Direction
Time ID Items Direction Main Direction X testDirec
0 1 A A Up Up 1 Up
1 1 A B Down Up 2 Up
2 1 A A Up Up 3 Up
3 1 A A Up Up 4 Up
4 2 B B Down Down 6 Down
5 2 B A Up Down 7 Down
6 2 B A Up Down 8 Down
7 2 B B Down Down 9 Down
8 3 A A Up Up 3 Up
9 3 A A Up Up 4 Up
10 3 A B Down Up 5 Up
11 3 A B Down Up 6 Up
You can use any custom function with apply. If you're operating row-wise (e.g. need more than one column) you'll call this with axis=1, and it will pass a row. Otherwise you can just do df['my_col'].apply(lambda x: do_something_to_x(x)) and it will pass values of the col. For your example:
def calc_dist(row):
if row['Main Direction'] == 'Up':
return row['X'] + 10
else:
return row['X'] - 10
df['Dist'] = df.apply(calc_dist, axis=1)
Say I have an incomplete dataset in a Pandas DataFrame such as:
incData = pd.DataFrame({'comp': ['A']*3 + ['B']*5 + ['C']*4,
'x': [1,2,3] + [1,2,3,4,5] + [1,2,3,4],
'y': [3,None,7] + [1,4,7,None,None] + [4,None,2,1]})
And also a DataFrame with fitting parameters that I could use to fill holes:
fitTable = pd.DataFrame({'slope': [2,3,-1],
'intercept': [1,-2,5]},
index=['A','B','C'])
I would like to achieve the following using y=x*slope+intercept for the None entries only:
comp x y
0 A 1 3.0
1 A 2 5.0
2 A 3 7.0
3 B 1 1.0
4 B 2 4.0
5 B 3 7.0
6 B 4 10.0
7 B 5 13.0
8 C 1 4.0
9 C 2 3.0
10 C 3 2.0
11 C 4 1.0
One way I envisioned is by using join and drop:
incData = incData.join(fitTable,on='comp')
incData.loc[incData['y'].isnull(),'y'] = incData[incData['y'].isnull()]['x']*\
incData[incData['y'].isnull()]['slope']+\
incData[incData['y'].isnull()]['intercept']
incData.drop(['slope','intercept'], axis=1, inplace=True)
However, that does not seem very efficient, because it adds and removes columns. It seems that I am making this too complicated, do I overlook a simple more direct solution? Something more like this non-functional code:
incData.loc[incData['y'].isnull(),'y'] = incData[incData['y'].isnull()]['x']*\
fitTable[incData[incData['y'].isnull()]['comp']]['slope']+\
fitTable[incData[incData['y'].isnull()]['comp']]['intercept']
I am pretty new to Pandas, so I sometimes get a bit mixed up with the strict indexing rules...
you can use map on the column 'comp' once mask with null value in 'y' like:
mask = incData['y'].isna()
incData.loc[mask, 'y'] = incData.loc[mask, 'x']*\
incData.loc[mask,'comp'].map(fitTable['slope']) +\
incData.loc[mask,'comp'].map(fitTable['intercept'])
and your non-functional code, I guess it would be something like:
incData.loc[mask,'y'] = incData.loc[mask, 'x']*\
fitTable.loc[incData.loc[mask, 'comp'],'slope'].to_numpy()+\
fitTable.loc[incData.loc[mask, 'comp'],'intercept'].to_numpy()
IIUC:
incData.loc[pd.isna(incData['y']), 'y'] = incData[pd.isna(incData['y'])].apply(lambda row: row['x']*fitTable.loc[row['comp'], 'slope']+fitTable.loc[row['comp'], 'intercept'], axis=1)
incData
comp x y
0 A 1 3.0
1 A 2 5.0
2 A 3 7.0
3 B 1 1.0
4 B 2 4.0
5 B 3 7.0
6 B 4 10.0
7 B 5 13.0
8 C 1 4.0
9 C 2 3.0
10 C 3 2.0
11 C 4 1.0
merge is another option
# merge two dataframe together on comp
m = incData.merge(fitTable, left_on='comp', right_index=True)
# y = mx+b
m['y'] = m['x']*m['slope']+m['intercept']
comp x y slope intercept
0 A 1 3 2 1
1 A 2 5 2 1
2 A 3 7 2 1
3 B 1 1 3 -2
4 B 2 4 3 -2
5 B 3 7 3 -2
6 B 4 10 3 -2
7 B 5 13 3 -2
8 C 1 4 -1 5
9 C 2 3 -1 5
10 C 3 2 -1 5
11 C 4 1 -1 5
I have a df as
Id Event SeqNo
1 A 1
1 B 2
1 C 3
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
1 D 9
1 B 10
1 ABD 11
1 D 12
1 B 13
1 CDE 14
1 A 15
I am looking for a pattern "ABD followed by CDE without having event B in between them "
For example, The output of this df will be :
Id Event SeqNo
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
This pattern can be followed multiple times for a single ID and I want find the list of all those IDs and their respective count (if possible).
Here's a vectorized one with some scaling trickery and leveraging convolution to find the required pattern -
# Get the col in context and scale it to the three strings to form an ID array
a = df['Event']
id_ar = (a=='ABD') + 2*(a=='B') + 3*(a=='CDE')
# Mask of those specific strings and hence extract the corresponding masked df
mask = id_ar>0
df1 = df[mask]
# Get pattern col with 1s at places with the pattern found, 0s elsewhere
df1['Pattern'] = (np.convolve(id_ar[mask],[9,1],'same')==28).astype(int)
# Groupby Id col and sum the pattern col for final output
out = df1.groupby(['Id'])['Pattern'].sum()
That convolution part might be a bit tricky. The idea there is to use id_ar that has values of 1, 2 and 3 corresponding to strings 'ABD',''B' and 'CDE'. We are looking for 1 followed by 3, so using the convolution with a kernel [9,1] would result in 1*1 + 3*9 = 28 as the convolution sum for the window that has 'ABD' and then 'CDE'. Hence, we look for the conv. sum of 28 for the match. For the case of 'ABD' followed by ''B' and then 'CDE', conv. sum would be different, hence would be filtered out.
Sample run -
1) Input dataframe :
In [377]: df
Out[377]:
Id Event SeqNo
0 1 A 1
1 1 B 2
2 1 C 3
3 1 ABD 4
4 1 B 5
5 1 C 6
6 1 A 7
7 1 CDE 8
8 1 D 9
9 1 B 10
10 1 ABD 11
11 1 D 12
12 1 B 13
13 2 A 1
14 2 B 2
15 2 C 3
16 2 ABD 4
17 2 A 5
18 2 C 6
19 2 A 7
20 2 CDE 8
21 2 D 9
22 2 B 10
23 2 ABD 11
24 2 D 12
25 2 B 13
26 2 CDE 14
27 2 A 15
2) Intermediate filtered o/p (look at column Pattern for the presence of the reqd. pattern) :
In [380]: df1
Out[380]:
Id Event SeqNo Pattern
1 1 B 2 0
3 1 ABD 4 0
4 1 B 5 0
7 1 CDE 8 0
9 1 B 10 0
10 1 ABD 11 0
12 1 B 13 0
14 2 B 2 0
16 2 ABD 4 0
20 2 CDE 8 1
22 2 B 10 0
23 2 ABD 11 0
25 2 B 13 0
26 2 CDE 14 0
3) Final o/p :
In [381]: out
Out[381]:
Id
1 0
2 1
Name: Pattern, dtype: int64
I used a solution based on the assumption that anything other than ABD,CDE and B is irrelevant to or solution. So I get rid of them first by a filtering operation.
Then, what I want to know if there is an ABD followed by a CDE without a B in between. I shift the Events column by one in time (note this doesn't have to be a 1 step in units of SeqNo).
Then I check every column of the new df whether Events==ABD and Events_1_Step==CDE meaning that there wasn't a B in between, but possibly other stuff like A or C or even nothing. This gets me a list of booleans for every time I have a sequence like that. If I sum them up, I get the count.
Finally, I have to make sure these are all done at Id level so use .groupby.
IMPORTANT: This solution is assumed that your df is sorted by Id first and then by SeqNo. If not, please do so.
import pandas as pd
df = pd.read_csv("path/to/file.csv")
df2 = df[df["Event"].isin(["ABD", "CDE", "B"])]
df2.loc[:,"Event_1_Step"] = df2["Event"].shift(-1)
df2.loc[:,"SeqNo_1_Step"] = df2["SeqNo"].shift(-1)
for id, id_df in df2.groupby("Id"):
print(id) # Set a counter object here per Id to track count per id
id_df = id_df[id_df.apply(lambda x: x["Event"] == "ABD" and x["Event_1_Step"] == "CDE", axis=1)]
for row_id, row in id_df.iterrows():
print(df[(df["Id"] == id) * df["SeqNo"].between(row["SeqNo"], row["SeqNo_1_Step"])])
You could use this:
s = (pd.Series(
np.select([df['Event'] == 'ABD', df['Event'] =='B', df['Id'] != df['Id'].shift()],
[True, False, False], default=np.nan))
.ffill()
.fillna(False)
.astype(bool))
corr = (df['Event'] == "CDE") & s
corr.groupby(df['Id']).max()
Using np.select to create a column which has True if Event == 'CDE" and False for B or at the start of a new Id. By the forward filling using ffill. You have for every value whether ABD or B was last. Then you can check if it is True where the value is CDE. You could then use GroupBy to check whether it is True for any value per Id.
Which for
Id Event SeqNo
1 A 1
1 B 2
1 C 3
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
1 D 9
1 B 10
1 ABD 11
1 D 12
1 B 13
1 CDE 14
1 A 15
2 B 16
3 ABD 17
3 B 18
3 CDE 19
4 ABD 20
4 CDE 21
5 CDE 22
Outputs:
Id
1 True
2 False
3 False
4 True
5 False
I have a data frame where there are several groups of numeric series where the values are cumulative. Consider the following:
df = pd.DataFrame({'Cat': ['A', 'A','A','A', 'B','B','B','B'], 'Indicator': [1,2,3,4,1,2,3,4], 'Cumulative1': [1,3,6,7,2,4,6,9], 'Cumulative2': [1,3,4,6,1,5,7,12]})
In [74]:df
Out[74]:
Cat Cumulative1 Cumulative2 Indicator
0 A 1 1 1
1 A 3 3 2
2 A 6 4 3
3 A 7 6 4
4 B 2 1 1
5 B 4 5 2
6 B 6 7 3
7 B 9 12 4
I need to create discrete series for Cumulative1 and Cumulative2, with starting point being the earliest entry in 'Indicator'.
my Approach is to use diff()
In[82]: df['Discrete1'] = df.groupby('Cat')['Cumulative1'].diff()
Out[82]: df
Cat Cumulative1 Cumulative2 Indicator Discrete1
0 A 1 1 1 NaN
1 A 3 3 2 2.0
2 A 6 4 3 3.0
3 A 7 6 4 1.0
4 B 2 1 1 NaN
5 B 4 5 2 2.0
6 B 6 7 3 2.0
7 B 9 12 4 3.0
I have 3 questions:
How do I avoid the NaN in an elegant/Pythonic way? The correct values are to be found in the original Cumulative series.
Secondly, how do I elegantly apply this computation to all series, say -
cols = ['Cumulative1', 'Cumulative2']
Thirdly, I have a lot of data that needs this computation -- is this the most efficient way?
You do not want to avoid NaNs, you want to fill them with the start values from the "cumulative" column:
df['Discrete1'] = df['Discrete1'].combine_first(df['Cumulative1'])
To apply the operation to all (or select) columns, broadcast it to all columns of interest:
sources = 'Cumulative1', 'Cumulative2'
targets = ["Discrete" + x[len('Cumulative'):] for x in sources]
df[targets] = df.groupby('Cat')[sources].diff()
You still have to condition the NaNs in a loop:
for s,t in zip(sources, targets):
df[t] = df[t].combine_first(df[s])
Hi I am dealing with some data by using pandas.
I am facing a problem but here I'll try to simplify it.
Suppose I have a dataset looks like this:
# Incidents Place Month
0 3 A 1
1 5 B 1
2 2 C 2
3 2 B 2
4 6 C 3
5 3 A 1
So I want to sum the # of incidents by the place, that is, I want to have a result like
P #
A 3
B 7(5+2)
C 8(2+6)
stored in a pandas DataFrame. I don't care about other columns at this point.
Next question is, now if I want to use the data in Month column as well, I'd like to have result looks like
P M #
A 1 6(3+3)
B 1 5
B 2 2
C 2 2
C 3 6
How can I achieve these results in pandas? I have tried groupby and some other functions but I cannot reach the point...
Any help is appreciated!
You can do it in this way:
In [35]: df
Out[35]:
# Incidents Place Month
0 3 A 1
1 5 B 1
2 2 C 2
3 2 B 2
4 6 C 3
5 3 A 1
In [36]: df.groupby('Place')['# Incidents'].sum().reset_index()
Out[36]:
Place # Incidents
0 A 6
1 B 7
2 C 8
In [37]: df.groupby(['Place', 'Month'])['# Incidents'].sum().reset_index()
Out[37]:
Place Month # Incidents
0 A 1 6
1 B 1 5
2 B 2 2
3 C 2 2
4 C 3 6
Please find here a Pandas documentation with lots of examples.