I'm hoping to use a conditional statement to create a new column but I'm unsure on the best way to proceed.
Using below, I essentially have various Items that contain a specific Direction for a given point in time. I want to use the ID to provide the correct Direction. So match the values in Items and ID to determine the correct Direction.
I've manually inserted this below as Main Direction but will need to automate this. I then want to pass a conditional statement to X. Specifically, if == 'Up' then add 10, if == 'Down' then subtract 10.
import pandas as pd
df = pd.DataFrame({
'Time' : [1,1,1,1,2,2,2,2,3,3,3,3],
'ID' : ['A','A','A','A','B','B','B','B','A','A','A','A'],
'Items' : ['A','B','A','A','B','A','A','B','A','A','B','B'],
'Direction' : ['Up','Down','Up','Up','Down','Up','Up','Down','Up','Up','Down','Down'],
'Main Direction' : ['Up','Up','Up','Up','Down','Down','Down','Down','Up','Up','Up','Up'],
'X' : [1,2,3,4,6,7,8,9,3,4,5,6],
})
df['Dist'] = [df['X'] + 10 if x == 'Up' else df['X'] -10 for x in df['Main Direction']]
intended output:
Time ID All Direction Main Direction X Dist
0 1 A A Up Up 1 11
1 1 A B Down Up 2 12
2 1 A A Up Up 3 13
3 1 A A Up Up 4 14
4 2 B B Down Down 6 -4
5 2 B A Up Down 7 -3
6 2 B A Up Down 8 -2
7 2 B B Down Down 9 -1
8 3 A A Up Up 3 13
9 3 A A Up Up 4 14
10 3 A B Down Up 5 15
11 3 A B Down Up 6 16
Try with np.where
df.X=np.where(df['Main Direction'].eq('Up'),df.X+10,df.X-10)
df
Time ID Items Direction Main Direction X
0 1 A A Up Up 11
1 1 A B Down Up 12
2 1 A A Up Up 13
3 1 A A Up Up 14
4 2 B B Down Down -4
5 2 B A Up Down -3
6 2 B A Up Down -2
7 2 B B Down Down -1
8 3 A A Up Up 13
9 3 A A Up Up 14
10 3 A B Down Up 15
11 3 A B Down Up 16
Here is how to do the Main direction column
df['testDirec'] = np.where(df['ID'] == df['Items'],df['Direction'],None)
df['testDirec'] = df['testDirec'].ffill()
Gives the same as Main Direction
Time ID Items Direction Main Direction X testDirec
0 1 A A Up Up 1 Up
1 1 A B Down Up 2 Up
2 1 A A Up Up 3 Up
3 1 A A Up Up 4 Up
4 2 B B Down Down 6 Down
5 2 B A Up Down 7 Down
6 2 B A Up Down 8 Down
7 2 B B Down Down 9 Down
8 3 A A Up Up 3 Up
9 3 A A Up Up 4 Up
10 3 A B Down Up 5 Up
11 3 A B Down Up 6 Up
You can use any custom function with apply. If you're operating row-wise (e.g. need more than one column) you'll call this with axis=1, and it will pass a row. Otherwise you can just do df['my_col'].apply(lambda x: do_something_to_x(x)) and it will pass values of the col. For your example:
def calc_dist(row):
if row['Main Direction'] == 'Up':
return row['X'] + 10
else:
return row['X'] - 10
df['Dist'] = df.apply(calc_dist, axis=1)
Related
Let's say I have the following df -
data={'Location':[1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4]}
df = pd.DataFrame(data=data)
df
Location
0 1
1 1
2 1
3 2
4 2
5 2
6 2
7 2
8 2
9 3
10 3
11 3
12 3
13 3
14 3
15 4
16 4
17 4
In addition, I have the following dict:
Unlock={
1:"A",
2:"B",
3:"C",
4:"D",
5:"E",
6:"F",
7:"G",
8:"H",
9:"I",
10:"J"
}
I'd like to create another column that will randomly select a string from the 'Unlock' dict based on the condition that Location<=Unlock. So for example - for Location 2 some rows will get 'A' and some rows will get 'B'.
I've tried to do the following but with no luck (I'm getting an error) -
df['Name']=np.select(df['Location']<=Unlock,np.random.choice(Unlock,size=len(df))
Thanks in advance for your help!
You can convert your dictionary values to a list, and randomly select the values of a subset of this list: only up to Location number of elements.
With Python versions >= 3.7, dict maintains insertion order. For lower versions - see below.
lst = list(Unlock.values())
df['Name'] = df['Location'].transform(lambda loc: np.random.choice(lst[:loc]))
Example output:
Location Name
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 B
6 2 B
7 2 A
8 2 B
9 3 B
10 3 B
11 3 C
12 3 C
13 3 C
14 3 B
15 4 A
16 4 C
17 4 D
If you are using a lower version of Python, you can Build a list of dictionary values, sorted by key:
lst = [value for key, value in sorted(Unlock.items())]
For a vectorial method, multiply by a random value (0,1] and ceil, then map with your dictionary.
This will give you an equiprobable value between 1 and the current value (included):
import numpy as np
df['random'] = (np.ceil(df['Location'].mul(1-np.random.random(size=len(df))))
.astype(int).map(Unlock)
)
output (reproducible with np.random.seed(0)):
Location random
0 1 A
1 1 A
2 1 A
3 2 B
4 2 A
5 2 B
6 2 A
7 2 B
8 2 B
9 3 B
10 3 C
11 3 B
12 3 B
13 3 C
14 3 A
15 4 A
16 4 A
17 4 D
So I have a dataframe like the one below.
dff = pd.DataFrame({'id':[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3], 'categ':['A','A','A','B','C','A','A','A','B','C','A','A','A','B','C'],'cost':[3,1,1,3,10,1,2,3,4,10,2,2,2,4,13] })
dff
id categ cost
0 1 A 3
1 1 A 1
2 1 A 1
3 1 B 3
4 1 C 10
5 2 A 1
6 2 A 2
7 2 A 3
8 2 B 4
9 2 C 10
10 3 A 2
11 3 A 2
12 3 A 2
13 3 B 4
14 3 C 13
Now i want to make a new grouped by 'id' dataframe and create a new column where if the sum of category A = 50% and B = 30% of the cost of C, then return True, otherwise false. My desired output is the one below.
new
id
1 True
2 False
3 False
I have tried some stuff but i can't make it work. Any idea on how to get my desired output? Thanks
Try pivot data frame first and then check if columns A, B, C satisfy the condition:
import numpy as np
dff.pivot_table('cost', 'id', 'categ', aggfunc='sum')\
.assign(new = lambda df: np.isclose(df.A, 0.5 * df.C) & np.isclose(df.B, 0.3 * df.C))
categ A B C new
id
1 5 3 10 True
2 6 4 10 False
3 6 4 13 False
Try with pd.crosstab with normalize, then apply a little bit math.
Notice : here we can not use equal due to float, we need np.isclose
s = pd.crosstab(df['id'], df['categ'], df['cost'],aggfunc='sum',normalize = 'index')
s['new'] = np.isclose(s.values.tolist(),[0.5/1.8,0.3/1.8,1/1.8],atol=0.0001).all(1)
s
Out[341]:
categ A B C new
id
1 0.277778 0.166667 0.555556 True
2 0.300000 0.200000 0.500000 False
3 0.260870 0.173913 0.565217 False
When I have a below df, I want to get a column 'C' which has max value between specific value '15' and column 'A' within the condition "B == 't'"
testdf = pd.DataFrame({"A":[20, 16, 7, 3, 8],"B":['t','t','t','t','f']})
testdf
A B
0 20 t
1 16 t
2 7 t
3 3 t
4 8 f
I tried this:
testdf.loc[testdf['B']=='t', 'C'] = max(15,(testdf.loc[testdf['B']=='t','A']))
And desired output is:
A B C
0 20 t 20
1 16 t 16
2 7 t 15
3 3 t 15
4 8 f 8
Could you help me to get the output? Thank you!
Use np.where with clip:
testdf['C'] = np.where(testdf['B'].eq('t'),
testdf['A'].clip(15), df['A'])
Or similarly with series.where:
testdf['C'] = (testdf['A'].clip(15)
.where(testdf['B'].eq('t'), testdf['A'])
)
output:
A B C
0 20 t 20
1 16 t 16
2 7 t 15
3 3 t 15
4 8 f 8
You could also use the update method:
testdf['C'] = testdf['A']
A B C
0 20 t 20
1 16 t 16
2 7 t 7
3 3 t 3
4 8 f 8
values = testdf.A[testdf.B.eq('t')].clip(15)
values
Out[16]:
0 20
1 16
2 15
3 15
Name: A, dtype: int64
testdf.update(values.rename('C'))
A B C
0 20 t 20.0
1 16 t 16.0
2 7 t 15.0
3 3 t 15.0
4 8 f 8.0
To apply any formula to individual values in a dataframe you can use
df['column'] =df['column'].apply(lambda x: anyFunc(x))
x here will catch individual values of column one by one and pass it to the function where you can manipulate it and return back.
I have a df as
Id Event SeqNo
1 A 1
1 B 2
1 C 3
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
1 D 9
1 B 10
1 ABD 11
1 D 12
1 B 13
1 CDE 14
1 A 15
I am looking for a pattern "ABD followed by CDE without having event B in between them "
For example, The output of this df will be :
Id Event SeqNo
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
This pattern can be followed multiple times for a single ID and I want find the list of all those IDs and their respective count (if possible).
Here's a vectorized one with some scaling trickery and leveraging convolution to find the required pattern -
# Get the col in context and scale it to the three strings to form an ID array
a = df['Event']
id_ar = (a=='ABD') + 2*(a=='B') + 3*(a=='CDE')
# Mask of those specific strings and hence extract the corresponding masked df
mask = id_ar>0
df1 = df[mask]
# Get pattern col with 1s at places with the pattern found, 0s elsewhere
df1['Pattern'] = (np.convolve(id_ar[mask],[9,1],'same')==28).astype(int)
# Groupby Id col and sum the pattern col for final output
out = df1.groupby(['Id'])['Pattern'].sum()
That convolution part might be a bit tricky. The idea there is to use id_ar that has values of 1, 2 and 3 corresponding to strings 'ABD',''B' and 'CDE'. We are looking for 1 followed by 3, so using the convolution with a kernel [9,1] would result in 1*1 + 3*9 = 28 as the convolution sum for the window that has 'ABD' and then 'CDE'. Hence, we look for the conv. sum of 28 for the match. For the case of 'ABD' followed by ''B' and then 'CDE', conv. sum would be different, hence would be filtered out.
Sample run -
1) Input dataframe :
In [377]: df
Out[377]:
Id Event SeqNo
0 1 A 1
1 1 B 2
2 1 C 3
3 1 ABD 4
4 1 B 5
5 1 C 6
6 1 A 7
7 1 CDE 8
8 1 D 9
9 1 B 10
10 1 ABD 11
11 1 D 12
12 1 B 13
13 2 A 1
14 2 B 2
15 2 C 3
16 2 ABD 4
17 2 A 5
18 2 C 6
19 2 A 7
20 2 CDE 8
21 2 D 9
22 2 B 10
23 2 ABD 11
24 2 D 12
25 2 B 13
26 2 CDE 14
27 2 A 15
2) Intermediate filtered o/p (look at column Pattern for the presence of the reqd. pattern) :
In [380]: df1
Out[380]:
Id Event SeqNo Pattern
1 1 B 2 0
3 1 ABD 4 0
4 1 B 5 0
7 1 CDE 8 0
9 1 B 10 0
10 1 ABD 11 0
12 1 B 13 0
14 2 B 2 0
16 2 ABD 4 0
20 2 CDE 8 1
22 2 B 10 0
23 2 ABD 11 0
25 2 B 13 0
26 2 CDE 14 0
3) Final o/p :
In [381]: out
Out[381]:
Id
1 0
2 1
Name: Pattern, dtype: int64
I used a solution based on the assumption that anything other than ABD,CDE and B is irrelevant to or solution. So I get rid of them first by a filtering operation.
Then, what I want to know if there is an ABD followed by a CDE without a B in between. I shift the Events column by one in time (note this doesn't have to be a 1 step in units of SeqNo).
Then I check every column of the new df whether Events==ABD and Events_1_Step==CDE meaning that there wasn't a B in between, but possibly other stuff like A or C or even nothing. This gets me a list of booleans for every time I have a sequence like that. If I sum them up, I get the count.
Finally, I have to make sure these are all done at Id level so use .groupby.
IMPORTANT: This solution is assumed that your df is sorted by Id first and then by SeqNo. If not, please do so.
import pandas as pd
df = pd.read_csv("path/to/file.csv")
df2 = df[df["Event"].isin(["ABD", "CDE", "B"])]
df2.loc[:,"Event_1_Step"] = df2["Event"].shift(-1)
df2.loc[:,"SeqNo_1_Step"] = df2["SeqNo"].shift(-1)
for id, id_df in df2.groupby("Id"):
print(id) # Set a counter object here per Id to track count per id
id_df = id_df[id_df.apply(lambda x: x["Event"] == "ABD" and x["Event_1_Step"] == "CDE", axis=1)]
for row_id, row in id_df.iterrows():
print(df[(df["Id"] == id) * df["SeqNo"].between(row["SeqNo"], row["SeqNo_1_Step"])])
You could use this:
s = (pd.Series(
np.select([df['Event'] == 'ABD', df['Event'] =='B', df['Id'] != df['Id'].shift()],
[True, False, False], default=np.nan))
.ffill()
.fillna(False)
.astype(bool))
corr = (df['Event'] == "CDE") & s
corr.groupby(df['Id']).max()
Using np.select to create a column which has True if Event == 'CDE" and False for B or at the start of a new Id. By the forward filling using ffill. You have for every value whether ABD or B was last. Then you can check if it is True where the value is CDE. You could then use GroupBy to check whether it is True for any value per Id.
Which for
Id Event SeqNo
1 A 1
1 B 2
1 C 3
1 ABD 4
1 A 5
1 C 6
1 A 7
1 CDE 8
1 D 9
1 B 10
1 ABD 11
1 D 12
1 B 13
1 CDE 14
1 A 15
2 B 16
3 ABD 17
3 B 18
3 CDE 19
4 ABD 20
4 CDE 21
5 CDE 22
Outputs:
Id
1 True
2 False
3 False
4 True
5 False
I am sure this is an easy fix, but I haven't been able to find the exact solution to my problem. My data set has a column called 'LANE' which contains unique values. I want to add rows for each 'LANE' based on a range of numbers (which would be 0 to 12). As a result each 'LANE' would have 13 rows with a new column 'NUMBER' ranging from 0 up to and including 12.
Example:
Input
LANE
a
b
Output
LANE NUMBER
a 0
a 1
a 2
a 3
a 4
a 5
a 6
a 7
a 8
a 9
a 10
a 11
a 12
b 0
b 1
b 2
b 3
b 4
b 5
b 6
b 7
b 8
b 9
b 10
b 11
b 12
I am currently trying different forms of:
num = 0
while num <= 12:
for x in df['LANE']:
df['NUMBER'] = num
num += 1
The problem with this loop is, I still have one record for each lane and the 'NUMBER' column only has the value 12.
Comprehension
For loops are the natural and naive way to produce Cartesian products. Comprehensions allow us to embed this more succinctly.
pd.DataFrame(
[[l, n] for l in df.LANE for n in range(12)],
columns=['LANE', 'NUMBER']
)
LANE NUMBER
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 a 6
7 a 7
8 a 8
9 a 9
10 a 10
11 a 11
12 b 0
13 b 1
14 b 2
15 b 3
16 b 4
17 b 5
18 b 6
19 b 7
20 b 8
21 b 9
22 b 10
23 b 11
itertools.product
This logic is almost identical to the Comprehension solution but it uses itertools built in product function. product is an iterator that pops out each combination one at a time. I force the result by unpacking with the splat * like so [*product(a, b)]. Ultimately, it is a list of lists that gets passed to the pd.DataFrame constructor in the same way as the Comprehension solution above.
from itertools import product
pd.DataFrame([*product(df.LANE, range(12))], columns=['LANE', 'NUMBER'])
groupby/cumcount and repeat
I don't like this answer but it provides some perspective on the simplicity of the other answers.
I use repeat to replicate each index value 12 times. I use this repeated index in a loc which returns a dataframe sliced with passed index. I then use groupbys cumcount to count each position within the group and add that as a new column.
df.loc[df.index.repeat(12)].assign(NUMBER=lambda d: d.groupby('LANE').cumcount())
LANE NUMBER
0 a 0
0 a 1
0 a 2
0 a 3
0 a 4
0 a 5
0 a 6
0 a 7
0 a 8
0 a 9
0 a 10
0 a 11
1 b 0
1 b 1
1 b 2
1 b 3
1 b 4
1 b 5
1 b 6
1 b 7
1 b 8
1 b 9
1 b 10
1 b 11
Another approach using pandas as below:
# First approach, one liner code
df = pd.DataFrame({'Lane': ['a'] * 12 + ['b'] * 12,
'Number': list(range(12)) * 2})
# Second approach
df = pd.DataFrame({'Lane': ['a'] * 12 + ['b'] * 12})
df['Number'] = df.groupby('Lane').cumcount()