This is a follow up to this question: determine the coordinates where two pandas time series cross, and how many times the time series cross
I have 2 series in my Pandas dataframe, and would like to know where they intersect.
A B
0 1 0.5
1 2 3.0
2 3 1.0
3 4 1.0
4 5 6.0
With this code, we can create a third column that will contain True everytime the two series intersect:
df['difference'] = df.A - df.B
df['cross'] = np.sign(df.difference.shift(1))!=np.sign(df.difference)
np.sum(df.cross)-1
Now, instead of a simple True or False, I would to know in which direction the intersection took place. For example: from 1 to 2, it intersected upwards, from 2 to 3 downwards, from 3 to 4 no intersections, from 4 to 5 upwards.
A B Cross_direction
0 1 0.5 None
1 2 3.0 Upwards
2 3 1.0 Downwards
3 4 1.0 None
4 5 6.0 Upwards
In pseudo-code, it should be like this:
cross_directions = [none, none, ... * series size]
for item in df['difference']:
if item > 0 and next_item < 0:
cross_directions.append("up")
elif item < 0 and next_item > 0:
cross_directions.append("down")
The problem is that next_item is unavailable with this syntax (we obtain that in the original syntax using .shift(1)) and that it takes a lot of code.
Should I look into implementing the code above using something that can group the loop by 2 items at a time? Or is there a simpler and more elegant solution like the one from the previous question?
You can use numpy.select.
Below code should work for you, the code is as follows:
df = pd.DataFrame({'A': [1, 2, 3, 4,5], 'B': [0.5, 3, 1, 1, 6]})
df['Diff'] = df.A - df.B
df['Cross'] = np.select([((df.Diff < 0) & (df.Diff.shift() > 0)), ((df.Diff > 0) & (df.Diff.shift() < 0))], ['Up', 'Down'], 'None')
#Output dataframe
A B Diff Cross
0 1 0.5 0.5 None
1 2 3.0 -1.0 Up
2 3 1.0 2.0 Down
3 4 1.0 3.0 None
4 5 6.0 -1.0 Up
My very lousy and redundant solution.
dataframe['difference'] = dataframe['A'] - dataframe['B']
dataframe['temporary_a'] = np.array(dataframe.difference) > 0
dataframe['temporary_b'] = np.array(dataframe.difference.shift(1)) < 0
cross_directions = []
for index,row in dataframe.iterrows():
if not row['temporary_a'] and not row['temporary_b']:
cross_directions.append("up")
elif row['temporary_a'] and row['temporary_b']:
cross_directions.append("down")
else:
cross_directions.append("not")
dataframe['cross_direction'] = cross_directions
Related
I have a dataframe:
data = {'x' : [1,1,1,2,2,2,2,3,3,3,3,3,3], 'y' : [1,4,5,2,6,7,8,3,9,10,11,12,13], 'z': [1,1,1,2,2,6,7,3,3,9,10,3,12], 'a': ['Parent', 'Node','Node', 'Parent', 'Node','Node','Node','Parent','Standalone', 'Node','Node','Node','Node']}
df = pd.DataFrame(data)
Column:
x represents ids of the parent;
y represents individual ids of the parent, standalones and nodes;
z represents ids under the group which it should fall
My object is to update column z where z != x and value of nodes should be aligned either to its parent or standalone.
The output column should look like output = [1,1,1,2,2,2,2,3,3,9,9,3,3]
I am trying to use the below code:
df.z = np.where((df.z.values != df.x.values) & (df.a != 'Standalone'), df[df['y'] == df.z.values]['z'], df.z.values)
but i am receiving this error:
ValueError: operands could not be broadcast together with shapes (13,) (3,) (13,)
Any lead on this would be helpful. Also, open to use .apply method.
The tricky part is Standalone, So groupby x to focus on each group, then define a function do_by_check_standalone to gp.apply to check if 'Standalone' in gpdf.a:
def do_by_check_standalone(df):
if df['a'].isin(['Standalone']).any(): # need process
#track Standalone's all childrens
track_parent = {i: set([i]) for i in df[df['a'] == 'Standalone']['y']}
for k in track_parent:
#find all child
s_len = -1
e_len = 0
while s_len != e_len:
s_len = e_len
track_parent[k].update(list(df[df['z'].isin(track_parent[k])]['y']))
e_len = len(track_parent[k])
# track_parent should be: {9: {9, 10, 11}}
# loop each row
for idx, row in df.iterrows():
if row['a'] == 'Node': # child, need find its parent is 'Parent' or 'Standalone'
isStandalone = False
# if detects parent is 'Standalone', assign value of Standalone (e.g. k)
for k in track_parent:
if row['z'] in track_parent[k]:
df.loc[idx, 'output'] = k
isStandalone = True
break
if not isStandalone: # otherwise, set to 'x' (e.g. df.name which we groupby)
df.loc[idx, 'output'] = df.name
else: # Parent or Standalone, simply fill 'x'
df.loc[idx, 'output'] = df.name
else: # no Standalone here, simply fill 'x'
df.loc[:,'output'] = df.name # x
return df
The df.output should fit your need (with some type convertion).
>>>df.groupby('x').apply(do_by_check_standalone)
x y z a output
0 1 1 1 Parent 1.0
1 1 4 1 Node 1.0
2 1 5 1 Node 1.0
3 2 2 2 Parent 2.0
4 2 6 2 Node 2.0
5 2 7 6 Node 2.0
6 2 8 7 Node 2.0
7 3 3 3 Parent 3.0
8 3 9 3 Standalone 3.0
9 3 10 9 Node 9.0
10 3 11 10 Node 9.0
11 3 12 3 Node 3.0
12 3 13 12 Node 3.0
I have a parking lot with cars of different models (nr) and the cars are so closely packed that in order for one to get out one might need to move some others. A little like a 15Puzzle, only I can take one or more cars out of the parking lot. Ordered_car_List includes the cars that will be picked up today, and they need to be taken out of the parking lot with as few non-ordered cars as possible moved. There are more columns to this panda, but this is what I can't figure out.
I have a Program that works good for small sets of data, but it seems that this is not the way of the PANDAS :-)
I have this:
cars = pd.DataFrame({'x': [1,1,1,1,1,2,2,2,2],
'y': [1,2,3,4,5,1,2,3,4],
'order_number':[6,6,7,6,7,9,9,10,12]})
cars['order_number_no_dublicates_down'] = None
Ordered_car_List = [6,9,9,10,28]
i=0
while i < len(cars):
temp_val = cars.at[i, 'order_number']
if temp_val in Ordered_car_List:
cars.at[i, 'order_number_no_dublicates_down'] = temp_val
Ordered_car_List.remove(temp_val)
i+=1
If I use cars.apply(lambda..., how can I change the Ordered_car_List in each iteration?
Is there another approach that I can take?
I found this page, and it made me want to be faster. The Lambda approach is in the middle when it comes to speed, but it still is so much faster than what I am doing now.
https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06
Updating cars
We can vectorize this based on two counters:
cumcount() to cumulatively count each unique value in cars['order_number']
collections.Counter() to count each unique value in Ordered_car_List
cumcount = cars.groupby('order_number').cumcount().add(1)
maxcount = cars['order_number'].map(Counter(Ordered_car_List))
# order_number cumcount maxcount
# 0 6 1 1
# 1 6 2 1
# 2 7 1 0
# 3 6 3 1
# 4 7 2 0
# 5 9 1 2
# 6 9 2 2
# 7 10 1 1
# 8 12 1 0
So then we only want to keep cars['order_number'] where cumcount <= maxcount:
either use DataFrame.loc[]
cars.loc[cumcount <= maxcount, 'nodup'] = cars['order_number']
or Series.where()
cars['nodup'] = cars['order_number'].where(cumcount <= maxcount)
or Series.mask() with the condition inverted
cars['nodup'] = cars['order_number'].mask(cumcount > maxcount)
Updating Ordered_car_List
The final Ordered_car_List is a Counter() difference:
Used_car_List = cars.loc[cumcount <= maxcount, 'order_number']
# [6, 9, 9, 10]
Ordered_car_List = list(Counter(Ordered_car_List) - Counter(Used_car_List))
# [28]
Final output
cumcount = cars.groupby('order_number').cumcount().add(1)
maxcount = cars['order_number'].map(Counter(Ordered_car_List))
cars['nodup'] = cars['order_number'].where(cumcount <= maxcount)
# x y order_number nodup
# 0 1 1 6 6.0
# 1 1 2 6 NaN
# 2 1 3 7 NaN
# 3 1 4 6 NaN
# 4 1 5 7 NaN
# 5 2 1 9 9.0
# 6 2 2 9 9.0
# 7 2 3 10 10.0
# 8 2 4 12 NaN
Used_car_List = cars.loc[cumcount <= maxcount, 'order_number']
Ordered_car_List = list(Counter(Ordered_car_List) - Counter(Used_car_List))
# [28]
Timings
Note that your loop is still very fast with small data, but the vectorized counter approach just scales much better:
I have a data frame that looks like this:
data_dict = {'factor_1' : np.random.randint(1, 5, 10), 'factor_2' : np.random.randint(1, 5, 10), 'multi' : np.random.rand(10), 'output' : np.NaN}
df = pd.DataFrame(data_dict)
I'm getting stuck implementing this comparison:
If factor_1 and factor_2 values match, then output = 2 * multi (Here 2 is kind of a base value). Continue scanning the next rows.
If factor_1 and factor_2 values don't match then:
output = -2. Scan the next row(s).
If factor values still don't match until row R then assign values for output as $-2^2, -2^3, ..., -2^R$ respectively.
When factor values match at row R+1 then assign value for output as $2^(R+1) * multi$.
Repeat the process
The end result will look like this:
This solution does not use recursion:
# sample data
np.random.seed(1)
data_dict = {'factor_1' : np.random.randint(1, 5, 10), 'factor_2' : np.random.randint(1, 5, 10), 'multi' : np.random.rand(10), 'output' : np.NaN}
df = pd.DataFrame(data_dict)
# create a mask
mask = (df['factor_1'] != df['factor_2'])
# get the cumsum from the mask
df['R'] = mask.cumsum() - mask.cumsum().where(~mask).ffill().fillna(0)
# use np.where to create the output
df['output'] = np.where(df['R'] == 0, df['multi']*2, -2**df['R'])
factor_1 factor_2 multi output R
0 2 1 0.419195 -2.000000 1.0
1 4 2 0.685220 -4.000000 2.0
2 1 1 0.204452 0.408904 0.0
3 1 4 0.878117 -2.000000 1.0
4 4 2 0.027388 -4.000000 2.0
5 2 1 0.670468 -8.000000 3.0
6 4 3 0.417305 -16.000000 4.0
7 2 2 0.558690 1.117380 0.0
8 4 3 0.140387 -2.000000 1.0
9 1 1 0.198101 0.396203 0.0
The solution I present is, maybe, a little bit harder to read, but I think it works as you wanted. It combines
numpy.where() in order to make a column based on a condition,
pandas.DataFrame.shift() and pandas.DataFrame.cumsum() to label different groups with consecutive similar values, and
pandas.DataFrame.rank() in order to construct a vector of powers used on previously made df['output'] column.
The code is following.
df['output'] = np.where(df.factor_1 == df.factor_2, -2 * df.multi, 2)
group = ['output', (df.output != df.output.shift()).cumsum()]
df['output'] = (-1) * df.output.pow(df.groupby(group).output.rank('first'))
flag = False
cols = ('factor_1', 'factor_2', 'multi')
z = zip(*[data_dict[col] for col in cols])
for i, (f1, f2, multi) in enumerate(z):
if f1==f2:
output = 2 * multi
flag = False
else:
if flag:
output *= 2
else:
output = -2
flag = True
data_dict['output'][i] = output
The tricky part is flag variable, which tells you whether the previous row had match or not.
I have a very large pandas dataset, and at some point I need to use the following function
def proc_trader(data):
data['_seq'] = np.nan
# make every ending of a roundtrip with its index
data.ix[data.cumq == 0,'tag'] = np.arange(1, (data.cumq == 0).sum() + 1)
# backfill the roundtrip index until previous roundtrip;
# then fill the rest with 0s (roundtrip incomplete for most recent trades)
data['_seq'] =data['tag'].fillna(method = 'bfill').fillna(0)
return data['_seq']
# btw, why on earth this function returns a dataframe instead of the series `data['_seq']`??
and I use apply
reshaped['_spell']=reshaped.groupby(['trader','stock'])[['cumq']].apply(proc_trader)
Obviously, I cannot share the data here, but do you see a bottleneck in my code? Could it be the arange thing? There are many name-productid combinations in the data.
Minimal Working Example:
import pandas as pd
import numpy as np
reshaped= pd.DataFrame({'trader' : ['a','a','a','a','a','a','a'],'stock' : ['a','a','a','a','a','a','b'], 'day' :[0,1,2,4,5,10,1],'delta':[10,-10,15,-10,-5,5,0] ,'out': [1,1,2,2,2,0,1]})
reshaped.sort_values(by=['trader', 'stock','day'], inplace=True)
reshaped['cumq']=reshaped.groupby(['trader', 'stock']).delta.transform('cumsum')
reshaped['_spell']=reshaped.groupby(['trader','stock'])[['cumq']].apply(proc_trader).reset_index()['_seq']
Nothing really fancy here, just tweaked in a couple of places. There is really no need to put in a function, so I didn't. On this tiny sample data, it's about twice as fast as the original.
reshaped.sort_values(by=['trader', 'stock','day'], inplace=True)
reshaped['cumq']=reshaped.groupby(['trader', 'stock']).delta.cumsum()
reshaped.loc[ reshaped.cumq == 0, '_spell' ] = 1
reshaped['_spell'] = reshaped.groupby(['trader','stock'])['_spell'].cumsum()
reshaped['_spell'] = reshaped.groupby(['trader','stock'])['_spell'].bfill().fillna(0)
Result:
day delta out stock trader cumq _spell
0 0 10 1 a a 10 1.0
1 1 -10 1 a a 0 1.0
2 2 15 2 a a 15 2.0
3 4 -10 2 a a 5 2.0
4 5 -5 2 a a 0 2.0
5 10 5 0 a a 5 0.0
6 1 0 1 b a 0 1.0
I would like to use Pandas df.apply but only for certain rows
As an example, I want to do something like this, but my actual issue is a little more complicated:
import pandas as pd
import math
z = pd.DataFrame({'a':[4.0,5.0,6.0,7.0,8.0],'b':[6.0,0,5.0,0,1.0]})
z.where(z['b'] != 0, z['a'] / z['b'].apply(lambda l: math.log(l)), 0)
What I want in this example is the value in 'a' divided by the log of the value in 'b' for each row, and for rows where 'b' is 0, I simply want to return 0.
The other answers are excellent, but I thought I'd add one other approach that can be faster in some circumstances – using broadcasting and masking to achieve the same result:
import numpy as np
mask = (z['b'] != 0)
z_valid = z[mask]
z['c'] = 0
z.loc[mask, 'c'] = z_valid['a'] / np.log(z_valid['b'])
Especially with very large dataframes, this approach will generally be faster than solutions based on apply().
You can just use an if statement in a lambda function.
z['c'] = z.apply(lambda row: 0 if row['b'] in (0,1) else row['a'] / math.log(row['b']), axis=1)
I also excluded 1, because log(1) is zero.
Output:
a b c
0 4 6 2.232443
1 5 0 0.000000
2 6 5 3.728010
3 7 0 0.000000
4 8 1 0.000000
Hope this helps. It is easy and readable
df['c']=df['b'].apply(lambda x: 0 if x ==0 else math.log(x))
You can use a lambda with a conditional to return 0 if the input value is 0 and skip the whole where clause:
z['c'] = z.apply(lambda x: math.log(x.b) if x.b > 0 else 0, axis=1)
You also have to assign the results to a new column (z['c']).
Use np.where() which divides a by the log of the value in b if the condition is met and returns 0 otherwise:
import numpy as np
z['c'] = np.where(z['b'] != 0, z['a'] / np.log(z['b']), 0)
Output:
a b c
0 4.0 6.0 2.232443
1 5.0 0.0 0.000000
2 6.0 5.0 3.728010
3 7.0 0.0 0.000000
4 8.0 1.0 inf