I have a list of nodes in a pandas data frame that look like this:
row col
...
36 182 240
35 182 241
34 182 242
33 182 243
58 183 220
32 183 244
31 183 245
30 183 246
29 183 247
...
The grid of this nodes are looking like this:
My code labels each face of the X-ed cells so that if it is connected to the adjustment X-ed cell it gets labeled as 0 and if it is not connected (open) it gets labeled as 1. The code is not working properly along some of the edges:
df["Front Face"] = 1
df["Back Face"] = 1
df["Right Face"] = 1
df["Left Face"] = 1
df = df.sort_values(by=['row','col'], ascending=True)
df = df.reset_index(drop=True)
for ix1 in df.index:
try:
if df["col"][ix1] == df["col"][ix1 + 1] - 1:
df["Right Face"][ix1] = 0
df["Left Face"][ix1 + 1] = 0
if df["col"][ix1] == df["col"][ix1 - 1] + 1:
df["Left Face"][ix1] = 0
df["Right Face"][ix1 - 1] = 0
except:
pass
df= df.sort_values(by=['col','row'], ascending=True)
df= df.reset_index(drop=True)
for ix2 in df.index:
try:
if df["row"][ix2] == df["row"][ix2 + 1] - 1:
df["Back Face"][ix2] = 0
df["Front Face"][ix2 + 1] = 0
if df["row"][ix2] == df["row"][ix2 - 1] + 1:
df["Front Face"][ix2] = 0
df["Back Face"][ix2 - 1] = 0
except:
pass
This is part of the output with cell 182,243 and 183,244 missing one label:
row col Front Face Back Face Right Face Left Face
36 182 240 1 1 0 0
35 182 241 1 1 0 0
34 182 242 1 1 0 0
33 182 243 1 0 1 0
58 183 220 1 0 1 1
32 183 244 0 1 0 1
31 183 245 1 1 0 0
30 183 246 1 1 0 0
29 183 247 1 1 0 0
I circled the problematic cells in the picture here:
I assume every row in your df marks an occupied position and you want to mark the adjacent cells as Front, Back, Left or Right.
If so, you can do this in a vectorized way but I have to admit: I struggled a lot with the getting the indexes and the numpy broadcast to work right.
# A random 5 * 10 matrix with 10% of rows marked as "occupied"
n, m = 5, 10
count = int(n * m * 0.1)
df = pd.DataFrame({
'row': np.random.randint(n, size=count),
'col': np.random.randint(m, size=count)
}).drop_duplicates()
And let's build the result data frame:
from itertools import product
# Every row in `df` marks an occupied position
result = df.set_index(['row', 'col']).assign(Occupied = True)
# Now expand `result` into the full matrix
idx = product(range(n), range(m))
result = result.reindex(idx, fill_value=False)
# Every cell is Open initially
for col in ['Front Face', 'Back Face', 'Right Face', 'Left Face']:
result[col] = 1
# Now start to build out a list of blocked cells
occupied = result.query('Occupied').index.to_frame().to_numpy()
valid_index = result.index
faces = {
'Front Face': [-1, 0],
'Back Face': [1, 0],
'Left Face': [0, -1],
'Right Face': [0, 1]
}
for face, offset in faces.items():
blocked = valid_index.intersection([tuple(i) for i in occupied + offset])
result.loc[blocked, face] = 0
To illustrate the result, let's build a helper function:
from IPython.display import display
def illustrate(result):
display_df = result['Occupied'].map({True: 'x', False: ''}).reset_index()
display_df = display_df.pivot(index='row', columns='col', values='Occupied')
is_open = result[['Front Face', 'Back Face', 'Right Face', 'Left Face']].all(axis=1)
style_df = (
is_open.map({
True: 'background-color: white',
False: 'background-color: rgba(255,0,0,0.2)'
})
.unstack()
)
display(display_df.style.apply(lambda _: style_df, axis=None))
illustrate(result)
Result (the cells in red have a 0 on any of the faces):
Related
I have a dataframe that has column , 'col', with both positive and negative numbers. I would like run a ranking separately on both the positive and negative numbers only with 0 excluded not to mess up the ranking. My issue is that my code below is updating the 'col' column. I must be keeping a reference it but not sure where?
data = {'col':[random.randint(-1000, 1000) for _ in range(100)]}
df = pd.DataFrame(data)
pos_idx = np.where(df.col > 0)[0]
neg_idx = np.where(df.col < 0)[0]
p = df[df.col > 0].col.values
n = df[df.col < 0].col.values
p_rank = np.round(p.argsort().argsort()/(len(p)-1)*100,1)
n_rank = np.round((n*-1).argsort().argsort()/(len(n)-1)*100,1)
pc = df.col.values
pc[pc > 0] = p_rank
pc[pc < 0] = n_rank
df['ranking'] = pc
One way to do it is to avoid mutating the original dataframe by replacing this line in your code:
pc = df.col.values
with:
pc = df.copy().col.values
So that:
print(df)
# Output
col ranking
0 -492 49
1 884 93
2 -355 36
3 741 77
4 -210 24
.. ... ...
95 564 57
96 683 63
97 -129 18
98 -413 44
99 810 81
[100 rows x 2 columns]
was able to figure it out on my own.
created a new column of zeros then used .loc to update te value at their respective index locations.
df['ranking'] = 0
df[df.col > 0, 'ranking'] = pos_rank
df[df.col < 0, 'ranking'] = neg_rank
I have two data frames df1 and df2
no plan current flag
0 abc1 249 30 Y/U
1 abc2 249 30 N
2 abc3 249 30 Y/D
and
plan offer
0 149 20
1 249 30
2 349 40
I want to put an extra column in df1 such that if df1['flag'] == 'Y/U' it will search the next higher number in df2['offer'] comparing df1['current']. Similarly, the same rule applies for a lower number, where the flag is Y/D. (Keep it as usual if the flag is N)
Expected output:
no plan current flag Pos
0 abc1 249 30 Y/U 40
1 abc2 249 30 N 30
2 abc3 249 30 Y/D 20
I tried to do it using apply.
df1['pos'] = (df1.apply(lambda x: next((z for (y, z) in zip(df2['plan'], df2['offer'])
if y > x['plan'] if z > x['current']), None), axis=1))
But it is giving the result considering every cases 'Y/U'.
Without using plan you can achieve the desired result like this.
You can just use a list.
offers = df2['offer'].sort_values().tolist()
def assign_pos(row, offers):
index = offers.index(row['current'])
if row['flag'] == "N":
row['pos'] = row['current']
elif row['flag'] == 'Y/U':
row['pos'] = offers[index + 1]
elif row['flag'] == 'Y/D':
row['pos'] = offers[index - 1]
return row
df1 = df1.apply(assign_pos, args=[offers], axis=1)
How to accumulate values skipping rows if the accumulated result of those rows exceeds a certain threshold?
threshold = 120
Col1
---
100
5
90
5
8
Expected output:
Acumm_with_condition
---
100
105 (100+5)
NaN (105+90 > threshold, skip )
110 (105+5)
118 (110+8)
Though it's not entirely vectorized, you can use a loop where you calculate the cumsum, then check to see if it has exceeded the threshold and if it has, set the value where it first breaks the threshold to 0 and restart the loop.
def thresholded_cumsum(df, column, threshold=np.inf, dropped_value_fill=None):
s = df[column].copy().to_numpy()
dropped_value_mask = np.zeros_like(s, dtype=bool)
cur_cumsum = s.cumsum()
cur_mask = cur_cumsum > threshold
while cur_mask.any():
first_above_thresh_idx = np.nonzero(cur_mask)[0][0]
# Drop the value out of s, note the position of this value within the mask
s[first_above_thresh_idx] = 0
dropped_value_mask[first_above_thresh_idx] = True
# Recalculate the cumsum & threshold mask now that we've dropped the value
cur_cumsum = s.cumsum()
cur_mask = cur_cumsum > threshold
if dropped_value_fill is not None:
cur_cumsum[dropped_value_mask] = dropped_value_fill
return cur_cumsum
Usage:
df["thresh_cumsum"] = thresholded_cumsum(df, "col1", threshold=120)
print(df)
col1 thresh_cumsum
0 100 100
1 5 105
2 90 105
3 5 110
4 8 118
I've included an extra parameter here dropped_value_fill, this is essentially a value you can use to annotate your output to let you know which values were intentionally dropped for violating the threshold.
With dropped_value_fill=-1
df["thresh_cumsum"] = thresholded_cumsum(df, "col1", threshold=120, dropped_value_fill=-1)
print(df)
col1 thresh_cumsum
0 100 100
1 5 105
2 90 -1
3 5 110
4 8 118
Ended up using:
def accumulate_under_threshold(values, threshold, skipped_row_value):
output = []
accumulated = 0
for i, val in enumerate(values):
if val + accumulated <= threshold:
accumulated = val + accumulated
output.append(accumulated)
else:
output.append(math.nan)
if values[i:].min() > (threshold - accumulated ):
output.extend( [skipped_row_value]*(len(values)-1-i))
break
return np.array(output)
df['acumm_with_condition'] = accumulate_under_threshold(df['Col1'].values, 120, math.nan)
I wish to select only the rows that have observations across multiple years. For example, suppose
mlIndx = pd.MultiIndex.from_tuples([('x', 0,),('x',1),('z', 0), ('y', 1),('t', 0),('t', 1)])
df = pd.DataFrame(np.random.randint(0,100,(6,2)), columns = ['a','b'], index=mlIndx)
In [18]: df
Out[18]:
a b
x 0 6 1
1 63 88
z 0 69 54
y 1 27 27
t 0 98 12
1 69 31
My desired output is
Out[19]:
a b
x 0 6 1
1 63 88
t 0 98 12
1 69 31
My current solution is blunt so something that can scale up more easily would be great. You can assumed a sorted index.
df.reset_index(level=0, inplace=True)
df[df.level_0.duplicated() | df.level_0.duplicated(keep='last')]
Out[30]:
level_0 a b
0 x 6 1
1 x 63 88
0 t 98 12
1 t 69 31
You can figure this out with groupby (on the first level of the index) + transform, and then use boolean indexing to filter out those rows:
df[df.groupby(level=0).a.transform('size').gt(1)]
a b
x 0 67 83
1 2 34
t 0 18 87
1 63 20
Details
Output of the groupby -
df.groupby(level=0).a.transform('size')
x 0 2
1 2
z 0 1
y 1 1
t 0 2
1 2
Name: a, dtype: int64
Filtering from here is straightforward, just find those rows with size > 1.
Use the group by filter
You can pass a function that returns a boolean to
df.groupby(level=0).filter(lambda x: len(x) > 1)
a b
x 0 7 33
1 31 43
t 0 71 18
1 68 72
I've spent my fare share of time focused on speed. Not all solutions need to be the fastest solutions. However, since the subject has come up. I'll offer what I think should be a fast solution. It is my intent to keep future readers informed.
Results of Time Test
res.plot(loglog=True)
res.div(res.min(1), 0).T
10 30 100 300 1000 3000
cs 4.425970 4.643234 5.422120 3.768960 3.912819 3.937120
wen 2.617455 4.288538 6.694974 18.489803 57.416648 148.860403
jp 6.644870 21.444406 67.315362 208.024627 569.421257 1525.943062
pir 6.043569 10.358355 26.099766 63.531397 165.032540 404.254033
pir_pd_factorize 1.153351 1.132094 1.141539 1.191434 1.000000 1.000000
pir_np_unique 1.058743 1.000000 1.000000 1.000000 1.021489 1.188738
pir_best_of 1.000000 1.006871 1.030610 1.086425 1.068483 1.025837
Simulation Details
def pir_pd_factorize(df):
f, u = pd.factorize(df.index.get_level_values(0))
m = np.bincount(f)[f] > 1
return df[m]
def pir_np_unique(df):
u, f = np.unique(df.index.get_level_values(0), return_inverse=True)
m = np.bincount(f)[f] > 1
return df[m]
def pir_best_of(df):
if len(df) > 1000:
return pir_pd_factorize(df)
else:
return pir_np_unique(df)
def cs(df):
return df[df.groupby(level=0).a.transform('size').gt(1)]
def pir(df):
return df.groupby(level=0).filter(lambda x: len(x) > 1)
def wen(df):
s=df.a.count(level=0)
return df.loc[s[s>1].index.tolist()]
def jp(df):
return df.loc[[i for i in df.index.get_level_values(0).unique() if len(df.loc[i]) > 1]]
res = pd.DataFrame(
index=[10, 30, 100, 300, 1000, 3000],
columns='cs wen jp pir pir_pd_factorize pir_np_unique pir_best_of'.split(),
dtype=float
)
np.random.seed([3, 1415])
for i in res.index:
d = pd.DataFrame(
dict(a=range(i)),
pd.MultiIndex.from_arrays([
np.random.randint(i // 4 * 3, size=i),
range(i)
])
)
for j in res.columns:
stmt = f'{j}(d)'
setp = f'from __main__ import d, {j}'
res.at[i, j] = timeit(stmt, setp, number=100)
Just a new way
s=df.a.count(level=0)
df.loc[s[s>1].index.tolist()]
Out[12]:
a b
x 0 1 31
1 70 29
t 0 42 26
1 96 29
And if you want to keep using duplicate
s=df.index.get_level_values(level=0)
df.loc[s[s.duplicated()].tolist()]
Out[18]:
a b
x 0 1 31
1 70 29
t 0 42 26
1 96 29
I'm not convinced groupby is necessary:
df = df.sort_index()
df.loc[[i for i in df.index.get_level_values(0).unique() if len(df.loc[i]) > 1]]
# a b
# x 0 16 3
# 1 97 36
# t 0 9 18
# 1 37 30
Some benchmarking:
df = pd.concat([df]*10000).sort_index()
def cs(df):
return df[df.groupby(level=0).a.transform('size').gt(1)]
def pir(df):
return df.groupby(level=0).filter(lambda x: len(x) > 1)
def wen(df):
s=df.a.count(level=0)
return df.loc[s[s>1].index.tolist()]
def jp(df):
return df.loc[[i for i in df.index.get_level_values(0).unique() if len(df.loc[i]) > 1]]
%timeit cs(df) # 19.5ms
%timeit pir(df) # 33.8ms
%timeit wen(df) # 17.0ms
%timeit jp(df) # 22.3ms
in my problem I have 2 dataframes mydataframe1 and mydataframe2 as below.
mydataframe1
Out[13]:
Start End Remove
50 60 1
61 105 0
106 150 1
151 160 0
161 180 1
181 200 0
201 400 1
mydataframe2
Out[14]:
Start End
55 100
105 140
151 154
155 185
220 240
From mydataframe2 I would like to remove the rows for which the interval Start-End are contained (also partially) in any of the "Remove"=1 intervals in mydataframe1. In other words there should not be any itnersection between the intervals of mydataframe2 and each of the intervals in mydataframe1
in this case mydataframe2 becomes
mydataframe2
Out[15]:
Start End
151 154
You could use pd.IntervalIndex for intersections
Get rows to be removed
In [313]: dfr = df1.query('Remove == 1')
Construct IntervalIndex from to be removed ranges
In [314]: s1 = pd.IntervalIndex.from_arrays(dfr.Start, dfr.End, 'both')
Construct IntervalIndex from to be tested
In [315]: s2 = pd.IntervalIndex.from_arrays(df2.Start, df2.End, 'both')
Select rows of s2 which are not in s1 ranges
In [316]: df2.loc[[x not in s1 for x in s2]]
Out[316]:
Start End
2 151 154
Details
In [320]: df1
Out[320]:
Start End Remove
0 50 60 1
1 61 105 0
2 106 150 1
3 151 160 0
4 161 180 1
5 181 200 0
6 201 400 1
In [321]: df2
Out[321]:
Start End
0 55 100
1 105 140
2 151 154
3 155 185
4 220 240
In [322]: dfr
Out[322]:
Start End Remove
0 50 60 1
2 106 150 1
4 161 180 1
6 201 400 1
IntervalIndex details
In [323]: s1
Out[323]:
IntervalIndex([[50, 60], [106, 150], [161, 180], [201, 400]]
closed='both',
dtype='interval[int64]')
In [324]: s2
Out[324]:
IntervalIndex([[55, 100], [105, 140], [151, 154], [155, 185], [220, 240]]
closed='both',
dtype='interval[int64]')
In [326]: [x not in s1 for x in s2]
Out[326]: [False, False, True, False, False]
We can use Medial- or length-oriented tree: Overlap test:
In [143]: d1 = d1.assign(s=d1.Start+d1.End, d=d1.End-d1.Start)
In [144]: d2 = d2.assign(s=d2.Start+d2.End, d=d2.End-d2.Start)
In [145]: d1
Out[145]:
Start End Remove d s
0 50 60 1 10 110
1 61 105 0 44 166
2 106 150 1 44 256
3 151 160 0 9 311
4 161 180 1 19 341
5 181 200 0 19 381
6 201 400 1 199 601
In [146]: d2
Out[146]:
Start End d s
0 55 100 45 155
1 105 140 35 245
2 151 154 3 305
3 155 185 30 340
4 220 240 20 460
now we can check for overlapping intervals and filter:
In [148]: d2[~d2[['s','d']]\
...: .apply(lambda x: ((d1.loc[d1.Remove==1, 's'] - x.s).abs() <
...: d1.loc[d1.Remove==1, 'd'] +x.d).any(),
...: axis=1)]\
...: .drop(['s','d'], 1)
...:
Out[148]:
Start End
2 151 154
I think that this should work:
mydataframe2[mydataframe2.Start.isin(mydataframe1[mydataframe1.Remove != 0].Start)]
Breaking it down:
# This filter will remove anything which has Remove not 0
filter_non_remove = mydataframe1.Remove != 0
# This provides a valid Sequence of Start values
valid_starts = mydataframe1[mydataframe1.Remove != 0].Start
# Another filter, that checks whether the Start
# value is in the valid_starts Sequence
is_df2_valid = mydataframe2.Start.isin(valid_starts)
# Final applied filter
output = mydataframe2[is_df2_valid]
You can get all the unique range values from the columns marked Remove then evaluate the Start and End dates contained in mydataframe2 are not in any of the range values. The first part will define all unique values falling with the Start/End values were Remove = 1.
start_end_remove = mydataframe1[mydataframe1['Remove'] == 1][['Start', 'End']].as_matrix()
remove_ranges = set([])
for x in start_end_remove:
remove_ranges.update(np.arange(x[0], x[1] + 1))
Next you can evaluate mydataframe2 against the unique set of range values. If the Start/End values of mydataframe2 are in the range of values they are removed from the dataframe by flagging whether they should be removed in a new columns. A function is defined to see if there is overlap between any of the ranges, then that function is applied to each row in mydataframe2 and remove the rows where the ranges do overlap.
def evaluate_in_range(x, remove_ranges):
s = x[0]
e = x[1]
eval_range = set(np.arange(s, e + 1))
if len(eval_range.intersection(remove_ranges)) > 0:
return 1
else:
return 0
mydataframe2['Remove'] = mydataframe2[['Start', 'End']].apply(lambda x: evaluate_in_range(x, remove_ranges), axis=1)
mydataframe2.drop(mydataframe2[mydataframe2['Remove'] == 1].index, inplace=True)
How about this:
mydataframe1['key']=1
mydataframe2['key']=1
df3 = mydataframe2.merge(mydataframe1, on="key")
df3['s_gt_s'] = df3.Start_y > df3.Start_x
df3['s_lt_e'] = df3.Start_y < df3.End_x
df3['e_gt_s'] = df3.End_y > df3.Start_x
df3['e_lt_e'] = df3.End_y < df3.End_x
df3['s_in'] = df3.s_gt_s & df3.s_lt_e
df3['e_in'] = df3.e_gt_s & df3.e_lt_e
df3['overlaps'] = df3.s_in | df3.e_in
my_new_dataframe = df3[df3.overlaps & df3.Remove==1][['End_x','Start_x']].drop_duplicates()