I am trying to calculate the sum of sales for stores in the same neighborhood based on their geographic coordinates. I have sample data:
data={'ID':['1','2','3','4'],'SALE':[100,120,110,95],'X':[23,22,21,24],'Y':[44,45,41,46],'X_MIN':[22,21,20,23],'Y_MIN':[43,44,40,45],'X_MAX':[24,23,22,25],'Y_MAX':[45,46,42,47]}
ID
SALE
X
Y
X_MIN
Y_MIN
X_MAX
Y_MAX
1
100
23
44
22
43
24
45
2
120
22
45
21
44
23
46
3
110
21
41
20
40
22
42
4
95
24
46
23
45
25
47
X and Y are the coordinates of the store. X and Y with MIN and MAX are the area they cover. For each row, I want to sum sales for all stores that are within the boundaries of the single store. I expect results similar to the table below where SUM for ID 1 is equal 220 because the coordinates (X and Y) are within the MIN and MAX limits of this store for ID 1 and ID 2 while for ID 4 only this one store is between his coordinates so the sum of sales is equal 95.
final={'ID':['1','2','3','4'],'SUM':[220,220,110,95]}
ID
SUM
1
220
2
220
3
110
4
95
What I've tried:
data['SUM'] = data.apply(lambda x: data['SALE'].sum(data[(data['X'] >= x['X_MIN'])&(data['X'] <= x['X_MAX'])&(data['Y'] >= x['Y_MIN'])&(data['Y'] <= x['Y_MAX'])]),axis=1)
Unfortunately the code does not work and I am getting the following error:
TypeError: unhashable type: 'DataFrame'
I am asking for help in solving this problem.
If you put the summation at the end, your solution works:
data['SUM'] = data.apply(lambda x: (data['SALE'][(data['X'] >= x['X_MIN'])&(data['X'] <= x['X_MAX'])&(data['Y'] >= x['Y_MIN'])&(data['Y'] <= x['Y_MAX'])]).sum(),axis=1)
###output of data['SUM']:
###0 220
###1 220
###2 110
###3 95
Related
High D_HIGH D_HIGH_H
33 46.57 0 0L
0 69.93 42 42H
1 86.44 68 68H
34 56.58 83 83L
35 67.12 125 125L
2 117.91 158 158H
36 94.51 186 186L
3 120.45 245 245H
4 123.28 254 254H
37 83.20 286 286L
In column D_HIGH_H there is L & H at end.
If there are two continuous H then the one having highest value in High column has to be selected and other has to be ignored(deleted).
If there are two continuous L then the one having lowest value in High column has to be selected and other has to be ignored(deleted).
If the sequence is H,L,H,L then no changes to be made.
Output I want is as follows:
High D_HIGH D_HIGH_H
33 46.57 0 0L
1 86.44 68 68H
34 56.58 83 83L
2 117.91 158 158H
36 94.51 186 186L
4 123.28 254 254H
37 83.20 286 286L
I tried various options using list map but did not work out.Also tried with groupby but no logical conclusion.
Here's one way:
g = ((l := df['D_HIGH_H'].str[-1]) != l.shift()).cumsum()
def f(x):
if (x['D_HIGH_H'].str[-1] == 'H').any():
return x.nlargest(1, 'D_HIGH')
return x.nsmallest(1, 'D_HIGH')
df.groupby(g, as_index=False).apply(f)
Output:
High D_HIGH D_HIGH_H
0 33 46.57 0 0L
1 1 86.44 68 68H
2 34 56.58 83 83L
3 2 117.91 158 158H
4 36 94.51 186 186L
5 4 123.28 254 254H
6 37 83.20 286 286L
You can use extract to get the letter, then compute a custom group and groupby.apply with a function that depends on the letter:
# extract letter
s = df['D_HIGH_H'].str.extract('(\D)$', expand=False)
# group by successive letters
# get the idxmin/idxmax depending on the type of letter
keep = (df['High']
.groupby([s, s.ne(s.shift()).cumsum()], sort=False)
.apply(lambda x: x.idxmin() if x.name[0] == 'L' else x.idxmax())
.tolist()
)
out = df.loc[keep]
Output:
High D_HIGH D_HIGH_H
33 46.57 0 0L
1 86.44 68 68H
34 56.58 83 83L
2 117.91 158 158H
36 94.51 186 186L
4 123.28 254 254H
37 83.20 286 286L
I'm trying to perform this operation on this dataset. I'm trying to calculate cummulative Sum of the specific subset of the dataset.I want the changes to reflect on real dataset.
. Table below illustrates how I want to calculate Offset.
#OFFSET
min = data.exit_block.min()
max = data.exit_block.max()
temp = 0
data['Offset']
for i in tqdm(range(min,min+10)):
offset = data.loc[(data["exit_block"] >= i) & (data["entry_block"] < i)]['size'].sum()
data.loc[data["entry_block"] == i ,'Offset'] = data[data['entry_block']==i]['size'].cumsum() + offset
print(len(data.loc[(data["exit_block"] >= i) & (data["entry_block"] < i)]['size']))
print(offset)
print(data[data['entry_block']==i]['size'].cumsum().head() )
print(data[data['entry_block']==i]['size'].head())
break
In the code above I'm creating a dataset B from original dataset and trying to perform of the cummulative sum operation on the origial dataset from the values driven from dataset B.
Index
Entry_block
Exit_block
Size
Offset
1
10
20
10
10
2
11
20
150
160
3
18
20
100
260
4
19
21
40
300
5
20
21
120
120
6
20
21
180
300
7
20
21
210
510
8
20
21
90
600
9
20
21
450
1050
I have formed the bins using pandas.cut function. Now, in order to perform smoothing by bin-boundaries, I calculate the minimum and maximum value of each bin using groupby function
Minimum values
date births with noise
bin
A 1959-01-31 23 19.921049
B 1959-01-02 27 25.921175
C 1959-01-01 30 32.064698
D 1959-01-08 35 38.507170
E 1959-01-05 41 45.022163
F 1959-01-13 47 51.821755
G 1959-03-27 56 59.416700
H 1959-09-23 73 70.140119
Maximum values-
date births with noise
bin
A 1959-07-12 30 25.161292
B 1959-12-11 35 31.738422
C 1959-12-27 42 38.447807
D 1959-12-20 48 44.919703
E 1959-12-31 56 51.274550
F 1959-12-30 59 57.515927
G 1959-11-05 68 63.970382
H 1959-09-23 73 70.140119
Now I want to replace the values in my original dataframe. If the value is less than the mean (of its bin) then it is replaced with the min value (for that bin), and if it is greater than the mean then it is replaced with the max value.
My dataframe looks like this-
date births with noise bin smooth_val_mean
0 1959-01-01 35 36.964692 C 35.461173
1 1959-01-02 32 29.861393 B 29.592061
2 1959-01-03 30 27.268515 B 29.592061
3 1959-01-04 31 31.513148 B 29.592061
4 1959-01-05 44 46.194690 E 47.850101
How should I do this using pandas/numpy?
Let's try this function:
def thresh(col):
means = df['bin'].replace(df_mean[col])
mins = df['bin'].replace(df_min[col])
maxs = df['bin'].replace(df_max[col])
signs = np.signs(df[col] - means)
df[f'{col}_smooth'] = np.select((signs==1, signs==-1), (maxs, mins), means)
for col in ['with noise']:
thresh(col)
I want to train a binary classification ML model with some data that I have; something like this:
df
y ch1_g1 ch2_g1 ch3_g1 ch1_g2 ch2_g2 ch3_g2
0 20 89 62 23 3 74
1 51 64 19 2 83 0
0 14 58 2 71 31 48
1 32 28 2 30 92 91
1 51 36 51 66 15 14
...
My target (y) depends on three characteristics from two groups, however I have an imbalance in my data, a count of values of my y target reveals that I have more zeros than ones in a ratio of about 2.68. I correct this by looping each row and randomly swapping values from group 1 to group 2 and viceversa, like this:
for index,row in df.iterrows():
choice = np.random.choice([0,1])
if row['y'] != choice:
df.loc[index, 'y'] = choice
for column in df.columns[1:]:
key = column.replace('g1', 'g2') if 'g1' in column else column.replace('g2', 'g1')
df.loc[index, column] = row[key]
Doing this reduce the ratio to no more than 1.3, so I was wondering if there is a more direct aproach using pandas methods.
¿Anyone have an idea how to accomplish this?
Whether or not swapping columns solves class unbalance aside, I would swap the whole data set, and randomly choose between the original and the swapped:
# Step 1: swap the columns
df1 = pd.concat((df.filter(regex='[^(_g1)]$'),
df.filter(regex='_g1$')),
axis=1)
# Step 2: rename the columns
df1.columns = df.columns
# random choice
np.random.seed(1)
is_original = np.random.choice([True,False], size=len(df))
# concat to make new dataset
pd.concat((df[is_original],df1[~is_original]))
Output:
y ch1_g1 ch2_g1 ch3_g1 ch1_g2 ch2_g2 ch3_g2
2 0 14 58 2 71 31 48
3 1 32 28 2 30 92 91
0 0 23 3 74 20 89 62
1 1 2 83 0 51 64 19
4 1 66 15 14 51 36 51
Notice that row with indexes 1,4 have g1 swap with g2.
this might be a basic question, but I have not being able to find a solution. I have two dataframes, with identical rows and columns, called Volumes and Prices, which are like this
Volumes
Index ProductA ProductB ProductC ProductD Limit
0 100 300 400 78 100
1 110 370 20 30 100
2 90 320 200 121 100
3 150 320 410 99 100
....
Prices
Index ProductA ProductB ProductC ProductD Limit
0 50 110 30 90 0
1 51 110 29 99 0
2 49 120 25 88 0
3 51 110 22 96 0
....
I want to assign 0 to the "cell" of the Prices dataframe which correspond to Volumes less than what it is on the Limit column
so, the ideal output would be
Prices
Index ProductA ProductB ProductC ProductD Limit
0 50 110 30 0 0
1 51 110 0 0 0
2 0 120 25 88 0
3 51 110 22 0 0
....
I tried
import pandas as pd
import numpy as np
d_price = {'ProductA' : [50, 51, 49, 51], 'ProductB' : [110,110,120,110],
'ProductC' : [30,29,25,22],'ProductD' : [90,99,88,96], 'Limit': [0]*4}
d_volume = {'ProductA' : [100,110,90,150], 'ProductB' : [300,370,320,320],
'ProductC' : [400,20,200,410],'ProductD' : [78,30,121,99], 'Limit': [100]*4}
Prices = pd.DataFrame(d_price)
Volumes = pd.DataFrame(d_volume)
Prices[Volumes > Volumes.Limit]=0
but I do not obtain any changes to the Prices dataframe... obviously I'm having a hard time understanding boolean slicing, any help would be great
The problem is in
Prices[Volumes > Volumes.Limit]=0
Since Limit varies on each row, you should use, for example, apply like following:
Prices[Volumes.apply(lambda x : x>x.Limit, axis=1)]=0
you can use mask to solve this problem, I am not an expert either but this solutions does what you want to do.
test=(Volumes.ix[:,'ProductA':'ProductD'] >= Volumes.Limit.values)
final = Prices[test].fillna(0)