pandas find closed values on same line - python

I like to find and extract all closed text based on same line and distance between text < 10 (x2 - x < 10) from pandas dataframe. x,y,x2,y2 are coordinates of bounding box which contains text. Texts can be different each time (string, float, int,...).
In my example, I want extract 'Amount VAT' idx 70 and 71: there are on same line, and distance from 'VAT'[x] - 'Amount'[x2] < 10
line text x y x2 y2
29 11 Amount 2184 1140 2311 1166
51 14 Amount 1532 1450 1660 1476
66 15 Amount 1893 1500 2021 1527
70 16 Amount 1893 1551 2022 1578
71 16 VAT 2031 1550 2121 1578
Final result must be:
line text x y x2 y2
70 16 Amount 1893 1551 2022 1578
71 16 VAT 2031 1550 2121 1578
and extraction should work for 2 or more text on same line and (x2 - x < 10). Other result with 3 values:
line text x y x2 y2
5 16 Total 1755 1551 1884 1578
8 16 Amount 1893 1551 2022 1578
20 16 VAT 2031 1550 2121 1578
I find a way to find same lines:
same_line = find_labels['line'].map(find_labels['line'].value_counts() > 1)
and I try to find near values x2 - x < 10, but I don't how to do this.
I try to make loop or use .cov() but not working...
Some can help me ?
Thanks for your help

Assuming VAT and Amount are both indexed by the same line value, I would do this:
# set the index in line
df.set_index('line', inplace=True)
#split up the table into the 2 parts to work on
amount_df = df[df['text'] == 'Amount']
vat_df = df[df['text'] == 'VAT']
# join the 2 tables to get everything on one row
df2 = amount_df.join(vat_df, how='outer', on='line', rsuffix='amount', lsuffix='vat')
# do the math
condition = df2['xvat'] - df2['x2amount'] < 10
df2 = df2[condition]
df2['text'] = 'Total'
df2['x'] = df2['xvat'] - (df2['xamount'] - df2['xvat'])
df2['y'] = df2['yvat'] - (df2['yamount'] - df2['yvat'])
df2['x2'] = df2['x2vat'] - (df2['x2amount'] - df2['x2vat'])
df2['y2'] = df2['y2vat'] - (df2['y2amount'] - df2['y2vat'])
df.append(df2[['text','x','y','x2','y2']])
I get
not quite exactly what you asked, but you get the idea.
Not sure what the right math is that gives you the results you show

Related

Loop though decimal that is object type

Adress Rooms m2 Price Floor
196 Skanstes 29a 5 325 2800 24/24
12 Ausekļa 4 5 195 2660 7/7
7 Antonijas 17A 3 86 2200 6/6
31 Blaumaņa 16 4 136 1800 4/6
186 Rūpniecības 21k2 5 160 1700 7/7
233 Vesetas 24 4 133 1700 10/10
187 Rūpniecības 34 5 157 1600 3/6
91 Elizabetes 31а 8 203 1600 1/5
35 Blaumaņa 9 3 90 1600 3/5
60 Cēsu 9 3 133 1550 6/7
I got the data set that I want to test the theory on if the higher the floor the more expensive the property rent price.
Adress object
Rooms int64
m2 int64
Price int64
Floor object
dtype: object
tbh I am stuck, not even sure how to start with this. Is there any way I can loop through the first number and compare it to the second? Like if 24=24 then it's in the new category 'Top Floor'?? And create 'mid-floor' and 'ground floor' categories as well.
GOT this far.
df_sorted= df.sort_values("Price",ascending=False)
print(df_sorted.head(10))
for e in df_sorted['Floor']:
parts=e.split('/')
print(parts)
but the second part is not working
if parts[0]==parts[-1]:
return "Top Floor" if parts[0]=="1":
return "Bottom Floor" else: "Mid Floor"
First solution, using three categories as suggested in the question. Then applying a grouping by category to check the mean price as a simple comparison:
def floor_to_categories(floor_str):
num1, num2 = floor_str.split("/")
if num1 == num2: return "Top"
elif num1 == "1": return "Bottom"
return "Middle"
df["FloorCategories"] = df.Floor.apply(floor_to_categories)
df.groupby("FloorCategories").Price.mean()
Second solution, continuous intead of discrete, converting the floor into a float from 0 to 1, and then apply pearson correlation between the price and the new floor float:
def floor_to_float(floor_str):
num1, num2 = [float(num) for num in floor_str.split("/")]
return num1 / num2
df["FloorFloat"] = df.Floor.apply(floor_to_float)
df[["Price", "FloorFloat"]].corr()
If the floor is stored as a string you can use the following function:
def split_floors(floor):
if floor.split('/')[0] == '1':
return 'Bottom'
if floor.split('/')[0] == floor.split('/')[1]:
return 'Top Floor'
else:
return 'Mid Floor'

Sum of the column values if the rows meet the conditions

I am trying to calculate the sum of sales for stores in the same neighborhood based on their geographic coordinates. I have sample data:
data={'ID':['1','2','3','4'],'SALE':[100,120,110,95],'X':[23,22,21,24],'Y':[44,45,41,46],'X_MIN':[22,21,20,23],'Y_MIN':[43,44,40,45],'X_MAX':[24,23,22,25],'Y_MAX':[45,46,42,47]}
ID
SALE
X
Y
X_MIN
Y_MIN
X_MAX
Y_MAX
1
100
23
44
22
43
24
45
2
120
22
45
21
44
23
46
3
110
21
41
20
40
22
42
4
95
24
46
23
45
25
47
X and Y are the coordinates of the store. X and Y with MIN and MAX are the area they cover. For each row, I want to sum sales for all stores that are within the boundaries of the single store. I expect results similar to the table below where SUM for ID 1 is equal 220 because the coordinates (X and Y) are within the MIN and MAX limits of this store for ID 1 and ID 2 while for ID 4 only this one store is between his coordinates so the sum of sales is equal 95.
final={'ID':['1','2','3','4'],'SUM':[220,220,110,95]}
ID
SUM
1
220
2
220
3
110
4
95
What I've tried:
data['SUM'] = data.apply(lambda x: data['SALE'].sum(data[(data['X'] >= x['X_MIN'])&(data['X'] <= x['X_MAX'])&(data['Y'] >= x['Y_MIN'])&(data['Y'] <= x['Y_MAX'])]),axis=1)
Unfortunately the code does not work and I am getting the following error:
TypeError: unhashable type: 'DataFrame'
I am asking for help in solving this problem.
If you put the summation at the end, your solution works:
data['SUM'] = data.apply(lambda x: (data['SALE'][(data['X'] >= x['X_MIN'])&(data['X'] <= x['X_MAX'])&(data['Y'] >= x['Y_MIN'])&(data['Y'] <= x['Y_MAX'])]).sum(),axis=1)
###output of data['SUM']:
###0 220
###1 220
###2 110
###3 95

Make loop iterate through range but limited to input

I'm new to python so I apologize in advance if the question is too easy.
I'm trying to make a simulation to find the optimization point on a dataframe. This is what I have so far:
import random
import pandas as pd
import math
import numpy as np
loops = int(input('Q of simulations: '))
cost = 175
sell_price = 250
sale_price = 250/2
# order = 1000
simulation = 0
profit = 0
rows = []
order = range(1000, 3000)
ordenes = []
for i in order:
ordenes.append(i)
for i in ordenes:
demand = math.trunc(1000 + random.random() * (2001))
if demand >= i:
profit = (sell_price - cost)* i
rows.append([simulation, demand, i, profit, (demand - i)])
else:
profit = (sell_price - cost)* demand - (i - demand)* (sale_price - cost)
rows.append([simulation, demand, i, profit, (demand - i)])
DataFrame = pd.DataFrame(rows, columns = ['#Simulation', 'Demand', 'Order', 'Utility', 'Shortage'])
print(DataFrame)
DataFrame.loc[DataFrame['Utility'].idxmax()]
The current output (for any number specified in tis:
#Simulation Demand Order Utility Shortage
0 0 2067 1000 75000.0 1067
1 0 1392 1001 75075.0 391
2 0 1042 1002 75150.0 40
3 0 1457 1003 75225.0 454
4 0 1930 1004 75300.0 926
... ... ... ... ... ...
1995 0 1823 2995 195325.0 -1172
1996 0 2186 2996 204450.0 -810
1997 0 1384 2997 184450.0 -1613
1998 0 1795 2998 194775.0 -1203
1999 0 1611 2999 190225.0 -1388
[2000 rows x 5 columns]
#Simulation 0.0
Demand 2922.0
Order 2989.0
Utility 222500.0
Shortage -67.0
Name: 1989, dtype: float64
Desired Output (writing 5 in the input):
#Simulation Demand Order Utility Shortage
0 0 2067 1000 75000.0 1067
1 1 1392 1001 75075.0 391
2 2 1042 1002 75150.0 40
3 3 1457 1003 75225.0 454
4 4 1930 1004 75300.0 926
[5 rows x 5 columns]
#Simulation 4.0
Demand 1930.0
Order 1004.0
Utility 75300.0
Shortage 926.0
Name: 1989, dtype: float64
I really don't know how to make it happen, I've tried everything that comes to my mind but the outcome either fails on the 'order' column or as shown above.

Subtract from each cell in pandas dataframe based on value

I have a df like this-- it's a dataframe and all values are floats:
data=np.random.randint(3000,size=(10,1))
data=pd.DataFrame(data)
For each value, if it's between 570 and 1140, I want to subtract 570.
If it's over 1140, I want to subtract 1140 from the value. I wrote this function to do that.
def AdjustTimes(val):
if val > 570 and val < 1140:
val = val-570
elif val > 1140:
val = val - 1140
Based on another question I tried to apply it using data.applymap(AdjustTimes). I got no error but the function does not seem to have been applied.
Setup
data
0
0 1863
1 2490
2 2650
3 2321
4 822
5 82
6 2192
7 722
8 2537
9 874
First, let's create masks for each of your conditions. One pandaic approach is using between to retrieve a mask for the first condition -
m1 = data.loc[:, 0].between(570, 1140, inclusive=True)
Or, you can do this with a couple of logical operators -
m1 = data.loc[:, 0].ge(570) & data.loc[:, 0].le(1140)
And,
m2 = data.loc[:, 0].gt(1140)
Now, to perform replacement, you have a couple of options.
Option 1
Use loc to index and subtract -
data.loc[m1, 0] -= 570
data.loc[m2, 0] -= 1140
data
0
0 723
1 1350
2 1510
3 1181
4 252
5 82
6 1052
7 152
8 1397
9 304
Equivalent version for a pd.Series -
m1 = data.ge(570) & data.le(1140)
m2 = data.gt(1140)
data.loc[m1] -= 570
data.loc[m2] -= 1140
Option 2
You can also do this with np.where (but it'd be a bit more inefficient).
v = data.loc[:, 0]
data.loc[:, 0] = np.where(m1, v - 570, np.where(m2, v - 1140, v))
Here, m1 and m2 are the masks computed from before.
data
0
0 723
1 1350
2 1510
3 1181
4 252
5 82
6 1052
7 152
8 1397
9 304
Equivalent pd.Series code -
data[:] = np.where(m1, data - 570, np.where(m2, data - 1140, data))
Could you try something like:
data=np.random.randint(3000,size=(10,1))
data=pd.DataFrame(data)
data = data -570*((data > 570) & (data < 1140)) -1140*(data > 1140)
The applymap method is designed to generate a new dataframe, not to modify an existing one (and the function it calls should return a value for the new cell rather than modifying its argument). You don't show the line where you actually use applymap, but I suspect it's just data.applymap(AdjustTimes) on its own. If you change your code to the following it should work fine:
def AdjustTimes(val):
if val >= 1140:
return val - 1140
elif val >= 570:
return val - 570
data = data.applymap(AdjustTimes)
(I've also cleaned up the if statements to be a little faster and deal with the case where Val = 1140 (your original code wouldn't adjust that one).

Iterating over pandas rows to get minimum

Here is my dataframe:
Date cell tumor_size(mm)
25/10/2015 113 51
22/10/2015 222 50
22/10/2015 883 45
20/10/2015 334 35
19/10/2015 564 47
19/10/2015 123 56
22/10/2014 345 36
13/12/2013 456 44
What I want to do is compare the size of the tumors detected on the different days. Let's consider the cell 222 as an example; I want to compare its size to different cells but detected on earlier days e.g. I will not compare its size with cell 883, because they were detected on the same day. Or I will not compare it with cell 113, because it was detected later on.
As my dataset is too large, I have iterate over the rows. If I explain it in a non-pythonic way:
for the cell 222:
get_size_distance(absolute value):
(50 - 35 = 15), (50 - 47 = 3), (50 - 56 = 6), (50 - 36 = 14), (44 - 36 = 8)
get_minumum = 3, I got this value when I compared it with 564, so I will name it as a pait for the cell 222
Then do it for the cell 883
The resulting output should look like this:
Date cell tumor_size(mm) pair size_difference
25/10/2015 113 51 222 1
22/10/2015 222 50 123 6
22/10/2015 883 45 456 1
20/10/2015 334 35 345 1
19/10/2015 564 47 456 3
19/10/2015 123 56 456 12
22/10/2014 345 36 456 8
13/12/2013 456 44 NaN NaN
I will really appreciate your help
It's not pretty, but I believe it does the trick
a = pd.read_clipboard()
# Cut off last row since it was a faulty date. You can skip this.
df = a.copy().iloc[:-1]
# Convert to dates and order just in case (not really needed I guess).
df['Date'] = df.Date.apply(lambda x: datetime.strptime(x, '%d/%m/%Y'))
df.sort_values('Date', ascending=False)
# Rename column
df = df.rename(columns={"tumor_size(mm)": 'tumor_size'})
# These will be our lists of pairs and size differences.
pairs = []
diffs = []
# Loop over all unique dates
for date in df.Date.unique():
# Only take dates earlier then current date.
compare_df = df.loc[df.Date < date].copy()
# Loop over each cell for this date and find the minimum
for row in df.loc[df.Date == date].itertuples():
# If no cells earlier are available use nans.
if compare_df.empty:
pairs.append(float('nan'))
diffs.append(float('nan'))
# Take lowest absolute value and fill in otherwise
else:
compare_df['size_diff'] = abs(compare_df.tumor_size - row.tumor_size)
row_of_interest = compare_df.loc[compare_df.size_diff == compare_df.size_diff.min()]
pairs.append(row_of_interest.cell.values[0])
diffs.append(row_of_interest.size_diff.values[0])
df['pair'] = pairs
df['size_difference'] = diffs
returns:
Date cell tumor_size pair size_difference
0 2015-10-25 113 51 222.0 1.0
1 2015-10-22 222 50 564.0 3.0
2 2015-10-22 883 45 564.0 2.0
3 2015-10-20 334 35 345.0 1.0
4 2015-10-19 564 47 345.0 11.0
5 2015-10-19 123 56 345.0 20.0
6 2014-10-22 345 36 NaN NaN

Categories