I have lots of files that contain x, y, yerr columns. I read them and save and apply a change on the x values, then I would like to set a limit on the x values I will use afterwards which are the newxval:
for key, value in files_data.items():
file_short_name = key
D_value_sale = value[1]
data = pd.DataFrame(value[0])
if data.shape[1] == 3:
data.columns = ["x", "y", "yerr"]
else:
data.columns = ["x", "y"]
D = D_value_sale
b = 111
c = 222
data["newx"] = -c*(((data.x*(1/(1+D)))-b)/b)
data["newy"] = (data.y-data.y.min())/(data.y.max()-data.y.min())
w = data[(data.newx < 20000) & (data.newx > 8000)]
dfx = w["newx"]
dfy = w["newy"]
peak = GaussianModel()
pars = offset.make_params(c=np.median(dfy))
pars += peak.guess(dfy, x= dfy, amplitude=-0.5)
result = model.fit(dfy, pars, dfx)
If I'm understanding correctly what you are asking this is what you could do:
for key, value in files_data.items():
file_short_name = key
# main = value[1]
data = pd.DataFrame(value[0])
if data.shape[1] == 3:
data.columns = ["x", "y", "yerr"]
else:
# Here you should define what happens in case
# the data isn't what you expected it to be
data["newx"] = data.x + 1 # Perform whatever transformation you need
# data["newy"] = data.y * (1.01234) # Etc.
# Then you can limit the newx column by doing:
data[(data.newx < upper_limit) & (data.newx > lower_limit)]
What you're doing won't work if you want to preserve the relationship between columns. When you assign the data columns to their own variables xval, yval and error you are implicitely "losing" their relationship.
I'll open with the same caveat of "if I'm understanding you correctly" then the crux of what you are looking for is the boolean array that you have created to apply your limits:
data = data[(data[0] >= xlim[0]) & (data[0] <= xlim[1])]
This boolean array can be saved and applied to any array of the same shape.
idx = (data[0] >= xlim[0]) & (data[0] <= xlim[1])
filtered_data = data[0][idx]
filtered_newxval = newxval[idx]
By way of a more complete and independent example, see the code below where this concept can be applied to multidimensional arrays and pandas dataframes
import numpy as np
import pandas as pd
np.random.seed(42)
x = np.random.randint(0, 20, 10)
y = np.random.randint(0, 20, 10)
print("x", x)
# >>> x [ 6 19 14 10 7 6 18 10 10 3]
print("y", y)
# >>> y [ 7 2 1 11 5 1 0 11 11 16]
xmin = 3
xmax = 17
idx = (x >= xmin) & (x <= xmax)
data = np.vstack((x, y))
print("filtered_data:\n", data[:, idx])
# >>> filtered_data:
# [[ 6 14 10 7 6 10 10 3]
# [ 7 1 11 5 1 11 11 16]]
df = pd.DataFrame({"x": x, "y": y})
df["xnew"] = df["x"] * 2
print(df[idx])
# >>> x y xnew
# >>> 0 6 7 12
# >>> 2 14 1 28
# >>> 3 10 11 20
# >>> 4 7 5 14
# >>> 5 6 1 12
# >>> 7 10 11 20
# >>> 8 10 11 20
# >>> 9 3 16 6
Related
For example, I have a dataframe (df) and the Target column is df['Z']. I have two other columns, df['X'] and df['Y']. I have received all this data from the real-world data collection.
How can I make an equation Z as the following functions in python: (i.e. fit Z as a function of X and Y)
> 1. Z = f(X)
> 2. Z = f(X,Y)
That's how you do that:
def function(x, y):
return x+y+4 # Obviously the function can be more complex
df["Z"] = function(df["A"], df["B"])
Example
data = {'A': [x for x in range(5)], 'B': [x for x in range(6,11)]}
df = pd.DataFrame(data)
def function(x,y):
return x+y+4
df["Z"] = function(df["A"], df["B"])
print(df)
Output:
A B Z
0 0 6 10
1 1 7 12
2 2 8 14
3 3 9 16
4 4 10 18
I am writing a simulation and I am trying to append the result of EACH ITERATION INTO a dataframe that keep track of all the iterations.
While everything works fine with collecting the results, I cannot find a way to append the results into a new column each time. I have been banging my head on that issue for a while now and cannot unblock the problem.
I have built a simplified version of what I am doing to best explain my issue:
import simpy
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import pandas as pd
###dataframe for the simulation
df = pd.DataFrame({'Id' : ['1183', '1187']})
df['average_demand'] = [7426,989]
df['lead_time'] = [1.5, 1.5]
df['sale_price'] = [1.98, 2.01]
df['buy_price'] = [0.11, 0.23]
df['beg_inventory'] = [1544,674]
df['margin'] = df['sale_price'] - df['buy_price']
df['holding_cost'] = 0.2/12
df['aggregate_order_placement_cost'] = 1000
df['review_time'] = 0
df['periods'] = 30
#df['cap_ts'] = 1.5
df['min_ts'] = 1
df['low_demand'] = [300, 30]#,3000,350,220,40,42,40,10,25,240]
df['high_demand'] = [1000, 130]#,12000,700,500,100,90,210,135,200,800]
df['low_sd'] = [160,30]#,3400,100,90,10,5,50,26,45,170]
df['high_sd'] = [400,90]#,5500,200,160,60,50,100,78,113,300]
cap_ts = 0
big_df = pd.DataFrame(df)
for i in df.index:
for cap_ts in range(1,12, 1):
def warehouse_run(env, df):
df['inventory'] = df['beg_inventory']
df['balance'] = 0.0
df['quantity_on_order'] = 0
df['count_order_placed'] = 0
df['commands_on_order'] = 0
df['demand'] = 0
df['safety_stock'] = 0
df['stockout_occurence'] = 0
df['inventory_position'] = 0
while True:
interarrival = generate_interarrival()
yield env.timeout(interarrival)
df['balance'] -= df['inventory'] * df['holding_cost'] * interarrival
df['demand'] = generate_demand()
if df['demand'].loc[i] < df['inventory'].loc[i]:
df['balance'] += df['sale_price'] * df['demand']
df['inventory'] -= df['demand']
print('{:.2f} sold {}'.format(env.now, df['demand'].loc[i]))
else:
df['balance'] += df['sale_price'] * df['inventory']
df['inventory'] = 0
df['stockout_occurence'] += 1
print('{:.2f} demand {} but inventory{}'.format(env.now, df['demand'].loc[i], df['inventory'].loc[i]))
print('{:.2f} sold {} ( nb stockout)'.format(env.now, df['stockout_occurence'].loc[i]))
if df['demand'].loc[i] > df['inventory'].loc[i]:
env.process(handle_order(env,
df))
df['count_order_placed'] += 1
print("inventory", df['inventory'].loc[i])
print("number of orders placed", df['count_order_placed'].loc[i])
def handle_order(env, df):
df['quantity_ordered'] = cap_ts *df['average_demand']
df['quantity_on_order'] += df['quantity_ordered']
df['commands_on_order'] += 1
print("{:.2f} placed order for {}".format(env.now, df['quantity_ordered'].loc[i]))
df['balance'] -= df['buy_price'] * df['quantity_ordered'] + df['aggregate_order_placement_cost']
yield env.timeout(df['lead_time'].loc[i], 0)
df['inventory'] += df['quantity_ordered']
df['quantity_on_order'] -= df['quantity_ordered']
df['commands_on_order'] -= 1
print('{:.2f} receive order,{} in inventory'.format(env.now, df['inventory'].loc[i]))
# number of orders per month
def generate_interarrival():
return np.random.exponential(1. / 1)
# quantity of demand per months
def generate_demand():
return np.random.randint(df['low_demand'].loc[i], df['high_demand'].loc[i])
def generate_standard_deviation():
return np.random.randint(df['low_sd'].loc[i], df['high_sd'].loc[i])
obs_time = []
inventory_level = []
demand_level = []
safety_stock_level = []
inventory_position_level = []
def observe(env, df):
while True:
obs_time.append(env.now)
inventory_level.append(df['inventory'].loc[i])
demand_level.append(df['demand'].loc[i])
safety_stock_level.append(df['safety_stock'].loc[i])
inventory_position_level.append(df['inventory_position'].loc[i])
yield env.timeout(0.1)
np.random.seed(0)
env = simpy.Environment()
env.process(warehouse_run(env, df))
env.process(observe(env, df))
# #RUN FOR 12 MONTHS
env.run(until=36.0)
recap = pd.DataFrame(df.loc[i])
recap = recap.transpose()
#big_df.append(recap)
big_df['Iteration {}'.format(i)] = recap
print(recap)
So in this code, the issue is in appending the results contained into recap to big_df. Ideally at the end of the simulation, big_dfshould contain 24 columns, which would be one column of results for each iteration of the simulation. Any help on this would be greatly appreciated, thank you
UPDATE: thanks to wnsfan40 I have been able to get a df that concat the result for each iteration, but big_dfreset at each iteration and does not continually append each new df.
expected output looks kind of like that:
Id result_columns
0 11198 x
1 11198 x
2 11198 x
3 11198 x
4 11198 x
5 11198 x
6 11198 x
7 11198 x
8 11198 x
9 11198 x
10 11198 x
11 11198 x
12 11187 y
13 11187 y
14 11187 y
15 11187 y
16 11187 y
17 11187 y
18 11187 y
19 11187 y
20 11187 y
21 11187 y
22 11187 y
23 11187 y
with the result columns is a shortcuts for all the columns containing results about each row.
Assign df's columns as the index to big_df when it is initialized using
big_df = pd.DataFrame(index = df.index)
Try changing from append to assigning a column value, such as
big_df['Iteration {}'.format(i)] = recap
There's a few questions on this but I'm getting stuck. I have a df that contains coordinates for various scatter points. I want to generate a radius around one of these points and return the points that are within this radius for each point in time. Using the df below, I want to return a df that contains all the points within the radius around A for each point in time.
import pandas as pd
df = pd.DataFrame({
'Time' : ['09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.1','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2','09:00:00.2'],
'Label' : ['A','B','C','D','E','A','B','C','D','E'],
'X' : [8,4,3,8,7,7,3,3,4,6],
'Y' : [3,3,3,4,3,2,1,5,4,2],
})
x_data = (df.groupby(['Time'])['X'].apply(list))
y_data = (df.groupby(['Time'])['Y'].apply(list))
AX_data = (df.loc[df['Label'] == 'A']['X'])
AY_data = (df.loc[df['Label'] == 'A']['Y'])
def countPoints(df, center_x, center_y, x, y, radius):
'''
Count number of points within radius for label A
'''
# Determine square distance
square_dist = (center_x - x) ** 2 + (center_y - y) ** 2
# Return df of rows within radius
df = df[square_dist <= radius ** 2].copy()
return df
df = countPoints(df, AX_data, AY_data, x_data, y_data, radius = 1)
Intended Output:
Time Label X Y
0 09:00:00.1 A 8 3
1 09:00:00.1 D 8 4
2 09:00:00.1 E 7 3
3 09:00:00.2 A 7 2
4 09:00:00.2 E 6 2
Here my take on it using np.linalg.norm
def calc_dist(gp, a_label, r=1):
dist_df = gp[['X', 'Y']] - gp.loc[gp.Label.eq(a_label), ['X', 'Y']].values
dist_arr = np.linalg.norm(dist_df, axis=1)
return gp[dist_arr <= r]
df_A = df.groupby('Time').apply(calc_dist, a_label='A', r=1).reset_index(drop=True)
Out[2159]:
Time Label X Y
0 09:00:00.1 A 8 3
1 09:00:00.1 D 8 4
2 09:00:00.1 E 7 3
3 09:00:00.2 A 7 2
4 09:00:00.2 E 6 2
Method 2:
df1 = df.where(df.Label.eq('A')).groupby(df.Time).apply(lambda x: x.ffill().bfill())
m = np.linalg.norm(df[['X', 'Y']] - df1[['X', 'Y']], axis=1) <= 1
df_A = df[m]
Out[2262]:
Time Label X Y
0 09:00:00.1 A 8 3
3 09:00:00.1 D 8 4
4 09:00:00.1 E 7 3
5 09:00:00.2 A 7 2
9 09:00:00.2 E 6 2
I have this optimization problem where I am trying to maximize column z based on a unique value from column X, but also within a constraint that each of the unique values picked of X added up column of Y most be less than or equal to (in this example) 23.
For example, I have this sample data:
X Y Z
1 9 25
1 7 20
1 5 5
2 9 20
2 7 10
2 5 5
3 9 10
3 7 5
3 5 5
The result should look like this:
X Y Z
1 9 25
2 9 20
3 5 5
This is replica for Set up linear programming optimization in R using LpSolve? with solution but I need the same in python.
For those who would want some help to get started with pulp in python can refer to http://ojs.pythonpapers.org/index.php/tppm/article/view/111
Github repo- https://github.com/coin-or/pulp/tree/master/doc/KPyCon2009 could be handy as well.
Below is the code in python for the dummy problem asked
import pandas as pd
import pulp
X=[1,1,1,2,2,2,3,3,3]
Y=[9,7,5,9,7,5,9,7,5]
Z=[25,20,5,20,10,5,10,5,5]
df = pd.DataFrame({'X':X,'Y':Y,'Z':Z})
allx = df['X'].unique()
possible_values = [(w,b) for w in allx for b in range(1,4)]
x = pulp.LpVariable.dicts('arr', (allx, range(1,4)),
lowBound = 0,
upBound = 1,
cat = pulp.LpInteger)
model = pulp.LpProblem("Optim", pulp.LpMaximize)
model += sum([x[w][b]*df[df['X']==w].reset_index()['Z'][b-1] for (w,b) in possible_values])
model += sum([x[w][b]*df[df['X']==w].reset_index()['Y'][b-1] for (w,b) in possible_values]) <= 23, \
"Maximum_number_of_Y"
for value in allx:
model += sum([x[w][b] for (w,b) in possible_values if w==value])>=1
for value in allx:
model += sum([x[w][b] for (w,b) in possible_values if w==value])<=1
##View definition
model
model.solve()
print("The choosen rows are out of a total of %s:"%len(possible_values))
for v in model.variables():
print v.name, "=", v.varValue
For solution in R
d=data.frame(x=c(1,1,1,2,2,2,3,3,3),y=c(9,7,5,9,7,5,9,7,5),z=c(25,20,5,20,10,5,10,5,3))
library(lpSolve)
all.x <- unique(d$x)
d[lp(direction = "max",
objective.in = d$z,
const.mat = rbind(outer(all.x, d$x, "=="), d$y),
const.dir = rep(c("==", "<="), c(length(all.x), 1)),
const.rhs = rep(c(1, 23), c(length(all.x), 1)),
all.bin = TRUE)$solution == 1,]
I'm trying to bin a sample of observations into n discrete groups, then combine these groups until each subgroup has a mimimum of 6 members. So far, I've generated bins, and grouped my DataFrame into them:
# df is a DataFrame containing 135 measurments
bins = np.linspace(df.heights.min(), df.heights.max(), 21)
grp = df.groupby(np.digitize(df.heights, bins))
grp.size()
1 4
2 1
3 2
4 3
5 2
6 8
7 7
8 6
9 19
10 12
11 13
12 12
13 7
14 12
15 12
16 2
17 3
18 6
19 3
21 1
So I can see that I need to combine groups 1 - 3, 3 - 5, and 16 - 21, while leaving the others intact, but I don't know how to do this programmatically.
You can do this:
df = pd.DataFrame(np.random.random_integers(1,200,135), columns=['heights'])
bins = np.linspace(df.heights.min(), df.heights.max(), 21)
grp = df.groupby(np.digitize(df.heights, bins))
sizes = grp.size()
def f(vals, max):
sum = 0
group = 1
for v in vals:
sum += v
if sum <= max:
yield group
else:
group +=1
sum = v
yield group
#I've changed 6 by 30 for the example cause I don't have your original dataset
grp.size().groupby([g for g in f(sizes, 30)])
And if you do print grp.size().groupby([g for g in f(sizes, 30)]).cumsum() you will see that the cumulative sums is grouped as expected.
Also if you want to group the original values you can do something like:
dat = np.random.random_integers(0,200,135)
dat = np.array([78,116,146,111,147,78,14,91,196,92,163,144,107,182,58,89,77,134,
83,126,94,70,121,175,174,88,90,42,93,131,91,175,135,8,142,166,
1,112,25,34,119,13,95,182,178,200,97,8,60,189,49,94,191,81,
56,131,30,107,16,48,58,65,78,8,0,11,45,179,151,130,35,64,
143,33,49,25,139,20,53,55,20,3,63,119,153,14,81,93,62,162,
46,29,84,4,186,66,90,174,55,48,172,83,173,167,66,4,197,175,
184,20,23,161,70,153,173,127,51,186,114,27,177,96,93,105,169,158,
83,155,161,29,197,143,122,72,60])
df = pd.DataFrame({'heights':dat})
bins = np.digitize(dat,np.linspace(0,200,21))
grp = df.heights.groupby(bins)
m = 15 #you should put 6 here, the minimun
s = 0
c = 1
def f(x):
global c,s
res = pd.Series([c]*x.size,index=x.index)
s += x.size
if s>m:
s = 0
c += 1
return res
g = grp.apply(f)
print df.groupby(g).size()
#another way of doing the same, just a matter of taste
m = 15 #you should put 6 here, the minimun
s = 0
c = 1
def f2(x):
global c,s
res = [c]*x.size #here is the main difference with f
s += x.size
if s>m:
s = 0
c += 1
return res
g = grp.transform(f2) #call it this way
print df.groupby(g).size()