For example, I have a dataframe (df) and the Target column is df['Z']. I have two other columns, df['X'] and df['Y']. I have received all this data from the real-world data collection.
How can I make an equation Z as the following functions in python: (i.e. fit Z as a function of X and Y)
> 1. Z = f(X)
> 2. Z = f(X,Y)
That's how you do that:
def function(x, y):
return x+y+4 # Obviously the function can be more complex
df["Z"] = function(df["A"], df["B"])
Example
data = {'A': [x for x in range(5)], 'B': [x for x in range(6,11)]}
df = pd.DataFrame(data)
def function(x,y):
return x+y+4
df["Z"] = function(df["A"], df["B"])
print(df)
Output:
A B Z
0 0 6 10
1 1 7 12
2 2 8 14
3 3 9 16
4 4 10 18
I have lots of files that contain x, y, yerr columns. I read them and save and apply a change on the x values, then I would like to set a limit on the x values I will use afterwards which are the newxval:
for key, value in files_data.items():
file_short_name = key
D_value_sale = value[1]
data = pd.DataFrame(value[0])
if data.shape[1] == 3:
data.columns = ["x", "y", "yerr"]
else:
data.columns = ["x", "y"]
D = D_value_sale
b = 111
c = 222
data["newx"] = -c*(((data.x*(1/(1+D)))-b)/b)
data["newy"] = (data.y-data.y.min())/(data.y.max()-data.y.min())
w = data[(data.newx < 20000) & (data.newx > 8000)]
dfx = w["newx"]
dfy = w["newy"]
peak = GaussianModel()
pars = offset.make_params(c=np.median(dfy))
pars += peak.guess(dfy, x= dfy, amplitude=-0.5)
result = model.fit(dfy, pars, dfx)
If I'm understanding correctly what you are asking this is what you could do:
for key, value in files_data.items():
file_short_name = key
# main = value[1]
data = pd.DataFrame(value[0])
if data.shape[1] == 3:
data.columns = ["x", "y", "yerr"]
else:
# Here you should define what happens in case
# the data isn't what you expected it to be
data["newx"] = data.x + 1 # Perform whatever transformation you need
# data["newy"] = data.y * (1.01234) # Etc.
# Then you can limit the newx column by doing:
data[(data.newx < upper_limit) & (data.newx > lower_limit)]
What you're doing won't work if you want to preserve the relationship between columns. When you assign the data columns to their own variables xval, yval and error you are implicitely "losing" their relationship.
I'll open with the same caveat of "if I'm understanding you correctly" then the crux of what you are looking for is the boolean array that you have created to apply your limits:
data = data[(data[0] >= xlim[0]) & (data[0] <= xlim[1])]
This boolean array can be saved and applied to any array of the same shape.
idx = (data[0] >= xlim[0]) & (data[0] <= xlim[1])
filtered_data = data[0][idx]
filtered_newxval = newxval[idx]
By way of a more complete and independent example, see the code below where this concept can be applied to multidimensional arrays and pandas dataframes
import numpy as np
import pandas as pd
np.random.seed(42)
x = np.random.randint(0, 20, 10)
y = np.random.randint(0, 20, 10)
print("x", x)
# >>> x [ 6 19 14 10 7 6 18 10 10 3]
print("y", y)
# >>> y [ 7 2 1 11 5 1 0 11 11 16]
xmin = 3
xmax = 17
idx = (x >= xmin) & (x <= xmax)
data = np.vstack((x, y))
print("filtered_data:\n", data[:, idx])
# >>> filtered_data:
# [[ 6 14 10 7 6 10 10 3]
# [ 7 1 11 5 1 11 11 16]]
df = pd.DataFrame({"x": x, "y": y})
df["xnew"] = df["x"] * 2
print(df[idx])
# >>> x y xnew
# >>> 0 6 7 12
# >>> 2 14 1 28
# >>> 3 10 11 20
# >>> 4 7 5 14
# >>> 5 6 1 12
# >>> 7 10 11 20
# >>> 8 10 11 20
# >>> 9 3 16 6
So I made this dataframe
alp = "abcdefghijklmnopqrstuvwxyz0123456789"
s = "carl"
for i in s:
alp = alp.replace(i,"")
jaa = s+alp
x = list(jaa)
array = np.array(x)
re = np.reshape(array,(6,6))
dt = pd.DataFrame(re)
dt.columns = [1,2,3,4,5,6]
dt.index = [1,2,3,4,5,6]
dt
1 2 3 4 5 6
1 c a r l b d
2 e f g h i j
3 k m n o p q
4 s t u v w x
5 y z 0 1 2 3
6 4 5 6 7 8 9
I want to search a value , and print its row(index) and column.
For example, 'h', the output i want is 2,4.
Is there any way to get that output?
row, col = np.where(dt == "h")
print(dt.index[row[0]], dt.columns[col[0]])
I have two dataframes:
1) Contains a list of suppliers and their Lat,Long coordinates
sup_essential = pd.DataFrame({'supplier': ['A','B','C'],
'coords': [(51.1235,-0.3453),(52.1245,-0.3423),(53.1235,-1.4553)]})
2) A list of stores and their lat, long coordinates
stores_essential = pd.DataFrame({'storekey': [1,2,3],
'coords': [(54.1235,-0.6553),(49.1245,-1.3423),(50.1235,-1.8553)]})
I want to create an output table that has: store, store_coordinates, supplier, supplier_coordinates, distance for every combination of store and supplier.
I currently have:
test=[]
for row in sup_essential.iterrows():
for row in stores_essential.iterrows():
r = sup_essential['supplier'],stores_essential['storeKey']
test.append(r)
But this just gives me repeats of all the values
Source DFs
In [105]: sup
Out[105]:
coords supplier
0 (51.1235, -0.3453) A
1 (52.1245, -0.3423) B
2 (53.1235, -1.4553) C
In [106]: stores
Out[106]:
coords storekey
0 (54.1235, -0.6553) 1
1 (49.1245, -1.3423) 2
2 (50.1235, -1.8553) 3
Solutions:
from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')
m = pd.merge(sup.assign(x=0), stores.assign(x=0), on='x', suffixes=['1','2']).drop('x',1)
d1 = sup[['coords']].assign(lat=sup.coords.str[0], lon=sup.coords.str[1]).drop('coords',1)
d2 = stores[['coords']].assign(lat=stores.coords.str[0], lon=stores.coords.str[1]).drop('coords',1)
m['dist_km'] = np.ravel(dist.pairwise(np.radians(d1), np.radians(d2)) * 6367)
## -- End pasted text --
Result:
In [135]: m
Out[135]:
coords1 supplier coords2 storekey dist_km
0 (51.1235, -0.3453) A (54.1235, -0.6553) 1 334.029670
1 (51.1235, -0.3453) A (49.1245, -1.3423) 2 233.213416
2 (51.1235, -0.3453) A (50.1235, -1.8553) 3 153.880680
3 (52.1245, -0.3423) B (54.1235, -0.6553) 1 223.116901
4 (52.1245, -0.3423) B (49.1245, -1.3423) 2 340.738587
5 (52.1245, -0.3423) B (50.1235, -1.8553) 3 246.116984
6 (53.1235, -1.4553) C (54.1235, -0.6553) 1 122.997130
7 (53.1235, -1.4553) C (49.1245, -1.3423) 2 444.459052
8 (53.1235, -1.4553) C (50.1235, -1.8553) 3 334.514028
I'm trying to bin a sample of observations into n discrete groups, then combine these groups until each subgroup has a mimimum of 6 members. So far, I've generated bins, and grouped my DataFrame into them:
# df is a DataFrame containing 135 measurments
bins = np.linspace(df.heights.min(), df.heights.max(), 21)
grp = df.groupby(np.digitize(df.heights, bins))
grp.size()
1 4
2 1
3 2
4 3
5 2
6 8
7 7
8 6
9 19
10 12
11 13
12 12
13 7
14 12
15 12
16 2
17 3
18 6
19 3
21 1
So I can see that I need to combine groups 1 - 3, 3 - 5, and 16 - 21, while leaving the others intact, but I don't know how to do this programmatically.
You can do this:
df = pd.DataFrame(np.random.random_integers(1,200,135), columns=['heights'])
bins = np.linspace(df.heights.min(), df.heights.max(), 21)
grp = df.groupby(np.digitize(df.heights, bins))
sizes = grp.size()
def f(vals, max):
sum = 0
group = 1
for v in vals:
sum += v
if sum <= max:
yield group
else:
group +=1
sum = v
yield group
#I've changed 6 by 30 for the example cause I don't have your original dataset
grp.size().groupby([g for g in f(sizes, 30)])
And if you do print grp.size().groupby([g for g in f(sizes, 30)]).cumsum() you will see that the cumulative sums is grouped as expected.
Also if you want to group the original values you can do something like:
dat = np.random.random_integers(0,200,135)
dat = np.array([78,116,146,111,147,78,14,91,196,92,163,144,107,182,58,89,77,134,
83,126,94,70,121,175,174,88,90,42,93,131,91,175,135,8,142,166,
1,112,25,34,119,13,95,182,178,200,97,8,60,189,49,94,191,81,
56,131,30,107,16,48,58,65,78,8,0,11,45,179,151,130,35,64,
143,33,49,25,139,20,53,55,20,3,63,119,153,14,81,93,62,162,
46,29,84,4,186,66,90,174,55,48,172,83,173,167,66,4,197,175,
184,20,23,161,70,153,173,127,51,186,114,27,177,96,93,105,169,158,
83,155,161,29,197,143,122,72,60])
df = pd.DataFrame({'heights':dat})
bins = np.digitize(dat,np.linspace(0,200,21))
grp = df.heights.groupby(bins)
m = 15 #you should put 6 here, the minimun
s = 0
c = 1
def f(x):
global c,s
res = pd.Series([c]*x.size,index=x.index)
s += x.size
if s>m:
s = 0
c += 1
return res
g = grp.apply(f)
print df.groupby(g).size()
#another way of doing the same, just a matter of taste
m = 15 #you should put 6 here, the minimun
s = 0
c = 1
def f2(x):
global c,s
res = [c]*x.size #here is the main difference with f
s += x.size
if s>m:
s = 0
c += 1
return res
g = grp.transform(f2) #call it this way
print df.groupby(g).size()