how to create new column and store value using equation - python

The column size has multiple values. Tor x in column size, y (predicted value) is calculated. How do display existing column size and predicted value y?
for x in df['size']:
y=(0.7118*x)+1.1691

Why don't you just do
df["Y"] = 0.7118 * df["size"] + 1.1691
or
df["Y"] = df["size"].mul(0.7118).add(1.1691)

You can try something like this:
df['y'] = df['size'].apply(lambda x: 0.7118*x + 1.1691)

Related

'Oversampling' cartesian data in a dataframe without for loop?

I have a 3D data in a pandas dataframe that I would like to 'oversample'/smooth by replacing the value at each x,y point with the average value of all the points that are within 5 units of that point. I can do it using a for loop like this (starting with a dataframe with three columns X,Y,Z):
import pandas as pd
Z_OS = []
X_OS = []
Y_OS = []
for inddex, row in df.iterrows():
Z_OS += [df[(df['X'] > row['X']-5) & (df['X']<row['X']+5) & (df['Y'] > row['Y']-5) & (df1['Y']<row['Y']+5)]['Z'].mean()]
X_OS += [row['X']]
Y_OS += [row['Y']]
dict = {
'X': X_OS,
'Y': Y_OS,
'Z': Z_OS
}
OSdf = pd.DataFrame.from_dict(dict)
but this method is very slow for large datasets and feels very 'unpythonic'. How could I do this without for loops? Is it possible via complex use of the groupby function?
xy = df[['x','y']]
df['smoothed z'] = df[['z']].apply(
lambda row: df['z'][(xy - xy.loc[row.name]).abs().lt(5).all(1)].mean(),
axis=1
)
Here I used df[['z']] to get a column 'z' as a data frame. We need an index of a row, i.e. row.name, when we apply a function to this column.
.abs().lt(5).all(1) read as absolut values which are all less then 5 along the row.
Update
The code below is actually the same but seems more consistent as it addresses directly the index:
df.index.to_series().apply(lambda i: df.loc[(xy - xy.loc[i]).abs().lt(5).all(1), 'z'].mean())
df['column_name'].rolling(rolling_window).mean()
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html

Pandas - if X float in column is greater than Y, find difference between X and Y and multiply by .25

I suspect the solution is quite simple, but I have been unable to figure it out. Essentially, what I want to do is to query a column with the float object type to see if each value >= 100.00. If it is greater, then I want to take the value x and do so: ((x - 100)*.25)+100 = new value (replace original values inplace, preferably.)
The data looks something like:
Some columns here
A percentage stored as float
foobar
84.85
foobar
15.95
fuubahr
102.25
The result of the above operation mentioned would give the following for the above:
Some columns here
A percentage stored as float
foobar
84.85
foobar
15.95
fuubahr
100.5625
Thanks!
List comprehension is easy solution for this:
dataframe["A percentage stored as float"] = [((x - 100)*.25) + 100 if x >= 100 else x for x in dataframe["A percentage stored as float"]]
What it does: It loops through the each column row, checks if value meets our if stement and then does the applies the calculation, if statement is not met, then it returns the original row value.

getting column name using iloc in dataframe

Is there a way to get the column name as a value using iloc or other functions?
i have a for loop here:
for i in range(0,18):
coef, pval = pearsonr(x.iloc[:,i],y)
print('pval of ',x.iloc[?,i], ' and allStar: ', pval)
where i want to print 'pval of column_name and allStar: pval'
is there a value I can replace ? with so that it fetches the column name for each of the columns? Or I have to use another function?
If x is Your dataframe try converting column name to column index:
col_idx = x.columns.get_loc('column_name')
Now this index can be passed to iloc method.
The short answer for your direct question is to use x.columns.
for i in range(0,18):
coef, pval = pearsonr(x.iloc[:,i],y)
print('pval of ',x.columns[i], ' and allStar: ', pval)
A cleaner approach would be to simply iterate over the columns:
for c in x.columns:
coef, pval = pearsonr(x[c], y)
print('pval of ',c, ' and allStar: ', pval)
Bonus notes (mainly to avoid the loop...):
To get the correlation coefficients (and not the pvalues just yet) of each column with y, you can simply use corrwith:
r = x.corrwith(pd.Series(y), axis=0)
To obtain the pvalues that correspond to those Pearson coefficients, you can simply calculate them directly, as follows:
dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2) # n == len(y)
p = 2*dist.cdf(-abs(r)) # <= the pvalues!

How to cut one pandas column in python, and make the other column equal in length?

I have a dataframe consisting of a few columns, among these are X, Y and Z coordinates. Now there columns are all of equal length. If you imagine a cylinder, what I am looking to do is to cut a part of the cylinder so that it gets shorter. I have managed to figure out how to cut column Z with respect to the actual values in column Z:
cutted_z = [i for i in df["Z"] if i >= 0 and i <= 1000]
How can one cut X Y Z equally with respect to the actual numerical values in Z?
The problem with my current solution is that it leaves the the length of the X and Y columns, as well as the rest of the columns in my dataframe, untouched, meaning X and Y now have more columns in Z.
You could just do:
mask = (df['Z'] >= 0) & (df['Z'] <= 1000) #creates a mask to filter dataframe on
df = df[mask]
This will give you a slice of dataframe for Z values range you wanted.

Converting hex to negative int in Python

I want to convert hex in column x into correct negative int as seen in column "true", but instead i got result in column y.
x y true
fdf1 65009 -527
I tried this (I know it's not correct)
df["y"] = df["x"].apply(int,base=16)
and from this link I know this function:
def s16(value):
return -(value & 0x8000) | (value & 0x7fff)
a = s16(int('fdf1', 16))
print(a)
can convert single value into correct one but how do you apply it to make a new column in Pandas data frame?
Use lambda function:
df["y"] = df["x"].apply(lambda x: s16(int(x, base=16)))
Or change function for cleaner code:
def s16(value):
value = int(value, base=16)
return -(value & 0x8000) | (value & 0x7fff)
df["y"] = df["x"].apply(s16)
print (df)
x y true
0 fdf1 -527 -527
The easiest way is to convert it to an integer and reinterpret it as a 16-bit integer by using .astype:
import numpy as np
df["y"] = df["x"].apply(lambda x: int(x, base=16)).astype(np.int16)
The dtype of column y will be int16, so any operation done on this column with other int16's will keep the values between -32768 and 32767.

Categories