I have a dataframe with 2 columns in python. I want to enter the dataframe with one column and obtain the value of the 2nd column. Sometimes the values can be exact, but they can also be values between 2 rows.
I have this example dataframe:
x y
0 0 0
1 10 100
2 20 200
I want to find the value of y if I check the dataframe with the value of x. For example, if I enter in the dataframe with the value of 10, I obtain the value of 100. But if I check with 15, I need to interpolate between the two values of y. Is there any function to do it?
numpy.interp is probaly the simplest way here for linear interpolation:
def interpolate(xval, df, xcol, ycol):
# compute xval as the linear interpolation of xval where df is a dataframe and
# df.x are the x coordinates, and df.y are the y coordinates. df.x is expected to be sorted.
return np.interp([xval], df[xcol], df[ycol])
With your example data it gives:
>>> interpolate(10, df, 'x', 'y')
>>> 100.0
>>> interpolate(15, df, 'x', 'y')
>>> 150.0
You can even directly do:
>>> np.interp([10, 15], df.x, df.y)
array([100., 150.])
You can have a look at the interpolate method provided in Pandas module (doc). But I'm not sure that answers your question.
You can do it with interp1d from the sklearn module. Several types of interpolation are possible: ‘linear’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’... You find the list at the (doc page).
The interpolation process can be summarised as three steps:
Split your data between missing and non missing values. I use isna (doc)
Create the interpolation function using the data without missing values. I use interp1d (doc)
Interpolate (predict the missing values). Just call the function find in step 2 on the missing data (column x).
Here the code:
# Import modules
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
# Data
df = pd.DataFrame(
[[0, 0],
[10, 100],
[11, np.NaN],
[15, np.NaN],
[17, np.NaN],
[20, 200]],
columns=["x", "y"])
print(df)
# x y
# 0 0 0.0
# 1 10 100.0
# 2 11 NaN
# 3 15 NaN
# 4 17 NaN
# 5 20 200.0
# Split data in training (not NaN values) and missing (NaN values)
missing = df.isna().any(axis=1)
df_training = df[~missing]
df_missing = df[missing].reset_index(drop=True)
# Create function that interpolate missing value (from our training values)
f = interp1d(df_training.x, df_training.y)
# Interpolate the missing values
df_missing["y"] = f(df_missing.x)
print(df_missing)
# x y
# 0 11 110.0
# 1 15 150.0
# 2 17 170.0
You can find others works on the topic at this link.
Related
import numpy as np
import pandas as pd
GFG_dict = { '2019': [10,20,30,40],'2020': [20,30,40,50],'2021': [30, np.NaN, np.NaN, 60],'2022':[40,50,60,70]}
gfg = pd.DataFrame(GFG_dict)
gfg['mean1'] = gfg.mean(axis = 1)
gfg['mean2'] = gfg.loc[:,'2019':'2022'].sum(axis = 1)/4
gfg
I want to get the average as in column mean2.
How do I get the same result using: .mean(axis =1)
When you do mean2 you are doing the sum (which skips nan values) and then always dividing by 4 (which is the count including nan).
You can achieve this by doing a fillna(0) and then calculating the mean:
gfg['mean1'] = gfg.fillna(0).mean(axis = 1)
NOTE: this will not fill the nan values in your dataframe but only do it for the mean calculation, so your dataframe will not be modified
I am doing some computing on a dataset using loops. Then, based on random event, I am going to compute some float number(This means that I don't know in advance how many floats I am going to retrieve). I want to save these numbers(results) in a some kind of a list and then save them to a dataframe column ( I want to have these results for each iteration in my loop and save them in a column so I can compare them, meaning, each iteration will produce a "list" of results that will be registred in a df column)
example:
for y in range(1,10):
for x in range(1,100):
if(x>random number and x<y):
result=2*x
I want to save all the results in a dataframe columns by combination x,y. For example, the results for x=1,y=2 in a column then x=2,y=2 in column ...etc and the results are not of the same size, so I guess that I'll use fillna.
Now I know that I can create an empty dataframe with max index and then fill it result by result, but I think there's a better way to do it!
Thanks in advance.
You want to take advantage of the efficiency that numpy and pandas give you. If you use numpy.where, you can set the value to nan when the if statement is False, and otherwise you can execute your formula:
import numpy as np
import pandas as pd
np.random.seed(0) # so you can reproduce my result, you can remove this in practice
x = list(range(10))
y = list(range(1, 11))
random_nums = 10 * np.random.random(10)
df = pd.DataFrame({'x' : x, 'y': y})
# the first argument is your if condition
df['new_col'] = np.where((df['x'] > random_nums) & (df['x'] < df['y']), 2*df['x'], np.nan)
print(df)
Here, random_nums generates an entire np.ndarray of random numbers to compare with. This gives
x y new_col
0 0 1 NaN
1 1 2 NaN
2 2 3 NaN
3 3 4 NaN
4 4 5 NaN
5 5 6 NaN
6 6 7 12.0
7 7 8 NaN
8 8 9 NaN
9 9 10 18.0
This is especially faster if your formula (here, 2*x) is relatively quick to compute.
I'm working on a hotel booking dataset. Within the data frame, there's a discrete numerical column called ‘agent’ that has 13.7% missing values. My intuition is to just drop the rows of missing values, but considering the number of missing values is not that small, now I want to use the Random Sampling Imputation to replace them proportionally with the existing categorical variables.
My code is:
new_agent = hotel['agent'].dropna()
agent_2 = hotel['agent'].fillna(lambda x: random.choice(new_agent,inplace=True))
results
The first 3 rows was nan but now replaced with <function at 0x7ffa2c53d700>. Is there something wrong with my code, maybe in the lambda syntax?
UPDATE:
Thanks ti7 helped me solved the problem:
new_agent = hotel['agent'].dropna() #get a series of just the
available values
n_null = hotel['agent'].isnull().sum() #length of the missing entries
new_agent.sample(n_null,replace=True).values #sample it with
repetition and get values
hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values
#fill and replace
.fillna() is naively assigning your function to the missing values. It can do this because functions are really objects!
You probably want some form of generating a new Series with random values from your current series (you know the shape from subtracting the lengths) and use that for the missing values.
get a Series of just the available values (.dropna())
.sample() it with repetition (replace=True) to a new Series of the same length as the missing entries (df["agent"].isna().sum())
get the .values (this is a flat numpy array)
filter the column and assign
quick code
df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
df["agent"].isna().sum(), # get the same number of values as are missing
replace=True # repeat values
).values # throw out the index
demo
>>> import pandas as pd
>>> df = pd.DataFrame({'agent': [1,2, None, None, 10], 'b': [3,4,5,6,7]})
>>> df
agent b
0 1.0 3
1 2.0 4
2 NaN 5
3 NaN 6
4 10.0 7
>>> df["agent"].isna().sum()
2
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 1.])
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 2.])
>>> df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
... df["agent"].isna().sum(),
... replace=True
... ).values
>>> df
agent b
0 1.0 3
1 2.0 4
2 10.0 5
3 2.0 6
4 10.0 7
New to Pandas and I'm wondering if there's a better way to accomplish the following -
Set up:
import pandas as pd
import numpy as np
x = np.arange(0, 1, .01)
y = np.random.binomial(10, x, 100)
bins = 50
df = pd.DataFrame({'x':x, 'y':y})
print(df.head())
x y
0 -1 1
1 38 1
2 56 0
3 42 0
4 41 0
I would like to group the x values into equal size bins, and for each bin take the average value of both x and y.
my_bins = pd.cut(x, bins=20)
data = df[['x', 'y']].groupby(my_bins).agg(['mean', 'size'])
print(data.head())
x y
mean size mean size
age
(-1.101, 4.05] -1.000000 87990 0.768428 87990
(4.05, 9.1] NaN 0 NaN 0
(9.1, 14.15] NaN 0 NaN 0
(14.15, 19.2] 18.512286 1872 0.493590 1872
(19.2, 24.25] 22.768022 8906 0.496968 8906
Well that works. But from here, how do I plot x's mean vs y's mean? I know I can do something like
data.columns = data.columns.droplevel() # remove the multiple levels that were created
data.columns = ['x_mean', 'x_size', 'y_mean', 'y_size'] # manually set new column names
data.plot.scatter(x='x_mean', y='y_mean') # plot
But this feels wrong and clunky as I have to drop the column levels (which removes useful structure from my data) and I have to manually rename the columns. Is there a better way?
You can specify the x and y parameters pointing the multi-level columns using tuples:
data.plot.scatter(x=('x', 'mean'), y=('y', 'mean'))
This way, you don't need to rename the columns in order to plot it.
I would like to fill gaps in a column in my DataFrame using a cubic spline. If I were to export to a list then I could use the numpy's interp1d function and apply this to the missing values.
Is there a way to use this function inside pandas?
Most numpy/scipy function require the arguments only to be "array_like", iterp1d is no exception. Fortunately both Series and DataFrame are "array_like" so we don't need to leave pandas:
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
df = pd.DataFrame([np.arange(1, 6), [1, 8, 27, np.nan, 125]]).T
In [5]: df
Out[5]:
0 1
0 1 1
1 2 8
2 3 27
3 4 NaN
4 5 125
df2 = df.dropna() # interpolate on the non nan
f = interp1d(df2[0], df2[1], kind='cubic')
#f(4) == array(63.9999999999992)
df[1] = df[0].apply(f)
In [10]: df
Out[10]:
0 1
0 1 1
1 2 8
2 3 27
3 4 64
4 5 125
Note: I couldn't think of an example off the top of my head to pass in a DataFrame into the second argument (y)... but this ought to work too.