I have this table
df= pd.DataFrame ({'A': [0,1.5,2.1,2.9,4], 'B': [1.5,2.05,3,4,5]})
here I have 2 problems, a gap and an overlapping, I would like to detect automatically using python pandas, someone can help me, thanks
df= pd.DataFrame ({'A': [0,1.5,2.1,2.9,4], 'Validate': [1.5,2.05,3,4,5], 'test': ['np.nan', np.nan, 'gab','over', np.nan]})
thanks
IIUC:
s = df.B.shift() - (df.A)
df['Validate'] = np.select((s>0, s<0), ('over', 'gap'), default=np.nan)
Output:
A B Validate
0 0.0 1.50 nan
1 1.5 2.05 nan
2 2.1 3.00 gap
3 2.9 4.00 over
4 4.0 5.00 nan
We can using sign from numpy
df['Validate']=np.sign(df.B.shift().sub(df.A)).map({1:'over',-1:'gap'})
df
Out[150]:
A B Validate
0 0.0 1.50 NaN
1 1.5 2.05 NaN
2 2.1 3.00 gap
3 2.9 4.00 over
4 4.0 5.00 NaN
Related
When there are identical values pivot_table takes the mean
(cause aggfunc='mean' by default)
For instance:
d=pd.DataFrame(data={
'x_values':[13.4,13.08,12.73,12.,33.,23.,12.],
'y_values': [1.54, 1.47,1.,2.,4.,4.,3.],
'experiment':['e', 'e', 'e', 'f', 'f','f','f']})
print(pd.pivot_table(d, index='x_values',
columns='experiment', values='y_values',sort=False))
returns:
experiment e f
x_values
13.40 1.54 NaN
13.08 1.47 NaN
12.73 1.00 NaN
12.00 NaN 2.5
33.00 NaN 4.0
23.00 NaN 4.0
As you can see a new value in f appears (2.5 which is the mean of 2. and 3).
But I want to keep the list as it was in my pandas
experiment e f
x_values
13.40 1.54 NaN
13.08 1.47 NaN
12.73 1.00 NaN
12.00 NaN 2.0
33.00 NaN 4.0
23.00 NaN 4.0
12.00 NaN 3.0
How can I do it ?
I have tried to play with aggfunc=list followed by an explode but in this case the order is lost ...
Thanks
Here's my solution. You don't really want to pivot on x_values (because there are dupes). So add a new unique column (id_col) and pivot on both x_values and id_col. Then you will have to do some cleanup:
(d
.assign(id_col=range(len(d)))
.pivot(index=['x_values', 'id_col'], columns='experiment')
.reset_index()
.drop(columns='id_col')
.set_index('x_values')
)
Here's the output:
y_values
experiment e f
x_values
12.00 NaN 2.0
12.00 NaN 3.0
12.73 1.00 NaN
13.08 1.47 NaN
13.40 1.54 NaN
23.00 NaN 4.0
33.00 NaN 4.0
A workaround would be to select data for each unique experiment value and then concat all these data:
pd.concat([d.loc[d.experiment.eq(c), ['x_values', 'y_values']].rename(columns={'y_values': c}) for c in d.experiment.unique()])
Result:
x_values e f
0 13.40 1.54 NaN
1 13.08 1.47 NaN
2 12.73 1.00 NaN
3 12.00 NaN 2.0
4 33.00 NaN 4.0
5 23.00 NaN 4.0
6 12.00 NaN 3.0
You could also just assign new variables and fill them according to boolean masks:
df=pd.DataFrame(
data={
'x_values':[13.4, 13.08, 12.73, 12., 33., 23., 12.],
'y_values': [1.54, 1.47,1.,2.,4.,4.,3.],
'experiment':['e', 'e', 'e', 'f', 'f','f','f']
}
)
df['e'] = df.loc[df['experiment'] == 'e', 'y_values']
df['f'] = df.loc[df['experiment'] == 'f', 'y_values']
df_final = df.drop(columns=['y_values', 'experiment']).set_index(['x_values'])
df_final
-------------------------------------------------
e f
x_values
13.40 1.54 NaN
13.08 1.47 NaN
12.73 1.00 NaN
12.00 NaN 2.0
33.00 NaN 4.0
23.00 NaN 4.0
12.00 NaN 3.0
-------------------------------------------------
If you have more than one attribute for the column experiment, you can iterate over all unique values:
for experiment in df['experiment'].unique():
df[experiment] = df.loc[df['experiment'] == experiment, 'y_values']
df_final = df.drop(columns=['y_values', 'experiment']).set_index(['x_values'])
df_final
which results to the desired output.
This approach appears to be more efficient than the approach provided by #Stef. However, with the cost of more lines of code.
from time import time
first_approach = []
for i in range(1000):
start = time()
pd.concat([df.loc[df.experiment.eq(c), ['x_values', 'y_values']].rename(columns={'y_values': c}) for c in df.experiment.unique()]).set_index(['x_values'])
first_approach.append(time()-start)
second_approach = []
for i in range(1000):
start = time()
for experiment in df['experiment'].unique():
df[experiment] = df.loc[df['experiment'] == experiment, 'y_values']
df.drop(columns=['y_values', 'experiment']).set_index(['x_values'])
second_approach.append(time()-start)
print(f'Average Time First Approach:\t{sum(first_approach)/len(first_approach):.5f}')
print(f'Average Time Second Approach:\t{sum(second_approach)/len(second_approach):.5f}')
--------------------------------------------
Average Time First Approach: 0.00403
Average Time Second Approach: 0.00205
--------------------------------------------
I have a DataFrame that looks like this:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
a b
0 1.0 4.0
1 2.0 2.0
2 NaN 3.0
3 1.0 NaN
4 NaN NaN
5 NaN 1.0
6 4.0 5.0
7 2.0 NaN
8 3.0 5.0
9 NaN 8.0
I want to dynamically replace the nan values. I have tried doing (df.ffill()+df.bfill())/2 but that does not yield the desired output, as it casts the fill value to the whole column at once, rather then dynamically. I have tried with interpolate, but it doesn't work well for non linear data.
I have seen this answer but did not fully understand it and not sure if it would work.
Update on the computation of the values
I want every nan value to be the mean of the previous and next non nan value. In case there are more than 1 nan value in sequence, I want to replace one at a time and then compute the mean e.g., in case there is 1, np.nan, np.nan, 4, I first want the mean of 1 and 4 (2.5) for the first nan value - obtaining 1,2.5,np.nan,4 - and then the second nan will be the mean of 2.5 and 4, getting to 1,2.5,3.25,4
The desired output is
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
Inspired by the #ye olde noobe answer (thanks to him!):
I've optimized it to make it ≃ 100x faster (times comparison below):
def custom_fillna(s:pd.Series):
for i in range(len(s)):
if pd.isna(s[i]):
last_valid_number = (s[s[:i].last_valid_index()] if s[:i].last_valid_index() is not None else 0)
next_valid_numer = (s[s[i:].first_valid_index()] if s[i:].first_valid_index() is not None else 0)
s[i] = (last_valid_number+next_valid_numer)/2
custom_fillna(df['a'])
df
Times comparison:
Maybe not the most optimized, but it works (note: from your example, I assume that if there is no valid value before or after a NaN, like the last row on column a, 0 is used as a replacement):
import pandas as pd
def fill_dynamically(s: pd.Series):
for i in range(len(s)):
s[i] = (
(0 if s[i:].first_valid_index() is None else s[i:][s[i:].first_valid_index()]) +
(0 if s[:i+1].last_valid_index() is None else s[:i+1][s[:i+1].last_valid_index()])
) / 2
Use like this for the full dataframe:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
df.apply(fill_dynamically)
df after applying:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
In case you would have other columns and don't want to apply that on the whole dataframe, you can of course use it on a single column, like that:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
fill_dynamically(df['a'])
In this case, df looks like that:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 NaN
4 2.50 NaN
5 3.25 1.0
6 4.00 5.0
7 2.00 NaN
8 3.00 5.0
9 1.50 8.0
I have a dataframe that looks like this:
userId movie1 movie2 movie3 movie4 score
0 4.1 2.1 1.0 NaN 2
1 3.1 1.1 3.4 1.4 1
2 2.8 NaN 1.7 NaN 3
3 NaN 5.0 NaN 2.3 4
4 NaN NaN NaN NaN 1
5 2.3 NaN 2.0 4.0 1
I want to subtract the movie scores from each movie so the output would look like this:
userId movie1 movie2 movie3 movie4 score
0 2.1 0.1 -1.0 NaN 2
1 2.1 0.1 2.4 0.4 1
2 -0.2 NaN -2.3 NaN 3
3 NaN 1.0 NaN -1.7 4
4 NaN NaN NaN NaN 1
5 1.3 NaN 1.0 3.0 1
The actual dataframe has thousands of movies and the movies are referenced by name so im trying to find a solution to comply with that.
I should have also mention that the movies are not listed in order like ["movie1", "movie2", "movie3"], they are listed by their titles instead like ["Star Wars", "Harry Potter", "Lord of the Rings"]. The dataset could be changed so I wont know what the last movie in the list is.
Use df.filter to identify the movie columns and then subtract these columns from score array:
In [35]: x = df.filter(like='movie', axis=1).columns.tolist()
In [36]: df[x] = df.filter(like='movie', axis=1) - df.score.values[:, None]
In [37]: df
Out[37]:
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
EDIT: When the movie column names are random. Select all columns except 'userId', 'score':
x = df.columns[~df.columns.isin(['userId', 'score'])]
df[x] = df[x] - df.score.values[:, None]
A possible solution
import numpy as np
import pandas as pd
df = pd.DataFrame()
df['userId'] = [0 , 1 , 2 , 3 , 4 , 5 ]
df['movie1'] = [4.1 , 3.1, 2.8 , np.nan, np.nan, 2.3 ]
df['movie2'] = [2.1 , 1.1, np.nan, 5.0 , np.nan, np.nan]
df['movie3'] = [1.0 , 3.4, 1.7 , np.nan, np.nan, 2.0 ]
df['movie4'] = [np.nan, 1.4, np.nan, 2.3 , np.nan, 4.0 ]
df['score'] = [2, 1, 3, 4, 5, 6]
print('before = ', df)
df.iloc[:,1:-1] = df.iloc[:,1:-1].sub(df.iloc[:,-1].values, axis='rows')
print('after = ', df)
It should return
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
You can use NumPy broadcasting to subtract here.
v = df.loc[:, 'movie1':'movie4'].to_numpy()
s = df['score'].to_numpy()
out = v - s[:, None]
df.loc[:, 'movie1':'movie4'] = out
df
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
If you don't know column names use pd.Index.difference here.
cols = df.columns.difference(['userId', 'score'])
# Every column name is extracted expect for 'userId' and 'score'
cols
# Index(['movie1', 'movie2', 'movie3', 'movie4'], dtype='object')
Now, just replace 'movie1':'movie4' with cols.
v = df.loc[:, cols].to_numpy()
s = df['score'].to_numpy()
out = v - s[:, None]
df.loc[:, cols] = out
Solution without using .apply():
df.iloc[:, 1:5] = (
df.iloc[:, 1:5]
- df['score'].values.reshape(-1, 1)
)
You can select the columns with iloc if the names of the columns are unknown and use the sub function from pandas to avoid converting to numpy or using apply. I'm assuming value [2,'movie3'] is a typo in your expected output.
df.iloc[:,1:-1] = df.iloc[:,1:-1].sub(df.score, axis=0)
df
Out:
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 1
5 5 1.3 NaN 1.0 3.0 1
df.loc[:, "movie1":"movie4"] = df.loc[:, "movie1":"movie4"].apply(
lambda x: x - df["score"]
)
print(df)
Prints:
userId movie1 movie2 movie3 movie4 score
0 0 2.1 0.1 -1.0 NaN 2
1 1 2.1 0.1 2.4 0.4 1
2 2 -0.2 NaN -1.3 NaN 3
3 3 NaN 1.0 NaN -1.7 4
4 4 NaN NaN NaN NaN 5
5 5 -3.7 NaN -4.0 -2.0 6
When I am trying to use fillna to replace NaNs in the columns with means, the NaNs changed from float64 to object, showing:
bound method Series.mean of 0 NaN\n1
Here is the code:
mean = df['texture_mean'].mean
df['texture_mean'] = df['texture_mean'].fillna(mean)`
You cannot use mean = df['texture_mean'].mean. This is where the problem lies. The following code will work -
df=pd.DataFrame({'texture_mean':[2,4,None,6,1,None],'A':[1,2,3,4,5,None]}) # Example
df
A texture_mean
0 1.0 2.0
1 2.0 4.0
2 3.0 NaN
3 4.0 6.0
4 5.0 1.0
5 NaN NaN
df['texture_mean']=df['texture_mean'].fillna(df['texture_mean'].mean())
df
A texture_mean
0 1.0 2.00
1 2.0 4.00
2 3.0 3.25
3 4.0 6.00
4 5.0 1.00
5 NaN 3.25
In case you want to replace all the NaNs with the respective means of that column in all columns, then just do this -
df=df.fillna(df.mean())
df
A texture_mean
0 1.0 2.00
1 2.0 4.00
2 3.0 3.25
3 4.0 6.00
4 5.0 1.00
5 3.0 3.25
Let me know if this is what you want.
I have a dataframe with multiple columns and rows
For all columns I need to say the row value is equal to 0.5 of this row + 0.5 of the row befores value.
I currently set up a loop which is working. But I feel there is a better way without using a loop. Does anyone have any thoughts?
dataframe = df_input
df_output=df_input.copy()
for i in range(1, df_input.shape[0]):
try:
df_output.iloc[[i]]= (df_input.iloc[[i-1]]*(1/2)).values+(df_input.iloc[[i]]*(1/2)).values
except:
pass
Do you mean sth like this:
First creating test data:
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 20, [5, 3]), columns=['A', 'B', 'C'])
A B C
0 6 19 14
1 10 7 6
2 18 10 10
3 3 7 2
4 1 11 5
Your requested function:
(df*.5).rolling(2).sum()
A B C
0 NaN NaN NaN
1 8.0 13.0 10.0
2 14.0 8.5 8.0
3 10.5 8.5 6.0
4 2.0 9.0 3.5
EDIT:
for an unbalanced sum you can define an auxiliary function:
def weighted_mean(arr):
return sum(arr*[.25, .75])
df.rolling(2).apply(weighted_mean, raw=True)
A B C
0 NaN NaN NaN
1 9.00 10.00 8.00
2 16.00 9.25 9.00
3 6.75 7.75 4.00
4 1.50 10.00 4.25
EDIT2:
...and if the weights should be to be set at runtime:
def weighted_mean(arr, weights=[.5, .5]):
return sum(arr*weights/sum(weights))
No additional argument defaults to balanced mean:
df.rolling(2).apply(weighted_mean, raw=True)
A B C
0 NaN NaN NaN
1 8.0 13.0 10.0
2 14.0 8.5 8.0
3 10.5 8.5 6.0
4 2.0 9.0 3.5
An unbalanced mean:
df.rolling(2).apply(weighted_mean, raw=True, args=[[.25, .75]])
A B C
0 NaN NaN NaN
1 9.00 10.00 8.00
2 16.00 9.25 9.00
3 6.75 7.75 4.00
4 1.50 10.00 4.25
The division by sum(weights) enables the definition of weights not only restricted to fractions of one, but by any ratio:
df.rolling(2).apply(weighted_mean, raw=True, args=[[1, 3]])
A B C
0 NaN NaN NaN
1 9.00 10.00 8.00
2 16.00 9.25 9.00
3 6.75 7.75 4.00
4 1.50 10.00 4.25
df.rolling(window=2, min_periods=1).apply(lambda x: x[0]*0.5 + x[1] if len(x) > 1 else x)
This will do the same operation for all columns.
Explanation: For each rolling object the lambda chooses the columns and x are structured like [this_col[i], this_col[i+1]] for all cols, and then doing custom arithmetic is straightforward.
Some
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 1)), columns=['a'])
df["cumsum_a"] = 0.5*df["a"].cumsum() + 0.5*df["a"]
thing like below?