How can I delete only the three consecutive rows in a pandas dataframe that have the same value (in the example below, this would be the integer "4").
Consider the following code:
import pandas as pd
df = pd.DataFrame({
'rating': [4, 4, 3.5, 15, 5 ,4,4,4,4,4 ]
})
rating
0 4.0
1 4.0
2 3.5
3 15.0
4 5.0
5 4.0
6 4.0
7 4.0
8 4.0
9 4.0
I would like to get the following result as output with the three consecutive rows containing the value "4" being removed:
0 4.0
1 4.0
2 3.5
3 15.0
4 5.0
5 4.0
6 4.0
first get a group each time a new value exists, then use GroupBy.head
new_df = df.groupby(df['rating'].ne(df['rating'].shift()).cumsum()).head(2)
print(new_df)
rating
0 4.0
1 4.0
2 3.5
3 15.0
4 5.0
5 4.0
6 4.0
Use GroupBy.cumcount for counter and filter in rows in boolean indexing:
#filter consecutive groups less like 2 (python count from 0)
df= df[df.groupby(df['rating'].ne(df['rating'].shift()).cumsum()).cumcount().lt(2)]
print (df)
rating
0 4.0
1 4.0
2 3.5
3 15.0
4 5.0
5 4.0
6 4.0
Related
I have a DataFrame with 15 columns and 5000 rows.
In the DataFrame there are 4 columns that contain NaN values. I would like to replace the values with the median.
As there are several columns, I would like to do this via a for-loop.
These are the column numbers: 1,5,8,9.
The NaN values per column get the corresponding median.
I tried:
for i in [1,5,8,9]:
df[i] = df[i].fillna(df[i].transform('median'))
No need for a loop, use a vectorial approach:
out = df.fillna(df.median())
Or, to limit to specific columns names:
cols = [1, 5, 8, 9]
# or automatic selection of columns with NaNs
# cols = df.isna().any()
out = df.fillna(df[cols].median())
or positional indices:
col_pos = [1, 5, 8, 9]
out = df.fillna(df.iloc[:, col_pos].median())
output:
0 1 2 3 4 5 6 7 8 9
0 9 7.0 1 3.0 5.0 7 3 6.0 6.0 7
1 9 1.0 9 6.0 4.5 3 8 4.0 1.0 4
2 5 3.5 3 1.0 4.0 4 4 3.5 3.0 8
3 4 6.0 9 3.0 3.0 2 1 2.0 1.0 3
4 4 1.0 1 3.0 7.0 8 4 3.0 5.0 6
used example input:
0 1 2 3 4 5 6 7 8 9
0 9 7.0 1 3.0 5.0 7 3 6.0 6.0 7
1 9 1.0 9 6.0 NaN 3 8 4.0 1.0 4
2 5 NaN 3 1.0 4.0 4 4 NaN NaN 8
3 4 6.0 9 3.0 3.0 2 1 2.0 1.0 3
4 4 1.0 1 NaN 7.0 8 4 3.0 5.0 6
You can simply do:
df[[1,5,8,9]] = df[[1,5,8,9]].fillna(df[[1,5,8,9]].median())
I have a dataframe df
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [np.nan, 1, 2,np.nan,2,np.nan,np.nan],
'B': [10, np.nan, np.nan,5,np.nan,np.nan,7],
'C': [1,1,2,2,3,3,3]})
which looks like :
A B C
0 NaN 10.0 1
1 1.0 NaN 1
2 2.0 NaN 2
3 NaN 5.0 2
4 2.0 NaN 3
5 NaN NaN 3
6 NaN 7.0 3
I want to replace all the NAN values in column A and B with the value from other records which are from the same group as mentioned in column C.
My expected output is :
A B C
0 1.0 10.0 1
1 1.0 10.0 1
2 2.0 5.0 2
3 2.0 5.0 2
4 2.0 7.0 3
5 2.0 7.0 3
6 2.0 7.0 3
How can I do the same in pandas dataframe ?
Use GroupBy.apply with forward and back filling missing values:
df[['A','B']] = df.groupby('C')['A','B'].apply(lambda x: x.ffill().bfill())
print (df)
A B C
0 1.0 10.0 1
1 1.0 10.0 1
2 2.0 5.0 2
3 2.0 5.0 2
4 2.0 7.0 3
5 2.0 7.0 3
6 2.0 7.0 3
Having a data frame with missing values at the end of a column, f.e.:
df = pd.DataFrame({'a':[np.nan,1,2,np.nan,np.nan,5,np.nan,np.nan]}, index=[0,1,2,3,4,5,6,7])
a
0 NaN
1 1.0
2 2.0
3 NaN
4 NaN
5 5.0
6 NaN
7 NaN
Using 'index' interpolation method:
df.interpolate(method='index')
Returns the data frame with the last missing values forward filled:
a
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 5.0
7 5.0
Is there a way to turn off that behaviour and leave the last missing values:
a
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 NaN
I think need new parameter limit_direction in 0.23.0+, check this:
df = df.interpolate(method='index', limit=1, limit_direction='backward')
print (df)
a
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 NaN
EDIT: If want replace NaNs only inside add parameter limit_area:
df = df.interpolate(method='index',limit_area='inside')
print (df)
a
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
5 5.0
6 NaN
7 NaN
Do you mean that the last NaNs(one or more) should be remained?
How about this.
Find the last valid arg index and split and interpolate and append.
valargmax=np.max(np.where((df.isnull().eq(False).values==True).flatten()==True))
r = df[0:(valargmax+1)].interpolate(method='index').append(df[(valargmax+1):])
print(r)
I want to get the min value of a column by compare the value in current row with the value in previous 2 rows, I know this can be done by creating 2 columns with the shift(-1) and shift(-2) and return the min value of the row, but I would like to know if there is any way to do it better if I extend the range from previous 2 rows to n rows.
for example in below dataset
df= pd.DataFrame([12,11,4,15,6,],columns=['score'])
>>> df
score
0 12
1 11
2 4
3 15
4 6
create new columns prv_score_1, prv_score_2 for previous value
>>> df['prv_score_1'] = df['score'].shift(-1)
>>> df['prv_score_2'] = df['score'].shift(-2)
>>> df
score prv_score_1 prv_score_2
0 12 11.0 4.0
1 11 4.0 15.0
2 4 15.0 6.0
3 15 6.0 NaN
4 6 NaN NaN
Create a Minimum column and get the minimum value of the row
>>> df['Minimum'] = df.min(1)
>>> df
score prv_score_1 prv_score_2 Minimum
0 12 11.0 4.0 4.0
1 11 4.0 15.0 4.0
2 4 15.0 6.0 4.0
3 15 6.0 NaN 6.0
4 6 NaN NaN 6.0
Anyway to do better?
You need rolling min with window 3 i.e
df['new'] = df['score'][::-1].rolling(3,min_periods=1).min()[::-1]
score new
0 12.0 4.0
1 11.0 4.0
2 4.0 4.0
3 15.0 6.0
4 6.0 6.0
You can check the rolling function:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
In your case, this will do the trick:
df[::-1].rolling(3, min_periods=1).min()[::-1]
You can achieve this with rolling.min(). For example with window size 2, use:
df.rolling(2).min()
Then change 2 to n for the more general case.
Let's say I have a Pandas dataframe (that is already in the dataframe format):
x = [[1,2,8,7,9],[1,3,5.6,4.5,4],[2,3,4.5,5,5]]
df = pd.DataFrame(x, columns=['id1','id2','val1','val2','val3'])
id1 id2 val1 val2 val3
1 2 8.0 7.0 9
1 3 5.6 4.5 4
2 3 4.5 5.0 5
I want val1, val2, and val2 in one column, with id1 and id2 as grouping variables. I can use this extremely convoluted code:
dfT = df.iloc[:,2::].T.reset_index(drop=True)
n_points = dfT.shape[0]
final = pd.DataFrame()
for i in range(0, df.shape[0]):
data = np.asarray([[df.ix[i,'id1']]*n_points,
[df.ix[i,'id2']]*n_points,
dfT.ix[:,i].values]).T
temp = pd.DataFrame(data, columns=['id1','id2','val'])
final = pd.concat([final, temp], axis=0)
to get my dataframe into the correct format:
id1 id2 val
0 1.0 2.0 8.0
1 1.0 2.0 7.0
2 1.0 2.0 9.0
0 1.0 3.0 5.6
1 1.0 3.0 4.5
2 1.0 3.0 4.0
0 2.0 3.0 4.5
1 2.0 3.0 5.0
2 2.0 3.0 5.0
but there must be a more efficient way of doing this, since on a large dataframe this takes way too long.
Suggestions?
You can use melt with drop column variable:
print (pd.melt(df, id_vars=['id1','id2'], value_name='val')
.drop('variable', axis=1))
id1 id2 val
0 1 2 8.0
1 1 3 5.6
2 2 3 4.5
3 1 2 7.0
4 1 3 4.5
5 2 3 5.0
6 1 2 9.0
7 1 3 4.0
8 2 3 5.0
Another solution with set_index and stack:
print (df.set_index(['id1','id2'])
.stack()
.reset_index(level=2, drop=True)
.reset_index(name='val'))
id1 id2 val
0 1 2 8.0
1 1 2 7.0
2 1 2 9.0
3 1 3 5.6
4 1 3 4.5
5 1 3 4.0
6 2 3 4.5
7 2 3 5.0
8 2 3 5.0
There's even a simpler one which can be done using lreshape(Not yet documented though):
pd.lreshape(df, {'val': ['val1', 'val2', 'val3']}).sort_values(['id1', 'id2'])