I want to delete values based on their relative rank within their column. Specifically, I want to isolate the X highest and X lowest values within several columns. So if X=2 and my dataframe looks like this:
ID Val1 Val2 Val3
001 2 8 14
002 10 15 8
003 3 1 20
004 11 11 7
005 14 4 19
The output should look like this:
ID Val1 Val2 Val3
001 2 NaN NaN
002 NaN 15 8
003 3 1 20
004 11 11 7
005 14 4 19
I know that I can make a sub-table to isolate the high and low rank using:
df = df.sort('Column Name')
df2 = df.head(X) # OR: df.tail(X)
And I figure I clear these sub-tables of the values from other columns using:
df2['Other Column'] = np.NaN
df2['Other Column B'] = np.NaN
Then merge the sub-tables back together in a way that replaces NaN values when there is data in one of the tables. I tried:
df2.update(df3) # df3 is a sub-table made the same way as df2 using a different column
Which only updated rows already present in df2.
I tried:
out = pd.merge(df2, df3, how='outer')
which gave me separate rows when a row appeared in both df2 and d3
I tried:
out = df2.combine_first(df3)
which over-wrote numerical values with found NaN values in some cases making it unsuitable.
There must be a way to do this: I want to the original dataframe with NaN values plugged in whenever a value is not among the X highest or X lowest values in that column.
Interesting question, you can get the index of the values of each columns in the sorted values of each columns (here in the mask DataFrame), and then keep the values that have the index within you defined boundary.
In [98]:
print df
Val1 Val2 Val3
ID
1 2 8 14
2 10 15 8
3 3 1 20
4 11 11 7
5 14 4 19
In [99]:
mask = df.apply(lambda x: np.searchsorted(sorted(x),x))
print mask
Val1 Val2 Val3
ID
1 0 2 2
2 2 4 1
3 1 0 4
4 3 3 0
5 4 1 3
In [100]:
print (mask<=1)|(mask>=(len(mask)-2))
Val1 Val2 Val3
ID
1 True False False
2 False True True
3 True True True
4 True True True
5 True True True
In [101]:
print df.where((mask<=1)|(mask>=(len(mask)-2)))
Val1 Val2 Val3
ID
1 2 NaN NaN
2 NaN 15 8
3 3 1 20
4 11 11 7
5 14 4 19
Related
I would like to convert closest values of a column (col2 in the below) to the same value (say the largest one). Suppose the following dataframe:
df = pd.DataFrame({"col1":[0,1,2,3,4,5,6],"col2":[1,5,6,10,12,14,17]})
col1 col2
0 0 1
1 1 5
2 2 6
3 3 10
4 4 12
5 5 14
6 6 17
Given column col2 and a closeness threshold of 2: difference of 5 and 6 less than threshold, so both will be the same, i.e. 6. Values 1 and 17 are far away from the rest of values in col2, so no changes there. Differences between 10, 12 and 14 are less than 2, so change them all to 14. (why I need this process: when converting image to text using pytesseract.image_to_data, the top coordinates of text are slightly different and I want to fix those coordinates and make them same values.)
The final output given col2 and closeness threshold of 2 will be:
col1 col2
0 0 1
1 1 6
2 2 6
3 3 14
4 4 14
5 5 14
6 6 17
You help much appreciated!
If values are sorted like in sample data use:
df['col2'] = df['col2'].mask(df['col2'].diff(-1).abs().le(2)).bfill()
print (df)
col1 col2
0 0 1.0
1 1 6.0
2 2 6.0
3 3 14.0
4 4 14.0
5 5 14.0
6 6 17.0
Please help.
My first table looks like:
id val1 val2
0 4 30
1 5 NaN
2 3 10
3 2 8
4 3 NaN
My second table looks like
id val1 val2_estimate
0 1 8
1 2 12
2 3 13
3 4 16
4 5 22
I want to replace Nan in 1st table with estimated values from column val2_estimate from 2nd table where val1 are the same. val1 in 2nd table are unique. End result need to look like that:
id val1 val2
0 4 30
1 5 22
2 3 10
3 2 8
4 3 13
I want to replace NaN values only.
Use merge to get the corresponding df2's estimate for df1, then use fillna:
df['val2'] = df['val2'].fillna(
df.merge(df2, on=['val1'], how='left')['val2_estimate'])
df
id val1 val2
0 0 4 30.0
1 1 5 22.0
2 2 3 10.0
3 3 2 8.0
4 4 3 13.0
Many ways to skin a cat, this is one of them.
Use fillna with map from a pd.Series created using set_index:
df['val2'] = df['val2'].fillna(df['val1'].map(df2.set_index('val1')['val2_estimate']))
df
Output:
val1 val2
id
0 4 30.0
1 5 22.0
2 3 10.0
3 2 8.0
4 3 13.0
I have a data frame like this,
df
col1 col2
1 A
2 A
3 B
4 C
5 C
6 C
7 B
8 B
9 A
Now we can see that there is continuous occurrence of A, B and C. I want only the rows where the occurrence is starting. And the other values of the same occurrence will be nan.
The final data frame I am looking for will look like,
df
col1 col2
1 A
2 NA
3 B
4 C
5 NA
6 NA
7 B
8 NA
9 A
I can do it using for loop and comparing, But the execution time will be more. I am looking for pythonic way to do it. Some panda shortcuts may be.
Compare by Series.shifted values and missing values by Series.where or numpy.where:
df['col2'] = df['col2'].where(df['col2'].ne(df['col2'].shift()))
#alternative
#df['col2'] = np.where(df['col2'].ne(df['col2'].shift()), df['col2'], np.nan)
Or by DataFrame.loc with inverted condition by ~:
df.loc[~df['col2'].ne(df['col2'].shift()), 'col2'] = np.nan
Or thanks #Daniel Mesejo - use eq for ==:
df.loc[df['col2'].eq(df['col2'].shift()), 'col2'] = np.nan
print (df)
col1 col2
0 1 A
1 2 NaN
2 3 B
3 4 C
4 5 NaN
5 6 NaN
6 7 B
7 8 NaN
8 9 A
Detail:
print (df['col2'].ne(df['col2'].shift()))
0 True
1 False
2 True
3 True
4 False
5 False
6 True
7 False
8 True
Name: col2, dtype: bool
I'm curently learnig python and pandas (this question is based on a pevious post but with an additional query); at the moment have the 2 columns containing numeric sequences (ascending and/or descending) as described below:
Col 1: (col1 numeric incrememt and/or decrement = 1)
1
2
3
5
7
8
9
Col 2: (Col2 numeric increment and/or decrement = 4)
113
109
105
90
94
98
102
Need to extract the numeric ranges from both columns and print them according to the sequence break occurance on any of those 2 columns and the result should be as follow:
1,3,105,113
5,5,90,90
7,9,94,102
Already received a very useful way to do it using python's pandas library by #MaxU where it generates the numeric ranges based on the breaks detected on both columns using a criteria of col1 and col2 = increase and/or decreases by 1.
How can I extract numeric ranges from 2 columns and print the range from both columns as tuples?
The unique difference on this case is that the increment/decrement criteria applied for both columns are different for each one of them.
Try this:
In [42]: df
Out[42]:
Col1 Col2
0 1 113
1 2 109
2 3 105
3 5 90
4 7 94
5 8 98
6 9 102
In [43]: df.groupby(df.diff().abs().ne([1,4]).any(1).cumsum()).agg(['min','max'])
Out[43]:
Col1 Col2
min max min max
1 1 3 105 113
2 5 5 90 90
3 7 9 94 102
Explanation: our goal is to group those rows with the increment/decrement [1,4] for Col1, Col2 correspondingly:
In [44]: df.diff().abs()
Out[44]:
Col1 Col2
0 NaN NaN
1 1.0 4.0
2 1.0 4.0
3 2.0 15.0
4 2.0 4.0
5 1.0 4.0
6 1.0 4.0
In [45]: df.diff().abs().ne([1,4])
Out[45]:
Col1 Col2
0 True True
1 False False
2 False False
3 True True
4 True False
5 False False
6 False False
In [46]: df.diff().abs().ne([1,4]).any(1)
Out[46]:
0 True
1 False
2 False
3 True
4 True
5 False
6 False
dtype: bool
In [47]: df.diff().abs().ne([1,4]).any(1).cumsum()
Out[47]:
0 1
1 1
2 1
3 2
4 3
5 3
6 3
dtype: int32
Suppose I have a dataframe that looks like this:
group level
0 1 10
1 1 10
2 1 11
3 2 5
4 2 5
5 3 9
6 3 9
7 3 9
8 3 8
The desired output is this:
group level
0 1 10
5 3 9
Namely, this is the logic: look inside each group, if there is more than 1 distinct value present in the level column, return the first row in that group. For example, no row from group 2 is selected, because the only value present in the level column is 5.
In addition, how does the situation change if I want the last, instead of the first row of such groups?
What I have attempted was combining group_by statements, with creating sets from entries in the level column, but failed to produce anything even nearly sensible.
This can be done with groupby and using apply to run a simple function on each group:
def get_first_val(group):
has_multiple_vals = len(group['level'].unique()) >= 2
if has_multiple_vals:
return group['level'].loc[group['level'].first_valid_index()]
else:
return None
df.groupby('group').apply(get_first_val).dropna()
Out[8]:
group
1 10
3 9
dtype: float64
There's also a last_valid_index() method, so you wouldn't have to
make any huge changes to get the last row instead.
If you have other columns that you want to keep, you just need a slight tweak:
import numpy as np
df['col1'] = np.random.randint(10, 20, 9)
df['col2'] = np.random.randint(20, 30, 9)
df
Out[17]:
group level col1 col2
0 1 10 19 21
1 1 10 18 24
2 1 11 14 23
3 2 5 14 26
4 2 5 10 22
5 3 9 13 27
6 3 9 16 20
7 3 9 18 26
8 3 8 11 2
def get_first_val_keep_cols(group):
has_multiple_vals = len(group['level'].unique()) >= 2
if has_multiple_vals:
return group.loc[group['level'].first_valid_index(), :]
else:
return None
df.groupby('group').apply(get_first_val_keep_cols).dropna()
Out[20]:
group level col1 col2
group
1 1 10 19 21
3 3 9 13 27
This would be simpler:
In [121]:
print df.groupby('group').\
agg(lambda x: x.values[0] if (x.values!=x.values[0]).any() else np.nan).\
dropna()
level
group
1 10
3 9
For each group, if any of the values are not the same as the first value, aggregate that group into its first value; otherwise, aggregate it to nan.
Finally, dropna().