Retain indexes from a dataframe based on indexes of another dataframe - python

Suppose that I have two dataframes A and B indexed from 0 to 10. I remove a couple of duplicate rows from A so that the indexes 7 and 9 are removed. So now A.index will be [0,1,2,3,4,5,6,8,10].
Now I want to retain exactly the rows having these same indexes in B. As of now its index set is from 0 to 10. In other words, given the exact same index initially, and having dropped a few indexes from A, how do I retain the subset of B rows that exactly correspond to retained rows of A in terms of their index?

I believe you can select by loc:
A = pd.DataFrame({'col':[5,8,4,0,6,2,1,8,3,4,9]})
B = pd.DataFrame({'col':np.arange(10, 21)})
#print (A)
#print (B)
A1 = A.drop_duplicates('col')
print (A1)
col
0 5
1 8
2 4
3 0
4 6
5 2
6 1
8 3
10 9
B1 = B.loc[A1.index]
print (B1)
col
0 10
1 11
2 12
3 13
4 14
5 15
6 16
8 18
10 20

Related

How to fill missing value in a few columns at the same time

I need to drop missing values in a few columns. I wrote this to do it one by one:
df2['A'].fillna(df1['A'].mean(), inplace=True)
df2['B'].fillna(df1['B'].mean(), inplace=True)
df2['C'].fillna(df1['C'].mean(), inplace=True)
Any other ways I can fill them all in one line of code?
You can use a single instructions:
cols = ['A', 'B', 'C']
df[cols] = df[cols].fillna(df[cols].mean())
Or for apply on all numeric columns, use select_dtypes:
cols = df.select_dtypes('number').columns
df[cols] = df[cols].fillna(df[cols].mean())
Note: I strongly discourage you to use inplace parameter. It will probably disappear in Pandas 2
[lambda c: df2[c].fillna(df1[c].mean(), inplace=True) for c in df2.columns]
There are few options to work with nans in a df. I'll explain some of them...
Given this example df:
A
B
C
0
1
5
10
1
2
nan
11
2
nan
nan
12
3
4
8
nan
4
nan
9
14
Example 1: fill all columns with mean
df = df.fillna(df.mean())
Result:
A
B
C
0
1
5
10
1
2
7.33333
11
2
2.33333
7.33333
12
3
4
8
11.75
4
2.33333
9
14
Example 2: fill some columns with median
df[["A","B"]] = df[["A","B"]].fillna(df.median())
Result:
A
B
C
0
1
5
10
1
2
8
11
2
2
8
12
3
4
8
nan
4
2
9
14
Example 3: fill all columns using ffill()
Explanation: Missing values are replaced with the most recent available value in the same column. So, the value of the preceding row in the same column is used to fill in the blanks.
df = df.fillna(method='ffill')
Result:
A
B
C
0
1
5
10
1
2
8
11
2
2
8
12
3
4
8
12
4
2
9
14
Example 4: fill all columns using bfill()
Explanation: Missing values in a column are filled using the value of the next row going up, meaning the values are filled from the bottom to the top. Basically, you're replacing the missing values with the next known non-missing value.
df = df.fillna(method='bfill')
Result:
A
B
C
0
1
5
10
1
2
8
11
2
4
8
12
3
4
8
14
4
nan
9
14
If you wanted to DROP (no fill) the missing values. You can do this:
Option 1: remove rows with one or more missing values
df = df.dropna(how="any")
Result:
A
B
C
0
1
5
10
Option 2: remove rows with all missing values
df = df.dropna(how="all")

How to select data 3 times in row dataframe greater threshold - pandas

I want to select 3 residual data that only pass through the threshold in a row, where my threshold is 3. Here I attach the csv data to the link and what I currently do is for the filter. where I need the time criteria there. Consecutive data are those that pass the threshold and are sequentially timed
df[df.residual_value >= 3]
Data csv
IIUC, you want to filter the rows that are greater or equal than 3, only if 3 consecutive rows match the criterion. You can use rolling+min:
processing:
df[df['col'].rolling(window=3).min().shift(-2).ge(3)]
example dataset:
np.random.seed(0)
df = pd.DataFrame({'col': np.random.randint(0,10,100)})
>>> df.head(15)
col
0 5
1 0
2 3
3 3
4 7
5 9
6 3
7 5
8 2
9 4
10 7
11 6
12 8
13 8
14 1
output:
col
2 3
3 3
4 7
5 9
9 4
10 7
11 6
...

Make changes in one Column while keeping the dataframe unchanged

I would like to make changes in 'Column C' of the dataframe(df) while keeping other values unchanged in the dataframe using python.
Condition: If any value in 'Column C' < 5 & 'Column C' > 15,change it to 'NaN'
Current Dataframe:
Index Column A Column B Column C
0 6 12 15
1 8 8 2
2 10 14 6
3 9 16 3
4 4 3 7
5 2 18 7
Expected Dataframe Output:
Index Column A Column B Column C
0 6 12 NaN
1 8 8 NaN
2 10 14 6
3 9 16 NaN
4 4 3 7
5 2 18 7
I tried to use the df.apply() method as shown below but it messed up my indexes. Could you please help?
df.loc[df['Column C'].apply(lambda x:np.nan if x <5 and x>15 else x)]
This should work. In general, only use pd.Series.apply where a calculation cannot be vectorised. In addition, you require an or (|) statement, not and, since no value can be simultaneously below 5 and above 15.
df.loc[(df.C < 5) | (df.C > 15), 'C'] = np.nan

Refer to next index in pandas

If I had a simple pandas DataFrame like this:
frame = pd.DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'), index=list('123'))
I want find the max value from each row, and use this to find the next value in the column and add this value to a new column.
So the above DataFrame looks like this (with d2 changed to 3):
a b c d
1 1 2 3 4
2 5 6 7 3
3 9 10 11 12
So, conceptually the first row should be scanned, 4 is identified as the largest number, then 3 is found as the number within the same column but in the next index. Similarly for the row 2, 7 is the largest number, and 11 is the next number in that column. So 3 and 11 should get added to a new column like this:
a b c d Next
1 1 2 3 4 NaN
2 5 6 7 3 3
3 9 10 11 12 11
I started by making a function like this, but it only finds the max values.
f = lambda x: x.max()
max = frame.apply(f, axis='columns')
frame['Next'] = max
Based on your edit, you can use np.argsort:
i = np.arange(len(df))
j = pd.Series(np.argmax(df.values, axis=1))
df['next'] = df.shift(-1).values[i, j]
a b c d next
1 1 2 3 4 3.0
2 5 6 7 3 11.0
3 9 10 11 12 NaN

Select particular rows from inside groups in pandas dataframe

Suppose I have a dataframe that looks like this:
group level
0 1 10
1 1 10
2 1 11
3 2 5
4 2 5
5 3 9
6 3 9
7 3 9
8 3 8
The desired output is this:
group level
0 1 10
5 3 9
Namely, this is the logic: look inside each group, if there is more than 1 distinct value present in the level column, return the first row in that group. For example, no row from group 2 is selected, because the only value present in the level column is 5.
In addition, how does the situation change if I want the last, instead of the first row of such groups?
What I have attempted was combining group_by statements, with creating sets from entries in the level column, but failed to produce anything even nearly sensible.
This can be done with groupby and using apply to run a simple function on each group:
def get_first_val(group):
has_multiple_vals = len(group['level'].unique()) >= 2
if has_multiple_vals:
return group['level'].loc[group['level'].first_valid_index()]
else:
return None
df.groupby('group').apply(get_first_val).dropna()
Out[8]:
group
1 10
3 9
dtype: float64
There's also a last_valid_index() method, so you wouldn't have to
make any huge changes to get the last row instead.
If you have other columns that you want to keep, you just need a slight tweak:
import numpy as np
df['col1'] = np.random.randint(10, 20, 9)
df['col2'] = np.random.randint(20, 30, 9)
df
Out[17]:
group level col1 col2
0 1 10 19 21
1 1 10 18 24
2 1 11 14 23
3 2 5 14 26
4 2 5 10 22
5 3 9 13 27
6 3 9 16 20
7 3 9 18 26
8 3 8 11 2
def get_first_val_keep_cols(group):
has_multiple_vals = len(group['level'].unique()) >= 2
if has_multiple_vals:
return group.loc[group['level'].first_valid_index(), :]
else:
return None
df.groupby('group').apply(get_first_val_keep_cols).dropna()
Out[20]:
group level col1 col2
group
1 1 10 19 21
3 3 9 13 27
This would be simpler:
In [121]:
print df.groupby('group').\
agg(lambda x: x.values[0] if (x.values!=x.values[0]).any() else np.nan).\
dropna()
level
group
1 10
3 9
For each group, if any of the values are not the same as the first value, aggregate that group into its first value; otherwise, aggregate it to nan.
Finally, dropna().

Categories