Suppose we have a Pandas DataFrame f defined as follows. I am trying to create a mask to select all rows with value 'a' or 'b' in column 'xx'(I would like to select out row 0, 1, 3, 4).
f = pd.DataFrame([['a', 'b','c','a', 'b','c'],['1', '2','3', '4', '5','6', ]])
f = f.transpose()
f.columns = ['xx', 'yy']
f
xx yy
0 a 1
1 b 2
2 c 3
3 a 4
4 b 5
5 c 6
Is there any elegant way to do this in pandas?
I know to select all rows with f.xx =='a', we can do f[f.xx == 'a']. While I have not figure out how to select rows with f.xx is either 'a' or 'b'. Thanks.
You could use isin
print(f[(f["xx"].isin(("a","b")))])
Which will give you:
xx yy
0 a 1
1 b 2
3 a 4
4 b 5
If you really wanted a mask you could use or |:
mask = (f["xx"] == "a") | (f["xx"] == "b")
print(f[mask])
Which will give you the same output:
xx yy
0 a 1
1 b 2
3 a 4
4 b 5
Related
I'm trying to drop rows from a df where certain conditions are met. Using below, I'm grouping values using column C. For each unique group, I want to drop ALL rows where A is less than 1 AND B is greater than 100. This has to occur on the same row though. If I use .any() or .all(), it doesn't return what I want.
df = pd.DataFrame({
'A' : [1,0,1,0,1,0,0,1,0,1],
'B' : [101, 2, 3, 1, 5, 101, 2, 3, 4, 5],
'C' : ['d', 'd', 'd', 'd', 'e', 'e', 'e', 'f', 'f',],
})
df.groupby(['C']).filter(lambda g: g['A'].lt(1) & g['B'].gt(100))
initial df:
A B C
0 1 101 d # A is not lt 1 so keep all d's
1 0 2 d
2 1 3 d
3 0 1 d
4 1 5 e
5 0 101 e # A is lt 1 and B is gt 100 so drop all e's
6 0 2 e
7 1 3 f
8 0 4 f
9 1 5 f
intended out:
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f
For better performnce get all C values match condition and then filter original column C by Series.isin in boolean indexing with inverted mask:
df1 = df[~df['C'].isin(df.loc[df['A'].lt(1) & df['B'].gt(100), 'C'])]
Another idea is use GroupBy.transform with GroupBy.any for test if match at least one value:
df1 = df[~(df['A'].lt(1) & df['B'].gt(100)).groupby(df['C']).transform('any')]
Your solution is possible with any and not for scalars, if large DataFrame it should be slow:
df1 = df.groupby(['C']).filter(lambda g:not ( g['A'].lt(1) & g['B'].gt(100)).any())
df1 = df.groupby(['C']).filter(lambda g: (g['A'].ge(1) | g['B'].le(100)).all())
print (df1)
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f
I have two DataFrames
df_1:
idx A X
0 1 A
1 2 B
2 3 C
3 4 D
4 1 E
5 2 F
and
df_2:
idx B Y
0 1 H
1 2 I
2 4 J
3 2 K
4 3 L
5 1 M
my goal is get the following:
df_result:
idx A X B Y
0 1 A 1 H
1 2 B 2 I
2 4 D 4 J
3 2 F 2 K
I am trying to match both A and B columns, based on on the column Bfrom df_2.
Columns A and B repeat their content after getting to 4. The order matters here and because of that the row from df_1 with idx = 4 does not match the one from df_2 with idx = 5.
I was trying to use:
matching = list(set(df_1["A"]) & set(df_2["B"]))
and then
df1_filt = df_1[df_1['A'].isin(matching)]
df2_filt = df_2[df_2['B'].isin(matching)]
But this does not take the order into consideration.
I am looking for a solution without many for loops.
Edit:
df_result = pd.merge_asof(left=df_1, right=df_2, left_on='idx', right_on='idx', left_by='A', right_by='B', direction='backward', tolerance=2).dropna().drop(labels='idx', axis='columns').reset_index(drop=True)
Gets me what I want.
IIUC this should work:
df_result = df_1.merge(df_2,
left_on=['idx', 'A'], right_on=['idx', 'B'])
I have a df,
name_id name
1 a
2 b
2 b
3 c
3 c
3 c
now I want to groupby name_id and assign -1 to rows in the group(s), whose length is 1 or < 2;
one_occurrence_indices = df.groupby('name_id').filter(lambda x: len(x) == 1).index.tolist()
for index in one_occurrence_indices:
df.loc[index, 'name_id'] = -1
I am wondering what is the best way to do it. so the result df,
name_id name
-1 a
2 b
2 b
3 c
3 c
3 c
Use transform with loc:
df.loc[df.groupby('name_id')['name_id'].transform('size') == 1, 'name_id'] = -1
Alternative is numpy.where:
df['name_id'] = np.where(df.groupby('name_id')['name_id'].transform('size') == 1,
-1, df['name_id'])
print (df)
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c
Also if want test duplicates use duplicated:
df['name_id'] = np.where(df.duplicated('name_id', keep=False), df['name_id'], -1)
Use:
df.name_id*=(df.groupby('name_id').name.transform(len)==1).map({True:-1,False:1})
df
Out[50]:
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c
Using pd.DataFrame.mask:
lens = df.groupby('name_id')['name'].transform(len)
df['name_id'].mask(lens < 2, -1, inplace=True)
print(df)
name_id name
0 -1 a
1 2 b
2 2 b
3 3 c
4 3 c
5 3 c
I have two Series (df1 and df2) of equal length, which need to be combined into one DataFrame column as follows. Each index has only one value or no values but never two values, so there are no duplicates (e.g. if df1 has a value 'A' at index 0, then df2 is empty at index 0, and vice versa).
df1 = c1 df2 = c2
0 A 0
1 B 1
2 2 C
3 D 3
4 E 4
5 5 F
6 6
7 G 7
The result I want is this:
0 A
1 B
2 C
3 D
4 E
5 F
6
7 G
I have tried .concat, .append and .union, but these do not produce the desired result. What is the correct approach then?
You can try so:
df1['new'] = df1['c1'] + df2['c2']
For an in-place solution, I recommend pd.Series.replace:
df1['c1'].replace('', df2['c2'], inplace=True)
print(df1)
c1
0 A
1 B
2 C
3 D
4 E
5 F
6
7 G
Suppose I create a pandas DataFrame with two columns, one of which contains some numbers and the other contains letters. Like this:
import pandas as pd
from pprint import pprint
df = pd.DataFrame({'a': [1,2,3,4,5,6], 'b': ['y','x','y','x','y', 'y']})
pprint(df)
a b
0 1 y
1 2 x
2 3 y
3 4 x
4 5 y
5 6 y
Now say that I want to make a third column (c) whose value is equal to the last value of a when b was equal to x. In the cases where a value of x was not encountered in b yet, the value in c should default to 0.
The procedure should produce pretty much the following result:
last_a = 0
c = []
for i,b in enumerate(df['b']):
if b == 'x':
last_a = df.iloc[i]['a']
c += [last_a]
df['c'] = c
pprint(df)
a b c
0 1 y 0
1 2 x 2
2 3 y 2
3 4 x 4
4 5 y 4
5 6 y 4
Is there a more elegant way to accomplish this either with or without pandas?
In [140]: df = pd.DataFrame({'a': [1,2,3,4,5,6], 'b': ['y','x','y','x','y', 'y']})
In [141]: df
Out[141]:
a b
0 1 y
1 2 x
2 3 y
3 4 x
4 5 y
5 6 y
FInd out where column 'b' == x, then return the value in that column (not the location); this column is already the 'a' column
In [142]: df['c'] = df.loc[df['b']=='x','a'].apply(lambda v: v if v < len(df) else np.nan)
Fill the rest of the values forward, then fill holes with 0
In [143]: df['c'] = df['c'].ffill().fillna(0)
In [144]: df
Out[144]:
a b c
0 1 y 0
1 2 x 2
2 3 y 2
3 4 x 4
4 5 y 4
5 6 y 4