Python - Filter DataFrame by Sub DataFrame - python

I got a DataFrame like this:
A B C
1 1 2 3
2 4 5 6
3 7 8 9
And I want to filter it by a sub set of DataFrame:
A C
1 4 6
2 7 9
Finally, I can get this output:
A B C
2 4 5 6
3 7 8 9
How can I do to solve this?

In [92]: d1.merge(d2)
Out[92]:
A B C
0 4 5 6
1 7 8 9

Related

Update values in dataframe based on dictionary and condition

I have a dataframe and a dictionary that contains some of the columns of the dataframe and some values. I want to update the dataframe based on the dictionary values, and pick the higher value.
>>> df1
a b c d e f
0 4 2 6 2 8 1
1 3 6 7 7 8 5
2 2 1 1 6 8 7
3 1 2 7 3 3 1
4 1 7 2 6 7 6
5 4 8 8 2 2 1
and the dictionary is
compare = {'a':4, 'c':7, 'e':3}
So I want to check the values in columns ['a','c','e'] and replace with the value in the dictionary, if it is higher.
What I have tried is this:
comp = pd.DataFrame(pd.Series(compare).reindex(df1.columns).fillna(0)).T
df1[df1.columns] = df1.apply(lambda x: np.where(x>comp, x, comp)[0] ,axis=1)
Excepted Output:
>>>df1
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1
Another possible solution, based on numpy:
cols = list(compare.keys())
df[cols] = np.maximum(df[cols].values, np.array(list(compare.values())))
Output:
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1
limits = df.columns.map(compare).to_series(index=df.columns)
new = df.mask(df < limits, limits, axis=1)
obtain a Series whose index is columns of df and values from the dictionary
check if the frame's values are less then the "limits"; if so, put what limits have; else, as is
to get
>>> new
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1

How to repeat a pandas data frame column x amount of times?

If I have a Pandas Dataframe like this:
A
1 8
2 9
3 7
4 2
How do I repeat it x number of times? For example, if I wanted to repeat it 3 times I would get something like this:
A B C D
1 8 8 8 8
2 9 9 9 9
3 7 7 7 7
4 2 2 2 2
Use concat:
n = 3
pd.concat([df] * (n+1), axis=1, ignore_index=True)
0 1 2 3
1 8 8 8 8
2 9 9 9 9
3 7 7 7 7
4 2 2 2 2
If you want the columns renamed, use rename:
(pd.concat([df] * (n+1), axis=1, ignore_index=True)
.rename(lambda x: chr(ord('A')+x), axis=1))
A B C D
1 8 8 8 8
2 9 9 9 9
3 7 7 7 7
4 2 2 2 2
You can use Numpy to repeat the values and reconstruct the dataframe.
n = 3
pd.DataFrame(np.tile(df.values, n + 1), columns = df.columns.tolist()+list('BCD'))
A B C D
0 8 8 8 8
1 9 9 9 9
2 7 7 7 7
3 2 2 2 2
You can use concat like #coldspeed did.
Or you can set them manually.
df['B'] = df.A
df['C'] = df.A
df['D'] = df.A
print(df)
A B C D
1 8 8 8 8
2 9 9 9 9
3 7 7 7 7
4 2 2 2 2

How to cat two column (float) into a column quick and efficiency in pandas dataframe?

I want to get a new column by cat two column (float or int) as following shows,
So anyone have a better idea?
I think mine is something too complex
a=pandas.Series([1,3,5,7,9])
b=pandas.Series([2,4,6,8,10])
c=pandas.Series([3,5,6,5,10])
abc=pandas.DataFrame({'a':a, 'b':b, 'c':c})
abc
a b c
0 1 2 3
1 3 4 5
2 5 6 6
3 7 8 5
4 9 10 10
abc['new']=pandas.Series(map(str,abc.iloc[:,0])).str.cat(pandas.Series(map(str,abc.iloc[:,1])), sep='::')
abc
a b c new
0 1 2 3 1::2
1 3 4 5 3::4
2 5 6 6 5::6
3 7 8 5 7::8
4 9 10 10 9::10
Use astype for convert to str:
#if need select columns by position with iloc
abc['new'] = abc.iloc[:,0].astype(str) + '::' + abc.iloc[:,1].astype(str)
print (abc)
a b c new
0 1 2 3 1::2
1 3 4 5 3::4
2 5 6 6 5::6
3 7 8 5 7::8
4 9 10 10 9::10
#if need select by column names
abc['new'] = abc['a'].astype(str) + '::' + abc['b'].astype(str)
print (abc)
a b c new
0 1 2 3 1::2
1 3 4 5 3::4
2 5 6 6 5::6
3 7 8 5 7::8
4 9 10 10 9::10
Solution with str.cat:
abc['new'] = abc['a'].astype(str).str.cat(abc['b'].astype(str), sep='::')
print (abc)
a b c new
0 1 2 3 1::2
1 3 4 5 3::4
2 5 6 6 5::6
3 7 8 5 7::8
4 9 10 10 9::10
You can also do something like this using map
abc['d'] = abc['a'].map(str) +'::'+ abc['b'].map(str)
print(abc)
output:
a b c d
0 1 2 3 1::2
1 3 4 5 3::4
2 5 6 6 5::6
3 7 8 5 7::8
4 9 10 10 9::10
how about using apply?
abc['new'] = abc.apply(lambda x: '{}::{}'.format(x['a'],x['b']), axis=1)
it is a simple one-liner this way.

Shorter notation for columns in pandas DataFrame

Take a random DataFrame:
df = pd.DataFrame(np.random.rand(3, 2), columns=['a', 'b'])
Pandas allows defining new columns in two ways:
df['c'] = df.a + df.b
df['c'] = df['a'] + df['b']
As the DataFrame name gets longer, this notation becomes less readable.
And then there's the query function:
df.query('a > b')
It returns the slices of the df that match the condition.
Is there a way to run something like DataFrame.query() but for operations on the frame?
Function DataFrame.eval() does exactly this:
df.eval('c = a + b')
And warning-free assignment:
df.eval('c = a + b', inplace=True)
More generally, pandas.eval():
The following arithmetic operations are supported: +, -, *, /, **, %,
// (python engine only) along with the following boolean operations: |
(or), & (and), and ~ (not). Additionally, the 'pandas' parser allows
the use of and, or, and not with the same semantics as the
corresponding bitwise operators.
Pandas docs say that eval supports only Python expression statements (e.g., a == b), but pandas silently supports abs(a - b) and maybe other statements. The rest throw an error. For example:
df.eval('del(a)')
returns NotImplementedError: 'Delete' nodes are not implemented.
Here's a way using assign and add:
df.assign(c=df.a.add(df.b))
a b c
0 0.086468 0.978044 1.064512
1 0.270727 0.789762 1.060489
2 0.150097 0.662430 0.812527
Note: The assign creates a copy of your dataframe, therefore you aren't distorting the original data. You'll need to reassign to a different variable or back to df.
Consider the dataframe named my_obnoxiously_long_dataframe_name
np.random.seed([3,1415])
my_obnoxiously_long_dataframe_name = pd.DataFrame(
np.random.randint(10, size=(10, 10)),
columns=list('ABCDEFGHIJ')
)
my_obnoxiously_long_dataframe_name
A B C D E F G H I J
0 0 2 7 3 8 7 0 6 8 6
1 0 2 0 4 9 7 3 2 4 3
2 3 6 7 7 4 5 3 7 5 9
3 8 7 6 4 7 6 2 6 6 5
4 2 8 7 5 8 4 7 6 1 5
5 2 8 2 4 7 6 9 4 2 4
6 6 3 8 3 9 8 0 4 3 0
7 4 1 5 8 6 0 8 7 4 6
8 3 5 8 5 1 5 1 4 3 9
9 5 5 7 0 3 2 5 8 8 9
If you want cleaner code, create a temp variable name that's smaller
d_ = my_obnoxiously_long_dataframe_name
d_['K'] = abs(d_.J - d_.D)
d_['L'] = d_.A + d_.B
del d_
my_obnoxiously_long_dataframe_name
A B C D E F G H I J K L
0 0 2 7 3 8 7 0 6 8 6 3 2
1 0 2 0 4 9 7 3 2 4 3 1 2
2 3 6 7 7 4 5 3 7 5 9 2 9
3 8 7 6 4 7 6 2 6 6 5 1 15
4 2 8 7 5 8 4 7 6 1 5 0 10
5 2 8 2 4 7 6 9 4 2 4 0 10
6 6 3 8 3 9 8 0 4 3 0 3 9
7 4 1 5 8 6 0 8 7 4 6 2 5
8 3 5 8 5 1 5 1 4 3 9 4 8
9 5 5 7 0 3 2 5 8 8 9 9 10

sort dataframe by position in group then by that group

consider the dataframe df
df = pd.DataFrame(dict(
A=list('aaaaabbbbccc'),
B=range(12)
))
print(df)
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
I want to sort the dataframe such if I grouped by column 'A' I'd pull the first position from each group, then cycle back and get the second position from each group if any are remaining. So on and so forth.
I'd expect results tot look like this
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
You can use cumcount for count values in groups first, then sort_values and reindex by Series cum:
cum = df.groupby('A')['B'].cumcount().sort_values()
print (cum)
0 0
5 0
9 0
1 1
6 1
10 1
2 2
7 2
11 2
3 3
8 3
4 4
dtype: int64
print (df.reindex(cum.index))
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Here's a NumPy approach -
def approach1(g, v):
# Inputs : 1D arrays of groupby and value columns
id_arr2 = np.ones(v.size,dtype=int)
sf = np.flatnonzero(g[1:] != g[:-1])+1
id_arr2[sf[0]] = -sf[0]+1
id_arr2[sf[1:]] = sf[:-1] - sf[1:]+1
return id_arr2.cumsum().argsort(kind='mergesort')
Sample run -
In [246]: df
Out[246]:
A B
0 a 0
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
8 b 8
9 c 9
10 c 10
11 c 11
In [247]: df.iloc[approach1(df.A.values, df.B.values)]
Out[247]:
A B
0 a 0
5 b 5
9 c 9
1 a 1
6 b 6
10 c 10
2 a 2
7 b 7
11 c 11
3 a 3
8 b 8
4 a 4
Or using df.reindex from #jezrael's post :
df.reindex(approach1(df.A.values, df.B.values))

Categories