Shorter notation for columns in pandas DataFrame - python

Take a random DataFrame:
df = pd.DataFrame(np.random.rand(3, 2), columns=['a', 'b'])
Pandas allows defining new columns in two ways:
df['c'] = df.a + df.b
df['c'] = df['a'] + df['b']
As the DataFrame name gets longer, this notation becomes less readable.
And then there's the query function:
df.query('a > b')
It returns the slices of the df that match the condition.
Is there a way to run something like DataFrame.query() but for operations on the frame?

Function DataFrame.eval() does exactly this:
df.eval('c = a + b')
And warning-free assignment:
df.eval('c = a + b', inplace=True)
More generally, pandas.eval():
The following arithmetic operations are supported: +, -, *, /, **, %,
// (python engine only) along with the following boolean operations: |
(or), & (and), and ~ (not). Additionally, the 'pandas' parser allows
the use of and, or, and not with the same semantics as the
corresponding bitwise operators.
Pandas docs say that eval supports only Python expression statements (e.g., a == b), but pandas silently supports abs(a - b) and maybe other statements. The rest throw an error. For example:
df.eval('del(a)')
returns NotImplementedError: 'Delete' nodes are not implemented.

Here's a way using assign and add:
df.assign(c=df.a.add(df.b))
a b c
0 0.086468 0.978044 1.064512
1 0.270727 0.789762 1.060489
2 0.150097 0.662430 0.812527
Note: The assign creates a copy of your dataframe, therefore you aren't distorting the original data. You'll need to reassign to a different variable or back to df.

Consider the dataframe named my_obnoxiously_long_dataframe_name
np.random.seed([3,1415])
my_obnoxiously_long_dataframe_name = pd.DataFrame(
np.random.randint(10, size=(10, 10)),
columns=list('ABCDEFGHIJ')
)
my_obnoxiously_long_dataframe_name
A B C D E F G H I J
0 0 2 7 3 8 7 0 6 8 6
1 0 2 0 4 9 7 3 2 4 3
2 3 6 7 7 4 5 3 7 5 9
3 8 7 6 4 7 6 2 6 6 5
4 2 8 7 5 8 4 7 6 1 5
5 2 8 2 4 7 6 9 4 2 4
6 6 3 8 3 9 8 0 4 3 0
7 4 1 5 8 6 0 8 7 4 6
8 3 5 8 5 1 5 1 4 3 9
9 5 5 7 0 3 2 5 8 8 9
If you want cleaner code, create a temp variable name that's smaller
d_ = my_obnoxiously_long_dataframe_name
d_['K'] = abs(d_.J - d_.D)
d_['L'] = d_.A + d_.B
del d_
my_obnoxiously_long_dataframe_name
A B C D E F G H I J K L
0 0 2 7 3 8 7 0 6 8 6 3 2
1 0 2 0 4 9 7 3 2 4 3 1 2
2 3 6 7 7 4 5 3 7 5 9 2 9
3 8 7 6 4 7 6 2 6 6 5 1 15
4 2 8 7 5 8 4 7 6 1 5 0 10
5 2 8 2 4 7 6 9 4 2 4 0 10
6 6 3 8 3 9 8 0 4 3 0 3 9
7 4 1 5 8 6 0 8 7 4 6 2 5
8 3 5 8 5 1 5 1 4 3 9 4 8
9 5 5 7 0 3 2 5 8 8 9 9 10

Related

Update values in dataframe based on dictionary and condition

I have a dataframe and a dictionary that contains some of the columns of the dataframe and some values. I want to update the dataframe based on the dictionary values, and pick the higher value.
>>> df1
a b c d e f
0 4 2 6 2 8 1
1 3 6 7 7 8 5
2 2 1 1 6 8 7
3 1 2 7 3 3 1
4 1 7 2 6 7 6
5 4 8 8 2 2 1
and the dictionary is
compare = {'a':4, 'c':7, 'e':3}
So I want to check the values in columns ['a','c','e'] and replace with the value in the dictionary, if it is higher.
What I have tried is this:
comp = pd.DataFrame(pd.Series(compare).reindex(df1.columns).fillna(0)).T
df1[df1.columns] = df1.apply(lambda x: np.where(x>comp, x, comp)[0] ,axis=1)
Excepted Output:
>>>df1
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1
Another possible solution, based on numpy:
cols = list(compare.keys())
df[cols] = np.maximum(df[cols].values, np.array(list(compare.values())))
Output:
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1
limits = df.columns.map(compare).to_series(index=df.columns)
new = df.mask(df < limits, limits, axis=1)
obtain a Series whose index is columns of df and values from the dictionary
check if the frame's values are less then the "limits"; if so, put what limits have; else, as is
to get
>>> new
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1

How to repeat a pandas data frame column x amount of times?

If I have a Pandas Dataframe like this:
A
1 8
2 9
3 7
4 2
How do I repeat it x number of times? For example, if I wanted to repeat it 3 times I would get something like this:
A B C D
1 8 8 8 8
2 9 9 9 9
3 7 7 7 7
4 2 2 2 2
Use concat:
n = 3
pd.concat([df] * (n+1), axis=1, ignore_index=True)
0 1 2 3
1 8 8 8 8
2 9 9 9 9
3 7 7 7 7
4 2 2 2 2
If you want the columns renamed, use rename:
(pd.concat([df] * (n+1), axis=1, ignore_index=True)
.rename(lambda x: chr(ord('A')+x), axis=1))
A B C D
1 8 8 8 8
2 9 9 9 9
3 7 7 7 7
4 2 2 2 2
You can use Numpy to repeat the values and reconstruct the dataframe.
n = 3
pd.DataFrame(np.tile(df.values, n + 1), columns = df.columns.tolist()+list('BCD'))
A B C D
0 8 8 8 8
1 9 9 9 9
2 7 7 7 7
3 2 2 2 2
You can use concat like #coldspeed did.
Or you can set them manually.
df['B'] = df.A
df['C'] = df.A
df['D'] = df.A
print(df)
A B C D
1 8 8 8 8
2 9 9 9 9
3 7 7 7 7
4 2 2 2 2

How to delete rows having same value in more than 3 columns

I have below Data Frame.
A B C D E F G
1 4 9 4 6 9 8
2 2 2 2 2 5 9
2 2 2 2 2 2 2
2 6 9 5 4 4 5
2 8 1 9 5 8 9
2 2 2 5 6 3 6
I need output as below:
A B C D E F G
1 4 9 4 6 9 8
2 6 9 5 4 4 5
2 8 1 9 5 8 9
2 2 2 5 6 3 6
It means rows having more than three columns as same value should be deleted.
We can see in the Second and Third rows are having 5 and 7 columns as same value respectively . We need to delete those rows.
Could any please help me.
Here's a naïve Pandas loop via pd.DataFrame.apply and pd.Series.value_counts:
def max_count(s):
return s.value_counts().values[0]
res = df[df.apply(max_count, axis=1).le(3)]
print(res)
A B C D E F G
0 1 4 9 4 6 9 8
3 2 6 9 5 4 4 5
4 2 8 1 9 5 8 9
5 2 2 2 5 6 3 6
Approach #1
For dataframe with ints, here's a vectorized one with bincount -
# https://stackoverflow.com/a/46256361/ #Divakar
def bincount2D_vectorized(a):
N = a.max()+1
a_offs = a + np.arange(a.shape[0])[:,None]*N
return np.bincount(a_offs.ravel(), minlength=a.shape[0]*N).reshape(-1,N)
out = df[(bincount2D_vectorized(df.values)<=3).all(1)]
Sample output -
In [563]: df[(bincount2D_vectorized(df.values)<=3).all(1)]
Out[563]:
A B C D E F G
0 1 4 9 4 6 9 8
3 2 6 9 5 4 4 5
4 2 8 1 9 5 8 9
5 2 2 2 5 6 3 6
You can use a set which has only unique values. If a row has 3 equal values, then
len(set(row)) = len(row) - 2.
Iterate over the dataframe to find those rows and store their indexes.
indexes_to_remove = []
for index, row in df.iterrows():
if len(set(row)) < len(row) - 2:
indexes_to_remove.append(index)
Then you can remove them safely.

How to cat two column (float) into a column quick and efficiency in pandas dataframe?

I want to get a new column by cat two column (float or int) as following shows,
So anyone have a better idea?
I think mine is something too complex
a=pandas.Series([1,3,5,7,9])
b=pandas.Series([2,4,6,8,10])
c=pandas.Series([3,5,6,5,10])
abc=pandas.DataFrame({'a':a, 'b':b, 'c':c})
abc
a b c
0 1 2 3
1 3 4 5
2 5 6 6
3 7 8 5
4 9 10 10
abc['new']=pandas.Series(map(str,abc.iloc[:,0])).str.cat(pandas.Series(map(str,abc.iloc[:,1])), sep='::')
abc
a b c new
0 1 2 3 1::2
1 3 4 5 3::4
2 5 6 6 5::6
3 7 8 5 7::8
4 9 10 10 9::10
Use astype for convert to str:
#if need select columns by position with iloc
abc['new'] = abc.iloc[:,0].astype(str) + '::' + abc.iloc[:,1].astype(str)
print (abc)
a b c new
0 1 2 3 1::2
1 3 4 5 3::4
2 5 6 6 5::6
3 7 8 5 7::8
4 9 10 10 9::10
#if need select by column names
abc['new'] = abc['a'].astype(str) + '::' + abc['b'].astype(str)
print (abc)
a b c new
0 1 2 3 1::2
1 3 4 5 3::4
2 5 6 6 5::6
3 7 8 5 7::8
4 9 10 10 9::10
Solution with str.cat:
abc['new'] = abc['a'].astype(str).str.cat(abc['b'].astype(str), sep='::')
print (abc)
a b c new
0 1 2 3 1::2
1 3 4 5 3::4
2 5 6 6 5::6
3 7 8 5 7::8
4 9 10 10 9::10
You can also do something like this using map
abc['d'] = abc['a'].map(str) +'::'+ abc['b'].map(str)
print(abc)
output:
a b c d
0 1 2 3 1::2
1 3 4 5 3::4
2 5 6 6 5::6
3 7 8 5 7::8
4 9 10 10 9::10
how about using apply?
abc['new'] = abc.apply(lambda x: '{}::{}'.format(x['a'],x['b']), axis=1)
it is a simple one-liner this way.

How do I put a series (such as) the result of a pandas groupby.apply(f) into a new column of the dataframe?

I have a dataframe, that I want to calculate statitics on (value_count, mode, mean, etc.) and then put the result in a new column. My current solution is O(n**2) or so, and I'm sure there is likely a faster, obvious method that I'm overlooking.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(10, size=(100, 10)),
columns = list('abcdefghij'))
df['result'] = 0
groups = df.groupby([df.i, df.j])
for g in groups:
icol_eq = df.i == g[0][0]
jcol_eq = df.j == g[0][1]
i_and_j = icol_eq & jcol_eq
df['result'][i_and_j] = len(g[1])
The above works, but is extremely slow for large dataframes.
I tried
df['result'] = df.groupby([df.i, df.j]).apply(len)
but it doesn't seem to work.
Nor does
def f(g):
g['result'] = len(g)
return g
df.groupby([df.i, df.j]).apply(f)
Nor can I merge the resulting series of a df.groupby.apply(lambda x: len(x))
You want to use transform:
In [98]:
df['result'] = df.groupby([df.i, df.j]).transform(len)
df
Out[98]:
a b c d e f g h i j result
0 6 1 3 0 1 1 4 2 8 6 6
1 1 3 9 7 5 5 3 5 4 4 1
2 1 5 0 1 8 1 4 7 3 9 1
3 6 8 6 4 6 0 8 0 6 5 6
4 7 9 7 2 8 9 9 6 0 6 7
5 3 5 5 7 2 7 7 3 2 8 3
6 5 0 4 7 5 7 5 7 9 1 5
7 3 2 5 4 3 6 8 4 2 0 3
8 2 3 0 4 8 5 7 9 7 2 2
9 1 1 3 2 3 5 6 6 5 6 1
10 3 0 2 7 1 8 1 3 5 4 3
....
transform returns a Series with an index aligned to your original df so you can then add it as a column

Categories