I have a dataframe and a dictionary that contains some of the columns of the dataframe and some values. I want to update the dataframe based on the dictionary values, and pick the higher value.
>>> df1
a b c d e f
0 4 2 6 2 8 1
1 3 6 7 7 8 5
2 2 1 1 6 8 7
3 1 2 7 3 3 1
4 1 7 2 6 7 6
5 4 8 8 2 2 1
and the dictionary is
compare = {'a':4, 'c':7, 'e':3}
So I want to check the values in columns ['a','c','e'] and replace with the value in the dictionary, if it is higher.
What I have tried is this:
comp = pd.DataFrame(pd.Series(compare).reindex(df1.columns).fillna(0)).T
df1[df1.columns] = df1.apply(lambda x: np.where(x>comp, x, comp)[0] ,axis=1)
Excepted Output:
>>>df1
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1
Another possible solution, based on numpy:
cols = list(compare.keys())
df[cols] = np.maximum(df[cols].values, np.array(list(compare.values())))
Output:
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1
limits = df.columns.map(compare).to_series(index=df.columns)
new = df.mask(df < limits, limits, axis=1)
obtain a Series whose index is columns of df and values from the dictionary
check if the frame's values are less then the "limits"; if so, put what limits have; else, as is
to get
>>> new
a b c d e f
0 4 2 7 2 8 1
1 4 6 7 7 8 5
2 4 1 7 6 8 7
3 4 2 7 3 3 1
4 4 7 7 6 7 6
5 4 8 8 2 3 1
Related
I have a dataframe as below
A B C D E F G H I
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
I want to multiply every 3rd column after the 2 column in the last 2 rows by 5 to get the ouput as below.
How to acomplish this?
A B C D E F G H I
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
1 10 3 4 25 6 7 40 9
1 10 3 4 25 6 7 40 9
I am able to select the cells i need with df.iloc[-2:,1::3]
which results in the df as below but I am not able to proceed further.
B E H
2 5 8
2 5 8
I know that I can select the same cells with loc instead of iloc, then the calcualtion is straign forward, but i am not able to figure it out.
The column names & cell values CANNOT Be used since these change (the df here is just a dummy data)
You can assign back to same selection of rows/ columns like:
df.iloc[-2:,1::3] = df.iloc[-2:,1::3].mul(5)
#alternative
#df.iloc[-2:,1::3] = df.iloc[-2:,1::3] * 5
print (df)
A B C D E F G H I
0 1 2 3 4 5 6 7 8 9
1 1 2 3 4 5 6 7 8 9
2 1 2 3 4 5 6 7 8 9
3 1 2 3 4 5 6 7 8 9
4 1 2 3 4 5 6 7 8 9
5 1 10 3 4 25 6 7 40 9
6 1 10 3 4 25 6 7 40 9
I have below Data Frame.
A B C D E F G
1 4 9 4 6 9 8
2 2 2 2 2 5 9
2 2 2 2 2 2 2
2 6 9 5 4 4 5
2 8 1 9 5 8 9
2 2 2 5 6 3 6
I need output as below:
A B C D E F G
1 4 9 4 6 9 8
2 6 9 5 4 4 5
2 8 1 9 5 8 9
2 2 2 5 6 3 6
It means rows having more than three columns as same value should be deleted.
We can see in the Second and Third rows are having 5 and 7 columns as same value respectively . We need to delete those rows.
Could any please help me.
Here's a naïve Pandas loop via pd.DataFrame.apply and pd.Series.value_counts:
def max_count(s):
return s.value_counts().values[0]
res = df[df.apply(max_count, axis=1).le(3)]
print(res)
A B C D E F G
0 1 4 9 4 6 9 8
3 2 6 9 5 4 4 5
4 2 8 1 9 5 8 9
5 2 2 2 5 6 3 6
Approach #1
For dataframe with ints, here's a vectorized one with bincount -
# https://stackoverflow.com/a/46256361/ #Divakar
def bincount2D_vectorized(a):
N = a.max()+1
a_offs = a + np.arange(a.shape[0])[:,None]*N
return np.bincount(a_offs.ravel(), minlength=a.shape[0]*N).reshape(-1,N)
out = df[(bincount2D_vectorized(df.values)<=3).all(1)]
Sample output -
In [563]: df[(bincount2D_vectorized(df.values)<=3).all(1)]
Out[563]:
A B C D E F G
0 1 4 9 4 6 9 8
3 2 6 9 5 4 4 5
4 2 8 1 9 5 8 9
5 2 2 2 5 6 3 6
You can use a set which has only unique values. If a row has 3 equal values, then
len(set(row)) = len(row) - 2.
Iterate over the dataframe to find those rows and store their indexes.
indexes_to_remove = []
for index, row in df.iterrows():
if len(set(row)) < len(row) - 2:
indexes_to_remove.append(index)
Then you can remove them safely.
Actually I thougth this should be very easy. I have a pandas data frame with lets say 100 colums and I want a subset containing colums 0:30 and 77:99.
What I've done so far is:
df_1 = df.iloc[:,0:30]
df_2 = df.iloc[:,77:99]
df2 = pd.concat([df_1 , df_2], axis=1, join_axes=[df_1 .index])
Is there an easier way?
Use numpy.r_ for concanecate indices:
df2 = df.iloc[:, np.r_[0:30, 77:99]]
Sample:
df = pd.DataFrame(np.random.randint(10, size=(5,15)))
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 6 2 9 5 4 6 9 9 7 9 6 6 1 0 6
1 5 6 7 0 7 8 7 9 4 8 1 2 0 8 5
2 5 6 1 6 7 6 1 5 5 4 6 3 2 3 0
3 4 3 1 3 3 8 3 6 7 1 8 6 2 1 8
4 3 8 2 3 7 3 6 4 4 6 2 6 9 4 9
df2 = df.iloc[:, np.r_[0:3, 7:9]]
print (df2)
0 1 2 7 8
0 6 2 9 9 7
1 5 6 7 9 4
2 5 6 1 5 5
3 4 3 1 6 7
4 3 8 2 4 4
df_1 = df.iloc[:,0:3]
df_2 = df.iloc[:,7:9]
df2 = pd.concat([df_1 , df_2], axis=1, join_axes=[df_1 .index])
print (df2)
0 1 2 7 8
0 6 2 9 9 7
1 5 6 7 9 4
2 5 6 1 5 5
3 4 3 1 6 7
4 3 8 2 4 4
Take a random DataFrame:
df = pd.DataFrame(np.random.rand(3, 2), columns=['a', 'b'])
Pandas allows defining new columns in two ways:
df['c'] = df.a + df.b
df['c'] = df['a'] + df['b']
As the DataFrame name gets longer, this notation becomes less readable.
And then there's the query function:
df.query('a > b')
It returns the slices of the df that match the condition.
Is there a way to run something like DataFrame.query() but for operations on the frame?
Function DataFrame.eval() does exactly this:
df.eval('c = a + b')
And warning-free assignment:
df.eval('c = a + b', inplace=True)
More generally, pandas.eval():
The following arithmetic operations are supported: +, -, *, /, **, %,
// (python engine only) along with the following boolean operations: |
(or), & (and), and ~ (not). Additionally, the 'pandas' parser allows
the use of and, or, and not with the same semantics as the
corresponding bitwise operators.
Pandas docs say that eval supports only Python expression statements (e.g., a == b), but pandas silently supports abs(a - b) and maybe other statements. The rest throw an error. For example:
df.eval('del(a)')
returns NotImplementedError: 'Delete' nodes are not implemented.
Here's a way using assign and add:
df.assign(c=df.a.add(df.b))
a b c
0 0.086468 0.978044 1.064512
1 0.270727 0.789762 1.060489
2 0.150097 0.662430 0.812527
Note: The assign creates a copy of your dataframe, therefore you aren't distorting the original data. You'll need to reassign to a different variable or back to df.
Consider the dataframe named my_obnoxiously_long_dataframe_name
np.random.seed([3,1415])
my_obnoxiously_long_dataframe_name = pd.DataFrame(
np.random.randint(10, size=(10, 10)),
columns=list('ABCDEFGHIJ')
)
my_obnoxiously_long_dataframe_name
A B C D E F G H I J
0 0 2 7 3 8 7 0 6 8 6
1 0 2 0 4 9 7 3 2 4 3
2 3 6 7 7 4 5 3 7 5 9
3 8 7 6 4 7 6 2 6 6 5
4 2 8 7 5 8 4 7 6 1 5
5 2 8 2 4 7 6 9 4 2 4
6 6 3 8 3 9 8 0 4 3 0
7 4 1 5 8 6 0 8 7 4 6
8 3 5 8 5 1 5 1 4 3 9
9 5 5 7 0 3 2 5 8 8 9
If you want cleaner code, create a temp variable name that's smaller
d_ = my_obnoxiously_long_dataframe_name
d_['K'] = abs(d_.J - d_.D)
d_['L'] = d_.A + d_.B
del d_
my_obnoxiously_long_dataframe_name
A B C D E F G H I J K L
0 0 2 7 3 8 7 0 6 8 6 3 2
1 0 2 0 4 9 7 3 2 4 3 1 2
2 3 6 7 7 4 5 3 7 5 9 2 9
3 8 7 6 4 7 6 2 6 6 5 1 15
4 2 8 7 5 8 4 7 6 1 5 0 10
5 2 8 2 4 7 6 9 4 2 4 0 10
6 6 3 8 3 9 8 0 4 3 0 3 9
7 4 1 5 8 6 0 8 7 4 6 2 5
8 3 5 8 5 1 5 1 4 3 9 4 8
9 5 5 7 0 3 2 5 8 8 9 9 10
I have a dataframe, that I want to calculate statitics on (value_count, mode, mean, etc.) and then put the result in a new column. My current solution is O(n**2) or so, and I'm sure there is likely a faster, obvious method that I'm overlooking.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(10, size=(100, 10)),
columns = list('abcdefghij'))
df['result'] = 0
groups = df.groupby([df.i, df.j])
for g in groups:
icol_eq = df.i == g[0][0]
jcol_eq = df.j == g[0][1]
i_and_j = icol_eq & jcol_eq
df['result'][i_and_j] = len(g[1])
The above works, but is extremely slow for large dataframes.
I tried
df['result'] = df.groupby([df.i, df.j]).apply(len)
but it doesn't seem to work.
Nor does
def f(g):
g['result'] = len(g)
return g
df.groupby([df.i, df.j]).apply(f)
Nor can I merge the resulting series of a df.groupby.apply(lambda x: len(x))
You want to use transform:
In [98]:
df['result'] = df.groupby([df.i, df.j]).transform(len)
df
Out[98]:
a b c d e f g h i j result
0 6 1 3 0 1 1 4 2 8 6 6
1 1 3 9 7 5 5 3 5 4 4 1
2 1 5 0 1 8 1 4 7 3 9 1
3 6 8 6 4 6 0 8 0 6 5 6
4 7 9 7 2 8 9 9 6 0 6 7
5 3 5 5 7 2 7 7 3 2 8 3
6 5 0 4 7 5 7 5 7 9 1 5
7 3 2 5 4 3 6 8 4 2 0 3
8 2 3 0 4 8 5 7 9 7 2 2
9 1 1 3 2 3 5 6 6 5 6 1
10 3 0 2 7 1 8 1 3 5 4 3
....
transform returns a Series with an index aligned to your original df so you can then add it as a column