Filter in pandas using operators - python

I have Three variables as
a='col1'
b='=='
c=2
I have Pandas dataframe as
df = pd.DataFrame({'col1': [0, 1, 2], 'col2': [10, 11, 12]},dtype=object)
And I wanted to filter for col1=2, So I wrote
df.query("#a #b #c")
Which is throwing below error
File "<unknown>", line 1
__pd_eval_local_a __pd_eval_local_b __pd_eval_local_c
^
SyntaxError: invalid syntax
Can some one help me how to achieve this using these three variables?
Thanks,

In [214]: df
Out[214]:
col1 col2 col3
0 0 10 aaa
1 1 11 bbb
2 2 12 ccc
In [215]: a='col1'; b='=='; c=2 # <--- `c` is `int`
In [216]: df.query(a + b + '#c')
Out[216]:
col1 col2 col3
2 2 12 ccc
In [217]: a='col3'; b='=='; c='aaa' # <--- `c` is `str`
In [218]: df.query(a + b + '#c')
Out[218]:
col1 col2 col3
0 0 10 aaa
it will also work with datetime dtypes, as .query() would take care of dtypes:
In [226]: df['col4'] = pd.date_range('2017-01-01', freq='99D', periods=len(df))
In [227]: df
Out[227]:
col1 col2 col3 col4
0 0 10 aaa 2017-01-01
1 1 11 bbb 2017-04-10
2 2 12 ccc 2017-07-18
In [228]: a='col4'; b='=='; c='2017-01-01'
In [229]: df.query(a + b + '#c')
Out[229]:
col1 col2 col3 col4
0 0 10 aaa 2017-01-01

You can do it without the use of query if that's not a requierement, using eval :
a='col1'
b='=='
c='2'
instruction ="df[\"" + a + "\"]" + b + c
>>>> 'df["col1"]==2'
df[eval(instruction)]
>>>>
col1 col2
2 2 12

Related

Pandas get column value based on row value [duplicate]

This question already has answers here:
Lookup Values by Corresponding Column Header in Pandas 1.2.0 or newer
(4 answers)
Closed 1 year ago.
I have the following dataframe:
df = pd.DataFrame(data={'flag': ['col3', 'col2', 'col2'],
'col1': [1, 3, 2],
'col2': [5, 2, 4],
'col3': [6, 3, 6],
'col4': [0, 4, 4]},
index=pd.Series(['A', 'B', 'C'], name='index'))
index
flag
col1
col2
col3
col4
A
col3
1
5
6
0
B
col2
3
2
3
4
C
col2
2
4
6
4
For each row, I want to get the value when column name is equal to the flag.
index
flag
col1
col2
col3
col4
col_val
A
col3
1
5
6
0
6
B
col2
3
2
3
4
2
C
col2
2
4
6
4
4
– Index A has a flag of col3. So col_val should be 6 because df['col3'] for that row is 6.
– Index B has a flag of col2. So col_val should be 2 because df['col2'] for that row is 2.
– Index C has a flag of col2. So col_val should be 4 because df['col2'] for that row is 3.
Per this page:
idx, cols = pd.factorize(df['flag'])
df['COl_VAL'] = df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
Output:
>>> df
flag col1 col2 col3 col4 COl_VAL
index
A col3 1 5 6 0 6
B col2 3 2 3 4 2
C col2 2 4 6 4 4
The docs has an example that you can adapt; the solution is below is just another option.
What it does is flip the dataframe into a MultiIndex dataframe, select the relevant columns and trim it to non nulls::
cols = [(ent, ent) for ent in df.flag.unique()]
(df.assign(col_val = df.pivot(index = None, columns = 'flag')
.loc(axis = 1)[cols].sum(1)
)
flag col1 col2 col3 col4 col_val
index
A col3 1 5 6 0 6.0
B col2 3 2 3 4 2.0
C col2 2 4 6 4 4.0
try this:
cond = ([df.columns.values[1:]] * df.shape[0]) == df.flag.values.reshape(-1,1)
df1 = df.set_index('flag', append=True)
df1.join(df1.where(cond).ffill(axis=1).col4.rename('res')).reset_index('flag')

Python: Transform or Melt and Groupby a Dataframe

I have something like this:
df =
col1 col2 col3
0 B C A
1 E D G
2 NaN F B
EDIT : I need to convert it into this:
result =
Name location
0 B col1,col2
1 C col1
2 A col1
3 E col2
4 D col2
5 G col2
6 F col3
Essentially getting a "location" telling me which column an "Name" is in. Thank you in advance.
Try melt and dropna:
>>> df.melt(var_name='location').dropna().groupby('value', sort=False, as_index=False).agg(', '.join)
value location
0 B col1, col3
1 E col1
2 C col2
3 D col2
4 F col2
5 A col3
6 G col3
>>>
Also groupby and agg.
Or an alternative with stack():
new = df.stack().reset_index().drop('level_0',axis=1).dropna()
new.columns = ['name','location']
prints:
name location
0 col1 B
1 col2 C
2 col3 A
3 col1 E
4 col2 D
5 col3 G
6 col2 F
EDIT:
To get your updated output you could use a groupby along with join():
new.groupby('location').agg({'name':lambda x: ', '.join(list(x))}).reset_index()
Which gives you:
location name
0 A col3
1 B col1, col3
2 C col2
3 D col2
4 E col1
5 F col2
6 G col3
Try using melt to convert columns to rows. And give the rows a column name.
Then dropna to remove the NaN values in rows.
df = df.melt(var_name="location", value_name="Name").dropna()
You can use pandas.melt and pandas.groupby.agg:
df = df.melt(var_name="location", value_name="Name").dropna()
new_df = df.groupby("Name", as_index=False).agg(",".join)
print(new_df)
Output:
Name location
0 A col3
1 B col1,col3
2 C col2
3 D col2
4 E col1
5 F col2
6 G col3

Fetch row data from known index in pandas

df1:
col1 col2
0 a 5
1 b 2
2 c 1
df2:
col1
0 qa0
1 qa1
2 qa2
3 qa3
4 qa4
5 qa5
final output:
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1
Basically , in df1, I have index stored for another df data. I have to fetch data from df2 and append it in df1.
I don't know how to fetch data via index number.
Use Series.map by another Series:
df1['col3'] = df1['col2'].map(df2['col1'])
Or use DataFrame.join with rename column:
df1 = df1.join(df2.rename(columns={'col1':'col3'})['col3'], on='col2')
print (df1)
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1
You can use iloc to get data and then to_numpy for values
df1["col3"] = df2.iloc[df1.col2].to_numpy()
df1
col1 col2 col3
0 a 5 qa5
1 b 2 qa2
2 c 1 qa1

How to multiply numeric elements of a dataframe by a scalar? [duplicate]

How to multiply all the numeric values in the data frame by a constant without having to specify column names explicitly? Example:
In [13]: df = pd.DataFrame({'col1': ['A','B','C'], 'col2':[1,2,3], 'col3': [30, 10,20]})
In [14]: df
Out[14]:
col1 col2 col3
0 A 1 30
1 B 2 10
2 C 3 20
I tried df.multiply but it affects the string values as well by concatenating them several times.
In [15]: df.multiply(3)
Out[15]:
col1 col2 col3
0 AAA 3 90
1 BBB 6 30
2 CCC 9 60
Is there a way to preserve the string values intact while multiplying only the numeric values by a constant?
you can use select_dtypes() including number dtype or excluding all columns of object and datetime64 dtypes:
Demo:
In [162]: df
Out[162]:
col1 col2 col3 date
0 A 1 30 2016-01-01
1 B 2 10 2016-01-02
2 C 3 20 2016-01-03
In [163]: df.dtypes
Out[163]:
col1 object
col2 int64
col3 int64
date datetime64[ns]
dtype: object
In [164]: df.select_dtypes(exclude=['object', 'datetime']) * 3
Out[164]:
col2 col3
0 3 90
1 6 30
2 9 60
or a much better solution (c) ayhan:
df[df.select_dtypes(include=['number']).columns] *= 3
From docs:
To select all numeric types use the numpy dtype numpy.number
The other answer specifies how to multiply only numeric columns. Here's how to update it:
df = pd.DataFrame({'col1': ['A','B','C'], 'col2':[1,2,3], 'col3': [30, 10,20]})
s = df.select_dtypes(include=[np.number])*3
df[s.columns] = s
print (df)
col1 col2 col3
0 A 3 90
1 B 6 30
2 C 9 60
One way would be to get the dtypes, match them against object and datetime dtypes and exclude them with a mask, like so -
df.ix[:,~np.in1d(df.dtypes,['object','datetime'])] *= 3
Sample run -
In [273]: df
Out[273]:
col1 col2 col3
0 A 1 30
1 B 2 10
2 C 3 20
In [274]: df.ix[:,~np.in1d(df.dtypes,['object','datetime'])] *= 3
In [275]: df
Out[275]:
col1 col2 col3
0 A 3 90
1 B 6 30
2 C 9 60
This should work even over mixed types within columns but is likely slow over large dataframes.
def mul(x, y):
try:
return pd.to_numeric(x) * y
except:
return x
df.applymap(lambda x: mul(x, 3))
A simple solution using assign() and select_dtypes():
df.assign(**df.select_dtypes('number')*3)

How to do group by and take count of unique and count of some value as aggregate on same column in python pandas?

My question is related to my previous Question but it's different. So I am asking the new question.
In above question see the answer of #jezrael.
df = pd.DataFrame({'col1':[1,1,1],
'col2':[4,4,6],
'col3':[7,7,9],
'col4':[3,3,5]})
print (df)
col1 col2 col3 col4
0 1 4 7 3
1 1 4 7 3
2 1 6 9 5
df1 = df.groupby(['col1','col2']).agg({'col3':'size','col4':'nunique'})
df1['result_col'] = df1['col3'].div(df1['col4'])
print (df1)
col4 col3 result_col
col1 col2
1 4 1 2 2.0
6 1 1 1.0
Now here I want to take count for the specific value of col4 . Say I also want to take count of col4 == 3 in the same query.
df.groupby(['col1','col2']).agg({'col3':'size','col4':'nunique'}) ... + count(col4=='3')
How to do this in same above query I have tried bellow but not getting solution.
df.groupby(['col1','col2']).agg({'col3':'size','col4':'nunique','col4':'x: lambda x[x == 7].count()'})
Do some preprocessing by including the col4==3 as a column ahead of time. Then use aggregate
df.assign(result_col=df.col4.eq(3).astype(int)).groupby(
['col1', 'col2']
).agg(dict(col3='size', col4='nunique', result_col='sum'))
col3 result_col col4
col1 col2
1 4 2 2 1
6 1 0 1
old answers
g = df.groupby(['col1', 'col2'])
g.agg({'col3':'size','col4': 'nunique'}).assign(
result_col=g.col4.apply(lambda x: x.eq(3).sum()))
col3 col4 result_col
col1 col2
1 4 2 1 2
6 1 1 0
slightly rearranged
g = df.groupby(['col1', 'col2'])
final_df = g.agg({'col3':'size','col4': 'nunique'})
final_df.insert(1, 'result_col', g.col4.apply(lambda x: x.eq(3).sum()))
final_df
col3 result_col col4
col1 col2
1 4 2 2 1
6 1 0 1
I think you need aggregate with list of function in dict for column col4.
If need count 3 values the simpliest is sum True values in x == 3:
df1 = df.groupby(['col1','col2'])
.agg({'col3':'size','col4': ['nunique', lambda x: (x == 3).sum()]})
df1 = df1.rename(columns={'<lambda>':'count_3'})
df1.columns = ['{}_{}'.format(x[0], x[1]) for x in df1.columns]
print (df1)
col4_nunique col4_count_3 col3_size
col1 col2
1 4 1 2 2
6 1 0 1

Categories