How to multiply all the numeric values in the data frame by a constant without having to specify column names explicitly? Example:
In [13]: df = pd.DataFrame({'col1': ['A','B','C'], 'col2':[1,2,3], 'col3': [30, 10,20]})
In [14]: df
Out[14]:
col1 col2 col3
0 A 1 30
1 B 2 10
2 C 3 20
I tried df.multiply but it affects the string values as well by concatenating them several times.
In [15]: df.multiply(3)
Out[15]:
col1 col2 col3
0 AAA 3 90
1 BBB 6 30
2 CCC 9 60
Is there a way to preserve the string values intact while multiplying only the numeric values by a constant?
you can use select_dtypes() including number dtype or excluding all columns of object and datetime64 dtypes:
Demo:
In [162]: df
Out[162]:
col1 col2 col3 date
0 A 1 30 2016-01-01
1 B 2 10 2016-01-02
2 C 3 20 2016-01-03
In [163]: df.dtypes
Out[163]:
col1 object
col2 int64
col3 int64
date datetime64[ns]
dtype: object
In [164]: df.select_dtypes(exclude=['object', 'datetime']) * 3
Out[164]:
col2 col3
0 3 90
1 6 30
2 9 60
or a much better solution (c) ayhan:
df[df.select_dtypes(include=['number']).columns] *= 3
From docs:
To select all numeric types use the numpy dtype numpy.number
The other answer specifies how to multiply only numeric columns. Here's how to update it:
df = pd.DataFrame({'col1': ['A','B','C'], 'col2':[1,2,3], 'col3': [30, 10,20]})
s = df.select_dtypes(include=[np.number])*3
df[s.columns] = s
print (df)
col1 col2 col3
0 A 3 90
1 B 6 30
2 C 9 60
One way would be to get the dtypes, match them against object and datetime dtypes and exclude them with a mask, like so -
df.ix[:,~np.in1d(df.dtypes,['object','datetime'])] *= 3
Sample run -
In [273]: df
Out[273]:
col1 col2 col3
0 A 1 30
1 B 2 10
2 C 3 20
In [274]: df.ix[:,~np.in1d(df.dtypes,['object','datetime'])] *= 3
In [275]: df
Out[275]:
col1 col2 col3
0 A 3 90
1 B 6 30
2 C 9 60
This should work even over mixed types within columns but is likely slow over large dataframes.
def mul(x, y):
try:
return pd.to_numeric(x) * y
except:
return x
df.applymap(lambda x: mul(x, 3))
A simple solution using assign() and select_dtypes():
df.assign(**df.select_dtypes('number')*3)
Related
I'm trying to replace values in a Pandas data frame, based on certain criteria on multiple columns. For a single column criteria this can be done very elegantly with a dictionary (e.g. Remap values in pandas column with a dict):
import pandas as pd
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2}, 'col2': {0:10, 1:20, 2:20}})
rdict = {1:'a', 2:'b'}
df2 = df.replace({"col1": rdict})
Input df:
col1 col2
0 1 10
1 1 20
2 2 20
Resulting df2:
col1 col2
0 a 10
1 a 20
2 b 20
I'm trying to extend this to criteria over multiple columns (e.g. where col1==1, col2==10 -> replace). For a single criteria this can be done like:
df3=df.copy()
df3.loc[((df['col1']==1)&(df['col2']==10)), 'col1'] = 'c'
Which results in a df3:
col1 col2
0 c 10
1 1 20
2 2 20
My real life problem has a large number of criteria, which would involve a large number of df3.loc[((criteria1)&(criteria2)), column] = value calls, which is far less elegant the the replacement using a dictionary as a "lookup table". Is it possible to extend the elegant solution (df2 = df.replace({"col1": rdict})) to a setup where values in one column are replaced by criteria based on multiple columns?
An example of what I'm trying to achieve (although in my real life case the number of criteria is a lot larger):
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2}, 'col2': {0:10, 1:20, 2:10, 3:20}})
df3=df.copy()
df3.loc[((df['col1']==1)&(df['col2']==10)), 'col1'] = 'a'
df3.loc[((df['col1']==1)&(df['col2']==20)), 'col1'] = 'b'
df3.loc[((df['col1']==2)&(df['col2']==10)), 'col1'] = 'c'
df3.loc[((df['col1']==2)&(df['col2']==20)), 'col1'] = 'd'
Input df:
0 1 10
1 1 20
2 2 10
3 2 20
Resulting df3:
col1 col2
0 a 10
1 b 20
2 c 10
3 d 20
We can use merge.
Suppose your df looks like
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2, 4:2, 5:1}, 'col2': {0:10, 1:20, 2:10, 3:20, 4: 20, 5:10}})
col1 col2
0 1 10
1 1 20
2 2 10
3 2 20
4 2 20
5 1 10
And your conditional replacement can be represented as another dataframe:
df_replace
col1 col2 val
0 1 10 a
1 1 20 b
2 2 10 c
3 2 20 d
(As OP (Bart) pointed out, you can save this in a csv file.)
Then you can use
df = df.merge(df_replace, on=["col1", "col2"], how="left")
col1 col2 val
0 1 10 a
1 1 20 b
2 2 10 c
3 2 20 d
4 2 20 d
5 1 10 a
Then you just need to drop col1.
As MaxU pointed out, there could be rows that does not get replaced and resulting in NaN. We can use a line like
df["val"] = df["val"].combine_first(df["col1"])
to fill in values from col1 if the resulting values after merge is NaN.
Demo:
Source DF:
In [120]: df
Out[120]:
col1 col2
0 1 10
1 1 10
2 1 20
3 1 20
4 2 10
5 2 20
6 3 30
Conditions & Replacements DF:
In [121]: cond
Out[121]:
col1 col2 repl
1 1 20 b
2 2 10 c
0 1 10 a
3 2 20 d
Solution:
In [121]: res = df.merge(cond, how='left')
yields:
In [122]: res
Out[122]:
col1 col2 repl
0 1 10 a
1 1 10 a
2 1 20 b
3 1 20 b
4 2 10 c
5 2 20 d
6 3 30 NaN # <-- NOTE
In [123]: res['col1'] = res.pop('repl').fillna(res['col1'])
In [124]: res
Out[124]:
col1 col2
0 a 10
1 a 10
2 b 20
3 b 20
4 c 10
5 d 20
6 3 30
This method is likely to be more efficient than pandas functionality, as it relies on numpy arrays and dictionary mappings.
import pandas as pd
df = pd.DataFrame({'col1': {0:1, 1:1, 2:2, 3:2}, 'col2': {0:10, 1:20, 2:10, 3:20}})
rdict = {(1, 10): 'a', (1, 20): 'b', (2, 10): 'c', (2, 20): 'd'}
df['col1'] = list(map(rdict.get, [(x[0], x[1]) for x in df1[['c1', 'c2']].values]))
I have a huge dataset with thousands of rows and hundreds of columns. One of these columns contain a string because I am getting an error. I want to locate this string. All my columns are supposed to be float values, however one of these columns has a type str somewhere.
How can I loop through a particular column using Pandas and print only the row that is of type str? I want to find out what the string(s) are so I can convert them to their numerical equivalent.
Using applymap with type
df = pd.DataFrame({'C1': [1,2,3,'4'], 'C2': [10, 20, '3',40]})
df.applymap(type)==str
Out[73]:
C1 C2
0 False False
1 False False
2 False True
3 True False
Here you know the str cell.
Then we using np.where to locate it
np.where((df.applymap(type)==str))
Out[75]: (array([2, 3], dtype=int64), array([1, 0], dtype=int64))
If your goal is to convert everything to numerical values, then you can use this approach:
Sample DF:
In [126]: df = pd.DataFrame(np.arange(15).reshape(5,3)).add_prefix('col')
In [127]: df.loc[0,'col0'] = 'XXX'
In [128]: df
Out[128]:
col0 col1 col2
0 XXX 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [129]: df.dtypes
Out[129]:
col0 object
col1 int32
col2 int32
dtype: object
Solution:
In [130]: df.loc[:, df.dtypes.eq('object')] = df.loc[:, df.dtypes.eq('object')].apply(pd.to_numeric, errors='coerce')
In [131]: df
Out[131]:
col0 col1 col2
0 NaN 1 2
1 3.0 4 5
2 6.0 7 8
3 9.0 10 11
4 12.0 13 14
In [132]: df.dtypes
Out[132]:
col0 float64
col1 int32
col2 int32
dtype: object
I'm trying to find the mean of values in different rows, grouped by similarities in other columns. Example:
In [14]: pd.DataFrame({'col1':[1,2,1,2], 'col2':['A','C','A','B'], 'col3':[1, 5, 6, 9]})
Out[14]:
col1 col2 col3
0 1 A 1
1 2 C 5
2 1 A 6
3 2 B 9
What I would like is to add a column with the means of col3, for all rows where the combination of col1 and col2 match. Desired output:
Out[14]:
col1 col2 col3 mean
0 1 A 1 3.5
1 2 C 5 5
2 1 A 6 3.5
3 2 B 9 9
I have tried several things with groupby in combination with apply but couldn't get proper results.
its a transform my man
df['mean'] = df.groupby(['col1','col2']).col3.transform('mean')
I have Three variables as
a='col1'
b='=='
c=2
I have Pandas dataframe as
df = pd.DataFrame({'col1': [0, 1, 2], 'col2': [10, 11, 12]},dtype=object)
And I wanted to filter for col1=2, So I wrote
df.query("#a #b #c")
Which is throwing below error
File "<unknown>", line 1
__pd_eval_local_a __pd_eval_local_b __pd_eval_local_c
^
SyntaxError: invalid syntax
Can some one help me how to achieve this using these three variables?
Thanks,
In [214]: df
Out[214]:
col1 col2 col3
0 0 10 aaa
1 1 11 bbb
2 2 12 ccc
In [215]: a='col1'; b='=='; c=2 # <--- `c` is `int`
In [216]: df.query(a + b + '#c')
Out[216]:
col1 col2 col3
2 2 12 ccc
In [217]: a='col3'; b='=='; c='aaa' # <--- `c` is `str`
In [218]: df.query(a + b + '#c')
Out[218]:
col1 col2 col3
0 0 10 aaa
it will also work with datetime dtypes, as .query() would take care of dtypes:
In [226]: df['col4'] = pd.date_range('2017-01-01', freq='99D', periods=len(df))
In [227]: df
Out[227]:
col1 col2 col3 col4
0 0 10 aaa 2017-01-01
1 1 11 bbb 2017-04-10
2 2 12 ccc 2017-07-18
In [228]: a='col4'; b='=='; c='2017-01-01'
In [229]: df.query(a + b + '#c')
Out[229]:
col1 col2 col3 col4
0 0 10 aaa 2017-01-01
You can do it without the use of query if that's not a requierement, using eval :
a='col1'
b='=='
c='2'
instruction ="df[\"" + a + "\"]" + b + c
>>>> 'df["col1"]==2'
df[eval(instruction)]
>>>>
col1 col2
2 2 12
Say I have two matrices, an original and a reference:
import pandas as pa
print "Original Data Frame"
# Create a dataframe
oldcols = {'col1':['a','a','b','b'], 'col2':['c','d','c','d'], 'col3':[1,2,3,4]}
a = pa.DataFrame(oldcols)
print "Original Table:"
print a
print "Reference Table:"
b = pa.DataFrame({'col1':['x','x'], 'col2':['c','d'], 'col3':[10,20]})
print b
Where the tables look like this:
Original Data Frame
Original Table:
col1 col2 col3
0 a c 1
1 a d 2
2 b c 3
3 b d 4
Reference Table:
col1 col2 col3
0 x c 10
1 x d 20
Now I want to subtract from the third column (col3) of the original table (a), the value in the reference table (c) in the row where the second columns of the two tables match. So the first row of table two should have the value 10 added to the third column, because the row of table b where the column is col2 is 'c' has a value of 10 in col3. Make sense? Here's some code that does that:
col3 = []
for ix, row in a.iterrows():
col3 += [row[2] + b[b['col2'] == row[1]]['col3']]
a['col3'] = col3
print "Output Table:"
print a
Yielding the following output:
Output Table:
col1 col2 col3
0 a c [11]
1 a d [22]
2 b c [13]
3 b d [24]
My question is, is there a more elegant way to do this? Also, the results in 'col3' should not be lists. Solutions using numpy are also welcome.
I did not quite understand your description of what you are trying to do, but the output you have shown can be generated by first merging the two data frames and then some simple operations;
>>> df = a.merge(b.filter(['col2', 'col3']), how='left',
left_on='col2', right_on='col2', suffixes=('', '_'))
>>> df
col1 col2 col3 col3_
0 a c 1 10
1 b c 3 10
2 a d 2 20
3 b d 4 20
[4 rows x 4 columns]
>>> df.col3_.fillna(0, inplace=True) # in case there are no matches
>>> df.col3 += df.col3_
>>> df
col1 col2 col3 col3_
0 a c 11 10
1 b c 13 10
2 a d 22 20
3 b d 24 20
[4 rows x 4 columns]
>>> df.drop('col3_', axis=1, inplace=True)
>>> df
col1 col2 col3
0 a c 11
1 b c 13
2 a d 22
3 b d 24
[4 rows x 3 columns]
If values in col2 in b are not unique, then probably you also need something like:
>>> b.groupby('col2', as_index=False)['col3'].aggregate(sum)