This might be a simple question but I tried searching for the answer and I can't seem to find it.
I have a pandas dataframe and I want to change the value of some cells to a tuple.
So if I have this:
Col0 Col1 Col2
3 a 6
7 b 8
I would like to change all the "a" values to a tuple:
Col0 Col1 Col2
3 (4,5) 6
7 b 8
I tried doing this:
df.loc[df["Col1"] == "a"] = (4,5,)
but it obviously didn't work. I don't know how I can do this.
How can I do this?
Let's try boolean indexing with loc to update the values in the cells in Col1 which contains a:
m = df['Col1'].eq('a')
df.loc[m, 'Col1'] = pd.Series([(4, 5)]*m.sum(), index=m[m].index)
Alternatively you can try .reindex with fill_value parameter set to tuple (4, 5):
m = df['Col1'].eq('a')
df['Col1'] = df.loc[~m, 'Col1'].reindex(m.index, fill_value=(4, 5))
Col0 Col1 Col2
0 3 (4, 5) 6
1 7 b 8
Maybe something like this?
import pandas as pd
df = pd.DataFrame(data = {'Col0': [3,7], 'Col1': ['a', 'b'], 'Col2': [6, 8]})
df.set_value(0, 'Col1', (4, 5))
Or, if you don't know where 'a' is (and in case we have more than one 'a0 per column, you can do that with a loop:
import pandas as pd
import numpy as np
df = pd.DataFrame(data = {'Col0': [3,7], 'Col1': ['a', 'b'], 'Col2': [6, 8]})
# find the position of a in 'Col1'
where_a = np.where(df['Col1'] == 'a')[0]
# replace a with tuple (4, 5)
for x in where_a:
df.set_value(x, 'Col1', (4, 5))
Col0 Col1 Col2
0 3 (4, 5) 6
1 7 b 8
Related
import pandas as pd
import numpy as np
from numpy.random import randint
from IPython.display import display, HTML
dict_1 = {'col1':range(0,21),'col3':0}
dict_2 = {'col2':range(0,41,4)}
df = pd.DataFrame(dict_1)
df_2 = pd.DataFrame(dict_2)
So the goal is to compare all the values in df['col1'] in df_2['col2'] and for each row that equals the other I need to update col3 to 1 or some value other than what it is now
in this case I am looking for something of the sort:
df = pd.DataFrame({'col1':[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20],'col3':[1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1]}
One option is isin + np.where
df['col3'] = np.where(df['col1'].isin(df_2['col2']), 1, 0)
df.head():
col1 col3
0 0 1
1 1 0
2 2 0
3 3 0
4 4 1
Edit to include col4 with left merge and use np.where to set values based on where col4 (values from df2) is nan:
Given
df = pd.DataFrame({'col1': range(0, 21), 'col3': 0})
df_2 = pd.DataFrame({'col2': [0, 4, 8, 12, 16],
'col4': ['a', 'b', 'c', 'd', 'e']})
df = df.merge(
df_2, left_on='col1', right_on='col2', how='left'
).drop(columns='col2').fillna('')
df['col3'] = np.where(df['col4'].isna(), 0, 1)
df.head():
col1 col3 col4
0 0 1 a
1 1 0
2 2 0
3 3 0
4 4 1 b
You can do it directly with pandas:
df_1['col3'] = df_1['col1'].isin(df_2['col2']).astype(int)
trying to add up numbers from one column only when it matches the string in another
for instance, add ints in col2 when col1 is A.
import pandas as pd
d = {'col1': ['A', 'B', 'A', 'B'], 'col2': [1, 2, 1, 2]}
df = pd.DataFrame(data=d)
df
Use DataFrame.loc + GroupBy.transform
df.loc[df.col1.eq('A'), 'col2'] = df.groupby('col1')['col2'].transform('sum')
print(df)
Or
new_df = df.assign(col2=df.groupby('col1')['col2'].transform('sum').where(df.col1.ne('A'),
df.col1))
Output
col1 col2
0 A 2
1 B 2
2 A 2
3 B 2
I'm converting code from R to Python and am looking for some help with mutating a new column based on other columns, using dfply syntax/piping
In this example, I want to subtract 2 from col1 if col2 is 'c', otherwise add 4
import pandas as pd
import numpy as np
from dfply import *
col1 = [1,2,3,4,5]
col2 = ['a', 'b', 'c', 'd', 'e']
df = pd.DataFrame(data = {'col1': col1, 'col2': col2})
in R I would do:
df_new <- df %>%
mutate(newCol = ifelse(col2 == 'c', col1 - 2, col1 + 4))
but Python doesn't seem to like this:
new_df = (df >>
mutate(newCol = np.where(X.col2 == 'c', X.col1 - 2, X.col1 + 4)))
I get an error of "invalid __array_struct__"
Note that this works fine:
new_df = (df >>
mutate(newCol = X.col1 - 2))
I will use apply/lambda function. X is the dataframe row and axis=1 means apply the lambda function per column.
df['newCol'] = df.apply(lambda X: X.col1 - 2 if X.col2 == 'c' else X.col1 + 4, axis=1)
df
col1 col2 newCol
0 1 a 5
1 2 b 6
2 3 c 1
3 4 d 8
4 5 e 9
The python equivalent here would be a inline if else expression (or ternary operator):
ifelse(col2 == 'c', col1 - 2, col1 + 4)
Would then become
col1 - 2 if col2 == 'c' else col1 + 4
how to get all column names where values in columns are 'f' or 't' into array ?
df['FTI'].value_counts()
instead of this 'FTI' i need array of returned columns. Is it possible?
Reproducible example:
df = pd.DataFrame({'col1':[1,2,3], 'col2':['f', 'f', 'f'], 'col3': ['t','t','t'], 'col4':['d','d','d']})
col1 col2 col3 col4
0 1 f t d
1 2 f t d
2 3 f t d
Such that, using eq and all:
>>> s = (df.eq('t') | df.eq('f')).all()
col1 False
col2 True
col3 True
col4 False
dtype: bool
To get the names:
>>> s[s].index.values
array(['col2', 'col3'], dtype=object)
To get the positions:
>>> np.flatnonzero(s) + 1
array([2, 3])
Yes. It is possible. Here is one way
You can get the columns like this.
cols=[]
for col in df.columns:
if df[col].str.contains('f|t').any()==True:
cols.append(col)
Then you can just use this for frequencies
f= pd.Series()
for col in cols:
f=pd.concat([f,df[col].value_counts()])
Here is a very simple dataframe:
df = pd.DataFrame({'col1' :[1,2,3],
'col2' :[1,3,3] })
I'm trying to remove rows where there are duplicate values (e.g., row 3)
This doesn't work,
df = df[(df.col1 != 3 & df.col2 != 3)]
and the documentation is pretty clear about why, which makes sense.
But I still don't know how to delete that row.
Does anyone have any ideas? Thanks. Monica.
If I understand your question correctly, I think you were close.
Starting from your data:
In [20]: df
Out[20]:
col1 col2
0 1 1
1 2 3
2 3 3
And doing this:
In [21]: df = df[df['col1'] != df['col2']]
Returns:
In [22]: df
Out[22]:
col1 col2
1 2 3
What about:
In [43]: df = pd.DataFrame({'col1' :[1,2,3],
'col2' :[1,3,3] })
In [44]: df[df.max(axis=1) != df.min(axis=1)]
Out[44]:
col1 col2
1 2 3
[1 rows x 2 columns]
We want to remove rows whose values show up in all columns, or in other words the values are equal => their minimums and maximums are equal. This is method works on a DataFrame with any number of columns. If we apply the above, we remove rows 0 and 2.
Any row with all the same values with have zero as the standard deviation. One way to filter them is as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1' :[1, 2, 3, np.nan],
'col2' :[1, 3, 3, np.nan]}
>>> df.loc[df.std(axis=1, skipna=False) > 0]
col1 col2
1 2