Replace a cell's value in pandas to a tuple - python

This might be a simple question but I tried searching for the answer and I can't seem to find it.
I have a pandas dataframe and I want to change the value of some cells to a tuple.
So if I have this:
Col0 Col1 Col2
3 a 6
7 b 8
I would like to change all the "a" values to a tuple:
Col0 Col1 Col2
3 (4,5) 6
7 b 8
I tried doing this:
df.loc[df["Col1"] == "a"] = (4,5,)
but it obviously didn't work. I don't know how I can do this.
How can I do this?

Let's try boolean indexing with loc to update the values in the cells in Col1 which contains a:
m = df['Col1'].eq('a')
df.loc[m, 'Col1'] = pd.Series([(4, 5)]*m.sum(), index=m[m].index)
Alternatively you can try .reindex with fill_value parameter set to tuple (4, 5):
m = df['Col1'].eq('a')
df['Col1'] = df.loc[~m, 'Col1'].reindex(m.index, fill_value=(4, 5))
Col0 Col1 Col2
0 3 (4, 5) 6
1 7 b 8

Maybe something like this?
import pandas as pd
df = pd.DataFrame(data = {'Col0': [3,7], 'Col1': ['a', 'b'], 'Col2': [6, 8]})
df.set_value(0, 'Col1', (4, 5))
Or, if you don't know where 'a' is (and in case we have more than one 'a0 per column, you can do that with a loop:
import pandas as pd
import numpy as np
df = pd.DataFrame(data = {'Col0': [3,7], 'Col1': ['a', 'b'], 'Col2': [6, 8]})
# find the position of a in 'Col1'
where_a = np.where(df['Col1'] == 'a')[0]
# replace a with tuple (4, 5)
for x in where_a:
df.set_value(x, 'Col1', (4, 5))
Col0 Col1 Col2
0 3 (4, 5) 6
1 7 b 8

Related

How can I update values of a column in a dataframe based on its existence in another dataframe

import pandas as pd
import numpy as np
from numpy.random import randint
from IPython.display import display, HTML
dict_1 = {'col1':range(0,21),'col3':0}
dict_2 = {'col2':range(0,41,4)}
df = pd.DataFrame(dict_1)
df_2 = pd.DataFrame(dict_2)
So the goal is to compare all the values in df['col1'] in df_2['col2'] and for each row that equals the other I need to update col3 to 1 or some value other than what it is now
in this case I am looking for something of the sort:
df = pd.DataFrame({'col1':[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20],'col3':[1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1]}
One option is isin + np.where
df['col3'] = np.where(df['col1'].isin(df_2['col2']), 1, 0)
df.head():
col1 col3
0 0 1
1 1 0
2 2 0
3 3 0
4 4 1
Edit to include col4 with left merge and use np.where to set values based on where col4 (values from df2) is nan:
Given
df = pd.DataFrame({'col1': range(0, 21), 'col3': 0})
df_2 = pd.DataFrame({'col2': [0, 4, 8, 12, 16],
'col4': ['a', 'b', 'c', 'd', 'e']})
df = df.merge(
df_2, left_on='col1', right_on='col2', how='left'
).drop(columns='col2').fillna('')
df['col3'] = np.where(df['col4'].isna(), 0, 1)
df.head():
col1 col3 col4
0 0 1 a
1 1 0
2 2 0
3 3 0
4 4 1 b
You can do it directly with pandas:
df_1['col3'] = df_1['col1'].isin(df_2['col2']).astype(int)

How do I add from one column only when string matches in another?

trying to add up numbers from one column only when it matches the string in another
for instance, add ints in col2 when col1 is A.
import pandas as pd
d = {'col1': ['A', 'B', 'A', 'B'], 'col2': [1, 2, 1, 2]}
df = pd.DataFrame(data=d)
df
Use DataFrame.loc + GroupBy.transform
df.loc[df.col1.eq('A'), 'col2'] = df.groupby('col1')['col2'].transform('sum')
print(df)
Or
new_df = df.assign(col2=df.groupby('col1')['col2'].transform('sum').where(df.col1.ne('A'),
df.col1))
Output
col1 col2
0 A 2
1 B 2
2 A 2
3 B 2

Python equivalent to dplyr's ifelse

I'm converting code from R to Python and am looking for some help with mutating a new column based on other columns, using dfply syntax/piping
In this example, I want to subtract 2 from col1 if col2 is 'c', otherwise add 4
import pandas as pd
import numpy as np
from dfply import *
col1 = [1,2,3,4,5]
col2 = ['a', 'b', 'c', 'd', 'e']
df = pd.DataFrame(data = {'col1': col1, 'col2': col2})
in R I would do:
df_new <- df %>%
mutate(newCol = ifelse(col2 == 'c', col1 - 2, col1 + 4))
but Python doesn't seem to like this:
new_df = (df >>
mutate(newCol = np.where(X.col2 == 'c', X.col1 - 2, X.col1 + 4)))
I get an error of "invalid __array_struct__"
Note that this works fine:
new_df = (df >>
mutate(newCol = X.col1 - 2))
I will use apply/lambda function. X is the dataframe row and axis=1 means apply the lambda function per column.
df['newCol'] = df.apply(lambda X: X.col1 - 2 if X.col2 == 'c' else X.col1 + 4, axis=1)
df
col1 col2 newCol
0 1 a 5
1 2 b 6
2 3 c 1
3 4 d 8
4 5 e 9
The python equivalent here would be a inline if else expression (or ternary operator):
ifelse(col2 == 'c', col1 - 2, col1 + 4)
Would then become
col1 - 2 if col2 == 'c' else col1 + 4

Pandas dataframe get columns names and value_counts

how to get all column names where values in columns are 'f' or 't' into array ?
df['FTI'].value_counts()
instead of this 'FTI' i need array of returned columns. Is it possible?
Reproducible example:
df = pd.DataFrame({'col1':[1,2,3], 'col2':['f', 'f', 'f'], 'col3': ['t','t','t'], 'col4':['d','d','d']})
col1 col2 col3 col4
0 1 f t d
1 2 f t d
2 3 f t d
Such that, using eq and all:
>>> s = (df.eq('t') | df.eq('f')).all()
col1 False
col2 True
col3 True
col4 False
dtype: bool
To get the names:
>>> s[s].index.values
array(['col2', 'col3'], dtype=object)
To get the positions:
>>> np.flatnonzero(s) + 1
array([2, 3])
Yes. It is possible. Here is one way
You can get the columns like this.
cols=[]
for col in df.columns:
if df[col].str.contains('f|t').any()==True:
cols.append(col)
Then you can just use this for frequencies
f= pd.Series()
for col in cols:
f=pd.concat([f,df[col].value_counts()])

Remove rows where values appear in all columns in Pandas

Here is a very simple dataframe:
df = pd.DataFrame({'col1' :[1,2,3],
'col2' :[1,3,3] })
I'm trying to remove rows where there are duplicate values (e.g., row 3)
This doesn't work,
df = df[(df.col1 != 3 & df.col2 != 3)]
and the documentation is pretty clear about why, which makes sense.
But I still don't know how to delete that row.
Does anyone have any ideas? Thanks. Monica.
If I understand your question correctly, I think you were close.
Starting from your data:
In [20]: df
Out[20]:
col1 col2
0 1 1
1 2 3
2 3 3
And doing this:
In [21]: df = df[df['col1'] != df['col2']]
Returns:
In [22]: df
Out[22]:
col1 col2
1 2 3
What about:
In [43]: df = pd.DataFrame({'col1' :[1,2,3],
'col2' :[1,3,3] })
In [44]: df[df.max(axis=1) != df.min(axis=1)]
Out[44]:
col1 col2
1 2 3
[1 rows x 2 columns]
We want to remove rows whose values show up in all columns, or in other words the values are equal => their minimums and maximums are equal. This is method works on a DataFrame with any number of columns. If we apply the above, we remove rows 0 and 2.
Any row with all the same values with have zero as the standard deviation. One way to filter them is as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1' :[1, 2, 3, np.nan],
'col2' :[1, 3, 3, np.nan]}
>>> df.loc[df.std(axis=1, skipna=False) > 0]
col1 col2
1 2

Categories