I'm converting code from R to Python and am looking for some help with mutating a new column based on other columns, using dfply syntax/piping
In this example, I want to subtract 2 from col1 if col2 is 'c', otherwise add 4
import pandas as pd
import numpy as np
from dfply import *
col1 = [1,2,3,4,5]
col2 = ['a', 'b', 'c', 'd', 'e']
df = pd.DataFrame(data = {'col1': col1, 'col2': col2})
in R I would do:
df_new <- df %>%
mutate(newCol = ifelse(col2 == 'c', col1 - 2, col1 + 4))
but Python doesn't seem to like this:
new_df = (df >>
mutate(newCol = np.where(X.col2 == 'c', X.col1 - 2, X.col1 + 4)))
I get an error of "invalid __array_struct__"
Note that this works fine:
new_df = (df >>
mutate(newCol = X.col1 - 2))
I will use apply/lambda function. X is the dataframe row and axis=1 means apply the lambda function per column.
df['newCol'] = df.apply(lambda X: X.col1 - 2 if X.col2 == 'c' else X.col1 + 4, axis=1)
df
col1 col2 newCol
0 1 a 5
1 2 b 6
2 3 c 1
3 4 d 8
4 5 e 9
The python equivalent here would be a inline if else expression (or ternary operator):
ifelse(col2 == 'c', col1 - 2, col1 + 4)
Would then become
col1 - 2 if col2 == 'c' else col1 + 4
Related
import pandas as pd
import numpy as np
from numpy.random import randint
from IPython.display import display, HTML
dict_1 = {'col1':range(0,21),'col3':0}
dict_2 = {'col2':range(0,41,4)}
df = pd.DataFrame(dict_1)
df_2 = pd.DataFrame(dict_2)
So the goal is to compare all the values in df['col1'] in df_2['col2'] and for each row that equals the other I need to update col3 to 1 or some value other than what it is now
in this case I am looking for something of the sort:
df = pd.DataFrame({'col1':[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20],'col3':[1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,1]}
One option is isin + np.where
df['col3'] = np.where(df['col1'].isin(df_2['col2']), 1, 0)
df.head():
col1 col3
0 0 1
1 1 0
2 2 0
3 3 0
4 4 1
Edit to include col4 with left merge and use np.where to set values based on where col4 (values from df2) is nan:
Given
df = pd.DataFrame({'col1': range(0, 21), 'col3': 0})
df_2 = pd.DataFrame({'col2': [0, 4, 8, 12, 16],
'col4': ['a', 'b', 'c', 'd', 'e']})
df = df.merge(
df_2, left_on='col1', right_on='col2', how='left'
).drop(columns='col2').fillna('')
df['col3'] = np.where(df['col4'].isna(), 0, 1)
df.head():
col1 col3 col4
0 0 1 a
1 1 0
2 2 0
3 3 0
4 4 1 b
You can do it directly with pandas:
df_1['col3'] = df_1['col1'].isin(df_2['col2']).astype(int)
This might be a simple question but I tried searching for the answer and I can't seem to find it.
I have a pandas dataframe and I want to change the value of some cells to a tuple.
So if I have this:
Col0 Col1 Col2
3 a 6
7 b 8
I would like to change all the "a" values to a tuple:
Col0 Col1 Col2
3 (4,5) 6
7 b 8
I tried doing this:
df.loc[df["Col1"] == "a"] = (4,5,)
but it obviously didn't work. I don't know how I can do this.
How can I do this?
Let's try boolean indexing with loc to update the values in the cells in Col1 which contains a:
m = df['Col1'].eq('a')
df.loc[m, 'Col1'] = pd.Series([(4, 5)]*m.sum(), index=m[m].index)
Alternatively you can try .reindex with fill_value parameter set to tuple (4, 5):
m = df['Col1'].eq('a')
df['Col1'] = df.loc[~m, 'Col1'].reindex(m.index, fill_value=(4, 5))
Col0 Col1 Col2
0 3 (4, 5) 6
1 7 b 8
Maybe something like this?
import pandas as pd
df = pd.DataFrame(data = {'Col0': [3,7], 'Col1': ['a', 'b'], 'Col2': [6, 8]})
df.set_value(0, 'Col1', (4, 5))
Or, if you don't know where 'a' is (and in case we have more than one 'a0 per column, you can do that with a loop:
import pandas as pd
import numpy as np
df = pd.DataFrame(data = {'Col0': [3,7], 'Col1': ['a', 'b'], 'Col2': [6, 8]})
# find the position of a in 'Col1'
where_a = np.where(df['Col1'] == 'a')[0]
# replace a with tuple (4, 5)
for x in where_a:
df.set_value(x, 'Col1', (4, 5))
Col0 Col1 Col2
0 3 (4, 5) 6
1 7 b 8
trying to add up numbers from one column only when it matches the string in another
for instance, add ints in col2 when col1 is A.
import pandas as pd
d = {'col1': ['A', 'B', 'A', 'B'], 'col2': [1, 2, 1, 2]}
df = pd.DataFrame(data=d)
df
Use DataFrame.loc + GroupBy.transform
df.loc[df.col1.eq('A'), 'col2'] = df.groupby('col1')['col2'].transform('sum')
print(df)
Or
new_df = df.assign(col2=df.groupby('col1')['col2'].transform('sum').where(df.col1.ne('A'),
df.col1))
Output
col1 col2
0 A 2
1 B 2
2 A 2
3 B 2
I have a data frame like this
col1 col2
[A, B] 1
[A, C] 2
I would like to separate col1 into two columns and the output, I would like it out in this form
col1_A col1_B col2
A B 1
A C 2
I have tried this df['col1'].str.rsplit(',',n=2, expand=True)
but it showed TypeError: list indices must be integers or slices, not str
join + pop
df = df.join(pd.DataFrame(df.pop('col1').values.tolist(),
columns=['col1_A', 'col1_B']))
print(df)
col2 col1_A col1_B
0 1 A B
1 2 A C
It's good practice to try and avoid pd.Series.apply, which often amounts a Python-level loop with an additional overhead.
You can use apply:
import pandas as pd
df = pd.DataFrame({
"col1": [['A', 'B'], ['A', 'C']],
"col2": [1, 2],
})
df['col1_A'] = df['col1'].apply(lambda x: x[0])
df['col1_B'] = df['col1'].apply(lambda x: x[1])
del df['col1']
df = df[df.columns[[1,2,0]]]
print(df)
col1_A col1_B col2
0 A B 1
1 A C 2
You can do this:
>> df_expanded = df['col1'].apply(pd.Series).rename(
columns = lambda x : 'col1_' + str(x))
>> df_expanded
col1_0 col1_1
0 A B
1 A C
Adding these columns to the original dataframe:
>> pd.concat([df_expanded, df], axis=1).drop('col1', axis=1)
col1_0 col1_1 col2
0 A B 1
1 A C 2
If columns need to be named as the first element in the rows:
df_expanded.columns = ['col1_' + value
for value in df_expanded.iloc[0,:].values.tolist()]
col1_A col1_B
0 A B
1 A C
Zip values and column name and use insert to get right position.
for ind,(k,v) in enumerate(zip(zip(*df.pop('col1').tolist()),['col1_A', 'col1_B'])):
df.insert(ind, v, k)
Full example
import pandas as pd
df = pd.DataFrame({
"col1": [['A', 'B'], ['A', 'C']],
"col2": [1, 2],
})
for ind,(k,v) in enumerate(zip(zip(*df.pop('col1').tolist()),['col1_A', 'col1_B'])):
df.insert(ind, v, k)
print(df)
Returns:
col1_A col1_B col2
0 A B 1
1 A C 2
I need to change individual elements in a DataFrame. I tried doing something like this, but it doesn't work:
for index, row in df.iterrows():
if df.at[row, index] == 'something':
df.at[row, index] = df.at[row, index] + 'add a string'
else:
df.at[row, index] = df.at[row, index] + 'add a value'
How can I do that?
If need modify all columns in DataFrame use numpy.where with DataFrame constructor, because where return numpy array:
df = pd.DataFrame(np.where(df == 'something', df + 'add a string', df + 'add a value'),
index=df.index,
columns=df.columns)
If only one column col:
df['col'] = np.where(df['col'] == 'something',
df['col'] + 'add a string',
df['col'] + 'add a value')
Sample:
df = pd.DataFrame({'col': ['a', 'b', 'a'], 'col1': ['a', 'b', 'b']})
print (df)
col col1
0 a a
1 b b
2 a b
df = pd.DataFrame(np.where(df == 'a', df + 'add a string', df + 'add a value'),
index=df.index,
columns=df.columns)
print (df)
col col1
0 aadd a string aadd a string
1 badd a value badd a value
2 aadd a string badd a value
df['col'] = np.where(df['col'] == 'a',
df['col'] + 'add a string',
df['col'] + 'add a value')
print (df)
col col1
0 aadd a string a
1 badd a value b
2 aadd a string b
You can use .ix and apply a function like this:
import pandas as pd
D = pd.DataFrame({'A': ['a', 'b', 3,7,'b','a'], 'B': ['a', 'b', 3,7,'b','a']})
D.ix[D.index%2 == 0,'A'] = D.ix[D.index%2 == 0,'A'].apply(lambda s: s+'x' if isinstance(s,str) else s+1)
D.ix[D.index[2:5],'B'] = D.ix[D.index[2:5],'B'].apply(lambda s: s+'y' if isinstance(s,str) else s-1)
First example appends x to each string or alternatively adds 1 to each non-string on column A for every even index.
The second example appends y to each string or alternatively subtracts 1 from each non-string on column B for the indices 2,3,4.
Original Frame:
A B
0 a a
1 b b
2 3 3
3 7 7
4 b b
5 a a
Modified Frame:
A B
0 ax a
1 b b
2 4 2
3 7 6
4 bx by
5 a a