Replacing cell values with column header value - python

I've got a dataframe:
a-line abstract ... zippered
0 0 ... 0
0 1 ... 0
0 0 ... 1
Where the value of the cell is 1 I need to replace it with the header name.
df.dtypes returns Length: 1000, dtype: object
I have tried df.apply(lambda x: x.astype(object).replace(1, x.name))
but get TypeError: Invalid "to_replace" type: 'int'
other attempts:
df.apply(lambda x: x.astype(object).replace(str(1), x.name)) == TypeError: Invalid "to_replace" type: 'str'
df.apply(lambda x: x.astype(str).replace(str(1), x.name)) == Invalid "to_replace" type: 'str'

The key idea to all three solutions below is to loop through columns. The first method is with replace.
for col in df:
df[col]=df[col].replace(1, df[col].name)
Alternatively, per your attempt to apply a lambda:
for col in df_new:
df_new[col]=df_new[col].astype(str).apply(lambda x: x.replace('1',df_new[col].name))
Finally, this is with np.where:
for col in df_new:
df_new[col]=np.where(df_new[col] == 1, df_new[col].name, df_new[col])
Output for all three:
a-line abstract ... zippered
0 0 0 ... 0
1 0 abstract ... 0
2 0 0 ... zippered

You might consider to play from this idea
import pandas as pd
df = pd.DataFrame([[0,0,0],
[0,1,0],
[0,0,1],
[0,1,0]],
columns=["a","b","c"])
df = pd.DataFrame(np.where(df==1, df.columns, df),
columns=df.columns)
UPDATE: Timing
#David Erickson solution it's perfect but you can avoid the loop. In particular if you have many columns.
Generate data
import pandas as pd
import numpy as np
n = 1_000
columns = ["{:04d}".format(i) for i in range(n)]
df = pd.DataFrame(np.random.randint(0, high=2, size=(4,n)),
columns=columns)
# we test against the same dataframe
df_bk = df.copy()
David's solution #1
%%timeit -n10
for col in df:
df[col]=df[col].replace(1, df[col].name)
1.01 s ± 35.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
David's solution #2
%%timeit -n10
df = df_bk.copy()
for col in df:
df[col]=df[col].astype(str).apply(lambda x: x.replace('1',df[col].name))
890 ms ± 24.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
David's solution #3
%%timeit -n10
for col in df:
df[col]=np.where(df[col] == 1, df[col].name, df[col])
886 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Avoiding loops
%%timeit -n10
df = df_bk.copy()
df = pd.DataFrame(np.where(df==1, df.columns, df),
columns=df.columns)
455 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

How to create a new column in pandas based off boolean values from other columns?

With Pandas, I am using a data frame with a column that shows one's job and I wish to add another column that gives a 1 or a 0 based on whether the person is a manager. The actual data is far longer, so is there a way to use boolean logic to not have to put in the 1 or 0 manually?
Below is what the desired output is...
import numpy as np
import pandas as pd
import numpy as np
import pandas as pd
job=pd.Series({'David':'manager','Keith':'player', 'Bob':'coach', 'Rick':'manger'})
is_manager=pd.Series({'David':'1','Keith':'0', 'Bob':'0', 'Rick':'1'})
data=pd.DataFrame({'job':job,'is_manager':is_manager})
print(data)
Don't know how efficient this is but it works.
import numpy as np
import pandas as pd
job=pd.Series({'David':'manager','Keith':'player', 'Bob':'coach', 'Rick':'manager'})
data=pd.DataFrame({'job':job})
data['is_manager'] = data.apply(lambda row: 1 if row['job'] == 'manager' else 0, axis=1)
print(data)
Compare column by Series.eq and then convert mask to 0, 1 by casting to integers, by Series.view or by numpy.where:
data=pd.DataFrame({'job':job})
data['is_manager'] = data['job'].eq('manager').astype(int)
data['is_manager'] = data['job'].eq('manager').view('i1')
data['is_manager'] = np.where(data['job'].eq('manager'), 1, 0)
print(data)
job is_manager
David manager 1
Keith player 0
Bob coach 0
Rick manger 0
Performance:
# 40k rows
data = pd.concat([data] * 10000, ignore_index=True)
print (data)
In [234]: %timeit data['is_manager'] = data['job'].eq('manager').astype(int)
2.93 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [235]: %timeit data['is_manager'] = data['job'].eq('manager').view('i1')
2.96 ms ± 161 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [236]: %timeit data['is_manager'] = np.where(data['job'].eq('manager'), 1, 0)
2.89 ms ± 192 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [237]: %timeit data['is_manager'] = data.apply(lambda row: 1 if row['job'] == 'manager' else 0, axis=1)
340 ms ± 8.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This is a possible solution:
import numpy as np
import pandas as pd
job=pd.Series({'David':'manager','Keith':'player', 'Bob':'coach', 'Rick':'manger'})
data = pd.DataFrame({'job':job})
is_manager = data == "manager"
is_manager = is_manager.rename(columns={"job": "is_manager"})
data = data.join(is_manager)
data['is_manager'] = data.apply(lambda row: 1 if row['is_manager'] == True else 0, axis=1)
print(data)

How to apply changes to subset dataframe to source dataframe

I'm trying to determine and flag duplicate 'Sample' values in a dataframe using groupby with lambda:
rdtRows["DuplicateSample"] = False
rdtRowsSampleGrouped = rdtRows.groupby( ['Sample']).filter(lambda x: len(x) > 1)
rdtRowsSampleGrouped["DuplicateSample"] = True
# How to get flag changes made on rdtRowsSampleGrouped to apply to rdtRows??
How do I make changes / apply the "DuplicateSample" to the source rdtRows data? I'm stumped
:(
Use GroupBy.transform with GroupBy.size:
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
Or use Series.duplicated with keep=False if need faster solution:
df['DuplicateSample'] = df['Sample'].duplicated(keep=False)
Performance in some sample data (in real should be different, depends of number of rows, number of duplicated values):
np.random.seed(2020)
N = 100000
df = pd.DataFrame({'Sample': np.random.randint(100000, size=N)})
In [51]: %timeit df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform('size') > 1
17 ms ± 50 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [52]: %timeit df['DuplicateSample1'] = df['Sample'].duplicated(keep=False)
3.73 ms ± 40 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Stef solution is unfortunately 2734times slowier like duplicated solution
In [53]: %timeit df['DuplicateSample2'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
10.2 s ± 517 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use transform:
import pandas as pd
df = pd.DataFrame({'Sample': [1,2,2,3,4,4]})
df['DuplicateSample'] = df.groupby('Sample')['Sample'].transform(lambda x: len(x)>1)
Result:
Sample DuplicateSample
0 1 False
1 2 True
2 2 True
3 3 False
4 4 True
5 4 True

how to flatten array in pandas dataframe

Assuming I have a pandas dataframe such as
df_p = pd.DataFrame(
{'name_array':
[[20130101, 320903902, 239032902],
[20130101, 3253453, 239032902],
[65756, 4342452, 32425432523]],
'name': ['a', 'a', 'c']} )
I want to extract the series which contains the flatten arrays in each row whilst preserving the order
The expected result is a pandas.core.series.Series
This question is not a duplicate because my expected output is a pandas Series, and not a dataframe.
The solutions using melt are slower than OP's original method, which they shared in the answer here, especially after the speedup from my comment on that answer.
I created a larger dataframe to test on:
df = pd.DataFrame({'name_array': np.random.rand(1000, 3).tolist()})
And timing the two solutions using melt on this dataframe yield:
In [16]: %timeit pd.melt(df.name_array.apply(pd.Series).reset_index(), id_vars=['index'],value_name='name_array').drop('variable', axis=1).sort_values('index')
173 ms ± 5.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [17]: %timeit df['name_array'].apply(lambda x: pd.Series([i for i in x])).melt().drop('variable', axis=1)['value']
175 ms ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The OP's method with the speedup I suggested in the comments:
In [18]: %timeit pd.Series(np.concatenate(df['name_array']))
18 ms ± 887 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And finally, the fastest solution as provided here but modified to provide a series instead of dataframe output:
In [14]: from itertools import chain
In [15]: %timeit pd.Series(list(chain.from_iterable(df['name_array'])))
402 µs ± 4.68 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This last method is faster than melt() by 3 orders of magnitude and faster than np.concatenate() by 2 orders of magnitude.
This is the solution I've figured out. Don't know if there are more efficient ways.
df_p = pd.DataFrame(
{'name_array':
[[20130101, 320903902, 239032902],
[20130101, 3253453, 239032902],
[65756, 4342452, 32425432523]],
'name': ['a', 'a', 'c']} )
data = pd.DataFrame( {'column':np.concatenate(df_p['name_array'].values)} )['column']
output:
[0 20130101
1 320903902
2 239032902
3 20130101
4 3253453
5 239032902
6 65756
7 4342452
8 32425432523
Name: column, dtype: int64]
You can use pd.melt:
pd.melt(df_p.name_array.apply(pd.Series).reset_index(),
id_vars=['index'],
value_name='name_array') \
.drop('variable', axis=1) \
.sort_values('index')
OUTPUT:
index name_array
0 20130101
0 320903902
0 239032902
1 20130101
1 3253453
1 239032902
2 65756
2 4342452
2 32425432523
you can flatten list of column's lists, and then create series of that, in this way:
pd.Series([element for row in df_p.name_array for element in row])

Pandas: Replace a string with 'other' if it is not present in a list of strings

I have the following data frame, df, with column 'Class'
Class
0 Individual
1 Group
2 A
3 B
4 C
5 D
6 Group
I would like to replace everything apart from Group and Individual with 'Other', so the final data frame is
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
The dataframe is huge, with over 600 K rows. What is the best way to optimally look for values other than 'Group' and 'Individual' and replace them with 'Other'?
I have seen examples for replace, such as:
df['Class'] = df['Class'].replace({'A':'Other', 'B':'Other'})
but since the sheer amount of unique values i have are too many i cannot individually do this. I want to rather just use the exclude subset of 'Group' and 'Individual'.
I think you need:
df['Class'] = np.where(df['Class'].isin(['Individual','Group']), df['Class'], 'Other')
print (df)
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
Another solution (slower):
m = (df['Class'] == 'Individual') | (df['Class'] == 'Group')
df['Class'] = np.where(m, df['Class'], 'Other')
Another solution:
df['Class'] = df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
Performance (in real data depends of number of replacements):
#[700000 rows x 1 columns]
df = pd.concat([df] * 100000, ignore_index=True)
#print (df)
In [208]: %timeit df['Class1'] = np.where(df['Class'].isin(['Individual','Group']), df['Class'], 'Other')
25.9 ms ± 485 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [209]: %timeit df['Class2'] = np.where((df['Class'] == 'Individual') | (df['Class'] == 'Group'), df['Class'], 'Other')
120 ms ± 6.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [210]: %timeit df['Class3'] = df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
95.7 ms ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [211]: %timeit df.loc[~df['Class'].isin(['Individual', 'Group']), 'Class'] = 'Other'
97.8 ms ± 6.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Another approach could be:
df.loc[~df['Class'].isin(['Individual', 'Group']), 'Class'] = 'Other'
You can do it this way for example
get list of unique items list = df['Class'].unique()
remove your known class list.remove('Individual')....
then list all Other rows df[df.class is in list]
replace class values df[df.class is in list].class = 'Other'
Sorry for this pseudo-pseudo code, but principle is same.
You can use pd.Series.where:
df['Class'].where(df['Class'].isin(['Individual', 'Group']), 'Other', inplace=True)
print(df)
Class
0 Individual
1 Group
2 Other
3 Other
4 Other
5 Other
6 Group
This should be efficient versus map + fillna:
df = pd.concat([df] * 100000, ignore_index=True)
%timeit df['Class'].where(df['Class'].isin(['Individual', 'Group']), 'Other')
# 60.3 ms per loop
%timeit df['Class'].map({'Individual':'Individual', 'Group':'Group'}).fillna('Other')
# 133 ms per loop
Another way using apply :
df['Class'] = df['Class'].apply(lambda cl : cl if cl in ["Individual","Group"] else "Other"]

Pandas Dataframe Find Rows Where all Columns Equal

I have a dataframe that has characters in it - I want a boolean result by row that tells me if all columns for that row have the same value.
For example, I have
df = [ a b c d
0 'C' 'C' 'C' 'C'
1 'C' 'C' 'A' 'A'
2 'A' 'A' 'A' 'A' ]
and I want the result to be
0 True
1 False
2 True
I've tried .all but it seems I can only check if all are equal to one letter. The only other way I can think of doing it is by doing a unique on each row and see if that equals 1? Thanks in advance.
I think the cleanest way is to check all columns against the first column using eq:
In [11]: df
Out[11]:
a b c d
0 C C C C
1 C C A A
2 A A A A
In [12]: df.iloc[:, 0]
Out[12]:
0 C
1 C
2 A
Name: a, dtype: object
In [13]: df.eq(df.iloc[:, 0], axis=0)
Out[13]:
a b c d
0 True True True True
1 True True False False
2 True True True True
Now you can use all (if they are all equal to the first item, they are all equal):
In [14]: df.eq(df.iloc[:, 0], axis=0).all(1)
Out[14]:
0 True
1 False
2 True
dtype: bool
Compare array by first column and check if all Trues per row:
Same solution in numpy for better performance:
a = df.values
b = (a == a[:, [0]]).all(axis=1)
print (b)
[ True True False]
And if need Series:
s = pd.Series(b, axis=df.index)
Comparing solutions:
data = [[10,10,10],[12,12,12],[10,12,10]]
df = pd.DataFrame(data,columns=['Col1','Col2','Col3'])
#[30000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)
#jez - numpy array
In [14]: %%timeit
...: a = df.values
...: b = (a == a[:, [0]]).all(axis=1)
141 µs ± 3.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#jez - Series
In [15]: %%timeit
...: a = df.values
...: b = (a == a[:, [0]]).all(axis=1)
...: pd.Series(b, index=df.index)
169 µs ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#Andy Hayden
In [16]: %%timeit
...: df.eq(df.iloc[:, 0], axis=0).all(axis=1)
2.22 ms ± 68.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Wen1
In [17]: %%timeit
...: list(map(lambda x : len(set(x))==1,df.values))
56.8 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#K.-Michael Aye
In [18]: %%timeit
...: df.apply(lambda x: len(set(x)) == 1, axis=1)
686 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Wen2
In [19]: %%timeit
...: df.nunique(1).eq(1)
2.87 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
nunique: New in version 0.20.0.(Base on timing benchmark from Jez , if performance is not important you can using this one)
df.nunique(axis = 1).eq(1)
Out[308]:
0 True
1 False
2 True
dtype: bool
Or you can using map with set
list(map(lambda x : len(set(x))==1,df.values))
df = pd.DataFrame.from_dict({'a':'C C A'.split(),
'b':'C C A'.split(),
'c':'C A A'.split(),
'd':'C A A'.split()})
df.apply(lambda x: len(set(x)) == 1, axis=1)
0 True
1 False
2 True
dtype: bool
Explanation: set(x) has only 1 element, if all elements of the row are the same. The axis=1 option applies any given function over the rows instead.
You can use nunique(axis=1) so the results (added to a new column) can be obtained by:
df['unique'] = df.nunique(axis=1) == 1
The answer by #yo-and-ben-w uses eq(1) but I think == 1 is easier to read.

Categories