Python: String matching on a pandas column of lists

Python: String matching on a pandas column of lists - python

What is the best way to do string matching on a column of lists?
E.g. I have a dataset:
import numpy as np
import pandas as pd
list_items = ['apple', 'grapple', 'tackle', 'satchel', 'snapple']
df = pd.DataFrame({'id':xrange(3), 'L':[np.random.choice(list_items, 3).tolist() for _ in xrange(3)]})
df
L id
0 [tackle, apple, grapple] 0
1 [tackle, snapple, satchel] 1
2 [satchel, satchel, tackle] 2
And I want to return the rows where any item in L matches a string, e.g. 'grap' should return row 0, and 'sat' should return rows 1:2.

Let's use this:
np.random.seed(123)
list_items = ['apple', 'grapple', 'tackle', 'satchel', 'snapple']
df = pd.DataFrame({'id':range(3), 'L':[np.random.choice(list_items, 3).tolist() for _ in range(3)]})
df
L id
0 [tackle, snapple, tackle] 0
1 [grapple, satchel, tackle] 1
2 [satchel, grapple, grapple] 2
Use any and apply:
df[df.L.apply(lambda x: any('grap' in s for s in x))]
Output:
L id
1 [grapple, satchel, tackle] 1
2 [satchel, grapple, grapple] 2
Timings:
%timeit df.L.apply(lambda x: any('grap' in s for s in x))
10000 loops, best of 3: 194 µs per loop
%timeit df.L.apply(lambda i: ','.join(i)).str.contains('grap')
1000 loops, best of 3: 481 µs per loop
%timeit df.L.str.join(', ').str.contains('grap')
1000 loops, best of 3: 529 µs per loop

df[df.L.apply(lambda i: ','.join(i)).str.contains('yourstring')]

Related

How to parse values from existing dataframe to new column for each row [duplicate]

This question is same to this posted earlier. I want to concatenate three columns instead of concatenating two columns:
Here is the combining two columns:
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['combined']=df.apply(lambda x:'%s_%s' % (x['foo'],x['bar']),axis=1)
df
bar foo new combined
0 1 a apple a_1
1 2 b banana b_2
2 3 c pear c_3
I want to combine three columns with this command but it is not working, any idea?
df['combined']=df.apply(lambda x:'%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)

Another solution using DataFrame.apply(), with slightly less typing and more scalable when you want to join more columns:
cols = ['foo', 'bar', 'new']
df['combined'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

You can use string concatenation to combine columns, with or without delimiters. You do have to convert the type on non-string columns.
In[17]: df['combined'] = df['bar'].astype(str) + '_' + df['foo'] + '_' + df['new']
In[17]:df
Out[18]:
bar foo new combined
0 1 a apple 1_a_apple
1 2 b banana 2_b_banana
2 3 c pear 3_c_pear

If you have even more columns you want to combine, using the Series method str.cat might be handy:
df["combined"] = df["foo"].str.cat(df[["bar", "new"]].astype(str), sep="_")
Basically, you select the first column (if it is not already of type str, you need to append .astype(str)), to which you append the other columns (separated by an optional separator character).

Just wanted to make a time comparison for both solutions (for 30K rows DF):
In [1]: df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
In [2]: big = pd.concat([df] * 10**4, ignore_index=True)
In [3]: big.shape
Out[3]: (30000, 3)
In [4]: %timeit big.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
1 loop, best of 3: 881 ms per loop
In [5]: %timeit big['bar'].astype(str)+'_'+big['foo']+'_'+big['new']
10 loops, best of 3: 44.2 ms per loop
a few more options:
In [6]: %timeit big.ix[:, :-1].astype(str).add('_').sum(axis=1).str.cat(big.new)
10 loops, best of 3: 72.2 ms per loop
In [11]: %timeit big.astype(str).add('_').sum(axis=1).str[:-1]
10 loops, best of 3: 82.3 ms per loop

Possibly the fastest solution is to operate in plain Python:
Series(
map(
'_'.join,
df.values.tolist()
# when non-string columns are present:
# df.values.astype(str).tolist()
),
index=df.index
)
Comparison against #MaxU answer (using the big data frame which has both numeric and string columns):
%timeit big['bar'].astype(str) + '_' + big['foo'] + '_' + big['new']
# 29.4 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit Series(map('_'.join, big.values.astype(str).tolist()), index=big.index)
# 27.4 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Comparison against #derchambers answer (using their df data frame where all columns are strings):
from functools import reduce
def reduce_join(df, columns):
slist = [df[x] for x in columns]
return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])
def list_map(df, columns):
return Series(
map(
'_'.join,
df[columns].values.tolist()
),
index=df.index
)
%timeit df1 = reduce_join(df, list('1234'))
# 602 ms ± 39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2 = list_map(df, list('1234'))
# 351 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The answer given by #allen is reasonably generic but can lack in performance for larger dataframes:
Reduce does a lot better:
from functools import reduce
import pandas as pd
# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'
def reduce_join(df, columns):
assert len(columns) > 1
slist = [df[x].astype(str) for x in columns]
return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])
def apply_join(df, columns):
assert len(columns) > 1
return df[columns].apply(lambda row:'_'.join(row.values.astype(str)), axis=1)
# ensure outputs are equal
df1 = reduce_join(df, list('1234'))
df2 = apply_join(df, list('1234'))
assert df1.equals(df2)
# profile
%timeit df1 = reduce_join(df, list('1234')) # 733 ms
%timeit df2 = apply_join(df, list('1234')) # 8.84 s

I think you are missing one %s
df['combined']=df.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)

First convert the columns to str. Then use the .T.agg('_'.join) function to concatenate them. More info can be gotten here
# Initialize columns
cols_concat = ['first_name', 'second_name']
# Convert them to type str
df[cols_concat] = df[cols_concat].astype('str')
# Then concatenate them as follows
df['new_col'] = df[cols_concat].T.agg('_'.join)

df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['combined'] = df['foo'].astype(str)+'_'+df['bar'].astype(str)
If you concatenate with string('_') please you convert the column to string which you want and after you can concatenate the dataframe.

df['New_column_name'] = df['Column1'].map(str) + 'X' + df['Steps']
X= x is any delimiter (eg: space) by which you want to separate two merged column.

If you have a list of columns you want to concatenate and maybe you'd like to use some separator, here's what you can do
def concat_columns(df, cols_to_concat, new_col_name, sep=" "):
df[new_col_name] = df[cols_to_concat[0]]
for col in cols_to_concat[1:]:
df[new_col_name] = df[new_col_name].astype(str) + sep + df[col].astype(str)
This should be faster than apply and takes an arbitrary number of columns to concatenate.

#derchambers I found one more solution:
import pandas as pd
# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'
def eval_join(df, columns):
sum_elements = [f"df['{col}']" for col in columns]
to_eval = "+ '_' + ".join(sum_elements)
return eval(to_eval)
#profile
%timeit df3 = eval_join(df, list('1234')) # 504 ms

You could create a function which would make the implementation neater (esp. if you're using this functionality multiple times throughout an implementation):
def concat_cols(df, cols_to_concat, new_col_name, separator):
df[new_col_name] = ''
for i, col in enumerate(cols_to_concat):
df[new_col_name] += ('' if i == 0 else separator) + df[col].astype(str)
return df
Sample usage:
test = pd.DataFrame(data=[[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'])
test = concat_cols(test, ['a', 'b', 'c'], 'concat_col', '_')

following to #Allen response
If you need to chain such operation with other dataframe transformation, use assign:
df.assign(
combined = lambda x: x[cols].apply(
lambda row: "_".join(row.values.astype(str)), axis=1
)
)

Considering that one is combining three columns, one would need three format specifiers, '%s_%s_%s', not just two '%s_%s'. The following will do the work
df['combined'] = df.apply(lambda x: '%s_%s_%s' % (x['foo'], x['bar'], x['new']), axis=1)
[Out]:
foo bar new combined
0 a 1 apple a_1_apple
1 b 2 banana b_2_banana
2 c 3 pear c_3_pear
Alternatively, if one wants to create a separate list to store the columns that one wants to combine, the following will do the work.
columns = ['foo', 'bar', 'new']
df['combined'] = df.apply(lambda x: '_'.join([str(x[i]) for i in columns]), axis=1)
[Out]:
foo bar new combined
0 a 1 apple a_1_apple
1 b 2 banana b_2_banana
2 c 3 pear c_3_pear
This last one is more convenient, as one can simply change or add the column names in the list - it will require less changes.

Reference a row from another Pandas DataFrame

Suppose I have 2 DataFrames:
DataFrame 1
A B
a 1
b 2
c 3
d 4
DataFrame2:
C D
a c
b a
a b
The goal is to add a column to DataFrame 2 ('E').
C D E
a c (1-3=-2)
b a (2-1=1)
a b (1-2=-1)
If this were excel, a formula could be something similar to "=vlookup(A1,DataFrame1,2)-vlookup(B1,DataFrame1,2)". Any idea what this formula looks like in Python?
Thanks!

A Pandas Series can be thought of as a mapping from its index to its values.
Here, we wish to use the first DataFrame, df1 as a mapping from column A to column B. So the natural thing to do is to convert df1 into a Series:
s = df1.set_index('A')['B']
# A
# a 0
# b 1
# c 2
# d 3
# Name: B, dtype: int64
Now we can use the Series.map method to "lookup" values in a Series based on s:
import pandas as pd
df1 = pd.DataFrame({'A':list('abcd'), 'B':[1,2,3,4]})
df2 = pd.DataFrame({'C':list('aba'), 'D':list('cab')})
s = df1.set_index('A')['B']
df2['E'] = df2['C'].map(s) - df2['D'].map(s)
print(df2)
yields
C D E
0 a c -2
1 b a 1
2 a b -1

You can do something like this:
#set column A as index, so you can index it
df1 = df1.set_index('A')
df2['E'] = df1.loc[df2.C, 'B'].values - df1.loc[df2.D, 'B'].values
And the result is:
C D E
0 a c -2
1 b a 1
2 a b -1
Hope it helps :)

Option 1
Using replace and eval with assign
df2.assign(E=df2.replace(df1.A.values, df1.B).eval('C - D'))
C D E
0 a c -2
1 b a 1
2 a b -1
I like this answer for it's succinctness.
I use replace with two iterables, nameley df1.A that specifies what to replace and df1.B that specifies what to replace with.
I use eval to elegantly perform the differencing of the new found C less D.
I use assign to create a copy of df2 with a new column named E that has the values from the steps above.
I could have used a dictionary instead dict(zip(df1.A, df1.B))
df2.assign(E=df2.replace(dict(zip(df1.A, df1.B))).eval('C - D'))
C D E
0 a c -2
1 b a 1
2 a b -1
PROJECT/kill
numpy + pd.factorize
base = df1.A.values
vals = df1.B.values
refs = df2.values.ravel()
f, u = pd.factorize(np.append(base, refs))
look = vals[f[base.size:]]
df2.assign(E=look[::2] - look[1::2])
C D E
0 a c -2
1 b a 1
2 a b -1
Timing
Among the pure pandas #unutbu's answer is the clear winner. While my overly verbose numpy solution only improves by about 40ish%
Let's use these functions for the numpy versions. Note using_F_order is #unutbu's contribution.
def using_numpy(df1, df2):
base = df1.A.values
vals = df1.B.values
refs = df2.values.ravel()
f, u = pd.factorize(np.append(base, refs))
look = vals[f[base.size:]]
return df2.assign(E=look[::2] - look[1::2])
def using_F_order(df1, df2):
base = df1.A.values
vals = df1.B.values
refs = df2.values.ravel(order='F')
f, u = pd.factorize(np.append(base, refs))
look = vals[f[base.size:]].reshape(-1, 2, order='F')
return df2.assign(E=look[:, 0]-look[:, 1])
small data
%timeit df2.assign(E=df2.replace(df1.A.values, df1.B).eval('C - D'))
%timeit df2.assign(E=df2.replace(dict(zip(df1.A, df1.B))).eval('C - D'))
%timeit df2.assign(E=(lambda s: df2['C'].map(s) - df2['D'].map(s))(df1.set_index('A')['B']))
%timeit using_numpy(df1, df2)
%timeit using_F_order(df1, df2)
100 loops, best of 3: 2.31 ms per loop
100 loops, best of 3: 2.44 ms per loop
1000 loops, best of 3: 1.25 ms per loop
1000 loops, best of 3: 436 µs per loop
1000 loops, best of 3: 424 µs per loop
large data
from string import ascii_lowercase, ascii_uppercase
import pandas as pd
import numpy as np
upper = np.array(list(ascii_uppercase))
lower = np.array(list(ascii_lowercase))
ch = np.core.defchararray.add(upper[:, None], lower).ravel()
np.random.seed([3,1415])
n = 100000
df1 = pd.DataFrame(dict(A=ch, B=np.arange(ch.size)))
df2 = pd.DataFrame(dict(C=np.random.choice(ch, n), D=np.random.choice(ch, n)))
%timeit df2.assign(E=df2.replace(df1.A.values, df1.B).eval('C - D'))
%timeit df2.assign(E=df2.replace(dict(zip(df1.A, df1.B))).eval('C - D'))
%timeit df2.assign(E=(lambda s: df2['C'].map(s) - df2['D'].map(s))(df1.set_index('A')['B']))
%timeit using_numpy(df1, df2)
%timeit using_F_order(df1, df2)
1 loop, best of 3: 11.1 s per loop
1 loop, best of 3: 10.6 s per loop
100 loops, best of 3: 17.7 ms per loop
100 loops, best of 3: 10.9 ms per loop
100 loops, best of 3: 9.11 ms per loop

Here's a very simple way to achieve this:
newdf = df2.replace(['a','b','c','d'],[1,2,3,4])
df2['E'] = newdf['C'] - newdf['D']
df2
I hope this helps !

How to concatenate multiple column values into a single column in Pandas dataframe

This question is same to this posted earlier. I want to concatenate three columns instead of concatenating two columns:
Here is the combining two columns:
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['combined']=df.apply(lambda x:'%s_%s' % (x['foo'],x['bar']),axis=1)
df
bar foo new combined
0 1 a apple a_1
1 2 b banana b_2
2 3 c pear c_3
I want to combine three columns with this command but it is not working, any idea?
df['combined']=df.apply(lambda x:'%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)

Another solution using DataFrame.apply(), with slightly less typing and more scalable when you want to join more columns:
cols = ['foo', 'bar', 'new']
df['combined'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)

You can use string concatenation to combine columns, with or without delimiters. You do have to convert the type on non-string columns.
In[17]: df['combined'] = df['bar'].astype(str) + '_' + df['foo'] + '_' + df['new']
In[17]:df
Out[18]:
bar foo new combined
0 1 a apple 1_a_apple
1 2 b banana 2_b_banana
2 3 c pear 3_c_pear

If you have even more columns you want to combine, using the Series method str.cat might be handy:
df["combined"] = df["foo"].str.cat(df[["bar", "new"]].astype(str), sep="_")
Basically, you select the first column (if it is not already of type str, you need to append .astype(str)), to which you append the other columns (separated by an optional separator character).

Just wanted to make a time comparison for both solutions (for 30K rows DF):
In [1]: df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
In [2]: big = pd.concat([df] * 10**4, ignore_index=True)
In [3]: big.shape
Out[3]: (30000, 3)
In [4]: %timeit big.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)
1 loop, best of 3: 881 ms per loop
In [5]: %timeit big['bar'].astype(str)+'_'+big['foo']+'_'+big['new']
10 loops, best of 3: 44.2 ms per loop
a few more options:
In [6]: %timeit big.ix[:, :-1].astype(str).add('_').sum(axis=1).str.cat(big.new)
10 loops, best of 3: 72.2 ms per loop
In [11]: %timeit big.astype(str).add('_').sum(axis=1).str[:-1]
10 loops, best of 3: 82.3 ms per loop

Possibly the fastest solution is to operate in plain Python:
Series(
map(
'_'.join,
df.values.tolist()
# when non-string columns are present:
# df.values.astype(str).tolist()
),
index=df.index
)
Comparison against #MaxU answer (using the big data frame which has both numeric and string columns):
%timeit big['bar'].astype(str) + '_' + big['foo'] + '_' + big['new']
# 29.4 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit Series(map('_'.join, big.values.astype(str).tolist()), index=big.index)
# 27.4 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Comparison against #derchambers answer (using their df data frame where all columns are strings):
from functools import reduce
def reduce_join(df, columns):
slist = [df[x] for x in columns]
return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])
def list_map(df, columns):
return Series(
map(
'_'.join,
df[columns].values.tolist()
),
index=df.index
)
%timeit df1 = reduce_join(df, list('1234'))
# 602 ms ± 39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df2 = list_map(df, list('1234'))
# 351 ms ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The answer given by #allen is reasonably generic but can lack in performance for larger dataframes:
Reduce does a lot better:
from functools import reduce
import pandas as pd
# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'
def reduce_join(df, columns):
assert len(columns) > 1
slist = [df[x].astype(str) for x in columns]
return reduce(lambda x, y: x + '_' + y, slist[1:], slist[0])
def apply_join(df, columns):
assert len(columns) > 1
return df[columns].apply(lambda row:'_'.join(row.values.astype(str)), axis=1)
# ensure outputs are equal
df1 = reduce_join(df, list('1234'))
df2 = apply_join(df, list('1234'))
assert df1.equals(df2)
# profile
%timeit df1 = reduce_join(df, list('1234')) # 733 ms
%timeit df2 = apply_join(df, list('1234')) # 8.84 s

I think you are missing one %s
df['combined']=df.apply(lambda x:'%s_%s_%s' % (x['bar'],x['foo'],x['new']),axis=1)

First convert the columns to str. Then use the .T.agg('_'.join) function to concatenate them. More info can be gotten here
# Initialize columns
cols_concat = ['first_name', 'second_name']
# Convert them to type str
df[cols_concat] = df[cols_concat].astype('str')
# Then concatenate them as follows
df['new_col'] = df[cols_concat].T.agg('_'.join)

df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3], 'new':['apple', 'banana', 'pear']})
df['combined'] = df['foo'].astype(str)+'_'+df['bar'].astype(str)
If you concatenate with string('_') please you convert the column to string which you want and after you can concatenate the dataframe.

df['New_column_name'] = df['Column1'].map(str) + 'X' + df['Steps']
X= x is any delimiter (eg: space) by which you want to separate two merged column.

If you have a list of columns you want to concatenate and maybe you'd like to use some separator, here's what you can do
def concat_columns(df, cols_to_concat, new_col_name, sep=" "):
df[new_col_name] = df[cols_to_concat[0]]
for col in cols_to_concat[1:]:
df[new_col_name] = df[new_col_name].astype(str) + sep + df[col].astype(str)
This should be faster than apply and takes an arbitrary number of columns to concatenate.

#derchambers I found one more solution:
import pandas as pd
# make data
df = pd.DataFrame(index=range(1_000_000))
df['1'] = 'CO'
df['2'] = 'BOB'
df['3'] = '01'
df['4'] = 'BILL'
def eval_join(df, columns):
sum_elements = [f"df['{col}']" for col in columns]
to_eval = "+ '_' + ".join(sum_elements)
return eval(to_eval)
#profile
%timeit df3 = eval_join(df, list('1234')) # 504 ms

You could create a function which would make the implementation neater (esp. if you're using this functionality multiple times throughout an implementation):
def concat_cols(df, cols_to_concat, new_col_name, separator):
df[new_col_name] = ''
for i, col in enumerate(cols_to_concat):
df[new_col_name] += ('' if i == 0 else separator) + df[col].astype(str)
return df
Sample usage:
test = pd.DataFrame(data=[[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'])
test = concat_cols(test, ['a', 'b', 'c'], 'concat_col', '_')

following to #Allen response
If you need to chain such operation with other dataframe transformation, use assign:
df.assign(
combined = lambda x: x[cols].apply(
lambda row: "_".join(row.values.astype(str)), axis=1
)
)

Considering that one is combining three columns, one would need three format specifiers, '%s_%s_%s', not just two '%s_%s'. The following will do the work
df['combined'] = df.apply(lambda x: '%s_%s_%s' % (x['foo'], x['bar'], x['new']), axis=1)
[Out]:
foo bar new combined
0 a 1 apple a_1_apple
1 b 2 banana b_2_banana
2 c 3 pear c_3_pear
Alternatively, if one wants to create a separate list to store the columns that one wants to combine, the following will do the work.
columns = ['foo', 'bar', 'new']
df['combined'] = df.apply(lambda x: '_'.join([str(x[i]) for i in columns]), axis=1)
[Out]:
foo bar new combined
0 a 1 apple a_1_apple
1 b 2 banana b_2_banana
2 c 3 pear c_3_pear
This last one is more convenient, as one can simply change or add the column names in the list - it will require less changes.

how to convert a data frame with a list in the value to a big data frame with the each level as a single column value in python?

I have a data frame look like below:
mydata = [{'col_A' : 'A', 'col_B': [1,2,3]},
{'col_A' : 'B', 'col_B': [7,8]}]
pd.DataFrame(mydata)
col_A col_B
A [1, 2, 3]
B [7, 8]
How to split the value in the list and create a data frame that look like this:
col_A col_B
A 1
A 2
A 3
B 7
B 8

Try this:
pd.DataFrame([{'col_A':row['col_A'], 'col_B':val}
for ind, row in df.iterrows()
for val in row['col_B']])
You might also be able to do something clever with the apply() function, but off the top of my head, I can think of how.

Here is a solution using apply:
df['col_B'].apply(pd.Series).set_index(df['col_A']).stack().reset_index(level=0)
col_A 0
0 A 1
1 A 2
2 A 3
3 B 7
4 B 8

If your DataFrame is big, the fastest is use DataFrame constructor with stack and double reset_index:
print pd.DataFrame(x for x in df['col_B']).set_index(df['col_A']).stack()
.reset_index(drop=True, level=1).reset_index().rename(columns={0:'col_B'})
Testing:
import pandas as pd
mydata = [{'col_A' : 'A', 'col_B': [1,2,3]},
{'col_A' : 'B', 'col_B': [7,8]}]
df = pd.DataFrame(mydata)
print df
df = pd.concat([df]*1000).reset_index(drop=True)
print pd.DataFrame(x for x in df['col_B']).set_index(df['col_A']).stack().reset_index(drop=True, level=1).reset_index().rename(columns={0:'col_B'})
print pd.DataFrame(x for x in df['col_B']).set_index(df['col_A']).stack().reset_index().drop('level_1', axis=1).rename(columns={0:'col_B'})
print df['col_B'].apply(pd.Series).set_index(df['col_A']).stack().reset_index().drop('level_1', axis=1).rename(columns={0:'col_B'})
print pd.DataFrame([{'col_A':row['col_A'], 'col_B':val} for ind, row in df.iterrows() for val in row['col_B']])
Timing:
In [1657]: %timeit pd.DataFrame(x for x in df['col_B']).set_index(df['col_A']).stack().reset_index().drop('level_1', axis=1).rename(columns={0:'col_B'})
100 loops, best of 3: 4.01 ms per loop
In [1658]: %timeit pd.DataFrame(x for x in df['col_B']).set_index(df['col_A']).stack().reset_index(drop=True, level=1).reset_index().rename(columns={0:'col_B'})
100 loops, best of 3: 3.09 ms per loop
In [1659]: %timeit pd.DataFrame([{'col_A':row['col_A'], 'col_B':val} for ind, row in df.iterrows() for val in row['col_B']])
10 loops, best of 3: 153 ms per loop
In [1660]: %timeit df['col_B'].apply(pd.Series).set_index(df['col_A']).stack().reset_index().drop('level_1', axis=1).rename(columns={0:'col_B'})
1 loops, best of 3: 357 ms per loop

Return max of zero or value for a pandas DataFrame column

I want to replace negative values in a pandas DataFrame column with zero.
Is there a more concise way to construct this expression?
df['value'][df['value'] < 0] = 0

You could use the clip method:
import pandas as pd
import numpy as np
df = pd.DataFrame({'value': np.arange(-5,5)})
df['value'] = df['value'].clip(0, None)
print(df)
yields
value
0 0
1 0
2 0
3 0
4 0
5 0
6 1
7 2
8 3
9 4

Another possibility is numpy.maximum(). This is more straight-forward to read in my opinion.
import pandas as pd
import numpy as np
df['value'] = np.maximum(df.value, 0)
It's also significantly faster than all other methods.
df_orig = pd.DataFrame({'value': np.arange(-1000000, 1000000)})
df = df_orig.copy()
%timeit df['value'] = np.maximum(df.value, 0)
# 100 loops, best of 3: 8.36 ms per loop
df = df_orig.copy()
%timeit df['value'] = np.where(df.value < 0, 0, df.value)
# 100 loops, best of 3: 10.1 ms per loop
df = df_orig.copy()
%timeit df['value'] = df.value.clip(0, None)
# 100 loops, best of 3: 14.1 ms per loop
df = df_orig.copy()
%timeit df['value'] = df.value.clip_lower(0)
# 100 loops, best of 3: 14.2 ms per loop
df = df_orig.copy()
%timeit df.loc[df.value < 0, 'value'] = 0
# 10 loops, best of 3: 62.7 ms per loop
(notebook)

Here is the canonical way of doing it, while not necessarily more concise, is more flexible (in that you can apply this to arbitrary columns)
In [39]: df = DataFrame(randn(5,1),columns=['value'])
In [40]: df
Out[40]:
value
0 0.092232
1 -0.472784
2 -1.857964
3 -0.014385
4 0.301531
In [41]: df.loc[df['value']<0,'value'] = 0
In [42]: df
Out[42]:
value
0 0.092232
1 0.000000
2 0.000000
3 0.000000
4 0.301531

Or where to check:
>>> import pandas as pd,numpy as np
>>> df = pd.DataFrame(np.random.randn(5,1),columns=['value'])
>>> df
value
0 1.193313
1 -1.011003
2 -0.399778
3 -0.736607
4 -0.629540
>>> df['value']=df['value'].where(df['value']>0,0)
>>> df
value
0 1.193313
1 0.000000
2 0.000000
3 0.000000
4 0.000000
>>>

For completeness, np.where is also a possibility, which is faster than most answers here. The np.maximum answer is the best approach though, as it's faster and more concise than this.
df['value'] = np.where(df.value < 0, 0, df.value)

Let's take only values greater than zero, leaving those which are negative as NaN (works with frames not with series), then impute.
df[df > 0].fillna(0)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: String matching on a pandas column of lists - python

df[df.L.apply(lambda i: ','.join(i)).str.contains('yourstring')]

Related

How to parse values from existing dataframe to new column for each row [duplicate]

Reference a row from another Pandas DataFrame

How to concatenate multiple column values into a single column in Pandas dataframe

how to convert a data frame with a list in the value to a big data frame with the each level as a single column value in python?

Return max of zero or value for a pandas DataFrame column

Categories

Resources