I am wondering if there a fast way to merge two pandas tables by the regular expression in python .
For example:
table A
col1 col2
1 apple_3dollars_5
2 apple_2dollar_4
1 orange_5dollar_3
1 apple_1dollar_3
table B
col1 col2
good (apple|oragne)_\dollars_5
bad .*_1dollar_.*
ok oragne_\ddollar_\d
Output:
col1 col2 col3
1 apple_3dollars_5 good
1 orange_5dollar_3 ok
1 apple_1dollar_3 bad
this is just an example, what I want is instead of merging by one col that exactly match, I want to join by some regular expression. Thank you!
First of all fix RegEx'es in the B DataFrame:
In [222]: B
Out[222]:
col1 col2
0 good (apple|oragne)_\ddollars_5
1 bad .*_1dollar_.*
2 ok orange_\ddollar_\d
Now we can prepare the following variables:
In [223]: to_repl = B.col2.values.tolist()
In [224]: vals = B.col1.values.tolist()
In [225]: to_repl
Out[225]: ['(apple|oragne)_\\ddollars_5', '.*_1dollar_.*', 'orange_\\ddollar_\\d']
In [226]: vals
Out[226]: ['good', 'bad', 'ok']
Finally we can use them in the replace function:
In [227]: A['col3'] = A['col2'].replace(to_repl, vals, regex=True)
In [228]: A
Out[228]:
col1 col2 col3
0 1 apple_3dollars_5 good
1 2 apple_2dollar_4 apple_2dollar_4
2 1 orange_5dollar_3 ok
3 1 apple_1dollar_3 bad
I took the idea from https://python.tutorialink.com/can-i-perform-a-left-join-merge-between-two-dataframes-using-regular-expressions-with-pandas/ and improved it a little so that the original data can have more than one column and now we can make a real left join (merge) with regex!
import pandas as pd
d = {'extra_colum1': ['x', 'y', 'z', 'w'],'field': ['ab', 'a', 'cd', 'e'], 'extra_colum2': ['x', 'y', 'z', 'w']}
df = pd.DataFrame(data=d)
df_dict = pd.DataFrame(['a', 'b', 'c', 'd'], columns =
['destination'])
df_dict['field'] = '.*' + df_dict['destination'] + '.*'
df_dict.columns=['destination','field']
dataframe and dict
def merge_regex(df, df_dict, how, field):
import re
df_dict = df_dict.drop_duplicates()
idx = [(i,j) for i,r in enumerate(df_dict[f'{field}']) for j,v in enumerate(df[f'{field}']) if re.match(r,v)]
df_dict_idx, df_idx = zip(*idx)
t = df_dict.iloc[list(df_dict_idx),0].reset_index(drop=True)
t1 = df.iloc[list(df_idx),df.columns.get_loc(f'{field}')].reset_index(drop=True)
df_dict_translated = pd.concat([t,t1], axis=1)
data = pd.merge(
df,
df_dict_translated,
how=f'{how}',
left_on=f'{field}',
right_on=f'{field}'
)
data = data.drop_duplicates()
return data
Similar to #MaxU, I use .replace, but I replace the column of values that you want to merge on with the regex strings that they match on. Small warning that this can cause some issues like non-unique index if your normal text matches more than one regex pattern. So using your dataframe A and #MaxU's fixed regexes for dataframe B:
A['joinCol'] = A.col2.replace(B.col2, B.col2, regex=True)
B.rename({'col2': 'joinCol'}) # the join columns should have the same name
C = A.join(B, on='joinCol')
If you want, you can then drop that join column:
C = C.drop('joinCol', axis=1)
Related
This may be an easy process, but has been confusing me. Let's say I have the following df:
import pandas as pd
# creating a DataFrame
df = pd.DataFrame({'col1' :['A', 'B', 'B', 'C'],
'col2' :['B', 'A', 'B', 'H'],
'col3' :['', '', '', ''],
'val' :['','','','']})
print(df)
col1 col2 col3 val
0 A B
1 B A
2 B B
3 C H
Now I wanna identify empty columns and drop them. Here is what I am doing:
empty_cols = [col for col in df if df[col].isnull().all()]
print(empty_cols)
[]
False
which returns [] and False. Am I making a mistake somewhere here? Thanks!
You could try this
empty = []
for col in df.columns:
if df[col].sum() == "":
empty.append(col)
Alternatively like this, if you want a one liner:
empty = [col for col in df if df[col].sum() == ""]
If by "empty" you mean and string with no charatchers like '', then you can chek for that instead of null values
empty_cols = [col for col in df if df[col].str.contains('').all()]
I guess, you need to check '' instead of isnull(), because, null is None in Python, but you have just empty stings, so they are not None.
P.S my code to detect and drop.
empty = [df[col].name for col in df if not df[col].all()]
df = df.drop(empty, axis=1)
I currently have a DataFrame like such:
col1 col2 col3
0 0 1 ['a', 'b', 'c']
1 2 3 ['d', 'e', 'f']
2 4 5 ['g', 'h', 'i']
what I want to do is select the rows where a certain value is contained in the lists of col3. For example, the code I would initially run is:
df.loc['a' in df['col3']]
but I get the following error:
KeyError: False
I've taken a look at this question: KeyError: False in pandas dataframe but it doesn't quite answer my question. I've tried the suggested solutions in the answers and it didn't help.
How would I go about this issue? Thanks.
Use list comprehension for test each list:
df1 = df[['a' in x for x in df['col3']]]
print (df1)
col1 col2 col3
0 0 1 [a, b, c]
Or use Series.map:
df1 = df[df['col3'].map(lambda x: 'a' in x)]
#alternative
#df1 = df[df['col3'].apply(lambda x: 'a' in x)]
Or create DataFrame and test by DataFrame.eq with DataFrame.any:
df1 = df[pd.DataFrame(df['col3'].tolist()).eq('a').any(axis=1)]
Use:
df = df[df.testc.map(lambda x: 'a' in x)]
I would like to count the number of cells within each row that contain a particular character string, cells which have the particular string more than once should be counted once only.
I can count the number of cells across a row which equal a given value, but when I expand this logic to use str.contains, I have issues, as shown below
d = {'col1': ["a#", "b","c#"], 'col2': ["a", "b","c#"]}
df = pd.DataFrame(d)
#can correctly count across rows using equality
thisworks =( df =="a#" ).sum(axis=1)
#can count across a column using str.contains
thisworks1=df['col1'].str.contains('#').sum()
#but cannot use str.contains with a dataframe so what is the alternative
thisdoesnt =( df.str.contains('#') ).sum(axis=1)
Output should be a series showing the number of cells in each row that contain the given character string.
str.contains is a series method. To apply it to whole dataframe you need either agg or apply such as:
df.agg(lambda x: x.str.contains('#')).sum(1)
Out[2358]:
0 1
1 0
2 2
dtype: int64
If you don't like agg nor apply, you may use np.char.find to work directly on underlying numpy array of df
(np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1)
Out[2360]: array([1, 0, 2])
Passing it to series or a columns of df
pd.Series((np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1), index=df.index)
Out[2361]:
0 1
1 0
2 2
dtype: int32
A solution using df.apply:
df = pd.DataFrame({'col1': ["a#", "b","c#"],
'col2': ["a", "b","c#"]})
df
col1 col2
0 a# a
1 b b
2 c# c#
df['sum'] = df.apply(lambda x: x.str.contains('#'), axis=1).sum(axis=1)
col1 col2 sum
0 a# a 1
1 b b 0
2 c# c# 2
Something like this should work:
df = pd.DataFrame({'col1': ['#', '0'], 'col2': ['#', '#']})
df['totals'] = df['col1'].str.contains('#', regex=False).astype(int) +\
df['col2'].str.contains('#', regex=False).astype(int)
df
# col1 col2 totals
# 0 # # 2
# 1 0 # 1
It should generalize to as many columns as you want.
I have a Pandas DataFrame with two columns, each row contains a list of elements. I'm trying to find set difference between two columns for each row using pandas.apply method.
My df for example
A B
0 ['a','b','c'] ['a']
1 ['e', 'f', 'g'] ['f', 'g']
So it should look like this:
df.apply(set_diff_func, axis=1)
What I'm trying to achieve:
0 ['b','c']
1 ['e']
I can make it using iterrows, but I've once read, that it's better to use apply when it's possible.
How about
df.apply(lambda row: list(set(row['A']) - set(row['B'])), axis = 1)
or
(df['A'].apply(set) - df['B'].apply(set)).apply(list)
Here's the function you need, you can change the name of the columns with the col1 and col2 arguments by passing them to the args option in apply:
def set_diff_func(row, col1, col2):
return list(set(row[col1]).difference(set(row[col2])))
This should return the required result:
>>> dataset = pd.DataFrame(
[{'A':['a','b','c'], 'B':['a']},
{'A':['e', 'f', 'g'] , 'B':['f', 'g']}])
>>> dataset.apply(set_diff_func, axis=1, args=['A','B'])
0 [c, b]
1 [e]
I need to apply a list to a pandas dataframe by column. The operation to be performed is string concatenation. Being more specific:
Inputs I have:
df = pd.DataFrame([['a', 'b', 'c'], ['d', 'e', 'f']], columns=['Col1', 'Col2', 'Col3'])
lt = ['Prod1', 'Prod2', 'Prod3']
which results in:
>>>df
Col1 Col2 Col3
0 a b c
1 d e f
>>>lt
['Prod1', 'Prod2', 'Prod3']
moreover, the length of lt will always be equal to number of columns of df.
What I would like to have is a dataframe of this sort:
res = pd.DataFrame([['Prod1a', 'Prod2b', 'Prod3c'], ['Prod1d', 'Prod2e', 'Prod3f']],
columns=['Col1', 'Col2', 'Col3'])
which gives:
>>>res
Col1 Col2 Col3
0 Prod1a Prod2b Prod3c
1 Prod1d Prod2e Prod3f
Until now, I've been able to solve the problem looping through rows and columns but I won't give up the idea that there's a more elegant way to solve it (maybe something like apply.
Does anyone has suggestions? Thanks!
You can perform broadcasted string concatenation:
lt + df
Col1 Col2 Col3
0 Prod1a Prod2b Prod3c
1 Prod1d Prod2e Prod3f
You can also use numpy's np.char.add function.
df[:] = np.char.add(lt, df.values.astype(str))
df
Col1 Col2 Col3
0 Prod1a Prod2b Prod3c
1 Prod1d Prod2e Prod3f
Thirdly, there is the list comprehension option.
df[:] = [[i + v for i, v in zip(lt, V)] for V in df.values.tolist()]
df
Col1 Col2 Col3
0 Prod1a Prod2b Prod3c
1 Prod1d Prod2e Prod3f