Applying regex between two dataframes efficiently

Applying regex between two dataframes efficiently - python

I'm having a bit of a performance issue while trying to match two specific words in two dataframes. I need to return a 1 for every row containing a word and else a 0. The function I wrote looks as follows:
def matchWords(row):
row = row[0].upper()
for x in df_X.Names:
if re.search("\\b" + x + "\\b", row):
return 1
return 0
This function is called from a lambda and although it works fine, it takes quite a long time to run. I have allready applied multithreading in an effort to increase the speed but I want it faster. Is there a way to maybe precompile the df_X.Names or does anybody have another tip to get this faster / more efficient?
Thanks in advance for any help!

IIUC you need str.contains, multiple words can be join by | (or). Last use numpy.where:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'d': {0: 'wa', 1: 'rs', 2: 'qn'},
'e': {0: 'i', 1: 'r', 2: 't'},
'f': {0: 'a', 1: 's', 2: 'f'}})
print df1
d e f
0 wa i a
1 rs r s
2 qn t f
df = pd.DataFrame({'a': {0: 'wa ug dh', 1: 'rs sd qn', 2: 'ga mf rn'},
'c': {0: 'i', 1: 'r', 2: 't'},
'b': {0: 'a', 1: 's', 2: 'f'}})
print df
a b c
0 wa ug dh a i
1 rs sd qn s r
2 ga mf rn f t
Join values from column d with separator |:
words = "|".join(df1.d.tolist())
print words
wa|rs|qn
print df.a.str.contains(words)
0 True
1 True
2 False
Name: a, dtype: bool
print np.where(df.a.str.contains(words), 1, 0)
[1 1 0]
df['new'] = np.where(df.a.str.contains(words), 1, 0)
print df
a b c new
0 wa ug dh a i 1
1 rs sd qn s r 1
2 ga mf rn f t 0

Related

TypeError: unsupported operand type(s) for &: 'str' and 'bool' for DF filtering

I am trying to filter my dataframe such that when I create a new columnoutput, it displays the "medium" rating. My dataframe has str values, so I convert them to numbers based on a ranking system I have and then I filter out the maximum and minimum rating per row.
I am running into this error:
TypeError: unsupported operand type(s) for &: 'str' and 'bool'
I've created a data frame that pulls str values from my csv file:
df = pdf.read_csv('csv path', usecols=['rating1','rating2','rating3'])
And my dataframe looks like this:
rating1 rating2 rating3
0 D D C
1 C B A
2 B B B
I need it to look like this
rating1 rating2 rating3 mediumrating
0 D D C 1
1 C B A 3
2 B B B 3
I have a mapping dictionary that converts the values to numbers.
ranking = {
'D': 1, 'C':2, 'B': 3, 'A' : 4
}
Below you can find the code I use to determine the "medium rating". Basically, if all the ratings are the same, you can pull the minimum rating. If two of the ratings are the same, pull in the lowest rating. If the three ratings differ, filter out the max rating and the min rating.
if df == df.loc[(['rating1'] == df['rating2'] & df['rating1'] == df['rating3'])]:
df['mediumrating'] = df.replace(ranking).min(axis=1)
elif df == df.loc[(['rating1'] == df['rating2'] | df['rating1'] == df['rating3'] | df['rating2'] == df['rating3'])]:
df['mediumrating'] = df.replace(ranking).min(axis=1)
else:
df['mediumrating'] == df.loc[(df.replace(ranking) > df.replace(ranking).min(axis=1) & df.replace(ranking)
Any help on my formatting or process would be welcomed!!

Use np.where:
For the condition, use df.nunique applied to axis=1 and check if the result equals either 1 (all values are the same) or 2 (two different values) with Series.isin.
If True, we need df.min along axis=1.
If False (all unique values), we need df.median along axis=1.
Finally, use astype to turn resulting floats into integers.
import pandas as pd
import numpy as np
data = {'rating1': {0: 'D', 1: 'C', 2: 'B'},
'rating2': {0: 'D', 1: 'B', 2: 'B'},
'rating3': {0: 'C', 1: 'A', 2: 'B'}}
df = pd.DataFrame(data)
ranking = {'D': 1, 'C':2, 'B': 3, 'A' : 4}
df['mediumrating'] = np.where(df.replace(ranking).nunique(axis=1).isin([1,2]),
df.replace(ranking).min(axis=1),
df.replace(ranking).median(axis=1)).astype(int)
print(df)
rating1 rating2 rating3 mediumrating
0 D D C 1
1 C B A 3
2 B B B 3

Took to sec to understand what you really meant by filter. Here is some code that should be self explanatory and should achieve what you're looking for:
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['D', 'D', 'C'], ['C', 'B', 'A'], ['B', 'B', 'B']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['rating1', 'rating2', 'rating3'])
# dictionary that maps the rating to a number
rating_map = {'D': 1, 'C': 2, 'B': 3, 'A': 4}
def rating_to_number(rating1, rating2, rating3):
if rating1 == rating2 and rating2 == rating3:
return rating_map[rating1]
elif rating1 == rating2 or rating1 == rating3 or rating2 == rating3:
return min(rating_map[rating1], rating_map[rating2], rating_map[rating3])
else:
return rating_map[sorted([rating1, rating2, rating3])[1]]
# create a new column based on the values of the other columns such that the new column has the value of therating_to_number function applied to the other columns
df['mediumrating'] = df.apply(lambda x: rating_to_number(x['rating1'], x['rating2'], x['rating3']), axis=1)
print(df)
This prints out:
rating1 rating2 rating3 mediumrating
0 D D C 2
1 C B A 3
2 B B B 3
Edit: updated rating_to_number based on your updated question

Filter out data with best combinations within two columns in pandas dataframe

I have a dataframe which looks like the below sample data
df = pd.DataFrame({'col_1':['1','1','1','2','2','3','3','4','4','4','5'],
'col_2':['A','B','C','A','B','C','D','D','A','A','B'],
'col_3':['256','546','985','573','265','731','968','592','364','657','953']})
print(df)
col_1 col_2 col_3
0 1 A 256
1 1 B 546
2 1 C 985
3 2 A 573
4 2 B 265
5 3 C 731
6 3 D 968
7 4 D 592
8 4 A 364
9 4 A 657
10 5 B 953
I want filter out this data and find out the best combinations of two columns covering each value at least once.
For example in the above data the first combination for 'col_1' and 'col_2' is [1,A] and when we search further the next available combination is [1,B] but it's is giving me only one new value i.e. 'B' as '1' is already covered in the first combination. So we should search further before finalizing the second combination. On searching further we can find a better combination i.e. [2,B] giving us both the new values. This way we can search for further combinations. Whichever combination we pick 'col_3' value should come as it is. The expected output is:
col_1 col_2 col_3
0 1 A 256
1 2 B 265
2 3 C 731
3 4 D 592
4 5 B 953
Whatever I tried to filter out this data didn't work for me.
Can anyone provide a solution or guide me in the right direction?

I think that the best way to achieve this is with an iterative algorithm.
First, you need to extract columns, then you define a get_score function to understand how much value a combination has, given an history of combinations already set as valuable.
Last, you append either combinations that are good (score == MAX_SCORE) or the last combination, as in your example (i == last_index).
In the end you will obtain a list of combinations (or tuples), that you can easily cast back to a DataFrame.
# define max score of a combination
MAX_SCORE = 2
# extract columns as lists
x, y = df.col_1.to_list(), df.col_2.to_list()
# function to compute score
def get_score(combination, old_combinations):
x, y = combination
if not old_combinations:
return MAX_SCORE
# split a list of two-sized tuples in two lists
xs, ys = [*zip(*old_combinations)]
score = MAX_SCORE
if x in xs:
score -= 1
if y in ys:
score -= 1
return score
# get list of pairs
iterator = list(zip(x, y))
last_index = len(iterator) - 1
combinations = []
# algorithm
for i, combination in enumerate(iterator):
score = get_score( combination, combinations )
if score == MAX_SCORE or i == last_index:
combinations.append( combination )
# result
combinations # [(1, 'A'), (2, 'B'), (3, 'C'), (4, 'D'), (5, 'B')]
new_df = pd.DataFrame(combinations, columns=["col_1", "col_2"])
print(new_df)
This prints the following:
col_1 col_2
0 1 A
1 2 B
2 3 C
3 4 D
4 5 B
EDIT:
In general, if you have N columns but want to calculate the scores only on M columns (with N <= M), you can generalize the algorithm as follows:
# the columns to use when calculating scores
COLUMNS = ["col_1", "col_2"]
# if no columns is provided, use every column
if not COLUMNS:
COLUMNS = df.columns.to_list()
# check non-empty columns
assert COLUMNS
# check valid columns
assert all(col in df.columns for col in COLUMNS)
MAX_SCORE = len(COLUMNS)
def get_score(combination, combinations) -> int:
if not combinations:
return MAX_SCORE
score = MAX_SCORE
for i, value in enumerate(combination):
if value in [c[i] for c in combinations]:
score -= 1
return score
last_index = len(df) - 1
combinations = []
rows = []
for i, row in df.iterrows():
combination = row[COLUMNS].values.tolist()
score = get_score(combination, combinations)
if score == MAX_SCORE or i == last_index:
combinations.append(combination)
rows.append(row)
new_df = pd.DataFrame(rows).reset_index(drop=True)
print(new_df)
This prints the following:
col_1 col_2 col_3
0 1 A 256
1 2 B 265
2 3 C 731
3 4 D 592
4 5 B 953

Iteratively adding pairs while using groupby to count the available unused options, commented inline for clarity. Superior to other iterative approaches as the loop has finishes in max df['col_2'].nunique() iterations, possibly faster.
df = pd.DataFrame({
'col_1': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3, 7: 4, 8: 4, 9: 4, 10: 5},
'col_2': {0: 'A', 1: 'B', 2: 'C', 3: 'A', 4: 'B', 5: 'C', 6: 'D', 7: 'D', 8: 'A', 9: 'A', 10: 'B'},
})
pairs = pd.DataFrame(columns=df.columns)
# Iterate and get all col_1, col_2 pair such that the combination is optimal
# - col_2 not yet in pairs
# - col_1 associated uniquely (or minimally) to that col_2, once previous combinations were removed
while len(df[~df['col_2'].isin(pairs['col_2'])]):
options = df[~df['col_2'].isin(pairs['col_2'])].groupby('col_1')['col_2'].unique()
n_options = options.str.len()
if n_options.eq(1).any():
add = options.index[n_options.eq(1)]
else:
add = [n_options.idxmin()]
pairs = pairs.append(options[add].reset_index().explode('col_2'), ignore_index=True)
# Now all possible col_2 are in pairs, add missing col_1 with any col_2
pairs = pairs.append(df[~df['col_1'].isin(pairs['col_1'])].drop_duplicates('col_1'), ignore_index=True)
print(pairs)
This returns the following:
col_1 col_2
0 5 B
1 2 A
2 1 C
3 4 D
4 3 C

df = df.pivot(columns='col_2', values='col_1')
desired_df = []
last_uniques = []
for col in df.columns:
uniques = df[col].dropna().unique()
desired = list(set(uniques) - set(last_uniques))
desired_df.append({'col_1': desired[0], 'col_2': col})
last_uniques.append(desired[0])
for col in df.columns:
uniques = list(set(df[col].dropna()) - set(last_uniques))
if uniques:
desired_df.append({'col_1': uniques[0], 'col_2': col})
last_uniques.append(uniques[0])
df_final = pd.DataFrame(data=desired_df)
OUTPUT:
col_1 col_2
0 1 A
1 2 B
2 3 C
3 4 D
4 5 B

You can first add iteratively as required − your desired output leaves no other way than to use a loop:
>>> use = df.iloc[[0]]
>>> while True:
... mask = df['col_1'].isin(use['col_1']) | df['col_2'].isin(use['col_2'])
... if mask.all():
... break
... use = use.append(df[~mask].iloc[0], ignore_index=True)
...
>>> use
col_1 col_2 col_3
0 1 A 256
1 2 B 265
2 3 C 731
3 4 D 592
Then we need to add in rows with missing values in either column:
>>> miss_1 = df.groupby('col_1', as_index=False).first()
>>> miss_2 = df.groupby('col_2', as_index=False).first()
>>> use = use.append(miss_1[~miss_1['col_1'].isin(use['col_1'])], ignore_index=True)
>>> use = use.append(miss_2[~miss_2['col_2'].isin(use['col_2'])], ignore_index=True)
>>> use
col_1 col_2 col_3
0 1 A 256
1 2 B 265
2 3 C 731
3 4 D 592
4 5 B 953

Thanks everyone for your help.
Here's my working solution which is giving the output as expected:
import pandas as pd
df = pd.DataFrame({'col_1': ['1', '1', '1', '2', '2', '3', '3', '4', '4', '4', '5', '5', '5', '5', '5', '5'],
'col_2': ['A', 'B', 'C', 'A', 'B', 'C', 'D', 'D', 'A', 'A', 'B', 'E', 'F', 'G', 'H', 'I'],
'col_3': ['256', '546', '985', '573', '265', '731', '968', '592', '364', '657', '953', '476', '835',
'683', '572', '903']})
df = df.dropna()
# the columns to use when calculating scores
COLUMNS = ["col_1", "col_2"]
col_1_set = set()
col_2_set = set()
best_combinations = []
rows = []
ignored_rows = []
for i, row in df.iterrows():
combination = row[COLUMNS].values.tolist()
# Let's check whether we have picked col_1 value or not yet
if combination[0] in col_1_set:
# We have already picked col_1 value
# Let's check the col_2 value status
if combination[1] in col_2_set:
# col_2 value has also been picked
# Let's ignore this combination and continue checking further
continue
else:
# col_2 value is not yet picked but col_1 value has already been picked
# Since it's not the best combination we will check it later again
ignored_rows.append(row)
elif combination[1] in col_2_set:
# This col_1 value is new for us.
# But col_2 value has already been picked
# Since it's not the best combination we will check it later again
ignored_rows.append(row)
else:
# col_value has also not yet picked.
# Best combination found, Let's store it
best_combinations.append(combination)
col_1_set.add(combination[0])
col_2_set.add(combination[1])
rows.append(row)
# Now Let's check for the ignored combinations
for ignored_row in ignored_rows:
ignored_combination = ignored_row[COLUMNS].values.tolist()
if not (ignored_combination[0] in col_1_set) or not (ignored_combination[1] in col_2_set):
# Either of the elements within the above combination is missed by us
# Let's add it
best_combinations.append(ignored_combination)
col_1_set.add(ignored_combination[0])
col_2_set.add(ignored_combination[1])
rows.append(ignored_row)
df_filtered = pd.DataFrame(rows).reset_index(drop=True)
print(df_filtered)
Output:
col_1 col_2 col_3
0 1 A 256
1 2 B 265
2 3 C 731
3 4 D 592
4 5 E 476
5 5 F 835
6 5 G 683
7 5 H 572
8 5 I 903

Putting values of specific column ranges in other existing columns of the same dataframe

I have received some survey data that, in a simplified manner, looks similar to the following:
Q1 C1 I11 I12 I13 Q2 C2 I21 I22 I23 Q3 C3 I31 I32 I33
0 test1 a b c d test2 e f g h test3 i j k l
In the end, I have reshaped the data to the preferred structure by executing the following code:
df = pd.DataFrame({'Q1': {0: 'test1'}, 'C1': {0: 'a'}, 'I11': {0: 'b'}, 'I12': {0: 'c'}, 'I13': {0: 'd'}, 'Q2': {0: 'test2'}, 'C2': {0: 'e'}, 'I21': {0: 'f'}, 'I22': {0: 'g'}, 'I23': {0: 'h'}, 'Q3': {
0: 'test3'}, 'C3': {0: 'i'}, 'I31': {0: 'j'}, 'I32': {0: 'k'}, 'I33': {0: 'l'}})
header_list = ['Q', 'CA', 'IA1', 'IA2', 'IA3']
df1 = df.iloc[:,0:5]
df2 = df.iloc[:,5:10]
df3 = df.iloc[:,10:15]
for x in df1, df2, df3:
x.columns = header_list
final = pd.concat([df1, df2, df3])
print(final)
Output:
Q CA IA1 IA2 IA3
0 test1 a b c d
0 test2 e f g h
0 test3 i j k l
Although this works, I was wondering if there is a more efficient way of obtaining an equivalent result (instead of creating subset dataframes as above). Essentially, the values of the columns after the 5th one (i.e. "I13") should be placed under the first 5 accordingly. In this simplified version, this yields 3 rows, since there are only 3 subsets, but this would obviously become more cumbersome with the above code if this becomes larger.
Thanks in advance!
PS: I am still new to Python and programming

You can try reshape:
pd.DataFrame(df.values.reshape(-1,5),
columns=['Q','CA','IA1','IA2','IA3'])
Output:
Q CA IA1 IA2 IA3
0 test1 a b c d
1 test2 e f g h
2 test3 i j k l

Try this method if you want to use pandas operations only and don't want to keep changing data types -
lst = list(df.columns)
n=5
new_cols = ['Q', 'CA', 'IA1','IA2','IA3']
#break the column list into groups of n = 3 in this case
chunks = [lst[i:i + n] for i in range(0, len(lst), n)]
#concatenate the list of dataframes over axis = 0after renaming columns of each
pd.concat([df[i].set_axis(new_cols, axis=1) for i in chunks], axis=0, ignore_index=True)
Q CA IA1 IA2 IA3
0 test1 a b c d
1 test2 e f g h
2 test3 i j k l

How to do exact string match while filtering from pandas dataframe

I have a dataframe as
df
indx pids
A 181718,
B 31718,
C 1718,
D 1235,3456
E 890654,
I want to return a row that matches 1718 exactly.
I tried to do this but as expected it returns rows where the 1718 is subset as well:
group_df = df.loc[df['pids'].astype(str).str.contains('{},'.format(1718)), 'pids']
indx pids
A 181718,
B 31718,
C 1718,
When I try to do something like this, it returns empty:
cham_geom = df.loc[df['pids'] == '1718', 'pids']
Expected output:
indx pids
C 1718,
Can anyone help me with it?

you can try with:
df[df.pids.replace('\D','',regex=True).eq('1718')]
indx pids
2 C 1718,
'\D' : Any character that is not a numeric digit from 0 to 9.
EDIT
Considering the below df:
indx pids
0 A 181718,
1 B 31718,
2 C 1718,
3 D 1235,3456
4 E 890654,
5 F 3220,1718
executing:
df[df.pids.str.split(",").apply(lambda x: '1718' in x)]
#if not comma only:-> df[df.pids.str.split("\D").apply(lambda x: '1718' in x)]
Gives:
indx pids
2 C 1718,
5 F 3220,1718

there is a method isin that matches and returns a dataframe containing True for matched and false for not matching.
Consider the following example
>>> found = df.isin(["1718,"])==True
>>> df[found].head(3)
this will show the first 3 values matched with 1718
or if you want to match it with only 1 value then you can do so
>>> found = df.pids == "1718,"
>>> df[found].head(3)

Use str.contains with a negative lookbehind, to ensure there are no other digits before '1718'
Sample Data
import pandas as pd
d = {'indx': {0: 'A', 1: 'B', 2: 'C', 3: 'D', 4: 'E', 5: 'F', 6: 'G'},
'pids': {0: '181718,', 1: '31718,', 2: '1718,', 3: '1235,3456', 4: '890654,', 5: '1231,1718', 6: '1231, 1718'}}
df = pd.DataFrame(d)
Code:
df.loc[df.pids.str.contains('(?<![0-9])1718')]
Output:
indx pids
2 C 1718,
5 F 1231,1718
6 G 1231, 1718

Python Pandas - Combining Multiple Columns into one Staggered Column

How do you combine multiple columns into one staggered column? For example, if I have data:
Column 1 Column 2
0 A E
1 B F
2 C G
3 D H
And I want it in the form:
Column 1
0 A
1 E
2 B
3 F
4 C
5 G
6 D
7 H
What is a good, vectorized pythonic way to go about doing this? I could probably do some sort of df.apply() hack but I'm betting there is a better way. The application is putting multiple dimensions of time series data into a single stream for ML applications.

First stack the columns and then drop the multiindex:
df.stack().reset_index(drop=True)
Out:
0 A
1 E
2 B
3 F
4 C
5 G
6 D
7 H
dtype: object

To get a dataframe:
pd.DataFrame(df.values.reshape(-1, 1), columns=['Column 1'])
For a series answering OP question:
pd.Series(df.values.flatten(), name='Column 1')
For a series timing tests:
pd.Series(get_df(n).values.flatten(), name='Column 1')
Timing
code
def get_df(n=1):
df = pd.DataFrame({'Column 2': {0: 'E', 1: 'F', 2: 'G', 3: 'H'},
'Column 1': {0: 'A', 1: 'B', 2: 'C', 3: 'D'}})
return pd.concat([df for _ in range(n)])
Given Sample
Given Sample * 10,000
Given Sample * 1,000,000

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Applying regex between two dataframes efficiently - python

Related

TypeError: unsupported operand type(s) for &: 'str' and 'bool' for DF filtering

Filter out data with best combinations within two columns in pandas dataframe

Putting values of specific column ranges in other existing columns of the same dataframe

How to do exact string match while filtering from pandas dataframe

Python Pandas - Combining Multiple Columns into one Staggered Column

Categories

Resources