Conditional Regex Function on Pandas Dataframe

Conditional Regex Function on Pandas Dataframe - python

I have the following df and function (see below). I might be over complicating this. A new set of fresh eyes would be deeply appreciated.
df:
Site Name Plan Unique ID Atlas Placement ID
Affectv we11080301 11087207850894
Mashable we14880202 11087208009031
Alphr uk10790301 11087208005229
Alphr uk19350201 11087208005228
The goal is to:
Iter first through df['Plan Unique ID'], search for a specific value (we_match or uk_match), if there is a match
Check that the string value is bigger than a certain value in that group (we12720203 or uk11350200)
If the value is greater than add that we or uk value to a new column df['Consolidated ID'].
If the value is lower or there is no match, then search df['Atlas Placement ID'] with new_id_search
If there is a match, then add that to df['Consolidated ID']
If not, return 0 to df['Consolidated ID]
The current problem is that it returns an empty column.
def placement_extract(df="mediaplan_df", we_search="we\d{8}", uk_search="uk\d{8}", new_id_search= "(\d{14})"):
if type(df['Plan Unique ID']) is str:
we_match = re.search(we_search, df['Plan Unique ID'])
if we_match:
if we_match > "we12720203":
return we_match.group(0)
else:
uk_match = re.search(uk_search, df['Plan Unique ID'])
if uk_match:
if uk_match > "uk11350200":
return uk_match.group(0)
else:
match_new = re.search(new_id_search, df['Atlas Placement ID'])
if match_new:
return match_new.group(0)
return 0
mediaplan_df['Consolidated ID'] = mediaplan_df.apply(placement_extract, axis=1)
Edit: Cleaned the formula
I modified gzl's function in the following way (see below): First see if in df1 there is 14 numbers. If so, add that.
The next step, ideally would be to grab a column MediaPlanUnique from df2 and turn it into a series filtered_placements:
we11080301
we12880304
we14880202
uk19350201
uk11560205
uk11560305
And see if any of the values in filtered_placements are present in df['Plan Unique ID]. If there is a match, then add df['Plan Unique ID] to our end column = df[ConsolidatedID]
The current problem is that it results in all 0. I think it's because the comparison is been done as 1 to 1 (first result of new_match vs first result of filtered_placements) rather than 1 to many (first result of new_match vs all results of filtered_placements)
Any ideas?
def placement_extract(df="mediaplan_df", new_id_search="[a-zA-Z]{2}\d{8}", old_id_search= "(\d{14})"):
if type(df['PlacementID']) is str:
old_match = re.search(old_id_search, df['PlacementID'])
if old_match:
return old_match.group(0)
else:
if type(df['Plan Unique ID']) is str:
if type(filtered_placements) is str:
new_match = re.search(new_id_search, df['Plan Unique ID'])
if new_match:
if filtered_placements.str.contains(new_match.group(0)):
return new_match.group(0)
return 0
mediaplan_df['ConsolidatedID'] = mediaplan_df.apply(placement_extract, axis=1)

I would recommend that don't use such complicate nested if statements. As Phil pointed out, each check is mutually-exclusive. Thus you can check 'we' and 'uk' in same indented if statement, then fall back to default process.
def placement_extract(df="mediaplan_df", we_search="we\d{8}", uk_search="uk\d{8}", new_id_search= "(\d{14})"):
if type(df['Plan Unique ID']) is str:
we_match = re.search(we_search, df['Plan Unique ID'])
if we_match:
if we_match.group(0) > "we12720203":
return we_match.group(0)
uk_match = re.search(uk_search, df['Plan Unique ID'])
if uk_match:
if uk_match.group(0) > "uk11350200":
return uk_match.group(0)
match_new = re.search(new_id_search, df['Atlas Placement ID'])
if match_new:
return match_new.group(0)
return 0
Test:
In [37]: df.apply(placement_extract, axis=1)
Out[37]:
0 11087207850894
1 we14880202
2 11087208005229
3 uk19350201
dtype: object

I've reorganised the logic and also simpified the regex operations to show another way to approach it. The reorganisation wasn't strictly necessary for the answer but as you asked for another opinion / way of approaching it I thought this might help you in future:
# Inline comments to explain the main changes.
def placement_extract(row, we_search="we12720203", uk_search="uk11350200"):
# Extracted to shorter temp variable
plan_id = row["Plan Unique ID"]
# Using parenthesis to get two separate groups - code and numeric
# Means you can do the match just once
result = re.match("(we|uk)(.+)",plan_id)
if result:
code, numeric = result.groups()
# We can get away with these simple tests as the earlier regex guarantees
# that the string starts with either "we" or "uk"
if code == "we" and plan_id > we_search:
return_val = plan_id
elif code == "uk" and plan_id > uk_search:
return_val = plan_id
else:
# It looked like this column was used whatever happened at the
# end, so there's no need to check against a regex
#
# The Atlas Placement is the default option if it either fails
# the prefix check OR the "greater than" test
return_val = row["Atlas Placement ID"]
# A single return statement is often easier to debug
return return_val
Then using in an apply statement (also look into assign):
$ mediaplan_df["Consolidated ID"] = mediaplan_df.apply(placement_extract, axis=1)
$ mediaplan_df
>
Site Name Plan Unique ID Atlas Placement ID Consolidated ID
0 Affectv we11080301 11087207850894 11087207850894
1 Mashable we14880202 11087208009031 we14880202
2 Alphr uk10790301 11087208005229 11087208005229
3 Alphr uk19350201 11087208005228 uk19350201

Related

How to specified a group of object in a dataframe column using Python

In the example below, how do I specified 'mansion' under 'h_type', and find its highest price?
(prevent from finding a highest price from the whole data which might include 'aparment')
ie:
df=pd.DataFrame({'h_type':[aparment,mansion,....],'h_price':[..., ...,...]})
if df.loc[df['h_type']=='mansion']: ##<= do not work,
aidMax = priceSr.idxmax()
if not isnan(aidMax):
amaxSr = df.loc[aidMax]
if amost is None:
amost = amaxSr.copy()
else:
if float(amaxSr['h_price']) > float(amost['h_price']):
amost = amaxSr.copy()
amost = amost.to_frame().transpose()
print(amost, '\n==========')

TL;DR:
That can be a oneliner:
max_price = df[df["h_price"] == "mansion"]]["h_price"].max()
Explanation
A little bit of explaining here:
df[df["h_price"] == "mansion"]]
That pieces selects all the rows who's column "h_price" value is the maximum.
df[df["h_price"] == "mansion"]]["h_price"]
Then, we access the column "h_price" of all the rows.
Finally
df[df["h_price"] == "mansion"]]["h_price"].max()
Will return the maximum value for that column (amongs all rows).

python - "merge based on a partial match" - Improving performance of function

I have the below script - which aims to create a "merge based on a partial match" functionality since this is not possible with the normal .merge() funct to the best of my knowledge.
The below works / returns the desired result, but unfortunately, it's incredibly slow to the point that it's almost unusable where I need it.
Been looking around at other Stack Overflow posts that contain similar problems, but haven't yet been able to find a faster solution.
Any thoughts on how this could be accomplished would be appreciated!
import pandas as pd
df1 = pd.DataFrame([ 'https://wwww.example.com/hi', 'https://wwww.example.com/tri', 'https://wwww.example.com/bi', 'https://wwww.example.com/hihibi' ]
,columns = ['pages']
)
df2 = pd.DataFrame(['hi','bi','geo']
,columns = ['ngrams']
)
def join_on_partial_match(full_values=None, matching_criteria=None):
# Changing columns name with index number
full_values.columns.values[0] = "full"
matching_criteria.columns.values[0] = "ngram_match"
# Creating matching column so all rows match on join
full_values['join'] = 1
matching_criteria['join'] = 1
dfFull = full_values.merge(matching_criteria, on='join').drop('join', axis=1)
# Dropping the 'join' column we created to join the 2 tables
matching_criteria = matching_criteria.drop('join', axis=1)
# identifying matching and returning bool values based on whether match exists
dfFull['match'] = dfFull.apply(lambda x: x.full.find(x.ngram_match), axis=1).ge(0)
# filtering dataset to only 'True' rows
final = dfFull[dfFull['match'] == True]
final = final.drop('match', axis=1)
return final
join = join_on_partial_match(full_values=df1,matching_criteria=df2)
print(join)
>> full ngram_match
0 https://wwww.example.com/hi hi
7 https://wwww.example.com/bi bi
9 https://wwww.example.com/hihibi hi
10 https://wwww.example.com/hihibi bi

For anyone who is interested - ended up figuring out 2 ways to do this.
First returns all matches (i.e., it duplicates the input value and matches with all partial matches)
Only returns the first match.
Both are extremely fast. Just ended up using a pretty simple masking script
def partial_match_join_all_matches_returned(full_values=None, matching_criteria=None):
"""The partial_match_join_first_match_returned() function takes two series objects and returns a dataframe with all matching values (duplicating the full value).
Args:
full_values = None: This is the series that contains the full values for matching pair.
partial_values = None: This is the series that contains the partial values for matching pair.
Returns:
A dataframe with 2 columns - 'full' and 'match'.
"""
start_join1 = time.time()
matching_criteria = matching_criteria.to_frame("match")
full_values = full_values.to_frame("full")
full_values = full_values.drop_duplicates()
output=[]
for n in matching_criteria['match']:
mask = full_values['full'].str.contains(n, case=False, na=False)
df = full_values[mask]
df_copy = df.copy()
df_copy['match'] = n
# df = df.loc[n, 'match']
output.append(df_copy)
final = pd.concat(output)
end_join1 = (time.time() - start_join1)
end_join1 = str(round(end_join1, 2))
len_join1 = len(final)
return final
def partial_match_join_first_match_returned(full_values=None, matching_criteria=None):
"""The partial_match_join_first_match_returned() function takes two series objects and returns a dataframe with the first matching value.
Args:
full_values = None: This is the series that contains the full values for matching pair.
partial_values = None: This is the series that contains the partial values for matching pair.
Returns:
A dataframe with 2 columns - 'full' and 'match'.
"""
start_singlejoin = time.time()
matching_criteria = matching_criteria.to_frame("match")
full_values = full_values.to_frame("full").drop_duplicates()
output=[]
for n in matching_criteria['match']:
mask = full_values['full'].str.contains(n, case=False, na=False)
df = full_values[mask]
df_copy = df.copy()
df_copy['match'] = n
# leaves us with only the 1st of each URL
df_copy.drop_duplicates(subset=['full'])
output.append(df_copy)
final = pd.concat(output)
end_singlejoin = (time.time() - start_singlejoin)
end_singlejoin = str(round(end_singlejoin, 2))
len_singlejoin = len(final)
return final

Counting combinations in Dataframe create new Dataframe

So I have a dataframe called reactions_drugs
and I want to create a table called new_r_d where I keep track of how often a see a symptom for a given medication like
Here is the code I have but I am running into errors such as "Unable to coerce to Series, length must be 3 given 0"
new_r_d = pd.DataFrame(columns = ['drugname', 'reaction', 'count']
for i in range(len(reactions_drugs)):
name = reactions_drugs.drugname[i]
drug_rec_act = reactions_drugs.drug_rec_act[i]
for rec in drug_rec_act:
row = new_r_d.loc[(new_r_d['drugname'] == name) & (new_r_d['reaction'] == rec)]
if row == []:
# create new row
new_r_d.append({'drugname': name, 'reaction': rec, 'count': 1})
else:
new_r_d.at[row,'count'] += 1

Assuming the rows in your current reactions (drug_rec_act) column contain one string enclosed in a list, you can convert the values in that column to lists of strings (by splitting each string on the comma delimiter) and then utilize the explode() function and value_counts() to get your desired result:
df['drug_rec_act'] = df['drug_rec_act'].apply(lambda x: x[0].split(','))
df_long = df.explode('drug_rec_act')
result = df_long.groupby('drugname')['drug_rec_act'].value_counts().reset_index(name='count')

iterate over pandas dataframe, update value from data in another row, and delete that other row

I have a pandas dataframe of 7000 rows, below is a sample
I need to fill in the missing branch type column, the missing info is available in the rows below. For the first row, I search the data frame ['link_name'] for B-A. and use the root_type to be the branch name.
After the extraction I want to delete the row I extracted the root_type from to have an output like this:
I tried the below code, but it doesn't work properly
count = 0
missing = 0
errored_links=[]
for i,j in bmx.iterrows():
try:
spn = bmx[bmx.link_name ==j.link_reverse_name].root_type.values[0]
index_t = bmx[bmx.link_name ==j.link_reverse_name].root_type.index[0]
bmx.drop(bmx.index[index_t],inplace=True)
count+=1
bmx.at[i,'branch_type']=spn
except:
bmx.at[i,'branch_type']='missing'
missing+=1
errored_links.append(j)
print('Iterations: ',count)
print('Missing: ', missing)

Build up a list with indices to be removed, do the job and after iterating all rows remove the unneeded rows. Do not use if/else in loop, simply set all to be missing by start and then set those that have branch type to its values.
bmx=pd.DataFrame({'link_name':["A-B","C-D","B-A","D-C"],
'root_type':["type1", "type2", "type6", "type1"],
'branch_type':["","","",""],
'link_reverse_name':["B-A","D-C","A-B","C-D"]},
columns=['link_name','root_type','branch_type','link_reverse_name'])
bmx["branch_type"]="missing" #set all to be missing by start, get rid of ifs :)
to_remove = []
for i,j in bmx.iterrows():
if(i in to_remove):
continue #just skip if we marked the row for removal already
link = bmx[bmx.link_name == j.link_reverse_name].root_type.values[0]
idx = bmx[bmx.link_name == j.link_reverse_name].index
if link:
j.branch_type = link
to_remove.append(idx[0]) #append the index to the list
bmx.drop(to_remove, inplace=True)
print(bmx)
We get the desired output:
link_name root_type branch_type link_reverse_name
0 A-B type1 type6 B-A
1 C-D type2 type1 D-C
Of course I expect that all entries are unique, otherwise this will produce some duplicates. I did not use the not problem relevant cols for simplicity.

How to group by and sum when all elements of one list are in another list

I have a data frame df1. "transactions" column has an array of int.
id transactions
1 [1,2,3]
2 [2,3]
data frame df2. "items" column has an array of int.
items cost
[1,2] 2.0
[2] 1.0
[2,4] 4.0
I need to check whether all elements of items are in each transaction if so sum up the costs.
Expected Result
id transaction score
1 [1,2,3] 3.0
2 [2,3] 1.0
I did the following
#cross join
-----------
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()],
right.values[ib2.ravel()]]))
out=cartesian_product_simplified(df1,df2)
#column names assigning
out.columns=['id', 'transactions', 'cost', 'items']
#converting panda series to list
t=out["transactions"].tolist()
item=out["items"].tolist()
#check list present in another list
-------------------------------------
def check(trans,itm):
out_list=list()
for row in trans:
ret =np.all(np.in1d(itm, row))
out_list.append(ret)
return out_list
if true: group and sum
-----------------------
a=check(t,item)
for i in a:
if(i):
print(out.groupby(['id','transactions']))['cost'].sum()
else:
print("no")
Throws TypeError: 'NoneType' object is not subscriptable.
I am new to python and don't know how to put all these together. How to group by and sum the cost when all items of one list in another list?

The simplies way is just to check all items for all transactions:
# df1 and df2 are initialized
def sum_score(transaction):
score = 0
for _, row in df2.iterrows():
if all(item in transaction for item in row["items"]):
score += row["cost"]
return score
df1["score"] = df1["transactions"].map(sum_score)
It will be extremely slow on big scale. If this is a problem, we need to iterate not over every item, but preselect only possible. If you have enough memory, it can be done like that. For each item we remember all the row numbers in df2, where it appeared. So for each transaction we get the items, get all the possible lines and check only them.
import collections
# df1 and df2 are initialized
def get_sum_score_precalculated_func(items_cost_df):
# create a dict of possible indexes to search for an item
items_search_dict = collections.default_dict(set)
for i, (_, row) in enumerate(items_cost_df.iterrow()):
for item in row["items"]:
items_search_dict[item].add(i)
def sum_score(transaction):
possible_indexes = set()
for i in transaction:
possible_indexes += items_search_dict[i]
score = 0
for i in possible_indexes:
row = items_cost_df.iloc[i]
if all(item in transaction for item in row["items"]):
score += row["cost"]
return score
return sum_score
df1["score"] = df1["transactions"].map(get_sum_score_precalculated_func(df2))
Here I use
set which is an unordered storage of unique values (it helps to join possible line numbers and avoid double count).
collections.defaultdict which is a usual dict, but if you are trying to access uninitialized values it fill it with the given data (blank set in my case). It help to avoid if x not in my_dict: my_dict[x] = set(). I also use so called "closure", which means sum_score function will have access to items_cost_df and items_search_dict which were accessible at the level the sum_score function was declared even after it was returned and get_sum_score_precalculated_func
That should be much faster in case the items are quite unique and can be found only in a few lines of df2.
If you have quite a few unique items and so many identical transactions, you'd better calculate score for each unique transaction first. And then just join the result.
transactions_score = []
for transaction in df1["transactions"].unique():
score = sum_score(transaction)
transaction_score.append([transaction, score])
transaction_score = pd.DataFrame(
transaction_score,
columns=["transactions", "score"])
df1 = df1.merge(transaction_score, on="transactions", how="left")
Here I use sum_score from first example of code
P.S. With the python error message there should be a line number which helps a lot to understand the problem.

# convert df_1 to dictionary for iteration
df_1_dict = dict(zip(df_1["id"], df_1["transactions"]))
# convert df_2 to list for iteration as there is no unique column
df_2_list = df_2.values.tolist()
# iterate through each combination to find a valid one
new_data = []
for rows in df_2_list:
items = rows[0]
costs = rows[1]
for key, value in df_1_dict.items():
# find common items in both
common = set(value).intersection(set(items))
# execute of common item exist in second dataframe
if len(common) == len(items):
new_row = {"id": key, "transactions": value, "costs": costs}
new_data.append(new_row)
merged_df = pd.DataFrame(new_data)
merged_df = merged_df[["id", "transactions", "costs"]]
# group the data by id to get total cost for each id
merged_df = (
merged_df
.groupby(["id"])
.agg({"costs": "sum"})
.reset_index()
)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Conditional Regex Function on Pandas Dataframe - python

Related

How to specified a group of object in a dataframe column using Python

python - "merge based on a partial match" - Improving performance of function

Counting combinations in Dataframe create new Dataframe

iterate over pandas dataframe, update value from data in another row, and delete that other row

How to group by and sum when all elements of one list are in another list

Categories

Resources