Finding relationships between values based on their name in Python with Panda - python

I want to make relationship between values by their Name based on below rules:
1- I have a CSV file (with more than 100000 rows) that consists of lots of values, I shared some examples as below:
Name:
A02-father
A03-father
A04-father
A05-father
A07-father
A08-father
A09-father
A17-father
A18-father
A20-father
A02-SA-A03-SA
A02-SA-A04-SA
A03-SA-A02-SA
A03-SA-A05-SA
A03-SA-A17-SA
A04-SA-A02-SA
A04-SA-A09-SA
A05-SA-A03-SA
A09-SA-A04-SA
A09-SA-A20-SA
A17-SA-A03-SA
A17-SA-A18-SA
A18-SA-A17-SA
A20-SA-A09-SA
A05-NA
B02-Father
B04-Father
B06-Father
B02-SA-B04-SA
B04-SA-BO2-SA
B04-SA-B06-SA
B06-SA-B04-SA
B06-NA
2- Now I have another CSV file which let me know from which value I should start? in this case the value is
A03-father & B02-father & ... which dont have any influence on each other and they all have seperate path to go, so for each path we will start from mentioned start point.
father.csv
A03-father
B02-father
....
3- Based on the naming I want to make the relationships, As A03-Father has been determined as Father I should check for any value which has been started with A03.(All of them are A0's babies.)
Also as B02 is father, we will check for any value which starts with B02. (B02-SA-B04-SA)
4- Now If I find out A03-SA-A02-SA , this is A03's baby.
I find out A03-SA-A05-SA , this is A03's baby.
I find out A03-SA-A17-SA , this is A03's baby.
and after that I must check any node which starts with A02 & A05 & A17:
As you see A02-Father exists so it is Father and now we will search for any string which starts with A02 and doesn't have A03 which has been detected as Father(It must be ignored)
This must be checked till end of values which exist in the CSV file.
As you see I should check the path based on name (REGEX) and should go forward till end of path.
The expected result:
Father Baby
A03-father A03-SA-A02-SA
A03-father A03-SA-A05-SA
A03-father A03-SA-A17-SA
A02-father A02-SA-A04-SA
A05-father A05-NA
A17-father A17-SA-A18-SA
A04-father A04-SA-A09-SA
A02-father A02-SA-A04-SA
A09-father A09-SA-A20-SA
B02-father B02-SA-B04-SA
B04-father B04-SA-B06-SA
B06-father B06-NA
I have coded it as below with pandas:
import pandas as pd
import numpy as np
import re
#Read the file which consists of all Values
df = pd.read_csv("C:\\total.csv")
#Read the file which let me know who is father
Fa = pd.read_csv("C:\\Father.csv")
#Get the first part of Father which is A0
Fa['sub'] = Fa['Name'].str.extract(r'(\w+\s*)', expand=False)
r2 = []
#check in all the csv file and find anything which starts with A0 and is not Father
for f in Fa['sub']:
baby=(df[df['Name'].str.startswith(f) & ~df['Name'].str.contains('Father')])
baby['sub'] = bay['Name'].str.extract(r'(\w+\s*)', expand=False)
r1= pd.merge(Fa, baby, left_on='sub', right_on='sub',suffixes=('_f', '_c'))
r2.append(result1)
out_df = pd.concat(result2)
out_df= out_df.replace(np.nan, '', regex=True)
#find A0-N-A2-M and A0-N-A4-M
out_df.to_csv('C:\\child1.csv')
#check in all the csv file and find anything which starts with the second part of child1 which is A2 and A4
out_df["baby2"] = out_df['Name_baby'].str.extract(r'^(?:[^-]*-){2}\s*([^-]+)', expand=False)
baby3= out_df["baby2"]
r4 = []
for f in out_df["baby2"]:
#I want to exclude A0 which has been detected.
l = ['A0']
regstr = '|'.join(l)
baby1=(df[df['Name'].str.startswith(f) & ~df['Name'].str.contains(regstr)])
baby1['sub'] = baby1['Name'].str.extract(r'(\w+\s*)', expand=False)
r3= pd.merge(baby3, baby1, left_on='baby2', right_on='sub',suffixes=('_f', '_c'))
r4.append(r3)
out2_df = pd.concat(r4)
out2_df.to_csv('C:\\child2.csv')
I want to put below code in a loop and go through the file and check it, based on naming process and detect other fathers and babies till it finished. however this code is not customized and doesn't have the exact result as i expected.
my question is about how to make the loop?
I should go through the path and also consider regstr value for any string.
#check in all the csv file and find anything which starts with the second part of child1 which is A2 and A4
out_df["baby2"] = out_df['Name_baby'].str.extract(r'^(?:[^-]*-){2}\s*([^-]+)', expand=False)
baby3= out_df["baby2"]
r4 = []
for f in out_df["baby2"]:
#I want to exclude A0 which has been detected.
l = ['A0']
regstr = '|'.join(l)
baby1=(df[df['Name'].str.startswith(f) & ~df['Name'].str.contains(regstr)])
baby1['sub'] = baby1['Name'].str.extract(r'(\w+\s*)', expand=False)
r3= pd.merge(baby3, baby1, left_on='baby2', right_on='sub',suffixes=('_f', '_c'))
r4.append(r3)
out2_df = pd.concat(r4)
out2_df.to_csv('C:\\child2.csv')

Start with import collections (will be needed soon).
I assume that you have already read df and Fa DataFrames.
The first part of my code is to create children Series (index - parent,
value - child):
isFather = df.Name.str.contains('-father', case=False)
dfChildren = df[~isFather]
key = []; val = []
for fath in df[isFather].Name:
prefix = fath.split('-')[0]
for child in dfChildren[dfChildren.Name.str.startswith(prefix)].Name:
key.append(prefix)
val.append(child)
children = pd.Series(val, index=key)
Print children to see the result.
The second part is to create the actual result, starting from each
starting points in Fa:
nodes = collections.deque()
father = []; baby = [] # Containers for source data
# Loop for each starting point
for startNode in Fa.Name.str.split('-', expand=True)[0]:
nodes.append(startNode)
while nodes:
node = nodes.popleft() # Take node name from the queue
# Children of this node
myChildren = children[children.index == node]
# Process children (ind - father, val - child)
for ind, val in myChildren.items():
parts = val.split('-') # Parts of child name
# Child "actual" name (if exists)
val_2 = parts[2] if len(parts) >= 3 else ''
if val_2 not in father: # val_2 not "visited" before
# Add father / child name to containers
father.append(ind)
baby.append(val)
if len(val_2) > 0:
nodes.append(val_2) # Add to the queue, to be processe later
# Drop rows for "node" from "children" (if any exists)
if (children.index == node).sum() > 0:
children.drop(node, inplace=True)
# Convert to a DataFrame
result = pd.DataFrame({'Father': father, 'Baby': baby})
result.Father += '-father' # Add "-father" to "bare" names
I added -father with lower case "f", but I think this is not much
significant detail.
The result, for your data sample, is:
Father Baby
0 A03-father A03-SA-A02-SA
1 A03-father A03-SA-A05-SA
2 A03-father A03-SA-A17-SA
3 A02-father A02-SA-A04-SA
4 A05-father A05-NA
5 A17-father A17-SA-A18-SA
6 A04-father A04-SA-A09-SA
7 A09-father A09-SA-A20-SA
8 B02-father B02-SA-B04-SA
9 B04-father B04-SA-B06-SA
10 B06-father B06-NA
And two remarks concerning your data sample:
You wrote B04-SA-B02-SA with capital O (a letter) instead of 0
(zero). I corrected it in my source data.
Row A02-father A02-SA-A04-SA in your expected result is doubled.
I assume it should occur only once.

Commented inline
def find(data, from_pos=0):
fathers = {}
skip = []
for x in data[from_pos:]:
tks = x.split("-")
# Is it father ?
if tks[1].lower() == "father":
fathers[tks[0]] = x
else:
if tks[0] in fathers and tks[-2] not in skip:
print (fathers[tks[0]], x)
# Skip this father appearing as child later
skip.append(tks[0])
Testcase:
data = [
'A0-Father',
'A0-N-A2-M',
'A0-N-A4-M',
'A2-Father',
'A2-M-A0-N',
'A2-N-A8-M',
'A8-father',
'A8-M-A11-N',
'A8-M-A2-N']
find(data, from_pos=0)
Output:
A0-Father A0-N-A2-M
A0-Father A0-N-A4-M
A2-Father A2-N-A8-M
A8-father A8-M-A11-N
Edit 1:
Start with some data for testing
data = [
'A02-father',
'A03-father',
'A04-father',
'A05-father',
'A07-father',
'A08-father',
'A09-father',
'A17-father',
'A18-father',
'A20-father',
'A02-SA-A03-SA',
'A02-SA-A04-SA',
'A03-SA-A02-SA',
'A03-SA-A05-SA',
'A03-SA-A17-SA',
'A04-SA-A02-SA',
'A04-SA-A09-SA',
'A05-SA-A03-SA',
'A09-SA-A04-SA',
'A09-SA-A20-SA',
'A17-SA-A03-SA',
'A17-SA-A18-SA',
'A18-SA-A17-SA',
'A20-SA-A09-SA',
'A05-NA',
]
father = [
'A03-father',
]
First let us make a data structure so that manipulations will be easy and lookups for relationships will be fast as you have huge data
def make_data_structure(data):
all_fathers, all_relations = {}, {}
for x in data:
tks = x.split("-")
if tks[1].lower() == "father":
all_fathers[tks[0]] = x
else:
if len(tks) == 2:
tks.extend(['NA', 'NA'])
if tks[0] in all_relations:
all_relations[tks[0]][0].append(tks[-2])
all_relations[tks[0]][1].append(x)
else:
all_relations[tks[0]] =[[tks[-2]], [x]]
return all_fathers, all_relations
all_fathers, all_relations = make_data_structure(data)
all_fathers, all_relations
Output:
{'A02': 'A02-father',
'A03': 'A03-father',
'A04': 'A04-father',
'A05': 'A05-father',
'A07': 'A07-father',
'A08': 'A08-father',
'A09': 'A09-father',
'A17': 'A17-father',
'A18': 'A18-father',
'A20': 'A20-father'},
{'A02': [['A03', 'A04'], ['A02-SA-A03-SA', 'A02-SA-A04-SA']],
'A03': [['A02', 'A05', 'A17'],
['A03-SA-A02-SA', 'A03-SA-A05-SA', 'A03-SA-A17-SA']],
'A04': [['A02', 'A09'], ['A04-SA-A02-SA', 'A04-SA-A09-SA']],
'A05': [['A03', 'NA'], ['A05-SA-A03-SA', 'A05-NA']],
'A09': [['A04', 'A20'], ['A09-SA-A04-SA', 'A09-SA-A20-SA']],
'A17': [['A03', 'A18'], ['A17-SA-A03-SA', 'A17-SA-A18-SA']],
'A18': [['A17'], ['A18-SA-A17-SA']],
'A20': [['A09'], ['A20-SA-A09-SA']]}
As you can see all_fathers holds all the parents and most imporantly all_relations hold the father-child relationship which can be indexed using the father for faster lookups.
How lets do the actual parsing of the relationships
def find(all_fathers, all_relations, from_father):
fathers = [from_father]
skip = []
while True:
if len(fathers) == 0:
break
current_father = fathers[0]
fathers = fathers[1:]
for i in range(len(all_relations[current_father][0])):
if not all_relations[current_father][0][i] in skip:
print (all_fathers[current_father], all_relations[current_father][1][i])
if all_relations[current_father][0][i] != 'NA':
fathers.append(all_relations[current_father][0][i])
skip.append(current_father)
for x in father:
find(all_fathers, all_relations, x.split("-")[0])
Output:
A03-father A03-SA-A02-SA
A03-father A03-SA-A05-SA
A03-father A03-SA-A17-SA
A02-father A02-SA-A04-SA
A05-father A05-NA
A17-father A17-SA-A18-SA
A04-father A04-SA-A09-SA
A09-father A09-SA-A20-SA
Edit 2:
New test cases; [You will have to load the values in father.csv to a list called father].
data = [
'A02-father',
'A03-father',
'A04-father',
'A05-father',
'A07-father',
'A08-father',
'A09-father',
'A17-father',
'A18-father',
'A20-father',
'A02-SA-A03-SA',
'A02-SA-A04-SA',
'A03-SA-A02-SA',
'A03-SA-A05-SA',
'A03-SA-A17-SA',
'A04-SA-A02-SA',
'A04-SA-A09-SA',
'A05-SA-A03-SA',
'A09-SA-A04-SA',
'A09-SA-A20-SA',
'A17-SA-A03-SA',
'A17-SA-A18-SA',
'A18-SA-A17-SA',
'A20-SA-A09-SA',
'A05-NA',
'B02-Father',
'B04-Father',
'B06-Father',
'B02-SA-B04-SA',
'B04-SA-B02-SA',
'B04-SA-B06-SA',
'B06-SA-B04-SA',
'B06-NA',
]
father = [
'A03-father',
'B02-father'
]
for x in father:
find(all_fathers, all_relations, x.split("-")[0])
Output:
A03-father A03-SA-A02-SA
A03-father A03-SA-A05-SA
A03-father A03-SA-A17-SA
A02-father A02-SA-A04-SA
A05-father A05-NA
A17-father A17-SA-A18-SA
A04-father A04-SA-A09-SA
A09-father A09-SA-A20-SA
B02-Father B02-SA-B04-SA
B04-Father B04-SA-B06-SA
B06-Father B06-NA

Related

Trying to create a streamlit app that uses user-provided URLs to scrape and return a downloadable df

I'm trying to use this create_df() function in Streamlit to gather a list of user-provided URLs called "recipes" and loop through each URL to return a df I've labeled "res" towards the end of the function. I've tried several approaches with the Streamlit syntax but I just cannot get this to work as I'm getting this error message:
recipe_scrapers._exceptions.WebsiteNotImplementedError: recipe-scrapers exception: Website (h) not supported.
Have a look at my entire repo here. The main.py script works just fine once you've installed all requirements locally, but when I try running the same script with Streamlit syntax in the streamlit.py script I get the above error. Once you run streamlit run streamlit.py in your terminal and have a look at the UI I've create it should be quite clear what I'm aiming at, which is providing the user with a csv of all ingredients in the recipe URLs they provided for a convenient grocery shopping list.
Any help would be greatly appreciated!
def create_df(recipes):
"""
Description:
Creates one df with all recipes and their ingredients
Arguments:
* recipes: list of recipe URLs provided by user
Comments:
Note that ingredients with qualitative amounts e.g., "scheutje melk", "snufje zout" have been ommitted from the ingredient list
"""
df_list = []
for recipe in recipes:
scraper = scrape_me(recipe)
recipe_details = replace_measurement_symbols(scraper.ingredients())
recipe_name = recipe.split("https://www.hellofresh.nl/recipes/", 1)[1]
recipe_name = recipe_name.rsplit('-', 1)[0]
print("Processing data for "+ recipe_name +" recipe.")
for ingredient in recipe_details:
try:
df_temp = pd.DataFrame(columns=['Ingredients', 'Measurement'])
df_temp[str(recipe_name)] = recipe_name
ing_1 = ingredient.split("2 * ", 1)[1]
ing_1 = ing_1.split(" ", 2)
item = ing_1[2]
measurement = ing_1[1]
quantity = float(ing_1[0]) * 2
df_temp.loc[len(df_temp)] = [item, measurement, quantity]
df_list.append(df_temp)
except (ValueError, IndexError) as e:
pass
df = pd.concat(df_list)
print("Renaming duplicate ingredients e.g., Kruimige aardappelen, Voorgekookte halve kriel met schil -> Aardappelen")
ingredient_dict = {
'Aardappelen': ('Dunne frieten', 'Half kruimige aardappelen', 'Voorgekookte halve kriel met schil',
'Kruimige aardappelen', 'Roodschillige aardappelen', 'Opperdoezer Ronde aardappelen'),
'Ui': ('Rode ui'),
'Kipfilet': ('Kipfilet met tuinkruiden en knoflook'),
'Kipworst': ('Gekruide kipworst'),
'Kipgehakt': ('Gemengd gekruid gehakt', 'Kipgehakt met Mexicaanse kruiden', 'Half-om-halfgehakt met Italiaanse kruiden',
'Kipgehakt met tuinkruiden'),
'Kipshoarma': ('Kalkoenshoarma')
}
reverse_label_ing = {x:k for k,v in ingredient_dict.items() for x in v}
df["Ingredients"].replace(reverse_label_ing, inplace=True)
print("Assigning ingredient categories")
category_dict = {
'brood': ('Biologisch wit rozenbroodje', 'Bladerdeeg', 'Briochebroodje', 'Wit platbrood'),
'granen': ('Basmatirijst', 'Bulgur', 'Casarecce', 'Cashewstukjes',
'Gesneden snijbonen', 'Jasmijnrijst', 'Linzen', 'Maïs in blik',
'Parelcouscous', 'Penne', 'Rigatoni', 'Rode kidneybonen',
'Spaghetti', 'Witte tortilla'),
'groenten': ('Aardappelen', 'Aubergine', 'Bosui', 'Broccoli',
'Champignons', 'Citroen', 'Gele wortel', 'Gesneden rodekool',
'Groene paprika', 'Groentemix van paprika, prei, gele wortel en courgette',
'IJsbergsla', 'Kumato tomaat', 'Limoen', 'Little gem',
'Paprika', 'Portobello', 'Prei', 'Pruimtomaat',
'Radicchio en ijsbergsla', 'Rode cherrytomaten', 'Rode paprika', 'Rode peper',
'Rode puntpaprika', 'Rode ui', 'Rucola', 'Rucola en veldsla', 'Rucolamelange',
'Semi-gedroogde tomatenmix', 'Sjalot', 'Sperziebonen', 'Spinazie', 'Tomaat',
'Turkse groene peper', 'Veldsla', 'Vers basilicum', 'Verse bieslook',
'Verse bladpeterselie', 'Verse koriander', 'Verse krulpeterselie', 'Wortel', 'Zoete aardappel'),
'kruiden': ('Aïoli', 'Bloem', 'Bruine suiker', 'Cranberrychutney', 'Extra vierge olijfolie',
'Extra vierge olijfolie met truffelaroma', 'Fles olijfolie', 'Gedroogde laos',
'Gedroogde oregano', 'Gemalen kaneel', 'Gemalen komijnzaad', 'Gemalen korianderzaad',
'Gemalen kurkuma', 'Gerookt paprikapoeder', 'Groene currykruiden', 'Groentebouillon',
'Groentebouillonblokje', 'Honing', 'Italiaanse kruiden', 'Kippenbouillonblokje', 'Knoflookteen',
'Kokosmelk', 'Koreaanse kruidenmix', 'Mayonaise', 'Mexicaanse kruiden', 'Midden-Oosterse kruidenmix',
'Mosterd', 'Nootmuskaat', 'Olijfolie', 'Panko paneermeel', 'Paprikapoeder', 'Passata',
'Pikante uienchutney', 'Runderbouillonblokje', 'Sambal', 'Sesamzaad', 'Siciliaanse kruidenmix',
'Sojasaus', 'Suiker', 'Sumak', 'Surinaamse kruiden', 'Tomatenblokjes', 'Tomatenblokjes met ui',
'Truffeltapenade', 'Ui', 'Verse gember', 'Visbouillon', 'Witte balsamicoazijn', 'Wittewijnazijn',
'Zonnebloemolie', 'Zwarte balsamicoazijn'),
'vlees': ('Gekruide runderburger', 'Half-om-half gehaktballetjes met Spaanse kruiden', 'Kipfilethaasjes', 'Kipfiletstukjes',
'Kipgehaktballetjes met Italiaanse kruiden', 'Kippendijreepjes', 'Kipshoarma', 'Kipworst', 'Spekblokjes',
'Vegetarische döner kebab', 'Vegetarische kaasschnitzel', 'Vegetarische schnitzel'),
'zuivel': ('Ei', 'Geraspte belegen kaas', 'Geraspte cheddar', 'Geraspte grana padano', 'Geraspte oude kaas',
'Geraspte pecorino', 'Karnemelk', 'Kruidenroomkaas', 'Labne', 'Melk', 'Mozzarella',
'Parmigiano reggiano', 'Roomboter', 'Slagroom', 'Volle yoghurt')
}
reverse_label_cat = {x:k for k,v in category_dict.items() for x in v}
df["Category"] = df["Ingredients"].map(reverse_label_cat)
col = "Category"
first_col = df.pop(col)
df.insert(0, col, first_col)
df = df.sort_values(['Category', 'Ingredients'], ascending = [True, True])
print("Merging ingredients by row across all recipe columns using justify()")
gp_cols = ['Ingredients', 'Measurement']
oth_cols = df.columns.difference(gp_cols)
arr = np.vstack(df.groupby(gp_cols, sort=False, dropna=False).apply(lambda gp: justify(gp.to_numpy(), invalid_val=np.NaN, axis=0, side='up')))
# Reconstruct DataFrame
# Remove entirely NaN rows based on the non-grouping columns
res = (pd.DataFrame(arr, columns=df.columns)
.dropna(how='all', subset=oth_cols, axis=0))
res = res.fillna(0)
res['Total'] = res.drop(['Ingredients', 'Measurement'], axis=1).sum(axis=1)
res=res[res['Total'] !=0] #To drop rows that are being duplicated with 0 for some reason; will check later
print("Processing complete!")
return res
Your function create_df needs a list as an argument but st.text_input returs always a string.
In your streamlit.py, replace this df_download = create_df(recs) by this df_download = create_df([recs]). But if you need to handle multiple urls, you should use str.split like this :
def create_df(recipes):
recipes = recipes.split(",") # <--- add this line to make a list from the user-input
### rest of the code ###
if download:
df_download = create_df(recs)
# Output :

want to add left out string in matched string

Below is my example code:
from fuzzywuzzy import fuzz
import json
from itertools import zip_longest
synonyms = open("synonyms.json","r")
synonyms = json.loads(synonyms.read())
vendor_data = ["i7 processor","solid state","Corei5 :1135G7 (11th
Generation)","hard
drive","ddr 8gb","something1", "something2",
"something3","HT (100W) DDR4-2400"]
buyer_data = ["i7 processor 12 generation","corei7:latest technology"]
vendor = []
buyer = []
for item,value in synonyms.items():
for k,k2 in zip_longest(vendor_data,buyer_data):
for v in value:
if fuzz.token_set_ratio(k,v) > 70:
if item in k:
vendor.append(k)
else:
vendor.append(item+" "+k)
else:
#didnt get only "something" strings here !
if fuzz.token_set_ratio(k2,v) > 70:
if item in k2:
buyer.append(k2)
else:
buyer.append(item+" "+k2)
vendor = list(set(vendor))
buyer = list(set(buyer))
vendor,buyer
Note: "something" string can be anything like "battery" or "display"etc
synonyms json
{
"processor":["corei5","core","corei7","i5","i7","ryzen5","i5 processor","i7
processor","processor i5","processor i7","core generation","core gen"],
"ram":["DDR4","memory","DDR3","DDR","DDR 8gb","DDR 8 gb","DDR 16gb","DDR 16 gb","DDR
32gb","DDR 32 gb","DDR4-"],
"ssd":["solid state drive","solid drive"],
"hdd":["Hard Drive"]
}
what do i need ?
I want to add all "something" string inside vendor list dynamically.
! NOTE -- "something" string can be anything in future.
I want to add "something" string in vendor array which is not a matched value in fuzz>70! I want to basically add left out data also.
for example like below:
current output
['processor Corei5 :1135G7 (11th Generation)',
'i7 processor',
'ram HT (100W) DDR4-2400',
'ram ddr 8gb',
'hdd hard drive',
'ssd solid state']
expected output below
['processor Corei5 :1135G7 (11th Generation)',
'i7 processor',
'ram HT (100W) DDR4-2400',
'ram ddr 8gb',
'hdd hard drive',
'ssd solid state',
'something1',
'something2'
'something3'] #something string need to be added in vendor list dynamically.
what silly mistake am I doing ? Thank you.
Here's my attempt:
from fuzzywuzzy import process, fuzz
synonyms = {'processor': ['corei5', 'core', 'corei7', 'i5', 'i7', 'ryzen5', 'i5 processor', 'i7 processor', 'processor i5', 'processor i7', 'core generation', 'core gen'], 'ram': ['DDR4', 'memory', 'DDR3', 'DDR', 'DDR 8gb', 'DDR 8 gb', 'DDR 16gb', 'DDR 16 gb', 'DDR 32gb', 'DDR 32 gb', 'DDR4-'], 'ssd': ['solid state drive', 'solid drive'], 'hdd': ['Hard Drive']}
vendor_data = ['i7 processor', 'solid state', 'Corei5 :1135G7 (11th Generation)', 'hard drive', 'ddr 8gb', 'something1', 'something2', 'something3', 'HT (100W) DDR4-2400']
buyer_data = ['i7 processor 12 generation', 'corei7:latest technology']
def find_synonym(s: str, min_score: int = 60):
results = process.extractBests(s, choices=synonyms, score_cutoff=min_score)
if not results:
return None
return results[0][-1]
def process_data(l: list, min_score: int = 60):
matches = []
no_matches = []
for item in l:
syn = find_synonym(item, min_score=min_score)
if syn is not None:
new_item = f'{syn} {item}' if syn not in item else item
matches.append(new_item)
elif any(fuzz.partial_ratio(s, item) >= min_score for s in synonyms.keys()):
# one of the synonyms is already in the item string
matches.append(item)
else:
no_matches.append(item)
return matches, no_matches
For process_data(vendor_data) we get:
(['i7 processor',
'ssd solid state',
'processor Corei5 :1135G7 (11th Generation)',
'hdd hard drive',
'ram ddr 8gb',
'ram HT (100W) DDR4-2400'],
['something1', 'something2', 'something3'])
And for process_data(buyer_data):
(['i7 processor 12 generation', 'processor corei7:latest technology'], [])
I had to lower the cut-off score to 60 to also get results for ddr 8gb. The process_data function returns 2 lists: One with matches with words from the synonyms dict and one with items without matches. If you want exactly the output you listed in your question, just concatenate the two lists like this:
matches, no_matches = process_data(vendor_data)
matches + no_matches # ['i7 processor', 'ssd solid state', 'processor Corei5 :1135G7 (11th Generation)', 'hdd hard drive', 'ram ddr 8gb', 'ram HT (100W) DDR4-2400', 'something1', 'something2', 'something3']
I have tried to come up with a decent answer (certainly not the cleanest one)
import json
from itertools import zip_longest
from fuzzywuzzy import fuzz
synonyms = open("synonyms.json", "r")
synonyms = json.loads(synonyms.read())
vendor_data = ["i7 processor", "solid state", "Corei5 :1135G7 (11thGeneration)", "hard drive", "ddr 8gb", "something1",
"something2",
"something3", "HT (100W) DDR4-2400"]
buyer_data = ["i7 processor 12 generation", "corei7:latest technology"]
vendor = []
buyer = []
for k, k2 in zip_longest(vendor_data, buyer_data):
has_matched = False
for item, value in synonyms.items():
for v in value:
if fuzz.token_set_ratio(k, v) > 70:
if item in k:
vendor.append(k)
else:
vendor.append(item + " " + k)
if has_matched or k2 is None:
break
else:
has_matched = True
if fuzz.token_set_ratio(k2, v) > 70:
if item in k2:
buyer.append(k2)
else:
buyer.append(item + " " + k2)
if has_matched or k is None:
break
else:
has_matched = True
else:
continue # match not found
break # match is found
else: # only evaluates on normal loop end
# Only something strings
# do something with the new input values
continue
vendor = list(set(vendor))
buyer = list(set(buyer))
I hope you can achieve what you want with this code. Check the docs if you don't know what a for else loop does. TLDR: the else clause executes when the loop terminates normally (not with a break). Note that I put the synonyms loop inside the data loop. This is because we can't certainly know in which synonym group the data belongs, also somethimes the vendor data entry is a processor while the buyer data is memory. Also note that I have assumed an item can't match more than 1 time. If this could be the case you would need to make a more advanced check (just make a counter and break when the counter equals 2 for example).
EDIT:
I took another look at the question and came up with maybe a better answer:
v_dict = dict()
for spec in vendor_data[:]:
for item, choices in synonyms.items():
if process.extractOne(spec, choices)[1] > 70: # don't forget to import process from fuzzywuzzy
v_dict[spec] = item
break
else:
v_dict[spec] = "Something new"
This code matches the strings to the correct type. for example {'i7 processor': 'processor', 'solid state': 'ssd', 'Corei5 :1135G7 (11thGeneration)': 'processor', 'hard drive': 'ssd', 'ddr 8gb': 'ram', 'something1': 'Something new', 'something2': 'Something new', 'something3': 'Something new', 'HT (100W) DDR4-2400': 'ram'}. You can change the "Something new" with watherver you like. You could also do: v_dict[spec] = 0 (on a match) and v_dict[spec] = 1 (on no match). You could then sort the dict ->
it = iter(v_dict.values())
print(sorted(v_dict.keys(), key=lambda x: next(it)))
Which would give the wanted results (more or less), all the recognised items will be first, and then all the unrecognised items. You could do some more advanced sorting on this dict if you want. I think this code gives you enough flexibility to reach your goal.
If I understand correctly, what you are trying to do is match keywords specified by a customer and/or vendor against a predefined database of keywords you have.
First, I would highly recommend using a reversed mapping of the synonyms, so it's faster to lookup, especially when the dataset will grow.
Second, considering the fuzzywuzzy API, it looks like you simply want the best match, so extractOne is a solid choice for that.
Now, extractOne returns the best match and a score:
>>> process.extractOne("cowboys", choices)
("Dallas Cowboys", 90)
I would split the algorithm into two:
A generic part that simply gets the best match, which should always exist (even if it's not a great one)
A filter, where you could adjust the sensitivity of the algorithm, based on different criteria of your application. This sensitivity threshold should set the minimal match quality. If you're below this threshold, just use "untagged" for the category for example.
Here is the final code, which I think is very simple and easy to understand and expand:
import json
from fuzzywuzzy import process
def load_synonyms():
with open('synonyms.json') as fin:
synonyms = json.load(fin)
# Reversing the map makes it much easier to lookup
reversed_synonyms = {}
for key, values in synonyms.items():
for value in values:
reversed_synonyms[value] = key
return reversed_synonyms
def load_vendor_data():
return [
"i7 processor",
"solid state",
"Corei5 :1135G7 (11thGeneration)",
"hard drive",
"ddr 8gb",
"something1",
"something2",
"something3",
"HT (100W) DDR4-2400"
]
def load_customer_data():
return [
"i7 processor 12 generation",
"corei7:latest technology"
]
def get_tag(keyword, synonyms):
THRESHOLD = 80
DEFAULT = 'general'
tag, score = process.extractOne(keyword, synonyms.keys())
return synonyms[tag] if score > THRESHOLD else DEFAULT
def main():
synonyms = load_synonyms()
customer_data = load_customer_data()
vendor_data = load_vendor_data()
data = customer_data + vendor_data
tags_dict = { keyword: get_tag(keyword, synonyms) for keyword in data }
print(json.dumps(tags_dict, indent=4))
if __name__ == '__main__':
main()
When running with the specified inputs, the output is:
{
"i7 processor 12 generation": "processor",
"corei7:latest technology": "processor",
"i7 processor": "processor",
"solid state": "ssd",
"Corei5 :1135G7 (11thGeneration)": "processor",
"hard drive": "hdd",
"ddr 8gb": "ram",
"something1": "general",
"something2": "general",
"something3": "general",
"HT (100W) DDR4-2400": "ram"
}

merge two txt files together by one common column python

how to read in two tab delimited files .txt and map them together by one common column.
For example, from these two files create a mapping of gene to pathway:
First file, pathway.txt
Pathway Protein
Binding and Uptake of Ligands by Scavenger Receptors P69905
Erythrocytes take up carbon dioxide and release oxygen P69905
Metabolism P69905
Amyloids P02647
Metabolism P02647
Hemostasis P68871
Second file, gene.txt
Gene Protein
Fabp3 P11404
HBA1 P69905
APOA1 P02647
Hbb-b1 P02088
HBB P68871
Hba P01942
output would be like,
Gene Protein Pathway
Fabp3 P11404
HBA1 P69905 Binding and Uptake of Ligands by Scavenger Receptors, Erythrocytes take up carbon dioxide and release oxygen, Metabolism
APOA1 P02647 Amyloids, Metabolism
Hbb-b1 P02088
HBB P68871 Hemostasis
Hba P01942
Leave blank if there is no pathway corresponds to gene base on the protein id information.
UPDATE:
import pandas as pd
file1= pd.read_csv("gene.csv")
file2= pd.read_csv("pathway.csv")
output = pd.concat([file1,file2]).fillna(" ")
output= output[["Gene","Protein"]+list(output.columns[1:-1])]
output.to_csv("mapping of gene to pathway.csv", index=False)
So this only gives me the merged file which is not i expected.
>>> from collections import defaultdict
>>> my_dict = defaultdict()
>>> f = open('pathway.txt')
>>> for x in f:
... x = x.strip().split()
... value,key = " ".join(x[:-1]),x[-1]
... if my_dict.get(key,0)==0:
... my_dict[key] = [value]
... else:my_dict[key].append(value)
...
>>> my_dict
defaultdict(None, {'P68871': ['Hemostasis'], 'Protein': ['Pathway'], 'P69905': ['Binding', 'Erythrocytes', 'Metabolism'], 'P02647': ['Amyloids', 'Metabolism']})
>>> f1 = open('gene.txt')
>>> for x in f1:
... value,key = x.strip().split()
... if my_dict.get(key,0)==0:
... print("{:<15}{:<15}".format(value,key))
... else: print("{:<15}{:<15}{}".format(value,key,", ".join(my_dict[key])))
...
Gene Protein Pathway
Fabp3 P11404
HBA1 P69905 Binding and Uptake of Ligands by Scavenger Receptors, Erythrocytes take up carbon dioxide and release oxygen Metabolism
APOA1 P02647 Amyloids, Metabolism
Hbb-b1 P02088
HBB P68871 Hemostasis
Hba P01942
class Protein:
def __init__(self, protein, pathway = None, gene = ""):
self.protein = protein
self.pathways = []
self.gene = gene
if pathway is not None:
self.pathways.append(pathway)
return
def __str__(self):
return "%s\t%s\t%s" % (
self.gene,
self.protein,
", ".join([p for p in self.pathways]))
# protein -> pathway map
proteins = {}
# get the pathways
f1 = file("pathways.txt")
for line in f1.readlines()[1:]:
tokens = line.split()
pathway = " ".join(tokens[:-1])
protein = tokens[-1]
if protein in proteins:
p = proteins[protein]
p.pathways.append(pathway)
else:
p = Protein(protein = protein, pathway = pathway)
proteins[protein] = p
# get the genes
f2 = file("genes.txt")
for line in f2.readlines()[1:]:
gene, protein = line.split()
if protein in proteins:
p = proteins[protein]
p.gene = gene
else:
p = Protein(protein = protein, gene = gene)
proteins[protein] = p
# print the results
print "Gene\tProtein\tPathway"
for protein in proteins.values():
print protein

Add edge between nodes if list contains part of another list

I'm trying to add edges between nodes.
I have a text file which I have put into a list.
The first list contains this:
Title , Rating
[('"$weepstake$" (1979) {(#1.2)}', '10.0'),
('"\'Til Death Do Us Part" (2006) {Pilot(#1.0)}', '3.7'),
('"\'Conversations with My Wife\'" (2010)', '4.2'),
('"\'Da Kink in My Hair" (2007)', '4.2').....much more here ]
I want to create nodes labeled with all the titles and when two titles have the same rating, then I want to create an edge between them, so I - in the end - get all titles with rating 10.0 together in one network and so on.
My code so far:
import networkx as nx
import string
from sys import maxint
import csv
import pprint
import re
def printStuff(labels,dG):
for index, node in enumerate(dG.nodes()):
print '%s:%d\n' % (labels[index],dG.node[node]['count'])
str1 = titleList
#print str1
get_user_info = titleList1
dG = nx.DiGraph()
for i, word in enumerate(str1):
try:
next_word = str1[i]
if not dG.has_node(word):
dG.add_node(word)
dG.node[word]['count'] = 1
else:
dG.node[word]['count'] += 1
if not dG.has_node(next_word):
dG.add_node(next_word)
dG.node[next_word]['count'] = 0
if not dG.has_edge(word, next_word):
dG.add_edge(word, next_word, weight=0)
else:
dG.edge[word][next_word]['weight'] += 1
except IndexError:
if not dG.has_node(word):
dG.add_node(word)
dG.node[word]['count'] = 1
else:
dG.node[word]['count'] += 1
except:
raise
printStuff(titleList, dG)
Output:
10.0:1
10.0:1
3.7:1
10.0:1
3.7:1
4.2:1
10.0:1
3.7:1
4.2:1
4.2:1
And for edges:
for edge in dG.edges():
print '%s:%d\n' % (edge,dG.edge[edge[0]][edge[1]]['weight'])
Output:
(('"\'Conversations with My Wife\'" (2010)', '4.2'), ('"\'Conversations with My Wife\'" (2010)', '4.2')):0
(('"\'Da Kink in My Hair" (2007)', '4.2'), ('"\'Da Kink in My Hair" (2007)', '4.2')):0
(('"$weepstake$" (1979) {(#1.2)}', '10.0'), ('"$weepstake$" (1979) {(#1.2)}', '10.0')):0
(('"\'Til Death Do Us Part" (2006) {Pilot (#1.0)}', '3.7'), ('"\'Til Death Do Us Part" (2006) {Pilot (#1.0)}', '3.7')):0
How about this:
data = [('"$weepstake$" (1979) {(#1.2)}', '10.0'),
('"\'Til Death Do Us Part" (2006) {Pilot(#1.0)}', '3.7'),
('"\'Conversations with My Wife\'" (2010)', '4.2'),
('"\'Da Kink in My Hair" (2007)', '4.2')]
import networkx as nx
G = nx.Graph()
G.add_edges_from(data)
nx.draw(G)
if you want a count of edges from a score.
len(G.edges('4.2'))
2

Shortest distance algorithm Python

I wanted to create a simple breadth first search algorithm, which returns the shortest path.
An actor information dictionary maps and actor to the list of movies the actor appears in:
actor_info = { "act1" : ["movieC", "movieA"], "act2" : ["movieA", "movieB"],
"act3" :["movieA", "movieB"], "act4" : ["movieC", "movieD"],
"act5" : ["movieD", "movieB"], "act6" : ["movieE"],
"act7" : ["movieG", "movieE"], "act8" : ["movieD", "movieF"],
"KevinBacon" : ["movieF"], "act10" : ["movieG"], "act11" : ["movieG"] }
The inverse of this maps movies to the list of actors appearing in them:
movie_info = {'movieB': ['act2', 'act3', 'act5'], 'movieC': ['act1', 'act4'],
'movieA': ['act1', 'act2', 'act3'], 'movieF': ['KevinBacon', 'act8'],
'movieG': ['act7', 'act10', 'act11'], 'movieD': ['act8', 'act4', 'act5'],
'movieE': ['act6', 'act7']}
so for a call
shortest_dictance("act1", "Kevin Bacon", actor_info, movie_info)
I should get 3 since act1 appears in movieC with Act4 who appears in movieD with Act8 who appears in movie F with KevinBacon. So the shortest distance is 3.
So far I have this:
def shotest_distance(actA, actB, actor_info, movie_info):
'''Return the number of movies required to connect actA and actB.
If theres no connection return -1.'''
# So we keep 2 lists of actors:
# 1.The actors that we have already investigated.
# 2.The actors that need to be investigated because we have found a
# connection beginning at actA. This list must be
# ordered, since we want to investigate actors in the order we
# discover them.
# -- Each time we put an actor in this list, we also store
# her distance from actA.
investigated = []
to_investigate = [actA]
distance = 0
while actB not in to_investigate and to_investigate!= []:
for actor in to_investigate:
to_investigated.remove(actA)
investigated.append(act)
for movie in actor_info[actor]:
for co_star in movie_info[movie]:
if co_star not in (investigated and to_investigate):
to_investigate.append(co_star)
....
....
return d
I can't figure the appropriate way to keep track of the distances discovered each of iteration of the code. Also the code seems to be very ineffecient time wise.
Firstly create one graph out of this to connect all the nodes and then run the shortest_path code(There could be an efficient graph library to do this instead of the function mentioned below, nevertheless this one is elegant) and then find out all the number of movie names from the shortest path.
for i in movie_info:
actor_info[i] = movie_info[i]
def find_shortest_path(graph, start, end, path=[]):
path = path + [start]
if start == end:
return path
if not start in graph:
return None
shortest = None
for node in graph[start]:
if node not in path:
newpath = find_shortest_path(graph, node, end, path)
if newpath:
if not shortest or len(newpath) < len(shortest):
shortest = newpath
return shortest
L = find_shortest_path(actor_info, 'act1', 'act2')
print len([i for i in L if i in movie_info])
find_shortest_path Source: http://www.python.org/doc/essays/graphs/
This looks like it works. It keeps track of a current set of movies. For each step, it looks at all of the one-step-away movies which haven't already been considered ("seen").
actor_info = { "act1" : ["movieC", "movieA"], "act2" : ["movieA", "movieB"],
"act3" :["movieA", "movieB"], "act4" : ["movieC", "movieD"],
"act5" : ["movieD", "movieB"], "act6" : ["movieE"],
"act7" : ["movieG", "movieE"], "act8" : ["movieD", "movieF"],
"KevinBacon" : ["movieF"], "act10" : ["movieG"], "act11" : ["movieG"] }
movie_info = {'movieB': ['act2', 'act3', 'act5'], 'movieC': ['act1', 'act4'],
'movieA': ['act1', 'act2', 'act3'], 'movieF': ['KevinBacon', 'act8'],
'movieG': ['act7', 'act10', 'act11'], 'movieD': ['act8', 'act4', 'act5'],
'movieE': ['act6', 'act7']}
def shortest_distance(actA, actB, actor_info, movie_info):
if actA not in actor_info:
return -1 # "infinity"
if actB not in actor_info:
return -1 # "infinity"
if actA == actB:
return 0
dist = 1
movies = set(actor_info[actA])
end_movies = set(actor_info[actB])
if movies & end_movies:
return dist
seen = movies.copy()
print "All movies with", actA, seen
while 1:
dist += 1
next_step = set()
for movie in movies:
for actor in movie_info[movie]:
next_step.update(actor_info[actor])
print "Movies with actors from those movies", next_step
movies = next_step - seen
print "New movies with actors from those movies", movies
if not movies:
return -1 # "Infinity"
# Has actorB been in any of those movies?
if movies & end_movies:
return dist
# Update the set of seen movies, so I don't visit them again
seen.update(movies)
if __name__ == "__main__":
print shortest_distance("act1", "KevinBacon", actor_info, movie_info)
The output is
All movies with act1 set(['movieC', 'movieA'])
Movies with actors from those movies set(['movieB', 'movieC', 'movieA', 'movieD'])
New movies with actors from those movies set(['movieB', 'movieD'])
Movies with actors from those movies set(['movieB', 'movieC', 'movieA', 'movieF', 'movieD'])
New movies with actors from those movies set(['movieF'])
3
Here's a version which returns a list of movies making up the minimum connection (None for no connection, and an empty list if the actA and actB are the same.)
def connect(links, movie):
chain = []
while movie is not None:
chain.append(movie)
movie = links[movie]
return chain
def shortest_distance(actA, actB, actor_info, movie_info):
if actA not in actor_info:
return None # "infinity"
if actB not in actor_info:
return None # "infinity"
if actA == actB:
return []
# {x: y} means that x is one link outwards from y
links = {}
# Start from the destination and work backward
for movie in actor_info[actB]:
links[movie] = None
dist = 1
movies = links.keys()
while 1:
new_movies = []
for movie in movies:
for actor in movie_info[movie]:
if actor == actA:
return connect(links, movie)
for other_movie in actor_info[actor]:
if other_movie not in links:
links[other_movie] = movie
new_movies.append(other_movie)
if not new_movies:
return None # Infinity
movies = new_movies
if __name__ == "__main__":
dist = shortest_distance("act1", "KevinBacon", actor_info, movie_info)
if dist is None:
print "Not connected"
else:
print "The Kevin Bacon Number for act1 is", len(dist)
print "Movies are:", ", ".join(dist)
Here's the output:
The Kevin Bacon Number for act1 is 3
Movies are: movieC, movieD, movieF

Categories