This question already has answers here:
Generating natural schedule for a sports league
(2 answers)
Closed 5 years ago.
I would like to write a League Fixture generator in python, but I can't. Here is the details:
There is a dynamic list of teams like teams = ["Team1", "Team2", "Team3", "Team4"].
How can I generate a fixture_weekx list from the teams list? For example:
fixture_week1 = ["Team1", "Team2", "Team3", "Team4"]
fixture_week2 = ["Team1", "Team3", "Team2", "Team4"]
fixture_week2 = ["Team1", "Team4", "Team2", "Team3"]
#Return matches:
fixture_week1 = ["Team2", "Team1", "Team4", "Team3"]
fixture_week2 = ["Team3", "Team1", "Team4", "Team2"]
fixture_week2 = ["Team4", "Team1", "Team3", "Team2"]
Any idea?
Fixture scheduling is a well known problem. This is python implementation of algorithm given at: http://en.wikipedia.org/wiki/Round-robin_tournament
# generation code - for cut and paste
import operator
def fixtures(teams):
if len(teams) % 2:
teams.append('Day off') # if team number is odd - use 'day off' as fake team
rotation = list(teams) # copy the list
fixtures = []
for i in range(0, len(teams)-1):
fixtures.append(rotation)
rotation = [rotation[0]] + [rotation[-1]] + rotation[1:-1]
return fixtures
# demo code
teams = ["Team1", "Team2", "Team3", "Team4", "Team5"]
# for one match each - use this block only
matches = fixtures(teams)
for f in matches:
print zip(*[iter(f)]*2)
# if you want return matches
reverse_teams = [list(x) for x in zip(teams[1::2], teams[::2])]
reverse_teams = reduce(operator.add, reverse_teams) # swap team1 with team2, and so on ....
#then run the fixtures again
matches = fixtures(reverse_teams)
print "return matches"
for f in matches:
print f
This generates output:
[('Team1', 'Day off'), ('Team2', 'Team5'), ('Team3', 'Team4')]
[('Team1', 'Team5'), ('Day off', 'Team4'), ('Team2', 'Team3')]
[('Team1', 'Team4'), ('Team5', 'Team3'), ('Day off', 'Team2')]
[('Team1', 'Team3'), ('Team4', 'Team2'), ('Team5', 'Day off')]
[('Team1', 'Team2'), ('Team3', 'Day off'), ('Team4', 'Team5')]
I wanted to comment that the code from #MariaZverina doesn't quite work. I tried it as is, but I didn't get the right pairings. The modification that I made below works with her code. The difference is that I do a rainbow style pairing of each fixture by zipping the first half of the fixture f
with the reversed second half.
# demo code
teams = ["Team1", "Team2", "Team3", "Team4", "Team5"]
# for one match each - use this block only
matches = fixtures(teams)
for f in matches:
# This is where the difference is.
# I implemented "rainbow" style pairing from each fixture f
# In other words:
# [(f[0],[f[n-1]), (f[1],f[n-2]), ..., (f[n/2-1],f[n/2])],
# where n is the length of f
n = len(f)
print zip(f[0:n/2],reversed(f[n/2:n]))
The code from #MariaZverina hasn't worked, I have implemented this code using Round-robin_tournament as well:
teams = ["Team1", "Team2", "Team3", "Team4", "Team5", "Team6"]
if len(teams) % 2:
teams.append('Day off')
n = len(teams)
matchs = []
fixtures = []
return_matchs = []
for fixture in range(1, n):
for i in range(n/2):
matchs.append((teams[i], teams[n - 1 - i]))
return_matchs.append((teams[n - 1 - i], teams[i]))
teams.insert(1, teams.pop())
fixtures.insert(len(fixtures)/2, matchs)
fixtures.append(return_matchs)
matchs = []
return_matchs = []
for fixture in fixtures:
print fixture
Output:
[('Team1', 'Team6'), ('Team2', 'Team5'), ('Team3', 'Team4')]
[('Team1', 'Team5'), ('Team6', 'Team4'), ('Team2', 'Team3')]
[('Team1', 'Team4'), ('Team5', 'Team3'), ('Team6', 'Team2')]
[('Team1', 'Team3'), ('Team4', 'Team2'), ('Team5', 'Team6')]
[('Team1', 'Team2'), ('Team3', 'Team6'), ('Team4', 'Team5')]
[('Team6', 'Team1'), ('Team5', 'Team2'), ('Team4', 'Team3')]
[('Team5', 'Team1'), ('Team4', 'Team6'), ('Team3', 'Team2')]
[('Team4', 'Team1'), ('Team3', 'Team5'), ('Team2', 'Team6')]
[('Team3', 'Team1'), ('Team2', 'Team4'), ('Team6', 'Team5')]
[('Team2', 'Team1'), ('Team6', 'Team3'), ('Team5', 'Team4')]
Related
I'm trying to use this create_df() function in Streamlit to gather a list of user-provided URLs called "recipes" and loop through each URL to return a df I've labeled "res" towards the end of the function. I've tried several approaches with the Streamlit syntax but I just cannot get this to work as I'm getting this error message:
recipe_scrapers._exceptions.WebsiteNotImplementedError: recipe-scrapers exception: Website (h) not supported.
Have a look at my entire repo here. The main.py script works just fine once you've installed all requirements locally, but when I try running the same script with Streamlit syntax in the streamlit.py script I get the above error. Once you run streamlit run streamlit.py in your terminal and have a look at the UI I've create it should be quite clear what I'm aiming at, which is providing the user with a csv of all ingredients in the recipe URLs they provided for a convenient grocery shopping list.
Any help would be greatly appreciated!
def create_df(recipes):
"""
Description:
Creates one df with all recipes and their ingredients
Arguments:
* recipes: list of recipe URLs provided by user
Comments:
Note that ingredients with qualitative amounts e.g., "scheutje melk", "snufje zout" have been ommitted from the ingredient list
"""
df_list = []
for recipe in recipes:
scraper = scrape_me(recipe)
recipe_details = replace_measurement_symbols(scraper.ingredients())
recipe_name = recipe.split("https://www.hellofresh.nl/recipes/", 1)[1]
recipe_name = recipe_name.rsplit('-', 1)[0]
print("Processing data for "+ recipe_name +" recipe.")
for ingredient in recipe_details:
try:
df_temp = pd.DataFrame(columns=['Ingredients', 'Measurement'])
df_temp[str(recipe_name)] = recipe_name
ing_1 = ingredient.split("2 * ", 1)[1]
ing_1 = ing_1.split(" ", 2)
item = ing_1[2]
measurement = ing_1[1]
quantity = float(ing_1[0]) * 2
df_temp.loc[len(df_temp)] = [item, measurement, quantity]
df_list.append(df_temp)
except (ValueError, IndexError) as e:
pass
df = pd.concat(df_list)
print("Renaming duplicate ingredients e.g., Kruimige aardappelen, Voorgekookte halve kriel met schil -> Aardappelen")
ingredient_dict = {
'Aardappelen': ('Dunne frieten', 'Half kruimige aardappelen', 'Voorgekookte halve kriel met schil',
'Kruimige aardappelen', 'Roodschillige aardappelen', 'Opperdoezer Ronde aardappelen'),
'Ui': ('Rode ui'),
'Kipfilet': ('Kipfilet met tuinkruiden en knoflook'),
'Kipworst': ('Gekruide kipworst'),
'Kipgehakt': ('Gemengd gekruid gehakt', 'Kipgehakt met Mexicaanse kruiden', 'Half-om-halfgehakt met Italiaanse kruiden',
'Kipgehakt met tuinkruiden'),
'Kipshoarma': ('Kalkoenshoarma')
}
reverse_label_ing = {x:k for k,v in ingredient_dict.items() for x in v}
df["Ingredients"].replace(reverse_label_ing, inplace=True)
print("Assigning ingredient categories")
category_dict = {
'brood': ('Biologisch wit rozenbroodje', 'Bladerdeeg', 'Briochebroodje', 'Wit platbrood'),
'granen': ('Basmatirijst', 'Bulgur', 'Casarecce', 'Cashewstukjes',
'Gesneden snijbonen', 'Jasmijnrijst', 'Linzen', 'Maïs in blik',
'Parelcouscous', 'Penne', 'Rigatoni', 'Rode kidneybonen',
'Spaghetti', 'Witte tortilla'),
'groenten': ('Aardappelen', 'Aubergine', 'Bosui', 'Broccoli',
'Champignons', 'Citroen', 'Gele wortel', 'Gesneden rodekool',
'Groene paprika', 'Groentemix van paprika, prei, gele wortel en courgette',
'IJsbergsla', 'Kumato tomaat', 'Limoen', 'Little gem',
'Paprika', 'Portobello', 'Prei', 'Pruimtomaat',
'Radicchio en ijsbergsla', 'Rode cherrytomaten', 'Rode paprika', 'Rode peper',
'Rode puntpaprika', 'Rode ui', 'Rucola', 'Rucola en veldsla', 'Rucolamelange',
'Semi-gedroogde tomatenmix', 'Sjalot', 'Sperziebonen', 'Spinazie', 'Tomaat',
'Turkse groene peper', 'Veldsla', 'Vers basilicum', 'Verse bieslook',
'Verse bladpeterselie', 'Verse koriander', 'Verse krulpeterselie', 'Wortel', 'Zoete aardappel'),
'kruiden': ('Aïoli', 'Bloem', 'Bruine suiker', 'Cranberrychutney', 'Extra vierge olijfolie',
'Extra vierge olijfolie met truffelaroma', 'Fles olijfolie', 'Gedroogde laos',
'Gedroogde oregano', 'Gemalen kaneel', 'Gemalen komijnzaad', 'Gemalen korianderzaad',
'Gemalen kurkuma', 'Gerookt paprikapoeder', 'Groene currykruiden', 'Groentebouillon',
'Groentebouillonblokje', 'Honing', 'Italiaanse kruiden', 'Kippenbouillonblokje', 'Knoflookteen',
'Kokosmelk', 'Koreaanse kruidenmix', 'Mayonaise', 'Mexicaanse kruiden', 'Midden-Oosterse kruidenmix',
'Mosterd', 'Nootmuskaat', 'Olijfolie', 'Panko paneermeel', 'Paprikapoeder', 'Passata',
'Pikante uienchutney', 'Runderbouillonblokje', 'Sambal', 'Sesamzaad', 'Siciliaanse kruidenmix',
'Sojasaus', 'Suiker', 'Sumak', 'Surinaamse kruiden', 'Tomatenblokjes', 'Tomatenblokjes met ui',
'Truffeltapenade', 'Ui', 'Verse gember', 'Visbouillon', 'Witte balsamicoazijn', 'Wittewijnazijn',
'Zonnebloemolie', 'Zwarte balsamicoazijn'),
'vlees': ('Gekruide runderburger', 'Half-om-half gehaktballetjes met Spaanse kruiden', 'Kipfilethaasjes', 'Kipfiletstukjes',
'Kipgehaktballetjes met Italiaanse kruiden', 'Kippendijreepjes', 'Kipshoarma', 'Kipworst', 'Spekblokjes',
'Vegetarische döner kebab', 'Vegetarische kaasschnitzel', 'Vegetarische schnitzel'),
'zuivel': ('Ei', 'Geraspte belegen kaas', 'Geraspte cheddar', 'Geraspte grana padano', 'Geraspte oude kaas',
'Geraspte pecorino', 'Karnemelk', 'Kruidenroomkaas', 'Labne', 'Melk', 'Mozzarella',
'Parmigiano reggiano', 'Roomboter', 'Slagroom', 'Volle yoghurt')
}
reverse_label_cat = {x:k for k,v in category_dict.items() for x in v}
df["Category"] = df["Ingredients"].map(reverse_label_cat)
col = "Category"
first_col = df.pop(col)
df.insert(0, col, first_col)
df = df.sort_values(['Category', 'Ingredients'], ascending = [True, True])
print("Merging ingredients by row across all recipe columns using justify()")
gp_cols = ['Ingredients', 'Measurement']
oth_cols = df.columns.difference(gp_cols)
arr = np.vstack(df.groupby(gp_cols, sort=False, dropna=False).apply(lambda gp: justify(gp.to_numpy(), invalid_val=np.NaN, axis=0, side='up')))
# Reconstruct DataFrame
# Remove entirely NaN rows based on the non-grouping columns
res = (pd.DataFrame(arr, columns=df.columns)
.dropna(how='all', subset=oth_cols, axis=0))
res = res.fillna(0)
res['Total'] = res.drop(['Ingredients', 'Measurement'], axis=1).sum(axis=1)
res=res[res['Total'] !=0] #To drop rows that are being duplicated with 0 for some reason; will check later
print("Processing complete!")
return res
Your function create_df needs a list as an argument but st.text_input returs always a string.
In your streamlit.py, replace this df_download = create_df(recs) by this df_download = create_df([recs]). But if you need to handle multiple urls, you should use str.split like this :
def create_df(recipes):
recipes = recipes.split(",") # <--- add this line to make a list from the user-input
### rest of the code ###
if download:
df_download = create_df(recs)
# Output :
Below is my example code:
from fuzzywuzzy import fuzz
import json
from itertools import zip_longest
synonyms = open("synonyms.json","r")
synonyms = json.loads(synonyms.read())
vendor_data = ["i7 processor","solid state","Corei5 :1135G7 (11th
Generation)","hard
drive","ddr 8gb","something1", "something2",
"something3","HT (100W) DDR4-2400"]
buyer_data = ["i7 processor 12 generation","corei7:latest technology"]
vendor = []
buyer = []
for item,value in synonyms.items():
for k,k2 in zip_longest(vendor_data,buyer_data):
for v in value:
if fuzz.token_set_ratio(k,v) > 70:
if item in k:
vendor.append(k)
else:
vendor.append(item+" "+k)
else:
#didnt get only "something" strings here !
if fuzz.token_set_ratio(k2,v) > 70:
if item in k2:
buyer.append(k2)
else:
buyer.append(item+" "+k2)
vendor = list(set(vendor))
buyer = list(set(buyer))
vendor,buyer
Note: "something" string can be anything like "battery" or "display"etc
synonyms json
{
"processor":["corei5","core","corei7","i5","i7","ryzen5","i5 processor","i7
processor","processor i5","processor i7","core generation","core gen"],
"ram":["DDR4","memory","DDR3","DDR","DDR 8gb","DDR 8 gb","DDR 16gb","DDR 16 gb","DDR
32gb","DDR 32 gb","DDR4-"],
"ssd":["solid state drive","solid drive"],
"hdd":["Hard Drive"]
}
what do i need ?
I want to add all "something" string inside vendor list dynamically.
! NOTE -- "something" string can be anything in future.
I want to add "something" string in vendor array which is not a matched value in fuzz>70! I want to basically add left out data also.
for example like below:
current output
['processor Corei5 :1135G7 (11th Generation)',
'i7 processor',
'ram HT (100W) DDR4-2400',
'ram ddr 8gb',
'hdd hard drive',
'ssd solid state']
expected output below
['processor Corei5 :1135G7 (11th Generation)',
'i7 processor',
'ram HT (100W) DDR4-2400',
'ram ddr 8gb',
'hdd hard drive',
'ssd solid state',
'something1',
'something2'
'something3'] #something string need to be added in vendor list dynamically.
what silly mistake am I doing ? Thank you.
Here's my attempt:
from fuzzywuzzy import process, fuzz
synonyms = {'processor': ['corei5', 'core', 'corei7', 'i5', 'i7', 'ryzen5', 'i5 processor', 'i7 processor', 'processor i5', 'processor i7', 'core generation', 'core gen'], 'ram': ['DDR4', 'memory', 'DDR3', 'DDR', 'DDR 8gb', 'DDR 8 gb', 'DDR 16gb', 'DDR 16 gb', 'DDR 32gb', 'DDR 32 gb', 'DDR4-'], 'ssd': ['solid state drive', 'solid drive'], 'hdd': ['Hard Drive']}
vendor_data = ['i7 processor', 'solid state', 'Corei5 :1135G7 (11th Generation)', 'hard drive', 'ddr 8gb', 'something1', 'something2', 'something3', 'HT (100W) DDR4-2400']
buyer_data = ['i7 processor 12 generation', 'corei7:latest technology']
def find_synonym(s: str, min_score: int = 60):
results = process.extractBests(s, choices=synonyms, score_cutoff=min_score)
if not results:
return None
return results[0][-1]
def process_data(l: list, min_score: int = 60):
matches = []
no_matches = []
for item in l:
syn = find_synonym(item, min_score=min_score)
if syn is not None:
new_item = f'{syn} {item}' if syn not in item else item
matches.append(new_item)
elif any(fuzz.partial_ratio(s, item) >= min_score for s in synonyms.keys()):
# one of the synonyms is already in the item string
matches.append(item)
else:
no_matches.append(item)
return matches, no_matches
For process_data(vendor_data) we get:
(['i7 processor',
'ssd solid state',
'processor Corei5 :1135G7 (11th Generation)',
'hdd hard drive',
'ram ddr 8gb',
'ram HT (100W) DDR4-2400'],
['something1', 'something2', 'something3'])
And for process_data(buyer_data):
(['i7 processor 12 generation', 'processor corei7:latest technology'], [])
I had to lower the cut-off score to 60 to also get results for ddr 8gb. The process_data function returns 2 lists: One with matches with words from the synonyms dict and one with items without matches. If you want exactly the output you listed in your question, just concatenate the two lists like this:
matches, no_matches = process_data(vendor_data)
matches + no_matches # ['i7 processor', 'ssd solid state', 'processor Corei5 :1135G7 (11th Generation)', 'hdd hard drive', 'ram ddr 8gb', 'ram HT (100W) DDR4-2400', 'something1', 'something2', 'something3']
I have tried to come up with a decent answer (certainly not the cleanest one)
import json
from itertools import zip_longest
from fuzzywuzzy import fuzz
synonyms = open("synonyms.json", "r")
synonyms = json.loads(synonyms.read())
vendor_data = ["i7 processor", "solid state", "Corei5 :1135G7 (11thGeneration)", "hard drive", "ddr 8gb", "something1",
"something2",
"something3", "HT (100W) DDR4-2400"]
buyer_data = ["i7 processor 12 generation", "corei7:latest technology"]
vendor = []
buyer = []
for k, k2 in zip_longest(vendor_data, buyer_data):
has_matched = False
for item, value in synonyms.items():
for v in value:
if fuzz.token_set_ratio(k, v) > 70:
if item in k:
vendor.append(k)
else:
vendor.append(item + " " + k)
if has_matched or k2 is None:
break
else:
has_matched = True
if fuzz.token_set_ratio(k2, v) > 70:
if item in k2:
buyer.append(k2)
else:
buyer.append(item + " " + k2)
if has_matched or k is None:
break
else:
has_matched = True
else:
continue # match not found
break # match is found
else: # only evaluates on normal loop end
# Only something strings
# do something with the new input values
continue
vendor = list(set(vendor))
buyer = list(set(buyer))
I hope you can achieve what you want with this code. Check the docs if you don't know what a for else loop does. TLDR: the else clause executes when the loop terminates normally (not with a break). Note that I put the synonyms loop inside the data loop. This is because we can't certainly know in which synonym group the data belongs, also somethimes the vendor data entry is a processor while the buyer data is memory. Also note that I have assumed an item can't match more than 1 time. If this could be the case you would need to make a more advanced check (just make a counter and break when the counter equals 2 for example).
EDIT:
I took another look at the question and came up with maybe a better answer:
v_dict = dict()
for spec in vendor_data[:]:
for item, choices in synonyms.items():
if process.extractOne(spec, choices)[1] > 70: # don't forget to import process from fuzzywuzzy
v_dict[spec] = item
break
else:
v_dict[spec] = "Something new"
This code matches the strings to the correct type. for example {'i7 processor': 'processor', 'solid state': 'ssd', 'Corei5 :1135G7 (11thGeneration)': 'processor', 'hard drive': 'ssd', 'ddr 8gb': 'ram', 'something1': 'Something new', 'something2': 'Something new', 'something3': 'Something new', 'HT (100W) DDR4-2400': 'ram'}. You can change the "Something new" with watherver you like. You could also do: v_dict[spec] = 0 (on a match) and v_dict[spec] = 1 (on no match). You could then sort the dict ->
it = iter(v_dict.values())
print(sorted(v_dict.keys(), key=lambda x: next(it)))
Which would give the wanted results (more or less), all the recognised items will be first, and then all the unrecognised items. You could do some more advanced sorting on this dict if you want. I think this code gives you enough flexibility to reach your goal.
If I understand correctly, what you are trying to do is match keywords specified by a customer and/or vendor against a predefined database of keywords you have.
First, I would highly recommend using a reversed mapping of the synonyms, so it's faster to lookup, especially when the dataset will grow.
Second, considering the fuzzywuzzy API, it looks like you simply want the best match, so extractOne is a solid choice for that.
Now, extractOne returns the best match and a score:
>>> process.extractOne("cowboys", choices)
("Dallas Cowboys", 90)
I would split the algorithm into two:
A generic part that simply gets the best match, which should always exist (even if it's not a great one)
A filter, where you could adjust the sensitivity of the algorithm, based on different criteria of your application. This sensitivity threshold should set the minimal match quality. If you're below this threshold, just use "untagged" for the category for example.
Here is the final code, which I think is very simple and easy to understand and expand:
import json
from fuzzywuzzy import process
def load_synonyms():
with open('synonyms.json') as fin:
synonyms = json.load(fin)
# Reversing the map makes it much easier to lookup
reversed_synonyms = {}
for key, values in synonyms.items():
for value in values:
reversed_synonyms[value] = key
return reversed_synonyms
def load_vendor_data():
return [
"i7 processor",
"solid state",
"Corei5 :1135G7 (11thGeneration)",
"hard drive",
"ddr 8gb",
"something1",
"something2",
"something3",
"HT (100W) DDR4-2400"
]
def load_customer_data():
return [
"i7 processor 12 generation",
"corei7:latest technology"
]
def get_tag(keyword, synonyms):
THRESHOLD = 80
DEFAULT = 'general'
tag, score = process.extractOne(keyword, synonyms.keys())
return synonyms[tag] if score > THRESHOLD else DEFAULT
def main():
synonyms = load_synonyms()
customer_data = load_customer_data()
vendor_data = load_vendor_data()
data = customer_data + vendor_data
tags_dict = { keyword: get_tag(keyword, synonyms) for keyword in data }
print(json.dumps(tags_dict, indent=4))
if __name__ == '__main__':
main()
When running with the specified inputs, the output is:
{
"i7 processor 12 generation": "processor",
"corei7:latest technology": "processor",
"i7 processor": "processor",
"solid state": "ssd",
"Corei5 :1135G7 (11thGeneration)": "processor",
"hard drive": "hdd",
"ddr 8gb": "ram",
"something1": "general",
"something2": "general",
"something3": "general",
"HT (100W) DDR4-2400": "ram"
}
I want to find stand-alone or successively connected nouns in a text. I put together below code, but it is neither efficient nor pythonic. Does anybody have a more pythonic way of finding these nouns with spaCy?
Below code builds a dict with all tokens and then runs through them to find stand-alone or connected PROPN or NOUN until the for-loop runs out of range. It returns a list of the collected items.
def extract_unnamed_ents(doc):
"""Takes a string and returns a list of all succesively connected nouns or pronouns"""
nlp_doc = nlp(doc)
token_list = []
for token in nlp_doc:
token_dict = {}
token_dict['lemma'] = token.lemma_
token_dict['pos'] = token.pos_
token_dict['tag'] = token.tag_
token_list.append(token_dict)
ents = []
k = 0
for i in range(len(token_list)):
try:
if token_list[k]['pos'] == 'PROPN' or token_list[k]['pos'] == 'NOUN':
ent = token_list[k]['lemma']
if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
ent = ent + ' ' + token_list[k+1]['lemma']
k += 1
if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
ent = ent + ' ' + token_list[k+1]['lemma']
k += 1
if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
ent = ent + ' ' + token_list[k+1]['lemma']
k += 1
if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
ent = ent + ' ' + token_list[k+1]['lemma']
k += 1
if ent not in ents:
ents.append(ent)
except:
pass
k += 1
return ents
Test:
extract_unnamed_ents('Chancellor Angela Merkel and some of her ministers will discuss at a cabinet '
"retreat next week ways to avert driving bans in major cities after Germany's "
'top administrative court in February allowed local authorities to bar '
'heavily polluting diesel cars.')
Out:
['Chancellor Angela Merkel',
'minister',
'cabinet retreat',
'week way',
'ban',
'city',
'Germany',
'court',
'February',
'authority',
'diesel car']
spacy has a way of doing this but I'm not sure it is giving you exactly what you are after
import spacy
text = """Chancellor Angela Merkel and some of her ministers will discuss
at a cabinet retreat next week ways to avert driving bans in
major cities after Germany's top administrative court
in February allowed local authorities to bar heavily
polluting diesel cars.
""".replace('\n', ' ')
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print([i.text for i in doc.noun_chunks])
gives
['Chancellor Angela Merkel', 'her ministers', 'a cabinet retreat', 'ways', 'driving bans', 'major cities', "Germany's top administrative court", 'February', 'local authorities', 'heavily polluting diesel cars']
Here, however the i.lemma_ line doesn't really give you what you want (I think this might be fixed by this recent PR).
Since it isn't quite what you are after you could use itertools.groupby like so
import itertools
out = []
for i, j in itertools.groupby(doc, key=lambda i: i.pos_):
if i not in ("PROPN", "NOUN"):
continue
out.append(' '.join(k.lemma_ for k in j))
print(out)
gives
['Chancellor Angela Merkel', 'minister', 'cabinet retreat', 'week way', 'ban', 'city', 'Germany', 'court', 'February', 'authority', 'diesel car']
This should give you exactly the same output as your function (the output is slightly different here but I believe this is due to different spacy versions).
If you are feeling really adventurous you could use a list comprehension
out = [' '.join(k.lemma_ for k in j)
for i, j in itertools.groupby(doc, key=lambda i: i.pos_)
if i in ("PROPN", "NOUN")]
Note I see slightly different results with different spacy versions. The output above is from spacy-2.1.8
I have the following code:
import os
import pprint
file_path = input("Please, enter the path to the file: ")
if os.path.exists(file_path):
worker_dict = {}
k = 1
for line in open(file_path,'r'):
split_line = line.split()
worker = 'worker{}'.format(k)
worker_name = '{}_{}'.format(worker, 'name')
worker_yob = '{}_{}'.format(worker, 'yob')
worker_job = '{}_{}'.format(worker, 'job')
worker_salary = '{}_{}'.format(worker, 'salary')
worker_dict[worker_name] = ' '.join(split_line[0:2])
worker_dict[worker_yob] = ' '.join(split_line[2:3])
worker_dict[worker_job] = ' '.join(split_line[3:4])
worker_dict[worker_salary] = ' '.join(split_line[4:5])
k += 1
else:
print('Error: Invalid file path')
File:
John Snow 1967 CEO 3400$
Adam Brown 1954 engineer 1200$
Output from worker_dict:
{
'worker1_job': 'CEO',
'worker1_name': 'John Snow',
'worker1_salary': '3400$',
'worker1_yob': '1967',
'worker2_job': 'engineer',
'worker2_name': 'Adam Brown',
'worker2_salary': '1200$',
'worker2_yob': '1954',
}
And I want to sort data by worker name and after that by salary. So my idea was to create a separate list with salaries and worker names to sort. But I have problems with filling it, maybe there is a more elegant way to solve my problem?
import os
import pprint
file_path = input("Please, enter the path to the file: ")
if os.path.exists(file_path):
worker_dict = {}
k = 1
with open(file_path,'r') as file:
content=file.read().splitlines()
res=[]
for i in content:
val = i.split()
name = [" ".join([val[0],val[1]]),]#concatenate first name and last name
i=name+val[2:] #prepend name
res.append(i) #append modified value to new list
res.sort(key=lambda x: x[3])#sort by salary
print res
res.sort(key=lambda x: x[0])#sort by name
print res
Output:
[['Adam Brown', '1954', 'engineer', '1200$'], ['John Snow', '1967', 'CEO', '3400$']]
[['Adam Brown', '1954', 'engineer', '1200$'], ['John Snow', '1967', 'CEO', '3400$']]
d = {
'worker1_job': 'CEO',
'worker1_name': 'John Snow',
'worker1_salary': '3400$',
'worker1_yob': '1967',
'worker2_job': 'engineer',
'worker2_name': 'Adam Brown',
'worker2_salary': '1200$',
'worker2_yob': '1954',
}
from itertools import zip_longest
#re-group:
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
#re-order:
res = []
for group in list(grouper(d.values(), 4)):
reorder = [1,2,0,3]
res.append([ group[i] for i in reorder])
#sort:
res.sort(key=lambda x: (x[1], x[2]))
output:
[['Adam Brown', '1200$', 'engineer', '1954'],
['John Snow', '3400$', 'CEO', '1967']]
Grouper is defined and explained in itertools. I've grouped your dictionary by records pertaining to each worker, returned it as a reordered list of lists. As lists, I sort them by the name and salary. This is solution is modular: it distinctly groups, re-orders and sorts.
I recommend to store the workers in a different format, for example .csv, then you could use csv.DictReader and put it into a list of dictionaries (this would also allow you to use jobs, names, etc. with more words like "tomb raider").
Note that you have to convert the year of birth and salary to ints or floats to sort them correctly, otherwise they would get sorted lexicographically as in a real world dictionary (book) because they are strings, e.g.:
>>> sorted(['100', '11', '1001'])
['100', '1001', '11']
To sort the list of dicts you can use operator.itemgetter as the key argument of sorted, instead of a lambda function, and just pass the desired key to itemgetter.
The k variable is useless, because it's just the len of the list.
The .csv file:
"name","year of birth","job","salary"
John Snow,1967,CEO,3400$
Adam Brown,1954,engineer,1200$
Lara Croft,1984,tomb raider,5600$
The .py file:
import os
import csv
from operator import itemgetter
from pprint import pprint
file_path = input('Please, enter the path to the file: ')
if os.path.exists(file_path):
with open(file_path, 'r', newline='') as f:
worker_list = list(csv.DictReader(f))
for worker in worker_list:
worker['salary'] = int(worker['salary'].strip('$'))
worker['year of birth'] = int(worker['year of birth'])
pprint(worker_list)
pprint(sorted(worker_list, key=itemgetter('name')))
pprint(sorted(worker_list, key=itemgetter('salary')))
pprint(sorted(worker_list, key=itemgetter('year of birth')))
You still need some error handling, if a int conversion fails, or just let the program crash.
I'm trying to add edges between nodes.
I have a text file which I have put into a list.
The first list contains this:
Title , Rating
[('"$weepstake$" (1979) {(#1.2)}', '10.0'),
('"\'Til Death Do Us Part" (2006) {Pilot(#1.0)}', '3.7'),
('"\'Conversations with My Wife\'" (2010)', '4.2'),
('"\'Da Kink in My Hair" (2007)', '4.2').....much more here ]
I want to create nodes labeled with all the titles and when two titles have the same rating, then I want to create an edge between them, so I - in the end - get all titles with rating 10.0 together in one network and so on.
My code so far:
import networkx as nx
import string
from sys import maxint
import csv
import pprint
import re
def printStuff(labels,dG):
for index, node in enumerate(dG.nodes()):
print '%s:%d\n' % (labels[index],dG.node[node]['count'])
str1 = titleList
#print str1
get_user_info = titleList1
dG = nx.DiGraph()
for i, word in enumerate(str1):
try:
next_word = str1[i]
if not dG.has_node(word):
dG.add_node(word)
dG.node[word]['count'] = 1
else:
dG.node[word]['count'] += 1
if not dG.has_node(next_word):
dG.add_node(next_word)
dG.node[next_word]['count'] = 0
if not dG.has_edge(word, next_word):
dG.add_edge(word, next_word, weight=0)
else:
dG.edge[word][next_word]['weight'] += 1
except IndexError:
if not dG.has_node(word):
dG.add_node(word)
dG.node[word]['count'] = 1
else:
dG.node[word]['count'] += 1
except:
raise
printStuff(titleList, dG)
Output:
10.0:1
10.0:1
3.7:1
10.0:1
3.7:1
4.2:1
10.0:1
3.7:1
4.2:1
4.2:1
And for edges:
for edge in dG.edges():
print '%s:%d\n' % (edge,dG.edge[edge[0]][edge[1]]['weight'])
Output:
(('"\'Conversations with My Wife\'" (2010)', '4.2'), ('"\'Conversations with My Wife\'" (2010)', '4.2')):0
(('"\'Da Kink in My Hair" (2007)', '4.2'), ('"\'Da Kink in My Hair" (2007)', '4.2')):0
(('"$weepstake$" (1979) {(#1.2)}', '10.0'), ('"$weepstake$" (1979) {(#1.2)}', '10.0')):0
(('"\'Til Death Do Us Part" (2006) {Pilot (#1.0)}', '3.7'), ('"\'Til Death Do Us Part" (2006) {Pilot (#1.0)}', '3.7')):0
How about this:
data = [('"$weepstake$" (1979) {(#1.2)}', '10.0'),
('"\'Til Death Do Us Part" (2006) {Pilot(#1.0)}', '3.7'),
('"\'Conversations with My Wife\'" (2010)', '4.2'),
('"\'Da Kink in My Hair" (2007)', '4.2')]
import networkx as nx
G = nx.Graph()
G.add_edges_from(data)
nx.draw(G)
if you want a count of edges from a score.
len(G.edges('4.2'))
2