I have sets of data. The first (A) is a list of equipment with sophisticated names. The second is a list of more broad equipment categories (B) - to which I have to group the first list into using string comparisons. I'm aware this won't be perfect.
For each entity in List A - I'd like to establish the levenshtein distance for each entity in List B. The record in List B with the highest score will be the group to which I'll assign that data point.
I'm very rusty in python - and am playing around with FuzzyWuzzy to get the distance between two string values. However - I can't quite figure out how to iterate through each list to produce what I need.
I presumed I'd just create a list for each data set and write a pretty basic loop for each - but like I said I'm a little rusty and not having any luck.
Any help would be greatly appreciated! If there is another package that will allow me to do this (not Fuzzy) - I'm glad to take suggestions.
It looks like the process.extractOne function is what you're looking for. A simple use case is something like
from fuzzywuzzy import process
from collections import defaultdict
complicated_names = ['leather couch', 'left-handed screwdriver', 'tomato peeler']
generic_names = ['couch', 'screwdriver', 'peeler']
group = defaultdict(list)
for name in complicated_names:
group[process.extractOne(name, generic_names)[0]].append(name)
defaultdict is a dictionary that has default values for all keys.
We loop over all the complicated names, use fuzzywuzzy to find the closest match, and then add the name to the list associated with that match.
Related
new to Python, trying to learn how to use dictionaries, but honestly don't see the point, you're limited to 2 pairs whereas if I just make a list with Tuples I get far more flexibility
in the code below I made a list of superheroes, where u can look it up by
Name(Batman), Identity (Bruce Wayne) or Universe (DC)
can't do that on a dictionary (you're limited to pairs of 2) so why would I ever need a dictionary?
Superheroes = [('Batman','Bruce Wayne','DC'),('Spiderman','Peter Parker','Marvel'),('Superman','Clark Kent','DC'),('Ironman','Tony Stark','Marvel'),('Green Arrow','Oliver Queen','DC')]
user_selection = input()
for (name,identity,universe) in Superheroes:
if name==user_selection or identity == user_selection or universe == user_selection:
print('Hero:' + name + '\nSecret Identity:' + identity + '\nUniverse:'+ universe)
else:
continue
Most uses of dictionaries don't require searching for a match in both the keys and values. You design your data structure so that the item you normally need to look up is the dictionary key.
For instance, if you have data with ingredients in recipes, you would almost always be looking it up by the dish that you're cooking. So you make that the key and you can get all the ingredients with a single lookup instead of searching the entire list.
If you occasionally need to find something in the value, you can still iterate through the dictionary using dict.entries(). If you need to look up by different components frequently you can make multiple dictionaries that all refer to the same values using different keys.
I have been working on a problem which involves sorting a large data set of shop orders, extracting shop and user information based on some parameters. Mostly this has involved creating dictionaries by iterating through a data set with a for loop and appending a new list, like this:
sshop = defaultdict(list)
for i in range(df_subset.shape[0]):
orderid, sid, userid, time = df.iloc[i]
sshop[sid].append(userid)
sData = dict(sshop)
#CREATES DICTIONARY OF UNIQUE SHOPS WITH USER DATA AS THE VALUE
shops = df_subset['shopid'].unique()
shops_dict = defaultdict(list)
for shop in shops:
shops_dict[shop].append(sData[shop])
shops_dict = dict(shops_dict)
shops_dict looks like this at this point:
{10009: [[196962305]], 10051: [[2854032, 48600461]], 10061: [[168750452, 194819216, 130633421,
62464559]]}
To get to the final stages I have had to repeat lines of code similar to these a couple of times. What seems to happen everytime I do this is that the VALUES in the dictionaries gain a set of square brackets.
This is one of my final dictionaries:
{10159: [[[1577562540.0, 1577736960.0, 1577737080.0]], [[1577651880.0, 1577652000.0, 1577652960.0]]],
10208: [[[1577651040.0, 1577651580.0, 1577797080.0]]]}
I don't entirely understand why this is happening, asides from I believe it is something to do with using defaultdict(list) and then converting that into a dictionary with dict().
These extra brackets, asides from being a little confusing, appear to be causing some problems for accessing the data using certain functions. I understand that there needs to be two sets of square brackets in total, one set that encases all the values in the dictionary key and another inside of that for each of the specific sets of values within that key.
My first question would be, is it possible to remove a specific set of square brackets from a dictionary like that?
My second question would be, if not - is there a better way of creating new dictionaries out the data from an older one without using defaultdict(list) and having all those extra square brackets?
Any help much appreciated!
Thanks :)!
In second loop use extend instead of append.
for shop in shops:
shops_dict[shop].extend(sData[shop])
shops_dict = dict(shops_dict)
I want to find the 20 most common names, and their frequency, in a country.
Lets say I have lists of all residents' first name in 100 cities. Each list might contain a lot of names. Lets say we speak about 100 lists, each list with 1000 strings.
What is the most efficient method to get the 20 most common names, and their frequencies, in the entire country?
This is the direction I began with, assuming I got each city in a text file at the same directory:
Use pandas and collection modules for this.
Iterate through each city.txt, making it a string. Then, turn it into a collection using the Counter module, and then to a DataFrame (using to_dict).
Union each DataFrame with the previous one.
Then, group by and count (*) the DataFrame.
But, I'm thinking this method might not work, as the DataFrame can get too big.
Would like to hear any advice on that. Thank you.
Here is a sample code:
import os
from collections import Counter
cities = [i for i in os.listdir(".") if i.endswith(".txt")]
d = Counter()
for file in cities:
with open(file) as f:
# Adjust the code below to put the strings in a list
data = f.read().split(",")
d.update(Counter(data))
out = d.most_common(10)
print(out)
You can also use NLTK library, I was using the code below for similar purpose.
from nltk import FreqDist
fd = FreqDist(text)
top_20 = fd.most_commmon(20) # it's done, you got top 20 tokens :)
If I want to assign to an element of a list only one value I use always a dictionary. For example:
{'Monday':1, 'Tuesday':2,...'Friday':5,..}
But I want to assign to one element of a list many values, like for example:
Monday: Jogging, Swimming, Skating
Tuesday: School, Work, Dinner, Cinema
...
Friday: Doctor
Is any built-in structure or a simple way to make something like this in python?
My idea: I was thinking about something like: a dictionary which as a key holds a day and as a value holds a list, but maybe there is a better solution.
A dictionary whose values are lists is perfectly fine, and in fact very common.
In fact, you might want to consider an extension to that: a collections.defaultdict(list). This will create a new empty list the first time you access any key, so you can write code like this:
d[day].append(activity)
… instead of this:
if not day in d:
d[day] = []
d[day].append(activity)
The down-side of a defaultdict is that you no longer have a way to detect that a key is missing in your lookup code, because it will automatically create a new one. If that matters, use a regular dict together with the setdefault method:
d.setdefault(day, []).append(activity)
You could wrap either of these solutions up in a "MultiDict" class that encapsulates the fact that it's a dictionary of lists, but the dictionary-of-lists idea is such a common idiom that it really isn't necessary to hide it.
I have a dict that has unix epoch timestamps for keys, like so:
lookup_dict = {
1357899: {} #some dict of data
1357910: {} #some other dict of data
}
Except, you know, millions and millions and millions of entries. I'd like to subset this dict, over and over again. Ideally, I'd love to be able to write something like I can in R, like:
lookup_value = 1357900
dict_subset = lookup_dict[key >= lookup_value]
# dict_subset now contains {1357910: {}}
But I confess, I can't find any actual proof that this is something Python can do without having, one way or the other, to iterate over every row. If I understand Python correctly (and I might not), key lookup of the form key in dict uses binary search, and is thus very fast; any way to do a binary search, on dict keys?
To do this without iterating, you're going to need the keys in sorted order. Then you just need to do a binary search for the first one >= lookup_value, instead of checking each one for >= lookup_value.
If you're willing to use a third-party library, there are plenty out there. The first two that spring to mind are bintrees (which uses a red-black tree, like C++, Java, etc.) and blist (which uses a B+Tree). For example, with bintrees, it's as simple as this:
dict_subset = lookup_dict[lookup_value:]
And this will be as efficient as you'd hope—basically, it adds a single O(log N) search on top of whatever the cost of using that subset. (Of course usually what you want to do with that subset is iterate the whole thing, which ends up being O(N) anyway… but maybe you're doing something different, or maybe the subset is only 10 keys out of 1000000.)
Of course there is a tradeoff. Random access to a tree-based mapping is O(log N) instead of "usually O(1)". Also, your keys obviously need to be fully ordered, instead of hashable (and that's a lot harder to detect automatically and raise nice error messages on).
If you want to build this yourself, you can. You don't even necessarily need a tree; just a sorted list of keys alongside a dict. You can maintain the list with the bisect module in the stdlib, as JonClements suggested. You may want to wrap up bisect to make a sorted list object—or, better, get one of the recipes on ActiveState or PyPI to do it for you. You can then wrap the sorted list and the dict together into a single object, so you don't accidentally update one without updating the other. And then you can extend the interface to be as nice as bintrees, if you want.
Using the following code will work out
some_time_to_filter_for = # blah unix time
# Create a new sub-dictionary
sub_dict = {key: val for key, val in lookup_dict.items()
if key >= some_time_to_filter_for}
Basically we just iterate through all the keys in your dictionary and given a time to filter out for we take all the keys that are greater than or equal to that value and place them into our new dictionary