iterator.groupby() cannot generate the correct result - python

Code:
import itertools
first_letter = lambda x: x[0]
names = ['Alan', 'Adam', 'Wes', 'Albert', 'Steven']
for letter, name in itertools.groupby(names, first_letter):
print(letter, list(name))
Output:
A ['Alan', 'Adam']
W ['Wes']
A ['Albert']
S ['Steven']
I want to group by the first element, but it seems not work well, what's wrong here?

As you would expect form any function in itertools, groupby operates on sequences of elements that share a common key. You have to remember that an iterator can be any source of sequential data, possibly one that doesn't store is own elements as a list does.
What this means is that if the data is not already grouped within the iterator, groupby won't work the way you expect. Put it another way, groupby starts another group whenever the key changes, regardless of whether the key has already appeared in the sequence or not.
Probably the easiest way to pre-group the data in your case is to sort it. Lists can be sorted in-place:
names=['Alan','Adam','Wes','Albert','Steven']
names.sort()
for letter, name in itertools.groupby(names, first_letter):
print( letter, list(name))
A similar result could be obtained by distributing your list into a dictionary. I use collections.defaultdict below because it makes adding new elements easier. You could use a regular dictionary just as easily:
grouped = collections.defaultdict(list)
for name in names:
grouped[name[0]].append(name)
for letter, group in grouped.items():
print(letter, group)
In either case, the point is that you can't expect groupby to do exactly what you want with the order of elements in your raw data.

Related

Python pick a random value from hashmap that has a list as value?

so I have a defaultdict(list) hashmap, potential_terms
potential_terms={9: ['leather'], 10: ['type', 'polyester'], 13:['hello','bye']}
What I want to output is the 2 values (words) with the lowest keys, so 'leather' is definitely the first output, but 'type' and 'polyester' both have k=10, when the key is the same, I want a random choice either 'type' or 'polyester'
What I did is:
out=[v for k,v in sorted(potential_terms.items(), key=lambda x:(x[0],random.choice(x[1])))][:2]
but when I print out I get :
[['leather'], ['type', 'polyester']]
My guess is ofcourse the 2nd part of the lambda function: random.choice(x[1]). Any ideas on how to make it work as expected by outputting either 'type' or 'polyester' ?
Thanks
EDIT: See Karl's answer and comment as to why this solution isn't correct for OP's problem.
I leave it here because it does demonstrate what OP originally got wrong.
key= doesn't transform the data itself, it only tells sorted how to sort,
you want to apply choice on v when selecting it for the comprehension, like so:
out=[random.choice(v) for k,v in sorted(potential_terms.items())[:2]]
(I also moved the [:2] inside, to shorten the list before the comprehension)
Output:
['leather', 'type']
OR
['leather', 'polyester']
You have (with some extra formatting to highlight the structure):
out = [
v
for k, v in sorted(
potential_terms.items(),
key=lambda x:(x[0], random.choice(x[1]))
)
][:2]
This means (reading from the inside out): sort the items according to the key, breaking ties using a random choice from the value list. Extract the values (which are lists) from those sorted items into a list (of lists). Finally, get the first two items of that list of lists.
This doesn't match the problem description, and is also somewhat nonsensical: since the keys are, well, keys, there cannot be duplicates, and thus there cannot be ties to break.
What we wanted: sort the items according to the key, then put all the contents of those individual lists next to each other to make a flattened list of strings, but randomizing the order within each sublist (i.e., shuffling those sublists). Then, get the first two items of that list of strings.
Thus, applying the technique from the link, and shuffling the sublists "inline" as they are discovered by the comprehension:
out = [
term
for k, v in sorted(
potential_terms.items(),
key = lambda x:x[0] # this is not actually necessary now,
# since the natural sort order of the items will work.
)
for term in random.sample(v, len(v))
][:2]
Please also see https://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/ to understand how the list flattening and result ordering works in a two-level comprehension like this.
Instead of the out, a simpler function, is:
d = list(p.values()) which stores all the values.
It will store the values as:
[['leather'], ['polyester', 'type'], ['hello', 'bye']]
You can access, leather as d[0] and the list, ['polyester', 'type'], as d[1]. Now we'll just use random.shuffle(d[1]), and use d[1][0].
Which would get us a random word, type or polyester.
Final code should be like this:
import random
potential_terms={9: ['leather'], 10: ['type', 'polyester'], 13:['hello','bye']}
d = list(p.values())
random.shuffle(d[1])
c = []
c.append(d[0][0])
c.append(d[1][0])
Which gives the desired output,
either ['leather', 'polyester'] or ['leather', 'type'].

Django filter by condition

I have a queryset that I want to paginate through alphabetically.
employees = Employee.nodes.order_by('name')
I want to compare the first letter of the employee's name name[0] to the letter that I am iterating on. - but I don't know how to filter based on conditions applied to my attribute.
employees_by_letter = []
for letter in alphabet:
employees_by_this_letter = employees.filter(name[0].lower()=letter)
employees_by_letter.append(employees_by_this_letter)
"""error -- SyntaxError: keyword can't be an expression"""
I suppose I could iterate through each employee object and append a value for their first letter... but there has to be a better way.
Well this is Python, and in Python parameter names are identifiers. You can not drop some sort of expression into it.
Django has however some ways to do advanced filtering. In your case, you want to use the __istartswith filter:
employees_by_letter = [
employees.filter(name__istartswith=letter)
for letter in alphabet
]
This is a list comprehension that will generate such that for every letter in alphabet, the corresponding queryset is in the list.
Note however that since you eventually will fetch every element, I would actually recommend iterating (or performing a groupby for example).
Like:
from django.db.models.functions import Lower
fromiteratools import groupby
employees_by_letter = {
k: list(v)
for k, v in groupby(
Employee.annotate(lowname=Lower('name')).nodes.order_by('lowname'),
lambda x: x.lowname[:1]
)
}
this will construct a dictionary with as key the lowercase letter (or an empty string, if there are strings with an empty name), and these all map to lists of Employee instances: the Employees with a name that start with that letter. So that means we perform only a single query on the database. Django querysets are however lazy, so in case you plan to only really fetch a few of the querysets, then the former can be more efficient.

Custom Sort Complicated Strings in Python

I have a list of filenames conforming to the pattern: s[num][alpha1][alpha2].ext
I need to sort, first by the number, then by alpha1, then by alpha2. The last two aren't alphabetical, however, but rather should reflect a custom ordering.
I've created two lists representing the ordering for alpha1 and alpha2, like so:
alpha1Order = ["Fizz", "Buzz", "Ipsum", "Dolor", "Lorem"]
alpha2Order = ["Sit", "Amet", "Test"]
What's the best way to proceed? My first though was to tokenize (somehow) such that I split each filename into its component parts (s, num, alpha1, alpha2), then sort, but I wasn't quite sure how to perform such a complicated sort. Using a key function seemed clunky, as this sort didn't seem to lend itself to a simple ordering.
Once tokenized, your data is perfectly orderable with a key function. Just return the index of the alpha1Order and alpha2Order lists for the value. Replace them with dictionaries to make the lookup easier:
alpha1Order = {token: i for i, token in enumerate(alpha1Order)}
alpha2Order = {token: i for i, token in enumerate(alpha2Order)}
def keyfunction(filename):
num, alpha1, alpha2 = tokenize(filename)
return int(num), alpha1Order[alpha1], alpha2Order[alpha2]
This returns a tuple to sort on; Python will use the first value to sort on, ordering anything that has the same int(num) value by the second entry, using the 3rd to break any values tied on the first 2 entries.

Split list into chunks by condition

I have a list like:
["asdf-1-bhd","uuu-2-ggg","asdf-2-bhd","uuu-1-ggg","asdf-3-bhd"]
that I want to split into the two groups who's elements are equal after I remove the number:
"asdf-1-bhd", "asdf-2-bhd", "asdf-3-bhd"
"uuu-2-ggg" , uuu-1-ggg"
I have been using itertools.groupby with
for key, group in itertools.groupby(elements, key= lambda x : removeIndexNumber(x)):
but this does not work when the elements to be grouped are not consecutive.
I have thought about using list comprehensions, but this seems impossible since the number of groups is not fixed.
tl;dr:
I want to group stuff, two problems:
I don't know the number of chunks I will obtain
I the elements that will be grouped into a chunk might not be consecutive
Why don't you think about it a bit differently. You can map everyting into a dict:
import re
from collections import defaultdict
regex = re.compile('([a-z]+\-)\d(\-[a-z]+)')
t = ["asdf-1-bhd","uuu-2-ggg","asdf-2-bhd","uuu-1-ggg","asdf-3-bhd"]
maps = defaultdict(list)
for x in t:
parts = regex.match(x).groups()
maps[parts[0]+parts[1]].append(x)
Output:
[['asdf-1-bhd', 'asdf-2-bhd', 'asdf-3-bhd'], ['uuu-2-ggg', 'uuu-1-ggg']]
This is really fast because you don't have to compare one thing to another.
Edit:
On Thinking differently
Your original approach was to iterate through each item and compare them to one another. This is overcomplicated and unnecessary.
Let's consider what my code does. First it gets the stripped down version:
"asdf-1-bhd" -> "asdf--bhd"
"uuu-2-ggg" -> "uuu--ggg"
"asdf-2-bhd" -> "asdf--bhd"
"uuu-1-ggg" -> "uuu--ggg"
"asdf-3-bhd" -> "asdf--bhd"
You can already start to see the groups, and we haven't compared anything yet!
We now do a sort of reverse mapping. We take everything thing on the right and make it a key, and anything on the left and put it in a list that is mapped by its value on the left:
'asdf--bhd' -> ['asdf-1-bhd', 'asdf-2-bhd', 'asdf-3-bhd']
'uuu--ggg' -> ['uuu-2-ggg', 'uuu-1-ggg']
And there we have our groups defined by their common computed value (key). This will work for any amount of elements and groups.
Ok, simple solution (it must be too late over here):
Use itertools.groupby , but first sort the list.
As for the example given above:
elements = ["asdf-1-bhd","uuu-2-ggg","asdf-2-bhd","uuu-1-ggg","asdf-3-bhd"]
elemens.sort(key = lambda x : removeIndex(x))
for key, group in itertools.groupby(elements, key= lambda x : removeIndexNumber(x)):
for element in group:
# do stuff
As you can see, the condition for sorting is the same as for grouping. That way, the elements that will eventually have to be grouped are first put into consecutive order. After this has been done, itertools.groupy can work properly.

Sorting dictionary list-values based on time

I'm pretty new to python (couple weeks into it) and I'm having some trouble wrapping my head around data structures. What I've done so far is extract text line-by-line from a .txt file and store them into a dictionary with the key as animal, for example.
database = {
'dog': ['apple', 'dog', '2012-06-12-08-12-59'],
'cat': [
['orange', 'cat', '2012-06-11-18-33-12'],
['blue', 'cat', '2012-06-13-03-23-48']
],
'frog': ['kiwi', 'frog', '2012-06-12-17-12-44'],
'cow': [
['pear', 'ant', '2012-06-12-14-02-30'],
['plum', 'cow', '2012-06-12-23-27-14']
]
}
# year-month-day-hour-min-sec
That way, when I print my dictionary out, it prints out by animal types, and the newest dates first.
Whats the best way to go about sorting this data by time? I'm on python 2.7. What I'm thinking is
for each key:
grab the list (or list of lists) --> get the 3rd entry --> '-'.split it, --> then maybe try the sorted(parameters)
I'm just not really sure how to go about this...
Walk through the elements of your dictionary. For each value, run sorted on your list of lists, and tell the sorting algorithm to use the third field of the list as the "key" element. This key element is what is used to compare values to other elements in the list in order to ascertain sort order. To tell sorted which element of your lists to sort with, use operator.itemgetter to specify the third element.
Since your timestamps are rigidly structured and each character in the timestamp is more temporally significant than the next one, you can sort them naturally, like strings - you don't need to convert them to times.
# Dictionary stored in d
from operator import itemgetter
# Iterate over the elements of the dictionary; below, by
# calling items(), k gets the key value of an entry and
# v gets the value of that entry
for k,v in d.items():
if v and isinstance(v[0], list):
v.sort(key=itemgetter(2)) # Start with 0, so third element is 2
If your dates are all in the format year-month-day-hour-min-sec,2012-06-12-23-27-14,I think your step of split it is not necessary,just compare them as string.
>>> '2012-06-12-23-27-14' > '2012-06-12-14-02-30'
True
Firstly, you'll probably want each key,value item in the dict to be of a similar type. At the moment some of them (eg: database['dog'] ) are a list of strings (a line) and some (eg: database['cat']) are a list of lines. If you get them all into list of lines format (even if there's only one item in the list of lines) it will be much easier.
Then, one (old) way would be to make a comparison function for those lines. This will be easy since your dates are already in a format that's directly (string) comparable. To compare two lines, you want to compare the 3rd (2nd index) item in them:
def compare_line_by_date(x,y):
return cmp(x[2],y[2])
Finally you can get the lines for a particular key sorted by telling the sorted builtin to use your compare_line_by_date function:
sorted(database['cat'],compare_line_by_date)
The above is suitable (but slow, and will disappear in python 3) for arbitrarily complex comparison/sorting functions. There are other ways to do your particular sort, for example by using the key parameter of sorted:
def key_for_line(line):
return line[2]
sorted(database['cat'],key=key_for_line)
Using keys for sorting is much faster than cmp because the key function only needs to be run once per item in the list to be sorted, instead of every time items in the list are compared (which is usually much more often than the number of items in the list). The idea of a key is to basically boil each list item down into something that be compared naturally, like a string or a number. In the example above we boiled the line down into just the date, which is then compared.
Disclaimer: I haven't tested any of the code in this answer... but it should work!

Categories