Using Python Higher Order Functions to Manipulate Lists - python

I've made this list; each item is a string that contains commas (in some cases) and colon (always):
dinner = [
'cake,peas,cheese : No',
'duck,broccoli,onions : Maybe',
'motor oil : Definitely Not',
'pizza : Damn Right',
'ice cream : Maybe',
'bologna : No',
'potatoes,bacon,carrots,water: Yes',
'rats,hats : Definitely Not',
'seltzer : Yes',
'sleeping,whining,spitting : No Way',
'marmalade : No'
]
I would like to create a new list from the one above as follows:
['cake : No',
'peas : No',
'cheese : No',
'duck : Maybe',
'broccoli : Maybe',
'onions : Maybe',
'motor oil : Definitely Not',
'pizza : Damn Right',
'ice cream : Maybe',
'bologna : No',
'potatoes : Yes',
'bacon : Yes',
'carrots : Yes',
'water : Yes',
'rats : Definitely Not',
'hats : Definitely Not',
'seltzer : Yes',
'sleeping : No Way',
'whining : No Way',
'spitting : No Way',
'marmalade : No']
But I'd like to know if/ how it's possible to do so in a line or two of efficient code employing primarily Python's higher order functions. I've been attempting it:
reduce(lambda x,y: x + y, (map(lambda x: x.split(':')[0].strip().split(','), dinner)))
...produces this:
['cake',
'peas',
'cheese',
'duck',
'broccoli',
'onions',
'motor oil',
'pizza',
'ice cream',
'bologna',
'potatoes',
'bacon',
'carrots',
'water',
'rats',
'hats',
'seltzer',
'sleeping',
'whining',
'spitting',
'marmalade']
...but I'm struggling with appending the piece of each string after the colon back onto each item.

I would create a dict using, zip, map and itertools.repeat:
from itertools import repeat
data = ({k.strip(): v.strip() for _k, _v in map(lambda x: x.split(":"), dinner)
for k, v in zip(_k.split(","), repeat(_v))})
from pprint import pprint as pp
pp(data)
Output:
{'bacon': 'Yes',
'bologna': 'No',
'broccoli': 'Maybe',
'cake': 'No',
'carrots': 'Yes',
'cheese': 'No',
'duck': 'Maybe',
'hats': 'Definitely Not',
'ice cream': 'Maybe',
'marmalade': 'No',
'motor oil': 'Definitely Not',
'onions': 'Maybe',
'peas': 'No',
'pizza': 'Damn Right',
'potatoes': 'Yes',
'rats': 'Definitely Not',
'seltzer': 'Yes',
'sleeping': 'No Way',
'spitting': 'No Way',
'water': 'Yes',
'whining': 'No Way'}
Or using the dict constructor:
from itertools import repeat
data = dict(map(str.strip, t) for _k, _v in map(lambda x: x.split(":"), dinner)
for t in zip(_k.split(","), repeat(_v)))
from pprint import pprint as pp
pp(data)
If you really want a list of strings, we can do something similar using itertools.chain and joining the substrings:
from itertools import repeat, chain
data = chain.from_iterable(map(":".join, zip(_k.split(","), repeat(_v)))
for _k, _v in map(lambda x: x.split(":"), dinner))
from pprint import pprint as pp
pp(list(data))
Output:
['cake: No',
'peas: No',
'cheese : No',
'duck: Maybe',
'broccoli: Maybe',
'onions : Maybe',
'motor oil : Definitely Not',
'pizza : Damn Right',
'ice cream : Maybe',
'bologna : No',
'potatoes: Yes',
'bacon: Yes',
'carrots: Yes',
'water: Yes',
'rats: Definitely Not',
'hats : Definitely Not',
'seltzer : Yes',
'sleeping: No Way',
'whining: No Way',
'spitting : No Way',
'marmalade : No']

Assuming you really need it as a list of strings vs. a dictionary, which looks like a better data structure.
By simplify using comprehensions you can do this:
>>> [[x+':'+y for x in i.split(',')]
... for i, y in map(lambda l: map(str.strip, l.split(':')), dinner)]
[['cake:No', 'peas:No', 'cheese:No'],
['duck:Maybe', 'broccoli:Maybe', 'onions:Maybe'],
['motor oil:Definitely Not'],
...
['marmalade:No']]
Now just add up the lists:
>>> from operator import add
>>> reduce(add, ([x+':'+y for x in i.split(',')]
... for i, y in map(lambda l: map(str.strip, l.split(':')), dinner)), [])
['cake:No',
'peas:No',
'cheese:No',
'duck:Maybe',
...
'marmalade:No']
Or just flatten the list:
>>> [a for i, y in map(lambda l: map(str.strip, l.split(':')), dinner)
... for a in (x+':'+y for x in i.split(','))]
['cake:No',
'peas:No',
'cheese:No',
'duck:Maybe',
...
'marmalade:No']

This may work:
def processList (aList):
finalList = []
for aListEntry in aList:
aListEntry_entries = aListEntry.split(':')
aListEntry_list = aListEntry_entries[0].split(',')
for aListEntry_list_entry in aListEntry_list:
finalList.append(aListEntry_list_entry.strip() + ' : ' + aListEntry_entries[1].strip())
return finalList

List comprehensions are preferred in Python (check eg this), due to better legibility (at least for some;).
The code demonstrates two types of list comprehension nesting, the first is basically chaining the operations, the other produces one list from two nested loops.
If you make your data more consistent by adding one space after the carrots, water, you can get rid of two .strip() calls;)
dinner = [
'cake,peas,cheese : No',
'duck,broccoli,onions : Maybe',
'motor oil : Definitely Not',
'pizza : Damn Right',
'ice cream : Maybe',
'bologna : No',
'potatoes,bacon,carrots,water : Yes',
'rats,hats : Definitely Not',
'seltzer : Yes',
'sleeping,whining,spitting : No Way',
'marmalade : No'
]
prefs = [(pref, items.split(',')) for items, pref in [it.split(" : ") for it in dinner]]
[" : ".join([item, pref]) for pref, items in prefs for item in items]

Related

Removing phrases in reverse order from a List

I have two lists.
L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] #trigrams list
If I look closely, L1 has 'not worry' and 'good very' which are exact reversed repetitions of 'worry not' and 'very good'.
I need to remove such reversed elements from the list. Similary in L2, 'happy be always' is a reverse of 'always be happy', which is to be removed as well.
The final output I'm looking for is:
L1 = ['worry not', 'be happy', 'very good', 'full stop']
L2 = ['take into account', 'always be happy', 'stay safe friend']
I tried one solution
[[max(zip(map(set, map(str.split, group)), group))[1]] for group in L1]
But it is not giving the correct output.
Should I be writing different functions for bigrams and trigrams reverse repetition removal, or is there a pythonic way of doing this in a faster way,because I'll have to run this for about 10K+strings.
You can do it with list comprehensions if you iterate over the list from the end
lst = L1[::-1] # L2[::-1]
x = [s for i, s in enumerate(lst) if ' '.join(s.split()[::-1]) not in lst[i+1:]][::-1]
# L1: ['worry not', 'be happy', 'very good', 'full stop']
# L2: ['take into account', 'always be happy', 'stay safe friend']
You can use an index set and add both direct and reversed n-grams to it:
index = set()
res = []
for x in L1:
a = tuple(x.split())
b = tuple(reversed(a))
if a in index or b in index:
continue
index.add(a)
index.add(b)
res.append(x)
print(res)
Using a set of tuples is the way to deal with this:
L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] #trigrams list
for list_ in L1, L2:
s = set()
for e in list_:
t = tuple(e.split())
if not t[::-1] in s:
s.add(t)
print([' '.join(e) for e in s])
Output:
['be happy', 'worry not', 'very good', 'full stop']
['always be happy', 'stay safe friend', 'take into account']
L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] #trigrams list
def solution(lst):
res = []
for item in lst:
if " ".join(item.split()[::-1]) not in res:
res.append(item)
return res
print(solution(L2))
My solution consist on iterate foreach element in the list, transform that element in a list, sort it and compare with the next element making the same, transform it in a list and sort it, if the arrays are matching, remove this element.
Here is my code:
L1 = ['worry not', 'be happy', 'very good', 'not worry', 'good very', 'full stop'] # bigrams list
L2 = ['take into account', 'always be happy', 'stay safe friend', 'happy be always'] #trigrams l
def remove_duplicates(L):
for idx_i, l_i in enumerate(L):
aux_i = l_i.split()
aux_i.sort()
for idx_j, l_j in enumerate(L[idx_i+1:]):
aux_j = l_j.split()
aux_j.sort()
if aux_i == aux_j:
L.pop(idx_i + idx_j + 1)
print(L)
remove_duplicates(L1)
remove_duplicates(L2)
The output is what you're looking for:
>>> remove_duplicates(L1)
['worry not', 'be happy', 'very good', 'full stop']
>>> remove_duplicates(L2)
['take into account', 'always be happy', 'stay safe friend']
Hope this works for you
This is a possible solution (the complexity is linear with respect to the number of strings):
from collections import defaultdict
from operator import itemgetter
d = defaultdict(list)
for s in L2:
d[max(s, reversed(s.split()))].append(s)
result = list(map(itemgetter(0), d.values()))
Here are the results:
['worry not', 'be happy', 'very good', 'full stop']
['take into account', 'always be happy', 'stay safe friend']

How to sort a dictionary by value

I am trying to sort a dictionary by value, which is a timestamp in the format H:MM:SS (eg "0:41:42") but the code below doesn't work as expected:
album_len = {
'The Piper At The Gates Of Dawn': '0:41:50',
'A Saucerful of Secrets': '0:39:23',
'More': '0:44:53', 'Division Bell': '1:05:52',
'The Wall': '1:17:46',
'Dark side of the moon': '0:45:18',
'Wish you were here': '0:44:17',
'Animals': '0:41:42'
}
album_len = OrderedDict(sorted(album_len.items()))
This is the output I get:
OrderedDict([
('A Saucerful of Secrets', '0:39:23'),
('Animals', '0:41:42'),
('Dark side of the moon', '0:45:18'),
('Division Bell', '1:05:52'),
('More', '0:44:53'),
('The Piper At The Gates Of Dawn', '0:41:50'),
('The Wall', '1:17:46'),
('Wish you were here', '0:44:17')])
It's not supposed to be like that. The first element I expected to see is ('The Wall', '1:17:46'), the longest one.
How do I get the elements sorted the way I intended?
Try converting each value to a datetime and using that as the key:
from collections import OrderedDict
from datetime import datetime
def convert_to_datetime(val):
return datetime.strptime(val, "%H:%M:%S")
album_len = {'The Piper At The Gates Of Dawn': '0:41:50',
'A Saucerful of Secrets': '0:39:23', 'More': '0:44:53',
'Division Bell': '1:05:52', 'The Wall': '1:17:46',
'Dark side of the moon': '0:45:18',
'Wish you were here': '0:44:17', 'Animals': '0:41:42'}
album_len = OrderedDict(
sorted(album_len.items(), key=lambda i: convert_to_datetime(i[1]))
)
print(album_len)
Output:
OrderedDict([('A Saucerful of Secrets', '0:39:23'), ('Animals', '0:41:42'),
('The Piper At The Gates Of Dawn', '0:41:50'),
('Wish you were here', '0:44:17'), ('More', '0:44:53'),
('Dark side of the moon', '0:45:18'), ('Division Bell', '1:05:52'),
('The Wall', '1:17:46')])
Or in descending order with reverse set to True:
album_len = OrderedDict(
sorted(
album_len.items(),
key=lambda i: convert_to_datetime(i[1]),
reverse=True
)
)
Output:
OrderedDict([('The Wall', '1:17:46'), ('Division Bell', '1:05:52'),
('Dark side of the moon', '0:45:18'), ('More', '0:44:53'),
('Wish you were here', '0:44:17'),
('The Piper At The Gates Of Dawn', '0:41:50'),
('Animals', '0:41:42'), ('A Saucerful of Secrets', '0:39:23')])
Edit: If only insertion order needs maintained and the OrderedDict specific functions like move_to_end are not going to be used then a regular python dict also works here for Python3.7+.
Ascending:
album_len = dict(
sorted(album_len.items(), key=lambda i: convert_to_datetime(i[1]))
)
Descending:
album_len = dict(
sorted(album_len.items(), key=lambda i: convert_to_datetime(i[1]),
reverse=True)
)
This is a duplicate of the question: How do I sort a dictionary by value?"
>>> dict(sorted(album_len.items(), key=lambda item: item[1]))
{'A Saucerful of Secrets': '0:39:23',
'Animals': '0:41:42',
'The Piper At The Gates Of Dawn': '0:41:50',
'Wish you were here': '0:44:17',
'More': '0:44:53',
'Dark side of the moon': '0:45:18',
'Division Bell': '1:05:52',
'The Wall': '1:17:46'}
Note: the time format is already lexicographically ordered, you don't need to convert to datetime.
See comment below of #DarrylG. He's totally right, therefore, the remark on the lexicographic order is valid as long as the duration does not exceed 9:59:59 except if hours are padded with a leading zero.

Filtering web articles by keywords inside of a loop

I wrote a function to scrape web articles but I want to adapt it in such a way that it checks if the article is relvant to me (based on a list of keywords) and ignores it if it isn't. I've found several ways to check if a string is inside another string, but somehow I can't get them to work inside a for-loop. Here's a light example of the function:
combos = ['apple and pear', 'pear and banana', 'apple and peach', 'banana and kiwi', 'peach and orange']
my_favorites = ['apple', 'peach']
caps = []
for i in combos:
for j in my_favorites:
if j not in i:
continue
caps.append(i.upper())
print(caps)
I want to skip to the next iteration of the loop if at least one of my favorite fruits are not included. But all the strings in the list are getting through the filter:
['APPLE AND PEAR', 'PEAR AND BANANA', 'APPLE AND PEACH', 'BANANA AND KIWI', 'PEACH AND ORANGE']
Can someone please explain my failure in understanding here?
I find regular expressions to be the best way to filter text especially when the input is a vast dataset. Below, I used python built-in re module to compile the pattern required and used regex match function to search through the list and match with the pattern.
import re
combos = ['apple and pear', 'pear and banana', 'apple and peach', 'banana and kiwi', 'peach and orange']
my_favorites = ['apple', 'peach']
regex_pattern = "|".join(my_favorites)
r = re.compile(regex_pattern)
filtered_list = filter(r.match, combos)
caps = [item.upper() for item in filtered_list]
You need to add caps.append(i.upper()) to an else condition.
combos = ['apple and pear', 'pear and banana', 'apple and peach', 'banana and kiwi', 'peach and orange']
my_favorites = ['apple', 'peach']
caps = []
for i in combos:
for j in my_favorites:
if j not in i:
continue
else:
caps.append(i.upper())
print(caps)
You append the upper case of combos item regardless keywords presence in it.
Using continue affects the inner loop. So you iterate over whole my_favorites list and once finished, append upper case of i to caps.
The below code achieves what you want:
combos = ['apple and pear', 'pear and banana', 'apple and peach', 'banana and kiwi', 'peach and orange']
my_favorites = ['apple', 'peach']
caps = []
for i in combos:
if any([fav in i for fav in my_favorites]):
caps.append(i.upper())
print(caps)

Split individual strings in a list Python

How do I split individual strings in a list?
data = ('Esperanza Ice Cream', 'Gregory Johnson', 'Brandies bar and grill')
Return:
print(data)
('Esperanza', 'Ice', 'Cream', 'Gregory', 'Johnson', 'Brandies', 'bar', 'and', 'grill')
One approach, using join and split:
items = ' '.join(data)
terms = items.split(' ')
print(terms)
['Esperanza', 'Ice', 'Cream', 'Gregory', 'Johnson', 'Brandies', 'bar', 'and', 'grill']
The idea here is to generate a single string containing all space-separated terms. Then, all we need is a single call to the non regex version of split to get the output.
data = ('Esperanza Ice Cream', 'Gregory Johnson', 'Brandies bar and grill')
data = [i.split(' ') for i in data]
data=sum(data, [])
print(tuple(data))
#('Esperanza', 'Ice', 'Cream', 'Gregory', 'Johnson', 'Brandies', 'bar', 'and', 'grill')
You can use itertools.chain for that like:
Code:
it.chain.from_iterable(i.split() for i in data)
Test Code:
import itertools as it
data = ('Esperanza Ice Cream', 'Gregory Johnson', 'Brandies bar and grill')
print(list(it.chain.from_iterable(i.split() for i in data)))
Results:
['Esperanza', 'Ice', 'Cream', 'Gregory', 'Johnson', 'Brandies', 'bar', 'and', 'grill']

A regular expression using a list of words

I'm using Python.
I have some strings :
'1 banana', '100 g of sugar', '1 cup of flour'
I need to distinguish the food from the quantity.
I have an array of quantities type
quantities = ['g', 'cup', 'kg', 'L']
altern = '|'.join(quantities)
and so with using a regular expression I would like to get for example for '1 cup of flour' : 'flour' and '1 cup of', for '1 banana' : '1' and 'banana'
I have written this regexp to match the quantity part of the strings above :
\d{1,3}\s<altern>?\s?(\bof\b)?
but I'm very unsure about this ...particularly on how to introduce the altern variable in the regular expression.
I think your amounts are units, so I took the liberty to fix this misnomer. I propose to use named grouping to ease understanding the output.
import re
units = [ 'g', 'cup', 'kg', 'L' ]
anyUnitRE = '|'.join(units)
inputs = [ '1 banana', '100 g of sugar', '1 cup of flour' ]
for input in inputs:
m = re.match(
r'(?P<amount>\d{1,3})\s*'
r'(?P<unit>(' + anyUnitRE + r')?)\s*'
r'(?P<preposition>(of)?)\s*'
r'(?P<name>.*)', input)
print m and m.groupdict()
The output will be sth like this:
{'preposition': '', 'amount': '1', 'name': 'banana', 'unit': ''}
{'preposition': 'of', 'amount': '100', 'name': 'sugar', 'unit': 'g'}
{'preposition': 'of', 'amount': '1', 'name': 'flour', 'unit': 'cup'}
So you can do sth like this:
if m.groupdict()['name'] == 'sugar':
…
amount = int(m.groupdict()['amount'])
unit = m.groupdict()['unit']
I think you can use this:
"(.*?) (\w*)$"
And get \1 for first part and \2 for second part.
[Regex Demo]
And for a better regex:
"^((?=.*of)((.*of)(.*)))|((?!.*of)(\d+)(.*))$"
And get \3 and \6 for first part and \4 and \7 for second part.
You can try this code:
import re
lst = ['1 banana', '100 g of sugar', '1 cup of flour']
quantities = ['g', 'cup', 'kg', 'L']
altern = '|'.join(quantities)
r = r'(\d{1,3})\s*((?:%s)?s?(?:\s*\bof\b)?\s*\S+)'%(altern)
for x in lst:
print re.findall(r, x)
See demo
Output:
[('1', 'banana')]
[('100', 'g of sugar')]
[('1', 'cup of flour')]
Why do you want to do this with regular expressions? You can use Python's string splitting functions instead:
def qsplit(a):
"""Return a tuple of quantity and ingredient"""
if not a:
return None
if not a[0] in "0123456789":
return ["0", a]
if " of " in a:
return a.split(" of ", 1)
return a.split(None, 1)

Categories