A regular expression using a list of words - python

I'm using Python.
I have some strings :
'1 banana', '100 g of sugar', '1 cup of flour'
I need to distinguish the food from the quantity.
I have an array of quantities type
quantities = ['g', 'cup', 'kg', 'L']
altern = '|'.join(quantities)
and so with using a regular expression I would like to get for example for '1 cup of flour' : 'flour' and '1 cup of', for '1 banana' : '1' and 'banana'
I have written this regexp to match the quantity part of the strings above :
\d{1,3}\s<altern>?\s?(\bof\b)?
but I'm very unsure about this ...particularly on how to introduce the altern variable in the regular expression.

I think your amounts are units, so I took the liberty to fix this misnomer. I propose to use named grouping to ease understanding the output.
import re
units = [ 'g', 'cup', 'kg', 'L' ]
anyUnitRE = '|'.join(units)
inputs = [ '1 banana', '100 g of sugar', '1 cup of flour' ]
for input in inputs:
m = re.match(
r'(?P<amount>\d{1,3})\s*'
r'(?P<unit>(' + anyUnitRE + r')?)\s*'
r'(?P<preposition>(of)?)\s*'
r'(?P<name>.*)', input)
print m and m.groupdict()
The output will be sth like this:
{'preposition': '', 'amount': '1', 'name': 'banana', 'unit': ''}
{'preposition': 'of', 'amount': '100', 'name': 'sugar', 'unit': 'g'}
{'preposition': 'of', 'amount': '1', 'name': 'flour', 'unit': 'cup'}
So you can do sth like this:
if m.groupdict()['name'] == 'sugar':
…
amount = int(m.groupdict()['amount'])
unit = m.groupdict()['unit']

I think you can use this:
"(.*?) (\w*)$"
And get \1 for first part and \2 for second part.
[Regex Demo]
And for a better regex:
"^((?=.*of)((.*of)(.*)))|((?!.*of)(\d+)(.*))$"
And get \3 and \6 for first part and \4 and \7 for second part.

You can try this code:
import re
lst = ['1 banana', '100 g of sugar', '1 cup of flour']
quantities = ['g', 'cup', 'kg', 'L']
altern = '|'.join(quantities)
r = r'(\d{1,3})\s*((?:%s)?s?(?:\s*\bof\b)?\s*\S+)'%(altern)
for x in lst:
print re.findall(r, x)
See demo
Output:
[('1', 'banana')]
[('100', 'g of sugar')]
[('1', 'cup of flour')]

Why do you want to do this with regular expressions? You can use Python's string splitting functions instead:
def qsplit(a):
"""Return a tuple of quantity and ingredient"""
if not a:
return None
if not a[0] in "0123456789":
return ["0", a]
if " of " in a:
return a.split(" of ", 1)
return a.split(None, 1)

Related

From a list of string, get a list of dictionaries

I have a list of strings which contains spanish-recipes´s ingredients and its quantities and I would like to get a list of dictionaries splitting every ingredient, unit and quantity.
This is the list:
ingredients=[
'50',
'ccs',
'aceite',
'1',
'hoja',
'laurel',
'\n',
'1',
'cabeza',
'ajos',
'1',
'vaso',
'vino',
'1,5',
'kilos',
'conejo',
'\n',
...]
I would like to get a dict like this:
my_dic=[
{"name":"aceite" ,"qt":50 ,"unit": "ccs"},
{"name":"laurel" ,"qt":1 ,"unit": "hoja"},
{"name":"ajos" ,"qt":1 ,"unit": "cabeza"},
{"name":"vino" ,"qt":1 ,"unit": "vaso"},
{"name":"conejo" ,"qt":1,5 ,"unit": "kilos"},
...]
I have been trying things but it was all a disaster.
Any ideas?
Thanks in advance!!
So first, you want to remove the newlines from your original list:
ingredients = [i for i in ingredients if i is not '\n']
Then, each ingredient name is every third element in the ingredients list starting from the third element. Likewise for the quantity and unit, starting from the second and first elements, respectively:
names = ingredients[2::3]
units = ingredients[1::3]
qts = ingredients[::3]
Then, iterate through these lists and construct the data structure you specified (which is not actually a dict but a list of dicts):
my_list = []
for i in range(len(names)):
my_dict = {"name":names[i],"qt":qts[i],"unit":units[i]}
my_list.append(my_dict)
There are a lot of ways to compress all of the above, but I have written it for comprehensibility.
This doesn't produce a dictionary, but it does give you the output that you specify in the question:
# Strip out the \n values (can possibly do this with a .strip() in the input stage)
ingredients = [value for value in ingredients if value != '\n']
labels = ['qt', 'unit', 'name']
my_dic = [dict(zip(labels, ingredients[i:i+3])) for i in range(0, len(ingredients), 3)]
my_dic contains:
[{'qt': '50', 'unit': 'ccs', 'name': 'aceite'},
{'qt': '1', 'unit': 'hoja', 'name': 'laurel'},
{'qt': '1', 'unit': 'cabeza', 'name': 'ajos'},
{'qt': '1', 'unit': 'vaso', 'name': 'vino'},
{'qt': '1,5', 'unit': 'kilos', 'name': 'conejo'}]
You can clean you list with filter to remove the \n characters and then zip() it together to collect your items together. This makes a quick two-liner:
l = filter(lambda w: w != '\n', ingredients)
result = [{'name': name, 'qt':qt, 'unit': unit}
for qt, unit, name in zip(l, l, l)]
result:
[{'name': 'aceite', 'qt': '50', 'unit': 'ccs'},
{'name': 'laurel', 'qt': '1', 'unit': 'hoja'},
{'name': 'ajos', 'qt': '1', 'unit': 'cabeza'},
{'name': 'vino', 'qt': '1', 'unit': 'vaso'},
{'name': 'conejo', 'qt': '1,5', 'unit': 'kilos'}]
How about:
ingredients = (list)(filter(lambda a: a != '\n', ingredients))
ing_organized = []
for i in range (0, len(ingredients) , 3):
curr_dict = {"name": ingredients[i+2] ,"qt": ingredients[i] ,"unit": ingredients[i+1]}
ing_organized.append(curr_dict)
I just removed '\n' elements from the list as they didn't seem to have meaning.

Split individual strings in a list Python

How do I split individual strings in a list?
data = ('Esperanza Ice Cream', 'Gregory Johnson', 'Brandies bar and grill')
Return:
print(data)
('Esperanza', 'Ice', 'Cream', 'Gregory', 'Johnson', 'Brandies', 'bar', 'and', 'grill')
One approach, using join and split:
items = ' '.join(data)
terms = items.split(' ')
print(terms)
['Esperanza', 'Ice', 'Cream', 'Gregory', 'Johnson', 'Brandies', 'bar', 'and', 'grill']
The idea here is to generate a single string containing all space-separated terms. Then, all we need is a single call to the non regex version of split to get the output.
data = ('Esperanza Ice Cream', 'Gregory Johnson', 'Brandies bar and grill')
data = [i.split(' ') for i in data]
data=sum(data, [])
print(tuple(data))
#('Esperanza', 'Ice', 'Cream', 'Gregory', 'Johnson', 'Brandies', 'bar', 'and', 'grill')
You can use itertools.chain for that like:
Code:
it.chain.from_iterable(i.split() for i in data)
Test Code:
import itertools as it
data = ('Esperanza Ice Cream', 'Gregory Johnson', 'Brandies bar and grill')
print(list(it.chain.from_iterable(i.split() for i in data)))
Results:
['Esperanza', 'Ice', 'Cream', 'Gregory', 'Johnson', 'Brandies', 'bar', 'and', 'grill']

Splitting a string after certain characters?

I will be given a string, and I need to split it every time that it has an "|", "/", "." or "_"
How can I do this fast? I know how to use the command split, but is there any way to give more than 1 split condition to it? For example, if the input given was
Hello test|multiple|36.strings/just36/testing
I want the output to give:
"['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']"
Use a regex and the regex module:
>>> import re
>>> s='You/can_split|multiple'
>>> re.split(r'[/_|.]', s)
['You', 'can', 'split', 'multiple']
In this case, [/_|.] will split on any of those characters.
Or, you can use a list comprehension to insert a single (perhaps multiple character) delimiter and then split on that:
>>> ''.join(['-><-' if c in '/_|.' else c for c in s]).split('-><-')
['You', 'can', 'split', 'multiple']
With the added example:
>>> s2="Hello test|multiple|36.strings/just36/testing"
Method 1:
>>> re.split(r'[/_|.]', s2)
['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']
Method 2:
>>> ''.join(['-><-' if c in '/_|.' else c for c in s2]).split('-><-')
['Hello test', 'multiple', '36', 'strings', 'just36', 'testing']
Use groupby:
from itertools import groupby
s = 'You/can_split|multiple'
separators = set('/_|.')
result = [''.join(group) for k, group in groupby(s, key=lambda x: x not in separators) if k]
print(result)
Output
['You', 'can', 'split', 'multiple']

re.search with strings: difference between use cases of re.rearch() with strings

I have the following code:
import re
l=['fang', 'yi', 'ke', 'da', 'xue', 'xue', 'bao', '=', 'journal', 'of', 'southern', 'medical', 'university', '2015/feb']
t=[l[13]]
t2=['2015/Feb']
wl1=['2015/Feb']
for i in t:
print(type(i))
print(type(wl1[0]))
r=re.search(r'^%s$' %i, wl1[0])
if r:
print('yes')
for i in t2:
print(type(i))
print(type(wl1[0]))
r2=re.search(r'^%s$' %i, wl1[0])
if r2:
print('yes')
Could anyone explain me why in the first loop it does not match the two strings? In the second it does.
Your input value is lowercase:
>>> l=['fang', 'yi', 'ke', 'da', 'xue', 'xue', 'bao', '=', 'journal', 'of', 'southern', 'medical', 'university', '2015/feb']
>>> t=[l[13]]
>>> t[0]
'2015/feb'
while you are trying to match against a value with the F uppercased:
>>> wl1=['2015/Feb']
>>> wl1[0]
'2015/Feb'
As such the regular expression ^2015/feb$ won't match, while in your second example you generated the expression ^2015/Feb$ instead.

Product code looks like abcd2343, how to split by letters and numbers?

I have a list of product codes in a text file, on each line is the product code that looks like:
abcd2343 abw34324 abc3243-23A
So it is letters followed by numbers and other characters.
I want to split on the first occurrence of a number.
import re
s='abcd2343 abw34324 abc3243-23A'
re.split('(\d+)',s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A']
Or, if you want to split on the first occurrence of a digit:
re.findall('\d*\D+',s)
> ['abcd', '2343 abw', '34324 abc', '3243-', '23A']
\d+ matches 1-or-more digits.
\d*\D+ matches 0-or-more digits followed by 1-or-more non-digits.
\d+|\D+ matches 1-or-more digits or 1-or-more non-digits.
Consult the docs for more about Python's regex syntax.
re.split(pat, s) will split the string s using pat as the delimiter. If pat begins and ends with parentheses (so as to be a "capturing group"), then re.split will return the substrings matched by pat as well. For instance, compare:
re.split('\d+', s)
> ['abcd', ' abw', ' abc', '-', 'A'] # <-- just the non-matching parts
re.split('(\d+)', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A'] # <-- both the non-matching parts and the captured groups
In contrast, re.findall(pat, s) returns only the parts of s that match pat:
re.findall('\d+', s)
> ['2343', '34324', '3243', '23']
Thus, if s ends with a digit, you could avoid ending with an empty string by using re.findall('\d+|\D+', s) instead of re.split('(\d+)', s):
s='abcd2343 abw34324 abc3243-23A 123'
re.split('(\d+)', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A ', '123', '']
re.findall('\d+|\D+', s)
> ['abcd', '2343', ' abw', '34324', ' abc', '3243', '-', '23', 'A ', '123']
This function handles float and negative numbers as well.
def separate_number_chars(s):
res = re.split('([-+]?\d+\.\d+)|([-+]?\d+)', s.strip())
res_f = [r.strip() for r in res if r is not None and r.strip() != '']
return res_f
For example:
utils.separate_number_chars('-12.1grams')
> ['-12.1', 'grams']
import re
m = re.match(r"(?P<letters>[a-zA-Z]+)(?P<the_rest>.+)$",input)
m.group('letters')
m.group('the_rest')
This covers your corner case of abc3243-23A and will output abc for the letters group and 3243-23A for the_rest
Since you said they are all on individual lines you'll obviously need to put a line at a time in input
def firstIntIndex(string):
result = -1
for k in range(0, len(string)):
if (bool(re.match('\d', string[k]))):
result = k
break
return result
To partition on the first digit
parts = re.split('(\d.*)','abcd2343') # => ['abcd', '2343', '']
parts = re.split('(\d.*)','abc3243-23A') # => ['abc', '3243-23A', '']
So the two parts are always parts[0] and parts[1].
Of course, you can apply this to multiple codes:
>>> s = "abcd2343 abw34324 abc3243-23A"
>>> results = [re.split('(\d.*)', pcode) for pcode in s.split(' ')]
>>> results
[['abcd', '2343', ''], ['abw', '34324', ''], ['abc', '3243-23A', '']]
If each code is in an individual line then instead of s.split( ) use s.splitlines().
Try this code it will work fine
import re
text = "MARIA APARECIDA 99223-2000 / 98450-8026"
parts = re.split(r' (?=\d)',text, 1)
print(parts)
Output:
['MARIA APARECIDA', '99223-2000 / 98450-8026']

Categories