How do I split individual strings in a list?
data = ('Esperanza Ice Cream', 'Gregory Johnson', 'Brandies bar and grill')
Return:
print(data)
('Esperanza', 'Ice', 'Cream', 'Gregory', 'Johnson', 'Brandies', 'bar', 'and', 'grill')
One approach, using join and split:
items = ' '.join(data)
terms = items.split(' ')
print(terms)
['Esperanza', 'Ice', 'Cream', 'Gregory', 'Johnson', 'Brandies', 'bar', 'and', 'grill']
The idea here is to generate a single string containing all space-separated terms. Then, all we need is a single call to the non regex version of split to get the output.
data = ('Esperanza Ice Cream', 'Gregory Johnson', 'Brandies bar and grill')
data = [i.split(' ') for i in data]
data=sum(data, [])
print(tuple(data))
#('Esperanza', 'Ice', 'Cream', 'Gregory', 'Johnson', 'Brandies', 'bar', 'and', 'grill')
You can use itertools.chain for that like:
Code:
it.chain.from_iterable(i.split() for i in data)
Test Code:
import itertools as it
data = ('Esperanza Ice Cream', 'Gregory Johnson', 'Brandies bar and grill')
print(list(it.chain.from_iterable(i.split() for i in data)))
Results:
['Esperanza', 'Ice', 'Cream', 'Gregory', 'Johnson', 'Brandies', 'bar', 'and', 'grill']
Related
I want to filter only elements that have only one word and make a new array of it.
How could I do this in Python?
Array:
['somewhat', 'all', 'dictator', 'was called', 'was', 'main director', 'in']
NewArray should be:
['somewhat', 'all', 'dictator', 'was', 'in']
try this
a= ['somewhat', 'all', 'dictator', 'was called', 'was', 'main director', 'in']
print([i for i in a if " " not in i])
Output:
['somewhat', 'all', 'dictator', 'was', 'in']
filter the list with a list comprehension
old_list = ['somewhat', 'all', 'dictator', 'was called', 'was', 'main director', 'in']
new_list = [x for x in old_list if len(x.split()) == 1]
Returns:
['somewhat', 'all', 'dictator', 'was', 'in']
Using re.match and filter
import re
MATCH_SINGLE_WORD = re.compile(r"^\w+$")
inp = ['somewhat', 'all', 'dictator', 'was called', 'was', 'main director', 'in']
out = filter(MATCH_SINGLE_WORD.match, inp)
print(list(out)) # If you need to print. Otherwise, out is a generator that can be traversed(once) later
This solution would handle \n or \t being present in word boundaries as well along with single whitespace character.
If you want to handle leading and trailing whitespaces,
import re
from operator import methodcaller
MATCH_SINGLE_WORD = re.compile(r"^\w+$")
inp = ['somewhat', 'all', 'dictator', 'was called', 'was', 'main director', 'in']
out = filter(MATCH_SINGLE_WORD.match, map(methodcaller("strip"), inp))
An update to my previous post, with some changes:
Say that I have 100 tweets.
In those tweets, I need to extract: 1) food names, and 2) beverage names. I also need to attach type (drink or food) and an id-number (each item has a unique id) for each extraction.
I already have a lexicon with names, type and id-number:
lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}
Tweet example:
After various processing of "tweet_1" I have this sentences:
sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream',
'coca cola and banana is not a good combo']
My requested output (can be other type than list):
["tweet_id_1",
[[["dr pepper"], ["drink", "d_124"]],
[["coca cola"], ["drink", "d_234"]],
[["banana split"], ["food", "f_567"]],
[["ice cream"], ["food", "f_789"]]],
"tweet_id_1",,
[[["coca cola"], ["drink", "d_234"]],
[["banana"], ["food", "f_456"]]]]
It's important that the output should NOT extract unigrams within ngrams (n>1):
["tweet_id_1",
[[["dr pepper"], ["drink", "d_124"]],
[["coca cola"], ["drink", "d_234"]],
[["cola"], ["drink", "d_345"]],
[["banana split"], ["food", "f_567"]],
[["banana"], ["food", "f_456"]],
[["ice cream"], ["food", "f_789"]],
[["cream"], ["food", "f_678"]]],
"tweet_id_1",
[[["coca cola"], ["drink", "d_234"]],
[["cola"], ["drink", "d_345"]],
[["banana"], ["food", "f_456"]]]]
Ideally, I would like to be able to run my sentences in various nltk filters like lemmatize() and pos_tag() BEFORE the extraction to get an output like the following. But with this regexp solution, if I do that, then all the words are split into unigrams, or they will generate 1 unigram and 1 bigram from the string "coca cola", which would generate the output that I did not want to have (as the example above).
The ideal output (again the type of the output is not important):
["tweet_id_1",
[[[("dr pepper", "NN")], ["drink", "d_124"]],
[[("coca cola", "NN")], ["drink", "d_234"]],
[[("banana split", "NN")], ["food", "f_567"]],
[[("ice cream", "NN")], ["food", "f_789"]]],
"tweet_id_1",
[[[("coca cola", "NN")], ["drink", "d_234"]],
[[("banana", "NN")], ["food", "f_456"]]]]
May not be the most efficient solution, but this will definitely get you started -
sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream',
'coca cola and banana is not a good combo']
lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}
lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)
chunks = []
for sentence in sentences:
for lex in lexicon_list:
if lex in sentence:
chunks.append({lex: list(lexicon[lex].values()) })
sentence = sentence.replace(lex, '')
print(chunks)
Output
[{'dr pepper': ['drink', 'd_123']}, {'coca cola': ['drink', 'd_234']}, {'banana split': ['food', 'f_567']}, {'ice cream': ['food', 'f_789']}, {'coca cola': ['drink', 'd_234']}, {'banana': ['food', 'f_456']}]
Explanation
lexicon_list = list(lexicon.keys()) takes the list of phrases that need to be searched and sorts them by length (so that bigger chunks are found first)
The output is a list of dict, where each dict has list values.
Unfortunately I cannot make comments due to my low reputation, but the answer of Vivek could be improved through 1) regex, 2) including pos_tag tokens as NN, 3) dictionary structure in which you could select tweets result by a tweet:
import re
import nltk
from collections import OrderedDict
tweets = {"tweet_1": ['dr pepper is better than coca cola and suits banana split with ice cream', 'coca cola and banana is not a good combo']}
lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}
lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)
#regex will be much more faster than "in" operator
pattern = "(" + "|".join(lexicon_list) + ")"
pattern = re.compile(pattern)
# Here we make the dictionary of our phrases and their tagged equivalents
lexicon_pos_tag = {word:nltk.pos_tag(nltk.word_tokenize(word)) for word in lexicon_list}
# if you will train model that it recognizes e.g. "banana split" as ("banana split", "NN")
# not as ("banana", "NN") and ("split", "NN") you could use the following
# lexicon_pos_tag = {word:nltk.pos_tag(word) for word in lexicon_list}
#chunks will register the tweets as the keywords
chunks = OrderedDict()
for tweet in tweets:
chunks[tweet] = []
for sentence in tweets[tweet]:
temp = OrderedDict()
for word in pattern.findall(sentence):
temp[word] = [lexicon_pos_tag[word], [lexicon[word]["type"], lexicon[word]["id"]]]
chunks[tweet].append((temp))
Finally Output is:
OrderedDict([('tweet_1',
[OrderedDict([('dr pepper',
[[('dr', 'NN'), ('pepper', 'NN')],
['drink', 'd_123']]),
('coca cola',
[[('coca', 'NN'), ('cola', 'NN')],
['drink', 'd_234']]),
('banana split',
[[('banana', 'NN'), ('split', 'NN')],
['food', 'f_567']]),
('ice cream',
[[('ice', 'NN'), ('cream', 'NN')],
['food', 'f_789']])]),
OrderedDict([('coca cola',
[[('coca', 'NN'), ('cola', 'NN')],
['drink', 'd_234']]),
('banana',
[[('banana', 'NN')], ['food', 'f_456']])])])])
I would a for loop to filter ..
Use if statements to find the string in the keys.. if you wish to include unigrams, delete
len(key.split()) > 1
If you wish to only include unigrams then change it to:
len(key.split()) == 1
filtered_list = ['tweet_id_1']
for k, v in lexicon.items():
for s in sentences:
if k in s and len(k.split()) > 1:
filtered_list.extend((k, v))
print(filtered_list)
I am doing some operations on file and converting its lines to a list
l1 = [
['john', 'age', '23', 'has', 'scored', '90,'],
['sam', 'scored', '70,', 'and', 'age', 'is', '19,'],
['rick', 'failed', 'as' ,'he', 'scored', '20,'],
]
As we see not all lists/lines are similar, but one thing will be common that is integer followed by word 'scored'.
Is there a way I can sort on the basis of integer which follows keyword 'scored'?
Earlier I had tried with similar lists with following and it had worked but that wont help on above list
sorted(l1, key=lambda l: int(l[4].rstrip(',')),reverse=True)
Here's one way:
>>> a_list = [['john', 'age', '23', 'has', 'scored', '90,'],
['sam', 'scored', '70,', 'and', 'age', 'is', '19,'],
['rick', 'failed', 'as', 'he', 'scored', '20,']]
>>> sorted(a_list, key = lambda l: int(l[l.index('scored') + 1].strip(",")), reverse=True)
[['john', 'age', '23', 'has', 'scored', '90,'],
['sam', 'scored', '70,', 'and', 'age', 'is', '19,'],
['rick', 'failed', 'as', 'he', 'scored', '20,']]
I'm using Python.
I have some strings :
'1 banana', '100 g of sugar', '1 cup of flour'
I need to distinguish the food from the quantity.
I have an array of quantities type
quantities = ['g', 'cup', 'kg', 'L']
altern = '|'.join(quantities)
and so with using a regular expression I would like to get for example for '1 cup of flour' : 'flour' and '1 cup of', for '1 banana' : '1' and 'banana'
I have written this regexp to match the quantity part of the strings above :
\d{1,3}\s<altern>?\s?(\bof\b)?
but I'm very unsure about this ...particularly on how to introduce the altern variable in the regular expression.
I think your amounts are units, so I took the liberty to fix this misnomer. I propose to use named grouping to ease understanding the output.
import re
units = [ 'g', 'cup', 'kg', 'L' ]
anyUnitRE = '|'.join(units)
inputs = [ '1 banana', '100 g of sugar', '1 cup of flour' ]
for input in inputs:
m = re.match(
r'(?P<amount>\d{1,3})\s*'
r'(?P<unit>(' + anyUnitRE + r')?)\s*'
r'(?P<preposition>(of)?)\s*'
r'(?P<name>.*)', input)
print m and m.groupdict()
The output will be sth like this:
{'preposition': '', 'amount': '1', 'name': 'banana', 'unit': ''}
{'preposition': 'of', 'amount': '100', 'name': 'sugar', 'unit': 'g'}
{'preposition': 'of', 'amount': '1', 'name': 'flour', 'unit': 'cup'}
So you can do sth like this:
if m.groupdict()['name'] == 'sugar':
…
amount = int(m.groupdict()['amount'])
unit = m.groupdict()['unit']
I think you can use this:
"(.*?) (\w*)$"
And get \1 for first part and \2 for second part.
[Regex Demo]
And for a better regex:
"^((?=.*of)((.*of)(.*)))|((?!.*of)(\d+)(.*))$"
And get \3 and \6 for first part and \4 and \7 for second part.
You can try this code:
import re
lst = ['1 banana', '100 g of sugar', '1 cup of flour']
quantities = ['g', 'cup', 'kg', 'L']
altern = '|'.join(quantities)
r = r'(\d{1,3})\s*((?:%s)?s?(?:\s*\bof\b)?\s*\S+)'%(altern)
for x in lst:
print re.findall(r, x)
See demo
Output:
[('1', 'banana')]
[('100', 'g of sugar')]
[('1', 'cup of flour')]
Why do you want to do this with regular expressions? You can use Python's string splitting functions instead:
def qsplit(a):
"""Return a tuple of quantity and ingredient"""
if not a:
return None
if not a[0] in "0123456789":
return ["0", a]
if " of " in a:
return a.split(" of ", 1)
return a.split(None, 1)
I have the following code:
import re
l=['fang', 'yi', 'ke', 'da', 'xue', 'xue', 'bao', '=', 'journal', 'of', 'southern', 'medical', 'university', '2015/feb']
t=[l[13]]
t2=['2015/Feb']
wl1=['2015/Feb']
for i in t:
print(type(i))
print(type(wl1[0]))
r=re.search(r'^%s$' %i, wl1[0])
if r:
print('yes')
for i in t2:
print(type(i))
print(type(wl1[0]))
r2=re.search(r'^%s$' %i, wl1[0])
if r2:
print('yes')
Could anyone explain me why in the first loop it does not match the two strings? In the second it does.
Your input value is lowercase:
>>> l=['fang', 'yi', 'ke', 'da', 'xue', 'xue', 'bao', '=', 'journal', 'of', 'southern', 'medical', 'university', '2015/feb']
>>> t=[l[13]]
>>> t[0]
'2015/feb'
while you are trying to match against a value with the F uppercased:
>>> wl1=['2015/Feb']
>>> wl1[0]
'2015/Feb'
As such the regular expression ^2015/feb$ won't match, while in your second example you generated the expression ^2015/Feb$ instead.