n-grams from text in python - python

An update to my previous post, with some changes:
Say that I have 100 tweets.
In those tweets, I need to extract: 1) food names, and 2) beverage names. I also need to attach type (drink or food) and an id-number (each item has a unique id) for each extraction.
I already have a lexicon with names, type and id-number:
lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}
Tweet example:
After various processing of "tweet_1" I have this sentences:
sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream',
'coca cola and banana is not a good combo']
My requested output (can be other type than list):
["tweet_id_1",
[[["dr pepper"], ["drink", "d_124"]],
[["coca cola"], ["drink", "d_234"]],
[["banana split"], ["food", "f_567"]],
[["ice cream"], ["food", "f_789"]]],
"tweet_id_1",,
[[["coca cola"], ["drink", "d_234"]],
[["banana"], ["food", "f_456"]]]]
It's important that the output should NOT extract unigrams within ngrams (n>1):
["tweet_id_1",
[[["dr pepper"], ["drink", "d_124"]],
[["coca cola"], ["drink", "d_234"]],
[["cola"], ["drink", "d_345"]],
[["banana split"], ["food", "f_567"]],
[["banana"], ["food", "f_456"]],
[["ice cream"], ["food", "f_789"]],
[["cream"], ["food", "f_678"]]],
"tweet_id_1",
[[["coca cola"], ["drink", "d_234"]],
[["cola"], ["drink", "d_345"]],
[["banana"], ["food", "f_456"]]]]
Ideally, I would like to be able to run my sentences in various nltk filters like lemmatize() and pos_tag() BEFORE the extraction to get an output like the following. But with this regexp solution, if I do that, then all the words are split into unigrams, or they will generate 1 unigram and 1 bigram from the string "coca cola", which would generate the output that I did not want to have (as the example above).
The ideal output (again the type of the output is not important):
["tweet_id_1",
[[[("dr pepper", "NN")], ["drink", "d_124"]],
[[("coca cola", "NN")], ["drink", "d_234"]],
[[("banana split", "NN")], ["food", "f_567"]],
[[("ice cream", "NN")], ["food", "f_789"]]],
"tweet_id_1",
[[[("coca cola", "NN")], ["drink", "d_234"]],
[[("banana", "NN")], ["food", "f_456"]]]]

May not be the most efficient solution, but this will definitely get you started -
sentences = [
'dr pepper is better than coca cola and suits banana split with ice cream',
'coca cola and banana is not a good combo']
lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}
lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)
chunks = []
for sentence in sentences:
for lex in lexicon_list:
if lex in sentence:
chunks.append({lex: list(lexicon[lex].values()) })
sentence = sentence.replace(lex, '')
print(chunks)
Output
[{'dr pepper': ['drink', 'd_123']}, {'coca cola': ['drink', 'd_234']}, {'banana split': ['food', 'f_567']}, {'ice cream': ['food', 'f_789']}, {'coca cola': ['drink', 'd_234']}, {'banana': ['food', 'f_456']}]
Explanation
lexicon_list = list(lexicon.keys()) takes the list of phrases that need to be searched and sorts them by length (so that bigger chunks are found first)
The output is a list of dict, where each dict has list values.

Unfortunately I cannot make comments due to my low reputation, but the answer of Vivek could be improved through 1) regex, 2) including pos_tag tokens as NN, 3) dictionary structure in which you could select tweets result by a tweet:
import re
import nltk
from collections import OrderedDict
tweets = {"tweet_1": ['dr pepper is better than coca cola and suits banana split with ice cream', 'coca cola and banana is not a good combo']}
lexicon = {
'dr pepper': {'type': 'drink', 'id': 'd_123'},
'coca cola': {'type': 'drink', 'id': 'd_234'},
'cola': {'type': 'drink', 'id': 'd_345'},
'banana': {'type': 'food', 'id': 'f_456'},
'banana split': {'type': 'food', 'id': 'f_567'},
'cream': {'type': 'food', 'id': 'f_678'},
'ice cream': {'type': 'food', 'id': 'f_789'}}
lexicon_list = list(lexicon.keys())
lexicon_list.sort(key = lambda s: len(s.split()), reverse=True)
#regex will be much more faster than "in" operator
pattern = "(" + "|".join(lexicon_list) + ")"
pattern = re.compile(pattern)
# Here we make the dictionary of our phrases and their tagged equivalents
lexicon_pos_tag = {word:nltk.pos_tag(nltk.word_tokenize(word)) for word in lexicon_list}
# if you will train model that it recognizes e.g. "banana split" as ("banana split", "NN")
# not as ("banana", "NN") and ("split", "NN") you could use the following
# lexicon_pos_tag = {word:nltk.pos_tag(word) for word in lexicon_list}
#chunks will register the tweets as the keywords
chunks = OrderedDict()
for tweet in tweets:
chunks[tweet] = []
for sentence in tweets[tweet]:
temp = OrderedDict()
for word in pattern.findall(sentence):
temp[word] = [lexicon_pos_tag[word], [lexicon[word]["type"], lexicon[word]["id"]]]
chunks[tweet].append((temp))
Finally Output is:
OrderedDict([('tweet_1',
[OrderedDict([('dr pepper',
[[('dr', 'NN'), ('pepper', 'NN')],
['drink', 'd_123']]),
('coca cola',
[[('coca', 'NN'), ('cola', 'NN')],
['drink', 'd_234']]),
('banana split',
[[('banana', 'NN'), ('split', 'NN')],
['food', 'f_567']]),
('ice cream',
[[('ice', 'NN'), ('cream', 'NN')],
['food', 'f_789']])]),
OrderedDict([('coca cola',
[[('coca', 'NN'), ('cola', 'NN')],
['drink', 'd_234']]),
('banana',
[[('banana', 'NN')], ['food', 'f_456']])])])])

I would a for loop to filter ..
Use if statements to find the string in the keys.. if you wish to include unigrams, delete
len(key.split()) > 1
If you wish to only include unigrams then change it to:
len(key.split()) == 1
filtered_list = ['tweet_id_1']
for k, v in lexicon.items():
for s in sentences:
if k in s and len(k.split()) > 1:
filtered_list.extend((k, v))
print(filtered_list)

Related

Remove dictionaries from a list of dictionaries under certain conditions

I have a very large list of dictionaries that looks like this (I show a simplified version):
list_of_dicts:
[{'ID': 1234,
'Name': 'Bobby',
'Animal': 'Dog',
'About': [{'ID': 5678, 'Food': 'Dog Food'}]},
{'ID': 5678, 'Food': 'Dog Food'},
{'ID': 91011,
'Name': 'Jack',
'Animal': 'Bird',
'About': [{'ID': 1996, 'Food': 'Seeds'}]},
{'ID': 1996, 'Food': 'Seeds'},
{'ID': 2007,
'Name': 'Bean',
'Animal': 'Cat',
'About': [{'ID': 2008, 'Food': 'Fish'}]},
{'ID': 2008, 'Food': 'Fish'}]
I'd like to remove the dictionaries containing IDs that are equal to the ID's nested in the 'About' entries. For example, 'ID' 2008, is already nested in the nested 'About' value, therefore I'd like to remove that dictionary.
I have some code that can do this, and for this specific example it works. However, the amount of data that I have is much larger, and the remove() function does not seem to remove all the entries unless I run it a couple of times.
Any suggestions on how I can do this better?
My code:
nested_ids = [5678, 1996, 2008]
for i in list_of_dicts:
if i['ID'] in nested_ids:
list_of_dicts.remove(i)
Desired output:
[{'ID': 1234,
'Name': 'Bobby',
'Animal': 'Dog',
'About': [{'ID': 5678, 'Food': 'Dog Food'}]},
{'ID': 91011,
'Name': 'Jack',
'Animal': 'Bird',
'About': [{'ID': 1996, 'Food': 'Seeds'}]},
{'ID': 2007,
'Name': 'Bean',
'Animal': 'Cat',
'About': [{'ID': 2008, 'Food': 'Fish'}]}]
You can use a list comprehension:
cleaned_list = [d for d in list_of_dicts if d['ID'] not in nested_ids]
It is happening because we are modifying the dict while iterating it, So to avoid that we can copy the required values to a new dict as follow
filtered_dicts = []
nested_ids = [5678, 1996, 2008]
for curr in list_of_dicts:
if curr['ID'] not in nested_ids:
filtered_dicts.append(curr)
the problem is that when you remove a member of a list you're changing the indexes of everything after that index so you should reorder the indices you want to remove in reverse so you start from the back of the list
so all you need to do is iterate over the list in reverse order:
for i in list_of_dicts[::-1]:

What is the most efficient way to create nested dictionaries in Python?

I currently have over 10k elements in my dictionary looks like:
cars = [{'model': 'Ford', 'year': 2010},
{'model': 'BMW', 'year': 2019},
...]
And I have a second dictionary:
car_owners = [{'model': 'BMW', 'name': 'Sam', 'age': 34},
{'model': 'BMW', 'name': 'Taylor', 'age': 34},
.....]
However, I want to join together the 2 together to be something like:
combined = [{'model': 'BMW',
'year': 2019,
'owners: [{'name': 'Sam', 'age': 34}, ...]
}]
What is the best way to combine them? For the moment I am using a For loop but I feel like there are more efficient ways of dealing with this.
** This is just a fake example of data, the one I have is a lot more complex but this helps give the idea of what I want to achieve
Iterate over the first list, creating a dict with the key-val as model-val, then in the second dict, look for the same key (model) and update the first dict, if it is found:
cars = [{'model': 'Ford', 'year': 2010}, {'model': 'BMW', 'year': 2019}]
car_owners = [{'model': 'BMW', 'name': 'Sam', 'age': 34}, {'model': 'Ford', 'name': 'Taylor', 'age': 34}]
dd = {x['model']:x for x in cars}
for item in car_owners:
key = item['model']
if key in dd:
del item['model']
dd[key].update({'car_owners': item})
else:
dd[key] = item
print(list(dd.values()))
OUTPUT:
[{'model': 'BMW', 'year': 2019, 'car_owners': {'name': 'Sam', 'age': 34}}, {'model': 'Ford', 'year': 2010, 'car_owners': {'name': 'Taylor',
'age': 34}}]
Really, what you want performance wise is to have dictionaries with the model as the key. That way, you have O(1) lookup and can quickly get the requested element (instead of looping each time in order to find the car with model x).
If you're starting off with lists, I'd first create dictionaries, and then everything is O(1) from there on out.
models_to_cars = {car['model']: car for car in cars}
models_to_owners = {}
for car_owner in car_owners:
models_to_owners.setdefault(car_owner['model'], []).append(car_owner)
combined = [{
**car,
'owners': models_to_owners.get(model, [])
} for model, car in models_to_cars.items()]
Then you'd have
combined = [{'model': 'BMW',
'year': 2019,
'owners': [{'name': 'Sam', 'age': 34}, ...]
}]
as you wanted

Split individual strings in a list Python

How do I split individual strings in a list?
data = ('Esperanza Ice Cream', 'Gregory Johnson', 'Brandies bar and grill')
Return:
print(data)
('Esperanza', 'Ice', 'Cream', 'Gregory', 'Johnson', 'Brandies', 'bar', 'and', 'grill')
One approach, using join and split:
items = ' '.join(data)
terms = items.split(' ')
print(terms)
['Esperanza', 'Ice', 'Cream', 'Gregory', 'Johnson', 'Brandies', 'bar', 'and', 'grill']
The idea here is to generate a single string containing all space-separated terms. Then, all we need is a single call to the non regex version of split to get the output.
data = ('Esperanza Ice Cream', 'Gregory Johnson', 'Brandies bar and grill')
data = [i.split(' ') for i in data]
data=sum(data, [])
print(tuple(data))
#('Esperanza', 'Ice', 'Cream', 'Gregory', 'Johnson', 'Brandies', 'bar', 'and', 'grill')
You can use itertools.chain for that like:
Code:
it.chain.from_iterable(i.split() for i in data)
Test Code:
import itertools as it
data = ('Esperanza Ice Cream', 'Gregory Johnson', 'Brandies bar and grill')
print(list(it.chain.from_iterable(i.split() for i in data)))
Results:
['Esperanza', 'Ice', 'Cream', 'Gregory', 'Johnson', 'Brandies', 'bar', 'and', 'grill']

Creating a list of dictionaries from separate lists

I honestly expected this to have been asked previously, but after 30 minutes of searching I haven't had any luck.
Say we have multiple lists, each of the same length, each one containing a different type of data about something. We would like to turn this into a list of dictionaries with the data type as the key.
input:
data = [['tom', 'jim', 'mark'], ['Toronto', 'New York', 'Paris'], [1990,2000,2000]]
data_types = ['name', 'place', 'year']
output:
travels = [{'name':'tom', 'place': 'Toronto', 'year':1990},
{'name':'jim', 'place': 'New York', 'year':2000},
{'name':'mark', 'place': 'Paris', 'year':2001}]
This is fairly easy to do with index-based iteration:
travels = []
for d_index in range(len(data[0])):
travel = {}
for dt_index in range(len(data_types)):
travel[data_types[dt_index]] = data[dt_index][d_index]
travels.append(travel)
But this is 2017! There has to be a more concise way to do this! We have map, flatmap, reduce, list comprehensions, numpy, lodash, zip. Except I can't seem to compose these cleanly into this particular transformation. Any ideas?
You can use a list comprehension with zip after transposing your dataset:
>>> [dict(zip(data_types, x)) for x in zip(*data)]
[{'place': 'Toronto', 'name': 'tom', 'year': 1990},
{'place': 'New York', 'name': 'jim', 'year': 2000},
{'place': 'Paris', 'name': 'mark', 'year': 2000}]

Extracting duplicate from array of dictionaries

Hi i have an array of dicts that looks like this:
books = [
{'Serial Number': '3333', 'size':'500', 'Book':'The Hobbit'},
{'Serial Number': '2222', 'size':'100', 'Book':'Lord of the Rings'},
{'Serial Number': '1111', 'size':'200', 'Book':'39 Steps'},
{'Serial Number': '3333', 'size':'600', 'Book':'100 Dalmations'},
{'Serial Number': '2222', 'size':'800', 'Book':'Woman in Black'},
{'Serial Number': '6666', 'size':'1000', 'Book':'The Hunt for Red October'},
]
I need to create a separate array of dicts that looks like this based on duplicate serial numbers:
duplicates = [
'3333', [{'Book':'The Hobbit'}, {'Book':'100 Dalmations'}],
'2222', [{'Book':'Lord of the Rings'}, {'Book':'Woman in Black'}]
]
Is there an easy way to do this using a built in function, if not whats the best way to achieve this?
The most pythonic way I can think about:
from collections import defaultdict
res = defaultdict(list)
for d in books:
res[d.pop('Serial Number')].append(d)
print({k: v for k, v in res.items() if len(v) > 1})
Output:
{'2222': [{'Book': 'Lord of the Rings', 'size': '100'},
{'Book': 'Woman in Black', 'size': '800'}],
'3333': [{'Book': 'The Hobbit', 'size': '500'},
{'Book': '100 Dalmations', 'size': '600'}]}

Categories