how to get combinations of substrings in python - python

I want to get combinations of substrings like an example below:
original string: “This is my pen”
expected output list: [“This is my pen”, “This is mypen”, “This ismy pen”, “Thisis my pen”, “This ismypen”, “Thisismy pen”, “Thisismypen”]
As you can see in the example, I’d like to get all substring combinations by removing a white space character(s) while keeping the order of the sequence.
I’ve tried to use strip(idx) function and from itertools import combinations. But it was hard to keep the order of the original sentence and also get all possible cases at the same time.
Any basic ideas will be welcomed! Thank you very much.
I’m a newbie to programming so please let me know if I need to write more details. Thanks a lot.

Try this:
import itertools
s = "This is my pen"
s_list = s.split()
s_len = len(s_list)
Then:
r = ["".join(itertools.chain(*zip(s_list, v+("",))))
for v in itertools.product([" ", ""], repeat=s_len-1)]
This results in r having the following value:
['This is my pen',
'This is mypen',
'This ismy pen',
'This ismypen',
'Thisis my pen',
'Thisis mypen',
'Thisismy pen',
'Thisismypen']
If you just want to iterate over the values, you can avoid creating the top-level list as follows:
for u in ("".join(itertools.chain(*zip(s_list, v+("",))))
for v in itertools.product([" ", ""], repeat=s_len-1)):
print(u)
which produces:
This is my pen
This is mypen
This ismy pen
This ismypen
Thisis my pen
Thisis mypen
Thisismy pen
Thisismypen

Related

Python - counting frequency of a non-ngram sequence in a list of strings efficiently

As I've stated in the title, I'm trying to calculate the phrase frequency of a given list of sequences that appear in in a list of strings. The problem is that the words in phrases do not have to appear next to the other ones, there may be one or more words in between.
Example:
Sequence: ('able', 'help', 'number') in a sentence "Please call us, we may be able to help, our phone number is 1234"
I remove the stopwords (NLTK stopwords), remove punctuation, lowercase all letters and tokenize the sentence, so the processed sequence looks like ['please', 'call', 'us', 'able', 'help', 'phone', 'number', '1234']. I have about 30,000 sequences varying in length from 1 (single words) to 3, and I'm searching in almost 6,000 short sentences. My current approach is presented below:
from collections import Counter
from tqdm import tqdm
import nltk
# Get term sequency per sentence
def get_bow(sen, vocab):
vector = [0] * len(vocab)
tokenized_sentence = nltk.word_tokenize(sen)
combined_sentence = list(itertools.chain.from_iterable([itertools.combinations(tokenized_sentence, 1),
itertools.combinations(tokenized_sentence, 2),
itertools.combinations(tokenized_sentence, 3)]))
for el in combined_sentence:
if el in vocab:
cnt = combined_sentence.count(el)
idx = vocab.index(el)
vector[idx] = cnt
return vector
sentence_vectors = []
for sentence in tqdm(text_list):
sent_vec = get_bow
sentence_vectors.append(get_bow(sentence, phrase_list))
phrase_list is a list of tuples with the sequences, text_list is a list of strings. Currently, the frequency takes over 1 hour to calculate and I'm trying to find more efficient way to get the list of frequencies associated with the given terms. I've also tried using sklearn's CountVectorizer, but there's a problem with processing sequences with gaps and they're not calculated at all.
I'd be grateful if anyone would try to give me some insight about how to make my script more efficient. Thanks in advance!
EDIT:
Example of phrase_list: [('able',), ('able', 'us', 'software'), ('able', 'back'), ('printer', 'holidays'), ('printer', 'information')]
Example of text_list: ['able add printer mac still working advise calling support team mon fri excluding bank holidays would able look', 'absolutely one cat coyote peterson', 'accurate customs checks cause delays also causing issues expected delivery dates changing', 'add super mario flair fridge desk supermario dworld bowsersfury magnet set available platinum points shipping costs mynintendo reward get', 'additional information though pass comments team thanks']
Expected output: [2, 0, 0, 1, 0] - a vector with occurrence count of each phrase, the order of values should be the same as in phrase_list. My code returns the vector of a phrase occurence per sentence, because I was trying to implement something like a bag-of-words.
There are many aspects that could be made faster, but here is the main problem:
combined_sentence = list(itertools.chain.from_iterable([itertools.combinations(tokenized_sentence, 1),
itertools.combinations(tokenized_sentence, 2),
itertools.combinations(tokenized_sentence, 3)]))
You generate all potential combinations of 1,2 or 3 words of the sentence. This is always bad, no matter what you want to do.
Sentence: "Master Yoda about sentence structure care does not."
You really want to treat this sentence as if it contained "Yoda does not", then you should still not generate all combinations. There are much faster ways, but I will only spend time on this, if that indeed is your goal.
If you would want to treat this sentence as a sentence that does NOT contain "Yoda does not", then I think you can figure out yourself how to speed up your code. Maybe look here.
I hope this helped. Let me know in case you need option 1.

How do I match exact strings from a list to a larger string taking white spaces into account?

I have a large list of strings and I want to check whether a string occurs in a larger string. The list contains of strings of one word and also strings of multiple words. To do so I have written the following code:
example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has kneepain as wel as a headache"
emptylist = []
for i in example_text:
res = [ele for ele in example_list if(ele in i)]
emptylist.append(res)
However the problem is here is 'pain' is also added to emptylist which it should not as I only want something from the example_list to be added if exactly matches the text. I also tried using sets:
word_set = set(example_list)
phrase_set = set(example_text.split())
word_set.intersection(phrase_set)
This however chops op 'morning sickness' into 'morning' and 'sickness'. Does anyone know what is the correct way to tackle this problem?
Nice examples have already been provided in this post by members.
I made the matching_text a little more challenging where the pain occurred more than once. I also aimed for a little more information about where the match location starts. I ended up with the following code.
I worked on the following sentence.
"The patient has not only kneepain but headache and arm pain, stomach pain and sickness"
import re
from collections import defaultdict
example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has not only kneepain but headache and arm pain, stomach pain and sickness"
TruthFalseDict = defaultdict(list)
for i in example_list:
MatchedTruths = re.finditer(r'\b%s\b'%i, example_text)
if MatchedTruths:
for j in MatchedTruths:
TruthFalseDict[i].append(j.start())
print(dict(TruthFalseDict))
The above gives me the following output.
{'pain': [55, 69], 'headache': [38], 'sickness': [78]}
Using PyParsing:
import pyparsing as pp
example_list = ['pain', 'chestpain', 'headache', 'sickness', 'morning sickness']
example_text = "The patient has kneepain as wel as a headache morning sickness"
list_of_matches = []
for word in example_list:
rule = pp.OneOrMore(pp.Keyword(word))
for t, s, e in rule.scanString(example_text):
if t:
list_of_matches.append(t[0])
print(list_of_matches)
Which yields:
['headache', 'sickness', 'morning sickness']
You should be able to use a regex using word boundaries
>>> import re
>>> [word for word in example_list if re.search(r'\b{}\b'.format(word), example_text)]
['headache']
This will not match 'pain' in 'kneepain' since that does not begin with a word boundary. But it would properly match substrings that contained whitespace.

ordering text by the agreement with keywords

I am trying to order a number of short paragraphs by their agreement with a list of keywords. This is used to provide a user with the text ordered by interest.
Let's assume I already have the list of keywords, hopefully reflecting the users interest. I thought this is a fairly standard procedure and expected some python package for that. But so far my Google search was not very successful.
I can easily come up with a brute force solution myself, but I was wondering whether somebody knows an efficient way to do this?
EDIT:
Ok here is an example:
keywords = ['cats', 'food', 'Miau']
text1 = 'This is text about dogs'
text2 = 'This is text about food'
text3 = 'This is text about cat food'
I need a procedure which leads to the order text3, text2, text1
thanks
This is the simplest thing I can think of:
import string
input = open('document.txt', 'r')
text = input.read()
table = string.maketrans("","")
text = text.translate(table, string.punctuation)
wordlist = text.split()
agreement_cnt = 0
for word in list_of_keywords:
agreement_cnt += wordlist.count(word)
got the removing punctuation bit from here: Best way to strip punctuation from a string in Python.
Something like this might be a good starting point:
>>> keywords = ['cats', 'food', 'Miau']
>>> text1 = 'This is a text about food fed to cats'
>>> matched_word_count = len(set(text1.split()).intersection(set(keywords)))
>>> print matched_word_count
2
If you want to correct for capitalization or capture word forms (i.e. 'cat' instead of 'cats'), there's obviously more to consider, though.
Taking the above and capturing match counts for a list of different strings, and then sorting the results to find the "best" match, should be relatively simple.

is it possible to create an if statement dynamically using list elements and OR?

I'm trying to adapt the code I wrote below to work with a dynamic list of required values rather than with a string, as it works at present:
required_word = "duck"
sentences = [["the", "quick", "brown", "fox", "jump", "over", "lazy", "dog"],
["Hello", "duck"]]
sentences_not_containing_required_words = []
for sentence in sentences:
if required_word not in sentence:
sentences_not_containing_required_words.append(sentence)
print sentences_not_containing_required_words
Say for example I had two required words (only one of which are actually required), I could do this:
required_words = ["dog", "fox"]
sentences = [["the", "quick", "brown", "fox", "jump", "over", "lazy", "dog"],
["Hello", "duck"]]
sentences_not_containing_required_words = []
for sentence in sentences:
if (required_words[0] not in sentence) or (required_words[1]not in sentence):
sentences_not_containing_required_words.append(sentence)
print sentences_not_containing_required_words
>>> [['Hello', 'duck']]
However, what I need is for someone to steer me in the direction of a method of dealing with a list that will vary in size (number of items), and satisfy the if statement if any of the list's items are not in the list named 'sentence'. However, being quite new to Python, I'm stumped, and don't know how to better phrase the question. Do I need to come up with a different approach?
Thanks in advance!
(Note that the real code will do something more complicated than printing sentences_not_containing_required_words.)
You can construct this list pretty easily with a combination of a list comprehension and the any() built-in function:
non_matches = [s for s in sentences if not any(w in s for w in required_words)]
This will iterate over the list sentences while constructing a new list, and only include sentences where none of the words from required_words are present.
If you are going to end up with longer lists of sentences, you may consider using a generator expression instead to minimize memory footprint:
non_matches = (s for s in sentences if not any(w in s for w in required_words))
for s in non_matches:
# do stuff

How to find collocations in text, python

How do you find collocations in text?
A collocation is a sequence of words that occurs together unusually often.
python has built-in func bigrams that returns word pairs.
>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>
What's left is to find bigrams that occur more often based on the frequency of individual words. Any ideas how to put it in the code?
Try NLTK. You will mostly be interested in nltk.collocations.BigramCollocationFinder, but here is a quick demonstration to show you how to get started:
>>> import nltk
>>> def tokenize(sentences):
... for sent in nltk.sent_tokenize(sentences.lower()):
... for word in nltk.word_tokenize(sent):
... yield word
...
>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
<Text: mary had a little lamb ....>
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
There are none in this small segment, but here goes:
>>> text.collocations(num=20)
Building collocations list
Here is some code that takes a list of lowercase words and returns a list of all bigrams with their respective counts, starting with the highest count. Don't use this code for large lists.
from itertools import izip
words = ["more", "is", "said", "than", "done", "is", "said"]
words_iter = iter(words)
next(words_iter, None)
count = {}
for bigram in izip(words, words_iter):
count[bigram] = count.get(bigram, 0) + 1
print sorted(((c, b) for b, c in count.iteritems()), reverse=True)
(words_iter is introduced to avoid copying the whole list of words as you would do in izip(words, words[1:])
import itertools
from collections import Counter
words = ['more', 'is', 'said', 'than', 'done']
nextword = iter(words)
next(nextword)
freq=Counter(zip(words,nextword))
print(freq)
A collocation is a sequence of tokens that are better treated as a single token when parsing e.g. "red herring" has a meaning that can't be derived from its components. Deriving a useful set of collocations from a corpus involves ranking the n-grams by some statistic (n-gram frequency, mutual information, log-likelihood, etc) followed by judicious manual editing.
Points that you appear to be ignoring:
(1) the corpus must be rather large ... attempting to get collocations from one sentence as you appear to suggest is pointless.
(2) n can be greater than 2 ... e.g. analysing texts written about 20th century Chinese history will throw up "significant" bigrams like "Mao Tse" and "Tse Tung".
What are you actually trying to achieve? What code have you written so far?
Agree with Tim McNamara on using nltk and problems with the unicode. However, I like the text class a lot - there is a hack that you can use to get the collocations as list , i discovered it looking at the source code . Apparently whenever you invoke the collocations method it saves it as a class variable!
import nltk
def tokenize(sentences):
for sent in nltk.sent_tokenize(sentences.lower()):
for word in nltk.word_tokenize(sent):
yield word
text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
text.collocations(num=20)
collocations = [" ".join(el) for el in list(text._collocations)]
enjoy !

Categories