Python: Replace all substring occurrences with regular expressions - python

I would like to replace all substring occurrences with regular expressions. The original sentences would be like:
mystring = "Carl's house is big. He is asking 1M for that(the house)."
Now let's suppose I have two substrings I would like to bold. I bold the words by adding ** at the beginning and at the end of the substring. The 2 substrings are:
substring1 = "house", so bolded it would be "**house**"
substring2 = "the house", so bolded it would be "**the house**"
At the end I want the original sentence like this:
mystring = "Carl's **house** is big. He is asking 1M for that(**the house**)."
The main problem is that as I have several substrings to replace, they can overlap words like the example above. If I analyze the longest substring at first, I am getting this:
Carl's **house** is big. He is asking 1M for that(**the **house****).
On the other hand, if I analyze the shortest substring first, I am getting this:
Carl's **house** is big. He is asking 1M for that(the **house**).
It seems to be I will need to replace from the longest substring to the shortest, but I wonder how should I do to consider it in the first replacement but in the second. Also remember the substring can appear several times in the string.
Note:// Suppose the string ** will never occur in the original string, so we can use it to bold our words

You can search for all of the strings at once, so that the fact that one is a substring of another doesn't matter:
re.sub(r"(house|the house)", r"**\1**", mystring)

You could have a group that is not captured and is note required. If you look at the regex patter (?P<repl>(?:the )?house), the (?:the )? part is saying that there might be a the in the string, if it is present, include it in the match. This way, you let the re library optimize the way it matches. Here is the complete example
>>> data = "Carl's house is big. He is asking 1M for that(the house)."
>>> re.sub('(?P<repl>(?:the )?house)', '**\g<repl>**', data)
"Carl's **house** is big. He is asking 1M for that(**the house**)."
Note: \g<repl> is used to get all the string matched by the group <repl>

You could do two passes:
First: Go through from longest to shortest and replace with something like:
'the house': 'AA_THE_HOUSE'
'house': 'BB_HOUSE'
Second: Go through replace like:
'AA_THE_HOUSE': '**the house**'
'BB_HOUSE': '**house**'

Replace the strings with some unique values and then replace them back with original string enclosed in ** to make them bold.
For example:
'the house' with 'temp_the_house'
'house' with 'temp_house'
then 'temp_house' with 'house'
'temp_the_house' with '**the house****'
Should work fine. You can automate this by using two lists.

Related

An efficient way to find elements of a list that contain substrings from another list

list1 = ["happy new year", "game over", "a happy story", "hold on"]
list2 = ["happy", "new", "hold"]
Assume I have two string lists, I want to use a new list to store the matched pairs of those two lists just like below:
list3=[["happy new year","happy"],["happy new year","new"],["a happy story","happy"],["hold on","hold"]]
which means I need to get all pairs of strings in one list with their substrings in another list.
Actually that is about some Chinese ancient scripts data. The first list contains names of people in 10th to 13th century, and the second list contains titles of all the poems at that period. Ancient Chinese people often record their social relations in the title of their works. For example, someone may write a poem titled "For my friend Wang Anshi". In this case, the people "Wang Anshi" in the first list should be matched with this title. Also their are cases like "For my friend Wang Anshi and Su Shi" which contains more than one people in the title. So basically that's a huge work involved 30,000 people and 160,000 poems.
Following is my code:
list3 = []
for i in list1:
for j in list2:
if str(i).count(str(j)) > 0:
list3.append([i,j])
I use str(i) because python always takes my Chinese strings as float. And this code does work but too too too slow. I must figure out another way to do that. Thanks!
Use a regular expression to do the searching, via the re module. A regular expression engine can work out matching elements in a search through text much better than a nested for loop can.
I'm going to use better variable names here to make it clearer where what list has to go; titles are the poem titles you are searching through, and names the things you are trying to match. matched are the (title, name) pairs you want to produce:
import re
titles = ["happy new year", "game over", "a happy story", "hold on"]
names = ["happy", "new", "hold"]
by_reverse_length = sorted(names, key=len, reverse=True)
pattern = "|".join(map(re.escape, by_reverse_length))
any_name = re.compile("({})".format(pattern))
matches = []
for title in titles:
for match in any_name.finditer(title):
matches.append((title, match.group()))
The above produces your required output:
>>> matches
[('happy new year', 'happy'), ('happy new year', 'new'), ('a happy story', 'happy'), ('hold on', 'hold')]
The names are sorted by length, in reverse, so that longer names are found before shorter with the same prefix; e.g. Hollander is found before Holland is found before Holl.
The pattern string is created from your names to form a ...|...|... alternatives pattern, any one of those patterns can match, but the regex engine will find those listed earlier in the sequence over those put later, hence the need to reverse sort by length. The (...) parentheses around the whole pattern of names tells the regular expression engine to capture that part of the text, in a group. The match.group() call in the loop can then extract the matched text.
The re.escape() function call is there to prevent 'meta characters' in the names, characters with special meaning such as ^, $, (, ), etc, from being interpreted as their special regular expression meanings.
The re.finditer() function (and method on compiled patterns) then finds non-overlapping matches in order from left to right, so it'll never match shorter substrings, and gives us the opportunity to extract the match object for each. This gives you more options if you want to know about starting positions of the matches and other metadata as well, should you want those. Otherwise, re.findall() could also be used here.
If you are going to use the above on text with Western alphabets and not on Chinese, then you probably also want to add word boundary markers, \b:
any_name = re.compile("\b({})\b".format(pattern))
otherwise substrings part of a larger word can be matched. Since Chinese has no word boundary characters (such as spaces and punctuation) you don't want to use \b in such texts.
If the lists are longer, it might be worth building a sort of "index" of the sentences a given word appears in. Creating the index takes about as long as finding the first word from list2 in all the sentences in list1 (it has to loop over all the words in all the sentences), and once created, you can get the sentences containing a word much faster in O(1).
list1 = ["happy new year", "game over", "a happy story", "hold on"]
list2 = ["happy", "new", "hold"]
import collections
index = collections.defaultdict(list)
for sentence in list1:
for word in sentence.split():
index[word].append(sentence)
res = [[sentence, word] for word in list2 for sentence in index[word]]
Result:
[['happy new year', 'happy'],
['a happy story', 'happy'],
['happy new year', 'new'],
['hold on', 'hold']]
This uses str.split to split the words at spaces, but if the sentences are more complex, e.g. if they contain punctuation, you might use a regular expression with word boundaries \b instead, and possibly normalize the sentences (e.g. convert to lowercase or apply a stemmer, not sure if this is applicable to Chinese, though).
This can be done quite easily in an absolutely strightforward way.
Option A: Finding "all" possible combinations: To find all strings in one list that contain substrings from another list, loop over all your strings of your list1 (the strings to assess) and for each element check whether it contains a substring of list2:
list1 = ["happy new year", "game over", "a happy story", "hold on"]
list2 = ["happy", "new", "hold"]
[(string, substring) for string in list1 for substring in list2 if substring in string]
>>> [('happy new year', 'happy'), ('happy new year', 'new'), ('a happy story', 'happy'), ('hold on', 'hold')]
(I do think the title of your question is a bit misleading, though, as you are not only asking for elements of a list that contain a substring of another list, but as per your code example you are looking for 'all possible combinations'.)
Thus option B: Finding "any" combination: Much simpler and faster, if you really only need what the question says, you can improve performance by finding only the 'any' matches:
[string for string in list1 if ( substring in string for substring in list2)]
Option B will also allow you to improve performance. In case the lists are very long, you can run B first, create a subset (only strings that will actually produce a match with a substring), and then expand again to catch 'all' instead of any.

Derive words from string based on key words

I have a string (text_string) from which I want to find words based on my so called key_words. I want to store the result in a list called expected_output.
The expected output is always the word after the keyword (the number of spaces between the keyword and the output word doesn't matter). The expected_output word is then all characters until the next space.
Please see the example below:
text_string = "happy yes_no!?. why coding without paus happy yes"
key_words = ["happy","coding"]
expected_output = ['yes_no!?.', 'without', 'yes']
expected_output explanation:
yes_no!?. (since it comes after happy. All signs are included until the next space.)
without (since it comes after coding. the number of spaces surronding the word doesn't matter)
yes (since it comes after happy)
You can solve it using regex. Like this e.g.
import re
expected_output = re.findall('(?:{0})\s+?([^\s]+)'.format('|'.join(key_words)), text_string)
Explanation
(?:{0}) Is getting your key_words list and creating a non-capturing group with all the words inside this list.
\s+? Add a lazy quantifier so it will get all spaces after any of the former occurrences up to the next character which isn't a space
([^\s]+) Will capture the text right after your key_words until a next space is found
Note: in case you're running this too many times, inside a loop i.e, you ought to use re.compile on the regex string before in order to improve performance.
We will use re module of Python to split your strings based on whitespaces.
Then, the idea is to go over each word, and look if that word is part of your keywords. If yes, we set take_it to True, so that next time the loop is processed, the word will be added to taken which stores all the words you're looking for.
import re
def find_next_words(text, keywords):
take_it = False
taken = []
for word in re.split(r'\s+', text):
if take_it == True:
taken.append(word)
take_it = word in keywords
return taken
print(find_next_words("happy yes_no!?. why coding without paus happy yes", ["happy", "coding"]))
results in ['yes_no!?.', 'without', 'yes']

search and count specific phrases with special characters in text files

I have a list of search phrases where some are single words, some are multiple words, some have a hyphen in between them, and others may have both parentheses and hyphens. I'm trying to process a directory of text files and search for 100+ of these phrases, and then count occurrences.
It seems like the code below works in 2.7x python until it hits the hyphenated search phrases. I observed some unexpected counts on some text files for at least one of the hyphenated search phrases.
kwlist = ['phraseone', 'phrase two', 'phrase-three', 'phrase four (a-b-c) abc', 'phrase five abc', 'phrase-six abc abc']
for kws in kwlist:
s_str = kws
kw = re.findall(r"\b" + s_str +r"\b", ltxt)
count = 0
for c in kw:
if c == s_str:
count += 1
output.write(str(count))
Is there a better way to handle the range of phrases in the search, or any improvements I can make to my algorithm?
You could achieve this with what I would call a pythonic one-liner.
We don't need to bother with using a regex, as we can use the built-in .count() method, which will from the documentation:
string.count(s, sub[, start[, end]])
Return the number of (non-overlapping) occurrences of substring sub in string s[start:end]. Defaults for start and end and interpretation of negative values are the same as for slices.
So all we need to do is sum up the occurrences of each keyword in kwlist in the string ltxt. This can be done with a list-comprehension:
output.write(str(sum([ltxt.count(kws) for kws in kwlist])))
Update
As pointed out in #voiDnyx's comment, the above solution writes the sum of all the counts, not for each individual keyword.
If you want the individual keywords outputted, you can just write each one individually from the list:
counts = [ltxt.count(kws) for kws in kwlist]
for cnt in counts:
output.write(str(cnt))
This will work, but if you wanted to get silly and put it all in one-line, you could potentially do:
[output.write(str(ltxt.count(kws))) for kws in kwlist]
Its up to you, hope this helps! :)
If you need to match word boundaries, then yes the only way to do so would be to use the \b in a regex. This doesn't mean that you cant still do it in one line:
[output.write(str(len(re.findall(r'\b'+re.escape(kws)+r'\b'))) for kws in kwlist]
Note how the re.escape is necessary, as the keyword may contain special characters.

Removing numbers from strings

So, I am working with a text file on which I am doing the following operations on the string
def string_operations(string):
1) lowercase
2) remove integers from string
3) remove symbols
4) stemming
After this, I am still left with strings like:
durham 28x23
I see the flaw in my approach but would like to know if there is a good, fast way to identify if there is a numeric value attached with the string.
So in the above example, I want the output to be
durham
Another example:
21st ammendment
Should give:
ammendment
So how do I deal with this stuff?
If you requirement is, "remove any terms that start with a digit", you could do something like this:
def removeNumerics(s):
return ' '.join([term for term in s.split() if not term[0].isdigit()])
This splits the string on whitespace and then joins with a space all the terms that do not start with a number.
And it works like this:
>>> removeNumerics('21st amendment')
'amendment'
>>> removeNumerics('durham 28x23')
'durham'
If this isn't what you're looking for, maybe show some explicit examples in your questions (showing both the initial string and your desired result).

Finding various string repeats in python in next 10 characters

So I'm working on a problem where I have to find various string repeats after encountering an initial string, say we take ACTGAC so the data file has sequences that look like:
AAACTGACACCATCGATCAGAACCTGA
So in that string once we find ACTGAC then I need to analyze the next 10 characters for the string repeats which go by some rules. I have the rules coded but can anyone show me how once I find the string that I need, I can make a substring for the next ten characters to analyze. I know that str.partition function can do that once I find the string, and then the [1:10] can get the next ten characters.
Thanks!
You almost have it already (but note that indexes start counting from zero in Python).
The partition method will split a string into head, separator, tail, based on the first occurence of separator.
So you just need to take a slice of the first ten characters of the tail:
>>> data = 'AAACTGACACCATCGATCAGAACCTGA'
>>> head, sep, tail = data.partition('ACTGAC')
>>> tail[:10]
'ACCATCGATC'
Python allows you to leave out the start-index in slices (in defaults to zero - the start of the string), and also the end-index (it defaults to the length of the string).
Note that you could also do the whole operation in one line, like this:
>>> data.partition('ACTGAC')[2][:10]
'ACCATCGATC'
So, based on marcog's answer in Find all occurrences of a substring in Python , I propose:
>>> import re
>>> data = 'AAACTGACACCATCGATCAGAACCTGAACTGACTGACAAA'
>>> sep = 'ACTGAC'
>>> [data[m.start()+len(sep):][:10] for m in re.finditer('(?=%s)'%sep, data)]
['ACCATCGATC', 'TGACAAA', 'AAA']

Categories