Finding repetitive substrings

Finding repetitive substrings - python

Having some arbitrary string such as
hello hello hello I am I am I am your string string string string of strings
Can I somehow find repetitive sub-strings delimited by spaces(EDIT)? In this case it would be 'hello', 'I am' and 'string'.
I have been wondering about this for some time but I still can not find any real solution.
I also have read some articles concerning this topic and hit up on suffix trees but can this help me even though I need to find every repetition e.g. with repetition count higher than two?
If it is so, is there some library for python, that can handle suffix trees and perform operations on them?
Edit: I am sorry I was not clear enough. So just to make it clear - I am looking for repetitive sub-strings, that means sequences in string, that, for example, in terms of regular expressions can be substituted by + or {} wildcards. So If I would have to make regular expression from listed string, I would do
(hello ){3}(I am ){3}your (string ){4}of strings

To find two or more characters that repeat two or more times, each delimited by spaces, use:
(.{2,}?)(?:\s+\1)+
Here's a working example with your test string: http://bit.ly/17cKX62
EDIT: made quantifier in capture group reluctant by adding ? to match shortest possible match (i.e. now matches "string" and not "string string")
EDIT 2: added a required space delimiter for cleaner results

Related

Finding a simpler Python RegEx for a string that contains each character at least once

I'm working on a small project and have need for Regular Expressions that accept strings that contain each character in a given alphabet at least once.
So for the alphabet {J, K, L} I would need a RegEx that accepts strings containing J one or more times AND K one or more times, AND L one or more times, in any order, with any amount of duplicate characters before, after, or in-between.
I'm pretty inexperienced with RegEx and so have trouble finding "lateral thinking" solutions to many problems. My first approach to this was therefore pretty brute-force: I took each possible "base" string, for example,
JKL, JLK, KJL, KLJ, LKJ, LJK
and allow for any string that could be built up from one of those starting points. However the resulting regular expression* (despite working) ends up being very long and containing a lot of redundancy. Not to mention this approach becomes completely untenable once the alphabet has more than a handful of characters.
I spent a few hours trying to find a more elegant approach, but I have yet to find one that still accepts every possible string. Is there a method or technique I could be using to get this done in a way that's more elegant and scalable (to larger alphabets)?
*For reference, my regular expression for the listed example:
((J|K|L)*J(J|K|L)*K(J|K|L)*L(J|K|L)*)|
((J|K|L)*J(J|K|L)*L(J|K|L)*K(J|K|L)*)|
((J|K|L)*K(J|K|L)*J(J|K|L)*L(J|K|L)*)|
((J|K|L)*K(J|K|L)*L(J|K|L)*J(J|K|L)*)|
((J|K|L)*L(J|K|L)*J(J|K|L)*K(J|K|L)*)|
((J|K|L)*L(J|K|L)*K(J|K|L)*J(J|K|L)*)

This is a typical use-case for a lookahead. You can simply use ^(?=[^J]*J)(?=[^K]*K)(?=[^L]*L) to check all your conditions. If your string also must contain only these characters, you can append [JKL]+$ to it.

If using regex is not a requirement you could also check for the characters individually:
text = ...
alphabet = 'JKL'
assert all([character in text for character in alphabet])
Or if you do not want to allow characters that are not in the alphabet:
assert set(alphabet) == set(text)

Reading text from a file, then writing to another file with repetitions in text marked

I'm a beginner to both Python and to this forum, so please excuse any vague descriptions or mistakes.
I have a problem regarding reading/writing to a file. What I'm trying to do is to read a text from a file and then find the words that occur more than one time, mark them as repeated_word and then write the original text to another file but with the repeated words marked with star signs around them.
I find it difficult to understand how I'm going to compare just the words (without punctuation etc) but still be able to write the words in its original context to the file.
I have been recommended to use regex by some, but I don't know how to use it. Another approach is to iterate through the textstring and tokenize and normalize, sort of by going through each character, and then make some kind av object or element out of each word.
I am thankful to anyone who might have ideas on how to solve this. The main problem is not how to find which words that are repeated but how to mark them and then write them to the file in their context. Some help with the coding would be much appreciated, thanks.
EDIT
I have updated the code with what I've come up with so far. If there is anything you would consider "bad coding", please comment on it.
To explain the Whitelist class, the assignment has two parts, one of where I am supposed to mark the words and one regarding a whitelist, containing words that are "allowed repetitions", and shall therefore not be marked.
I have read heaps of stuff about regex but I still can't get my head around how to use it.

Basically, you need to do two things: find which words are repeated, and then transform each of these words into something else (namely, the original word with some marker around it). Since there's no way to know which words are repeated without going through the entire file, you will need to make two passes.
For the first pass, all you need to do is extract the words from the text and count how many times each one occurs. In order to determine what the words are, you can use a regular expression. A good starting point might be
regex = re.compile(r"[\w']+")
The function re.compile creates a regular expression from a string. This regular expression matches any sequence of one or more word characters (\w) or apostrophes, so it will catch contractions but not punctuation, and I think in many "normal" English texts this should capture all the words.
Once you have created the regular expression object, you can use its finditer method to iterate over all matches of this regular expression in your text.
for word in regex.finditer(text):
You can use the Counter class to count how many times each word occurs. (I leave the implementation as an exercise. :-P The documentation should be quite helpful.)
After you've gotten a count of how many times each word occurs, you will have to pick out those whose counts are 2 or more, and come up with some way to identify them in the input text. I think a regular expression will also help you here. Specifically, you can create a regular expression object which will match any of a selected set of words, by compiling the string consisting of the words joined by |.
regex = re.compile('|'.join(words))
where words is a list or set or some iterable. Since you're new to Python, let's not get too fancy (although one can); just code up a way to go through your Counter or whatever and create a list of all words which have a count of 2 or more, then create the regular expression as I showed you.
Once you have that, you'll probably benefit from the sub method, which takes a string and replaces all matches of the regular expression in it with some other text. In your case, the replacement text will be the original word with asterisks around it, so you can do this:
new_text = regex.sub(text, r'*\0*')
In a regular expression replacement, \0 refers to whatever was matched by the regex.
Finally, you can write new_text to a file.

If you know that the text only contains alphabetic characters, it may be easier to just ignore characters that are outside of a-z than to try to remove all the punctuation.
Here is one way to remove all characters that are not a-z or space:
file = ''.join(c for c in file if 97 <= ord(c) <= 122 or c == ' ')
This works because ord() returns the ASCII code for a given character, and ASCII 97-122 represent a-z (in lowercase).
Then you'll want to split those into words, you can accomplish that like:
words = file.split()
If you pass this to the Counter data structure it will count the occurrences of each word.
counter = Counter(file.split)
Then counter.items() will contain a mapping from word to number of occurrences.

OK. I presume that this is a homework assignment, so I'm not going to give you a complete solution. But, you really need to do a number of things.
The first is to read the input file in to memory. Then split it in to its component words (tokenize it) probably contained in a list, suitably cleaned up to remove stray punctuation. You seem to be well on your way to doing that, but I would recommend you look at the split() and strip() methods available for strings.
You need to consider whether you want the count to be case sensitive or not, and so you might want to convert each word in the list to (say) lowercase to keep this consistent. So you could do this with a for loop and the string lower() method, but a list-comprehension is probably better.
You then need to go through the list of words and count how many times each one appears. If you check out collections.Counter you will find that this does the heavy lifting for your or, alternatively, you will need to build a dictionary which has the words as keys and the count of the words. (You might also want to check out the collections.defaultdict class here as well).
Finally, you need to go through the text you've read from the file and for each word it contains which has more than one match (i.e. the count in the dictionary or counter is > 1) mark it appropriately. Regular expressions are designed to do exactly this sort of thing. So I recommend you look at the re library.
Having done that, you simply then write the result to a file, which is simple enough.
Finally, with respect to your file operations (reading and writing) I would recommend you consider replacing the try ... except construct with a with ... as one.

Scaling regex on big strings in Python

I'm trying to take Regex substring one mismatch in any location of string and turn it into a big data situation where I can:
Match all instances of big substrings such as SSQPSPSQSSQPSS (and allowing only one possible mismatch within this substring) to a much larger string such as SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQSPSSQSSQPSS.
In reality, my substrings and the strings that I match them to are in the hundreds and sometimes even thousands of letters and I wish to incorporate the possibility of mismatches.
How can I scale the regex notation of Regex substring one mismatch in any location of string to solve my big data problems? Is there an efficient way to go about this?

You may try this,
>>> s = "SSPQQQQPSSSSQQQSSQPSPSQSSQPSSQPPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQPSPSQSSQPSSSSQSPSSQSSQPSS"
>>> re.findall(r'(?=(SSQPSPSQSSQPSS|[A-Z]SQPSPSQSSQPSS|S[A-Z]QPSPSQSSQPSS|SS[A-Z]PSPSQSSQPSS))', s)
['SSQPSPSQSSQPSS', 'SSQPSPSQSSQPSS', 'SSQPSPSQSSQPSS', 'SSQPSPSQSSQPSS']
Likwise add pattern with replacing remaining chars with [A-Z].

parsing string with specific name in python

i have string like this
<name:john student male age=23 subject=\computer\sience_{20092973}>
i am confused ":","="
i want to parsing this string!
so i want to split to list like this
name:john
job:student
sex:male
age:23
subject:{20092973}
parsing string with specific name(name, job, sex.. etc) in python
i already searching... but i can't find.. sorry..
how can i this?
thank you.

It's generally a good idea to give more than one example of the strings you're trying to parse. But I'll take a guess. It looks like your format is pretty simple, and primarily whitespace-separated. It's simple enough that using regular expressions should work, like this, where line_to_parse is the string you want to parse:
import re
matchval = re.match("<name:(\S+)\s+(\S+)\s+(\S+)\s+age=(\S+)\s+subject=[^\{]*(\{\S+\})", line_to_parse)
matchgroups = matchval.groups()
Now matchgroups will be a tuple of the values you want. It should be trivial for you to take those and get them into the desired format.
If you want to do many of these, it may be worth compiling the regular expression; take a look at the re documentation for more on this.
As for the way the expression works: I won't go into regular expressions in general (that's what the re docs are for) but in this case, we want to get a bunch of strings that don't have any whitespace in them, and have whitespace between them, and we want to do something odd with the subject, ignoring all the text except the part between { and }.
Each "(...)" in the expression saves whatever is inside it as a group. Each "\S+" stands for one or more ("+") characters that aren't whitespace ("\S"), so "(\S+)" will match and save a string of length at least one that has no whitespace in it. Each "\s+" does the opposite: it has not parentheses around it, so it doesn't save what it matches, and it matches at one or more ("+") whitespace characters ("\s"). This suffices for most of what we want. At the end, though, we need to deal with the subject. "[...]" allows us to list multiple types of characters. "[^...]" is special, and matches anything that isn't in there. {, like [, (, and so on, needs to be escaped to be normal in the string, so we escape it with \, and in the end, that means "[^{]*" matches zero or more ("*") characters that aren't "{" ("[^{]"). Since "*" and "+" are "greedy", and will try to match as much as they can and still have the expression match, we now only need to deal with the last part. From what I've talked about before, it should be pretty clear what "({\S+})" does.

The most elegant way to find n words in String with the particular word

There is a big string and I need to find all substrings containing exactly N words (if it is possible).
For example:
big_string = "The most elegant way to find n words in String with the particular word"
N = 2
find_sub(big_string, 'find', N=2) # => ['way to find n words']
I've tried to solve it with regular expressions, but it happened to be more complex then I expect at first. Is there an elegant solution around I've just overlook?
Upd
By word we mean everything separated by \b
N parameter indicates how many words on each side of the 'find' should be

For your specific example (if we use the "word" definition of regular expressions, i.e. anything containing letters, digits and underscores) the regex would look like this:
r'(?:\w+\W+){2}find(?:\W+\w+){2}'
\w matches one of said word characters. \W matches any other character. I think it's obvious where in the pattern your parameters go. You can use the pattern with re.search or re.findall.
The issue is if there are less than the desired amount of words around your query (i.e. if it's too close to one end of the string). But you should be able to get away with:
r'(?:\w+\W+){0,2}find(?:\W+\w+){0,2}'
thanks to greediness of repetition. Note that in any case, if you want multiple results, matches can never overlap. So if you use the first pattern, you will only get the first match, if two occurrences of find are to close to each other, whereas in the second, you won't get n words before the second find (the ones that were already consumed will be missing). In particular, if two occurrences of find are closer together than n so that the second find will already be part of the first match, then you can't get the second match at all.
If you want to treat a word as anything that is not a white-space character, the approach looks similar:
r'(?:\S+\s+){0,2}find(?:\s+\S+){0,2}'
For anything else you will have to come up with the character classes yourself, I guess.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.