For example, I would like to do something like this in lua:
s = "Hey\n There And Yea\n"
print(s.rfind("\n", 0, 5))
I've tried making this in lua with the string.find function:
local s = "Hey\n There And Yea\n"
local _, p = s:find(".*\n", -5)
print(p)
But these aren't producing the same results. What am I doing wrong, and how can I fix this to making it the same as rfind?
Lua has a little known function string.reverse that reverses all characters of a string. While this is rarely needed, the function can typically be used to make a reverse search inside a string.
So to implement rfind, you want to search the reverse pattern inside the reverse original string, and finally make some arithmetics to obtain the offset from the original string.
Here is the code that mimics Python rfind:
function rfind(subject, tofind, startIdx, endIdx)
startIdx = startIdx or 0
endIdx = endIdx or #subject
subject = subject:sub(startIdx+1, endIdx):reverse()
tofind = tofind:reverse()
local idx = subject:find(tofind)
return idx and #subject - #tofind - idx + startIdx + 1 or -1
end
print(rfind("Hello World", "H")) --> 0
print(rfind("Hello World", "l")) --> 9
print(rfind("foo foo foo", "foo")) --> 8
print(rfind("Hello World", "Toto")) --> -1
print(rfind("Hello World", "l", 1, 4)) --> 3
Note that this version of rfind uses Python index convention, starting at 0 and returning -1 if string is not found. It would be more coherent in Lua to have 1-based index and to return nil when there are no match. The modification would be trivial.
The pattern I have written will only work for single-char substrings like the one the asker used as a test case. Skip ahead to the next bold header to see that answer, or read on for an explanation of some of the things they did wrong with their attempt. Skip to the very final bold header for a general, inefficient solution for multi-char substrings
I have tried to recreate the output of python mystring.rfind with lua mystring:find, it only works for single-character substrings. Later I will show you a function that does it for all cases but is a pretty bad loop.
As a recap (to address what you're doing wrong), let's talk about mystringvar:find("pattern", index), sugar for string.find(mystringvar, "pattern", index). This will return start, stop indexes.
The optional Index sets the start, not the end, but a negative index will count backwards from the 'right minus index' to end of string (an index of -1 will only evaluate the last character, -2 the last 2). This is not the desired behavior.
Instead of trying to use the index to create a substring, you should create a substring like this:
mystringvar:sub(start, end) will extract and return the substring from start to end (1 indexed, inclusive end). So to recreate Python's 0-5 (0 indexed, exclusive end), use 1-5.
Now note that these methods can be chained into string:sub(x, y):find("") but I will break it up for ease of reading. Without further ado, I present you:
The answer
local s = "Hey\n There And Yea\n"
local substr = s:sub(1,5)
local start, fin = substr:find("\n[^\n]-$")
print(start, ",", fin)
I had a few half measure solutions, but to make sure what I was writing would work for multiple substring instances (the 1-5 substring only contains 1), I tested with the substring and the whole string. Observe:
output with sub(1, 5): 4 , 5
output with sub(1, 19) (the whole length): 19 , 19
These both correctly report the beginning of the rightmost substring, but note that the "fin" index goes to the end of the sentence, I will explain in a second. I hope this is fine because rfind only returns the starting index anyway, so this should be an appropriate replacement.
Let's reread the code to see how it works:
sub I've already explained
There is no longer a need for index in string.find
Alright, what's this pattern "\n[^\n]-$"?
$ - anchor to end of sentence
[^x] - match "not x"
- - as few matches as possible (even 0) of the previous character or set (in this case, [^\n]). This means that if a string ends with your substring, it will still work)
It begins with \n, so all together it means: "Find me a line break, but followed by no other line breaks, up to the end of the sentence." This means that even though your substring only contains 1 instance of \n, if you were to use this function on a string with multiple substrings, you would still get the highest index, as rfind does.
Note that string.find does not conform to pattern groups (()), so it would be vain to wrap the \n in a group. As a consequence, I cannot stop end-anchoring $ from extending the fin variable to the end of the sentence.
I hope this works well for you.
Function to do this for substrings of any length
I will not be explaining this one.
function string.rfind(str, substr, plain) --plain is included for you to pass to find if you wish to ignore patterns
assert(substr ~= "") --An empty substring would cause an endless loop. Bad!
local plain = plain or false --default plain to false if not included
local index = 0
--[[
Watch closely... we continually shift the starting point after each found index until nothing is left.
At that point, we find the difference between the original string's length and the new string's length, to see how many characters we cut out.
]]--
while true do
local new_start, _ = string.find(str, substr, index, plain) --index will continually push up the string to after whenever the last index was.
if new_start == nil then --no match is found
if index == 0 then return nil end --if no match is found and the index was never changed, return nil (there was no match)
return #str - #str:sub(index) --if no match is found and we have some index, do math.
end
--print("new start", new_start)
index = new_start + 1 --ok, there was some kind of match. set our index to whatever that was, and add 1 so that we don't get stuck in a loop of rematching the start of our substring.
end
end
If you'd like to see my entire "test suite" for this...
Related
I have a seemingly simple problem that for the life of me is just outside my reach of understanding. What I mean by that is that I can come up with many complex ways to attempt this, but there must be an easy way.
What I am trying to do is find and replace a substring in a string, but the catch is that it is based on a mix of a defined region and then variable regions based on length.
Here is an example:
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC' and I want to replace AATCGATCGTA with <span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>
So in this example the first part will always be constant AATCGA and will be used to locate the region to replace. This is then followed by a "spacer", in this case a single character but could be more than one and needs to be specified, and finally the last bit that will follow the "tail", in this case four characters, but could also be more or less. A set-up in this case would be:
to_find = 'AATCGA'
spacer = 'T' #Variable based on number and not on the character
tail = 'CGTA' #Variable based on number and not on the character
With this information I need to do something like:
new_seq = sequence.replace(f'{to_find}{len(spacer)}{len(tail)}', f'<span color="blue">{to_find}</span><span>{spacer}</span><span color="green">{tail}</span>')
print(new_seq)
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
But the spacer could be 3 characters from the end of to_find and it may vary, the same with the tail section. Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
Any help would be much appreciated!
I'm not quite sure I understand you fully. Nevertheless, you don't seem to be too far off. Just use regex.
import re
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC'
expected_new_seq = '<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC'
to_find = 'AATCGA'
spacer = 'T' # Variable based on number and not on the character
tail = 'CGTA' # Variable based on number and not on the character
# In this case, the pattern is (AATCGA)(.{1})(.{4})
# It matches "AATCGA" that is followed by 1 character and then 4 characters.
# AATCGA is captured in group 1, then the next unknown character is captured
# in group 2, and the next 4 unknown characters are captured in group 3
# (the brackets create capturing groups).
pattern = f'({to_find})(.{{{len(spacer)}}})(.{{{len(tail)}}})'
# \1 refers to capture group 1 (to_find), \2 refers to capture group 2 (spacer),
# and \3 refers to capture group 3 (tail).
# This no longer needs to be a f-string. But making it a raw string means we
# don't need to escape the slashes
repl = r'<span color="blue">\1</span><span>\2</span><span color="green">\3</span>'
new_seq = re.sub(pattern, repl, sequence)
print(new_seq)
print(new_seq == expected_new_seq)
Output:
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
True
Have a play around with it here (also includes interactive explanation): https://regex101.com/r/2mshrI/1
Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
How do you know when to replace it when it's in reverse instead of forward? After all, all you're doing is matching a short string followed/lead by n characters. I imagine you'd get matches in both directions, so which replacement do you carry out? Please provide more examples - longer input with expected output.
so I have this list:
tokens = ['<greeting>', 'Hello World!', '</greeting>']
the task is to count the number of strings that have XML tags. what I have so far (that works) is this:
tokens = ['<greeting>', 'Hello World!', '</greeting>']
count = 0
for i in range(len(tokens)):
if tokens[i].find('>') >1:
print(tokens[i])
count += 1
print(count)
else:
count += 0
what puzzles me is that I'm inclined in using the following line for the if statement
if tokens[i].find('>') == True:
but it won't work.
what's the optimal way of writing this loop, in your opinion?
many thanks!
alex.
One issue I see with you approach is that it might capture false positives (e.g. "gree>ting"), so checking only for a closing tag is not enough.
If your definition of "contains a tag" simply means checking whether the string contains a < followed by some characters, then another >, you could use a regular expression (keeping this in mind in case you were thinking about something more complex).
This, combined with the compact list generator method proposed by #aws_apprentice in the comments, gives us:
import re
regex = "<.+>"
count = sum([1 if re.search(regex, t) else 0 for t in tokens])
print(count) #done!
Explanation:
This one-liner we used is called a list generator, which will generate a list of ones and zeros. For each string t in tokens, if the string contains a tag, append 1 to the new list, else append 0. And re.search is used for checking whether the string (or a substring of it) matches the given regex.
The following approach checks for the opening < at the start of the string and also checks for > at the end of the string.
In [4]: tokens = ['<greeting>', 'Hello World!', '</greeting>']
In [5]: sum([1 if i.startswith('<') and i.endswith('>') else 0 for i in tokens])
Out[5]: 2
Anis R.'s answer should work fine but this is a non-regex alternative (and not as elegant. In fact I would call this clumsy).
This code just looks at the beginning and end of each list element for carats. I'm a novice to the extreme but I think a range(len(tokens)) is redundant and can be simplified like this as well.
tokens = ['<greeting>', 'Hello World!', '</greeting>']
count = 0
for i in tokens:
if i[0].find('<') == 0 and i[-1].find('>') != -1:
print(i)
count += 1
print(count)
str.find() returns an index position, not a boolean as others have noted, so your if statement must reflect that. A .find() with no result returns -1. As you can see, for the first carat checking for an index of 0 will work, as long as your data follows the scheme in your example list. The second if component is negative (using !=), since it checks the last character in the list item. I don't think you could use a positive if statement there since, again, .find() returns an index position and your data presumably has variable lengths. I'm sure you could complicate that check to be positive by adding more code but that shortcut seems satisfactory in your case to me. The only time it wouldn't work is if your list components can look like '<greeting> Hello'
Happy to be corrected by others, that's why I'm here.
How does SpaCy keeps track of character and token offset during tokenization?
In SpaCy, there's a Span object that keeps the start and end offset of the token/span https://spacy.io/api/span#init
There's a _recalculate_indices method seems to be retrieving the token_by_start and token_by_end but that looks like all the recalcuation is doing.
When looking at extraneous spaces, it's doing some smart alignment of the spans.
Does it recalculate after every regex execution, does it keep track of the character's movement? Does it do a post regexes execution span search?
Summary:
During tokenization, this is the part that keeps track of offset and character.
Simple answer: It goes character by character in the string.
TL;DR is at the bottom.
Explained chunk by chunk:
It takes in the string to be tokenized and starts iterating through it one letter/space at a time.
It is a simple for loop on the string where uc is the current character in the string.
for uc in string:
It first checks to see if the current character is a space and compares that to see if the last in_ws setting is opposite of whether it is a space or not. If they are the same, it will jump down and increase i += 1.
in_ws is being used to know if it should process or not. They want to do things on spaces as well as on characters, so they can't just track isspace() and operate only on False. Instead, when it first starts, in_ws is set to the result of string[0].isspace() and then compared against itself. If string[0] is a space, it will evaluate the same and therefor skip down and increase i (discussed later) and go to the next uc until it reaches a uc that is not the same as the first one. In practice this allows it to sequence through multiple spaces after having treated the first space, or multiple characters until it reaches the next space boundary.
if uc.isspace() != in_ws:
It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i.
It tracks two index values: start and i. start is the start of the potential token that it is on, and i is the ending character it is looking at. When the script starts, start will be 0. After a cycle of this, start will be the index of the last space plus 1 which would make it the first letter of the current word.
It checks first if start is less than i which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.
if start < i:
span is the word that is currently being looked at for tokenization. It is the string sliced by the start index value through the i index value.
span = string[start:i]
It is then taking the hash of the word (start through i) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize method on that portion of the string.
key = hash_string(span)
cache_hit = self._try_cache(key, doc)
if not cache_hit:
self._tokenize(doc, span, key)
Next it checks to see if the current character uc is an exact space. If it is, it resets start to be i + 1 where i is the index of the current character.
if uc == ' ':
doc.c[doc.length - 1].spacy = True
start = i + 1
If the character is not a space, it sets start to be the current character's index. It then reverses in_ws, indicating it is a character.
else:
start = i
in_ws = not in_ws
And then it increases i += 1 and loops to the next character.
i += 1
TL;DR
So all of that said, it keeps track of the character in the string that it is on using i and it keeps the start of the word using start. start is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).
This is my first SO post, so go easy! I have a script that counts how many matches occur in a string named postIdent for the substring ff. Based on this it then iterates over postIdent and extracts all of the data following it, like so:
substring = 'ff'
global occurences
occurences = postIdent.count(substring)
x = 0
while x <= occurences:
for i in postIdent.split("ff"):
rawData = i
required_Id = rawData[-8:]
x += 1
To explain further, if we take the string "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff", it is clear there are 3 instances of ff. I need to get the 8 preceding characters at every instance of the substring ff, so for the first instance this would be 909a9090.
With the rawData, I essentially need to offset the variable required_Id by -1 when I get the data out of the split() method, as I am currently getting the last 8 characters of the current string, not the string I have just split. Another way of doing it could be to pass the current required_Id to the next iteration, but I've not been able to do this.
The split method gets everything after the matching string ff.
Using the partition method can get me the data I need, but does not allow me to iterate over the string in the same way.
Get the last 8 digits of each split using a slice operation in a list-comprehension:
s = "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff"
print([x[-8:] for x in s.split('ff') if x])
# ['909a9090', '90434390', 'sdfs9000']
Not a difficult problem, but tricky for a beginner.
If you split the string on 'ff' then you appear to want the eight characters at the end of every substring but the last. The last eight characters of string s can be obtained using s[-8:]. All but the last element of a sequence x can similarly be obtained with the expression x[:-1].
Putting both those together, we get
subject = '090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff'
for x in subject.split('ff')[:-1]:
print(x[-8:])
This should print
909a9090
90434390
sdfs9000
I wouldn't do this with split myself, I'd use str.find. This code isn't fancy but it's pretty easy to understand:
fullstr = "090fd0909a9090ff90493090434390ff90904210412419ghfsdfs9000ff"
search = "ff"
found = None # our next offset of
last = 0
l = 8
print(fullstr)
while True:
found = fullstr.find(search, last)
if found == -1:
break
preceeding = fullstr[found-l:found]
print("At position {} found preceeding characters '{}' ".format(found,preceeding))
last = found + len(search)
Overall I like Austin's answer more; it's a lot more elegant.
I want to use a regex to find a substring, followed by a variable number of characters, followed by any of several substrings.
an re.findall of
"ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
should give me:
['ATGTCAGGTAA', 'ATGTCAGGTAAGCTTAG', 'ATGTCAGGTAAGCTTAGGGCTTTAG']
I have tried all of the following without success:
import re
string2 = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
re.findall('(ATG.*TAA)|(ATG.*TAG)', string2)
re.findall('ATG.*(TAA|TAG)', string2)
re.findall('ATG.*((TAA)|(TAG))', string2)
re.findall('ATG.*(TAA)|(TAG)', string2)
re.findall('ATG.*(TAA)|ATG.*(TAG)', string2)
re.findall('(ATG.*)(TAA)|(ATG.*)(TAG)', string2)
re.findall('(ATG.*)TAA|(ATG.*)TAG', string2)
What am I missing here?
This is not super-easy, because a) you want overlapping matches, and b) you want greedy and non-greedy and everything inbetween.
As long as the strings are fairly short, you can check every substring:
import re
s = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
p = re.compile(r'ATG.*TA[GA]$')
for start in range(len(s)-6): # string is at least 6 letters long
for end in range(start+6, len(s)):
if p.match(s, pos=start, endpos=end):
print(s[start:end])
This prints:
ATGTCAGGTAA
ATGTCAGGTAAGCTTAG
ATGTCAGGTAAGCTTAGGGCTTTAG
Since you appear to work with DNA sequences or something like that, make sure to check out Biopython, too.
I like the accepted answer just fine :-) That is, I'm adding this for info, not looking for points.
If you have heavy need for this, trying a match on O(N^2) pairs of indices may soon become unbearably slow. One improvement is to use the .search() method to "leap" directly to the only starting indices that can possibly pay off. So the following does that.
It also uses the .fullmatch() method so that you don't have to artificially change the "natural" regexp (e.g., in your example, no need to add a trailing $ to the regexp - and, indeed, in the following code doing so would no longer work as intended). Note that .fullmatch() was added in Python 3.4, so this code also requires Python 3!
Finally, this intends to generalize the re module's finditer() function/method. While you don't need match objects (you just want strings), they're far more generally applicable, and returning a generator is often friendlier than returning a list too.
So, no, this doesn't do exactly what you want, but does things from which you can get what you want, in Python 3, faster:
def finditer_overlap(regexp, string):
start = 0
n = len(string)
while start <= n:
# don't know whether regexp will find shortest or
# longest match, but _will_ find leftmost match
m = regexp.search(string, start)
if m is None:
return
start = m.start()
for finish in range(start, n+1):
m = regexp.fullmatch(string, start, finish)
if m is not None:
yield m
start += 1
Then, e.g.,
import re
string2 = "ATGTCAGGTAAGCTTAGGGCTTTAGGATT"
pat = re.compile("ATG.*(TAA|TAG)")
for match in finditer_overlap(pat, string2):
print(match.group())
prints what you wanted in your example. The other ways you tried to write a regexp should also work. In this example it's faster because the second time around the outer loop start is 1, and regexp.search(string, 1) fails to find another match, so the generator exits at once (so skips checking O(N^2) other index pairs).