How does SpaCy keeps track of character and token offset during tokenization?

How does SpaCy keeps track of character and token offset during tokenization? - python

How does SpaCy keeps track of character and token offset during tokenization?
In SpaCy, there's a Span object that keeps the start and end offset of the token/span https://spacy.io/api/span#init
There's a _recalculate_indices method seems to be retrieving the token_by_start and token_by_end but that looks like all the recalcuation is doing.
When looking at extraneous spaces, it's doing some smart alignment of the spans.
Does it recalculate after every regex execution, does it keep track of the character's movement? Does it do a post regexes execution span search?

Summary:
During tokenization, this is the part that keeps track of offset and character.
Simple answer: It goes character by character in the string.
TL;DR is at the bottom.
Explained chunk by chunk:
It takes in the string to be tokenized and starts iterating through it one letter/space at a time.
It is a simple for loop on the string where uc is the current character in the string.
for uc in string:
It first checks to see if the current character is a space and compares that to see if the last in_ws setting is opposite of whether it is a space or not. If they are the same, it will jump down and increase i += 1.
in_ws is being used to know if it should process or not. They want to do things on spaces as well as on characters, so they can't just track isspace() and operate only on False. Instead, when it first starts, in_ws is set to the result of string[0].isspace() and then compared against itself. If string[0] is a space, it will evaluate the same and therefor skip down and increase i (discussed later) and go to the next uc until it reaches a uc that is not the same as the first one. In practice this allows it to sequence through multiple spaces after having treated the first space, or multiple characters until it reaches the next space boundary.
if uc.isspace() != in_ws:
It will continue to go through characters until it reaches the next boundary, keeping the index of the current character as i.
It tracks two index values: start and i. start is the start of the potential token that it is on, and i is the ending character it is looking at. When the script starts, start will be 0. After a cycle of this, start will be the index of the last space plus 1 which would make it the first letter of the current word.
It checks first if start is less than i which is used to know if it should attempt to check the cache and tokenize the current character sequence. This will make sense further down.
if start < i:
span is the word that is currently being looked at for tokenization. It is the string sliced by the start index value through the i index value.
span = string[start:i]
It is then taking the hash of the word (start through i) and checking the cache dictionary to see if that word has been processed already. If it has not it will call the _tokenize method on that portion of the string.
key = hash_string(span)
cache_hit = self._try_cache(key, doc)
if not cache_hit:
self._tokenize(doc, span, key)
Next it checks to see if the current character uc is an exact space. If it is, it resets start to be i + 1 where i is the index of the current character.
if uc == ' ':
doc.c[doc.length - 1].spacy = True
start = i + 1
If the character is not a space, it sets start to be the current character's index. It then reverses in_ws, indicating it is a character.
else:
start = i
in_ws = not in_ws
And then it increases i += 1 and loops to the next character.
i += 1
TL;DR
So all of that said, it keeps track of the character in the string that it is on using i and it keeps the start of the word using start. start is reset to the current character at the end of processing for a word, and then after the spaces it is set to the last space plus one (the start of the next word).

Related

Replace a substring with defined region and follow up variable region in Python

I have a seemingly simple problem that for the life of me is just outside my reach of understanding. What I mean by that is that I can come up with many complex ways to attempt this, but there must be an easy way.
What I am trying to do is find and replace a substring in a string, but the catch is that it is based on a mix of a defined region and then variable regions based on length.
Here is an example:
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC' and I want to replace AATCGATCGTA with <span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>
So in this example the first part will always be constant AATCGA and will be used to locate the region to replace. This is then followed by a "spacer", in this case a single character but could be more than one and needs to be specified, and finally the last bit that will follow the "tail", in this case four characters, but could also be more or less. A set-up in this case would be:
to_find = 'AATCGA'
spacer = 'T' #Variable based on number and not on the character
tail = 'CGTA' #Variable based on number and not on the character
With this information I need to do something like:
new_seq = sequence.replace(f'{to_find}{len(spacer)}{len(tail)}', f'<span color="blue">{to_find}</span><span>{spacer}</span><span color="green">{tail}</span>')
print(new_seq)
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
But the spacer could be 3 characters from the end of to_find and it may vary, the same with the tail section. Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
Any help would be much appreciated!

I'm not quite sure I understand you fully. Nevertheless, you don't seem to be too far off. Just use regex.
import re
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC'
expected_new_seq = '<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC'
to_find = 'AATCGA'
spacer = 'T' # Variable based on number and not on the character
tail = 'CGTA' # Variable based on number and not on the character
# In this case, the pattern is (AATCGA)(.{1})(.{4})
# It matches "AATCGA" that is followed by 1 character and then 4 characters.
# AATCGA is captured in group 1, then the next unknown character is captured
# in group 2, and the next 4 unknown characters are captured in group 3
# (the brackets create capturing groups).
pattern = f'({to_find})(.{{{len(spacer)}}})(.{{{len(tail)}}})'
# \1 refers to capture group 1 (to_find), \2 refers to capture group 2 (spacer),
# and \3 refers to capture group 3 (tail).
# This no longer needs to be a f-string. But making it a raw string means we
# don't need to escape the slashes
repl = r'<span color="blue">\1</span><span>\2</span><span color="green">\3</span>'
new_seq = re.sub(pattern, repl, sequence)
print(new_seq)
print(new_seq == expected_new_seq)
Output:
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
True
Have a play around with it here (also includes interactive explanation): https://regex101.com/r/2mshrI/1
Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
How do you know when to replace it when it's in reverse instead of forward? After all, all you're doing is matching a short string followed/lead by n characters. I imagine you'd get matches in both directions, so which replacement do you carry out? Please provide more examples - longer input with expected output.

How would I implement a rfind in Lua with the arguments?

For example, I would like to do something like this in lua:
s = "Hey\n There And Yea\n"
print(s.rfind("\n", 0, 5))
I've tried making this in lua with the string.find function:
local s = "Hey\n There And Yea\n"
local _, p = s:find(".*\n", -5)
print(p)
But these aren't producing the same results. What am I doing wrong, and how can I fix this to making it the same as rfind?

Lua has a little known function string.reverse that reverses all characters of a string. While this is rarely needed, the function can typically be used to make a reverse search inside a string.
So to implement rfind, you want to search the reverse pattern inside the reverse original string, and finally make some arithmetics to obtain the offset from the original string.
Here is the code that mimics Python rfind:
function rfind(subject, tofind, startIdx, endIdx)
startIdx = startIdx or 0
endIdx = endIdx or #subject
subject = subject:sub(startIdx+1, endIdx):reverse()
tofind = tofind:reverse()
local idx = subject:find(tofind)
return idx and #subject - #tofind - idx + startIdx + 1 or -1
end
print(rfind("Hello World", "H")) --> 0
print(rfind("Hello World", "l")) --> 9
print(rfind("foo foo foo", "foo")) --> 8
print(rfind("Hello World", "Toto")) --> -1
print(rfind("Hello World", "l", 1, 4)) --> 3
Note that this version of rfind uses Python index convention, starting at 0 and returning -1 if string is not found. It would be more coherent in Lua to have 1-based index and to return nil when there are no match. The modification would be trivial.

The pattern I have written will only work for single-char substrings like the one the asker used as a test case. Skip ahead to the next bold header to see that answer, or read on for an explanation of some of the things they did wrong with their attempt. Skip to the very final bold header for a general, inefficient solution for multi-char substrings
I have tried to recreate the output of python mystring.rfind with lua mystring:find, it only works for single-character substrings. Later I will show you a function that does it for all cases but is a pretty bad loop.
As a recap (to address what you're doing wrong), let's talk about mystringvar:find("pattern", index), sugar for string.find(mystringvar, "pattern", index). This will return start, stop indexes.
The optional Index sets the start, not the end, but a negative index will count backwards from the 'right minus index' to end of string (an index of -1 will only evaluate the last character, -2 the last 2). This is not the desired behavior.
Instead of trying to use the index to create a substring, you should create a substring like this:
mystringvar:sub(start, end) will extract and return the substring from start to end (1 indexed, inclusive end). So to recreate Python's 0-5 (0 indexed, exclusive end), use 1-5.
Now note that these methods can be chained into string:sub(x, y):find("") but I will break it up for ease of reading. Without further ado, I present you:
The answer
local s = "Hey\n There And Yea\n"
local substr = s:sub(1,5)
local start, fin = substr:find("\n[^\n]-$")
print(start, ",", fin)
I had a few half measure solutions, but to make sure what I was writing would work for multiple substring instances (the 1-5 substring only contains 1), I tested with the substring and the whole string. Observe:
output with sub(1, 5): 4 , 5
output with sub(1, 19) (the whole length): 19 , 19
These both correctly report the beginning of the rightmost substring, but note that the "fin" index goes to the end of the sentence, I will explain in a second. I hope this is fine because rfind only returns the starting index anyway, so this should be an appropriate replacement.
Let's reread the code to see how it works:
sub I've already explained
There is no longer a need for index in string.find
Alright, what's this pattern "\n[^\n]-$"?
$ - anchor to end of sentence
[^x] - match "not x"
- - as few matches as possible (even 0) of the previous character or set (in this case, [^\n]). This means that if a string ends with your substring, it will still work)
It begins with \n, so all together it means: "Find me a line break, but followed by no other line breaks, up to the end of the sentence." This means that even though your substring only contains 1 instance of \n, if you were to use this function on a string with multiple substrings, you would still get the highest index, as rfind does.
Note that string.find does not conform to pattern groups (()), so it would be vain to wrap the \n in a group. As a consequence, I cannot stop end-anchoring $ from extending the fin variable to the end of the sentence.
I hope this works well for you.
Function to do this for substrings of any length
I will not be explaining this one.
function string.rfind(str, substr, plain) --plain is included for you to pass to find if you wish to ignore patterns
assert(substr ~= "") --An empty substring would cause an endless loop. Bad!
local plain = plain or false --default plain to false if not included
local index = 0
--[[
Watch closely... we continually shift the starting point after each found index until nothing is left.
At that point, we find the difference between the original string's length and the new string's length, to see how many characters we cut out.
]]--
while true do
local new_start, _ = string.find(str, substr, index, plain) --index will continually push up the string to after whenever the last index was.
if new_start == nil then --no match is found
if index == 0 then return nil end --if no match is found and the index was never changed, return nil (there was no match)
return #str - #str:sub(index) --if no match is found and we have some index, do math.
end
--print("new start", new_start)
index = new_start + 1 --ok, there was some kind of match. set our index to whatever that was, and add 1 so that we don't get stuck in a loop of rematching the start of our substring.
end
end
If you'd like to see my entire "test suite" for this...

Optimal way to "stamp" string into desired string

So, I was looking for an algorithm for the following problem:
You are given a desired string s, and a stamp t. t is also a string. Let the beginning string be len(s)*"?".
Is it possible to use the stamp to transform the beginning string into the string s using the stamp? The whole stamp must fit inside the beginning string (the stamp's borders may not exceed the ?????... string's borders).
Print the number of stamps required and print the left border of the stamp for each stamping.
Example:
AABCACA (desired result)
ABCA (stamp)
Solution:
3
1 4 2
explanation: ??????? → ABCA??? → ABCABCA → AABCACA.
My solution:
If the stamp's first letter is not the desired string's first letter, the task is not possible. The same goes for the last letter. If the stamp doesn't have all the letters in the desired string, the task is impossible.
My algorithm goes like this: try to find the stamp in the desired string. If it is found, delete it and replace it with question marks. Mark down the left border of the stamp. Do this as long as you can.
Then look for the stamp's contiguous subarrays of size len(stamp)-1. If you find any of those, delete them and replace with question marks. Mark down the left border of the stamp.
Then look for the stamp's contiguous subarrays of size len(stamp)-2. If you find any of those, delete them and replace with question marks. Mark down the left border of the stamp. Do that until you are finished. There you have the answer.
The problems
I'm not sure what is wrong with my code as it can't seem to pass some test cases. There is probably a logical error.
import sys
desiredString = input()
stamp = input()
stampingSpots = []
if (len(set(desiredString)) != len(set(stamp)) or stamp[0] != desiredString[0] or stamp[-1] != desiredString[-1]):
print("-1")
sys.exit()
def searchAndReplace(stringToFind, fix): #Search for stringToFind and replace it with len(stringToFind)*"?". Fix is used to fix the position.
global desiredString
for x in range(0, len(desiredString)-len(stringToFind)+1):
if desiredString[x:x+len(stringToFind)] == stringToFind:
stampingSpots.append(x+1-fix) #Mark down the stamping spot
firstPart = desiredString[:x]
firstPart += len(stringToFind)*"?"
firstPart += desiredString[len(firstPart):]
desiredString = firstPart
return True
return False
while(searchAndReplace(stamp,0)): #Search for the full stamp in desiredString
searchAndReplace(stamp,0)
length = len(stamp)-1
while(length > 0):
for firstPart in range(0, len(stamp)-length+1):
secondPart = firstPart+length
while(searchAndReplace(stamp[firstPart:secondPart], firstPart)):
searchAndReplace(stamp[firstPart:secondPart], firstPart)
if len(stampingSpots) > 10*len(desiredString): #Too much output, not possible
print("-1")
sys.exit()
length -= 1
print(len(stampingSpots))
for i in reversed(stampingSpots):
print(i, end = " ")

The algorithm you describe is fundamentally flawed. The results it produces simply don't correspond to things the stamp can actually do. For example, with stamp AB and string AAA, it will try to stamp beyond the borders of the string to apply the final A. It will also try to use use the AB and BC substrings of the stamp ABC directly next to each other for the string ABBC, but no actual application of the stamp can do that.
The stamp cannot be used to apply arbitrary substrings of the stamp string. It can be used to stamp over previous stamp applications, but your algorithm doesn't consider the full complexity of overstamping. Also, even if you could stamp arbitrary substrings of the stamp string, you haven't proven your algorithm minimizes stamp applications.

We can use divide and conquer: let f(s) represent the minimum stamps required to generate string s where "*" is a wildcard. Then:
Geedily pick a part of the string that's the largest match for the stamp.
Set that part to wildcards and provide each of its right and left parts to f.
For example:
AABCACA (desired result)
ABCA (stamp)
f(AABCACA)
^^^^
ABCA (match)
= 1 + f(A****) + f(****CA)
=> f(A****)
^^^^
ABCA (match)
=> f(****CA)
^^^^
ABCA
Total 3

In python Im trying to parse a string where the element and the index plus one returns the desired index

Hers is my code challenge and my code. I'm stuck, not sure why its not working properly
-write a function named plaintext that takes a single parameter of a string encoded in this format: before each character of the message, add a digit and a series of other characters. the digit should correspond to the number of characters that will precede the message's actual, meaningful character. it should return the decoded word in string form
""" my pseudocode:
#convert string to a list
#enumerate list
#parse string where the element and the index plus one returns the desired index
#return decoded message of desired indexes """
encoded_message = "0h2ake1zy"
#encoded_message ="2xwz"
#encoded_message = "0u2zyi2467"
def plaintext(string):
while(True):
#encoded_message = raw_input("enter encoded message:")
for index, character in enumerate(list(encoded_message)):
character = int(character)
decoded_msg = index + character + 1
print decoded_msg

You need to go iterate over the string's characters, and in each iteration skip the specified number of characters and take the following one:
def plaintext(s):
res = ''
i = 0
while i < len(s):
# Skip the number of chars specified
i += int(s[i])
# Take the letter after them
i += 1
res += s[i]
# Move on to the next position
i += 1
return res

Here are some hints.
First decide what looping construct you want to use. Python offers choices: iterate over individual characters, loop over the indices of the characters, while loop. You certainly don't want both a while and a for loop.
You're going to be processing the string in groups, "0h", then "2ake", then "1zy" to take your first example string. What is the condition that will cause you to exit the loop?
Now, look at your line decoded_msg = index + character + 1. To construct the decoded string, you want to index into the string itself, based on the digit's value. So, this line should contain something like, encoded_message[x] for some x, that you have to figure out using the digit.
Also, you'll want to accumulate characters as you go along. So you'll need to begin the loop with an empty result string decoded_msg="" and add a character to it decoded_msg += ... for each iteration of the loop.
I hope this helps a little more than just giving the answer.

How to understand this code, to split an array in Python?

I'm a bit confused about a line in Python:
We use Python and a custom function to split a line: we want what is between quotes to be a single entry in the array.
The line is, for example:
"La Jolla Bank, FSB",La Jolla,CA,32423,19-Feb-10,24-Feb-10
So "La Jolla Bank, FSB" should be a single entry in the array.
And I'm not sure to understand this code:
The first char is a quote '"', so the variable "quote" is set to its inverse, so set to "TRUE".
Then we check the comma, AND if quote is set to its inverse, so if quote is TRUE, which is the case when we are inside the quotes.
We cut it with current="", and this is where I don't understand: we are still between the quotes, so normally we should not cut it now! edit: so and not quote means "false", and not "the opposite of", thanks !
Code:
def mysplit (string):
quote = False
retval = []
current = ""
for char in string:
if char == '"':
quote = not quote
elif char == ',' and not quote: #the first coma is still in the quotes, and quote is set to TRUE, so we should not cut current here...
retval.append(current)
current = ""
else:
current += char
retval.append(current)
return retval

You're viewing it as though both if char == '"' and elif char == ',' and not quote were run.
However the if statement explicitly makes it so that only one will run.
Either, quote will be inverted OR the current value will get cut.
In the case where the current char is ", then the logic will be called to invert the quote flag. But the logic to cut the string will not run.
In the case where the current char is ,, then the logic for inverting the flag will NOT run, but the logic to cut the string will if the quote flag is not set.

That is initializing current to the empty string, wiping out whatever it may have been set to before.
As long as you are not inside quotes (ie. quote is False), when you see a ,, you have hit the end of the field. Whatever you have accumulated into current is the content of that field, so append it to retval and reset current to the empty string, ready for the next field.
That said, this looks like you're dealing with a .csv input. There is a csv module that can deal with this for you.

current is reset to empty because in the case where you have encountered ',' and you are not under "" quotes you should interpret that as an end of a "token".
This is definitely not pythonic, for char in string makes me cringe and whoever wrote this code should have used regex.

What you're looking at is a condensed version of a Finite State Machine, used by most language parsing programs.
Let's see if I can't annotate it:
def mysplit (string):
# We start out at the beginning of the string NOT in between quotes
quote = False
# Hold each element that we split out
retval = []
# This variable holds whatever the current item we're interested in is
# e.g: If we're in a quote, then it's everything (including commas)
# otherwise it's every UP UNTIL the next comma
current = ""
# Scan the string character by character
for char in string:
# We hit a quote, so turn on QUOTE SCANNING MODE!!!
# If we're in quote scanning mode, turn it off
if char == '"':
quote = not quote
# We hit a comma, and we're not in quote scanning mode
elif char == ',' and not quote:
# We got what we want, let's put it in the return value
# and then reset our current item to nothing so we can prepare for the next item.
retval.append(current)
current = ""
else:
# Nothing special, let's just keep building up our current item
current += char
# We're done with all the characters, let's put together whatever we were working on when we ran out of characters
retval.append(current)
# Return it!
return retval

This is not the best code for splitting but it is pretty straight forward
1 current = ""
# First you set current to empty string, the following line
# will loop through the string to be split and pull characters out of it
# one by one... setting 'char' to be the value of next character
2 for char in string:
# the following code will check if the line we are currently inside of the quote
# if otherwise it will add the current character to the the 'current' variable
#
3 if char == '"':
4 quote = not quote
5 elif char == ',' and not quote:
6 retval.append(current)
### if we see the comma, it will append whatever is accumulated in current to the
### return result.
### then you have to reset the value in the current to let the next word accumulate
7 current = "" #why do we cut current here?
8 else:
9 current += char
### after the last char is seen, we still have left over characters in current which
### we can just shove into the final result
10 retval.append(current)
11 return retval
Here is an example run:
Let string be 'a,bbb,ccc
Step char current retval
1 a a {}
2 , {a} ### Current is reset
3 b b {a}
4 b bb {a}
5 b bbb {a}
6 , {a,bbb} ### Current is reset
and so on

OK you aren't quite there!
1.the first char is a quote
' " ', so the variable "quote" is set to its inverse, so set to
"TRUE".
good! so quote was set to the inverse of whatever it was previously. At the beginning of the prog, it was false, so when " is seen, it becomes true. But vice versa, if it was True, and a quote is seen, it becomes false.
In other words, this line of the program changes quote from whatever is was before that line. It is called 'toggling'.
then we check the coma, AND if quote is set to its inverse,
so if quote is TRUE, which is the case when we are inside the quotes.
This isn't quite right. not quote means "only if quote is false". This has nothing to do with whether it is 'set to its inverse'. No variable can be equal to its own inverse! it is like saying X=True and X=False - obviously nonsense.
quote is always either True or False - and nothing else!
3.we cut it with current="", and this is where i don't understand : we are still between the quotes, so norm ally we should not cut it now!
So hopefully you can see now that, you are not between the quotes if you reach this line. the not quote ensures that you don't cut inside a quote, because not quote really means just that - not in a quote!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.