Implement a tokeniser in Python - python

I am trying to implement a tokeniser in python (without using NLTK libraries) that splits a string into words using blank spaces. Example usage is:
>> tokens = tokenise1(“A (small, simple) example”)
>> tokens
[‘A’, ‘(small,’, ‘simple)’, ‘example’]
I can get some of the way using regular expressions but my return value includes white spaces which I don't want. How do i get the correct return value as per example usage?
What i have so far is:
def tokenise1(string):
return re.split(r'(\S+)', string)
and it returns:
['', 'A', ' ', '(small,', ' ', 'simple)', ' ', 'example', '']
so i need to get rid of the white space in the return

The output is having spaces because you capture them using (). Instead you can split like
re.split(r'\s+', string)
['A', '(small,', 'simple)', 'example']
\s+ Matches one or more spaces.

Related

TCL multi-character split in Python?

I am splitting a text file using this tcl proc:
proc mcsplit "str splitStr {mc {\x00}}" {
return [split [string map [list $splitStr $mc] $str] $mc] }
# mcsplit --
# Splits a string based using another string
# Arguments:
# str string to split into pieces
# splitStr substring
# mc magic character that must not exist in the orignal string.
# Defaults to the NULL character. Must be a single character.
# Results:
# Returns a list of strings
The split command splits a string based on each character that is in the splitString. This version handles the splitString as a combined string, splitting the string into constituent parts,
but my objective is to do the same using python does anyone here did the same before?
It's not very clear from your question whether the python split behavior is what you need. If you need to split at each occurrence of a multiple-character string, Python's regular split will do the job:
>>> 'this is a test'.split('es')
['this is a t', 't']
If, however, you want to split at any occurrence of multiple individual characters, you'll need to use re.split:
>>> import re
>>> re.split(r'[es]', 'this is a test')
['thi', ' i', ' a t', '', 't']
>>>

Replace escape sequence characters in a string in Python 3.x

I have used the following code to replace the escaped characters in a string. I have first done splitting by \n and the used re.sub(), but still I dont know what I am missing, the code is not working according to the expectations. I am a newbie at Python, so please don't judge if there are optimisation problems. Here is my code:
#import sys
import re
String = "1\r\r\t\r\n2\r\r\n3\r\r\r\r\n\r\n\r4\n\r"
splitString = String.split('\n')
replacedStrings = []
i=0
for oneString in splitString:
#oneString = oneString.replace(r'^(.?)*(\\[^n])+(.?)*$', "")
oneString = re.sub(r'^(.?)*(\\[^n])+(.?)*$', "", oneString)
print(oneString)
replacedStrings.insert(i, oneString)
i += 1
print(replacedStrings)
My aim here is: I need the values only (without the escaped sequences) as the split strings.
My approach here is:
I have split the string by \n that gives me array list of separate strings.
Then, I have checked each string using the regex, if the regex matches, then the matched substring is replaced to "".
Then I have pushed those strings to a collection, thinking that it will store the replaced strings in the new array list.
So basically, I am through with 1 and 2, but currently I am stuck at 3. Following is my Output:
1
2
3
4
['1\r\r\t\r', '2\r\r', '3\r\r\r\r', '\r', '\r4', '\r']
You might find it easier to use re.findall here with the simple pattern \S+:
input = "1\r\r\t\r\n2\r\r\n3\r\r\r\r\n\r\n\r4\n\r"
output = re.findall(r'\S+', input)
print(output)
['1', '2', '3', '4']
This approach will isolate and match any islands of one or more non whitespace characters.
Edit:
Based on your new input data, we can try matching on the pattern [^\r\n\t]+:
input = "jkahdjkah \r\r\t\r\nA: B\r\r\nA : B\r\r\r\r\n\r\n\r4\n\r"
output = re.findall(r'[^\r\n\t]+', input)
print(output)
['jkahdjkah ', 'A: B', 'A : B', '4']
re.sub isn't really the right tool for the job here. What would be on the table is split or re.findall, because you want to repeatedly match/isolate a certain part of your text. re.sub is useful for taking a string and transforming it to something else. It can be used to extract text, but does not work so well for multiple matches.
You were almost there, I would just use string.strip() to replace multiple \r and \n at the start and the end of the strings
String = "1\r\r\t\r\n2\r\r\n3\r\r\r\r\n\r\n\r4\n\r"
splitString = String.split('\n')
replacedStrings = []
i=0
for oneString in splitString:
s = oneString.strip()
if s != '':
print(s)
replacedStrings.append(s)
print(replacedStrings)
The output will look like
1
2
3
4
['1', '2', '3', '4']
For "jkahdjkah \r\r\t\r\nA: B\r\r\nA : B\r\r\r\r\n\r\n\r4\n\r", the output will be ['jkahdjkah', 'A: B', 'A : B', '4']
I have found one more method, this seems to work fine, it might not be as optimised as the other answers, but its just another way:
import re
splitString = []
String = "jhgdf\r\r\t\r\nA : B\r\r\nA : B\r\r\r\r\n\r\n\rA: B\n\r"
splitString = re.compile('[\r\t\n]+').split(String)
if "" in splitString:
splitString.remove("")
print(splitString)
I added it here, so that people going through the same trouble as me, might want to overlook this approach too.
Following is the Output that I have got after using the above code:
['jhgdf', 'A : B', 'A : B', 'A: B']

Python detect character surrounded by spaces

Anyone know how I can find the character in the center that is surrounded by spaces?
1 + 1
I'd like to be able to separate the + in the middle to use in a if/else statement.
Sorry if I'm not too clear, I'm a Python beginner.
I think you are looking for something like the split() method which will split on white space by default.
Suppose we have a string s
s = "1 + 1"
chunks = s.split()
print(chunks[1]) # Will print '+'
This regular expression will detect a single character surrounded by spaces, if the character is a plus or minus or mult or div sign: r' ([+-*/]) '. Note the spaces inside the apostrophes. The parentheses "capture" the character in the middle. If you need to recognize a different set of characters, change the set inside the brackets.
If you haven't dealt with regular expressions before, read up on the re module. They are very useful for simple text processing. The two relevant features here are "character classes" (the square brackets in my example) and "capturing parentheses" (the round parens).
You can use regex:
s="1 + 1"
a=re.compile(r' (?P<sym>.) ')
a.search(s).group('sym')
import re
def find_between(string, start_=' ', end_=' '):
re_str = r'{}([-+*/%^]){}'.format(start_, end_)
try:
return re.search(re_str, string).group(1)
except AttributeError:
return None
print(find_between('9 * 5', ' ', ' '))
Not knowing how many spaces separate your central character, then I'd use the following:
s = '1 + 1'
middle = filter(None, s.split())[1]
print middle # +
The split works as in the solution provided by Zac, but if there are more than a single space, then the returned list will have a bunch of '' elements, which we can get rid of with the filter(None, ) function.
Then it's just a matter of extracting your second element.
Check it in action at https://eval.in/636622
If we look at it step-by-step, then here is how it all works using a python console:
>>> s = '1 + 1'
>>> s.split()
['1', '+', '', '', '1']
>>> filter(None, s.split())
['1', '+', '1']
>>> filter(None, s.split())[1]
'+'

String split with default delimiter vs user defined delimiter

I tried a simple example with string split, but get some unexpected behavior. Here is the sample code:
def split_string(source,splitlist):
for delim in splitlist:
source = source.replace(delim, ' ')
return source.split(' ')
out = split_string("This is a test-of the,string separation-code!", " ,!-")
print out
>>> ['This', 'is', 'a', 'test', 'of', 'the', 'string', 'separation', 'code', '']
As you can see, I got an extra empty string at the end of the list when I use space as delimiter argument for split() function. However, if I don't pass in any argument for split() function, I got no empty string at the end of the output list.
From what I read in python docs, they said the default argument for split() is space. So, why when I explicitly pass in a ' ' as delimiter, it creates an empty string at the end of the output list?
The docs:
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace.
That may happen if you have multiple spaces separating two words.
For example,
'a b'.split(' ') will return ['a', '', '', '', 'b']
But I would suggest you to use split from re module. Check the example below:
import re
print re.split('[\s,; !]+', 'a b !!!!!!! , hello ;;;;; world')
When we run the above piece, it outputs ['a', 'b', 'hello', 'world']

Python - defining string split delimiter?

How could I define string delimiter for splitting in most efficient way? I mean to not need to use many if's etc?
I have strings that need to be splited strictly into two element lists. The problem is those strings have different symbols by which I can split them. For example:
'Hello: test1'. This one has split delimiter ': '. The other example would be:
'Hello - test1'. So this one would be ' - '. Also split delimiter could be ' -' or '- '. So if I know all variations of delimiters, how could I define them most efficiently?
First I did something like this:
strings = ['Hello - test', 'Hello- test', 'Hello -test']
for s in strings:
delim = ' - '
if len(s.split('- ', 1)) == 2:
delim = '- '
elif len(s.split(' -', 1)) == 2:
delim = ' -'
print s.split(delim, 1)[1])
But then I got new strings that had another unexpected delimiters. So doing this way I should add even more ifs to check other delimiters like ': '. But then I wondered if there is some better way to define them (there is not problem if I should need to include new delimiters in some kind of list if I would need to later on). Maybe regex would help or some other tool?
Put all the delimiters inside re.split function like below using logical OR | operator.
re.split(r': | - | -|- ', string)
Add maxsplit=1, if you want to do an one time split.
re.split(r': | - | -|- ', string, maxsplit=1)
You can use the split function of the re module
>>> strings = ['Hello1 - test1', 'Hello2- test2', 'Hello3 -test3', 'Hello4 :test4', 'Hello5 : test5']
>>> for s in strings:
... re.split(" *[:-] *",s)
...
['Hello1', 'test1']
['Hello2', 'test2']
['Hello3', 'test3']
['Hello4', 'test4']
['Hello5', 'test5']
Where between [] you put all the possible delimiters. The * indicates that some spaces can be put before or after.
\s*[:-]\s*
You can split by this.Use re.split(r"\s*[:-]\s*",string).See demo.
https://regex101.com/r/nL5yL3/14
You should use this if you can have delimiters like - or - or -.wherein you have can have multiple spaces.
This isn't the best way, but if you want to avoid using re for some (or no) reason, this is what I would do:
>>> strings = ['Hello - test', 'Hello- test', 'Hello -test', 'Hello : test']
>>> delims = [':', '-'] # all possible delimiters; don't worry about spaces.
>>>
>>> for string in strings:
... delim = next((d for d in delims if d in string), None) # finds the first delimiter in delims that's present in the string (if there is one)
... if not delim:
... continue # No delimiter! (I don't know how you want to handle this possibility; this code will simply skip the string all together.)
... print [s.strip() for s in string.split(delim, 1)] # assuming you want them in list form.
['Hello', 'test']
['Hello', 'test']
['Hello', 'test']
['Hello', 'test']
This uses Python's native .split() to break the string at the delimiter, and then .strip() to trim the white space off the results, if there is any. I've used next to find the appropriate delimiter, but there are plenty of things you can swap that out with (especially if you like for blocks).
If you're certain that each string will contain at least one of the delimiters (preferably exactly one), then you can shave it down to this:
## with strings and delims defined...
>>> for string in strings:
... delim = next(d for d in delims if d in string) # raises StopIteration at this line if there is no delimiter in the string.
... print [s.strip() for s in string.split(delim, 1)]
I'm not sure if this is the most elegant solution, but it uses fewer if blocks, and you won't have to import anything to do it.

Categories