I am splitting a text file using this tcl proc:
proc mcsplit "str splitStr {mc {\x00}}" {
return [split [string map [list $splitStr $mc] $str] $mc] }
# mcsplit --
# Splits a string based using another string
# Arguments:
# str string to split into pieces
# splitStr substring
# mc magic character that must not exist in the orignal string.
# Defaults to the NULL character. Must be a single character.
# Results:
# Returns a list of strings
The split command splits a string based on each character that is in the splitString. This version handles the splitString as a combined string, splitting the string into constituent parts,
but my objective is to do the same using python does anyone here did the same before?
It's not very clear from your question whether the python split behavior is what you need. If you need to split at each occurrence of a multiple-character string, Python's regular split will do the job:
>>> 'this is a test'.split('es')
['this is a t', 't']
If, however, you want to split at any occurrence of multiple individual characters, you'll need to use re.split:
>>> import re
>>> re.split(r'[es]', 'this is a test')
['thi', ' i', ' a t', '', 't']
>>>
Related
Is there any way to convert a list containing unicode strings to a proper list without using eval() or ast.literal_eval() in Python?
For example:
"[u'hello', u'hi']"
to
['hello', 'hi']
Could this be what you are looking for?
a = "[u'hello', u'hi']".translate(None, "[]u'' ")
a = a.split(',')
print(a) #['hello', 'hi']
Seems to fail when you have 'u' in string so you can go with:
a = "[u'hello', u'hi', u'uyou']".translate(None, "[]' ")
a = [item[1:] for item in a.split(',')]
It depends a bit on the formatting of your input:
it only contains "strings"
if there are always u in front of the strings,
each string is inside single quotations '
there is always no whitespace before the , but one after.
there are no whitespaces before or after the [ and ]
you could simply strip the left [u' and the right '] (for convenience I just slice it from the fourth element to the second to last element), then split at ', u' and you're done:
>>> s = "[u'hello', u'hi']"
>>> s[3:-2].split("', u'")
['hello', 'hi']
I have a String that looks like
test = '20170125NBCNightlyNews'
I am trying to split it into two parts, the digits, and the name. The format will always be [date][show] the date is stripped of format and is digit only in the direction of YYYYMMDD (dont think that matters)
I am trying to use re. I have a working version by writing.
re.split('(\d+)',test)
Simple enough, this gives me the values I need in a list.
['', '20170125', 'NBCNightlyNews']
However, as you will note, there is an empty string in the first position. I could theoretically just ignore it, but I want to learn why it is there in the first place, and if/how I can avoid it.
I also tried telling it to match the begininning of the string as well, and got the same results.
>>> re.split('(^\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>> re.split('^(\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>>
Does anyone have any input as to why this is there / how I can avoid the empty string?
Other answers have explained why what you're doing does what it does, but if you have a constant format for the date, there is no reason to abuse a re.split to parse this data:
test[:8], test[8:]
Will split your strings just fine.
What you are actually doing by entering re.split('(^\d+)', test) is, that your test string is splitted on any occurence of a number with at least one character.
So, if you have
test = '20170125NBCNightlyNews'
This is happening:
20170125 NBCNightlyNews
^^^^^^^^
The string is split into three parts, everything before the number, the number itself and everything after the number.
Maybe it is easier to understand if you have a sentence of words, separated by a whitespace character.
re.split(' ', 'this is a house')
=> ['this', 'is', 'a', 'house']
re.split(' ', ' is a house')
=> ['', 'is', 'a', 'house']
You're getting an empty result in the beginning because your input string starts with digits and you're splitting it by digits only. Hence you get an empty string which is before first set of digits.
To avoid that you can use filter:
>>> print filter(None, re.split('(\d+)',test))
['20170125', 'NBCNightlyNews']
Why re.split when you can just match and get the groups?...
import re
test = '20170125NBCNightlyNews'
pattern = re.compile('(\d+)(\w+)')
result = re.match(pattern, test)
result.groups()[0] # for the date part
result.groups()[1] # for the show name
I realize now the intention was to parse the text, not fix the regex usage. I'm with the others, you shouldn't use regex for this simple task when you already know the format won't change and the date is fixed size and will always be first. Just use string indexing.
From the documentation:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string. That way, separator components are always found at the same relative indices within the result list.
So if you have:
test = 'test20170125NBCNightlyNews'
The indexes would remain unaffected:
>>>re.split('(\d+)',test)
['test', '20170125', 'NBCNightlyNews']
If the date is always 8 digits long, I would access the substrings directly (without using regex):
>>> [test[:8], test[8:]]
['20170125', 'NBCNightlyNews']
If the length of the date might vary, I would use:
>>> s = re.search('^(\d*)(.*)$', test)
>>> [s.group(1), s.group(2)]
['20170125', 'NBCNightlyNews']
I am trying to implement a tokeniser in python (without using NLTK libraries) that splits a string into words using blank spaces. Example usage is:
>> tokens = tokenise1(“A (small, simple) example”)
>> tokens
[‘A’, ‘(small,’, ‘simple)’, ‘example’]
I can get some of the way using regular expressions but my return value includes white spaces which I don't want. How do i get the correct return value as per example usage?
What i have so far is:
def tokenise1(string):
return re.split(r'(\S+)', string)
and it returns:
['', 'A', ' ', '(small,', ' ', 'simple)', ' ', 'example', '']
so i need to get rid of the white space in the return
The output is having spaces because you capture them using (). Instead you can split like
re.split(r'\s+', string)
['A', '(small,', 'simple)', 'example']
\s+ Matches one or more spaces.
I tried a simple example with string split, but get some unexpected behavior. Here is the sample code:
def split_string(source,splitlist):
for delim in splitlist:
source = source.replace(delim, ' ')
return source.split(' ')
out = split_string("This is a test-of the,string separation-code!", " ,!-")
print out
>>> ['This', 'is', 'a', 'test', 'of', 'the', 'string', 'separation', 'code', '']
As you can see, I got an extra empty string at the end of the list when I use space as delimiter argument for split() function. However, if I don't pass in any argument for split() function, I got no empty string at the end of the output list.
From what I read in python docs, they said the default argument for split() is space. So, why when I explicitly pass in a ' ' as delimiter, it creates an empty string at the end of the output list?
The docs:
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace.
That may happen if you have multiple spaces separating two words.
For example,
'a b'.split(' ') will return ['a', '', '', '', 'b']
But I would suggest you to use split from re module. Check the example below:
import re
print re.split('[\s,; !]+', 'a b !!!!!!! , hello ;;;;; world')
When we run the above piece, it outputs ['a', 'b', 'hello', 'world']
How could I define string delimiter for splitting in most efficient way? I mean to not need to use many if's etc?
I have strings that need to be splited strictly into two element lists. The problem is those strings have different symbols by which I can split them. For example:
'Hello: test1'. This one has split delimiter ': '. The other example would be:
'Hello - test1'. So this one would be ' - '. Also split delimiter could be ' -' or '- '. So if I know all variations of delimiters, how could I define them most efficiently?
First I did something like this:
strings = ['Hello - test', 'Hello- test', 'Hello -test']
for s in strings:
delim = ' - '
if len(s.split('- ', 1)) == 2:
delim = '- '
elif len(s.split(' -', 1)) == 2:
delim = ' -'
print s.split(delim, 1)[1])
But then I got new strings that had another unexpected delimiters. So doing this way I should add even more ifs to check other delimiters like ': '. But then I wondered if there is some better way to define them (there is not problem if I should need to include new delimiters in some kind of list if I would need to later on). Maybe regex would help or some other tool?
Put all the delimiters inside re.split function like below using logical OR | operator.
re.split(r': | - | -|- ', string)
Add maxsplit=1, if you want to do an one time split.
re.split(r': | - | -|- ', string, maxsplit=1)
You can use the split function of the re module
>>> strings = ['Hello1 - test1', 'Hello2- test2', 'Hello3 -test3', 'Hello4 :test4', 'Hello5 : test5']
>>> for s in strings:
... re.split(" *[:-] *",s)
...
['Hello1', 'test1']
['Hello2', 'test2']
['Hello3', 'test3']
['Hello4', 'test4']
['Hello5', 'test5']
Where between [] you put all the possible delimiters. The * indicates that some spaces can be put before or after.
\s*[:-]\s*
You can split by this.Use re.split(r"\s*[:-]\s*",string).See demo.
https://regex101.com/r/nL5yL3/14
You should use this if you can have delimiters like - or - or -.wherein you have can have multiple spaces.
This isn't the best way, but if you want to avoid using re for some (or no) reason, this is what I would do:
>>> strings = ['Hello - test', 'Hello- test', 'Hello -test', 'Hello : test']
>>> delims = [':', '-'] # all possible delimiters; don't worry about spaces.
>>>
>>> for string in strings:
... delim = next((d for d in delims if d in string), None) # finds the first delimiter in delims that's present in the string (if there is one)
... if not delim:
... continue # No delimiter! (I don't know how you want to handle this possibility; this code will simply skip the string all together.)
... print [s.strip() for s in string.split(delim, 1)] # assuming you want them in list form.
['Hello', 'test']
['Hello', 'test']
['Hello', 'test']
['Hello', 'test']
This uses Python's native .split() to break the string at the delimiter, and then .strip() to trim the white space off the results, if there is any. I've used next to find the appropriate delimiter, but there are plenty of things you can swap that out with (especially if you like for blocks).
If you're certain that each string will contain at least one of the delimiters (preferably exactly one), then you can shave it down to this:
## with strings and delims defined...
>>> for string in strings:
... delim = next(d for d in delims if d in string) # raises StopIteration at this line if there is no delimiter in the string.
... print [s.strip() for s in string.split(delim, 1)]
I'm not sure if this is the most elegant solution, but it uses fewer if blocks, and you won't have to import anything to do it.