Python Regex Simple Split - Empty at first index

Python Regex Simple Split - Empty at first index - python

I have a String that looks like
test = '20170125NBCNightlyNews'
I am trying to split it into two parts, the digits, and the name. The format will always be [date][show] the date is stripped of format and is digit only in the direction of YYYYMMDD (dont think that matters)
I am trying to use re. I have a working version by writing.
re.split('(\d+)',test)
Simple enough, this gives me the values I need in a list.
['', '20170125', 'NBCNightlyNews']
However, as you will note, there is an empty string in the first position. I could theoretically just ignore it, but I want to learn why it is there in the first place, and if/how I can avoid it.
I also tried telling it to match the begininning of the string as well, and got the same results.
>>> re.split('(^\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>> re.split('^(\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>>
Does anyone have any input as to why this is there / how I can avoid the empty string?

Other answers have explained why what you're doing does what it does, but if you have a constant format for the date, there is no reason to abuse a re.split to parse this data:
test[:8], test[8:]
Will split your strings just fine.

What you are actually doing by entering re.split('(^\d+)', test) is, that your test string is splitted on any occurence of a number with at least one character.
So, if you have
test = '20170125NBCNightlyNews'
This is happening:
20170125 NBCNightlyNews
^^^^^^^^
The string is split into three parts, everything before the number, the number itself and everything after the number.
Maybe it is easier to understand if you have a sentence of words, separated by a whitespace character.
re.split(' ', 'this is a house')
=> ['this', 'is', 'a', 'house']
re.split(' ', ' is a house')
=> ['', 'is', 'a', 'house']

You're getting an empty result in the beginning because your input string starts with digits and you're splitting it by digits only. Hence you get an empty string which is before first set of digits.
To avoid that you can use filter:
>>> print filter(None, re.split('(\d+)',test))
['20170125', 'NBCNightlyNews']

Why re.split when you can just match and get the groups?...
import re
test = '20170125NBCNightlyNews'
pattern = re.compile('(\d+)(\w+)')
result = re.match(pattern, test)
result.groups()[0] # for the date part
result.groups()[1] # for the show name
I realize now the intention was to parse the text, not fix the regex usage. I'm with the others, you shouldn't use regex for this simple task when you already know the format won't change and the date is fixed size and will always be first. Just use string indexing.

From the documentation:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string. That way, separator components are always found at the same relative indices within the result list.
So if you have:
test = 'test20170125NBCNightlyNews'
The indexes would remain unaffected:
>>>re.split('(\d+)',test)
['test', '20170125', 'NBCNightlyNews']

If the date is always 8 digits long, I would access the substrings directly (without using regex):
>>> [test[:8], test[8:]]
['20170125', 'NBCNightlyNews']
If the length of the date might vary, I would use:
>>> s = re.search('^(\d*)(.*)$', test)
>>> [s.group(1), s.group(2)]
['20170125', 'NBCNightlyNews']

Related

How to remove everything before certain character in Python

I'm new to python and struggle with a certain task:
I have a String that could have anything in it, but it always "ends" the same.
It can be just a Filename, a complete path, or just a random string, ending with a Version Number.
Example:
C:\Users\abc\Desktop\string-anotherstring-15.1R7-S8.1
string-anotherstring-15.1R7-S8.1
string-anotherstring.andanother-15.1R7-S8.1
What always is the same (looking from the end) is that if you reach the second dot and go 2 characters in front of it, you always match the part that I'm interested in.
Cutting everything after a certain string was "easy," and I solved it myself - that's why the string ends with the version now :)
Is there a way to tell python, "look for the second dot from behind the string and go 2 in front of it and delete everything in front of that so that I get the Version as a string?
Happy for any pointers in the right direction.
Thanks

If you want the version number, can you use the hyphen (-) to split the string? Or do you need to depend on the dots only?
Please see below use of rsplit and join which can help you.
>>> a = 'string-anotherstring.andanother-15.1R7-S8.1'
>>> a.rsplit('-')
['string', 'anotherstring.andanother', '15.1R7', 'S8.1']
>>> a.rsplit('-')[-2:] #Get everything from second last to the end
['15.1R7', 'S8.1']
>>> '-'.join(a.rsplit('-')[-2:]) #Get everything from second last to the end, and join them with a hyphen
'15.1R7-S8.1'
>>>
For using dots, use the same way
>>> a
'string-anotherstring.andanother-15.1R7-S8.1'
>>> data = a.rsplit('.')
>>> [data[-3][-2:]]
['15']
>>> [data[-3][-2:]] + data[-2:]
['15', '1R7-S8', '1']
>>> '.'.join([data[-3][-2:]] + data[-2:])
'15.1R7-S8.1'
>>>

You can build a regex from the end mark of a line using the anchor $.
Using your own description, use the regex:
(\d\d\.[^.]*)\.[^.]*$
Demo
If you want the last characters from the end included, just move the capturing parenthesis:
(\d\d\.[^.]*\.[^.]*)$
Demo
Explanation:
(\d\d\.[^.]*\.[^.]*)$
^ ^ #digits
^ # a literal '.'
^ # anything OTHER THAN a '.'
^ # literal '.'
^ # anything OTHER THAN a '.'
^ # end of line

Assuming I understand this correctly, there are two ways to do this that come to mind:
Including both, since I might not understand this correctly, and for completeness reasons. I think the split/parts solution is cleaner, particularly when the 'certain character' is a dot.
>>> msg = r'C:\Users\abc\Desktop\string-anotherstring-15.1R7-S8.1'
>>> re.search(r'.*(..\..*)', msg).group(1)
'S8.1'
>>> parts = msg.split('.')
>>> ".".join((parts[-2][-2:], parts[-1]))
'S8.1'

For your example, you can split the string by the separator '-', and then join the last two indices. Like so:
txt = "string-anotherstring-15.1R7-S8.1"
x = txt.split("-")
y = "".join(x[-2:])
print(y) # outputs 15.1R7S8.1

How to escape null characters .i.e [' '] while using regex split function? [duplicate]

I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']

I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)

Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')

If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']

Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.

>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

Splitting string in Python starting from the first numeric character

I need to manage string in Python in this way:
I have this kind of strings with '>=', '=', '<=', '<', '>' in front of them, for example:
'>=1_2_3'
'<2_3_2'
what I want to achieve is splitting the strings to obtain, respectively:
'>=', '1_2_3'
'<', '2_3_2'
basically I need to split them starting from the first numeric character.
There's a way to achieve this result with regular expressions without iterating over the string checking if a character is a number or a '_'?
thank you.

This will do:
re.split(r'(^[^\d]+)', string)[1:]
Example:
>>> re.split(r'(^[^\d]+)', '>=1_2_3')[1:]
['>=', '1_2_3']
>>> re.split(r'(^[^\d]+)', '<2_3_2')[1:]
['<', '2_3_2']

import re
strings = ['>=1_2_3','<2_3_2']
for s in strings:
mat = re.match(r'([^\d]*)(\d.*)', s)
print mat.groups()
Outputs:
('>=', '1_2_3')
('<', '2_3_2')
This just groups everything up until the first digit in one group, then that first digit and everything after into a second.
You can access the individual groups with mat.group(1), mat.group(2)

You can split using this regex:
(?<=[<>=])(?=\d)
RegEx Demo

There's probably a better way but you can split with a capture then join the second two elements:
values = re.split(r'(\d)', '>=1_2_3', maxsplit = 1)
values = [values[0], values[1] + values[2]]

Checking and removing extra symbols

I'm interested by removing extra symbols from strings in python.
What could by the more efficient and pythonic way to do that ? Is there some grammar module ?
My first idea would be to locate the more nested text and go through the left and the right, counting the opening and closing symbols. Then i remove the last one of the symbol counter that contain too much symbol.
An example would be this string
text = "(This (is an example)"
You can clearly see that the first parenthesis is not balanced by another one. So i want to delete it.
text = "This (is and example)"
The solution has to be independant of the position of the parentheses.
Others example could be :
text = "(This (is another example) )) (to) explain) the question"
That would become :
text = "(This (is another example) ) (to) explain the question"

Had to break this into an answer for formatting. Check the Python's regular expression module.
If I'm understanding what you are asking, look at re.sub. You can use a regular expression to find the character you'd like to remove, and replace them with an empty string.
Suppose we want to remove all instances of '.', '&', and '*'.
>>> import re
>>> s = "abc&def.ghi**jkl&"
>>> re.sub('[\.\&\*]', '', s)
'abcdefghijkl'
If the pattern to be matched is larger, you can use re.compile and pass that as the first argument to sub.
>>> r = re.compile('[\.\&\*]')
>>> re.sub(r, '', s)
'abcdefghijkl'
Hope this helps.

Matching two almost similar string (python)

In a file I can have either of the following two string formats:
::WORD1::WORD2= ANYTHING
::WORD3::WORD4::WORD5= ANYTHING2
This is the regex I came up with:
::(\w+)(?:::(\w+))?::(\w+)=(.*)
regex.findall(..)
[(u'WORD1', u'', u'WORD2', u' ANYTHING'),
(u'WORD3', u'WORD4', u'WORD5', u' ANYTHING2')]
My first question is, why do I get this empty u'' when matching the first string ?
My second question is, is there an easier way to write this regex? the two strings are very similar, except that sometimes i have this extra ::WORD5
My last question is: most of the time I have only word between the :: so that's why \w+ is enough, but sometime I can get stuff like 2-WORD2 or 3-2-WORD2 etc.. there is this - that appears. How can I add it into the \w+ ?

for last question:
[\w\-]+
explain:
\w
Matches any word character.

Captured groups are always included in re.findall results, even if they don't match anything. That's why you get an empty string. If you just want to get what's between the delimiters, try split instead of findall:
a = '::WORD1::WORD2= ANYTHING'
b = '::WORD3::WORD4::WORD5= ANYTHING2'
print re.split(r'::|= ', a)[1:] # ['WORD1', 'WORD2', 'ANYTHING']
print re.split(r'::|= ', b)[1:] # ['WORD3', 'WORD4', 'WORD5', 'ANYTHING2']
In response to the comments, if "ANYTHING" could be well, anything, it's easier to use string functions rather than regexps:
x, y = a.split('= ', 1)
results = x.split('::')[1:] + [y]

Based on the answer of thg435 you can just split to the "=" and then do exactly the same somethign like
left,right = a.split('=', 1)
answer = left.split('::')[1:] + [right]

For you last question you can do something like (that accept letters, numbers and "-")
[a-zA-Z0-9\-]+

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regex Simple Split - Empty at first index - python

Other answers have explained why what you're doing does what it does, but if you have a constant format for the date, there is no reason to abuse a re.split to parse this data: test[:8], test[8:] Will split your strings just fine.

Related

How to remove everything before certain character in Python

How to escape null characters .i.e [' '] while using regex split function? [duplicate]

Splitting string in Python starting from the first numeric character

Checking and removing extra symbols

Matching two almost similar string (python)

Categories

Resources