I have a file with lines in following format:
d 55 r:100:10000
I would like to find that 55 and parse it to int. How can I do this ? I would like to make it variable-space-in-between-proof. Which means that there might be more or less spaces in between but it will be between d and r for sure.
That's easy:
number = int(line.split()[1])
If you actually need to check whether it's between d and r, then use
import re
number = int(re.search(r"d\s+(\d+)\s+r", line).group(1))
The str.split() method by default splits on arbitrary-width whitespace:
>>> 'd 55 r:100:10000'.split()
['d', '55', 'r:100:10000']
Picking out just the middle number then becomes a simple select:
>>> int('d 55 r:100:10000'.split()[1])
55
Quoting the documentation:
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns [].
>>> line = 'd 55 r:100:10000'
>>> int(line.split()[1]) # By default splits by any whitespace
55
Or being more efficient, you can limit it to two splits (where the second one holds the integer):
>>> int(line.split(None, 2)[1]) # None means split by any whitespace
55
Related
I am creating a function which takes in a street plus house number and possibly an addition and returns the house number. I could simply extract the integers, but the problem is that there can be an integer in the street name itself. For instance:
my_string = 'Hendrik 4e laan 18 bis'
In this case, I would like to return:
street_name = 'Hendrik 4e laan', street_number = 18, street_number_addition = 'bis'
I cannot simply split the string on spaces and take last integer ([x for x in my_string.split() if x.isdigit()][-1]), because the streetnumber addition might be attached to the street number (e.g. 18bis or 18b).
Hence how do I get a list of [4, 18] such that I can simply take the last item?
This works for me:
>>> import re
>>> my_string = 'Hendrik 4e laan 18 bis'
>>> re.findall( r'(\d+)', my_string )
['4', '18']
>>>
If you know that you only want the last occurrence, then another option, instead of finding all of them (which you could do using findall - see lenik's answer) is to search for a sequence of digits after which there are no more digits before the end of the string:
import re
my_string = 'Hendrik 4e laan 18 bis'
match = re.search(r'(\d+)\D*$', my_string)
if match:
number = int(match.group(1)) # gives here: number = 18
else:
print("no digits found")
Explanation of regular expression:
(\d+) one or more digits (with grouping parentheses, so that we can use the .group(1) to extract them)
\D* - zero or more characters that are not digits
$ - the end of the string
Looking for an elegant way to:
Split a string based on a separator
Instead of discarding separator, making it a part of the splitted chunks.
For instance I do have date and time data like:
D2018-4-21T3:55+6
2018-4-4T3:15+6
D2018-11-21T12:45+6:30
Sometimes there's D, sometimes not (however I always want it to be a part of first chunk), no trailing or leading zeros for time and timezone only have ':' sometimes. Point is, it is necessary to split on these 'D, T, +' characters cause the segements might not follow the sae length. If they were it would be easier to just split on the index basis. I want to split them over multiple characters like T and + and have them a part of the data as well like:
['D2018-4-21', 'T3:55', 'TZ+6']
['D2018-4-4', 'T3:15', 'TZ+6']
['D2018-11-21', 'T12:45', 'TZ+6:30']
I know a nicer way would be to clean data first and normalize all rows to follow same pattern but just curious how to do it as it is
For now on my ugly solution looks like:
[i+j for _, i in enumerate(['D','T','TZ']) for __, j in enumerate('D2018-4-21T3:55+6'.replace('T',' ').replace('D', ' ').replace('+', ' +').split()) if _ == __]
Use a regular expression
Reference:
https://docs.python.org/3/library/re.html
(...)
Matches whatever regular expression is inside the parentheses, and
indicates the start and end of a group; the contents of a group can be
retrieved after a match has been performed, and can be matched later
in the string with the \number special sequence, described below. To
match the literals '(' or ')', use ( or ), or enclose them inside a
character class: [(], [)].
import re
a = '''D2018-4-21T3:55+6
2018-4-4T3:15+6
D2018-11-21T12:45+6:30'''
b = a.splitlines()
for i in b:
m = re.search(r'^D?(.*)([T].*?)([-+].*)$', i)
if m:
print(["D%s" % m.group(1), m.group(2), "TZ%s" % m.group(3)])
Result:
['D2018-4-21', 'T3:55', 'TZ+6']
['D2018-4-4', 'T3:15', 'TZ+6']
['D2018-11-21', 'T12:45', 'TZ+6:30']
I have a String that looks like
test = '20170125NBCNightlyNews'
I am trying to split it into two parts, the digits, and the name. The format will always be [date][show] the date is stripped of format and is digit only in the direction of YYYYMMDD (dont think that matters)
I am trying to use re. I have a working version by writing.
re.split('(\d+)',test)
Simple enough, this gives me the values I need in a list.
['', '20170125', 'NBCNightlyNews']
However, as you will note, there is an empty string in the first position. I could theoretically just ignore it, but I want to learn why it is there in the first place, and if/how I can avoid it.
I also tried telling it to match the begininning of the string as well, and got the same results.
>>> re.split('(^\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>> re.split('^(\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>>
Does anyone have any input as to why this is there / how I can avoid the empty string?
Other answers have explained why what you're doing does what it does, but if you have a constant format for the date, there is no reason to abuse a re.split to parse this data:
test[:8], test[8:]
Will split your strings just fine.
What you are actually doing by entering re.split('(^\d+)', test) is, that your test string is splitted on any occurence of a number with at least one character.
So, if you have
test = '20170125NBCNightlyNews'
This is happening:
20170125 NBCNightlyNews
^^^^^^^^
The string is split into three parts, everything before the number, the number itself and everything after the number.
Maybe it is easier to understand if you have a sentence of words, separated by a whitespace character.
re.split(' ', 'this is a house')
=> ['this', 'is', 'a', 'house']
re.split(' ', ' is a house')
=> ['', 'is', 'a', 'house']
You're getting an empty result in the beginning because your input string starts with digits and you're splitting it by digits only. Hence you get an empty string which is before first set of digits.
To avoid that you can use filter:
>>> print filter(None, re.split('(\d+)',test))
['20170125', 'NBCNightlyNews']
Why re.split when you can just match and get the groups?...
import re
test = '20170125NBCNightlyNews'
pattern = re.compile('(\d+)(\w+)')
result = re.match(pattern, test)
result.groups()[0] # for the date part
result.groups()[1] # for the show name
I realize now the intention was to parse the text, not fix the regex usage. I'm with the others, you shouldn't use regex for this simple task when you already know the format won't change and the date is fixed size and will always be first. Just use string indexing.
From the documentation:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string. That way, separator components are always found at the same relative indices within the result list.
So if you have:
test = 'test20170125NBCNightlyNews'
The indexes would remain unaffected:
>>>re.split('(\d+)',test)
['test', '20170125', 'NBCNightlyNews']
If the date is always 8 digits long, I would access the substrings directly (without using regex):
>>> [test[:8], test[8:]]
['20170125', 'NBCNightlyNews']
If the length of the date might vary, I would use:
>>> s = re.search('^(\d*)(.*)$', test)
>>> [s.group(1), s.group(2)]
['20170125', 'NBCNightlyNews']
I need to do a string compare to see if 2 strings are equal, like:
>>> x = 'a1h3c'
>>> x == 'a__c'
>>> True
independent of the 3 characters in middle of the string.
You need to use anchors.
>>> import re
>>> x = 'a1h3c'
>>> pattern = re.compile(r'^a.*c$')
>>> pattern.match(x) != None
True
This would check for the first and last char to be a and c . And it won't care about the chars present at the middle.
If you want to check for exactly three chars to be present at the middle then you could use this,
>>> pattern = re.compile(r'^a...c$')
>>> pattern.match(x) != None
True
Note that end of the line anchor $ is important , without $, a...c would match afoocbarbuz.
Your problem could be solved with string indexing, but if you want an intro to regex, here ya go.
import re
your_match_object = re.match(pattern,string)
the pattern in your case would be
pattern = re.compile("a...c") # the dot denotes any char but a newline
from here, you can see if your string fits this pattern with
print pattern.match("a1h3c") != None
https://docs.python.org/2/howto/regex.html
https://docs.python.org/2/library/re.html#search-vs-match
if str1[0] == str2[0]:
# do something.
You can repeat this statement as many times as you like.
This is slicing. We're getting the first value. To get the last value, use [-1].
I'll also mention, that with slicing, the string can be of any size, as long as you know the relative position from the beginning or the end of the string.
I have a string such as the one below:
26 (passengers:22 crew:4)
or
32 (passengers:? crew: ?)
. What I'm looking to do is split up the code so that just the numbers representing the number of passengers and crew are extracted. If it's a question mark, I'd look for it to be replaced by a "".
I'm aware I can use string.replace("?", "") to replace the ? however how do I go about extracting the numeric characters for crew or passengers respectively? The numbers may vary from two digits to three so I can't slice the last few characters off the string or at a specific interval.
Thanks in advance
A regular expression to match those would be:
r'\(\s*passengers:\s*(\d{1,3}|\?)\s+ crew:\s*(\d{1,3}|\?)\s*\)'
with some extra whitespace tolerance thrown in.
Results:
>>> import re
>>> numbers = re.compile(r'\(\s*passengers:\s*(\d{1,3}|\?)\s+ crew:\s*(\d{1,3}|\?)\s*\)')
>>> numbers.search('26 (passengers:22 crew:4)').groups()
('22', '4')
>>> numbers.search('32 (passengers:? crew: ?)').groups()
('?', '?')