splitting string in Python (2.7) - python

I have a string such as the one below:
26 (passengers:22 crew:4)
or
32 (passengers:? crew: ?)
. What I'm looking to do is split up the code so that just the numbers representing the number of passengers and crew are extracted. If it's a question mark, I'd look for it to be replaced by a "".
I'm aware I can use string.replace("?", "") to replace the ? however how do I go about extracting the numeric characters for crew or passengers respectively? The numbers may vary from two digits to three so I can't slice the last few characters off the string or at a specific interval.
Thanks in advance

A regular expression to match those would be:
r'\(\s*passengers:\s*(\d{1,3}|\?)\s+ crew:\s*(\d{1,3}|\?)\s*\)'
with some extra whitespace tolerance thrown in.
Results:
>>> import re
>>> numbers = re.compile(r'\(\s*passengers:\s*(\d{1,3}|\?)\s+ crew:\s*(\d{1,3}|\?)\s*\)')
>>> numbers.search('26 (passengers:22 crew:4)').groups()
('22', '4')
>>> numbers.search('32 (passengers:? crew: ?)').groups()
('?', '?')

Related

How to get the float values after particular symbol using regex in python?

I am new to python, I have been using regex for matching, etc. Now I am facing a small issue with it, I have a string str = "vans>=20.09 and stands == 'four'". I want the values after the Comparison Operators, I have used regex to extract that the pattern which I gave is working fine in extracting the values like int and strings but it is not extracting the float values. What is the best pattern, so that regex will extract all kind of values(int, float, strings)?
My code:
import re
str = "vans>=20.09 and stands == 'four'"
rx = re.compile(r"""(?P<key>\w+)\s*[<>=]+\s*'?(?P<value>\w+)'?""")
result = {m.group('key'): m.group('value') for m in rx.finditer(str)}
which gives:
{'vans': '20', 'stands': 'four'}
Expected Output:
{'vans': '20.09', 'stands': 'four'}
You can extend the second \w group with an \. to include dots.
rx = re.compile(r"""(?P<key>\w+)\s*[<>=]+\s*'?(?P<value>[\w\.]+)'?""")
This should work fine, strings like 12.34.56 will also be matched as value.
There is a problem in identifying the comparison operators as well. The following should suffice all use cases. However, there is a caveat - for numbers with no digit following the decimal, only the value before the decimal point will be selected.
rx = re.compile(r"""(?P<key>\w+)\s*[<>!=]=\s*'?(?P<value>(\w|[+-]?\d*(\.\d)?\d*)+)'?""")

how to lift the data with regex in python that's between two semicolons?

I got a set of lines in a file that's separated by semicolons like this:
8=FIX.4.2;9=159;35=A;56=MBT;34=1;7202=asdf:asdf;52=20130624-04:10:00.843;43=Y;98=0;10=000;Timestamp=Fri July 25 1958 16:12:52:112545;MsgDirection=1;
What I want is the whole message up until 10=000; and the value of 7202 which would be asdf:asdf.
I got this:
(^.*000;)
which according to regex should get me the whole line until 10=000;. Which is great. But if I do this:
(^.*000;)(7202=.*;)
according to the regex101.com means I won't match anything.
I don't know why adding that 2nd grouping invalidates the whole expression.
any help on this would be great.
Thanks
Answer for first version of question
"I am trying to use regex with python to lift out my data from 7202=, so I want to get the asdf:asdf."
If I understand correctly, your goal is to find the data that is between 7202= and ;. In that case:
>>> import re
>>> line = "8=FIX.4.2;9=159;35=A;56=MBT;34=1;7202=asdf:asdf;52=20130624-04:10:00.843;43=Y;98=0;10=000;Timestamp=Fri July 25 1958 16:12:52:112545;MsgDirection=1;"
>>> re.search('7202=([^;]*);', line).group(1)
'asdf:asdf'
The regex is 7202=([^;]*);. This matches:
The literal string 7202=
Any characters that follow up to but excluding the firs semicolon:
([^;]*). Because this is in parentheses, it is captured as group 1.
The literal character ;
Answer for second version of question
"What I want is the whole message up until 10=000; and the value of 7202 which would be asdf:asdf."
>>> import re
>>> line = "8=FIX.4.2;9=159;35=A;56=MBT;34=1;7202=asdf:asdf;52=20130624-04:10:00.843;43=Y;98=0;10=000;Timestamp=Fri July 25 1958 16:12:52:112545;MsgDirection=1;"
>>> r = re.search('.*7202=([^;]*);.*10=000;', line)
>>> r.group(0), r.group(1)
('8=FIX.4.2;9=159;35=A;56=MBT;34=1;7202=asdf:asdf;52=20130624-04:10:00.843;43=Y;98=0;10=000;', 'asdf:asdf')
The regex is .*7202=([^;]*);.*10=000;. This matches:
Anything up to and including 7202=: .*7202=
Any characters that follow up to but excluding the firs semicolon: ([^;]*). Because this is in parentheses, it is captured as group 1.
Any characters that follow starting with ; and ending with 10=000;: ;.*10=000;
The value of the whole match string is available as r.group(0). The value of group 1 is available as r.group(1). Thus the single match object r lets us get both strings.

Python Regex Simple Split - Empty at first index

I have a String that looks like
test = '20170125NBCNightlyNews'
I am trying to split it into two parts, the digits, and the name. The format will always be [date][show] the date is stripped of format and is digit only in the direction of YYYYMMDD (dont think that matters)
I am trying to use re. I have a working version by writing.
re.split('(\d+)',test)
Simple enough, this gives me the values I need in a list.
['', '20170125', 'NBCNightlyNews']
However, as you will note, there is an empty string in the first position. I could theoretically just ignore it, but I want to learn why it is there in the first place, and if/how I can avoid it.
I also tried telling it to match the begininning of the string as well, and got the same results.
>>> re.split('(^\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>> re.split('^(\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>>
Does anyone have any input as to why this is there / how I can avoid the empty string?
Other answers have explained why what you're doing does what it does, but if you have a constant format for the date, there is no reason to abuse a re.split to parse this data:
test[:8], test[8:]
Will split your strings just fine.
What you are actually doing by entering re.split('(^\d+)', test) is, that your test string is splitted on any occurence of a number with at least one character.
So, if you have
test = '20170125NBCNightlyNews'
This is happening:
20170125 NBCNightlyNews
^^^^^^^^
The string is split into three parts, everything before the number, the number itself and everything after the number.
Maybe it is easier to understand if you have a sentence of words, separated by a whitespace character.
re.split(' ', 'this is a house')
=> ['this', 'is', 'a', 'house']
re.split(' ', ' is a house')
=> ['', 'is', 'a', 'house']
You're getting an empty result in the beginning because your input string starts with digits and you're splitting it by digits only. Hence you get an empty string which is before first set of digits.
To avoid that you can use filter:
>>> print filter(None, re.split('(\d+)',test))
['20170125', 'NBCNightlyNews']
Why re.split when you can just match and get the groups?...
import re
test = '20170125NBCNightlyNews'
pattern = re.compile('(\d+)(\w+)')
result = re.match(pattern, test)
result.groups()[0] # for the date part
result.groups()[1] # for the show name
I realize now the intention was to parse the text, not fix the regex usage. I'm with the others, you shouldn't use regex for this simple task when you already know the format won't change and the date is fixed size and will always be first. Just use string indexing.
From the documentation:
If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string. That way, separator components are always found at the same relative indices within the result list.
So if you have:
test = 'test20170125NBCNightlyNews'
The indexes would remain unaffected:
>>>re.split('(\d+)',test)
['test', '20170125', 'NBCNightlyNews']
If the date is always 8 digits long, I would access the substrings directly (without using regex):
>>> [test[:8], test[8:]]
['20170125', 'NBCNightlyNews']
If the length of the date might vary, I would use:
>>> s = re.search('^(\d*)(.*)$', test)
>>> [s.group(1), s.group(2)]
['20170125', 'NBCNightlyNews']

How can I split a string with no delimiters but with fixed number of decimal places - python

What is the best way to split the following string into 6 float values. The number of decimal points will always be six.
x=' 2C 6s 0.043315-143.954801 17.872676 31.277358-18.149649114.553363'
The output should read:
y=[0.043315, -143.954801, 17.872676, 31.277358, 18.149649, 114.553363]
Assuming that you want to get -18.149649 instead of 18.149649 since that would be consistent I suggest using a regex in combination with the .findall() function as follows:
import re
regex = '(-?[0-9]{1,}\.[0-9]{6})'
x = ' 2C 6s 0.043315-143.954801 17.872676 31.277358-18.149649114.553363'
out = re.findall(regex, x)
print(out)
Giving:
['0.043315', '-143.954801', '17.872676', '31.277358', '-18.149649', '114.553363']
Update due to comment:
You could replace [0-9] with \d which is equivalent since \d matches a digit (number) as shown here.
This should do the trick.
re.findall(r'\-?[0-9]+\.[0-9]{6}', string)

Jython output formats, adding symbol at N:th character

I have a problem that probably is very easy to solve. I have a script that takes numbers from various places does math with them and then prints the results as strings.
This is a sample
type("c", KEY_CTRL)
LeInput = Env.getClipboard().strip() #Takes stuff from clipboard
LeInput = LeInput.replace("-","") #Quick replace
Variable = int(LeInput) + 5 #Simple math operation
StringOut = str(Variable) #Converts it to string
popup(StringOut) #shows result for the amazed user
But what I want to do is to add the "-" signs again as per XXXX-XX-XX but I have no idea on how to do this with Regex etc. The only solution I have is dividing it by 10^N to split it into smaller and smaller integers. As an example:
int 543442/100 = 5434 giving the first string the number 5434, and then repeat process until i have split it enough times to get my 5434-42 or whatever.
So how do I insert any symbol at the N:th character?
OK, so here is the Jython solution based on the answer from Tenub
import re
strOut = re.sub(r'^(\d{4})(.{2})(.{2})', r'\1-\2-\3', strIn)
This can be worth noting when doing Regex with Jython:
The solution is to use Python’s raw string notation for regular
expression patterns; backslashes are not handled in any special way in
a string literal prefixed with 'r'. So r"\n" is a two- character
*string containing '\' and 'n', while "\n" is a one-character string*
containing a newline. Usually patterns will be expressed in Python
*code using this raw string notation.*
Here is a working example
http://regex101.com/r/oN2wF1
In that case you could do a replace with the following:
(\d{4})(\d{2})(\d+)
to
$1-$2-$3

Categories