Removing non integers from a grep obtained string w/ Python and Bash - python

I am using grep to grab the text out of a file:
NELECT = 44.0000 total number of electrons,
and I need to save the number as a variable. I have tried a handful of methods I have found here such as using filters and findall. For some reason I can only get it to separate one zero.
So far the code looks like this:
wd=os.getcwd()
electrons=str(os.system("grep 'NELECT' "+wd+"/OUTCAR"))
VBM=(re.findall('\d+', electrons))
print VBM
And in return I get ['0'].

The result of os.system is the exit status of the command, not the output of the command -- see https://docs.python.org/3/library/os.html#os.system
$ cat OUTCAR
NELECT = 44.0000 total number of electrons,
$ python
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> result = os.system("grep 'NELECT' "+os.getcwd()+"/OUTCAR")
NELECT = 44.0000 total number of electrons,
>>> result
0
The "NELECT" line was just printed by grep to stdout, but not captured in the result variable
>>> from subprocess import check_output
>>> result2 = check_output(["grep", "NELECT", os.getcwd()+"/OUTCAR"])
>>> result2
'NELECT = 44.0000 total number of electrons,\n'
>>> import re
>>> re.findall(r'\d+', result2)
['44', '0000']
Or, don't call out to grep, read the file yourself
>>> import os
>>> import re
>>> with open(os.getcwd() + "/OUTCAR") as f:
... for line in f:
... if "NELECT" in line:
... digits = re.findall(r'\d+', line)
... break
...
>>> digits
['44', '0000']
Or, maybe don't use a regular expression:
>>> words = line.split()
>>> words[2]
'44.0000'
>>> int(float(words[2]))
44

Are you sure that electrons has output specified? For me this regex returns list with two elements: ['44', '000'] and that's expected behavior. So most probably there is something wrong with grep call.
Your regex won't retrieve whole 44.000 as \d+ catches only continuous digit strings, no dot symbols. To get whole number use something like \b\d+\.\d+\b which means: any word (\b means word beginning/ending, dot must be escaped as . in regex matches any character) that contains at least 1 digit, dot and at least 1 more digit. If dot is optional, then something like this: \b(\d+(?:\.\d+)?)\b ((?:) creates group that will not be captured so your output will still be single element list).
Note that re.findall will return list of string matches. To retrieve number from first match: float(VBM[0])
Edit. Forgot to add: avoid using print statement, it works oddly with tuples and is completely removed in Python 3. Python 2 support ends in 2020 so it's better to prepare. You can replace print statement with Python 3 print function by adding from __future__ import print_function at the file beginning.

Related

Python automatically converting some strings to raw strings?

Python seems to be automatically converting strings (not just input) into raw strings. Can somebody explain what is happening here?
Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit
(AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = '\stest'
>>> s
'\\stest'
# looks like a raw string
>>> print(s)
\stest
>>> s = '\ntest'
>>> s
'\ntest'
# this one doesn't
>>> s = '\n test'
>>> s
'\n test'
>>> s = r'\n test'
>>> s
'\\n test'
>>> print(s)
\n test
The question marked as a duplicate for this one seems to be useful, but then I do not understand why
>>> s = '\n test'
>>> s
'\n test'
>>> repr(s)
"'\\n test'"
does not get two backslashes when called, and does when repr() is called on it.
\n is a valid escape sequence and '\n' is a length 1 string (new line character). In contrast, \s is an invalid escape sequence, so Python is assuming that what you wanted there was a two character string: a backlash character plus an s character.
>>> len('\s')
2
What you saw on terminal output was just the usual representation for such a length 2 string. Note that the correct way to create the string which Python gave you back here would have been with r'\s' or with '\\s'.
>>> r'\s' == '\\s' == '\s'
True
This is a deprecated behavior. In a future version of Python, likely the next point release, your code will be a syntax error.
Since you're using v3.7.1, you could enable warnings if you want to be informed about such uses of deprecated features:
$ python -Wall
>>> '\s'
<stdin>:1: DeprecationWarning: invalid escape sequence \s
'\\s'
As for your subsequent question after the edit:
>>> s = '\n test'
>>> s # this prints the repr(s)
'\n test'
>>> repr(s) # this prints the repr(repr(s))
"'\\n test'"

How to extract a part of url from dictionary value in Python?

I have a dictionary where the key's value is
https://service-dmn1-region.com/info 4169 description
I'm interested in fetching dmn1-region from that URL part and print 4169 description as it is. So I intend to print result as:
dmn1-region :4169 description
Do you think it's possible without complex regex/regular expression. The script is in python and tried this -
import re
print re.sub('https://','',dictionary[key])
This just removes https:// part and shows result as service-dmn1-region.com/info 4169 description . But I'm not sure how to achieve the above intended way.
key-value pairs from dictionary looks like-
dictionary = {'service': 'https://service-dmn1-region.com/info 4169 description',
'service1': 'https://service1-dmn2-region2.com/info 5123 someDescription','service2': 'https://dmn1-region-service2.com/info'}
Any insights and help very much appreciated.
Given the information and the fact that you don't want to use regular expressions, you could do something like this:
dictionary = {'service': 'https://service-dmn1-region.com/info 4169 description',
'service1': 'https://service1-dmn2-region2.com/info 5123 someDescription'}
def extract(key, s):
info = '/info'
service = key + '-'
return s[s.find('service') + len(service):s.find('.com')], s[s.find(info) + len(info):].strip()
for key, value in dictionary.items():
region, info = extract(key, value)
print('{0}:{1}'.format(region, info))
Output
dmn2-region2:5123 someDescription
dmn1-region:4169 description
Note that the urls are the values of the dictionary and not the keys.
I'd use something like:
import re
for k, v in dictionary.items(): # .iteritems() for py2
print(re.sub(r"^.*?{}-([^.]+).*?(\d+)\s(.*?)$".format(k), r"\1 :\2 \3", v))
dmn1-region :4169 description
dmn2-region2 :5123 someDescription
DEMO
for values of the type https://service-dmn1-region.com/info 4169 description
you could just match on ^[^-]+-([^.]+)[^\s]+ (.*)$
[harald#localhost ~]$ python3
Python 3.6.6 (default, Jul 19 2018, 14:25:17)
[GCC 8.1.1 20180712 (Red Hat 8.1.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> val = 'https://service-dmn1-region.com/info
4169 description'
>>> res = re.match('^[^-]+-([^.]+)[^\s]+ (.*)$', val)
>>> res.group(1)
'dmn1-region'
>>> res.group(2)
'4169 description'
where ^[^-]+ matches from the start of the input (initial ^) anything that isn't an apostrophe - ([^-]+), so https://service
next you specify that there must be one apostrophe to follow ^[⁻]+- and that you wish to capture
everything that follows next that isn't a dot with ([^.]+). (as you guessed by now, precluding your pattern with ^ negates it and the pattern is written inside brackets [].
Leading us to ^[^-]+-([^.]+), next you wish ignore everything up to the next whitespace since that seperates the other values from the string so you add a pattern match for anything not whitespace (\s) leading to an extra [^\s]+, so ^[^-]+-([^.]+)[^\s]+
which you then wish to follow up by the whitespace seperator (if expect more than 1 whitespace you could use \s* instead of an actual space) and you add a final catch-all capture pattern (.*), which would capture 4169 description (the dot stands for all characters here) until the end of the input $ leading you to ^[^-]+-([^.]+)[^\s]+ (.*)$.

Suppress print newline in python 3 str.format

I am using Python 3 and am trying to use my print statements with the str.format.
e.g:
print ('{0:3d} {1:6d} {2:10s} '.format (count1,count2,string1))
When I try to use the end='' to suppress the subsequent newline, this is ignored. A newline always happens.
How do I suppress the subsequent newline?
Source:
int1= 1
int2 = 999
string1 = 'qwerty'
print ( '{0:3d} {1:6d} {2:10s} '.format (int1,int2,string1))
print ('newline')
print ( '{0:3d} {1:6d} {2:10s} '.format (int1,int2,string1,end=''))
print ('newline')
Python 3.4.0 (default, Apr 11 2014, 13:05:11)
[GCC 4.8.2] on linux
Type "copyright", "credits" or "license()" for more information.
1 999 qwerty
newline
1 999 qwerty
newline
Your problem is that you have the end='' argument being passed to the format function, not to the print function.
Change this line:
print ( '{0:3d} {1:6d} {2:10s} '.format (int1,int2,string1,end=''))
To this:
print ( '{0:3d} {1:6d} {2:10s} '.format (int1,int2,string1), end='')
By the way, you should also give PEP8 a read. It defines standards for Python coding styles, that you really should try to follow, unless you're working with a group of people that have agreed on some other style standards. In particular, your spacing is a bit weird around function calls - you shouldn't have spaces between function names and the argument parentheses, or between the parentheses and the first argument. I wrote my suggested solution to your problem in a way that maintains your current style, but it really should look more like this:
print('{0:3d} {1:6d} {2:10s} '.format(int1, int2, string1), end='')

How should I remove one string from the start of another given that I know the longer string matches case-insensitively?

Suppose I have a workflow that involves examining the start of a long string (LS, say) to see whether it begins with a shorter string SS. If it does, I chop off the matching part of LS and do something with the remaining part. Otherwise, I do something else. (The specific case that prompted this question was a parsing library.)
def do_thing(LS, SS):
if (LS.startswith(SS)):
action_on_match(LS[len(SS):])
else:
action_on_no_match()
This is straightforward. Now, though, suppose that I want to do the same thing but this time I want the strings to be matched case-insensitively. It is possible to test whether "LS.startswith(SS) but case-insensitively". But how should I determine how much of LS to "chop off" when I pass it in to action_on_match()? It isn't sufficient to just use len(SS) as it was before, because if I'm uppercasing or lowercasing or casefolding things, then the length of the matching prefix of LS might not be what I expect: changing the case of a string can change its length. It is important that the part of LS passed to action_on_match() be exactly what the program received as input (after the cutoff point, of course).
Answerers have suggested using lower() and preserving the use of len(SS), but this will not work:
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct 6 2014, 22:15:05) [MSC v.1600 32 bit (In
tel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> def action_on_match (s): return "Match: %s" % s
...
>>> def action_on_no_match (): return "No match"
...
>>> def do_thing (LS, SS):
... if LS.lower().startswith(SS.lower()):
... return action_on_match(LS[len(SS):])
... else:
... return action_on_no_match()
...
>>> do_thing('i\u0307asdf', '\u0130')
'Match: \u0307asdf'
>>>
Here we expect to see 'Match: asdf', but there is an extra character.
Simple enough:
def do_thing(LS, SS):
if LS.lower().startswith(SS.lower()):
action_on_match(LS[len(SS):])
else:
action_on_no_match()
All I'm doing is lower-casing both LS and SS and then comparing them. This will be much slower than a regex solution for very long strings, as it has to convert the entire string to lowercase first.
A regex solution would look like this:
import re
def do_thing(LS, SS):
if re.match("^%s" % SS, LS, re.I):
action_on_match(LS[len(SS):])
else:
action_on_no_match()
Performance
For short strings (len(LL) == 8 characters) over 1000000 iterations:
lower() method: 0.86s (winner)
re method: 1.91s
For long strings (len(LL) == 600 characters) over 1000000 iterations:
lower() method: 2.54s
re method: 1.96s (winner)
Unicode combining characters
For unicode combining characters, the data needs to be normalised first. This means converting any precomposed character into its component parts. You will find for example:
>>> '\u0130' == 'I\u0307'
False
>>> normalize("NFD", '\u0130') == normalize("NFD", 'I\u0307')
True
You will need to perform this normalisation process on your inputs:
SS = normalize("NFD", SS)
LS = normalize("NFD", LS)
Just use str.lower, the length of "FOO" is going to be the same as "foo".lower():
LS.lower().startswith(SS.lower())
def do_thing(ls, ss):
if ls.startswith(ss):
action_on_match(ls[len(ss):])
else:
action_on_no_match()

Python regular expression gives unexpected result

I'm trying to create an svn pre-commit hook, but can't get my regular expression to work as expected. It should print False for messages that do not look like "DEV-5 | some message". Why do I get True here?
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:05:24)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> p = re.compile("^\[[A-Z]+-[0-9]+\] | .+$", re.DOTALL)
>>> message = "test message"
>>> match = p.search(message)
>>> bool(match)
True
>>> p = re.compile("^[A-Z]+-[0-9]+ \| .+$", re.DOTALL)
>>> print p.search("test message")
None
>>> print p.search("DEV-5 | some message")
<_sre.SRE_Match object at 0x800eb78b8>
you don't need \[ and \]
you need to escape |
The culprit is the trailing " | .+$" which is matching ' message' as an alternative to the first regex. As Roman pointed out you meant to match literal '|' so you have to escape it as '\|'.
To see what was being matched, you can do:
print match.group()
' message'
(By the way, a faster non-regex way to only handle lines containing vertical bar would use line.split('|'):
for line in ...:
parts = line.split('|',1)
if len(parts)==1: continue
(code,mesg) = parts
I haven't run the code, but I suspect that the part after the alternative (|) in your regexp matches any nonempty string starting with a space, in this case it's " message".

Categories