How to extract a part of url from dictionary value in Python? - python

I have a dictionary where the key's value is
https://service-dmn1-region.com/info 4169 description
I'm interested in fetching dmn1-region from that URL part and print 4169 description as it is. So I intend to print result as:
dmn1-region :4169 description
Do you think it's possible without complex regex/regular expression. The script is in python and tried this -
import re
print re.sub('https://','',dictionary[key])
This just removes https:// part and shows result as service-dmn1-region.com/info 4169 description . But I'm not sure how to achieve the above intended way.
key-value pairs from dictionary looks like-
dictionary = {'service': 'https://service-dmn1-region.com/info 4169 description',
'service1': 'https://service1-dmn2-region2.com/info 5123 someDescription','service2': 'https://dmn1-region-service2.com/info'}
Any insights and help very much appreciated.

Given the information and the fact that you don't want to use regular expressions, you could do something like this:
dictionary = {'service': 'https://service-dmn1-region.com/info 4169 description',
'service1': 'https://service1-dmn2-region2.com/info 5123 someDescription'}
def extract(key, s):
info = '/info'
service = key + '-'
return s[s.find('service') + len(service):s.find('.com')], s[s.find(info) + len(info):].strip()
for key, value in dictionary.items():
region, info = extract(key, value)
print('{0}:{1}'.format(region, info))
Output
dmn2-region2:5123 someDescription
dmn1-region:4169 description
Note that the urls are the values of the dictionary and not the keys.

I'd use something like:
import re
for k, v in dictionary.items(): # .iteritems() for py2
print(re.sub(r"^.*?{}-([^.]+).*?(\d+)\s(.*?)$".format(k), r"\1 :\2 \3", v))
dmn1-region :4169 description
dmn2-region2 :5123 someDescription
DEMO

for values of the type https://service-dmn1-region.com/info 4169 description
you could just match on ^[^-]+-([^.]+)[^\s]+ (.*)$
[harald#localhost ~]$ python3
Python 3.6.6 (default, Jul 19 2018, 14:25:17)
[GCC 8.1.1 20180712 (Red Hat 8.1.1-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> val = 'https://service-dmn1-region.com/info
4169 description'
>>> res = re.match('^[^-]+-([^.]+)[^\s]+ (.*)$', val)
>>> res.group(1)
'dmn1-region'
>>> res.group(2)
'4169 description'
where ^[^-]+ matches from the start of the input (initial ^) anything that isn't an apostrophe - ([^-]+), so https://service
next you specify that there must be one apostrophe to follow ^[⁻]+- and that you wish to capture
everything that follows next that isn't a dot with ([^.]+). (as you guessed by now, precluding your pattern with ^ negates it and the pattern is written inside brackets [].
Leading us to ^[^-]+-([^.]+), next you wish ignore everything up to the next whitespace since that seperates the other values from the string so you add a pattern match for anything not whitespace (\s) leading to an extra [^\s]+, so ^[^-]+-([^.]+)[^\s]+
which you then wish to follow up by the whitespace seperator (if expect more than 1 whitespace you could use \s* instead of an actual space) and you add a final catch-all capture pattern (.*), which would capture 4169 description (the dot stands for all characters here) until the end of the input $ leading you to ^[^-]+-([^.]+)[^\s]+ (.*)$.

Related

Removing non integers from a grep obtained string w/ Python and Bash

I am using grep to grab the text out of a file:
NELECT = 44.0000 total number of electrons,
and I need to save the number as a variable. I have tried a handful of methods I have found here such as using filters and findall. For some reason I can only get it to separate one zero.
So far the code looks like this:
wd=os.getcwd()
electrons=str(os.system("grep 'NELECT' "+wd+"/OUTCAR"))
VBM=(re.findall('\d+', electrons))
print VBM
And in return I get ['0'].
The result of os.system is the exit status of the command, not the output of the command -- see https://docs.python.org/3/library/os.html#os.system
$ cat OUTCAR
NELECT = 44.0000 total number of electrons,
$ python
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> result = os.system("grep 'NELECT' "+os.getcwd()+"/OUTCAR")
NELECT = 44.0000 total number of electrons,
>>> result
0
The "NELECT" line was just printed by grep to stdout, but not captured in the result variable
>>> from subprocess import check_output
>>> result2 = check_output(["grep", "NELECT", os.getcwd()+"/OUTCAR"])
>>> result2
'NELECT = 44.0000 total number of electrons,\n'
>>> import re
>>> re.findall(r'\d+', result2)
['44', '0000']
Or, don't call out to grep, read the file yourself
>>> import os
>>> import re
>>> with open(os.getcwd() + "/OUTCAR") as f:
... for line in f:
... if "NELECT" in line:
... digits = re.findall(r'\d+', line)
... break
...
>>> digits
['44', '0000']
Or, maybe don't use a regular expression:
>>> words = line.split()
>>> words[2]
'44.0000'
>>> int(float(words[2]))
44
Are you sure that electrons has output specified? For me this regex returns list with two elements: ['44', '000'] and that's expected behavior. So most probably there is something wrong with grep call.
Your regex won't retrieve whole 44.000 as \d+ catches only continuous digit strings, no dot symbols. To get whole number use something like \b\d+\.\d+\b which means: any word (\b means word beginning/ending, dot must be escaped as . in regex matches any character) that contains at least 1 digit, dot and at least 1 more digit. If dot is optional, then something like this: \b(\d+(?:\.\d+)?)\b ((?:) creates group that will not be captured so your output will still be single element list).
Note that re.findall will return list of string matches. To retrieve number from first match: float(VBM[0])
Edit. Forgot to add: avoid using print statement, it works oddly with tuples and is completely removed in Python 3. Python 2 support ends in 2020 so it's better to prepare. You can replace print statement with Python 3 print function by adding from __future__ import print_function at the file beginning.

how to lift the data with regex in python that's between two semicolons?

I got a set of lines in a file that's separated by semicolons like this:
8=FIX.4.2;9=159;35=A;56=MBT;34=1;7202=asdf:asdf;52=20130624-04:10:00.843;43=Y;98=0;10=000;Timestamp=Fri July 25 1958 16:12:52:112545;MsgDirection=1;
What I want is the whole message up until 10=000; and the value of 7202 which would be asdf:asdf.
I got this:
(^.*000;)
which according to regex should get me the whole line until 10=000;. Which is great. But if I do this:
(^.*000;)(7202=.*;)
according to the regex101.com means I won't match anything.
I don't know why adding that 2nd grouping invalidates the whole expression.
any help on this would be great.
Thanks
Answer for first version of question
"I am trying to use regex with python to lift out my data from 7202=, so I want to get the asdf:asdf."
If I understand correctly, your goal is to find the data that is between 7202= and ;. In that case:
>>> import re
>>> line = "8=FIX.4.2;9=159;35=A;56=MBT;34=1;7202=asdf:asdf;52=20130624-04:10:00.843;43=Y;98=0;10=000;Timestamp=Fri July 25 1958 16:12:52:112545;MsgDirection=1;"
>>> re.search('7202=([^;]*);', line).group(1)
'asdf:asdf'
The regex is 7202=([^;]*);. This matches:
The literal string 7202=
Any characters that follow up to but excluding the firs semicolon:
([^;]*). Because this is in parentheses, it is captured as group 1.
The literal character ;
Answer for second version of question
"What I want is the whole message up until 10=000; and the value of 7202 which would be asdf:asdf."
>>> import re
>>> line = "8=FIX.4.2;9=159;35=A;56=MBT;34=1;7202=asdf:asdf;52=20130624-04:10:00.843;43=Y;98=0;10=000;Timestamp=Fri July 25 1958 16:12:52:112545;MsgDirection=1;"
>>> r = re.search('.*7202=([^;]*);.*10=000;', line)
>>> r.group(0), r.group(1)
('8=FIX.4.2;9=159;35=A;56=MBT;34=1;7202=asdf:asdf;52=20130624-04:10:00.843;43=Y;98=0;10=000;', 'asdf:asdf')
The regex is .*7202=([^;]*);.*10=000;. This matches:
Anything up to and including 7202=: .*7202=
Any characters that follow up to but excluding the firs semicolon: ([^;]*). Because this is in parentheses, it is captured as group 1.
Any characters that follow starting with ; and ending with 10=000;: ;.*10=000;
The value of the whole match string is available as r.group(0). The value of group 1 is available as r.group(1). Thus the single match object r lets us get both strings.

Python Regular Expression Extracting 'name= ....'

I'm using a Python script to read data from our corporate instance of JIRA. There is a value that is returned as a string and I need to figure out how to extract one bit of info from it. What I need is the 'name= ....' and I just need the numbers from that result.
<class 'list'>: ['com.atlassian.greenhopper.service.sprint.Sprint#6f68eefa[id=30943,rapidViewId=10468,state=CLOSED,name=2016.2.4 - XXXXXXXXXX,startDate=2016-05-26T08:50:57.273-07:00,endDate=2016-06-08T20:59:00.000-07:00,completeDate=2016-06-09T07:34:41.899-07:00,sequence=30943]']
I just need the 2016.2.4 portion of it. This number will not always be the same either.
Any thoughts as how to do this with RE? I'm new to regular expressions and would appreciate any help.
A simple regular expression can do the trick: name=([0-9.]+).
The primary part of the regex is ([0-9.]+) which will search for any digit (0-9) or period (.) in succession (+).
Now, to use this:
import re
pattern = re.compile('name=([0-9.]+)')
string = '''<class 'list'>: ['com.atlassian.greenhopper.service.sprint.Sprint#6f68eefa[id=30943,rapidViewId=10468,state=CLOSED,name=2016.2.4 - XXXXXXXXXX,startDate=2016-05-26T08:50:57.273-07:00,endDate=2016-06-08T20:59:00.000-07:00,completeDate=2016-06-09T07:34:41.899-07:00,sequence=30943]']'''
matches = pattern.search(string)
# Only assign the value if a match is found
name_value = '' if not matches else matches.group(1)
Use a capturing group to extract the version name:
>>> import re
>>> s = 'com.atlassian.greenhopper.service.sprint.Sprint#6f68eefa[id=30943,rapidViewId=10468,state=CLOSED,name=2016.2.4 - XXXXXXXXXX,startDate=2016-05-26T08:50:57.273-07:00,endDate=2016-06-08T20:59:00.000-07:00,completeDate=2016-06-09T07:34:41.899-07:00,sequence=30943]'
>>> re.search(r"name=([0-9.]+)", s).group(1)
'2016.2.4'
where ([0-9.]+) is a capturing group matching one or more digits or dots, parenthesis define a capturing group.
A non-regex option would involve some splitting by ,, = and -:
>>> l = [item.split("=") for item in s.split(",")]
>>> next(value[1] for value in l if value[0] == "name").split(" - ")[0]
'2016.2.4'
This, of course, needs testing and error handling.

Filter a string for valid MySQL Column name

I have been scraping data from a site.
I have this list scraped
[' ', '*One child under 12 years old stays free using existing bedding.', '24 hour front desk', 'Bar / Lounge', 'Business centre', 'Concierge', 'Dry cleaning / laundry service', ...
This is scraped so far and more (about 20) would be scraped too.
I want to create a column in my Table for every entry in List by getting its first 20 characters.
Here is how I filter these entries to make a valid MySQL column name.
column_name = column_to_create[:20].replace(" ","_").replace("/","_").replace("*","_").replace("-","_").replace("$","_").replace("&","_").replace(".","_")
I know it does not include many invalid character.
How can I filter to get a valid column name? Any less-code solution or any Reg-Ex ???
Use this Regex:
column_name = re.sub(r'[-/*$&.\s]+','_',column_to_create[:20])
Demo:
>>> import re
>>> st = "replace/ these**characters---all$$of&them....with_"
>>> re.sub(r'[-/*$&.\s]+','_',st)
'replace_these_characters_all_of_them_with_'
Also if there is any other character you want to replace with _, just add that character to square braces in the Regex. Say e.g., you need to replace # also. Then regex would become re.sub(r'[-/*$&.\s#]+','_',column_to_create[:20]).
Python has a translate capability you can use to easily change one character into another, or delete characters. I use it something like this (first 3 lines are setup, 4th line is actually using it.)
norm = string.maketrans(' _,','---') # space underscore comma to dash
keep = "-#'$%{}[]~#().&^+=/\/:"
toss = string.translate(norm,norm,string.letters+string.digits+keep)
toName = toName.translate(norm,toss)

Python, how do I parse key=value list ignoring what is inside parentheses?

Suppose I have a string like this:
"key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)"
I would like to get a dictionary corresponding to the above, where the value for key3 is the string
"(key3.1=value3.1;key3.2=value3.2)"
and eventually the corresponding sub-dictionary.
I know how to split the string at the semicolons, but how can I tell the parser to ignore the semicolon between parentheses?
This includes potentially nested parentheses.
Currently I am using an ad-hoc routine that looks for pairs of matching parentheses, "clears" its content, gets split positions and applies them to the original string, but this does not appear very elegant, there must be some prepackaged pythonic way to do this.
If anyone is interested, here is the code I am currently using:
def pparams(parameters, sep=';', defs='=', brc='()'):
'''
unpackages parameter string to struct
for example, pippo(a=21;b=35;c=pluto(h=zzz;y=mmm);d=2d3f) becomes:
a: '21'
b: '35'
c.fn: 'pluto'
c.h='zzz'
d: '2d3f'
fn_: 'pippo'
'''
ob=strfind(parameters,brc[0])
dp=strfind(parameters,defs)
out={}
if len(ob)>0:
if ob[0]<dp[0]:
#opening function
out['fn_']=parameters[:ob[0]]
parameters=parameters[(ob[0]+1):-1]
if len(dp)>0:
temp=smart_tokenize(parameters,sep,brc);
for v in temp:
defp=strfind(v,defs)
pname=v[:defp[0]]
pval=v[1+defp[0]:]
if len(strfind(pval,brc[0]))>0:
out[pname]=pparams(pval,sep,defs,brc);
else:
out[pname]=pval
else:
out['fn_']=parameters
return out
def smart_tokenize( instr, sep=';', brc='()' ):
'''
tokenize string ignoring separators contained within brc
'''
tstr=instr;
ob=strfind(instr,brc[0])
while len(ob)>0:
cb=findclsbrc(tstr,ob[0])
tstr=tstr[:ob[0]]+'?'*(cb-ob[0]+1)+tstr[cb+1:]
ob=strfind(tstr,brc[1])
sepp=[-1]+strfind(tstr,sep)+[len(instr)+1]
out=[]
for i in range(1,len(sepp)):
out.append(instr[(sepp[i-1]+1):(sepp[i])])
return out
def findclsbrc(instr, brc_pos, brc='()'):
'''
given a string containing an opening bracket, finds the
corresponding closing bracket
'''
tstr=instr[brc_pos:]
o=strfind(tstr,brc[0])
c=strfind(tstr,brc[1])
p=o+c
p.sort()
s1=[1 if v in o else 0 for v in p]
s2=[-1 if v in c else 0 for v in p]
s=[s1v+s2v for s1v,s2v in zip(s1,s2)]
s=[sum(s[:i+1]) for i in range(len(s))] #cumsum
return p[s.index(0)]+brc_pos
def strfind(instr, substr):
'''
returns starting position of each occurrence of substr within instr
'''
i=0
out=[]
while i<=len(instr):
try:
p=instr[i:].index(substr)
out.append(i+p)
i+=p+1
except:
i=len(instr)+1
return out
If you want to build a real parser, use one of the Python parsing libraries, like PLY or PyParsing. If you figure such a full-fledged library is overkill for the task at hand, go for some hack like the one you already have. I'm pretty sure there is no clean few-line solution without an external library.
Expanding on Sven Marnach's answer, here's an example of a pyparsing grammar that should work for you:
from pyparsing import (ZeroOrMore, Word, printables, Forward,
Group, Suppress, Dict)
collection = Forward()
simple_value = Word(printables, excludeChars='()=;')
key = simple_value
inner_collection = Suppress('(') + collection + Suppress(')')
value = simple_value ^ inner_collection
key_and_value = Group(key + Suppress('=') + value)
collection << Dict(key_and_value + ZeroOrMore(Suppress(';') + key_and_value))
coll = collection.parseString(
"key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)")
print coll['key1'] # value1
print coll['key2'] # value2
print coll['key3']['key3.1'] # value3.1
You could use a regex to capture the groups:
>>> import re
>>> s = "key1=value1;key2=value2;key3=(key3.1=value3.1;key3.2=value3.2)"
>>> r = re.compile('(\w+)=(\w+|\([^)]+\));?')
>>> dict(r.findall(s))
This regex says:
(\w)+ # Find and capture a group with 1 or more word characters (letters, digits, underscores)
= # Followed by the literal character '='
(\w+ # Followed by a group with 1 or more word characters
|\([^)]+\) # or a group that starts with an open paren (parens escaped with '\(' or \')'), followed by anything up until a closed paren, which terminates the alternate grouping
);? # optionally this grouping might be followed by a semicolon.
Gotta say, kind of a strange grammar. You should consider using a more standard format. If you need guidance choosing one maybe ask another question. Good luck!

Categories