Python Parser-Regular Expression - python

I have two strings in Python:
String1 = "1.451E1^^http://www.test.org/Schema#double"
String2 = "http://www.test.org/m3-lite#AirTemperature"
From String1 i want to extract the number 1.451E1 meaning the field from the start of the string till the ^ symbol.
From String2 i want to extract field AirTemperature meaning the field from the # symbol and after till the end of the string.
Can anyone help me with the the regular expressions for the parser?

If your strings have such clear separators, maybe a simple split is enough?
value = string.split("^^")[0]
measurement = string.split("#")[-1]
If regular expressions are really what you want, ^([0-9E.]+)\^ and #(\w+)$ are an ok start.

Related

String substitution by regular expression while excluding quoted strings

I searched a bit but couldn't find any questions addressing my problem. Sorry if my question is repetitive. I'm trying to edit python code say to replace all -/+/= operators that don't have white space on either side.
string = 'new_str=str+"this is a quoted string-having some operators+=- within the code."'
I would use '([^\s])(=|+|-)([^\s])' to find such operators. The problem is, I want to exclude those findings within the quoted string. Is there any way to do this by regular expression substitution.
The output I'm trying to get is:
edited_string = 'new_str = str + "this is a quoted string-having some operators+=- within the code."'
This example is just to help to understand the issue. I'm looking for an answer working on general cases.
You can do it in two steps: first adding space to the chars doesn't have space before them and then chars don't have space after them:
string = 'new_str=str+"this is a quoted string-having some operators+=- within the code."'
new_string = re.sub("(?<!\s>)(\+|\=)[^\+=-]", r" \g<0>", string)
new_string = re.sub("(\+|\=)(?=[^\s|=|-])", r"\g<0> ", new_string)
print(new_string)
>>> new_str = str + "this is a quoted string-having some operators+=- within the code."

Check if a list of partial strings is within a single string?

Hopefully the same question hasn't already been answered (I looked but could not find).
I have a list of partial strings:
date_parts = ['/Year', '/Month', '/Day',....etc. ]
and I have a string.
E.g.
string1 = "Tag01/Source 01/Start/Year"
or
string1 = "Tag01/Source 01/Volume"
What is the most efficient way, apart from using a for loop, to check if any of the date_parts strings are contained within the string?
For info, string1 in reality is actually another list of many strings and I would like to remove any of these strings that contain a string within the date_parts list.
Compile a regex from the partial strings. Use re.escape() in case they contain control characters in the regex language.
import re
date_parts = ['/Year', '/Month', '/Day']
pattern = re.compile('|'.join(re.escape(s) for s in date_parts))
Then use re.search() to see if it matches.
string1 = "Tag01/Source 01/Start/Year"
re.search(pattern, string1)
The regex engine is probably faster than a native Python loop.
For your particular use case, consider concatenating all the strings, like
all_string = '\n'.join(strings+[''])
Then you can do them all at once in a single call to the regex engine.
pattern = '|'.join(f'.*{re.escape(s)}.*\n' for s in date_parts)
strings = re.sub(pattern, '', all_string).split('\n')[:-1]
Of course, this assumes that none of your strings has a '\n'. You could pick some other character that's not in your strings to join and split on if necessary. '\f', for example, should be pretty rare. Here's how you might do it with '#'.
all_string = '#'.join(strings+[''])
pattern = '|'.join(f'[^#]*{re.escape(s)}[^#]*#' for s in date_parts)
strings = re.sub(pattern, '', all_string).split('#')[:-1]
If that's still not fast enough, you could try a faster regex engine, like rure.
You can use the any function with a list comprehension. It should be a little faster than a for loop.
For one string, you can test like this:
any(p in string1 for p in date_parts)
If strings is a list of many strings you want to check, you could do this:
unmatched = [s for s in strings if not any(p in s for p in date_parts)]
or
unmatched = [s for s in strings if all(p not in s for p in date_parts)]

Return a string of country codes from an argument that is a string of prices

So here's the question:
Write a function that will return a string of country codes from an argument that is a string of prices (containing dollar amounts following the country codes). Your function will take as an argument a string of prices like the following: "US$40, AU$89, JP$200". In this example, the function would return the string "US, AU, JP".
Hint: You may want to break the original string into a list, manipulate the individual elements, then make it into a string again.
Example:
> testEqual(get_country_codes("NZ$300, KR$1200, DK$5")
> "NZ, KR, DK"
As of now, I'm clueless as to how to separate the $ and the numbers. I'm very lost.
I would advice using and looking up regex expressions
https://docs.python.org/2/library/re.html
If you use re.findall it will return you a list of all matching strings, and you can use a regex expression like /[A-Z]{2}$ to find all the two letter capital words in the list.
After that you can just create a string from the resulting list.
Let me know if that is not clear
def test(string):
return ", ".join([item.split("$")[0] for item in string.split(", ")])
string = "NZ$300, KR$1200, DK$5"
print test(string)
Use a regular expression pattern and append the matches to a string. (\w{2})\$ matches exactly 2 word characters followed by by a $.
def get_country_codes(string):
matches = re.findall(r"(\w{2})\$", string)
return ", ".join(match for match in matches)

Retrieve part of string, variable length

I'm trying to learn how to use Regular Expressions with Python. I want to retrieve an ID number (in parentheses) in the end from a string that looks like this:
"This is a string of variable length (561401)"
The ID number (561401 in this example) can be of variable length, as can the text.
"This is another string of variable length (99521199)"
My coding fails:
import re
import selenium
# [Code omitted here, I use selenium to navigate a web page]
result = driver.find_element_by_class_name("class_name")
print result.text # [This correctly prints the whole string "This is a text of variable length (561401)"]
id = re.findall("??????", result.text) # [Not sure what to do here]
print id
This should work for your example:
(?<=\()[0-9]*
?<= Matches something preceding the group you are looking for but doesn't consume it. In this case, I used \(. ( is a special character, so it has to be escaped with \. [0-9] matches any number. The * means match any number of the directly preceding rule, so [0-9]* means match as many numbers as there are.
Solved this thanks to Kaz's link, very useful:
http://regex101.com/
id = re.findall("(\d+)", result.text)
print id[0]
You can use this simple solution :
>>> originString = "This is a string of variable length (561401)"
>>> str1=OriginalString.replace("("," ")
'This is a string of variable length 561401)'
>>> str2=str1.replace(")"," ")
'This is a string of variable length 561401 '
>>> [int(s) for s in string.split() if s.isdigit()]
[561401]
First, I replace parantheses with space. and then I searched the new string for integers.
No need to really use regular expressions here, if it is always at the end and always in parenthesis you can split, extract last element and remove the parenthesis by taking the substring ([1:-1]). Regexes are relatively time expensive.
line = "This is another string of variable length (99521199)"
print line.split()[-1][1:-1]
If you did want to use regular expressions I would do this:
import re
line = "This is another string of variable length (99521199)"
id_match = re.match('.*\((\d+)\)',line)
if id_match:
print id_match.group(1)

python: regular expressions, how to match a string of undefind length which has a structure and finishes with a specific group

I need to create a regexp to match strings like this 999-123-222-...-22
The string can be finished by &Ns=(any number) or without this... So valid strings for me are
999-123-222-...-22
999-123-222-...-22&Ns=12
999-123-222-...-22&Ns=12
And following are not valid:
999-123-222-...-22&N=1
I have tried testing it several hours already... But did not manage to solve, really need some help
Not sure if you want to literally match 999-123-22-...-22 or if that can be any sequence of numbers/dashes. Here are two different regexes:
/^[\d-]+(&Ns=\d+)?$/
/^999-123-222-\.\.\.-22(&Ns=\d+)?$/
The key idea is the (&Ns=\d+)?$ part, which matches an optional &Ns=<digits>, and is anchored to the end of the string with $.
If you just want to allow strings 999-123-222-...-22 and 999-123-222-...-22&Ns=12 you better use a string function.
If you want to allow any numbers between - you can use the regex:
^(\d+-){3}[.]{3}-\d+(&Ns=\d+)?$
If the numbers must be of only 3 digits and the last number of only 2 digits you can use:
^(\d{3}-){3}[.]{3}-\d{2}(&Ns=\d{2})?$
This looks like a phone number and extension information..
Why not make things simpler for yourself (and anyone who has to read this later) and split the input rather than use a complicated regex?
s = '999-123-222-...-22&Ns=12'
parts = s.split('&Ns=') # splits on Ns and removes it
If the piece before the "&" is a phone number, you could do another split and get the area code etc into separate fields, like so:
phone_parts = parts[0].split('-') # breaks up the digit string and removes the '-'
area_code = phone_parts[0]
The portion found after the the optional '&Ns=' can be checked to see if it is numeric with the string method isdigit, which will return true if all characters in the string are digits and there is at least one character, false otherwise.
if len(parts) > 1:
extra_digits_ok = parts[1].isdigit()

Categories