What is the RE to match the list? - python

I want to know how to construct the regular express to extract the list.
Here is my string:
audit = "{## audit_filter = ['hostname.*','service.*'] ##}"
Here is my expression:
AUDIT_FILTER_RE = r'([.*])'
And here is my search statement:
audit_filter = re.search(AUDIT_FILTER_RE, audit).group(1)
I want to extract everything inside the square brackets including the brackets. '[...]'
Expected Output:
['hostname.*','service.*']

import re
audit = "{## audit_filter = ['hostname.*','service.*'] ##}"
print eval(re.findall(r"\[.*\]", audit)[0]) # ['hostname.*', 'service.*']
findall returns a list of string matches. In your case, there should only be one, so I'm retrieving the string at index 0, which is a string representation of a list. Then, I use eval(...) to convert that string representation of a list to an actual list. Just beware:
If there are no matches, ...findall...[0] will throw a list index out of range error
Don't use eval() if you ever expect input coming from another source (i.e. input that is not yours) because that would be a security issue.

Use r"\[(.*?)\]"
Ex:
import re
audit = "{## audit_filter = ['hostname.*'] ##}"
print(re.findall(r"\[(.*?)\]", audit))
Output:
["'hostname.*'"]

Related

regex matching and get into a python list

I have the following saved as a string in a variable:
window.dataLayer=[{"articleCondition":"New","categoryNr":"12345","sellerCustomerNr":"88888888","articleStatus":"Open"}]
How do I extract the values of each element?
Goal would be to have something like this:
articleCondition = 'new'
categoryNr = '12345'
...
In python there are many ways to get value from a string, you can use regex, Python eval function and even more ways that I may not know.
Method 1
value = 'window.dataLayer=[{"articleCondition":"New","categoryNr":"12345","sellerCustomerNr":"88888888","articleStatus":"Open"}]'
value = value.split('=')[1]
data = eval(value)[0]
articleCondition = data['articleCondition']
Method 2
using regex
import re
re.findall('"articleCondition":"(\w*)"',value)
for regex you can be more creative to make a generall pattern.
You are having a list of dictionary. Use the dictionary key to get the value.
Ex:
dataLayer=[{"articleCondition":"New","categoryNr":"12345","sellerCustomerNr":"88888888","articleStatus":"Open"}]
print(dataLayer[0]["articleCondition"])
print(dataLayer[0]["categoryNr"])
Output:
New
12345
Use json. Your string is:
>>> s = 'window.dataLayer=[{"articleCondition":"New","categoryNr":"12345","sellerCustomerNr":"88888888","articleStatus":"Open"}]'
You can get the right hand side of the  = with a split:
>>> s.split('=')[1]
'[{"articleCondition":"New","categoryNr":"12345","sellerCustomerNr":"88888888","articleStatus":"Open"}]'
Then parse it with the json module:
>>> import json
>>> t = json.loads(s.split('=')[1])
>>> t[0]['articleCondition']
'New'
Please note that this works because you have double quotes in the RHS. Single quotes are not allowed in JSON.

Excluding a specific string of characters in a str()-function

A small issue I've encountered during coding.
I'm looking to print out the name of a .txt file.
For example, the file is named: verdata_florida.txt, or verdata_newyork.txt
How can I exclude .txt and verdata_, but keep the string between? It must work for any number of characters, but .txt and verdata_ must be excluded.
This is where I am so far, I've already defined filename to be input()
print("Average TAM at", str(filename[8:**????**]), "is higher than ")
3 ways of doing it:
using str.split twice:
>>> "verdata_florida.txt".split("_")[1].split(".")[0]
'florida'
using str.partition twice (you won't get an exception if the format doesn't match, and probably faster too):
>>> "verdata_florida.txt".partition("_")[2].partition(".")[0]
'florida'
using re, keeping only center part:
>>> import re
>>> re.sub(".*_(.*)\..*",r"\1","verdata_florida.txt")
'florida'
all those above must be tuned if _ and . appear multiple times (must we keep the longest or the shortest string)
EDIT: In your case, though, prefixes & suffixes seem fixed. In that case, just use str.replace twice:
>>> "verdata_florida.txt".replace("verdata_","").replace(".txt","")
'florida'
Assuming you want it to split on the first _ and the last . you can use slicing and the index and rindex functions to get this done. These functions will search for the first occurrence of the substring in the parenthesis and return the index number. If no substring is found, they will throw a ValueError. If the search is desired, but not the ValueError, you can also use find and rfind, which do the same thing but always return -1 if no match is found.
s = 'verdata_new_hampshire.txt'
s_trunc = s[s.index('_') + 1: s.rindex('.')] # or s[s.find('_') + 1: s.rfind('.')]
print(s_trunc) # new_hampshire
Of course, if you are always going to exclude verdata_ and .txt you could always hardcode the slice as well.
print(s[8:-4]) # new_hampshire
You can leverage str.split() on strings. For example:
s = 'verdata_newyork.txt'
s.split('verdata_')
# ['', 'florida.txt']
s.split('verdata_')[1]
# 'florida.txt'
s.split('verdata_')[1].split('.txt')
['florida', '']
s.split('verdata_')[1].split('.txt')[0]
# 'florida'
You can just split string by dot and underscore like this:
string filename = "verdata_prague.txt";
string name = filename.split("."); //verdata_prague
name = name[0].split("_")[1]; //prague
or by replace function:
string filename = "verdata_prague.txt";
string name = filename.replace(".txt",""); //verdata_prague
name = name[0].replace("verdata_","")[1]; //prague

Searching for content between non-tag strings

I'm using Python to try to pull data from this old code, and the content of interest is not between neat HTML tags but rather between strings of characters including punctuation and letters. Rather than getting each piece of content though I'm getting everything between the first instance of the initial string and the last instance of the final bounding string. For example:
>>> q = '"text:"content_of_interest_1",body, code code "text:":content_of_interest_2",body'
>>> start1 = '"text:"'
>>> end1 = '",body'
>>> print q[q.find(start1)+len(start1):q.rfind(end1)]
content_of_interest_1",body, code code "text:":content_of_interest_2
I'm instead looking to get out each instance of content bounded by start1 and end1, i.e.:
content_of_interest_1, content_of_interest_2
How can I re-phrase my code to get each instance of string-bounded content rather than all bounded content as above?
You need to use q.find to end1 instead of rfind for first sub-string and rfind for last one:
>>> q[q.find(start1)+len(start1):q.find(end1)]
'content_of_interest_1'
>>> q[q.rfind(start1)+len(start1):q.rfind(end1)]
':content_of_interest_2'
But using find will give you just the index of first occurrence of start and end. So as a more proper way fro such tasks you can simply use regular expression :
>>> re.findall(r':"(.*?)"',q)
['content_of_interest_1', ':content_of_interest_2']
You can use regular expression with positive lookehind
import re
re.findall(r'(?<="text:"):?\w+', q)
#['content_of_interest_1', ':content_of_interest_2']

how to split brackets using python abcd[00451.00]

I have tried below code to split but I am unable to split
import re
s = "abcd[00451.00]"
print str(s).strip('[]')
I need output as only number or decimal format 00451.00 this value but I am able to get output as abcd[00451.00
If you know for sure that there will be one opening and closing brackets you can do
s = "abcd[00451.00]"
print s[s.index("[") + 1:s.rindex("]")]
# 00451.00
str.index is used to get the first index of the element [ in the string, where as str.rindex is used to get the last index of the element in ]. Based on those indexes, the string is sliced.
If you want to convert that to a floating point number, then you can use float function, like this
print float(s[s.index("[") + 1:s.rindex("]")])
# 451.0
You should use re.search:
import re
s = "abcd[00451.00]"
>>> print re.search(r'\[([^\]]+)\]', s).group(1)
00451.00
You can first split on the '[' and then strip the resulting list of any ']' chars:
[p.strip(']') for p in s.split('[')]

programmatically find and replace content dynamically in a string in python

i need to find and replace patterns in a string with a dynamically generated content.
lets say i want to find all strings within '' in the string and double the string.
a string like:
my 'cat' is 'white' should become my 'catcat' is 'whitewhite'
all matches could also appear twice in the string.
thank you
Make use of the power of regular expressions. In this particular case:
import re
s = "my 'cat' is 'white'"
print re.sub("'([^']+)'", r"'\1\1'", s) # prints my 'catcat' is 'whitewhite'
\1 refers to the first group in the regex (called $1 in some other implementations).
It's also pretty easy to do it without regex in your case:
s = "my 'cat' is 'white'".split("'")
# the parts between the ' are at the 1, 3, 5 .. index
print s[1::2]
# replace them with new elements
s[1::2] = [x+x for x in s[1::2]]
# join that stuff back together
print "'".join(s)

Categories