create a list by literally using a string - python

For some reason I have a text file describing a lists of regular expressions
RegexRemove = [ 'OC1.*','OC2.*','-UC.*','EG[0-9]{4,6}.*','_t[0-9]{0,2}\.[0-9]{0,2}$' ]
RegexReplace = [ ['LA.*','LA'],['IF.*', 'IF'],['BH.*', 'BH'],['DP.*', 'DP'] ]
I like to read in the lines as string and convert them to the list as described in the text file.
The line are like source code defining the list, but they are part of a bigger textfile, which cannot read and interpreted as python.
I tried to convert them by replacing and splitting the string, but I always run into trouble, since commatas are used as delimiters for split and are part of the regular expression, too.
Can I read in only the line containing "Regex" and convert them to the lists described there by using some fancy functions?

Extract the wanted lines (obviously you've already done with this), split them on the '=' char, and then pass the second part to ast.literal_eval():
>>> import ast
>>> s = "[ 'OC1.*','OC2.*','-UC.*','EG[0-9]{4,6}.*','_t[0-9]{0,2}\.[0-9]{0,2}$' ]"
>>> ast.literal_eval(s)
['OC1.*', 'OC2.*', '-UC.*', 'EG[0-9]{4,6}.*', '_t[0-9]{0,2}\\.[0-9]{0,2}$']
>>>

You can use eval to parse string as a python objects:
items = eval('[1,2,4]')
print(type(items),len(items), items) # output: <class 'list'> 3 [1, 2, 4]

Related

What is the RE to match the list?

I want to know how to construct the regular express to extract the list.
Here is my string:
audit = "{## audit_filter = ['hostname.*','service.*'] ##}"
Here is my expression:
AUDIT_FILTER_RE = r'([.*])'
And here is my search statement:
audit_filter = re.search(AUDIT_FILTER_RE, audit).group(1)
I want to extract everything inside the square brackets including the brackets. '[...]'
Expected Output:
['hostname.*','service.*']
import re
audit = "{## audit_filter = ['hostname.*','service.*'] ##}"
print eval(re.findall(r"\[.*\]", audit)[0]) # ['hostname.*', 'service.*']
findall returns a list of string matches. In your case, there should only be one, so I'm retrieving the string at index 0, which is a string representation of a list. Then, I use eval(...) to convert that string representation of a list to an actual list. Just beware:
If there are no matches, ...findall...[0] will throw a list index out of range error
Don't use eval() if you ever expect input coming from another source (i.e. input that is not yours) because that would be a security issue.
Use r"\[(.*?)\]"
Ex:
import re
audit = "{## audit_filter = ['hostname.*'] ##}"
print(re.findall(r"\[(.*?)\]", audit))
Output:
["'hostname.*'"]

How to escape null characters .i.e [' '] while using regex split function? [duplicate]

I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']
I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)
Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')
If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']
Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

String separater

I have a program that takes text input and stores a list of words that were input, but without duplicates. I need to store these in a file so I convert it into a string and join it with a comma between each word.
Now if there is a comma near a word then it would break. I therefore need a string to join the items of a list that is not part of any of the items.
For example if an item was "dog" the string og couldn't be used so the program would know this and add another letter on to make it a unique set of letters.
I then concatenate these strings to recreate the inputted text but it only works if the string I'm splitting them with is not part of the words.
I use ## now as it is unlikely that will be in the inputted text but I would like it to be perfect.
Consider using an existing serialization library that will convert your objects to string for you, without you having to invent an algorithm yourself. For instance, json:
>>> import json
>>> my_strings = ["foo", 'ba"r', "ba,z", "qu'x", "zo##rt"]
>>> s = json.dumps(my_strings)
>>> s
'["foo", "ba\\"r", "ba,z", "qu\'x", "zo##rt"]'
>>> type(s)
<class 'str'>
>>> result = json.loads(s)
>>> result
['foo', 'ba"r', 'ba,z', "qu'x", 'zo##rt']
>>> type(result)
<class 'list'>

How to convert a multiline string into a list of lines?

In sikuli I've get a multiline string from clipboard like this...
Names = App.getClipboard();
So Name =
#corazona
#Pebleo00
#cofriasd
«paflio
and I have use this regex to delete the first character if it is not in x00-x7f hex range or is not a word, or is a digit
import re
Names = re.sub(r"(?m)^([^\x00-\x7F]+|\W|\d)", "", Names)
So now Names =
corazona
Pebleo00
cofriasd
paflio
But, I am having trouble with the second regex that converts "Names" into the items of a sequence. I would like to convert "Names" into...
'corazona', 'Pebleo00', 'cofriasd', 'paflio'
or
'corazona', 'Pebleo00', 'cofriasd', 'paflio',
So sikuli can then recognize it as a List (I've found that Sikuli is able to recognize it even with those last "comma" and "space" in the end) by using...
NamesAsList = eval(Names)
How could I do this in python? is it necessary to use regex, or there is other way to do this in python?
I have already done this but using .Net regex, I just don't know how to do it in python, I have googled it with no result.
This is how I did it using .Net regex
Text to find:
(.*[^$])(\r\n|\z)
Replace with:
'$1',%" "%
Thanks Advanced.
A couple of one liners. Your question isn't completely clear - but I am assuming - you want to split a given string delimited by 'newline' and then generate a list of strings by removing the first character if it's not alpha numeric. Here's how I'd go about it
import re
r = re.compile(r'^[a-zA-Z0-9]') # match # beginning anything that's not alpha numeric
s = '#abc\ndef\nghi'
l = [r.sub('', x) for x in s.split()]
# join this list with comma (if that's required else you got the list already)
','.join(l)
Hope that's what you want.
If Names is a string before you "convert" it, in which each name is separated by a new line ('\n'), then this will work:
NamesAsList = '\n'.split(Names)
See this question for other options.
You could use splitlines()
import re
clipBoard = App.getClipboard();
Names = re.sub(r"(?m)^([^\x00-\x7F]+|\W|\d)", "", clipBoard)
# Replace the end of a line with a comma.
singleNames = ', '.join(Names.splitlines())
print(singleNames)

programmatically find and replace content dynamically in a string in python

i need to find and replace patterns in a string with a dynamically generated content.
lets say i want to find all strings within '' in the string and double the string.
a string like:
my 'cat' is 'white' should become my 'catcat' is 'whitewhite'
all matches could also appear twice in the string.
thank you
Make use of the power of regular expressions. In this particular case:
import re
s = "my 'cat' is 'white'"
print re.sub("'([^']+)'", r"'\1\1'", s) # prints my 'catcat' is 'whitewhite'
\1 refers to the first group in the regex (called $1 in some other implementations).
It's also pretty easy to do it without regex in your case:
s = "my 'cat' is 'white'".split("'")
# the parts between the ' are at the 1, 3, 5 .. index
print s[1::2]
# replace them with new elements
s[1::2] = [x+x for x in s[1::2]]
# join that stuff back together
print "'".join(s)

Categories