Getting a list with new line characters - python

I have a string as below:
s = 'hello\n this is\n a forum\n'
And I am using a regular expression to get each single/double occurrence of each character. I want to generate a list as li = [ 'h','e','ll','o','\n','t','h'....]
I used
pattern = re.compile(r'(.)\1?' , re.IGNORECASE)
newList = [m.group() for m in pattern.finditer(s)]
print newList
But this gave me newList= [ 'h','e','ll','o','t','h'....] , here I could not get the '\n' new line character stored. How should I change my patter to get the '\n' in my list?

Use re.DOTALL flag like this
pattern = re.compile(r'(.)\1?' , re.IGNORECASE | re.DOTALL)
Quoting from the docs,
Make the '.' special character match any character at all, including a
newline; without this flag, '.' will match anything except a newline.
I would have preferred itertools.groupby to do this
from itertools import groupby
print ["".join(grp) for char, grp in groupby(s)]

Related

How can I remove a specific character from multi line string using regex in python

I have a multiline string which looks like this:
st = '''emp:firstinfo\n
:secondinfo\n
thirdinfo
'''
print(st)
What I am trying to do is to skip the second ':' from my string, and get an output which looks like this:
'''emp:firstinfo\n
secondinfo\n
thirdinfo
'''
simply put if it starts with a ':' I'm trying to ignore it.
Here's what I've done:
mat_obj = re.match(r'(.*)\n*([^:](.*))\n*(.*)' , st)
print(mat_obj.group())
Clearly, I don't see my mistake but could anyone please help me telling where I am getting it wrong?
You may use re.sub with this regex:
>>> print (re.sub(r'([^:\n]*:[^:\n]*\n)\s*:(.+)', r'\1\2', st))
emp:firstinfo
secondinfo
thirdinfo
RegEx Demo
RegEx Details:
(: Start 1st capture group
[^:\n]*: Match 0 or more of any character that is not : and newline
:: Match a colon
[^:\n]*: Match 0 or more of any character that is not : and newline
\n: Match a new line
): End 1st capture group
\s*: Match 0 or more whitespaces
:: Match a colon
(.+): Match 1 or more of any characters (except newlines) in 2nd capture group
\1\2: Is used in replacement to put back substring captured in groups 1 and 2.
You can use sub instead, just don't capture the undesired part.
(.*\n)[^:]*:(.*\n)(.*)
Replace by
\1\2\3
Regex Demo
import re
regex = r"(.*\n)[^:]*:(.*\n)(.*)"
test_str = ("emp:firstinfo\\n\n"
" :secondinfo\\n\n"
" thirdinfo")
subst = "\\1\\2\\3"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
#import regex library
import re
#remove character in a String and replace with empty string.
text = "The film Pulp Fiction was released in year 1994"
result = re.sub(r"[a-z]", "", text)
print(result)

Match everything except a pattern and replace matched with string

I want to use python in order to manipulate a string I have.
Basically, I want to prepend"\x" before every hex byte except the bytes that already have "\x" prepended to them.
My original string looks like this:
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
And I want to create the following string from it:
mystr = r"\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00"
I thought of using regular expressions to match everything except /\x../g and replace every match with "\x". Sadly, I struggled with it a lot without any success. Moreover, I'm not sure that using regex is the best approach to solve such case.
Regex: (?:\\x)?([0-9A-Z]{2}) Substitution: \\x$1
Details:
(?:) Non-capturing group
? Matches between zero and one time, match string \x if it exists.
() Capturing group
[] Match a single character present in the list 0-9 and A-Z
{n} Matches exactly n times
\\x String \x
$1 Group 1.
Python code:
import re
text = R'30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00'
text = re.sub(R'(?:\\x)?([0-9A-Z]{2})', R'\\x\1', text)
print(text)
Output:
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
Code demo
You don't need regex for this. You can use simple string manipulation. First remove all of the "\x" from your string. Then add add it back at every 2 characters.
replaced = mystr.replace(r"\x", "")
newstr = "".join([r"\x" + replaced[i*2:(i+1)*2] for i in range(len(replaced)/2)])
Output:
>>> print(newstr)
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
You can get a list with your values to manipulate as you wish, with an even simpler re pattern
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
import re
pat = r'([a-fA-F0-9]{2})'
match = re.findall(pat, mystr)
if match:
print('\n\nNew string:')
print('\\x' + '\\x'.join(match))
#for elem in match: # match gives you a list of strings with the hex values
# print('\\x{}'.format(elem), end='')
print('\n\nOriginal string:')
print(mystr)
This can be done without replacing existing \x by using a combination of positive lookbehinds and negative lookaheads.
(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})
Usage
See code in use here
import re
regex = r"(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})"
test_str = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
subst = r"\\x$1"
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE)
if result:
print (result)
Explanation
(?!(?<=\\x)|(?<=\\x[a-f\d])) Negative lookahead ensuring either of the following doesn't match.
(?<=\\x) Positive lookbehind ensuring what precedes is \x.
(?<=\\x[a-f\d]) Positive lookbehind ensuring what precedes is \x followed by a hexidecimal digit.
([a-f\d]{2}) Capture any two hexidecimal digits into capture group 1.

Python Regex: remove underscores and dashes except if they are in a dict of string substitutions

I'm pre-processing a string. I have a dictionary of 10k string substitutions (e. g. "John Lennon": "john_lennon"). I want to replace all other punctuation with a space.
The problem is some of these string substitutions contain underscores or hyphens, so I want to replace punctuation (except full stops) with spaces unless the word is contained in the keys of this dict. I also want to do it in one Regex expression since the text corpus is quite large and this could be a bottleneck.
So far, I have:
import re
input_str = "John Lennon: a musician, artist and activist."
multi_words = dict((re.escape(k), v) for k, v in multi_words.items())
pattern = re.compile("|".join(multi_words.keys()))
output_str = pattern.sub(lambda m: multi_words[re.escape(m.group(0))], input_str)
This replaces all strings using the keys in a dict. Now I just need to also remove punctuation in the same pass. This should return "john_lennon a musician artist and activist."
You could do it by adding one more alternative to the constructed regex which matches a single punctuation character. When the match is processed, a match not in the dictionary can be replaced with a space, using dictionary's get method. Here, I use [,:;_-] but you probably want to replace other characters.
Note: I moved the call to re.escape into the construction of the regex to avoid having to call it on every match.
import re
input_str = "John Lennon: a musician, artist and activist."
pattern = re.compile(("|".join(map(re.escape, multi_words.keys())) + "|[,:;_-]+")
output_str = pattern.sub(lambda m: multi_words.get(m.group(0), ' '), input_str)
You could handle the punctuation you would like to remove like the entries in your dictionary:
pattern = re.compile("|".join(multi_words.keys()) + r'|_|-')
and
multiwords['_'] = ' '
multiwords['-'] = ' '
Then these occurrences are treated like your key words.
But let me remind you that your code only works for a certain set of regular expressions. If you have the pattern foo.*bar in your keys and that matches a string like foo123bar, you will not find the corresponding value to the key by passing foo123bar through re.escape() and then searching in your multiword dictionary for it.
I think the whole escaping you do should be removed and the code should be commented to make clear that only fixed strings are allowed as keys, not complex regular expressions matching variable inputs.
You can add punctuations (excluding full stop) in a character set as part of the items to match, and then handle punctuations and dict keys separately in the substitution function:
import re
import string
punctuation = string.punctuation.replace('.', '')
pattern = re.compile("|".join(multi_words.keys())+
"|[{}]".format(re.escape(punctuation)))
def func(m):
m = m.group(0)
print(m, re.escape(m))
if m in string.punctuation:
return ''
return multi_words[re.escape(m)]
output_str = pattern.sub(func , input_str)
print(output_str)
# john_lennon a musician artist and activist.
You may use a regex like (?:alt1|alt2...|altN)|([^\w\s.]+) and check if Group 1 (that is, any punctuation other than .) was matched. If yes, replace with an empty string:
pattern = re.compile(r"(?:{})|([^\w\s.]+)".format("|".join(multi_words.keys())))
output_str = pattern.sub(lambda m: "" if m.group(1) else multi_words[re.escape(m.group(0))], input_str)
See the Python demo.
A note about _: if you need to remove it as well, use r"(?:{})|([^\w\s.]+|_+)" because [^\w\s.] (matching any char other than word, whitespace and . chars) does not match an underscore (a word char) and you need to add it as a separate alternative.
Note on Unicode: if you deal with Unicode strings, in Python 2.x, pass re.U or re.UNICODE modifier flag to the re.compile() method.

How do i extract a list of elements encased in quotation marks bounded by <> and delimited by commas - python, regex?

Given a string like this:
ORTH < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",
With regex, how do I get a tuple that looks like the following:
('ORTH', ['cali.ber,kl','calf','done'])
I've been doing it as such:
txt = '''ORTH < "cali.ber,kl", 'calf' , "done" >,'''
e1 = txt.partition(" ")[0]
vs = re.search(r"<([A-Za-z0-9_]+)>", txt)
v = vs.group(1)
v1 = [i[1:-1] for i in vs.strip().strip("<>").split(",")]
print v1
But i'm getting none for re.search().group(1). How should it be done to get the desired output?
The reason you don't get a match is that your regex doesn't match:
r"<([A-Za-z0-9_]+)>" is missing comma, quotation marks and the space character, which all can occur inside the < > according to your sample.
This one would match:
re.search(r"< ([A-Za-z0-9_.,\"' ]+) >", txt)
What also may trip you up is the fact that the list of names is delimited by comma, which itself can be part of the values, unescaped.
That means you can't just split that string by ',', but instead need to consider the two different quotation characters(' and " ) in order to separate the fields.
So I'd use this approach:
Use re.match to split the string into PREFIX < NAMES > parts, and discard the rest.
Use re.findall() to split the names into fields according to quotation marks
Edit:
1) According to your first comment, your data can also contain a preamble before the prefix that contains newlines. The default behavior for . is to match everything except newlines.
From the Python re docs:
re.DOTALL
Make the '.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.
So you need to construct that regex with the re.DOTALL flag. You do this by compiling it first and passing the ORed flags:
re.compile(pattern, flags=re.DOTALL)
2) If you include the space character before PREFIX in the regex, it will only match for data that actually contains that space - but not anymore for your first piece of example data. So I use .*?([A-Z\.]*)... to cover both cases. The ? is for non-greedy matching, so it matches the shortest possible match instead of the longest.
3) To cover PREFIX.FOO just extend the pattern for the prefix to ([A-Z\.]*) by including the . character and escaping it.
Updated example covering all the cases you mentioned:
import re
TEST_VALUES = [
"""ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel",""",
"""calf_n1 := n_-_c_le & n_-_pn_le &\n [ ORTH.FOO < "cali.ber,kl", 'calf' , "done" >,\nLKEYS.KEYREL.PRED "_calf_n_1_rel","""
]
EXPECTED = ('ORTH.FOO', ['cali.ber,kl','calf','done'])
pattern = re.compile(r'.*?([A-Z\.]*) < (.*) >.*', flags=re.DOTALL)
for value in TEST_VALUES:
prefix, names_str = pattern.match(value).groups()
names = re.findall('[\'"](.*?)["\']', names_str)
result = prefix, names
assert(result == EXPECTED)
print result

Regular expression to replace with XML node

I'm using Python to write a regular expression for replacing parts of the string with a XML node.
The source string looks like:
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace
And the result string should be like:
Hello
<replace name="str1"> this is to replace </replace>
<replace name="str2"> this is to replace </replace>
Can anyone help me?
What makes your problem a little bit tricky is that you want to match inside of a multiline string. You need to use the re.MULTILINE flag to make that work.
Then, you need to match some groups inside your source string, and use those groups in the final output. Here is code that works to solve your problem:
import re
s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)
s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
def mksub(m):
return '<replace name="%s">%s</replace>' % m.groups()
s_output = re.sub(pat, mksub, s_input)
The only tricky part is the regular expression pattern. Let's look at it in detail.
^ matches the start of a string. With re.MULTILINE, this matches the start of a line within a multiline string; in other words, it matches right after a newline in the string.
\s* matches optional whitespace.
REPLACE matches the literal string "REPLACE".
\( matches the literal string "(".
( begins a "match group".
[^)] means "match any character but a ")".
+ means "match one or more of the preceding pattern.
) closes a "match group".
\) matches the literal string ")"
(.*) is another match group containing ".*".
$ matches the end of a string. With re.MULTILINE, this matches the end of a line within a multiline string; in other words, it matches a newline character in the string.
. matches any character, and * means to match zero or more of the preceding pattern. Thus .* matches anything, up to the end of the line.
So, our pattern has two "match groups". When you run re.sub() it will make a "match object" which will be passed to mksub(). The match object has a method, .groups(), that returns the matched substrings as a tuple, and that gets substituted in to make the replacement text.
EDIT: You actually don't need to use a replacement function. You can put the special string \1 inside the replacement text, and it will be replaced by the contents of match group 1. (Match groups count from 1; the special match group 0 corresponds the the entire string matched by the pattern.) The only tricky part of the \1 string is that \ is special in strings. In a normal string, to get a \, you need to put two backslashes in a row, like so: "\\1" But you can use a Python "raw string" to conveniently write the replacement pattern. Doing so you get this:
import re
s_pat = "^\s*REPLACE\(([^)]+)\)(.*)$"
pat = re.compile(s_pat, re.MULTILINE)
s_repl = r'<replace name="\1">\2</replace>'
s_input = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
s_output = re.sub(pat, s_repl, s_input)
Here is an excellent tutorial on how to write regular expressions in Python.
Here is a solution using pyparsing. I know you specifically asked about a regex solution, but if your requirements change, you might find it easier to expand a pyparsing parser. Or a pyparsing prototype solution might give you a little more insight into the problem leading toward a regex or other final implementation.
src = """\
Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace
"""
from pyparsing import Suppress, Word, alphas, alphanums, restOfLine
LPAR,RPAR = map(Suppress,"()")
ident = Word(alphas, alphanums)
replExpr = "REPLACE" + LPAR + ident("name") + RPAR + restOfLine("body")
replExpr.setParseAction(
lambda toks : '<replace name="%(name)s">%(body)s </replace>' % toks
)
print replExpr.transformString(src)
In this case, you create the expression to be matched with pyparsing, define a parse action to do the text conversion, and then call transformString to scan through the input source to find all the matches, apply the parse action to each match, and return the resulting output. The parse action serves a similar function to mksub in #steveha's solution.
In addition to the parse action, pyparsing also supports naming individual elements of the expression - I used "name" and "body" to label the two parts of interest, which are represented in the re solution as groups 1 and 2. You can name groups in an re, the corresponding re would look like:
s_pat = "^\s*REPLACE\((?P<name>[^)]+)\)(?P<body>.*)$"
Unfortunately, to access these groups by name, you have to invoke the group() method on the re match object, you can't directly do the named string interpolation as in my lambda parse action. But this is Python, right? We can wrap that callable with a class that will give us dict-like access to the groups by name:
class CallableDict(object):
def __init__(self,fn):
self.fn = fn
def __getitem__(self,name):
return self.fn(name)
def mksub(m):
return '<replace name="%(name)s">%(body)s</replace>' % CallableDict(m.group)
s_output = re.sub(pat, mksub, s_input)
Using CallableDict, the string interpolation in mksub can now call m.group for each field, by making it look like we are retrieving the ['name'] and ['body'] elements of a dict.
Maybe like this ?
import re
mystr = """Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
prog = re.compile(r'REPLACE\((.*?)\)\s(.*)')
for line in mystr.split("\n"):
print prog.sub(r'< replace name="\1" > \2',line)
Something like this should work:
import re,sys
f = open( sys.argv[1], 'r' )
for i in f:
g = re.match( r'REPLACE\((.*)\)(.*)', i )
if g is None:
print i
else:
print '<replace name=\"%s\">%s</replace>' % (g.group(1),g.group(2))
f.close()
import re
a="""Hello
REPLACE(str1) this is to replace
REPLACE(str2) this is to replace"""
regex = re.compile(r"^REPLACE\(([^)]+)\)\s+(.*)$", re.MULTILINE)
b=re.sub(regex, r'< replace name="\1" > \2 < /replace >', a)
print b
will do the replace in one line.

Categories