How to use join and regex? - python

I'm trying to add \n after the quotation mark (") and space.
The closest that I could find is re.sub however it remove certain characters.
line = 'Type: "SecurityIncident" RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"'
q = re.sub(r'[\d\w]" ', '\n', line)
print(q)
Output:
Type: "SecurityInciden\nRowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F2\n
Looking for a solution without any character being remove.

Your attempted regex [\d\w]" is almost fine but has some little short comings. You don't need to write \d with \w in a character set as that is redundant as \w already contains \d within it. Since \w alone is enough to represent an alphabet or digit or underscore, hence no need to enclose it in character set [] hence you can just write \w and your updated regex becomes \w".
But now if you match this regex and substitute it with \n it will match a literal alphabet t then " and a space and it will be replaced by \n which is why you are getting this output,
SecurityInciden\nRowID
You need to capture the matched string in group1 and while substituting, you need to use it while substituting so that doesn't get replaced hence you should use \1\n as replacement instead of just \n
Try this updated regex,
(\w" )
And replace it by \1\n
Demo1
If you notice, there is an extra space at the end of line in the first line and if you don't want that space there, you can take that space out of those capturing parenthesis and use this regex,
(\w")
^ space here
Demo2
Here is a sample python code,
import re
line = 'Type: "SecurityIncident" RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"'
q = re.sub(r'(\w") ', r'\1\n', line)
print(q)
Output,
Type: "SecurityIncident"
RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"

Try this:
import re
line = 'Type: "SecurityIncident" RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"'
pattern = re.compile('(\w+): (".+?"\s?)', re.IGNORECASE)
q = re.sub(pattern, r'\g<1>: \g<2>\n', line)
print(repr(q))
It should give you following resutls:
Type: "SecurityIncident" \nRowID:
"FB013B06-B04C-4FEB-A5A5-3B858F910F29"\n

In your regex you are removing the t from incident because you are matching it and not using it in the replacement.
Another option to get your result might be to split on a double quote followed by a whitespace when preceded with a word character using a positive lookbehind.
Then join the result back together using a newline.
(?<=\w)"
Regex demo | Python demo
For example:
import re
line = 'Type: "SecurityIncident" RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"'
print("\n".join(re.split(r'(?<=\w)" ', line)))
Result
Type: "SecurityIncident
RowID: "FB013B06-B04C-4FEB-A5A5-3B858F910F29"

Related

How to remove all non-alphanumerical characters except when part of a word [duplicate]

I want to be able to remove all punctuation and single quotes ' from a string, unless the single quote ' is in the middle of a word.
At this point I have the following code:
with open('test.txt','r') as f:
for line in f:
line = line.lower()
line = re.sub('[^a-z\ \']+', " ", line)
print line
if there happens to be a line in test.txt like:
Here is some stuff. 'Now there are quotes.' Now there's not.
The result I want is:
here is some stuff now there are quotes now there's not
But the result I get is:
here is some stuff 'now there are quotes' now there's not
How can I remove the single quotes ' from a string if they're at the beginning or end of the word but not in the middle? Thanks for the help!
Split the string, use strip() on each word to remove leading and trailing characters on it, then join it all back together.
>>> s = "'here is some stuff 'now there are quotes' now there's not'"
>>> print(' '.join(w.strip("'") for w in s.split()).lower())
here is some stuff now there are quotes now there's not
Using regular expressions, you could first remove 's that don't follow a letter, then remove 's that don't precede a letter (thus only keeping ones that both follow and precede a letter):
line = "Here is some stuff. 'Now there are quotes.' Now there's not."
print re.sub(r"'([^A-Za-z])", r"\1", re.sub(r"([^A-Za-z])'", r"\1", line))
# Here is some stuff. Now there are quotes. Now there's not.
Probably more efficient to do it #TigerhawkT3's way. Though they produce different results if you have something like 'this'. If you want to remove that second ' too, then the regular expressions method is probably the simplest you can do.
Here's another solution using regular expressions with lookarounds.
This method will preserve any whitespace your string may have.
import re
rgx = re.compile(r"(?<!\w)\'|\'(?!\w)")
# Regex explanation:
# (?<!\w)\' match any quote not preceded by a word
# | or
# \'(?!\w) match any quote not followed by a word
s = "'here is some stuff 'now there are quotes' now there's not'"
print rgx.sub('', s) # here is some stuff now there are quotes now there's not
If a word is a sequence of 1+ letters, digits and underscores that can be matched with \w+ you may use
re.sub(r"(?!\b'\b)'", "", text)
See the regex demo. Here, ' is matched when it is not preceded nor followed with letters/digits/_.
Or, if words are strictly linguistic words that only consist of letters, use
re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) # ASCII only
re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) # any Unicode letter support
See Demo #2 (ASCII only letters) and Demo #3 (see last line in the demo text). Here, ' is only matched if it is not preceded nor followed with a letter (ASCII or any).
Python demo:
import re
text = "'text... 'some quotes', there's none'. three 'four' can't, '2'4', '_'_', 'l'école'"
print( re.sub(r"(?!\b'\b)'", "", text) )
# => text... some quotes, there's none. three four can't, 2'4, _'_, l'école
print( re.sub(r"'(?!(?<=[a-zA-Z]')[a-zA-Z])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, lécole
print( re.sub(r"'(?!(?<=[^\W\d_]')[^\W\d_])", "", text) )
# => text... some quotes, there's none. three four can't, 24, __, l'école
Here is complete solution to remove whatever you don't want in a string:
def istext (text):
ok = 0
for x in text: ok += x.isalnum()
return ok>0
def stripit (text, ofwhat):
for x in ofwhat: text = text.strip(x)
return text
def purge (text, notwanted="'\"!#$%&/()=?*+-.,;:_<>|\\[]{}"):
text = text.splitlines()
text = [" ".join([stripit(word, notwanted) for word in line.split() if istext(word)]) for line in text]
return "\n".join(text)
>>> print purge("'Nice, .to, see! you. Isn't it?'")
Nice to see you Isn't it
Note: this will kill all whitespaces too and transform them to space or remove them completely.

Match everything except a pattern and replace matched with string

I want to use python in order to manipulate a string I have.
Basically, I want to prepend"\x" before every hex byte except the bytes that already have "\x" prepended to them.
My original string looks like this:
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
And I want to create the following string from it:
mystr = r"\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00"
I thought of using regular expressions to match everything except /\x../g and replace every match with "\x". Sadly, I struggled with it a lot without any success. Moreover, I'm not sure that using regex is the best approach to solve such case.
Regex: (?:\\x)?([0-9A-Z]{2}) Substitution: \\x$1
Details:
(?:) Non-capturing group
? Matches between zero and one time, match string \x if it exists.
() Capturing group
[] Match a single character present in the list 0-9 and A-Z
{n} Matches exactly n times
\\x String \x
$1 Group 1.
Python code:
import re
text = R'30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00'
text = re.sub(R'(?:\\x)?([0-9A-Z]{2})', R'\\x\1', text)
print(text)
Output:
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
Code demo
You don't need regex for this. You can use simple string manipulation. First remove all of the "\x" from your string. Then add add it back at every 2 characters.
replaced = mystr.replace(r"\x", "")
newstr = "".join([r"\x" + replaced[i*2:(i+1)*2] for i in range(len(replaced)/2)])
Output:
>>> print(newstr)
\x30\x33\x62\x37\x61\x31\x31\x90\x01\x0A\x90\x02\x14\x6F\x6D\x6D\x61\x6E\x64\x90\x01\x06\x90\x02\x0F\x52\x65\x6C\x61\x74\x90\x01\x02\x90\x02\x50\x65\x6D\x31\x90\x00
You can get a list with your values to manipulate as you wish, with an even simpler re pattern
mystr = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
import re
pat = r'([a-fA-F0-9]{2})'
match = re.findall(pat, mystr)
if match:
print('\n\nNew string:')
print('\\x' + '\\x'.join(match))
#for elem in match: # match gives you a list of strings with the hex values
# print('\\x{}'.format(elem), end='')
print('\n\nOriginal string:')
print(mystr)
This can be done without replacing existing \x by using a combination of positive lookbehinds and negative lookaheads.
(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})
Usage
See code in use here
import re
regex = r"(?!(?<=\\x)|(?<=\\x[a-f\d]))([a-f\d]{2})"
test_str = r"30336237613131\x90\x01\x0A\x90\x02\x146F6D6D616E64\x90\x01\x06\x90\x02\x0F52656C6174\x90\x01\x02\x90\x02\x50656D31\x90\x00"
subst = r"\\x$1"
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE)
if result:
print (result)
Explanation
(?!(?<=\\x)|(?<=\\x[a-f\d])) Negative lookahead ensuring either of the following doesn't match.
(?<=\\x) Positive lookbehind ensuring what precedes is \x.
(?<=\\x[a-f\d]) Positive lookbehind ensuring what precedes is \x followed by a hexidecimal digit.
([a-f\d]{2}) Capture any two hexidecimal digits into capture group 1.

Python regex extract string between two braces, including new lines [duplicate]

I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is ('\n' is a newline)
some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times).
I'd like to capture two things: the 'some_Varying_TEXT' part, and all of the lines of uppercase text that comes two lines below it in one capture (i can strip out the newline characters later).
I've tried with a few approaches:
re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines
and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can't seem to catch the 4-5 lines of uppercase text.
I'd like match.group(1) to be some_Varying_Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.
If anyone's curious, its supposed to be a sequence of aminoacids that make up a protein.
Try this:
re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)
I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)
BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.
This will work:
>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
... title, sequence = match.groups()
... title = title.strip()
... sequence = rx_blanks.sub("",sequence)
... print "Title:",title
... print "Sequence:",sequence
... print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK
Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW
Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)
The first character (^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
Then (.+?)\n\n means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
[A-Z]+\n means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
You could add a final \n in the regular expression if you want to enforce a double newline at the end.
Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).
The following is a regular expression matching a multiline block of text:
import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)
If each file only has one sequence of aminoacids, I wouldn't use regular expressions at all. Just something like this:
def read_amino_acid_sequence(path):
with open(path) as sequence_file:
title = sequence_file.readline() # read 1st line
aminoacid_sequence = sequence_file.read() # read the rest
# some cleanup, if necessary
title = title.strip() # remove trailing white spaces and newline
aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
return title, aminoacid_sequence
find:
^>([^\n\r]+)[\n\r]([A-Z\n\r]+)
\1 = some_varying_text
\2 = lines of all CAPS
Edit (proof that this works):
text = """> some_Varying_TEXT
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA
> some_Varying_TEXT2
DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""
import re
regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]
#NOTE can be sorter with matches = re.findall(pattern, text, re.MULTILINE)
for m in matches:
print 'Name: %s\nSequence:%s' % (m[0], m[1])
It can sometimes be comfortable to specify the flag directly inside the string, as an inline-flag:
"(?m)^A complete line$".
For example in unit tests, with assertRaisesRegex. That way, you don't need to import re, or compile your regex before calling the assert.
My preference.
lineIter= iter(aFile)
for line in lineIter:
if line.startswith( ">" ):
someVaryingText= line
break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
if len(line.strip()) == 0:
break
acids.append( line )
At this point you have someVaryingText as a string, and the acids as a list of strings.
You can do "".join( acids ) to make a single string.
I find this less frustrating (and more flexible) than multiline regexes.

REGEX (python) match or return a string after '?', but in a new line, til the end of that line

Here us what I'm trying to do... I have a string structured like this:
stringparts.bst? (carriage return)
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99 (carriage return)
SPAM /198975/
I need it to match or return this:
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
What RegEx will do the trick?
I have tried this, but to no avail :(
bst\?(.*)\n
Thanks in advc
I tried this. Assuming the newline is only one character.
>>> s
'stringparts.bst?\n765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchks
yttsutcuan99\nSPAM /198975/'
>>> m = re.match('.*bst\?\s(.+)\s', s)
>>> print m.group(1)
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
Your regex will match everything between the bst? and the first newline which is nothing. I think you want to match everything between the first two newlines.
bst\?\n(.*)\n
will work, but you could also use
\n(.*)\n
although it may not work for some other more specific cases
This is more robust against different kinds of line breaks, and works if you have a whole list of such strings. The $ and ^ represent the beginning and end of a line, but not the actual line break character (hence the \s+ sequence).
import re
BST_RE = re.compile(
r"bst\?.*$\s+^(.*)$",
re.MULTILINE
)
INPUT_STR = r"""
stringparts.bst?
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
SPAM /198975/
stringparts.bst?
another
SPAM /.../
"""
occurrences = BST_RE.findall(INPUT_STR)
for occurrence in occurrences:
print occurrence
This pattern allows additional whitespace before the \n:
r'bst\?\s*\n(.*?)\s*\n'
If you don't expect any whitespace within the string to be captured, you could use a simpler one, where \s+ consumes whitespace, including the \n, and (\S+) captures all the consecutive non-whitespace:
r'bst\?\s+(\S+)'

Regex replace with negative look ahead in Python

I am trying to delete the single quotes surrounding regular text. For example, given the list:
alist = ["'ABC'", '(-inf-0.5]', '(4800-20800]', "'\\'(4.5-inf)\\''", "'\\'(2.75-3.25]\\''"]
I would like to turn "'ABC'" into "ABC", but keep other quotes, that is:
alist = ["ABC", '(-inf-0.5]', '(4800-20800]', "'\\'(4.5-inf)\\''", "'\\'(2.75-3.25]\\''"]
I tried to use look-head as below:
fixRepeatedQuotes = lambda text: re.sub(r'(?<!\\\'?)\'(?!\\)', r'', text)
print [fixRepeatedQuotes(str) for str in alist]
but received error message:
sre_constants.error: look-behind requires fixed-width pattern.
Any other workaround? Thanks a lot in advance!
Try should work:
result = re.sub("""(?s)(?:')([^'"]+)(?:')""", r"\1", subject)
explanation
"""
(?: # Match the regular expression below
' # Match the character “'” literally (but the ? makes it a non-capturing group)
)
( # Match the regular expression below and capture its match into backreference number 1
[^'"] # Match a single character NOT present in the list “'"” from this character class (aka any character matches except a single and double quote)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
(?: # Match the regular expression below
' # Match the character “'” literally (but the ? makes it a non-capturing group)
)
"""
re.sub accepts a function as the replace text. Therefore,
re.sub(r"'([A-Za-z]+)'", lambda match: match.group(), "'ABC'")
yields
"ABC"

Categories