Removing lines from a text file using python and regular expressions

Removing lines from a text file using python and regular expressions - python

I have some text files, and I want to remove all lines that begin with the asterisk (“*”).
Made-up example:
words
*remove me
words
words
*remove me
My current code fails. It follows below:
import re
program = open(program_path, "r")
program_contents = program.readlines()
program.close()
new_contents = []
pattern = r"[^*.]"
for line in program_contents:
match = re.findall(pattern, line, re.DOTALL)
if match.group(0):
new_contents.append(re.sub(pattern, "", line, re.DOTALL))
else:
new_contents.append(line)
print new_contents
This produces ['', '', '', '', '', '', '', '', '', '', '*', ''], which is no goo.
I’m very much a python novice, but I’m eager to learn. And I’ll eventually bundle this into a function (right now I’m just trying to figure it out in an ipython notebook).
Thanks for the help!

Your regular expression seems to be incorrect:
[^*.]
Means match any character that isn't a ^, * or .. When inside a bracket expression, everything after the first ^ is treated as a literal character. This means in the expression you have . is matching the . character, not a wildcard.
This is why you get "*" for lines starting with *, you're replacing every character but *! You would also keep any . present in the original string. Since the other lines do not contain * and ., all of their characters will be replaced.
If you want to match lines beginning with *:
^\*.*
What might be easier is something like this:
pat = re.compile("^[^*]")
for line in contents:
if re.search(pat, line):
new_contents.append(line)
This code just keeps any line that does not start with *.
In the pattern ^[^*], the first ^ matches the start of the string. The expression [^*] matches any character but *. So together this pattern matches any starting character of a string that isn't *.
It is a good trick to really think about when using regular expressions. Do you simply need to assert something about a string, do you need to change or remove characters in a string, do you need to match substrings?
In terms of python, you need to think about what each function is giving you and what you need to do with it. Sometimes, as in my example, you only need to know that a match was found. Sometimes you might need to do something with the match.
Sometimes re.sub isn't the fastest or the best approach. Why bother going through each line and replacing all of the characters, when you can just skip that line in total? There's no sense in making an empty string when you're filtering.
Most importantly: Do I really need a regex? (Here you don't!)
You don't really need a regular expression here. Since you know the size and position of your delimiter you can simply check like this:
if line[0] != "*":
This will be faster than a regex. They're very powerful tools and can be neat puzzles to figure out, but for delimiters with fixed width and position, you don't really need them. A regex is much more expensive than an approach making use of this information.

You don't want to use a [^...] negative character class; you are matching all characters except for the * or . characters now.
* is a meta character, you want to escape that to \*. The . 'match any character' syntax needs a multiplier to match more than one. Don't use re.DOTALL here; you are operating on a line-by-line basis but don't want to erase the newline.
There is no need to test first; if there is nothing to replace the original line is returned.
pattern = r"^\*.*"
for line in program_contents:
new_contents.append(re.sub(pattern, "", line))
Demo:
>>> import re
>>> program_contents = '''\
... words
... *remove me
... words
... words
... *remove me
... '''.splitlines(True)
>>> new_contents = []
>>> pattern = r"^\*.*"
>>> for line in program_contents:
... new_contents.append(re.sub(pattern, "", line))
...
>>> new_contents
['words\n', '\n', 'words\n', 'words\n', '\n']

You can do:
print '\n'.join(re.findall(r'^[^*].*$', ''.join(f), re.M))
Example:
txt='''\
words
*remove me
words
words
*remove me '''
import StringIO
f=StringIO.StringIO(txt)
import re
print '\n'.join(re.findall(r'^[^*].*$', ''.join(f), re.M))

Related

Find something that does not match a pattern at the beginning of line

I am using regex in Python to find something at the beginning of line that does not match pattern "SCENE" and before colon. The text looks like this
SCENE:xxxxxxdd\nAQW:xxxxxdd\nSCENE:xxxxxdf\nCER:dddd.ddd\nddd\nDYU:ddddd\nddd\nd\nEOI:ddd\n.
I need to find AQW, CER, DYU, EOI in this case.
I have tried
findall(r"^(?!SCENE)[^:]*, text, re.M)
I get AQW and EOI, but I get ddd\nDYU instead of DYU, ddd\nd\nEOI instead of EOI.
How could I get exactly AQW,CER,DYU,EOI?

You can try this to brteak your string to substrings and try to find there:
import re
line = "SCENE:xxxxxxdd\nAQW:xxxxxdd\nSCENE:xxxxxdf\nCER:dddd.ddd\nddd\nDYU:ddddd\nddd\nd\nEOI:ddd\n."
lines = re.split("\\n([A-Z])", line)
lines = [a+b for a,b in zip(lines[1::2], lines[2::2])]
for line in lines:
if re.match(r"^(?!SCENE)[^:]*", line):
print(line.split(":")[0])
result is:
AQW
CER
DYU
EOI
This answer is not the best in terms of performance I assume

This could probably be simplified more, and I'm assuming that \n in your example string is a literal newline character.
This should match all of your use cases. It starts by looking for any number of characters that aren't SCENE preceding a : then it finds any characters after the colon that don't follow a newline and precede a : then the last . there is probably a way around, but the final character wasn't being properly matched because it was directly followed by the negative lookahead.
findall( r"([A-Z]+(?<!SCENE):(?:[\s\S](?!\n[A-Z]+:))+.)", text )
https://regex101.com/r/NwdUcR/2
EDIT: I realize the above may not be exactly what you're looking for. If you're looking to match just the letters before the colon you can use this:
findall( r"([A-Z]+(?<!SCENE)):", text )

I use
findall (r"\n(?!SCENE)(.+?):")
which works. The point is I did not realize that I can use parenthesis to select what I would like to display in the result.

You don't really need regex in this case.
Here is a solution using plain and simple str.split().
s = 'SCENE:xxxxxxdd\nAQW:xxxxxdd\nSCENE:xxxxxdf\nCER:dddd.ddd\nddd\nDYU:ddddd\nddd\nd\nEOI:ddd\n.'
matches = [m.split('\n')[-1] for m in s.split(':') if 'SCENE' not in m]
>>> matches
['AQW', 'CER', 'DYU', 'EOI', '.']
If you want to exclude the last '.', you can use matches = [m.split('\n')[-1] for m in s.split(':') if (('SCENE' not in m) and (m[-1] != '.'))]
or simply matches = matches[:-1]

How to parse parameters from text?

I have a text that looks like:
ENGINE = CollapsingMergeTree (
first_param
,(
second_a
,second_b, second_c,
,second d), third, fourth)
Engine can be different (instead of CollapsingMergeTree, there can be different word, ReplacingMergeTree, SummingMergeTree...) but the text is always in format ENGINE = word (). Around "=" sign, can be space, but it is not mandatory.
Inside parenthesis are several parameters usually a single word and comma, but some parameters are in parenthesis like second in the example above.
Line breaks could be anywhere. Line can end with comma, parenthesis or anything else.
I need to extract n parameters (I don't know how many in advance). In example above, there are 4 parameters:
first = first_param
second = (second_a, second_b, second_c, second_d) [extract with parenthesis]
third = third
fourth = fourth
How to do that with python (regex or anything else)?

You'd probably want to use a proper parser (and so look up how to hand-roll a parser for a simple language) for whatever language that is, but since what little you show here looks Python-compatible you could just parse it as if it were Python using the ast module (from the standard library) and then manipulate the result.

I came up with a regex solution for your problem. I tried to keep the regex pattern as 'generic' as I could, because I don't know if there will always be newlines and whitespace in your text, which means the pattern selects a lot of whitespace, which is then removed afterwards.
#Import the module for regular expressions
import re
#Text to search. I CORRECTED IT A BIT AS YOUR EXAMPLE SAID second d AND second_c WAS FOLLOWED BY TWO COMMAS. I am assuming those were typos.
text = '''ENGINE = CollapsingMergeTree (
first_param
,(
second_a
,second_b, second_c
,second_d), third, fourth)'''
#Regex search pattern. re.S means . which represents ANY character, includes \n (newlines)
pattern = re.compile('ENGINE = CollapsingMergeTree \((.*?),\((.*?)\),(.*?), (.*?)\)', re.S) #ENGINE = CollapsingMergeTree \((.*?),\((.*?)\), (.*?), (.*?)\)
#Apply the pattern to the text and save the results in variable 'result'. result[0] would return whole text.
#The items you want are sub-expressions which are enclosed in parentheses () and can be accessed by using result[1] and above
result = re.match(pattern, text)
#result[1] will get everything after theparenteses after CollapsingMergeTree until it reaches a , (comma), but with whitespace and newlines. re.sub is used to replace all whitespace, including newlines, with nothing
first = re.sub('\s', '', result[1])
#result[2] will get second a-d, but with whitespace and newlines. re.sub is used to replace all whitespace, including newlines, with nothing
second = re.sub('\s', '', result[2])
third = re.sub('\s', '', result[3])
fourth = re.sub('\s', '', result[4])
print(first)
print(second)
print(third)
print(fourth)
OUTPUT:
first_param
second_a,second_b,second_c,second_d
third
fourth
Regex explanation:
\ = Escapes a control character, which is a character regex would interpret to mean something special. More here.
\( = Escape parentheses
() = Mark the expression in the parentheses as a sub-group. See result[1] and so on.
. = Matches any character (including newline, because of re.S)
* = Matches 0 or more occurrences of preceding expression.
? = Matches 0 or 1 occurrence of preceding expression.
NOTE: *? combined is called a nongreedy repetition, meaning the preceding expression is only matched once, instead of over and over again.
I am no expert, but I hope I got the explanations right.
I hope this helps.

Python regex extract string between two braces, including new lines [duplicate]

I'm having a bit of trouble getting a Python regex to work when matching against text that spans multiple lines. The example text is ('\n' is a newline)
some Varying TEXT\n
\n
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF\n
[more of the above, ending with a newline]\n
[yep, there is a variable number of lines here]\n
\n
(repeat the above a few hundred times).
I'd like to capture two things: the 'some_Varying_TEXT' part, and all of the lines of uppercase text that comes two lines below it in one capture (i can strip out the newline characters later).
I've tried with a few approaches:
re.compile(r"^>(\w+)$$([.$]+)^$", re.MULTILINE) # try to capture both parts
re.compile(r"(^[^>][\w\s]+)$", re.MULTILINE|re.DOTALL) # just textlines
and a lot of variations hereof with no luck. The last one seems to match the lines of text one by one, which is not what I really want. I can catch the first part, no problem, but I can't seem to catch the 4-5 lines of uppercase text.
I'd like match.group(1) to be some_Varying_Text and group(2) to be line1+line2+line3+etc until the empty line is encountered.
If anyone's curious, its supposed to be a sequence of aminoacids that make up a protein.

Try this:
re.compile(r"^(.+)\n((?:\n.+)+)", re.MULTILINE)
I think your biggest problem is that you're expecting the ^ and $ anchors to match linefeeds, but they don't. In multiline mode, ^ matches the position immediately following a newline and $ matches the position immediately preceding a newline.
Be aware, too, that a newline can consist of a linefeed (\n), a carriage-return (\r), or a carriage-return+linefeed (\r\n). If you aren't certain that your target text uses only linefeeds, you should use this more inclusive version of the regex:
re.compile(r"^(.+)(?:\n|\r\n?)((?:(?:\n|\r\n?).+)+)", re.MULTILINE)
BTW, you don't want to use the DOTALL modifier here; you're relying on the fact that the dot matches everything except newlines.

This will work:
>>> import re
>>> rx_sequence=re.compile(r"^(.+?)\n\n((?:[A-Z]+\n)+)",re.MULTILINE)
>>> rx_blanks=re.compile(r"\W+") # to remove blanks and newlines
>>> text="""Some varying text1
...
... AAABBBBBBCCCCCCDDDDDDD
... EEEEEEEFFFFFFFFGGGGGGG
... HHHHHHIIIIIJJJJJJJKKKK
...
... Some varying text 2
...
... LLLLLMMMMMMNNNNNNNOOOO
... PPPPPPPQQQQQQRRRRRRSSS
... TTTTTUUUUUVVVVVVWWWWWW
... """
>>> for match in rx_sequence.finditer(text):
... title, sequence = match.groups()
... title = title.strip()
... sequence = rx_blanks.sub("",sequence)
... print "Title:",title
... print "Sequence:",sequence
... print
...
Title: Some varying text1
Sequence: AAABBBBBBCCCCCCDDDDDDDEEEEEEEFFFFFFFFGGGGGGGHHHHHHIIIIIJJJJJJJKKKK
Title: Some varying text 2
Sequence: LLLLLMMMMMMNNNNNNNOOOOPPPPPPPQQQQQQRRRRRRSSSTTTTTUUUUUVVVVVVWWWWWW
Some explanation about this regular expression might be useful: ^(.+?)\n\n((?:[A-Z]+\n)+)
The first character (^) means "starting at the beginning of a line". Be aware that it does not match the newline itself (same for $: it means "just before a newline", but it does not match the newline itself).
Then (.+?)\n\n means "match as few characters as possible (all characters are allowed) until you reach two newlines". The result (without the newlines) is put in the first group.
[A-Z]+\n means "match as many upper case letters as possible until you reach a newline. This defines what I will call a textline.
((?:textline)+) means match one or more textlines but do not put each line in a group. Instead, put all the textlines in one group.
You could add a final \n in the regular expression if you want to enforce a double newline at the end.
Also, if you are not sure about what type of newline you will get (\n or \r or \r\n) then just fix the regular expression by replacing every occurrence of \n by (?:\n|\r\n?).

The following is a regular expression matching a multiline block of text:
import re
result = re.findall('(startText)(.+)((?:\n.+)+)(endText)',input)

If each file only has one sequence of aminoacids, I wouldn't use regular expressions at all. Just something like this:
def read_amino_acid_sequence(path):
with open(path) as sequence_file:
title = sequence_file.readline() # read 1st line
aminoacid_sequence = sequence_file.read() # read the rest
# some cleanup, if necessary
title = title.strip() # remove trailing white spaces and newline
aminoacid_sequence = aminoacid_sequence.replace(" ","").replace("\n","")
return title, aminoacid_sequence

find:
^>([^\n\r]+)[\n\r]([A-Z\n\r]+)
\1 = some_varying_text
\2 = lines of all CAPS
Edit (proof that this works):
text = """> some_Varying_TEXT
DSJFKDAFJKDAFJDSAKFJADSFLKDLAFKDSAF
GATACAACATAGGATACA
GGGGGAAAAAAAATTTTTTTTT
CCCCAAAA
> some_Varying_TEXT2
DJASDFHKJFHKSDHF
HHASGDFTERYTERE
GAGAGAGAGAG
PPPPPAAAAAAAAAAAAAAAP
"""
import re
regex = re.compile(r'^>([^\n\r]+)[\n\r]([A-Z\n\r]+)', re.MULTILINE)
matches = [m.groups() for m in regex.finditer(text)]
#NOTE can be sorter with matches = re.findall(pattern, text, re.MULTILINE)
for m in matches:
print 'Name: %s\nSequence:%s' % (m[0], m[1])

It can sometimes be comfortable to specify the flag directly inside the string, as an inline-flag:
"(?m)^A complete line$".
For example in unit tests, with assertRaisesRegex. That way, you don't need to import re, or compile your regex before calling the assert.

My preference.
lineIter= iter(aFile)
for line in lineIter:
if line.startswith( ">" ):
someVaryingText= line
break
assert len( lineIter.next().strip() ) == 0
acids= []
for line in lineIter:
if len(line.strip()) == 0:
break
acids.append( line )
At this point you have someVaryingText as a string, and the acids as a list of strings.
You can do "".join( acids ) to make a single string.
I find this less frustrating (and more flexible) than multiline regexes.

REGEX (python) match or return a string after '?', but in a new line, til the end of that line

Here us what I'm trying to do... I have a string structured like this:
stringparts.bst? (carriage return)
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99 (carriage return)
SPAM /198975/
I need it to match or return this:
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
What RegEx will do the trick?
I have tried this, but to no avail :(
bst\?(.*)\n
Thanks in advc

I tried this. Assuming the newline is only one character.
>>> s
'stringparts.bst?\n765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchks
yttsutcuan99\nSPAM /198975/'
>>> m = re.match('.*bst\?\s(.+)\s', s)
>>> print m.group(1)
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99

Your regex will match everything between the bst? and the first newline which is nothing. I think you want to match everything between the first two newlines.
bst\?\n(.*)\n
will work, but you could also use
\n(.*)\n
although it may not work for some other more specific cases

This is more robust against different kinds of line breaks, and works if you have a whole list of such strings. The $ and ^ represent the beginning and end of a line, but not the actual line break character (hence the \s+ sequence).
import re
BST_RE = re.compile(
r"bst\?.*$\s+^(.*)$",
re.MULTILINE
)
INPUT_STR = r"""
stringparts.bst?
765945559287eghc1bg60aa26e4c9ccf8ac425725622f65a6lsa6ahskchksyttsutcuan99
SPAM /198975/
stringparts.bst?
another
SPAM /.../
"""
occurrences = BST_RE.findall(INPUT_STR)
for occurrence in occurrences:
print occurrence

This pattern allows additional whitespace before the \n:
r'bst\?\s*\n(.*?)\s*\n'
If you don't expect any whitespace within the string to be captured, you could use a simpler one, where \s+ consumes whitespace, including the \n, and (\S+) captures all the consecutive non-whitespace:
r'bst\?\s+(\S+)'

Print the line between specific pattern

I want to print the lines between specific string, my string is as follows:
my_string = '''
##start/file1
file/images/graphs/main
file/images/graphs
file/graphs
##start/new
new/pattern/symbol
new/pattern/
##start/info/version
version/info/main
version/info/minor
##start
values/key
values
...
... '''
In this string i want to search for "main" and print it as:
##start/file1/file/images/graphs/main
##start/info/version/version/info/main
How can i do this?
I tried to find the lines between two ##start and search for main.

Try something like:
def get_mains(my_string):
section = ''
for line in my_string.split('\n'):
if line[0:7] == "##start":
section = line
continue
if 'main' in line:
yield '/'.join([section, line])
for main in get_mains(my_string):
print main

There is a way to do this with Python's Regular Expressions Parser called regex for short.
Basically, regex is this whole language for searching through a string for certain patterns. If I have the string 'Hello, World', it would match the regex pattern 'llo, Wor', because it contains an ell followed by an ell followed by an o followed by a comma and a space and a capital double-you and so on. On the surface it just looks like a substring test. The real power of regex comes with special characters. If I have the string 'Hello, World' again, it also matches the pattern 'Hello, \w\w\w\w\w', because \w is a special character that stands for any letter in the alphabet (plus a few extras). So 'Hello, Bobby', 'Hello, World', 'Hello, kitty' all match the pattern 'Hello, \w\w\w\w\w', because \w can stand in for any letter. There are many more of these 'special characters' and they are all very useful. To actually answer your question,
I constructed a pattern that matches
##start\textICareAbout
file_I_don't_care
file_I_don't_care
file_I_care_about\main
which is
r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)')
The leading r makes the string a raw string (so we don't have to double backslash newlines, see the linked webpage). Then, everything in parenthesis becomes a group. Groups are peices of texts that we want to be able to recall later. There are two groups. The first one is (##start{line}), the second one is (.*main). The first group matches anything that starts with ##start and continues for a whole line, so lines like
##start/file1 or ##start/new
The second group matches lines that end in main, because .* matches every character except newlines. In between the two groups there is {line}*, which means 'match any thing that is a complete line, and match any number of them'. So tying it all together, we have:
match anything that starts with ##start, then we match any number of lines, and then we match any line that ends in main.
import re
# define my_string here
pattern = re.compile(r'(##start{line}){line}*?(.*main)'.format(line=r'(?:.*\n)'))
for match in pattern.findall(my_string):
string = match[0][:-1] # don't want the trailing \n
string += '/'
string += match[1]
print string
For your example, it outputs
##start/file1/file/images/graphs/main
##start/new/version/info/main
So Regex is pretty cool and other languages have it too. It is a very powerful tool, and you should learn how to use it here.
Also just a side note, I use the .format function, because I think it looks much cleaner and easier to read, so
'hello{line}world'.format(line=r'(?:.*\n)') just becomes evaluated to 'hello(?:.*\n)world', and it would match
hello
Any Text Here. Anything at all. (just for one line)
world

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.