Iterating through regular expressions in Python - python

I have a basic knowledge of python (completed one class) and I'm unsure of how to tackle this next script. I have two files, one is a newick tree - looks like this, but much larger:
(((1:0.01671793,2:0.01627631):0.00455274,(3:0.02781576,4:0.05606947):0.02619237):0.08529440,5:0.16755623);
The second file is a tab delimited text file that looks like this but is much larger:
1 \t Human
2 \t Chimp
3 \t Mouse
4 \t Rat
5 \t Fish
I want to replace the sequence ID numbers (only those followed by colons) in the newick file with the species names in the text file to create
(((Human:0.01671793,Chimp:0.01627631):0.00455274,(Mouse:0.02781576,Rat:0.05606947):0.02619237):0.08529440,Fish:0.16755623);
My pseudocode (after opening both files) would look something like
for line in txtfile:
if line[0] matches \(\d*\ in newick:
replace that \d* with line[2]
Any suggestions would be greatly appreciated!

this can be done by defining a callback function that is run on every match of the regexp \(\d*:.
here's an (unrelated) example from https://docs.python.org/2/library/re.html#text-munging that illustrates how the callback function is used together with re.sub() that performs the regexp substitution:
>>> def repl(m):
... inner_word = list(m.group(2))
... random.shuffle(inner_word)
... return m.group(1) + "".join(inner_word) + m.group(3)
>>> text = "Professor Abdolmalek, please report your absences promptly."
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'
>>> re.sub(r"(\w)(\w+)(\w)", repl, text)
'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'

You can also do it using findall:
import re
s = "(((1:0.01671793,2:0.01627631):0.00455274,(3:0.02781576,4:0.05606947):0.02619237):0.08529440,5:0.16755623)"
rep = {'1':'Human',
'2':'Chimp',
'3':'Mouse',
'4':'Rat',
'5':'Fish'}
for i in re.findall(r'(\d+:)', s):
s = s.replace(i, rep[i[:-1]]+':')
>>> print s
(((Human:0.01671793,Chimp:0.01627631):0.00455274,(Mouse:0.02781576,Rat:0.05606947):0.02619237):0.08529440,Fish:0.16755623)

Related

Split string if separator is not in-between two characters

I want to write a script that reads from a csv file and splits each line by comma except any commas in-between two specific characters.
In the below code snippet I would like to split line by commas except the commas in-between two $s.
line = "$abc,def$,$ghi$,$jkl,mno$"
output = line.split(',')
for o in output:
print(o)
How do I write output = line.split(',') so that I get the following terminal output?
~$ python script.py
$abc,def$
$ghi$
$jkl,mno$
You can do this with a regular expression:
In re, the (?<!\$) will match a character not immediately following a $.
Similarly, a (?!\$) will match a character not immediately before a dollar.
The | character cam match multiple options. So to match a character where either side is not a $ you can use:
expression = r"(?<!\$),|,(?!\$)"
Full program:
import re
expression = r"(?<!\$),|,(?!\$)"
print(re.split(expression, "$abc,def$,$ghi$,$jkl,mno$"))
One solution (maybe not the most elegant but it will work) is to replace the string $,$ with something like $,,$ and then split ,,. So something like this
output = line.replace('$,$','$,,$').split(',,')
Using regex like mousetail suggested is the more elegant and robust solution but requires knowing regex (not that anyone KNOWS regex)
Try regular expressions:
import re
line = "$abc,def$,$ghi$,$jkl,mno$"
output = re.findall(r"\$(.*?)\$", line)
for o in output:
print('$'+o+'$')
$abc,def$
$ghi$
$jkl,mno$
First, you can identify a character that is not used in that line:
c = chr(max(map(ord, line)) + 1)
Then, you can proceed as follows:
line.replace('$,$', f'${c}$').split(c)
Here is your example:
>>> line = '$abc,def$,$ghi$,$jkl,mno$'
>>> c = chr(max(map(ord, line)) + 1)
>>> result = line.replace('$,$', f'${c}$').split(c)
>>> print(*result, sep='\n')
$abc,def$
$ghi$
$jkl,mno$

How to count sentences taking into account the occurrence of ellipses

I've written the following script to count the number of sentences in a text file:
import re
filepath = 'sample_text_with_ellipsis.txt'
with open(filepath, 'r') as f:
read_data = f.read()
sentences = re.split(r'[.{1}!?]+', read_data.replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)
However, if I run it on a sample_text_with_ellipsis.txt with the following content:
Wait for it... awesome!
I get sentence_count = 2 instead of 1, because it does not ignore the ellipsis (i.e., the "...").
What I tried to do in the regex is to make it match only one occurrence of a period through .{1}, but this apparently doesn't work the way I intended it. How can I get the regex to ignore ellipses?
Splitting sentences with a regex like this is not enough. See Python split text on sentences to see how NLTK can be leveraged for this.
Answering your question, you call 3 dot sequence an ellipsis. Thus, you need to use
[!?]+|(?<!\.)\.(?!\.)
See the regex demo. The . is moved from the character class since you can't use quantifiers inside them, and only that . is matched that is not enclosed with other dots.
[!?]+ - 1 or more ! or ?
| - or
(?<!\.)\.(?!\.) - a dot that is neither preceded ((?<!\.)), nor followed ((?!\.)) with a dot.
See Python demo:
import re
sentences = re.split(r'[!?]+|(?<!\.)\.(?!\.)', "Wait for it... awesome!".replace('\n',''))
sentences = sentences[:-1]
sentence_count = len(sentences)
print(sentence_count) # => 1
Following Wiktor's suggestion to use NLTK, I also came up with the following alternative solution:
import nltk
read_data="Wait for it... awesome!"
sentence_count = len(nltk.tokenize.sent_tokenize(read_data))
This yields a sentence count of 1 as expected.

Using regex to extract information from string

I am trying to write a regex in Python to extract some information from a string.
Given:
"Only in Api_git/Api/folder A: new.txt"
I would like to print:
Folder Path: Api_git/Api/folder A
Filename: new.txt
After having a look at some examples on the re manual page, I'm still a bit stuck.
This is what I've tried so far
m = re.match(r"(Only in ?P<folder_path>\w+:?P<filename>\w+)","Only in Api_git/Api/folder A: new.txt")
print m.group('folder_path')
print m.group('filename')
Can anybody point me in the right direction??
Get the matched group from index 1 and 2 using capturing groups.
^Only in ([^:]*): (.*)$
Here is demo
sample code:
import re
p = re.compile(ur'^Only in ([^:]*): (.*)$')
test_str = u"Only in Api_git/Api/folder A: new.txt"
re.findall(p, test_str)
If you want to print in the below format then try with substitution.
Folder Path: Api_git/Api/folder A
Filename: new.txt
DEMO
sample code:
import re
p = re.compile(ur'^Only in ([^:]*): (.*)$')
test_str = u"Only in Api_git/Api/folder A: new.txt"
subst = u"Folder Path: $1\nFilename: $2"
result = re.sub(p, subst, test_str)
Your pattern: (Only in ?P<folder_path>\w+:?P<filename>\w+) has a few flaws in it.
The ?P construct is only valid as the first bit inside a parenthesized expression,
so we need this.
(Only in (?P<folder_path>\w+):(?P<filename>\w+))
The \w character class is only for letters and underscores. It won't match / or ., for example. We need to use a different character class that more closely aligns with requirements. In fact, we can just use ., the class of nearly all characters:
(Only in (?P<folder_path>.+):(?P<filename>.+))
The colon has a space after it in your example text. We need to match it:
(Only in (?P<folder_path>.+): (?P<filename>.+))
The outermost parentheses are not needed. They aren't wrong, just not needed:
Only in (?P<folder_path>.+): (?P<filename>.+)
It is often convenient to provide the regular expression separate from the call to the regular expression engine. This is easily accomplished by creating a new variable, for example:
regex = r'Only in (?P<folder_path>.+): (?P<filename>.+)'
... # several lines later
m = re.match(regex, "Only in Api_git/Api/folder A: new.txt")
The above is purely for the convenience of the programmer: it neither saves nor squanders time or memory space. There is, however, a technique that can save some of the time involved in regular expressions: compiling.
Consider this code segment:
regex = r'Only in (?P<folder_path>.+): (?P<filename>.+)'
for line in input_file:
m = re.match(regex, line)
...
For each iteration of the loop, the regular expression engine must interpret the regular expression and apply it to the line variable. The re module allows us to separate the interpretation from the application; we can interpret once but apply several times:
regex = re.compile(r'Only in (?P<folder_path>.+): (?P<filename>.+)')
for line in input_file:
m = re.match(regex, line)
...
Now, your original program should look like this:
regex = re.compile(r'Only in (?P<folder_path>.+): (?P<filename>.+)')
m = re.match(regex, "Only in Api_git/Api/folder A: new.txt")
print m.group('folder_path')
print m.group('filename')
However, I'm a fan of using comments to explain regular expressions. My version, including some general cleanup, looks like this:
import re
regex = re.compile(r'''(?x) # Verbose
Only\ in\ # Literal match
(?P<folder_path>.+) # match longest sequence of anything, and put in 'folder_path'
:\ # Literal match
(?P<filename>.+) # match longest sequence of anything and put in 'filename'
''')
with open('diff.out') as input_file:
for line in input_file:
m = re.match(regex, line)
if m:
print m.group('folder_path')
print m.group('filename')
It really depends on the limitation of the input, if this is the only input this will do the trick.
^Only in (?P<folder_path>[a-zA-Z_/ ]*): (?P<filename>[a-z]*.txt)$

How do I efficiently replace the last line in a string?

I have a multi-line string, and I would like to replace the last line of the string with a different line. How do I most efficiently do this?
Split on the last linebreak and attach a new line:
new = old.rstrip('\n').rsplit('\n', 1)[0] + '\nNew line to be added with line break included.'
This first removes any trailing newline after the last line, splits once on the last newline in the text, takes everything before that last newline, and concatenates the result with a new newline and text.
Demo:
>>> old = '''The quick
... brown fox jumps
... over the lazy
... dog
... '''
>>> old.rstrip('\n').rsplit('\n', 1)[0] + '\nhorse and rider'
'The quick\nbrown fox jumps\nover the lazy\nhorse and rider'
This presumes that your lines are separated by \n characters; reading text files in text mode gives you such data on any platform.
If you are dealing with data with different line endings, adjust accordingly. In such cases os.linesep can come in useful.
I would suggest this approach:
>>> x = """
... test1
... test2
... test3"""
>>> print "\n".join(x.splitlines()[:-1]+["something else"])
test1
test2
something else
>>>
You can accomplish this using a simple regular expressions.
import re
new_string = re.sub(r'[^\n]*\n?$', "replacement", "existing\nstring")
It matches everything with the exception of the \n at the end of the string and replaces it with the replacement string.

Deleting Non-ASCII *lines* from a file?

Is there a way I can remove non-ascii lines (not characters) from a file? So given something like this:
Line 1 (full ASCII character set)
Line 2 (contains unicode characters)
Line 3 (full ASCII)
Line 4 (contains unicode characters)
I want:
Line 1
Line 3
I know I can use iconv to remove ASCII characters but I want to delete any line that contains non-ascii lines. Is there a utility/pythonic way to do this?
If you want to eliminate lines that contain any non-ascii characters:
def ascii_lines(iterable):
for line in iterable:
if all(ord(ch) < 128 for ch in line):
yield line
f = open('somefile.txt')
for line in ascii_lines(f):
print line
Given string like the next:
>>> s = "asd\n\xaa\xfa\xaf\nqwe"
>>> print s
asd
╙З╞
qwe
You may simply filter it by your criteria:
>>> s = filter(lambda x: ord(x) < 128, s)
>>> s
'asd\n\nqwe'
>>> print s
asd
qwe
Also you may achieve the same result with converting to unicode:
>>> str(s.decode('ascii', 'ignore'))
'asd\n\nqwe'
To remove empty lines I'd use re.sub('\n+', '\n', s).
In practice you'll want to do something with the data, and need to parse it further. If your file test looks like
http://example.com dog
http://example.com/å%20ä%20ö/ foo
http://google.com bar
A pyparsing script would remove the bad lines like so
from pyparsing import *
ParserElement.setDefaultWhitespaceChars(" \t")
EOL = LineEnd()
ascii = u''.join(unichr(x) for x in xrange(33,127))
words = Word(ascii)
good_line = Group(ZeroOrMore(words) + EOL)
bad_line = SkipTo(EOL,include=True)
blocks = good_line | bad_line.suppress()
grammar = ZeroOrMore(blocks) + StringEnd()
P = grammar.parseFile("test")
print P
Which would give as output:
[['http://example.com', 'dog', '\n'], ['http://google.com', 'bar']]
The advantage to the other methods (which work fine, and answer the question), as that you now have a nice parse tree to further manipulate the data. The idea is to write a grammar, not a parser, for any task that has the potential to become more complicated then when first started.
for line in fin:
try:
fout.write(line.encode('ASCII'))
except UnicodeDecodeError:
pass
LC_ALL=C grep -v $'[^\t\r -~]'
grep -v prints all lines that don't match the pattern. LC_ALL=C sets the locale to "C". $'[^\t\r -~]' is a pattern that, in the C locale, means "contains a character that is not a horizontal tab, a line-feed, a space, or an ASCII glyphic character". ($'...' is a Bash notation: it's equivalent to '...', except that it processes backslash-escapes like \t and \r. [^...] is a "negative character class", meaning "any character that isn't listed in .... Inside a character class, - matches a range: in this case, the range from space to tilde. The C locale is necessary to make sense of this "range".)

Categories