Regular Expression to find valid words in file - python

I need to write a function get_specified_words(filename) to get a list of lowercase words from a text file. All of the following conditions must be applied:
Include all lower-case character sequences including those that
contain a - or ' character and those that end with a '
character.
Exclude words that end with a -.
The function must only process lines between the start and end marker lines
Use this regular expression to extract the words from each relevant line of a file: valid_line_words = re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line)
Ensure that the line string is lower case before using the regular expression.
Use the optional encoding parameter when opening files for reading. That is your open file call should look like open(filename, encoding='utf-8'). This will be especially helpful if your operating system doesn't set Python's default encoding to UTF-8.
The sample text file testing.txt contains this:
That are after the start and should be dumped.
So should that
and that
and yes, that
*** START OF SYNTHETIC TEST CASE ***
Toby's code was rather "interesting", it had the following issues: short,
meaningless identifiers such as n1 and n; deep, complicated nesting;
a doc-string drought; very long, rambling and unfocused functions; not
enough spacing between functions; inconsistent spacing before and
after operators, just like this here. Boy was he going to get a low
style mark.... Let's hope he asks his friend Bob to help him bring his code
up to an acceptable level.
*** END OF SYNTHETIC TEST CASE ***
This is after the end and should be ignored too.
Have a nice day.
Here's my code:
import re
def stripped_lines(lines):
for line in lines:
stripped_line = line.rstrip('\n')
yield stripped_line
def lines_from_file(fname):
with open(fname, 'rt') as flines:
for line in stripped_lines(flines):
yield line
def is_marker_line(line, start='***', end='***'):
min_len = len(start) + len(end)
if len(line) < min_len:
return False
return line.startswith(start) and line.endswith(end)
def advance_past_next_marker(lines):
for line in lines:
if is_marker_line(line):
break
def lines_before_next_marker(lines):
valid_lines = []
for line in lines:
if is_marker_line(line):
break
valid_lines.append(re.findall("[a-z]+[-'][a-z]+|[a-z]+[']?|[a-z]+", line))
for content_line in valid_lines:
yield content_line
def lines_between_markers(lines):
it = iter(lines)
advance_past_next_marker(it)
for line in lines_before_next_marker(it):
yield line
def words(lines):
text = '\n'.join(lines).lower().split()
return text
def get_valid_words(fname):
return words(lines_between_markers(lines_from_file(fname)))
# This must be executed
filename = "valid.txt"
all_words = get_valid_words(filename)
print(filename, "loaded ok.")
print("{} valid words found.".format(len(all_words)))
print("word list:")
print("\n".join(all_words))
Here's my output:
File "C:/Users/jj.py", line 45, in <module>
text = '\n'.join(lines).lower().split()
builtins.TypeError: sequence item 0: expected str instance, list found
Here's the expected output:
valid.txt loaded ok.
73 valid words found.
word list:
toby's
code
was
rather
interesting
it
had
the
following
issues
short
meaningless
identifiers
such
as
n
and
n
deep
complicated
nesting
a
doc-string
drought
very
long
rambling
and
unfocused
functions
not
enough
spacing
between
functions
inconsistent
spacing
before
and
after
operators
just
like
this
here
boy
was
he
going
to
get
a
low
style
mark
let's
hope
he
asks
his
friend
bob
to
help
him
bring
his
code
up
to
an
acceptable
level
I need help with getting my code to work. Any help is appreciated.

lines_between_markers(lines_from_file(fname))
gives you a list of list of valid words.
So you just need to flatten it :
def words(lines):
words_list = [w for line in lines for w in line]
return words_list
Does the trick.
But I think that you should review the design of your program :
lines_between_markers should only yield lines between markers, but it does more. Regexp should be use on the result of this function and not inside the function.
What you didn't do :
Ensure that the line string is lower case before using the regular expression.
Use the optional encoding parameter when opening files for reading.
That is your open file call should look like open(filename,
encoding='utf-8').

Related

Return value in a quite nested for-loop

I want nested loops to test whether all elements match the condition and then to return True. Example:
There's a given text file: file.txt, which includes lines of this pattern:
aaa:bb3:3
fff:cc3:4
Letters, colon, alphanumeric, colon, integer, newline.
Generally, I want to test whether all lines matches this pattern. However, in this function I would like to check whether the first column includes only letters.
def opener(file):
#Opens a file and creates a list of lines
fi=open(file).read().splitlines()
import string
res = True
for i in fi:
#Checks whether any characters in the first column is not a letter
if any(j not in string.ascii_letters for j in i.split(':')[0]):
res = False
else:
continue
return res
However, the function returns False even if all characters in the first column are letters. I would like to ask you for the explanation, too.
Your code evaluates the empty line after your code - hence False :
Your file contains a newline after its last line, hence your code checks the line after your last data which does not fullfill your test- that is why you get False no matter the input:
aaa:bb3:3
fff:cc3:4
empty line that does not start with only letters
You can fix it if you "spezial treat" empty lines if they occur at the end. If you have an empty line in between filled ones you return False as well:
with open("t.txt","w") as f:
f.write("""aaa:bb3:3
fff:cc3:4
""")
import string
def opener(file):
letters = string.ascii_letters
# Opens a file and creates a list of lines
with open(file) as fi:
res = True
empty_line_found = False
for i in fi:
if i.strip(): # only check line if not empty
if empty_line_found: # we had an empty line and now a filled line: error
return False
#Checks whether any characters in the first column is not a letter
if any(j not in letters for j in i.strip().split(':')[0]):
return False # immediately exit - no need to test the rest of the file
else:
empty_line_found = True
return res # or True
print (opener("t.txt"))
Output:
True
If you use
# example with a file that contains an empty line between data lines - NOT ok
with open("t.txt","w") as f:
f.write("""aaa:bb3:3
fff:cc3:4
""")
or
# example for file that contains empty line after data - which is ok
with open("t.txt","w") as f:
f.write("""aaa:bb3:3
ff2f:cc3:4
""")
you get: False
Colonoscopy
ASCII, and UNICODE, both define character 0x3A as COLON. This character looks like two dots, one over the other: :
ASCII, and UNICODE, both define character 0x3B as SEMICOLON. This character looks like a dot over a comma: ;
You were consistent in your use of the colon in your example: fff:cc3:4 and you were consistent in your use of the word semicolon in your descriptive text: Letters, semicolon, alphanumeric, semicolon, integer, newline.
I'm going to assume you meant colon (':') since that is the character you typed. If not, you should change it to a semicolon (';') everywhere necessary.
Your Code
Here is your code, for reference:
def opener(file):
#Opens a file and creates a list of lines
fi=open(file).read().splitlines()
import string
res = True
for i in fi:
#Checks whether any characters in the first column is not a letter
if any(j not in string.ascii_letters for j in i.split(':')[0]):
res = False
else:
continue
return res
Your Problem
The problem you asked about was the function always returning false. The example you gave included a blank line between the first example and the second. I would caution you to watch out for spaces or tabs in those blank lines. You can fix this by explicitly catching blank lines and skipping over them:
for i in fi:
if i.isspace():
# skip blank lines
continue
Some Other Problems
Now here are some other things you might not have noticed:
You provided a nice comment in your function. That should have been a docstring:
def opener(file):
""" Opens a file and creates a list of lines.
"""
You import string in the middle of your function. Don't do that. Move the import
up to the top of the module:
import string # at top of file
def opener(file): # Not at top of file
You opened the file with open() and never closed it. This is exactly why the with keyword was added to python:
with open(file) as infile:
fi = infile.read().splitlines()
You opened the file, read its entire contents into memory, then split it into lines
discarding the newlines at the end. All so that you could split it by colons and ignore
everything but the first field.
It would have been simpler to just call readlines() on the file:
with open(file) as infile:
fi = infile.readlines()
res = True
for i in fi:
It would have been even easier and even simpler to just iterate on the file directly:
with open(file) as infile:
res = True
for i in infile:
It seems like you are building up towards checking the entire format you gave at the beginning. I suspect a regular expression would be (1) easier to write and maintain; (2) easier to understand later; and (3) faster to execute. Both now, for this simple case, and later when you have more rules in place:
import logging
import re
bad_lines = 0
for line in infile:
if line.isspace():
continue
if re.match(valid_line, line):
continue
logging.warn(f"Bad line: {line}")
bad_lines += 1
return bad_lines == 0
Your names are bad. Your function includes the names file, fi, i, j, and res. The only one that barely makes sense is file.
Considering that you are asking people to read your code and help you find a problem, please, please use better names. If you just replaced those names with file (same), infile, line, ch, and result the code gets more readable. If you restructured the code using standard Python best practices, like with, it gets even more readable. (And has fewer bugs!)

Python: losing nucleotides from fasta file to dictionary

I am trying to write a code to extract longest ORF in a fasta file. It is from Coursera Genomics data science course.
the file is a practice file: "dna.example.fasta"
Data is here:https://d396qusza40orc.cloudfront.net/genpython/data_sets/dna.example.fasta
Part of my code is below to extract reading frame 2 (start from the second position of a sequence. eg: seq: ATTGGG, to get reading frame 2: TTGGG):
#!/usr/bin/python
import sys
import getopt
o, a = getopt.getopt(sys.argv[1:], 'h')
opts = dict()
for k,v in o:
opts[k] = v
if '-h' in k:
print "--help\n"
if len(a) < 0:
print "missing fasta file\n"
f = open(a[0], "r")
seq = dict()
for line in f:
line = line.strip()
if line.startswith(">"):
name = line.split()[0]
seq[name] = ''
else:
seq[name] = seq[name] + line[1:]
k = seq[">gi|142022655|gb|EQ086233.1|323"]
print len(k)
The length of this particular sequence should be 4804 bp. Therefore by using this sequence alone I could get the correct answer.
However, with the code, here in the dictionary, this particular sequence becomes only 4736 bp.
I am new to python, so I can not wrap my head around as to where did those 100 bp go?
Thank you,
Xio
Take another look at your data file
An example of some of the lines:
>gi|142022655|gb|EQ086233.1|43 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
TCGGGCGAAGGCGGCAGCAAGTCGTCCACGCGCAGCGCGGCACCGCGGGCCTCTGCCGTGCGCTGCTTGG
CCATGGCCTCCAGCGCACCGATCGGATCAAAGCCGCTGAAGCCTTCGCGCATCAGGCGGCCATAGTTGGC
Notice how the sequences start on the first value of each line.
Your addition line seq[name] = seq[name] + line[1:] is adding everything on that line after the first character, excluding the first (Python 2 indicies are zero based). It turns out your missing number of nucleotides is the number of lines it took to make that genome, because you're losing the first character every time.
The revised way is seq[name] = seq[name] + line which simply adds the line without losing that first character.
The quickest way to find these kind of debugging errors is to either use a formal debugger, or add a bunch of print statements on your code and test with a small portion of the file -- something that you can see the output of and check for yourself if it's coming out right. A short file with maybe 50 nucleotides instead of 5000 is much easier to evaluate by hand and make sure the code is doing what you want. That's what I did to come up with the answer to the problem in about 5 minutes.
Also for future reference, please mention the version of python you are using before hand. There are quite a few differences between python 2 (The one you're using) and python 3.
I did some additional testing with your code, and if you get any extra characters at the end, they might be whitespace. Make sure you use the .strip() method on each line before adding it to your string, which clears whitespace.
Addressing your comment,
To start from the 2nd position on the first line of the sequence only and then use the full lines until the following nucleotide, you can take advantage of the file's linear format and just add one more clause to your if statement, an elif. This will test if we're on the first line of the sequence, and if so, use the characters starting from the second, if we're on any other line, use the whole line.
if line.startswith(">"):
name = line.split()[0]
seq[name] = ''
#If it's the first line in the series, then the dict's value
# will be an empty string, so this elif means "If we're at the
# start of the series..."
elif seq[name] == '':
seq[name] = seq[name] + line[1:]
else:
seq[name] = seq[name]
This adaptation will start from the 2nd nucleotide in the genome without losing the first from every line in the rest of the nucleotide.

Having problems with strings and arrays

I want to read a text file and copy text that is in between '~~~~~~~~~~~~~' into an array. However, I'm new in Python and this is as far as I got:
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a=[0]
b=0
for i,line in enumerate(searchlines):
if '~~~~~~~~~~~~~' in line:
b=b+1
if '~~~~~~~~~~~~~' not in line:
if 's1mb4d' in line:
break
a.insert(b,line)
This is what I envisioned:
First I read all the lines of the text file,
then I declare 'a' as an array in which text should be added,
then I declare 'b' because I need it as an index. The number of lines in between the '~~~~~~~~~~~~~' is not even, that's why I use 'b' so I can put lines of text into one array index until a new '~~~~~~~~~~~~~' was found.
I check for '~~~~~~~~~~~~~', if found I increase 'b' so I can start adding lines of text into a new array index.
The text file ends with 's1mb4d', so once its found, the program ends.
And if '~~~~~~~~~~~~~' is not found in the line, I add text to the array.
But things didn't go well. Only 1 line of the entire text between those '~~~~~~~~~~~~~' is being copied to the each array index.
Here is an example of the text file:
~~~~~~~~~~~~~
Text123asdasd
asdasdjfjfjf
~~~~~~~~~~~~~
123abc
321bca
gjjgfkk
~~~~~~~~~~~~~
You could use regex expression, give a try to this:
import re
input_text = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = []
for line in input_text:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
What it does is it reads line by line looks for all characters but '~' if line consists only of '~' it ignores it, every line with text is appended to your a list afterwards.
And just because we can, oneliner (excluding import and source ofc):
import re
lines = ['Text123asdasd asdasdjfjfjf','~~~~~~~~~~~~~','123abc 321bca gjjgfkk','~~~~~~~~~~~~~']
a = [re.findall(r'[^\~]+', line) for line in lines if len(re.findall(r'[^\~]+', line)) != 0]
In python the solution to a large part of problems is often to find the right function from the standard library that does the job. Here you should try using split instead, it should be way easier.
If I understand correctly your goal, you can do it like that :
joined_lines = ''.join(searchlines)
result = joined_lines.split('~~~~~~~~~~')
The first line joins your list of lines into a sinle string, and then the second one cut that big string every times it encounters the '~~' sequence.
I tried to clean it up to the best of my knowledge, try this and let me know if it works. We can work together on this!:)
with open("textfile.txt", "r",encoding='utf8') as f:
searchlines = f.readlines()
a = []
currentline = ''
for i,line in enumerate(searchlines):
currentline += line
if '~~~~~~~~~~~~~' in line:
a.append(currentline)
elif 's1mb4d' in line:
break
Some notes:
You can use elif for your break function
Append will automatically add the next iteration to the end of the array
currentline will continue to add text on each line as long as it doesn't have 's1mb4d' or the ~~~ which I think is what you want
s = ['']
with open('path\\to\\sample.txt') as f:
for l in f:
a = l.strip().split("\n")
s += a
a = []
for line in s:
my_text = re.findall(r'[^\~]+', line)
if len(my_text) != 0:
a.append(my_text)
print a
>>> [['Text123asdasd asdasdjfjfjf'], ['123abc 321bca gjjgfkk']]
If you're willing to impose/accept the constraint that the separator should be exactly 13 ~ characters (actually '\n%s\n' % ( '~' * 13) to be specific) ...
then you could accomplish this for relatively normal sized files using just
#!/usr/bin/python
## (Should be #!/usr/bin/env python; but StackOverflow's syntax highlighter?)
separator = '\n%s\n' % ('~' * 13)
with open('somefile.txt') as f:
results = f.read().split(separator)
# Use your results, a list of the strings separated by these separators.
Note that '~' * 13 is a way, in Python, of constructing a string by repeating some smaller string thirteen times. 'xx%sxx' % 'YY' is a way to "interpolate" one string into another. Of course you could just paste the thirteen ~ characters into your source code ... but I would consider constructing the string as shown to make it clear that the length is part of the string's specification --- that this is part of your file format requirements ... and that any other number of ~ characters won't be sufficient.
If you really want any line of any number of ~ characters to serve as a separator than you'll want to use the .split() method from the regular expressions module rather than the .split() method provided by the built-in string objects.
Note that this snippet of code will return all of the text between your separator lines, including any newlines they include. There are other snippets of code which can filter those out. For example given our previous results:
# ... refine results by filtering out newlines (replacing them with spaces)
results = [' '.join(each.split('\n')) for each in results]
(You could also use the .replace() string method; but I prefer the join/split combination). In this case we're using a list comprehension (a feature of Python) to iterate over each item in our results, which we're arbitrarily naming each), performing our transformation on it, and the resulting list is being boun back to the name results; I highly recommend learning and getting comfortable with list comprehension if you're going to learn Python. They're commonly used and can be a bit exotic compared to the syntax of many other programming and scripting languages).
This should work on MS Windows as well as Unix (and Unix-like) systems because of how Python handles "universal newlines." To use these examples under Python 3 you might have to work a little on the encodings and string types. (I didn't need to for my Python3.6 installed under MacOS X using Homebrew ... but just be forewarned).

Check for iambic pentameter?

I am kind of stuck on a question that I have to do regarding iambic pentameters, but because it is long, I'll try to simplify it.
So I need to get some words and their stress patterns from a text file that look somewhat like this:
if, 0
music,10
be,1
the,0
food,1
of,0
love,1
play,0
on,1
hello,01
world,1
And from the file, you can assume there will be much more words for different sentences. I am trying to get sentences from a text file which have multiple sentences, and to see if the sentence (ignoring punctuation and case) is an iambic pentameter.
For example if the text file contains this:
If music be the food of love play on
hello world
The first sentence will be assigned from the stress dictionary like this: 0101010101, and the second is obviously not a pentameter(011). I would like it so that it only prints sentences which are iambic pentameters.
Sorry if this is a convoluted or messy question.
This is what I have so far:
import string
dict = {};
sentence = open('sentences.txt')
stress = open('stress.txt')
for some in stress:
word,number = some.split(',')
dict[word] = number
for line in sentence:
one = line.split()
I don't think you are building your dictionary of stresses correctly. It's crucial to remember to get rid of the implicit \n character from lines as you read them in, as well as strip any whitespace from words after you've split on the comma. As things stand, the line if, 0 will be split to ['if', ' 0\n'] which isn't what you want.
So to create your dictionary of stresses you could do something like this:
stress_dict = {}
with open('stress.txt', 'r') as f:
for line in f:
word_stress = line.strip().split(',')
word = word_stress[0].strip().lower()
stress = word_stress[1].strip()
stress_dict[word] = stress
For the actual checking, the answer by #khelwood is a good way, but I'd take extra care to handle the \n character as you read in the lines and also make sure that all the characters in the line were lowercase (like in your dictionary).
Define a function is_iambic_pentameter to check whether a sentence is an iambic pentameter (returning True/False) and then check each line in sentences.txt:
def is_iambic_pentameter(line):
line_stresses = [stress_dict[word] for word in line.split()]
line_stresses = ''.join(line_stresses)
return line_stresses == '0101010101'
with open('sentences.txt', 'r') as f:
for line in f:
line = line.rstrip()
line = line.lower()
if is_iambic_pentameter(line):
print line
As an aside, you might be interested in NLTK, a natural language processing library for Python. Some Internet searching finds that people have written Haiku generators and other scripts for evaluating poetic forms using the library.
I wouldn't have thought iambic pentameter was that clear cut: always some words end up getting stressed or unstressed in order to fit the rhythm. But anyway. Something like this:
for line in sentences:
words = line.split()
stresspattern = ''.join([dict[word] for word in words])
if stresspattern=='0101010101':
print line
By the way, it's generally a bad idea to be calling your dictionary 'dict', since you're hiding the dict type.
Here's how the complete code could look like:
#!/usr/bin/env python3
def is_iambic_pentameter(words, word_stress_pattern):
"""Whether words are a line of iambic pentameter.
word_stress_pattern is a callable that given a word returns
its stress pattern
"""
return ''.join(map(word_stress_pattern, words)) == '01'*5
# create 'word -> stress pattern' mapping, to implement word_stress_pattern(word)
with open('stress.txt') as stress_file:
word_stress_pattern = dict(map(str.strip, line.split(','))
for line in stress_file).__getitem__
# print lines that use iambic pentameter
with open('sentences.txt') as file:
for line in file:
if is_iambic_pentameter(line.casefold().split(), word_stress_pattern):
print(line, end='')

Remove linebreak at specific position in textfile

I have a large textfile, which has linebreaks at column 80 due to console width. Many of the lines in the textfile are not 80 characters long, and are not affected by the linebreak. In pseudocode, this is what I want:
Iterate through lines in file
If line matches this regex pattern: ^(.{80})\n(.+)
Replace this line with a new string consisting of match.group(1) and match.group(2). Just remove the linebreak from this line.
If line doesn't match the regex, skip!
Maybe I don't need regex to do this?
f=open("file")
for line in f:
if len(line)==81:
n=f.next()
line=line.rstrip()+n
print line.rstrip()
f.close()
Here's some code which should to the trick
def remove_linebreaks(textfile, position=81):
"""
textfile : an file opened in 'r' mode
position : the index on a line at which \n must be removed
return a string with the \n at position removed
"""
fixed_lines = []
for line in textfile:
if len(line) == position:
line = line[:position]
fixed_lines.append(line)
return ''.join(fixed_lines)
Note that compared to your pseudo code, this will merge any number of consecutive folded lines.
Consider this.
def merge_lines( line_iter ):
buffer = ''
for line in line_iter:
if len(line) <= 80:
yield buffer + line
buffer= ''
else:
buffer += line[:-1] # remove '\n'
with open('myFile','r') as source:
with open('copy of myFile','w') as destination:
for line in merge_lines( source ):
destination.write(line)
I find that an explicit generator function makes it much easier to test and debug the essential logic of the script without having to create mock filesystems or do lots of fancy setup and teardown for testing.
Here is an example of how to use regular expressions to archive this. But regular expressions aren't the best solution everywhere and in this case, i think not using regular expressions is more efficient. Anyway, here is the solution:
text = re.sub(r'(?<=^.{80})\n', '', text)
You can also use the your regular expression when you call re.sub with a callable:
text = re.sub(r'^(.{80})\n(.+)', lambda m: m.group(1)+m.group(2), text)

Categories