How to populate a dictionary from a file while disregarding the header? - python

I have this function that's populating itself from another file. I'm having difficulty with avoiding the header of the txt file which begins with ;;; and that the first aspect of the dictionary is supposed to be the words which are in uppercase and the stuff that follows it are the phonemes. I'm not sure which part of my code is wrong :S
def read_file(file):
""" (file open for reading) -> pronunciation dictionary
Read file, which is in the format of the CMU Pronouncing
Dictionary, and return the pronunciation dictionary.
"""
line = file.readline()
while line != ';;;':
line = file.readline()
pronun_dict = {}
line = file.readline()
while line != '':
word = line.isupper()
phoneme = line.strip()
if phoneme not in pronun_dict:
pronun_dict[phoneme] = [word]
line = file.readline()
return pronun_dict
http://gyazo.com/31b414c39cc907bc917f7a1129f4019d
the above link is a screenshot of what the text file looks like!

while line != ';;;': will only be met when the header does not exactly match ';;;'. I assume the header can contain much more data. Try, instead, while line.startswith(';;;'):. As long as this condition is met, the next line in the file will be assigned to variable line. Therefore, your code will loop through that block until a line that doesn't begin with ;;; is found.
http://www.tutorialspoint.com/python/string_startswith.htm

Related

how can i read each word of a line from a file separately?

so the problem I came across is that I need to read each word of a line from a line one by one and repeat it for the whole file. each of the words are separated from each other by the # sign, e.g
2016/2017#Southeast_Kootenay#Mount_Baker_Secondary#STANDARD#COURSE_MARKS#99.0#71.0#88.0#49.0
after that I need to assign each value to the appropriate element of a class, for example:
school_years would be 2016/2017, district_name would be Southeast_Kootenay and etc.
the thing is that I have clue how to do it, I managed to extract the first word from a file but couldn't do it for the whole line and let alone the whole file, this is the code I used.
def word_return():
for lines in file:
for word in lines.split('#'):
return word
any kind of help would be appreciated
You're returning a single word. Remove last for and return the entire list like this if you want to get only the first line:
(Assuming file is a list of lines resulted from file = open("file.txt", "r").readlines())
def word_return():
for line in open("yourFile.txt", "r").readlines():
return lines.split('#')
If you want to return a list that will contain a list for each line, check the following:
def word_return():
allLines = []
for line in open("yourFile.txt", "r").readlines():
allLines.append(lines.split('#'))
return allLines

Random substitution

I have a txt file and a dictionary, where keys are adjectives, values are their synonyms. I need to replace the common adjectives from the dictionary which I meet in a given txt file with their synonyms - randomly! and save both versions - with changed and unchanged adjectives - line by line - in a new file(task3_edited_text). My code:
#get an English text as a additional input
filename_eng = sys.argv[2]
infile_eng = open(filename_eng, "r")
task3_edited_text = open("task3_edited_text.txt", "w")
#necessary for random choice
import random
#look for adjectives in English text
#line by line
for line in infile_eng:
task3_edited_text.write(line)
line_list = line.split()
#for each word in line
for word in line_list:
#if we find common adjectives, change them into synonym, randomly
if word in dict.keys(dictionary):
word.replace(word, str(random.choice(list(dictionary.values()))))
else:
pass
task3_edited_text.write(line)
Problem is in the output adjectives are not substituted by their values.
line_list = line.split()
...
task3_edited_text.write(line)
The issue is that you try to modify line_list, which you created from line. However, line_list is simply a list made from copying values generated from line ; modifying it doesn't change line in the slightest. So writing line to the file writes the unmodified line to the file, and doesn't take your changes into account.
You probably want to generate a line_to_write from line_list, and writing it to the file instead, like so:
line_to_write = " ".join(line_list)
task3_edited_text.write(line_to_write)
Also, line_list isn't even modified in your code as word is a copy of an element in line_list and not a reference to the original. Moreover, replace returns a copy of a string and doesn't modify the string you call it on. You probably want to modify line_list via the index of the elements like so:
for idx, word in enumerate(line_list):
#if we find common adjectives, change them into synonym, randomly
if word in dict.keys(dictionary):
line_list[idx] = word.replace(word, str(random.choice(list(dictionary.values()))))
else:
pass

Splitting a line from a file into different lists

My program is supposed to take input from the user and read a file with the name input. Read file gets saved into a dictionary called portfolio and from there all I have to do is sort each line in the portfolio into keys and values.
Here's my code.
portfolio = {}
portfolio = file_read() #Reads the file through a function
if file_empty(portfolio) == True or None: #nevermind this, it works
print "The file was not found."
else:
print "The file has successfully been loaded"
for line in portfolio:
elements = line.strip().split(",") #separate lists by comma
print elements[0] #using this to check
print elements[1] #if it works at all
All this does is print the first letter in the first line, which is S. And apparently elements[1] is supposed to be the second letter but index is out of range, please enlighten me what might be wrong.
Thank you.
It looks like file_read() is reading the file into a string.
Then for line in portfolio: is iterating through each character in that string.
Then elements = line.strip().split(",") will give you a list containing one character, so trying to get elements[1] is past the bounds of the list.
If you want to read the whole contents of the file into a string called portfolio, you can iterate through each line in the string using
for line in porfolio.split('\n'):
...
But the more usual way of iterating through lines in a file would be
with open(filename,'r') as inputfile:
for line in inputfile:
....
Got it to work with this code:
for line in minfil :
line = line.strip()
elements = line.split(",")
portfolio[str(elements[0])] = [(int(elements[1]),float(elements[2]), str(elements[3]))]

how to Use python to find a line number in a document and insert data underneath it

Hi I already have the search function sorted out:
def searchconfig():
config1 = open("config.php", "r")
b='//cats'
for num, line in enumerate(config1,0):
if b in line:
connum = num + 1
return connum
config1.close()
This will return the line number of //cats, I then need to take the data underneath it put it in a tempoary document, append new data under the //cats and then append the data in the tempoary document to the original? how would i do this? i know that i would have to use 'a' instead of 'r' when opening the document but i do not know how to utilise the line number.
I think, the easiest way would be to read the whole file into a list of strings, work on that list and write it back afterwards.
# Read all lines of the file into a list of strings
with open("config.php", "r") as file:
lines = list(file)
file.close()
# This gets the line number for the first line containing '//cats'
# Note that it will throw an StopIteration exception, if no such line exists...
linenum = (num for (num, line) in enumerate(lines) if '//cats' in line).next()
# insert a line after the line containing '//cats'
lines.insert(linenum+1, 'This is a new line...')
# You could also replace the line following '//cats' like
lines[linenum+1] = 'New line content...'
# Write back the file (in fact this creates a new file with new content)
# Note that you need to append the line delimiter '\n' to every line explicitely
with open("config.php", "w") as file:
file.writelines(line + '\n' for line in lines)
file.close()
Using "a" as mode for open would only let you append ath the end of the file.
You could use "r+" for a combined read/write mode, but then you could only overwrite some parts of the file, there is no simple way to insert new lines in the middle of the file using this mode.
You could do it like this. I am creating a new file in this example as it is usually safer.
with open('my_file.php') as my_php_file:
add_new_content = ['%sNEWCONTENT' %line if '//cat' in line
else line.strip('\n')
for line in my_php_file.readlines()]
with open('my_new_file.php', 'w+') as my_new_php_file:
for line in add_new_content:
print>>my_new_php_file, line

How can I use readline() to begin from the second line?

I'm writing a short program in Python that will read a FASTA file which is usually in this format:
>gi|253795547|ref|NC_012960.1| Candidatus Hodgkinia cicadicola Dsem chromosome, 52 lines
GACGGCTTGTTTGCGTGCGACGAGTTTAGGATTGCTCTTTTGCTAAGCTTGGGGGTTGCGCCCAAAGTGA
TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC
TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTGTTGGGCTCGCCGAAAGCTGGCAGGTCGA
I've created another program that reads the first line(aka header) of this FASTA file and now I want this second program to start reading and printing beginning from the sequence.
How would I do that?
so far i have this:
FASTA = open("test.txt", "r")
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
for line in FASTA:
line = line.strip()
print line
readSeq(FASTA)
Thanks guys
-Noob
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
_unused = FASTA.next() # skip heading record
for line in FASTA:
line = line.strip()
print line
Read the docs on file.next() to see why you should be wary of mixing file.readline() with for line in file:
you should show your script. To read from second line, something like this
f=open("file")
f.readline()
for line in f:
print line
f.close()
You might be interested in checking BioPythons handling of Fasta files (source).
def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
"""Generator function to iterate over Fasta records (as SeqRecord objects).
handle - input file
alphabet - optional alphabet
title2ids - A function that, when given the title of the FASTA
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of strings.
If this is not given, then the entire title line will be used
as the description, and the first word as the id and name.
Note that use of title2ids matches that of Bio.Fasta.SequenceParser
but the defaults are slightly different.
"""
#Skip any text before the first record (e.g. blank lines, comments)
while True:
line = handle.readline()
if line == "" : return #Premature end of file, or just empty?
if line[0] == ">":
break
while True:
if line[0]!=">":
raise ValueError("Records in Fasta files should start with '>' character")
if title2ids:
id, name, descr = title2ids(line[1:].rstrip())
else:
descr = line[1:].rstrip()
id = descr.split()[0]
name = id
lines = []
line = handle.readline()
while True:
if not line : break
if line[0] == ">": break
#Remove trailing whitespace, and any internal spaces
#(and any embedded \r which are possible in mangled files
#when not opened in universal read lines mode)
lines.append(line.rstrip().replace(" ","").replace("\r",""))
line = handle.readline()
#Return the record and then continue...
yield SeqRecord(Seq("".join(lines), alphabet),
id = id, name = name, description = descr)
if not line : return #StopIteration
assert False, "Should not reach this line"
good to see another bioinformatician :)
just include an if clause within your for loop above the line.strip() call
def readSeq(FASTA):
for line in FASTA:
if line.startswith('>'):
continue
line = line.strip()
print(line)
A pythonic and simple way to do this would be slice notation.
>>> f = open('filename')
>>> lines = f.readlines()
>>> lines[1:]
['TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC\n', 'TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTG
TTGGGCTCGCCGAAAGCTGGCAGGTCGA']
That says "give me all elements of lines, from the second (index 1) to the end.
Other general uses of slice notation:
s[i:j] slice of s from i to j
s[i:j:k] slice of s from i to j with step k (k can be negative to go backward)
Either i or j can be omitted (to imply the beginning or the end), and j can be negative to indicate a number of elements from the end.
s[:-1] All but the last element.
Edit in response to gnibbler's comment:
If the file is truly massive you can use iterator slicing to get the same effect while making sure you don't get the whole thing in memory.
import itertools
f = open("filename")
#start at the second line, don't stop, stride by one
for line in itertools.islice(f, 1, None, 1):
print line
"islicing" doesn't have the nice syntax or extra features of regular slicing, but it's a nice approach to remember.

Categories