I'm writing a function that takes in a fasta file that may have multiple sequences and returns a dictionary with the accession number as the key and the full length header as the value. When I run the function, I get an empty dictionary, but when I run the same code outside of a function, I get the desired dictionary. I assume this means that somewhere in the function my dictionary is being overwritten with nothing, because if it had to do with my regex pattern the code wouldn't work outside the function either, but I can't seem to find where the dictionary might be getting reset. I tried doing some searching for a solution, but the only questions I found with the same problem had obvious problems (no return value or one of the input variables was overwritten).
here is my function code and output (assume re is imported):
def getheader(file):
'''
compiles headers from a fasta file into a dictionary
Parameters
----------
file : file
file that contains the sequences with headers and has not been read
Returns
-------
dictionary with acc no as key and full header as value
'''
head_dict = {}
for line in file:
if line.startswith(">"):
acc = re.search(">([^\s]+)", line).group(1)
header = line.rstrip()
head_dict[acc] = header
return head_dict
file = open("seq1.txt", "r")
head_dict = getheader(file)
print(head_dict)
output
{}
Here is the input/output when I run it outside of a function:
import re
file = open("seq1.txt", "r")
head_dict = {}
for line in file:
if line.startswith(">"):
key = re.search(">([^\s]+)", line).group(1)
value = line.rstrip()
head_dict[key] = value
print(head_dict)
output
{'AF12345': '>AF12345 test sequence 1'}
where seq1.txt is the following without the quotes and the txt file has an actual return instead of "\n", I just wasn't sure how to format it correctly.
">AF12345 test sequence 1\nCGATATTCCCATGCGGTTTATTTATGCAAAACTGTGACGTTCGCTTGA"
The above is where my code is now. Prior to this I had a function that created two dictionaries and returned a specific one based on an input argument. One dictionary was the accession number and sequences, and the other was this dictionary I'm trying to create now. I did not have any problem getting the first dictionary returned. So I decided to split the function and have one for each dictionary despite it seeming repetitive. I've also tried moving the head_dict[key] = value to outside the conditional statement, and got the same problem. I've tried changing the variable names from key to acc and from value to header, but still got the same result (you can see in my example outside the function that the variables were originally key and value). I just tried making the empty dictionary an argument so that it is initialized outside of the function, but I still get an empty dictionary returned. I'm not sure what to try now. Thanks in advance!
Note: I'm sure this can be solved more efficiently with another library, but this is for a class and the instructor is very against using other libraries. We have to learn how to do it ourselves first.
Edit: Now I'm frustrated. I rewrote the code to add in creating the input file for people to use to run my code and help me with it, but it works. I'm sorry to have wasted everyone's time. Here is what I rewrote. If you notice a difference between this and what I posted above, please let me know.
import re
def getheader(file):
'''
compiles headers from a fasta file into a dictionary
Parameters
----------
file : file
file that contains the sequences with headers and has not been read
Returns
-------
dictionary with acc no as key and full header as value
'''
head_dict = {}
for line in file:
if line.startswith(">"):
acc = re.search(">([^\s]+)", line).group(1)
header = line.rstrip()
head_dict[acc] = header
return head_dict
file = open("bookishbubs_test1.txt", "w")
file.write(">AF12345 test sequence 1\nCGATATTCCCATGCGGTTTATTTATGCAAAACTGTGACGTTCGCTTGA")
file.close()
file = open("bookishbubs_test1.txt", "r")
headers = getheader(file)
print(headers)
file.close()
Related
So I have two files. One yaml file that contains tibetan words : its meaning. Another csv file that contains only word and it's POStag. As below:
yaml file :
ད་གདོད: ད་གཟོད་དང་དོན་འདྲ།
ད་ཆུ: དངུལ་ཆུ་ཡི་མིང་གཞན།
ད་ཕྲུག: དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན།
ད་བེར: སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།
ད་མེ་དུམ་མེ: དམ་དུམ་ལ་ལྟོས།
csv file :
ད་ཆུ PART
ད་གདོད DET
Desired output:
ད་ཆུ PART དངུལ་ཆུ་ཡི་མིང་གཞན།
ད་གདོད DET ད་གཟོད་དང་དོན་འདྲ།
Any idea on how to make text match from csv file to yaml file and extract its meaning in csv?
On a functional point of view, you have:
a dictionary, meaning here a key: value thing
a list of words to search in that dictionary, and that will produce a record
If everything can fit in memory, you can first read the yaml file to produce a Python dictionary, and then read the words file, one line at a time and use the above dictionary to generate the expected line. If the yaml file is too large, you could use the dbm (or shelve) module as an on disk dictionary.
As you have not shown any code, I cannot either... I can just say that you can simply use process the second file as a plain text one and just read it one line at a time. For the first one, you can either look for a yaml module from PyPI, or if the syntax is always as simple as the lines you have shown, just process it as text one line at a time and use split to extract the key and the value.
Assuming your files are called dict.yml and input.csv.
You can start by turning the yaml file into a dictionary with
import yaml
with open('dict.yaml', 'r') as file:
trans_dict = yaml.safe_load(file)
Which should give you
>>> trans_dict
{'ད་གདོད': 'ད་གཟོད་དང་དོན་འདྲ།',
'ད་ཆུ': 'དངུལ་ཆུ་ཡི་མིང་གཞན།',
'ད་ཕྲུག': 'དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན།',
'ད་བེར': 'སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།',
'ད་མེ་དུམ་མེ': 'དམ་དུམ་ལ་ལྟོས།'}
Then, you can iterate over the lines in the CSV and use the dictionary to get the definition:
outputs = []
with open('text.txt', 'r') as file:
for line in file:
term = line.strip()
definition = trans_dict.get(term.strip())
outputs.append(
term if definition is None
else f"{term} {definition}"
)
From here, your outputs variable should contain ['ད་ཆུ དངུལ་ཆུ་ཡི་མིང་གཞན།', 'ད་གདོད ད་གཟོད་དང་དོན་འདྲ།']. If you optionally wanted to write this out to a file, you could do
with open('output.txt', 'w') as file:
file.write('\n'.join(outputs))
If you had more tokens on each line of the CSV (unclear from your post), you could iterate over those tokens within a line, but you'd be able to apply basically the same approach.
The easiest solution that came to my mind would be iterating over all lines in the YAML-file and checking if the word is inside the CSV-file:
YAML_LINES = "ད་གདོད: ད་གཟོད་དང་དོན་འདྲ།\nད་ཆུ: དངུལ་ཆུ་ཡི་མིང་གཞན\nད་ཕྲུག: དྭ་ཕྲུག་གི་འབྲི་ཚུལ་གཞན\nད་བེར: སྒྲིབ་བྱེད་དང་རླུང་འགོག་བྱེད་ཀྱི་གླེགས་བུ་ལེབ་མོའི་མིང་།\nད་མེ་དུམ་མེ: དམ་དུམ་ལ་ལྟོས།".split("\n")
CSV_LINES = "ད་ཆུ\nད་གདོད".split("\n")
for line in YAML_LINES:
word, meaning = line.split(": ")
if word in CSV_LINES:
output = word + " " + meaning
print(output)
The YAML_LINES and CSV_LINES lists are only to provide a quick and dirty example.
I've build a Python script to randomly create sentences using data from the Princeton English Wordnet, following diagrams provided by Gödel, Escher, Bach. Calling python GEB.py produces a list of nonsensical sentences in English, such as:
resurgent inaesthetic cost.
the bryophytic fingernail.
aversive fortieth peach.
the asterismal hide.
the flour who translate gown which take_a_dare a punch through applewood whom the renewed request enfeoff.
an lobeliaceous freighter beside tuna.
And saves them to gibberish.txt. This script works fine.
Another script (translator.py) takes gibberish.txt and, through py-googletrans Python module, tries to translate those random sentences to Portuguese:
from googletrans import Translator
import json
tradutor = Translator()
with open('data.json') as dataFile:
data = json.load(dataFile)
def buscaLocal(keyword):
if keyword in data:
print(keyword + data[keyword])
else:
buscaAPI(keyword)
def buscaAPI(keyword):
result = tradutor.translate(keyword, dest="pt")
data.update({keyword: result.text})
with open('data.json', 'w') as fp:
json.dump(data, fp)
print(keyword + result.text)
keyword = open('/home/user/gibberish.txt', 'r').readline()
buscaLocal(keyword)
Currently the second script outputs only the translation of the first sentence in gibberish.txt. Something like:
resurgent inaesthetic cost.
aumento de custos inestético.
I have tried to use readlines() instead of readline(), but I get the following error:
Traceback (most recent call last):
File "main.py", line 28, in <module>
buscaLocal(keyword)
File "main.py", line 11, in buscaLocal
if keyword in data:
TypeError: unhashable type: 'list'
I've read similar questions about this error here, but it is not clear to me what should I use in order to read the whole list of sentences contained in gibberish.txt (new sentences begin in a new line).
How can I read the whole list of sentences contained in gibberish.txt? How should I adapt the code in translator.py in order to achieve that? I am sorry if the question is a bit confuse, I can edit if necessary, I am a Python newbie and I would appreciate if someone could help me out.
Let's start with what you're doing to the file object. You open a file, get a single line from it, and then don't close it. A better way to do it would be to process the entire file and then close it. This is generally done with a with block, which will close the file even if an error occurs:
with open('gibberish.txt') as f:
# do stuff to f
Aside from the material benefits, this will make the interface clearer, since f is no longer a throwaway object. You have three easy options for processing the entire file:
Use readline in a loop since it will only read one line at a time. You will have to strip off the newline characters manually and terminate the loop when '' appears:
while True:
line = f.readline()
if not line: break
keyword = line.rstrip()
buscaLocal(keyword)
This loop can take many forms, one of which is shown here.
Use readlines to read in all the lines in the file at once into a list of strings:
for line in f.readlines():
keyword = line.rstrip()
buscaLocal(keyword)
This is much cleaner than the previous option, since you don't need to check for loop termination manually, but it has the disadvantage of loading the entire file all at once, which the readline loop does not.
This brings us to the third option.
Python files are iterable objects. You can have the cleanliness of the readlines approach with the memory savings of readline:
for line in f:
buscaLocal(line.rstrip())
this approach can be simulated using readline with the more arcane form of next to create a similar iterator:
for line in next(f.readline, ''):
buscaLocal(line.rstrip())
As a side point, I would make some modifications to your functions:
def buscaLocal(keyword):
if keyword not in data:
buscaAPI(keyword)
print(keyword + data[keyword])
def buscaAPI(keyword):
# Make your function do one thing. In this case, do a lookup.
# Printing is not the task for this function.
result = tradutor.translate(keyword, dest="pt")
# No need to do a complicated update with a whole new
# dict object when you can do a simple assignment.
data[keyword] = result.text
...
# Avoid rewriting the file every time you get a new word.
# Do it once at the very end.
with open('data.json', 'w') as fp:
json.dump(data, fp)
If you are using readline() function, you have to remember that this function only returns a line, so you have to use a loop to go through all of the lines in the text files. In case of using readlines(), this function does reads the full file at once, but return each of the lines in a list. List data type is unhashable and can not be used as key in a dict object, that's why if keyword in data: line emits this error, as keyword here is a list of all of the lines. a simple for loop will solve this problem.
text_lines = open('/home/user/gibberish.txt', 'r').readlines()
for line in text_lines:
buscaLocal(line)
This loop will iterate through all of the lines in the list and there will be error accessing the dict as key element will be a string.
I have a log file that is updated every few milliseconds however the information is currently saved with four(4) different delimitors. The log files contain millions of lines so the chances of performing the action in excel null.
What I have left to work on resembles:
Sequence=3433;Status=true;Report=223313;Profile=xxxx;
Sequence=0323;Status=true;Header=The;Report=43838;Profile=xxxx;
Sequence=5323;Status=true;Report=6541998;Profile=xxxx;
I would like these set to:
Sequence,Status;Report;Header;Profile
3433,true,Report=223313,,xxxx
0323,true,Report=43838,The,xxxx
5323,true,Report=6541998,,xxxx
Meaning that I would the need the creation of a header using all portions with the equal "=" symbol following it. All of the other operations within the file are taken care of and this will be used to perform a comparative check between files and replace or append fields. As I am new to python, I only need the assistance with this portion of the program I am writing.
Thank you all in advance!
You can try this.
First of all, I called the csv library to reduce the job of putting commas and quotes.
import csv
Then I made a function that takes a single line from your log file and outputs a dictionary with the fields passed in the header. If the current line hasn't a particular field from header, it will stay filled with an empty string.
def convert_to_dict(line, header):
d = {}
for cell in header:
d[cell] = ''
row = line.strip().split(';')
for cell in row:
if cell:
key, value = cell.split('=')
d[key] = value
return d
Since the header and the number of fields can vary between your files, I made a function extracting them. For this, I employed a set, a collection of unique elements, but also unordered. So I converted to a list and used the sorted function. Don't forget that seek(0) call, to rewind the file!
def extract_fields(logfile):
fields = set()
for line in logfile:
row = line.strip().split(';')
for cell in row:
if cell:
key, value = cell.split('=')
fields.add(key)
logfile.seek(0)
return sorted(list(fields))
Lastly, I made the main piece of code, in which open both the log file to read and the csv file to write. Then, it extracts and writes the header, and writes each converted line.
if __name__ == '__main__':
with open('report.log', 'r') as logfile:
with open('report.csv', 'wb') as csvfile:
csvwriter = csv.writer(csvfile)
header = extract_fields(logfile)
csvwriter.writerow(header)
for line in logfile:
d = convert_to_dict(line, header)
csvwriter.writerow([d[cell] for cell in header])
These are the files I used as an example:
report.log
Sequence=3433;Status=true;Report=223313;Profile=xxxx;
Sequence=0323;Status=true;Header=The;Report=43838;Profile=xxxx;
Sequence=5323;Status=true;Report=6541998;Profile=xxxx;
report.csv
Header,Profile,Report,Sequence,Status
,xxxx,223313,3433,true
The,xxxx,43838,0323,true
,xxxx,6541998,5323,true
I hope it helps! :D
EDIT: I added support for different headers.
I have a problem in python. I want to create a function to print a file from user to a new file (example.txt).
The old file is like this:
{'a':1,'b':2...)
and I want the new file like:
a 1,b 2(the next line)
But the function which I made can run but it doesn't show anything in the new file. Can someone help me please.
def printing(file):
infile=open(file,'r')
outfile=open('example.txt','w')
dict={}
file=dict.values()
for key,values in file:
print key
print values
outfile.write(str(dict))
infile.close()
outfile.close()
This creates a new empty dictionary:
dict={}
dict is not a good name for a variable as it shadows the built-in type dict and could be confusing.
This makes the name file point at the values in the dictionary:
file=dict.values()
file will be empty because dict was empty.
This iterates over pairs of values in file.
for key,values in file:
As file is empty nothing will happen. However if file weren't empty, the values in it would have to be pairs of values to unpack them into key, values.
This converts dict to a string and writes it to the outfile:
outfile.write(str(dict))
Calling write with a non-str object will call str on it anway, so you could just say:
outfile.write(dict)
You don't actually do anything with infile.
You can use re module (regular expression) to achieve what you need. Solution could be like that. Of course you can customize to fit your need. Hope this helps.
import re
def printing(file):
outfile=open('example.txt','a')
with open(file,'r') as f:
for line in f:
new_string = re.sub('[^a-zA-Z0-9\n\.]', ' ', line)
outfile.write(new_string)
printing('output.txt')
I am a beginner of Python. I am trying now figuring out why the second 'for' loop doesn't work in the following script. I mean that I could only get the result of the first 'for' loop, but nothing from the second one. I copied and pasted my script and the data csv in the below.
It will be helpful if you tell me why it goes in this way and how to make the second 'for' loop work as well.
My SCRIPT:
import csv
file = "data.csv"
fh = open(file, 'rb')
read = csv.DictReader(fh)
for e in read:
print(e['a'])
for e in read:
print(e['b'])
"data.csv":
a,b,c
tree,bough,trunk
animal,leg,trunk
fish,fin,body
The csv reader is an iterator over the file. Once you go through it once, you read to the end of the file, so there is no more to read. If you need to go through it again, you can seek to the beginning of the file:
fh.seek(0)
This will reset the file to the beginning so you can read it again. Depending on the code, it may also be necessary to skip the field name header:
next(fh)
This is necessary for your code, since the DictReader consumed that line the first time around to determine the field names, and it's not going to do that again. It may not be necessary for other uses of csv.
If the file isn't too big and you need to do several things with the data, you could also just read the whole thing into a list:
data = list(read)
Then you can do what you want with data.
I have created small piece of function which doe take path of csv file read and return list of dict at once then you loop through list very easily,
def read_csv_data(path):
"""
Reads CSV from given path and Return list of dict with Mapping
"""
data = csv.reader(open(path))
# Read the column names from the first line of the file
fields = data.next()
data_lines = []
for row in data:
items = dict(zip(fields, row))
data_lines.append(items)
return data_lines
Regards