im trying to read a .fasta file as a dictionary and extract the header and sequence separately.there are several headers and sequences in the file.
an example below.
header= CMP12
sequence=agcgtmmnngucnncttsckkld
but when i try to read a fasta file using the function read_f and test it using print(dict.keys()) i get an empty list.
def read_f(fasta):
'''Read a file from a FASTA format'''
dictionary = {}
with open(fasta) as file:
text = file.readlines()
print(text)
name=''
seq= ''
#Create blocks of fasta text for each sequence, EXCEPT the last one
for line in text:
if line[0]=='>':
dictionary[name] = seq
name=line[1:].strip()
seq=''
else: seq = seq + line.strip()
yield name,seq
fasta= ("sample.prot.fasta")
dict = read_f(fasta)
print(dict.keys())
this is the error i get:
'generator' object has no attribute 'keys'
Using the yield keyword implies that when you call the function read_fasta, the function is not executed. Instead, a generator is returned and you have to iterate this generator to get the elements the function yields.
In concrete terms, replacing dict = read_fasta(fasta) by dict = read_fasta(*fasta) should do the job (* is the operator for unpacking).
As Iguananaut already mentioned, Bipython helps you out. (requires biopython package installed)
See Biopython "sequence file to dictionary"
from Bio import SeqIO
fasta= "sample.prot.fasta"
seq_record_dict = SeqIO.to_dict(SeqIO.parse(fasta, "fasta"))
Related
I have a script that appends to a list from a text file. I then use ''.join(mylist) to convert to type str so I can query a DynamoDB table for the said str. This seems to work until I query the table. I notice I am getting empty responses. After printing out each str, I notice they are being returned vertically. How can I format the string properly so my calls to DynamoDB are successful?
import boto3
from boto3.dynamodb.conditions import Key, Attr
dynamo = boto3.resource('dynamodb')
table = dynamo.Table('mytable')
s3.Bucket('instances').download_file('MissingInstances.txt')
with open('MissingInstances.txt', 'r') as f:
for line in f:
missing_instances = []
missing_instances.append(line)
unscanned = ''.join(missing_instances)
for i in unscanned:
print(i)
response = table.query(KeyConditionExpression=Key('EC2').eq(i))
items = response['Items']
print(items)
Contents of MissingInstances.txt:
i-xxxxxx
i-yyyyyy
i-zzzzzz
etc etc
Output of print(i):
i
-
x
x
x
x
x
i
-
y
y
y
y
y
etc etc
Output of print(items):
[]
[]
[]
etc etc
Desired output:
i-xxxxxx
i-yyyyyy
etc etc
Your problem isn't actually with the print function, but with how you are iterating your for loops. I've annotated your code below, added a tip to save you some time, and included some code to get you over this hurdle. Here is a resource for for loops, and here is another resource for using lists.
Here is your code, with annotations of what's happening:
#import libraries, prepare the data
import boto3
from boto3.dynamodb.conditions import Key, Attr
dynamo = boto3.resource('dynamodb')
table = dynamo.Table('mytable')
s3.Bucket('instances').download_file('MissingInstances.txt')
#Opens the text file that has the name of an instance and a newline character per line
with open('MissingInstances.txt', 'r') as f:
#For each line in the text file
for line in f:
#(For each line) Create an empty list called missing_instances
missing_instances = []
#Append this line to the empty list
missing_instances.append(line)
#Put all the current values of the list into a space-delimited string
#(There is only one value because you have been overwriting the list every loop)
unscanned = ''.join(missing_instances)
At this point in the code, you have looped through and written over missing_instances every iteration of your loop, so you are left with only the last instance.
#This should print the whole list of missing_instances
>>>print(*missing_instances)
i-cccccc
#This should print the whole unscanned string
>>>print(unscanned)
i-cccccc
Next, you loop through unscanned:
#For each letter in the string unscanned
for i in unscanned:
#Print the letter
print(i)
#Query using the letter (The rest of this won't work for obvious reasons)
response = table.query(KeyConditionExpression=Key('EC2').eq(i))
items = response['Items']
print(items)
You don't need to join the list to convert to string
I have a script that appends to a list from a text file. I then use
''.join(mylist) to convert to type str so I can query a DynamoDB table
for the said str
For example:
If you have this list:
missing_instances = ['i-xxxxxx','i-yyyyyy','i-zzzzzz']
You can see it's datatype is list:
>>>print(type(missing_instances))
<class 'list'>
But if you are looking at an element of that list (eg. the first element), the element's data type is str:
>>>print(type(missing_instances[0]))
<class 'str'>
This code loops through the text file and queries each line to the database:
#import libraries, prepare the data
import boto3
from boto3.dynamodb.conditions import Key, Attr
dynamo = boto3.resource('dynamodb')
table = dynamo.Table('mytable')
s3.Bucket('instances').download_file('MissingInstances.txt')
#Open the text file
with open('MissingInstances.txt', 'r') as f:
#Create a new list
missing_instances = []
#Loop through lines in the text file
for line in f:
#Append each line to the missing_instances list, removing the newlines
missing_instances.append(line.rstrip())
#CHECKS
#Print the whole list of missing_instances, each element on a new line
print(*missing_instances, sep='\n')
#Print the data type of missing_instances
print(type(missing_instances))
#Print the data type of the first element of missing_instances
print(type(missing_instances[0]))
#Loop through the list missing_instances
#For each string element of missing_instances
for i in missing_instances:
#Print the element
print(i)
#Query the element
response = table.query(KeyConditionExpression=Key('EC2').eq(i))
#Save the response
items = response['Items']
#Print the response
print(items)
#For good measure, close the text file
f.close()
Try stripping of newline characters before appending them to the list.
For example:
missing_instances.append(line.rstrip())
Print automatically introduces a new line on each call. It does not work like Java's System.out#print(String). For example, when I run this, I get this:
for c in 'adf':
print(c)
a
d
f
This is because in python (for some reason or another), strings are iterable.
I'm not sure what your code is in fact trying to do. I'm not familiar with this Boto3 library. But let's say the part i-xxxxx is decomposed to i and xxxxx, which I term id and other_stuff. Then,
for the_id in ids:
print(f'{the_id}-{ids}')
I am trying to copy values of data seprated with: from text file.
Text file having data like in this form:
I have 50+ text file contains data in this form:
Type: Assume
Number: 123456
Name: Assume
Phone Number: 000-000
Email Address: any#gmail.com
Mailing Address: Assume
i am trying to get data values in this format in csv from multiple text files:
Type Number Name Phone email Mailing Address
Assume 123456 Assume 000-000 any#gmail.com Assume
Here is the code:
import re
import csv
file_h = open("out.csv","a")
csv_writer = csv.writer(file_h)
def writeHeading(file_content):
list_of_headings = []
for row in file_content:
key = str(row.split(":")[0]).strip()
list_of_headings.append(key)
csv_writer.writerow(tuple(list_of_headings))
def writeContents(file_content):
list_of_data = ['Number']
for row in file_content:
value = str(row.split(":")[1]).strip()
list_of_data.append(value)
csv_writer.writerow(tuple(list_of_data))
def convert_txt_csv(filename):
file_content = open(filename,"r").readlines()
return file_content
list_of_files = ["10002.txt","10003.txt","10004.txt"]
# for writing heading once
file_content = convert_txt_csv(list_of_files[0])
writeHeading(file_content)
# for writing contents
for file in list_of_files:
file_content = convert_txt_csv(file)
writeContents(file_content)
file_h.close()
Here is the following error:
Traceback (most recent call last):
File "Magnet.py", line 37, in <module>
writeContents(file_content)
File "Magnet.py", line 20, in writeContents
value = str(row.split(":")[1]).strip()
IndexError: list index out of range
Your code probably encounters a blank line at the end of the first file, or any line that doesn't have a : in it, so when you try to split it into key/values it complains as it didn't get a list of expected length. You can fix that easily by checking if there is a colon on the current line, i.e.:
for row in file_content:
if ":" not in row: # or you can do the split and check len() of the result
continue
key = row.split(":")[0].strip()
list_of_headings.append(key)
But... While the task you're attempting looks extremely simple, keep in mind that your approach assumes that all the files are equal, with equal number key: value combinations and in the same order.
You'd be much better off by storing your parsed data in a dict and then using csv.DictWriter() to do your bidding.
I'm a total noob to Python and need some help with my code.
The code is meant to take Input.txt [http://pastebin.com/bMdjrqFE], split it into seperate Pokemon (in a list), and then split that into seperate values which I use to reformat the data and write it to Output.txt.
However, when I run the program, only the last Pokemon gets outputted, 386 times. [http://pastebin.com/wkHzvvgE]
Here's my code:
f = open("Input.txt", "r")#opens the file (input.txt)
nf = open("Output.txt", "w")#opens the file (output.txt)
pokeData = []
for line in f:
#print "%r" % line
pokeData.append(line)
num = 0
tab = """ """
newl = """NEWL
"""
slash = "/"
while num != 386:
current = pokeData
current.append(line)
print current[num]
for tab in current:
words = tab.split()
print words
for newl in words:
nf.write('%s:{num:%s,species:"%s",types:["%s","%s"],baseStats:{hp:%s,atk:%s,def:%s,spa:%s,spd:%s,spe:%s},abilities:{0:"%s"},{1:"%s"},heightm:%s,weightkg:%s,color:"Who cares",eggGroups:["%s"],["%s"]},\n' % (str(words[2]).lower(),str(words[1]),str(words[2]),str(words[3]),str(words[4]),str(words[5]),str(words[6]),str(words[7]),str(words[8]),str(words[9]),str(words[10]),str(words[12]).replace("_"," "),str(words[12]),str(words[14]),str(words[15]),str(words[16]),str(words[16])))
num = num + 1
nf.close()
f.close()
There are quite a few problems with your program starting with the file reading.
To read the lines of a file to an array you can use file.readlines().
So instead of
f = open("Input.txt", "r")#opens the file (input.txt)
pokeData = []
for line in f:
#print "%r" % line
pokeData.append(line)
You can just do this
pokeData = open("Input.txt", "r").readlines() # This will return each line within an array.
Next you are misunderstanding the uses of for and while.
A for loop in python is designed to iterate through an array or list as shown below. I don't know what you were trying to do by for newl in words, a for loop will create a new variable and then iterate through an array setting the value of this new variable. Refer below.
array = ["one", "two", "three"]
for i in array: # i is created
print (i)
The output will be:
one
two
three
So to fix alot of this code you can replace the whole while loop with something like this.
(The code below is assuming your input file has been formatted such that all the words are split by tabs)
for line in pokeData:
words = line.split (tab) # Split the line by tabs
nf.write ('your very long and complicated string')
Other helpers
The formatted string that you write to the output file looks very similar to the JSON format. There is a builtin python module called json that can convert a native python dict type to a json string. This will probably make things alot easier for you but either way works.
Hope this helps
I know I am missing the obvious here but I have the following PYTHON code in which I am trying to-
Take a specified JSON file containing multiple strings as an input.
Start at the line 1 and look for the key value of "content_text"
Add the key value to a new dictionary and write said dictionary to a new file
Repeat 1-3 on additional JSON files
import json
def OpenJsonFileAndPullData (JsonFileName, JsonOutputFileName):
output_file=open(JsonOutputFileName, 'w')
result = []
with open(JsonFileName, 'r') as InputFile:
for line in InputFile:
Item=json.loads(line)
my_dict={}
print item
my_dict['Post Content']=item.get('content_text')
my_dict['Type of Post']=item.get('content_type')
print my_dict
result.append(my_dict)
json.dumps(result, output_file)
OpenJsonFileAndPullData ('MyInput.json', 'MyOutput.txt')
However, when run I receive this error:
AttributeError: 'str' object has no attribute 'get'
Python is case-sensitive.
Item = json.loads(line) # variable "Item"
my_dict['Post Content'] = item.get('content_text') # another variable "item"
By the way, why don't you load whole file as json at once?
I need to extract some fasta sequences from "goodProteins.fasta" file (first input) with id list files present in separate folder (second input).
The format of the fasta sequence file is:
>1_12256
FSKVJLKDFJFDAKJQWERTYU......
>1_12257
SKJFHKDAJHLQWERTYGFDFHU......
>1_12258
QWERTYUHKDJKDJOKK......
>1_12259
DJHFDSQWERTYUHKDJKDJOKK......
>1_12260
ADKKHDFHJQWERTYUHKDJKDJOKK......
and the format of one of the id file is:
1_12258
1_12256
1_12257
I'm using the following script:
from Bio import SeqIO
import glob
def process(wanted_file, result_file):
fasta_file = "goodProteins.fasta" # First input (Fasta sequence)
wanted = set()
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
for seq in fasta_sequences:
if seq.id in wanted:
SeqIO.write([seq], f, "fasta")
listFilesArr = glob.glob("My_folder\*txt") # takes all .txt files as
# Second input in My_folder
for wanted_file in listFilesArr:
result_file = wanted_file[0:-4] + ".fasta"
process(wanted_file, result_file)
It should extract fasta sequence based on the information and order list in the id file and the desired output would be:
>1_12258
QWERTYUHKDJKDJOKK......
>1_12256
FSKVJLKDFJFDAKJQWERTYU......
>1_12257
SKJFHKDAJHLQWERTYGFDFHU......
but I get:
>1_12256
FSKVJLKDFJFDAKJQWERTYU......
>1_12257
SKJFHKDAJHLQWERTYGFDFHU......
>1_12258
QWERTYUHKDJKDJOKK......
That is, in my final output I get the header sorted according to their lower values, but I want them in exactly the same order as described in the list files. I'm not sure how to do it...please help.
I think the root cause of the ordering problem is because wanted is a set which are unordered. Since you want the sequence ids in the wanted_files to determine the ordering, you'd need to store them in something else that preserves order, like a list.
Alternatively, you can just process each line of the wanted_file as it's read. A problem with that approach is it would require you to potentially read through the "goodProteins.fasta" file many times — perhaps once for each line of the wanted_file if its contents aren't in a sorted order.
To avoid that, the entire file can be read in to a memory-resident dictionary whose keys are the sequence ids once using the SeqIO.to_dict() function, and then reused for each wanted_file. You say the file is 50-60 MB, but that isn't too much for most of today's hardware.
Anyway, here's code that attempts to do this. To avoid global variables there's a Process class that reads in the "goodProteins.fasta" file and converts it into a dictionary when an instance of it is created. Instances are callable and reusable, meaning that the same process object can be used with each of the wanted_files without repeatedly reading the sequences file.
Note that the code is untested because I don't have the data files or the Bio module installed on my system — but hopefully it's close enough to help.
from Bio import SeqIO
import glob
class Process(object):
def __init__(self, fasta_file_name):
# read entire fasta file into memory as a dictionary indexed by ID
with open(fasta_file_name, "rU") as fasta_file:
self.fasta_sequences = SeqIO.to_dict(
SeqIO.parse(fasta_file, 'fasta'))
def __call__(self, wanted_file_name, results_file_name):
with open(wanted_file_name, "rU") as wanted, \
open(results_file_name, "w") as results:
for seq_id in (line.strip() for line in wanted):
if seq_id:
SeqIO.write(self.fasta_sequences[seq_id], results, "fasta")
process = Process("goodProteins.fasta") # create process object
# process each wanted file using it
for wanted_file_name in glob.glob(r"My_folder\*.txt"):
results_file_name = wanted_file_name[:-4] + ".fasta"
process(wanted_file_name, results_file_name)