Rename fasta file according to a dataframe in python - python

Hello I have huge file such as :
>Seq1.1
AAAGGAGAATAGA
>Seq2.2
AGGAGCTTCTCAC
>Seq3.1
CGTACTACGAGA
>Seq5.2
CGAGATATA
>Seq3.1
CGTACTACGAGA
>Seq2
AGGAGAT
and a dataframe such as :
tab
query New_query
Seq1.1 Seq1.1
Seq2.2 Seq2.2
Seq3.1 Seq3.1_0
Seq5.2 Seq5.2_3
Seq3.1 Seq3.1_1
and the idea is to rename the >Seqname according to the tab.
Then for each Seqname, if tab['query'] != tab['New_query'], then rename the Seqname as tab['New_query']
Ps: All the >Seqname are not present in the tab, if it is the case then I do nothing.
I should then get a new fasta file such as :
>Seq1.1
AAAGGAGAATAGA
>Seq2.2
AGGAGCTTCTCAC
>Seq3.1_0
CGTACTACGAGA
>Seq5.2_3
CGAGATATA
>Seq3.1_1
CGTACTACGAGA
>Seq2
AGGAGAT
I tried this code :
records = SeqIO.parse("My_fasta_file.aa", 'fasta')
for record in records:
subtab=tab[tab['query']==record.id]
subtab=subtab.drop_duplicates(subset ="New_query",keep = "first")
if subtab.empty == True: #it means that the seq was not in the tab, so I do not rename the sequence.
continue
else:
if subtab['query'].iloc[0] != subtab['New_query'].iloc[0]:
record.id = subtab['New_query']
record.description = subtab['New_query']
else:
continue
it works but it takes to much time ...

You can create a mapper dictionary from the dataframe and then read the fasta file line by line, substituting the lines which starts with >:
mapper = tab.set_index('query').to_dict()['New_query']
with open('My_fasta_file.aa', 'r') as f_in, open('output.txt', 'w') as f_out:
for line in map(str.strip, f_in):
if line.startswith('>'):
v = line.split('>')[-1]
line = '>{}'.format(mapper.get(v, v))
print(line, file=f_out)
Creates output.txt:
>Seq1.1
AAAGGAGAATAGA
>Seq2.2
AGGAGCTTCTCAC
>Seq3.1_1
CGTACTACGAGA
>Seq5.2_3
CGAGATATA
>Seq3.1_1
CGTACTACGAGA
>Seq2
AGGAGAT

The solution by #Andrej using a dictionary is indeed the way to go.. Since you are already using biopython, below is a way to use it, and I think it might be good because it does handle fasta files properly..
Your data.frame is:
tab = pd.DataFrame({'query':['Seq1.1','Seq2.2','Seq3.1','Seq5.2','Seq3.1'],
'New_query':['Seq1.1','Seq2.2','Seq3.1_0','Seq5.2_3','Seq3.1_1']})
Same dictionary as Andrej:
mapper = tab.set_index('query').to_dict()['New_query']
Then similar to what you have done, we just change the header (by updating id and description, thanks to #Chris_Rands):
records = list(SeqIO.parse("My_fasta_file.aa", "fasta"))
for i in records:
i.id = mapper.get(i.id,i.id)
i.description = mapper.get(i.description,i.description)
Now write the file:
with open("new.fasta", "w") as output_handle:
SeqIO.write(records, output_handle, "fasta")

Related

Python script using json.load to compare two files and replace stringss

I have a JSON file like this: [{"ID": "12345", "Name":"John"}, {"ID":"45321", "Name":"Max"}...] called myclass.json. I used json.load library to get "ID" and "Name" values.
I have another .txt file with the content below. File name is list.txt:
Student,12345,Age 14
Student,45321,Age 15
.
.
.
I'm trying to create a script in python that compares the two files line by line and replace the student ID for the students name in list.txt file, so the new file would be:
Student,John,Age 14
Student,Max,Age 15
.
.
Any ideas?
My code so far:
import json
with open('/myclass.json') as f:
data = json.load(f)
for key in data:
x = key['Name']
z = key['ID']
with open('/myclass.json', 'r') as file1:
with open('/list.txt', 'r+') as file2:
for line in file2:
x = z
try this:
import json
import csv
with open('myclass.json') as f:
data = json.load(f)
with open('list.txt', 'r') as f:
reader = csv.reader(f)
rows = list(reader)
def get_name(id_):
for item in data:
if item['ID'] == id_:
return item["Name"]
with open('list.txt', 'w') as f:
writer = csv.writer(f)
for row in rows:
name = get_name(id_ = row[1])
if name:
row[1] = name
writer.writerows(rows)
Keep in mind that this script technically does not replace the items in the list.txt file one by one, but instead reads the entire file in and then overwrites the list.txt file entirely and constructs it from scratch. I suggest making a back up of list.txt or naming the new txt file something different incase the program crashes from some unexpected input.
One option is individually open each file for each mode while appending a list for matched ID values among those two files as
import json
with open('myclass.json','r') as f_in:
data = json.load(f_in)
j=0
lis=[]
with open('list.txt', 'r') as f_in:
for line in f_in:
if data[j]['ID']==line.split(',')[1]:
s = line.replace(line.split(',')[1],data[j]['Name'])
lis.append(s)
j+=1
with open('list.txt', 'w') as f_out:
for i in lis:
f_out.write(i)

Update Txt file in python

I have a text file with names and results. If the name already exists, only the result should be updated. I tried with this code and many others, but without success.
The content of the text file looks like this:
Ann, 200
Buddy, 10
Mark, 180
Luis, 100
PS: I started 2 weeks ago, so don't judge my bad code.
from os import rename
def updatescore(username, score):
file = open("mynewscores.txt", "r")
new_file = open("mynewscores2.txt", "w")
for line in file:
if username in line:
splitted = line.split(",")
splitted[1] = score
joined = "".join(splitted)
new_file.write(joined)
new_file.write(line)
file.close()
new_file.close()
maks = updatescore("Buddy", "200")
print(maks)
I would suggest reading the csv in as a dictionary and just update the one value.
import csv
d = {}
with open('test.txt', newline='') as f:
reader = csv.reader(f)
for row in reader:
key,value = row
d[key] = value
d['Buddy'] = 200
with open('test2.txt','w', newline='') as f:
writer = csv.writer(f)
for key, value in d.items():
writer.writerow([key,value])
So what needed to be different mostly is that when in your for loop you said to put line in the new text file, but it's never said to Not do that when wanting to replace a score, all that was needed was an else statement below the if statement:
from os import rename
def updatescore(username, score):
file = open("mynewscores.txt", "r")
new_file = open("mynewscores2.txt", "w")
for line in file:
if username in line:
splitted = line.split(",")
splitted[1] = score
print (splitted)
joined = ", ".join(splitted)
print(joined)
new_file.write(joined+'\n')
else:
new_file.write(line)
file.close()
new_file.close()
maks = updatescore("Buddy", "200")
print(maks)
You can try this, add the username if it doesn't exist, else update it.
def updatescore(username, score):
with open("mynewscores.txt", "r+") as file:
line = file.readline()
while line:
if username in line:
file.seek(file.tell() - len(line))
file.write(f"{username}, {score}")
return
line = file.readline()
file.write(f"\n{username}, {score}")
maks = updatescore("Buddy", "300")
maks = updatescore("Mario", "50")
You have new_file.write(joined) inside the if block, which is good, but you also have new_file.write(line) outside the if block.
Outside the if block, it's putting both the original and fixed lines into the file, and since you're using write() instead of writelines() both versions get put on the same line: there's no \n newline character.
You also want to add the comma: joined = ','.join(splitted) since you took the commas out when you used line.split(',')
I got the result you seem to be expecting when I put in both these fixes.
Next time you should include what you are expecting for output and what you're giving as input. It might be helpful if you also include what Error or result you actually got.
Welcome to Python BTW
Removed issues from your code:
def updatescore(username, score):
file = open("mynewscores.txt", "r")
new_file = open("mynewscores2.txt", "w")
for line in file.readlines():
splitted = line.split(",")
if username == splitted[0].strip():
splitted[1] = str(score)
joined = ",".join(splitted)
new_file.write(joined)
else:
new_file.write(line)
file.close()
new_file.close()
I believe this is the simplest/most straightforward way of doing things.
Code:
import csv
def update_score(name: str, score: int) -> None:
with open('../resources/name_data.csv', newline='') as file_obj:
reader = csv.reader(file_obj)
data_dict = dict(curr_row for curr_row in reader)
data_dict[name] = score
with open('../out/name_data_out.csv', 'w', newline='') as file_obj:
writer = csv.writer(file_obj)
writer.writerows(data_dict.items())
update_score('Buddy', 200)
Input file:
Ann,200
Buddy,10
Mark,180
Luis,100
Output file:
Ann,200
Buddy,200
Mark,180
Luis,100

Save output from biopython object into a file?

Here i have a code written to extract "locus_tag" of gene using "id". How can i save the output from this into a file in a tab seperated format????code adopted and modified https://www.biostars.org/p/110284/
from Bio import SeqIO
foo = open("geneid.txt")
lines = foo.read().splitlines()
genbank_file = open("example.gbk")
for record in SeqIO.parse(genbank_file, "genbank"):
for f in record.features:
if f.type == "CDS" and "protein_id" in f.qualifiers:
protein_id = f.qualifiers["protein_id"][0]
if protein_id in lines:
print f.qualifiers["protein_id"][0],f.qualifiers["locus_tag"][0]
Try adding something like this -- but you will need to make certain the indentations are correct with the code that you have already written.
with open(your_outputFileName, 'w') as outputFile:
string = '\t'.join([f.qualifiers['protein_id'][0],f.qualifiers['locus_tag'][0]])
outputFile.write(string + '\n')
You should also consider opening your initial file using "with". This will automatically close the file when you are done with it -- otherwise -- be certain to close the file (e.g., foo.close()).
for record in SeqIO.parse(genbank_file, 'genbank'):
for f in record.features:
if f.type == 'CDS' and 'protein_id' in f.qualifiers:
protein_id = f.qualifiers['protein_id'][0]
if protein_id in lines:
print f.qualifiers['protein_id'][0],f.qualifiers['locus_tag'][0]
with open('your_outputFileName', 'w') as outputFile:
string = '\t'.join([f.qualifiers['protein_id'][0],f.qualifiers['locus_tag'][0]]) + '\n'
outputFile.write(string)

How do I remove lines from a file and move them to another file?

Ok so the file contains:
apple,bot,cheese,-999
tea,fire,water,1
water,mountain,care,-999
So I want to check if the lines in file 1 have a -999 at the end and if they do, remove that line, and transfer the one that does not into a new file. So far my function has:
def clean(filename,cleanfile,value,position):
filename.readline()
for line in filename:
if line[position] != value:
cleanfile.write(line)
Value is -999 and position is 3. I opened my files in my main and passed them to the function, the problem is that the new file is empty.
You can use the csv module to figure out the details of splitting and joining the comma-separated values.
import csv
def clean(filename,cleanfile,value,position):
with open(filename) as reader_fp, open(cleanfile, 'w') as writer_fp:
reader = csv.reader(reader_fp)
writer = csv.writer(writer_fp)
for row in reader:
if row[position] != value:
writer.writerow(row)
try this:
def clean(filename,cleanfile,value,position):
for lines in filename.readlines():
line = lines.strip().split(",")
if line[position] != value:
cleanfile.write(",".join(line) + "\n")
clean(open("readFrom.txt", "r"), open("writeTo.txt", "w"), "-999", 3)
if you know that the value is always at the end of each line, you can try :
def clean (file1, file2, value):
for line in file1 :
if line.strip().split(",")[-1] != value :
file2.write(line)
file1.close()
file2.close()
clean(open("readFrom.txt", "r"), open("writeTo.txt", "w"), "-999")

Using python to read txt files and answer questions

a01:01-24-2011:s1
a03:01-24-2011:s2
a02:01-24-2011:s2
a03:02-02-2011:s2
a03:03-02-2011:s1
a02:04-19-2011:s2
a01:05-14-2011:s2
a02:06-11-2011:s2
a03:07-12-2011:s1
a01:08-19-2011:s1
a03:09-19-2011:s1
a03:10-19-2011:s2
a03:11-19-2011:s1
a03:12-19-2011:s2
So I have this list of data as a txt file, where animal name : date : location
So I have to read this txt file to answer questions.
So so far I have
text_file=open("animal data.txt", "r") #open the text file and reads it.
I know how to read one line, but here since there are multiple lines im not sure how i can read every line in the txt.
Use a for loop.
text_file = open("animal data.txt","r")
for line in text_file:
line = line.split(":")
#Code for what you want to do with each element in the line
text_file.close()
Since you know the format of this file, you can shorten it even more over the other answers:
with open('animal data.txt', 'r') as f:
for line in f:
animal_name, date, location = line.strip().split(':')
# You now have three variables (animal_name, date, and location)
# This loop will happen once for each line of the file
# For example, the first time through will have data like:
# animal_name == 'a01'
# date == '01-24-2011'
# location == 's1'
Or, if you want to keep a database of the information you get from the file to answer your questions, you can do something like this:
animal_names, dates, locations = [], [], []
with open('animal data.txt', 'r') as f:
for line in f:
animal_name, date, location = line.strip().split(':')
animal_names.append(animal_name)
dates.append(date)
locations.append(location)
# Here, you have access to the three lists of data from the file
# For example:
# animal_names[0] == 'a01'
# dates[0] == '01-24-2011'
# locations[0] == 's1'
You can use a with statement to open the file, in case of the open was failed.
>>> with open('data.txt', 'r') as f_in:
>>> for line in f_in:
>>> line = line.strip() # remove all whitespaces at start and end
>>> field = line.split(':')
>>> # field[0] = animal name
>>> # field[1] = date
>>> # field[2] = location
You are missing the closing the file. You better use the with statement to ensure the file gets closed.
with open("animal data.txt","r") as file:
for line in file:
line = line.split(":")
# Code for what you want to do with each element in the line

Categories