I am working with amino acid sequences using the Biopython parser, but regardless of data format (the format is fasta, that is, you can imagine them as strings of letters as follows preceded by the id), my problem is that I have a huge amount of data and despite having tried to parallelize with joblib the estimate of the hours it would take me to run this simple code is 400.
Basically I have a file that contains a series of ids that I have to remove (ids_to_drop) from the original dataset (original_dataset), to create a new file (new_dataset) that contains all the ids contained in the original dataset without the ids_to_drop.
I've tried them all but I don't know how else to do it and I'm stuck right now. Thanks so much!
def file_without_ids_to_remove(seq):
with open(new_output, "a") as f, open(ids_to_drop, "r") as r: #output #removing file
remove = r.read().split("\n")
if seq.id not in remove:
SeqIO.write(seq, f, "fasta")
Parallel(n_jobs=10)(delayed(file_without_ids_to_remove)(seq) for seq in tqdm.tqdm(SeqIO.parse(original_dataset, 'fasta')))
To be clear this is an example of the data (sequence.id + sequence):
WP_051064487.1
MSSAAQTPEATSDVSDANAKQAEALRVASVNVNGIRASYRKGMAEWLAPRQVDILCLQEVRAPDEVVDGF
LADDWHIVHAEAEAKGRAGVLIASRKDSLAPDATRIGIGEEYFATAGRWVEADYTIGENAKKLTVISAYV
HSGEVGTQRQEDKYRFLDTMLERMAELAEQSDYALIVGDLNVGHTELDIKNWKGNVKNAGFLPEERAYFD
KFFGGGDTPGGLGWKDVQRELAGPVNGPYTWWSQRGQAFDNDTGWRIDYHMATPELFARAGNAVVDRAPS
YAERWSDHAPLLVDYTIR
UPDATE: I tried in the following way after the suggestion and it works.
with open(new_dataset, "w") as filtered:
[SeqIO.write(seq,filtered,"fasta") for seq in tqdm.tqdm(SeqIO.parse(original_dataset, 'fasta')) if seq.id not in ids_to_remove]
This looks like a simple file filter operation. Turn the ids to remove into a set one time, and then just read/filter/write the original dataset. Sets are optimized for fast lookup. This operation will be I/O bound and would not benefit from parallelization.
with open("ids-to-remove") as f:
ids_to_remove = {seq_id_line.strip() for seq_id_line in f}
# just in case there are blank lines
if "" in ids_to_remove:
ids_to_remove.remove("")
with open("original-data-set") as orig, open("filtered-data-set", "w") as filtered:
filtered.writelines(line for line in orig if line.split()[0] not in ids_to_remove)
Related
I want to now the numbers of headers my csv file contains (between 0 and ~50). The file itself is huge (so not reading the complete file for this is mandatory) and contains numerical data.
I know that csv.Sniffer has a has_header() function, but that can only detect 1 header.
One idea I had is to recursivly call the has_header funcion (supposing it detects the first header) and then counting the recursions. I am sure though, there is a much smarter way.
Googling was kind of a pain, since no matter what you search, if it includes "count" and "csv" at some point, you get all the "count rows in csv" results :D
Clarification:
With number of headers I mean number of rows containing information which is not data. There is no general rule for the headers (could be text, floats, or white spaces) and it may be a single line of text. The data itself however is only floats. For me this was super clear, because I've been working with these files for a long time, but forgot this isn't the normal case.
I hoped there was a easy and smart builtin function from Numpy or Pandas, but it doesn't seem so.
Inspired by the comments so far, I think my best bet is to
read 100 lines
count number of separators in each line
determine most common number of separators per line
Coming from the end of 100 lines, find first line with different amount of separators, or isn't floats. That line is the last header line.
Here's a sketch for finding the first line which matches a particular criterion. For demo purposes, I use the criterion "there are empty fields":
import csv
with open(filename, "r", encoding="utf-8") as handle:
for lineno, fields in enumerate(csv.reader(handle), 1):
if "" in fields:
print(lineno-1)
break
You'd update it to look for something which makes sense for your data, like perhaps "third and eight fields contain numbers":
try:
float(fields[2])
float(fields[7])
print(lineno-1)
break
except ValueError:
continue
(notice how the list fields is indexed starting at zero, so the first field is fields[0] and the third is fields[2]), or perhaps a more sophisticated model where the first line contains no empty fields, successive lines contain more and more empty fields, and then the first data line contains fewer empty fields:
maxempty = 0
for lineno, fields in numerate(csv.reader(handle), 1):
empty = fields.count("")
if empty > maxempty:
maxempty = empty
elif empty < maxempty:
print(lineno-1)
break
We simply print the line number of the last header line, since your question asks how many there are. Perhaps printing or returning the number of the first data line would make more sense in some scenarios.
This code doesn't use Pandas at all, just the regular csv module from the Python standard library. It stops reading when you hit break so it doesn't matter for performance how many lines there are after that (though if you need to experiment or debug, maybe create a smaller file with only, say, the first 200 lines of your real file).
Use re.search to search for lines that have 2 or more letters in a row. Two is used instead of one, to not count as header scientific notation (e.g., 1.0e5).
# In the shell, create a test file:
# echo "foo,bar\nbaz,bletch\n1e4,2.0\n2E5,2" > in_file.csv
import re
num_header_lines = 0
for line in open('in_file.csv'):
if re.search('[A-Za-z]{2,}', line):
# count the header here
num_header_lines += 1
else:
break
print(num_header_lines)
# 2
Well, I think that you could get the first line of the csv file and then split it by a ",". That will return an array with all the headers in it. Now you can just count them with len.
Try this:
import pandas as pd
df = pd.read_csv('your_file.csv', index_col=0)
num_rows, num_cols = df.shape
Since I see you're worried about file size, breaking the file into chunks would work:
chunk_size = 10000
df = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,
low_memory=False)
I think you might get a variable number of rows if you read the df chunk by chunk but if you're only interested in number of columns this would work easily.
You could also look into dask.dataframe
This only reads first line of csv
import csv
with open('ornek.csv', newline='') as f:
reader = csv.reader(f)
row1 = next(reader)
sizeOfHeader = len(row1)
I am a newbie in the python world and bioinformatics. I am dealing with a almost 50GB structured file to write it out. So I would like to take some great tips from you.
The file goes like this. (it's actually called FASTQ_format)
#Machinename:~:Team1:atcatg 1st line.
atatgacatgacatgaca 2nd line.
+ 3rd line.
asldjfwe!##$#%$ 4th line.
These four lines are repeated in order. Those 4 lines are like a team.
And I have nearly 30 candidates DNA sequences. e.g. atgcat, tttagc
What I am doing is have each candidate DNA sequence going through the huge file to find whether a candidate sequence is similar to Team dna sequence, which means allowing one mismatch to each (e.g. taaaaa = aaaata) and if they are similar or same, I use dictionary to store them to write it out later. key for candidate DNA sequence. Value for (4 lines) in List to store them in order by line order
So what I have done is:
def myfunction(str1, str2): # to find if they are similar( allowed one mis match) if they are similar, it returns true
f = open('hugefile')
diction = {}
mylist = ['candidate dna sequences1','dna2','dna3','dna4'...]
while True:
line = f.readline()
if not line:
break
if "machine name" in line:
teamseq = line.split(':')[-1]
if my function(candidate dna, team dna) == True:
if not candidate dna in diction.keys():
diction[candidate dna] = []
diction[candidate dna].append(line)
diction[candidate dna].append(line)
diction[candidate dna].append(line)
diction[candidate dna].append(line)
else: # chances some same team dna are repeated.
diction[candidate dna].append(line)
diction[candidate dna].append(line)
diction[candidate dna].append(line)
diction[candidate dna].append(line)
f.close()
wf = open(hughfile+".out", 'w')
for i in candidate dna list: # dna 1 , dna2, dna3
wf.write(diction[i] + '\n')
wf.close()
My function doesn't use any global variables (I think I am happy with my function), whereas the dictionary variable is a global variable and takes all the data as well as making lots of list instances. The code is simple but so slow and such a big pain in the butt to the CPU and memory. I use pypy though.
So any tips write it out in order by line order?
I suggest opening input and output files simultaneously and writing to the output as you step through the input. As it is now, you are reading 50GB into memory and then writing it out. That is both slow and unnecessary.
IN PSEUDOCODE:
with open(huge file) as fin, open(hughfile+".out", 'w') as fout:
for line in f:
if "machine name" in line:
# read the following 4 lines from fin as a record
# process that record
# write the record to fout
# the input record in no longer needed -- allow to be garbage collected...
As I have outlined it, the previous 4 line records are written as they are encountered and then disposed of. If you need to refer to diction.keys() for previous records, only keep the minimum necessary as a set() to cut down the total size of the in-memory data.
I have a file of sequence information, so the file will be structured like this,
[SEQUENCE ID]
atgctagctagatcga
[SEQUENCE ID]
agatcgatggctagatc
What I've been doing is comparing between files to see what sequences IDs are shared, which is simple enough, but now I want to pull out the actual sequence associated with the ID. The files I'm using are huge (10 GB+) so using a dictionary or anything that would involve reading all the lines into the system memory is out.
Basically what the code is intended to do is if the sequence ID from file 1 isn't found in file 2, then return the line after the sequence ID from file 1. Any tips?
So you only need line N and line N+1? In this case read the file in chunks of two lines. Then you always have access to both the sequence ID and the sequence.
from itertools import izip
with open('data.txt', 'r') as f:
for line1, line2 in izip(*(iter(f),) * 2):
print line1, line2
Short answer: you will have to use a third party Python library to keep one of the data sequences searchable in better than O(n).
If they are not sorted, you will have to sort at least one of the files. Think of it this way:
I get the sequence ID from file 1 - and to check if it is not present in file2, I'dhave to read all the file - much eless feasible than reading the file once.
Than - better than sorting, it would be usefull to have a data-structure that could hold the sorted data on disc in a way to provide for fast searchs, and still be able to grow - that woulf facilitate sorting as well,a s all you'd have to do in a first step would be reading the entries in file 2, and just inserting then into this growing-sorted disk-persisted data structure.
While certainly you could roll your own data-structure to do this, I'd suggest the ue of ZODB - ZOPE's object oriented DATABSe, witha btree folder, and have your "2 lines of data" made into a minimal object for your task.
Assuming the [SEQUENCE ID] s do fit in memory, and that the bulk of your data is actually on the sequence line (unlike the examples provided) - you have the option to parse a file (file2 in your question), and anotate not only te [SEQUENCE ID] - but the file postion for each such identifier. This approach would enable you to proceed without braking much of your current workflow (like, having to learn about a database)
:
def get_indexes(filename):
with open(filename, "rt") as file:
sequences = {}
while True:
position = file.tell()
id = file.readline()
if not id:
break()
sequences[id.strip()] = position
# skip corresponding data line:
file.readline()
return sequences
def fetcher(filename1, filename2, sequences):
with open(filename1, "rt") as file1, open(filename2, "rt" as file2):
while True:
id = file.readline()
data = file.readline()
if not id:
break
id = id.strip()
if id in sequences:
# postion file2 reading at the identifier:
file2.seek(sequences[id])
# throw away id:
file2.readline()
data = file.readline()
yield id, data
if __name__== "__main__":
sequences = getindexes("/data/file2")
for id, data in fetcher("/data/file1", "/data/file2", sequences):
print "%s\n%s"% (id, data)
I'm somewhat new to python. I'm trying to sort through a list of strings and integers. The lists contains some symbols that need to be filtered out (i.e. ro!ad should end up road). Also, they are all on one line separated by a space. So I need to use 2 arguments; one for the input file and then the output file. It should be sorted with numbers first and then the words without the special characters each on a different line. I've been looking at loads of list functions but am having some trouble putting this together as I've never had to do anything like this. Any takers?
So far I have the basic stuff
#!/usr/bin/python
import sys
try:
infilename = sys.argv[1] #outfilename = sys.argv[2]
except:
print "Usage: ",sys.argv[0], "infile outfile"; sys.exit(1)
ifile = open(infilename, 'r')
#ofile = open(outfilename, 'w')
data = ifile.readlines()
r = sorted(data, key=lambda item: (int(item.partition(' ')[0])
if item[0].isdigit() else float('inf'), item))
ifile.close()
print '\n'.join(r)
#ofile.writelines(r)
#ofile.close()
The output shows exactly what was in the file but exactly as the file is written and not sorted at all. The goal is to take a file (arg1.txt) and sort it and make a new file (arg2.txt) which will be cmd line variables. I used print in this case to speed up the editing but need to have it write to a file. That's why the output file areas are commented but feel free to tell me I'm stupid if I screwed that up, too! Thanks for any help!
When you have an issue like this, it's usually a good idea to check your data at various points throughout the program to make sure it looks the way you want it to. The issue here seems to be in the way you're reading in the file.
data = ifile.readlines()
is going to read in the entire file as a list of lines. But since all the entries you want to sort are on one line, this list will only have one entry. When you try to sort the list, you're passing a list of length 1, which is going to just return the same list regardless of what your key function is. Try changing the line to
data = ifile.readlines()[0].split()
You may not even need the key function any more since numbers are placed before letters by default. I don't see anything in your code to remove special characters though.
since they are on the same line you dont really need readlines
with open('some.txt') as f:
data = f.read() #now data = "item 1 item2 etc..."
you can use re to filter out unwanted characters
import re
data = "ro!ad"
fixed_data = re.sub("[!?#$]","",data)
partition maybe overkill
data = "hello 23frank sam wilbur"
my_list = data.split() # ["hello","23frank","sam","wilbur"]
print sorted(my_list)
however you will need to do more to force numbers to sort maybe something like
numbers = [x for x in my_list if x[0].isdigit()]
strings = [x for x in my_list if not x[0].isdigit()]
sorted_list = sorted(numbers,key=lambda x:int(re.sub("[^0-9]","",x))) + sorted(strings(
Also, they are all on one line separated by a space.
So your file contains a single line?
data = ifile.readlines()
This makes data into a list of the lines in your file. All 1 of them.
r = sorted(...)
This makes r the sorted version of that list.
To get the words from the line, you can .read() the entire file as a single string, and .split() it (by default, it splits on whitespace).
I have a certain check to be done and if the check satisfies, I want the result to be printed. Below is the code:
import string
import codecs
import sys
y=sys.argv[1]
list_1=[]
f=1.0
x=0.05
write_in = open ("new_file.txt", "w")
write_in_1 = open ("new_file_1.txt", "w")
ligand_file=open( y, "r" ) #Open the receptor.txt file
ligand_lines=ligand_file.readlines() # Read all the lines into the array
ligand_lines=map( string.strip, ligand_lines ) #Remove the newline character from all the pdb file names
ligand_file.close()
ligand_file=open( "unique_count_c_from_ac.txt", "r" ) #Open the receptor.txt file
ligand_lines_1=ligand_file.readlines() # Read all the lines into the array
ligand_lines_1=map( string.strip, ligand_lines_1 ) #Remove the newline character from all the pdb file names
ligand_file.close()
s=[]
for i in ligand_lines:
for j in ligand_lines_1:
j = j.split()
if i == j[1]:
print j
The above code works great but when I print j, it prints like ['351', '342'] but I am expecting to get 351 342 (with one space in between). Since it is more of a python question, I have not included the input files (basically they are just numbers).
Can anyone help me?
Cheers,
Chavanak
To convert a list of strings to a single string with spaces in between the lists's items, use ' '.join(seq).
>>> ' '.join(['1','2','3'])
'1 2 3'
You can replace ' ' with whatever string you want in between the items.
Mark Rushakoff seems to have solved your immediate problem, but there are some other improvements that could be made to your code.
Always use context managers (with open(filename, mode) as f:) for opening files rather than relying on close getting called manually.
Don't bother reading a whole file into memory very often. Looping over some_file.readilines() can be replaced with looping over some_file directly.
For example, you could have used map(string.strip, ligland_file) or better yet [line.strip() for line in ligland_file]
Don't choose names to include the type of the object they refer to. This information can be found other ways.
For exmaple, the code you posted can be simplified to something along the lines of
import sys
from contextlib import nested
some_real_name = sys.argv[1]
other_file = "unique_count_c_from_ac.txt"
with nested(open(some_real_name, "r"), open(other_file, "r")) as ligand_1, ligand_2:
for line_1 in ligand_1:
# Take care of the trailing newline
line_1 = line_1.strip()
for line_2 in ligand_2:
line_2 = line2.strip()
numbers = line2.split()
if line_1 == numbers[1]:
# If the second number from this line matches the number that is
# in the user's file, print all the numbers from this line
print ' '.join(numbers)
which is more reliable and I believe more easily read.
Note that the algorithmic performance of this is far from ideal because of these nested loops. Depending on your need, this could potentially be improved, but since I don't know exactly what data you need to extract to tell you whether you can.
The time this takes currently in my code and yours is O(nmq), where n is the number of lines in one file, m is the number of lines in the other, and q is the length of lines in unique_count_c_from_ac.txt. If two of these are fixed/small, then you have linear performance. If two can grow arbitrarily (I sort of imagine n and m can?), then you could look into improving your algorithm, probably using sets or dicts.