I have a dictionary with 400,000 items in it, whose keys are DNA names and values are DNA sequences.
I want to divide the dictionary into 40 text files with 10,000 items in each of the files.
Here are my codes:
record_dict # my DNA dictionary
keys_in_dict #the list of the keys
for keys in keys_in_dict:
outhandle = open("D:\\Research\\Transcriptome_sequences\\input{0}.fasta".format (?????), "w")
What should I put in place of (?????)? How do I finish this loop?
UPDATE:
Hey fellows,
Thank you for your help. Now I can make multiple files from a dictionary. However, when I tried to make multiple files directly from the original file instead of making a dictionary first, I had problems. The codes only generate one file with the first item in it. What did I do wrong? Here are my codes:
from Bio import SeqIO
handle = open("D:/Research/Transcriptome_sequences/differentially_expressed_genes.fasta","rU")
filesize = 100 # number of entries per file
filenum = 0
itemcount = 0
for record in SeqIO.parse(handle, "fasta") :
if not itemcount % filesize:
outhandle = open("D:/Research/Transcriptome_sequences/input{0}.fasta".format(filenum), "w")
SeqIO.write(record, outhandle, "fasta")
filenum += 1
itemcount += 1
outhandle.close()
n = 10000
sections = (record_dict.items()[i:i+n] for i in xrange(0,len(record_dict),n))
for ind, sec in enumerate(sections):
with open("D:/Research/Transcriptome_sequences/input{0}.fasta".format(ind), "w") as f1:
for k,v in sec:
f1.write("{} {}\n".format(k,v))
It will not be the fastest solution, but I think the most straightforwared way is to keep track of lines and open a file every 10,000 iterations through loop.
I assume you are writing out fasta or something.
Otherwise, you could slice the list [:10000] beforehand and generate a chunk of output to write all at once with one command (which would be much faster). Even as it is, you might want to build up the string by concatenating through the loop and then writing that one monstrous string out with a single .write command for each file.
itemcount=0
filesize = 10000
filenum = 0
filehandle = ""
for keys in keys_in_dict:
# check if it is time to open a new file,
# whenever itemcount/filesize has no remainder
if not itemcount % filesize:
if filehandle:
filehandle.close()
filenum+=1
PathToFile = "D:/Research/Transcriptome_sequences/input{0}.fasta".format(filenum)
filehandle = open(PathToFile,'w')
filehandle.write(">{0}\n{1}\n".format(keys,record_dict[keys])
itemcount += 1
filehandle.close()
EDIT: Here is a more efficient way to do it (time-wise, not memory-wise), only writing once per file (40x total) instead of with each line (400,000 times). As always, check your output, especially making sure that the first and last sequences are included in the output and the last file is written properly.
filesize = 10 # number of entries per file
filenum = 0
filehandle = ""
OutString = ""
print record_dict
for itemcount,keys in enumerate(keys_in_dict):
# check if it is time to open a new file,
# whenever itemcount/filesize has no remainder
OutString += ">{0}\n{1}\n".format(keys,record_dict[keys])
if not itemcount % filesize:
if filehandle:
filehandle.write(OutString)
filehandle.close()
OutString =""
filenum+=1
PathToFile = "D:/Research/Transcriptome_sequences/input{0}.fasta".format(filenum)
filehandle = open(PathToFile,'w')
filehandle.write(OutString)
filehandle.close()
Making use of the built-in module/function, itertools.tee, could solve this elegantly.
import itertools
for (idx, keys2) in enumerate(itertools.tee(keys_in_dict, 40)):
with open('filename_prefix_%02d.fasta' % idx, 'w') as fout:
for key in keys2:
fout.write(...)
Quoted from the doc for your reference:
itertools.tee(iterable[, n=2]) Return n independent iterators from a
single iterable.
Once tee() has made a split, the original iterable should not be used
anywhere else; otherwise, the iterable could get advanced without the
tee objects being informed.
This itertool may require significant auxiliary storage (depending on
how much temporary data needs to be stored). In general, if one
iterator uses most or all of the data before another iterator starts,
it is faster to use list() instead of tee().
Related
Context: I have a file with ~44 million rows. Each is an individual with US address, so there's a "ZIP Code" field. File is txt, pipe-delimited.
Due to size, I cannot (at least on my machine) use Pandas to analyze. So a basic question I have is: How many records (rows) are there for each distinct ZIP code? I took the following steps, but I wonder if there's a faster, more Pythonic way to do this (seems like there is, I just don't know).
Step 1: Create a set for ZIP values from file:
output = set()
with open(filename) as f:
for line in f:
output.add(line.split('|')[8] # 9th item in the split string is "ZIP" value
zip_list = list(output) # List is length of 45,292
Step 2: Created a "0" list, same length as first list:
zero_zip = [0]*len(zip_list)
Step 3: Created a dictionary (with all zeroes) from those two lists:
zip_dict = dict(zip(zip_list, zero_zip))
Step 4: Lastly I ran through the file again, this time updating the dict I just created:
with open(filename) as f:
next(f) # skip first line, which contains headers
for line in f:
zip_dict[line.split('|')[8]] +=1
I got the end result but wondering if there's a simpler way. Thanks all.
Creating the zip_dict can be replaced with a defaultdict. If you can run through every line in the file, you don't need to do it twice, you can just keep a running count.
from collections import defaultdict
d = defaultdict(int)
with open(filename) as f:
for line in f:
parts = line.split('|')
d[parts[8]] += 1
This is simple using the built-in Counter class.
from collections import Counter
with open(filename) as f:
c = Counter(line.split('|')[8] for line in f)
print(c)
I've a file which have integers in first two columns.
File Name : file.txt
col_a,col_b
1001021,1010045
2001021,2010045
3001021,3010045
4001021,4010045 and so on
Now using python, i get a variable var_a = 2002000.
Now how to find the range within which this var_a lies in "file.txt".
Expected Output : 2001021,2010045
I have tried with below,
With open("file.txt","r") as a:
a_line = a.readlines()
for line in a_line:
line_sp = line.split(',')
if var_a < line_sp[0] and var_a > line_sp[1]:
print ('%r, %r', %(line_sp[0], line_sp[1])
Since the file have more than million of record this make it time consuming. Is there any better way to do the same without a for loop.
Since the file have more than million of record this make it time
consuming. Is there any better way to do the same without a for loop.
Unfortunately you have to iterate over all records in file and the only way you can archive that is some kind of for loop. So complexity of this task will always be at least O(n).
It is better to read your file linewise (not all into memory) and store its content inside ranges to look them up for multiple numbers. Ranges store quite efficiently and you only have to read in your file once to check more then 1 number.
Since python 3.7 dictionarys are insert ordered, if your file is sorted you will only iterate your dictionary until the first time a number is in the range, for numbers not all all in range you iterate the whole dictionary.
Create file:
fn = "n.txt"
with open(fn, "w") as f:
f.write("""1001021,1010045
2001021,2010045
3001021,3010045
garbage
4001021,4010045""")
Process file:
fn = "n.txt"
# read in
data = {}
with open(fn) as f:
for nr,line in enumerate(f):
line = line.strip()
if line:
try:
start,stop = map(int, line.split(","))
data[nr] = range(start,stop+1)
except ValueError as e:
pass # print(f"Bad data ({e}) in line {nr}")
look_for_nums = [800, 1001021, 3001039, 4010043, 9999999]
for look_for in look_for_nums:
items_checked = 0
for nr,rng in data.items():
items_checked += 1
if look_for in rng:
print(f"Found {look_for} it in line {nr} in range: {rng.start},{rng.stop-1}", end=" ")
break
else:
print(f"{look_for} not found")
print(f"after {items_checked } checks")
Output:
800 not found after 4 checks
Found 1001021 it in line 0 in range: 1001021,1010045 after 1 checks
Found 3001039 it in line 2 in range: 3001021,3010045 after 3 checks
Found 4010043 it in line 5 in range: 4001021,4010045 after 4 checks
9999999 not found after 4 checks
There are better ways to store such a ranges-file, f.e. in a tree like datastructure - research into k-d-trees to get even faster results if you need them. They partition the ranges in a smarter way, so you do not need to use a linear search to find the right bucket.
This answer to Data Structure to store Integer Range , Query the ranges and modify the ranges provides more things to research.
Assuming each line in the file has the correct format, you can do something like following.
var_a = 2002000
with open("file.txt") as file:
for l in file:
a,b = map(int, l.split(',', 1)) # each line must have only two comma separated numbers
if a < var_a < b:
print(l) # use the line as you want
break # if you need only the first occurrence, break the loop now
Note that you'll have to do additional verifications/workarounds if the file format is not guaranteed.
Obviously you have to iterate through all the lines (in the worse case). But we don't load all the lines into memory at once. So as soon as the answer is found, the rest of the file is ignored without reading (assuming you are looking only for the first match).
I am trying to use python to find four-line blocks of interest in two separate files then print out some of those lines in controlled order. Below are the two input files and an example of the desired output file. Note that the DNA sequence in the Input.fasta is different than the DNA sequence in Input.fastq because the .fasta file has been read corrected.
Input.fasta
>read1
AAAGGCTGT
>read2
AGTCTTTAT
>read3
CGTGCCGCT
Input.fastq
#read1
AAATGCTGT
+
'(''%$'))
#read2
AGTCTCTAT
+
&---+2010
#read3
AGTGTCGCT
+
0-23;:677
DesiredOutput.fastq
#read1
AAAGGCTGT
+
'(''%$'))
#read2
AGTCTTTAT
+
&---+2010
#read3
CGTGCCGCT
+
0-23;:677
Basically I need the sequence line "AAAGGCTGT",
"AGTCTTTAT", and "CGTGCCGCT" from "input.fasta" and all other lines from "input.fastq". This allows the restoration of quality information to a read corrected .fasta file.
Here is my closest failed attempt:
fastq = open(Input.fastq, "r")
fasta = open(Input.fasta, "r")
ReadIDs = []
IDs = []
with fastq as fq:
for line in fq:
if "read" in line:
ReadIDs.append(line)
print(line.strip())
for ID in ReadIDs:
IDs.append(ID[1:6])
with fasta as fa:
for line in fa:
if any(string in line for string in IDs):
print(next(fa).strip())
next(fq)
print(next(fq).strip())
print(next(fq).strip())
I think I am running into trouble by trying to nest "with" calls to two different files in the same loop. This prints the desired lines for read1 correctly but does not continue to iterate through the remaining lines and throws an error "ValueError: I/O operation on closed file"
I suggest you use Biopython, which will save you a lot of trouble as it provides nice parsers for these file formats, which handle not only the standard cases but also for example multi-line fasta.
Here is an implementation that replaces the fastq sequence lines with the corresponding fasta sequence lines:
from Bio import SeqIO
fasta_dict = {record.id: record.seq for record in
SeqIO.parse('Input.fasta', 'fasta')}
def yield_records():
for record in SeqIO.parse('Input.fastq', 'fastq'):
record.seq = fasta_dict[record.id]
yield record
SeqIO.write(yield_records(), 'DesiredOutput.fastq', 'fastq')
If you don't want to use the headers but just rely on the order then the solution is even simpler and more memory efficient (just make sure the order and number of records is the same), no need to define the dictionary first, just iterate over the records together:
fasta_records = SeqIO.parse('Input.fasta', 'fasta')
fastq_records = SeqIO.parse('Input.fastq', 'fastq')
def yield_records():
for fasta_record, fastq_record in zip(fasta_records, fastq_records):
fastq_record.seq = fasta_record.seq
yield fastq_record
## Open the files (and close them after the 'with' block ends)
with open("Input.fastq", "r") as fq, open("Input.fasta", "r") as fa:
## Read in the Input.fastq file and save its content to a list
fastq = fq.readlines()
## Do the same for the Input.fasta file
fasta = fa.readlines()
## For every line in the Input.fastq file
for i in range(len(fastq)):
print(fastq[i]))
print(fasta[2 * i])
print(fasta[(2 * i) + 1])
I like the Biopython solution by #Chris_Rands better for small files, but here is a solution that only uses the batteries included with Python and is memory efficient. It assumes the fasta and fastq files to contain the same number of reads in the same order.
with open('Input.fasta') as fasta, open('Input.fastq') as fastq, open('DesiredOutput.fastq', 'w') as fo:
for i, line in enumerate(fastq):
if i % 4 == 1:
for j in range(2):
line = fasta.readline()
print(line, end='', file=fo)
Is there a way to do this? Say I have a file that's a list of names that goes like this:
Alfred
Bill
Donald
How could I insert the third name, "Charlie", at line x (in this case 3), and automatically send all others down one line? I've seen other questions like this, but they didn't get helpful answers. Can it be done, preferably with either a method or a loop?
This is a way of doing the trick.
with open("path_to_file", "r") as f:
contents = f.readlines()
contents.insert(index, value)
with open("path_to_file", "w") as f:
contents = "".join(contents)
f.write(contents)
index and value are the line and value of your choice, lines starting from 0.
If you want to search a file for a substring and add a new text to the next line, one of the elegant ways to do it is the following:
import os, fileinput
old = "A"
new = "B"
for line in fileinput.FileInput(file_path, inplace=True):
if old in line :
line += new + os.linesep
print(line, end="")
There is a combination of techniques which I found useful in solving this issue:
with open(file, 'r+') as fd:
contents = fd.readlines()
contents.insert(index, new_string) # new_string should end in a newline
fd.seek(0) # readlines consumes the iterator, so we need to start over
fd.writelines(contents) # No need to truncate as we are increasing filesize
In our particular application, we wanted to add it after a certain string:
with open(file, 'r+') as fd:
contents = fd.readlines()
if match_string in contents[-1]: # Handle last line to prevent IndexError
contents.append(insert_string)
else:
for index, line in enumerate(contents):
if match_string in line and insert_string not in contents[index + 1]:
contents.insert(index + 1, insert_string)
break
fd.seek(0)
fd.writelines(contents)
If you want it to insert the string after every instance of the match, instead of just the first, remove the else: (and properly unindent) and the break.
Note also that the and insert_string not in contents[index + 1]: prevents it from adding more than one copy after the match_string, so it's safe to run repeatedly.
You can just read the data into a list and insert the new record where you want.
names = []
with open('names.txt', 'r+') as fd:
for line in fd:
names.append(line.split(' ')[-1].strip())
names.insert(2, "Charlie") # element 2 will be 3. in your list
fd.seek(0)
fd.truncate()
for i in xrange(len(names)):
fd.write("%d. %s\n" %(i + 1, names[i]))
The accepted answer has to load the whole file into memory, which doesn't work nicely for large files. The following solution writes the file contents with the new data inserted into the right line to a temporary file in the same directory (so on the same file system), only reading small chunks from the source file at a time. It then overwrites the source file with the contents of the temporary file in an efficient way (Python 3.8+).
from pathlib import Path
from shutil import copyfile
from tempfile import NamedTemporaryFile
sourcefile = Path("/path/to/source").resolve()
insert_lineno = 152 # The line to insert the new data into.
insert_data = "..." # Some string to insert.
with sourcefile.open(mode="r") as source:
destination = NamedTemporaryFile(mode="w", dir=str(sourcefile.parent))
lineno = 1
while lineno < insert_lineno:
destination.file.write(source.readline())
lineno += 1
# Insert the new data.
destination.file.write(insert_data)
# Write the rest in chunks.
while True:
data = source.read(1024)
if not data:
break
destination.file.write(data)
# Finish writing data.
destination.flush()
# Overwrite the original file's contents with that of the temporary file.
# This uses a memory-optimised copy operation starting from Python 3.8.
copyfile(destination.name, str(sourcefile))
# Delete the temporary file.
destination.close()
EDIT 2020-09-08: I just found an answer on Code Review that does something similar to above with more explanation - it might be useful to some.
You don't show us what the output should look like, so one possible interpretation is that you want this as the output:
Alfred
Bill
Charlie
Donald
(Insert Charlie, then add 1 to all subsequent lines.) Here's one possible solution:
def insert_line(input_stream, pos, new_name, output_stream):
inserted = False
for line in input_stream:
number, name = parse_line(line)
if number == pos:
print >> output_stream, format_line(number, new_name)
inserted = True
print >> output_stream, format_line(number if not inserted else (number + 1), name)
def parse_line(line):
number_str, name = line.strip().split()
return (get_number(number_str), name)
def get_number(number_str):
return int(number_str.split('.')[0])
def format_line(number, name):
return add_dot(number) + ' ' + name
def add_dot(number):
return str(number) + '.'
input_stream = open('input.txt', 'r')
output_stream = open('output.txt', 'w')
insert_line(input_stream, 3, 'Charlie', output_stream)
input_stream.close()
output_stream.close()
Parse the file into a python list using file.readlines() or file.read().split('\n')
Identify the position where you have to insert a new line, according to your criteria.
Insert a new list element there using list.insert().
Write the result to the file.
location_of_line = 0
with open(filename, 'r') as file_you_want_to_read:
#readlines in file and put in a list
contents = file_you_want_to_read.readlines()
#find location of what line you want to insert after
for index, line in enumerate(contents):
if line.startswith('whatever you are looking for')
location_of_line = index
#now you have a list of every line in that file
context.insert(location_of_line, "whatever you want to append to middle of file")
with open(filename, 'w') as file_to_write_to:
file_to_write_to.writelines(contents)
That is how I ended up getting whatever data I want to insert to the middle of the file.
this is just pseudo code, as I was having a hard time finding clear understanding of what is going on.
essentially you read in the file to its entirety and add it into a list, then you insert your lines that you want to that list, and then re-write to the same file.
i am sure there are better ways to do this, may not be efficient, but it makes more sense to me at least, I hope it makes sense to someone else.
A simple but not efficient way is to read the whole content, change it and then rewrite it:
line_index = 3
lines = None
with open('file.txt', 'r') as file_handler:
lines = file_handler.readlines()
lines.insert(line_index, 'Charlie')
with open('file.txt', 'w') as file_handler:
file_handler.writelines(lines)
I write this in order to reutilize/correct martincho's answer (accepted one)
! IMPORTANT: This code loads all the file into ram and rewrites content to the file
Variables index, value may be what you desire, but pay attention to making value string and end with '\n' if you don't want it to mess with existing data.
with open("path_to_file", "r+") as f:
# Read the content into a variable
contents = f.readlines()
contents.insert(index, value)
# Reset the reader's location (in bytes)
f.seek(0)
# Rewrite the content to the file
f.writelines(contents)
See the python docs about file.seek method: Python docs
Below is a slightly awkward solution for the special case in which you are creating the original file yourself and happen to know the insertion location (e.g. you know ahead of time that you will need to insert a line with an additional name before the third line, but won't know the name until after you've fetched and written the rest of the names). Reading, storing and then re-writing the entire contents of the file as described in other answers is, I think, more elegant than this option, but may be undesirable for large files.
You can leave a buffer of invisible null characters ('\0') at the insertion location to be overwritten later:
num_names = 1_000_000 # Enough data to make storing in a list unideal
max_len = 20 # The maximum allowed length of the inserted line
line_to_insert = 2 # The third line is at index 2 (0-based indexing)
with open(filename, 'w+') as file:
for i in range(line_to_insert):
name = get_name(i) # Returns 'Alfred' for i = 0, etc.
file.write(F'{i + 1}. {name}\n')
insert_position = file.tell() # Position to jump back to for insertion
file.write('\0' * max_len + '\n') # Buffer will show up as a blank line
for i in range(line_to_insert, num_names):
name = get_name(i)
file.write(F'{i + 2}. {name}\n') # Line numbering now bumped up by 1.
# Later, once you have the name to insert...
with open(filename, 'r+') as file: # Must use 'r+' to write to middle of file
file.seek(insert_position) # Move stream to the insertion line
name = get_bonus_name() # This lucky winner jumps up to 3rd place
new_line = F'{line_to_insert + 1}. {name}'
file.write(new_line[:max_len]) # Slice so you don't overwrite next line
Unfortunately there is no way to delete-without-replacement any excess null characters that did not get overwritten (or in general any characters anywhere in the middle of a file), unless you then re-write everything that follows. But the null characters will not affect how your file looks to a human (they have zero width).
I routinely use PowerShell to split larger text or csv files in to smaller files for quicker processing. However, I have a few files that come over that are an usual format. These are basically print files to a text file. Each record starts with a single line that starts with a 1 and there is nothing else on the line.
What I need to be able to do is to split a file based on the number of statements. So, basically if I want to split the file in to chunks of 3000 statements, I would go down until I see the 3001 occurrence of 1 in position 1 and copy everything before that to the new file. I can run this from windows, linux or OS X so pretty much anything is open for the split.
Any ideas would be greatly appreciated.
Maybe try recognizing it by the fact that there is a '1' plus a new line?
with open(input_file, 'r') as f:
my_string = f.read()
my_list = my_string.split('\n1\n')
Separates each record to a list assuming it has the following format:
1
....
....
1
....
....
....
You can then output each element in the list to a separate file.
for x in range(len(my_list)):
print >> str(x)+'.txt', my_list[x]
To avoid loading the file in memory, you could define a function that generates records incrementally and then use itertool's grouper recipe to write each 3000 records to a new file:
#!/usr/bin/env python3
from itertools import zip_longest
with open('input.txt') as input_file:
files = zip_longest(*[generate_records(input_file)]*3000, filevalue=())
for n, records in enumerate(files):
open('output{n}.txt'.format(n=n), 'w') as output_file:
output_file.writelines(''.join(lines)
for r in records for lines in r)
where generate_records() yields one record at a time where a record is also an iterator over lines in the input file:
from itertools import chain
def generate_records(input_file, start='1\n', eof=[]):
def record(yield_start=True):
if yield_start:
yield start
for line in input_file:
if line == start: # start new record
break
yield line
else: # EOF
eof.append(True)
# the first record may include lines before the first 1\n
yield chain(record(yield_start=False),
record())
while not eof:
yield record()
generate_records() is a generator that yield generators like itertools.groupby() does.
For performance reasons, you could read/write chunks of multiple lines at once.