I have this huge (61GB) FASTQ file of which I want to create a random subset, but which I cannot load into memory. The problem with FASTQs is that every four lines belong together, otherwise I would just create a list of random integers and only write the lines at these integers to my subset file.
So far, I have this:
import random
num = []
while len(num) < 50000000:
ran = random.randint(0,27000000)
if (ran%4 == 0) and (ran not in num):
num.append(ran)
num = sorted(num)
fastq = open("all.fastq", "r", 4)
subset = open("sub.fastq", "w")
for i,line in enumerate(fastq):
for ran in num:
if ran == i:
subset.append(line)
I have no idea how to reach the next three lines in the file before going to the next random integer. Can someone help me?
Iterate over the file in chunks of four lines.
Take a random sample from that iterator.
The idea is that you can sample from a generator without random access, by iterating through it and choosing (or not) each element in turn.
You could try this:
import random
num = sorted([random.randint(0,27000000/4)*4 for i in range(50000000/4)])
lines_to_write = 0
with open("all.fastq", "r") as fastq:
with open("sub.fastq", "w") as subset:
for i,line in enumerate(fastq):
if len(num)==0:
break
if i == num[0]:
num.pop(0)
lines_to_write = 4
if lines_to_write>0:
lines_to_write -= 1
subset.write(line)
Related
file = open('funny_file.txt', 'r')
list = []
for i in file:
list.append(int(i[:-1]))
print(len([num for num in list for i in range(100) if num == 3**i]))
The result is good but can i make it in other easier way?
Assuming the largest exponent is 100 (from your range(100)). You could do a fairly O(1) computation. First you can calculate 3^100 = 515377520732011331036461129765621272702107522001. Now, this number divided by any power of 3 (smaller than this number) has a reminder of 0. So you could write:
def check_power_of_three(n):
return 515377520732011331036461129765621272702107522001 % n == 0
Now, about your code. I do not get why are you doing two loops, when you can achieve the same thing using just one.
file = open('funny_file.txt', 'r')
numList = [] # I changed the name to avoid conflicts with `list` type
powerOfThree = []
for i in file:
n = int(i[:-1])
numList.append(n)
if check_power_of_three(n):
powerOfThree.append(n)
print(len(powerOfThree))
# Also remember to close() the file
file.close()
Outputs:
1
^ Which is 27
I'm assuming by 'square of the 3', you mean powers of three.
There are some things going on in your print statement that I would recommend you to revise. Generally, when you are interested in achieving a particular result, it makes sense to create a separate function for it that encapsulates that functionality. In this case, it could be a function that tells us whether it's input is a power of three. Using the logarithm with base three you should be able to come up with an implementation of that function.
One way to optimize the code is to skip computation for the numbers which are already explored before, like below:
file = open('funny_file.txt', 'r')
list = []
for i in file:
list.append(int(i[:-1]))
powerof_3_nums = []
not_power_of_3 = []
for num in list:
if num in powerof_3_nums:
continue
if num in not_power_of_3:
continue
for i in range(100):
if num == 3**i:
powerof_3_nums.append(num)
print(len(powerof_3_nums))
can't seem to find anything about this as I can't really word it correctly. Basically when I bubbleSort my list it from negatives up to the lowest negative and then reads in the positives, lets say it's [1.6397, -2.0215, -0.4933, , -3.4167] it will sort as [-0.4933, -2.0215, -3.4167, 1.6397]
by algorithm is
def bubbleSort(array):
n=len(array)
for i in range(n):
for j in range(0, n-i-1):
while array[j]>array[j+1]:
array[j], array[j+1] = array[j+1], array[j]
I want this to read like
[-3.4167, -2.0215, -0.4933, 1.6397]
thanks in advance
additional:
example=[]
with open('ex.txt') as f:
for l in f:
l=l.strip()
example.append(l)
examplesearch=input('select array to sort')
if (examplesearch == ex1):
bubbleSort(example)
print(example)
Since you are reading from a file the inputted numbers are strings, thus Python uses string comparison (as evident by the result you are getting).
You have to convert to float first:
example = []
with open('ex.txt') as f:
for l in f:
l = float(l.strip())
example.append(l)
BWT, you don't need to pre-create the list:
with open('ex.txt') as f:
# 'if number' will handle potential empty line at the end of the file
output = [float(number.strip()) for number in f if number]
hello i am a beginner of python, there is a simple question.
i'm asked to write a generator to traverse a txt file, each row in the file is 3 coordinates of a point (x,y,z)
how to return 5 points(5 lines) each time next() is called?
here is my code, i can only generate one line each time
thx a lot!
import itertools
def gen_points(name):
f=open(name)
for l in f:
clean_string=l.rstrip()
x,y,z=clean_string.split()
x= float(x)
y= float(y)
z= float(z)
yield x,y,z
f.close()
file_name=r"D:\master ppt\Spatial Analysis\data\22.txt"
a=gen_points(file_name)
g=itertools.cycle(a)
print(next(g))
just wait until you have five items in your list of triplets, and yield that instead:
def gen_points(name):
with open(name) as f:
five_items = []
for l in f:
five_items.append(tuple(map(float,l.split())))
if len(five_items) == 5:
yield five_items
# create another empty object
five_items = []
if five_items:
yield five_items
also yield at the end of the loop if not empty to avoid losing the last elements if the number of elements isn't dividable by 5.
Aside: clean_string=l.rstrip() was useless since split already takes care of linefeeds and whatnot.
You don't have to yield immediately, so hold onto the output and then yield it later:
## Adding batchsize so you can change it on the fly if you need to
def gen_points(name, batchsize = 5):
## The "with" statement is better practice
with open(name,'r') as f:
## A container to hold the output
output = list()
for l in f:
clean_string=l.rstrip()
x,y,z=clean_string.split()
x= float(x)
y= float(y)
z= float(z)
## Store instead of yielding
output.append([x,y,z])
## Check if we have enough lines
if len(output) >= batchsize:
yield output
## Clear list for next batch
output = list()
## In case you have an "odd" number of points (i.e.- 23 points in this situation)
if output: yield output
I'm using random.randint to generate a random number, and then assigning that number to a variable. Then I want to print the line with the number I assigned to the variable, but I keep getting the error:
list index out of range
Here's what I tried:
f = open(filename. txt)
lines = f.readlines()
rand_line = random. randint(1,10)
print lines[rand_line]
You want to use random.choice
import random
with open(filename) as f:
lines = f.readlines()
print(random.choice(lines))
To get a random line without loading the whole file in memory you can use Reservoir sampling (with sample size of 1):
from random import randrange
def get_random_line(afile, default=None):
"""Return a random line from the file (or default)."""
line = default
for i, aline in enumerate(afile, start=1):
if randrange(i) == 0: # random int [0..i)
line = aline
return line
with open('filename.txt') as f:
print(get_random_line(f))
This algorithm runs in O(n) time using O(1) additional space.
This code is correct, assuming that you meant to pass a string to open function, and that you have no space after the dot...
However, be careful to the indexing in Python, namely it starts at 0 and not 1, and then ends at len(your_list)-1.
Using random.choice is better, but if you want to follow your idea it would rather be:
import random
with open('name.txt') as f:
lines = f.readlines()
random_int = random.randint(0,len(lines)-1)
print lines[random_int]
Since randint includes both boundary, you must look until len(lines)-1.
f = open(filename. txt)
lines = f.readlines()
rand_line = random.randint(0, (len(lines) - 1)) # https://docs.python.org/2/library/random.html#random.randint
print lines[rand_line]
You can edit your code to achieve this without an error.
f = open(filename. txt)
lines = f.readlines()
rand_line = random. randint(0,len(lines)-1) # this should make it work
print lines[rand_line]
This way the index is not out of range.
I am trying to write a function that takes a file name as an argument, and returns the sum of the numbers in the file. Here is what I have done so far:
def sum_digits (filename):
"""
>>> sum_digits("digits.txt")
434
"""
myfile = open(filename, "r")
newfile = myfile.read()
sum = 0
while newfile.isdigit():
sum += newfile%10
newfile = newfile/10
return sum
if __name__=="__main__":
import doctest
doctest.testmod(verbose=True)
But this code is not working. I dont know how to do this. Any ideas?
You need to split your text to get a list of numbers, then iterate over that adding them up:
nums = newfile.split()
for num in nums:
sum += int(num)