Read data from file into different arrays in python - python

I'm creating a simple script for blender and i need a little help with get some data from file i've created before via python.
That file got structure like below:
name first morph
values -1.0000 1.0000
data 35 0.026703 0.115768 -0.068769
data 36 -0.049349 0.015188 -0.029470
data 37 -0.042880 -0.045805 -0.039931
data 38 0.000000 0.115775 -0.068780
name second morph
values -0.6000 1.2000
data 03 0.037259 -0.046251 -0.020062
data 04 -0.010330 -0.046106 -0.019890
…
etc more 2k lines ;p
What i need is to create a loop that read for me those data line by line and put values into three different arrays: names[] values[] and data[] depending on first word of file line.
Manualy it should be like that:
names.append('first morph')
values.append( (-1.0000,1.0000))
data.append((35, 0.026703, 0.115768, -0.068769))
data.append((36, -0.049349, 0.015188, -0.029470))
data.append((37, -0.042880, -0.045805, -0.039931))
data.append((38, 0.000000, 0.115775, -0.068780))
names.append('second morph')
values.append( (-0.6000,1.2000))
…
I dont know why my atempts of creating that kind of 'for line in file:' loop creating more errors than complete data, i dont know why is going out of range, or not getting proper data
Please help how to automate that process instead of writing manualy each line since i already exported needed parameters into a file.

names = []
values = []
data = []
with open('yourfile') as lines:
for line in lines:
first, rest = line.split(' ', 1)
if first == 'name':
names.append(rest)
elif first == 'values':
floats = map(float, rest.split())
values.append(tuple(floats))
elif first == 'data':
int_str, floats_str = rest.split(' ', 1)
floats = map(float, floats_str.split())
data.append( (int(int_str),) + tuple(floats) )
Why do you need it like this? How will you know where the next name starts in your data and values lists?

Here is a simple python "for line in" solution... you can just call processed.py...
fp = open("data1.txt", "r")
data = fp.readlines()
fp1 = open("processed.py", "w")
fp1.write("names = []\nvalues=[]\ndata=[]\n")
for line in data:
s = ""
if "name" in line:
s = "names.append('" + line[5:].strip() + "')"
elif "values" in line:
s = "values.append((" + ", ".join(line[7:].strip().split(" ")) + "))"
elif "data" in line:
s = "data.append((" + ", ".join(line[5:].strip().split(" ")) + "))"
fp1.write(s + "\n");
fp1.close()
fp.close()

Related

How to get read strings from a list(a txt file) and print them out as ints, strings, and floats?

I've tried literally everything to make this work. What i'm trying to do is take a file, assign each line a variable, and then set the type of the variable. It's reading it in the list as [ and ' being a line number, and I don't know what to do. I also have lists inside of the file that I need to save.
My error is:
ValueError: invalid literal for int() for base 10: '['
My code is:
def load_data():
f = open(name+".txt",'r')
enter = str(f.readlines()).rstrip('\n)
print(enter)
y = enter[0]
hp = enter[1]
coins = enter[2]
status = enter[3]
y2 = enter[4]
y3 = enter[5]
energy = enter[6]
stamina = enter[7]
item1 = enter[8]
item2 = enter[9]
item3 = enter[10]
equipped = enter[11]
firstime = enter[12]
armorpoint1 = enter[13]
armorpoint2 = enter[14]
armorpoints = enter[15]
upgradepoint1 = enter[16]
upgradepoint2 = enter[17]
firstime3 = enter[18]
firstime4 = enter[19]
part2 = enter[20]
receptionist = enter[21]
unlocklist = enter[22]
armorlist = enter[23]
heal1 = enter[24]
heal2 = enter[25]
heal3 = enter[26]
unlocked = enter[27]
unlocked2 = enter[28]
float(int(y))
int(hp)
int(coins)
str(status)
float(int(y2))
float(int(y3))
int(energy)
int(stamina)
str(item1)
str(item2)
str(item3)
str(equipped)
int(firstime)
int(armorpoint1)
int(armorpoint2)
int(armorpoints)
int(upgradepoint1)
int(upgradepoint2)
int(firstime3)
int(firstime4)
list(unlocklist)
list(armorlist)
int(heal1)
int(heal2)
int(heal3)
f.close()
SAMPLE FILE:
35.0
110
140
Sharpshooter
31.5
33
11
13
Slimer Gun
empty
empty
Protective Clothes
0
3
15
0
3
15
0
1
False
False
['Slime Slicer', 'Slimer Gun']
['Casual Clothes', 'Protective clothes']
4
3
-1
{'Protective Clothes': True}
{'Slimer Gun': True}
The .readlines() function returns a list, each item containing a separate line. In order to strip the newline from each of the lines, you can use a list comprehension:
f = open("data.txt", "r")
lines = [line.strip() for line in f.readlines()]
You can then proceed to cast each item in the list separately, as you have, or try to somehow automatically infer the type in a loop. This would be easier if you formatted the example file more like a configuration file. This thread has some relevant answers:
Best way to retrieve variable values from a text file?
I think its better if you read the file this way, which first reads and removes whitespace and then splits into lines. Then you can set a variable to each line (also you need to set the result of changing variable types to the variable).
For lists, you may need a function to extract the list from the string. However if you aren't expecting security breaches then using eval() should be fine.
def load_data():
f = open(name+".txt",'r')
content = f.read().rstrip()
lines = content.split("\n")
y = float(int(enter[0]))
hp = int(enter[1])
coins = int(enter[2])
status = enter[3]
# (etc)
unlocklist = eval(enter[22])
armorlist = eval(enter[23])
f.close()

Looking for first line of data with python

I have a data file that looks like this, and the file type is a list.
############################################################
# Tool
# File: test
#
# mass: mass in GeV
# spectrum: from 1 to 100 GeV
###########################################################
# mass (GeV) spectrum (1-100 GeV)
10 0.2822771608053263
20 0.8697454394829301
30 1.430461657476815
40 1.9349004472432392
50 2.3876849629827412
60 2.796620869276766
70 3.1726347734996727
80 3.5235401505002244
90 3.8513847250834106
100 4.157478780924807
For me to read the data I would normally have to count how many lines before the first set of numbers and then for loop through the file. In this file its 8 lines
spectrum=[]
mass=[]
with open ('test.in') as m:
test=m.readlines()
for i in range(8,len(test)):
single_line=test[i].split('\t')
mass.appened(float(single_line[0]))
spectrum.append(float(single_line[1]))
Let's say I didn't want to open the file to check how many lines there are from the intro statement to the first line of data points. How would I make python automatically start at the first line of data points, and go through the end of the file?
This is a general solution, but should work in your specific case.
you could for each line, check if it starts with a number.
psedo-code
for line in test:
if line.split()[0].isdigit():
DoStuffWithData
spectrum=[]
mass=[]
with open ('test.in') as m:
test=m.readlines()
for line in test:
if line[0] == '#':
continue
single_line=line.split('\t')
mass.append(float(single_line[0]))
spectrum.append(float(single_line[1]))
You can filter all lines that start with # by regex or startswith method of string
import re
spectrum=[]
mass=[]
with open ('test.in') as m:
test= [i for i in f.readlines() if not re.match("^#.*", i)]
for i in test:
single_line = i.split('\t')
mass.appened(float(single_line[0]))
spectrum.append(float(single_line[1]))
OR
spectrum = []
mass = []
with open('test.in') as m:
test = [i for i in f.readlines() if not i.startwith("#")]
for i in test:
single_line = i.split('\t')
mass.appened(float(single_line[0]))
spectrum.append(float(single_line[1]))
This will filter out all the lines that start with #.
pseudo code:
for r in m:
if r.startwith('#'):
continue
spt = r.split('\t')
if len(spt) < 2:
continue
## todo: .....

Python: How to split txt file data into csv according to user input

So I am reading in a .txt file that is largely similar to this: TTACGATATACGA etc. but contains thousands of characters. Now I can read in a file and output it as a csv according to user input that decides characters per column and number of columns however it writes a new file for each time.
Ideally I would like to have a format such as such per file:
User enters 4 and 3.
Output: TCAG, TGCT, TACG,
My curent output is this:
TCAGTGCTTACG
I have tried looking at string splitting but I don't seem to be able to get it to work.
here is what I've written thus far, apologies if it's poor:
#user input for parameters
user_input_character = int(input("Enter how many characters you;d like
per column"))
user_input_column = int(input("Enter how many columns you'd like"))
character_per_column = user_input_character
columns_per_entry = user_input_column
characters_to_read = int((character_per_column * columns_per_entry))
print("Total characters: " + str(characters_to_read))
#counts used to set letters to be taken into intake
index_start = 0
index_finish = characters_to_read
count =1
#open the file to be read
lines = []
test_file = open("dna.txt", "r")
for line in test_file:
line = line.strip()
if not line:
continue
lines.append(',')
#read the file and take note of its size for index purposes
read_file = test_file.read()
file_size = read_file.__len__()
print((file_size))
i = 1
index = 0
#use loop to make more than one file output
while(index < 50):
#print count used to measure progress for testing
print('the count is', count)
count += 1
index += characters_to_read
print('index: ',index)
#intake only uses letters from index count per file
intake = read_file[index_start:index_finish]
print(intake)
index_start += characters_to_read
index_finish +=characters_to_read
#output a txt file with the 4 letters from intake as a individually numbered txt file
text_file_output = open("Output%i.csv"%i,'w')
i += 1
text_file_output.write(intake)
text_file_output.close()
#define path to print to console for file saving
path = os.path.abspath("Output%i")
directory = os.path.dirname(path)
print(path)
test_file.close()
Here's a simple way to split your DNA data into rows consisting of columns and chunks of specified sizes. It assumes that the DNA data is in a single string with no white space characters (spaces, tabs, newlines, etc).
To test this code, I create some fake data using the random module.
from random import seed, choice
seed(42)
# Make some random DNA data
num = 66
data = ''.join([choice('ACGT') for _ in range(num)])
print(data, '\n')
# Split the data into chunks, columns and rows
chunksize, cols = 4, 3
row = []
for i in range(0, len(data), chunksize):
chunk = data[i:i+chunksize]
row.append(chunk)
if len(row) == cols:
print(' '.join(row))
row = []
if row:
print(' '.join(row))
output
AAGCCCAATAAACCACTCTGACTGGCCGAATAGGGATATAGGCAACGACATGTGCGGCGACCCTTG
AAGC CCAA TAAA
CCAC TCTG ACTG
GCCG AATA GGGA
TATA GGCA ACGA
CATG TGCG GCGA
CCCT TG
On my old 2GHz 32 bit machine, running Python 3.6.0, this code can process and save to disk around 100000 chars per second (that includes the time taken to generate the random data).
Here's a version of the above code that handles spaces and blank lines in the input data. It reads the input data from a file and writes the output to a CSV file.
Firstly, here's the code I used to create some fake test data, which I saved to "dnatest.txt".
from random import seed, choice, randrange
seed(123)
# Make some random DNA data containing spaces
pool = 'ACGT' * 5 + ' '
for _ in range(15):
# Choose a random line length
size = randrange(50, 70)
data = ''.join([choice(pool) for _ in range(size)])
print(data)
# Randomly add a blank line
if randrange(5) < 2:
print()
Here's the file it created:
dnatest.txt
AGCATCACCGGCCAGCGTCACGTAGAGGTCGAAACCGTATCCGATGT AGG
ACC TTACTAC CGTACGGCAGGAGGAGGG TATTACAC CT TCTCACGAGCAAGGAATA
ATTGATGGCACAGC AAGATCCGCTA CCGATTG CAACCA CATACGAT CGACCAGATGG
ACAGAACAGATCTTGGGAATGGAACAGGAGAGAGTGTGGGCCACATTAAAGTGATAAT ATTT
TCTGTCGTGGGGCACCAAACCATGCTAATGCACGACTGGGT GAGGGTTGAGAGCCTACTATCCTCAG
TCGATCGAGATGACCCTCCTATCGCAACAGCTGTCAGTGTCCAGAG ACGTCGC CA
TAGGTCTGGAAAC GCACTCCCCTC GGAATAGTCTACACGAGTCCATTATGTC
GATCTGACTATGGGGACCATAACGGCTATGCGACCATGGACTGGTTCGAG
GATTCCCGTTCTACAT CACCTT ACCTCTGATAA CGACTGGTTCGA GGGTCTC CC
AAA CGTCTATTATGTCATAACGTAACTCTGC CGTAGTTTGATCAAACGTACAGCCACCAC
TGAAGC CGCCTCGAACCGCGTCCGACCCTGGGGAGCCTGGGGCCCAGCA
CCTTAGC ACTGCGA AGCTACACCCCACGAGTAATTTG T CTATCGT CCG
GCCTCGTTTCCTTGTGAAATTAT ATGGT C AGTCTTCAATCAA CACCTA CTAATAA
GTGCTAGC CCGGGGATCTTGTCCTGGTCCA GGTC AT AATCCGTGCTCAAATTACATGGCTT
TTAGTAATGAGTTCGGGC GCGCCCTCAAAGTTGGTCTAGAAGCGCGCAGTTTTCCTTAGGT
Here's the code that processes that data:
# Input & output file names
iname = 'dnatest.txt'
oname = 'dnatest.csv'
# Read the data and eliminate all whitespace
with open(iname) as f:
data = ''.join(f.read().split())
# Split the data into chunks, columns and rows
chunksize, cols = 4, 3
with open(oname, 'w') as f:
row = []
for i in range(0, len(data), chunksize):
chunk = data[i:i+chunksize]
row.append(chunk)
if len(row) == cols:
f.write(', '.join(row) + '\n')
row = []
if row:
f.write(', '.join(row) + '\n')
And here's the file it creates:
dnatest.csv
AGCA, TCAC, CGGC
CAGC, GTCA, CGTA
GAGG, TCGA, AACC
GTAT, CCGA, TGTA
GGAC, CTTA, CTAC
CGTA, CGGC, AGGA
GGAG, GGTA, TTAC
ACCT, TCTC, ACGA
GCAA, GGAA, TAAT
TGAT, GGCA, CAGC
AAGA, TCCG, CTAC
CGAT, TGCA, ACCA
CATA, CGAT, CGAC
CAGA, TGGA, CAGA
ACAG, ATCT, TGGG
AATG, GAAC, AGGA
GAGA, GTGT, GGGC
CACA, TTAA, AGTG
ATAA, TATT, TTCT
GTCG, TGGG, GCAC
CAAA, CCAT, GCTA
ATGC, ACGA, CTGG
GTGA, GGGT, TGAG
AGCC, TACT, ATCC
TCAG, TCGA, TCGA
GATG, ACCC, TCCT
ATCG, CAAC, AGCT
GTCA, GTGT, CCAG
AGAC, GTCG, CCAT
AGGT, CTGG, AAAC
GCAC, TCCC, CTCG
GAAT, AGTC, TACA
CGAG, TCCA, TTAT
GTCG, ATCT, GACT
ATGG, GGAC, CATA
ACGG, CTAT, GCGA
CCAT, GGAC, TGGT
TCGA, GGAT, TCCC
GTTC, TACA, TCAC
CTTA, CCTC, TGAT
AACG, ACTG, GTTC
GAGG, GTCT, CCCA
AACG, TCTA, TTAT
GTCA, TAAC, GTAA
CTCT, GCCG, TAGT
TTGA, TCAA, ACGT
ACAG, CCAC, CACT
GAAG, CCGC, CTCG
AACC, GCGT, CCGA
CCCT, GGGG, AGCC
TGGG, GCCC, AGCA
CCTT, AGCA, CTGC
GAAG, CTAC, ACCC
CACG, AGTA, ATTT
GTCT, ATCG, TCCG
GCCT, CGTT, TCCT
TGTG, AAAT, TATA
TGGT, CAGT, CTTC
AATC, AACA, CCTA
CTAA, TAAG, TGCT
AGCC, CGGG, GATC
TTGT, CCTG, GTCC
AGGT, CATA, ATCC
GTGC, TCAA, ATTA
CATG, GCTT, TTAG
TAAT, GAGT, TCGG
GCGC, GCCC, TCAA
AGTT, GGTC, TAGA
AGCG, CGCA, GTTT
TCCT, TAGG, T

Double if conditional in the line.startswith strategy

I have a data.dat file with this format:
REAL PART
FREQ 1.6 5.4 2.1 13.15 13.15 17.71
FREQ 51.64 51.64 82.11 133.15 133.15 167.71
.
.
.
IMAGINARY PART
FREQ 51.64 51.64 82.12 132.15 129.15 161.71
FREQ 5.64 51.64 83.09 131.15 120.15 160.7
.
.
.
REAL PART
FREQ 1.6 5.4 2.1 13.15 15.15 17.71
FREQ 51.64 57.64 82.11 183.15 133.15 167.71
.
.
.
IMAGINARY PART
FREQ 53.64 53.64 81.12 132.15 129.15 161.71
FREQ 5.64 55.64 83.09 131.15 120.15 160.7
All over the document REAL and IMAGINARY blocks are reported
Within the REAL PART block,
I would like to split each line that starts with FREQ.
I have managed to:
1) split lines and extract the value of FREQ and
2) append this result to a list of lists, and
3) create a final list, All_frequencies:
FREQ = []
fname ='data.dat'
f = open(fname, 'r')
for line in f:
if line.startswith(' FREQ'):
FREQS = line.split()
FREQ.append(FREQS)
print 'Final FREQ = ', FREQ
All_frequencies = list(itertools.chain.from_iterable(FREQ))
print 'All_frequencies = ', All_frequencies
The problem with this code is that it also extracts the IMAGINARY PART values of FREQ. Only the REAL PART values of FREQ would have to be extracted.
I have tried to make something like:
if line.startswith('REAL PART'):
if line.startswith('IMAGINARY PART'):
code...
or:
if line.startswith(' REAL') and line.startswith(' FREQ'):
code...
But this does not work. I would appreciate if you could help me
It appears based on the sample data in the question that lines starting with 'REAL' or 'IMAGINARY' don't have any data on them, they just mark the beginning of a block. If that's the case (and you don't go changing the question again), you just need to keep track of which block you're in. You can also use yield instead of building up an ever-larger list of frequencies, as long as this code is in a function.
def read_real_parts(fname):
f = open(fname, 'r')
real_part = False
for line in f:
if line.startswith(' REAL'):
real_part = True
elif line.startswith(' IMAGINARY'):
real_part = False
elif line.startswith(' FREQ') and real_part:
FREQS = line.split()
yield FREQS
FREQ = read_real_parts('data.dat') #this gives you a generator
All_frequencies = list(itertools.chain.from_iterable(FREQ)) #then convert to list
Think of this as a state machine having two states. In one state, when the program has read a line with REAL at the beginning it goes into the REAL state and aggregates values. When it reads a line with IMAGINARY it goes into the alternate state and ignores values.
REAL, IMAGINARY = 1,2
FREQ = []
fname = 'data.dat'
f = open(fname)
state = None
for line in f:
line = line.strip()
if not line: continue
if line.startswith('REAL'):
state = REAL
continue
elif line.startswith('IMAGINARY'):
state = IMAGINARY
continue
else:
pass
if state == IMAGINARY:
continue
freqs = line.split()[1:]
FREQ.extend(freqs)
I assume that you want only the numeric values; hence the [:1] at the end of the assignment to freqs near the end of the script.
Using your data file, without the ellipsis lines, produces the following result in FREQ:
['1.6', '5.4', '2.1', '13.15', '13.15', '17.71', '51.64', '51.64', '82.11', '133.15', '133.15', '167.71', '1.6', '5.4', '2.1', '13.15', '15.15', '17.71', '51.64', '57.64', '82.11', '183.15', '133.15', '167.71']
You would need to keep track of which part you are looking at, so you can use a flag to do this:
section = None #will change to either "real" or "imag"
for line in f:
if line.startswith("IMAGINARY PART"):
section = "imag"
elif line.startswith('REAL PART'):
section = "real"
else:
freqs = line.split()
if section == "real":
FREQ.append(freqs)
#elif section == "imag":
# IMAG_FREQ.append(freqs)
by the way, instead of appending to FREQ then needing to use itertools.chain.from_iterable you might consider just extending FREQ instead.
we start with a flag set to False. if we find a line that contains "REAL", we set it to True to start copying the data below the REAL part, until we find a line that contains IMAGINARY, which sets the flag to False and goes to the next line until another "REAL" is found (and hence the flag turns back to True)
using the flag concept in a simple way:
with open('this.txt', 'r') as content:
my_lines = content.readlines()
f=open('another.txt', 'w')
my_real_flag = False
for line in my_lines:
if "REAL" in line:
my_real_flag = True
elif "IMAGINARY" in line:
my_real_flag = False
if my_real_flag:
#do code here because we found real frequencies
f.write(line)
else:
continue #because my_real_flag isn't true, so we must have found a
f.close()
this.txt looks like this:
REAL
1
2
3
IMAGINARY
4
5
6
REAL
1
2
3
IMAGINARY
4
5
6
another.txt ends up looking like this:
REAL
1
2
3
REAL
1
2
3
Original answer that only works when there is one REAL section
If the file is "small" enough to be read as an entire string and there is only one instance of "IMAGINARY PART", you can do this:
file_str = file_str.split("IMAGINARY PART")[0]
which would get you everything above the "IMAGINARY PART" line.
You can then apply the rest of your code to this file_str string that contains only the real part
to elaborate more, file_str is a str which is obtained by the following:
with open('data.dat', 'r') as my_data:
file_str = my_data.read()
the "with" block is referenced all over stack exchange, so there may be a better explanation for it than mine. I intuitively think about it as "open a file named 'data.dat' with the ability to only read it and name it as the variable my_data. once its opened, read the entirety of the file into a str, file_str, using my_data.read(), then close 'data.dat' "
now you have a str, and you can apply all the applicable str functions to it.
if "IMAGINARY PART" happens frequently throughout the file or the file is too big, Tadgh's suggestion of a flag a break works well.
for line in f:
if "IMAGINARY PART" not in line:
#do stuff
else:
f.close()
break

how to create an index to parse big text file

I have two files A and B in FASTQ format, which are basically several hundred million lines of text organized in groups of 4 lines starting with an # as follows:
#120412_SN549_0058_BD0UMKACXX:5:1101:1156:2031#0/1
GCCAATGGCATGGTTTCATGGATGTTAGCAGAAGACATGAGACTTCTGGGACAGGAGCAAAACACTTCATGATGGCAAAAGATCGGAAGAGCACACGTCTGAACTCN
+120412_SN549_0058_BD0UMKACXX:5:1101:1156:2031#0/1
bbbeee_[_ccdccegeeghhiiehghifhfhhhiiihhfhghigbeffeefddd]aegggdffhfhhihbghhdfffgdb^beeabcccabbcb`ccacacbbccB
I need to compare the
5:1101:1156:2031#0/
part between files A and B and write the groups of 4 lines in file B that matched to a new file. I got a piece of code in python that does that, but only works for small files as it parses through the entire #-lines of file B for every #-line in file A, and both files contain hundreds of millions of lines.
Someone suggested that I should create an index for file B; I have googled around without success and would be very grateful if someone could point out how to do this or let me know of a tutorial so I can learn. Thanks.
==EDIT==
In theory each group of 4 lines should only exist once in each file. Would it increase the speed enough if breaking the parsing after each match or do I need a different algorithm altogether?
An index is just a shortened version of the information you are working with. In this case, you will want the "key" - the text between the first colon(':') on the #-line and the final slash('/') near the end - as well as some kind of value.
Since the "value" in this case is the entire contents of the 4-line block, and since our index is going to store a separate entry for each block, we would be storing the entire file in memory if we used the actual value in the index.
Instead, let's use the file position of the beginning of the 4-line block. That way, you can move to that file position, print 4 lines, and stop. Total cost is the 4 or 8 or however many bytes it takes to store an integer file position, instead of however-many bytes of actual genome data.
Here is some code that does the job, but also does a lot of validation and checking. You might want to throw stuff away that you don't use.
import sys
def build_index(path):
index = {}
for key, pos, data in parse_fastq(path):
if key not in index:
# Don't overwrite duplicates- use first occurrence.
index[key] = pos
return index
def error(s):
sys.stderr.write(s + "\n")
def extract_key(s):
# This much is fairly constant:
assert(s.startswith('#'))
(machine_name, rest) = s.split(':', 1)
# Per wikipedia, this changes in different variants of FASTQ format:
(key, rest) = rest.split('/', 1)
return key
def parse_fastq(path):
"""
Parse the 4-line FASTQ groups in path.
Validate the contents, somewhat.
"""
f = open(path)
i = 0
# Note: iterating a file is incompatible with fh.tell(). Fake it.
pos = offset = 0
for line in f:
offset += len(line)
lx = i % 4
i += 1
if lx == 0: # #machine: key
key = extract_key(line)
len1 = len2 = 0
data = [ line ]
elif lx == 1:
data.append(line)
len1 = len(line)
elif lx == 2: # +machine: key or something
assert(line.startswith('+'))
data.append(line)
else: # lx == 3 : quality data
data.append(line)
len2 = len(line)
if len2 != len1:
error("Data length mismatch at line "
+ str(i-2)
+ " (len: " + str(len1) + ") and line "
+ str(i)
+ " (len: " + str(len2) + ")\n")
#print "Yielding #%i: %s" % (pos, key)
yield key, pos, data
pos = offset
if i % 4 != 0:
error("EOF encountered in mid-record at line " + str(i));
def match_records(path, index):
results = []
for key, pos, d in parse_fastq(path):
if key in index:
# found a match!
results.append(key)
return results
def write_matches(inpath, matches, outpath):
rf = open(inpath)
wf = open(outpath, 'w')
for m in matches:
rf.seek(m)
wf.write(rf.readline())
wf.write(rf.readline())
wf.write(rf.readline())
wf.write(rf.readline())
rf.close()
wf.close()
#import pdb; pdb.set_trace()
index = build_index('afile.fastq')
matches = match_records('bfile.fastq', index)
posns = [ index[k] for k in matches ]
write_matches('afile.fastq', posns, 'outfile.fastq')
Note that this code goes back to the first file to get the blocks of data. If your data is identical between files, you would be able to copy the block from the second file when a match occurs.
Note also that depending on what you are trying to extract, you may want to change the order of the output blocks, and you may want to make sure that the keys are unique, or perhaps make sure the keys are not unique but are repeated in the order they match. That's up to you - I'm not sure what you're doing with the data.
these guys claim to parse a few gigs file while using a dedicated library, see http://www.biostars.org/p/15113/
fastq_parser = SeqIO.parse(fastq_filename, "fastq")
wanted = (rec for rec in fastq_parser if ...)
SeqIO.write(wanted, output_file, "fastq")
a better approach IMO would be to parse it once and load the data to some database instead of that output_file (i.e mysql) and latter run the queries there

Categories