read a tabular dataset from a text file in python - python

I have many text files with the following format,
%header
%header
table
.
.
.
another table
.
.
.
If I didn't have the second table, I could use a simple commnad to read the file such as :
numpy.loadtxt(file_name, skiprows=2, dtype=float, usecols={0, 1})
is there an easy way to read the first table without having to read the files line by line, something like numpy.loadtxt

Use numpy.genfromtxt and set max_rows according to info from the header.
As an example, I created the following data file:
# nrows=10
# nrows=15
1
2
3
4
5
6
7
8
9
10
.
.
.
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
.
.
.
The the following oversimplified code could read the two tables from the file (of course you can enhance the code to meet your needs):
f = open('filename.txt')
# read header and find the number of rows to read for each table:
p = f.tell()
l = f.readline()
tabrows = []
while l.strip().startswith('#'):
if 'nrows' in l:
tabrows.append(int(l.split('=')[1]))
p = f.tell()
l = f.readline()
f.seek(p)
# read all tables assuming each table is followed by three lines with a dot:
import numpy as np
tables = []
skipdots = 0
ndotsafter = 3
for nrows in tabrows:
tables.append(np.genfromtxt(f, skip_header=skipdots, max_rows=nrows))
skipdots = ndotsafter
f.close()

Related

How to separate columns of *.csv file from one to more (Python, BioPython, Pandas)?

I'm still quite a beginner and my friend (who's not answering so far) provided me a code for downloading a genomic sequence from Ensembl.org and writing it to a *.csv file using dictionaries. Unluckily, the file contains only one column and 89870 rows, I'm not sure how to fix it. It would ease my work with counting because it acts weird when doing plots. I don't know where could be a mistake. Here's the code:
from Bio.SeqIO.FastaIO import FastaIterator
record_ids = []
records = []
with open("equus_cds.fa") as handle:
for record in FastaIterator(handle):
record_ids.append(record.id)
records.append(record)
data_cds = {}
for record in records:
data_cds[record.id] = {'A': 0, 'G': 0, 'C': 0, 'T': 0, 'N': 0}
for letter in str(record.seq):
data_cds[record.id][letter] += 1
import csv
with open('data_cds.csv', 'w') as csvfile:
writer = csv.writer(csvfile, delimiter = "\t")
writer.writerow(['ID', 'A', 'G', 'C', 'T', 'N'])
for key, values in data_cds.items():
writer.writerow([key, values['A'], values['G'], values['C'], values['T'], values['N']])
with open ("data_cds.csv") as file:
print (file.readline())
for lines in file.readlines():
print(lines)
The output shows a scrolling table of contents but it's a bit shifted:
ID A G C T N
ENSECAT00000046986.1 67 64 83 71 0
ENSECAT00000031957.1 81 83 75 85 0
etc. etc., imagine over 80 thousand lines.
Then I would like to count the sum of all "N's" (it's not always zero) and I have no idea how to do it with this format...
Thanks in advance!
EDIT: I've downloaded the sequence from here: http://ftp.ensembl.org/pub/release-103/fasta/equus_caballus/cds/, unzipped it:
handle = gzip.open('file1.fa.gz')
with open('equus_cds.fa', 'wb') as out:
for line in handle:
out.write(line)
And then the code I've posted follows. The *.csv file always contains a name of a specific gene (ID - ENSECAT000... etc.) and then nitrogen bases (A, T, G, C) and also unknown bases (N). This whole file then has 8k lines but only one column, I would like to have it properly separated (each base to one column, if possible) because then it would be easier to count how many of each base is in the whole file (how many Ns to be specific).
The reason I want to know this is when I'm making a plot, I'm comparing two sequences, cds (coding sequences) and cDNA (complementary DNA) and after subtracting N the plot acts weird, cds gets bigger than cDNA and that's nonsense. Here's the code for the plot:
data1 = pd.read_csv ("data_cds.csv", delimiter="\t")
data1['x'] = (data1['G'] + data1['C'] - data1['N']) / (data1['A'] + data1['G'] + data1['C'] + data1['T'] - data1['N'])
data1['x'].plot.hist(bins=2000)
plt.xlim([0, 1])
plt.xlabel("cds GC percentage")
plt.title("Equus caballus", style="italic")
I'm analysing mammals for my thesis, I'm not encountering this problem with every species but it's still enough. I hope my question is more understandable now.
EDIT 2:
I'm either really bad at maths or it's too late at night here or the file acts weird... How come that the sums of N bases are different?
df['N'].sum()
3504.0
df['cds_wo_N'] = df["A"]+df["G"]+df["C"]+df["T"]-df["N"]
df['cds_wo_N'].sum()
88748562.0
df['cds_w_N'] = df["A"]+df["G"]+df["C"]+df["T"]+df["N"]
df['cds_w_N'].sum()
88755570.0
df['N_subt'] = df['cds_w_N']-df['cds_wo_N']
df['N_subt'].sum()
7008.0
SeqIO has a to_dict method. If you use that in combination with collections.Counter you can write your code more succinctly. We'll also put everything in a pandas.DataFrame directly and not go through the intermediate step of writing out a CSV file.
from collections import Counter
from Bio import SeqIO
import pandas as pd
import matplotlib.pyplot as plt
record_dict = SeqIO.to_dict(SeqIO.parse("Equus_caballus.EquCab3.0.cds.all.fa", "fasta"))
record_dict = {record_id: Counter(record_seq) for record_id, record_seq in record_dict.items()}
df = pd.DataFrame.from_dict(record_dict, orient='index')
Our dataframe looks like:
A
G
C
T
N
ENSECAT00000046986.1
67
64
83
71
NaN
ENSECAT00000031957.1
81
83
75
85
NaN
ENSECAT00000038711.1
85
59
82
59
NaN
ENSECAT00000058645.1
74
66
82
78
NaN
ENSECAT00000058952.1
69
63
82
71
NaN
...
We can now easily filter out only the records which have unknown bases with df[df['N'].notnull()]
A
G
C
T
N
ENSECAT00000016113.2
155
264
245
135
20
ENSECAT00000048238.2
274
247
166
196
20
ENSECAT00000052603.2
370
280
283
374
1000
ENSECAT00000074965.1
654
1081
545
586
20
ENSECAT00000049830.1
177
486
458
194
20
...
ENSECAT00000029115.3
94
191
167
92
20
ENSECAT00000050439.2
734
1358
1296
717
20
ENSECAT00000058713.2
728
1353
1294
715
20
ENSECAT00000046294.1
694
1362
1341
729
20
ENSECAT00000064068.1
248
501
539
330
20
Or count the total number of N bases with df['N'].sum():
3504
We can now calculate the GC percentage
df = df.fillna(0) # replace the NaNs with zero
df['cds GC percentage'] = (df['G'] + df['C'] - df['N']) / (df['A'] + df['G'] + df['C'] + df['T'] - df['N'])
df['cds GC percentage'] looks like:
cds GC percentage
ENSECAT00000046986.1
0.515789
ENSECAT00000031957.1
0.487654
ENSECAT00000038711.1
0.494737
ENSECAT00000058645.1
0.493333
ENSECAT00000058952.1
0.508772
...
And the plot now looks as follows:
df['cds GC percentage'].plot.hist(bins=2000)
plt.xlim([0, 1])
plt.xlabel("cds GC percentage")
plt.title("Equus caballus", style="italic");
Edit
Regarding your latest update. Define df['cds_wo_N'] as follows:
df['cds_wo_N'] = df["A"]+df["G"]+df["C"]+df["T"]
The script you have is creating a TAB delimited output file, not a comma separated one. If you remove the delimiter='\t' parameter, it will default to a comma.
Secondly, you appear to be getting extra blank rows. These are removed by adding the newline='' parameter when opening the output file. This is specified in the documentation.
from Bio.SeqIO.FastaIO import FastaIterator
import csv
record_ids = []
records = []
with open("equus_cds.fa") as handle:
for record in FastaIterator(handle):
record_ids.append(record.id)
records.append(record)
data_cds = {}
for record in records:
data_cds[record.id] = {'A': 0, 'G': 0, 'C': 0, 'T': 0, 'N': 0}
for letter in str(record.seq):
data_cds[record.id][letter] += 1
with open('data_cds.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter = "\t")
writer.writerow(['ID', 'A', 'G', 'C', 'T', 'N'])
for key, values in data_cds.items():
writer.writerow([key, values['A'], values['G'], values['C'], values['T'], values['N']])
with open("data_cds.csv") as file:
for line in file:
print(line)
This should then produce something like:
ID,A,G,C,T,N
ENSECAT00000046986.1,67,64,83,71,0
ENSECAT00000031957.1,81,83,75,85,0
You can decompress your .gz file with Python as follows:
import shutil
import gzip
with gzip.open('Equus_caballus.EquCab3.0.cds.all.fa.gz', 'rb') as f_in, \
open('equus_cds.fa', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)

How can you keep track of revisions in a csv file with a Python program?

I have a CSV file where each row has an ID followed by several attributes. Initially, my task was to find the IDs with matching attributes and put them together as a family. Then, outputting them in another CSV document under the format of every relationship printed in different rows.
The basic outline for the CSV file looks like this:
ID SIZE SPEED RANK
123 10 20 2
567 15 30 1
890 10 20 2
321 20 10 3
295 15 30 1
The basic outline for the python module looks like this:
FAMILIES = {}
ATTRIBUTES = ['ID', 'SIZE', 'SPEED', 'RANK']
with open('data.csv', 'rb') as f:
data = csv.DictReader(f)
for row in data:
fam_id = str(tuple([row[field_name] for field_name in ATTRIBUTES]))
id = row['ID']
FAMILIES.setdefault(fam_id, [])
FAMILIES[fam_id].append(id)
output = []
for fam_id, node_arr in FAMILIES.items():
for from_item in node_arr:
for to_item in node_arr:
if from_item != to_item:
output.append(fam_id, from_item, to_item)
def write_array_to_csv(arr):
with open('hdd_output_temp.csv', 'wb') as w:
writer = csv.writer(w)
writer.writerows(arr)
if __name__ == "__main__":
write_array_to_csv(output)
Which would print into a CSV like this:
('10,20,2') 123 890
('10,20,2') 890 123
('15,30,1') 567 295
('15,30,1') 295 567
Now, my question is, if I were to go into the original csv file and make some revisions, how could I alter the code to detect all the updated relationships. I would like to put all the added relationships into FAMILIES2 and all the broken relationships into FAMILIES3. So if a new ID '589' were added that matched the '20,10,3' family and '890' was updated to have a different ID of '10,20,1',
I would like FAMILIES 2 to be able to output:
('20,10,3') 321 589
('20,10,3') 589 321
And FAMILIES3 to output:
('10,20,2') 123 890
('10,20,2') 890 123

python csv module read data from header

I have following format of file
# Data set number 1
#
# Number of lines 4010
# Max number of column 3 is 5
# Blahblah
# More blahblah
1 2 1 110
2 2 5 20 21 465 417 38
2 1 2 33 46 17
......
4010 3 5 1001 2010 3355 107 2039
# Data set number 2
#
# Number of lines 4010
# Max number of column 3 is 5
# Blahblah
# More blahblah
1 2 1 110
2 2 5 20 21 465 417 38
2 1 2 33 46 17
......
I hope to read the number of data set, number of lines, and maximum number of column 3. I searched and find out csv module can read the headers, but can I read those numbers of header, and store? What I did was
nnn = linecache.getline(filename, 1)
nnnn = nnn(line.split()[4])
number = linecache.getline(filename, 3)
number2 = number(line.split()[4])
mmm = linecache.getline(filename, 5)
mmmm = mmm(line.split()[7])
mmmmm = int(mmmm)
max_nb = range(mmmmm)
n_data = int(nnnn)
n_frame = range(n_data)
singleframe = natoms + 6
Like this. How can I read those numbers and store using csv module? I skip the 6 headerlines by using 'singleframe', but also curious how csv module can read 6 number of header lines. Thanks
You don't really have a CSV file; you have a proprietary format instead. Just parse it directly, using regular expressions to quickly extract your desired data:
import re
set_number = re.compile(r'Data set number (\d+)'),
patterns = {
'line_count': re.compile(r'Number of lines (\d+)'),
'max_num': re.compile(r'Max number of column 3 is (\d+)'),
}
with open(filename, 'r') as infh:
results = {}
set_numbers = []
for line in infh:
if not line.startswith('#'):
# skip lines without a comment
continue
set_match = set_number.match(line)
if set_match:
set_numbers.append(int(set_match.group(1)))
else:
for name, pattern in patterns.items():
match = pattern.search(line)
if match:
results[name] = int(match.group(1))
Do not use the linecache module. It'll read the whole file into memory, and is really only intended for access to Python source files; whenever a traceback needs to be printed this module caches the source files involved with the current stack. You'd only use it for smaller files from which you need random lines, repeatedly.

Extracting BLAST output columns in CSV form with python

I have a csv file in excel which contains the output from a BLAST search in the following format:
# BLASTN 2.2.29+
# Query: Cryptocephalus androgyne
# Database: SANdouble
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 1 hits found
Cryptocephalus ctg7180000094003 79.59 637 110 9 38 655 1300 1935 1.00E-125 444
# BLASTN 2.2.29+
# Query: Cryptocephalus aureolus
# Database: SANdouble
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 4 hits found
Cryptocephalus ctg7180000093816 95.5 667 12 8 7 655 1269 1935 0 1051
Cryptocephalus ctg7180000094021 88.01 667 62 8 7 655 1269 1935 0 780
Cryptocephalus ctg7180000094015 81.26 667 105 13 7 654 1269 1934 2.00E-152 532
Cryptocephalus ctg7180000093818 78.64 515 106 4 8 519 1270 1783 2.00E-94 340
I have imported this as a csv into python using
with open('BLASToutput.csv', 'rU') as csvfile:
contents = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in contents:
table = ', '.join(row)
What I now want to be able to do is extract columns of data as a list. My overall aim is to count all the matches which have over 98% identity (the third column).
The issue is that, since this is not in the typical csv format, there are no headers at the top so I cant extract a column based on its header. I was thinking if I could extract the third column as a list I can then use normal list tools in python to extract just the numbers I want but I have never used pythons csv module and I'm struggling to find an appropriate command. Other questions on SO are similar but dont refer to my specific case where there are no headers and empty cells. If you could help me I would be very grateful!
The data file is not that like in CSV format. It has comments, and its delimiter is not single character, but formatted spaces.
Since your overall aim is
to count all the matches which have over 98% identity (the third column).
and the data file content is well formed, you can use normal file parsing approach:
import re
with open('BLASToutput.csv') as f:
# read the file line by line
for line in f:
# skip comments (or maybe leave as it is)
if line.startswith('#'):
# print line
continue
# split fields
fields = re.split(r' +', line)
# check if the 3rd field is greater than 98%
if float(fields[2]) > 98:
# output the matched line
print line
I managed to find one way based on:
Python: split files using mutliple split delimiters
import csv
csvfile = open("SANDoubleSuperMatrix.csv", "rU")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
identity = []
for line in reader:
identity.append(line[2])
print identity

Reading a specific row & columns of data in a text file using Python 2.7

I am new to Python and need to extract data from a text file. I have a text file below:
UNHOLTZ-DICKIE CORPORATION
CALIBRATION DATA
789 3456
222 455
333 5
344 67788
12 6789
2456 56656
And I want to read it on the shell as two columns of data only:
789 3456
222 455
333 5
344 67788
12 6789
2456 56656
Here's a Python program that reads a file and outputs the 3rd... lines (drops the first 2 lines). That's all I can deduce that you want given your short explanation.
# read the whole file
file = open("input.file", 'r')
lines = file.readlines()
file.close()
# Skip first 2 lines, output the rest to stdout
count = 0
for line in lines:
count +=1
if count > 2:
print line,
If you have numpy installed then this is a one-liner:
col1,col2 = numpy.genfromtxt("myfile.txt",skiprows=2,unpack=True)
where myfile.txt is your data file.

Categories