Splitting csv by few rows in Pandas - python

I have a csv file with a few rows header. Next I got a sample(1) data, and next again is header and sample(2) data. The number of samples (and headers) are not constant in files.
DF looks like this:
[header]
InfoMap : 4214
InfoSample:3122
Content:, ,22dmm
Sample_name, Sample_id, Sample_phone, Sample_project
Ana 22 785 a6659
Ana 22 785 a658141
Ana 22 785 csd449
Ben 23 756 a6659
Ben 23 756 a658141
Charlie 44 733 c658141
[header]
InfoMap : 423421
InfoSample:315
Content, ,562dmm
Sample_name, Sample_id, Sample_phone, Sample_project
Cris 82 7835 a6659
Cris 82 7485 a658141
Cris 82 7485 csd449
MATT 53 268 a6659
MATT 53 268 a658141
Dan 42 885 c658141
What I tried to do:
I need to split each header with sample to new file. So in case above I should got 2 files:
file1:
[header]
InfoMap : 4214
InfoSample:3122
Content:, ,22dmm
Sample_name, Sample_id, Sample_phone, Sample_project
Ana 22 785 a6659
Ana 22 785 a658141
Ana 22 785 csd449
Ben 23 756 a6659
Ben 23 756 a658141
Charlie 44 733 c658141
file2:
[header]
InfoMap : 423421
InfoSample:315
Content, ,562dmm
Sample_name, Sample_id, Sample_phone, Sample_project
Cris 82 7835 a6659
Cris 82 7485 a658141
Cris 82 7485 csd449
MATT 53 268 a6659
MATT 53 268 a658141
Dan 42 885 c658141
How can I do it in the simplest way in pandas or core Python? As I said numbers of headers and samples are not constant.
I tried it by loop for:
Looking for [header] in line
save the numbers of all [header] lines index numbers.
with "open" I tried to save all compartments to new files
The problem was: I can't read it as csv because it was one column dataframe (because headers), line was read in a weird way because I have files with mixed samples.
I'm looking for a better concep. Maybe Pandas has some functions I don't know about. If not, I'll keep going with my way of doing things and try to do it this way.
I'm not necessarily looking for a ready-made solution, but some hints or concepts.

here is the psuedocode sample code according to the logic I have given in the comment assuming that "InfoMap" is the starting point of header:
dataList = []
with open('YourData.csv', newline='') as File:
reader = csv.reader(File)
dataList = []
count = 0
for row in reader:
if "InfoMap" in row[0]:
count += 1
if count > 1:
#fileName = "file" + str(count)
#WriteDataListToCSV(dataList, fileName) create a function that can write dataList into csv
#print(dataList)
dataList = []
dataList.append(row)
else:
dataList.append(row)
else:
dataList.append(row)
#fileName = "file" + str(count)
#WriteDataListToCSV(dataList)
#print(dataList)
You can uncomment the print statements to see what dataList contains

Another psedocode (or buggy real code). Only variation is that you don't write the data into a list.
file_no = 0
write_file = open(f"sub_file_{file_no}.csv", "a+")
with open("input.csv") as in_file:
reader = csv.reader(in_file)
for row in reader:
if row[0] == "[header]":
# At the start of a new file. Close the old, increment count, open new
write_file.close()
file_no += 1
write_file = open(f"sub_file_{file_no}.csv", "a+")
# Just pass current row into the currently open file
write_file.write(row)
write_file.close()

Related

How to separate columns of *.csv file from one to more (Python, BioPython, Pandas)?

I'm still quite a beginner and my friend (who's not answering so far) provided me a code for downloading a genomic sequence from Ensembl.org and writing it to a *.csv file using dictionaries. Unluckily, the file contains only one column and 89870 rows, I'm not sure how to fix it. It would ease my work with counting because it acts weird when doing plots. I don't know where could be a mistake. Here's the code:
from Bio.SeqIO.FastaIO import FastaIterator
record_ids = []
records = []
with open("equus_cds.fa") as handle:
for record in FastaIterator(handle):
record_ids.append(record.id)
records.append(record)
data_cds = {}
for record in records:
data_cds[record.id] = {'A': 0, 'G': 0, 'C': 0, 'T': 0, 'N': 0}
for letter in str(record.seq):
data_cds[record.id][letter] += 1
import csv
with open('data_cds.csv', 'w') as csvfile:
writer = csv.writer(csvfile, delimiter = "\t")
writer.writerow(['ID', 'A', 'G', 'C', 'T', 'N'])
for key, values in data_cds.items():
writer.writerow([key, values['A'], values['G'], values['C'], values['T'], values['N']])
with open ("data_cds.csv") as file:
print (file.readline())
for lines in file.readlines():
print(lines)
The output shows a scrolling table of contents but it's a bit shifted:
ID A G C T N
ENSECAT00000046986.1 67 64 83 71 0
ENSECAT00000031957.1 81 83 75 85 0
etc. etc., imagine over 80 thousand lines.
Then I would like to count the sum of all "N's" (it's not always zero) and I have no idea how to do it with this format...
Thanks in advance!
EDIT: I've downloaded the sequence from here: http://ftp.ensembl.org/pub/release-103/fasta/equus_caballus/cds/, unzipped it:
handle = gzip.open('file1.fa.gz')
with open('equus_cds.fa', 'wb') as out:
for line in handle:
out.write(line)
And then the code I've posted follows. The *.csv file always contains a name of a specific gene (ID - ENSECAT000... etc.) and then nitrogen bases (A, T, G, C) and also unknown bases (N). This whole file then has 8k lines but only one column, I would like to have it properly separated (each base to one column, if possible) because then it would be easier to count how many of each base is in the whole file (how many Ns to be specific).
The reason I want to know this is when I'm making a plot, I'm comparing two sequences, cds (coding sequences) and cDNA (complementary DNA) and after subtracting N the plot acts weird, cds gets bigger than cDNA and that's nonsense. Here's the code for the plot:
data1 = pd.read_csv ("data_cds.csv", delimiter="\t")
data1['x'] = (data1['G'] + data1['C'] - data1['N']) / (data1['A'] + data1['G'] + data1['C'] + data1['T'] - data1['N'])
data1['x'].plot.hist(bins=2000)
plt.xlim([0, 1])
plt.xlabel("cds GC percentage")
plt.title("Equus caballus", style="italic")
I'm analysing mammals for my thesis, I'm not encountering this problem with every species but it's still enough. I hope my question is more understandable now.
EDIT 2:
I'm either really bad at maths or it's too late at night here or the file acts weird... How come that the sums of N bases are different?
df['N'].sum()
3504.0
df['cds_wo_N'] = df["A"]+df["G"]+df["C"]+df["T"]-df["N"]
df['cds_wo_N'].sum()
88748562.0
df['cds_w_N'] = df["A"]+df["G"]+df["C"]+df["T"]+df["N"]
df['cds_w_N'].sum()
88755570.0
df['N_subt'] = df['cds_w_N']-df['cds_wo_N']
df['N_subt'].sum()
7008.0
SeqIO has a to_dict method. If you use that in combination with collections.Counter you can write your code more succinctly. We'll also put everything in a pandas.DataFrame directly and not go through the intermediate step of writing out a CSV file.
from collections import Counter
from Bio import SeqIO
import pandas as pd
import matplotlib.pyplot as plt
record_dict = SeqIO.to_dict(SeqIO.parse("Equus_caballus.EquCab3.0.cds.all.fa", "fasta"))
record_dict = {record_id: Counter(record_seq) for record_id, record_seq in record_dict.items()}
df = pd.DataFrame.from_dict(record_dict, orient='index')
Our dataframe looks like:
A
G
C
T
N
ENSECAT00000046986.1
67
64
83
71
NaN
ENSECAT00000031957.1
81
83
75
85
NaN
ENSECAT00000038711.1
85
59
82
59
NaN
ENSECAT00000058645.1
74
66
82
78
NaN
ENSECAT00000058952.1
69
63
82
71
NaN
...
We can now easily filter out only the records which have unknown bases with df[df['N'].notnull()]
A
G
C
T
N
ENSECAT00000016113.2
155
264
245
135
20
ENSECAT00000048238.2
274
247
166
196
20
ENSECAT00000052603.2
370
280
283
374
1000
ENSECAT00000074965.1
654
1081
545
586
20
ENSECAT00000049830.1
177
486
458
194
20
...
ENSECAT00000029115.3
94
191
167
92
20
ENSECAT00000050439.2
734
1358
1296
717
20
ENSECAT00000058713.2
728
1353
1294
715
20
ENSECAT00000046294.1
694
1362
1341
729
20
ENSECAT00000064068.1
248
501
539
330
20
Or count the total number of N bases with df['N'].sum():
3504
We can now calculate the GC percentage
df = df.fillna(0) # replace the NaNs with zero
df['cds GC percentage'] = (df['G'] + df['C'] - df['N']) / (df['A'] + df['G'] + df['C'] + df['T'] - df['N'])
df['cds GC percentage'] looks like:
cds GC percentage
ENSECAT00000046986.1
0.515789
ENSECAT00000031957.1
0.487654
ENSECAT00000038711.1
0.494737
ENSECAT00000058645.1
0.493333
ENSECAT00000058952.1
0.508772
...
And the plot now looks as follows:
df['cds GC percentage'].plot.hist(bins=2000)
plt.xlim([0, 1])
plt.xlabel("cds GC percentage")
plt.title("Equus caballus", style="italic");
Edit
Regarding your latest update. Define df['cds_wo_N'] as follows:
df['cds_wo_N'] = df["A"]+df["G"]+df["C"]+df["T"]
The script you have is creating a TAB delimited output file, not a comma separated one. If you remove the delimiter='\t' parameter, it will default to a comma.
Secondly, you appear to be getting extra blank rows. These are removed by adding the newline='' parameter when opening the output file. This is specified in the documentation.
from Bio.SeqIO.FastaIO import FastaIterator
import csv
record_ids = []
records = []
with open("equus_cds.fa") as handle:
for record in FastaIterator(handle):
record_ids.append(record.id)
records.append(record)
data_cds = {}
for record in records:
data_cds[record.id] = {'A': 0, 'G': 0, 'C': 0, 'T': 0, 'N': 0}
for letter in str(record.seq):
data_cds[record.id][letter] += 1
with open('data_cds.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter = "\t")
writer.writerow(['ID', 'A', 'G', 'C', 'T', 'N'])
for key, values in data_cds.items():
writer.writerow([key, values['A'], values['G'], values['C'], values['T'], values['N']])
with open("data_cds.csv") as file:
for line in file:
print(line)
This should then produce something like:
ID,A,G,C,T,N
ENSECAT00000046986.1,67,64,83,71,0
ENSECAT00000031957.1,81,83,75,85,0
You can decompress your .gz file with Python as follows:
import shutil
import gzip
with gzip.open('Equus_caballus.EquCab3.0.cds.all.fa.gz', 'rb') as f_in, \
open('equus_cds.fa', 'wb') as f_out:
shutil.copyfileobj(f_in, f_out)

Changing row data in case it doesn't match different row

I ran into an issue while was trying to work with a CSV file. Basically, I have a CSV file which contains names and ID's. Header looks something similar to this:
New ID | name | ID that needs to be changed | name |
In the column[0], New ID column, there are numbers from 1 to 980. in the column[3], ID that needs to be changed, there are 714.What I really need to accomplish is to create column[4], which will store ID from column[1] in case name in column[1] is to be found in column[3]. I need to come up with a fucntion which will pick 1 name from column[1], scan whole column[3] to see if that name is there and if it is, ID from columnp[0] is copied to column[4]
So far I got this:
import csv
input = open('tbr.csv', "rb")
output = open('sortedTbr.csv', "wb")
reader = csv.reader(input)
writer = csv.writer(output)
for row in input:
writer.writerow(row)
print row
input.close
output.close
Which doesn't do much. It writes every single letter into a new column in a csv...
3 problems here:
first you don't specify the delimiter, I assume it's pipe. csv parser cannot autodetect the delimiter.
second, you create the reader but scan the raw input file instead,
which explains that when you write the csv back, it creates as many cells as there are letters (iterates over row as string type instead of list)
third, when you close your handles, you actually don't call close but just access to the method reference. Add () to call the methods (classical mistake, everyone gets caught once in a while)
Here's my fixed version for your "extended" question. You need 2 passes, one to read fully column 1 and the other one to check. I use a dict to store values and make a relation between name and ID
My code runs in Python 2.7 only but runs in Python 3.4 provided you comment/uncomment the indicated lines
import csv
# python 2 only, remove if using python 3:
input_handle = open('tbr.csv', "r") # don't use input: reserved kw
output = open('sortedTbr.csv', "wb")
# uncomment 2 lines below if you're using python 3
#input_handle = open('tbr.csv', "r", newline='') # don't use input: reserved kw
#output = open('sortedTbr.csv', "w", newline='')
reader = csv.reader(input_handle,delimiter='\t')
writer = csv.writer(output,delimiter='\t')
title = next(reader) # skip title line
title.append("ID2") # add column title
db = dict()
input_rows = list(reader) # read file once
input_handle.close() # actually calls close!
# first pass
for row in input_rows:
db[row[1]] = row[0] # relation: name => id
writer.writerow(title)
# second pass
for row in input_rows:
row.append(db.get(row[3],""))
writer.writerow(row)
output.close()
I used this as tbr.csv (should be .tsv since separator is TAB)
New ID name ID that needs to be changed name
492 abboui jaouad jordan 438 abboui jaouad jordan
22 abrazone nelli 536 abrazone nelli
493 abuladze damirs 736 abuladze damirs
275 afanasjeva ludmila 472 afanasjeva oksana
494 afanasjeva oksana 578 afanasjevs viktors
54 afanasjevs viktors 354 aksinovichs andrejs
166 aksinovichs andrejs 488 aksinovichs german
495 aksinovichs german 462 aleksandra
got this in output: note: added one column
New ID name ID that needs to be changed name ID2
492 abboui jaouad jordan 438 abboui jaouad jordan 492
22 abrazone nelli 536 abrazone nelli 22
493 abuladze damirs 736 abuladze damirs 493
275 afanasjeva ludmila 472 afanasjeva oksana 494
494 afanasjeva oksana 578 afanasjevs viktors 54
54 afanasjevs viktors 354 aksinovichs andrejs 166
166 aksinovichs andrejs 488 aksinovichs german 495
495 aksinovichs german 462 aleksandra
I would say it works. Don't hesitate to accept the answer :)

How can you keep track of revisions in a csv file with a Python program?

I have a CSV file where each row has an ID followed by several attributes. Initially, my task was to find the IDs with matching attributes and put them together as a family. Then, outputting them in another CSV document under the format of every relationship printed in different rows.
The basic outline for the CSV file looks like this:
ID SIZE SPEED RANK
123 10 20 2
567 15 30 1
890 10 20 2
321 20 10 3
295 15 30 1
The basic outline for the python module looks like this:
FAMILIES = {}
ATTRIBUTES = ['ID', 'SIZE', 'SPEED', 'RANK']
with open('data.csv', 'rb') as f:
data = csv.DictReader(f)
for row in data:
fam_id = str(tuple([row[field_name] for field_name in ATTRIBUTES]))
id = row['ID']
FAMILIES.setdefault(fam_id, [])
FAMILIES[fam_id].append(id)
output = []
for fam_id, node_arr in FAMILIES.items():
for from_item in node_arr:
for to_item in node_arr:
if from_item != to_item:
output.append(fam_id, from_item, to_item)
def write_array_to_csv(arr):
with open('hdd_output_temp.csv', 'wb') as w:
writer = csv.writer(w)
writer.writerows(arr)
if __name__ == "__main__":
write_array_to_csv(output)
Which would print into a CSV like this:
('10,20,2') 123 890
('10,20,2') 890 123
('15,30,1') 567 295
('15,30,1') 295 567
Now, my question is, if I were to go into the original csv file and make some revisions, how could I alter the code to detect all the updated relationships. I would like to put all the added relationships into FAMILIES2 and all the broken relationships into FAMILIES3. So if a new ID '589' were added that matched the '20,10,3' family and '890' was updated to have a different ID of '10,20,1',
I would like FAMILIES 2 to be able to output:
('20,10,3') 321 589
('20,10,3') 589 321
And FAMILIES3 to output:
('10,20,2') 123 890
('10,20,2') 890 123

Extracting BLAST output columns in CSV form with python

I have a csv file in excel which contains the output from a BLAST search in the following format:
# BLASTN 2.2.29+
# Query: Cryptocephalus androgyne
# Database: SANdouble
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 1 hits found
Cryptocephalus ctg7180000094003 79.59 637 110 9 38 655 1300 1935 1.00E-125 444
# BLASTN 2.2.29+
# Query: Cryptocephalus aureolus
# Database: SANdouble
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 4 hits found
Cryptocephalus ctg7180000093816 95.5 667 12 8 7 655 1269 1935 0 1051
Cryptocephalus ctg7180000094021 88.01 667 62 8 7 655 1269 1935 0 780
Cryptocephalus ctg7180000094015 81.26 667 105 13 7 654 1269 1934 2.00E-152 532
Cryptocephalus ctg7180000093818 78.64 515 106 4 8 519 1270 1783 2.00E-94 340
I have imported this as a csv into python using
with open('BLASToutput.csv', 'rU') as csvfile:
contents = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in contents:
table = ', '.join(row)
What I now want to be able to do is extract columns of data as a list. My overall aim is to count all the matches which have over 98% identity (the third column).
The issue is that, since this is not in the typical csv format, there are no headers at the top so I cant extract a column based on its header. I was thinking if I could extract the third column as a list I can then use normal list tools in python to extract just the numbers I want but I have never used pythons csv module and I'm struggling to find an appropriate command. Other questions on SO are similar but dont refer to my specific case where there are no headers and empty cells. If you could help me I would be very grateful!
The data file is not that like in CSV format. It has comments, and its delimiter is not single character, but formatted spaces.
Since your overall aim is
to count all the matches which have over 98% identity (the third column).
and the data file content is well formed, you can use normal file parsing approach:
import re
with open('BLASToutput.csv') as f:
# read the file line by line
for line in f:
# skip comments (or maybe leave as it is)
if line.startswith('#'):
# print line
continue
# split fields
fields = re.split(r' +', line)
# check if the 3rd field is greater than 98%
if float(fields[2]) > 98:
# output the matched line
print line
I managed to find one way based on:
Python: split files using mutliple split delimiters
import csv
csvfile = open("SANDoubleSuperMatrix.csv", "rU")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
identity = []
for line in reader:
identity.append(line[2])
print identity

Organising two column data

Hi I have a two column data stored in a file called "Cv_0.out", each column is separated by two spaces
12 454
232 123
879 2354
12312 23423
8794 1237
3245 34
I would like to then sort this data in ascending order based only on the right hand column values whilst at the same time keeping the pairs of values together, so reordering the left hand side values. I would like to get the following:
3245 34
232 123
12 454
8794 1237
879 2354
12312 23423
I have tried the following so far:
import sys,csv
import operator
reader = csv.reader(open('Cv_0.out'),delimiter=' ')
sort = sorted(reader, key=lambda row: int(row[0]))
print sort
Any help would be really appreciated
Your input file can be dealt even without CSV:
with open("input") as f:
lines = (map(int,x.strip().split()) for x in f)
newLines = sorted(lines, key=lambda row: row[1])
print "\n".join(str(x)+ " " + str(y) for x,y in newLines)
IMO, the problem was using row[0] instead of row[1], if you wanted to sort on the second column.

Categories