Extracting BLAST output columns in CSV form with python

Extracting BLAST output columns in CSV form with python - python

I have a csv file in excel which contains the output from a BLAST search in the following format:
# BLASTN 2.2.29+
# Query: Cryptocephalus androgyne
# Database: SANdouble
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 1 hits found
Cryptocephalus ctg7180000094003 79.59 637 110 9 38 655 1300 1935 1.00E-125 444
# BLASTN 2.2.29+
# Query: Cryptocephalus aureolus
# Database: SANdouble
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 4 hits found
Cryptocephalus ctg7180000093816 95.5 667 12 8 7 655 1269 1935 0 1051
Cryptocephalus ctg7180000094021 88.01 667 62 8 7 655 1269 1935 0 780
Cryptocephalus ctg7180000094015 81.26 667 105 13 7 654 1269 1934 2.00E-152 532
Cryptocephalus ctg7180000093818 78.64 515 106 4 8 519 1270 1783 2.00E-94 340
I have imported this as a csv into python using
with open('BLASToutput.csv', 'rU') as csvfile:
contents = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in contents:
table = ', '.join(row)
What I now want to be able to do is extract columns of data as a list. My overall aim is to count all the matches which have over 98% identity (the third column).
The issue is that, since this is not in the typical csv format, there are no headers at the top so I cant extract a column based on its header. I was thinking if I could extract the third column as a list I can then use normal list tools in python to extract just the numbers I want but I have never used pythons csv module and I'm struggling to find an appropriate command. Other questions on SO are similar but dont refer to my specific case where there are no headers and empty cells. If you could help me I would be very grateful!

The data file is not that like in CSV format. It has comments, and its delimiter is not single character, but formatted spaces.
Since your overall aim is
to count all the matches which have over 98% identity (the third column).
and the data file content is well formed, you can use normal file parsing approach:
import re
with open('BLASToutput.csv') as f:
# read the file line by line
for line in f:
# skip comments (or maybe leave as it is)
if line.startswith('#'):
# print line
continue
# split fields
fields = re.split(r' +', line)
# check if the 3rd field is greater than 98%
if float(fields[2]) > 98:
# output the matched line
print line

I managed to find one way based on:
Python: split files using mutliple split delimiters
import csv
csvfile = open("SANDoubleSuperMatrix.csv", "rU")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
identity = []
for line in reader:
identity.append(line[2])
print identity

Related

Match index function from excel in pandas

There is a match index function in Excel that i use to match if the elements are present in the required column
=iferror(INDEX($B$2:$F$8,MATCH($J4,$B$2:$B$8,0),MATCH(K$3,$B$1:$F$1,0)),0)
This is the function i am using right now and it is yielding me good results but I want to implement it in python.
brand N Z None
Honor 63 96 190
Tecno 0 695 763
from this table I want
brand L N Z
Honor 0 63 96
Tecno 0 0 695
It should compare both the column and index and give the appropriate value
i have tried the lookup function in python but that gives me the
ValueError: Row labels must have same size as column labels

What you basically do with your excel formula, is creating something like a pivot table, you can also do that with pandas. E.g. like this:
# Define the columns and brands, you like to have in your result table
# along with the dataframe in variable df it's the only input
columns_query=['L', 'N', 'Z']
brands_query=['Honor', 'Tecno', 'Bar']
# no begin processing by selecting the columns
# which should be shown and are actually present
# add the brand, even if it was not selected
columns_present= {col for col in set(columns_query) if col in df.columns}
columns_present.add('brand')
# select the brands in question and take the
# info in columns we identified for these brands
# from this generate a "flat" list-like data
# structure using melt
# it contains records containing
# (brand, column-name and cell-value)
flat= df.loc[df['brand'].isin(brands_query), columns_present].melt(id_vars='brand')
# if you also want to see the columns and brands,
# for which you have no data in your original df
# you can use the following lines (if you don't
# need them, just skip the following lines until
# the next comment)
# the code just generates data points for the
# columns and rows, which would otherwise not be
# displayed and fills them wit NaN (the pandas
# equivalent for None)
columns_missing= set(columns_query).difference(columns_present)
brands_missing= set(brands_query).difference(df['brand'].unique())
num_dummies= max(len(brands_missing), len(columns_missing))
dummy_records= {
'brand': list(brands_missing) + [brands_query[0]] * (num_dummies - len(brands_missing)),
'variable': list(columns_missing) + [columns_query[0]] * (num_dummies - len(columns_missing)),
'value': [np.NaN] * num_dummies
}
dummy_records= pd.DataFrame(dummy_records)
flat= pd.concat([flat, dummy_records], axis='index', ignore_index=True)
# we get the result by the following line:
flat.set_index(['brand', 'variable']).unstack(level=-1)
For my testdata, this outputs:
value
variable L N Z
brand
Bar NaN NaN NaN
Honor NaN 63.0 96.0
Tecno NaN 0.0 695.0
The testdata is (note, that above we don't see col None and row Foo, but we see row Bar and column L, which are actually not present in the testdata, but were "queried"):
brand N Z None
0 Honor 63 96 190
1 Tecno 0 695 763
2 Foo 8 111 231
You can generate this testdata using:
import pandas as pd
import numpy as np
import io
raw=\
"""brand N Z None
Honor 63 96 190
Tecno 0 695 763
Foo 8 111 231"""
df= pd.read_csv(io.StringIO(raw), sep='\s+')
Note: the result as shown in the output is a regular pandas dataframe. So in case you plan to write the data back to a excel sheet, there should be no problem (pandas provides methods to read/write dataframes to/from excel-files).

Do you need to use Pandas for this action. You can do it with simple python as well. Read from one text file and print out matched and processed fields.
Basic file reading in Python goes like this. Where datafile.csv is your file. This reads all the lines in one file and prints out right result. First you need to save your file in .csv format so there is a separator between fields ','.
import csv # use csv
print('brand L N Z') # print new header
with open('datafile.csv', newline='') as csvfile:
spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
next(spamreader, None) # skip old header
for row in spamreader:
# You need to add Excel Match etc... logic here.
print(row[0], 0, row[1], row[2]) # print output
Input file:
brand,N,Z,None
Honor,63,96,190
Tecno,0,695,763
Prints out:
brand L N Z
Honor 0 63 96
Tecno 0 0 695
(I am not familiar with Excel Match-function so you may need to add some logic to above Python script to get logic working with all your data.)

How can I read *.csv files that have numbers with commas using pandas?

I want to read a *.csv file that have numbers with commas.
For example,
File.csv
Date, Time, Open, High, Low, Close, Volume
2016/11/09,12:10:00,'4355,'4358,'4346,'4351,1,201 # The last value is 1201, not 201
2016/11/09,12:09:00,'4361,'4362,'4353,'4355,1,117 # The last value is 1117, not 117
2016/11/09,12:08:00,'4364,'4374,'4359,'4360,10,175 # The last value is 10175, not 175
2016/11/09,12:07:00,'4371,'4376,'4360,'4365,590
2016/11/09,12:06:00,'4359,'4372,'4358,'4369,420
2016/11/09,12:05:00,'4365,'4367,'4356,'4359,542
2016/11/09,12:04:00,'4379,'1380,'4360,'4365,1,697 # The last value is 1697, not 697
2016/11/09,12:03:00,'4394,'4396,'4376,'4381,1,272 # The last value is 1272, not 272
2016/11/09,12:02:00,'4391,'4399,'4390,'4393,524
...
2014/07/10,12:05:00,'10195,'10300,'10155,'10290,219,271 # The last value is 219271, not 271
2014/07/09,12:04:00,'10345,'10360,'10185,'10194,235,711 # The last value is 235711, not 711
2014/07/08,12:03:00,'10339,'10420,'10301,'10348,232,050 # The last value is 242050, not 050
It actually has 7 columns, but the values of the last column sometimes have commas and pandas take them as extra columns.
My questions is, if there are any methods with which I can make pandas takes only the first 6 commas and ignore the rest commas when it reads columns, or if there are any methods to delete commas after the 6th commas(I'm sorry, but I can't think of any functions to do that.)
Thank you for reading this :)

You can do all of it in Python without having to save the data into a new file. The idea is to clean the data and put in a dictionary-like format for pandas to grab it and turn it into a dataframe. The following should constitute a decent starting point:
from collections import defaultdict
from collections import OrderedDict
import pandas as pd
# Import the data
data = open('prices.csv').readlines()
# Split on the first 6 commas
data = [x.strip().replace("'","").split(",",6) for x in data]
# Get the headers
headers = [x.strip() for x in data[0]]
# Get the remaining of the data
remainings = [list(map(lambda y: y.replace(",",""), x)) for x in data[1:]]
# Create a dictionary-like container
output = defaultdict(list)
# Loop through the data and save the rows accordingly
for n, header in enumerate(headers):
for row in remainings:
output[header].append(row[n])
# Save it in an ordered dictionary to maintain the order of columns
output = OrderedDict((k,output.get(k)) for k in headers)
# Convert your raw data into a pandas dataframe
df = pd.DataFrame(output)
# Print it
print(df)
This yields:
Date Time Open High Low Close Volume
0 2016/11/09 12:10:00 4355 4358 4346 4351 1201
1 2016/11/09 12:09:00 4361 4362 4353 4355 1117
2 2016/11/09 12:08:00 4364 4374 4359 4360 10175
3 2016/11/09 12:07:00 4371 4376 4360 4365 590
4 2016/11/09 12:06:00 4359 4372 4358 4369 420
5 2016/11/09 12:05:00 4365 4367 4356 4359 542
6 2016/11/09 12:04:00 4379 1380 4360 4365 1697
7 2016/11/09 12:03:00 4394 4396 4376 4381 1272
8 2016/11/09 12:02:00 4391 4399 4390 4393 524
The starting file (prices.csv) is the following:
Date, Time, Open, High, Low, Close, Volume
2016/11/09,12:10:00,'4355,'4358,'4346,'4351,1,201
2016/11/09,12:09:00,'4361,'4362,'4353,'4355,1,117
2016/11/09,12:08:00,'4364,'4374,'4359,'4360,10,175
2016/11/09,12:07:00,'4371,'4376,'4360,'4365,590
2016/11/09,12:06:00,'4359,'4372,'4358,'4369,420
2016/11/09,12:05:00,'4365,'4367,'4356,'4359,542
2016/11/09,12:04:00,'4379,'1380,'4360,'4365,1,697
2016/11/09,12:03:00,'4394,'4396,'4376,'4381,1,272
2016/11/09,12:02:00,'4391,'4399,'4390,'4393,524
I hope this helps.

I guess pandas cant handle it so I would do a pre-processing with Perl to generate a new cvs and work on it.
Using Perl split can help you in this situation
perl -pne '$_ = join("|", split(/,/, $_, 7) )' < input.csv > output.csv
Then you can use the usual read_cvs on the output file with the seperator as |

One more way to solve your problem.
import re
import pandas as pd
l1 =[]
with open('/home/yusuf/Desktop/c1') as f:
headers = map(lambda x: x.strip(), f.readline().strip('\n').split(','))
for a in f.readlines():
b = re.findall("(.*?),(.*?),'(.*?),'(.*?),'(.*?),'(.*?),(.*)",a)
l1.append(list(b[0]))
df = pd.DataFrame(data=l1, columns=headers)
df['Volume'] = df['Volume'].apply(lambda x: x.replace(",",""))
df
Output:
Regex Demo:
https://regex101.com/r/o1zxtO/2

I'm pretty sure pandas can't handle that, but you can easily fix the final column. An approach in Python
with open('yourfile.csv') as csv, open('newcsv.csv','w') as result:
for line in csv:
columns = line.split(',')
if len(columns) > COLUMNAMOUNT:
columns[COLUMNAMOUNT-1] += ''.join(columns[COLUMNAMOUNT:])
result.write(','.join(columns[COLUMNAMOUNT-1]))
Now you can load the new csv in to pandas. Other solutions can be AWK or even shell scripting.

Changing row data in case it doesn't match different row

I ran into an issue while was trying to work with a CSV file. Basically, I have a CSV file which contains names and ID's. Header looks something similar to this:
New ID | name | ID that needs to be changed | name |
In the column[0], New ID column, there are numbers from 1 to 980. in the column[3], ID that needs to be changed, there are 714.What I really need to accomplish is to create column[4], which will store ID from column[1] in case name in column[1] is to be found in column[3]. I need to come up with a fucntion which will pick 1 name from column[1], scan whole column[3] to see if that name is there and if it is, ID from columnp[0] is copied to column[4]
So far I got this:
import csv
input = open('tbr.csv', "rb")
output = open('sortedTbr.csv', "wb")
reader = csv.reader(input)
writer = csv.writer(output)
for row in input:
writer.writerow(row)
print row
input.close
output.close
Which doesn't do much. It writes every single letter into a new column in a csv...

3 problems here:
first you don't specify the delimiter, I assume it's pipe. csv parser cannot autodetect the delimiter.
second, you create the reader but scan the raw input file instead,
which explains that when you write the csv back, it creates as many cells as there are letters (iterates over row as string type instead of list)
third, when you close your handles, you actually don't call close but just access to the method reference. Add () to call the methods (classical mistake, everyone gets caught once in a while)
Here's my fixed version for your "extended" question. You need 2 passes, one to read fully column 1 and the other one to check. I use a dict to store values and make a relation between name and ID
My code runs in Python 2.7 only but runs in Python 3.4 provided you comment/uncomment the indicated lines
import csv
# python 2 only, remove if using python 3:
input_handle = open('tbr.csv', "r") # don't use input: reserved kw
output = open('sortedTbr.csv', "wb")
# uncomment 2 lines below if you're using python 3
#input_handle = open('tbr.csv', "r", newline='') # don't use input: reserved kw
#output = open('sortedTbr.csv', "w", newline='')
reader = csv.reader(input_handle,delimiter='\t')
writer = csv.writer(output,delimiter='\t')
title = next(reader) # skip title line
title.append("ID2") # add column title
db = dict()
input_rows = list(reader) # read file once
input_handle.close() # actually calls close!
# first pass
for row in input_rows:
db[row[1]] = row[0] # relation: name => id
writer.writerow(title)
# second pass
for row in input_rows:
row.append(db.get(row[3],""))
writer.writerow(row)
output.close()
I used this as tbr.csv (should be .tsv since separator is TAB)
New ID name ID that needs to be changed name
492 abboui jaouad jordan 438 abboui jaouad jordan
22 abrazone nelli 536 abrazone nelli
493 abuladze damirs 736 abuladze damirs
275 afanasjeva ludmila 472 afanasjeva oksana
494 afanasjeva oksana 578 afanasjevs viktors
54 afanasjevs viktors 354 aksinovichs andrejs
166 aksinovichs andrejs 488 aksinovichs german
495 aksinovichs german 462 aleksandra
got this in output: note: added one column
New ID name ID that needs to be changed name ID2
492 abboui jaouad jordan 438 abboui jaouad jordan 492
22 abrazone nelli 536 abrazone nelli 22
493 abuladze damirs 736 abuladze damirs 493
275 afanasjeva ludmila 472 afanasjeva oksana 494
494 afanasjeva oksana 578 afanasjevs viktors 54
54 afanasjevs viktors 354 aksinovichs andrejs 166
166 aksinovichs andrejs 488 aksinovichs german 495
495 aksinovichs german 462 aleksandra
I would say it works. Don't hesitate to accept the answer :)

Splitting information in a specific column python?

Here is an example of what my file looks like:
Type Variant_class ACC_NUM dbsnp genomic_coordinates_hg18 genomic_coordinates_hg19 HGVS_cdna HGVS_protein gene disease sequence_context_hg18 sequence_context_hg19 codon_change codon_number intron_number site location location_reference_point author journal vol page year pmid entrezid sift_score sift_prediction mutpred_score
1 DM CM920001 rs1800433 null chr12:9232351:- NM_000014.4 NP_000005.2:p.C972Y A2M Chronicobstructivepulmonarydisease null CACAAAATCTTCTCCAGATGCCCTATGGCT[G/A]TGGAGAGCAGAATATGGTCCTCTTTGCTCC TGT-TAT 972 null null 2 null Poller HUMGENET 88 313 1992 1370808 2 0 DAMAGING 0.594315245478036
1 DM CM004784 rs74315453 null chr22:43089410:- NM_017436.4 NP_059132.1:p.M183K A4GALT Pksynthasedeficiency(pphenotype) null TGCTCTCCGACGCCTCCAGGATCGCACTCA[T/A]GTGGAAGTTCGGCGGCATCTACCTGGACAC ATG-AAG 183 null null 2 null Steffensen JBC 275 16723 2000 10747952 53947 0 DAMAGING 0.787878787878788
1 DM CM1210274 null null chr22:43089327:- NM_017436.4 NP_059132.1:p.Q211E A4GALT NORpolyagglutination null CTGCGGAACCTGACCAACGTGCTGGGCACC[C/G]AGTCCCGCTACGTCCTCAACGGCGCGTTCC CAG-GAG 211 null null null null Suchanowska JBC 287 38220 2012 22965229 53947 0.79 TOLERATED null
What I want to do is split the information in column 13 by the - mark. In my example file above, this column contains the data ATG-AAG and CAG-GAG. I would like to separate it with a tab separation.
I've tried my code below:
with open('disease_mut_split2.txt') as inf:
with open('disease_mut_splitfinal.txt', 'w') as outf:
for line in inf:
outf.write('\t'.join(line.split('-')))
However, this also splits the - in the 6 column, which I do not want. Is there any way I can specify the column to split with the code I have?

If you know it's always going to be at column 13, just use a slice:
'{}\t{}'.format(line[:13], line[14:])
Alternatively, if you always know it's going to be the first thing you can limit the # of splits:
>>> x = 'this has - a few - dashes - in it'
>>> x.split('-', maxsplit=1)
['this has ', ' a few - dashes - in it']
If by column you mean that your data is a csv file (tab separated files work the same way), then Python's csv module will aid you:
with open('infile.txt') as f, open('outfile.txt', 'w') as outfile:
reader = csv.reader(f, delimiter='\t')
writer = csv.writer(outfile, delimiter='\t')
writer.writerow(next(reader, None)) # Write out the header row
for row in reader:
# Note: Python lists begin with [0],
# so the 13th column will have an index of 12
row[12] = row[12].replace('-', ' ')
writer.writerow(row)

Assuming what you're doing is in fact parsing/formatting a csv file Wayne Werner's csv module approach is probably the most robust way to solve this.
As an alternative, you might consider using re.sub from the re module. The exact regex to use will depend on the data. If, for instance that column is always 3 nucleotides, - and 3 nucleotides, something like this might work:
re.sub(r'(?<=[ACTG]{3})-(?=[ACTG]{3})', '\t', line))
The regex uses lookbehind and lookahead to replace a - between two sets of 3 nucleotides, so assuming that sort of pattern doesn't appear elsewhere in your file should work well.
EDIT: Changed to re.sub For some reason the original code just had me in a split mindset!

Reading a specific row & columns of data in a text file using Python 2.7

I am new to Python and need to extract data from a text file. I have a text file below:
UNHOLTZ-DICKIE CORPORATION
CALIBRATION DATA
789 3456
222 455
333 5
344 67788
12 6789
2456 56656
And I want to read it on the shell as two columns of data only:
789 3456
222 455
333 5
344 67788
12 6789
2456 56656

Here's a Python program that reads a file and outputs the 3rd... lines (drops the first 2 lines). That's all I can deduce that you want given your short explanation.
# read the whole file
file = open("input.file", 'r')
lines = file.readlines()
file.close()
# Skip first 2 lines, output the rest to stdout
count = 0
for line in lines:
count +=1
if count > 2:
print line,

If you have numpy installed then this is a one-liner:
col1,col2 = numpy.genfromtxt("myfile.txt",skiprows=2,unpack=True)
where myfile.txt is your data file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting BLAST output columns in CSV form with python - python

Related

Match index function from excel in pandas

How can I read *.csv files that have numbers with commas using pandas?

Changing row data in case it doesn't match different row

Splitting information in a specific column python?

Reading a specific row & columns of data in a text file using Python 2.7

Categories

Resources