Changing row data in case it doesn't match different row - python

I ran into an issue while was trying to work with a CSV file. Basically, I have a CSV file which contains names and ID's. Header looks something similar to this:
New ID | name | ID that needs to be changed | name |
In the column[0], New ID column, there are numbers from 1 to 980. in the column[3], ID that needs to be changed, there are 714.What I really need to accomplish is to create column[4], which will store ID from column[1] in case name in column[1] is to be found in column[3]. I need to come up with a fucntion which will pick 1 name from column[1], scan whole column[3] to see if that name is there and if it is, ID from columnp[0] is copied to column[4]
So far I got this:
import csv
input = open('tbr.csv', "rb")
output = open('sortedTbr.csv', "wb")
reader = csv.reader(input)
writer = csv.writer(output)
for row in input:
writer.writerow(row)
print row
input.close
output.close
Which doesn't do much. It writes every single letter into a new column in a csv...

3 problems here:
first you don't specify the delimiter, I assume it's pipe. csv parser cannot autodetect the delimiter.
second, you create the reader but scan the raw input file instead,
which explains that when you write the csv back, it creates as many cells as there are letters (iterates over row as string type instead of list)
third, when you close your handles, you actually don't call close but just access to the method reference. Add () to call the methods (classical mistake, everyone gets caught once in a while)
Here's my fixed version for your "extended" question. You need 2 passes, one to read fully column 1 and the other one to check. I use a dict to store values and make a relation between name and ID
My code runs in Python 2.7 only but runs in Python 3.4 provided you comment/uncomment the indicated lines
import csv
# python 2 only, remove if using python 3:
input_handle = open('tbr.csv', "r") # don't use input: reserved kw
output = open('sortedTbr.csv', "wb")
# uncomment 2 lines below if you're using python 3
#input_handle = open('tbr.csv', "r", newline='') # don't use input: reserved kw
#output = open('sortedTbr.csv', "w", newline='')
reader = csv.reader(input_handle,delimiter='\t')
writer = csv.writer(output,delimiter='\t')
title = next(reader) # skip title line
title.append("ID2") # add column title
db = dict()
input_rows = list(reader) # read file once
input_handle.close() # actually calls close!
# first pass
for row in input_rows:
db[row[1]] = row[0] # relation: name => id
writer.writerow(title)
# second pass
for row in input_rows:
row.append(db.get(row[3],""))
writer.writerow(row)
output.close()
I used this as tbr.csv (should be .tsv since separator is TAB)
New ID name ID that needs to be changed name
492 abboui jaouad jordan 438 abboui jaouad jordan
22 abrazone nelli 536 abrazone nelli
493 abuladze damirs 736 abuladze damirs
275 afanasjeva ludmila 472 afanasjeva oksana
494 afanasjeva oksana 578 afanasjevs viktors
54 afanasjevs viktors 354 aksinovichs andrejs
166 aksinovichs andrejs 488 aksinovichs german
495 aksinovichs german 462 aleksandra
got this in output: note: added one column
New ID name ID that needs to be changed name ID2
492 abboui jaouad jordan 438 abboui jaouad jordan 492
22 abrazone nelli 536 abrazone nelli 22
493 abuladze damirs 736 abuladze damirs 493
275 afanasjeva ludmila 472 afanasjeva oksana 494
494 afanasjeva oksana 578 afanasjevs viktors 54
54 afanasjevs viktors 354 aksinovichs andrejs 166
166 aksinovichs andrejs 488 aksinovichs german 495
495 aksinovichs german 462 aleksandra
I would say it works. Don't hesitate to accept the answer :)

Related

Splitting csv by few rows in Pandas

I have a csv file with a few rows header. Next I got a sample(1) data, and next again is header and sample(2) data. The number of samples (and headers) are not constant in files.
DF looks like this:
[header]
InfoMap : 4214
InfoSample:3122
Content:, ,22dmm
Sample_name, Sample_id, Sample_phone, Sample_project
Ana 22 785 a6659
Ana 22 785 a658141
Ana 22 785 csd449
Ben 23 756 a6659
Ben 23 756 a658141
Charlie 44 733 c658141
[header]
InfoMap : 423421
InfoSample:315
Content, ,562dmm
Sample_name, Sample_id, Sample_phone, Sample_project
Cris 82 7835 a6659
Cris 82 7485 a658141
Cris 82 7485 csd449
MATT 53 268 a6659
MATT 53 268 a658141
Dan 42 885 c658141
What I tried to do:
I need to split each header with sample to new file. So in case above I should got 2 files:
file1:
[header]
InfoMap : 4214
InfoSample:3122
Content:, ,22dmm
Sample_name, Sample_id, Sample_phone, Sample_project
Ana 22 785 a6659
Ana 22 785 a658141
Ana 22 785 csd449
Ben 23 756 a6659
Ben 23 756 a658141
Charlie 44 733 c658141
file2:
[header]
InfoMap : 423421
InfoSample:315
Content, ,562dmm
Sample_name, Sample_id, Sample_phone, Sample_project
Cris 82 7835 a6659
Cris 82 7485 a658141
Cris 82 7485 csd449
MATT 53 268 a6659
MATT 53 268 a658141
Dan 42 885 c658141
How can I do it in the simplest way in pandas or core Python? As I said numbers of headers and samples are not constant.
I tried it by loop for:
Looking for [header] in line
save the numbers of all [header] lines index numbers.
with "open" I tried to save all compartments to new files
The problem was: I can't read it as csv because it was one column dataframe (because headers), line was read in a weird way because I have files with mixed samples.
I'm looking for a better concep. Maybe Pandas has some functions I don't know about. If not, I'll keep going with my way of doing things and try to do it this way.
I'm not necessarily looking for a ready-made solution, but some hints or concepts.
here is the psuedocode sample code according to the logic I have given in the comment assuming that "InfoMap" is the starting point of header:
dataList = []
with open('YourData.csv', newline='') as File:
reader = csv.reader(File)
dataList = []
count = 0
for row in reader:
if "InfoMap" in row[0]:
count += 1
if count > 1:
#fileName = "file" + str(count)
#WriteDataListToCSV(dataList, fileName) create a function that can write dataList into csv
#print(dataList)
dataList = []
dataList.append(row)
else:
dataList.append(row)
else:
dataList.append(row)
#fileName = "file" + str(count)
#WriteDataListToCSV(dataList)
#print(dataList)
You can uncomment the print statements to see what dataList contains
Another psedocode (or buggy real code). Only variation is that you don't write the data into a list.
file_no = 0
write_file = open(f"sub_file_{file_no}.csv", "a+")
with open("input.csv") as in_file:
reader = csv.reader(in_file)
for row in reader:
if row[0] == "[header]":
# At the start of a new file. Close the old, increment count, open new
write_file.close()
file_no += 1
write_file = open(f"sub_file_{file_no}.csv", "a+")
# Just pass current row into the currently open file
write_file.write(row)
write_file.close()

How can you keep track of revisions in a csv file with a Python program?

I have a CSV file where each row has an ID followed by several attributes. Initially, my task was to find the IDs with matching attributes and put them together as a family. Then, outputting them in another CSV document under the format of every relationship printed in different rows.
The basic outline for the CSV file looks like this:
ID SIZE SPEED RANK
123 10 20 2
567 15 30 1
890 10 20 2
321 20 10 3
295 15 30 1
The basic outline for the python module looks like this:
FAMILIES = {}
ATTRIBUTES = ['ID', 'SIZE', 'SPEED', 'RANK']
with open('data.csv', 'rb') as f:
data = csv.DictReader(f)
for row in data:
fam_id = str(tuple([row[field_name] for field_name in ATTRIBUTES]))
id = row['ID']
FAMILIES.setdefault(fam_id, [])
FAMILIES[fam_id].append(id)
output = []
for fam_id, node_arr in FAMILIES.items():
for from_item in node_arr:
for to_item in node_arr:
if from_item != to_item:
output.append(fam_id, from_item, to_item)
def write_array_to_csv(arr):
with open('hdd_output_temp.csv', 'wb') as w:
writer = csv.writer(w)
writer.writerows(arr)
if __name__ == "__main__":
write_array_to_csv(output)
Which would print into a CSV like this:
('10,20,2') 123 890
('10,20,2') 890 123
('15,30,1') 567 295
('15,30,1') 295 567
Now, my question is, if I were to go into the original csv file and make some revisions, how could I alter the code to detect all the updated relationships. I would like to put all the added relationships into FAMILIES2 and all the broken relationships into FAMILIES3. So if a new ID '589' were added that matched the '20,10,3' family and '890' was updated to have a different ID of '10,20,1',
I would like FAMILIES 2 to be able to output:
('20,10,3') 321 589
('20,10,3') 589 321
And FAMILIES3 to output:
('10,20,2') 123 890
('10,20,2') 890 123

Extracting BLAST output columns in CSV form with python

I have a csv file in excel which contains the output from a BLAST search in the following format:
# BLASTN 2.2.29+
# Query: Cryptocephalus androgyne
# Database: SANdouble
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 1 hits found
Cryptocephalus ctg7180000094003 79.59 637 110 9 38 655 1300 1935 1.00E-125 444
# BLASTN 2.2.29+
# Query: Cryptocephalus aureolus
# Database: SANdouble
# Fields: query id subject id % identity alignment length mismatches gap opens q. start q. end s. start s. end evalue bit score
# 4 hits found
Cryptocephalus ctg7180000093816 95.5 667 12 8 7 655 1269 1935 0 1051
Cryptocephalus ctg7180000094021 88.01 667 62 8 7 655 1269 1935 0 780
Cryptocephalus ctg7180000094015 81.26 667 105 13 7 654 1269 1934 2.00E-152 532
Cryptocephalus ctg7180000093818 78.64 515 106 4 8 519 1270 1783 2.00E-94 340
I have imported this as a csv into python using
with open('BLASToutput.csv', 'rU') as csvfile:
contents = csv.reader(csvfile, delimiter=' ', quotechar='|')
for row in contents:
table = ', '.join(row)
What I now want to be able to do is extract columns of data as a list. My overall aim is to count all the matches which have over 98% identity (the third column).
The issue is that, since this is not in the typical csv format, there are no headers at the top so I cant extract a column based on its header. I was thinking if I could extract the third column as a list I can then use normal list tools in python to extract just the numbers I want but I have never used pythons csv module and I'm struggling to find an appropriate command. Other questions on SO are similar but dont refer to my specific case where there are no headers and empty cells. If you could help me I would be very grateful!
The data file is not that like in CSV format. It has comments, and its delimiter is not single character, but formatted spaces.
Since your overall aim is
to count all the matches which have over 98% identity (the third column).
and the data file content is well formed, you can use normal file parsing approach:
import re
with open('BLASToutput.csv') as f:
# read the file line by line
for line in f:
# skip comments (or maybe leave as it is)
if line.startswith('#'):
# print line
continue
# split fields
fields = re.split(r' +', line)
# check if the 3rd field is greater than 98%
if float(fields[2]) > 98:
# output the matched line
print line
I managed to find one way based on:
Python: split files using mutliple split delimiters
import csv
csvfile = open("SANDoubleSuperMatrix.csv", "rU")
dialect = csv.Sniffer().sniff(csvfile.read(1024))
csvfile.seek(0)
reader = csv.reader(csvfile, dialect)
identity = []
for line in reader:
identity.append(line[2])
print identity

Organising two column data

Hi I have a two column data stored in a file called "Cv_0.out", each column is separated by two spaces
12 454
232 123
879 2354
12312 23423
8794 1237
3245 34
I would like to then sort this data in ascending order based only on the right hand column values whilst at the same time keeping the pairs of values together, so reordering the left hand side values. I would like to get the following:
3245 34
232 123
12 454
8794 1237
879 2354
12312 23423
I have tried the following so far:
import sys,csv
import operator
reader = csv.reader(open('Cv_0.out'),delimiter=' ')
sort = sorted(reader, key=lambda row: int(row[0]))
print sort
Any help would be really appreciated
Your input file can be dealt even without CSV:
with open("input") as f:
lines = (map(int,x.strip().split()) for x in f)
newLines = sorted(lines, key=lambda row: row[1])
print "\n".join(str(x)+ " " + str(y) for x,y in newLines)
IMO, the problem was using row[0] instead of row[1], if you wanted to sort on the second column.

Reading a specific row & columns of data in a text file using Python 2.7

I am new to Python and need to extract data from a text file. I have a text file below:
UNHOLTZ-DICKIE CORPORATION
CALIBRATION DATA
789 3456
222 455
333 5
344 67788
12 6789
2456 56656
And I want to read it on the shell as two columns of data only:
789 3456
222 455
333 5
344 67788
12 6789
2456 56656
Here's a Python program that reads a file and outputs the 3rd... lines (drops the first 2 lines). That's all I can deduce that you want given your short explanation.
# read the whole file
file = open("input.file", 'r')
lines = file.readlines()
file.close()
# Skip first 2 lines, output the rest to stdout
count = 0
for line in lines:
count +=1
if count > 2:
print line,
If you have numpy installed then this is a one-liner:
col1,col2 = numpy.genfromtxt("myfile.txt",skiprows=2,unpack=True)
where myfile.txt is your data file.

Categories