Sum all values in CSV - python

I have a CSV file with 0s and 1s and need to determine the sum total of the entire file. The file looks like this when opened in ExCel:
0 1 1 1 0 0 0 1 0 1
1 0 1 0 0 1 1 0 0 0
0 0 1 0 0 0 0 1 0 1
0 1 1 1 1 1 1 0 1 1
0 0 1 0 1 0 1 1 0 1
0 0 0 0 0 0 0 0 1 0
0 0 1 0 0 1 1 0 1 1
0 0 1 1 0 0 1 1 0 1
1 0 1 0 1 0 1 1 1 0
0 1 0 0 1 0 0 0 1 1
Using this script I can sum the values of each row and they print out in a single column:
import csv
import numpy as np
path = r'E:\myPy\one_zero.csv'
infile = open(path, 'r')
with infile as file_in:
fin = csv.reader(file_in, delimiter = ',')
for line in fin:
print line.count('1')
I need to be able to sum up the resulting column, but my experience with this is mild. Looking for suggestions. Thanks.

If you have more than just 1's and 0's map to int and sum all rows:
with open( r'E:\myPy\one_zero.csv') as f:
r = csv.reader(f, delimiter = ',')
count = sum(sum(map(int,row)) for row in r)
Or just count the 1's:
with open( r'E:\myPy\one_zero.csv' ) as f:
r = csv.reader(f, delimiter = ',')
count = sum(row.count("1") for row in r)
Just use with open(r'E:\myPy\one_zero.csv'), you don't need to and should not open and then pass the file handle to with.

path = r'E:\myPy\one_zero.csv'
infile = open(path, 'r')
answer = 0
with infile as file_in:
fin = csv.reader(file_in, delimiter = ',')
for line in fin:
a = line.count(1)
answer += a
print answer
Example:
answer = 0
lines = [[1, 0, 0, 1],[1,1,1,1],[0,0,0,1]]
for line in lines:
a = line.count(1)
answer += a
print answer
7
One possible error is you used:
line.count('1')
vs
line.count(1)
looking for a string instead of a numeric

Why use the CSV module at all? You have a file full of 0s, 1s, commas and newlines. Just open the file, read() it and count the 1s:
>>> with open(filename, 'r') as fin: print fin.read().count('1')
That should get you what you want, no?

Related

Merge 2 or more csv files with time overlap data

How do I merge 2 or more csv files with time overlap data? For e.g.,
data1 is
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.84005458 0 0.0093211568 0
0.94034343 0 0.0094739951 0
data2 is
Time u v w
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
What I want is:
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
So the last few rows of data from the 1st csv file is deleted and the 2nd csv file is merged so that the time sequence is increasing.
How can that be done? Thanks.
Python has an excellent built in library function to help with this called heapq.merge().
Assuming your data is space delimited, you could use this as follows:
from heapq import merge
import csv
filenames = ['data1.csv', 'data2.csv']
merge_list = []
for filename in filenames:
f_input = open(filename)
csv_input = csv.reader(f_input, delimiter=' ', skipinitialspace=True)
header = next(csv_input)
merge_list.append(csv_input)
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output, delimiter=' ')
csv_output.writerow(header)
csv_output.writerows(merge(*merge_list, key=lambda x: float(x[0])))
This would produce a CSV output format as:
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.74041502 0 0.0095119512 0
0.84005458 0 0.0093211568 0
0.84043291 0 0.0095214359 0
0.94034343 0 0.0094739951 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
This will work for any number of input CSV files.
If both files are individually ordered by time already.
Using for loop is enough:
# csv cell should be separated by comma, change if required
dilimeter = ','
# open files and read lines
f1 = open('data1.csv', 'r')
f1_lines = f1.readlines()
f1.close()
f2 = open('data2.csv', 'r')
f2_lines = f2.readlines()
f2.close()
# extract header
output_lines = [f1_lines[0]]
# start scanning frome line 2 of both files (line 1 is header)
f1_index = 1
f2_index = 1
while True:
# all data1 are processed, append remaining lines from data2
if f1_index >= len(f1_lines):
output_lines += f2_lines[f2_index:]
break
# all data2 are processed, append remaining lines from data1
if f2_index >= len(f2_lines):
output_lines += f1_lines[f1_index:]
break
f1_line_time = float(f1_lines[f1_index].split(dilimeter)[0]) # get the time cell of data1
f2_line_time = float(f2_lines[f2_index].split(dilimeter)[0]) # get the time cell of data2
if f1_line_time < f2_line_time:
output_lines.append(f1_lines[f1_index])
f1_index += 1
elif f1_lines == f2_line_time:
# if they are equal in time, pick one
output_lines.append(f1_lines[f1_index])
f1_index += 1
f2_index += 1
else:
output_lines.append(f2_lines[f2_index])
f2_index += 1
f_output = open('out.csv', 'w')
f_output.write(''.join(output_lines))
f_output.close()
Another option:
import csv
delimiter = " "
with open("data1.csv", "r") as fin1,\
open("data2.csv", "r") as fin2,\
open("data.csv", "w") as fout:
reader1 = csv.reader(fin1, delimiter=delimiter)
reader2 = csv.reader(fin2, delimiter=delimiter)
writer = csv.writer(fout, delimiter=delimiter)
next(reader2)
first_row = next(reader2)
start2 = float(first_row[0])
writer.writerow(next(reader1))
for row in reader1:
if start2 <= float(row[0]):
break
writer.writerow(row)
writer.writerow(first_row)
writer.writerows(reader2)
Assumption is that the files are already ordered individually:
First take the first data row of data2.csv and convert its first entry into a float start2.
With that in mind write all rows from data1.csv with a time less than start2 into the new file data.csv, and break out of the loop once the condition isn't met anymore.
Then write the already extracted first data row from data2.csv to data.csv, and afterwards write the rest of data2.csv to data.csv.
Result for
data1.csv
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.84005458 0 0.0093211568 0
0.94034343 0 0.0094739951 0
data2.csv
Time u v w
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
is
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
A more general solution (multiple files) could look like:
import csv
delimiter = " "
files = ["data1.csv", "data2.csv", "data3.csv"]
stops = []
for file in files[1:]:
with open(file, "r") as file:
reader = csv.reader(file, delimiter=delimiter)
header = next(reader)
stops.append(float(next(reader)[0]))
stops.append(float("inf"))
with open("data.csv", "w") as fout:
writer = csv.writer(fout, delimiter=delimiter)
writer.writerow(header)
for stop, file in zip(stops, files):
with open(file, "r") as fin:
next(fin)
reader = csv.reader(fin, delimiter=delimiter)
for row in reader:
if stop <= float(row[0]):
break
writer.writerow(row)
This would work for overlaps looking like
1. file: |------|
2. file: |--------|
3. file: |------|
but not
1. file: |--------|
2. file: |-------|
3. file: |--------------|

How to extract columnes from Python list?

My Python code
import operator
with open('index.txt') as f:
lines = f.read().splitlines()
print type(lines)
print len(lines)
l2=lines[1::3]
print len(l2)
print l2[0]
list1 = [0,2]
my_items = operator.itemgetter(*list1)
new_list = [ my_items(x) for x in l2 ]
with open('newindex1.txt','w') as thefile:
for item in l2:
thefile.write("%s\n" % item)
Couple of lines from index.txt
0 0 0
0 1 0
0 2 0
1 0 0
1 1 0
1 2 0
2 0 0
2 1 0
2 2 0
3 0 0
Couple of lines from newindex1.txt
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
I wanted to read the file as a list,then choose every third row and then finally select first and the third column from that list.It seems that I do not understand how operator works.
If I try with Back2Basics solution
import numpy as np
myarray = np.fromfile('index.txt', dtype=int, sep=' ')
anotherarray = myarray[::3][0,2]
I got
File "a12.py", line 4, in <module>
anotherarray = myarray[::3][0,2]
IndexError: too many indices
You don't need to read all the data into memory at all, you can use itertools.islice to parse the rows you want and the csv lib to read and write the data:
from operator import itemgetter
from itertools import islice
import csv
with open("in.txt") as f, open('newindex1.txt','w') as out:
r = csv.reader(f, delimiter=" ")
wr = csv.writer(out, delimiter=" ")
for row in iter(lambda: list(islice(r, 0, 3, 3)), []):
wr.writerow(map(itemgetter(0, 2), row)[0])
I'd highly suggest using numpy for this. The reason being this is all numerical data that fits so nicely into memory. The code looks like this.
import numpy as np
myarray = np.fromfile('index.txt', dtype=int, sep=' ')
anotherarray = myarray[::3,::2]
and then you want to write the file
anotherarray.tofile('newfile.txt', sep=" ")
The way the array slicing line [::3,::2] reads is "take every 3rd row starting from 0, and take every other column starting from 0"
I think you need something this?
lines = []
with open('index.txt', 'r') as fi:
lines = fi.read().splitlines()
lines = [line.split() for line in lines]
with open('answer.txt', 'w') as fo:
for column in range(len(lines)):
if (column + 1) % 3:
fo.write('%s %s\n' % (lines[column][0], lines[column][2]))

how to populate a matrix in python

I was trying to write a code that needs to be outputted as matrix but since being a novice, i am not getting it right. Basically i want to generate a matrix of counts for each of A,C,G,T for each column. I was able to do it for a single column but having hard time to do it for other columns.
Input file
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
My code so far
fh_in = open("consensus_seq.txt", 'r')
A_count = 0
C_count = 0
G_count = 0
T_count = 0
result = []
for line in fh_in:
line = line.strip()
if not line.startswith(">"):
for nuc in line[0]:
if nuc == "A":
A_count += 1
if nuc == "C":
C_count += 1
if nuc == "G":
G_count += 1
if nuc == "T":
T_count += 1
result.append(A_count)
result.append(C_count)
result.append(G_count)
result.append(T_count)
print result
Output
[5, 0, 1, 1]
The actual output that i want is
A 5 1 0 0 5 5 0 0
C 0 0 1 4 2 0 6 1
G 1 1 6 3 0 1 0 0
T 1 5 0 0 0 1 1 6
Any help/hint is appreciated.
First make a list of the rows, stripping out the lines starting with >. Then you can zip this to turn it into a list of columns. Then you can make a list of column counts of each letter.
rows = [line.strip() for line in infile if not line.startswith('>')]
columns = zip(*rows)
for letter in 'ACGT':
print letter, [column.count(letter) for column in columns]
However this may be memory intensive if your file is very large. An alternative is just to go through line by line counting the letters.
counts = {letter: [0] * 8 for letter in 'ACGT'}
for line in infile:
if not line.startswith('>'):
for i, letter in enumerate(line.strip()):
counts[letter][i] += 1
for letter, columns in counts.items():
print letter, columns
You could also use a Counter, especially if you aren't sure in advance how many columns there will be:
from collections import Counter
# ...
counts = Counter()
for line in infile:
if not line.startswith('>'):
counts.update(enumerate(line.strip()))
columns = range(max(counts.keys())[0])
for letter in 'ACGT':
print letter, [counts[column, letter] for column in columns]
You could use numpy to load the text file. Since the format is a little funky it is hard to load, but the summation becomes trivial after that:
import numpy as np
data = np.loadtxt("raw.txt", comments=">",
converters = {0: lambda s: [x for x in s]}, dtype=str)
print (data=="A").sum(axis=0)
print (data=="T").sum(axis=0)
print (data=="C").sum(axis=0)
print (data=="G").sum(axis=0)
Output:
[5 1 0 0 5 5 0 0]
[1 5 0 0 0 1 1 6]
[0 0 1 4 2 0 6 1]
[1 1 6 3 0 1 0 0]
The real advantage to this is the numpy array you've constructed can do other things. For example, let's say I wanted to know, instead of the sum, the average number of times we found an A along the columns of the "Rosalinds":
print (data=="A").mean(axis=0)
[ 0.71428571 0.14285714 0. 0. 0.71428571 0.71428571 0. 0.]
import collections
answer = []
with open('blah') as infile:
rows = [line.strip() for _,line in zip(infile, infile)]
cols = zip(*rows)
for col in cols:
d = collections.Counter(col)
answer.append([d[i] for i in "ATCG"])
answer = [list(i) for i in zip(*answer)]
for line in answer:
print(' '.join([str(i) for i in line]))
Output:
5 1 0 1 0
0 0 1 6 6
5 0 2 0 1
0 1 6 0 0

Why is .readlines() making a list of individual characters?

I have a text file of this format:
EFF 3500. GRAVITY 0.00000 SDSC GRID [+0.0] VTURB 2.0 KM/S L/H 1.25
wl(nm) Inu(ergs/cm**2/s/hz/ster) for 17 mu in 1221 frequency intervals
1.000 .900 .800 .700 .600 .500 .400 .300 .250 .200 .150 .125 .100 .075 .050 .025 .010
9.09 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.35 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.61 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.77 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
9.96 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10.20 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
10.38 0.000E+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
...more numbers
I'm trying to make it so File[0][0] will print the word "EFF" and so on.
import sys
import numpy as np
from math import *
import matplotlib.pyplot as plt
print 'Number of arguments:', len(sys.argv), 'arguments.'
print 'Argument List:', str(sys.argv)
z = np.array(sys.argv) #store all of the file names into array
i = len(sys.argv) #the length of the filenames array
File = open(str(z[1])).readlines() #load spectrum file
for n in range(0, len(File)):
File[n].split()
for n in range(0, len(File[1])):
print File[1][n]
However,it keeps outputting individual characters as if each list index is a single character. This includes whitespace too. I have split() in a loop because if I put readlines().split() it gives an error.
Output:
E
F
F
3
5
0
0
.
G
R
A
V
I
...ect
What am i doing wrong?
>>> text = """some
... multiline
... text
... """
>>> lines = text.splitlines()
>>> for i in range(len(lines)):
... lines[i].split() # split *returns* the list of tokens
... # it does *not* modify the string inplace
...
['some']
['multiline']
['text']
>>> lines #strings unchanged
['some', 'multiline', 'text']
>>> for i in range(len(lines)):
... lines[i] = lines[i].split() # you have to modify the list
...
>>> lines
[['some'], ['multiline'], ['text']]
If you want a one-liner do:
>>> words = [line.split() for line in text.splitlines()]
>>> words
[['some'], ['multiline'], ['text']]
Using a file object it should be:
with open(z[1]) as f:
File = [line.split() for line in f]
By the way, you are using an anti-idiom when looping. If you want to loop over an iterable simply do:
for element in iterable:
#...
If you need also the index of the element use enumerate:
for index, element in enumerate(iterable):
#...
In your case:
for i, line in enumerate(File):
File[i] = line.split()
for word in File[1]:
print word
You want something like this:
for line in File:
fields = line.split()
#fields[0] is "EFF", fields[1] is "3500.", etc.
The split() method returns a list of strings, it does not modify the object that is is called on.

How to find the average of multiple columns in a file using python

Hi I have a file that consists of too many columns to open in excel. Each column has 10 rows of numerical values 0-2 and has a row saying the title of the column. I would like the output to be the name of the column and the average value of the 10 rows. The file is too large to open in excel 2000 so I have to try using python. Any tips on an easy way to do this.
Here is a sample of the first 3 columns:
Trial1 Trial2 Trial3
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
I want python to output as a test file
Trial 1 Trial 2 Trial 3
1 2 1 (whatever the averages are)
A memory-friendly solution without using any modules:
with open("filename", "rtU") as f:
columns = f.readline().strip().split(" ")
numRows = 0
sums = [0] * len(columns)
for line in f:
# Skip empty lines
if not line.strip():
continue
values = line.split(" ")
for i in xrange(len(values)):
sums[i] += int(values[i])
numRows += 1
for index, summedRowValue in enumerate(sums):
print columns[index], 1.0 * summedRowValue / numRows
You can use Numpy:
import numpy as np
from StringIO import StringIO
s = StringIO('''\
Trial1 Trial2 Trial3
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
''')
data = np.loadtxt(s, skiprows=1) # skip header row
print data.mean(axis=0) # column means
# OUTPUT: array([ 0.8, 1. , 0.8])
Note that the first argument to loadtxt could be the name of your file instead of a file like object.
You can use the builtin csv module:
import csv
csvReader = csv.reader(open('input.txt'), delimiter=' ')
headers = csvReader.next()
values = [map(int, row) for row in csvReader]
def average(l):
return float(sum(l)) / len(l)
averages = [int(round(average(trial))) for trial in zip(*values)]
print ' '.join(headers)
print ' '.join(str(x) for x in averages)
Result:
Trial1 Trial2 Trial3
1 1 1
Less of an answer than it is an alternative understanding of the problem:
You could think of each line being a vector. In this way, the average done column-by-column is just the average of each of these vectors. All you need in order to do this is
A way to read a line into a vector object,
A vector addition operation,
Scalar multiplication (or division) of vectors.
Python comes (I think) with most of this already installed, but this should lead to some easily readable code.

Categories