how to populate a matrix in python - python

I was trying to write a code that needs to be outputted as matrix but since being a novice, i am not getting it right. Basically i want to generate a matrix of counts for each of A,C,G,T for each column. I was able to do it for a single column but having hard time to do it for other columns.
Input file
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
My code so far
fh_in = open("consensus_seq.txt", 'r')
A_count = 0
C_count = 0
G_count = 0
T_count = 0
result = []
for line in fh_in:
line = line.strip()
if not line.startswith(">"):
for nuc in line[0]:
if nuc == "A":
A_count += 1
if nuc == "C":
C_count += 1
if nuc == "G":
G_count += 1
if nuc == "T":
T_count += 1
result.append(A_count)
result.append(C_count)
result.append(G_count)
result.append(T_count)
print result
Output
[5, 0, 1, 1]
The actual output that i want is
A 5 1 0 0 5 5 0 0
C 0 0 1 4 2 0 6 1
G 1 1 6 3 0 1 0 0
T 1 5 0 0 0 1 1 6
Any help/hint is appreciated.

First make a list of the rows, stripping out the lines starting with >. Then you can zip this to turn it into a list of columns. Then you can make a list of column counts of each letter.
rows = [line.strip() for line in infile if not line.startswith('>')]
columns = zip(*rows)
for letter in 'ACGT':
print letter, [column.count(letter) for column in columns]
However this may be memory intensive if your file is very large. An alternative is just to go through line by line counting the letters.
counts = {letter: [0] * 8 for letter in 'ACGT'}
for line in infile:
if not line.startswith('>'):
for i, letter in enumerate(line.strip()):
counts[letter][i] += 1
for letter, columns in counts.items():
print letter, columns
You could also use a Counter, especially if you aren't sure in advance how many columns there will be:
from collections import Counter
# ...
counts = Counter()
for line in infile:
if not line.startswith('>'):
counts.update(enumerate(line.strip()))
columns = range(max(counts.keys())[0])
for letter in 'ACGT':
print letter, [counts[column, letter] for column in columns]

You could use numpy to load the text file. Since the format is a little funky it is hard to load, but the summation becomes trivial after that:
import numpy as np
data = np.loadtxt("raw.txt", comments=">",
converters = {0: lambda s: [x for x in s]}, dtype=str)
print (data=="A").sum(axis=0)
print (data=="T").sum(axis=0)
print (data=="C").sum(axis=0)
print (data=="G").sum(axis=0)
Output:
[5 1 0 0 5 5 0 0]
[1 5 0 0 0 1 1 6]
[0 0 1 4 2 0 6 1]
[1 1 6 3 0 1 0 0]
The real advantage to this is the numpy array you've constructed can do other things. For example, let's say I wanted to know, instead of the sum, the average number of times we found an A along the columns of the "Rosalinds":
print (data=="A").mean(axis=0)
[ 0.71428571 0.14285714 0. 0. 0.71428571 0.71428571 0. 0.]

import collections
answer = []
with open('blah') as infile:
rows = [line.strip() for _,line in zip(infile, infile)]
cols = zip(*rows)
for col in cols:
d = collections.Counter(col)
answer.append([d[i] for i in "ATCG"])
answer = [list(i) for i in zip(*answer)]
for line in answer:
print(' '.join([str(i) for i in line]))
Output:
5 1 0 1 0
0 0 1 6 6
5 0 2 0 1
0 1 6 0 0

Related

how to \n in certain places in a list

Trying to opening a new line in different points in a generated list.
ive tried to use this to seperate a list but it doesnt work.
for j in range (num,0,-1):
for i in range(0,len(num),j):
blist[i:j]
print(blist)
heres my code
num=int(input('Size: '))
list=[]
blist=[]
for k in range(num,-1,-1):
for i in range(0,num,1):
list.append(i)
num-=1
print(list)
for j in range (num,0,-1):
for i in range(0,len(num),j):
blist[i:j]
print(blist)
heres the expected result
Size: 8
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6
0 1 2 3 4 5
0 1 2 3 4
0 1 2 3
0 1 2
0 1
0
Size: 3
0 1 2
0 1
0
This works:
n = int(input('Size: '))
L = [str(i) for i in range(n)]
for i in range(n):
print(' '.join(L[:n-i]))
Line by line explanation:
L = [str(i) for i in range(n)] Create a list of string digits from 0 up to n-1
for i in range(n): set i from 0 up to n-1
L[:n-i] slices the list L from start up to n-i (not inclusive)
' '.join(L[:n-i]) just glues together all the elements of the resulting slice with a white space ' '
How about this?
num=int(input('Size: '))
list_=[]
for k in range(num,-1,-1):
list_.append([str(i) for i in range(0,num,1)])
num -= 1
result = '\n'.join([' '.join(l) for l in list_])
print(result)
The result takes each list in list_ and joins them by a space, then the next join separates them by newlines.
Below code can work for me.
num=int(input('Size: '))
list_data = list(range(num))
for i in reversed(range(1, num +1 )):
print( list_data[0:i])
You can unpack the output of range as arguments to print instead:
num = int(input('Size: '))
for n in range(num):
print(*range(num - n))
Sample input/output:
Size: 7
0 1 2 3 4 5 6
0 1 2 3 4 5
0 1 2 3 4
0 1 2 3
0 1 2
0 1
0

How to extract columnes from Python list?

My Python code
import operator
with open('index.txt') as f:
lines = f.read().splitlines()
print type(lines)
print len(lines)
l2=lines[1::3]
print len(l2)
print l2[0]
list1 = [0,2]
my_items = operator.itemgetter(*list1)
new_list = [ my_items(x) for x in l2 ]
with open('newindex1.txt','w') as thefile:
for item in l2:
thefile.write("%s\n" % item)
Couple of lines from index.txt
0 0 0
0 1 0
0 2 0
1 0 0
1 1 0
1 2 0
2 0 0
2 1 0
2 2 0
3 0 0
Couple of lines from newindex1.txt
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
I wanted to read the file as a list,then choose every third row and then finally select first and the third column from that list.It seems that I do not understand how operator works.
If I try with Back2Basics solution
import numpy as np
myarray = np.fromfile('index.txt', dtype=int, sep=' ')
anotherarray = myarray[::3][0,2]
I got
File "a12.py", line 4, in <module>
anotherarray = myarray[::3][0,2]
IndexError: too many indices
You don't need to read all the data into memory at all, you can use itertools.islice to parse the rows you want and the csv lib to read and write the data:
from operator import itemgetter
from itertools import islice
import csv
with open("in.txt") as f, open('newindex1.txt','w') as out:
r = csv.reader(f, delimiter=" ")
wr = csv.writer(out, delimiter=" ")
for row in iter(lambda: list(islice(r, 0, 3, 3)), []):
wr.writerow(map(itemgetter(0, 2), row)[0])
I'd highly suggest using numpy for this. The reason being this is all numerical data that fits so nicely into memory. The code looks like this.
import numpy as np
myarray = np.fromfile('index.txt', dtype=int, sep=' ')
anotherarray = myarray[::3,::2]
and then you want to write the file
anotherarray.tofile('newfile.txt', sep=" ")
The way the array slicing line [::3,::2] reads is "take every 3rd row starting from 0, and take every other column starting from 0"
I think you need something this?
lines = []
with open('index.txt', 'r') as fi:
lines = fi.read().splitlines()
lines = [line.split() for line in lines]
with open('answer.txt', 'w') as fo:
for column in range(len(lines)):
if (column + 1) % 3:
fo.write('%s %s\n' % (lines[column][0], lines[column][2]))

Sum all values in CSV

I have a CSV file with 0s and 1s and need to determine the sum total of the entire file. The file looks like this when opened in ExCel:
0 1 1 1 0 0 0 1 0 1
1 0 1 0 0 1 1 0 0 0
0 0 1 0 0 0 0 1 0 1
0 1 1 1 1 1 1 0 1 1
0 0 1 0 1 0 1 1 0 1
0 0 0 0 0 0 0 0 1 0
0 0 1 0 0 1 1 0 1 1
0 0 1 1 0 0 1 1 0 1
1 0 1 0 1 0 1 1 1 0
0 1 0 0 1 0 0 0 1 1
Using this script I can sum the values of each row and they print out in a single column:
import csv
import numpy as np
path = r'E:\myPy\one_zero.csv'
infile = open(path, 'r')
with infile as file_in:
fin = csv.reader(file_in, delimiter = ',')
for line in fin:
print line.count('1')
I need to be able to sum up the resulting column, but my experience with this is mild. Looking for suggestions. Thanks.
If you have more than just 1's and 0's map to int and sum all rows:
with open( r'E:\myPy\one_zero.csv') as f:
r = csv.reader(f, delimiter = ',')
count = sum(sum(map(int,row)) for row in r)
Or just count the 1's:
with open( r'E:\myPy\one_zero.csv' ) as f:
r = csv.reader(f, delimiter = ',')
count = sum(row.count("1") for row in r)
Just use with open(r'E:\myPy\one_zero.csv'), you don't need to and should not open and then pass the file handle to with.
path = r'E:\myPy\one_zero.csv'
infile = open(path, 'r')
answer = 0
with infile as file_in:
fin = csv.reader(file_in, delimiter = ',')
for line in fin:
a = line.count(1)
answer += a
print answer
Example:
answer = 0
lines = [[1, 0, 0, 1],[1,1,1,1],[0,0,0,1]]
for line in lines:
a = line.count(1)
answer += a
print answer
7
One possible error is you used:
line.count('1')
vs
line.count(1)
looking for a string instead of a numeric
Why use the CSV module at all? You have a file full of 0s, 1s, commas and newlines. Just open the file, read() it and count the 1s:
>>> with open(filename, 'r') as fin: print fin.read().count('1')
That should get you what you want, no?

Python Ignoring What is in a list?

Working on a project for CS1 that prints out a grid made of 0s and adds shapes of certain numbered sizes to it. Before it adds a shape it needs to check if A) it will fit on the grid and B) if something else is already there. The issue I am having is that when run, the function that checks to make sure placement for the shapes is valid will always do the first and second shapes correctly, but any shape added after that will only "see" the first shape added when looking for a collision. I checked to see if it wasnt taking in the right list after the first time but that doesnt seem to be it. Example of the issue....
Shape Sizes = 4, 3, 2, 1
Python Outputs:
4 4 4 4 1 2 3 0
4 4 4 4 2 2 3 0
4 4 4 4 3 3 3 0
4 4 4 4 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
It Should Output:
4 4 4 4 3 3 3 1
4 4 4 4 3 3 3 0
4 4 4 4 3 3 3 0
4 4 4 4 2 2 0 0
0 0 0 0 2 2 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
What's going on here? Full Code is below...
def binCreate(size):
binlist = [[0 for col in range(size)] for row in range(size)]
return binlist
def binPrint(lst):
for row in range(len(lst)):
for col in range(len(lst[row])):
print(lst[row][col], end = " ")
print()
def itemCreate(fileName):
lst = []
for i in open(fileName):
i = i.split()
lst = i
lst = [int(i) for i in lst]
return lst
def main():
size = int(input("Bin Size: "))
fileName = str(input("Item Size File: "))
binList = binCreate(size)
blockList = itemCreate(fileName)
blockList.sort(reverse = True)
binList = checker(binList, len(binList), blockList)
binPrint(binList)
def isSpaceFree(binList, r, c, size):
if r + size > len(binList[0]):
return False
elif c + size > len(binList[0]):
return False
for row in range(r, r + size):
for col in range(c, c + size):
if binList[r][c] != 0:
return False
elif binList[r][c] == size:
return False
return True
def checker(binList, gSize, blockList):
for i in blockList:
r = 0
c = 0
comp = False
while comp != True:
check = isSpaceFree(binList, r, c, i)
if check == True:
for x in range(c, c+ i):
for y in range(r, r+ i):
binList[x][y] = i
comp = True
else:
print(c)
print(r)
r += 1
if r > gSize:
r = 0
c += 1
if c > gSize:
print("Imcompadible")
comp = True
print(i)
binPrint(binList)
input()
return binList
Your code to test for open spaces looks in binList[r][c] (where r is a row value and c is a column value). However, the code that sets the values once an open space has been found sets binList[x][y] (where x is a column value and y is a row value).
The latter is wrong. You want to set binList[y][x] instead (indexing by row, then column).
That will get you a working solution, but it will still not be exactly what you say you expect (you'll get a reflection across the diagonal). This is because your code updates r first, then c only when r has exceeded the bin size. If you want to place items to the right first, then below, you need to swap them.
I'd suggest using two for loops for r and c, rather than a while too, but to make it work in an elegant way you'd probably need to factor out the "find one item's place" code so you could return from the inner loop (rather than needing some complicated code to let you break out of both of the nested loops).

How to find the average of multiple columns in a file using python

Hi I have a file that consists of too many columns to open in excel. Each column has 10 rows of numerical values 0-2 and has a row saying the title of the column. I would like the output to be the name of the column and the average value of the 10 rows. The file is too large to open in excel 2000 so I have to try using python. Any tips on an easy way to do this.
Here is a sample of the first 3 columns:
Trial1 Trial2 Trial3
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
I want python to output as a test file
Trial 1 Trial 2 Trial 3
1 2 1 (whatever the averages are)
A memory-friendly solution without using any modules:
with open("filename", "rtU") as f:
columns = f.readline().strip().split(" ")
numRows = 0
sums = [0] * len(columns)
for line in f:
# Skip empty lines
if not line.strip():
continue
values = line.split(" ")
for i in xrange(len(values)):
sums[i] += int(values[i])
numRows += 1
for index, summedRowValue in enumerate(sums):
print columns[index], 1.0 * summedRowValue / numRows
You can use Numpy:
import numpy as np
from StringIO import StringIO
s = StringIO('''\
Trial1 Trial2 Trial3
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
''')
data = np.loadtxt(s, skiprows=1) # skip header row
print data.mean(axis=0) # column means
# OUTPUT: array([ 0.8, 1. , 0.8])
Note that the first argument to loadtxt could be the name of your file instead of a file like object.
You can use the builtin csv module:
import csv
csvReader = csv.reader(open('input.txt'), delimiter=' ')
headers = csvReader.next()
values = [map(int, row) for row in csvReader]
def average(l):
return float(sum(l)) / len(l)
averages = [int(round(average(trial))) for trial in zip(*values)]
print ' '.join(headers)
print ' '.join(str(x) for x in averages)
Result:
Trial1 Trial2 Trial3
1 1 1
Less of an answer than it is an alternative understanding of the problem:
You could think of each line being a vector. In this way, the average done column-by-column is just the average of each of these vectors. All you need in order to do this is
A way to read a line into a vector object,
A vector addition operation,
Scalar multiplication (or division) of vectors.
Python comes (I think) with most of this already installed, but this should lead to some easily readable code.

Categories