How to extract columnes from Python list? - python

My Python code
import operator
with open('index.txt') as f:
lines = f.read().splitlines()
print type(lines)
print len(lines)
l2=lines[1::3]
print len(l2)
print l2[0]
list1 = [0,2]
my_items = operator.itemgetter(*list1)
new_list = [ my_items(x) for x in l2 ]
with open('newindex1.txt','w') as thefile:
for item in l2:
thefile.write("%s\n" % item)
Couple of lines from index.txt
0 0 0
0 1 0
0 2 0
1 0 0
1 1 0
1 2 0
2 0 0
2 1 0
2 2 0
3 0 0
Couple of lines from newindex1.txt
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
I wanted to read the file as a list,then choose every third row and then finally select first and the third column from that list.It seems that I do not understand how operator works.
If I try with Back2Basics solution
import numpy as np
myarray = np.fromfile('index.txt', dtype=int, sep=' ')
anotherarray = myarray[::3][0,2]
I got
File "a12.py", line 4, in <module>
anotherarray = myarray[::3][0,2]
IndexError: too many indices

You don't need to read all the data into memory at all, you can use itertools.islice to parse the rows you want and the csv lib to read and write the data:
from operator import itemgetter
from itertools import islice
import csv
with open("in.txt") as f, open('newindex1.txt','w') as out:
r = csv.reader(f, delimiter=" ")
wr = csv.writer(out, delimiter=" ")
for row in iter(lambda: list(islice(r, 0, 3, 3)), []):
wr.writerow(map(itemgetter(0, 2), row)[0])

I'd highly suggest using numpy for this. The reason being this is all numerical data that fits so nicely into memory. The code looks like this.
import numpy as np
myarray = np.fromfile('index.txt', dtype=int, sep=' ')
anotherarray = myarray[::3,::2]
and then you want to write the file
anotherarray.tofile('newfile.txt', sep=" ")
The way the array slicing line [::3,::2] reads is "take every 3rd row starting from 0, and take every other column starting from 0"

I think you need something this?
lines = []
with open('index.txt', 'r') as fi:
lines = fi.read().splitlines()
lines = [line.split() for line in lines]
with open('answer.txt', 'w') as fo:
for column in range(len(lines)):
if (column + 1) % 3:
fo.write('%s %s\n' % (lines[column][0], lines[column][2]))

Related

Fast way to create pandas dataframe from pairs

I have a big file of word/tag pairs saved like this:
This/DT gene/NN called/VBN gametocide/NN
Now I want to put these pairs into a DataFrame with their counts like this:
DT | NN --
This| 1 0
Gene| 0 1
:
I tried doing this with a dict that counts the pairs and then put it in the DataFrame:
file = open("data.txt", "r")
train = file.read()
words = train.split()
data = defaultdict(int)
for i in words:
data[i] += 1
matrixB = pd.DataFrame()
for elem, count in data.items():
word, tag = elem.split('/')
matrixB.loc[tag, word] = count
But this takes a really long time (file has like 300000 of these). Is there a faster way to do this?
What was wrong with the answers from your other question?
from collections import Counter
with open('data.txt') as f:
train = f.read()
c = Counter(tuple(x.split('/')) for x in train.split())
s = pd.Series(c)
df = s.unstack().fillna(0)
print(df)
yields
DT NN VBN
This 1 0 0
called 0 0 1
gametocide 0 1 0
gene 0 1 0
I thought this question was remarkably similar... Why did you post twice?
from collection import Counter
text = "This/DT gene/NN called/VBN gametocide/NN"
>>> pd.Series(Counter(tuple(pair.split('/')) for pair in text.split())).unstack().fillna(0)
DT NN VBN
This 1 0 0
called 0 0 1
gametocide 0 1 0
gene 0 1 0

Sum all values in CSV

I have a CSV file with 0s and 1s and need to determine the sum total of the entire file. The file looks like this when opened in ExCel:
0 1 1 1 0 0 0 1 0 1
1 0 1 0 0 1 1 0 0 0
0 0 1 0 0 0 0 1 0 1
0 1 1 1 1 1 1 0 1 1
0 0 1 0 1 0 1 1 0 1
0 0 0 0 0 0 0 0 1 0
0 0 1 0 0 1 1 0 1 1
0 0 1 1 0 0 1 1 0 1
1 0 1 0 1 0 1 1 1 0
0 1 0 0 1 0 0 0 1 1
Using this script I can sum the values of each row and they print out in a single column:
import csv
import numpy as np
path = r'E:\myPy\one_zero.csv'
infile = open(path, 'r')
with infile as file_in:
fin = csv.reader(file_in, delimiter = ',')
for line in fin:
print line.count('1')
I need to be able to sum up the resulting column, but my experience with this is mild. Looking for suggestions. Thanks.
If you have more than just 1's and 0's map to int and sum all rows:
with open( r'E:\myPy\one_zero.csv') as f:
r = csv.reader(f, delimiter = ',')
count = sum(sum(map(int,row)) for row in r)
Or just count the 1's:
with open( r'E:\myPy\one_zero.csv' ) as f:
r = csv.reader(f, delimiter = ',')
count = sum(row.count("1") for row in r)
Just use with open(r'E:\myPy\one_zero.csv'), you don't need to and should not open and then pass the file handle to with.
path = r'E:\myPy\one_zero.csv'
infile = open(path, 'r')
answer = 0
with infile as file_in:
fin = csv.reader(file_in, delimiter = ',')
for line in fin:
a = line.count(1)
answer += a
print answer
Example:
answer = 0
lines = [[1, 0, 0, 1],[1,1,1,1],[0,0,0,1]]
for line in lines:
a = line.count(1)
answer += a
print answer
7
One possible error is you used:
line.count('1')
vs
line.count(1)
looking for a string instead of a numeric
Why use the CSV module at all? You have a file full of 0s, 1s, commas and newlines. Just open the file, read() it and count the 1s:
>>> with open(filename, 'r') as fin: print fin.read().count('1')
That should get you what you want, no?

how to populate a matrix in python

I was trying to write a code that needs to be outputted as matrix but since being a novice, i am not getting it right. Basically i want to generate a matrix of counts for each of A,C,G,T for each column. I was able to do it for a single column but having hard time to do it for other columns.
Input file
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
My code so far
fh_in = open("consensus_seq.txt", 'r')
A_count = 0
C_count = 0
G_count = 0
T_count = 0
result = []
for line in fh_in:
line = line.strip()
if not line.startswith(">"):
for nuc in line[0]:
if nuc == "A":
A_count += 1
if nuc == "C":
C_count += 1
if nuc == "G":
G_count += 1
if nuc == "T":
T_count += 1
result.append(A_count)
result.append(C_count)
result.append(G_count)
result.append(T_count)
print result
Output
[5, 0, 1, 1]
The actual output that i want is
A 5 1 0 0 5 5 0 0
C 0 0 1 4 2 0 6 1
G 1 1 6 3 0 1 0 0
T 1 5 0 0 0 1 1 6
Any help/hint is appreciated.
First make a list of the rows, stripping out the lines starting with >. Then you can zip this to turn it into a list of columns. Then you can make a list of column counts of each letter.
rows = [line.strip() for line in infile if not line.startswith('>')]
columns = zip(*rows)
for letter in 'ACGT':
print letter, [column.count(letter) for column in columns]
However this may be memory intensive if your file is very large. An alternative is just to go through line by line counting the letters.
counts = {letter: [0] * 8 for letter in 'ACGT'}
for line in infile:
if not line.startswith('>'):
for i, letter in enumerate(line.strip()):
counts[letter][i] += 1
for letter, columns in counts.items():
print letter, columns
You could also use a Counter, especially if you aren't sure in advance how many columns there will be:
from collections import Counter
# ...
counts = Counter()
for line in infile:
if not line.startswith('>'):
counts.update(enumerate(line.strip()))
columns = range(max(counts.keys())[0])
for letter in 'ACGT':
print letter, [counts[column, letter] for column in columns]
You could use numpy to load the text file. Since the format is a little funky it is hard to load, but the summation becomes trivial after that:
import numpy as np
data = np.loadtxt("raw.txt", comments=">",
converters = {0: lambda s: [x for x in s]}, dtype=str)
print (data=="A").sum(axis=0)
print (data=="T").sum(axis=0)
print (data=="C").sum(axis=0)
print (data=="G").sum(axis=0)
Output:
[5 1 0 0 5 5 0 0]
[1 5 0 0 0 1 1 6]
[0 0 1 4 2 0 6 1]
[1 1 6 3 0 1 0 0]
The real advantage to this is the numpy array you've constructed can do other things. For example, let's say I wanted to know, instead of the sum, the average number of times we found an A along the columns of the "Rosalinds":
print (data=="A").mean(axis=0)
[ 0.71428571 0.14285714 0. 0. 0.71428571 0.71428571 0. 0.]
import collections
answer = []
with open('blah') as infile:
rows = [line.strip() for _,line in zip(infile, infile)]
cols = zip(*rows)
for col in cols:
d = collections.Counter(col)
answer.append([d[i] for i in "ATCG"])
answer = [list(i) for i in zip(*answer)]
for line in answer:
print(' '.join([str(i) for i in line]))
Output:
5 1 0 1 0
0 0 1 6 6
5 0 2 0 1
0 1 6 0 0

Python counting lines in files using exact locations

I know this is straightforward but I am not quite understanding how to make my for loop work.
My first file is a long list of two columns of data:
ROW VALUE
0 165
1 115
2 32
3 14
4 9
5 0
6 89
7 26
. .
406369 129
406370 103
My second file is a list of important row numbers:
1
43
192
so on
All I want to do is go to the row number of interest in file 1, and then walk down, row by row, until the value column hits zero. The output will then be simply a list of the important row numbers followed by the count of the lines there are until the first file reaches zero. For instance, the output for important row number "1" from file #2, should be 3, because there are three lines and then the values reaches 0 in file #1. I appreciate any help! I have some script I have started and can post it in an edit if that is helpful. THANK YOU!
EDIT:
Some script I have started:
for line in important_rows_file:
line = line.strip().split()
positive_starts.append(int(line[2])
countsfile = []
for line in file:
line = line.strip().split()
countsfile.append([line[0]] + [line[1]])
count = 0
i = 0
for i in range(0, len(countsfile)):
for start in positive_starts:
if int(countsfile[start + i][1]) > 0:
count = count + 1
else:
count = count
.... not sure what is next
Here are two ways to do it.
The first way builds a dictionary in memory for all row numbers. This would be a good way to do it if a. You are going to re-use this same data over and over (you can store it and read it back in) or b. You are going to process a lot of rows from the second file (ie. most of the rows need this done). The second way just does a one-off for a given row number.
Given this as the input file:
ROW VALUE
0 165
1 115
2 32
3 14
4 9
5 0
6 89
7 26
8 13
9 0
Method 1.
ref_dict = {}
with open("so_cnt_file.txt") as infile:
next(infile)
cur_start_row = 0
cur_rows = []
for line in infile:
row, col = [int(val) for val in line.strip().split(" ") if val]
if col == 0:
for cur_row in cur_rows:
ref_dict[cur_row] = row - cur_row - 1
cur_start_row = row
cur_rows = []
continue
cur_rows.append(row)
print ref_dict
OUTPUT
{0: 4, 1: 3, 2: 2, 3: 1, 4: 0, 6: 2, 7: 1, 8: 0}
Method 2
def get_count_for_row(row=1):
with open("so_cnt_file.txt") as infile:
for i in range(0, row + 2):
next(infile)
cnt = 0
for line in infile:
row, col = [int(val) for val in line.strip().split(" ") if val]
if col == 0:
return cnt
cnt += 1
print get_count_for_row(1)
print get_count_for_row(6)
OUTPUT
3
2
Here is a solution that takes all of the rows of interest in a single call.
def get_count_for_rows(*rows):
rows = sorted(rows)
counts = []
with open("so_cnt_file.txt") as infile:
cur_row = 0
for i in range(cur_row, 2):
next(infile)
while rows:
inrow = rows.pop(0)
for i in range(cur_row, inrow):
next(infile)
cnt = 0
for line in infile:
row, col = [int(val) for val in line.strip().split(" ") if val]
if col == 0:
counts.append((inrow, cnt))
break
cnt += 1
cur_row = row
return counts
print get_count_for_rows(1, 6)
OUTPUT
[(1, 3), (6, 2)]

How to find the average of multiple columns in a file using python

Hi I have a file that consists of too many columns to open in excel. Each column has 10 rows of numerical values 0-2 and has a row saying the title of the column. I would like the output to be the name of the column and the average value of the 10 rows. The file is too large to open in excel 2000 so I have to try using python. Any tips on an easy way to do this.
Here is a sample of the first 3 columns:
Trial1 Trial2 Trial3
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
I want python to output as a test file
Trial 1 Trial 2 Trial 3
1 2 1 (whatever the averages are)
A memory-friendly solution without using any modules:
with open("filename", "rtU") as f:
columns = f.readline().strip().split(" ")
numRows = 0
sums = [0] * len(columns)
for line in f:
# Skip empty lines
if not line.strip():
continue
values = line.split(" ")
for i in xrange(len(values)):
sums[i] += int(values[i])
numRows += 1
for index, summedRowValue in enumerate(sums):
print columns[index], 1.0 * summedRowValue / numRows
You can use Numpy:
import numpy as np
from StringIO import StringIO
s = StringIO('''\
Trial1 Trial2 Trial3
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
1 0 1
0 0 0
0 2 0
2 2 2
1 1 1
''')
data = np.loadtxt(s, skiprows=1) # skip header row
print data.mean(axis=0) # column means
# OUTPUT: array([ 0.8, 1. , 0.8])
Note that the first argument to loadtxt could be the name of your file instead of a file like object.
You can use the builtin csv module:
import csv
csvReader = csv.reader(open('input.txt'), delimiter=' ')
headers = csvReader.next()
values = [map(int, row) for row in csvReader]
def average(l):
return float(sum(l)) / len(l)
averages = [int(round(average(trial))) for trial in zip(*values)]
print ' '.join(headers)
print ' '.join(str(x) for x in averages)
Result:
Trial1 Trial2 Trial3
1 1 1
Less of an answer than it is an alternative understanding of the problem:
You could think of each line being a vector. In this way, the average done column-by-column is just the average of each of these vectors. All you need in order to do this is
A way to read a line into a vector object,
A vector addition operation,
Scalar multiplication (or division) of vectors.
Python comes (I think) with most of this already installed, but this should lead to some easily readable code.

Categories