Python counting lines in files using exact locations - python

I know this is straightforward but I am not quite understanding how to make my for loop work.
My first file is a long list of two columns of data:
ROW VALUE
0 165
1 115
2 32
3 14
4 9
5 0
6 89
7 26
. .
406369 129
406370 103
My second file is a list of important row numbers:
1
43
192
so on
All I want to do is go to the row number of interest in file 1, and then walk down, row by row, until the value column hits zero. The output will then be simply a list of the important row numbers followed by the count of the lines there are until the first file reaches zero. For instance, the output for important row number "1" from file #2, should be 3, because there are three lines and then the values reaches 0 in file #1. I appreciate any help! I have some script I have started and can post it in an edit if that is helpful. THANK YOU!
EDIT:
Some script I have started:
for line in important_rows_file:
line = line.strip().split()
positive_starts.append(int(line[2])
countsfile = []
for line in file:
line = line.strip().split()
countsfile.append([line[0]] + [line[1]])
count = 0
i = 0
for i in range(0, len(countsfile)):
for start in positive_starts:
if int(countsfile[start + i][1]) > 0:
count = count + 1
else:
count = count
.... not sure what is next

Here are two ways to do it.
The first way builds a dictionary in memory for all row numbers. This would be a good way to do it if a. You are going to re-use this same data over and over (you can store it and read it back in) or b. You are going to process a lot of rows from the second file (ie. most of the rows need this done). The second way just does a one-off for a given row number.
Given this as the input file:
ROW VALUE
0 165
1 115
2 32
3 14
4 9
5 0
6 89
7 26
8 13
9 0
Method 1.
ref_dict = {}
with open("so_cnt_file.txt") as infile:
next(infile)
cur_start_row = 0
cur_rows = []
for line in infile:
row, col = [int(val) for val in line.strip().split(" ") if val]
if col == 0:
for cur_row in cur_rows:
ref_dict[cur_row] = row - cur_row - 1
cur_start_row = row
cur_rows = []
continue
cur_rows.append(row)
print ref_dict
OUTPUT
{0: 4, 1: 3, 2: 2, 3: 1, 4: 0, 6: 2, 7: 1, 8: 0}
Method 2
def get_count_for_row(row=1):
with open("so_cnt_file.txt") as infile:
for i in range(0, row + 2):
next(infile)
cnt = 0
for line in infile:
row, col = [int(val) for val in line.strip().split(" ") if val]
if col == 0:
return cnt
cnt += 1
print get_count_for_row(1)
print get_count_for_row(6)
OUTPUT
3
2
Here is a solution that takes all of the rows of interest in a single call.
def get_count_for_rows(*rows):
rows = sorted(rows)
counts = []
with open("so_cnt_file.txt") as infile:
cur_row = 0
for i in range(cur_row, 2):
next(infile)
while rows:
inrow = rows.pop(0)
for i in range(cur_row, inrow):
next(infile)
cnt = 0
for line in infile:
row, col = [int(val) for val in line.strip().split(" ") if val]
if col == 0:
counts.append((inrow, cnt))
break
cnt += 1
cur_row = row
return counts
print get_count_for_rows(1, 6)
OUTPUT
[(1, 3), (6, 2)]

Related

Find lowest latest and average milk production?

i have set of line which i read from file in to list . Each row is a new record and each row consist 3 numbers and 1 letter. There are 5 condition in it
507 W 1000 1
1 M 6 2
1 W 1400 3
1 M 8 8
1 T 101 10
507 M 4 12
1 W 1700 15
1 M 7 16
507 M 8 20
1) The first element is a cow ID a unique number representing a cow inside the data set.
2) The second element is a action code 'W' 'M' 'T'.
3) if 'W' is come then the 3rd element is the latest weight of cow.
4) if 'M' is come then the 3rd element is the amount of milk the cow produce.
5) if 'T' is come then the 3rd element is the current temperature of cow.
6) IMPORTANT: if a cow doesn't have least one W and least M then exclude it from the output.
output: (id , lowest weight , max weight , average milk)
507 1000 1000 6
1 1400 1700 7
My output is correct but how should i apply the 6th condition in my code?
my code
import sys
filename = sys.argv[1]
arr = []
with open(filename, "r") as fileToProcess:
for line in fileToProcess:
arr.append(line.strip().split(' '))
#print(L)
if not arr:
print("EMPTY")
else:
lst2 = [item[0] for item in arr]
# print(lst2)
mylist = list(set(lst2))
# print(mylist[0])
sum_1_M = 0
sum_1_W = 0
list_1 = []
count = 0
for i in range(len(mylist)):
for x in arr:
if x[0] == mylist[i] and x[1] == 'M':
sum_1_M += int(x[2])
count = count + 1
elif x[0] == mylist[i] and x[1] == 'W':
sum_1_W += int(x[2])
list_1.append(int(x[2]))
list_1.sort()
print('{} {} {} {}'.format(mylist[i], list_1[0], list_1[len(list_1) - 1], int(sum_1_M / count)))
sum_1_M = 0
sum_1_W = 0
list_1 = []
count = 0
I think you can actually calculate everything while reading, the key is to use a dictionary and updates entry while parsing line by line. Take a look at this code I made for you,
import sys
filename = sys.argv[1]
dic = {}
with open(filename, "r") as fileToProcess:
for line in fileToProcess:
arr = line.strip().split(' ')
if arr[0] not in dic:
dic[arr[0]] = {
'min_weight': 99999999,
'max_weight': 0,
'total_milk': 0,
'count_milk': 0
}
if arr[1] == 'W':
if dic[arr[0]]['min_weight'] >= int(arr[2]):
dic[arr[0]]['min_weight'] = int(arr[2])
if dic[arr[0]]['max_weight'] <= int(arr[2]):
dic[arr[0]]['max_weight'] = int(arr[2])
elif arr[1] == 'M':
dic[arr[0]]['total_milk'] += int(arr[2])
dic[arr[0]]['count_milk'] += 1
for k, v in dic.items():
if v['max_weight'] > 0 and v['total_milk'] > 0:
print('({}, {}, {}, {})'.format(
k,
v['min_weight'],
v['max_weight'],
v['total_milk']/v['count_milk']
))

How to extract columnes from Python list?

My Python code
import operator
with open('index.txt') as f:
lines = f.read().splitlines()
print type(lines)
print len(lines)
l2=lines[1::3]
print len(l2)
print l2[0]
list1 = [0,2]
my_items = operator.itemgetter(*list1)
new_list = [ my_items(x) for x in l2 ]
with open('newindex1.txt','w') as thefile:
for item in l2:
thefile.write("%s\n" % item)
Couple of lines from index.txt
0 0 0
0 1 0
0 2 0
1 0 0
1 1 0
1 2 0
2 0 0
2 1 0
2 2 0
3 0 0
Couple of lines from newindex1.txt
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
I wanted to read the file as a list,then choose every third row and then finally select first and the third column from that list.It seems that I do not understand how operator works.
If I try with Back2Basics solution
import numpy as np
myarray = np.fromfile('index.txt', dtype=int, sep=' ')
anotherarray = myarray[::3][0,2]
I got
File "a12.py", line 4, in <module>
anotherarray = myarray[::3][0,2]
IndexError: too many indices
You don't need to read all the data into memory at all, you can use itertools.islice to parse the rows you want and the csv lib to read and write the data:
from operator import itemgetter
from itertools import islice
import csv
with open("in.txt") as f, open('newindex1.txt','w') as out:
r = csv.reader(f, delimiter=" ")
wr = csv.writer(out, delimiter=" ")
for row in iter(lambda: list(islice(r, 0, 3, 3)), []):
wr.writerow(map(itemgetter(0, 2), row)[0])
I'd highly suggest using numpy for this. The reason being this is all numerical data that fits so nicely into memory. The code looks like this.
import numpy as np
myarray = np.fromfile('index.txt', dtype=int, sep=' ')
anotherarray = myarray[::3,::2]
and then you want to write the file
anotherarray.tofile('newfile.txt', sep=" ")
The way the array slicing line [::3,::2] reads is "take every 3rd row starting from 0, and take every other column starting from 0"
I think you need something this?
lines = []
with open('index.txt', 'r') as fi:
lines = fi.read().splitlines()
lines = [line.split() for line in lines]
with open('answer.txt', 'w') as fo:
for column in range(len(lines)):
if (column + 1) % 3:
fo.write('%s %s\n' % (lines[column][0], lines[column][2]))

Count how many row with last item is 1 or 0 in CSV file Python

I have CSV file like this:
2,1,2,3,1
23,3,2,22,0
2,2,11,2,0
1,2,2,1,1
.
.
44,3,3,44,0
2,2,11,2,0
Each row ended with 1 or 0, I want to count the prior probability for 1 or 0 by counting how many row with last item is 1 or 0 divide by total rows. How to solve it in python. Thank you.
import csv
total = 0
zeros = 0
ones = 0
with open('path/to/file') as infile:
for row in csv.reader(infile):
total += 1
if row[-1] == '0': zeros += 1
if row[-1] == '1': ones += 1
# do some division to calculate priors

how to populate a matrix in python

I was trying to write a code that needs to be outputted as matrix but since being a novice, i am not getting it right. Basically i want to generate a matrix of counts for each of A,C,G,T for each column. I was able to do it for a single column but having hard time to do it for other columns.
Input file
>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT
My code so far
fh_in = open("consensus_seq.txt", 'r')
A_count = 0
C_count = 0
G_count = 0
T_count = 0
result = []
for line in fh_in:
line = line.strip()
if not line.startswith(">"):
for nuc in line[0]:
if nuc == "A":
A_count += 1
if nuc == "C":
C_count += 1
if nuc == "G":
G_count += 1
if nuc == "T":
T_count += 1
result.append(A_count)
result.append(C_count)
result.append(G_count)
result.append(T_count)
print result
Output
[5, 0, 1, 1]
The actual output that i want is
A 5 1 0 0 5 5 0 0
C 0 0 1 4 2 0 6 1
G 1 1 6 3 0 1 0 0
T 1 5 0 0 0 1 1 6
Any help/hint is appreciated.
First make a list of the rows, stripping out the lines starting with >. Then you can zip this to turn it into a list of columns. Then you can make a list of column counts of each letter.
rows = [line.strip() for line in infile if not line.startswith('>')]
columns = zip(*rows)
for letter in 'ACGT':
print letter, [column.count(letter) for column in columns]
However this may be memory intensive if your file is very large. An alternative is just to go through line by line counting the letters.
counts = {letter: [0] * 8 for letter in 'ACGT'}
for line in infile:
if not line.startswith('>'):
for i, letter in enumerate(line.strip()):
counts[letter][i] += 1
for letter, columns in counts.items():
print letter, columns
You could also use a Counter, especially if you aren't sure in advance how many columns there will be:
from collections import Counter
# ...
counts = Counter()
for line in infile:
if not line.startswith('>'):
counts.update(enumerate(line.strip()))
columns = range(max(counts.keys())[0])
for letter in 'ACGT':
print letter, [counts[column, letter] for column in columns]
You could use numpy to load the text file. Since the format is a little funky it is hard to load, but the summation becomes trivial after that:
import numpy as np
data = np.loadtxt("raw.txt", comments=">",
converters = {0: lambda s: [x for x in s]}, dtype=str)
print (data=="A").sum(axis=0)
print (data=="T").sum(axis=0)
print (data=="C").sum(axis=0)
print (data=="G").sum(axis=0)
Output:
[5 1 0 0 5 5 0 0]
[1 5 0 0 0 1 1 6]
[0 0 1 4 2 0 6 1]
[1 1 6 3 0 1 0 0]
The real advantage to this is the numpy array you've constructed can do other things. For example, let's say I wanted to know, instead of the sum, the average number of times we found an A along the columns of the "Rosalinds":
print (data=="A").mean(axis=0)
[ 0.71428571 0.14285714 0. 0. 0.71428571 0.71428571 0. 0.]
import collections
answer = []
with open('blah') as infile:
rows = [line.strip() for _,line in zip(infile, infile)]
cols = zip(*rows)
for col in cols:
d = collections.Counter(col)
answer.append([d[i] for i in "ATCG"])
answer = [list(i) for i in zip(*answer)]
for line in answer:
print(' '.join([str(i) for i in line]))
Output:
5 1 0 1 0
0 0 1 6 6
5 0 2 0 1
0 1 6 0 0

How to count the frequency of numbers given in a text file

How to count the frequency of numbers given in a text file. The text file is as follows.
0
2
0
1
0
1
55
100
100
I want the output as follows
0 3
1 2
2 1
55 1
100 2
I tried this without success
def histogram( A, flAsList=False ):
"""Return histogram of values in array A."""
H = {}
for val in A:
H[val] = H.get(val,0) + 1
if flAsList:
return H.items()
return H
Any better way. Thanks in advance!
Use Counter. It's the best way for this type of problems
from collections import Counter
with open('file.txt', 'r') as fd:
lines = fd.read().split()
counter = Counter(lines)
# sorts items
items = sorted(counter.items(), key=lambda x: int(x[0]))
# prints desired output
for k, repetitions in items:
print k,'\t', repetitions
The output:
0 3
1 2
2 1
55 1
100 2
Use a Counter object for this:
from collections import Counter
c = Counter(A)
Now the c variable will hold a frequency map of each of the values. For instance:
Counter(['a', 'b', 'c', 'a', 'c', 'a'])
=> Counter({'a': 3, 'c': 2, 'b': 1})
Please consider using update:
def histogram( A, flAsList=False ):
"""Return histogram of values in array A."""
H = {}
for val in A:
# H[val] = H.get(val,0) + 1
if H.has_key(val):
H[val] = H[val] + 1
else:
H.update({val : 1})
if flAsList:
return H.items()
return H
Simple approach using a dictionary:
histogram = {}
with open("file","r") as f:
for line in f:
try:
histogram[line.strip()] +=1
except KeyError:
histogram[line.strip()] = 1
for key in sorted(histogram.keys(),key=int):
print key,"\t",histogram[key]
Output:
0 3
1 2
2 1
55 1
100 2
Edit:
To select a specific column you'd want to split the line using split(). For example the sixth field by splitting on a single space:
try:
histogram[line.strip().split(' ')[5]] +=1
except KeyError:
histogram[line.strip().split(' ')[5]] = 1

Categories