Parsing two text files in Python for a combined result - python

A chocolate company has decided to offer discount on the candy products which are produced 30 days of more before the current date. I have to have a matrix as a print result where the program reads through 2 files, one being the the cost of the different candies of different sizes, and another being the threshold number of days after which the discount is offered. So in this question the two text files look something like this
candies.txt
31 32 19 11 15 30 35 37
12 34 39 45 66 78 12 7
76 32 8 2 3 5 18 32 48
99 102 3 46 88 22 25 21
fd zz er 23 44 56 77 99
44 33 22 55 er ee df 22
and the second file days.txt
30
But it can have more than one number. It can look something like
30
40
36
The desired output is
Discount at days = 30
$ $ $
$ $ $
$ $ $ $ $
$ $ $ $
? ? ? $
$ ? ? ? $
Discount at days = 40
And then execute the output accordingly
So basically, everywhere the number is under the number given in days.txt it should print a "$" sign and everywhere it is more than the number(30 in our case) it should just print spaces in their place. We also have an anomally, where we have the english alphabets in the candies.txt matrix and since we are looking for numbers to check the price and not letters, it should print a "?" sign in their place as it is not recognized.
Here's my code
def replace(word, threshold):
try:
value = int(word)
except ValueError:
return '?'
if value < threshold:
return '$'
if value > threshold:
return ' '
return word
def get_threshold(filename):
thresholds = []
with open(filename) as fobj:
for line in fobj:
if line.strip():
thresholds.append(int(line))
return thresholds
def process_file(data_file, threshold):
lines = []
print('Original data:')
with open(data_file) as f:
for line in f:
line = line.strip()
print(line)
replaced_line = ' '.join(
replace(chunck, threshold) for chunck in line.split())
lines.append(replaced_line)
print('\nData replaced with threshold', threshold)
for threshold in get_threshold('days.txt'):
process_file('demo.txt', threshold )
My question is that my code works when there is only one number in the second file, days.txt but it doesn't work when there are more than one number in the second file. I want it to work when there are multiple numbers in each newline of the second text file. I don't know what I am doing wrong.

Read all thresholds:
def get_thresholds(filename):
with open(filename) as fobj :
return [int(line) for line in fobj if line.strip()]
Alternative implementation without the list comprehension:
def get_thresholds(filename):
thresholds = []
with open(filename) as fobj:
for line in fobj:
if line.strip():
thresholds.append(int(line))
return thresholds
Modify your function a bit:
def process_file(data_file, threshold):
lines = []
print('Original data:')
with open(data_file) as f:
for line in f:
line = line.strip()
print(line)
replaced_line = ' '.join(
replace(chunck, threshold) for chunck in line.split())
lines.append(replaced_line)
print('\nData replaced with threshold', threshold)
for line in lines:
print(line)
Go through all thresholds:
for threshold in get_thresholds('days.txt'):
process_file('candies.txt', threshold)

This is a re-write of my previous answer. Due to to the long discussion and the many changes it seems clearer to another answer. I chopped the task into smaller sub-tasks and defined a function for each. All functions have docstrings. This is highly recommended.
"""
A chocolate company has decided to offer discount on the candy products
which are produced 30 days of more before the current date.
More story here ...
"""
def read_thresholds(filename):
"""Read values for thresholds from file.
"""
thresholds = []
with open(filename) as fobj:
for line in fobj:
if line.strip():
thresholds.append(int(line))
return thresholds
def read_costs(filename):
"""Read the cost from file.
"""
lines = []
with open(filename) as fobj:
for line in fobj:
lines.append(line.strip())
return lines
def replace(word, threshold):
"""Replace value below threshold with `$`, above threshold with ` `,
non-numbers with `?`, and keep the value if it equals the
threshold.
"""
try:
value = int(word)
except ValueError:
return '?'
if value < threshold:
return '$'
if value > threshold:
return ' '
return word
def process_costs(costs, threshold):
"""Replace the cost for given threshold and print results.
"""
res = []
for line in costs:
replaced_line = ' '.join(
replace(chunck, threshold) for chunck in line.split())
res.append(replaced_line)
print('\nData replaced with threshold', threshold)
for line in res:
print(line)
def show_orig(costs):
"""Show original costs.
"""
print('Original data:')
for line in costs:
print(line)
def main(costs_filename, threshold_filename):
"""Replace costs for all thresholds and display results.
"""
costs = read_costs(costs_filename)
show_orig(costs)
for threshold in read_thresholds(threshold_filename):
process_costs(costs, threshold)
if __name__ == '__main__':
main('candies.txt', 'days.txt')

Related

Counting the occurrence of two words at the beginning of a text file

I am withing a python code to read all lines in a text file (b.txt) and count those with special words (ATOM and HETATM) at the beginning of each line. The b.txt file is as follows:
REMARK 480 ATOM ZERO OCCUPANCY
ATOM 3332 CA GLY A 8 9.207 4.845 44.955 1.00 42.92 C
HETATM 2954 O HOH A 489 -17.507 4.101 8.012 1.00 53.13 O
and the code is:
pdb_text = open("b.txt","r")
data = pdb_text.read()
n_atoms = data.count("ATOM")
n_het_atom = data.count("HETATM")
total_atoms = n_atoms + n_het_atom
print('Number of atoms:', total_atoms)
I expect “2” as the output, but I get “3” instead.
For counting the lines that start with ATOM or HETATM you can use string's startwith function. For example you can do:
data = ""
with open("b.txt","r") as file:
data = file.readlines()
counter = 0
for line in data.split('\n'):
if line.startswith("ATOM") or line.startswith("HETATM"):
counter = counter + 1
print('Number of atoms:', counter)
Using the following code, you can read a PDB file and count the number of atoms in it.
# open the pdb file
pdb_file = input ("Enter the name of your PDB file: ")
pdb_text = open(pdb_file,"r")
# read contents of the pdb file to string
data = ""
data = pdb_text.read()
# get number of occurrences of ATOMS and in the pdb file
n_atoms = 0
for line in data.split('\n'):
if line.startswith("ATOM") or line.startswith("HETATM"):
n_atoms = n_atoms + 1
print('Number of atoms:', n_atoms)
Let's generalize it:
We want a function which takes a file_path and a list of words.
It should then return a dictionary of the line counts starting with the words.
For the sum, we takes the values of the dictionary and sum it. We can take that also as a function.
In other languages, such a function which returns a dictionary of the words with their count as value frequence. Since we count not overall occurrence but those at the start of the lines, we call it start_frequencies:
def start_frequencies(path, words):
dct = {}
with open(path) as f:
for line in f:
for word in words:
if line.startswith(word):
dct[word] = dct.get(word, 0) + 1
return dct
The magic is dct.get(word, 0) because it says: "If the word exist in the dictionary dct already as a key, take the value of it, else take as default value (the count) 0."
Then, we can write the function which returns the sum of all start counts:
def sum_of_start_frequencies(path, words):
dct = start_frequencies(path, words)
return sum(dct.values())
So in your case, you use it like:
pdb_file = input ("Enter the name of your PDB file: ")
sum_of_start_frequencies(pdb_file, ["ATOM", "HETATM"])
It should return 2.

Separating values from text file in python

I am trying to read in a text file with the following data:
362 147
422 32
145 45
312 57
35 421
361 275
and I want to separate the values into pairs so 362 and 147 would be pair 1, 422 and 32 pair 2 and so on.
However I run into a problem during the 5 pair which should be 35,421 but for some reason my code does not split this pair correctly, i think this is because of the spaces since only this pair has a two digit number and then a 3 digit number. But I'm not sure how to fix this, here's my code:
def __init__(filename):
f = open(filename, "r") #reads file
#print (f.read) # test if file was actually read
f1 = f.readlines() # reads individual lines
counter = 0
for line in f1:
values = line.split(" ") #splits the two values for each line into an array
value1 = values[0].strip() #.strip removes spaces at each values
value2 = values[1].strip()
counter = counter + 1
print('\npair: {}'.format(counter))
#print(values)
print(value1)
print(value2)
The output I get:
pair: 1
362
147
pair: 2
422
32
pair: 3
145
45
pair: 4
312
57
pair: 5
35
pair: 6
361
275
Try this :
def __init__(filename):
with open(filename, "r") as f:
lines = [i.strip() for i in f.readlines()]
for line_num, line in enumerate(lines):
p1, p2 = [i for i in line.split() if i]
print(f"pair: {line_num+1}\n{p1}\n{p2}\n\n")
Note : Always try to use with open(). In this way python takes care of closing the file automatically at the end.
The problem with your code is that you're not checking whether the words extracted after splitting values are empty string or not. If you print values for each line, for the pair 5, you'ld notice it is ['', '35', '421\n']. The first value of this one is an empty string. You can change your code to this :
def __init__(filename):
f = open(filename, "r") #reads file
#print (f.read) # test if file was actually read
f1 = f.readlines() # reads individual lines
counter = 0
for line in f1:
values = line.split() #splits the two values for each line into an array; Addendum .split(" ") is equivalent to .split()
values = [i for i in values if i] #Removes the empty strings
value1 = values[0].strip() #.strip removes spaces at each values
value2 = values[1].strip()
counter = counter + 1
print('\npair: {}'.format(counter))
#print(values)
print(value1)
print(value2)
Change this line:
values = line.split(" ")
to:
values = line.split()

How to find the average for a file then put it in another file

I want to find the average of the list inFile and then I would like to move it to the classscores.
classgrades.txt is:
Chapman 90 100 85 66 80 55
Cleese 80 90 85 88
Gilliam 78 82 80 80 75 77
Idle 91
Jones 68 90 22 100 0 80 85
Palin 80 90 80 90
classcores.txt is empty
This is what I have so far... what should I do?
inFile = open('classgrades.txt','r')
outFile = open('classscores.txt','w')
for line in inFile:
with open(r'classgrades.txt') as data:
total_stuff = #Don't know what to do past here
biggest = min(total_stuff)
smallest = max(total_stuff)
print(biggest - smallest)
print(sum(total_stuff)/len(total_stuff))
You will need to:
- split each line by whitespace and slice out all items but the first
- convert each string value in array to integer
- sum all of those integer values in the array
- add the sum for this line to total_sum
- add the length of those values (the number of numbers) to total_numbers
However, this is only part of the problem... I will leave the rest up to you. This code will not write to the new file, it will simply take an average of all the numbers in the first file. If this isn't exactly what you are asking for, then try playing around with this stuff and you should be able to figure it all out.
inFile = open('classgrades.txt','r')
outFile = open('classscores.txt','w')
total_sum = 0
total_values = 0
with open(r'classgrades.txt') as inFile:
for line in inFile:
# split by whitespace and slice out everything after 1st item
num_strings = line.split(' ')[1:]
# convert each string value to an integer
values = [int(n) for n in num_strings]
# sum all the values on this line
line_sum = sum(values)
# add the sum of numbers in this line to the total_sum
total_sum += line_sum
# add the length of values in this line to total_values
total_numbers += len(values)
average = total_sum // total_numbers # // is integer division in python3
return average
you don't need to open file many times and you should close the files at the end of your program. Below is what I tried hope this works for you:
d1 = {}
with open(r'classgrades.txt','r') as fp:
for line in fp:
contents = line.strip().split(' ')
# create mapping of student and his numbers
d1[contents[0]] = map(int,contents[1:])
with open(r'classscores.txt','w') as fp:
for key, item in d1.items():
biggest = min(item)
smallest = max(item)
print(biggest - smallest)
# average of all numbers
avg = sum(item)/len(item)
fp.write("%s %s\n"%(key,avg))
Apologies if this is kind of advanced, I try to provide key words/phrases for you to search for to learn more.
Presuming you're looking for each student's separate average:
in_file = open('classgrades.txt', 'r') # python naming style is under_score
out_file = open('classcores.txt', 'w')
all_grades = [] # if you want a class-wide average as well as individual averages
for line in in_file:
# make a list of the things between spaces, like ["studentname", "90", "100"]
student = line.split(' ')[0]
# this next line includes "list comprehension" and "list slicing"
# it gets every item in the list aside from the 0th index (student name),
# and "casts" them to integers so we can do math on them.
grades = [int(g) for g in line.split(' ')[1:]]
# hanging on to every grade for later
all_grades += grades # lists can be +='d like numbers can
average = int(sum(grades) / len(grades))
# str.format() here is one way to do "string interpolation"
out_file.write('{} {}\n'.format(student, average))
total_avg = sum(all_grades) / len(all_grades)
print('Class average: {}'.format(total_avg))
in_file.close()
out_file.close()
As others pointed out, it is good to get in the habit of closing files.
Other responses here use with open() (as a "context manager") which is best practice because it automatically closes the file for you.
To work with two files without having a data container in between (like Amit's d1 dictionary), you would do something like:
with open('in.txt') as in_file:
with open('out.txt', 'w') as out_file:
... do things ...
This script should accomplish what you are trying to do I think:
# define a list data structure to store the classgrades
classgrades = []
with open( 'classgrades.txt', 'r' ) as infile:
for line in infile:
l = line.split()
# append a dict to the classgrades list with student as the key
# and value is list of the students scores.
classgrades.append({'name': l[0], 'scores': l[1:]})
with open( 'classscores.txt', 'w' ) as outfile:
for student in classgrades:
# get the students name out of dict.
name = student['name']
# get the students scores. use list comprehension to convert
# strings to ints so that scores is a list of ints.
scores = [int(s) for s in student['scores']]
# calc. total
total = sum(scores)
# get the number of scores.
count = len( student['scores'] )
# calc. average
average = total/count
biggest = max(scores)
smallest = min(scores)
diff = ( biggest - smallest )
outfile.write( "%s %s %s\n" % ( name, diff , average ))
Running the above code will create a file called classscores.txt which will contain this:
Chapman 45 79.33333333333333
Cleese 10 85.75
Gilliam 7 78.66666666666667
Idle 0 91.0
Jones 100 63.57142857142857
Palin 10 85.0

how to keep ordered rows in dictionary?

I wrote the following script to retrieve the gene count for each contains. It works well but the order of the ID list that I use as an input is not conserved in the output.
I would need to conserve the same order as my input contigs list is ordered depending on their level of expression
Can anyone help me?
Thanks for your help.
from collections import defaultdict
import numpy as np
gene_list = {}
for line in open('idlist.txt'):
columns = line.strip().split()
gene = columns[0]
rien = columns[1]
gene_list[gene] = rien
gene_count = defaultdict(lambda: np.zeros(6, dtype=int))
out_file= open('out.txt','w')
esem_file = open('Aquilonia.txt')
esem_file.readline()
for line in esem_file:
fields = line.strip().split()
exon = fields[0]
numbers = [float(field) for field in fields[1:]]
if exon in gene_list.keys():
gene = gene_list[exon]
gene_count[gene] += numbers
print >> out_file, gene, gene_count[gene]
input file:
comp54678_c0_seq3
comp56871_c2_seq8
comp56466_c0_seq5
comp57004_c0_seq1
comp54990_c0_seq11
...
output file comes back in numerical order:
comp100235_c0_seq1 [22 13 15 6 15 16]
comp101274_c0_seq1 [55 2 27 26 6 6]
comp101915_c0_seq1 [20 2 34 12 8 7]
comp101956_c0_seq1 [13 21 11 17 17 28]
comp101964_c0_seq1 [30 73 45 36 0 1]
Use collections.OrderedDict(); it preserves entries in input order.
from collections import OrderedDict
with open('idlist.txt') as idlist:
gene_list = OrderedDict(line.split(None, 1) for line in idlist)
The above code reads your gene_list ordered dictionary using one line.
However, it looks as if you generate the output file purely based on the order of the input file lines:
for line in esem_file:
# ...
if exon in gene_list: # no need to call `.keys()` here
gene = gene_list[exon]
gene_count[gene] += numbers
print >> out_file, gene, gene_count[gene]
Rework your code to first collect the counts, then use a separate loop to write out your data:
with open('Aquilonia.txt') as esem_file:
next(esem_file, None) # skip first line
for line in esem_file:
fields = line.split()
exon = fields[0]
numbers = [float(field) for field in fields[1:]]
if exon in gene_list:
gene_count[gene_list[exon]] += numbers
with open('out.txt','w') as out_file:
for gene in gene_list:
print >> out_file, gene, gene_count[gene]

Finding the rating of words using python

This is my program and it display the value if i give the complete name like if i type eng than it will show me only eng with value
import re
sent = "eng"
#sent=raw_input("Enter word")
#regex = re.compile('(^|\W)sent(?=(\W|$))')
for line in open("sir_try.txt").readlines():
if sent == line.split()[0].strip():
k = line.rsplit(',',1)[0].strip()
print k
gene name utr length
ensbta 24
ensg1 12
ensg24 30
ensg37 65
enscat 22
ensm 30
Actually what i want to do is that i want to search the highest value from the text file not through words , and it delete all the values from the text file of the same word having less value than the maximum like from the above text it should delete 12 , 30 for ensg , and than it should find the minimum value from the utr values and display it with name
What you people answering me is , i already done it , and i mention it before my showing my program
please try this
file=open("sir_try.txt","r")
list_line=file.readlines()
file.close()
all_text=""
dic={}
sent="ensg"
temp_list=[]
for line in list_line:
all_text=all_text+line
name= line.rsplit()[0].strip()
score=line.rsplit()[1].strip()
dic[name]=score
for i in dic.keys():
if sent in i:
temp_list.append(dic[i])
hiegh_score=max(temp_list)
def check(index):
reverse_text=all_text[index+1::-1]
index2=reverse_text.find("\n")
if sent==reverse_text[:index2+1][::-1][1:len(sent)+1]:
return False
else:
return True
list_to_min=dic.values()
for i in temp_list:
if i!=hiegh_score:
index=all_text.find(str(i))
while check(index):
index=all_text.find(str(i),index+len(str(i)))
all_text=all_text[0:index]+all_text[index+len(str(i)):]
list_to_min.remove(str(i))
#write all text to "sir_try.txt"
file2=open("sir_try.txt","w")
file2.write(all_text)
file2.close()
min_score= min(list_to_min)
for j in dic.keys():
if min_score==dic[j]:
print "min score is :"+str(min_score)+" for person "+j
function check is for a bug in solotion for explain when your file is
gene name utr length
ali 12
ali87 30
ensbta 24
ensg1 12
ensg24 30
ensg37 65
enscat 22
ensm 30
program delete ali score but we dont have it
by adding check function i solve it
and this version is final version answer
Try instead of if sent == and replace it with a if sent in (line.split()[0].strip()):
That should check if the value of sent (engs) is anywhere in the argument (line.split()[0].strip()) in this case.
If you're still trying to only take the highest value, I would just create a variable value, then something along the lines of
if line.split()[1].strip() > value:
value = line.split()[1].strip()
Test that out and let us know how it works for you.
To find out the name (first column) with the maximum value associated (second column), you need to first split the lines at the whitespace between name and value. Then you can find the maximum value using the built-in max() function. Let it take the value column as sorting criterion. You can then easily find out the corresponding name.
Example:
file_content = """
gene name utr length
ensbta 24
ensg1 12
ensg24 30
ensg37 65
enscat 22
ensm 30
"""
# split lines at whitespace
l = [line.split() for line in file_content.splitlines()]
# skip headline and empty lines
l = [line for line in l if len(line) == 2]
print l
# find the maximum of second column
max_utr_length_tuple = max(l, key=lambda x:x[1])
print max_utr_length_tuple
print max_utr_length_tuple[0]
the output is:
$ python test.py
[['ensbta', '24'], ['ensg1', '12'], ['ensg24', '30'], ['ensg37', '65'], ['enscat', '22'], ['ensm', '30']]
['ensg37', '65']
ensg37
Short and sweet:
In [01]: t=file_content.split()[4:]
In [02]: b=((zip(t[0::2], t[1::2])))
In [03]: max(b, key=lambda x:x[1])
Out[03]: ('ensg37', '65')
import operator
f = open('./sir_try.txt', 'r')
f = f.readlines()
del f[0]
gene = {}
matched_gene = {}
for line in f:
words = line.strip().split(' ')
words = [word for word in words if not word == '']
gene[words[0]] = words[1]
# getting user input
user_input = raw_input('Enter gene name: ')
for gene_name, utr_length in gene.iteritems():
if user_input in gene_name:
matched_gene[gene_name] = utr_length
m = max(matched_gene.iteritems(), key=operator.itemgetter(1))[0]
print m, matched_gene[m] # expected answer
# code to remove redundant gene names as per requirement
for key in matched_gene.keys():
if not key == m:
matched_gene.pop(key)
for key in gene.keys():
if user_input in key:
gene.pop(key)
final_gene = dict(gene.items() + matched_gene.items())
out = open('./output.txt', 'w')
out.write('gene name' + '\t\t' + 'utr length' + '\n\n')
for key, value in final_gene.iteritems():
out.write(key + '\t\t\t\t' + value + '\n')
out.close()
Output:
Enter gene name: ensg
ensg37 65
Since you have tagged your question regex,
Here's something that you would want to see and it's the only one (at the moment) that uses regex!
import re
sent = 'ensg' # your sequence
# regex that will "filter" the lines containing value of sent
my_re = re.compile(r'(.*?%s.*?)\s+?(\d+)' % sent)
with open('stack.txt') as f:
lines = f.read() # get data from file
filtered = my_re.findall(lines) # "filter" your data
print filtered
# get the desired (tuple with maximum "utr length")
max_tuple = max(filtered, key=lambda x: x[1])
print max_tuple
Output:
[('ensg1', '12'), ('ensg24', '30'), ('ensg37', '65')]
('ensg37', '65')

Categories