Separating values from text file in python - python

I am trying to read in a text file with the following data:
362 147
422 32
145 45
312 57
35 421
361 275
and I want to separate the values into pairs so 362 and 147 would be pair 1, 422 and 32 pair 2 and so on.
However I run into a problem during the 5 pair which should be 35,421 but for some reason my code does not split this pair correctly, i think this is because of the spaces since only this pair has a two digit number and then a 3 digit number. But I'm not sure how to fix this, here's my code:
def __init__(filename):
f = open(filename, "r") #reads file
#print (f.read) # test if file was actually read
f1 = f.readlines() # reads individual lines
counter = 0
for line in f1:
values = line.split(" ") #splits the two values for each line into an array
value1 = values[0].strip() #.strip removes spaces at each values
value2 = values[1].strip()
counter = counter + 1
print('\npair: {}'.format(counter))
#print(values)
print(value1)
print(value2)
The output I get:
pair: 1
362
147
pair: 2
422
32
pair: 3
145
45
pair: 4
312
57
pair: 5
35
pair: 6
361
275

Try this :
def __init__(filename):
with open(filename, "r") as f:
lines = [i.strip() for i in f.readlines()]
for line_num, line in enumerate(lines):
p1, p2 = [i for i in line.split() if i]
print(f"pair: {line_num+1}\n{p1}\n{p2}\n\n")
Note : Always try to use with open(). In this way python takes care of closing the file automatically at the end.
The problem with your code is that you're not checking whether the words extracted after splitting values are empty string or not. If you print values for each line, for the pair 5, you'ld notice it is ['', '35', '421\n']. The first value of this one is an empty string. You can change your code to this :
def __init__(filename):
f = open(filename, "r") #reads file
#print (f.read) # test if file was actually read
f1 = f.readlines() # reads individual lines
counter = 0
for line in f1:
values = line.split() #splits the two values for each line into an array; Addendum .split(" ") is equivalent to .split()
values = [i for i in values if i] #Removes the empty strings
value1 = values[0].strip() #.strip removes spaces at each values
value2 = values[1].strip()
counter = counter + 1
print('\npair: {}'.format(counter))
#print(values)
print(value1)
print(value2)

Change this line:
values = line.split(" ")
to:
values = line.split()

Related

Counting items in txt file with Python dictionaries

I have following txt file (only a fragment is given)
## DISTANCE : Shortest distance from variant to transcript
## a lot of comments here
## STRAND : Strand of the feature (1/-1)
## FLAGS : Transcript quality flags
#Uploaded_variation Location Allele Gene Feature Feature_type Consequence cDNA_position CDS_position Protein_position Amino_acids Codons Existing_variation Extra
chr1_69270_A/G chr1:69270 G ENSG00000186092 ENST00000335137 Transcript upstream_gene_variant 216 180 60 S tcA/tcG - IMPACT=LOW;STRAND=1
chr1_69270_A/G chr1:69270 G ENSG00000186092 ENST00000641515 Transcript intron_variant 303 243 81 S tcA/tcG - IMPACT=LOW;STRAND=1
chr1_69511_A/G chr1:69511 G ENSG00000186092 ENST00000335137 Transcript upstream_gene_variant 457 421 141 T/A Aca/Gca - IMPACT=MODERATE;STRAND=1
with many unknown various ENSG numbers, such as ENSG00000187583, etc. The count of integers in each ENSG string is 11.
I have to count how many intron_variant and upstream_gene_variant contains each gene (ENSGxxx).
and output it to csv file.
I use dictionary for this purpose. i tried to write this code, but not sure about correct syntax.
The logics should be: if these 11 numbers are not in dictionary, it should be added with value 1. If they already are in dictionary, value should be changed to x + 1. I currently have this code, but I am not really Python programmer, and not sure about correct syntax.
with open(file, 'rt') as f:
data = f.readlines()
Count = 0
d = {}
for line in data:
if line[0] == "#":
output.write(line)
if line.__contains__('ENSG'):
d[line.split('ENSG')[1][0:11]]=1
if 1 in d:
d=1
else:
Count += 1
Any suggestions?
Thank you!
Can you try this:
from collections import Counter
with open('data.txt') as fp:
ensg = []
for line in fp:
idx = line.find('ENSG')
if not line.startswith('#') and idx != -1:
ensg.append(line[idx+4:idx+15])
count = Counter(ensg)
>>> count
Counter({'00000187961': 2, '00000187583': 2})
Update
I need to know how many ENGs contain "intron_variant" and "upstream_gene_variant"
Use regex to extract desired patterns:
from collections import Counter
import re
PAT_ENSG = r'ENSG(?P<ensg>\d{11})'
PAT_VARIANT = r'(?P<variant>intron_variant|upstream_gene_variant)'
PATTERN = re.compile(fr'{PAT_ENSG}.*\b{PAT_VARIANT}\b')
with open('data.txt') as fp:
ensg = []
for line in fp:
sre = PATTERN.search(line)
if not line.startswith('#') and sre:
ensg.append(sre.groups())
count = Counter(ensg)
Output:
>>> count
Counter({('00000186092', 'upstream_gene_variant'): 2,
('00000186092', 'intron_variant'): 1})
Here's another interpretation of your requirement:-
I have modified your sample data such that the first ENG value is ENSG00000187971 to highlight how this works.
D = {}
with open('eng.txt') as eng:
for line in eng:
if not line.startswith('#'):
t = line.split()
V = t[6]
E = t[3]
if not V in D:
D[V] = {}
if not E in D[V]:
D[V][E] = 1
else:
D[V][E] += 1
print(D)
The output of this is:-
{'intron_variant': {'ENSG00000187971': 1, 'ENSG00000187961': 1}, 'upstream_gene_variant': {'ENSG00000187583': 2}}
So what you have now is a dictionary keyed by variant. Each variant has its own dictionary keyed by the ENSG values and a count of occurrences of each ENSG value

Selecting line from file by using "startswith" and "next" commands

I have a file from which I want to create a list ("timestep") from the numbers which appear after each line "ITEM: TIMESTEP" so:
timestep = [253400, 253500, .. etc]
Here is the sample of the file I have:
ITEM: TIMESTEP
253400
ITEM: NUMBER OF ATOMS
378
ITEM: BOX BOUNDS pp pp pp
-2.6943709180241954e-01 5.6240920636804063e+01
-2.8194230631882372e-01 5.8851195163321044e+01
-2.7398090193568775e-01 5.7189372326936599e+01
ITEM: ATOMS id type q x y z
16865 3 0 28.8028 1.81293 26.876
16866 2 0 27.6753 2.22199 27.8362
16867 2 0 26.8715 1.04115 28.4178
16868 2 0 25.7503 1.42602 29.4002
16869 2 0 24.8716 0.25569 29.8897
16870 3 0 23.7129 0.593415 30.8357
16871 3 0 11.9253 -0.270359 31.7252
ITEM: TIMESTEP
253500
ITEM: NUMBER OF ATOMS
378
ITEM: BOX BOUNDS pp pp pp
-2.6943709180241954e-01 5.6240920636804063e+01
-2.8194230631882372e-01 5.8851195163321044e+01
-2.7398090193568775e-01 5.7189372326936599e+01
ITEM: ATOMS id type q x y z
16865 3 0 28.8028 1.81293 26.876
16866 2 0 27.6753 2.22199 27.8362
16867 2 0 26.8715 1.04115 28.4178
16868 2 0 25.7503 1.42602 29.4002
16869 2 0 24.8716 0.25569 29.8897
16870 3 0 23.7129 0.593415 30.8357
16871 3 0 11.9253 -0.270359 31.7252
To do this I tried to use "startswith" and "next" commands at once and it didn't work. Is there other way to do it? I send also the code I'm trying to use for that:
timestep = []
with open(file, 'r') as f:
lines = f.readlines()
for line in lines:
line = line.split()
if line[0].startswith("ITEM: TIMESTEP"):
timestep.append(next(line))
print(timestep)
The logic is to decide whether to append the current line to timestep or not. So, what you need is a variable which tells you append the current line when that variable is TRUE.
timestep = []
append_to_list = False # decision variable
with open(file, 'r') as f:
lines = f.readlines()
for line in lines:
line = line.strip() # remove "\n" from line
if line.startswith("ITEM"):
# Update add_to_list
if line == 'ITEM: TIMESTEP':
append_to_list = True
else:
append_to_list = False
else:
# append to list if line doesn't start with "ITEM" and append_to_list is TRUE
if append_to_list:
timestep.append(line)
print(timestep)
output:
['253400', '253500']
First - I don't like this, because it doesn't scale. You can only get the first immediately following line nicely, anything else will be just ugh...
But you asked, so ... for x in lines will create an iterator over lines and use that to keep the position. You don't have access to that iterator, so next will not be the next element you're expecting. But you can make your own iterator and use that:
lines_iter = iter(lines)
for line in lines_iter:
# whatever was here
timestep.append(next(line_iter))
However, if you ever want to scale it... for is not a good way to iterate over a file like this. You want to know what is in the next/previous line. I would suggest using while:
timestep = []
with open('example.txt', 'r') as f:
lines = f.readlines()
i = 0
while i < len(lines):
if line[i].startswith("ITEM: TIMESTEP"):
i += 1
while not line[i].startswith("ITEM: "):
timestep.append(next(line))
i += 1
else:
i += 1
This way you can extend it for different types of ITEMS of variable length.
So the problem with your code is subtle. You have a list lines which you iterate over, but you can't call next on a list.
Instead, turn it into an explicit iterator and you should be fine
timestep = []
with open(file, 'r') as f:
lines = f.readlines()
lines_iter = iter(lines)
for line in lines_iter:
line = line.strip() # removes the newline
if line.startswith("ITEM: TIMESTEP"):
timestep.append(next(lines_iter, None)) # the second argument here prevents errors
# when ITEM: TIMESTEP appears as the
# last line in the file
print(timestep)
I'm also not sure why you included line.split, which seems to be incorrect (in any case line.split()[0].startswith('ITEM: TIMESTEP') can never be true, since the split will separate ITEM: and TIMESTEP into separate elements of the resulting list.)
For a more robust answer, consider grouping your data based on when the line begins with ITEM.
def process_file(f):
ITEM_MARKER = 'ITEM: '
item_title = '(none)'
values = []
for line in f:
if line.startswith(ITEM_MARKER):
if values:
yield (item_title, values)
item_title = line[len(ITEM_MARKER):].strip() # strip off the marker
values = []
else:
values.append(line.strip())
if values:
yield (item_title, values)
This will let you pass in the whole file and will lazily produce a set of values for each ITEM: <whatever> group. Then you can aggregate in some reasonable way.
with open(file, 'r') as f:
groups = process_file(f)
aggregations = {}
for name, values in groups:
aggregations.setdefault(name, []).extend(values)
print(aggregations['TIMESTEP']) # this is what you want
You can use enumerate to help with index referencing. We can check to see if the string ITEM: TIMESTEP is in the previous line then add the integer to our timestep list.
timestep = []
with open('example.txt', 'r') as f:
lines = f.readlines()
for i, line in enumerate(lines):
if "ITEM: TIMESTEP" in lines[i-1]:
timestep.append(int(line.strip()))
print(timestep)

How to find the average for a file then put it in another file

I want to find the average of the list inFile and then I would like to move it to the classscores.
classgrades.txt is:
Chapman 90 100 85 66 80 55
Cleese 80 90 85 88
Gilliam 78 82 80 80 75 77
Idle 91
Jones 68 90 22 100 0 80 85
Palin 80 90 80 90
classcores.txt is empty
This is what I have so far... what should I do?
inFile = open('classgrades.txt','r')
outFile = open('classscores.txt','w')
for line in inFile:
with open(r'classgrades.txt') as data:
total_stuff = #Don't know what to do past here
biggest = min(total_stuff)
smallest = max(total_stuff)
print(biggest - smallest)
print(sum(total_stuff)/len(total_stuff))
You will need to:
- split each line by whitespace and slice out all items but the first
- convert each string value in array to integer
- sum all of those integer values in the array
- add the sum for this line to total_sum
- add the length of those values (the number of numbers) to total_numbers
However, this is only part of the problem... I will leave the rest up to you. This code will not write to the new file, it will simply take an average of all the numbers in the first file. If this isn't exactly what you are asking for, then try playing around with this stuff and you should be able to figure it all out.
inFile = open('classgrades.txt','r')
outFile = open('classscores.txt','w')
total_sum = 0
total_values = 0
with open(r'classgrades.txt') as inFile:
for line in inFile:
# split by whitespace and slice out everything after 1st item
num_strings = line.split(' ')[1:]
# convert each string value to an integer
values = [int(n) for n in num_strings]
# sum all the values on this line
line_sum = sum(values)
# add the sum of numbers in this line to the total_sum
total_sum += line_sum
# add the length of values in this line to total_values
total_numbers += len(values)
average = total_sum // total_numbers # // is integer division in python3
return average
you don't need to open file many times and you should close the files at the end of your program. Below is what I tried hope this works for you:
d1 = {}
with open(r'classgrades.txt','r') as fp:
for line in fp:
contents = line.strip().split(' ')
# create mapping of student and his numbers
d1[contents[0]] = map(int,contents[1:])
with open(r'classscores.txt','w') as fp:
for key, item in d1.items():
biggest = min(item)
smallest = max(item)
print(biggest - smallest)
# average of all numbers
avg = sum(item)/len(item)
fp.write("%s %s\n"%(key,avg))
Apologies if this is kind of advanced, I try to provide key words/phrases for you to search for to learn more.
Presuming you're looking for each student's separate average:
in_file = open('classgrades.txt', 'r') # python naming style is under_score
out_file = open('classcores.txt', 'w')
all_grades = [] # if you want a class-wide average as well as individual averages
for line in in_file:
# make a list of the things between spaces, like ["studentname", "90", "100"]
student = line.split(' ')[0]
# this next line includes "list comprehension" and "list slicing"
# it gets every item in the list aside from the 0th index (student name),
# and "casts" them to integers so we can do math on them.
grades = [int(g) for g in line.split(' ')[1:]]
# hanging on to every grade for later
all_grades += grades # lists can be +='d like numbers can
average = int(sum(grades) / len(grades))
# str.format() here is one way to do "string interpolation"
out_file.write('{} {}\n'.format(student, average))
total_avg = sum(all_grades) / len(all_grades)
print('Class average: {}'.format(total_avg))
in_file.close()
out_file.close()
As others pointed out, it is good to get in the habit of closing files.
Other responses here use with open() (as a "context manager") which is best practice because it automatically closes the file for you.
To work with two files without having a data container in between (like Amit's d1 dictionary), you would do something like:
with open('in.txt') as in_file:
with open('out.txt', 'w') as out_file:
... do things ...
This script should accomplish what you are trying to do I think:
# define a list data structure to store the classgrades
classgrades = []
with open( 'classgrades.txt', 'r' ) as infile:
for line in infile:
l = line.split()
# append a dict to the classgrades list with student as the key
# and value is list of the students scores.
classgrades.append({'name': l[0], 'scores': l[1:]})
with open( 'classscores.txt', 'w' ) as outfile:
for student in classgrades:
# get the students name out of dict.
name = student['name']
# get the students scores. use list comprehension to convert
# strings to ints so that scores is a list of ints.
scores = [int(s) for s in student['scores']]
# calc. total
total = sum(scores)
# get the number of scores.
count = len( student['scores'] )
# calc. average
average = total/count
biggest = max(scores)
smallest = min(scores)
diff = ( biggest - smallest )
outfile.write( "%s %s %s\n" % ( name, diff , average ))
Running the above code will create a file called classscores.txt which will contain this:
Chapman 45 79.33333333333333
Cleese 10 85.75
Gilliam 7 78.66666666666667
Idle 0 91.0
Jones 100 63.57142857142857
Palin 10 85.0

Parsing two text files in Python for a combined result

A chocolate company has decided to offer discount on the candy products which are produced 30 days of more before the current date. I have to have a matrix as a print result where the program reads through 2 files, one being the the cost of the different candies of different sizes, and another being the threshold number of days after which the discount is offered. So in this question the two text files look something like this
candies.txt
31 32 19 11 15 30 35 37
12 34 39 45 66 78 12 7
76 32 8 2 3 5 18 32 48
99 102 3 46 88 22 25 21
fd zz er 23 44 56 77 99
44 33 22 55 er ee df 22
and the second file days.txt
30
But it can have more than one number. It can look something like
30
40
36
The desired output is
Discount at days = 30
$ $ $
$ $ $
$ $ $ $ $
$ $ $ $
? ? ? $
$ ? ? ? $
Discount at days = 40
And then execute the output accordingly
So basically, everywhere the number is under the number given in days.txt it should print a "$" sign and everywhere it is more than the number(30 in our case) it should just print spaces in their place. We also have an anomally, where we have the english alphabets in the candies.txt matrix and since we are looking for numbers to check the price and not letters, it should print a "?" sign in their place as it is not recognized.
Here's my code
def replace(word, threshold):
try:
value = int(word)
except ValueError:
return '?'
if value < threshold:
return '$'
if value > threshold:
return ' '
return word
def get_threshold(filename):
thresholds = []
with open(filename) as fobj:
for line in fobj:
if line.strip():
thresholds.append(int(line))
return thresholds
def process_file(data_file, threshold):
lines = []
print('Original data:')
with open(data_file) as f:
for line in f:
line = line.strip()
print(line)
replaced_line = ' '.join(
replace(chunck, threshold) for chunck in line.split())
lines.append(replaced_line)
print('\nData replaced with threshold', threshold)
for threshold in get_threshold('days.txt'):
process_file('demo.txt', threshold )
My question is that my code works when there is only one number in the second file, days.txt but it doesn't work when there are more than one number in the second file. I want it to work when there are multiple numbers in each newline of the second text file. I don't know what I am doing wrong.
Read all thresholds:
def get_thresholds(filename):
with open(filename) as fobj :
return [int(line) for line in fobj if line.strip()]
Alternative implementation without the list comprehension:
def get_thresholds(filename):
thresholds = []
with open(filename) as fobj:
for line in fobj:
if line.strip():
thresholds.append(int(line))
return thresholds
Modify your function a bit:
def process_file(data_file, threshold):
lines = []
print('Original data:')
with open(data_file) as f:
for line in f:
line = line.strip()
print(line)
replaced_line = ' '.join(
replace(chunck, threshold) for chunck in line.split())
lines.append(replaced_line)
print('\nData replaced with threshold', threshold)
for line in lines:
print(line)
Go through all thresholds:
for threshold in get_thresholds('days.txt'):
process_file('candies.txt', threshold)
This is a re-write of my previous answer. Due to to the long discussion and the many changes it seems clearer to another answer. I chopped the task into smaller sub-tasks and defined a function for each. All functions have docstrings. This is highly recommended.
"""
A chocolate company has decided to offer discount on the candy products
which are produced 30 days of more before the current date.
More story here ...
"""
def read_thresholds(filename):
"""Read values for thresholds from file.
"""
thresholds = []
with open(filename) as fobj:
for line in fobj:
if line.strip():
thresholds.append(int(line))
return thresholds
def read_costs(filename):
"""Read the cost from file.
"""
lines = []
with open(filename) as fobj:
for line in fobj:
lines.append(line.strip())
return lines
def replace(word, threshold):
"""Replace value below threshold with `$`, above threshold with ` `,
non-numbers with `?`, and keep the value if it equals the
threshold.
"""
try:
value = int(word)
except ValueError:
return '?'
if value < threshold:
return '$'
if value > threshold:
return ' '
return word
def process_costs(costs, threshold):
"""Replace the cost for given threshold and print results.
"""
res = []
for line in costs:
replaced_line = ' '.join(
replace(chunck, threshold) for chunck in line.split())
res.append(replaced_line)
print('\nData replaced with threshold', threshold)
for line in res:
print(line)
def show_orig(costs):
"""Show original costs.
"""
print('Original data:')
for line in costs:
print(line)
def main(costs_filename, threshold_filename):
"""Replace costs for all thresholds and display results.
"""
costs = read_costs(costs_filename)
show_orig(costs)
for threshold in read_thresholds(threshold_filename):
process_costs(costs, threshold)
if __name__ == '__main__':
main('candies.txt', 'days.txt')

how to keep ordered rows in dictionary?

I wrote the following script to retrieve the gene count for each contains. It works well but the order of the ID list that I use as an input is not conserved in the output.
I would need to conserve the same order as my input contigs list is ordered depending on their level of expression
Can anyone help me?
Thanks for your help.
from collections import defaultdict
import numpy as np
gene_list = {}
for line in open('idlist.txt'):
columns = line.strip().split()
gene = columns[0]
rien = columns[1]
gene_list[gene] = rien
gene_count = defaultdict(lambda: np.zeros(6, dtype=int))
out_file= open('out.txt','w')
esem_file = open('Aquilonia.txt')
esem_file.readline()
for line in esem_file:
fields = line.strip().split()
exon = fields[0]
numbers = [float(field) for field in fields[1:]]
if exon in gene_list.keys():
gene = gene_list[exon]
gene_count[gene] += numbers
print >> out_file, gene, gene_count[gene]
input file:
comp54678_c0_seq3
comp56871_c2_seq8
comp56466_c0_seq5
comp57004_c0_seq1
comp54990_c0_seq11
...
output file comes back in numerical order:
comp100235_c0_seq1 [22 13 15 6 15 16]
comp101274_c0_seq1 [55 2 27 26 6 6]
comp101915_c0_seq1 [20 2 34 12 8 7]
comp101956_c0_seq1 [13 21 11 17 17 28]
comp101964_c0_seq1 [30 73 45 36 0 1]
Use collections.OrderedDict(); it preserves entries in input order.
from collections import OrderedDict
with open('idlist.txt') as idlist:
gene_list = OrderedDict(line.split(None, 1) for line in idlist)
The above code reads your gene_list ordered dictionary using one line.
However, it looks as if you generate the output file purely based on the order of the input file lines:
for line in esem_file:
# ...
if exon in gene_list: # no need to call `.keys()` here
gene = gene_list[exon]
gene_count[gene] += numbers
print >> out_file, gene, gene_count[gene]
Rework your code to first collect the counts, then use a separate loop to write out your data:
with open('Aquilonia.txt') as esem_file:
next(esem_file, None) # skip first line
for line in esem_file:
fields = line.split()
exon = fields[0]
numbers = [float(field) for field in fields[1:]]
if exon in gene_list:
gene_count[gene_list[exon]] += numbers
with open('out.txt','w') as out_file:
for gene in gene_list:
print >> out_file, gene, gene_count[gene]

Categories