I would like to count the occurences of missings of every line in a txt file.
foo.txt file:
1 1 1 1 1 NA # so, Missings: 1
1 1 1 NA 1 1 # so, Missings: 1
1 1 NA 1 1 NA # so, Missings: 2
But I would also like to obtain the amount of elements for the first line (assuming this is equal for all lines).
miss = []
with open("foo.txt") as f:
for line in f:
miss.append(line.count("NA"))
>>> miss
[1, 1, 2] # correct
The problem is when I try to identify the amount of elements. I did this with the following code:
miss = []
with open("foo.txt") as f:
first_line = f.readline()
elements = first_line.count(" ") # given that values are separated by space
for line in f:
miss.append(line.count("NA"))
>>> (elements + 1)
6 # True, this is correct
>>> miss
[1,2] # misses the first item due to readline() removing lines.`
How can I read the first line once without removing it for the further operation?
Try f.seek(0). This will reset the file handle to the beginning of the file.
Complete example would then be:
miss = []
with open("foo.txt") as f:
first_line = f.readline()
elements = first_line.count(" ") # given that values are separated by space
f.seek(0)
for line in f:
miss.append(line.count("NA"))
Even better would be to read all lines, even the first line, only once, and checking for number of elements only once:
miss = []
elements = None
with open("foo.txt") as f:
for line in f:
if elements is None:
elements = line.count(" ") # given that values are separated by space
miss.append(line.count("NA"))
BTW: wouldn't the number of elements be line.count(" ") + 1?
I'd recommend using len(line.split()), as this also handles tabs, double spaces, leading/trailing spaces etc.
Provided all lines have the number of items you can just count items in the last line:
miss = []
with open("foo.txt") as f:
for line in f:
miss.append(line.count("NA")
elements = len(line.split())
A better way to count is probably:
elements = len(line.split())
because this also counts items separated with multiple spaces or tabs.
You can also just treat the first line separately
with open("foo.txt") as f:
first_line = next(f1)
elements = first_line.count(" ") # given that values are separated by space
miss = [first_line.count("NA")]
for line in f:
miss.append(line.count("NA")
Related
I want to rewrite a exisiting file with things like:
Tom A
Mike B
Jim C
to
Tom 1
Mike 2
Jim 3
The letters A,B,C can also be something else. Basicaly i want to keep the spaces between the names and what comes behind, but change them to numbers. Does someone have an idea please? Thanks a lot for your help.
I assume your first and second columns are separated by a tab (i.e. \t)?
If so, you can do this by reading the file into a list, use the split function to split each line of the file into components, edit the second component of each line, concatenate the two components back together with a tab separator and finally rewrite to a file.
For example, if test.txt is your input file:
# Create list that holds the desired output
output = [1,2,3]
# Open the file to be overwritten
with open('test.txt', 'r') as f:
# Read file into a list of strings (one string per line)
text = f.readlines()
# Open the file for writing (FYI this CLEARS the file as we specify 'w')
with open('test.txt', 'w') as f:
# Loop over lines (i.e. elements) in `text`
for i,item in enumerate(text):
# Split line into elements based on whitespace (default for `split`)
line = item.split()
# Concatenate the name and desired output with a tab separator and write to the file
f.write("%s\t%s\n" % (line[0],output[i]))
I assumed your first and second columns were separated by a spaces in the file.
You can read the file contents into a list and use the function replace_end(line,newline) and it will replace the end of the line with what you passed. then you can just write out the changed list back to the file.
""" rewrite a exisiting file """
def main():
""" main """
filename = "update_me.txt"
count = 0
lst = []
with open(filename, "r",encoding = "utf-8") as filestream:
_lines = filestream.readlines()
for line in _lines:
lst.insert(count,line.strip())
count += 1
#print(f"Line {count} {line.strip()}")
count = 0
# change the list
for line in lst:
lst[count] = replace_end(line,"ABC")
count +=1
count = 0
with open(filename, "w", encoding = "utf-8") as filestream:
for line in lst:
filestream.write(line+"\n")
count +=1
def replace_end(line,newline):
""" replace the end of a line """
return line[:-len(newline)] + newline
if __name__ == '__main__':
main()
Goal: Open the text file. Check whether the first 3 characters of each line are the same in subsequent lines. If yes, delete the bottom one.
The contents of the text file:
cat1
dog4
cat3
fish
dog8
Desired output:
cat1
dog4
fish
Attempt at code:
line = open("text.txt", "r")
for num in line.readlines():
a = line[num][0:3] #getting first 3 characters
for num2 in line.readlines():
b = line[num2][0:3]
if a in b:
line[num2] = ""
Open the file and read one line at a time. Note the first 3 characters (prefix). Check if the prefix has been previously observed. If not, keep that line and add the prefix to a set. For example:
with open('text.txt') as infile:
out_lines = []
prefixes = set()
for line in map(str.strip, infile):
if not (prefix := line[:3]) in prefixes:
out_lines.append(line)
prefixes.add(prefix)
print(out_lines)
Output:
['cat1', 'dog4', 'fish']
Note:
Requires Python 3.8+
You can use a dictionary to store the first 3 char and then check while reading. Sample check then code below
line = open("text.txt", "r")
first_three_char_dict = {}
for num in line.readlines():
a = line[num][0:3] # getting first 3 characters
if first_three_char_dict.get(a):
line[num] = ""
else:
first_three_char_dict[a] = num
pass;
try to read line and add the word (first 3 char)into a dict. The key of dict would be the first 3 char of word and value would be the word itself. At the end you will have dict keys which are unique and their values are your desired result.
You just need to check if the data is already exist or not in temporary list
line = open("text.txt", "r")
result = []
for num in line.readlines():
data = line[num][0:3] # getting first 3 characters
if data not in result: # check if the data is already exist in list or not
result.append(data) # if the data is not exist in list just append it
I have a text file that looks like:
>MN00153:75:000H37WNG:1:11102:13823:1502
CCTGCGTTGAAGTGGCTTACTTGCACCTTATGCTACCGTGACCTGCGAATCCAGTCTCATCGTGACCATTCAGGACCAGTGGCAAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTT
CASP3_fw1_rc : CCATTCAGGACCAGTGGCAAG - The position is 66
CASP3_fw2 : CCTGCGTTGAAGTGGCTTACT - The position is 1
Distance is 44
>MN00153:75:000H37WNG:1:11102:13823:1504
CCTGCGTTGAAGTGGCTTACTTGCACCTTATGCTACCGTGACCTGCGAATCCAGTCTCATCGTGACCATTCAGGACCAGTGGCAAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTT
CASP3_fw1_rc : CCATTCAGGACCAGTGGCAAG - The position is 66
CASP3_fw2 : CCTGCGTTGAAGTGGCTTACT - The position is 1
Distance is 44
>MN00153:75:000H37WNG:1:11102:13823:1506
CCTGCGTTGAAGTGGCTTACTTGCACCTTATGCTACCGTGACCTGCGAATCCAGTCTCATCGTGACCATTCAGGACCAGTGGCAAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTT
CASP3_fw1_rc : CCATTCAGGACCAGTGGCAAG - The position is 66
CASP3_fw2_rc : CCTGCGTTGAAGTGGCTTACT - The position is 1
Distance is 44
>MN00153:75:000H37WNG:1:11102:13823:1508
CCTGCGTTGAAGTGGCTTACTTGCACCTTATGCTACCGTGACCTGCGAATCCAGTCTCATCGTGACCATTCAGGACCAGTGGCAAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTT
CASP3_fw1_rc : CCATTCAGGACCAGTGGCAAG - The position is 66
EIF2_fw2 : CCTGCGTTGAAGTGGCTTACT - The position is 1
Distance is 44
I am interested in lines 3 and 4 of each bucket (starts with '>'). I want to count 1 if line 3 and line 4 is CASP3 (regardless of what is afterward). so the output should be
3
Because first, second, and third buckets have pair CASP3 in lines 3 and 4 of each bucket (except the last one).
Thanks
If your file is not too huge you might use .readlines function to get list of lines following way:
with open('filename.txt', 'r') as f:
lines = f.readlines()
then use enumerate and str methods following way:
cnt = 0
for inx, line in enumerate(lines):
if line.startswith('>') and lines[inx+2].startswith('CASP3') and lines[inx+3].startswith('CASP3'):
cnt += 1
print(cnt)
My solution requires that there are at least 3 lines after last line starting with >.
Without reading the whole file into memory:
def startswith_casp(iterator):
# grab the next four lines of your file
chunk = [line for _, line in zip(range(3), iterator)]
# use a slice here to avoid index errors
return all(c.startswith('CASP3') for c in chunk[1:3])
with open('yourfile.txt') as fh:
count = 0
for line in fh:
if not line.strip():
continue
elif line.startswith('>'):
# Function returns a boolean, so True will add 1 while False adds 0
count += startswith_casp(fh)
else:
continue
print(count)
Here I am split-ing by \n\n to get the "buckets", then by \n to get the lines within each bucket, then checking the 3rd ([2]) and 4th ([3]) line in each bucket for the pattern:
with open('genes.txt') as file:
data = file.read()
by_bucket = [i.split('\n') for i in data.split('\n\n')]
count = 0
for bucket in by_bucket:
count += (bucket[2].startswith('CASP3') and bucket[3].startswith('CASP3'))
print(count)
It would probably help if you were to explain what you already tried. That being said here's a general approach:
Use the .split() method to split at any character(s) you want. This results in you getting a list with each entry being one bucket.
Loop over this list with for VariableName in ExampleList: to check each bucket on their own.
You can optionally check if the first entry is a > or if you did the splitting correctly you may not need to.
Seperate each bucket into another list where each entry is one line by using bucket.splitlines().
Then check if the first characters of the 3rd and 4th entry in this list are CASP3 by checking if string[2][:5]=="CASP3"(for the third line) and string[3][:5]=="CASP3"(for the fourth line) is true.
Add another counter to the function that is increased by 1 whenever one bucket is valid.
return this counter.
If you have additional questions, feel free to ask.
Here's a example that takes a string and returns your value you need:
def getValue(string):
counter=0
splitList=string.split("\n\n")
for bucket in splitList:
bucket=bucket.splitlines()
if bucket[2][:5]=="CASP3" and bucket[3][:5]=="CASP3":
counter+=1
return counter
Note that this function relies on the buckets being seperated by a empty newline, but you can change that as well to seperate on any other character(s).
My solution is reading the file.txt into a dictionary of text sections (where a section spans between the two greater than symbols (i.e. '>') which then allows you to easily perform some comparisons.
file_path = './file.txt'
keyword="CASP3"
section_ID = 0
count = 0
all_sections = {}
with open(file_path,'r') as f:
for line in f:
if line.startswith(">"):
if line not in all_sections:
section_ID += 1
all_sections[section_ID] = {}
all_sections[section_ID]['entries'] = []
all_sections[section_ID]['entries'].append(line)
for sec_id in all_sections:
if all_sections[sec_id]['entries'][2].startswith(keyword) and all_sections[sec_id]['entries'][3].startswith(keyword):
count+=1
print('count : ', count)
output using your file would be :
count : 3
Python noob here. I've been smashing my head trying to do this, tried several Unix tools and I'm convinced that python is the way to go.
I have two files, File1 has headers and numbers like this:
>id1
77
>id2
2
>id3
2
>id4
22
...
Note that id number is unique, but the number assigned to it may repeat. I have several files like this all with the same number of headers (~500).
File2 has all numbers of File1 and an appended sequence
1
ATCGTCATA
2
ATCGTCGTA
...
22
CCCGTCGTA
...
77
ATCGTCATA
...
Note that sequence id is unique, as all sequences after it. I have the same amount of files as File1 but the number of sequences within each File2 may vary(~150).
My desired output is the File1 with the sequence from File2, it is important that File1 maintains original order.
>id1
ATCGTCATA
>id2
ATCGTCGTA
>id3
ATCGTCGTA
>id4
CCCGTCGTA
My approach is to extract numbers from File1 and use them as a pattern to match in File2. First I am trying to make this work with only a pair of files. here is what I achieved:
#!/usr/bin/env python
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
with open(datafile, 'r') as f:
datafile_lines = set([line.strip() for line in f]) #maybe I could use regex to get only lines with number as pattern?
print (datafile_lines)
outputlist = []
with open(schemaseqs, 'r') as f:
for line in f:
seqs = line.split(',')[0]
if seqs[1:-1] in datafile_lines:
outputlist.append(line)
print (outputlist)
This outputs a mix of patterns from File1 and the sequences from File2. Any help is appreciated.
Ps: I am open to modifications in files structure, I tried substituting \n in File2 for "," with no avail.
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
datafile_lines = []
d = {}
prev = None
with open(datafile, 'r') as f:
i = 0
for line in f:
if i % 2 == 0:
d[line.strip()]=0
prev = line.strip()
else:
d[prev] = line.strip()
i+=1
new_d = {}
with open(schemaseqs, 'r') as f:
i=0
prev = None
for line in f:
if i % 2 == 0:
new_d[line.strip()]=0
prev = line.strip()
else:
new_d[prev] = line.strip()
i+=1
for key, value in d.items():
if value in new_d:
d[key] = new_d[value]
print(d)
with open(datafile,'w') as filee:
for k,v in d.items():
filee.writelines(k)
filee.writelines('\n')
filee.writelines(v)
filee.writelines('\n')
creating two dictionary would be easy and then map both dictionary values.
Since the files are so neatly organized, I wouldn't use a set to store the lines. Sets don't enforce order, and the order of these lines conveys a lot of information. I also wouldn't use Regex; it's probably overkill for the task of parsing individual lines, but not powerful enough to keep track of which ID corresponds to each gene sequence.
Instead, I would read the files in the opposite order. First, read the file with the gene sequences and build a mapping of IDs to genes. Then read in the first file and replace each id with the corresponding value in that mapping.
If the IDs are a continuous sequence (1, 2, 3... n, n+1), then a list is probably the easiest way to store them. If the file is already in order, you don't even have to pay attention to the ID numbers; you can just skip every other row and append each gene sequence to an array in order. If they aren't continuous, you can use a dictionary with the IDs as keys. I'll use the dictionary approach for this example:
id_to_gene_map = {}
with open(file2, 'r') as id_to_gene_file:
for line_number, line in enumerate(id_to_gene_file, start=1):
if line_number % 2 == 1: # Update ID on odd numbered lines, including line 1
current_id = line
else:
id_to_gene_map[current_id] = line # Map previous line's ID to this line's value
with open(file1, 'r') as input_file, open('output.txt', 'w') as output_file:
for line in input_file:
if not line.startswith(">"): # Keep ">id1" lines unchanged
line = id_to_gene_map[line] # Otherwise, replace with the corresponding gene
output_file.write(line)
In this case, the IDs and values both have trailing newlines. You can strip them out, but since you'll want to add them back in for writing the output file, it's probably easiest to leave them alone.
Im learning python but i have some problems with my scripts yet.
I have a file similar to:
1 5
2 5
3 5
4 2
5 1
6 7
7 7
8 8
I want to print the pairs of numbers 2-1 in consecutive lines, just taking the column 2 to find them, and then, print the column 1 and 2 with the results. The result will be similar to this:
4 2
5 1
I'm trying to do it with python, because my file has 4,000,000 data. So, this is my script:
import linecache
final_lines = []
with open("file.dat") as f:
for i, line in enumerate(f, 1):
if "1" in line:
if "2" in linecache.getline("file.dat", i-1):
linestart = i - 1
final_lines.append(linecache.getline("file.dat", linestart))
print(final_lines)
and the result is:
['2\n', '2\n', '2\n']
What I must to change in my script to fit the result that I want?, Can you guide me please? Thanks a lot.
Use a for loop with enumerate with a if statement to condition the lines, and then if the condition is true, append the two lines into the list final_lines:
final_lines = []
with open('file.dat') as f:
lines = f.readlines()
for i,line in enumerate(lines):
if line.split()[1] == '2' and lines[i+1].split()[1] == '1':
final_lines.extend([line,lines[i+1]])
And now:
print(final_lines)
Would return your desired list.
would work i think
import re
with open("info.dat") as f:
for match in re.findall("\d+ 2[\s\n]*\d+ 1",f.read()):
print match
see also : https://repl.it/repls/TatteredViciousResources
another alternative is
lines = f.readlines()
for line,nextline in zip(lines,lines[1:]):
if line.strip().endswith("2") and nextline.strip().endswith("1"):
print(line+nextline)
You are a beginner at Python which is great, so I am going to take a more elementary approach. It's a huge file so you are better reading a line at at time and keeping only that line, but you actually need two lines to identify the pattern so keep two. consider the following:
fp = open('file.dat')
last_line = fp.readline()
next_line = fp.readline()
while next_line:
# logic to split the lines into a pair
# of numbers and check to see if the
# 2 and 1 end last_line and next_line
# and outputting
last_line = next_line
next_line = fp.readline()
This follows good, readable software patterns, and requires a minimum of resources.