replace lines in a larger file with ID - python

Hello every one i have problem with replacing same content lines with same ID e.x:
ONE -----------> 1
TWO -----------> 2
THREE-----------> 3
HELLO-----------> 4
SEVEN-----------> 5
ONE-----------> 1
ONE-----------> 1
ONE-----------> 1
TWO-----------> 2
I have worked on this code below but with no results:
NOTE: filein and file2 have same value of the defined example.
# opening the file in read mode
file = open("filein.txt", "r")
# opening the file in read and write mod
file2 = open("filein2.txt", "r+")
replacement = ""
count=1
# using the for loop
for line in file:
for line2 in file2:
line = line.strip()
if line == line2 :
changes = line.replace(line, str(count))
replacement = replacement + changes + "\n"
file2.seek(0)
file2.write(replacement)
count=count+1
file.close()
filein and filein2 contain same value
ONE
TWO
THREE
HELLO
SEVEN
ONE
ONE
ONE
TWO

To my understanding this is what you want; compare two files line by line, if the corresponding lines are equal, assign an ID to them, if the lines repeat somewhere else in the file assign the same ID as before, if the lines have not occurred before assign a new ID. If the lines are different, get both of their contents. In the end write either the ID or line content to a new file:
index_dct = dict()
id_ = 1
with open('text.txt') as f1, open('text1.txt') as f2, open('result.txt', 'w') as result:
for line1, line2 in zip(f1, f2):
line1, line2 = line1.strip(), line2.strip()
if line1 == line2:
text = index_dct.get(line1)
if text is None:
text = index_dct[line1] = id_
id_ += 1
else:
text = f'{line1} {line2}'
result.write(f'{text}\n')
A quick overview of how this works:
First you have a dictionary to store the value and its corresponding ID so that if a value repeats you can assign the same ID.
Then using a context manager (with) you open three files:
then iterate over the first two files at the same time using zip and compare if the lines match, if they do then first try to get their corresponding ID based on their value, if there is not yet such a value in the dictionary assign the current line value as a key and have its value be the ID, then increase ID by one.
If the lines don't match then just concatenate them together
Finally write the resultant value to the third file

If you're trying to make each unique word have a unique ID, you could use a dictionary:
inputText = "ONE TWO THREE HELLO SEVEN ONE ONE ONE TWO"
indexDictionary = {}
count = 1
outList = []
for word in inputText.split(" "):
if word not in indexDictionary.keys():
indexDictionary[word] = count
count += 1
outList.append(indexDictionary[word])
print(outList)
print(indexDictionary)

Related

How can i edit several numbers/words in a txt file using python?

I want to rewrite a exisiting file with things like:
Tom A
Mike B
Jim C
to
Tom 1
Mike 2
Jim 3
The letters A,B,C can also be something else. Basicaly i want to keep the spaces between the names and what comes behind, but change them to numbers. Does someone have an idea please? Thanks a lot for your help.
I assume your first and second columns are separated by a tab (i.e. \t)?
If so, you can do this by reading the file into a list, use the split function to split each line of the file into components, edit the second component of each line, concatenate the two components back together with a tab separator and finally rewrite to a file.
For example, if test.txt is your input file:
# Create list that holds the desired output
output = [1,2,3]
# Open the file to be overwritten
with open('test.txt', 'r') as f:
# Read file into a list of strings (one string per line)
text = f.readlines()
# Open the file for writing (FYI this CLEARS the file as we specify 'w')
with open('test.txt', 'w') as f:
# Loop over lines (i.e. elements) in `text`
for i,item in enumerate(text):
# Split line into elements based on whitespace (default for `split`)
line = item.split()
# Concatenate the name and desired output with a tab separator and write to the file
f.write("%s\t%s\n" % (line[0],output[i]))
I assumed your first and second columns were separated by a spaces in the file.
You can read the file contents into a list and use the function replace_end(line,newline) and it will replace the end of the line with what you passed. then you can just write out the changed list back to the file.
""" rewrite a exisiting file """
def main():
""" main """
filename = "update_me.txt"
count = 0
lst = []
with open(filename, "r",encoding = "utf-8") as filestream:
_lines = filestream.readlines()
for line in _lines:
lst.insert(count,line.strip())
count += 1
#print(f"Line {count} {line.strip()}")
count = 0
# change the list
for line in lst:
lst[count] = replace_end(line,"ABC")
count +=1
count = 0
with open(filename, "w", encoding = "utf-8") as filestream:
for line in lst:
filestream.write(line+"\n")
count +=1
def replace_end(line,newline):
""" replace the end of a line """
return line[:-len(newline)] + newline
if __name__ == '__main__':
main()

If the first 3 characters are the same, delete the line in a text file?

Goal: Open the text file. Check whether the first 3 characters of each line are the same in subsequent lines. If yes, delete the bottom one.
The contents of the text file:
cat1
dog4
cat3
fish
dog8
Desired output:
cat1
dog4
fish
Attempt at code:
line = open("text.txt", "r")
for num in line.readlines():
a = line[num][0:3] #getting first 3 characters
for num2 in line.readlines():
b = line[num2][0:3]
if a in b:
line[num2] = ""
Open the file and read one line at a time. Note the first 3 characters (prefix). Check if the prefix has been previously observed. If not, keep that line and add the prefix to a set. For example:
with open('text.txt') as infile:
out_lines = []
prefixes = set()
for line in map(str.strip, infile):
if not (prefix := line[:3]) in prefixes:
out_lines.append(line)
prefixes.add(prefix)
print(out_lines)
Output:
['cat1', 'dog4', 'fish']
Note:
Requires Python 3.8+
You can use a dictionary to store the first 3 char and then check while reading. Sample check then code below
line = open("text.txt", "r")
first_three_char_dict = {}
for num in line.readlines():
a = line[num][0:3] # getting first 3 characters
if first_three_char_dict.get(a):
line[num] = ""
else:
first_three_char_dict[a] = num
pass;
try to read line and add the word (first 3 char)into a dict. The key of dict would be the first 3 char of word and value would be the word itself. At the end you will have dict keys which are unique and their values are your desired result.
You just need to check if the data is already exist or not in temporary list
line = open("text.txt", "r")
result = []
for num in line.readlines():
data = line[num][0:3] # getting first 3 characters
if data not in result: # check if the data is already exist in list or not
result.append(data) # if the data is not exist in list just append it

Iterate over pair lines and count using python

I have a text file that looks like:
>MN00153:75:000H37WNG:1:11102:13823:1502
CCTGCGTTGAAGTGGCTTACTTGCACCTTATGCTACCGTGACCTGCGAATCCAGTCTCATCGTGACCATTCAGGACCAGTGGCAAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTT
CASP3_fw1_rc : CCATTCAGGACCAGTGGCAAG - The position is 66
CASP3_fw2 : CCTGCGTTGAAGTGGCTTACT - The position is 1
Distance is 44
>MN00153:75:000H37WNG:1:11102:13823:1504
CCTGCGTTGAAGTGGCTTACTTGCACCTTATGCTACCGTGACCTGCGAATCCAGTCTCATCGTGACCATTCAGGACCAGTGGCAAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTT
CASP3_fw1_rc : CCATTCAGGACCAGTGGCAAG - The position is 66
CASP3_fw2 : CCTGCGTTGAAGTGGCTTACT - The position is 1
Distance is 44
>MN00153:75:000H37WNG:1:11102:13823:1506
CCTGCGTTGAAGTGGCTTACTTGCACCTTATGCTACCGTGACCTGCGAATCCAGTCTCATCGTGACCATTCAGGACCAGTGGCAAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTT
CASP3_fw1_rc : CCATTCAGGACCAGTGGCAAG - The position is 66
CASP3_fw2_rc : CCTGCGTTGAAGTGGCTTACT - The position is 1
Distance is 44
>MN00153:75:000H37WNG:1:11102:13823:1508
CCTGCGTTGAAGTGGCTTACTTGCACCTTATGCTACCGTGACCTGCGAATCCAGTCTCATCGTGACCATTCAGGACCAGTGGCAAGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCGCTCATTATCTCGTATGCCGTCTTCTGCTT
CASP3_fw1_rc : CCATTCAGGACCAGTGGCAAG - The position is 66
EIF2_fw2 : CCTGCGTTGAAGTGGCTTACT - The position is 1
Distance is 44
I am interested in lines 3 and 4 of each bucket (starts with '>'). I want to count 1 if line 3 and line 4 is CASP3 (regardless of what is afterward). so the output should be
3
Because first, second, and third buckets have pair CASP3 in lines 3 and 4 of each bucket (except the last one).
Thanks
If your file is not too huge you might use .readlines function to get list of lines following way:
with open('filename.txt', 'r') as f:
lines = f.readlines()
then use enumerate and str methods following way:
cnt = 0
for inx, line in enumerate(lines):
if line.startswith('>') and lines[inx+2].startswith('CASP3') and lines[inx+3].startswith('CASP3'):
cnt += 1
print(cnt)
My solution requires that there are at least 3 lines after last line starting with >.
Without reading the whole file into memory:
def startswith_casp(iterator):
# grab the next four lines of your file
chunk = [line for _, line in zip(range(3), iterator)]
# use a slice here to avoid index errors
return all(c.startswith('CASP3') for c in chunk[1:3])
with open('yourfile.txt') as fh:
count = 0
for line in fh:
if not line.strip():
continue
elif line.startswith('>'):
# Function returns a boolean, so True will add 1 while False adds 0
count += startswith_casp(fh)
else:
continue
print(count)
Here I am split-ing by \n\n to get the "buckets", then by \n to get the lines within each bucket, then checking the 3rd ([2]) and 4th ([3]) line in each bucket for the pattern:
with open('genes.txt') as file:
data = file.read()
by_bucket = [i.split('\n') for i in data.split('\n\n')]
count = 0
for bucket in by_bucket:
count += (bucket[2].startswith('CASP3') and bucket[3].startswith('CASP3'))
print(count)
It would probably help if you were to explain what you already tried. That being said here's a general approach:
Use the .split() method to split at any character(s) you want. This results in you getting a list with each entry being one bucket.
Loop over this list with for VariableName in ExampleList: to check each bucket on their own.
You can optionally check if the first entry is a > or if you did the splitting correctly you may not need to.
Seperate each bucket into another list where each entry is one line by using bucket.splitlines().
Then check if the first characters of the 3rd and 4th entry in this list are CASP3 by checking if string[2][:5]=="CASP3"(for the third line) and string[3][:5]=="CASP3"(for the fourth line) is true.
Add another counter to the function that is increased by 1 whenever one bucket is valid.
return this counter.
If you have additional questions, feel free to ask.
Here's a example that takes a string and returns your value you need:
def getValue(string):
counter=0
splitList=string.split("\n\n")
for bucket in splitList:
bucket=bucket.splitlines()
if bucket[2][:5]=="CASP3" and bucket[3][:5]=="CASP3":
counter+=1
return counter
Note that this function relies on the buckets being seperated by a empty newline, but you can change that as well to seperate on any other character(s).
My solution is reading the file.txt into a dictionary of text sections (where a section spans between the two greater than symbols (i.e. '>') which then allows you to easily perform some comparisons.
file_path = './file.txt'
keyword="CASP3"
section_ID = 0
count = 0
all_sections = {}
with open(file_path,'r') as f:
for line in f:
if line.startswith(">"):
if line not in all_sections:
section_ID += 1
all_sections[section_ID] = {}
all_sections[section_ID]['entries'] = []
all_sections[section_ID]['entries'].append(line)
for sec_id in all_sections:
if all_sections[sec_id]['entries'][2].startswith(keyword) and all_sections[sec_id]['entries'][3].startswith(keyword):
count+=1
print('count : ', count)
output using your file would be :
count : 3

Use a file to search another file and print lines matching a pattern to first file

Python noob here. I've been smashing my head trying to do this, tried several Unix tools and I'm convinced that python is the way to go.
I have two files, File1 has headers and numbers like this:
>id1
77
>id2
2
>id3
2
>id4
22
...
Note that id number is unique, but the number assigned to it may repeat. I have several files like this all with the same number of headers (~500).
File2 has all numbers of File1 and an appended sequence
1
ATCGTCATA
2
ATCGTCGTA
...
22
CCCGTCGTA
...
77
ATCGTCATA
...
Note that sequence id is unique, as all sequences after it. I have the same amount of files as File1 but the number of sequences within each File2 may vary(~150).
My desired output is the File1 with the sequence from File2, it is important that File1 maintains original order.
>id1
ATCGTCATA
>id2
ATCGTCGTA
>id3
ATCGTCGTA
>id4
CCCGTCGTA
My approach is to extract numbers from File1 and use them as a pattern to match in File2. First I am trying to make this work with only a pair of files. here is what I achieved:
#!/usr/bin/env python
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
with open(datafile, 'r') as f:
datafile_lines = set([line.strip() for line in f]) #maybe I could use regex to get only lines with number as pattern?
print (datafile_lines)
outputlist = []
with open(schemaseqs, 'r') as f:
for line in f:
seqs = line.split(',')[0]
if seqs[1:-1] in datafile_lines:
outputlist.append(line)
print (outputlist)
This outputs a mix of patterns from File1 and the sequences from File2. Any help is appreciated.
Ps: I am open to modifications in files structure, I tried substituting \n in File2 for "," with no avail.
import re
datafile = 'protein2683.fasta.txt.named'
schemaseqs = 'protein2683.fasta'
datafile_lines = []
d = {}
prev = None
with open(datafile, 'r') as f:
i = 0
for line in f:
if i % 2 == 0:
d[line.strip()]=0
prev = line.strip()
else:
d[prev] = line.strip()
i+=1
new_d = {}
with open(schemaseqs, 'r') as f:
i=0
prev = None
for line in f:
if i % 2 == 0:
new_d[line.strip()]=0
prev = line.strip()
else:
new_d[prev] = line.strip()
i+=1
for key, value in d.items():
if value in new_d:
d[key] = new_d[value]
print(d)
with open(datafile,'w') as filee:
for k,v in d.items():
filee.writelines(k)
filee.writelines('\n')
filee.writelines(v)
filee.writelines('\n')
creating two dictionary would be easy and then map both dictionary values.
Since the files are so neatly organized, I wouldn't use a set to store the lines. Sets don't enforce order, and the order of these lines conveys a lot of information. I also wouldn't use Regex; it's probably overkill for the task of parsing individual lines, but not powerful enough to keep track of which ID corresponds to each gene sequence.
Instead, I would read the files in the opposite order. First, read the file with the gene sequences and build a mapping of IDs to genes. Then read in the first file and replace each id with the corresponding value in that mapping.
If the IDs are a continuous sequence (1, 2, 3... n, n+1), then a list is probably the easiest way to store them. If the file is already in order, you don't even have to pay attention to the ID numbers; you can just skip every other row and append each gene sequence to an array in order. If they aren't continuous, you can use a dictionary with the IDs as keys. I'll use the dictionary approach for this example:
id_to_gene_map = {}
with open(file2, 'r') as id_to_gene_file:
for line_number, line in enumerate(id_to_gene_file, start=1):
if line_number % 2 == 1: # Update ID on odd numbered lines, including line 1
current_id = line
else:
id_to_gene_map[current_id] = line # Map previous line's ID to this line's value
with open(file1, 'r') as input_file, open('output.txt', 'w') as output_file:
for line in input_file:
if not line.startswith(">"): # Keep ">id1" lines unchanged
line = id_to_gene_map[line] # Otherwise, replace with the corresponding gene
output_file.write(line)
In this case, the IDs and values both have trailing newlines. You can strip them out, but since you'll want to add them back in for writing the output file, it's probably easiest to leave them alone.

comparing parts of lines in two tsv files in python

So I want to sum/analyse values pertaining to a given line in one file which match another file.
The format of the first file I wish to compare against is:
Acetobacter cibinongensis Acetobacter Acetobacteraceae
Rhodospirillales Proteobacteria Bacteria
Acetobacter ghanensis Acetobacter Acetobacteraceae Rhodospirillales Proteobacteria Bacteria
Acetobacter pasteurianus Acetobacter Acetobacteraceae Rhodospirillales Proteobacteria Bacteria
And the second file is like:
Blochmannia endosymbiont of Polyrhachis (Hedomyrma) turneri Candidatus Blochmannia Enterobacteriaceae Enterobacteriales Proteobacteria Bacteria 1990 7.511 14946.9
Blochmannia endosymbiont of Polyrhachis (Hedomyrma) turneri Candidatus Blochmannia Enterobacteriaceae Enterobacteriales Proteobacteria Bacteria 2061 6.451 13295.5
Calyptogena okutanii thioautotrophic gill symbiont Proteobacteria-undef Proteobacteria-undef Proteobacteria-undef Proteobacteria Bacteria 7121 2.466 17560.4
What I want to do is parse every line in the first file, and for every line in the second file where the first 6 fields match, perform analysis on the numbers in the 3 fields following the species info.
My code is as follows:
with open('file1', 'r') as file1:
with open('file2', 'r') as file2:
for line in file1:
count = 0
line = line.split("\t")
for l in file2:
l = l.split("\t")
if l[0:6] == line[0:6]:
count+=1
count = str(count)
print line + '\t' + count +'\t'+'\n'
Which I'm hoping will give me the line from the first file and the number of times that species was found in the second file.
I know there's probably a better way of doing THIS particular part of the analysis but I wanted to give a simple example of the objective..
Anyway, I don't get any matches, i.e. I never see an instance where
l[0:6] == line[0:6]
is True.
Any ideas?? :-S
The root cause is that you consume file2 at the first iteration, then it always iterate over nothing.
Quick fix: read file2 fully and put it in a list. However, this is rather inefficient in terms of speed (O(N^2): double loop). Could be better if creating a dictionary with key = tuple of the 6 first values.
with open('file2', 'r') as f:
file2 = list(f)
with open('file1', 'r') as file1:
for line in file1:
count = 0
line = line.split("\t")
for l in file2:
l = l.split("\t")
if l[0:6] == line[0:6]:
count+=1
count = str(count)
print line + '\t' + count +'\t'+'\n'
Also, using csv module configured with TAB as separator would avoid you some surprises in the future.
Better version, using a dictionary for faster access on data of file2 (the first 6 elements are the key, note that we cannot use a list as key since it's mutable but we have to convert it to a tuple):
d = dict()
# create the dictionary from file2
with open('file2', 'r') as file2:
for l in file2:
fields = l.split("\t")
d[tuple(fields[0:6])] = fields[6:]
# iterate through file1, and use dict lookup on data of file2
# much, much faster if file2 contains a lot of data
with open('file1', 'r') as file1:
for line in file1:
count = 0
line = line.split("\t")
if tuple(line[0:6]) in d: # check if in dictionary
count+=1
# we could extract the extra data by accessing
# d[tuple(line[0:6])]
count = str(count)
print(line + '\t' + count +'\t'+'\n')

Categories