Remove lines from file what called from list - python

I want to remove lines from a .txt file.
i wanna make a list for string what i want to remove but the code will paste the lines as many times
as many string in list. How to avoid that?
file1 = open("base.txt", encoding="utf-8", errors="ignore")
Lines = file1.readlines()
file1.close()
not_needed = ['asd', '123', 'xyz']
row = 0
result = open("result.txt", "w", encoding="utf-8")
for line in Lines:
for item in not_needed:
if item not in line:
row += 1
result.write(str(row) + ": " + line)
so if the line contains the string from list, then delete it.
After every string print the file without the lines.
How to do it?

Look at the logic in your for loop... What it's doing is: take each line in lines, then for all the items in not_needed go through the line and write if condition is verified. But condition verifies each time the item is not found.
Try thinking about doing the inverse:
check if a line is in non needed.
if it is do nothing
otherwise write it

Expanded answer:
Here's what I think you are looking for:
for line in Lines:
if item not in not_needed:
row += 1
result.write(str(row) + ": " + line)

Related

Using Python rjust(8) does not seem to work on last item in list

I have text file containing comma separated values which read and output again reformatted.
102391,-55.5463,-6.50719,-163.255,2.20855,-2.63099,-7.86673
102392,11.224,-8.15971,15.5387,-11.512,-3.89007,-28.6367
102393,20.5277,-62.3261,-40.9294,-45.5899,-53.222,-1.77512
102394,188.113,19.2829,137.284,14.0548,4.47098,-50.8091
102397,-24.5383,-3.46016,1.74639,2.52063,3.31528,16.2535
102398,-107.719,-102.548,52.1627,-78.4543,-65.2494,-97.8143
I read it using this code:
with open(outfile , 'w') as fout:
with open(infile) as file:
for line in file:
linelist = line.split(",")
fout.write(" ELEM " + '{:>8}'.format(str(linelist[0]) + "\n"))
if len(linelist) == 7:
fout.write(" VALUE " + str(linelist[1][:8]).rjust(8) + str(linelist[2][:8]).rjust(8) + str(linelist[3][:8]).rjust(8) + str(linelist[4][:8]).rjust(8) + str(linelist[5][:8]).rjust(8) + str(linelist[6][:8]).rjust(8) )
fout.write("\n")
And get this output:
ELEM 102391
VALUE -55.5463-6.50719-163.255 2.20855-2.63099-7.86673
ELEM 102392
VALUE 11.224-8.15971 15.5387 -11.512-3.89007-28.6367
ELEM 102393
VALUE 20.5277-62.3261-40.9294-45.5899 -53.222-1.77512
ELEM 102394
VALUE 188.113 19.2829 137.284 14.0548 4.47098-50.8091
ELEM 102397
VALUE -24.5383-3.46016 1.74639 2.52063 3.3152816.2535
ELEM 102398
VALUE -107.719-102.548 52.1627-78.4543-65.2494-97.8143
Everything is fine except: Why do I get a extra blank line sometimes, and why is the last number before the blank line (16.2535) not rightadjusted? These two issues certainly belong to each other but i can not figure out what is going on.
It behaves like the last element of the fifth line of your input contins a 'newline' character at its end.
Can you check the content of linelist[6] for the fifth line of your input? I guess you would find something like: '16.2535\n'.
Hence,to make sure that your content does not include trailing newlines at the end of the string, you can use the String function .strip()

How to select the last character of a header in a fasta file?

I have a fasta file like this:
>XP1987651-apple1
ACCTTCCAAGTAG
>XP1235689-lemon2
TTGGAGTCCTGAG
>XP1254115-pear1
ATGCCGTAGTCAA
I would like to create a file selecting the header that ends with '1', for example:
>XP1987651-apple1
ACCTTCCAAGTAG
>XP1254115-pear1
ATGCCGTAGTCAA
so far I create this:
fasta = open('x.fasta')
output = open('x1.fasta', 'w')
seq = ''
for line in fasta:
if line[0] == '>' and seq == '':
header = line
elif line[0] != '>':
seq = seq + line
for n in header:
n = header[-1]
if '1' in n:
output.write(header + seq)
header= line
seq = ''
if "1" in header:
output.write(header + seq)
output.close()
However, it doesn't produce any output in the new file created. Can you please spot the error?
Thank you
One option would be to read the entire file into a string, and then use re.findall with the following regex pattern:
>[A-Z0-9]+-\w+1\r?\n[ACGT]+
Sample script:
fasta = open('x.fasta')
text = fasta.read()
matches = re.findall(r'>[A-Z0-9]+-\w+1\r?\n[ACGT]+', text)
print(matches)
For the sample data you gave above, this prints:
['>XP1987651-apple1\nACCTTCCAAGTAG', '>XP1254115-pear1\nATGCCGTAGTCAA']
You can start by getting a list of your individual records which are delimited by '>' and extract the header and body using a single split by newline .split('\n', 1)
records = [
line.split('\n', 1)
for line in fasta.read().split('>')[1:]
]
Then you can simply filter out records that do not end with 1
for header, body in records:
if header.endswith('1'):
output.write('>' + header + '\n')
output.write(body)
You can quite simply set a flag when you see a matching header line.
with open('x.fasta') as fasta, open('x1.fasta', 'w') as output:
for line in fasta:
if line.startswith('>'):
select = line.endswith('1\n')
if select:
output.write(line)
This avoids reading the entire file into memory; you are only examining one line at a time.
Maybe notice that line will contain the newline at the end of the line. I opted to simply keep it; sometimes, things are easier if you trim it with line = line.rstrip('\n') and add it back on output if necessary.

I want to replace words from a file by the line no using python i have a list of line no?

if I have a file like:
Flower
Magnet
5001
100
0
and I have a list containing line number, which I have to change.
list =[2,3]
How can I do this using python and the output I expect is:
Flower
Most
Most
100
0
Code that I've tried:
f = open("your_file.txt","r")
line = f.readlines()[2]
print(line)
if line=="5001":
print "yes"
else:
print "no"
but it is not able to match.
i want to overwrite the file which i am reading
You may simply loop through the list of indices that you have to replace in your file (my original answer needlessly looped through all lines in the file):
with open('test.txt') as f:
data = f.read().splitlines()
replace = {1,2}
for i in replace:
data[i] = 'Most'
print('\n'.join(data))
Output:
Flower
Most
Most
100
0
To overwrite the file you have opened with the replacements, you may use the following:
with open('test.txt', 'r+') as f:
data = f.read().splitlines()
replace = {1,2}
for i in replace:
data[i] = 'Most'
f.seek(0)
f.write('\n'.join(data))
f.truncate()
The reason that you're having this problem is that when you take a line from a file opened in python, you also get the newline character (\n) at the end. To solve this, you could use the string.strip() function, which will automatically remove these characters.
Eg.
f = open("your_file.txt","r")
line = f.readlines()
lineToCheck = line[2].strip()
if(lineToCheck == "5001"):
print("yes")
else:
print("no")

How do I read a file line by line and print the line that have specific string only in python?

I have a text file containing these lines
wbwubddwo 7::a number1 234 **
/// 45daa;: number2 12
time 3:44
I am trying to print for example if the program find string number1, it will print 234
I start with simple script below but it did not print what I wanted.
with open("test.txt", "rb") as f:
lines = f.read()
word = ["number1", "number2", "time"]
if any(item in lines for item in word):
val1 = lines.split("number1 ", 1)[1]
print val1
This return the following result
234 **
/// 45daa;: number2 12
time 3:44
Then I tried changing f.read() to f.readlines() but this time it did not print out anything.
Does anyone know other way to do this? Eventually I want to get the value for each line for example 234, 12 and 3:44 and store it inside the database.
Thank you for your help. I really appreciate it.
Explanations given below:
with open("test.txt", "r") as f:
lines = f.readlines()
stripped_lines = [line.strip() for line in lines]
words = ["number1", "number2", "time"]
for a_line in stripped_lines:
for word in words:
if word in a_line:
number = a_line.split()[1]
print(number)
1) First of all 'rb' gives bytes object i.e something like b'number1 234' would be returned use 'r' to get string object.
2) The lines you read will be something like this and it will be stored in a list.
['number1 234\r\n', 'number2 12\r\n', '\r\n', 'time 3:44']
Notice the \r\n those specify that you have a newline. To remove use strip().
3) Take each line from stripped_lines and take each word from words
and check if that word is present in that line using in.
4)a_line would be number1 234 but we only want the number part. So split()
output of that would be
['number1','234'] and split()[1] would mean the element at index 1. (2nd element).
5) You can also check if the string is a digit using your_string.isdigit()
UPDATE: Since you updated your question and input file this works:
import time
def isTimeFormat(input):
try:
time.strptime(input, '%H:%M')
return True
except ValueError:
return False
with open("test.txt", "r") as f:
lines = f.readlines()
stripped_lines = [line.strip() for line in lines]
words = ["number1", "number2", "time"]
for a_line in stripped_lines:
for word in words:
if word in a_line:
number = a_line.split()[-1] if (a_line.split()[-1].isdigit() or isTimeFormat(a_line.split()[-1])) else a_line.split()[-2]
print(number)
why this isTimeFormat() function?
def isTimeFormat(input):
try:
time.strptime(input, '%H:%M')
return True
except ValueError:
To check if 3:44 or 4:55 is time formats. Since you are considering them as values too.
Final output:
234
12
3:44
After some try and error, I found a solution like below. This is based on answer provided by #s_vishnu
with open("test.txt", "r") as f:
lines = f.readlines()
stripped_lines = [line.strip() for line in lines]
for item in stripped_lines:
if "number1" in item:
getval = item.split("actual ")[1].split(" ")[0]
print getval
if "number2" in item:
getval2 = item.split("number2 ")[1].split(" ")[0]
print getval2
if "time" in item:
getval3 = item.split("number3 ")[1].split(" ")[0]
print getval3
output
234
12
3:44
This way, I can also do other things for example saving each data to a database.
I am open to any suggestion to further improve my answer.
You're overthinking this. Assuming you don't have those two asterisks at the end of the first line and you want to print out lines containing a certain value(s), you can just read the file line by line, check if any of the chosen values match and print out the last value (value between a space and the end of the line) - no need to parse/split the whole line at all:
search_values = ["number1", "number2", "time"] # values to search for
with open("test.txt", "r") as f: # open your file
for line in f: # read it it line by line
if any(value in line for value in search_values): # check for search_values in line
print(line[line.rfind(" ") + 1:].rstrip()) # print the last value after space
Which will give you:
234
12
3:44
If you do have asterisks you have to more precisely define your file format as splitting won't necessarily yield you your desired value.

Read lines in one file and find all strings starting with 4-letter strings listed in another txt file

I have 2 txt files (a and b_).
file_a.txt contains a long list of 4-letter combinations (one combination per line):
aaaa
bcsg
aacd
gdee
aadw
hwer
etc.
file_b.txt contains a list of letter combinations of various length (some with spaces):
aaaibjkes
aaleoslk
abaaaalkjel
bcsgiweyoieotpwe
csseiolskj
gaelsi asdas
aaaloiersaaageehikjaaa
hwesdaaadf wiibhuehu
bcspwiopiejowih
gdeaes
aaailoiuwegoiglkjaaake
etc.
I am looking for a python script that would allow me to do the following:
read file_a.txt line by line
take each 4-letter combination (e.g. aaai)
read file_b.txt and find all the various-length letter combinations starting with the 4-letter combination (eg. aaaibjkes, aaailoiersaaageehikjaaa, aaailoiuwegoiglkjaaaike etc.)
print the results of each search in a separate txt file named with the 4-letter combination.
File aaai.txt:
aaaibjkes
aaailoiersaaageehikjaaa
aaailoiuwegoiglkjaaake
etc.
File bcsi.txt:
bcspwiopiejowih
bcsiweyoieotpwe
etc.
I'm sorry I'm a newbie. Can someone point me in the right direction, please. So far I've got only:
#I presume I will have to use regex at some point
import re
file1 = open('file_a.txt', 'r').readlines()
file2 = open('file_b.txt', 'r').readlines()
#Should I look into findall()?
I hope this would help you;
file1 = open('file_a.txt', 'r')
file2 = open('file_b.txt', 'r')
#get every item in your second file into a list
mylist = file2.readlines()
# read each line in the first file
while file1.readline():
searchStr = file1.readline()
# find this line in your second file
exists = [s for s in mylist if searchStr in s]
if (exists):
# if this line exists in your second file then create a file for it
fileNew = open(searchStr,'w')
for line in exists:
fileNew.write(line)
fileNew.close()
file1.close()
What you can do is to open both files and run both files down line by line using for loops.
You can have two for loops, the first one reading file_a.txt as you will be reading through it only once. The second will read through file_b.txt and look for the string at the start.
To do so, you will have to use .find() to search for the string. Since it is at the start, the value should be 0.
file_a = open("file_a.txt", "r")
file_b = open("file_b.txt", "r")
for a_line in file_a:
# This result value will be written into your new file
result = ""
# This is what we will search with
search_val = a_line.strip("\n")
print "---- Using " + search_val + " from file_a to search. ----"
for b_line in file_b:
print "Searching file_b using " + b_line.strip("\n")
if b_line.strip("\n").find(search_val) == 0:
result += (b_line)
print "---- Search ended ----"
# Set the read pointer to the start of the file again
file_b.seek(0, 0)
if result:
# Write the contents of "results" into a file with the name of "search_val"
with open(search_val + ".txt", "a") as f:
f.write(result)
file_a.close()
file_b.close()
Test Cases:
I am using the test cases in your question:
file_a.txt
aaaa
bcsg
aacd
gdee
aadw
hwer
file_b.txt
aaaibjkes
aaleoslk
abaaaalkjel
bcsgiweyoieotpwe
csseiolskj
gaelsi asdas
aaaloiersaaageehikjaaa
hwesdaaadf wiibhuehu
bcspwiopiejowih
gdeaes
aaailoiuwegoiglkjaaake
The program produces an output file bcsg.txt as it is supposed to with bcsgiweyoieotpwe inside.
Try this:
f1 = open("a.txt","r").readlines()
f2 = open("b.txt","r").readlines()
file1 = [word.replace("\n","") for word in f1]
file2 = [word.replace("\n","") for word in f2]
data = []
data_dict ={}
for short_word in file1:
data += ([[short_word,w] for w in file2 if w.startswith(short_word)])
for single_data in data:
if single_data[0] in data_dict:
data_dict[single_data[0]].append(single_data[1])
else:
data_dict[single_data[0]]=[single_data[1]]
for key,val in data_dict.iteritems():
open(key+".txt","w").writelines("\n".join(val))
print(key + ".txt created")

Categories