Python: Delete lines from except certain criteria - python

I am trying to delete lines from a file using specific criteria
The script i have seems to work but i have to add to many Or statements
Is there a way i can make an variable that holds all the criterias i would like to remove from the files?
Example code
with open("AW.txt", "r+", encoding='utf-8') as f:
new_f = f.readlines()
f.seek(0)
for line in new_f:
if "PPL"not in line.split() or "PPLX"not in line.split() or "PPLC"not in line.split():
f.write(line)
f.truncate()
I was more thinking in this way but it fails when i add multiple criterias
output = []
with open('AW.txt', 'r+', encoding='utf-8') as f:
lines = f.readlines()
criteria = 'PPL'
output =[line for line in lines if criteria not in line]
f.writelines(output)
Regards

You can use regular expressions to your rescue which will reduce the number of statements and checks in the code. If you have a list of criteria which can be dynamic, let's call the list of criteria crit_list, then the code would look like-
import re
with open("AW.txt", "r+", encoding='utf-8') as f:
new_f = f.readlines()
crit_list = ['PPL', 'PPLC', 'PPLX'] # Can use any number of criterions
obj = re.compile(r'%s' % ('|'.join(crit_list)))
out_lines = [line for line in new_f if not obj.search(line)]
f.truncate(0)
f.seek(0)
f.writelines(out_lines)
Use of regex makes it look different from how OP had posted. Let me explain the two lines containing the regex-
obj = re.compile(r'%s' % ('|'.join(crit_list)))
This line creates a regex object with the regular expression 'PPL|PPLX|PPLC' which means match at least one of these strings in the given line which can be thought of as a substitute for using as many ors in the code as there are criteria.
out_lines = [line for line in new_f if not obj.search(line)]
This statement means, search for the given criteria in the given line and if at least of them is found, preserve that line.
Hope that clears your doubts.

import re
output = []
with open('AW.txt', 'r+', encoding='utf-8') as f:
lines = f.readlines()
criteria = 'PPL'
output = re.sub("^.*[Crit1|Crit2|Crit3].*","")
f.writelines(output)
This will remove the lines. but it will not print them out in the writelines statement
your question was a little fuzzy, asking for lines to be deleted but then trying to write them out
add as many criteria as you want like this

You can get compare each list item with each criteria and get only those items that meet the criteria. Then simply get all lines which meet all the criterias.
For example, this can be done like (EDITED CODE):
with open('AW.txt', 'r+') as f:
lines = f.readlines()
criterias = ["PPL","PPLX","PPLC"]
conditioned_lines = [[line for criteria in criterias if criteria not in line] for line in lines]
output = [criteria_lines[0] for criteria_lines in conditioned_lines if len(criteria_lines) == len(criterias)]
f.truncate(0)
f.seek(0)
f.write(''.join(output))

Related

delete all rows up to a specific row

How you can implement deleting lines in a text document up to a certain line?
I find the line number using the code:
#!/usr/bin/env python
lookup = '00:00:00'
filename = "test.txt"
with open(filename) as text_file:
for num, line in enumerate(text_file, 1):
if lookup in line:
print(num)
print(num) outputs me the value of the string, for example 66.
How do I delete all the lines up to 66, i.e. up to the found line by word?
As proposed here with a small modification to your case:
read all lines of the file.
iterate the lines list until you reach the keyword.
write all remaining lines
with open("yourfile.txt", "r") as f:
lines = iter(f.readlines())
with open("yourfile.txt", "w") as f:
for line in lines:
if lookup in line:
f.write(line)
break
for line in lines:
f.write(line)
That's easy.
filename = "test.txt"
lookup = '00:00:00'
with open(filename,'r') as text_file:
lines = text_file.readlines()
res=[]
for i in range(0,len(lines),1):
if lookup in lines[i]:
res=lines[i:]
break
with open(filename,'w') as text_file:
text_file.writelines(res)
Do you know what lines you want to delete?
#!/usr/bin/env python
lookup = '00:00:00'
filename = "test.txt"
with open(filename) as text_file, open('okfile.txt', 'w') as ok:
lines = text_file.readlines()
ok.writelines(lines[4:])
This will delete the first 4 lines and store them in a different document in case you wanna keep the original.
Remember to close the files when you're done with them :)
Providing three alternate solutions. All begin with the same first part - reading:
filename = "test.txt"
lookup = '00:00:00'
with open(filename) as text_file:
lines = text_file.readlines()
The variations for the second parts are:
Using itertools.dropwhile which discards items from the iterator until the predicate (condition) returns False (ie discard while predicate is True). And from that point on, yields all the remaining items without re-checking the predicate:
import itertools
with open(filename, 'w') as text_file:
text_file.writelines(itertools.dropwhile(lambda line: lookup not in line, lines))
Note that it says not in. So all the lines before lookup is found, are discarded.
Bonus: If you wanted to do the opposite - write lines until you find the lookup and then stop, replace itertools.dropwhile with itertools.takewhile.
Using a flag-value (found) to determine when to start writing the file:
with open(filename, 'w') as text_file:
found = False
for line in lines:
if not found and lookup in line: # 2nd expression not checked once `found` is True
found = True # value remains True for all remaining iterations
if found:
text_file.write(line)
Similar to #c yj's answer, with some refinements - use enumerate instead of range, and then use the last index (idx) to write the lines from that point on; with no other intermediate variables needed:
for idx, line in enumerate(lines):
if lookup in line:
break
with open(filename, 'w') as text_file:
text_file.writelines(lines[idx:])

Open and Read a CSV File without libraries

I have the following problem. I am supposed to open a CSV file (its an excel table) and read it without using any library.
I tried already a lot and have now the first row in a tuple and this in a list. But only the first line. The header. But no other row.
This is what I have so far.
with open(path, 'r+') as file:
results=[]
text = file.readline()
while text != '':
for line in text.split('\n'):
a=line.split(',')
b=tuple(a)
results.append(b)
return results
The output should: be every line in a tuple and all the tuples in a list.
My question is now, how can I read the other lines in python?
I am really sorry, I am new to programming all together and so I have a real hard time finding my mistake.
Thank you very much in advance for helping me out!
This problem was many times on Stackoverflow so you should find working code.
But much better is to use module csv for this.
You have wrong indentation and you use return results after reading first line so it exits function and it never try read other lines.
But after changing this there are still other problems so it still will not read next lines.
You use readline() so you read only first line and your loop will works all time with the same line - and maybe it will never ends because you never set text = ''
You should use read() to get all text which later you split to lines using split("\n") or you could use readlines() to get all lines as list and then you don't need split(). OR you can use for line in file: In all situations you don't need while
def read_csv(path):
with open(path, 'r+') as file:
results = []
text = file.read()
for line in text.split('\n'):
items = line.split(',')
results.append(tuple(items))
# after for-loop
return results
def read_csv(path):
with open(path, 'r+') as file:
results = []
lines = file.readlines()
for line in lines:
line = line.rstrip('\n') # remove `\n` at the end of line
items = line.split(',')
results.append(tuple(items))
# after for-loop
return results
def read_csv(path):
with open(path, 'r+') as file:
results = []
for line in file:
line = line.rstrip('\n') # remove `\n` at the end of line
items = line.split(',')
results.append(tuple(items))
# after for-loop
return results
All this version will not work correctly if you will '\n' or , inside item which shouldn't be treated as end of row or as separtor between items. These items will be in " " which also can make problem to remove them. All these problem you can resolve using standard module csv.
Your code is pretty well and you are near goal:
with open(path, 'r+') as file:
results=[]
text = file.read()
#while text != '':
for line in text.split('\n'):
a=line.split(',')
b=tuple(a)
results.append(b)
return results
Your Code:
with open(path, 'r+') as file:
results=[]
text = file.readline()
while text != '':
for line in text.split('\n'):
a=line.split(',')
b=tuple(a)
results.append(b)
return results
So enjoy learning :)
One caveat is that the csv may not end with a blank line as this would result in an ugly tuple at the end of the list like ('',) (Which looks like a smiley)
To prevent this you have to check for empty lines: if line != '': after the for will do the trick.

How to open a file in python, read the comments ("#"), find a word after the comments and select the word after it?

I have a function that loops through a file that Looks like this:
"#" XDI/1.0 XDAC/1.4 Athena/0.9.25
"#" Column.4: pre_edge
Content
That is to say that after the "#" there is a comment. My function aims to read each line and if it starts with a specific word, select what is after the ":"
For example if I had These two lines. I would like to read through them and if the line starts with "#" and contains the word "Column.4" the word "pre_edge" should be stored.
An example of my current approach follows:
with open(file, "r") as f:
for line in f:
if line.startswith ('#'):
word = line.split(" Column.4:")[1]
else:
print("n")
I think my Trouble is specifically after finding a line that starts with "#" how can I parse/search through it? and save its Content if it contains the desidered word.
In case that # comment contain str Column.4: as stated above, you could parse it this way.
with open(filepath) as f:
for line in f:
if line.startswith('#'):
# Here you proceed comment lines
if 'Column.4' in line:
first, remainder = line.split('Column.4: ')
# Remainder contains everything after '# Column.4: '
# So if you want to get first word ->
word = remainder.split()[0]
else:
# Here you can proceed lines that are not comments
pass
Note
Also it is a good practice to use for line in f: statement instead of f.readlines() (as mentioned in other answers), because this way you don't load all lines into memory, but proceed them one by one.
You should start by reading the file into a list and then work through that instead:
file = 'test.txt' #<- call file whatever you want
with open(file, "r") as f:
txt = f.readlines()
for line in txt:
if line.startswith ('"#"'):
word = line.split(" Column.4: ")
try:
print(word[1])
except IndexError:
print(word)
else:
print("n")
Output:
>>> ['"#" XDI/1.0 XDAC/1.4 Athena/0.9.25\n']
>>> pre_edge
Used a try and except catch because the first line also starts with "#" and we can't split that with your current logic.
Also, as a side note, in the question you have the file with lines starting as "#" with the quotation marks so the startswith() function was altered as such.
with open('stuff.txt', 'r+') as f:
data = f.readlines()
for line in data:
words = line.split()
if words and ('#' in words[0]) and ("Column.4:" in words):
print(words[-1])
# pre_edge

Parsing a file from first char in each line

I'm trying to group a file by the first character in each line of the file.
For example, the file:
s/1/1/2/3/4/5///6
p/22/LLL/GP/1/3//
x//-/-/-/1/5/-/-/
s/1/1/2/3/4/5///6
p/22/LLL/GP/1/3//
x//-/-/-/1/5/-/-/
I need to group everything starting with the first s/ up to the next s/. I don't think split() will work since it would remove the delimiter.
Desired end result:
s/1/1/2/3/4/5///6
p/22/LLL/GP/1/3//
x//-/-/-/1/5/-/-/
s/1/1/2/3/4/5///6
p/22/LLL/GP/1/3//
x//-/-/-/1/5/-/-/
I'd prefer to do this without the re module if possible (is it?)
Edit: Attempts:
The following gets me the values in groups using list comprehension:
with open('/file/path', 'r') as f:
content = f.read()
groups = ['s/' + group for group in content.split('s/')[1:]]
Since the s/ is the first character in the sequence, I use the [1:] to avoid having an element of just s/ in groups[0].
Is there a better way? Or is this the best?
Assuming the first line of the file starts with 's/' you could try something like this:
groups = []
with open('test.txt', 'r') as f:
for line in f:
if line.startswith('s/'):
groups.append('')
groups[-1] += line
To deal with files that don't start with 's/' and have the first element be all lines until the first 's/', we can make a small change and add in an empty string on the first line:
groups = []
with open('test.txt', 'r') as f:
for line in f:
if line.startswith('s/') or not groups:
groups.append('')
groups[-1] += line
Alternatively, if we want to skip lines until the first 's/', we can do the following:
groups = []
with open('test.txt', 'r') as f:
for line in f:
if line.startswith('s/'):
groups.append('')
if groups:
groups[-1] += line

Parsing a text file in python and outputting to a CSV

Preface - I'm pretty new to Python, having had more experience in another language.
I have a text file with single column list of strings in the generic (but slightly varying) format "./abc123a1/type/1ab2_x_data_type.file.type"
I need to extract the abc123a1 and the 1ab2 portions from all several hundred of the rows and put them under two columns (column a and b) in a csv. Sometimes there may be a "1ab2_a" and a "1ab2_b", but I only want one 1ab2. So I'd want to grab "1ab2_a" and ignore all others.
I have the regex which I THINK will work:
tmp = list()
if re.findall(re.compile(r'^([a-zA-Z0-9]{4})_'), x):
tmp = re.findall(re.compile(r'^([a-zA-Z0-9]{4})_'), x)
elif re.findall(re.compile(r'_([a-zA-Z0-9]{4})_'), x):
tmp = re.findall(re.compile(r'_([a-zA-Z0-9]{4})_'), x)
if len(tmp) == 0:
return None
elif len(tmp) > 1:
print "ERROR found multiple matches"
return "ERROR"
else:
return tmp[0].upper()
I am trying to make this script step by step and testing things to make sure it works, but it's just not.
import sys
import csv
listOfData = []
with open(sys.argv[1]) as f:
print "yes"
for line in f:
print line
for line in f:
listOfData.append([line])
print listOfData
with open('extracted.csv', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerow(('column a', 'column b'))
writer.writerows(listOfData)
print listOfData
Still failing to get anything in the csv other than column headers, much less a parsed version!
Does anyone have any better ideas or formats I could do this in? A friend mentioned looking into glob.glob, but I haven't had luck getting that to work either.
IMHO, you were not far from making it work. The problem is that you read once the whole file just to print the lines, and then (once at end of file) you try to put them into a list... and get an empty list !
You should read the file only once:
import sys
import csv
listOfData = []
with open(sys.argv[1]) as f:
print "yes"
for line in f:
print line
listOfData.append([line])
print listOfData
with open('extracted.csv', 'w') as out_file:
writer = csv.writer(out_file)
writer.writerow(('column a', 'column b'))
writer.writerows(listOfData)
print listOfData
once it works, you still have to use the regex to get relevant data to put into the csv file
I am not sure about your regex (it will most probably not work) , but the reason why your current (non-regex , simple) code does not work is because -
with open(sys.argv[1]) as f:
print "yes"
for line in f:
print line
for line in f:
listOfData.append([line])
As you can see you are first iterating over each line in file and printing it, it should be fine, but after the loop ends, the file pointer is at the end of file, so trying to iterate over it again , would not produce any result. You should only iterate over it once, and do both printing and appending to list in it. Example -
with open(sys.argv[1]) as f:
print "yes"
for line in f:
print line
listOfData.append([line])
I think at least part of the problem is the two for loops in the following:
with open(sys.argv[1]) as f:
print "yes"
for line in f:
print line
for line in f:
listOfData.append([line])
The first one prints all the lines of f, so there's nothing left for the second one to iterate over unless you first f.seek(0) and rewind the file.
An alternative way would to simply to this:
with open(sys.argv[1]) as f:
print "yes"
for line in f:
print line
listOfData.append([line])
It's hard to tell if your regexes are OK without more than one line of sample input data.
Are you sure you need all of the regular expressions? You seem to be parsing a list of paths and filenames. The path could be split up using a split command, for example:
print "./abc123a1/type/1ab2_a_data_type.file.type".split("/")
Would give:
['.', 'abc123a1', 'type', '1ab2_a_data_type.file.type']
You could then create a set consisting of the second entry and up to the '_' in forth entry, e.g.
('abc123a1', '1ab2')
This could then be used to print only the first entry from each:
pairs = set()
with open(sys.argv[1], 'r') as in_file, open('extracted.csv', 'wb') as out_file:
writer = csv.writer(out_file)
for row in in_file:
folders = row.split("/")
col_a = folders[1]
col_b = folders[3].split("_")[0]
if (col_a, col_b) not in pairs:
pairs.add((col_a, col_b))
writer.writerow([col_a, col_b])
So for an input looking like this:
./abc123a1/type/1ab2_a_data_type.file.type
./abc123a1/type/1ab2_b_data_type.file.type
./abc123a2/type/1ab2_a_data_type.file.type
./abc123a3/type/1ab2_a_data_type.file.type
You would get a CSV file looking like:
abc123a1,1ab2
abc123a2,1ab2
abc123a3,1ab2

Categories