Search lines, pull specific data - python

I need to read a text file, search all lines, find a keyword in a specific location of the line and if it exists, pull other data from that same line.
My example is the word 'TRED'. If TRED is at index location 95 I need to pull data from either certain columns or specific indexes from that line.
Currently my code is this....but it's not finding the word and so the results are all errors.
substr = "TRED"
with open(strFileLoc + "test.txt", 'r') as inputfile:
for line in inputfile:
if line.find(substr, 95, 98) != -1:
print(line.rstrip('\n'))
else:
print("There was an error at " + line.rstrip('\n'))

There are a couple of ways to solve the problem. The issue (based on my quick test) is caused by the substring call you make: str.find() indexes from the first position you give to the last-1, so the substring you're looking for in this case would be 3 characters long (TRE) even if there would be a match at that position. So you could fix it by simply increasing the end position to 99.
However, find() also returns the position where it finds a substring, (-1) if not found. You could achieve a good outcome by searching without specifying a location and checking the return value instead, thus robustly handling the case of a shorter string.
substr = "TRED"
with open(strFileLoc + "test.txt", 'r') as inputfile:
for line in inputfile:
loc = line.find(substr)
if loc == 95:
print(line.rstrip('\n'))
else:
print("There was an error at " + line.rstrip('\n'))

I believe there is an easier way to do this comparison. Check out the below code.
substr = "TRED"
with open(strFileLoc + "test.txt", 'r', 'r') as inputfile:
for line in inputfile:
if line[95:99] == substr:
print(line.rstrip('\n'))
else:
print("There was an error at " + line.rstrip('\n'))
Ouput:
sdaksdkakslkdlaksjdlkajslkdjlkajklsfjslkdvnksdjjlsjdlfjlskldfjlsnvkjdglsjdfljalsmnljklasjlfaaaaTREDdjsalkjdlka
Make sure you are giving the proper index values.(Note: line[95:99], will take elements at 95,96,97,98 position only).

You mentioned you'd want to pull from either columns or indices on that line.
If your file is space, comma, tab, etc. separated you can split each line into columns very easily. Hence, you can do this:
substr = "TRED"
token_splitter = ',' # or whatever separator you have
column_number = 2
with open(strFileLoc + "test.txt", 'r', 'r') as inputfile:
for line in inputfile:
columns = line.rstrip().split(token_splitter)
if columns[column_number] == substr: # get the exact column directly
print(line.rstrip('\n'))
else:
print("There was an error at " + line.rstrip('\n'))

Related

Using Python rjust(8) does not seem to work on last item in list

I have text file containing comma separated values which read and output again reformatted.
102391,-55.5463,-6.50719,-163.255,2.20855,-2.63099,-7.86673
102392,11.224,-8.15971,15.5387,-11.512,-3.89007,-28.6367
102393,20.5277,-62.3261,-40.9294,-45.5899,-53.222,-1.77512
102394,188.113,19.2829,137.284,14.0548,4.47098,-50.8091
102397,-24.5383,-3.46016,1.74639,2.52063,3.31528,16.2535
102398,-107.719,-102.548,52.1627,-78.4543,-65.2494,-97.8143
I read it using this code:
with open(outfile , 'w') as fout:
with open(infile) as file:
for line in file:
linelist = line.split(",")
fout.write(" ELEM " + '{:>8}'.format(str(linelist[0]) + "\n"))
if len(linelist) == 7:
fout.write(" VALUE " + str(linelist[1][:8]).rjust(8) + str(linelist[2][:8]).rjust(8) + str(linelist[3][:8]).rjust(8) + str(linelist[4][:8]).rjust(8) + str(linelist[5][:8]).rjust(8) + str(linelist[6][:8]).rjust(8) )
fout.write("\n")
And get this output:
ELEM 102391
VALUE -55.5463-6.50719-163.255 2.20855-2.63099-7.86673
ELEM 102392
VALUE 11.224-8.15971 15.5387 -11.512-3.89007-28.6367
ELEM 102393
VALUE 20.5277-62.3261-40.9294-45.5899 -53.222-1.77512
ELEM 102394
VALUE 188.113 19.2829 137.284 14.0548 4.47098-50.8091
ELEM 102397
VALUE -24.5383-3.46016 1.74639 2.52063 3.3152816.2535
ELEM 102398
VALUE -107.719-102.548 52.1627-78.4543-65.2494-97.8143
Everything is fine except: Why do I get a extra blank line sometimes, and why is the last number before the blank line (16.2535) not rightadjusted? These two issues certainly belong to each other but i can not figure out what is going on.
It behaves like the last element of the fifth line of your input contins a 'newline' character at its end.
Can you check the content of linelist[6] for the fifth line of your input? I guess you would find something like: '16.2535\n'.
Hence,to make sure that your content does not include trailing newlines at the end of the string, you can use the String function .strip()

Remove linebreak in csv

I have a CSV file that has errors. The most common one is a too early linebreak.
But now I don't know how to remove it ideally. If I read the line by line
with open("test.csv", "r") as reader:
test = reader.read().splitlines()
the wrong structure is already in my variable. Is this still the right approach and do I use a for loop over test and create a copy or can I manipulate directly in the test variable while iterating over it?
I can identify the corrupt lines by the semikolon, some rows end with a ; others start with it. So maybe counting would be an alternative way to solve it?
EDIT:
I replaced reader.read().splitlines() with reader.readlines() so I could handle the rows which end with a ;
for line in lines:
if("Foobar" in line):
line = line.replace("Foobar", "")
if(";\n" in line):
line = line.replace(";\n", ";")
The only thing that remains are rows that beginn with a ;
Since I need to go back one entry in the list
Example:
Col_a;Col_b;Col_c;Col_d
2021;Foobar;Bla
;Blub
Blub belongs in the row above.
Here's a simple Python script to merge lines until you have the desired number of fields.
import sys
sep = ';'
fields = 4
collected = []
for line in sys.stdin:
new = line.rstrip('\n').split(sep)
if collected:
collected[-1] += new[0]
collected.extend(new[1:])
else:
collected = new
if len(collected) < fields:
continue
print(';'.join(collected))
collected = []
This simply reads from standard input and prints to standard output. If the last line is incomplete, it will be lost.
The separator and the number of fields can be edited into the variables at the top; exposing these as command-line parameters left as an exercise.
If you wanted to keep the newlines, it would not be too hard to only strip a newline from the last fields, and use csv.writer to write the fields back out as properly quoted CSV.
This is how I deal with this. This function fixes the line if there are more columns than needed or if there is a line break in the middle.
Parameters of the function are:
message - content of the file - reader.read() in your case
columns - number of expected columns
filename - filename (I use it for logging)
def pre_parse(message, columns, filename):
parsed_message=[]
i =0
temp_line =''
for line in message.splitlines():
#print(line)
split = line.split(',')
if len(split) == columns:
parsed_message.append(line)
elif len(split) > columns:
print(f'Line {i} has been truncated in file {filename} - too much columns'))
split = split[:columns]
line = ','.join(split)
parsed_message.append(line)
elif len(split) < columns and temp_line =='':
temp_line = line.replace('\n','')
print(temp_line)
elif temp_line !='':
line = temp_line+line
if line.count(',') == columns-1:
print((f'Line {i} has been fixed in file {filename} - extra line feed'))
parsed_message.append(line)
temp_line =''
else:
temp_line=line.replace('\n', '')
i+=1
return parsed_message
make sure you use proper split character and proper line feed characer.

Remove lines from file what called from list

I want to remove lines from a .txt file.
i wanna make a list for string what i want to remove but the code will paste the lines as many times
as many string in list. How to avoid that?
file1 = open("base.txt", encoding="utf-8", errors="ignore")
Lines = file1.readlines()
file1.close()
not_needed = ['asd', '123', 'xyz']
row = 0
result = open("result.txt", "w", encoding="utf-8")
for line in Lines:
for item in not_needed:
if item not in line:
row += 1
result.write(str(row) + ": " + line)
so if the line contains the string from list, then delete it.
After every string print the file without the lines.
How to do it?
Look at the logic in your for loop... What it's doing is: take each line in lines, then for all the items in not_needed go through the line and write if condition is verified. But condition verifies each time the item is not found.
Try thinking about doing the inverse:
check if a line is in non needed.
if it is do nothing
otherwise write it
Expanded answer:
Here's what I think you are looking for:
for line in Lines:
if item not in not_needed:
row += 1
result.write(str(row) + ": " + line)

How to check to see if a certain line is found before a certain point in a txt file?

I need to figure out if a certain phrase/line is found before another phrase takes place in a text file. If the phrase is found, I will pass, if it does not exist, I will add a line above the cutoff. Of note, the phrase can occur later in the document as well.
An example of what this txt format would be could be:
woijwoi
woeioasd
woaije
Is this found
owijefoiawjwfioj
This is the cutoff
asoi w
more text lines
Is this found
aoiw
The search should cut off on the phrase "This is the cutoff". It is unknown what line the cutoff will be on. If "Is this found" exists before the cutoff, pass. If it does not, I want to add the phrase "Adding a line" right above the cutoff to the output doc.
An example of the code I've tried so far, with all strings previously defined:
find = 'Is this found'
with open(longStr1) as old_file:
lines = old_file.readlines()
with open(endfile1, "w") as new_file:
for num, line in enumerate(lines):
if "This is the" in line:
base_num = num
for num in range(1, base_num):
if not find in line:
if line.startswith("This is the"):
line = newbasecase + line
I am getting an error for "name 'base_num' is not defined" Is there a better way to perform this search?
What about something like this? Looks for both find and cutoff index positions, then cycles through the line list and checks for the cutoff index, evaluates if there's a previous "find" variable and if not adds the "Adding a line" line and ends the new file.
find = "Is this found"
find_index = 0
cutoff = "This is the cutoff"
cutoff_index = 0
with open(longStr1) as old_file:
lines = old_file.readlines()
if find in lines:
find_index = lines.index(find)
if cutoff in lines:
cutoff_index = lines.index(cutoff)
with open(endfile1, "w") as new_file:
for num, line in enumerate(lines):
if cutoff in line:
if cutoff_index < find_index:
new_file.write("Adding a line\n")
new_file.write(line)
break
new_file.write(line)

How to remove extra space from end of the line before newline in python?

I'm quite new to python. I have a program which reads an input file with different characters and then writes all unique characters from that file into an output file with a single space between each of them. The problem is that after the last character there is one extra space (before the newline). How can I remove it?
My code:
import sys
inputName = sys.argv[1]
outputName = sys.argv[2]
infile = open(inputName,"r",encoding="utf-8")
outfile = open(outputName,"w",encoding="utf-8")
result = []
for line in infile:
for c in line:
if c not in result:
result.append(c)
outfile.write(c.strip())
if(c == ' '):
pass
else:
outfile.write(' ')
outfile.write('\n')
With the line outfile.write(' '), you write a space after each character (unless the character is a space). So you'll have to avoid writing the last space. Now, you can't tell whether any given character is the last one until you're done reading, so it's not like you can just put in an if statement to test that, but there are a few ways to get around that:
Write the space before the character c instead of after it. That way the space you have to skip is the one before the first character, and that you definitely can identify with an if statement and a boolean variable. If you do this, make sure to check that you get the right result if the first or second c is itself a space.
Alternatively, you can avoid writing anything until the very end. Just save up all the characters you see - you already do this in the list result - and write them all in one go. You can use
' '.join(strings)
to join together a list of strings (in this case, your characters) with spaces between them, and this will automatically omit a trailing space.
Why are you adding that if block on the end?
Your program is adding the extra space on the end.
import sys
inputName = sys.argv[1]
outputName = sys.argv[2]
infile = open(inputName,"r",encoding="utf-8")
outfile = open(outputName,"w",encoding="utf-8")
result = []
for line in infile:
charno = 0
for c in line:
if c not in result:
result.append(c)
outfile.write(c.strip())
charno += 1
if (c == ' '):
pass
elif charno => len(line):
pass
else:
outfile.write(' ')
outfile.write('\n')

Categories