Python Reading File and Skipping Invalid Lines

Python Reading File and Skipping Invalid Lines - python

I've been trying to write some code to read a CSV file. Some of the lines in the CSV are not complete. I would like the code to skip a bad line if there is data missing in one of the fields. I'm using the following code.
def Test():
dataFile = open('test.txt','r')
readFile = dataFile.read()
lineSplit = readFile.split('\n')
for everyLine in lineSplit:
dividedLine = everyLine.split(';')
a = dividedLine[0]
b = dividedLine[1]
c = dividedLine[2]
d = dividedLine[3]
e = dividedLine[4]
f = dividedLine[5]
g = dividedLine[6]
print (a,b,c,d,e,f,g)

In my opinion, the Pythonic way to do this would be to use the included csv module in conjunction with a try/except block (while following PEP 8 - Style Guide for Python Code).
import csv
def test():
with open('reading_test.txt','rb') as data_file:
for line in csv.reader(data_file):
try:
a,b,c,d,e,f,g = line
except ValueError:
continue # ignore the line
print(a,b,c,d,e,f,g)
test()
This approach is called "It's Easier to Ask Forgiveness than Permission" (EAFP). The other more common style is referred to as "Look Before You Leap" (LBYL). You can read more about them in this snippet from a book by a very authoritative author.

Given that you cannot know before hand whether a given line is incomplete, you need to check if it is and skip it if it is not. You can use continue for this, which makes the for loop move to the next iteration:
def Test():
dataFile = open('test.txt','r')
readFile = dataFile.read()
lineSplit = readFile.split('\n')
for everyLine in lineSplit:
dividedLine = everyLine.split(';')
if len(dividedLine) != 7:
continue
a = dividedLine[0]
b = dividedLine[1]
c = dividedLine[2]
d = dividedLine[3]
e = dividedLine[4]
f = dividedLine[5]
g = dividedLine[6]
print (a,b,c,d,e,f,g)

This doesn't seem all-that python related so much as conceptual: A line parsed from a csv row will be invalid if:
1. It is shorter than the minimum required length (i.e missing elements)
2. One or more entries parsed come back empty or None (only if all elements are required)
3. The type of an element doesn't match the intended type of the column (not in the scope of what you requested, but good to keep in mind)
In python, once you have split the array, you can check the first two conditions with
if len(dividedLines) < intended_length or ("" in dividedLines): continue
First part just needs you to get the intended length for a row, you can usually use the index row for that. The second part could have the quotes replaced with a None or something, but split returns a empty string so in this case use the "".
HTH

Related

Find all strings in text file fitting either of two formats

So I know similar questions have been asked before, but every method I have tried is not working...
Here is the ask: I have a text file (which is a log file) that I am parsing for any occurrence of "app.task2". The following are the 2 scenarios that can occur (As they appear in the text file, independent of my code):
Scenario 1:
Mar 23 10:28:24 dasd[116] <Notice>: app.task2.refresh:556A2D:[
{name: ApplicationPolicy, policyWeight: 50.000, response: {Decision: Can Proceed, Score: 0.45}}
] sumScores:68.785000, denominator:96.410000, FinalDecision: Can Proceed FinalScore: 0.713463}
Scenario 2:
Mar 23 10:35:56 dasd[116] <Notice>: 'app.task2.refresh:C6C2FE' CurrentScore: 0.636967, ThresholdScore: 0.410015 DecisionToRun:1
The problem I am facing is that my current code below, I am not getting the entire log entry for the first case, and it is only pulling the first line in the log, not the remainder of the log entry, and it appears to be stopping at the new line escape character, which is occurring after ":[".
My Code:
all = []
with open(path_to_log) as f:
for line in f:
if "app.task2" in line:
all.append(line)
print all
How can I get the entire log entry for the first case? I tried stripping escape characters with no luck. From here I should be able to parse the list of results returned for what I truly need, but this will help! ty!
OF NOTE: I need to be able to locate these types of log entries (which will then give us either scenario 1 or scenario 2) by the string "app.task2". So this needs to be incorporated, like in my example...

Before adding the line to all, check if it ends with [. If it does, keep reading and merge the lines until you get to ].
import re
all = []
with open(path_to_log) as f:
for line in f:
if "app.task2" in line:
if re.search(r'\[\s*$', line): # start of multiline log message
for line2 in f:
line += line2
if re.search(r'^\s*\]', line2): # end of multiline log message
break
all.append(line)
print(all)

You are iterating over each each line individually which is why you only get the first line in scenario 1.
Either you can add a counter like this:
all = []
count = -1
with open(path_to_log) as f:
for line in f:
if count > 0:
all.append(line)
if count == 1:
tmp = all[-count:]
del all[-count:]
all.append("\n".join(tmp))
count -= 1
continue
if "app.task2" in line:
all.append(line)
if line.endswith('[\n'):
count = 3
print all
In this case i think Barmar solution would work just as good.
Or you can (preferably) when storing the log file have some distinct delimiter between each log entry and just split the log file by this delimiter.

I like #Barmar's solution with nested loops on the same file object, and may use that technique in the future. But prior to seeing I would have done it with a single loop, which may or may not be more readable:
all = []
keep = False
for line in open(path_to_log,"rt"):
if "app.task2" in line:
all.append(line)
keep = line.rstrip().endswith("[")
elif keep:
all.append(line)
keep = not line.lstrip().startswith("]")
print (all)
or, you can print it nicer with:
print(*all,sep='\n')

How to import a special format as a dictionary in python?

I have the text files as below format in single line,
username:password;username1:password1;username2:password2;
etc.
What I have tried so far is
with open('list.txt') as f:
d = dict(x.rstrip().split(None, 1) for x in f)
but I get an error saying that the length is 1 and 2 is required which indicates the file is not being as key:value.
Is there any way to fix this or should I just reformat the file in another way?
thanks for your answers.
What i got so far is:
with open('tester.txt') as f:
password_list = dict(x.strip(":").split(";", 1) for x in f)
for user, password in password_list.items():
print(user + " - " + password)
the results comes out as username:password - username1:password1
what i need is to split username:password where key = user and value = password

Since variable f in this case is a file object and not a list, the first thing to do would be to get the lines from it. You could use the https://docs.python.org/2/library/stdtypes.html?highlight=readline#file.readlines* method for this.
Furthermore, I think I would use strip with the semicolon (";") parameter. This will provide you with a list of strings of "username:password", provided your entire file looks like this.
I think you will figure out what to do after that.
EDIT
* I auto assumed you use Python 2.7 for some reason. In version 3.X you might want to look at the "distutils.text_file" (https://docs.python.org/3.7/distutils/apiref.html?highlight=readlines#distutils.text_file.TextFile.readlines) class.

Load the text of the file in Python with open() and read() as a string
Apply split(;) to that string to create a list like [username:password, username1:password1, username2:password2]
Do a dict comprehension where you apply split(":") to each item of the above list to split those pairs.

with open('list.txt', 'rt') as f:
raw_data = f.readlines()[0]
list_data = raw_data.split(';')
user_dict = { x.split(':')[0]:x.split(':')[1] for x in list_data }
print(user_dict)
Dictionary comprehension is useful here.
One liner to pull all the info out of the text file. As requested. Hope your tutor is impressed. Ask him How it works and see what he says. Maybe update your question to include his response.
If you want me to explain, feel free to comment and I shall go into more detail.

The error you're probably getting:
ValueError: dictionary update sequence element #3 has length 1; 2 is required
is because the text line ends with a semicolon. Splitting it on semicolons then results in a list that contains some pairs, and an empty string:
>>> "username:password;username1:password1;username2:password2;".split(";")
['username:password', 'username1:password1', 'username2:password2', '']
Splitting the empty string on colons then results in a single empty string, rather than two strings.
To fix this, filter out the empty string. One example of doing this would be
[element for element in x.split(";") if element != ""]
In general, I recommend you do the work one step at a time and assign to intermediary variables.

Here's a simple (but long) answer. You need to get the line from the file, and then split it and the items resulting from the split:
results = {}
with open('file.txt') as file:
for line in file:
#Only one line, but that's fine
entries = line.split(';')
for entry in entries:
if entry != '':
#The last item in entries will be blank, due to how split works in this example
user, password = entry.split(':')
results[user] = password

Try this.
f = open('test.txt').read()
data = f.split(";")
d = {}
for i in data:
if i:
value = i.split(":")
d.update({value[0]:value[1]})
print d

Reading a numbers off a list from a txt file, but only upto a comma

This is data from a lab experiment (around 717 lines of data). Rather than trying to excell it, I want to import and graph it on either python or matlab. I'm new here btw... and am a student!
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
more numbers : see Screenshot of more data from my file
I just can't figure out how to read the line up until a comma. Specifically, I need the Load numbers for one of my arrays/list, so for example on the first line I only need 62.638 (which would be the first number on my first index on my list/array).
How can I get an array/list of this, something that iterates/reads the list and ignores strings?
Thanks!
NOTE: I use Anaconda + Jupyter Notebooks for Python & Matlab (school provided software).
EDIT: Okay, so I came home today and worked on it again. I hadn't dealt with CSV files before, but after some searching I was able to learn how to read my file, somewhat.
import csv
from itertools import islice
with open('Blue_bar_GroupD.txt','r') as BB:
BB_csv = csv.reader(BB)
x = 0
BB_lb = []
while x < 7: #to skip the string data
next(BB_csv)
x+=1
for row in islice(BB_csv,0,758):
print(row[0]) #testing if I can read row data
Okay, here is where I am stuck. I want to make an arraw/list that has the 0th index value of each row. Sorry if I'm a freaking noob!
Thanks again!

You can skip all lines till the first data row and then parse the data into a list for later use - 700+ lines can be easily processd in memory.
Therefor you need to:
read the file line by line
remember the last non-empty line before number/comma/dot ( == header )
see if the line is only number/comma/dot, else increase a skip-counter (== data )
seek to 0
skip enough lines to get to header or data
read the rest into a data structure
Create test file:
text = """
""
"Test Methdo","exp-l Tensile with Extensometer.msm"
"Sample I.D.","Sample108.mss"
"Speciment Number","1"
"Load (lbf)","Time (s)","Crosshead (in)","Extensometer (in)"
62.638,0.900,0.000,0.00008
122.998,1.700,0.001,0.00012
"""
with open ("t.txt","w") as w:
w.write(text)
Some helpers and the skipping/reading logic:
import re
import csv
def convert_row(row):
"""Convert one row of data into a list of mixed ints and others.
Int is the preferred data type, else string is used - no other tried."""
d = []
for v in row:
try:
# convert to int && add
d.append(float(v))
except:
# not an int, append as is
d.append(v)
return d
def count_to_first_data(fh):
"""Count lines in fh not consisting of numbers, dots and commas.
Sideeffect: will reset position in fh to 0."""
skiplines = 0
header_line = 0
fh.seek(0)
for line in fh:
if re.match(r"^[\d.,]+$",line):
fh.seek(0)
return skiplines, header_line
else:
if line.strip():
header_line = skiplines
skiplines += 1
raise ValueError("File does not contain pure number rows!")
Usage of helpers / data conversion:
data = []
skiplines = 0
with open("t.txt","r") as csvfile:
skip_to_data, skip_to_header = count_to_first_data(csvfile)
for _ in range(skip_to_header): # skip_to_data if you do not want the headers
next(csvfile)
reader = csv.reader(csvfile, delimiter=',',quotechar='"')
for row in reader:
row_data = convert_row(row)
if row_data:
data.append(row_data)
print(data)
Output (reformatted):
[['Load (lbf)', 'Time (s)', 'Crosshead (in)', 'Extensometer (in)'],
[62.638, 0.9, 0.0, 8e-05],
[122.998, 1.7, 0.001, 0.00012]]
Doku:
re.match
csv.reader
Method of file objekts (i.e.: seek())
With this you now have "clean" data that you can use for further processing - including your headers.
For visualization you can have a look at matplotlib

I would recommend reading your file with python
data = []
with open('my_txt.txt', 'r') as fd:
# Suppress header lines
for i in range(6):
fd.readline()
# Read data lines up to the first column
for line in fd:
index = line.find(',')
if index >= 0:
data.append(float(line[0:index]))
leads to a list containing your data of the first column
>>> data
[62.638, 122.998]
The MATLAB solution is less nice, since you have to know the number of data lines in your file (which you do not need to know in the python solution)
n_header = 6
n_lines = 2 % Insert here 717 (as you mentioned)
M = csvread('my_txt.txt', n_header, 0, [n_header 0 n_header+n_lines-1 0])
leads to:
>> M
M =
62.6380
122.9980
For the sake of clarity: You can also use MATLABs textscan function to achieve what you want without knowing the number of lines, but still, the python code would be the better choice in my opinion.

Based on your format, you will need to do 3 steps. One, read all lines, two, determine which line to use, last, get the floats and assign them to a list.
Assuming you file name is name.txt, try:
f = open("name.txt", "r")
all_lines = f.readlines()
grid = []
for line in all_lines:
if ('"' not in line) and (line != '\n'):
grid.append(list(map(float, line.strip('\n').split(','))))
f.close()
The grid will then contain a series of lists containing your group of floats.
Explanation for fun:
In the "for" loop, i searched for the double quote to eliminate any string as all strings are concocted between quotes. The other one is for skipping empty lines.
Based on your needs, you can use the list grid as you please. For example, to fetch the first line's first number, do
grid[0][0]
as python's list counts from 0 to n-1 for n elements.

This is super simple in Matlab, just 2 lines:
data = dlmread('data.csv', ',', 6,0);
column1 = data(:,1);
Where 6 and 0 should be replaced by the row and column offset you want. So in this case, the data starts at row 7 and you want all the columns, then just copy over the data in column 1 into another vector.
As another note, try typing doc dlmread in matlab - it brings up the help page for dlmread. This is really useful when you're looking for matlab functions, as it has other suggestions for similar functions down the bottom.

Python 2.7 mixing iteration and read methods would lose data

I have an issue with a bit of code that works in Python 3, but fail in 2.7. I have the following part of code:
def getDimensions(file,log):
noStations = 0
noSpanPts = 0
dataSet = False
if log:
print("attempting to retrieve dimensions. Opening file",file)
while not dataSet:
try: # read until error occurs
string = file.readline().rstrip() # to avoid breaking on an empty line
except IOError:
break
stations
if "Ax dist hub" in string: # parse out number of stations
if log:
print("found ax dist hub location")
next(file) # skip empty line
eos = False # end of stations
while not eos:
string = file.readline().rstrip()
if string =="":
eos = True
else:
noStations = int(string.split()[0])
This returns an error:
ValueError: Mixing iteration and read methods would lose data.
I understand that the issue is how I read my string in the while loop, or at least that is what I believe. Is there a quick way to fix this? Any help is appreciated. Thank you!

The problem is that you are using next and readline on the same file. As the docs say:
. As a consequence of using a read-ahead buffer, combining next() with other file methods (like readline()) does not work right.
The fix is trivial: replace next with readline.

If you want a short code to do that try:
lines = []
with open(filename) as f:
lines = [line for line in f if line.strip()]
Then you can do tests for lines.

How to strip out special characters of a list in python?

I'm currently doing a project for class and I need a little advice/help. I have a csv file that I'm extracting data from. (I am not using the csv module because I'm not familiar with and the instructor warned us that it's complicated.) I've gotten the data into lists using a function I created. It works fine, if the values are just a string of numbers, but if there is a percent sign or 'N/A' in in the cell, then I get an error. Here is the code:
def get_values(file, index):
'''(file object, int) -> list
Return a list of states and corresponding values at a prticular index in file.'''
values_list = []
for i in range(6):
file.readline()
for line in file:
line_list = line.split(',')
values_list.append(line_list[index])
values_list = [i.rstrip('%') for i in values_list]
values_list = [float(i) for i in values_list]
return values_list
while True:
try:
file_name = input('Enter in file name: ')
input_file = open( file_name, 'r')
break
except IOError:
print('File not found.')
heart_list = get_values(input_file, 1)
input_file.close()
input_file = input_file = open( 'riskfactors.csv', 'r')
HIV_list = get_values(input_file, 8)
input_file.close()
I would like to strip the %, but nothing I;ve trie has worked so far. Any suggestions?

Without seeing a complete SSCCE with sample inputs, it's hard to be sure, but I'm willing to bet the problem is this:
values_list = [i.rstrip('%') for i in values_list]
That will strip any '%' characters off the end of each value, but it won't strip any '%' characters anywhere else. And in a typical CSV file, that isn't good enough.
My guess is that you have a line like this:
foo , 10% , bar
This will split into:
['foo ', ' 10% ', ' bar\n']
So, you add the ' 10% ' to values_list, and the rstrip line will do nothing, because it doesn't end with a '%', it ends with a ' '.
Or, alternatively, it may just be this:
foo,bar,10%
So you get this:
['foo', 'bar', '10%\n']
… which has the same problem.
If this (either version) is the problem, what you want to do is something like:
values_list = [i.strip().rstrip('%')` for i in values_list]
Meanwhile, you can make this a lot simpler by just getting rid of the list comprehension. Why try to fix every row after the fact, when you can fix the single values as you add them? For example:
for line in file:
line_list = line.split(',')
value = line_list[index]
value = value.rstrip('%')
value = float(value)
values_list.append(value)
return values_list
And now, things are simple enough that you can merge multiple lines without making it less readable.
Of course you still need to deal with 'N/A'. The question is whether you want to treat that as 0.0, or None, or skip it over, or do something different, but whatever you decide, you might consider using try around the float instead of checking for 'N/A', to make your code more robust. For example:
value = value.rstrip('%')
try:
value = float(value)
except ValueError as e:
# maybe log the error, or log the error only if not N/A, or...
pass # or values_list.append(0.0), or whatever
else:
values_list.append(value)
By the way, dealing with this kind of stuff is exactly why you should use the csv module.
Here's how you use csv. Instead of this:
for line in file:
line_list = line.split(',')
Just do this:
for line_list in csv.reader(file):
That's complicated?
And it takes care of all of the subtleties with stripping whitespace (and quoting and escaping and all kinds of other nonsense that you'll forget to test for).
In other words, most likely, if you'd used csv, besides saving one line of code, you wouldn't have had this problem in the first place—and the same would be true for 8 of the next 10 problems you're going to run into.
But if you're learning from an instructor who thinks csv is too complicated… well, it's a good thing you're motivated enough to try to figure things out for yourself and ask questions outside of class, so there's some hope…

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Reading File and Skipping Invalid Lines - python

Related

Find all strings in text file fitting either of two formats

How to import a special format as a dictionary in python?

Reading a numbers off a list from a txt file, but only upto a comma

Python 2.7 mixing iteration and read methods would lose data

How to strip out special characters of a list in python?

Categories

Resources