Parsing semi structured text strings in Python

Parsing semi structured text strings in Python - python

I am trying to parse pseudo-English scripts, and want to convert it into another machine readable language.
However the script have been written by many people in the past, and each had their own style of writing.
some Examples would be:
On Device 1 Set word 45 and 46 to hex 331
On Device 1 set words 45 and 46 bits 3..7 to 280
on Device 1 set word 45 to oct 332
on device 1 set speed to 60kts Words 3-4 to hex 34
(there are many more different ways used in the source text)
The issue is its not always logical nor consistent
I have looked at Regexp, and matching certain words. This works out ok, but when I need to know the next word (e.g in 'Word 24' I would match for 'Word' then try to figure out if the next token is a number or not). In the case of 'Words' i need to look for the words to set, as well as their values.
in example 1, it should produce to Set word 45 to hex 331 and Set word 46 to hex 331
or if possible Set word 45 to hex 331 and word 46 to hex 331
i tried using the findall method on re - that would only give me the matched words, and then i have to try to find out the next word (i.e value) manually
alternatively, i could split the string using a space and process each word manually, then be able to do something like
assuming list is
['On', 'device1:', 'set', 'Word', '1', '', 'to', '88', 'and', 'word', '2', 'to', '2151']
for i in range (0,sp.__len__()):
rew = re.search("[Ww]ord", sp[i])
if rew:
print ("Found word, next val is ", sp[i+1])
is there a better way to do what i want? i looked a little bit into tokenizing, but not sure that would work as the language is not structured in the first place.

I suggest you develop a program that gradually explores the syntax that people have used to write the scripts.
E.g., each instruction in your examples seems to break down into a device-part and a settings-part. So you could try matching each line against the regex ^(.+) set (.+). If you find lines that don't match that pattern, print them out. Examine the output, find a general pattern that matches some of them, add a corresponding regex to your program (or modify an existing regex), and repeat. Proceed until you've recognized (in a very general way) every line in your input.
(Since capitalization appears to be inconsistent, you can either do case-insensitive matches, or convert each line to lowercase before you start processing it. More generally, you may find other 'normalizations' that simplify subsequent processing. E.g., if people were inconsistent about spaces, you can convert every run of whitespace characters into a single space.)
(If your input has typographical errors, e.g. someone wrote "ste" for "set", then you can either change the regex to allow for that (... (set|ste) ...), or go to (a copy of) the input file and just fix the typo.)
Then go back to the lines that matched ^(.+) set (.+), print out just the first group for each, and repeat the above process for just those substrings.
Then repeat the process for the second group in each "set" instruction. And so on, recursively.
Eventually, your program will be, in effect, a parser for the script language. At that point, you can start to add code to convert each recognized construct into the output language.
Depending on your experience with Python, you can find ways to make the code concise.

Depending on what you actually want from these strings, you could use a parser, e.g. parsimonious:
from parsimonious.nodes import NodeVisitor
from parsimonious.grammar import Grammar
grammar = Grammar(
r"""
command = set operand to? number (operator number)* middle? to? numsys? number
operand = (~r"words?" / "speed") ws
middle = (~r"[Ww]ords" / "bits")+ ws number
to = ws "to" ws
number = ws ~r"[-\d.]+" "kts"? ws
numsys = ws ("oct" / "hex") ws
operator = ws "and" ws
set = ~"[Ss]et" ws
ws = ~r"\s*"
"""
)
class HorribleStuff(NodeVisitor):
def __init__(self):
self.cmds = []
def generic_visit(self, node, visited_children):
pass
def visit_operand(self, node, visited_children):
self.cmds.append(('operand', node.text))
def visit_number(self, node, visited_children):
self.cmds.append(('number', node.text))
examples = ['Set word 45 and 46 to hex 331',
'set words 45 and 46 bits 3..7 to 280',
'set word 45 to oct 332',
'set speed to 60kts Words 3-4 to hex 34']
for example in examples:
tree = grammar.parse(example)
hs = HorribleStuff()
hs.visit(tree)
print(hs.cmds)
This would yield
[('operand', 'word '), ('number', '45 '), ('number', '46 '), ('number', '331')]
[('operand', 'words '), ('number', '45 '), ('number', '46 '), ('number', '3..7 '), ('number', '280')]
[('operand', 'word '), ('number', '45 '), ('number', '332')]
[('operand', 'speed '), ('number', '60kts '), ('number', '3-4 '), ('number', '34')]

Related

How to extract set of substrings from a paragraph of string

Say I have a string:
output='[{ "id":"b678792277461" ,"Responses":{"SUCCESS":{"sh xyz":"sh xyz\\n Name Age Height Weight\\n Ana \\u003c15 \\u003e 163 47\\n 43\\n DEB \\u003c23 \\u003e 155 \\n Grey \\u003c53 \\u003e 143 54\\n 63\\n Sch#"},"FAILURE":{},"BLACKLISTED":{}}}]'
This is just an example but I have much longer output which is response from an api call.
I want to extract all names (ana, dab, grey) and put in a separate list.
how can I do it?
json_data = json.loads(output)
json_data = [{'id': 'b678792277461', 'Responses': {'SUCCESS': {'sh xyz': 'sh xyz\n Name Age Height Weight\n Ana <15 > 163 47\n 43\n DEB <23 > 155 \n Grey <53 > 143 54\n 63\n Sch#'}, 'FAILURE': {}, 'BLACKLISTED': {}}}]
1) I have tried re.findall('\\n(.+)\\u',output)
but this didn't work because it says "incomplete sequence u"
2)
start = output.find('\\n')
end = output.find('\\u', start)
x=output[start:end]
But I couldn't figure out how to run this piece of code in loop to extract names
Thanks

The \u object is not a letter and it cannot be matched. It is a part of a Unicode sequence. The following regex works, but it is kind of quirky. It looks for the beginning of each line, except for the first one, until the first space.
output = json_data[0]['Responses']['SUCCESS']['sh xyz']
pattern = "\n\s*([a-z]+)\s+"
result = re.findall(pattern, output, re.M | re.I)
#['Name', 'Ana', 'DEB', 'Grey']
Explanation of the pattern:
start at a new line (\n)
skip all spaces, if any (\s*)
collect one or more letters ([a-z]+)
skip at least one space (\s+)
Unfortunately, "Name" is also recognized as a name. If you know that it is always present in the first line, slice the list of the results:
result[1:]
#['Ana', 'DEB', 'Grey']

I use regexr.com and play around with the regular expression until I get it right and then covert that into Python.
https://regexr.com/
I'm assuming the \n is the newline character here and I'll bet your \u error is caused by a line break. To use the multiline match in Python, you need to use that flag when you compile.
\n(.*)\n - this will be greedy and grab as many matches as possible (In the example it would grab the entire \nAna through 54\n
[{ "id":"678792277461" ,"Responses": {Name Age Height Weight\n Ana \u00315 \u003163 47\n 43\n Deb \u00323 \u003155 60 \n Grey \u00353 \u003144 54\n }]
import re
a = re.compile("\\n(.*)\\n", re.MULTILINE)
for responses in a.match(source):
match = responses.split("\n")
# match[0] should be " Ana \u00315 \u003163 47"
# match[1] should be " Deb \u00323 \u003155 60" etc.

Removing chars/signs from string

I'm preparing text for a word cloud, but I get stuck.
I need to remove all digits, all signs like . , - ? = / ! # etc., but I don't know how. I don't want to replace again and again. Is there a method for that?
Here is my concept and what I have to do:
Concatenate texts in one string
Set chars to lowercase <--- I'm here
Now I want to delete specific signs and divide the text into words (list)
calculate freq of words
next do the stopwords script...
abstracts_list = open('new','r')
abstracts = []
allab = ''
for ab in abstracts_list:
abstracts.append(ab)
for ab in abstracts:
allab += ab
Lower = allab.lower()
Text example:
MicroRNAs (miRNAs) are a class of noncoding RNA molecules
approximately 19 to 25 nucleotides in length that downregulate the
expression of target genes at the post-transcriptional level by
binding to the 3'-untranslated region (3'-UTR). Epstein-Barr virus
(EBV) generates at least 44 miRNAs, but the functions of most of these
miRNAs have not yet been identified. Previously, we reported BRUCE as
a target of miR-BART15-3p, a miRNA produced by EBV, but our data
suggested that there might be other apoptosis-associated target genes
of miR-BART15-3p. Thus, in this study, we searched for new target
genes of miR-BART15-3p using in silico analyses. We found a possible
seed match site in the 3'-UTR of Tax1-binding protein 1 (TAX1BP1). The
luciferase activity of a reporter vector including the 3'-UTR of
TAX1BP1 was decreased by miR-BART15-3p. MiR-BART15-3p downregulated
the expression of TAX1BP1 mRNA and protein in AGS cells, while an
inhibitor against miR-BART15-3p upregulated the expression of TAX1BP1
mRNA and protein in AGS-EBV cells. Mir-BART15-3p modulated NF-κB
activity in gastric cancer cell lines. Moreover, miR-BART15-3p
strongly promoted chemosensitivity to 5-fluorouracil (5-FU). Our
results suggest that miR-BART15-3p targets the anti-apoptotic TAX1BP1
gene in cancer cells, causing increased apoptosis and chemosensitivity
to 5-FU.

So to set upper case characters to lower case characters you could do the following:
so just store your text to a string variable, for example STRING and next use the command
STRING=re.sub('([A-Z]{1})', r'\1',STRING).lower()
now your string will be free of capital letters.
To remove the special characters again module re can help you with the sub command :
STRING = re.sub('[^a-zA-Z0-9-_*.]', ' ', STRING )
with these command your string will be free of special characters
And to determine the word frequency you could use the module collections from where you have to import Counter.
Then use the following command to determine the frequency with which the words occur:
Counter(STRING.split()).most_common()

I'd probably try to use string.isalpha():
abstracts = []
with open('new','r') as abstracts_list:
for ab in abstracts_list: # this gives one line of text.
if not ab.isalpha():
ab = ''.join(c for c in ab if c.isalpha()
abstracts.append(ab.lower())
# now assuming you want the text in one big string like allab was
long_string = ''.join(abstracts)

identifying strings which cant be spelt in a list item

I have a list
['mPXSz0qd6j0 youtube ', 'lBz5XJRLHQM youtube ', 'search OpHQOO-DwlQ ',
'sachin 47427243 ', 'alex smith ', 'birthday JEaM8Lg9oK4 ',
'nebula 8x41n9thAU8 ', 'chuck norris ',
'searcher O6tUtqPcHDw ', 'graham wXqsg59z7m0 ', 'queries K70QnTfGjoM ']
Is there some way to identify the strings which can't be spelt in the list item and remove them?

You can use, e.g. PyEnchant for basic dictionary checking and NLTK to take minor spelling issues into account, like this:
import enchant
import nltk
spell_dict = enchant.Dict('en_US') # or whatever language supported
def get_distance_limit(w):
'''
The word is considered good
if it's no further from a known word than this limit.
'''
return len(w)/5 + 2 # just for example, allowing around 1 typo per 5 chars.
def check_word(word):
if spell_dict.check(word):
return True # a known dictionary word
# try similar words
max_dist = get_distance_limit(word)
for suggestion in spell_dict.suggest(word):
if nltk.edit_distance(suggestion, word) < max_dist:
return True
return False
Add a case normalisation and a filter for digits and you'll get a pretty good heuristics.

It is entirely possible to compare your list members to words that you don't believe to be valid for your input.
This can be done in many ways, partially depending on your definition of "properly spelled" and what you end up using for a comparison list. If you decide that numbers preclude an entry from being valid, or underscores, or mixed case, you could test for regex matching.
Post regex, you would have to decide what a valid character to split on should be. Is it spaces (are you willing to break on 'ad hoc' ('ad' is an abbreviation, 'hoc' is not a word))? Is it hyphens (this will break on hyphenated last names)?
With these above criteria decided, it's just a decision of what word, proper name, and common slang list to use and a list comprehension:
word_list[:] = [term for term in word_list if passes_my_membership_criteria(term)]
where passes_my_membership_criteria() is a function that contains the rules for staying in the list of words, returning False for things that you've decided are not valid.

fixing words with spaces using a dictionary look up in python?

I have extracted the list of sentences from a document. I am pre-processing this list of sentences to make it more sensible. I am faced with the following problem
I have sentences such as "more recen t ly the develop ment, wh ich is a po ten t "
I would like to correct such sentences using a look up dictionary? to remove the unwanted spaces.
The final output should be "more recently the development, which is a potent "
I would assume that this is a straight forward task in preprocessing text? I need help with some pointers to look for such approaches. Thanks.

Take a look at word or text segmentation. The problem is to find the most probable split of a string into a group of words. Example:
thequickbrownfoxjumpsoverthelazydog
The most probable segmentation should be of course:
the quick brown fox jumps over the lazy dog
Here's an article including prototypical source code for the problem using Google Ngram corpus:
http://jeremykun.com/2012/01/15/word-segmentation/
The key for this algorithm to work is access to knowledge about the world, in this case word frequencies in some language. I implemented a version of the algorithm described in the article here:
https://gist.github.com/miku/7279824
Example usage:
$ python segmentation.py t hequi ckbrownfoxjum ped
thequickbrownfoxjumped
['the', 'quick', 'brown', 'fox', 'jumped']
Using data, even this can be reordered:
$ python segmentation.py lmaoro fll olwt f pwned
lmaorofllolwtfpwned
['lmao', 'rofl', 'lol', 'wtf', 'pwned']
Note that the algorithm is quite slow - it's prototypical.
Another approach using NLTK:
http://web.archive.org/web/20160123234612/http://www.winwaed.com:80/blog/2012/03/13/segmenting-words-and-sentences/
As for your problem, you could just concatenate all string parts you have to get a single string and the run a segmentation algorithm on it.

Your goal is to improve text, not necessarily to make it perfect; so the approach you outline makes sense in my opinion. I would keep it simple and use a "greedy" approach: Start with the first fragment and stick pieces to it as long as the result is in the dictionary; if the result is not, spit out what you have so far and start over with the next fragment. Yes, occasionally you'll make a mistake with cases like the me thod, so if you'll be using this a lot, you could look for something more sophisticated. However, it's probably good enough.
Mainly what you require is a large dictionary. If you'll be using it a lot, I would encode it as a "prefix tree" (a.k.a. trie), so that you can quickly find out if a fragment is the start of a real word. The nltk provides a Trie implementation.
Since this kind of spurious word breaks are inconsistent, I would also extend my dictionary with words already processed in the current document; you may have seen the complete word earlier, but now it's broken up.

--Solution 1:
Lets think of these chunks in your sentence as beads on an abacus, with each bead consisting of a partial string, the beads can be moved left or right to generate the permutations. The position of each fragment is fixed between two adjacent fragments.
In current case, the beads would be :
(more)(recen)(t)(ly)(the)(develop)(ment,)(wh)(ich)(is)(a)(po)(ten)(t)
This solves 2 subproblems:
a) Bead is a single unit,so We do not care about permutations within the bead i.e. permutations of "more" are not possible.
b) The order of the beads is constant, only the spacing between them changes. i.e. "more" will always be before "recen" and so on.
Now, generate all the permutations of these beads , which will give output like :
morerecentlythedevelopment,which is a potent
morerecentlythedevelopment,which is a poten t
morerecentlythedevelop ment, wh ich is a po tent
morerecentlythedevelop ment, wh ich is a po ten t
morerecentlythe development,whichisapotent
Then score these permutations based on how many words from your relevant dictionary they contain, most correct results can be easily filtered out.
more recently the development, which is a potent will score higher than morerecentlythedevelop ment, wh ich is a po ten t
Code which does the permutation part of the beads:
import re
def gen_abacus_perms(frags):
if len(frags) == 0:
return []
if len(frags) == 1:
return [frags[0]]
prefix_1 = "{0}{1}".format(frags[0],frags[1])
prefix_2 = "{0} {1}".format(frags[0],frags[1])
if len(frags) == 2:
nres = [prefix_1,prefix_2]
return nres
rem_perms = gen_abacus_perms(frags[2:])
res = ["{0}{1}".format(prefix_1, x ) for x in rem_perms] + ["{0} {1}".format(prefix_1, x ) for x in rem_perms] + \
["{0}{1}".format(prefix_2, x ) for x in rem_perms] + ["{0} {1}".format(prefix_2 , x ) for x in rem_perms]
return res
broken = "more recen t ly the develop ment, wh ich is a po ten t"
frags = re.split("\s+",broken)
perms = gen_abacus_perms(frags)
print("\n".join(perms))
demo:http://ideone.com/pt4PSt
--Solution#2:
I would suggest an alternate approach which makes use of text analysis intelligence already developed by folks working on similar problems and having worked on big corpus of data which depends on dictionary and grammar .e.g. search engines.
I am not well aware of such public/paid apis, so my example is based on google results.
Lets try to use google :
You can keep putting your invalid terms to Google, for multiple passes, and keep evaluating the results for some score based on your lookup dictionary.
here are two relevant outputs by using 2 passes of your text :
This outout is used for a second pass :
Which gives you the conversion as ""more recently the development, which is a potent".
To verify the conversion, you will have to use some similarity algorithm and scoring to filter out invalid / not so good results.
One raw technique could be using a comparison of normalized strings using difflib.
>>> import difflib
>>> import re
>>> input = "more recen t ly the develop ment, wh ich is a po ten t "
>>> output = "more recently the development, which is a potent "
>>> input_norm = re.sub(r'\W+', '', input).lower()
>>> output_norm = re.sub(r'\W+', '', output).lower()
>>> input_norm
'morerecentlythedevelopmentwhichisapotent'
>>> output_norm
'morerecentlythedevelopmentwhichisapotent'
>>> difflib.SequenceMatcher(None,input_norm,output_norm).ratio()
1.0

I would recommend stripping away the spaces and looking for dictionary words to break it down into. There are a few things you can do to make it more accurate. To make it get the first word in text with no spaces, try taking the entire string, and going through dictionary words from a file (you can download several such files from http://wordlist.sourceforge.net/), the longest ones first, than taking off letters from the end of the string you want to segment. If you want it to work on a big string, you can make it automatically take off letters from the back so that the string you are looking for the first word in is only as long as the longest dictionary word. This should result in you finding the longest words, and making it less likely to do something like classify "asynchronous" as "a synchronous". Here is an example that uses raw input to take in the text to correct and a dictionary file called dictionary.txt:
dict = open("dictionary.txt",'r') #loads a file with a list of words to break string up into
words = raw_input("enter text to correct spaces on: ")
words = words.strip() #strips away spaces
spaced = [] #this is the list of newly broken up words
parsing = True #this represents when the while loop can end
while parsing:
if len(words) == 0: #checks if all of the text has been broken into words, if it has been it will end the while loop
parsing = False
iterating = True
for iteration in range(45): #goes through each of the possible word lengths, starting from the biggest
if iterating == False:
break
word = words[:45-iteration] #each iteration, the word has one letter removed from the back, starting with the longest possible number of letters, 45
for line in dict:
line = line[:-1] #this deletes the last character of the dictionary word, which will be a newline. delete this line of code if it is not a newline, or change it to [1:] if the newline character is at the beginning
if line == word: #this finds if this is the word we are looking for
spaced.append(word)
words = words[-(len(word)):] #takes away the word from the text list
iterating = False
break
print ' '.join(spaced) #prints the output
If you want it to be even more accurate, you could try using a natural language parsing program, there are several available for python free online.

Here's something really basic:
chunks = []
for chunk in my_str.split():
chunks.append(chunk)
joined = ''.join(chunks)
if is_word(joined):
print joined,
del chunks[:]
# deal with left overs
if chunks:
print ''.join(chunks)
I assume you have a set of valid words somewhere that can be used to implement is_word. You also have to make sure it deals with punctuation. Here's one way to do that:
def is_word(wd):
if not wd:
return False
# Strip of trailing punctuation. There might be stuff in front
# that you want to strip too, such as open parentheses; this is
# just to give the idea, not a complete solution.
if wd[-1] in ',.!?;:':
wd = wd[:-1]
return wd in valid_words

You can iterate through a dictionary of words to find the best fit. Adding the words together when a match is not found.
def iterate(word,dictionary):
for word in dictionary:
if words in possibleWord:
finished_sentence.append(words)
added = True
else:
added = False
return [added,finished_sentence]
sentence = "more recen t ly the develop ment, wh ich is a po ten t "
finished_sentence = ""
sentence = sentence.split()
for word in sentence:
added,new_word = interate(word,dictionary)
while True:
if added == False:
word += possible[sentence.find(possibleWord)]
iterate(word,dictionary)
else:
break
finished_sentence.append(word)
This should work. For the variable dictionary, download a txt file of every single english word, then open it in your program.

my index.py file be like
from wordsegment import load, segment
load()
print(segment('morerecentlythedevelopmentwhichisapotent'))
my index.php file be like
<html>
<head>
<title>py script</title>
</head>
<body>
<h1>Hey There!Python Working Successfully In A PHP Page.</h1>
<?php
$python = `python index.py`;
echo $python;
?>
</body>
</html>
Hope this will work

How to strip variable spaces in each line of a text file based on special condition - one-liner in Python?

I have some data (text files) that is formatted in the most uneven manner one could think of. I am trying to minimize the amount of manual work on parsing this data.
Sample Data :
Name Degree CLASS CODE EDU Scores
--------------------------------------------------------------------------------------
John Marshall CSC 78659944 89989 BE 900
Think Code DB I10 MSC 87782 1231 MS 878
Mary 200 Jones CIVIL 98993483 32985 BE 898
John G. S Mech 7653 54 MS 65
Silent Ghost Python Ninja 788505 88448 MS Comp 887
Conditions :
More than one spaces should be compressed to a delimiter (pipe better? End goal is to store these files in the database).
Except for the first column, the other columns won't have any spaces in them, so all those spaces can be compressed to a pipe.
Only the first column can have multiple words with spaces (Mary K Jones). The rest of the columns are mostly numbers and some alphabets.
First and second columns are both strings. They almost always have more than one spaces between them, so that is how we can differentiate between the 2 columns. (If there is a single space, that is a risk I am willing to take given the horrible formatting!).
The number of columns varies, so we don't have to worry about column names. All we want is to extract each column's data.
Hope I made sense! I have a feeling that this task can be done in a oneliner. I don't want to loop, loop, loop :(
Muchos gracias "Pythonistas" for reading all the way and not quitting before this sentence!

It still seems tome that there's some format in your files:
>>> regex = r'^(.+)\b\s{2,}\b(.+)\s+(\d+)\s+(\d+)\s+(.+)\s+(\d+)'
>>> for line in s.splitlines():
lst = [i.strip() for j in re.findall(regex, line) for i in j if j]
print(lst)
[]
[]
['John Marshall', 'CSC', '78659944', '89989', 'BE', '900']
['Think Code DB I10', 'MSC', '87782', '1231', 'MS', '878']
['Mary 200 Jones', 'CIVIL', '98993483', '32985', 'BE', '898']
['John G. S', 'Mech', '7653', '54', 'MS', '65']
['Silent Ghost', 'Python Ninja', '788505', '88448', 'MS Comp', '887']
Regex is quite straightforward, the only things you need to pay attention to are the delimiters (\s) and the word breaks (\b) in case of the first delimiter. Note that when the line wouldn't match you get an empty list as lst. That would be a read flag to bring up the user interaction described below. Also you could skip the header lines by doing:
>>> file = open(fname)
>>> [next(file) for _ in range(2)]
>>> for line in file:
... # here empty lst indicates issues with regex
Previous variants:
>>> import re
>>> for line in open(fname):
lst = re.split(r'\s{2,}', line)
l = len(lst)
if l in (2,3):
lst[l-1:] = lst[l-1].split()
print(lst)
['Name', 'Degree', 'CLASS', 'CODE', 'EDU', 'Scores']
['--------------------------------------------------------------------------------------']
['John Marshall', 'CSC', '78659944', '89989', 'BE', '900']
['Think Code DB I10', 'MSC', '87782', '1231', 'MS', '878']
['Mary 200 Jones', 'CIVIL', '98993483', '32985', 'BE', '898']
['John G. S', 'Mech', '7653', '54', 'MS', '65']
another thing to do is simply allow user to decide what to do with questionable entries:
if l < 3:
lst = line.split()
print(lst)
iname = input('enter indexes that for elements of name: ') # use raw_input in py2k
idegr = input('enter indexes that for elements of degree: ')
Uhm, I was all the time under the impression that the second element might contain spaces, since it's not the case you could just do:
>>> for line in open(fname):
name, _, rest = line.partition(' ')
lst = [name] + rest.split()
print(lst)

Variation on SilentGhost's answer, this time first splitting the name from the rest (separated by two or more spaces), then just splitting the rest, and finally making one list.
import re
for line in open(fname):
name, rest = re.split('\s{2,}', line, maxsplit=1)
print [name] + rest.split()

This answer was written after the OP confessed to changing every tab ("\t") in his data to 3 spaces (and not mentioning it in his question).
Looking at the first line, it seems that this is a fixed-column-width report. It is entirely possible that your data contains tabs that if expanded properly might result in a non-crazy result.
Instead of doing line.replace('\t', ' ' * 3) try line.expandtabs().
Docs for expandtabs are here.
If the result looks sensible (columns of data line up), you will need to determine how you can work out the column widths programatically (if that is possible) -- maybe from the heading line.
Are you sure that the second line is all "-", or are there spaces between the columns?
The reason for asking is that I once needed to parse many different files from a database query report mechanism which presented the results like this:
RecordType ID1 ID2 Description
----------- -------------------- ----------- ----------------------
1 12345678 123456 Widget
4 87654321 654321 Gizmoid
and it was possible to write a completely general reader that inspected the second line to determine where to slice the heading line and the data lines. Hint:
sizes = map(len, dash_line.split())
If expandtabs() doesn't work, edit your question to show exactly what you do have i.e. show the result of print repr(line) for the first 5 or so lines (including the heading line). It might also be useful if you could say what software produces these files.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.