Looking for a strategy for parsing a file

Looking for a strategy for parsing a file - python

I'm an experienced C programmer, but a complete python newbie. I'm learning python mostly for fun, and as a first exercise want to parse a text file, extracting the meaningful bits from the fluff, and ending up with a tab-delimited string of those bits in a different order.
I've had a blast plowing through tutorials and documentation and stackoverflow Q&As, merrily splitting strings and reading lines from files and etc. Now I think I'm at the point where I need a few road signs from experienced folks to avoid blind alleys.
Here's one chunk of the text I want to parse (you may recognize this as a McMaster order). The actual file will contain one or more chunks like this.
1 92351A603 Lag Screw for Wood, 18-8 Stainless Steel, 5/16" Diameter, 5" Long, packs of 5
Your Part Number: 7218-GYROID
22
packs today
5.85
per pack 128.70
Note that the information is split over several lines in the file. I'd like to end up with a tab-delimited string that looks like this:
22\tpacks\tLag Screw for Wood, 18-8 Stainless Steel, 5/16" Diameter, 5" Long, packs of 5\t\t92351A603\t5.85\t\t128.70\t7218-GYROID\n
So I need to extract some parts of the string while ignoring others, rearrange them a bit, and re-pack them into a string.
Here's the (very early) code I have at the moment, it reads the file a line at a time, splits each line with delimiters, and I end up with several lists of strings, including a bunch of empty ones where there were double tabs:
import sys
import string
def split(delimiters, string, maxsplit=0):
"""Split the given string with the given delimiters (an array of strings)
This function lifted from stackoverflow in a post by Kos"""
import re
regexPattern = '|'.join(map(re.escape, delimiters))
return re.split(regexPattern, string, maxsplit)
delimiters = "\t", "\n", "\r", "Your Part Number: "
with open(sys.argv[1], 'r') as f:
for line in f:
print(split( delimiters, line))
f.close()
Question 1 is basic: how can I remove the empty strings from my lists, then mash all the strings together into one list? In C I'd loop through all the lists, ignoring the empties and sticking the other strings in a new list. But I have a feeling python has a more elegant way to do this sort of thing.
Question 2 is more open ended: what's a robust strategy here? Should I read more than one line at a time in the first place? Make a dictionary, allowing easier re-ordering of the items later?
Sorry for the novel. Thanks for any pointers. And please, stylistic comments are more than welcome, style matters.

You don't need to close file when using with.
And if I were to implement this. I might use a big regex to extract parts from each chunk(with finditer), and reassemble them for output.

You can remove empty strings by:
new_list = filter(None, old_list)
Replace the first parameter with a lambda expression that is True for elements you want to keep. Passing None is equivalent to lambda x: x.
You can mash strings together into one string using:
a_string = "".join(list_of_strings)
If you have several lists (of whatever) and you want to join them together into one list, then:
new_list = reduce(lambda x, y: x+y, old_list)
That will simply concatenate them, but you can use any non-empty string as the separator.
If you're new to Python, then functions like filter and reduce (EDIT: deprecated in Python 3) may seem a bit alien, but they save a lot of time coding, so it's worth getting to know them.
I think you're on the right track to solving your problem. I'd do this:
break up everything into lines
break the resulting list into smaller list, one list per order
parse the orders into "something meaningful"
sort, output the result
Personally, I'd make a class to handle the last two parts (they kind of belong together logically) but you could get by without it.

Related

Comparing two text files to EXCLUDE duplicates, and not line-by line. I want to output the exclusion of any duplicate strings, specifically

I feel like this isn't that difficult but some reason it is and I'm sleep deprived...so yeah. I've been able to neatly format and isolate the words of interests from two long .txt files. I've searched around StackOverflow and I can only seem to find line-by-line comparisons (which specifically seeks out duplicate strings and I'm trying to do the exact opposite), so it is not at all what I'm looking for. My objective is to check whether the same string appears ANYWHERE in (as in, is duplicated) in either txt file (I'm comparing just two) and the resultant output should exclude any and all duplicates and written to a .txt file or at least printed to the console. I've read the Python documentation and am aware of set(). I don't mind tips on that, but is there another way to go about it?
edit: it is solely a string of (numerous) five numeric characters, if that helps. Thank you in advance!
Both .txt files I'm comparing look like this essentially (I have had to change it a bit, but it is same exact idea).
1-94823 Words Words a numeric percentage time lapsed
2-84729 Words Words a numeric percentage time lapsed
The whole document, line-by-line is like that however there is some overlap between these two txt files and I am solely interested in the five digit number after the dash. I apologize my title is/was unclear, I want to compare every instance of these five digit numbers from both txt files and specifically exclude duplicates found if anything matches up in either of the two txt files, not just line-by-line and output that (there are a fair number of duplicates).
Thanks,
Amelia

Once you have a list of those 5 digit number you can do this:
List of numbers:
list1 = [12345, 67890, 13579]
list2 = [54321, 67890, 11235]
Create the sets:
set1 = set(list1)
set2 = set(list2)
Get the union without the intersection
non_duplicates_list = list(set1.symmetric_difference(set2))
and the result is:
[11235, 13579, 54321, 12345]
If there is any problem let me know :)

Removing various symbols from a text

I am trying to clean some texts that are very different from one another. I would like to remove the headlines, quotation marks, abbreviations, special symbols and points that don't actually end sentences.
Example input:
This is a headline
And inside the text there are 'abbreviations', e.g. "bzw." in German or some German dates, like 2. Dezember 2017. Sometimes there are even enumerations, that I might just eliminate completely.
• they have
◦ different bullet points
- or even equations and
Sometimes there are special symbols. ✓
Example output:
And inside the text there are abbreviations, for example beziehungsweise in German or some German dates, like 2 Dezember 2017. Sometimes there are even enumerations, that I might just eliminate completely. Sometimes there are special symbols.
What I did:
with open(r'C:\\Users\me\\Desktop\\ex.txt', 'r', encoding="utf8") as infile:
data = infile.read()
data = data.replace("'", '')
data = data.replace("e.g.", 'for example')
#and so on
with open(r'C:\\Users\me\\Desktop\\ex.txt', 'w', encoding="utf8") as outfile:
outfile.write(data)
My problems (although number 2 is the most important):
I just want a string with this input, but it obviously breaks because of the quotation marks, is there any way to do this other than working with files like I did? In reality, I'm copy-pasting a text and want an app to clean it.
The code seems very inefficient because I just manually write the things that I remember to delete/clean, but I don't know all the abbreviations by heart. How do I clean it in one go, so to say?
Is there any way to eliminate the headline and enumeration, and the point . that appears in that German date? My code doesn't do that.
Edit: I just remembered stuff like text = re.sub(r"(#\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text), but regex is inefficient for huge texts, isn't it?

To easily remove all non standard symbols you can use the str.isalnum() which only returns true for any alphaneumaric sequence or str.isascii() for any ascii strings. isprintable() seems viable too. A full list can be found here Using those functions you can iterate over the string and filter each character. So something like this:
filteredData = filter(str.isidentifier, data)
You can also combine those by creating a function that checks multiple string variables like this:
def FilterKey(char:str): return char.isidentifier() and char.isalpha()
Which can be used in filter like this:
filteredData = filter(FilterKey, data)
if it returns true its included in the output if it returns false its excluded.
You can also extend this by including your own checks on the chars in the return of the function and then afterwards, to remove the large chunks of strings, you can use the typical str.replace(old,new) function.

Python convention for sub-delimiter

I am writing a script which takes a string as input and splits it into a list using the .split(sep = ',') function. Then, some of the items in the list will be split into sub-lists. For example:
input = 'my,string,1|2|3'
mylist = input.split(',')
mylist[2] = mylist[2].split('|')
print(mylist)
> ['my','string',['1','2','3']]
The code works without a problem. (I know which position in the list will have the sub-list.) My question is: Is there any convention in python for which delimiter should be used to separate a string which will eventually be converted to numbers (int or float). Assuming that ',' is already used as the first delimiter?
As the programmer, I can request the string to be formatted using whichever delimiters I like. But I will have many users, so if there is a convention for separating numerical values, I would like to follow it. Note that the numbers may be float values, so I do not want to use the characters 'hyphen' or 'period' as delimiters.

I should preface this by saying I have never heard of such a convention, but I like the question. The convention for nested lists in English is to use commas for the inner list and semi-colons for the outer list, e.g.:
I have eaten: eggs, bacon, and apple for breakfast; toast, tuna, and a
banana for lunch; and chicken, salad, and potatoes for dinner.
That convention suggests input = 'my;string;1,2,3;'.
I also like the idea of using newlines: input = 'my\nstring\n1,2,3\n'. It has the benefit of being easy to read from / write to CSV.

regex to find string in string without respect to order?

I'm not sure hwo best to word this, so I'll dive straight into an example.
a bunch of lines we don't care about [...]
This is a nice line I can look for
This is the string I wish to extract
a bunch more lines we do't care about [...]
This line contains an integer 12345 related to the string above
more garbage [...]
But sometimes (and I have no control over this) the order is swapped:
a bunch of lines we don't care about [...]
Here is another string I wish to extract
This is a nice line I can look for
a bunch more lines we do't care about [...]
This line contains an integer 67890 related to the string above
more garbage [...]
The two lines ("nice line" and "string I wish to extract") are always adjacent but the order is not predictable. The integer containing line is an inconsistent number of lines below. The "nice line" appears multiple times and is always the same and the string and integer I'm extracting (globally) may be the same or different from each other.
Ultimately the idea is to populate two lists, one containing the strings and the other containing the integers, both ordered as they are found so the two can later be used as key/value pairs.
What I have no idea how to do, or even if its possible, is to write a regex that finds the string immediately before OR after a target line???
Doing this in Python, btw.
Thoughts?
edit/addition: So what I'm expecting as a result out of the above sample text would be something like:
list1["This is the string I wish to extract", "Here is another string I wish to extract"]
list2[12345, 67890]

A good strategy might be to look for "nice lines" and then search the lines above and below.
See the following (untested) python psuedocode:
L1, L2 = [], []
lines = open("file.txt").readlines()
for i, line in enumerate(i, lines):
if 'nice line' in line:
before_line = lines[min(i-1, 0)]
after_line = lines[min(i+1, len(lines) - 1)]
# You can generalize the above to a few lines above and below
# Use regex to parse information from `before_line` and `after_line`
# and add it to the lists: L1, L2

Efficient way to do a large number of search/replaces in Python?

I'm fairly new to Python, and am writing a series of script to convert between some proprietary markup formats. I'm iterating line by line over files and then basically doing a large number (100-200) of substitutions that basically fall into 4 categories:
line = line.replace("-","<EMDASH>") # Replace single character with tag
line = line.replace("<\\#>","#") # tag with single character
line = line.replace("<\\n>","") # remove tag
line = line.replace("\xe1","•") # replace non-ascii character with entity
the str.replace() function seems to be pretty efficient (fairly low in the numbers when I examine profiling output), but is there a better way to do this? I've seen the re.sub() method with a function as an argument, but am unsure if this would be better? I guess it depends on what kind of optimizations Python does internally. Thought I would ask for some advice before creating a large dict that might not be very helpful!
Additionally I do some parsing of tags (that look somewhat like HTML, but are not HTML). I identify tags like this:
m = re.findall('(<[^>]+>)',line)
And then do ~100 search/replaces (mostly removing matches) within the matched tags as well, e.g.:
m = re.findall('(<[^>]+>)',line)
for tag in m:
tag_new = re.sub("\*t\([^\)]*\)","",tag)
tag_new = re.sub("\*p\([^\)]*\)","",tag_new)
# do many more searches...
if tag != tag_new:
line = line.replace(tag,tag_new,1) # potentially problematic
Any thoughts of efficiency here?
Thanks!

str.replace() is more efficient if you're going to do basic search and replaces, and re.sub is (obviously) more efficient if you need complex pattern matching (because otherwise you'd have to use str.replace several times).
I'd recommend you use a combination of both. If you have several patterns that all get replaced by one thing, use re.sub. If you just have some cases where you just need to replace one specific tag with another, use str.replace.
You can also improve efficiency by using larger strings (call re.sub once instead of once for each line). Increases memory use, but shouldn't be a problem unless the file is HUGE, but also improves execution time.

If you don't actually need the regex and are just doing literal replacing, string.replace() will almost certainly be faster. But even so, your bottleneck here will be file input/output, not string manipulation.
The best solution though would probably be to use cStringIO

Depending on the ratio of relevant-to-not-relevant portions of the text you're operating on (and whether or not the parts each substitution operates on overlap), it might be more efficient to try to break down the input into tokens and work on each token individually.
Since each replace() in your current implementation has to examine the entire input string, that can be slow. If you instead broke down that stream into something like...
[<normal text>, <tag>, <tag>, <normal text>, <tag>, <normal text>]
# from an original "<normal text><tag><tag><normal text><tag><normal text>"
...then you could simply look to see if a given token is a tag, and replace it in the list (and then ''.join() at the end).

You can pass a function object to re.sub instead of a substitution string, it takes the match object and returns the substitution, so for example
>>> r = re.compile(r'<(\w+)>|(-)')
>>> r.sub(lambda m: '(%s)' % (m.group(1) if m.group(1) else 'emdash'), '<atag>-<anothertag>')
'(atag)(emdash)(anothertag)'
Of course you can use a more complex function object, this lambda is just an example.
Using a single regex that does all the substitution should be slightly faster than iterating the string many times, but if a lot of substitutions are perfomed the overhead of calling the function object that computes the substitution may be significant.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.