Formatted Input in Python - python

I have a peculiar problem. I need to read (from a txt file) using python only those substrings that are present at predefined range of offsets. Let's say 5-8 and 12-16.
For example, if a line in the file is something like:
abcdefghi akdhflskdhfhglskdjfhghsldk
then I would like to read the two words - "efgh" and "kdhfl". Because, in the word "efgh", the offset of character "e" is 5 and that of "h" is 8. Similarly, the other word "kdhfl".
Please note that the whitespaces also add to the offset. Infact, the white spaces in my file are not "consistenty occurring" in every line and cannot be depended upon to extract the words of interest. Which is why, I have to bank on the offsets.
I hope I've been able to make the question clear.
Awaiting answers!
Edit -
yes, the whitespace amount in each line can change and accounts for the offsets also. For example, consider these two lines -
abcz d
a bc d
In both cases, I view the offset of the final character "d" as the same. As I said, the white spaces in the file are not consistent and I cannot rely on them. I need to pick up the characters based on their offsets. Does your answer still hold?

assuming its a file,
for line in open("file"):
print line[4:8] , line[11:16]

To extract pieces from offsets simply read each line into a string and then access a substring with a slice ([from:to]).
It's unclear what you're saying about the inconsistent whitespace. If whitespace adds to the offset, it must be consistent to be meaningful. If the whitespace amount can change but actually accounts for the offsets, you can't reliably extract your data.
In your added example, as long as d's offset stays the same, you can extract it with slicing.
>>> s = 'a bc d'
>>> s[5:6]
'd'
>>> s = 'abc d'
>>> s[5:6]
'd'

What's to stop you from using a regular expression? Besides the whitespace do the offsets vary?
/.{4}(.{4}).{4}(.{4})/

Related

I have a question on using a split() method and the data's format (solved, uploading as memo)

I have made a variable like below,
data = '''
hello my name is mj
and I like reading novels and webtoons
nice meeting you all!
'''
and used data.split('\n') to split by sentences.
The data came up like,
['', 'hello my name is mj', 'and I like reading novels and webtoons', 'nice meeting you all!', '']
At the above list, why is there double quotations(") in the starting and at the end? Are those single sentence like 'hello my name is mj' and 'and I like ~' tied up as one string? If so, why??
Wait, while writing this question I think I got the answer, it is not just double quotation it is two single quotation written in order. As there is nothing written next to two '''s, so it just made empty string.
There is \n character at the beginning and end of you string, therefore it is also a part of the return value from split. You can do something like this:
[x for x in data.split('\n') if x]
Using list comprehension with a condition to filter only lines that are not empty.
...to split by sentence
There is the native splitlines method for this. It is advised to use that method, as it is aware of all kinds of varying encodings for line breaks.
Also, it will not create an extra entry at the end when the input ends with a line break like in your example. However, since you have an explicit empty line at the beginning, that one would still be included.
It might be a pragmatic solution to just strip your input from surrounding white space:
data.strip().splitlines()
For your example input, this will evaluate to:
[
'hello my name is mj',
'and I like reading novels and webtoons',
'nice meeting you all!'
]

s3fs read_block with delimiter lookahead?

I am using s3fs read_block to distribute a csv equally across multiple processes. Each process needs to be given a byte range to operate on and work independently of others. Every line in the csv needs to be processed without overlap.
The problem is that the beginning and ends of byte ranges are unlikely to be the beginning and ends of lines. So some lines may get chopped off.
For example-
My csv looks like this-
beer\npizza\nwings
And I want to process this in chunks of 9 bytes. For byte range 0-9 I will get "beer". And for byte range 10-16 i will get "wings". I will never get "pizza" because the split exists in the middle of a line
beer\npizza\nwings
__________^_______
What I need is some kind of lookahead. Where I want to get bytes between 0-9, and any additional bytes required to form the next line. Then my results would be beer\npizza, wings.
Is lookahead the right way of looking at this or is there another solution? If lookahead is the right way to do this, can this be done with s3fs or do I need a custom implementation to do this lookahead first to find the correct byte range?
Edit:
Custom implementation example:
if self._lookahead:
self._logger.debug('Performing lookahead')
"""Use lookahead to find next newline in csv"""
self._logger.debug(f'{end - 1}, {self._lookahead + 1}')
r = s3.read_block(self._s3_path, end - 1, self._lookahead + 1)
if '\n' not in (r[0], r[1]):
"""Range ends in the middle of a line. Look ahead for the next newline"""
read_length = read_length + r.index(b'\n')
self._logger.debug(f'New end found {read_length}')
From s3fs documentation. You can simply pass the delimiter to the read_block() function.
Hope it helps:
s3.read_block(path, offset=1000, length=10, delimiter=b'\n')

Which is the most efficent of matching and replacing with an identifier every three new lines?

I am working with some .txt files that doesn't have structure (they are messy), they represent a number of pages. In order to give them some structure I would like to identify the number of pages since the file itself doesn't have them. This can be done by replacing every three newlines with some annotation like:
\n
page: N
\n
Where N is the number. This is how my files look like, and I also tried with a simple replace. However, this function confuses and does not give me the expected format which would be something like this. Any idea of how to replace the spaces with some kind of identifier, just to try to parse them and getting the position of some information (page)?.
I also tried this:
import re
replaced = re.sub('\b(\s+\t+)\b', '\n\n\n', text)
print (replaced)
If the format is as regular as you state in your problem description:
Replace every occurrence of three newlines \n with page: N
You wouldn't have to use the re module. Something as simple as the following would do the trick:
>>> s='aaaaaaaaaaaaaaaaa\n\n\nbbbbbbbbbbbbbbbbbbbbbbb\n\n\nccccccccccccccccccccccc'
>>> pages = s.split('\n\n\n')
>>> ''.join(page + '\n\tpage: {}\n'.format(i + 1) for i, page in enumerate(pages))
'aaaaaaaaaaaaaaaaa\n\tpage: 1\nbbbbbbbbbbbbbbbbbbbbbbb\n\tpage: 2\nccccccccccccccccccccccc\n\tpage: 3\n'
I suspect, though, that your format is less regular than that, but you'll have to include more details before I can give a good answer for that.
If you want to split with messy whitespace (which I'll define as at least three newlines with any other whitespace mixed in), you can replace s.split('\n\n\n') with:
re.split(r'(?:\n\s*?){3,}', s)

regex to find string in string without respect to order?

I'm not sure hwo best to word this, so I'll dive straight into an example.
a bunch of lines we don't care about [...]
This is a nice line I can look for
This is the string I wish to extract
a bunch more lines we do't care about [...]
This line contains an integer 12345 related to the string above
more garbage [...]
But sometimes (and I have no control over this) the order is swapped:
a bunch of lines we don't care about [...]
Here is another string I wish to extract
This is a nice line I can look for
a bunch more lines we do't care about [...]
This line contains an integer 67890 related to the string above
more garbage [...]
The two lines ("nice line" and "string I wish to extract") are always adjacent but the order is not predictable. The integer containing line is an inconsistent number of lines below. The "nice line" appears multiple times and is always the same and the string and integer I'm extracting (globally) may be the same or different from each other.
Ultimately the idea is to populate two lists, one containing the strings and the other containing the integers, both ordered as they are found so the two can later be used as key/value pairs.
What I have no idea how to do, or even if its possible, is to write a regex that finds the string immediately before OR after a target line???
Doing this in Python, btw.
Thoughts?
edit/addition: So what I'm expecting as a result out of the above sample text would be something like:
list1["This is the string I wish to extract", "Here is another string I wish to extract"]
list2[12345, 67890]
A good strategy might be to look for "nice lines" and then search the lines above and below.
See the following (untested) python psuedocode:
L1, L2 = [], []
lines = open("file.txt").readlines()
for i, line in enumerate(i, lines):
if 'nice line' in line:
before_line = lines[min(i-1, 0)]
after_line = lines[min(i+1, len(lines) - 1)]
# You can generalize the above to a few lines above and below
# Use regex to parse information from `before_line` and `after_line`
# and add it to the lists: L1, L2

Looking for a strategy for parsing a file

I'm an experienced C programmer, but a complete python newbie. I'm learning python mostly for fun, and as a first exercise want to parse a text file, extracting the meaningful bits from the fluff, and ending up with a tab-delimited string of those bits in a different order.
I've had a blast plowing through tutorials and documentation and stackoverflow Q&As, merrily splitting strings and reading lines from files and etc. Now I think I'm at the point where I need a few road signs from experienced folks to avoid blind alleys.
Here's one chunk of the text I want to parse (you may recognize this as a McMaster order). The actual file will contain one or more chunks like this.
1 92351A603 Lag Screw for Wood, 18-8 Stainless Steel, 5/16" Diameter, 5" Long, packs of 5
Your Part Number: 7218-GYROID
22
packs today
5.85
per pack 128.70
Note that the information is split over several lines in the file. I'd like to end up with a tab-delimited string that looks like this:
22\tpacks\tLag Screw for Wood, 18-8 Stainless Steel, 5/16" Diameter, 5" Long, packs of 5\t\t92351A603\t5.85\t\t128.70\t7218-GYROID\n
So I need to extract some parts of the string while ignoring others, rearrange them a bit, and re-pack them into a string.
Here's the (very early) code I have at the moment, it reads the file a line at a time, splits each line with delimiters, and I end up with several lists of strings, including a bunch of empty ones where there were double tabs:
import sys
import string
def split(delimiters, string, maxsplit=0):
"""Split the given string with the given delimiters (an array of strings)
This function lifted from stackoverflow in a post by Kos"""
import re
regexPattern = '|'.join(map(re.escape, delimiters))
return re.split(regexPattern, string, maxsplit)
delimiters = "\t", "\n", "\r", "Your Part Number: "
with open(sys.argv[1], 'r') as f:
for line in f:
print(split( delimiters, line))
f.close()
Question 1 is basic: how can I remove the empty strings from my lists, then mash all the strings together into one list? In C I'd loop through all the lists, ignoring the empties and sticking the other strings in a new list. But I have a feeling python has a more elegant way to do this sort of thing.
Question 2 is more open ended: what's a robust strategy here? Should I read more than one line at a time in the first place? Make a dictionary, allowing easier re-ordering of the items later?
Sorry for the novel. Thanks for any pointers. And please, stylistic comments are more than welcome, style matters.
You don't need to close file when using with.
And if I were to implement this. I might use a big regex to extract parts from each chunk(with finditer), and reassemble them for output.
You can remove empty strings by:
new_list = filter(None, old_list)
Replace the first parameter with a lambda expression that is True for elements you want to keep. Passing None is equivalent to lambda x: x.
You can mash strings together into one string using:
a_string = "".join(list_of_strings)
If you have several lists (of whatever) and you want to join them together into one list, then:
new_list = reduce(lambda x, y: x+y, old_list)
That will simply concatenate them, but you can use any non-empty string as the separator.
If you're new to Python, then functions like filter and reduce (EDIT: deprecated in Python 3) may seem a bit alien, but they save a lot of time coding, so it's worth getting to know them.
I think you're on the right track to solving your problem. I'd do this:
break up everything into lines
break the resulting list into smaller list, one list per order
parse the orders into "something meaningful"
sort, output the result
Personally, I'd make a class to handle the last two parts (they kind of belong together logically) but you could get by without it.

Categories