Efficient extraction of anadromes - python

An anadrome is a proper sentence that when written in reverse constitutes a (possibly different) proper sentence up to a possible change of spacing. I have a file with 100 Million proper sentences and I would like to find all sub-sentences (divided by word boundaries) which are anadromes, by testing if their inverse is also in the file when ignoring internal spaces. My initial approach was to extract all sub-sentences and save them to a temporary file, create an in-memory set of their space-stripped inverses, and finally iterate over the temporary file and test if each line after space-stripping belongs to the set. This worked fine for smaller files but does not scale, as the set gets too large for memory. Other than replacing the in-memory set with an on-disk database, what could be done?
Edit: I ended up using an sqlite database with index. on a smaller set of 5 Million sentences using a db instead of an in-memory set takes 2x the time. With the full set, this is the only method that I found could complete the computation.

For each proper sentence you could try inverting it, and find all proper possible subtences.
Then for each inverted subsentence, you strip all the spaces.
You then do a regex-search in the original file, searching for using the space-stripped inverted-subsentence allowing for \s? in between characters.
For example d\s?l\s?r\s?o\s?w\s?o\s?l\s?l\s?e\s?h (inverted 'hello world') would match 'wlro woll eh' (inverted 'hello world' with crazy spacing, which would be in the original file if it were a proper sentence)

Related

How to use hash function in Python3 to transform an arbitrary string into a fixed-length sequence of alphanumeric symbols?

I have a large number of different sentences written in different languages (French, Ukrainian, English and so on). For each sentence I want to generate audio file with the given sentence being pronounced by a text-to-speech program. Now I need to decide how to name those audio files (one file for each sentence). I thought that it would be elegant if I can infer file name from the sentence. In other words, if I see the sentence, I should be able to computer (infer / derive) the name of audio file in which this sentence is spoken.
I thought that I could use a hash function for that. I would apply a hash function to the string representing the sentence and, as a result, I would get a string (hash) that I can use as a name of the file.
Why not to use the sentence itself as a name? Because sentences can be large and I do not want to have very large file names. Moreover, I do not want to have spaces and other punctuation symbols (as well as strange alphabet symbols) in the names of the files. Finally, I expect that hash will always have the same length which looks nice.
Now is my question: How can I transform an arbitrary unicode string into a sequence of alphanumeric symbols being a hash of the input string in Python3?
I also wonder if there is a danger of getting the same hash for different sentences.
ADDED:
I have just realized, that by applying hash function to the same string I can get different results for different sessions. This is, obviously, something that I would like to avoid.
Sure. Use a cryptographic hash function such as SHA-256; they're available in hashlib. (As you've noticed, hash isn't stable between sessions due to PYTHONHASHSEED, nor necessarily between Python versions and interpreters.)
I also apply some normalization here, but that may or may not be what you want.
import hashlib
def get_filename(sentence: str) -> str:
# assuming leading/trailing whitespace doesn't matter, nor does case
sentence_norm = sentence.lower().strip()
return hashlib.sha256(sentence_norm.encode("utf-8")).hexdigest()
>>> get_filename("Hello, mon ami!")
'c13c197526d17532bd6d9bf3c2ad34486ccb2fcdeadaf7b71c3c67c0f048ecb9'
>>> get_filename("hello, mon ami! ")
'c13c197526d17532bd6d9bf3c2ad34486ccb2fcdeadaf7b71c3c67c0f048ecb9'
>>>
I also wonder if there is a danger of getting the same hash for different sentences.
No, not until SHA-256 is broken, and if it is, we're all in trouble anyway.

How to combine two words in a long string into one & find their associated sentence?

Given a very long string -
"Given the large category of plants, the split ratio was determined to be 88.4. However, we're not sure if the split ratio was consistent across all subcategories or just a calculated average. If however, it deviated, it would be nonetheless, quite strange.
The words - split ratio. In the output, I want them to appear as split-ratio (as a single word) and I also only want to retain sentences where these words occur. So in this case, only the first two sentences.
Is this possible?
You can use replace in a list comprehension:
s = """Given the large category of plants, the split ratio was
determined to be 88.4. However, we're not sure
if the split ratio was consistent across all subcategories
or just a calculated average. If however, it deviated,
it would be nonetheless, quite strange."""
print('. '.join([x.replace('split ratio', 'split-ratio') for x in s.split('. ') if 'split ratio' in x]) + '.')
will print out only lines that contain 'split ratio' with each of them converted to 'split-ratio'.
Since python is in the tag line I expect you want it in that language right? And to be clear a simple find-replace in a normal text editor isn't going to solve this issue I suppose, you need actual logic to apply onto something.
I would have to stop and look up python for a bit. But in any language the easiest way I can think of is to just parse out the file/stream and make the changes as you go. Read in the stream and look for the pattern you want a match for = "split ratio" - regardless, as you are reading in the stream, write out a new one that favors your changes. But do it in the block size (or string length) of the pattern you are matching.
When you find true for the pattern you are constantly comparing, stop. Don't output that string, instead output the one you want to replace it with into the new target stream/file.
However, a search for python search and replace algorithm gives me this:
https://www.geeksforgeeks.org/python-string-replace/
Someone did the hard work for you already. Love that super high level programming language that leaves folks in the dark as to what is actually happening. Oh well.
Enjoy.
atomkey.

Python - Dividing a book in PDF form into individual text files that correspond with page numbers

I've converted my PDF file into a long string using PDFminer.
I'm wondering how I should go about dividing this string into smaller, individual strings/pages. Each page is divided by a certain series of characters (CRLF, FF, page number etc), and the string should be split and appended to a new text file according to these characters occurring.
I have no experience with regex, but is using the re module the best way to go about this?
My vague idea for implementation is that I have to iterate through the file using the re.search function, creating text files with each new form feed found. The only code I have is PDF > text conversion. Can anyone point me in the right direction?
Edit: I think the expression I should use is something like ^.*(?=(\d\n\n\d\n\n\f\bFavela\b)) (capture everything before 2 digits, the line breaks and the book's title 'Favela' which appears on top of each page.
Can I save these \d digits as variables? I want to use them as file names, as I iterate through the book and scoop up the portions of text divided by each appearance of \f\Favela.
I'm thinking the re.sub method would do it, looping through and replacing with an empty string as I go.

Search string with regex in large binary file (2 GB or more)

What's the best way to search (multiple) strings in a large binary file (2 GB or more) using a regular expression.
The binary data is just 'raw' data (like a memory dump) and the string is not bounded.
I am able to do this in a large text file by reading the file line by line.
I suppose I need to read the file in chunks, but then there is a boundary risk (match is located on chunk boundary)
An how can I search the binary data.
A short example is very much appreciated.
Edit:
I do not see the similarity. It's by no means clear to me
read() takes a value which a numeric indication of how many characters (bytes? multi-byte characters always confuses me), so you could read it in chunks, saving as much as is reasonable, checking with your regex. As space becomes an issue, perhaps remove only the beginning of what you've read before you read in the next chunk. This resides on having at least some guess as the length of the regex, or rather, an upper bound on it. If the regex you want to match encompasses more than the amount you can have in memory at a time, then I'm out of ideas.
s = ""
SOME_CHUNK_SIZE = 4096 ## 4kb, totally arbitrary
with open("large_file", "rb") as fh:
if len(s) > SOME_BIG_NUMBER:
s = s[SOME_CHUNK_SIZE:]
s += fh.read(SOME_CHUNK_SIZE)
## do regex test now
That should maybe get you some of the way. You'll also need to know when you're at the end of the file, since it doesn't seem to throw an error, it just returns 0 bytes. You can either read into a temporary string and check the length, or you could try something like checking the file stats and doing arithmetic with SOME_CHUNK_SIZE.

Compile and split a string from file using python

How can I compile a string from selected rows of a file, run some operations on the string and then split that string back to the original rows into that same file?
I only need certain rows of the file. I cannot do the operations to the other parts of the file. I have made a class that separates these rows from the file and runs the operations on these rows, but I'm thinking this would be even faster to run these operations on a single string containing parts of the file that can be used in these operations...
Or, if I can run these operations on a whole dictionary, that would help too. The operations are string replacements and RegEx replacements.
I am using python 3.3
Edit:
I'm going to explain this in greater detail here since my original post was so vague (thank you Paolo for pointing that out).
For instance, if I would like to fix a SubRipper (.srt-file), which is a common subtitle file, I would take something like this as an input (this is from an actual srt-file):
Here you can find correct example, submitting the file contents here messes newlines:
http://pastebin.com/ZdWUpNZ2
...And then I would only fix those rows which have the actual subtitle lines, not those ordering number rows or those hide/show rows of the subtitle file. So my compiled string might be:
"They're up on that ridge.|They got us pinned down."
Then I would run operations on that string. Then I would have to save those rows back to the file. How can I get those subtitle rows back into my original file after they are fixed? I could split my compiled and fixed string using "|" as a row delimiter and put them back to the original file, but how can I be certain what row goes where?
You can use pysrt to edit SubRip files:
from pysrt import SubRipFile
subs = SubRipFile.open('some/file.srt')
for sub in subs:
# do something with sub.text
pass
# save changes to a new file
subs.save('other/path.srt', encoding='utf-8')

Categories