How to read HTML file without any limit using python? - python

So I have a HTML file that consist of 4,574 words 57,718 characters.
But recently, when I read it using .read() command it got a limitation and only show 3,004 words 39,248 characters when I export it.
How can I read it and export it fully without any limitation?
This is my python script:
from IPython.display import FileLink, HTML
title = "Download HTML file"
filename = "data.html"
payload = open("./dendo_plot(2).html").read()
payload = payload.replace('"', """)
html = '<a download="{filename}" href="data:text/html;charset=utf-8,'+payload+'" target="_blank">{title}</a>'
print(payload)
HTML(html)
This is what I mean, Left (Source File), Right (Exported File), you can see there were a gap on both file.

I don't think there's a problem here, I think you are simply misinterpreting a variation in a metric between your input and output.
When you call read() on an opened file with no arguments, it reads the whole content of the file (until EOF) and put it in your memory:
To read a file’s contents, call f.read(size), which reads some quantity of data and returns it as a string [...]. size is an optional numeric argument. When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory.
From the official Python tutorial
So technically Python might be unable to read the whole file because it is too big to fit in your memory, but I strongly doubt that's what happening here.
I believe the difference in the number of characters and words you see between your input and output are because your data is changed when processed.
Look at: payload = payload.replace('"', """). From an HTML validation point of view, both " and " are the same and displayed the same (which is why you can switch them), but from a Python point of view, they are different and have different length:
>>> len('"')
1
>>> len(""")
6
So just with this line you get a variation in your input and output.
That being said, I don't think it is very relevant to use the number of characters and words to check if two pieces of HTML are the same. Take the following example:
>>> first_html = """<div>
... <p>Hello there</p>
... </div>"""
>>> len(first_html)
32
>>> second_html = "<div><p>Hello there</p></div>"
>>> len(second_html)
29
You would agree that both HTML will display the same thing, but they don't have the same number of characters. The HTML specification is quite tolerant in the usage of spaces, tabulation and new lines, that's why both previous examples are treated as equal by an HTML parser.
About the number of words, one simple question (well not that simple to answer though ^^'): what qualifies as a word in HTML? Is it only the text displayed? Does the HTML tags counts aswell? If so what about their attributes?
So to sum up, I don't think you have a real problem here, only a difference that is a problem from a certain point of view, but not from an other one.

Related

When converting binary data to JSON in python how to get rid of extra lines and whitespaces?

I currently have binary data that looks like this:
test = b'Got [01]:\n{\'test\': [{\'message\': \'foo bar baz \'\n "\'secured\', current \'proposal\'.",\n \'name\': \'this is a very great name \'\n \'approves something of great order \'\n \'has no one else associated\',\n \'status\': \'very good\'}],\n \'log-url\': \'https://localhost/we/are/the/champions\',\n \'status\': \'rockingandrolling\'}\n
As you can see this is basically JSON.
So what I did was the following:
test.decode('utf8').replace("Got [01]:\n{", '{').replace("\n", "").replace("'", '"')
This basically turned it into a string, and get it as close to valid JSON as possible. Unfortunately, it doesn't fully get there, because when I convert it to a string, it keeps all these stupid spaces and line breaks. That is hard to parse out, with all the .replace()s I keep using.
Is there any way to make the binary data that is being outputted and decoded to produce all 1 line allowing to parse the string, and so I can turn it into JSON format
I have also used a regex to this specific case, and it works, but because this binary data is generated dynamically every time, it may be a tad different where the line breaks and spaces are. So a regex is too brittle to catch every case.
Any thoughts?

How to save data to a file on separate items instead of one long string?

I am having trouble simply saving items into a file for later reading. When I save the file, instead of listing the items as single items, it appends the data together as one long string. According to my Google searches, this should not be appending the items.
What am I doing wrong?
Code:
with open('Ped.dta','w+') as p:
p.write(str(recnum)) # Add record number to top of file
for x in range(recnum):
p.write(dte[x]) # Write date
p.write(str(stp[x])) # Write Steps number
Since you do not show your data or your output I cannot be sure. But it seems you are trying to use the write method like the print function, but there are important differences.
Most important, write does not follow its written characters with any separator (like space by default for print) or end (like \n by default for print).
Therefore there is no space between your data and steps number or between the lines because you did not write them and Python did not add them.
So add those. Try the lines
p.write(dte[x]) # Write date
p.write(' ') # space separator
p.write(str(stp[x])) # Write Steps number
p.write('\n') # line terminator
Note that I do not know the format of your "date" that is written, so you may need to convert that to text before writing it.
Now that I have the time, I'll implement #abarnert's suggestion (in a comment) and show you how to get the advantages of the print function and still write to a file. Just use the file= parameter in Python 3, or in Python 2 after executing the statement
from __future__ import print_function
Using print you can do my four lines above in one line, since print automatically adds the space separator and newline end:
print(dte[x], str(stp[x]), file=p)
This does assume that your date datum dte[x] is to be printed as text.
Try adding a newline ('\n') character at the end of your lines as you see in docs. This should solve the problem of 'listing the items as single items', but the file you create may not be greatly structured nonetheless.
For further of your google searches you may want to check serialization, as well as json and csv formats, covered in python standard library.
You question would have befited if you gave very small example of recnum variable + original f.close() is not necessary as you have a with statement, see here at SO.

Python - Dividing a book in PDF form into individual text files that correspond with page numbers

I've converted my PDF file into a long string using PDFminer.
I'm wondering how I should go about dividing this string into smaller, individual strings/pages. Each page is divided by a certain series of characters (CRLF, FF, page number etc), and the string should be split and appended to a new text file according to these characters occurring.
I have no experience with regex, but is using the re module the best way to go about this?
My vague idea for implementation is that I have to iterate through the file using the re.search function, creating text files with each new form feed found. The only code I have is PDF > text conversion. Can anyone point me in the right direction?
Edit: I think the expression I should use is something like ^.*(?=(\d\n\n\d\n\n\f\bFavela\b)) (capture everything before 2 digits, the line breaks and the book's title 'Favela' which appears on top of each page.
Can I save these \d digits as variables? I want to use them as file names, as I iterate through the book and scoop up the portions of text divided by each appearance of \f\Favela.
I'm thinking the re.sub method would do it, looping through and replacing with an empty string as I go.

Extracting img src value from json [duplicate]

This question already has answers here:
Regex to get src value from an img tag
(3 answers)
Closed 7 years ago.
I need help in extracting the src values from the text ( eg: LOC/IMG.png ) . Any optimal approach to do this , as I have a file count to over 10^5 files .
I have JSON as follows :
{"Items":[{src=\"LOC/IMG.png\"}]}
You have JSON that contains some values that are HTML. If at all possible, therefore, you should parse the JSON as JSON, then parse the HTML values as HTML. This requires you to understand a tiny bit about the structure of the data—but that's a good thing to understand anyway.
For example:
j = json.loads(s)
for item in j['Items']:
soup = bs4.BeautifulSoup(item['Item'])
for img in soup.find_all('img'):
yield img['src']
This may be too slow, but it only takes a couple minutes to write the correct code, run it on 1000 random representative files, then figure out if it will be fast enough when extrapolated to whatever "file count of 1 Lakh" is. If it's fast enough, then do it this way; all else being equal, it's always better to be correct and simple than to be kludgy or complicated, and you'll save time if unexpected data show up as errors right off the bat than if they show up as incorrect results that you don't notice until a week later…
If your files are about 2K, like your example, my laptop can json.loads 2K of random JSON and BeautifulSoup 2K of random HTML in less time than it takes to read 2K off a hard drive, so at worse this will take only twice as long as reading the data and doing nothing. If you have a slow CPU and a fast SSD, or if your data are very unusual, etc., that may not be true (that's why you test, instead of guessing), but I think you'll be fine.
Let me put a disclaimer for parserers: I do not claim regexes are the coolest, and I myself use XML/JSON parsers everywhere when I can. However, when it comes to any malformed texts, parsers usually cannot handle those cases the qay I want. I have to add regexish code to deal with those situations.
So, in case a regex is absolutely necessary, use (?<=src=\\").*?(?=\\")" regex. (?<=src=\\") look-behind and look-ahead (?=\") will act as boundaries for the values inside src attributes.
Here is sample code:
import re
p = re.compile(ur'(?<=src=\\").*?(?=\\")')
test_str = "YOUR_STRING"
re.findall(p, test_str)
See demo.

Python regex to find characters unsupported by XML 1.0 returns no results

I'm writing a Python 3.2 script to find characters in a Unicode XML-formatted text file which aren't valid in XML 1.0. The file itself isn't XML 1.0, so it could easily contain characters supported in 1.1 and later, but the application which uses it can only handle characters valid in XML 1.0 so I need to find them.
XML 1.0 doesn't support any characters in the range \u0001-\u0020, except for \u0009, \u000A, \u000D, and \u0020. Above that, \u0021-\uD7FF and \u010000-\u10FFFF are also supported ranges, but nothing else. In my Python code, I define that regex pattern this way:
re.compile("[^\u0009\u000A\u000D\u0020\u0021-\uD7FF\uE000-\uFFFD\u010000-\u10FFFF]")
However, the code below isn't finding a known bad character in my sample file (\u0007, the 'bell' character.) Unfortunately I can't provide a sample line (proprietary data).
I think the problem is in one of two places: Either a bad regex pattern, or how I'm opening the file and reading in lines—i.e. an encoding problem. I could be wrong, of course.
Here's the relevant code snippet.
processChunkFile() takes three parameters: chunkfile is an absolute path to a file (a 'chunk' of 500,000 lines of the original file, in this case) which may or may not contain a bad character. outputfile is an absolute path to an optional, pre-existing file to write output to. verbose is a boolean flag to enable more verbose command-line output. The rest of the code is just getting command-line arguments (using argparse) and breaking the single large file up into smaller files. (The original file's typically larger than 4GB, hence the need to 'chunk' it.)
def processChunkFile(chunkfile, outputfile, verbose):
"""
Processes a given chunk file, looking for XML 1.0 chars.
Outputs any line containing such a character.
"""
badlines = []
if verbose:
print("Processing file {0}".format(os.path.basename(chunkfile)))
# open given chunk file and read it as a list of lines
with open(chunkfile, 'r') as chunk:
chunklines = chunk.readlines()
# check to see if a line contains a bad character;
# if so, add it to the badlines list
for line in chunklines:
if badCharacterCheck(line, verbose) == True:
badlines.append(line)
# output to file if required
if outputfile is not None:
with open(outputfile.encode(), 'a') as outfile:
for badline in badlines:
outfile.write(str(badline) + '\n')
# return list of bad lines
return badlines
def badCharacterCheck(line, verbose):
"""
Use regular expressions to seek characters in a line
which aren't supported in XML 1.0.
"""
invalidCharacters = re.compile("[^\u0009\u000A\u000D\u0020\u0021-\uD7FF\uE000-\uFFFD\u010000-\u10FFFF]")
matches = re.search(invalidCharacters, line)
if matches:
if verbose:
print(line)
print("FOUND: " + matches.groups())
return True
return False
\u010000
Python \u escapes are four-digit only, so that U+0100 followed by two U+0030 Digit Zeros. Use capital-U escape with eight digits for characters outside the BMP:
\U00010000-\U0010FFFF
Note that this and your expression in general won't work on ‘narrow builds’ of Python where strings are based on UTF-16 code units and characters outside the BMP are handled as two surrogate code units. (Narrow build were the default for Windows. Thankfully they go away with Python 3.3.)
it could easily contain characters supported in 1.1 and later
(Although XML 1.1 can only contain those characters when they're encoded as numeric character references &#...;, so the file itself may still not be well-formed.)
open(chunkfile, 'r')
Are you sure the chunkfile is encoded in locale.getpreferredencoding?
The original file's typically larger than 4GB, hence the need to 'chunk' it.
Ugh, monster XML is painful. But with sensible streaming APIs (and filesystems!) it should still be possible to handle. For example here, you could process each line one at a time using for line in chunk: instead of reading all of the chunk at once using readlines().
re.search(invalidCharacters, line)
As invalidCharacters is already a compiled pattern object you can just invalidCharacters.search(...).
Having said all that, it still matches U+0007 Bell for me.
The fastest way to remove words, characters, strings or anything between two known tags or two known characters in a string is by using a direct and Native C approach using RE along with a Common as shown below.
var = re.sub('<script>', '<!--', var)
var = re.sub('</script>', '-->', var)
#And finally
var = re.sub('<!--.*?-->', '', var)
It removes everything and works faster, better and cleaner than Beautiful Soup. Batch files are where the "" got there beginnings and were only borrowed for use with batch and html from native C". When using all Pythonic methods with regular expressions you have to realize that Python has not altered or changed much from all regular expressions used by Machine Language so why iterate many times when a single loop can find it all as one chunk in one iteration? Do the same individually with Characters also.
var = re.sub('\[', '<!--', var)
var = re.sub('\]', '-->', var)
#And finally
var = re.sub('<!--.*?-->', '' var)#wipes it all out from between along with.
And you do not need Beautiful Soup. You can also scalp data using them if you understand how this works.

Categories