I've just began learning Python and I've ran into a small problem.
I need to parse a text file, more specifically an HTML file (but it's syntax is so weird - divs after divs after divs, the result of a Google's 'View as HTML' for a certain PDF i can't seem to extract the text because it has a messy table done in m$ word).
Anyway, I chose a rather low-level approach because i just need the data asap and since I'm beginning to learn Python, I figured learning the basics would do me some good too.
I've got everything done except for a small part in which i need to retrieve a set of integers from a set of divs. Here's an example:
<div style="position:absolute;top:522;left:1020"><nobr>*88</nobr></div>
Now the numbers i want to retrieve all the ones inside <nobr></nobr> (in that case, '588') and, since it's quite a messy file, i have to make sure that what I am getting is correct. To do so, that number inside <nobr></nobr> must be preceded by "left:1020", "left:1024" or "left:1028". This is because of the automatic conversion and the best choice would be to get all the number preceded by left:102[0-] in my opinion.
To do so, I was trying to use:
for o in re.finditer('left:102[0-9]"><nobr>(.*?)</nobr></div>', words[index])
out = o.group(1)
But so far, no such luck... How can I get those numbers?
Thanks in advance,
J.
Don't use regular expressions to parse HTML. BeautifulSoup will make light work of this.
As for your specific problem, it might be that you are missing a colon at the end of the first line:
for o in re.finditer('left:102[0-9]"><nobr>(.*?)</nobr></div>', words[index]):
out = o.group(1)
If this isn't the problem, please post the error you are getting, at what you expect the output to be.
Related
Yes, I am aware of BeautifulSoup. I know how much better it is but unfortunately Regex is my only option right now, and quite frankly I'm stumped.
I've extracted the titles that I need and can get them to print in console, but can't get them to print as a label in tkinter.
This is what happens when it runs:
I am very appreciative of any advice or help as I have a long couple of nights ahead of me xoxo
add return of the list into print_uk definition and use position to get element of the list returned by in print_uk() inside the Label constructor.
To be more cool try yield the labels text.
I'm trying to teach myself a little python and in the process I'm 'borrowing' code from places to help build my project. A snipit from a piece of code I have which extracts a temperature value from a string looks like this...
re.findall(r"Temp=(\d+.\d+)", *string_variable*)[0]
for the life of me, I cannot find any documentation on what the "[0]" is used for at the end and how to use it.
Obviously I figured out that without it my final output is something like this:
['71.8']
and with it, my number is cleaner and rounded up:
72.0
Can someone point me to where this is documented so I can better understand how to use it in the future?
re.findall(r"Temp=(\d+.\d+)", string_variable) returns a list, [0] gets the first element of that list.
This is a sign that your method of teaching yourself by looking at snippets of code without context is not working. Go through a more traditional tutorial.
This documentation for re in the section re.findall states "Return all non-overlapping matches of pattern in string, as a list of strings." So the return value is a list. The Python Tutorial section on lists explains what [0] at the end of the list does.
I highly recommend that you read through the entire Python Tutorial, as I did, or something similar, to learn Python.
I have the following HTML and I need to remove the script tags and any script related attributes in the HTML. By script related attributes I mean any attribute that starts with on.
<body>
<script src="...">
</script>
<div onresize="CreateFixedHeaders()" onscroll="CreateFixedHeaders()" id="oReportDiv" style="overflow:auto;WIDTH:100%">
<script type="text/javascript" language="javascript">
//<![CDATA[
function CreateFixedHeaders() {}//]]>
</script>
<script>
var ClientReportfb64a4706a3749c484169e...
</script>
</body>
My first thought was to use BeautifulSoup to remove the tags and attributes. Unfortunately, I am unable to use BeautifulSoup. Seeing that BeautifulSoup is off the table I can see two options for doing this. The first option I see is splitting the strings and parsing based on index. This seems like a bad solution to me.
The other option is to use Regular Expressions. However, we know that isn't a good solution either (Cthulhu Parsing).
Now with that in mind, I personally feel it is alright to use regular expressions to strip the attributes. After all, with those it is still simple string manipulation.
So for removing the attributes I have:
script_attribute_regex = r'\son[a-zA-Z]+="[a-zA-Z0-0\.;\(\)_]+"'
result = re.sub(script_attribute_regex, "", page_source)
As I've said before, I personally think the above perfectly acceptable use of Regular Expression with HTML. But still I would like to get some opinions on the above usage.
Then there is the question of the script tags. I'm very tempted to go with Regular Expressions for this because I know them and I know what I need is pretty simple. Something like:
<script(.*)</script>
The above would start to get me close to what I need. And yes I realize the above RegEx will grab everything starting at the first opening script tag until the last closing script tag, but it's a starting example.
I'm very tempted to use Regular Expressions as I'm familiar with them (more so than Python) and I know that is the quickest way to achieve the results I want, at least for me it is.
So I need help to go against my nature and not be evil. I want to be evil and use RegEx so somebody please show me the light and guide me to the promised land on non-Regular Expressions.
Thanks
Update:
It looks like I wasn't very clear about what my question actually is, I apologize for that. My question is how can I parse the HTML using pure Python without Regular Expressions?
<script(.*)</script>
As for the above code example, it's wrong. I know it is wrong, I was using it as an example of a starting point.
I hope this clears up my question some
Update 2
I just wanted to add a few more notes about what I am doing.
I am crawling a web site to get the data I need.
Once we have the page that contains the data we need it is saved to the database.
Then the saved web page is displayed to the user.
The issue I am trying to solve happens here. The application throws a script error when you attempt to interact with the page that forces the user to click on a confirmation box. The application is not a web browser but uses the web browser DLL in Windows (I cannot remember the name at the moment).
The error in question only happens in this one page for this one web site.
Update 3
After adding the update I realized I was over thinking the problem, I was looking for a more generic solution. However, in this case that isn't what is needed.
The page is dynamically generated, however the script tags will stay static. With that in mind the solution becomes much simpler. With that I no longer need to treat it like HTML but as static strings.
So the solution I'm looking at is
import re
def strip_script_tags(page_source: str) -> str:
pattern = re.compile(r'\s?on\w+="[^"]+"\s?')
result = re.sub(pattern, "", page_source)
pattern2 = re.compile(r'<script[\s\S]+?/script>')
result = re.sub(pattern2, "", result)
return result
I would like to avoid Regular Expression however, since I'm limited to only using the standard library regular expressions seems like the best solution in this case. Which means #skamazin's answer is correct.
As for removing all the attributes that start with on, you can try this
It uses the regex:
\s?on\w+="[^"]+"\s?
And substitutes with the empty string (deletion). So in Python it should be:
pattern = re.compile(ur'\s?on\w+="[^"]+"\s?')
subst = u""
result = re.sub(pattern, subst, file)
If you are trying to match anything between the script tags try:
<script[\s\S]+?/script>
DEMO
The problem with your regex is that that dot (.) doesn't match newline character. Using a complemented set will match every single character possible. And make sure use the ? in [\s\S]+? so that it is lazy instead of greedy.
I am writing a program to parse the IETF Internet-drafts and pull out such things as title, date, protocol, and the countries of the authors. I realize this has been done before (arkko.com), but it's a little self-imposed programming exercise.
The problem I'm having is this:
Using some logic, some basic parsing, and
position = doc.tell()
I have precisely identified the point in each document where I need to begin examining lines and looking for, identifying, and pulling out the authors' countries of origin. And I can get to that precise point with:
doc.seek(position)
The problem I'm having is...then what? Having gotten to that position, I've tried every combination of file and string methods that I know to start parsing an arbitrary number of following lines, but I cannot make it work.
Sorry I don't have any full code snippets, but I've tried way too many and I think I might be barking up the entirely wrong tree at this point.
Edit: Actually I came up with a fairly simple solution:
I went through the file once, counted lines, and noted the line number of where I needed to begin parsing.
Then I went through the file again counting lines, and when the line numbers were greater than the first line number, I began parsing.
Probably not the most elegant solution in that I think I should have been able to use doc.seek() to avoid a second count, but it works. And now I know an area of string and file manipulation I need to explore a bit more.
You just need to call doc.read(some_buffer_length) and you'll get a string back.
How you deal with that string is a completely separate issue, but it doesn't matter if it comes from the beginning of the file, or not.
I'm creating simple comment-like application and need to convert normal urls into links, image links into images and yt/vimeo/etc. links into flash objects. E.g.:
http://foo.bar to http://foo.bar
http://foo.bar/image.gif to <img src="http://foo.bar/image.gif"/>
etc.
Of course i can write all of that by myself, but i think it's such obvious piece of code that somebody has already wrote it (maybe even with splitting text into paragraphs). I was googling for some time but couldn't find anything complex, just few snippets. Does filter (or something like that) exist?
Thanks!
PS. There is urlize but it works only for the first case.
Write a custom filter to handle all the necessary cases. Look at the source code for urlize to get started. You'll also need the urlize function from utils.
In your filter, first test for the first case and call urlize on that. Handle the second case and any other cases you may have.