Trying to build a tkinter web scraper using regex - python

Yes, I am aware of BeautifulSoup. I know how much better it is but unfortunately Regex is my only option right now, and quite frankly I'm stumped.
I've extracted the titles that I need and can get them to print in console, but can't get them to print as a label in tkinter.
This is what happens when it runs:
I am very appreciative of any advice or help as I have a long couple of nights ahead of me xoxo

add return of the list into print_uk definition and use position to get element of the list returned by in print_uk() inside the Label constructor.
To be more cool try yield the labels text.

Related

How to print HTML and to highlight some tags in PyQt?

I'm making the program using PyQt5. One of functions is printing HTML with highlight of some tags by different colours. Every new string is processed and added to text using .append method. I need to print clear HTML, that's way class QTextEdit is not suitable. To solve this problem, one needs to use QPlainTextEdit. But I got a new problem. Now I can't use tags <font> to appoint colour to certain tag. Shielding of tags in class QTextEdit is not good idea. Also, I can't appoint colour to whole field.
How can I solve this problem?
P.S. Sorry for mistakes in my English. You can tell me about them.
I would like to make a comment but I don't have enough reputation.
The comment section already has a good way of doing it, and here is another way.
You can just import html and use html.escape(text). This way you can escape part of the html code that are supposed to be literal strings while keeping other html working. This way you can also keep using QTextEdit.
Here's a quick example:
what I did was:
a = html.escape("<font size=\"3\" color=\"red\">This is some text!</font>")
self.append(""+a+"")
and this is the result.

How to make an assertion to find xpath?

Bear with me, I am very new to this. I am writing in Python using Selenium Webdriver.
I have a test written out and I want to make an assertion to find is a class is present. I have tried multiple ways to go about to do this however i feel as though it is not working correctly.
I have attached a photo of the code I am looking at (the highlighted line) and this is what I have so far. This fires which is a pop up to notify a success the form was submitted.
self.assertTrue(driver.find_element(By.XPATH, '//*[#id="message-center"]/div'))
I would take advantage of the .find_elements method. It will return an empty list if no such elements are found.
self.assertTrue(driver.find_elements(By.XPATH, '//*[#id="message-center"]/div').count() > 0);

Python re.findall (returns a number)[0]

I'm trying to teach myself a little python and in the process I'm 'borrowing' code from places to help build my project. A snipit from a piece of code I have which extracts a temperature value from a string looks like this...
re.findall(r"Temp=(\d+.\d+)", *string_variable*)[0]
for the life of me, I cannot find any documentation on what the "[0]" is used for at the end and how to use it.
Obviously I figured out that without it my final output is something like this:
['71.8']
and with it, my number is cleaner and rounded up:
72.0
Can someone point me to where this is documented so I can better understand how to use it in the future?
re.findall(r"Temp=(\d+.\d+)", string_variable) returns a list, [0] gets the first element of that list.
This is a sign that your method of teaching yourself by looking at snippets of code without context is not working. Go through a more traditional tutorial.
This documentation for re in the section re.findall states "Return all non-overlapping matches of pattern in string, as a list of strings." So the return value is a list. The Python Tutorial section on lists explains what [0] at the end of the list does.
I highly recommend that you read through the entire Python Tutorial, as I did, or something similar, to learn Python.

python - urls.py regex help with coordinates

I tried searching but could not find an answer for this.
I am trying to write a function that takes in coordinates, ie latitude and longitude. For example, 53.345633,-6.267014.
These will then be fed into the Google Maps API as the users current location, and directions to some nearby places (I already have these locations stored somewhere) and the directions will hopefully be returned.
So, I pretty much have all the Maps work done, but I can't test it because, frustratingly enough, I simply cannot figure out the regex for inside urls.py.
Can anyone help me with this? I'm hoping its simple enough for you guys. I tried it earlier but failed miserably! It's so frustrating coz I'm so close to finishing this part too!!
Thanks for the help!
PS Is the format for coordinates advisable, with the comma? Perhaps 53.345633+-6.267014 would be better (then I can just use my_coords = coords.replace("+", ", ") or something)??
I'm not sure I get it but you can try something like :
def str2cords(scords):
return [float(c) for c in scords.split(',')] #or maybe just scords.split(',') since float might mess up the exact cords?
or regex :
'/(-?\d+\.\d+),(-?\d+\.\d+)/'
If you want the dot to be optional.
(?P<lon>-?\d+.?\d+)/(?P<lat>-?\d+.?\d+)

Parsing a range of integers in a list

I've just began learning Python and I've ran into a small problem.
I need to parse a text file, more specifically an HTML file (but it's syntax is so weird - divs after divs after divs, the result of a Google's 'View as HTML' for a certain PDF i can't seem to extract the text because it has a messy table done in m$ word).
Anyway, I chose a rather low-level approach because i just need the data asap and since I'm beginning to learn Python, I figured learning the basics would do me some good too.
I've got everything done except for a small part in which i need to retrieve a set of integers from a set of divs. Here's an example:
<div style="position:absolute;top:522;left:1020"><nobr>*88</nobr></div>
Now the numbers i want to retrieve all the ones inside <nobr></nobr> (in that case, '588') and, since it's quite a messy file, i have to make sure that what I am getting is correct. To do so, that number inside <nobr></nobr> must be preceded by "left:1020", "left:1024" or "left:1028". This is because of the automatic conversion and the best choice would be to get all the number preceded by left:102[0-] in my opinion.
To do so, I was trying to use:
for o in re.finditer('left:102[0-9]"><nobr>(.*?)</nobr></div>', words[index])
out = o.group(1)
But so far, no such luck... How can I get those numbers?
Thanks in advance,
J.
Don't use regular expressions to parse HTML. BeautifulSoup will make light work of this.
As for your specific problem, it might be that you are missing a colon at the end of the first line:
for o in re.finditer('left:102[0-9]"><nobr>(.*?)</nobr></div>', words[index]):
out = o.group(1)
If this isn't the problem, please post the error you are getting, at what you expect the output to be.

Categories