Scrape content from python string in reverse order using Regex - python

I got this following content from a html page
str='http://www.ralphlauren.com/graphics/product_images/pPOLO2-24922076_alternate1_v360x480.jpg\', zoom: \'s7-1251098_alternate1\' }]\n\n\nEnlarge Image\n\n\n\n\n\n\nCotton Canvas Utility Jacket\nStyle Number : 112933196\n\n\n\n$125.00'
Like so, I have many html pages. I need some way to read the content BEFORE the style number. In this case, I need Cotton Canvas Utility Jacket. Is there a regex in python to do that? Note that I can start looking for pattern Enlarge Image and read whatever comes before I strike Style number. The issue is that there are many Enlarge Image on the html page. What I have shown above is part of the html page. full html page is here
In short, I need to find the product name from the linked HTML page.
Thanks.

Related

Python weasyprint convert page to pdf problem!【from china】

I try to convert cnn.com pages to PDF by Weasyprint, it actually works but with some unfriendly, the pdf headers are always been covered with a black block, it annoys when the content are covered, does any body know how to remove the annoyed things, likes definite a CSS sheetstyle? sincerely appreciate!!!
You can repeated the problem with any article from cnn.com.
or recommend a better converting tools, I ever try pdfkit, but it cannt download the full page with 'readmore' button are always
always display, even UserAgent has been append in http headers.
Those tools works different among websites, weird
enter image description here
import weasyprint
url = 'https://edition.cnn.com/2021/07/23/tech/taiwan-china-cybersecurity-
intl-hnk/index.html'
weasyprint.HTML(url).write_pdf('1.pdf')
that is my codes

Extract text from HTML faster than NLTK?

We use NLTK to extract text from HTML pages, but we want only most trivial text analysis, e.g. word count.
Is there a faster way to extract visible text from HTML using Python?
Understanding HTML (and ideally CSS) at some minimal level, like visible / invisible nodes, images' alt texts, etc, would be additionally great.
Ran into the same problem at my previous workplace. You'll want to check out beautifulsoup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print soup.text
You'll find its documentation here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
You can ignore elements based on attributes. As to understanding external stylesheets im not too sure. However what you could do there and something that would not be too slow (depending on the page) is to look into rendering the page with something like phantomjs and then selecting the rendered text :)

Extract text from CSS based on font-size

I wrote a function that parses all headers based on header's tags (h1/2...). Now I want to expand on it and add a feature that parses text based on font-size - say either 20px or 1.5em, regardless of the headers. I want a feature that brings any text written in font-size greater than X, wherever it is on the page. The function takes json file as an input, composed of a random HTML (and whatever website could have, i.e. CSS etc) in it.
Based on crummy it seems like one possible option is to use soup.fetch(), however, I haven't found many examples using it for this purpose.
Since font-size well might appear under CSS component I'm not sure that bs4 is the right package for it. I assume the answer includes cssutils or tinycss but haven't been able to find the best way to use those for this task.
As a reference - My code for header's tags was posted for a review: https://codereview.stackexchange.com/questions/166671/extract-html-content-based-on-tags-specifically-headers/166674?noredirect=1#comment317280_166674.
Posts I've checked:
What is the pythonic way to implement a css parser/replacer ;
Find all the span styles with font size larger than the most common one via beautiful soup python ;
Search in HTML page using Regex patterns with python ;
How to parse a web page containing CSS and HTML using python ;
how to extract text within font tag using beautifulsoup ;
Extract text with bold content from css selector
Thanks much,

style hyperlinks in reportlab pdfs

I am using rst2pdf to generate a PDF. I am using links to sections and they appear as hyperlinks in the PDF. If I hover over the link I can see it says "Go to page XXX". Is there a way to insert that page number into the text, so that it can be seen on hardcopies?
I'm starting using reportlab recently. Maybe you need to use the superscript tag?
p = Paragraph("<link href='http://someurl' color='blue'><u>Some text</u><super> [goto page xx]</super></link>", customstyle)
What it may looks like

Counting content only in HTML page

Is there anyway I can parse a website by just viewing the content as displayed to the user in his browser? That is, instead of downloading "page.htm"l and starting to parse the whole page with all the HTML/javascript tags, I will be able to retrieve the version as displayed to users in their browsers. I would like to "crawl" websites and rank them according to keywords popularity (viewing the HTML source version is problematic for that purpose).
Thanks!
Joel
A browser also downloads the page.html and then renders it. You should work the same way. Use a html parser like lxml.html or BeautifulSoup, using those you can ask for only the text enclosed within tags (and arguments you do like, like title and alt attributes).
You could get the source and strip the tags out, leaving only non-tag text, which works for almost all pages, except those where JavaScript-generated content is essential.
The pyparsing wiki Examples page includes this html tag stripper.

Categories