How can I convert String (with linebreaks) to HTML? - python

When I print the string (in Python) coming from a website I scraped it from, it looks like this:
"His this
is
a sample
String"
It does not show the \n breaks. this is what I see in a Python interpreter.
And I want to convert it to HTML that will add in the line breaks. I was looking around and didn't see any libraries that do this out of the box.
I was thinking BeautifulSoup, but wasn't quite sure.

If you have a String that you have readed it from a file you can just replace \n to <br>, which is a line break in html, by doing:
my_string.replace('\n', '<br>')

You can use the python replace(...) method to replace all line breaks with the html version <br> and possibly surround the string in a paragraph tag <p>...</p>. Let's say the name of the variable with the text is text:
html = "<p>" + text.replace("\n", "<br>") + "</p>"

searching for this answer in found this, witch is likely better because it encodes all characters, at least for python 3
Python – Convert HTML Characters To Strings
# import html
import html
# Create Text
text = 'Γeeks for Γeeks'
# It Converts given text To String
print(html.unescape(text))
# It Converts given text to HTML Entities
print(html.escape(text))

I believe this will work
for line in text:
for char in line:
if char == "/n":
text.replace(char, "<br>")

Related

Unnecessary Indentations in BeautifulSoup

I'm trying to parse a webpage:However, I want to only focus on text within the div tag labelled "class='body conbody'". I want my program to look inside of this tag and output the text exactly like how they appear on the webpage.
Here is my code so far:
pres_file = directory + "\\" + pres_number + ".html"
with open(pres_file) as html_file:
soup = BeautifulSoup(html_file, 'html.parser')
desiredText = soup.find('div', class_='body conbody')
for para in desiredText.find_all('p'):
print(para.get_text())
The problem with my current code is that whenever I try to print the paragraphs, (a), (1), (2), (b), and (c) are always formatted with a lot of unnecessary newlines and additional spaces after it. However, I would like for it to output text that is equivalent to how it looks on the webpage. How can I change my code to accomplish this?
I want my program to look inside of this tag and output the text exactly like how they appear on the webpage.
The browser does a lot of processing to display a web page. This includes removing extra spaces. Additionally, the browser developer tools show a parsed version of the HTML as well as potential additions from dynamic JavaScript code.
On the other hand, you are opening a raw text file and get the text as it is, including any formatting such as indentation and line breaks. You will need to process this yourself to format it the way you want when you output it.
There are at least two things to look for:
Is the indentation tab or space characters? By default print() represents a tab as 8 spaces. You can either replace the tabs with spaces to reduce the indentation or you can use another output method that allows you to configure specify how to show tabs.
The strings themselves will include a newline character. But then print() also adds a line break. So either remove the newline character from each string or do print(para.get_text(), end='') to disable print adding another newline.
You can use strip() on strings, like para.get_text().strip(). This will remove any whitespaces before and after the string.
You can use either lstrip() and rstrip() to remove only the exceeding whitespaces from the left or right side of the string.
s = " \t \n\n something \t \n "
print(s.strip()) # 'something'
print(s.lstrip()) # 'something \t \n '
print(s.rstrip()) # ' \t \n\n something'
Would something like this work:
Strip left and right of the p
Indent the paragraph with 1em (so 1 times the font size)
Newline each paragraph
font_size = 16 # get the font size
for para in desiredText.find_all('p'):
print(font_size * " " + para.get_text().strip(' \t\n\r') + "\n")

Convert text to HTML comment in Python

I have a Python script that generates some HTML. It does so using the Python markdown library. I'd like to stick the original Markdown text in a comment at the end of the HTML, where it will occasionally be useful for debugging purposes. I've tried just plunking the Markdown text after the end of the HTML, and it doesn't work for me (Firefox). So the way I imagine this working is that I run Markdown and then simply append the Markdown source, marked as a comment, after the HTML. However, HTML is apparently somewhat finicky about what it will allow in comments. The site htmlhelp.com gives the following advice after some discussion:
For this reason, use the following simple rule to compose valid and accepted [portable] comments:
An HTML comment begins with "" and does not contain "--" or ">" anywhere in the comment.
(source)
So it looks like I need to do some escaping or something to get my bunch of markdown text into a form that HTML will accept as a comment. Is there an existing tool that will help me do this?
According to the w3:
Comments consist of the following parts, in exactly the following order:
- the comment start delimiter "<!--"
- text
- the comment end delimiter "-->"
The text part of comments has the following restrictions:
1. must not start with a ">" character
2. must not start with the string "->"
3. must not contain the string "--"
4. must not end with a "-" character
These are very simple rules. You could regex-enforce them, but they are so simple you don't even need that!
3 of the 4 conditions can be met with concatenation, and the other one with a simple replace(). All in all, it's a one-liner:
def html_comment(text):
return '<!-- ' + text.replace('--', '- - ') + ' -->'
Note the spaces.
Can't you just .replace it? Ultimately, you could replace those characters with anything, but substituting with escape codes probably won't make your comment any more readible than substituting with nothing.
commented = '<!-- %s -->' % markdown_text.replace('--', '').replace('>', '')

Transform textarea input to paragraphed HTML

I'd like to transform what the user inputs into an textarea on a html page into a <p>-tagged output where each <p> is replacing new lines.
I'm trying with regular expressions but I can't get it to work. Will someone correct my expression?
String = "Hey, this is paragraph 1 \n and this is paragraph 2 \n and this will be paragraph 3"
Regex = r'(.+?)$'
It just results in Hey, this is paragraph 1 \n and this is paragraph 2 \n<p>and this will be paragraph 3</p>
I wouldn't use regular expressions for this, simply because you do not need it. Check this out:
text = "Hey, this is paragraph 1 \n and this is paragraph 2 \n and this will be paragraph 3"
html = ''
for line in text.split('\n'):
html += '<p>' + line + '</p>'
print html
To make it one line, because shorter is better, and clearer:
html = ''.join('<p>'+L+'</p>' for L in text.split('\n'))
I would do it this way:
s = "Hey, this is paragraph 1 \n and this is paragraph 2 \n and this will be paragraph 3"
"".join("<p>{0}</p>".format(row) for row in s.split('\n'))
You basically split your string into a list of lines. Then wrap each line with paragraph tags. In the end just join your lines.
Above answers relying on identifying '\n' do not work reliably. You need to use .splitlines(). I don't have enough rep to comment on the chosen answer, and when I edited the wiki, someone just reverted it. So can someone with more rep please fix it.
Text from a textarea may use '\r\n' as a new line character.
>> "1\r\n2".split('\n')
['1\r', '2']
'\r' alone is invalid inside a webpage, so using any of the above solutions produce ill formed web pages.
Luckily python provides a function to solve this. The answer that works reliably is:
html = ''.join('<p>'+L+'</p>' for L in text.splitlines())
You need to get rid of the anchor, $. Your regex is trying to match one or more of any non-newline characters, followed by the end of the string. You could use MULTILINE mode to make the anchors match at line boundaries, like so:
s1 = re.sub(r'(?m)^.+$', r'<p>\g<0></p>', s0)
...but this works just as well:
s1 = re.sub(r'.+', r'<p>\g<0></p>', s0)
The reluctant quantifier ( .+? ) wasn't doing anything useful either, but it didn't mess up the output like the anchor did.
Pretty easy >>
html='<p>'+s.replace("\n",'</p><p>')+'</p>'

How to remove \xa0 from string in Python?

I am currently using Beautiful Soup to parse an HTML file and calling get_text(), but it seems like I'm being left with a lot of \xa0 Unicode representing spaces. Is there an efficient way to remove all of them in Python 2.7, and change them into spaces? I guess the more generalized question would be, is there a way to remove Unicode formatting?
I tried using: line = line.replace(u'\xa0',' '), as suggested by another thread, but that changed the \xa0's to u's, so now I have "u"s everywhere instead. ):
EDIT: The problem seems to be resolved by str.replace(u'\xa0', ' ').encode('utf-8'), but just doing .encode('utf-8') without replace() seems to cause it to spit out even weirder characters, \xc2 for instance. Can anyone explain this?
\xa0 is actually non-breaking space in Latin1 (ISO 8859-1), also chr(160). You should replace it with a space.
string = string.replace(u'\xa0', u' ')
When .encode('utf-8'), it will encode the unicode to utf-8, that means every unicode could be represented by 1 to 4 bytes. For this case, \xa0 is represented by 2 bytes \xc2\xa0.
Read up on http://docs.python.org/howto/unicode.html.
Please note: this answer in from 2012, Python has moved on, you should be able to use unicodedata.normalize now
There's many useful things in Python's unicodedata library. One of them is the .normalize() function.
Try:
new_str = unicodedata.normalize("NFKD", unicode_str)
Replacing NFKD with any of the other methods listed in the link above if you don't get the results you're after.
After trying several methods, to summarize it, this is how I did it. Following are two ways of avoiding/removing \xa0 characters from parsed HTML string.
Assume we have our raw html as following:
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
So lets try to clean this HTML string:
from bs4 import BeautifulSoup
raw_html = '<p>Dear Parent, </p><p><span style="font-size: 1rem;">This is a test message, </span><span style="font-size: 1rem;">kindly ignore it. </span></p><p><span style="font-size: 1rem;">Thanks</span></p>'
text_string = BeautifulSoup(raw_html, "lxml").text
print text_string
#u'Dear Parent,\xa0This is a test message,\xa0kindly ignore it.\xa0Thanks'
The above code produces these characters \xa0 in the string. To remove them properly, we can use two ways.
Method # 1 (Recommended):
The first one is BeautifulSoup's get_text method with strip argument as True
So our code becomes:
clean_text = BeautifulSoup(raw_html, "lxml").get_text(strip=True)
print clean_text
# Dear Parent,This is a test message,kindly ignore it.Thanks
Method # 2:
The other option is to use python's library unicodedata
import unicodedata
text_string = BeautifulSoup(raw_html, "lxml").text
clean_text = unicodedata.normalize("NFKD",text_string)
print clean_text
# u'Dear Parent,This is a test message,kindly ignore it.Thanks'
I have also detailed these methods on this blog which you may want to refer.
Try using .strip() at the end of your line
line.strip() worked well for me
try this:
string.replace('\\xa0', ' ')
I ran into this same problem pulling some data from a sqlite3 database with python. The above answers didn't work for me (not sure why), but this did: line = line.decode('ascii', 'ignore') However, my goal was deleting the \xa0s, rather than replacing them with spaces.
I got this from this super-helpful unicode tutorial by Ned Batchelder.
Try this code
import re
re.sub(r'[^\x00-\x7F]+','','paste your string here').decode('utf-8','ignore').strip()
Python recognize it like a space character, so you can split it without args and join by a normal whitespace:
line = ' '.join(line.split())
I end up here while googling for the problem with not printable character. I use MySQL UTF-8 general_ci and deal with polish language. For problematic strings I have to procced as follows:
text=text.replace('\xc2\xa0', ' ')
It is just fast workaround and you probablly should try something with right encoding setup.
In Beautiful Soup, you can pass get_text() the strip parameter, which strips white space from the beginning and end of the text. This will remove \xa0 or any other white space if it occurs at the start or end of the string. Beautiful Soup replaced an empty string with \xa0 and this solved the problem for me.
mytext = soup.get_text(strip=True)
It's the equivalent of a space character, so strip it
print(string.strip()) # no more xa0
0xA0 (Unicode) is 0xC2A0 in UTF-8. .encode('utf8') will just take your Unicode 0xA0 and replace with UTF-8's 0xC2A0. Hence the apparition of 0xC2s... Encoding is not replacing, as you've probably realized now.
You can try string.strip()
It worked for me! :)
Generic version with the regular expression (It will remove all the control characters):
import re
def remove_control_chart(s):
return re.sub(r'\\x..', '', s)
This is how I solved this issue as I encountered \xao in html encoded string.
I discovered a None breaking space is inserted to ensure that a word and subsequent HTML markup is not separated due to resizing of a page.
This
presents a problem for the parsing code as it introduced codec encoding issues. What made it hard was that we
are not privy to the encoding used. From Windows machines it can be latin-1 or CP1252 (Western ISO),
but more recent OSes have standardized to UTF-8. By normalizing unicode data, we strip \xa0
my_string = unicodedata.normalize('NFKD', my_string).encode('ASCII', 'ignore')

Print string as HTML

I would like to know if is there any way to convert a plain unicode string to HTML in Genshi, so, for example, it renders newlines as <br/>.
I want this to render some text entered in a textarea.
Thanks in advance!
If Genshi works just as KID (which it should), then all you have to do is
${XML("<p>Hi!</p>")}
We have a small function to transform from a wiki format to HTML
def wikiFormat(text):
patternBold = re.compile("(''')(.+?)(''')")
patternItalic = re.compile("('')(.+?)('')")
patternBoldItalic = re.compile("(''''')(.+?)(''''')")
translatedText = (text or "").replace("\n", "<br/>")
translatedText = patternBoldItalic.sub(r'<b><i>\2</i></b>', textoTraducido or '')
translatedText = patternBold.sub(r'<b>\2</b>', translatedText or '')
translatedText = patternItalic.sub(r'<i>\2</i>', translatedText or '')
return translatedText
You should adapt it to your needs.
${XML(wikiFormat(text))}
Maybe use a <pre> tag.
Convert plain text to HTML, by escaping "<" and "&" characters (and maybe some more, but these two are the absolute minimum) as HTML entities
Substitute every newline with the text "<br />", possibly still combined with a newline.
In that order.
All in all that shouldn't be more than a few lines of Python code. (I don't do Python but any Python programmer should be able to do that, easily.)
edit I found code on the web for the first step. For step 2, see string.replace at the bottom of this page.
In case anyone is interested, this is how I solved it. This is the python code before the data is sent to the genshi template.
from trac.wiki.formatter import format_to_html
from trac.mimeview.api import Context
...
context = Context.from_request(req, 'resource')
data['comment'] = format_to_html(self.env, context, comment, True)
return template, data, None

Categories