Identifying sections tabbed in from raw text - python

Consider the text on this page. If you look at the source code, you'll see that the main text is presented exactly as in the page -- no HTML divisions or any other way to obviously find paragraphs/tabbed in sections.
Is there a way to automatically identify and remove sections that are tabbed in from the raw text?
One thing I notice is that when I encode the text as text = unicode(raw_text).encode("utf-8"), I can then see a bunch of \n's for line skips. But no \t's. (This might be not a useful direction to think, but just an idea).
Edit: The following works
text = unicode(raw_text).encode("utf-8")
y = [x for x in text.split("\n") if " " not in x]
final = " ".join(y)

Well, after looking at the page, they are 'tabbed' in with spaces rather than the tab character; looking for tabs would not be useful. It looks like the section is tabbed in with 5 spaces.
raw_text.replace(' ','')
To replace all occurances of 5 spaces...
from re import sub
...
raw_text = sub(r' .*\n', '', raw_text)

Related

Unnecessary Indentations in BeautifulSoup

I'm trying to parse a webpage:However, I want to only focus on text within the div tag labelled "class='body conbody'". I want my program to look inside of this tag and output the text exactly like how they appear on the webpage.
Here is my code so far:
pres_file = directory + "\\" + pres_number + ".html"
with open(pres_file) as html_file:
soup = BeautifulSoup(html_file, 'html.parser')
desiredText = soup.find('div', class_='body conbody')
for para in desiredText.find_all('p'):
print(para.get_text())
The problem with my current code is that whenever I try to print the paragraphs, (a), (1), (2), (b), and (c) are always formatted with a lot of unnecessary newlines and additional spaces after it. However, I would like for it to output text that is equivalent to how it looks on the webpage. How can I change my code to accomplish this?
I want my program to look inside of this tag and output the text exactly like how they appear on the webpage.
The browser does a lot of processing to display a web page. This includes removing extra spaces. Additionally, the browser developer tools show a parsed version of the HTML as well as potential additions from dynamic JavaScript code.
On the other hand, you are opening a raw text file and get the text as it is, including any formatting such as indentation and line breaks. You will need to process this yourself to format it the way you want when you output it.
There are at least two things to look for:
Is the indentation tab or space characters? By default print() represents a tab as 8 spaces. You can either replace the tabs with spaces to reduce the indentation or you can use another output method that allows you to configure specify how to show tabs.
The strings themselves will include a newline character. But then print() also adds a line break. So either remove the newline character from each string or do print(para.get_text(), end='') to disable print adding another newline.
You can use strip() on strings, like para.get_text().strip(). This will remove any whitespaces before and after the string.
You can use either lstrip() and rstrip() to remove only the exceeding whitespaces from the left or right side of the string.
s = " \t \n\n something \t \n "
print(s.strip()) # 'something'
print(s.lstrip()) # 'something \t \n '
print(s.rstrip()) # ' \t \n\n something'
Would something like this work:
Strip left and right of the p
Indent the paragraph with 1em (so 1 times the font size)
Newline each paragraph
font_size = 16 # get the font size
for para in desiredText.find_all('p'):
print(font_size * " " + para.get_text().strip(' \t\n\r') + "\n")

Which is the most efficent of matching and replacing with an identifier every three new lines?

I am working with some .txt files that doesn't have structure (they are messy), they represent a number of pages. In order to give them some structure I would like to identify the number of pages since the file itself doesn't have them. This can be done by replacing every three newlines with some annotation like:
\n
page: N
\n
Where N is the number. This is how my files look like, and I also tried with a simple replace. However, this function confuses and does not give me the expected format which would be something like this. Any idea of how to replace the spaces with some kind of identifier, just to try to parse them and getting the position of some information (page)?.
I also tried this:
import re
replaced = re.sub('\b(\s+\t+)\b', '\n\n\n', text)
print (replaced)
If the format is as regular as you state in your problem description:
Replace every occurrence of three newlines \n with page: N
You wouldn't have to use the re module. Something as simple as the following would do the trick:
>>> s='aaaaaaaaaaaaaaaaa\n\n\nbbbbbbbbbbbbbbbbbbbbbbb\n\n\nccccccccccccccccccccccc'
>>> pages = s.split('\n\n\n')
>>> ''.join(page + '\n\tpage: {}\n'.format(i + 1) for i, page in enumerate(pages))
'aaaaaaaaaaaaaaaaa\n\tpage: 1\nbbbbbbbbbbbbbbbbbbbbbbb\n\tpage: 2\nccccccccccccccccccccccc\n\tpage: 3\n'
I suspect, though, that your format is less regular than that, but you'll have to include more details before I can give a good answer for that.
If you want to split with messy whitespace (which I'll define as at least three newlines with any other whitespace mixed in), you can replace s.split('\n\n\n') with:
re.split(r'(?:\n\s*?){3,}', s)

New lines/tabulators turn into spaces in generated document

I have problem with \n and \t tags. When I am opening a generated .docx in OpenOffice everything looks fine, but when I open the same document in Microsoft Word I just get the last two tabulators in section "Surname" and spaces instead of newlines/tabulators in other sections. What is wrong?
p = document.add_paragraph('Simple paragraph')
p.add_run('Name:\t\t' + name).bold = True
p.add_run('\n\nSurname:\t\t' + surname)
In Word, what we often think of as a line feed translates to a paragraph object. If you want empty paragraphs in your document you will need to insert them explicitly.
First of all though, you should ask whether you're using paragraphs for formatting, a common casual practice for Word users but one you might want to deal with differently, in particular by using the space-before and/or space-after properties of a paragraph. In HTML this would correspond roughly to padding-top and padding-bottom.
In this case, if you just want the line feeds, consider using paragraphs like so:
document.add_paragraph('Simple paragraph')
p = document.add_paragraph()
p.add_run('Name:\t\t').bold = True
p.add_run(name)
document.add_paragraph()
p = document.add_paragraph()
p.add_run('Surname:\t\t').bold = True
p.add_run(surname)

Python: Removing particular character (u"\u2610") from string

I have been wrestling with decoding and encoding in Python, and I can't quite figure out how to resolve my problem. I am looping over xml text files (sample) that are apparently coded in utf-8, using Beautiful Soup to parse each file, then looking to see if any sentence in the file contains one or more words from two different list of words. Because the xml files are from the eighteenth century, I need to retain the em dashes that are in the xml. The code below does this just fine, but it also retains a pesky box character that I wish to remove. I believe the box character is this character.
(You can find an example of the character I wish to remove in line 3682 of the sample file above. On this webpage, the character looks like an 'or' pipe, but when I read the xml file in Komodo, it looks like a box. When I try to copy and paste the box into a search engine, it looks like an 'or' pipe. When I print to console, though, the character looks like an empty box.)
To sum up, the code below runs without errors, but it prints the empty box character that I would like to remove.
for work in glob.glob(pathtofiles):
openfile = open(work)
readfile = openfile.read()
stringfile = str(readfile)
decodefile = stringfile.decode('utf-8', 'strict') #is this the dodgy line?
soup = BeautifulSoup(decodefile)
textwithtags = soup.findAll('text')
textwithtagsasstring = str(textwithtags)
#this method strips everything between anglebrackets as it should
textwithouttags = stripTags(textwithtagsasstring)
#clean text
nonewlines = textwithouttags.replace("\n", " ")
noextrawhitespace = re.sub(' +',' ', nonewlines)
print noextrawhitespace #the boxes appear
I tried to remove the boxes by using
noboxes = noextrawhitespace.replace(u"\u2610", "")
But Python threw an error flag:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 280: ordinal not in range(128)
Does anyone know how I can remove the boxes from the xml files? I would be grateful for any help others can offer.
The problem is that you're mixing unicode and str. Whenever you do that, Python has to convert one to the other, which is does by using sys.getdefaultencoding(), which is usually ASCII, which is almost never what you want.*
If the exception comes from this line:
noboxes = noextrawhitespace.replace(u"\u2610", "")
… the fix is simple… except that you have to know whether noextrawhitespace is supposed to be a unicode object or a UTF-8-encoding str object). If the former, it's this:
noboxes = noextrawhitespace.replace(u"\u2610", u"")
If the latter, it's this:
noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
But really, you have to get all of the strings consistent in your code; mixing the two up is going to cause problems in more places than this one.
Since I don't have your XML files to test, I wrote my own:
<xml>
<text>abc☐def</text>
</xml>
Then, I added these two lines to the bottom of your code (and a bit to the top to just open my file instead of globbing for whatever):
noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
print noboxes
The output is now:
[<text>abc☐def</text>]
[<text>abc☐def</text>]
[<text>abcdef</text>]
So, I think that's what you want here.
* Sure sometimes you want ASCII… but those aren't usually the times when you have unicode objects…
Give this a try:
noextrawhitespace.replace("\\u2610", "")
I think you are just missing that extra '\'
This might also work.
print(noextrawhitespace.decode('unicode_escape').encode('ascii','ignore'))
Reading your sample, the following are the non-ASCII characters in the document:
0x2223 DIVIDES
0x2022 BULLET
0x3009 RIGHT ANGLE BRACKET
0x25aa BLACK SMALL SQUARE
0x25ca LOZENGE
0x3008 LEFT ANGLE BRACKET
0x2014 EM DASH
0x2026 HORIZONTAL ELLIPSIS
\u2223 is the actual character in question in line 3682, and it is being used as a soft hyphen. The others are used in markup for tagging illegible characters, such as:
<GAP DESC="illegible" RESP="oxf" EXTENT="4+ letters" DISP="\u2022\u2022\u2022\u2022\u2026"/>
Here's some code to do what your code is attempting. Make sure to process in Unicode:
from bs4 import BeautifulSoup
import re
with open('k000039.000.xml') as f:
soup = BeautifulSoup(f) # BS figures out the encoding
text = u''.join(soup.strings) # strings is a generator for just the text bits.
text = re.sub(ur'\s+',ur' ',text) # Simplify all white space.
text = text.replace(u'\u2223',u'') # Get rid of the DIVIDES character.
print text
Output:
[[truncated]] reckon my self a Bridegroom too. Buckle. I doubt Kickey won't find him such. [Aside.] Mrs. Sago. Well,—poor Keckky's bound to good Behaviour, or she had lost quite her Puddy's Favour. Shall I for this repine at Fortune?—No. I'm glad at Heart that I'm forgiven so. Some Neighbours Wives have but too lately shown, When Spouse had left 'em all their Friends were flown. Then all you Wives that wou'd avoid my Fate. Remain contented with your present State FINIS.

Transform textarea input to paragraphed HTML

I'd like to transform what the user inputs into an textarea on a html page into a <p>-tagged output where each <p> is replacing new lines.
I'm trying with regular expressions but I can't get it to work. Will someone correct my expression?
String = "Hey, this is paragraph 1 \n and this is paragraph 2 \n and this will be paragraph 3"
Regex = r'(.+?)$'
It just results in Hey, this is paragraph 1 \n and this is paragraph 2 \n<p>and this will be paragraph 3</p>
I wouldn't use regular expressions for this, simply because you do not need it. Check this out:
text = "Hey, this is paragraph 1 \n and this is paragraph 2 \n and this will be paragraph 3"
html = ''
for line in text.split('\n'):
html += '<p>' + line + '</p>'
print html
To make it one line, because shorter is better, and clearer:
html = ''.join('<p>'+L+'</p>' for L in text.split('\n'))
I would do it this way:
s = "Hey, this is paragraph 1 \n and this is paragraph 2 \n and this will be paragraph 3"
"".join("<p>{0}</p>".format(row) for row in s.split('\n'))
You basically split your string into a list of lines. Then wrap each line with paragraph tags. In the end just join your lines.
Above answers relying on identifying '\n' do not work reliably. You need to use .splitlines(). I don't have enough rep to comment on the chosen answer, and when I edited the wiki, someone just reverted it. So can someone with more rep please fix it.
Text from a textarea may use '\r\n' as a new line character.
>> "1\r\n2".split('\n')
['1\r', '2']
'\r' alone is invalid inside a webpage, so using any of the above solutions produce ill formed web pages.
Luckily python provides a function to solve this. The answer that works reliably is:
html = ''.join('<p>'+L+'</p>' for L in text.splitlines())
You need to get rid of the anchor, $. Your regex is trying to match one or more of any non-newline characters, followed by the end of the string. You could use MULTILINE mode to make the anchors match at line boundaries, like so:
s1 = re.sub(r'(?m)^.+$', r'<p>\g<0></p>', s0)
...but this works just as well:
s1 = re.sub(r'.+', r'<p>\g<0></p>', s0)
The reluctant quantifier ( .+? ) wasn't doing anything useful either, but it didn't mess up the output like the anchor did.
Pretty easy >>
html='<p>'+s.replace("\n",'</p><p>')+'</p>'

Categories