I have been wrestling with decoding and encoding in Python, and I can't quite figure out how to resolve my problem. I am looping over xml text files (sample) that are apparently coded in utf-8, using Beautiful Soup to parse each file, then looking to see if any sentence in the file contains one or more words from two different list of words. Because the xml files are from the eighteenth century, I need to retain the em dashes that are in the xml. The code below does this just fine, but it also retains a pesky box character that I wish to remove. I believe the box character is this character.
(You can find an example of the character I wish to remove in line 3682 of the sample file above. On this webpage, the character looks like an 'or' pipe, but when I read the xml file in Komodo, it looks like a box. When I try to copy and paste the box into a search engine, it looks like an 'or' pipe. When I print to console, though, the character looks like an empty box.)
To sum up, the code below runs without errors, but it prints the empty box character that I would like to remove.
for work in glob.glob(pathtofiles):
openfile = open(work)
readfile = openfile.read()
stringfile = str(readfile)
decodefile = stringfile.decode('utf-8', 'strict') #is this the dodgy line?
soup = BeautifulSoup(decodefile)
textwithtags = soup.findAll('text')
textwithtagsasstring = str(textwithtags)
#this method strips everything between anglebrackets as it should
textwithouttags = stripTags(textwithtagsasstring)
#clean text
nonewlines = textwithouttags.replace("\n", " ")
noextrawhitespace = re.sub(' +',' ', nonewlines)
print noextrawhitespace #the boxes appear
I tried to remove the boxes by using
noboxes = noextrawhitespace.replace(u"\u2610", "")
But Python threw an error flag:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 280: ordinal not in range(128)
Does anyone know how I can remove the boxes from the xml files? I would be grateful for any help others can offer.
The problem is that you're mixing unicode and str. Whenever you do that, Python has to convert one to the other, which is does by using sys.getdefaultencoding(), which is usually ASCII, which is almost never what you want.*
If the exception comes from this line:
noboxes = noextrawhitespace.replace(u"\u2610", "")
… the fix is simple… except that you have to know whether noextrawhitespace is supposed to be a unicode object or a UTF-8-encoding str object). If the former, it's this:
noboxes = noextrawhitespace.replace(u"\u2610", u"")
If the latter, it's this:
noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
But really, you have to get all of the strings consistent in your code; mixing the two up is going to cause problems in more places than this one.
Since I don't have your XML files to test, I wrote my own:
<xml>
<text>abc☐def</text>
</xml>
Then, I added these two lines to the bottom of your code (and a bit to the top to just open my file instead of globbing for whatever):
noboxes = noextrawhitespace.replace(u"\u2610".encode('utf-8'), "")
print noboxes
The output is now:
[<text>abc☐def</text>]
[<text>abc☐def</text>]
[<text>abcdef</text>]
So, I think that's what you want here.
* Sure sometimes you want ASCII… but those aren't usually the times when you have unicode objects…
Give this a try:
noextrawhitespace.replace("\\u2610", "")
I think you are just missing that extra '\'
This might also work.
print(noextrawhitespace.decode('unicode_escape').encode('ascii','ignore'))
Reading your sample, the following are the non-ASCII characters in the document:
0x2223 DIVIDES
0x2022 BULLET
0x3009 RIGHT ANGLE BRACKET
0x25aa BLACK SMALL SQUARE
0x25ca LOZENGE
0x3008 LEFT ANGLE BRACKET
0x2014 EM DASH
0x2026 HORIZONTAL ELLIPSIS
\u2223 is the actual character in question in line 3682, and it is being used as a soft hyphen. The others are used in markup for tagging illegible characters, such as:
<GAP DESC="illegible" RESP="oxf" EXTENT="4+ letters" DISP="\u2022\u2022\u2022\u2022\u2026"/>
Here's some code to do what your code is attempting. Make sure to process in Unicode:
from bs4 import BeautifulSoup
import re
with open('k000039.000.xml') as f:
soup = BeautifulSoup(f) # BS figures out the encoding
text = u''.join(soup.strings) # strings is a generator for just the text bits.
text = re.sub(ur'\s+',ur' ',text) # Simplify all white space.
text = text.replace(u'\u2223',u'') # Get rid of the DIVIDES character.
print text
Output:
[[truncated]] reckon my self a Bridegroom too. Buckle. I doubt Kickey won't find him such. [Aside.] Mrs. Sago. Well,—poor Keckky's bound to good Behaviour, or she had lost quite her Puddy's Favour. Shall I for this repine at Fortune?—No. I'm glad at Heart that I'm forgiven so. Some Neighbours Wives have but too lately shown, When Spouse had left 'em all their Friends were flown. Then all you Wives that wou'd avoid my Fate. Remain contented with your present State FINIS.
Related
All the codes I've tried work in VS Code terminal and the Widows Terminal (Power Script and Command Window), so I'm pretty happy about that, however, when I read a string from a text file and I print the string, the escape codes are printed in plain view and no colour is applied to the strings.
I've tried the octal, hexadecimal and unicode versions, I had the same problem with "\n" until I realised that the string read would contain "\n", where it would effectively escape the "" char, so calling .replace("\\n","\n") on the string solved that issue, but I got no joy with the colour codes.
This is the code that I use to read the file:
with open('ascii_art_with_color.txt','r') as file:
for line in file.readlines() :
text_line = line
print( text_line , end='' )
Sample from the ascii file:
encounter = You \033[31mencounter\033[0m a wolf howling at the moonlight
Printing using the print function works just fine, either the string constant or from a variable
print('The wolf \033[31mgrowls\033[0m at you as you try to get closer')
winning = 'The wolf lets out a \033[34mpiercing\033[0m cry, then falls to the ground'
print(winning)
Ideas? The main problem that got me stumped is that the codes are not interpreted/applied for the strings I read from the text file, anything else seems to work.
Update:
As it was suggested in the comments, the file contained the '\033' (4 chars) instead of the '\033' one char. I was hoping python would take the line, then apply/translate/encode it into the ANSI escape sequence code while printing it, as it does with the string in the example above.
In the meantime, I managed to get the colours in the text file using a script that replaces a specific string with the escape sequence (I guess python does the encoding behind the scenes before writing it to file)
file_dest = 'ascii_monster_wolf_dest.txt'
with open(file_name,'r') as file, open(file_dest,'w+') as file_dest:
for line in file.readlines():
line = line.replace('{#}','\033[31m')
line = line.replace('{*}','\033[0m')
file_dest.writelines(line)
This is some progress, but not what I really wanted tho.
Coming back to my question, is there a way to read the file and have the sequence '\033' (4 characters) being interpreted as the 1 char escape sequence, the way it seems to do with strings?
There are a couple of ways to do what you ask.
If you wrap the individual lines with quote marks, so they look like Python string constants, you can use the ast literal evaluator to decode it:
s = '"\\x61\\x62"'
# That string has 10 characters.
print( ast.literal_eval(s) )
# Prints ab
Alternatively, you can convert the strings to byte strings, and use the "unicode-escape" codec:
s = '\\x61\\x62'
s = s.encode('utf-8').decode('unicode-escape')
print( s )
# Prints ab
In my humble opinion, however, you would be better served by using some other kind of markup to denote your colors. By that, I mean something like:
<red>This is red</red> <blue>This is blue</blue
Maybe not exactly an HTML-type syntax, but something with code markers that YOU understand, that can be read by humans, and can be interpreted by all languages.
Open the file in binary format. Then use decode() as Tim Roberts suggested.
with open('ascii_art_with_color.txt','rb') as file:
for line in file.readlines() :
print( line.decode('unicode-escape') , end='' )
I want to replace all emoji with '' but my regEx doesn't work.For example,
content= u'?\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633?'
and I want to replace all the forms like \U0001f633 with '' so I write the code:
print re.sub(ur'\\U[0-9a-fA-F]{8}','',content)
But it doesn't work.
Thanks a lot.
You won't be able to recognize properly decoded unicode codepoints that way (as strings containing \uXXXX, etc.) Properly decoded, by the time the regex parser gets to them, each is a* character.
Depending on whether your python was compiled with only 16-bit unicode code points or not, you'll want a pattern something like either:
# 16-bit codepoints
re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
# 32-bit* codepoints
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')
And your code would look like:
import re
# Pick a pattern, adjust as necessary
#re_strip = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
re_strip = re.compile(u'[\U00010000-\U0010FFFF]')
content= u'[\u86cb\u767d12\U0001f633\uff0c\u4f53\u6e29\u65e9\u6668\u6b63\u5e38\uff0c\u5348\u540e\u665a\u95f4\u53d1\u70ed\uff0c\u6211\u73b0\u5728\u8be5\u548b\U0001f633]'
print(content)
stripped = re_strip.sub('', content)
print(stripped)
Both expressions, reduce the number of characters in the stripped string to 26.
These expressions strip out the emojis you were after, but may also strip out other things you do want. It may be worth reviewing a unicode codepoint range listing (e.g. here) and adjusting them.
You can determine whether your python install will only recognize 16-bit codepoints by doing something like:
import sys
print(sys.maxunicode.bit_length())
If this displays 16, you'll need the first regex expression. If it displays something greater than 16 (for me it says 21), the second one is what you want.
Neither expression will work when used on a python install with the wrong sys.maxunicode.
See also: this related.
I have a string of characters that includes [a-z] as well as á,ü,ó,ñ,å,... and so on. Currently I am using regular expressions to get every line in a file that includes these characters.
Sample of spanishList.txt:
adan
celular
tomás
justo
tom
átomo
camara
rosa
avion
Python code (charactersToSearch comes from flask #application.route('/<charactersToSearch>')):
print (charactersToSearch)
#'átdsmjfnueó'
...
#encode
charactersToSearch = charactersToSearch.encode('utf-8')
query = re.compile('[' + charactersToSearch + ']{2,}$', re.UNICODE).match
words = set(word.rstrip('\n') for word in open('spanishList.txt') if query(word))
...
When I do this, I am expecting to get the words in the text file that include the characters in charactersToSearch. It works perfectly for words without special characters:
...
#after doing further searching for other conditions, return list of found words.
return '<br />'.join(sorted(set(word for (word, path) in solve())))
>>> adan
>>> justo
>>> tom
Only problem is that it ignores all words in the file that aren't ASCII. I should also be getting tomás and átomo.
I've tried encode, UTF-8, using ur'[...], but I haven't been able to get it to work for all characters. The file and the program (# -*- coding: utf-8 -*-) are in utf-8 as well.
A different tack
I'm not sure how to fix it in your current workflow, so I'll suggest a different route.
This regex will match characters that are neither white-space characters nor letters in the extended ASCII range, such as A and é. In other words, if one of your words contains a weird character that is not part of this set, the regex will match.
(?i)(?!(?![×Þß÷þø])[a-zÀ-ÿ])\S
Of course this will also match punctuation, but I'm assuming that we're only looking at words in an unpunctuated list. otherwise, excluding punctuation is not too hard.
As I see it, your challenge is to define your set.
In Python, you can so something like:
if re.search(r"(?i)(?!(?![×Þß÷þø])[a-zÀ-ÿ])\S", subject):
# Successful match
else:
# Match attempt failed
I feel your pain. Dealing with Unicode in python2.x is the headache.
The problem with that input is that python sees "á" as the raw byte string '\xc3\xa1' instead of the unicode character "u'\uc3a1'. So your going to need to sanitize the input before passing the string into your regex.
To change a raw byte string to to a unicode string
char = "á"
## print char yields the infamous, and in python unparsable "\xc3\xa1".
## which is probably what the regex is not registering.
bytes_in_string = [byte for byte in char]
string = ''.join([str(hex(ord(byte))).strip('0x') for byte in bytes_in_string])
new_unicode_string = unichr(int(string),16))
There's probably a better way, because this is a lot of operations to get something ready for regex, which I think is supposed to be faster in some way than iterating & 'if/else'ing.
Dunno though, not an expert.
I used something similar to this to isolate the special char words when I parsed wiktionary which was a wicked mess. As far as I can tell your going to have to comb through that to clean it up anyways, you may as well just:
for word in file:
try:
word.encode('UTF-8')
except UnicodeDecodeError:
your_list_of_special_char_words.append(word)
Hope this helped, and good luck!
On further research found this post:
Bytes in a unicode Python string
The was able to figure out the issue. After getting the string from the flask app route, encode it otherwise it give you an error, and then decode the charactersToSearch and each word in the file.
charactersToSearch = charactersToSearch.encode('utf-8')
Then decode it in UTF-8. If you leave the previous line out it give you an error
UNIOnlyAlphabet = charactersToSearch.decode('UTF-8')
query = re.compile('[' + UNIOnlyAlphabet + ']{2,}$', re.U).match
Lastly, when reading the UTF-8 file and using query, don't forget to decode each word in the file.
words = set(word.decode('UTF-8').rstrip('\n') for word in open('spanishList.txt') if query(word.decode('UTF-8')))
That should do it. Now the results show regular and special characters.
justo
tomás
átomo
adan
tom
I can't get Python to print a word doc. What I am trying to do is to open the Word document, print it and close it. I can open Word and the Word document:
import win32com.client
msword = win32com.client.Dispatch("Word.Application")
msword.Documents.Open("X:\Backoffice\Adam\checklist.docx")
msword.visible= True
I have tried next to print
msword.activedocument.printout("X:\Backoffice\Adam\checklist.docx")
I get the error of "print out not valid".
Could someone shed some light on this how I can print this file from Python. I think it might be as simple as changing the word "printout". Thanks, I'm new to Python.
msword.ActiveDocument gives you the current active document. The PrintOut method prints that document: it doesn't take a document filename as a parameter.
From http://msdn.microsoft.com/en-us/library/aa220363(v=office.11).aspx:
expression.PrintOut(Background, Append, Range, OutputFileName, From, To, Item,
Copies, Pages, PageType, PrintToFile, Collate, FileName, ActivePrinterMacGX,
ManualDuplexPrint, PrintZoomColumn, PrintZoomRow, PrintZoomPaperWidth,
PrintZoomPaperHeight)
Specifically Word is trying to use your filename as a boolean Background which may be set True to print in the background.
Edit:
Case matters and the error is a bit bizarre. msword.ActiveDocument.Printout() should print it. msword.ActiveDocument.printout() throws an error complaining that 'PrintOut' is not a property.
I think what happens internally is that Python tries to compensate when you don't match the case on properties but it doesn't get it quite right for methods. Or something like that anyway. ActiveDocument and activedocument are interchangeable but PrintOut and printout aren't.
You probably have to escape the backslash character \ with \\:
msword.Documents.Open("X:\\Backoffice\\Adam\\checklist.docx")
EDIT: Explanation
The backslash is usually used to declare special characters. For example \n is the special character for a new-line. If you want a literal \ you have to escape it.
I have a section of code that I need to remove from multiple files that starts like this:
<?php
//{{56541616
and ends like this:
//}}18420732
?>
where both strings of numbers can be any sequence of letters and numbers (not the same).
I wrote a Python program that will return the entire input string except for this problem string:
def removeInsert(text):
m = re.search(r"<\?php\n\/\/\{\{[a-zA-Z0-9]{8}.*\/\/\}\}[a-zA-Z0-9]{8}\n\?>", text, re.DOTALL)
return text[:m.start()] + text[m.end():]
This program works great when I call it with removeInsert("""[file text]""") -- the triple quotes allow it to be read in as multiline.
I attempted to extend this to open a file and pass the string contents of the file to removeInsert() with this:
def fileRW(filename):
input_file = open(filename, 'r')
text = input_file.read()
newText = removeInsert(text)
...
However, when I run fileRW([input-file]), I get this error:
return text[:m.start()] + text[m.end():]
AttributeError: 'NoneType' object has no attribute 'start'
I can confirm that "text" in that last code is actually a string, and does contain the problem code, but it seems that the removeInsert() code doesn't work on this string. My best guess is that it's related to the triple quoting I do when inputting the string manually into removeInsert(). Perhaps the text that fileRW() passes to removeInsert() is not triple-quoted (I've tried different ways of forcing it to have triple quotes ("\"\"\"" added), but that doesn't work). I have no idea how to fix this, though, and can't find any information about it in my google searching. Any suggestions?
Your regex only uses \n for lines. Your text editor may insert a carriage return and newline combination: \r\n. Try changing \n in your regex to (\r\n|\r|\n).
Keep the \n in your regular expressions and open the file as:
input_file= open(filename, 'rU')
Note the extra U in the mode. This will allow your code to work even if used on other operating systems, or given files having “foreign” end-of-line.