Why am I getting odd characters?

Why am I getting odd characters? - python

Sorry if this isn't a reproducible example but I am guessing someone will know what to do when I describe the problem. The problem I have is that I am getting characters like "\xe2" "\x80" from a txt file that I am reading in the following way:
words = open("directory/file.txt","r")
liness = []
for x in words.readlines():
liness.append(lines.rstrip('\n'))
When I print lines I get the list I want, but then when I use max() in the following way:
max(liness, key = len)
returns the "a line from file.txt that containts \xe2 and \x80" I know this probably has something to do with encoding, but I haven't had luck solving it. Anyone?

I tried to reproduce your error but used the following code:
words = open("directory/file.txt", 'r', 0)
line = words.readline()
wordlist = string.split(line)
Unfortunately, I was not able to reproduce your error as you would have guessed. My file was txt file with a list of English words.
I assume that you are reading a .txt file with non-standard American English characters, correct?. If you are not using American English characters, you might want to check out this post:
Handling non-standard American English Characters and Symbols in a CSV, using Python
You will need to determine what type of encoding/decoding to use based on your file.

Related

How to create a word docx using python docx in other than english?

I am building a program creating printed outputs from python code. Further, the final print containing the other language (Sinhala). I want to use python docx to save this output into a word document. How to write into word in another language?
My aim is to produce a report making program from another language (Sinhala). I take all user inputs from widgets and managed to print the resulted lines in another language in python.
Now, I want to write these lines into word file using the Sinhala language.
a= "කණ්ඩියේ උස මීටර් 5.0 ක් පළල මීටර් 2.0 හා දිග මීටර් 2.0 ක් පමණ වන කොටසක්
අස්ථාවර වී"
document = Document()
document.add_heading("python word doc")
document.add_paragraph(a)
document.save('****\\report.docx')
when I use English, the code does the job. But, for the Sinhala language, I'm not sure how to do that?
I get the following error message for sinala language.
ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

The error code you're seeing is not directly related to the language. The only thing Word knows about language is which spelling dictionary to use. Otherwise its text is just an arbitrary sequence of unicode characters.
What I suspect is that the Unicode encoding of the Sinhala strings you're trying to write is not UTF-8. The other possibility is that the string contains some control characters (as mentioned in the error message), particularly the vertical-tab (VT, 0xB or decimal 11) which can arise in copy and paste from PowerPoint.
This latter one is easier to check for, so perhaps start there.
import re
def sanitize_str(s):
control_chars = "\x00-\x1f\x7f-\x9f"
control_char_re = re.compile("[%s]" % control_chars)
return control_char_re.sub("", s)
document.add_paragraph(sanitize_str(a))

Removing encoded text from strings read from txt file

Here's the problem:
I copied and pasted this entire list to a txt file from https://www.cboe.org/mdx/mdi/mdiproducts.aspx
Sample of text lines:
BFLY - The CBOE S&P 500 Iron Butterfly Index
BPVIX - CBOE/CME FX British Pound Volatility Index
BPVIX1 - CBOE/CME FX British Pound Volatility First Term Structure Index
BPVIX2 - CBOE/CME FX British Pound Volatility Second Term Structure Index
These lines of course appear normal in my text file, and I saved the file with utf-8 encoding.
My goal is to use python to strip out only the symbols from this long list, .e.g. BFLY, VPVIX etc, and write them to a new file
I am using the following code to read the file and split it:
x=open('sometextfile.txt','r')
y=x.read().split()
The issue I'm seeing is that there are unfamiliar characters popping up and they are affecting my ability to filter the list. Example:
print(y[0])
ï»¿BFLY
I'm guessing that these characters have something to do with the encoding and I have tried a few different things with the codec module without success. Using .decode('utf-8') throws an error when trying to use it against the above variables x or y. I am able to use .encode('utf-8'), which obviously makes things even worse.
The main problem is that when I try to loop through the list and remove any items that are not all upper case or contain non-alpha characters. Ex:
y[0].isalpha()
False
y[0].isupper()
False
So in this example the symbol BFLY ends up being removed from the list.
Funny thing is that these characters are not present in a txt file if I do something like:
q=open('someotherfile.txt','w')
q.write(y[0])
Any help would be greatly appreciated. I would really like to understand why this frequently happens when copying and pasting text from web pages like this one.

Why not use Regex?
I think this will catch the letters in caps
"[A-Z]{1,}/?[A-Z]{1,}[0-9]?"
This is better. I got a list of all such symbols. Here's my result.
['BFLY', 'CBOE', 'BPVIX', 'CBOE/CME', 'FX', 'BPVIX1', 'CBOE/CME', 'FX', 'BPVIX2', 'CBOE/CME', 'FX']
Here's the code
import re
reg_obj = re.compile(r'[A-Z]{1,}/?[A-Z]{1,}[0-9]?')
sym = reg_obj.findall(a)enter code here
print(sym)

Can i format txt files from within python

now pleased don't get me wrong on this, but im just curious whether I can get a text file and then find out how many lines within that text file have been written on, and thus use that number to print selective data from every few lines. Also could I use python to find specific words within the text file that are evenly apart for example within the text file if everything was written like this
name:> Ben
Score:> 2
name:> Ethan
Score:> 8
name:> James
Score:> 0
would it be possible for me to search the text file, for the string 'name:>' (and then save whatever comes infront of it, if possible to a variable) or seeing as they're all equally spaced could I save the specific score of one person to a variable with their name (as everything in front would be equally spaced), without having to open the txt file at all.
If all of this sounds completely impossible or if any of you have received any vague ideas as to what im talking about (in which case im in awe of your abilities of comprehension from this badly worded example), please give me any thoughts or ideas on how to format text files to create variables.
if all the above seems too complex could someone please just tell me wether its possible to analyse how many lines within a text file have been written on, from there ive got a vague idea on how to create my program.

You can use regular expression (RE) to search the text file as a string, then find out where the existing value is you want to change in the text file and write it.
https://docs.python.org/2/library/re.html

To do what you are asking, I would personally use the built-in re module, as follows:
import re
with open("foo.txt", "r") as foo:
contents = foo.read()
results = re.search("foo-bar", contents).group()
print(results)
That should do what you are looking for.

PyEnchant 'correcting' words in dictionary to words not in dictionary

I'm attempting to take large amounts of natural language from a web forum and correct the spelling with PyEnchant. The text is often informal, and about medical issues, so I have created a text file "test.pwl" containing relevant medical words, chat abbreviations, and so on. In some cases, little bits of html, urls, etc do unfortunately remain in it.
My script is designed to use both the en_US dictionary and the PWL to find all misspelled words and correct them to the first suggestion of d.suggest totally automatically. It prints a list of misspelled words, then a list of words that had no suggestions, and writes the corrected text to 'spellfixed.txt':
import enchant
import codecs
def spellcheckfile(filepath):
d = enchant.DictWithPWL("en_US","test.pwl")
try:
f = codecs.open(filepath, "r", "utf-8")
except IOError:
print "Error reading the file, right filepath?"
return
textdata = f.read()
mispelled = []
words = textdata.split()
for word in words:
# if spell check failed and the word is also not in
# mis-spelled list already, then add the word
if d.check(word) == False and word not in mispelled:
mispelled.append(word)
print mispelled
for mspellword in mispelled:
#get suggestions
suggestions=d.suggest(mspellword)
#make sure we actually got some
if len(suggestions) > 0:
# pick the first one
picksuggestion=suggestions[0]
else: print mspellword
#replace every occurence of the bad word with the suggestion
#this is almost certainly a bad idea :)
textdata = textdata.replace(mspellword,picksuggestion)
try:
fo=open("spellfixed.txt","w")
except IOError:
print "Error writing spellfixed.txt to current directory. Who knows why."
return
fo.write(textdata.encode("UTF-8"))
fo.close()
return
The issue is that the output often contains 'corrections' for words that were in either the dictionary or the pwl. For instance, when the first portion of the input was:
My NEW doctor feels that I am now bi-polar . This , after 9 years of being considered majorly depressed by everyone else
I got this:
My NEW dotor feels that I am now bipolar . This , aftER 9 years of being considERed majorly depressed by evERyone else
I could handle the case changes, but doctor --> dotor is no good at all. When the input is much shorter (for example, the above quotation is the entire imput), the result is desirable:
My NEW doctor feels that I am now bipolar . This , after 9 years of being considered majorly depressed by everyone else
Could anybody explain to me why? In very simple terms, please, as I'm very new to programming and newer to Python. A step-by-step solution would be greatly appreciated.

I think your problem is that you're replacing letter sequences inside words. "ER" might be a valid spelling correction for "er", but that doesn't mean that you should change "considered" to "considERed".
You can use regexes instead of simple text replacement to ensure that you replace only full words. "\b" in a regex means "word boundary":
>>> "considered at the er".replace( "er", "ER" )
'considERed at the ER'
>>> import re
>>> re.sub( "\\b" + "er" + "\\b", "ER", "considered at the er" )
'considered at the ER'

#replace every occurence of the bad word with the suggestion
#this is almost certainly a bad idea :)
You were right, that is a bad idea. This is what's causing "considered" to be replaced by "considERed". Also, you're doing a replacement even when you don't find a suggestion. Move the replacement to the if len(suggestions) > 0 block.
As for replacing every instance of the word, what you want to do instead is save the positions of the misspelled words along with the text of the misspelled words (or maybe just the positions and you can look the words up in the text later when you're looking for suggestions), allow duplicate misspelled words, and only replace the individual word with its suggestion.
I'll leave the implementation details and optimizations up to you, though. A step-by-step solution won't help you learn as much.

Python: Regex a dictionary using user input wildcards

I would like to be able to search a dictionary in Python using user input wildcards.
I have found this:
import fnmatch
lst = ['this','is','just','a','test', 'thing']
filtered = fnmatch.filter(lst, 'th*')
This matches this and thing. Now if I try to input a whole file and search through
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower()
filtered = fnmatch.filter(file_contents, 'th*')
this doesn't match anything. The difference is that in the file that I am reading from I is a text file (Shakespeare play) so I have spaces and it is not a list. I can match things such as a single letter, so if I just have 't' then I get a bunch of t's. So this tells me that I am matching single letters - I however am wanting to match whole words - but even more, to preserve the wildcard structure.
Since what I would like to happen is that a user enters in text (including what will be a wildcard) that I can substitute it in to the place that 'th*' is. The wild card would do what it should still. That leads to the question, can I just stick in a variable holding the search text in for 'th*'? After some investigation I am wondering if I am somehow supposed to translate the 'th*' for example and have found something such as:
regex = fnmatch.translate('th*')
print(regex)
which outputs th.*\Z(?ms)
Is this the right way to go about doing this? I don't know if it is needed.
What would be the best way in going about "passing in regex formulas" as well as perhaps an idea of what I have wrong in the code as it is not operating on the string of incoming text in the second set of code as it does (correctly) in the first.

If the problem is just that you "have spaces and it is not a list," why not make it into a list?
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower().split(' ') # split line on spaces to make a list
filtered = fnmatch.filter(file_contents, 'th*')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why am I getting odd characters? - python

Related

How to create a word docx using python docx in other than english?

Removing encoded text from strings read from txt file

Can i format txt files from within python

PyEnchant 'correcting' words in dictionary to words not in dictionary

Python: Regex a dictionary using user input wildcards

Categories

Resources