Python Unicode issues with .txt file

Python Unicode issues with .txt file - python

To make a long story short, I am writing a Python script that asks for the user to drop a .docx file and the file converts to .txt. Python looks for keywords within the .txt file and displays them to the shell. I was running into UnicodeDecodeError codec charmap etc..... I overcame that by writing within my for loop "word.decode("charmap"). NOW, Python is not displaying the keywords it does find to the shell. Any advice on how to overcome this? Maybe have Python skip through the characters it cannot decode and continue reading through the rest? Here is my code:
import sys
import os
import codecs
filename = input("Drag and drop resume here: ")
keywords =['NGA', 'DoD', 'Running', 'Programing', 'Enterprise', 'impossible', 'meets']
file_words = []
with open(filename, "rb") as file:
for line in file:
for word in line.split():
word.decode("charmap")
file_words.append(word)
comparison = []
for words in file_words:
if words in keywords:
comparison.append(words)
def remove_duplicates(comparison):
output = []
seen = set()
for words in comparison:
if words not in seen:
output.append(words)
seen.add(words)
return output
comparison = remove_duplicates(comparison)
print ("Keywords found:",comparison)
key_count = 0
word_count = 0
for element in comparison:
word_count += 1
for element in keywords:
key_count += 1
Threshold = word_count / key_count
if Threshold <= 0.7:
print ("The candidate is not qualified for")
else:
print ("The candidate is qualified for")
file.close()
And the output:
Drag and drop resume here: C:\Users\User\Desktop\Resume_Newton Love_151111.txt
Keywords found: []
The candidate is not qualified for

In Python 3, don't open text files in binary mode. The default is the file will decode to Unicode using locale.getpreferredencoding(False) (cp1252 on US Windows):
with open(filename) as file:
for line in file:
for word in line.split():
file_words.append(word)
or specify an encoding:
with open(filename, encoding='utf8') as file:
for line in file:
for word in line.split():
file_words.append(word)
You do need to know the encoding of your file. There are other options to open as well, including errors='ignore' or errors='replace' but you shouldn't get errors if you know the correct encoding.
As others have said, posting a sample of your text file that reproduces the error and the error traceback would help diagnose your specific issue.

In case anyone cares. It's been a long time, but wanted to clear up that I didn't even know the difference between binary and txt files back in these days. I eventually found a doc/docx module for python that made things easier. Sorry for the headache!

Maybe posting the code producing the traceback would be easier to fix.
I'm not sure this is the only problem, maybe this would work better:
with open(filename, "rb") as file:
for line in file:
for word in line.split():
file_words.append(word.decode("charmap"))

Alright I figured it out. Here is my code, but I tried a docx file that seemed to be more complex and when converted to .txt the entire file consisted of special characters. So now I am thinking that I should go to the python-docx module, since it deals with xml files like Word documents. I added "encoding = 'charmap'"
with open(filename, encoding = 'charmap') as file:
for line in file:
for word in line.split():
file_words.append(word)

Related

Python: Inserting content of entire textdocument after a certain string in 2nd document

Im pretty new to Python, but I've been trying to get into some programming in my free time. Currently, im dealing with the following problem:
I have 2 documents, 1 and 2. Both have text in them.
I want to search document 1 for a specific string. When I locate that string, I want to insert all the content of document 2 in a line after the specific string.
Before insertion:
Document 1 content:
text...
SpecificString
text...
After insertion:
Document 1 content:
text...
SpecificString
Document 2 content
text...
I've been trying different methods, but none are working, and keep deleting all content from document 1 and replacing it. Youtube & Google haven't yielded any desireble results, maybe im just looking in the wrong places.
I tried differnet things, this is 1 example:
f1 = '/Users/Win10/Desktop/Pythonprojects/oldfile.txt'
f2 = '/Users/Win10/Desktop/Pythonprojects/newfile.txt'
searchString=str("<\module>")
with open(f1, "r") as moduleinfo, open(f2, "w") as newproject:
new_contents = newproject.readlines()
#Now prev_contents is a list of strings and you may add the new line to this list at any position
if searchString in f1:
new_contents.insert(0,"\n")
new_contents.insert(0,moduleinfo)
#new_file.write("\n".join(new_contents))
The code simply deleted the content of document 1.

You can find interesting answers (How do I write to the middle of a text file while reading its contents?, Can you write to the middle of a file in python?, Adding lines after specific line)
By the way, an interesting way is to iterate the file in a read mode to find the index where the insert must be. Afterwards, overwrite the file with new indexing:
a) File2 = File2[:key_index] + File1 + File 2[key_index:]
Another option explained by Adding lines after specific line:
with open(file, "r") as in_file:
buf = in_file.readlines()
with open(file, "w") as out_file:
for line in buf:
if line == "YOUR SEARCH\n":
line = line + "Include below\n"
out_file.write(line)
Please tell us your final approach.
KR,

You have to import the second file in append mode instead of writing mode. Write mode override the document. Append mode add text to the end of the file, but you can move the pointer to the wanted location for writing and append the text there.
You can enter append mode by replacing the 'w' with 'a'.

Thanks for your input, it put me on the right track. I ended up going with the following:
f2 = '/Users/Win10/Desktop/Pythonprojects/newfile.txt'
f1 = '/Users/Win10/Desktop/Pythonprojects/oldfile.txt'
with open(f2) as file:
original = file.read()
with open(f1) as input:
myinsert = input.read()
newfile = original.replace("</Module>", "</Module>\n"+myinsert)
with open(f2, "w") as replaced:
replaced.write(newfile)
text from oldfile is inserted into newfile in a new line, under the "/Module" string. I'll be following up, if I find better solutions. Again, thank you for your answers.

File to string conversion in python

dictionary = file . read()
I'm currently creating a cipher solver for the 2017 cipher challenge
I have a word document of fifty eight thousand words but i cannot get the file as a string in python 2.7.9
I have tried many thing i have read online such as the above code but to no avail.
I also need this to be easy to understand as i am new to python
Thanks!Don't be negative be constructive!
The word are from:
http://www.mieliestronk.com/corncob_lowercase.txt

You probably should consult some code examples on the web for reading a file. You need something like:
fp = open(fname, "r")
lines = fp.readlines()
for line in lines:
do_something_with_the_lines
fp.close()

All you have to do is:
with open("dictionary.txt") as f: # Open the file and save it as "f"
dictionary = f.read() # Read the content of the file and save it to "dictionary"
If you want to read it from a website, try this:
import urllib2
dictionary = urllib2.urlopen("http://www.mieliestronk.com/corncob_lowercase.txt").read() # Open the website from the url and save its contents to "dictionary"

I think you should check this out for what you're trying to do (http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python)
This should be helpful

Processing large .txt file in Python works only on small files

I have several 1+ gb text files of URLs. I'm trying to use Python to find and replace in order to quickly strip down the URLs.
Because these files are big, I don't want to load them into memory.
My code works on small test files of 50 lines, but when I use this code on a big text file, it actually makes the file larger.
import re
import sys
def ProcessLargeTextFile():
with open("C:\\Users\\Combined files\\test2.txt", "r") as r, open("C:\\Users\\Combined files\\output.txt", "w") as w:
for line in r:
line = line.replace('https://twitter.com/', '')
w.write(line)
return
ProcessLargeTextFile()
print("Finished")
Small files I tested my code with result in the twitter username (as desired)
username_1
username_2
username_3
while large files result in
https://twitter.com/username_1਍ഀ
https://twitter.com/username_2਍ഀ
https://twitter.com/username_3਍ഀ

It's a problem with the encoding of the file, this works:
import re
def main():
inputfile = open("1-10_no_dups_split_2.txt", "r", encoding="UTF-16")
outputfile = open("output.txt", "a", encoding="UTF-8")
for line in inputfile:
line = re.sub("^https://twitter.com/", "", line)
outputfile.write(line)
outputfile.close()
main()
The trick being to specify UTF-16 on reading it, then output it as UTF-8. And viola, the weird stuff goes away :) I do a lot of work moving text files around with Python. There are many setting you can do to play with the encoding to automatically replace certain characters and what not, just read up about the "open" command if you get into at weird spot, or post back here :).
Doing a quick look at the results, you'll probably want to have a few regexes so you can catch https://mobile.twitter.com/ and other stuff, but that's another story.. Good luck!

You can use the open() method's buffering parameter.
Here is the code for it.
import re
import sys
def ProcessLargeTextFile():
with open("C:\\Users\\Combined files\\test2.txt", "r",buffering=200000000) as r, open("C:\\Users\\Combined files\\output.txt", "w") as w:
for line in r:
line = line.replace('https://twitter.com/', '')
w.write(line)
return
ProcessLargeTextFile()
print("Finished")
So I am reading 20 MB of data into memory at a time.

Checking if string is in text file is not working

I am writing in python 3.6 and am having trouble making my code match strings in a short text document. this is a simple example of the exact logic that is breaking my bigger program:
PATH = "C:\\Users\\JoshLaptop\\PycharmProjects\\practice\\commented.txt"
file = open(PATH, 'r')
words = ['bah', 'dah', 'gah', "fah", 'mah']
print(file.read().splitlines())
if 'bah' not in file.read().splitlines():
print("fail")
with the text document formatted like so:
bah
gah
fah
dah
mah
and it is indeed printing out fail each time I run this. Am I using the incorrect method of reading the data from the text document?

the issue is that you're printing print(file.read().splitlines())
so it exhausts the file, and the next call to file.read().splitlines() returns an empty list...
A better way to "grep" your pattern would be to iterate on the file lines instead of reading it fully. So if you find the string early in the file, you save time:
with open(PATH, 'r') as f:
for line in f:
if line.rstrip()=="bah":
break
else:
# else is reached when no break is called from the for loop: fail
print("fail")
The small catch here is not to forget to call line.rstrip() because file generator issues the line with the line terminator. Also, if there's a trailing space in your file, this code will still match the word (make it strip() if you want to match even with leading blanks)
If you want to match a lot of words, consider creating a set of lines:
lines = {line.rstrip() for line in f}
so your in lines call will be a lot faster.

Try it:
PATH = "C:\\Users\\JoshLaptop\\PycharmProjects\\practice\\commented.txt"
file = open(PATH, 'r')
words = file.read().splitlines()
print(words)
if 'bah' not in words:
print("fail")

You can't read the file two times.
When you do print(file.read().splitlines()), the file is read and the next call to this function will return nothing because you are already at the end of file.

PATH = "your_file"
file = open(PATH, 'r')
words = ['bah', 'dah', 'gah', "fah", 'mah']
if 'bah' not in (file.read().splitlines()) :
print("fail")
as you can see output is not 'fail' you must use one 'file.read().splitlines()' in code or save it in another variable otherwise you have an 'fail' message

Find case sensitive words in text file using python

I have a text file need to search it may be uppercase or lowercase letters in file using python

Maybe you should spent more time writing the question, if you expect us to invest time in answering it. Nevertheless, from what I understand you are looking for something like this:
import sys
with open(sys.argv[0], "r") as f:
for row in f:
for chr in row:
if chr.isupper():
print chr, "uppercase"
else:
print chr, "lowercase"

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Unicode issues with .txt file - python

In case anyone cares. It's been a long time, but wanted to clear up that I didn't even know the difference between binary and txt files back in these days. I eventually found a doc/docx module for python that made things easier. Sorry for the headache!

Maybe posting the code producing the traceback would be easier to fix. I'm not sure this is the only problem, maybe this would work better: with open(filename, "rb") as file: for line in file: for word in line.split(): file_words.append(word.decode("charmap"))

Related

Python: Inserting content of entire textdocument after a certain string in 2nd document

File to string conversion in python

Processing large .txt file in Python works only on small files

Checking if string is in text file is not working

Find case sensitive words in text file using python

Categories

Resources