File to string conversion in python - python

dictionary = file . read()
I'm currently creating a cipher solver for the 2017 cipher challenge
I have a word document of fifty eight thousand words but i cannot get the file as a string in python 2.7.9
I have tried many thing i have read online such as the above code but to no avail.
I also need this to be easy to understand as i am new to python
Thanks!Don't be negative be constructive!
The word are from:
http://www.mieliestronk.com/corncob_lowercase.txt

You probably should consult some code examples on the web for reading a file. You need something like:
fp = open(fname, "r")
lines = fp.readlines()
for line in lines:
do_something_with_the_lines
fp.close()

All you have to do is:
with open("dictionary.txt") as f: # Open the file and save it as "f"
dictionary = f.read() # Read the content of the file and save it to "dictionary"
If you want to read it from a website, try this:
import urllib2
dictionary = urllib2.urlopen("http://www.mieliestronk.com/corncob_lowercase.txt").read() # Open the website from the url and save its contents to "dictionary"

I think you should check this out for what you're trying to do (http://www.pythonforbeginners.com/files/reading-and-writing-files-in-python)
This should be helpful

Related

Reading and Typing in a new File IO

Why does the text not show up when I click on the file_io_reverse.ipynb file??
##I am trying to read 'file_io.ipynb' and put the reverse of it into 'file_io_reverse.ipynb', this code doesn't work at all
f = open('file_io_reverse.ipynb', "a")
with open('file_io.ipynb', "r") as f2:
for i in f2:
x = i[::-1]
print(x)
f.write(x)
f.close()
As #olvin pointed out, your mixture of ways of opening and closing files is inconsistent but not functionally incorrect and should work.
What are you trying to open the file_io_reverse.ipynb file in?
IPYNB notebooks are plain text files formatted using JSON, making them human-readable and easy to share with others. So if you are trying to reverse contents of each line in the file and trying to save it in another file, then that would make the new ipynb file invalid.
Try opening the file in a text editor, and it should have the reversed lines for each line in the file_io.ipynb.

Searching for a string in a file and saving the results

I have a few quite large text files with data on them. I need to find a string that repeats from the data and the string will always have an id number after it. I will need to then save that number.
Ive done some simple scripting with python but I am unsure where to start from with this or if python is even a good idea for this problem. Any help is appreciated.
I will post more information next time (my bad), but I managed to get something to work that should do it for me.
import re
with open("test.txt", "r") as opened:
text = opened.read()
output = re.findall(r"\bdata........", text)
out_str = ",".join(output)
print (out_str)
#with open("output.txt", "w") as outp:
#outp.write(out_str)

How does one read a .dif file with Python

I am working on a project that requires me to read a file with a .dif extension. Dif stands for data information exchange. The file opens nicely in Open Office Calc. Then you can easily save as a csv file, however when I open in Python all I get are random characters that don't make sense. Here is the last code that I tried just to see if I could read.
txt = open('C:\myfile.dif', 'rb').read()
print txt
I would even be open to programatically converting the file to csv first. before opening if someone knows how to do that. As always, any help is much appreciated. Below is a partial screenshot of what I get when I run the code.
Hadn't heard of this file format. Went and got a sample here.
I tested your method and it works fine:
>>> content = open(r"E:\sample.dif", 'rb').read()
>>> print (content)
b'TABLE\r\n0,1\r\n"EXCEL"\r\nVECTORS\r\n0,8\r\n""\r\nTUPLES\r\n0,3\r\n""\r\nDATA\r\n0,0\r\n""\r\n-1,0\r\nBOT\r\n1,0\r\n"Welcome to File Extension FYI Center!"\r\n1,0\r\n""\r\n1,0\r\n""\r\n-1,0\r\nBOT\r\n1,0\r\n""\r\n1,0\r\n""\r\n1,0\r\n""\r\n-1,0\r\nBOT\r\n1,0\r\n"ID"\r\n1,0\r\n"Type"\r\n1,0\r\n"Description"\r\n-1,0\r\nBOT\r\n0,1\r\nV\r\n1,0\r\n"ASP"\r\n1,0\r\n"Active Server Pages"\r\n-1,0\r\nBOT\r\n0,2\r\nV\r\n1,0\r\n"JSP"\r\n1,0\r\n"JavaServer Pages"\r\n-1,0\r\nBOT\r\n0,3\r\nV\r\n1,0\r\n"PNG"\r\n1,0\r\n"Portable Network Graphics"\r\n-1,0\r\nBOT\r\n0,4\r\nV\r\n1,0\r\n"GIF"\r\n1,0\r\n"Graphics Interchange Format"\r\n-1,0\r\nBOT\r\n0,5\r\nV\r\n1,0\r\n"WMV"\r\n1,0\r\n"Windows Media Video"\r\n-1,0\r\nEOD\r\n'
>>>
The question is what is in the file and how do you want to handle it. Personally I liked:
with open(r"E:\sample.dif", 'rb') as f:
for line in f:
print (line)
In the first code block, that long line that has a b'' (for bytes!) in front of it can be iterated on \r\n:
b'TABLE\r\n'
b'0,1\r\n'
b'"EXCEL"\r\n'
b'VECTORS\r\n'
b'0,8\r\n'
b'""\r\n'
b'TUPLES\r\n'
b'0,3\r\n'
b'""\r\n'
b'DATA\r\n'
b'0,0\r\n'
.
.
.
b'"Windows Media Video"\r\n'
b'-1,0\r\n'
b'EOD\r\n'

Python Unicode issues with .txt file

To make a long story short, I am writing a Python script that asks for the user to drop a .docx file and the file converts to .txt. Python looks for keywords within the .txt file and displays them to the shell. I was running into UnicodeDecodeError codec charmap etc..... I overcame that by writing within my for loop "word.decode("charmap"). NOW, Python is not displaying the keywords it does find to the shell. Any advice on how to overcome this? Maybe have Python skip through the characters it cannot decode and continue reading through the rest? Here is my code:
import sys
import os
import codecs
filename = input("Drag and drop resume here: ")
keywords =['NGA', 'DoD', 'Running', 'Programing', 'Enterprise', 'impossible', 'meets']
file_words = []
with open(filename, "rb") as file:
for line in file:
for word in line.split():
word.decode("charmap")
file_words.append(word)
comparison = []
for words in file_words:
if words in keywords:
comparison.append(words)
def remove_duplicates(comparison):
output = []
seen = set()
for words in comparison:
if words not in seen:
output.append(words)
seen.add(words)
return output
comparison = remove_duplicates(comparison)
print ("Keywords found:",comparison)
key_count = 0
word_count = 0
for element in comparison:
word_count += 1
for element in keywords:
key_count += 1
Threshold = word_count / key_count
if Threshold <= 0.7:
print ("The candidate is not qualified for")
else:
print ("The candidate is qualified for")
file.close()
And the output:
Drag and drop resume here: C:\Users\User\Desktop\Resume_Newton Love_151111.txt
Keywords found: []
The candidate is not qualified for
In Python 3, don't open text files in binary mode. The default is the file will decode to Unicode using locale.getpreferredencoding(False) (cp1252 on US Windows):
with open(filename) as file:
for line in file:
for word in line.split():
file_words.append(word)
or specify an encoding:
with open(filename, encoding='utf8') as file:
for line in file:
for word in line.split():
file_words.append(word)
You do need to know the encoding of your file. There are other options to open as well, including errors='ignore' or errors='replace' but you shouldn't get errors if you know the correct encoding.
As others have said, posting a sample of your text file that reproduces the error and the error traceback would help diagnose your specific issue.
In case anyone cares. It's been a long time, but wanted to clear up that I didn't even know the difference between binary and txt files back in these days. I eventually found a doc/docx module for python that made things easier. Sorry for the headache!
Maybe posting the code producing the traceback would be easier to fix.
I'm not sure this is the only problem, maybe this would work better:
with open(filename, "rb") as file:
for line in file:
for word in line.split():
file_words.append(word.decode("charmap"))
Alright I figured it out. Here is my code, but I tried a docx file that seemed to be more complex and when converted to .txt the entire file consisted of special characters. So now I am thinking that I should go to the python-docx module, since it deals with xml files like Word documents. I added "encoding = 'charmap'"
with open(filename, encoding = 'charmap') as file:
for line in file:
for word in line.split():
file_words.append(word)

Python 2.7 File Container containing HTML XML JPG PNG PDF with f.readlines()

Okay I got a file container that is a product of a Webcrawler containing a lot of different file types, likely but not all are HTML XML JPG PNG PDF. Most of the container is HTML text so I tried to open it with:
with open(fname) as f:
content = f.readlines()
which basically fails when I hit a PDF. The files are structured in a way so that every file is preceded by a little meta Information telling me what kind of file type is following.
Is there a similar method to .readlines() in python to read files line by line. I don't need the PDFs I will Ignore them anyway I just want to skip them.
Thanks in advance
Edit:
Example File: GDrive Link
file has a readline() method too, but the idiomatic way is to simply iterate over the file:
with open("/works/even/with/a/pdf/document.pdf") as f:
for line in f:
do_something_with(line)
Also I don't understand what you mean by "(it) basically fails when I hit a PDF". I have no problem applying the above code to a pdf file here.
For reading files line by line you could use fileoperations.
from fileoperations import FileReader
print FileReader.LineByLine(fname) #Note this returns a list of lines.
Could you show us a sample of the pdf? This works for my PDF's.
OK I found a solution just open the container with open(fname,'rb') and you are able to parse it line by line

Categories