Writing to txt file in UTF-8 - Python - python

My django application gets document from user, created some report about it, and write to txt file. The interesting problem is that everything works very well on my Mac OS. But on Windows, it can not read some letters, converts it to symbols like é™, ä±. Here are my codes:
views.py:
def result(request):
last_uploaded = OriginalDocument.objects.latest('id')
original = open(str(last_uploaded.document), 'r')
original_words = original.read().lower().split()
words_count = len(original_words)
open_original = open(str(last_uploaded.document), "r")
read_original = open_original.read()
characters_count = len(read_original)
report_fives = open("static/report_documents/" + str(last_uploaded.student_name) +
"-" + str(last_uploaded.document_title) + "-5.txt", 'w', encoding="utf-8")
# Path to the documents with which original doc is comparing
path = 'static/other_documents/doc*.txt'
files = glob.glob(path)
#endregion
rows, found_count, fives_count, rounded_percentage_five, percentage_for_chart_five, fives_for_report, founded_docs_for_report = search_by_five(last_uploaded, 5, original_words, report_fives, files)
context = {
...
}
return render(request, 'result.html', context)
report txt file:
['universitetindé™', 'té™hsili', 'alä±ram.', 'mé™n'] was found in static/other_documents\doc1.txt.
...

The issue here is that you're calling open() on a file without specifying the encoding. As noted in the Python documentation, the default encoding is platform dependent. That's probably why you're seeing different results in Windows and MacOS.
Assuming that the file itself was actually encoded in UTF-8, just specify that when reading the file:
original = open(str(last_uploaded.document), 'r', encoding="utf-8")

Related

My JSON save and load function is not working

I am writing a simple function to save a twitter search as a JSON, and then load the results. The save function seems to work but the load one doesn't. The error I receive is:
"UnsupportedOperation: not readable"
Can you please advise what the issue might be in my script?
import io
def save_json(filename, data):
with open('tweet2.json', 'w', encoding='utf8') as file:
json.dump(data, file, ensure_ascii = False)
def load_json(filename):
with open('tweet2.json', 'w', encoding = 'utf8') as file:
return json.load(file)
#sample usage
q = 'Test'
results = twitter_search(twitter_api, q, max_results = 10)
save_json = (q, results)
results = load_json(q)
print(json.dumps(results, indent = 1, ensure_ascii = False))
Using "w" you won't be able to read the file so you need to use "r" (Opens a file for reading only.)
open("tweet2.json","r")

How to write Arabic to a CSV file

I am trying to extract tweets with Python and store them in a CSV file, but I can't seem to include all languages. Arabic appears as special characters.
def recup_all_tweets(screen_name,api):
all_tweets = []
new_tweets = api.user_timeline(screen_name,count=300)
all_tweets.extend(new_tweets)
#outtweets = [[tweet.id_str, tweet.created_at, tweet.text,tweet.retweet_count,get_hashtagslist(tweet.text)] for tweet in all_tweets]
outtweets = [[tweet.text,tweet.entities['hashtags']] for tweet in all_tweets]
# with open('recup_all_tweets.json', 'w', encoding='utf-8') as f:
# f.write(json.dumps(outtweets, indent=4, sort_keys=True))
with open('recup_all_tweets.csv', 'w',encoding='utf-8') as f:
writer = csv.writer(f,delimiter=',')
writer.writerow(["text","tag"])
writer.writerows(outtweets)
# pass
return(outtweets)
Example of writing both CSV and JSON:
#coding:utf8
import csv
import json
s = ['عربى','عربى','عربى']
with open('output.csv','w',encoding='utf-8-sig',newline='') as f:
r = csv.writer(f)
r.writerow(['header1','header2','header3'])
r.writerow(s)
with open('output.json','w',encoding='utf8') as f:
json.dump(s,f,ensure_ascii=False)
output.csv:
header1,header2,header3
عربى,عربى,عربى
output.csv viewed in Excel:
output.json:
["عربى", "عربى", "عربى"]
Note Microsoft Excel needs utf-8-sig to read a UTF-8 file properly. Other applications may or may not need it to view properly. Many Windows applications required a UTF-8 "BOM" signature at the start of a text file or will assume an ANSI encoding instead. The ANSI encoding varies depending on the localized version of Windows used.
Maybe try with
f.write(json.dumps(outtweets, indent=4, sort_keys=True, ensure_ascii=False))
I searched a lot and finally wrote the following piece of code:
import arabic_reshaper
from bidi.algorithm import get_display
import numpy as np
itemsX = webdriver.find_elements(By.CLASS_NAME,"x1i10hfl")
item_linksX = [itemX.get_attribute("href") for itemX in itemsX]
item_linksX = filter(lambda k: '/p/' in k, item_linksX)
counter = 0
for item_linkX in item_linksX:
AllComments2 = []
counter = counter + 1
webdriver.get(item_linkX)
print(item_linkX)
sleep(11)
comments = webdriver.find_elements(By.CLASS_NAME,"_aacl")
for comment in comments:
try:
reshaped_text = arabic_reshaper.reshape(comment.text)
bidi_text = get_display(reshaped_text)
AllComments2.append(reshaped_text)
except:
pass
df = pd.DataFrame({'col':AllComments2})
df.to_csv('C:\Crawler\Comments' + str(counter) + '.csv', sep='\t', encoding='utf-16')
This code worked perfectly for me. I hope it helps those who haven't used the code from the previous post

Downloading Zip file through Django, file decompresses as cpzg

My goal is to parse a series of strings into a series of text files that are compressed as a Zip file and downloaded by a web app using Django's HTTP Response.
Developing locally in PyCharm, my method outputs a Zip file called "123.zip" which contains 6 individual files named "123_1", "123_2 etc". containing the letters from my phrase with no problem.
The issue is when I push the code to my web app and include the Django HTTP Response the file will download but when I go to extract it it produces "123.zip.cpzg". Extracting that in turn gives me 123.zip(1) in a frustrating infinite loop. Any suggestions where I'm going wrong?
Code that works locally to produce "123.zip":
def create_text_files1():
JobNumber = "123"
z = zipfile.ZipFile(JobNumber +".zip", mode ="w")
phrase = "A, B, C, D, EF, G"
words = phrase.split(",")
x =0
for word in words:
word.encode(encoding="UTF-8")
x = x + 1
z.writestr(JobNumber +"_" + str(x) + ".txt", word)
z.close()
Additional part of the method in my web app:
response = HTTPResponse(z, content_type ='application/zip')
response['Content-Disposition'] = "attachment; filename='" + str(jobNumber) + "_AHTextFiles.zip'"
Take a closer look at the example provided in this answer.
Notice a StringIO is opened, the zipFile is called with the StringIO as a "File-Like Object", and then, crucially, after the zipFile is closed, the StringIO is returned in the HTTPResponse.
# Open StringIO to grab in-memory ZIP contents
s = StringIO.StringIO()
# The zip compressor
zf = zipfile.ZipFile(s, "w")
# Grab ZIP file from in-memory, make response with correct MIME-type
resp = HttpResponse(s.getvalue(), mimetype = "application/x-zip-co mpressed")
I would recommend a few things in your case.
Use BytesIO for forward compatibility
Take advantage of ZipFile's built in context manager
In your Content-Disposition, be careful of "jobNumber" vs "JobNumber"
Try something like this:
def print_nozzle_txt(request):
JobNumber = "123"
phrase = "A, B, C, D, EF, G"
words = phrase.split(",")
x =0
byteStream = io.BytesIO()
with zipfile.ZipFile(byteStream, mode='w', compression=zipfile.ZIP_DEFLATED,) as zf:
for word in words:
word.encode(encoding="UTF-8")
x = x + 1
zf.writestr(JobNumber + "_" + str(x) + ".txt", word)
response = HttpResponse(byteStream.getvalue(), content_type='application/x-zip-compressed')
response['Content-Disposition'] = "attachment; filename='" + str(JobNumber) + "_AHTextFiles.zip'"
return response

Search and replace specific line which starts with specific string in a file

My requirement is to open a properties file and update the file, for update purpose i need to search for a specific string which stores the url information. For this purpose i have written the below code in python:
import os
owsURL="https://XXXXXXXXXXXXXX/"
reowsURL = "gStrOwsEnv = " + owsURL + "/" + "OWS_WS_51" + "/"
fileName='C:/Users/XXXXXXXXXXX/tempconf.properties'
if not os.path.isfile(fileName):
print("!!! Message : Configuraiton.properties file is not present ")
else:
print("+++ Message : Located the configuration.properties file")
with open(fileName) as f:
data = f.readlines()
for m in data:
if m.startswith("gStrOwsEnv"):
print("ok11")
m = m.replace(m,reowsURL)
after executing the program i am not able to update the properties file.
Any help is highly appreciated
Sample Content of file:
# ***********************************************
# Test Environment Details
# ***********************************************
# Application URL pointing to test execution
#gStrApplicationURL =XXXXXXXXXXXXXXXX/webservices/person
#gStrApplicationURL = XXXXXXXXXXXXXX/GuestAPIService/ProxyServices/
# FOR JSON
#gStrApplicationURL = XXXXXXXXXXXXXX
#SOAP_gStrApplicationURL =XXXXXXXXXXXXXXXXXXXXXXX
#(FOR WSDL PARSING)
version = 5
#v9
#SOAP_gStrApplicationURL = XXXXXXXXXXX/XXXXXXXXX/XXXXXXXXX/
#v5
SOAP_gStrApplicationURL = XXXXXXXXXXXXXXX/OWS_WS_51/
gStrApplicationXAIServerPath=
gStrEnvironmentName=XXXXXXXXX
gStrOwsEnv = XXXXXXXXXXXXXXXXXXXX/OWS_WS_51/
gStrConnectEnv = XXXXXXXXXXXXXXXXX/OWSServices/Proxy/
gStrSubscriptionKey =XXXXXXXXXXXXXXXXXXXXXX
I'm pretty sure that this is not the best way of doing that, but this is still one way:
with open(input_file_name, 'r') as f_in, open(output_file_name, 'w') as f_out:
for line in f_in:
if line.startswith("gStrOwsEnv"):
f_out.write(reowsURL)
else:
f_out.write(line)
That script copy every line of input_file_name into output_file_name except the lines that you want to change.

Upload and parse csv file with "universal newline" in python on Google App Engine

I'm uploading a csv/tsv file from a form in GAE, and I try to parse the file with python csv module.
Like describe here, uploaded files in GAE are strings.
So I treat my uploaded string a file-like object :
file = self.request.get('catalog')
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)
But new lines in my files are not necessarily '\n' (thanks to excel..), and it generated an error :
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
Does anyone know how to use StringIO.StringIO to treat strings like files open in universal-newline?
How about:
file = self.request.get('catalog')
file = '\n'.join(file.splitlines())
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)
or as pointed out in the comments, csv.reader() supports input from a list, so:
file = self.request.get('catalog')
catalog = csv.reader(file.splitlines(),dialect=csv.excel_tab)
or if in the future request.get supports read modes:
file = self.request.get('catalog', 'rU')
catalog = csv.reader(StringIO.StringIO(file),dialect=csv.excel_tab)
The solution described here should work. By defining an iterator class as follows, which loads the blob 1MB at a time, splits the lines using .splitlines() and then feeds lines to the CSV reader one at a time, the newlines can be handled without having to load the whole file into memory.
class BlobIterator:
"""Because the python csv module doesn't like strange newline chars and
the google blob reader cannot be told to open in universal mode, then
we need to read blocks of the blob and 'fix' the newlines as we go"""
def __init__(self, blob_reader):
self.blob_reader = blob_reader
self.last_line = ""
self.line_num = 0
self.lines = []
self.buffer = None
def __iter__(self):
return self
def next(self):
if not self.buffer or len(self.lines) == self.line_num + 1:
self.buffer = self.blob_reader.read(1048576) # 1MB buffer
self.lines = self.buffer.splitlines()
self.line_num = 0
# Handle special case where our block just happens to end on a new line
if self.buffer[-1:] == "\n" or self.buffer[-1:] == "\r":
self.lines.append("")
if not self.buffer:
raise StopIteration
if self.line_num == 0 and len(self.last_line) > 0:
result = self.last_line + self.lines[self.line_num] + "\n"
else:
result = self.lines[self.line_num] + "\n"
self.last_line = self.lines[self.line_num + 1]
self.line_num += 1
return result
Then call this like so:
blob_reader = blobstore.BlobReader(blob_key)
blob_iterator = BlobIterator(blob_reader)
reader = csv.reader(blob_iterator)

Categories