I'm converting a downloaded Facebook Messenger conversation from JSON to a text file using Python. I've converted the JSON to text and it's all looking fine. I need to strip the unnecessary information and reverse the order of the messages, then save the output to a file, which I've done. However, when I am formatting the messages with Python, when I look at the output file, sometimes instead of an apostrophe, there's â instead.
My Python isn't great as I normally work with Java, so there's probably a lot of things I could improve. If someone could suggest some better tags for this question, I'd also be very appreciative.
Example of apostrophe working: You're not making them are you?
Example of apostrophe not working: Itâs just a button I discovered
What is causing this to happen and why does not happen every time there is an apostrophe?
Here is the script:
#/usr/bin/python3
import datetime
def main():
input_file = open('messages.txt', 'r')
output_file = open('results.txt', 'w')
content_list = []
sender_name_list = []
time_list = []
line = input_file.readline()
while line:
line = input_file.readline()
if "sender_name" in line:
values = line.split("sender_name")
sender_name_list.append(values[1][1:])
if "timestamp_ms" in line:
values = line.split("timestamp_ms")
time_value = values[1]
timestamp = int(time_value[1:])
time = datetime.datetime.fromtimestamp(timestamp / 1000.0)
time_truncated = time.replace(microsecond=0)
time_list.append(time_truncated)
if "content" in line:
values = line.split("content")
content_list.append(values[1][1:])
content_list.reverse()
sender_name_list.reverse()
time_list.reverse()
for x in range(1, len(content_list)):
output_file.write(sender_name_list[x])
output_file.write(str(time_list[x]))
output_file.write("\n")
output_file.write(content_list[x])
output_file.write("\n\n")
input_file.close()
output_file.close()
if __name__ == "__main__":
main()
Edit:
The answer to the question was adding
import codecs
input_file = codecs.open('messages.txt', 'r', 'utf-8')
output_file = codecs.open('results.txt','w', 'utf-8')
Without seeing the incoming data it's hard to be sure, but I suspect that instead of an apostrophe (Unicode U+0027 ' APOSTROPHE), you've got a curly-equivalent (U+2019 ’ RIGHT SINGLE QUOTATION MARK) in there trying to be interpreted as old-fashioned ascii.
Instead of
output_file = open('results.txt', 'w')
try
import codecs
output_file = codecs.open('results.txt','w', 'utf-8')
You may also need the equivalent on your input file.
Related
I'm looking at a .CSV-file that looks like this:
Hello\r\n
my name is Alex\n
Hello\r\n
my name is John?\n
I'm trying to open the file with the newline-Character defined as '\n':
with open(outputfile, encoding="ISO-8859-15", newline='\n') as csvfile:
I get:
line1 = 'Hello'
line2 = 'my name is Alex'
line3 = 'Hello'
line4 = 'my name is John'
My desired result is:
line1 = 'Hello\r\nmy name is Alex'
line2 = 'Hello\r\nmy name is John'
Do you have any suggestions on how to fix this?
Thank you in advance!
I'm sure your answers are completely correct and technically advanced.
Sadly the CSV-File is not at all RFC 4180 compliant.
Therefore i'm going with the following solution and correct my temporary characters "||" afterwards:
with open(outputfile_corrected, 'w') as correctedfile_handle:
with open(outputfile, encoding="ISO-8859-15", newline='') as csvfile:
csvfile_content = csvfile.read()
csvfile_content_new = csvfile_content.replace('\r\n', '||')
correctedfile_handle.write(csvfile_content_new)
(Someone commented this, but answer has been deleted)
From documentation of the built-in function open in the standard library:
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.
File object itself cannot explicitly distinguish data bytes (in your case) '\r\n' from separator '\n' - this is an authority of the bytes decoder. So, probably, as one of the options, it is possible to write your own decoder and use associated encoding as encoding of your text file. But this is a bit tedious and in case of small files it's much easier to use a more straightforward approach, using re module. The solution proposed by #Martijn Pieters should be used to iterate large files.
import re
with open('data.csv', 'tr', encoding="ISO-8859-15", newline='') as f:
file_data = f.read()
# Approach 1:
lines1 = re.split(r'(?<!\r)\n', file_data)
if not lines1[-1]:
lines1.pop()
# Approach 2:
lines2 = re.findall(r'(?:.+?(?:\r\n)?)+', file_data)
# Approach 3:
iterator_lines3 = map(re.Match.group, re.finditer(r'(?:.+?(?:\r\n)?)+', file_data))
assert lines1 == lines2 == list(iterator_lines3)
print(lines1)
If we need '\n' at the end of each line:
# Approach 1:
nlines1 = re.split(r'(?<!\r\n)(?<=\n)', file_data)
if not nlines1[-1]:
nlines1.pop()
# Approach 2:
nlines2 = re.findall(r'(?:.+?(?:\r\n)?)+\n?', file_data)
# Approach 3:
iterator_nlines3 = map(re.Match.group, re.finditer(r'(?:.+?(?:\r\n)?)+\n', file_data))
assert nlines1 == nlines2 == list(iterator_nlines3)
print(nlines1)
Results:
['Hello\r\nmy name is Alex', 'Hello\r\nmy name is John']
['Hello\r\nmy name is Alex\n', 'Hello\r\nmy name is John\n']
Python's line splitting algorithm can't do what you want; lines that end in \r\n also end in \r. At most you can set the newline argument to either '\n' or '' and re-join lines if they end in \r\n instead of \n. You can use a generator function to do that for you:
def collapse_CRLF(fileobject):
buffer = []
for line in fileobject:
if line.endswidth('\r\n'):
buffer.append(line)
else:
yield ''.join(buffer) + line
buffer = []
if buffer:
yield ''.join(buffer)
then use this as:
with collapse_CRLF(open(outputfile, encoding="ISO-8859-15", newline='')) as csvfile:
However, if this is CSV file, then you really want to use the csv module. It handles files with a mix of \r\n and \n endings for you as it knows how to preserve bare newlines in RFC 4180 CSV files, already:
import csv
with open(outputfile, encoding="ISO-8859-15", newline='') as inputfile:
reader = csv.reader(inputfile)
Note that in a valid CSV file, \r\n is the delimiter between rows, and \n is valid in column values. So if you did not want to use the csv module here for whatever reason, you'd still want to use newline='\r\n'.
Struggling to automate a text file cleanup for some subsequent data analysis. I have a text to tab file where I need to remove instances of \t" text (remove the " but keep the tab).
I then want to remove instances of \n where the character before is nor \r. i.e. \r\n is OK x\n is not. I have the first part working but not the second part any help appreciated. I appreciate there are probably way better ways to do this given I'm writing then opening in a byte format simply because I can't seem to detect /r in 'r' mode.
import re
import sys
import time
originalFile = '14-09 - Copy.txt'
amendedFile = '14-09 - amended.txt'
with open(originalFile, 'r') as content_file:
content = content_file.read()
content = content.replace('\t\"','\t')
with open(amendedFile,'w') as f:
f.write(content)
with open(amendedFile, 'rb') as content_file:
content = content_file.read()
content = re.sub(b"(?<!\r)\n","", content)
with open(amendedFile,'wb') as f:
f.write(content)
print("Done")
For clarity or completion, the python 2 code below identifies the positions that I'm interested in (I'm just looking to automate their removal now). i.e.
\r\nText should equal \r\nText
\t\nText should equal \tText
Text\nText should equal TextText
import re
import sys
import time
with open('14-09 - Copy.txt', 'rb') as content_file:
content = content_file.read()
newLinePos = [m.start() for m in re.finditer('\n', content)]
for line in newLinePos:
if (content[line-1]) != '\r':
print (repr(content[line-20:line]))
Thanks as always!
You probably want to use ([^\r])\n as your pattern, and then substitute \1 to keep the character before.
So your line would be
content = re.sub(b"([^\r])\n",r"\1", content)
Im trying to compare two files via regex strings and print the output. I seem to have an issue with my loop as only the last line gets printed out. What am I missing ?
import re
delist = [r'"age":.*",',r'"average":.*",',r'"class":.*",']
with open('test1.txt', 'r') as bolo:
boloman = bolo.read()
for dabo in delist:
venga = re.findall(dabo, boloman)
for vaga in venga:
with open ('test.txt', 'r' ) as f:
content = f.read()
venga2 = re.findall(dabo, content)
for vaga2 in venga2:
mboa = content.replace(vaga2,vaga,1)
print (mboa)
At first, a problem I see is that you are always setting mboa with the only result. I think what you really want to do is to create a list and append it to that list.
import re
mboa = []
delist = [r'"age":.*",',r'"average":.*",',r'"class":.*",']
with open('test1.txt', 'r') as bolo:
boloman = bolo.read()
for dabo in delist:
venga = re.findall(dabo, boloman)
for vaga in venga:
with open ('test.txt', 'r' ) as f:
content = f.read()
venga2 = re.findall(dabo, content)
for vaga2 in venga2:
mboa.append(content.replace(vaga2,vaga,1))
print (mboa)
does that solve the issue? if it doesn't add a comment to this question and I'll try to fix it out ;)
Hopefully someone can help me out with the following. It is probably not too complicated but I haven't been able to figure it out. My "output.txt" file is created with:
f = open('output.txt', 'w')
print(tweet['text'].encode('utf-8'))
print(tweet['created_at'][0:19].encode('utf-8'))
print(tweet['user']['name'].encode('utf-8'))
f.close()
If I don't encode it for writing to file, it will give me errors. So "output" contains 3 rows of utf-8 encoded output:
b'testtesttest'
b'line2test'
b'\xca\x83\xc9\x94n ke\xc9\xaan'
In "main.py", I am trying to convert this back to a string:
f = open("output.txt", "r", encoding="utf-8")
text = f.read()
print(text)
f.close()
Unfortunately, the b'' - format is still not removed. Do I still need to decode it? If possible, I would like to keep the 3 row structure.
My apologies for the newbie question, this is my first one on SO :)
Thank you so much in advance!
With the help of the people answering my question, I have been able to get it to work. The solution is to change the way how to write to file:
tweet = json.loads(data)
tweet_text = tweet['text'] # content of the tweet
tweet_created_at = tweet['created_at'][0:19] # tweet created at
tweet_user = tweet['user']['name'] # tweet created by
with open('output.txt', 'w', encoding='utf-8') as f:
f.write(tweet_text + '\n')
f.write(tweet_created_at+ '\n')
f.write(tweet_user+ '\n')
Then read it like:
f = open("output.txt", "r", encoding='utf-8')
tweettext = f.read()
print(text)
f.close()
Instead of specifying the encoding when opening the file, use it to decode as you read.
f = open("output.txt", "rb")
text = f.read().decode(encoding="utf-8")
print(text)
f.close()
If b and the quote ' are in your file, that means this in a problem with your file. Someone probably did write(print(line)) instead of write(line). Now to decode it, you can use literal_eval. Otherwise #m_callens answer's should be ok.
import ast
with open("b.txt", "r") as f:
text = [ast.literal_eval(line) for line in f]
for l in text:
print(l.decode('utf-8'))
# testtesttest
# line2test
# ʃɔn keɪn
Started with python after a long time:
Basically I am trying to read a line from a file:
MY_FILE ='test1.hgx'
Eventually I want to change this test1.hgx with:
test1_para1_para2_para3.hgx
Where para1,23 are the parameters I want to write.
I wrote a code below
add_name= '%s'%(filename)+'_'+'%s'%(para1)+'_'+'%s'%(para2)+'_'+'%s'%(para3)+'.hgx'
print "added_name:",add_name
with open(filename) as f: lines = f.read().splitlines()
with open(filename, 'w') as f:
for line in lines:
if line.startswith(' MY_FILE'):
f.write(line.rsplit(' ', 1)[0] + "\'%s'\n"%add_name)
else:
f.write(line + '\n')
f.close
The above code works as expected and writes out when I execute the python code once:
MY_FILE ='test1_01_02_03.hgx'
However when I run the python code once again for the second time it eats up the '=' and writes the following:
MY_FILE 'test1_01_02_03.hgx'
Can I add something to my existing code that would always preserve the writing of the 'test1_01_02_03.hgx'. I think there is problem with :
f.write(line.rsplit(' ', 1)[0] + "\'%s'\n"%add_name)
However I am not able to figure out the problem. Any ideas would be helpful. Thanks.
Change:
f.write(line.rsplit(' ', 1)[0] + "\'%s'\n"%add_name)
to
f.write(line.rsplit('=', 1)[0] + "=\'%s'\n"%add_name)
Incidentally, are you sure that in the original file, there wasn't a space after the =? If there is no space after the =, this code will always eat up the =. If there is a space, it won't eat it up until the second time the code is run.
You are splitting on ' ', which is before the =, but not adding another = back. There are many ways you can do this, but the easiest may be to simply add the = back in:
f.write(line.rsplit(' ', 1)[0] + "='%s'\n" % add_name)
Another, cleaner, way to do it would be to use replace:
f.write(line.replace(filename, new_name))
As an aside, you can write the first line much better as:
add_name = '%s_%s_%s_%s.hgx' % (filename, para1, para2, para3)
Try using the fileinput module. Also, use format() to write into strings.
# Using the fileinput module we can open a file and write to it using stdout.
import fileinput
# using stdout we avoid the formatting of print, and avoid extra newlines.
import sys
filename = 'testfile'
params = ['foo', 'bar', 'baz']
# Build the new replacement line.
newline = '{0}_{1}_(2)_{3}.hgx'.format(filename, params[0], params[1], params[2])
for line in fileinput.input(filename, inplace=1):
if line.startswith('MY_FILE'):
sys.stdout.write('MYFILE = {0}\n'.format(newline))
else:
sys.stdout.write(line)
This should replace any line starting with MYFILE with the lineMYFILE = 'testfile_foo_bar_baz.hgz