How to keep the unicode character codes in my csv file? - python

I am handling a large number of incoming emails and many of them have various emoticons in them. I am planning to apply an NLP analysis on the user comments and train a classifier to provide relevant answers, instead of having to manually reply to hundreds of these messages. For this as a first step, I parsed all emails and saved their content in a list called userMessages that I wrote in a csv file. I plan to add further columns to the csv for analytic purposes, such as user name, address, date, and time but this is not relevant for this question now.
Here is the code I use to write the userMessages list into a csv file called user-messages.csv:
with open('user-messages.csv', 'wb') as myfile:
wr = csv.writer(myfile, dialect='excel', encoding='utf-8', quoting=csv.QUOTE_ALL)
for _msg in userMessages:
wr.writerow([_msg])
This doesn't run into an error due to the encoding='utf-8' parameter, however, it removes/recodes the emoticons in such a way that it is no longer retraceable, for instance in the following format: ðŸ˜. Ideally, I would like to have the original unicode codes in the csv file, such as '\U0001f604' (smiling face with open mouth and smiling eyes) and later substitute these codes with their (approximate) meaning for the NLP to better understand the context of the messages, for instance in the case of this character ('\U0001f604'), remove the code and add the words 'smile' or 'happy'.
Can this be achieved? Or am I overcomplicating things? Any advice would be greatly appreciated. Thank you!
Edit: I am using Windows and I open the csv files in Microsoft Excel 2016.

I really encourage replacing these Unicode characters with their meaning now, rather than keeping the Unicode as a string (which can be simply done by adding the escape character \) and convert them later.
Replacing the Unicode with their meaning can be done easily using unicodedata.name() method like so:
import unicodedata
def normalize_unicode(text):
output = []
for word in text.split(' '):
try:
meaning = unicodedata.name(word).lower()
output.append(meaning)
except TypeError:
output.append(word)
return " ".join(output)
Let's test out this function:
>>> x = "I'm happy \U0001f604"
>>> normalize_unicode(x)
I'm happy smiling face with open mouth and smiling eyes
Now, let's see how are you going to use this method in your code:
with open('user-messages.csv', 'wb') as myfile:
wr = csv.writer(myfile, dialect='excel', encoding='utf-8', quoting=csv.QUOTE_ALL)
for _msg in userMessages:
wr.writerow([ normalize_unicode(_msg) ]) #<-- can be added here
print(normalize_unicode(x))

Related

Python 3.8.5 alternative to .replace with csv.reader and UTF-8 mystery encodings

I have spent 5 hours throughout the dark recesses of SO so I am posting this question as a last resort, and I am genuinely hoping someone can point me in the right direction here:
Scenario:
I have some .csv files (UTF-8 CSVs: verified with the file -I command) from Google surveys that are in multiple languages. Output:
download.csv: application/csv; charset=utf-8
I have a "dictionary" file that has the translations for the questions and answers (one column is the $language and the other is English).
There are LOTS of special type characters (umlauts and French accent letters, etc..) in the data from Google, because French, German, Dutch
The dictionary file I built reads fine as UTF-8 including special characters and creates the find/replace keys accurately (verified with print commands)
The issue is that the Google files only read correctly (maintain proper characters) using the csv.read function in Python. However, that function does not have a .replace and so I can do one or the other:
read in the source file, make no replacements, and get a perfect copy (not what I need)
convert the csv files/rows to a fileinput/string (UTF-8 still, mind) and get an utterly thrashed output file with missing replacements because the data "looses" the encoding between the csv read and the string somehow?
The code (here) comes closest to working, except there is no .replace method on csv.reader:
import csv
#set source, output
source = 'fr_to_trans.csv'
output = 'fr_translated.csv'
dictionary = 'frtrans.csv'
find = []
replace = []
# build the dictionary itself:
with open(dictionary, encoding='utf-8') as dict_file:
for line in dict_file:
#print(line)
temp_split = []
temp_split = line.split(',')
if "!!" in temp_split[0] :
temp_split[0] = temp_split[0].replace("!!", ",")
find.append(temp_split[0])
if "!!" in temp_split[1] :
temp_split[1] = temp_split[1].replace("!!", ",")
replace.append(temp_split [1])
#print(len(find))
#print(len(replace))
#set loop counters
check_each = len(find)
# Read in the file to parse
with open(source, 'r', encoding='utf-8') as s_file, open(output, 'w', encoding='utf-8') as t_file :
output_writer = csv.writer(t_file)
for row in csv.reader(s_file):
the_row = row
print(the_row) #THIS RETURNS THE CORRECT, FORMATTED, UTF-8 DATA
i = 0
# find and replace everything in the find array with it's value in the replace array
while i < check_each :
print(find[i])
print(replace[i])
# THIS LINE DOES NOT WORK:
the_row = the_row.replace(find[i], replace[i])
i = i + 1
output_writer.writerow(the_row)
I have to assume that even though the Google files say they are UTF-8, they are a special "Google branded UTF-8" or some such nonsense. The fact that the file opens correctly with csv.reader, but then you can do nothing to it is infuriating beyond measure.
Just to clarify what I have tried:
Treat files as text and let Python sort out the encoding (fails)
Treat files as UTF-8 text (fails)
Open file as UTF-8, replace strings, and write out using the csv.writer (fails)
Convert the_row to a string, then replace, then write out with csv.writer (fails)
Quick edit - tried utf-8-sig with strings - better, but the output is still totally mangled because it isn't reading it as a csv, but strings
I have not tried:
"cell by cell" comparison instead of the whole row (working on that while this percolates on SO)
Different encoding of the file (I can only get UTF-8 CSVs so would need some sort of utility?)
If these were ASCII text I would have been done ages ago, but this whole "UTF-8 that isn't but is" thing is driving me mad. Anyone got any ideas on this?
Each row yielded by csv.reader is a list of cell values like
['42', 'spam', 'eggs']
Thus the line
# THIS LINE DOES NOT WORK:
the_row = the_row.replace(find[i], replace[i])
cannot possibly work, because lists don't have a replace method.
What might work is to iterate over the row list and find/replace on each cell value (I'm assuming they are all strings)
the_row = [cell.replace(find[i], replace[i]) for cell in the row]
However, if all you want to do is replace all instances of some characters in the file with some other characters then it's simpler to open the file as a text file and replace without invoking any csv machinery:
with open(source, 'r', encoding='utf-8') as s_file, open(output, 'w', encoding='utf-8') as t_file :
text = source.read()
for old, new in zip(find, replace):
text = text.replace(old, new)
t_file.write(text)
If the find/replace mapping is the same for all files, you can use str.translate to avoid the for loop.
# Make a reusable translation table
trans_table = str.maketrans(dict(zip(find, replace)))
with open(source, 'r', encoding='utf-8') as s_file, open(output, 'w', encoding='utf-8') as t_file :
text = source.read()
text = text.translate(trans_table)
t_file.write(text)
For clarity: csvs are text files, only formatted so that their contents can be interpreted as rows and columns. If you want to manipulate their contents as pure text it's fine to edit them as normal text files: as long as you don't change any of the characters used as delimiters or quote marks they will still be usuable as csvs when you want to use them as such.

How do I get python to write a csv file from the output my code?

I am incredibly new to python, so I might not have the right terminology...
I've extracted text from a pdf using pdfplumber. That's been saved as a object. The code I used for that is:
with pdfplumber.open('Bell_2014.pdf') as pdf:
page = pdf.pages[0]
bell = page.extract_text()
print(bell)
So "bell" is all of the text from the first page of the imported PDF.
what bell looks like I need to write all of that text as a string to a csv. I tried using:
with open('Bell_2014_ex.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(bell)
and
bell_ex = 'bell_2014_ex.csv'
with open(bell_ex, 'w', newline='') as csvfile:
file_writer = csv.writer(csvfile,delimiter=',')
file_writer.writerow(bell)
All I keep finding when I search this is how to create a csv with specific characters or numbers, but nothing from an output of an already executed code. For instance, I can get the above code:
bell_ex = 'bell_2014_ex.csv'
with open(bell_ex, 'w', newline='') as csvfile:
file_writer = csv.writer(csvfile,delimiter=',')
file_writer.writerow(['bell'])
to create a csv that has "bell" in one cell of the csv, but that's as close as I can get.
I feel like this should be super easy, but I just can't seem to get it to work.
Any thoughts?
Please and thank you for helping my inexperienced self.
page.extract_text() is defined as: "Collates all of the page's character objects into a single string." which would make bell just a very long string.
The CSV writerow() expects by default a list of strings, with each item in the list corresponding to a single column.
Your main issue is a type mismatch, you're trying to write a single string where a list of strings is expected. You will need to further operate on your bell object to convert it into a format acceptable to be written to a CSV.
Without having any knowledge of what bell contains or what you intend to write, I can't get any more specific, but documentation on Python's CSV module is very comprehensive in terms of settings delimiters, dialects, column definitions, etc. Once you have converted bell into a proper iterable of lists of strings, you can then write it to a CSV.
Some similar code I wrote recently converts a tab-separated file to csv for insertion into sqlite3 database:
Maybe this is helpful:
retval = ''
mode = 'r'
out_file = os.path.join('input', 'listfile.csv')
"""
Convert tab-delimited listfile.txt to comma separated values (.csv) file
"""
in_text = open(listfile.txt, 'r')
in_reader = csv.reader(in_text, delimiter='\t')
out_csv = open(out_file, 'w', newline='\n')
out_writer = csv.writer(out_csv, dialect=csv.excel)
for _line in in_reader:
out_writer.writerow(_line)
out_csv.close()
... and that's it, not too tough
So my problem was that I was missing the "encoding = 'utf-8'" for special characters and my delimiter need to be a space instead of a comma. What ended up working was:
from pdfminer.high_level import extract_text
object = extract_text('filepath.pdf')
print(object)
new_csv = 'filename.csv'
with open(new_csv, 'w', newline='', encoding = 'utf-8') as csvfile:
file_writer = csv.writer(csvfile,delimiter=' ')
file_writer.writerow(object)
However, since a lot of my pdfs weren't true pdfs but scans, the csv ended up having a lot of weird symbols. This worked for about half of the pdfs I have. If you have true pdfs, this will be great. If not, I'm currently trying to figure out how to extract all the text into a pandas dataframe separated by headers within the pdfs since pdfminer extracted all text perfectly.
Thank you for everyone that helped!

How to force csv writer to include \n's, instead of creating newlines?

While working on a Twitter scraping project recently, I noticed that the tweets I scrape sometimes have the newline character in them - \n - which means that there are line-breaks in some tweets.
This is a problem for the .csv files I am creating with the scraped tweets, because Python's csv.writer keeps interpreting them as new lines, and my .csv's thus become littered with line-breaks everywhere.
This is a picture of a .csv I made, with line-breaks where \n's were detected, and this is the code I am using to write each tweet in, one at a time,
with open(file_name, 'a') as f:
writer = csv.writer(f)
writer.writerow([
status.created_at, status.author.screen_name,
len(status.text), status.favorite_count, status.retweet_count,
status.text
])
I don't want to simply do string.replace("\n", " ") each time, as that does not seem to be efficient to me, and I have tried opening the csv with options like newline='\n', but they do not seem to work for me.
How could I tell the csv.writer to not create new lines whenever it sees \n's?
csv.writer quotes strings with \n
>>> import csv
>>> import io
>>> filestream = io.StringIO()
>>> csvwriter = csv.writer(filestream)
>>> csvwriter.writerow(["a", "a\n", "b"])
10
>>> filestream.getvalue()
'a,"a\n",b\r\n'
Notice "a\n" being in double quotes.

How to correctly put extended ASCII characters into a CSV file?

I'm trying to write some data in an array that contains extended ASCII characters to a CSV file. Below I made an small example of the code I'm using on real file.
The array text_array represents an array containing only one row.
import csv
text_array = [["Á","Â","Æ","Ç","Ö","×","Ø","Ù","Þ","ß","á","â","ã","ä","å","æ"]]
with open("/Files/out.csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(text_array)
The output I'm getting on CSV file is wrong, showing these characters.
à Â Æ Ç Ö × Ø Ù Þ ß á â ã ä å æ
I found that the code below fixes the issue in Python 3.4 but I'm working on Python 2.7.
c = csv.writer(open("Out.csv", 'w', newline='', encoding='utf-8'))
How can I fix this?
UPDATE
I receive some links as comments, but is a kind of difficult for me to understand what is needed to do to fix this issue. May someone show some example please.

Searching for a string in a file and saving the results

I have a few quite large text files with data on them. I need to find a string that repeats from the data and the string will always have an id number after it. I will need to then save that number.
Ive done some simple scripting with python but I am unsure where to start from with this or if python is even a good idea for this problem. Any help is appreciated.
I will post more information next time (my bad), but I managed to get something to work that should do it for me.
import re
with open("test.txt", "r") as opened:
text = opened.read()
output = re.findall(r"\bdata........", text)
out_str = ",".join(output)
print (out_str)
#with open("output.txt", "w") as outp:
#outp.write(out_str)

Categories