remove <feff> from a file - python

I am using this Python script to convert CSV to XML. After conversion I see tags in the text (vim), which causes XML parsing error.
I am already tried answers from here, without success.
The converted XML file.
Thanks for any help!

Your input file has BOM (byte-order mark) characters, and Python doesn't strip them automatically when file is encoded in utf8. See: Reading Unicode file data with BOM chars in Python
>>> s = '\xef\xbb\xbfABC'
>>> s.decode('utf8')
u'\ufeffABC'
>>> s.decode('utf-8-sig')
u'ABC'
So for your specific case, try something like
from io import StringIO
s = StringIO(open(csvFile).read().decode('utf-8-sig'))
csvData = csv.reader(s)
Very terrible style, but that script is a hacked together script anyway for a one-shot job.

Change utf-8 to utf-8-sig
import csv
with open('example.txt', 'r', encoding='utf-8-sig') as file:

Here's an example of a script that uses a real XML-aware library to run a similar conversion. It doesn't have the exact same output, but, well, it's an example -- salt to taste.
import csv
import lxml.etree
csvFile = 'myData.csv'
xmlFile = 'myData.xml'
reader = csv.reader(open(csvFile, 'r'))
with lxml.etree.xmlfile(xmlFile) as xf:
xf.write_declaration(standalone=True)
with xf.element('root'):
for row in reader:
row_el = lxml.etree.Element('row')
for col in row:
col_el = lxml.etree.SubElement(row_el, 'col')
col_el.text = col
xf.write(row_el)
To refer to the content of, say, row 2 column 3, you'd then use XPath like /row[2]/col[3]/text().

Related

How to write exact wordings of unicode characters into a file?

when I want to write "සිවු අවුරුදු පාටමාලාව" with the exact wording into a json file using python3.6, but instead \u0dc3\u0dd2\u0dc3\u0dd4\u0db1\u0dca\u0da7 \u0dc3\u0dd2\u0dc0\u0dd4 is written into the json file.
I read an excel using xlrd and write to using open().
import xlrd
import json
wb = xlrd.open_workbook('data.xlsx',encoding_override='utf-8')
sheet = wb.sheet_by_index(0)
with open('data.json', 'w') as outfile:
data = json.dump(outerdata,outfile,ensure_ascii=True)
If I do this in Python with the escape string you report:
>>> print ("\u0dc3\u0dd2\u0dc3\u0dd4\u0db1\u0dca\u0da7 \u0dc3\u0dd2\u0dc0\u0dd4")
සිසුන්ට සිවු
you will see that the escapes do render as the characters you want. These are two different representations of the same data. Both representations are valid in JSON. But you are using json.dump() and you have specified ensure_ascii=True. That tells json.dump() that you want the representation with escapes. That is what ascii means: only the printable characters between chr(32) and chr(126). Change that to ensure_ascii=False.
But because you are now no longer writing pure ascii to your output file data.json, you need to specify an encoding when you open it:
with open("data.json", "w", encoding="utf-8") as outfile:
data = json.dump(outerdata,outfile,ensure_ascii=False)
This will make your JSON file look the way you want it to look.

Python can't parse my list of ints [duplicate]

I needed to parse files generated by other tool, which unconditionally outputs json file with UTF-8 BOM header (EFBBBF). I soon found that this was the problem, as Python 2.7 module can't seem to parse it:
>>> import json
>>> data = json.load(open('sample.json'))
ValueError: No JSON object could be decoded
Removing BOM, solves it, but I wonder if there is another way of parsing json file with BOM header?
You can open with codecs:
import json
import codecs
json.load(codecs.open('sample.json', 'r', 'utf-8-sig'))
or decode with utf-8-sig yourself and pass to loads:
json.loads(open('sample.json').read().decode('utf-8-sig'))
Simple! You don't even need to import codecs.
with open('sample.json', encoding='utf-8-sig') as f:
data = json.load(f)
Since json.load(stream) uses json.loads(stream.read()) under the hood, it won't be that bad to write a small hepler function that lstrips the BOM:
from codecs import BOM_UTF8
def lstrip_bom(str_, bom=BOM_UTF8):
if str_.startswith(bom):
return str_[len(bom):]
else:
return str_
json.loads(lstrip_bom(open('sample.json').read()))
In other situations where you need to wrap a stream and fix it somehow you may look at inheriting from codecs.StreamReader.
you can also do it with keyword with
import codecs
with codecs.open('samples.json', 'r', 'utf-8-sig') as json_file:
data = json.load(json_file)
or better:
import io
with io.open('samples.json', 'r', encoding='utf-8-sig') as json_file:
data = json.load(json_file)
If this is a one-off, a very simple super high-tech solution that worked for me...
Open the JSON file in your favorite text editor.
Select-all
Create a new file
Paste
Save.
BOOM, BOM header gone!
I removed the BOM manually with Linux command.
First I check if there are efbb bf bytes for the file, with head i_have_BOM | xxd.
Then I run dd bs=1 skip=3 if=i_have_BOM.json of=I_dont_have_BOM.json.
bs=1 process 1 byte each time, skip=3, skip the first 3 bytes.
I'm using utf-8-sig just with import json
with open('estados.json', encoding='utf-8-sig') as json_file:
data = json.load(json_file)
print(data)

Reading .csv files with different footer row length in Python

I am a complete noob at Python so I apologize if the solution is obvious.
I am trying to read some .csv field data on python for processing. Currently I have:
data = pd.read_csv('somedata.csv', sep=' |,', engine='python', usecols=(range(0,10)), skiprows=155, skipfooter=3)
However depending on if the data collection was interrupted, the last few lines of the file may be something like:
#data_end
Run Complete
Or
Run Interrupted
ERROR
A bunch of error codes
Hence I can't just use skipfooter=3. Is there a way for Python to detect the length of the footer and skip it? Thank you.
You can first read the content of your file as a plain text file into a Python list, remove those lines that don't contain the expected number of separators, and then transform the list into an IO stream. This IO stream is then passed on to pd.read_csv as if it was a file object.
The code might look like this:
from io import StringIO
import pandas as pd
# adjust these variables to meet your requirements:
number_of_columns = 11
separator = " |, "
# read the content of the file as plain text:
with open("somedata.csv", "r") as infile:
raw = infile.readlines()
# drop the rows that don't contain the expected number of separators:
raw = [x for x in raw if x.count(separator) == number_of_columns]
# turn the list into an IO stream (after joining the rows into a big string):
stream = StringIO("".join(raw))
# pass the string as an argument to pd.read_csv():
df = pd.read_csv(stream, sep=separator, engine='python',
usecols=(range(0,10)), skiprows=155)
If you use Python 2.7, you have to make replace the first line from io import StringIO by the following two lines:
from __future__ import unicode_literals
from cStringIO import StringIO
This is so because StringIO requires a unicode string (which is not the default in Python 2.7), and because the StringIO class lives in a different module in Python 2.7.
I think you have to simply resort to counting the commas for each line and manually find the last correct one. I'm not aware of a parameter to read_csv to automate that.

Mixed encoding in csv file

I have a fairly large database (10,000+ records with about 120 vars each) in R. The problem is, that about half of the variables in the original .csv file were correctly encoded in UTF-8 while the rest were encoded in ANSI (Windows-1252) but are being decoded as UTF-8 resulting in weird characters for non-ASCII characters (mainly latin) like this é or ó.
I cannot simply change the file encoding because half of it would be decoded with the wrong type. Furthermore, I have no way of knowing which columns were encoded correctly and which ones didn't, and all I have is the original .csv file which I'm trying to fix.
So far I have found that a plain text file can be encoded in UTF-8 and misinterpreted characters (bad Unicode) can be inferred. One library that provides such functionality is ftfy for Python. However, I'm using the following code and so far, haven't had success:
import ftfy
file = open("file.csv", "r", encoding = "UTF8")
content = file.read()
content = ftfy.fix_text(content)
However, content will show exactly the same text than before. I believe this has to do with the way ftfy is inferring the content encoding.
Nevertheless, if I run ftfy.fix_text("Pública que cotiza en México") it will show the right response:
>> 'Pública que cotiza en México'
I'm thinking that maybe the way to solve the problem is to iterate through each of the values (cells) in the .csv file and try to fix if with ftfy, and the importing the file back to R, but it seems a little bit complicated
Any suggestions?
In fact, there was a mixed encoding for random cells in several places. Probably, there was an issue when exporting the data from it's original source.
The problem with ftfy is that it processes the file line by line, and if it encountered well formated characters, it assumes that the whole line is encoded in the same way and that strange characters were intended.
Since these errors appeared randomly through all the file, I wasn't able to transpose the whole table and process every line (column), so the answer was to process cell by cell. Fortunately, Python has a standard library that provides functionality to work painlessly with csv (specially because it escapes cells correctly).
This is the code I used to process the file:
import csv
import ftfy
import sys
def main(argv):
# input file
csvfile = open(argv[1], "r", encoding = "UTF8")
reader = csv.DictReader(csvfile)
# output stream
outfile = open(argv[2], "w", encoding = "Windows-1252") # Windows doesn't like utf8
writer = csv.DictWriter(outfile, fieldnames = reader.fieldnames, lineterminator = "\n")
# clean values
writer.writeheader()
for row in reader:
for col in row:
row[col] = ftfy.fix_text(row[col])
writer.writerow(row)
# close files
csvfile.close()
outfile.close()
if __name__ == "__main__":
main(sys.argv)
And then, calling:
$ python fix_encoding.py data.csv out.csv
will output a csv file with the right encoding.
a small suggestion: divide and conquer.
try using one tool (ftfy?) to align all the file to the same encoding (and save as plaintext file) and only then try parsing it as csv

Raw string for variables in python?

I have seen several similar posts on this but nothing has solved my problem.
I am reading a list of numbers with backslashes and writing them to a .csv. Obviously the backslashes are causing problems.
addr = "6253\342\200\2236387"
with open("output.csv", 'a') as w:
write = writer(w)
write.writerow([addr])
I found that using r"6253\342\200\2236387" gave me exactly what I want for the output but since I am reading my input from a file I can't use raw string. i tried .encode('string-escape') but that gave me 6253\xe2\x80\x936387 as output which is definitely not what I want. unicode-escape gave me an error. Any thoughts?
The r in front of a string is only for defining a string. If you're reading data from a file, it's already 'raw'. You shouldn't have to do anything special when reading in your data.
Note that if your data is not plain ascii, you may need to decode it or read it in binary. For example, if the data is utf-8, you can open the file like this before reading:
import codecs
f = codecs.open("test", "r", "utf-8")
Text file contains...
1234\4567\7890
41\5432\345\6789
Code:
with open('c:/tmp/numbers.csv', 'ab') as w:
f = open(textfilepath)
wr = csv.writer(w)
for line in f:
line = line.strip()
wr.writerow([line])
f.close()
This produced a csv with whole lines in a column. Maybe use 'ab' rather than 'a' as your file open type. I was getting extra blank records in my csv when using just 'a'.
I created this awhile back. This helps you write to a csv file.
def write2csv(fileName,theData):
theFile = open(fileName+'.csv', 'a')
wr = csv.writer(theFile, delimiter = ',', quoting=csv.QUOTE_MINIMAL)
wr.writerow(theData)

Categories