I know this is a common beginner issue and there are a ton of questions like this here on stack exchange and I've been searching through them but i still can't figure this out. I have some data from a scrape that looks like this (about 1000 items in the list):
inputList = [[u'someplace', u'3901 West Millen Drive', u'Hobbs', u'NH',
u'88240', u'37.751117', u'-103.187709999'], [u'\u0100lon someplace', u'3120
S Las Vegas Blvd', u'Las Duman', u'AL', u'89109', u'36.129066', u'-145.168791']]
I'm trying to write it to a csv file like this:
for i in inputList:
for ii in i:
ii.replace(" u'\u2019'", "") #just trying to get rid of offending character
ii.encode("utf-8")
def csvWrite(inList, outFile):
import csv
destination = open(outFile, 'w')
writer = csv.writer(destination, delimiter = ',')
data = inList
writer.writerows(data)
destination.close()
csvWrite(inputList, output)
but I keep getting this error on, writer.writerows(data):
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in
position 5: ordinal not in range(128)
I've tried a bunch of different thing to encode the data in the list, but still always get the error. I'm open to just ignoring the characters that can't be encoded to ascii. Can anyone point me in the right direction, I'm using python2.6
this line seems strange: ii.replace(" u'\u2019'", ""), did you mean ii.replace(u"\u2019", u"") ?
if you just want to remove those bad characters you could use this code instead:
for i in inputList:
for ii in i:
ii = "".join(list( filter((lambda x: ord(x) < 128), ii)))
print ii
Output:
someplace
3901 West Millen Drive
Hobbs
NH
88240
37.751117
-103.187709999
lon someplace
3120 S Las Vegas Blvd
Las Duman
AL
89109
36.129066
-145.168791
the final code will look like this:
inputList = [[u'someplace', u'3901 West Millen Drive', u'Hobbs', u'NH',
u'88240', u'37.751117', u'-103.187709999'], [u'\u0100lon someplace', u'3120 S Las Vegas Blvd', u'Las Duman', u'AL', u'89109', u'36.129066', u'-145.168791']]
cleared_inputList = []
for i in inputList:
c_i = []
for ii in i:
ii = "".join(list( filter((lambda x: ord(x) < 128), ii)))
c_i.append(ii)
cleared_inputList.append(c_i)
def csvWrite(inList, outFile):
import csv
destination = open(outFile, 'w')
writer = csv.writer(destination, delimiter = ',')
data = inList
writer.writerows(data)
destination.close()
csvWrite(cleared_inputList, output)
Related
i am trying to save array in a text file and I am getting a Unicode error
df_duplicate = df[df['is_duplicate'] == 1]
dfp_nonduplicate = df[df['is_duplicate'] == 0]
# Converting 2d array of q1 and q2 and flatten the array: like {{1,2},{3,4}} to {1,2,3,4}
p = np.dstack([df_duplicate["question1"], df_duplicate["question2"]]).flatten()
n = np.dstack([dfp_nonduplicate["question1"], dfp_nonduplicate["question2"]]).flatten()
print ("Number of data points in class 1 (duplicate pairs) :",len(p))
print ("Number of data points in class 0 (non duplicate pairs) :",len(n))
#Saving the np array into a text file
np.savetxt('train_p.txt', p, delimiter=' ', fmt='%s')
np.savetxt('train_n.txt', n, delimiter=' ', fmt='%s')`
I know I need to change it to utf-8 format but how to do with this particular code I cant understand.
Still a beginner in python
From the documentation, which I found by putting np.savetxt into a search engine:
numpy.savetxt(fname, X, fmt='%.18e', delimiter=' ', newline='n', header='', footer='', comments='# ', encoding=None)
Save an array to a text file.
So, yes, it does have an encoding parameter. That is where you specify the file encoding. So:
np.savetxt('train_p.txt', p, delimiter=' ', fmt='%s', encoding='utf-8')
That said: the character in question is a very strange one to have in your text. It would help to see where your data comes from.
I am trying to make a new list in my existing csv file (not using pandas).
Here is my code:
with open ('/Users/Weindependent/Desktop/dataset/albumlist.csv','r') as case0:
reader = csv.DictReader(case0)
album = []
for row in reader:
album.append(row)
print ("Number of albums is:",len(album))
The CSV file was downloaded from the Rolling Stone's Top 500 albums data set on data.world.
My logic is to create an empty list named album and have all the records in this list. But it seems the line of for row in reader has some issue.
the error message I got is:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 1040: invalid continuation byte
Can anyone let me know what I did wrong?
You need to open the file in the correct codec; UTF-8 is not the correct one. The dataset doesn't specify it, but I have determined that the most likely codec is mac_roman:
with open ('/Users/Weindependent/Desktop/dataset/albumlist.csv', 'r', encoding='mac_roman') as case0:
The original Kaggle dataset doesn't bother to document it, and the various kernels that use the set all just clobber the encoding. It's clearly a 8-bit Latin-variant (the majority of the data is ASCII with a few individual 8-bit codepoints).
So I analysed the data, and found there are just two such codepoints in 9 rows:
>>> import re
>>> eightbit = re.compile(rb'[\x80-\xff]')
>>> with open('albumlist.csv', 'rb') as bindata:
... nonascii = [l for l in bindata if eightbit.search(l)]
...
>>> len(nonascii)
9
>>> {c for l in nonascii for c in eightbit.findall(l)}
{b'\x89', b'\xca'}
The 0x89 byte appears in just one line:
>>> sum(l.count(b'\x89') for l in nonascii)
1
>>> sum(l.count(b'\xca') for l in nonascii)
22
>>> next(l for l in nonascii if b'\x89' in l)
b'359,1972,Honky Ch\x89teau,Elton John,Rock,"Pop Rock,\xcaClassic Rock"\r\n'
That's clearly Elton John's 1972 Honky Château album, so the 0x89 byte must represent the U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX codepoint.
The 0xCA bytes all appear to represent an alternative space character, they all appear righ after commas in the genre and subgenre columns (with one album exception):
>>> import csv
>>> for row in csv.reader((l.decode('ascii', 'backslashreplace') for l in nonascii)):
... for col in row:
... if '\\' in col: print(col)
...
Reggae,\xcaPop,\xcaFolk, World, & Country,\xcaStage & Screen
Reggae,\xcaRoots Reggae,\xcaRocksteady,\xcaContemporary,\xcaSoundtrack
Electronic,\xcaStage & Screen
Soundtrack,\xcaDisco
Rock,\xcaBlues
Blues Rock,\xcaElectric Blues,\xcaHarmonica Blues
Garage Rock,\xcaPsychedelic Rock
Honky Ch\x89teau
Pop Rock,\xcaClassic Rock
Funk / Soul,\xcaFolk, World, & Country
Rock,\xcaPop
Stan Getz\xca/\xcaJoao Gilberto\xcafeaturing\xcaAntonio Carlos Jobim
Bossa Nova,\xcaLatin Jazz
Lo-Fi,\xcaIndie Rock
These 0xCA bytes are almost certainly representing the U+00A0 NO-BREAK SPACE codepoint.
With these two mappings, you can try to determine what 8-bit codecs would make the same mapping. Rather than manually try out all Python's codecs I used Tripleee's 8-bit codec mapping to see what codecs use these mappings. There are only two:
0x89
â (U+00E2): mac_arabic, mac_croatian, mac_farsi, mac_greek, mac_iceland, mac_roman, mac_romanian, mac_turkish
0xca
(U+00A0): mac_centeuro, mac_croatian, mac_cyrillic, mac_greek, mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish
There are 6 encodings that are listed in both sets:
>>> set1 = set('mac_arabic, mac_croatian, mac_farsi, mac_greek, mac_iceland, mac_roman, mac_romanian, mac_turkish'.split(', '))
>>> set2 = set('mac_centeuro, mac_croatian, mac_cyrillic, mac_greek, mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish'.split(', '))
>>> set1 & set2
{'mac_turkish', 'mac_iceland', 'mac_romanian', 'mac_greek', 'mac_croatian', 'mac_roman'}
Of these, Mac OS Roman mac_roman codec is probably the most likely to have been used as Microsoft Excel for Mac used Mac Roman to create CSV files for a long time. However, it doesn't really matter, any of those 6 would work here.
You may want to replace those U+00A0 non-breaking spaces if you want to split out the genre and subgenre columns (really the genre and style columns if these were taken from Discogs).
I am learning python for data mining and I have a text file that contains a list of world cities and their coordinates. With my code, I am trying to find the coordinates of a list of cities. Everything works well until there is a city name with non-standard characters. I expect the program will skip that name and move to the next, but it terminates. How will I make the program skip names it cannot find and continue to the next?
lst = ['Paris', 'London', 'Helsinki', 'Amsterdam', 'Sant Julià de Lòria',
'New York', 'Dublin']
source = 'world.txt'
fh = open(source)
n = 0
for line in fh:
line.rstrip()
if lst[n] not in line:
continue
else:
co = line.split(',')
print lst[n], 'Lat: ', co[5], 'Long: ', co[6]
if n < (len(lst)-1):
n = n + 1
else:
break
The outcome of this run is:
>>>
Paris Lat: 33.180704 Long: 67.470836
London Lat: -11.758217 Long: 17.084013
Helsinki Lat: 60.175556 Long: 24.934167
Amsterdam Lat: 6.25 Long: -57.5166667
>>>
Your code has a number of issues. The following fixes most, if not all, of them — and should never terminate when a city isn't found.
# -*- coding: iso-8859-1 -*-
from __future__ import print_function
cities = ['Paris', 'London', 'Helsinki', 'Amsterdam', 'Sant Julià de Lòria', 'New York',
'Dublin']
SOURCE = 'world.txt'
for city in cities:
with open(SOURCE) as fh:
for line in fh:
if city in line:
fields = line.split(',')
print(fields[0], 'Lat: ', fields[5], 'Long: ', fields[6])
break
It may be an encoding problem. You have to know in which encoding is the file "world.txt".
If you do not know it, try the most commonly used encodings.
Replace the line :
fh = open(source)
With the lines :
import codecs
fh = codecs.open(source, 'r', 'utf-8')
If it still does not work, replace the 'utf-8' by 'cp1252', then by 'iso-8859-1'.
If none of these common encoding works, you have to find the encoding by yourself. Try to open "world.txt" in Notepad++, this text editor is able to make encoding inference. (Not sure if Notepad++ is able to open a 3 million line files, though).
It is also a good practice to know in which encoding is your own python source file, and to tell it explicitly by adding a line like that # -*- coding: utf-8 -*- at the beginning of your source file.
Of course, you have to specify the exact encoding in which your source file is. Once again, you can determine it by opening it in Notepad++.
I've tried to convert a CSV file coded in UTF-16 (exported by another program) to a simple array in Python 2.7 with very few luck.
Here's the nearest solution I've found:
from io import BytesIO
with open ('c:\\pfm\\bdh.txt','rb') as f:
x = f.read().decode('UTF-16').encode('UTF-8')
for line in csv.reader(BytesIO(x)):
print line
This code returns:
[' \tNombre\tEtiqueta\tExtensi\xc3\xb3n de archivo\tTama\xc3\xb1ol\xc3\xb3gico\tCategor\xc3\xada']
['1\tnom1\tetq1\text1 ...
What I'm trying to get it's something like this:
[['','Nombre','Etiqueta','Extensión de archivo','Tamaño lógico','Categoría']
['1','nom1','etq1','ext1','123','cat1']
['2','nom2','etq2','ext2','456','cat2']]
So, I'd need to convert those hexadecimal chars to latin typos (as: á,é,í,ó,ú, or ñ), and those tab-separated strings into arrays fields.
Do I really need to use dictionaries for the first part? I think there should be an easier solution, as I can see and write all these characater by keyboard.
For the second part, I think the CSV library won't help in this case, as I read it can't manage UTF-16 yet.
Could you give me a hand? Thank you!
ITEM #1: The hexadecimal characters
You are getting the:
[' \tNombre\tEtiqueta\tExtensi\xc3\xb3n de archivo\tTama\xc3\xb1ol\xc3\xb3gico\tCategor\xc3\xada']
output because you are printing a list. The behaviour of the list is to print the representation of each item. That is, it is the equivalent of:
print('[{0}]'.format(','.join[repr(item) for item in lst]))
If you use print(line[0]) you will get the output of the line.
ITEM #2: The output
The problem here is that the csv parser is not parsing the content as a tab-separated file, but a comma-separated file. You can fix this by using:
for line in csv.reader(BytesIO(s), delimiter='\t'):
print(line)
instead.
This will give you the desired result.
Processing a UTF-16 file with the csv module in Python 2 can indeed be a pain. Re-encoding to UTF-8 works, but you then still need to decode the resulting columns to produce unicode values.
Note also that your data appears to be tab delimited; the csv.reader() by default uses commas, not tabs, to separate columns. You'll need to configure it to use tabs instead by setting delimiter='\t' when constructing the reader.
Use io.open() to read UTF-16 and produce unicode lines. You can then use codecs.iterencode() to translate the decoded unicode values from the UTF-16 file to UTF-8.
To decode the rows back to unicode values, you could use an extra generator to do so as you iterate:
import csv
import codecs
import io
def row_decode(reader, encoding='utf8'):
for row in reader:
yield [col.decode('utf8') for col in row]
with io.open('c:\\pfm\\bdh.txt', encoding='utf16') as f:
wrapped = codecs.iterencode(f, 'utf8')
reader = csv.reader(wrapped, delimiter='\t')
for row in row_decode(reader):
print row
Each line will still use repr() on each contained value, which means that you'll see Python string literal syntax to represent strings. Any non-printable or non-ASCII codepoint will be represented by an escape code:
>>> [u'', u'Nombre', u'Etiqueta', u'Extensión de archivo', u'Tamaño lógico', u'Categoría']
[u'', u'Nombre', u'Etiqueta', u'Extensi\xf3n de archivo', u'Tama\xf1o l\xf3gico', u'Categor\xeda']
This is normal; the output is meant to be useful as a debugging aid and can be pasted back into any Python session to reproduce the original value, without worrying about terminal encodings.
For example, ó is represented as \xf3, representing the Unicode codepoint U+00F3 LATIN SMALL LETTER O WITH ACUTE. If you were to print this one column, Python will encode the Unicode string to bytes matching your terminal encoding, resulting in your terminal showing you the correct string again:
>>> u'Extensi\xf3n de archivo'
u'Extensi\xf3n de archivo'
>>> print u'Extensi\xf3n de archivo'
Extensión de archivo
Demo:
>>> import csv, codecs, io
>>> io.open('/tmp/demo.csv', 'w', encoding='utf16').write(u'''\
... \tNombre\tEtiqueta\tExtensi\xf3n de archivo\tTama\xf1o l\xf3gico\tCategor\xeda
... ''')
63L
>>> def row_decode(reader, encoding='utf8'):
... for row in reader:
... yield [col.decode('utf8') for col in row]
...
>>> with io.open('/tmp/demo.csv', encoding='utf16') as f:
... wrapped = codecs.iterencode(f, 'utf8')
... reader = csv.reader(wrapped, delimiter='\t')
... for row in row_decode(reader):
... print row
...
[u' ', u'Nombre', u'Etiqueta', u'Extensi\xf3n de archivo', u'Tama\xf1o l\xf3gico', u'Categor\xeda']
>>> # the row is displayed using repr() for each column; the values are correct:
...
>>> print row[3], row[4], row[5]
Extensión de archivo Tamaño lógico Categoría
I have a text file which contains entry like
70154::308933::3
UserId::ProductId::Score
I wrote this program to read:
(Sorry the indendetion is bit messed up here)
def generateSyntheticData(fileName):
dataDict = {}
# rowDict = []
innerDict = {}
try:
# for key in range(5):
# count = 0
myFile = open(fileName)
c = 0
#del innerDict[0:len(innerDict)]
for line in myFile:
c += 1
#line = str(line)
n = len(line)
#print 'n: ',n
if n is not 1:
# if c%100 ==0: print "%d: "%c, " entries read so far"
# words = line.replace(' ','_')
words = line.replace('::',' ')
words = words.strip().split()
#print 'userid: ', words[0]
userId = int( words[0]) # i get error here
movieId = int (words[1])
rating =float( words[2])
print "userId: ", userId, " productId: ", movieId," :rating: ", rating
#print words
#words = words.replace('_', ' ')
innerDict = dataDict.setdefault(userId,{})
innerDict[movieId] = rating
dataDict[userId] = (innerDict)
innerDict = {}
except IOError as (errno,strerror):
print "I/O error({0}) :{1} ".format(errno,strerror)
finally:
myFile.close()
print "total ratings read from file",fileName," :%d " %c
return dataDict
But i get the error:
ValueError: invalid literal for int() with base 10: ''
Funny thing is, it is working just fine reading the same format data from other file..
Actually while posting this question, I noticed something weird..
The entry 70154::308933::3
each number has a space.in between like 7 space 0 space 1 space 5 space 4 space :: space 3...
BUt the text file looks fine..:( on copy pasting only it shows this nature..
Anyways.. but any clue whats going on.
Thanks
The "spaces" thay you are seeing appear to be NULs ("\x00"). There is a 99.9% chance that your file is encoded in UTF-16, UTF-16LE, or UTF-16BE. If this is a one-off file, just open it with Notepad and save as "ANSI", not "Unicode" and not "Unicode bigendian". If however you need to process it as is, you'll need to know/detect what the encoding is. To find out which, do this:
print repr(open("yourfile.txt", "rb").read(20))
and compare the srtart of the output with the following:
>>> ucode = u"70154:"
>>> for sfx in ["", "LE", "BE"]:
... enc = "UTF-16" + sfx
... print enc, repr(ucode.encode(enc))
...
UTF-16 '\xff\xfe7\x000\x001\x005\x004\x00:\x00'
UTF-16LE '7\x000\x001\x005\x004\x00:\x00'
UTF-16BE '\x007\x000\x001\x005\x004\x00:'
>>>
You can make a detector that's good enough for your purposes by inspecting the first 2 bytes:
[pseudocode]
if f2b in `"\xff\xfe\xff"`: UTF-16
elif f2b[1] == `"\x00"`: UTF-16LE
elif f2b[0] == `"\x00"`: UTF-16BE
else: cp1252 or UTF-8 or whatever else is prevalent in your neck of the woods.
You could avoid hard-coding the fallback encoding:
>>> import locale
>>> locale.getpreferredencoding()
'cp1252'
Your line-reading code will look like this:
rawbytes = open(myFile, "rb").read()
enc = detect_encoding(rawbytes[:2])
for line in rawbytes.decode(enc).splitlines():
# whatever
Oh, and the lines will be unicode objects ... if that gives you a problem, ask another question.
Debugging 101: simply change the line:
words = words.strip().split()
to:
words = words.strip().split()
print words
and see what comes out.
I will mention a couple of things. If you have the literal UserId::... in the file and you try to process it, it won't take kindly to trying to convert that to an integer.
And the ... unusual line:
if n is not 1:
I would probably write as:
if n != 1:
If, as you indicate in your comment, you end up seeing:
['\x007\x000\x001\x005\x004\x00', '\x003\x000\x008\x009\x003\x003\x00', '3']
then I'd be checking your input file for binary (non-textual) data. You should never end up with that binary information if you're just reading text and trimming/splitting.
And because you state that the digits seem to have spaces between them, you should do a hex dump of the file to find out what's really in there. It may be a UTF-16 Unicode string, for example.