read an ascii file into a numpy array - python

I have an ascii file and I want to read it into a numpy array. But it was failing and for the first number in the file, it returns 'NaN' when I use numpy.genfromtxt. Then I tried to use the following way of reading the file into an array:
lines = file('myfile.asc').readlines()
X = []
for line in lines:
s = str.split(line)
X.append([float(s[i]) for i in range(len(s))])
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
ValueError: could not convert string to float: 15.514
when I printed the first line of the file it looks like :
>>> s
['\xef\xbb\xbf15.514', '15.433', '15.224', '14.998', '14.792', '15.564', '15.386', '15.293', '15.305', '15.132', '15.073', '15.005', '14.929', '14.823', '14.766', '14.768', '14.789']
how could I read such a file into a numpy array without problem and any presumption about the number of rows and columns?

Based on #falsetru's answer, I want to provide a solution with Numpy's file reading capabilities:
import numpy as np
import codecs
with codecs.open('myfile.asc', encoding='utf-8-sig') as f:
X = np.loadtxt(f)
It loads the file into an open file instance using the correct encoding. Numpy uses this kind of handle (it can also use handles from open() and works seemless like in every other case.

The file is encoded with utf-8 with BOM. Use codecs.open with utf-8-sig encoding to handle it correctly (To exclude BOM \xef\xbb\xbf).
import codecs
X = []
with codecs.open('myfile.asc', encoding='utf-8-sig') as f:
for line in f:
s = line.split()
X.append([float(s[i]) for i in range(len(s))])
UPDATE You don't need to use index at all:
with codecs.open('myfile.asc', encoding='utf-8-sig') as f:
X = [[float(x) for x in line.split()] for line in f]
BTW, instead of using the unbound method str.split(line), use line.split() if you have no special reason to do it.

Related

Utf-8 decoding with Python

I have a csv with some data, and in one row there is a text that was added after encoding it in utf-8.
This is the text:
"b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'"
I'm trying to use this text to obtain the original characters using the decode function, but it's imposible.
Does anyone know which is the correct procedure to do it?
Assuming that the line in your file is exactly like this:
b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'
And reading the line from the file gives the output:
>>> line
"b'\\xe7\\x94\\xb3\\xe8\\xbf\\xaa\\xe8\\xa5\\xbf\\xe8\\xb7\\xaf255\\xe5\\xbc\\x84660\\xe5\\x8f\\xb7\\xe5\\x92\\x8c665\\xe5\\x8f\\xb7 \\xe4\\xb8\\xad\\xe5\\x9b\\xbd\\xe4\\xb8\\x8a\\xe6\\xb5\\xb7\\xe6\\xb5\\xa6\\xe4\\xb8\\x9c\\xe6\\x96\\xb0\\xe5\\x8c\\xba 201205'"`
You can try to use eval() function:
with open(r"your_csv.csv", "r") as csvfile:
for line in csvfile:
# when you reach the desired line
b = eval(line).decode('utf-8')
Output:
>>> print(b)
'申迪西路255弄660号和665号 中国上海浦东新区 201205'
Try this:-
a = b'\xe7\x94\xb3\xe8\xbf\xaa\xe8\xa5\xbf\xe8\xb7\xaf255\xe5\xbc\x84660\xe5\x8f\xb7\xe5\x92\x8c665\xe5\x8f\xb7 \xe4\xb8\xad\xe5\x9b\xbd\xe4\xb8\x8a\xe6\xb5\xb7\xe6\xb5\xa6\xe4\xb8\x9c\xe6\x96\xb0\xe5\x8c\xba 201205'
print(a.decode('utf-8')) #your decoded output
As you are saying you are reading from file then you can try with passing encoding system when reading:-
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)

Importing large set of numbers from a text file into python program in matrix form

I'm trying to import some pointcloud coordinates into python, the values are in a text file in this format
0.0054216 0.11349 0.040749
-0.0017447 0.11425 0.041273
-0.010661 0.11338 0.040916
0.026422 0.11499 0.032623
and so on.
Ive tried doing it by 2 methods
def getValue(filename):
try:
file = open(filename,'r')
except: IOError:
print ('problem with file'), filename
value = []
for line in file:
value.append(float(line))
return value
I called the above code in idle but there is an error that says the string cannot be converted to float.
import numpy as np
import matplotlib.pyplot as plt
data = np.genformtxt('co.txt', delimiter=',')
In this method when I call for data in idle it says data is not defined.
Below is the error message
>>> data[0:4]
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
data[0:4]
NameError: name 'data' is not defined
With the data you provided I would say you are trying to use:
np.loadtxt(<yourtxt.txt>, delimiter=" ")
In this case your delimiter should be blank space as can be seen in you data. This works perfectly for me.
Your problem is you are using the comma as delimiter.
float() converts one number string, not several (read its docs)
In [32]: line = '0.0054216 0.11349 0.040749 '
In [33]: float(line)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-33-0f93140abeab> in <module>()
----> 1 float(line)
ValueError: could not convert string to float: '0.0054216 0.11349 0.040749 '
Note that the error tells us which string is giving it problems.
It works if we split the line into substrings and convert those individually.
In [34]: [float(x) for x in line.split()]
Out[34]: [0.0054216, 0.11349, 0.040749]
Similarly, genfromtxt needs to split the lines into proper substrings. There aren't any , in your file, so clearly that's the wrong delimiter. The default delimiter is white space, which works just fine in this case.
In [35]: data = np.genfromtxt([line])
In [36]: data
Out[36]: array([0.0054216, 0.11349 , 0.040749 ])
With the wrong delimiter it tries to convert the whole line to a float. It can't (same reason as above), so it uses np.nan instead.
In [37]: data = np.genfromtxt([line], delimiter=',')
In [38]: data
Out[38]: array(nan)

How to apply regex sub to a csv file in python

I have a csv file I wish to apply a regex replacement to with python.
So far I have the following
reader = csv.reader(open('ffrk_inventory_relics.csv', 'r'))
writer = csv.writer(open('outfile.csv','w'))
for row in reader:
reader = re.sub(r'\+','z',reader)
Which is giving me the following error:
Script error: Traceback (most recent call last):
File "ffrk_inventory_tracker_v1.6.py", line 22, in response
getRelics(data['equipments'], 'ffrk_inventory_relics')
File "ffrk_inventory_tracker_v1.6.py", line 72, in getRelics
reader = re.sub(r'\+','z',reader)
File "c:\users\baconcatbug\appdata\local\programs\python\python36\lib\re.py",
line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or bytes-like object
After googling to not much luck, I would like to ask the community here how to open the csv file correctly so I can use re.sub on it and then write out the altered csv file back to the same filename.
csv.reader(open('ffrk_inventory_relics.csv', 'r')) is creating a list of lists, and when you iterate over it and pass each value to re.sub, you are passing a list, not a string. Try this:
import re
import csv
final_data = [[re.sub('\+', 'z', b) for b in i] for i in csv.reader(open('ffrk_inventory_relics.csv', 'r'))]
write = csv.writer(open('ffrk_inventory_relics.csv'))
write.writerows(final_data)
If you don't need csv you can use replace with regular open:
with open('ffrk_inventory_relics.csv', 'r') as reader, open('outfile.csv','w') as writer:
for row in reader:
writer.write(row.replace('+','z'))

python prettytable module raise Could not determine delimiter error for valid csv file

I'm trying to use prettytable module to print out data from csv file. But it failed with the following exception
Could not determine delimiter error for valid csv file
>>> import prettytable
>>> with file("/tmp/test.csv") as f:
... prettytable.from_csv(f)
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "build/bdist.linux-x86_64/egg/prettytable.py", line 1337, in from_csv
File "/usr/lib/python2.7/csv.py", line 188, in sniff
raise Error, "Could not determine delimiter"
_csv.Error: Could not determine delimiter
The CSV file:
input_gps,1424185824460,1424185902788,1424185939525,1424186019313,1424186058952,1424186133797,1424186168766,1424186170214,1424186246354,1424186298434,1424186376789,1424186413625,1424186491453,1424186606143,1424186719394,1424186756366,1424186835829,1424186948532,1424187107293,1424187215557,1424187250693,1424187323097,1424187358989,1424187465475,1424187475824,1424187476738,1424187548602,1424187549228,1424187550690,1424187582866,1424187584248,1424187639923,1424187641623,1424187774567,1424187776418,1424187810376,1424187820238,1424187820998,1424187916896,1424187917472,1424187919241,1424188048340,dummy-0,dummy-1,Total
-73.958315%2C 40.815569,0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),13.0 (42%)
-76.932984%2C 38.992186,0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),1.0(100%),0.0(nan%),1.0(100%),0.0(nan%),1.0(100%),0.0(nan%),0.0(nan%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),1.0(100%),1.0(100%),0.0(nan%),0.0(0%),17.0 (55%)
null_input-0,0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0 (0%)
null_input-1,0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(0%),0.0(nan%),0.0(nan%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),0.0(0%),0.0(0%),0.0(nan%),1.0(100%),1.0 (3%)
Total,0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),0.0(0%),1.0(3%),0.0(0%),0.0(0%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),1.0(3%),0.0(0%),1.0(3%),31.0(100%)
If you anyone can inform me how to workaround the problem or other alternative alternatives, it will be very helpful.
According to pypi, prettytable is only alpha level. I could not find where you could give it the configuration to pass to the csv module. So in that case, you probably should read the csv file by explicitely declaring the delimiter, and build the PrettyTable line by line
pt = None # to avoid it vanished at end of block...
with open('/tmp/test.csv') as fd:
rd = csv.reader(fd, delimiter = ',')
pt = PrettyTable(next(rd))
for row in rd:
pt.add_row(row)
Got the same, working on some messier csv's. Ended up implementing fallback method with manual search
from string import punctuation, whitespace
from collections import Counter
def get_delimiter(self, contents: str):
# contents = f.read()
try:
sniffer = csv.Sniffer()
dialect = sniffer.sniff(contents)
return dialect.delimiter
except Error:
return fallback_delimiter_search(contents)
def fallback_delimiter_search(contents: str) -> str:
# eliminate space in case of a lot of text
content_chars = list(filter(lambda x: (x in punctuation or x in whitespace) and x!=' ', contents))
counts = Counter(content_chars)
tgt_delimiter = counts.most_common(1)[0][0]
return tgt_delimiter

How to optimize the reading of a file removing all the line feed \n characters

I need to read from a file several integers written line by line and separated by line feed and insert them into a list.
1
2
3
4
5
Currently I was able to read it using the following code, but I need also to optimize my code:
import sys
fd = open(sys.argv[1], 'r')
for line in fd:
line = line.rstrip('\n')
L.append(int(line))
Is there another way to read from a file all the lines removing the line feed characters from a performance point of view ?
Thanks.
int() automatically removes the white-space characters,so there's no need of str.rstrip.
>>> int('10\r\n')
10
>>> int('10\n')
10
>>> int('10 \n')
10
You can also use a list comprehension here, it is faster than list.append:
import sys
with open(sys.argv[1]) as fd:
L = [int(line) for line in fd]
Why the with statement?:
It is good practice to use the with keyword when dealing with file
objects. This has the advantage that the file is properly closed after
its suite finishes, even if an exception is raised on the way.
You don't actually need to strip the line because int() already gets rid of trailing whitespace:
L = []
with open('nums.txt') as myfile: # With statements are more pythonic!
for line in myfile:
L.append(int(line))
print L
Returns:
[1, 2, 3, 4, 5]
As a result, you can then use map():
with open('nums.txt') as myfile:
L = map(int, myfile)
Hope this helps!
Using a list comprehension will be quicker as the loop and the append will not be done in python, but by the runtime engine. (Also you do not need to strip the newlines).
[int(line) for line in fd]
You could consider using numpy, specifically numpy.genfromtxt, e.g:
import numpy as np
data = np.genfromtxt("yourfile.dat",delimiter="\n")
This will make data a numpy array with as many rows and columns as are in your file

Categories