Related
I have a numpy structured array with a mixed dtype (i.e., floats, ints, and strings). I want to select some of the columns of the array (all of which contain only floats) and then get the sum, by column, of the rows, as a standard numpy array. The initial array takes a form comparable to:
some_data = np.array([('foo', 3.5, 2.15), ('bar', 2.8, 5.3), ('baz', 1.2, 3.7)],
dtype=[('col1', '<U20'), ('A', '<f8'), ('B', '<f8')])
For this example, I'd like to take the sum of columns A and B, yielding np.array([7.5, 11.15]). With numpy ≤1.13, I could do that as follows:
get_cols = ['A', 'B']
desired_sum = np.sum(some_data[get_cols].view(('<f8', len(get_cols))), axis=0)
With the release of numpy 1.14, this method now fails with ValueError: Changing the dtype to a subarray type is only supported if the total itemsize is unchanged, which is a result of the changes made in numpy 1.14 to the handling of structured arrays. (User bbengfort commented about the FutureWarning given about this change in this answer.)
In light of these changes to structured arrays, how can I obtain the desired sum from the structured array subset?
In [165]: some_data = np.array([('foo', 3.5, 2.15), ('bar', 2.8, 5.3), ('baz', 1.2, 3.7)], dtype=[('col1', '<U20'), ('A', '<f8'), ('B', '<f8')])
...:
In [166]: get_cols = ['A','B']
In [167]: some_data[get_cols]
Out[167]:
array([( 3.5, 2.15), ( 2.8, 5.3 ), ( 1.2, 3.7 )],
dtype=[('A', '<f8'), ('B', '<f8')])
Simply reading the field values is fine. In 1.13 we get a warning
In [168]: some_data[get_cols].view(('<f8', len(get_cols)))
/usr/local/bin/ipython3:1: FutureWarning: Numpy has detected that you may be viewing or writing to an array returned by selecting multiple fields in a structured array.
This code may break in numpy 1.13 because this will return a view instead of a copy -- see release notes for details.
#!/usr/bin/python3
Out[168]:
array([[ 3.5 , 2.15],
[ 2.8 , 5.3 ],
[ 1.2 , 3.7 ]])
With the recommended copy, no warning:
In [169]: some_data[get_cols].copy().view(('<f8', len(get_cols)))
Out[169]:
array([[ 3.5 , 2.15],
[ 2.8 , 5.3 ],
[ 1.2 , 3.7 ]])
In [171]: np.sum(_, axis=0)
Out[171]: array([ 7.5 , 11.15])
In your original array,
dtype([('col1', '<U20'), ('A', '<f8'), ('B', '<f8')])
An A,B slice would have the two f8 items interspersed with the 20U items. Changing the view dtype of such a mix is problematic. That's why working with a copy is more reliable.
Since U20 takes up 4*20 bytes, the total itemsize is 96, a multiple of 8. We can convert the whole thing to f8, reshape and 'throw-away' the U20 columns:
In [183]: some_data.view('f8').reshape(3,-1)[:,-2:]
Out[183]:
array([[ 3.5 , 2.15],
[ 2.8 , 5.3 ],
[ 1.2 , 3.7 ]])
It's not very pretty and I don't recommend it, but it may give some insight into how structured data is arranged.
view on a structured array is useful at times, but often a bit tricky to use correctly.
If the 2 numeric fields are usually used together, I'd recommend a compound dtype like:
In [184]: some_data = np.array([('foo', [3.5, 2.15]), ('bar', [2.8, 5.3]), ('baz
...: ', [1.2, 3.7])],
...: dtype=[('col1', '<U20'), ('AB', '<f8',(2,))])
...:
...:
In [185]: some_data
Out[185]:
array([('foo', [ 3.5 , 2.15]), ('bar', [ 2.8 , 5.3 ]),
('baz', [ 1.2 , 3.7 ])],
dtype=[('col1', '<U20'), ('AB', '<f8', (2,))])
In [186]: some_data['AB']
Out[186]:
array([[ 3.5 , 2.15],
[ 2.8 , 5.3 ],
[ 1.2 , 3.7 ]])
genfromtxt accepts this style of dtype.
I'm trying to extract a list of keys and a list of all values for each entry from json. Consider the following json:
[{'height': 1.2 , 'width':2.5, 'weight':5 },{'height': 1.7 , 'width':4.5, 'weight':2 },{'height': 3.2 , 'width':4.5, 'weight':7 }]
And I want the output to be:
['height','width','weight']
[[1.2,2.5,5],[1.7,4.5,2],[3.2,4.5,7]]
Basically I need to load it to pandas Dataframe. I believe it will be easier to load the rows and cols directly. I can't load the json directly to DataFrame and I can't find an efficient way to extract rows and cols as described above.
As the comments have mentioned, you can just pass dictionaries straight to the DataFrame constructor. However, if you really want the format you described for some reason, you can use:
>>> x = [{'height': 1.2 , 'width':2.5, 'weight':5 },{'height': 1.7 , 'width':4.5, 'weight':2 },{'height': 3.2 , 'width':4.5, 'weight':7 }]
>>> list(map(lambda l:list(l.keys()), x))
[['height', 'width', 'weight'], ['height', 'width', 'weight'], ['height', 'width', 'weight']]
>>> list(map(lambda l:list(l.values()), x))
[[1.2, 2.5, 5], [1.7, 4.5, 2], [3.2, 4.5, 7]]
Assuming you have the json loaded in a variable, like text, you can simply do:
In [13]: import json, pandas
In [14]: text = json.dumps([{'height': 1.2 , 'width':2.5, 'weight':5 },{'height': 1.7 , 'width':4.5, 'weight':2 },{'height': 3.2 , 'width':4.5, 'weight':7 }])
...:
In [15]: df = pandas.read_json(text)
In [16]: df
Out[16]:
height weight width
0 1.2 5 2.5
1 1.7 2 4.5
2 3.2 7 4.5
In [17]: df.columns.values
Out[17]: array([u'height', u'weight', u'width'], dtype=object)
In [18]: df.as_matrix()
Out[18]:
array([[ 1.2, 5. , 2.5],
[ 1.7, 2. , 4.5],
[ 3.2, 7. , 4.5]])
I'd like to fill in a numpy array with some floating-point values coming from a file. The data would be stored like this:
0 11
5 6.2 4 6
2 5 3.2 6
7 1.4 5 11
The first line gives the first and last index and on the following lines come the actual data. My current approach is to split each data line, use float on each part, and store the values in a pre-allocated array, slice by slice. Here is how I do it now:
data_file ='data.txt'
# Non needed stuff at the beginning
skip_lines = 0
with open(data_file, 'r') as f:
# Skip any lines if needed
for _ in range(skip_lines):
f.readline()
# Get the data size and preallocate the numpy array
first, last = map(int, f.readline().split())
size = last - first + 1
data = np.zeros(size)
beg, end = (-1, 0) # Keep track of where to fill the array
for line in f:
if end - 1 == last:
break
samples = line.split()
beg = end
end += len(samples)
data[beg:end] = [float(s) for s in samples]
Is there a way in Python to read the data values one by one instead?
import numpy as np
f = open('data.txt', 'r')
first, last = map(int, f.readline().split())
arr = np.zeros(last - first + 1)
for k in range(last - first + 1):
data = f.read() # This does not work. Any idea?
# In C++, it could be done this way: double data; cin >> data
arr[k] = data
EDIT The only thing that one can be sure of is that the two first numbers are the first and last index and that the last data row has only the last numbers. There can be also other stuff after the data numbers. So one can't just read all the rows after the "first, last" row.
EDIT 2 Added (working) initial approach (split each data line, use float on each part, and store the values in a pre-allocated array, slice by slice) implementation.
Since your sample has the same number of columns in each row (except the first) we can read it as csv, for example with loadtxt:
In [1]: cat stack43307063.txt
0 11
5 6.2 4 6
2 5 3.2 6
7 1.4 5 11
In [2]: arr = np.loadtxt('stack43307063.txt', skiprows=1)
In [3]: arr
Out[3]:
array([[ 5. , 6.2, 4. , 6. ],
[ 2. , 5. , 3.2, 6. ],
[ 7. , 1.4, 5. , 11. ]])
This is easy to reshape and manipulate. If columns aren't consistent, then we need to work line by line.
In [9]: alist = []
In [10]: with open('stack43307063.txt') as f:
...: start, stop = [int(i) for i in f.readline().split()]
...: print(start, stop)
...: for line in f: # f.readline()
...: print(line.split())
...: alist.append([float(i) for i in line.split()])
...:
0 11
['5', '6.2', '4', '6']
['2', '5', '3.2', '6']
['7', '1.4', '5', '11']
In [11]: alist
Out[11]: [[5.0, 6.2, 4.0, 6.0], [2.0, 5.0, 3.2, 6.0], [7.0, 1.4, 5.0, 11.0]]
Replace the append with extend to collect the values in a flat list instead:
alist.extend([float(i) for i in line.split()])
[5.0, 6.2, 4.0, 6.0, 2.0, 5.0, 3.2, 6.0, 7.0, 1.4, 5.0, 11.0]
c++ io usually uses streams. Streaming is possible with Python, but text files are more often read line by line.
In [15]: lines = open('stack43307063.txt').readlines()
In [16]: lines
Out[16]: ['0 11\n', '5 6.2 4 6\n', '2 5 3.2 6\n', '7 1.4 5 11\n']
a list of lines when can be processed as above.
fromfile could also be used, except it looses any row/column structure in the original:
In [20]: np.fromfile('stack43307063.txt',sep=' ')
Out[20]:
array([ 0. , 11. , 5. , 6.2, 4. , 6. , 2. , 5. , 3.2,
6. , 7. , 1.4, 5. , 11. ])
This load includes the first line. We could skip that with an open and readline.
In [21]: with open('stack43307063.txt') as f:
...: start, stop = [int(i) for i in f.readline().split()]
...: print(start, stop)
...: arr = np.fromfile(f, sep=' ')
0 11
In [22]: arr
Out[22]:
array([ 5. , 6.2, 4. , 6. , 2. , 5. , 3.2, 6. , 7. ,
1.4, 5. , 11. ])
fromfile takes a count parameter as well, which could be set from your start and stop. But unless you just want to read subset it isn't needed.
Assumes only that the first two numbers represent the indices of the values required from the numbers that follow. Varying numbers of numbers can appear in the first or subsequent lines. Won't read tokens beyond last.
from io import StringIO
sample = StringIO('''3 11 5\n 6.2 4\n6 2 5 3.2 6 7\n1.4 5 11''')
from shlex import shlex
lexer = shlex(instream=sample, posix=False)
lexer.wordchars = r'0123456789.'
lexer.whitespace = ' \n'
lexer.whitespace_split = True
def oneToken():
while True:
token = lexer.get_token()
if token:
token = token.strip()
if not token:
return
else:
return
token = token.replace('\n', '')
yield token
tokens = oneToken()
first = int(next(tokens))
print (first)
last = int(next(tokens))
print (last)
all_available = [float(next(tokens)) for i in range(0, last)]
print (all_available)
data = all_available[first:last]
print (data)
Output:
3
11
[5.0, 6.2, 4.0, 6.0, 2.0, 5.0, 3.2, 6.0, 7.0, 1.4, 5.0]
[6.0, 2.0, 5.0, 3.2, 6.0, 7.0, 1.4, 5.0]
f.read() will give you the remaining numbers as a string. You'll have to split them and map to float.
import numpy as np
f = open('data.txt', 'r')
first, last = map(int, f.readline().split())
arr = np.zeros(last - first + 1)
data = map(float, f.read().split())
Python works fast with string processing. So you can try to solve this problem of reading with two delimiters. Reduce it to one delimiter and then read (Python 3.):
import numpy as np
from io import StringIO
data = np.loadtxt(StringIO(''.join(l.replace(' ', '\n') for l in open('tata.txt'))),delimiter=' ',skiprows=2)
https://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html
Data-type is float by default.
What is the most efficient way to remove negative elements in an array? I have tried numpy.delete and Remove all specific value from array and code of the form x[x != i].
For:
import numpy as np
x = np.array([-2, -1.4, -1.1, 0, 1.2, 2.2, 3.1, 4.4, 8.3, 9.9, 10, 14, 16.2])
I want to end up with an array:
[0, 1.2, 2.2, 3.1, 4.4, 8.3, 9.9, 10, 14, 16.2]
In [2]: x[x >= 0]
Out[2]: array([ 0. , 1.2, 2.2, 3.1, 4.4, 8.3, 9.9, 10. , 14. , 16.2])
If performance is important, you could take advantage of the fact that your np.array is sorted and use numpy.searchsorted
For example:
In [8]: x[np.searchsorted(x, 0) :]
Out[8]: array([ 0. , 1.2, 2.2, 3.1, 4.4, 8.3, 9.9, 10. , 14. , 16.2])
In [9]: %timeit x[np.searchsorted(x, 0) :]
1000000 loops, best of 3: 1.47 us per loop
In [10]: %timeit x[x >= 0]
100000 loops, best of 3: 4.5 us per loop
The difference in performance will increase as the size of the array increases because np.searchsorted does a binary search that is O(log n) vs. O(n) linear search that x >= 0 is doing.
In [11]: x = np.arange(-1000, 1000)
In [12]: %timeit x[np.searchsorted(x, 0) :]
1000000 loops, best of 3: 1.61 us per loop
In [13]: %timeit x[x >= 0]
100000 loops, best of 3: 9.87 us per loop
In numpy:
b = array[array>=0]
Example:
>>> import numpy as np
>>> arr = np.array([-2, -1.4, -1.1, 0, 1.2, 2.2, 3.1, 4.4, 8.3, 9.9, 10, 14, 16.2])
>>> arr = arr[arr>=0]
>>> arr
array([ 0. , 1.2, 2.2, 3.1, 4.4, 8.3, 9.9, 10. , 14. , 16.2])
There's probably a cool way to do this is numpy because numpy is magic to me, but:
x = np.array( [ num for num in x if num >= 0 ] )
I have a file with the following format :
# a, b, c
0.1 0 0
0.2 0.4 0.5
4 5 0.9
0.3 0 10
which is a file with 3 columns of data and the name of these columns are a, b and c.
Currently to read these data, I use :
def readdata(filename):
a, b, c = np.loadtxt(filename, unpack=True)
return a, b, c
But instead of that, I would like readdata to return a map mydata with {column title, numpy array} so I can call mydata["a"] to get the first column. I want this function to work if the file has new columns (d, e, f...).
How to do that (avoiding unecessary copies as possible) ?
This functionality is provided by the numpy function np.genfromtxt, if you call it with the keyword names=True.
Example:
>>> s = """# a, b, c
... 0.1 0 0
... 0.2 0.4 0.5
... 4 5 0.9
... 0.3 0 10
... """
>>> data = np.genfromtxt(StringIO(s),names=True)
>>> data['a']
array([ 0.1, 0.2, 4. , 0.3])
>>> data['b']
array([ 0. , 0.4, 5. , 0. ])
>>> data['c']
array([ 0. , 0.5, 0.9, 10. ])
With this file:
#a, b, c
0.1 0 0
0.2 0.4 0.5
4 5 0.9
0.3 0 10
Assuming your first line defines header rows, in Numpy, you can do this:
First, read the header row:
>>> with open('/tmp/testnp.txt','r') as f:
... header=[n.strip() for n in f.readline().strip().lstrip('#').split(',')]
...
>>> header
['a', 'b', 'c']
Now, create a structured array in Numpy with the names the same as the fields in the header:
>>> import numpy as np
>>> struct=[(name,'float') for name in header]
>>> data=np.loadtxt('/tmp/testnp.txt',dtype=struct,comments='#')
>>> data
array([(0.1, 0.0, 0.0), (0.2, 0.4, 0.5), (4.0, 5.0, 0.9), (0.3, 0.0, 10.0)],
dtype=[('a', '<f8'), ('b', '<f8'), ('c', '<f8')])
>>> data['a']
array([ 0.1, 0.2, 4. , 0.3])
You can read your file into a pandas DataFrame with
import pandas
dataframe = pandas.read_csv(my_file)
then you get your columns just doing:
my_column_series = dataframe['column_name']
note that your csv file has to have a first row (header) with the column_name's. Otherwise you have to give the names to the dataframe manually.