I know there is a ton of these threads but all of them are for very simple cases like 3x3 matrices and things of that sort and the solutions do not even begin to apply to my situation. So I'm trying to graph G versus l1 (that's not an eleven, but an L1). The data is in the file that I loaded from an excel file. The excel file is 14x250 so there are 14 arguments, each with 250 data points. I had another user (shout out to Hugh Bothwell!) help me with an error in my code, but now another error has surfaced.
So here is the code in question:
# format for CSV file:
header = ['l1', 'l2', 'l3', 'l4', 'l5', 'EI',
'S', 'P_right', 'P1_0', 'P3_0',
'w_left', 'w_right', 'G_left', 'G_right']
def loadfile(filename, skip=None, *args):
skip = set(skip or [])
with open(filename, *args) as f:
cr = csv.reader(f, quoting=csv.QUOTE_NONNUMERIC)
return np.array(row for i,row in enumerate(cr) if i not in skip)
#plot data
outputs_l1 = [loadfile('C:\\Users\\Chris\\Desktop\\Work\\Python Stuff\\BPCROOM - Shingles analysis\\ERR analysis\\l_1 analysis//BS(1) ERR analysis - l_1 - P_3 = {}.csv'.format(p)) for p in p3_arr]
col = {name:i for i,name in enumerate(header)}
fig = plt.figure()
for data,color in zip(outputs_l1, colors):
xs = data[:, col["l1" ]]
gl = data[:, col["G_left" ]] * 1000.0 # column 12
gr = data[:, col["G_right"]] * 1000.0 # column 13
plt.plot(xs, gl, color + "-", gr, color + "--")
for output, col in zip(outputs_l1, colors):
plt.plot(output[:,0], output[:,11]*1E3, col+'--')
plt.ticklabel_format(axis='both', style='plain', scilimits=(-1,1))
plt.xlabel('$l1 (m)$')
plt.ylabel('G $(J / m^2) * 10^{-3}$')
plt.xlim(xmin=.2)
plt.ylim(ymax=2, ymin=0)
plt.subplots_adjust(top=0.8, bottom=0.15, right=0.7)
After running the entire program, I recieve the error message:
Traceback (most recent call last):
File "C:/Users/Chris/Desktop/Work/Python Stuff/New Stuff from Brenday 8 26 2014/CD_ssa_plot(2).py", line 115, in <module>
xs = data[:, col["l1" ]]
IndexError: too many indices for array
and before I ran into that problem, I had another involving the line a few below the one the above error message refers to:
Traceback (most recent call last): File "FILE", line 119, in <module>
gl = data[:, col["G_left" ]] * 1000.0 # column 12
IndexError: index 12 is out of bounds for axis 1 with size 12
I understand the first error, but am just having problems fixing it. The second error is confusing for me though. My boss is really breathing down my neck so any help would be GREATLY appreciated!
I think the problem is given in the error message, although it is not very easy to spot:
IndexError: too many indices for array
xs = data[:, col["l1" ]]
'Too many indices' means you've given too many index values. You've given 2 values as you're expecting data to be a 2D array. Numpy is complaining because data is not 2D (it's either 1D or None).
This is a bit of a guess - I wonder if one of the filenames you pass to loadfile() points to an empty file, or a badly formatted one? If so, you might get an array returned that is either 1D, or even empty (np.array(None) does not throw an Error, so you would never know...). If you want to guard against this failure, you can insert some error checking into your loadfile function.
I highly recommend in your for loop inserting:
print(data)
This will work in Python 2.x or 3.x and might reveal the source of the issue. You might well find it is only one value of your outputs_l1 list (i.e. one file) that is giving the issue.
The message that you are getting is not for the default Exception of Python:
For a fresh python list, IndexError is thrown only on index not being in range (even docs say so).
>>> l = []
>>> l[1]
IndexError: list index out of range
If we try passing multiple items to list, or some other value, we get the TypeError:
>>> l[1, 2]
TypeError: list indices must be integers, not tuple
>>> l[float('NaN')]
TypeError: list indices must be integers, not float
However, here, you seem to be using matplotlib that internally uses numpy for handling arrays. On digging deeper through the codebase for numpy, we see:
static NPY_INLINE npy_intp
unpack_tuple(PyTupleObject *index, PyObject **result, npy_intp result_n)
{
npy_intp n, i;
n = PyTuple_GET_SIZE(index);
if (n > result_n) {
PyErr_SetString(PyExc_IndexError,
"too many indices for array");
return -1;
}
for (i = 0; i < n; i++) {
result[i] = PyTuple_GET_ITEM(index, i);
Py_INCREF(result[i]);
}
return n;
}
where, the unpack method will throw an error if it the size of the index is greater than that of the results.
So, Unlike Python which raises a TypeError on incorrect Indexes, Numpy raises the IndexError because it supports multidimensional arrays.
Before transforming the data into a list, I transformed the data into a list
data = list(data) data = np.array(data)
Related
I am trying to explore the topic of concurrency in python. I saw a couple of post about how to optimize processes by splitting the input data and processes separately, and afterwards joining the results. My task is to calculate the mean along the Z axis of a stack of rasters. I read the list of raster from a text file, and them create a stack numpy array with the data.
Then I wrote a simple function to use the stack array as input and calculate the mean. This task take me some minutes to complete. And I would like to process the numpy array in chunks to optimize the script. However when I do so by using the numpy.split (maybe a not good idea to split my 3d Array), then I get the following error:
Traceback <most recent call last>:
File "C:\Users\~\AppData\Local\conda\conda\envs\geoprocessing\lib\site-packages\numpy\lib\shape_base.py",
line 553, in split
len(indices_or_sections)
TypeError: object of type 'int' has no len()
During handling of the above exception, another exception ocurred:
Traceback (most recent call last):
File "tf_calculation_numpy.py", line 69, in <module>
main()
Tile "tf_calculation_numpy.py", line 60, in main
subarrays = np.split(final_array, 4)
File "C:\Users\~\AppData\Local\conda\conda\envs\geoprocessing\lib\site-packages\numpy\lib\shape_base.py", line 559, in split
array split does not result in an equal division'
ValueError: array split does not result in an equal division
Code is:
import rasterio
import os
import numpy as np
import time
import concurrent.futures
def mean_py(array):
print("Calculating mean of array")
start_time = time.time()
x = array.shape[1]
y = array.shape[2]
values = np.empty((x,y), type(array[0][0][0]))
for i in range(x):
for j in range(y):
#no more need for append operations
values[i][j] = ((np.mean(array[:, i, j])))
end_time = time.time()
hours, rem = divmod(end_time-start_time, 3600)
minutes, seconds = divmod(rem, 60)
print("{:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))
print(f"{'.'*80}")
return values
def TF_mean(ndarray):
sdir = r'G:\Mosaics\VH'
final_array = np.asarray(ndarray)
final_array = mean_py(final_array)
out_name = (sdir + "/" + "MEAN_VH.tif")
print(final_array.shape)
with rasterio.open(out_name, "w", **profile) as dst:
dst.write(final_array.astype('float32'), 1)
print(out_name)
print(f"\nDone!\n{'.'*80}")
def main():
sdir = r'G:\Mosaics\VH'
a = np.random.randn(250_000)
b = np.random.randn(250_000)
c = np.random.randn(250_000)
e = np.random.randn(250_000)
f = np.random.randn(250_000)
g = np.random.randn(250_000)
h = np.random.randn(250_000)
arrays = [a, b, c, e, f, g, h]
final_array = []
for array in arrays:
final_array.append(array)
print(f"{array} added")
print("Splitting nd-array!")
final_array = np.asarray(final_array)
subarrays = np.split(final_array, 4)
with concurrent.futures.ProcessPoolExecutor() as executor:
for subarray, mean in zip(subarrays, executor.map(TF_mean,subarrays)):
print(f'Processing {subarray}')
if __name__ == '__main__':
main()
I just expect to have four processes running in parallel and a way to obtained the 4 subarrays and write them as a whole Geotiff file.
The second exception is the important one here, in terms of describing the error: "array split does not result in an equal division"
final_array is a 2D array, with shape 7 by 250,000. numpy.split operates along an axis, defaulting to axis 0, so you just asked it to split a length seven axis into four equal parts. Obviously, this isn't possible, so it gives up.
To fix, you can:
Split more; you could just split in seven parts and process each separately. The executor is perfectly happy to do seven tasks, no matter how many workers you have; seven won't split evenly, so at the tail end of processing you'll likely have some workers idle while the rest finish up, but that's not the end of the world
Split on a more fine-grained level. You could just flatten the array, e.g. final_array = final_array.reshape(final_array.size), which would make it a flat 1,750,000 element array, which can be split into four parts.
Split unevenly; instead of subarrays = np.split(final_array, 4) which requires axis 0 to be evenly splittable, do subarrays = np.split(final_array, (2,4,6)), which splits into three groups of two rows, plus one group with a single row.
There are many other options depending on your use case (e.g. split on axis=1 instead of the default axis=0), but those three are the least invasive (and #1 and #3 shouldn't change behavior meaningfully; #2 might, depending on whether the separation between 250K element blocks is meaningful).
I am a beginner of python and would need some help. I have run into a problem when trying to manipulating some dat-files.
I have created 159 dat.files (refitted_to_digit0_0.dat, refitted_to_digit0_1.dat, ...refitted_to_digit0_158.dat) containing time series data of two columns (timestep, value) of 2999 rows. In my python program I have created a list of these files with filelist_refit_0=glob.glob('refitted_to_digit0_*')
plist_refit_0=[]
I now try to load the second column of each 159 files into the plist_refit_0 so that each place in the list contains an array of 2999 values (second columns) that I will use for further manipulations. I have created a for-loop for this and use the len(filelist_refit_0) as the range for the loop. The length being 159 (number of files: 0-158).
However, when I run this I get an error message: list index out of range.
I have tried with a lower range for the for-loop and it seems to work up until range 66 but not above that. filelist_refit_0[66] refer to file refitted_to_digit0_158.dat and filelist_refit_0[67] refer to refitted_to_digit0_16.dat. filelist_refit_0[158] refer to refitted_to_digit0_99.dat. Instead of being sorted in ascending order based on the value 0->158 I think the plist_refit_0 have the files in ascending order based on the digits: refitted_to_digit0_0.dat first, then refitted_to_digit0_1.dat, then refitted_to_digit0_10.dat, then refitted_to_digit0_100.dat, then refitted_to_digit0_101.dat resulting in refitted_to_digit0_158.dat being on place 66 in the list. However, I still don't understand why the compiler interprets the index as being out of range above 66 when the length of the filelist_refit_0 being 159 and there really are 159 files, no matter the order. If anyone can explain this and have some advice how to solve this problem, I highly appreciate it! Thanks for your help.
I have tried the following to understand the sorting:
print len(filelist_refit_0) => 159
print filelist_refit_0[66] => refitted_to_digit0_158.dat
print filelist_refit_0[67] => refitted_to_digit0_16.dat
print filelist_refit_0[158] => refitted_to_digit0_99.dat
print filelist_refit_0[0] => refitted_to_digit0_0.dat
I have "manually" loaded the files and it seems to work for most index e.g.
t, p = loadtxt(filelist_refit_0[65], usecols=(0, 1), unpack=True)
plist_refit_0.append(p)
t, p = loadtxt(filelist_refit_0[67], usecols=(0, 1), unpack=True)
plist_refit_0.append(p)
print plist_refit_0[0]
print plist_refit_0[1]
BUT it does not work for index66!:
t, p = loadtxt(filelist_refit_0[66], usecols=(0, 1), unpack=True)
plist_refit_0.append(p)
Then I get error: list index out of range.
As can be seen above it refers to refitted_to_digit0_158.dat which is the last file. I have looked into the file and it looks exactly the same as all the other files, which the same number of columns and raw-elements (2999). Why is this entry different?
Python 2:
filelist_refit_0 = glob.glob('refitted_to_digit0_*')
plist_refit_0 = []
for i in range(len(filelist_refit_0)):
t, p = loadtxt(filelist_refit_0[i], usecols=(0, 1), unpack=True)
plist_refit_0.append(p)
Traceback (most recent call last):
File "test.py", line 107, in <module>
t,p=loadtxt(filelist_refit_0[i],usecols=(0,1),unpack=True)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/lib/npyio.py", line 1092, in loadtxt
for x in read_data(_loadtxt_chunksize):
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/numpy/lib/npyio.py", line 1012, in read_data
vals = [vals[j] for j in usecols]
IndexError: list index out of range
I'm trying to read in data from a file in binary format and store it in a 2-d array. However, I'm getting an error that reads
error: unpack requires a bytes object of length 2
Essentially what I have is something like
import os, struct
from itertools import chain
packets = value1 #The number of packets in the data stream
dataLength = value2 #bytes of data per packet
packHeader = [[0 for x in range(14)] for y in range(packets)]
data = [[0 for x in range(dataLength)] for y in range(packets)]
for i in range(packets):
packHeader[i][0] = struct.unpack('>H', file.read(2))
packHeader[i][1] = struct.unpack('>H', file.read(2))
....
packHeader[i][13] = struct.unpack('>i', file.read(4))
packHeader[i]=list(chain(*packHeader[i])) #Deals with the tuple issue ((x,),(y,),...) -> (x,y,...)
for j in range(dataLength):
data[i][j] = struct.unpack('<h', file.read(2))
When it gets to this point it produces the error above. I'm not sure why. Both dataLength and packets are even numbers. So, imagined unpacking 2 bytes at a time shouldn't be an issue. Any thoughts?
EDIT I did check to see what would happen if I read in the data one byte at a time. So
data[i][j] = struct.unpack('<b', file.read(1))
and that worked fine. It just is not liking to unpack anything else.
EDIT 2 I also just went ahead and made that slightly more compact by saying something like
data[i] = [struct.unpack('<h', file.read(2)) for j in range(dataLength)]
Still produces the same error - just more compactly.
As it turns out, there was still iterations to be performed that when reading in 2 bytes (or more) at a time the data from the file was running out. The fix is to do something like the following
readBytes = value_wanting_to_be_read
dataLength = int(value2/readBytes)
and then in the actual loop
data[i] = [struct.unpack('<h', file.read(readBytes)) for j in range(dataLength)]
which works if readBytes = 2.
I've been having some problems with this code, trying to end up with an inner product of two 1-D arrays. The code of interest looks like this:
def find_percents(i):
percents=[]
median=1.5/(6+2*int(i/12))
b=2*median
m=b/(7+2*int(i/12))
for j in xrange (1,6+2*int(i/12)):
percents.append(float((b-m*j)))
percentlist=numpy.asarray(percents, dtype=float)
#print percentlist
total=sum(percentlist)
return total, percentlist
def playerlister(i):
players=[]
for i in xrange(i+1,i+6+2*int(i/12)):
position=sheet.cell(i,2)
points=sheet.cell(i,24)
if re.findall('RB', str(position.value)):
vbd=points.value-rbs[24]
players.append(vbd)
else:
pass
playerlist=numpy.asarray(players, dtype=float)
return playerlist
def others(i,percentlist,playerlist,total):
alternatives=[]
playerlist=playerlister(i)
percentlist=find_percents(i)
players=numpy.dot(playerlist,percentlist)
I am receiving the following error in response to the very last line of this attached code:
ValueError: setting an array element with a sequence.
In most other examples of this error, I have found the error to be because of incorrect data types in the arrays percentlist and playerlist, but mine should be float type. If it helps at all, I call these functions a little later in the program, like so:
for i in xrange(1,30):
total, percentlist= find_percents(i)
playerlist= playerlister(i)
print type(playerlist[i])
draft_score= others(i,percentlist,playerlist,total)
Can anyone help me figure out why I am setting an array element with a sequence? Please let me know if any more information might be helpful! Also for clarity, the playerlister is making use of the xlrd module to extract data from a spreadsheet, but the data are numerical and testing has shown that that both lists have a type of numpy.float64.
The shape and contents of each of these for one iteration of i is
<type 'numpy.float64'>
(5,)
[ 73.7 -94.4 140.9 44.8 130.9]
(5,)
[ 0.42857143 0.35714286 0.28571429 0.21428571 0.14285714]
Your function find_percents returns a two-element tuple.
When you call it in others, you are binding that tuple to the variable named percentlist, which you then try to use in a dot-product.
My guess is that by writing this in others it is fixed:
def others(i,percentlist,playerlist,total):
playerlist = playerlister(i)
_, percentlist = find_percents(i)
players = numpy.dot(playerlist,percentlist)
provided of course playerlist and percentlist always have the same number of elements (which we can't check because of the missing spreadsheet).
To verify, the following gives you the exact error message and the minimum of code needed to reproduce it:
>>> import numpy as np
>>> a = np.arange(5)
>>> np.dot(a, (2, a))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence.
I am trying to plot a boxplot for a column in several csv files (without the header row of course), but running into some confusion around tuples, lists and arrays. Here is what I have so far
#!/usr/bin/env python
import csv
from numpy import *
import pylab as p
import matplotlib
#open one file, until boxplot-ing works
f = csv.reader (open('2-node.csv'))
#get all the columns in the file
timeStamp,elapsed,label,responseCode,responseMessage,threadName,dataType,success,bytes,Latency = zip(*f)
#Make list out of elapsed to pop the 1st element -- the header
elapsed_list = list(elapsed)
elapsed_list.pop(0)
#Turn list back to a tuple
elapsed = tuple(elapsed_list)
#Turn list to an numpy array
elapsed_array = array(elapsed_list)
#Elapsed Column statically entered into an array
data_array = ([4631, 3641, 1902, 1937, 1745, 8937] )
print data_array #prints in this format: ([xx,xx,xx,xx]), .__class__ is list ... ?
print elapsed #prints in this format: ('xx','xx','xx','xx'), .__class__ is tuple
print elapsed_list # #print in this format: ['xx', 'xx', 'xx', 'xx', 'xx'], .__class__ is list
print elapsed_array #prints in this format: ['xx' 'xx' 'xx' 'xx' 'xx'] -- notice no commas, .__class__ is numpy.ndarray
p.boxplot (data_array) #works
p.boxplot (elapsed) # does not work, error below
p.boxplit (elapsed_list) #does not work
p.boxplot (elapsed_array) #does not work
p.show()
For boxplots, the 1st argument is an "an array or a sequence of vectors", so I would think elapsed_array would work ... ? But yet data_array, a "list," works ... but elapsed_list` a "list" does not ... ? Is there a better way to do this ... ?
I am fairly new to python, and I would like to understand the what about the differences among a tuple, list, and numpy-array prevents this boxplot from working.
Example error message is:
Traceback (most recent call last):
File "../pullcol.py", line 32, in <module>
p.boxplot (elapsed_list)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/pyplot.py", line 1962, in boxplot
ret = ax.boxplot(x, notch, sym, vert, whis, positions, widths, patch_artist, bootstrap)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/axes.py", line 5383, in boxplot
q1, med, q3 = mlab.prctile(d,[25,50,75])
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/mlab.py", line 946, in prctile
return _interpolate(values[ai],values[bi],frac)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/matplotlib/mlab.py", line 920, in _interpolate
return a + (b - a)*fraction
TypeError: unsupported operand type(s) for -: 'numpy.ndarray' and 'numpy.ndarray'
elapsed contains strings. Matplotlib needs integers or floats to plot something. Try converting each value of elapsed to integer. You can do this like so
elapsed = tuple([int(i) for i in elapsed])
or as FredL commented below:
elapsed_list = array(elapsed_list, dtype=float)
I'm not familiar with numpy or matplotlib, but just from the description and what's working, it appears it is looking for a nested sequence of sequences. Which is why data_array works as it's a tuple containing a list, where as all your other input is only one layer deep.
As for the differences, a list is a mutable sequence of objects, a tuple is an immutable sequence of objects and an array is a mutable sequence of bytes, ints, chars (basically 1, 2, 4 or 8 byte values).
Here's a link to the Python docs about 5.6. Sequence Types, from there you can jump to more detailed info about lists, tuples, arrays or any of the other sequence types in Python.