Why converting np.nan to int results in huge number?

Why converting np.nan to int results in huge number? - python

I have a numpy array like this below:
array([['18.0', '11.0', '5.0', ..., '19.0', '18.0', '20.0'],
['11.0', '14.0', '15.0', ..., '45.0', '26.0', '20.0'],
['1.0', '0.0', '1.0', ..., '3.0', '4.0', '17.0'],
...,
['nan', 'nan', 'nan', ..., 'nan', 'nan', 'nan'],
['nan', 'nan', 'nan', ..., 'nan', 'nan', 'nan'],
['nan', 'nan', 'nan', ..., 'nan', 'nan', 'nan']],
dtype='|S230')
But converting it to int array makes the np.nan value to be weird values:
df[:,4:].astype('float').astype('int')
array([[ 18, 11, 5,
..., 19, 18,
20],
[ 11, 14, 15,
..., 45, 26,
20],
[ 1, 0, 1,
..., 3, 4,
17],
...,
[-9223372036854775808, -9223372036854775808, -9223372036854775808,
..., -9223372036854775808, -9223372036854775808,
-9223372036854775808],
[-9223372036854775808, -9223372036854775808, -9223372036854775808,
..., -9223372036854775808, -9223372036854775808,
-9223372036854775808],
[-9223372036854775808, -9223372036854775808, -9223372036854775808,
..., -9223372036854775808, -9223372036854775808,
-9223372036854775808]])
So how to fix my problem ?

Converting floating-point Nan to an integer type is undefined behavior, as far as I know. The number:
-9223372036854775808
Is the smallest int64, i.e. -2**63. Note the same thing happens on my system when I coerce to int32:
>>> arr
array([['18.0', '11.0', '5.0', 'nan']],
dtype='<U4')
>>> arr.astype('float').astype(np.int32)
array([[ 18, 11, 5, -2147483648]], dtype=int32)
>>> -2**31
-2147483648

It all depends what you expect the result to be. nan is of a float type, so converting the string 'nan' into float is no problem. But there is no definition of converting it to int values.
I suggest you handle it differently - first choose what spcific int you want all the nan values to become (for example 0), and only then convert the whole array to int
a = np.array(['1','2','3','nan','nan'])
a[a=='nan'] = 0 # this will convert all the nan values to 0, or choose another number
a = a.astype('int')
Now a is equal to
array([1, 2, 3, 0, 0])

Related

generate new row in pandas if column value is over certain value

I'm writing a function to get the HOR_df output below by feeding in 2 dataframes, planned_routes and actual_shippers.
The function should look at the actual_pallets (from actual_shippers df) and then look at planned_routes to see if the columns actual pallets > max_truck_capacity. If it does, it should generate a new row, like in the picture below.
Visualisation of the inputs and wanted output:
Note
In the above case: S1 had planned 10 pallets, but with the new actual_pallets was increased so that the max_truck_capacity is too small to handle the new actual_pallets. Therefore a new row is generated with the S2 ID, and the 3 needed extra pallets.
HOR_df has in this case made sure that on the 1st of December 2021, the actual_pallets for shipper S2 were split up onto routes of 10 and 3 pallets separately, instead of the 10 pallets that were in the initial planned_routes.
Potential idea how it should be done
I'm not sure what's the most efficient way to do this, for instance if I should build something that iteratively goes through and "fill the routes up" with the new "actual_pallet" data?
# x = planned_routes
# y = actual_shippers
# z = cost of adhocs and cancellations
# w = truck eligibility
def optimal_trips(x, y, z, w):
# Step 1: Take in actual shippers package and pallet data
# Step 2: Take the actual data and feed it into the planned routes and add routes based on demand.
# Step 3: return a df with the new optimal routes.
Code for the dfs (to replicate)
Input 1:
planned_routes = pd.DataFrame({
'date':['2021-12-01','2021-12-02'],
'planned_route_id':['R1', 'R2'],
'S1_id':['S1', 'S1'],
'S2_id':['S2', 'S2'],
'S3_id':['NaN', 'NaN'],
'S4_id':['NaN', 'NaN'],
'S1_planned_packages':[110, 100],
'S2_planned_packages':[120, 100],
'S3_planned_packages':['NaN', 'NaN'],
'S4_planned_packages':['NaN', 'NaN'],
'total_planned_packages':[230, 200],
'S1_planned_pallets':[11, 10],
'S2_planned_pallets':[12, 10],
'S3_planned_pallets':['NaN', 'NaN'],
'S4_planned_pallets':['NaN', 'NaN'],
'total_pallets':[23, 20],
'truck_max_capacity':[24, 24],
'cost_route':[120, 120]
})
Input 2:
actual_shippers = pd.DataFrame({
'date':['2021-12-01','2021-12-01','2021-12-02','2021-12-02'],
'shipper_id':['S1', 'S2','S1', 'S2'],
'actual_packages':[140, 130, 140, 130],
'shipper_spp':[10, 10, 10, 10],
'actual_pallets':[14, 13, 14, 13],
'shipper_max_eligibility':[24, 24, 24, 24],
'truck_max_capacity':[24, 24, 24, 24]
})
Wanted output:
HOR_df = pd.DataFrame({
'date':['2021-12-01','2021-12-01', '2021-12-02', '2021-12-02'],
'planned_route_id':['R1', 'R3','R2', 'R4'],
'S1_id':['S1', 'S2', 'S1', 'S2'],
'S2_id':['S2', 'NaN', 'S2', 'NaN'],
'S3_id':['NaN', 'NaN','NaN', 'NaN'],
'S4_id':['NaN', 'NaN', 'NaN', 'NaN'],
'S1_actual_packages':[140, 0, 140, 0],
'S2_actual_packages':[100, 30, 100, 30],
'S3_actual_packages':['NaN', 'NaN', 'NaN', 'NaN'],
'S4_actual_packages':['NaN', 'NaN', 'NaN', 'NaN'],
'total_planned_packages':[240, 30, 240, 30], # sum(S1_actual_packages, S2_actual packages, S3... etc)
'S1_actual_pallets':[14, 3, 14, 3],
'S2_actual_pallets':[10, 'NaN', 10, 'NaN'],
'S3_actual_pallets':['NaN', 'NaN', 'NaN', 'NaN'],
'S4_actual_pallets':['NaN', 'NaN', 'NaN', 'NaN'],
'total_pallets':[24, 3, 24, 3], #sum(S1_actual_pallets, S2 ... etc)
'truck_max_capacity':[24, 24, 24, 24],
'cost_route':[120, 130, 120, 130]
})

None of Index are in the columns - Cannot do an single line boxplot

Until now:
I load a csv file, without header. I put in new column names.
I strip away all spaces
I organize the data. All R150 appear.
But I cant do a Boxplot of my Istwert Column. Error:
"None of [Index([',Istwert'], dtype='object')] are in the [columns]"
If I save the csv, I do not find any suspicious.
Any ideas?
The code so far:
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("<FILELOCATION>", delimiter=";" , skiprows = 1, names=["BID","Testschritt","Testbeschreibung","Sollwert","Minimum","Maximum","Istwert","Einheit"])
df = df.apply(lambda x: x.str.strip() if (x.dtype == "object") | (x.dtype == "float") else x)
result = df.loc[df["Testschritt"] == "R150"]
result.boxplot(column = ["Istwert"])
This is my CSV data:
And this is my result before the boxplot:

Always provide your data as text, not an image. OCR is painful.
This works for me.
what version of pandas are you using?
what are your data types?
result = pd.DataFrame(
{'idx': [226, 1070, 1914, 2758, 3602, 4446, 5290, 6134, 6978, 7822],
'BID': [7249, 7326, 7327, 7328, 7329, 7330, 7331, 7332, 7333, 7333],
'Testschritt': ['R150','R150', 'R150', 'R150', 'R150', 'R150', 'R150', 'R150', 'R150', 'R150'],
'Testbeschreibung': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
'Sollwert': [22, 22, 22, 22, 22, 22, 22, 22, 22, 22],
'Minimum': [19.8, 19.8, 19.8, 19.8, 19.8, 19.8, 19.8, 19.8, 19.8, 19.8],
'Maximum': [24.2, 24.2, 24.2, 24.2, 24.2, 24.2, 24.2, 24.2, 24.2, 24.2],
'Istwert': [20953, 21002, 20838, 20827, 20879, 20942, 20999, 20855, 20969, 20874],
'Einheit': ['KOhm', 'KOhm', 'KOhm', 'KOhm', 'KOhm', 'KOhm', 'KOhm', 'KOhm', 'KOhm', 'KOhm']}
).set_index("idx")
print(f"pandas: {pd.__version__}\n{result.dtypes}")
result.boxplot(column = ["Istwert"])
output
pandas: 1.1.0
BID int64
Testschritt object
Testbeschreibung int64
Sollwert int64
Minimum float64
Maximum float64
Istwert int64
Einheit object
dtype: object

Creating numpy array from dict

Assume I have a dict, call it coeffs:
coeffs = {'X1': 0.1, 'X2':0.2, 'X3':0.4, ..., 'Xn':0.09}
How can I convert the values into a 1 x n ndarray?
Into a n x m ndarray?

Here's an example of using your coeffs to fill in an array, with value indices derived from the dictionary keys:
In [591]: coeffs = {'X1': 0.1, 'X2':0.2, 'X3':0.4, 'X4':0.09}
In [592]: alist = [[int(k[1:]),v] for k,v in coeffs.items()]
In [593]: alist
Out[593]: [[4, 0.09], [3, 0.4], [1, 0.1], [2, 0.2]]
Here I stripped off the initial character and converted the rest to an integer. You could do your own conversion.
Now just initial an empty array, and fill in values:
In [594]: X = np.zeros((5,))
In [595]: for k,v in alist: X[k] = v
In [596]: X
Out[596]: array([ 0. , 0.1 , 0.2 , 0.4 , 0.09])
Obviously I could have used X = np.zeros((1,5)). An (n,m) array doesn't make sense unless there's a basis for choosing n for each dictionary item.
Just for laughs, here's another way of making an array from a dictionary - put the keys and values into fields of structured array:
In [613]: X = np.zeros(len(coeffs),dtype=[('keys','S3'),('values',float)])
In [614]: X
Out[614]:
array([(b'', 0.0), (b'', 0.0), (b'', 0.0), (b'', 0.0)],
dtype=[('keys', 'S3'), ('values', '<f8')])
In [615]: for i,(k,v) in enumerate(coeffs.items()):
X[i]=(k,v)
.....:
In [616]: X
Out[616]:
array([(b'X4', 0.09), (b'X3', 0.4), (b'X1', 0.1), (b'X2', 0.2)],
dtype=[('keys', 'S3'), ('values', '<f8')])
In [617]: X['keys']
Out[617]:
array([b'X4', b'X3', b'X1', b'X2'],
dtype='|S3')
In [618]: X['values']
Out[618]: array([ 0.09, 0.4 , 0.1 , 0.2 ])
The scipy sparse module has a sparse matrix format that stores its values in a dictionary, in fact, it is a subclass of dictionary. The keys in this dictionary are (i,j) tuples, the indexes of the nonzero elements. Sparse has the tools for quickly converting such a matrix into other, more computational friendly sparse formats, and into regular dense arrays.
I learned in other SO questions that a fast way to build such a matrix is to use the regular dictionary update method to copy values from another dictionary.
Inspired by #user's 2d version of this problem, here's how such a sparse matrix could be created.
Start with #user's sample coeffs:
In [24]: coeffs
Out[24]:
{'Y8': 22,
'Y2': 16,
'Y6': 20,
'X5': 20,
'Y9': 23,
'X2': 17,
...
'Y1': 15,
'X4': 19}
define a little function that converts the X3 style of key to (0,3) style:
In [25]: def decodekey(akey):
pt1,pt2 = akey[0],akey[1:]
i = {'X':0, 'Y':1}[pt1]
j = int(pt2)
return i,j
....:
Apply it with a dictionary comprehension to coeffs (or use a regular loop in earlier Python versions):
In [26]: coeffs1 = {decodekey(k):v for k,v in coeffs.items()}
In [27]: coeffs1
Out[27]:
{(1, 2): 16,
(0, 1): 16,
(0, 0): 15,
(1, 4): 18,
(1, 5): 19,
...
(0, 8): 23,
(0, 2): 17}
Import sparse and define an empty dok matrix:
In [28]: from scipy import sparse
In [29]: M=sparse.dok_matrix((2,10),dtype=int)
In [30]: M.A
Out[30]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
fill it with the coeffs1 dictionary values:
In [31]: M.update(coeffs1)
In [33]: M.A # convert to dense array
Out[33]:
array([[15, 16, 17, 18, 19, 20, 21, 22, 23, 24],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23]])
Actually, I don't need to use sparse to convert coeffs1 into an array. The (i,j) tuple can index an array directly, A[(i,j)] is the same as A[i,j].
In [34]: A=np.zeros((2,10),int)
In [35]: for k,v in coeffs1.items():
....: A[k] = v
....:
In [36]: A
Out[36]:
array([[15, 16, 17, 18, 19, 20, 21, 22, 23, 24],
[14, 15, 16, 17, 18, 19, 20, 21, 22, 23]])

Concerning a n x m array
#hpaulj's answer assumed (rightly) that the numbers after the X were supposed to be positions. If you had data like
coeffs = {'X1': 3, 'X2' : 5, ..., 'Xn' : 34, 'Y1': 5, 'Y2' : -3, ..., 'Yn': 32}
You could do as follows. Given sample data like
{'Y3': 17, 'Y2': 16, 'Y8': 22, 'Y5': 19, 'Y6': 20, 'Y4': 18, 'Y9': 23, 'Y1': 15, 'X8': 23, 'X9': 24, 'Y7': 21, 'Y0': 14, 'X2': 17, 'X3': 18, 'X0': 15, 'X1': 16, 'X6': 21, 'X7': 22, 'X4': 19, 'X5': 20}
created by
a = {}
for i in range(10):
a['X'+str(i)] = 15 + i
for i in range(10):
a['Y'+str(i)] = 14 + i
Put it in some ordered dictionary (inefficient, but easy)
b = {}
for k, v in a.iteritems():
letter = k[0]
index = float(k[1:])
if letter not in b.keys():
b[letter] = {}
b[letter][index] = v
gives
>>> b
{'Y': {0: 14, 1: 15, 2: 16, 3: 17, 4: 18, 5: 19, 6: 20, 7: 21, 8: 22, 9: 23}, 'X': {0: 15, 1: 16, 2: 17, 3: 18, 4: 19, 5: 20, 6: 21, 7: 22, 8: 23, 9: 24}}
Find out the target dimesions of the array. (This assumes all params are the same length and you have all values given).
row_length = max(b.values()[0])
row_indices = b.keys()
row_indices.sort()
Create the array via
X = np.empty((len(b.keys()), max(b.values()[0])))
and insert the data:
for i,row in enumerate(row_indices):
for j in range(row_length):
X[i,j] = b[row][j]
Result
>>> X
array([[ 15., 16., 17., 18., 19., 20., 21., 22., 23.],
[ 14., 15., 16., 17., 18., 19., 20., 21., 22.]])
Old answer
coeffs.values() is an array of the dict's values. Just create a
np.array(coeffs.values())
In general, when you have an object like coeffs, you can type
help(coeffs)
in the interpreter, to get a list of all it can do.

Python array/matrix dimension

I create two matrices
import numpy as np
arrA = np.zeros((9000,3))
arrB = np.zerros((9000,6))
I want to concatenate pieces of those matrices.
But when I try to do:
arrC = np.hstack((arrA, arrB[:,1]))
I get an error:
ValueError: all the input arrays must have same number of dimensions
I guess it's because np.shape(arrB[:,1]) is equal (9000,) instead of (9000,1), but I cannot figure out how to resolve it.
Could you please comment on this issue?

You could preserve dimensions by passing a list of indices, not an index:
>>> arrB[:,1].shape
(9000,)
>>> arrB[:,[1]].shape
(9000, 1)
>>> out = np.hstack([arrA, arrB[:,[1]]])
>>> out.shape
(9000, 4)

This is easier to see visually.
Assume:
>>> arrA=np.arange(9000*3).reshape(9000,3)
>>> arrA
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
...,
[26991, 26992, 26993],
[26994, 26995, 26996],
[26997, 26998, 26999]])
>>> arrB=np.arange(9000*6).reshape(9000,6)
>>> arrB
array([[ 0, 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10, 11],
[ 12, 13, 14, 15, 16, 17],
...,
[53982, 53983, 53984, 53985, 53986, 53987],
[53988, 53989, 53990, 53991, 53992, 53993],
[53994, 53995, 53996, 53997, 53998, 53999]])
If you take a slice of arrB, you are producing a series that looks more like a row:
>>> arrB[:,1]
array([ 1, 7, 13, ..., 53983, 53989, 53995])
What you need is a column the same shape as a column to add to arrA:
>>> arrB[:,[1]]
array([[ 1],
[ 7],
[ 13],
...,
[53983],
[53989],
[53995]])
Then hstack works as expected:
>>> arrC=np.hstack((arrA, arrB[:,[1]]))
>>> arrC
array([[ 0, 1, 2, 1],
[ 3, 4, 5, 7],
[ 6, 7, 8, 13],
...,
[26991, 26992, 26993, 53983],
[26994, 26995, 26996, 53989],
[26997, 26998, 26999, 53995]])
An alternate form is to specify -1 in one dimension and the number of rows or cols desired as the other in .reshape():
>>> arrB[:,1].reshape(-1,1) # one col
array([[ 1],
[ 7],
[ 13],
...,
[53983],
[53989],
[53995]])
>>> arrB[:,1].reshape(-1,6) # 6 cols
array([[ 1, 7, 13, 19, 25, 31],
[ 37, 43, 49, 55, 61, 67],
[ 73, 79, 85, 91, 97, 103],
...,
[53893, 53899, 53905, 53911, 53917, 53923],
[53929, 53935, 53941, 53947, 53953, 53959],
[53965, 53971, 53977, 53983, 53989, 53995]])
>>> arrB[:,1].reshape(2,-1) # 2 rows
array([[ 1, 7, 13, ..., 26983, 26989, 26995],
[27001, 27007, 27013, ..., 53983, 53989, 53995]])
There is more on array shaping and stacking here

I would try something like this:
np.vstack((arrA.transpose(), arrB[:,1])).transpose()

There several ways of making your selection from arrB a (9000,1) array:
np.hstack((arrA,arrB[:,[1]]))
np.hstack((arrA,arrB[:,1][:,None]))
np.hstack((arrA,arrB[:,1].reshape(9000,1)))
np.hstack((arrA,arrB[:,1].reshape(-1,1)))
One uses the concept of indexing with an array or list, the next adds a new axis (e.g. np.newaxis), the third uses reshape. These are all basic numpy array manipulation tasks.

Passing list as slice for a N-dimensional numpy array

I'm trying to manipulate the values of a N-dimensional array based on the user's decision, at which index the array should be changed. This example works fine:
import numpy as np
a = np.arange(24).reshape(2,3,4)
toChange = ['0', '0', '0'] #input from user via raw_input
a[toChange] = 0
But if I want to change not only one position but a complete row, I run into problems:
toChange = ['0', '0', ':'] #input from user via raw_input
a[toChange] = 0
This causes ValueError: setting an array element with a sequence.
I can see that the problem is the ':' string, because a[0, 0, :] = 0 does exactly what I want. The question is, how to pass the string to the array?
Or is there a smarter way to manipulate user-defined slices?
PS: as I'm working on an oldstable Debian I use Python 2.6.6 and Numpy 1.4.1

: is syntactic sugar for a slice object:
>>> class Indexable(object):
... def __getitem__(self, idx):
... return idx
...
>>> Indexable()[0, 0, :]
(0, 0, slice(None, None, None))
So if you replace ':' with slice(None, None, None) you get the desired result:
>>> toChange = [0, 0, slice(None, None, None)]
>>> a[toChange] = 0
>>> a
array([[[ 0, 0, 0, 0],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]],
[[12, 13, 14, 15],
[16, 17, 18, 19],
[20, 21, 22, 23]]])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why converting np.nan to int results in huge number? - python

Related

generate new row in pandas if column value is over certain value

None of Index are in the columns - Cannot do an single line boxplot

Creating numpy array from dict

Python array/matrix dimension

Passing list as slice for a N-dimensional numpy array

Categories

Resources