How to import from a data file a numpy structured array - python

i'm trying to create an array which has 5 columns imported from a data file. The 4 of them are floats and the last one string.
The data file looks like this:
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
I tried these:
data = np.genfromtxt(filename, dtype = "float,float,float,float,str", delimiter = ",")
data = np.loadtxt(filename, dtype = "float,float,float,float,str", delimiter = ",")
,but both codes import only the first column.
Why? How can i fix this?
Ty for your time! :)

You must specify correctly the str type : "U20" for exemple for 20 characters max :
data = np.loadtxt('data.txt', dtype = "float,"*4 + "U20", delimiter = ",")
seems to work :
array([( 5.1, 3.5, 1.4, 0.2, 'Iris-setosa'),
( 4.9, 3. , 1.4, 0.2, 'Iris-setosa'),
( 4.7, 3.2, 1.3, 0.2, 'Iris-setosa'),
( 4.6, 3.1, 1.5, 0.2, 'Iris-setosa'),
( 5. , 3.6, 1.4, 0.2, 'Iris-setosa'),
( 5.4, 3.9, 1.7, 0.4, 'Iris-setosa'),
( 4.6, 3.4, 1.4, 0.3, 'Iris-setosa'),
( 5. , 3.4, 1.5, 0.2, 'Iris-setosa')],
dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<U20')])
An other method using pandas give you an object array, but this slow down further computations :
In [336]: pd.read_csv('data.txt',header=None).values
Out[336]:
array([[5.1, 3.5, 1.4, 0.2, 'Iris-setosa'],
[4.9, 3.0, 1.4, 0.2, 'Iris-setosa'],
[4.7, 3.2, 1.3, 0.2, 'Iris-setosa'],
[4.6, 3.1, 1.5, 0.2, 'Iris-setosa'],
[5.0, 3.6, 1.4, 0.2, 'Iris-setosa'],
[5.4, 3.9, 1.7, 0.4, 'Iris-setosa'],
[4.6, 3.4, 1.4, 0.3, 'Iris-setosa'],
[5.0, 3.4, 1.5, 0.2, 'Iris-setosa']], dtype=object)

Related

Storing multiple arrays in a np.zeros or np.ones

I'm trying to initialize a dummy array of length n using np.zeros(n) with dtype=object. I want to use this dummy array to store n copies of another array of length m.
I'm trying to avoid for loop to set values at each index.
I tried using the below code but keep getting error -
temp = np.zeros(10, dtype=object)
arr = np.array([1.1,1.2,1.3,1.4,1.5])
res = temp * arr
The desired result should be -
np.array([[1.1,1.2,1.3,1.4,1.5], [1.1,1.2,1.3,1.4,1.5], ... 10 copies])
I keep getting the error -
operands could not be broadcast together with shapes (10,) (5,)
I understand that this error arises since the compiler thinks I'm trying to multiply those arrays.
So how do I achieve the task?
np.tile() is a built-in function that repeats a given array reps times. It looks like this is exactly what you need, i.e.:
res = np.tile(arr, 2)
>>> arr = np.array([1.1,1.2,1.3,1.4,1.5])
>>> arr
array([1.1, 1.2, 1.3, 1.4, 1.5])
>>> np.array([arr]*10)
array([[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5],
[1.1, 1.2, 1.3, 1.4, 1.5]])

How to use a different colormap for different rows of a heatmap

I am trying to change 1 row in my heatmap to a different color
here is the dataset:
m = np.array([[ 0.7, 1.4, 0.2, 1.5, 1.7, 1.2, 1.5, 2.5],
[ 1.1, 2.5, 0.4, 1.7, 2. , 2.4, 2. , 3.2],
[ 0.9, 4.4, 0.7, 2.3, 1.6, 2.3, 2.6, 3.3],
[ 0.8, 2.1, 0.2, 1.8, 2.3, 1.9, 2. , 2.9],
[ 0.9, 1.3, 0.8, 2.2, 1.8, 2.2, 1.7, 2.8],
[ 0.7, 0.9, 0.4, 1.8, 1.4, 2.1, 1.7, 2.9],
[ 1.2, 0.9, 0.4, 2.1, 1.3, 1.2, 1.9, 2.4],
[ 6.3, 13.5, 3.1, 13.4, 12.1, 13.3, 13.4, 20. ]])
data = pd.DataFrame(data = m)
Right now I am using seaborn heatmap, I can only create something like this:
cmap = sns.diverging_palette(240, 10, as_cmap = True)
sns.heatmap(data, annot = True, cmap = "Reds")
plt.show
I hope to change the color scheme of the last row, here is what I want to achieve (I did this in Excel):
Is it possible I achieve this in Python with seaborn heatmap? Thank you!
You can split in two, mask the unwanted parts, and plot separately:
# Reds
data1 = data.copy()
data1.loc[7] = float('nan')
ax = sns.heatmap(data1, annot=True, cmap="Reds")
# Greens
data2 = data.copy()
data2.loc[:6] = float('nan')
sns.heatmap(data2, annot=True, cmap="Greens")
output:
NB. you need to adapt the loc[…] parameter to your actual index names

Pandas Group By column to generate quantiles (.25, 0.5, .75)

Let's say we have CityName, Min-Temperature, Max-Temperature, Humidity of different cities.
We need an output dataframe grouped on CityName and want to generate 0.25, 0.5 and 0.75 quantiles. New column names would be OldColunmName + ('Q1)/('Q2')/('Q3').
Example INPUT
df = pd.DataFrame({'cityName': pd.Categorical(['a','a','a','a','b','b','b','b','a','a','a','a','b','b','b','b']),
'MinTemp': [1.1, 2.1, 3.1, 1.1, 2, 2.1, 2.2, 2.4, 2.5, 1.11, 1.31, 2.1, 1, 2, 2.3, 2.1],
'MaxTemp': [2.1, 4.2, 5.1, 2.13, 4, 3.1, 5.2, 3.4, 3.5, 2.11, 2.31, 3.1, 2, 4.3, 4.3, 3.1],
'Humidity': [0.29, 0.19, .45, 0.1, 0.1, 0.1, 0.2, 0.5, 0.11, 0.31, 0.1, .1, .2, 0.3, 0.3, 0.1]
})
OUTPUT
First Approach
First you have to group your data on the column you want which is 'cityName'. Then, because on each column you want to do multiple and different kinds of aggregations, you can use 'agg' function. For functions in the 'agg', you cannot give parameters so you define them as follow:
def quantile_50(x):
return x.quantile(0.5)
def quantile_25(x):
return x.quantile(0.25)
def quantile_75(x):
return x.quantile(0.75)
quantile_df = df.groupby('cityName').agg([quantile_25, quantile_50, quantile_75])
quantile_df
Second Approach
You can use describe method and select the statistics you need. By using idx you can choose which subindex to choose.
idx = pd.IndexSlice
df.groupby('cityName').describe().loc[:, idx[:, ['25%', '50%', '75%']]]

is there a parameter to set the precision for numpy.linspace?

I am trying to check if a numpy array contains a specific value:
>>> x = np.linspace(-5,5,101)
>>> x
array([-5. , -4.9, -4.8, -4.7, -4.6, -4.5, -4.4, -4.3, -4.2, -4.1, -4. ,
-3.9, -3.8, -3.7, -3.6, -3.5, -3.4, -3.3, -3.2, -3.1, -3. , -2.9,
-2.8, -2.7, -2.6, -2.5, -2.4, -2.3, -2.2, -2.1, -2. , -1.9, -1.8,
-1.7, -1.6, -1.5, -1.4, -1.3, -1.2, -1.1, -1. , -0.9, -0.8, -0.7,
-0.6, -0.5, -0.4, -0.3, -0.2, -0.1, 0. , 0.1, 0.2, 0.3, 0.4,
0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3, 1.4, 1.5,
1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6,
2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7,
3.8, 3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8,
4.9, 5. ])
>>> -5. in x
True
>>> a = 0.2
>>> a
0.2
>>> a in x
False
I assigned a constant to variable a. It seems that the precision of a is not compatible with the elements in the numpy array generated by np.linspace().
I've searched the docs, but didn't find anything about this.
This is not a question of the precision of np.linspace, but rather of the type of the elements in the generated array.
np.linspace generates elements which, conceptually, equally divide the input range between them. However, these elements are then stored as floating point numbers with limited precision, which makes the generation process itself appear to lack precision.
By passing the dtype argument to np.linspace, you can specify the precision of the floating point type used to store its result, which can increase the apparent precision of the generation process.
Nevertheless, you should not use the equality operator to compare floating point numbers. Instead, use np.isclose in conjunction with np.ndarray.any, or some equivalent:
>>> floats_64 = np.linspace(-5, 5, 101, dtype='float64')
>>> floats_128 = np.linspace(-5, 5, 101, dtype='float128')
>>> print(0.2 in floats_64)
False
>>> print(floats_64[52])
0.20000000000000018
>>> print(np.isclose(0.2, floats_64).any()) # check if any element in floats_64 is close to 0.2
True
>>> print(0.2 in floats_128)
False
>>> print(floats_128[52])
0.20000000000000017764
>>> print(np.isclose(0.2, floats_128).any()) # check if any element in floats_128 is close to 0.2
True

How can a Python list be sliced such that a column is moved to being a separate element column?

I have a list of the following form:
[[0, 5.1, 3.5, 1.4, 0.2],
[0, 4.9, 3.0, 1.4, 0.2],
[0, 4.7, 3.2, 1.3, 0.2],
[1, 4.6, 3.1, 1.5, 0.2],
[1, 5.0, 3.6, 1.4, 0.2],
[1, 5.4, 3.9, 1.7, 0.4],
[1, 4.6, 3.4, 1.4, 0.3]]
I want to slice out the first column and add it as a new element to each row of data (so at each odd position in the list), changing it to the following form:
[[5.1, 3.5, 1.4, 0.2], [0],
[4.9, 3.0, 1.4, 0.2], [0],
[4.7, 3.2, 1.3, 0.2], [0],
[4.6, 3.1, 1.5, 0.2], [1],
[5.0, 3.6, 1.4, 0.2], [1],
[5.4, 3.9, 1.7, 0.4], [1],
[4.6, 3.4, 1.4, 0.3], [1],]
How could I do this?
So far, I have extracted the necessary information in the following ways:
targets = [element[0] for element in dataset]
features = dataset[1:]
Try indexing and then get flattened list- i used list comprehension for flattening.
>>>l=[[0, 5.1, 3.5, 1.4, 0.2],
[0, 4.9, 3.0, 1.4, 0.2],
[0, 4.7, 3.2, 1.3, 0.2],
[1, 4.6, 3.1, 1.5, 0.2],
[1, 5.0, 3.6, 1.4, 0.2],
[1, 5.4, 3.9, 1.7, 0.4],
[1, 4.6, 3.4, 1.4, 0.3]]
>>>[[i[1:],[i[0]]] for i in l]#get sliced list of lists
>>>[[[5.1, 3.5, 1.4, 0.2], [0]], [[4.9, 3.0, 1.4, 0.2], [0]], [[4.7, 3.2, 1.3, 0.2], [0]], [[4.6, 3.1, 1.5, 0.2], [1]], [[5.0, 3.6, 1.4, 0.2], [1]], [[5.4, 3.9, 1.7, 0.4], [1]], [[4.6, 3.4, 1.4, 0.3], [1]]]
>>>d=[[i[1:],[i[0]]] for i in l]
>>>[item for sublist in d for item in sublist]#flatten list d
>>>[[5.1, 3.5, 1.4, 0.2], [0], [4.9, 3.0, 1.4, 0.2], [0], [4.7, 3.2, 1.3, 0.2], [0], [4.6, 3.1, 1.5, 0.2], [1], [5.0, 3.6, 1.4, 0.2], [1], [5.4, 3.9, 1.7, 0.4], [1], [4.6, 3.4, 1.4, 0.3], [1]]
Just oneliner alternative-
[item for sublist in [[i[1:],[i[0]]] for i in l] for item in sublist] #Here l is that list
List comprehensions are nice but can be a bit hard to scan. Loops are still useful, especially when combined with extend:
res = []
for entry in dataset:
res.extend([entry[1:], entry[:1]])
now:
import pprint
pprint.pprint(res)
prints:
[[5.1, 3.5, 1.4, 0.2],
[0],
[4.9, 3.0, 1.4, 0.2],
[0],
[4.7, 3.2, 1.3, 0.2],
[0],
[4.6, 3.1, 1.5, 0.2],
[1],
[5.0, 3.6, 1.4, 0.2],
[1],
[5.4, 3.9, 1.7, 0.4],
[1],
[4.6, 3.4, 1.4, 0.3],
[1]]
Try this:
from itertools import chain
print list(chain(*[list((element[1:],[element[0]])) for element in a]))
Output:
[[5.1, 3.5, 1.4, 0.2], [0], [4.9, 3.0, 1.4, 0.2], [0],
[4.7, 3.2, 1.3, 0.2], [0], [4.6, 3.1, 1.5, 0.2], [1],
[5.0, 3.6, 1.4, 0.2], [1], [5.4, 3.9, 1.7, 0.4], [1],
[4.6, 3.4, 1.4, 0.3], [1]]
Slice each sublist and make a new list with an element for each slice:
l = [[0, 5.1, 3.5, 1.4, 0.2],
[0, 4.9, 3.0, 1.4, 0.2],
[0, 4.7, 3.2, 1.3, 0.2],
[1, 4.6, 3.1, 1.5, 0.2],
[1, 5.0, 3.6, 1.4, 0.2],
[1, 5.4, 3.9, 1.7, 0.4],
[1, 4.6, 3.4, 1.4, 0.3]]
>>> print(*[item for sub in l for item in (sub[1:], [sub[0]])], sep='\n')
[5.1, 3.5, 1.4, 0.2]
[0]
[4.9, 3.0, 1.4, 0.2]
[0]
[4.7, 3.2, 1.3, 0.2]
[0]
[4.6, 3.1, 1.5, 0.2]
[1]
[5.0, 3.6, 1.4, 0.2]
[1]
[5.4, 3.9, 1.7, 0.4]
[1]
[4.6, 3.4, 1.4, 0.3]
[1]
A Pythonic approach in python 3.X using unpacking iteration and itertools.chain:
>>> from itertools import chain
>>>
>>> list(chain.from_iterable([[j,[i]] for i,*j in A]))
[[5.1, 3.5, 1.4, 0.2], [0],
[4.9, 3.0, 1.4, 0.2], [0],
[4.7, 3.2, 1.3, 0.2], [0],
[4.6, 3.1, 1.5, 0.2], [1],
[5.0, 3.6, 1.4, 0.2], [1],
[5.4, 3.9, 1.7, 0.4], [1],
[4.6, 3.4, 1.4, 0.3], [1]]

Categories