Compare current row with next row in a DataFrame with pandas - python

I have a DataFrame called "DataExample" and an ascending sorted list called "normalsizes".
import pandas as pd
if __name__ == "__main__":
DataExample = [[0.6, 0.36 ,0.00],
[0.6, 0.36 ,0.00],
[0.9, 0.81 ,0.85],
[0.8, 0.64 ,0.91],
[1.0, 1.00 ,0.92],
[1.0, 1.00 ,0.95],
[0.9, 0.81 ,0.97],
[1.2, 1.44 ,0.97],
[1.0, 1.00 ,0.97],
[1.0, 1.00 ,0.99],
[1.2, 1.44 ,0.99],
[1.1, 1.21 ,0.99]]
DataExample = pd.DataFrame(data = DataExample, columns = ['Lx', 'A', 'Ratio'])
normalsizes = [0, 0.75, 1, 1.25, 1.5, 1.75 ,2, 2.25, 2.4, 2.5, 2.75, 3,
3.25, 3.5, 3.75, 4, 4.25, 4.5, 4.75, 5, 5.25, 5.5, 5.75, 6]
# for i in example.index:
#
# numb = example['Lx'][i]
What I am looking for is that each “DataExample [‘ Lx ’]” is analyzed and located within a range of normalsizes, for example:
For DataExample [‘Lx’] [0] = 0.6 -----> then it is between the interval of [0, 0.75] -----> 0.6> 0 and 0.6 <= 0.75 -----> so I take the largest of that interval, that is, 0.75. This for each row.
With this I should have the following result:
Lx A Ratio
1 0.36 0
1 0.36 0
1 0.81 0.85
1 0.64 0.91
1.25 1 0.92
1.25 1 0.95
1 0.81 0.97
1.25 1.44 0.97
1.25 1 0.97
1.25 1 0.99
1.25 1.44 0.99
1.25 1.21 0.99

numpy.searchsorted will get you what you want
import numpy as np
normalsizes = np.array(normalsizes) # convert to numpy array
DataExample["Lx"] = normalsizes[np.searchsorted(normalsizes, DataExample["Lx"])]

Related

Creating a List and maintaining integer value

I am new to python a bit.
I am trying to convert a dataframe to list after changing the datatype of a particular column to integer. The funny thing is when converted to list, the column still has float.
There are three columns in the dataframe, first two is float and I want the last to be integer, but it still comes as float.
If I change all to integer, then the list creates as integer.
0 1.53 3.13 0.0
1 0.58 2.83 0.0
2 0.28 2.69 0.0
3 1.14 2.14 0.0
4 1.46 3.39 0.0
... ... ... ...
495 2.37 0.93 1.0
496 2.85 0.52 1.0
497 2.35 0.39 1.0
498 2.96 1.68 1.0
499 2.56 0.16 1.0
Above is the Dataframe.
Below is the last column converted
#convert last column to integer datatype
data[6] = data[6].astype(dtype ='int64')
display(data.dtypes)
The below is converting the dataframe to list.
#Turn DF to list
data_to_List = data.values.tolist()
data_to_List
#below is what is shown now.
[[1.53, 3.13, 0.0],
[0.58, 2.83, 0.0],
[0.28, 2.69, 0.0],
[1.14, 2.14, 0.0],
[3.54, 0.75, 1.0],
[3.04, 0.15, 1.0],
[2.49, 0.15, 1.0],
[2.27, 0.39, 1.0],
[3.65, 1.5, 1.0],
I want the last column to be just 0 or 1 and not 0.0 or 1.0
Yes, you are correct pandas is converting int to float when you use data.values
You can convert your float to int by using the below list comprehension:
data_to_List = [[x[0],x[1],int(x[2])] for x in data.values.tolist()]
print(data_to_List)
[[1.53, 3.13, 0],
[0.58, 2.83, 0],
[0.28, 2.69, 0],
[1.14, 2.14, 0],
[1.46, 3.39, 0]]

Sensor Data Sampling Frequency Mismatch

I have sensor data captured at different frequencies (this is data I've invented to simplify the operation). I want to resample the voltage data by increasing the number of data points and interpolate them so I have 16 instead of 12.
Pandas has a resample/upsample function but I can only find examples where people have gone from weekly data to daily data (adding 6 daily data points by interpolation between two weekly data points).
time (pressure)
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
pressure
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
time (voltage)
0.07
0.14
0.21
0.28
0.35
0.42
0.49
0.56
0.63
0.7
0.77
0.84
voltage
2.2
2.5
2.8
3.1
3.4
3.7
4
4.3
4.6
4.9
5.2
5.5
I would like my voltage to have 16 samples instead of 12 with the missing values interpolated. Thanks!
Let's assume two Series, "pressure" and "voltage":
pressure = pd.Series({0.05: 1.0, 0.1: 1.1, 0.15: 1.2, 0.2: 1.3, 0.25: 1.4, 0.3: 1.5, 0.35: 1.6, 0.4: 1.7, 0.45: 1.8,
0.5: 1.9, 0.55: 2.0, 0.6: 2.1, 0.65: 2.2, 0.7: 2.3, 0.75: 2.4, 0.8: 2.5}, name='pressure')
voltage = pd.Series({0.07: 2.2, 0.14: 2.5, 0.21: 2.8, 0.28: 3.1, 0.35: 3.4, 0.42: 3.7,
0.49: 4.0, 0.56: 4.3, 0.63: 4.6, 0.7: 4.9, 0.77: 5.2, 0.84: 5.5}, name='voltage')
You can either use pandas.merge_asof:
pd.merge_asof(pressure, voltage, left_index=True, right_index=True)
output:
or pandas.concat+interpolate:
(pd.concat([pressure, voltage], axis=1)
.sort_index()
.apply(pd.Series.interpolate)
#.plot(x='pressure', y='voltage', marker='o') # uncomment to plot
)
output:
Finally, to interpolate only on voltage, drop NAs on pressure first:
(pd.concat([pressure, voltage], axis=1)
.sort_index()
.dropna(subset=['pressure'])
.apply(pd.Series.interpolate)
)
output:

issue when loading a data file with numpy

I want to train a classifier with scikit, but for doing this first I need to load the corresponding data. I am using the following data file available in:
https://archive.ics.uci.edu/ml/machine-learning-databases/yeast/
When I open it in word it has the following contents:
ADT1_YEAST 0.58 0.61 0.47 0.13 0.50 0.00 0.48 0.22 MIT
ADT2_YEAST 0.43 0.67 0.48 0.27 0.50 0.00 0.53 0.22 MIT
ADT3_YEAST 0.64 0.62 0.49 0.15 0.50 0.00 0.53 0.22 MIT
AAR2_YEAST 0.58 0.44 0.57 0.13 0.50 0.00 0.54 0.22 NUC
Each file is separated by a double space and every line with a return carriage.
I want to read it with the following command:
f=open("yeast.data")
data = np.loadtxt(f,delimiter=" ")
and at the end I want to be able to use the following:
X = data[:,:-1] # select all columns except the last
y = data[:, -1] # select the last column
for using:
X_train, X_test, y_train, y_test = train_test_split(X, y)
but when I try to read it the following error appears:
ValueError: could not convert string to float: ADT1_YEAST
so how can I read this file in Python to use later the MLPClassifier?
Thanks
You can skip the f=open(...), and you can to use dtype='O' to make sure numpy reads it as an mix of numericals and strings. Because of some inconsistancies in the data structure in the file you linked, it's best to use genfromtxt instead of loadtxt:
data = np.genfromtxt('yeast.data',dtype='O')
>>> data
array([[b'ADT1_YEAST', b'0.58', b'0.61', ..., b'0.48', b'0.22', b'MIT'],
[b'ADT2_YEAST', b'0.43', b'0.67', ..., b'0.53', b'0.22', b'MIT'],
[b'ADT3_YEAST', b'0.64', b'0.62', ..., b'0.53', b'0.22', b'MIT'],
...,
[b'ZNRP_YEAST', b'0.67', b'0.57', ..., b'0.56', b'0.22', b'ME2'],
[b'ZUO1_YEAST', b'0.43', b'0.40', ..., b'0.53', b'0.39', b'NUC'],
[b'G6PD_YEAST', b'0.65', b'0.54', ..., b'0.53', b'0.22', b'CYT']], dtype=object)
>>> data.shape
(1484, 10)
You can change the dtypes when you call genfromtxt (see documentation), or you can change them manually after like this:
data[:,0] = data[:,0].astype(str)
data[:,1:-1]= data[:,1:-1].astype(float)
data[:,-1] = data[:,-1].astype(str)
>>> data
array([['ADT1_YEAST', 0.58, 0.61, ..., 0.48, 0.22, 'MIT'],
['ADT2_YEAST', 0.43, 0.67, ..., 0.53, 0.22, 'MIT'],
['ADT3_YEAST', 0.64, 0.62, ..., 0.53, 0.22, 'MIT'],
...,
['ZNRP_YEAST', 0.67, 0.57, ..., 0.56, 0.22, 'ME2'],
['ZUO1_YEAST', 0.43, 0.4, ..., 0.53, 0.39, 'NUC'],
['G6PD_YEAST', 0.65, 0.54, ..., 0.53, 0.22, 'CYT']], dtype=object)

Regrid numpy array based on cell area

import numpy as np
from skimage.measure import block_reduce
arr = np.random.random((6, 6))
area_cell = np.random.random((6, 6))
block_reduce(arr, block_size=(2, 2), func=np.ma.mean)
I would like to regrid a numpy array arr from 6 x 6 size to 3 x 3. Using the skimage function block_reduce for this.
However, block_reduce assumes each grid cell has same size. How can I solve this problem, when each grid cell has a different size? In this case size of each grid cell is given by the numpy array area_cell
-- EDIT:
An example:
arr
0.25 0.58 0.69 0.74
0.49 0.11 0.10 0.41
0.43 0.76 0.65 0.79
0.72 0.97 0.92 0.09
If all elements of area_cell were 1, and we were to convert 4 x 4 arr into 2 x 2, result would be:
0.36 0.48
0.72 0.61
However, if area_cell is as follows:
0.00 1.00 1.00 0.00
0.00 1.00 0.00 0.50
0.20 1.00 0.80 0.80
0.00 0.00 1.00 1.00
Then, result becomes:
0.17 0.22
0.21 0.54
It seems you are still reducing by blocks, but after scaling arr with area_cell. So, you just need to perform element-wise multiplication between these two arrays and use the same block_reduce code on that product array, like so -
block_reduce(arr*area_cell, block_size=(2, 2), func=np.ma.mean)
Alternatively, we can simply use np.mean after reshaping to a 4D version of the product array, like so -
m,n = arr.shape
out = (arr*area_cell).reshape(m//2,2,n//2,2).mean(axis=(1,3))
Sample run -
In [21]: arr
Out[21]:
array([[ 0.25, 0.58, 0.69, 0.74],
[ 0.49, 0.11, 0.1 , 0.41],
[ 0.43, 0.76, 0.65, 0.79],
[ 0.72, 0.97, 0.92, 0.09]])
In [22]: area_cell
Out[22]:
array([[ 0. , 1. , 1. , 0. ],
[ 0. , 1. , 0. , 0.5],
[ 0.2, 1. , 0.8, 0.8],
[ 0. , 0. , 1. , 1. ]])
In [23]: block_reduce(arr*area_cell, block_size=(2, 2), func=np.ma.mean)
Out[23]:
array([[ 0.1725 , 0.22375],
[ 0.2115 , 0.5405 ]])
In [24]: m,n = arr.shape
In [25]: (arr*area_cell).reshape(m//2,2,n//2,2).mean(axis=(1,3))
Out[25]:
array([[ 0.1725 , 0.22375],
[ 0.2115 , 0.5405 ]])

How to generate a clean x and y axis for a numpy matrix?

I am creating a distance matrix in numpy, with an out put as such:
['H', 'B', 'D', 'A', 'I', 'C', 'F']
[[ 0. 2.4 6.1 3.2 5.2 3.9 7.1]
[ 2.4 0. 4.1 1.2 3.2 1.9 5.1]
[ 6.1 4.1 0. 3.1 6.9 2.8 5.2]
[ 3.2 1.2 3.1 0. 4. 0.9 4.1]
[ 5.2 3.2 6.9 4. 0. 4.7 7.9]
[ 3.9 1.9 2.8 0.9 4.7 0. 3.8]
[ 7.1 5.1 5.2 4.1 7.9 3.8 0. ]]
I am printing that x axis by just printing a list before I print the actual matrix, a:
print" ", names
print a
I need the axis in that order, as the list 'names' properly orders the variables with their value in the matrix. But how would i be able to get a similar y axis in numpy?
It is not so pretty, but this pretty table prints works:
import numpy as np
names=np.array(['H', 'B', 'D', 'A', 'I', 'C', 'F'])
a=np.array([[ 0., 2.4, 6.1, 3.2, 5.2, 3.9, 7.1],
[2.4, 0., 4.1, 1.2, 3.2, 1.9, 5.1],
[6.1, 4.1, 0., 3.1, 6.9, 2.8, 5.2],
[3.2, 1.2, 3.1, 0., 4., 0.9, 4.1],
[5.2, 3.2, 6.9, 4., 0., 4.7, 7.9],
[3.9, 1.9 , 2.8, 0.9, 4.7, 0., 3.8],
[7.1, 5.1, 5.2, 4.1, 7.9, 3.8, 0. ]])
def pptable(x_axis,y_axis,table):
def format_field(field, fmt='{:,.2f}'):
if type(field) is str: return field
if type(field) is tuple: return field[1].format(field[0])
return fmt.format(field)
def get_max_col_w(table, index):
return max([len(format_field(row[index])) for row in table])
for i,l in enumerate(table):
l.insert(0,y_axis[i])
x_axis.insert(0,' ')
table.insert(0,x_axis)
col_paddings=[get_max_col_w(table, i) for i in range(len(table[0]))]
for i,row in enumerate(table):
# left col
row_tab=[str(row[0]).ljust(col_paddings[0])]
# rest of the cols
row_tab+=[format_field(row[j]).rjust(col_paddings[j])
for j in range(1,len(row))]
print(' '.join(row_tab))
x_axis=['x{}'.format(c) for c in names]
y_axis=['y{}'.format(c) for c in names]
pptable(x_axis,y_axis,a.tolist())
Prints:
xH xB xD xA xI xC xF
yH 0.00 2.40 6.10 3.20 5.20 3.90 7.10
yB 2.40 0.00 4.10 1.20 3.20 1.90 5.10
yD 6.10 4.10 0.00 3.10 6.90 2.80 5.20
yA 3.20 1.20 3.10 0.00 4.00 0.90 4.10
yI 5.20 3.20 6.90 4.00 0.00 4.70 7.90
yC 3.90 1.90 2.80 0.90 4.70 0.00 3.80
yF 7.10 5.10 5.20 4.10 7.90 3.80 0.00
If you want the X and Y axis to be the same, just call it with two lists of the same labels.

Categories