Pandas Group By column to generate quantiles (.25, 0.5, .75)

Pandas Group By column to generate quantiles (.25, 0.5, .75) - python

Let's say we have CityName, Min-Temperature, Max-Temperature, Humidity of different cities.
We need an output dataframe grouped on CityName and want to generate 0.25, 0.5 and 0.75 quantiles. New column names would be OldColunmName + ('Q1)/('Q2')/('Q3').
Example INPUT
df = pd.DataFrame({'cityName': pd.Categorical(['a','a','a','a','b','b','b','b','a','a','a','a','b','b','b','b']),
'MinTemp': [1.1, 2.1, 3.1, 1.1, 2, 2.1, 2.2, 2.4, 2.5, 1.11, 1.31, 2.1, 1, 2, 2.3, 2.1],
'MaxTemp': [2.1, 4.2, 5.1, 2.13, 4, 3.1, 5.2, 3.4, 3.5, 2.11, 2.31, 3.1, 2, 4.3, 4.3, 3.1],
'Humidity': [0.29, 0.19, .45, 0.1, 0.1, 0.1, 0.2, 0.5, 0.11, 0.31, 0.1, .1, .2, 0.3, 0.3, 0.1]
})
OUTPUT

First Approach
First you have to group your data on the column you want which is 'cityName'. Then, because on each column you want to do multiple and different kinds of aggregations, you can use 'agg' function. For functions in the 'agg', you cannot give parameters so you define them as follow:
def quantile_50(x):
return x.quantile(0.5)
def quantile_25(x):
return x.quantile(0.25)
def quantile_75(x):
return x.quantile(0.75)
quantile_df = df.groupby('cityName').agg([quantile_25, quantile_50, quantile_75])
quantile_df
Second Approach
You can use describe method and select the statistics you need. By using idx you can choose which subindex to choose.
idx = pd.IndexSlice
df.groupby('cityName').describe().loc[:, idx[:, ['25%', '50%', '75%']]]

Related

How to use a different colormap for different rows of a heatmap

I am trying to change 1 row in my heatmap to a different color
here is the dataset:
m = np.array([[ 0.7, 1.4, 0.2, 1.5, 1.7, 1.2, 1.5, 2.5],
[ 1.1, 2.5, 0.4, 1.7, 2. , 2.4, 2. , 3.2],
[ 0.9, 4.4, 0.7, 2.3, 1.6, 2.3, 2.6, 3.3],
[ 0.8, 2.1, 0.2, 1.8, 2.3, 1.9, 2. , 2.9],
[ 0.9, 1.3, 0.8, 2.2, 1.8, 2.2, 1.7, 2.8],
[ 0.7, 0.9, 0.4, 1.8, 1.4, 2.1, 1.7, 2.9],
[ 1.2, 0.9, 0.4, 2.1, 1.3, 1.2, 1.9, 2.4],
[ 6.3, 13.5, 3.1, 13.4, 12.1, 13.3, 13.4, 20. ]])
data = pd.DataFrame(data = m)
Right now I am using seaborn heatmap, I can only create something like this:
cmap = sns.diverging_palette(240, 10, as_cmap = True)
sns.heatmap(data, annot = True, cmap = "Reds")
plt.show
I hope to change the color scheme of the last row, here is what I want to achieve (I did this in Excel):
Is it possible I achieve this in Python with seaborn heatmap? Thank you!

You can split in two, mask the unwanted parts, and plot separately:
# Reds
data1 = data.copy()
data1.loc[7] = float('nan')
ax = sns.heatmap(data1, annot=True, cmap="Reds")
# Greens
data2 = data.copy()
data2.loc[:6] = float('nan')
sns.heatmap(data2, annot=True, cmap="Greens")
output:
NB. you need to adapt the loc[…] parameter to your actual index names

Normalize with respect to row and column

I have an array of probabilities. I would like the columns to sum to 1 (representing probability) and the rows to sum to X (where X is an integer, say 9 for example).
I thought that I could normalize the columns, and then normalize the rows and times by X. But this didn't work, the resulting sums of the rows and columns were not perfectly 1.0 and X.
This is what I tried:
# B is 5 rows by 30 columns
# Normalizing columns to 1.0
col_sum = []
for col in B.T:
col_sum.append(sum(col))
for row in range(B.shape[0]):
for col in range(B.shape[1]):
if B[row][col] != 0.0 and B[row][col] != 1.0:
B[row][col] = (B[row][col] / col_sum[col])
# Normalizing rows to X (9.0)
row_sum = []
for row in B:
row_sum.append(sum(row))
for row in range(B.shape[0]):
for col in range(B.shape[1]):
if B[row][col] != 0.0 and B[row][col] != 1.0:
B[row][col] = (B[row][col] / row_sum[row]) * 9.0

I'm not sure if I understood correctly, but it seems like what you're trying to accomplish might mathematically not be feasible?
Imagine you have a 2x2 matrix where you want the rows to sum up to 1 and the columns to 10. Even if you made all the numbers in the columns 1 (their max possible value) you would still not be able to sum them up to 10 in their columns?

This can only work if your matrix's number of columns is X times the number of rows. For example, if X = 3 and you have 5 rows, then you must have 15 columns. So, you could make your 5x30 matrix work for X=6 but not X=9.
The reason for this is that, if each column sums up to 1.0, the total of all values in the matrix will be 1.0 times the number of columns. And since you want each row to sum up to X, then the total of all values must also be X times the number of rows.
So: Columns * 1.0 = X * Rows
If that constraint is met, you only have to adjust all values proportionally to X/sum(row) and both dimensions will work automatically unless the initial values are not properly balanced. If the matrix is not already balanced, adjusting the values would be similar to solving a sudoku (allegedly an NP problem) and the result would largely be unrelated to the initial values. The matrix is balanced when all rows, adjusted to have the same sum, result in all columns having the same sum.
[0.7, 2.1, 1.4, 0.7, 1.4, 1.4, 0.7, 1.4, 1.4, 2.1, 0.7, 2.1, 1.4, 2.1, 1.4] 21
[2.8, 1.4, 0.7, 2.1, 1.4, 2.1, 0.7, 1.4, 2.1, 1.4, 0.7, 0.7, 1.4, 0.7, 1.4] 21
[1.4, 1.4, 1.4, 1.4, 1.4, 1.4, 1.4, 1.4, 1.4, 0.7, 2.8, 0.7, 0.7, 1.4, 2.1] 21
[1.4, 1.4, 1.4, 1.4, 2.1, 1.4, 1.4, 1.4, 0.7, 0.7, 2.1, 1.4, 1.4, 1.4, 1.4] 21
[0.7, 0.7, 2.1, 1.4, 0.7, 0.7, 2.8, 1.4, 1.4, 2.1, 0.7, 2.1, 2.1, 1.4, 0.7] 21
apply x = x * 3 / 21 to all elements ...
[0.1, 0.3, 0.2, 0.1, 0.2, 0.2, 0.1, 0.2, 0.2, 0.3, 0.1, 0.3, 0.2, 0.3, 0.2] 3.0
[0.4, 0.2, 0.1, 0.3, 0.2, 0.3, 0.1, 0.2, 0.3, 0.2, 0.1, 0.1, 0.2, 0.1, 0.2] 3.0
[0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.1, 0.4, 0.1, 0.1, 0.2, 0.3] 3.0
[0.2, 0.2, 0.2, 0.2, 0.3, 0.2, 0.2, 0.2, 0.1, 0.1, 0.3, 0.2, 0.2, 0.2, 0.2] 3.0
[0.1, 0.1, 0.3, 0.2, 0.1, 0.1, 0.4, 0.2, 0.2, 0.3, 0.1, 0.3, 0.3, 0.2, 0.1] 3.0
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

is there a parameter to set the precision for numpy.linspace?

I am trying to check if a numpy array contains a specific value:
>>> x = np.linspace(-5,5,101)
>>> x
array([-5. , -4.9, -4.8, -4.7, -4.6, -4.5, -4.4, -4.3, -4.2, -4.1, -4. ,
-3.9, -3.8, -3.7, -3.6, -3.5, -3.4, -3.3, -3.2, -3.1, -3. , -2.9,
-2.8, -2.7, -2.6, -2.5, -2.4, -2.3, -2.2, -2.1, -2. , -1.9, -1.8,
-1.7, -1.6, -1.5, -1.4, -1.3, -1.2, -1.1, -1. , -0.9, -0.8, -0.7,
-0.6, -0.5, -0.4, -0.3, -0.2, -0.1, 0. , 0.1, 0.2, 0.3, 0.4,
0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3, 1.4, 1.5,
1.6, 1.7, 1.8, 1.9, 2. , 2.1, 2.2, 2.3, 2.4, 2.5, 2.6,
2.7, 2.8, 2.9, 3. , 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7,
3.8, 3.9, 4. , 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8,
4.9, 5. ])
>>> -5. in x
True
>>> a = 0.2
>>> a
0.2
>>> a in x
False
I assigned a constant to variable a. It seems that the precision of a is not compatible with the elements in the numpy array generated by np.linspace().
I've searched the docs, but didn't find anything about this.

This is not a question of the precision of np.linspace, but rather of the type of the elements in the generated array.
np.linspace generates elements which, conceptually, equally divide the input range between them. However, these elements are then stored as floating point numbers with limited precision, which makes the generation process itself appear to lack precision.
By passing the dtype argument to np.linspace, you can specify the precision of the floating point type used to store its result, which can increase the apparent precision of the generation process.
Nevertheless, you should not use the equality operator to compare floating point numbers. Instead, use np.isclose in conjunction with np.ndarray.any, or some equivalent:
>>> floats_64 = np.linspace(-5, 5, 101, dtype='float64')
>>> floats_128 = np.linspace(-5, 5, 101, dtype='float128')
>>> print(0.2 in floats_64)
False
>>> print(floats_64[52])
0.20000000000000018
>>> print(np.isclose(0.2, floats_64).any()) # check if any element in floats_64 is close to 0.2
True
>>> print(0.2 in floats_128)
False
>>> print(floats_128[52])
0.20000000000000017764
>>> print(np.isclose(0.2, floats_128).any()) # check if any element in floats_128 is close to 0.2
True

How to import from a data file a numpy structured array

i'm trying to create an array which has 5 columns imported from a data file. The 4 of them are floats and the last one string.
The data file looks like this:
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
I tried these:
data = np.genfromtxt(filename, dtype = "float,float,float,float,str", delimiter = ",")
data = np.loadtxt(filename, dtype = "float,float,float,float,str", delimiter = ",")
,but both codes import only the first column.
Why? How can i fix this?
Ty for your time! :)

You must specify correctly the str type : "U20" for exemple for 20 characters max :
data = np.loadtxt('data.txt', dtype = "float,"*4 + "U20", delimiter = ",")
seems to work :
array([( 5.1, 3.5, 1.4, 0.2, 'Iris-setosa'),
( 4.9, 3. , 1.4, 0.2, 'Iris-setosa'),
( 4.7, 3.2, 1.3, 0.2, 'Iris-setosa'),
( 4.6, 3.1, 1.5, 0.2, 'Iris-setosa'),
( 5. , 3.6, 1.4, 0.2, 'Iris-setosa'),
( 5.4, 3.9, 1.7, 0.4, 'Iris-setosa'),
( 4.6, 3.4, 1.4, 0.3, 'Iris-setosa'),
( 5. , 3.4, 1.5, 0.2, 'Iris-setosa')],
dtype=[('f0', '<f8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8'), ('f4', '<U20')])
An other method using pandas give you an object array, but this slow down further computations :
In [336]: pd.read_csv('data.txt',header=None).values
Out[336]:
array([[5.1, 3.5, 1.4, 0.2, 'Iris-setosa'],
[4.9, 3.0, 1.4, 0.2, 'Iris-setosa'],
[4.7, 3.2, 1.3, 0.2, 'Iris-setosa'],
[4.6, 3.1, 1.5, 0.2, 'Iris-setosa'],
[5.0, 3.6, 1.4, 0.2, 'Iris-setosa'],
[5.4, 3.9, 1.7, 0.4, 'Iris-setosa'],
[4.6, 3.4, 1.4, 0.3, 'Iris-setosa'],
[5.0, 3.4, 1.5, 0.2, 'Iris-setosa']], dtype=object)

Iterating forward and backward in Python

I have a coding interface which has a counter component. It simply increments by 1 with every update. Consider it an infinite generator of {1,2,3,...} over time which I HAVE TO use.
I need to use this value and iterate from -1.5 to 1.5. So, the iteration should start from -1.5 and reach 1.5 and then from 1.5 back to -1.5.
How should I use this infinite iterator to generate an iteration in that range?

You can use cycle from itertools to repeat a sequence.
from itertools import cycle
# build the list with 0.1 increment
v = [(x-15)/10 for x in range(31)]
v = v + list(reversed(v))
cv = cycle(v)
for c in my_counter:
x = next(cv)
This will repeat the list v:
-1.5, -1.4, -1.3, -1.2, -1.1, -1.0, -0.9, -0.8, -0.7, -0.6, -0.5, -0.4,
-0.3, -0.2, -0.1, 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0,
1.1, 1.2, 1.3, 1.4, 1.5, 1.5, 1.4, 1.3, 1.2, 1.1, 1.0, 0.9, 0.8, 0.7,
0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0, -0.1, -0.2, -0.3, -0.4, -0.5, -0.6,
-0.7, -0.8, -0.9, -1.0, -1.1, -1.2, -1.3, -1.4, -1.5, -1.5, -1.4, -1.3,
-1.2, -1.1, -1.0, -0.9, -0.8, -0.7, -0.6, -0.5, -0.4, -0.3, -0.2, -0.1,
0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3,
1.4, 1.5, 1.5, 1.4, 1.3, 1.2, 1.1, 1.0, 0.9 ...

Something like:
import itertools
infGenGiven = itertools.count() # This is similar your generator
def func(x):
if x%2==0:
return 1.5
else:
return -1.5
infGenCycle = itertools.imap(func, infGenGiven)
count=0
while count<10:
print infGenCycle.next()
count+=1
Output:
1.5
-1.5
1.5
-1.5
1.5
-1.5
1.5
-1.5
1.5
-1.5
Note that this starts 1.5 because the first value in infGenGiven is 0, although for your generator it is 1 and so the infGenCycle output will give you what you want.

Thank you all.
I guess the best approach is to use the trigonometric functions (sine or cosine) which oscillate between plus and minus one.
More details at: https://en.wikipedia.org/wiki/Trigonometric_functions
Cheers

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Group By column to generate quantiles (.25, 0.5, .75) - python

Related

How to use a different colormap for different rows of a heatmap

Normalize with respect to row and column

is there a parameter to set the precision for numpy.linspace?

How to import from a data file a numpy structured array

Iterating forward and backward in Python

Categories

Resources