Best way converting data in PANDAS DataFrame to matrix in Python - python

I found one thread of converting a matrix to das pandas DataFrame. However, I would like to do the opposite - I have a pandas DataFrame with time series data of this structure:
row time stamp, batch, value
1, 1, 0.1
2, 1, 0.2
3, 1, 0.3
4, 1, 0.3
5, 2, 0.25
6, 2, 0.32
7, 2, 0.2
8, 2, 0.1
...
What I would like to have is a matrix of values with one row belonging to one batch:
[[0.1, 0.2, 0.3, 0.3],
[0.25, 0.32, 0.2, 0.1],
...]
which I want to plot as heatmap using matplotlib or alike.
Any suggestion?

What you can try is to first group by the desired index:
g = df.groupby("batch")
And then convert this group to an array by aggregating using the list constructor.
The result can then be converted to an array using the .values property (or .as_matrix() function, but this is getting deprecated soon.)
mtr = g.aggregate(list).values
One downside of this method is that it will create arrays of lists instead of a nice array, even if the result would lead to a non-jagged array.
Alternatively, if you know that you get exactly 4 values for every unique value of batch you can just use the matrix directly.
df = df.sort_values("batch")
my_indices = [1, 2] # Or whatever indices you desire.
mtr = df.values[:, my_indices] # or df.as_matrix()
mtr = mtr.reshape(-1, 4) # Only works if you have exactly 4 values for each batch

Try and use crosstab from pandas, pd.crosstab(). You will have to confirm the aggfunction.
https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.crosstab.html
and then .as_matrix()

Related

Efficient subsetting of pandas dataframe with indexes in 2d numpy array

I have a 2d (in future it may be a 3d) array with indicies. Let's say it looks like that:
[[1,1,1],
[1,2,2],
[2,2,3]]
And I have a pandas dataframe:
index, A, B
1, 0.1, 0.01
2, 0.2, 0.02
3, 0,3, 0.03
I want to get a numpy array (or pandas df) with values from column A, sliced based on numpy array. So the result would be here:
[[0.1,0.1,0.1],
[0.1,0.2,0.2],
[0.2,0.2,0.3]]
I can do it with a loop to get pandas dataframe:
pd.DataFrame(df.A[val].values for val in array)
However I'm looking for more efficient way to do it. Is there better way that allows me to use whole array of indices at once?
You can do:
df.loc[a.ravel(),'A'].values.reshape(a.shape)
Output:
array([[0.1, 0.1, 0.1],
[0.1, 0.2, 0.2],
[0.2, 0.2, 0.3]])

Combining Different Scaled arrays in dataframe

Is there a built-in function (numpy or pandas I'm thinking) that would help combine multiple rows of one column in a dataframe, keeping the same dimensions, but different scale? Also, combined with that, summing the values from a different column between the intervals? Or is it something I just need to build from scratch? Example below, I'm not sure exactly how to ask. This would need to be scalable; the example is simple, in reality I'm working with a 250 dim array and theoretically unlimited rows.
Ex:
import pandas as pd
import numpy as np
#Creating DF
df = pd.DataFrame([[[-2,-1,0,1,2],[-10,-5,5,5,-10]],
[[-.5,.5,1.5,2.5,3.5],[-3,-2,0,-2,-3]]])
output: 0 1
0 [-2, -1, 0, 1, 2] [-10, -5, 5, 5, -10]
1 [-0.5, 0.5, 1.5, 2.5, 3.5] [-3, -2, 0, -2, -3]
where the answer is [-2,-0.625,0.75,2.125,3.5] (column0 combined with dim 5) , [-10,-5,0,-5,-5] (sum of column1 between steps of column0 where (interval-1) < x<=interval)
answer = pd.DataFrame([[[-2,-.625,.75,2.125,3.5],[-10,-5,0,-5,-5]]])

Multiplying by pattern matching

I have a matrix of the following format:
matrix = np.array([1, 2, 3, np.nan],
[1, np.nan, 3, 4],
[np.nan, 2, 3, np.nan])
and coefficients I want to selectively multiply element-wise with my matrix:
coefficients = np.array([0.5, np.nan, 0.2, 0.3],
[0.3, 0.3, 0.2, np.nan],
[np.nan, 0.2, 0.1, np.nan])
In this case, I would want the first row in matrix to be multiplied with the second row in coefficients, while the second row in matrix would be multiplied with the first row in coefficients. In short, I want to select the row in coefficients that matches row in matrix in terms of where np.nan values are located.
The location of np.nan values will be different for each row in coefficients, as they describe the coefficients for different cases of data availability.
Is there a quick way to do this, that doesn't require writing if-statements for all possible cases?
Approach #1
A quick way would be with NumPy broadcasting -
# Mask of NaNs
mask1 = np.isnan(matrix)
mask2 = np.isnan(coefficients)
# Perform comparison between each row of mask1 against every row of mask2
# leading to a 3D array. Look for all-matching ones along the last axis.
# These are the ones that shows the row matches between the two input arrays -
# matrix and coefficients. Then, we use find the corresponding matching
# indices that gives us the pair of matches betweel those two arrays
r,c = np.nonzero((mask1[:,None] == mask2).all(-1))
# Index into arrays with those indices and perform elementwise multiplication
out = matrix[r] * coefficients[c]
Output for given sample data -
In [40]: out
Out[40]:
array([[ 0.3, 0.6, 0.6, nan],
[ 0.5, nan, 0.6, 1.2],
[ nan, 0.4, 0.3, nan]])
Approach #2
For performance, reduce each row of NaNs mask to its decimal equivalent and then create a storing array in which we can store elements off matrix and then multiply into the elements off coefficients indexed by those decimal equivalents -
R = 2**np.arange(matrix.shape[1])
idx1 = mask1.dot(R)
idx2 = mask2.dot(R)
A = np.empty((idx1.max()+1, matrix.shape[1]))
A[idx1] = matrix
A[idx2] *= coefficients
out = A[idx1]

Python Numpy: how to get the fraction of the frequency of a specific number in an array

I'm trying to find the fraction of ones in a specific row or column of an array and make a new array of these fractions.
so far i have :
def calc_frac(a,axis=0):
"""a function that returns the fraction of ones in each column or row"""
s=np.array(((a==1).sum())/len(a))
return(s)
and all my test values are coming back false when they should be true
If there are no missing values in the array, you could just call mean method on the a == 1 logical array, which returns fraction of 1s:
a = np.array([[1,2,3,1], [1,1,1,1], [1,0,2,2], [2,2,1,1]])
a
#array([[1, 2, 3, 1],
# [1, 1, 1, 1],
# [1, 0, 2, 2],
# [2, 2, 1, 1]])
1) Fraction of 1s per column
(a == 1).mean(0)
# array([ 0.75, 0.25, 0.5 , 0.75])
2) Fraction of 1s per row
(a == 1).mean(1)
# array([ 0.5 , 1. , 0.25, 0.5 ])
If nan counts as an entry, the above method still works; if nan doesn't count as an entry, you could take care of nan as follows:
(a == 1).sum(axis)/(~np.isnan(a)).sum(axis)
Where axis = 0, fraction per column; axis = 1, fraction per row.

Pandas plot with errorbar: style does not apply

I have Pandas (version 0.14.1) DataFrame object like this
import pandas as pd
df = pd.DataFrame(zip([1, 2, 3, 4, 5],
[0.1, 0.3, 0.1, 0.2, 0.4]),
columns=['y', 'dy'])
It returns
y dy
0 1 0.1
1 2 0.3
2 3 0.1
3 4 0.2
4 5 0.4
where the first column is value and the second is error.
First case: I want to make a plot for y-values
df['y'].plot(style="ro-")
Second case: I want to add a vertical errorbars dy for y-values
df['y'].plot(style="ro-", yerr=df['dy'])
So, If I add yerr or xerr parameter to plot method, It ignores style.
Is it Pandas feature or bug?
As TomAugspurger pointed out, it is a known issue. However, it has an easy workaround in most cases: use fmt keyword instead of style keyword to specify shortcut style options.
import pandas as pd
df = pd.DataFrame(zip([1, 2, 3, 4, 5],
[0.1, 0.3, 0.1, 0.2, 0.4]),
columns=['y', 'dy'])
df['y'].plot(fmt='ro-', yerr=df['dy'], grid='on')

Categories