The problem is very similar to that in How to evaluate the sum of values within array blocks, where I need to sum up elements in a matrix by blocks. What is different here is that the blocks could have different sizes. For example, given a 4-by-5 matrix
1 1 | 1 1 1
----|------
1 1 | 1 1 1
1 1 | 1 1 1
1 1 | 1 1 1
and block sizes 1 3 along the rows and 2 3 along the columns, the result should be a 2-by-2 matrix:
2 3
6 9
Is there a way of doing this without loops?
Seems like a good fit to use np.add.reduceat to basically sum along rows and then along cols -
def sum_blocks(a, row_sizes, col_sizes):
# Sum rows based on row-sizes
s1 = np.add.reduceat(a,np.r_[0,row_sizes[:-1].cumsum()],axis=0)
# Sum cols from row-summed output based on col-sizes
return np.add.reduceat(s1,np.r_[0,col_sizes[:-1].cumsum()],axis=1)
Sample run -
In [45]: np.random.seed(0)
...: a = np.random.randint(0,9,(4,5))
In [46]: a
Out[46]:
array([[5, 0, 3, 3, 7],
[3, 5, 2, 4, 7],
[6, 8, 8, 1, 6],
[7, 7, 8, 1, 5]])
In [47]: row_sizes = np.array([1,3])
...: col_sizes = np.array([2,3])
In [48]: sum_blocks(a, row_sizes, col_sizes)
Out[48]:
array([[ 5, 13],
[36, 42]])
Related
I have a list1 = [1,2,3,4,5,6,7,8,9,0]. I want to take out one element "4", then split the remaining list with np.array_split(list1,5), I will get [array([1, 2]), array([5, 6]), array([7, 8]), array([ 9, 10]), array([11])] as result. When I try to convert it into an pandas Data Frame, the out put result would be as:
index
0
1
0
1
2.0
1
5
6.0
2
7
8.0
3
9
10.0
4
11
NaN
But I want to get the result as just one column data frame with single value in each cell without NaN value at the end.
Any suggestion with this matter would be appreciated.
Put your array into a dict and create your dataframe from that:
list1 = [1,2,3,4,5,77,8,9,0]
x = np.array_split(list1, 5)
df = pd.DataFrame({'column': x})
Output:
>>> df
column
0 [1, 2]
1 [3, 4]
2 [5, 77]
3 [8, 9]
4 [0]
I’m trying to take a set of data that consists of N rows, and expand each row to include the squares/cube/etc of each column in that row (what power to go up to is determined by a variable j). The data starts out as a pandas DataFrame but can be turned into a numpy array.
For example:
If the row is [3,2] and j is 3, the row should be transformed to [3, 2, 9, 4, 27, 8]
I currently have a semi working version that consists of a bunch of nested for loops and is pretty ugly. I’m hoping for a cleaner way to make this transformation so things will be a bit easier for me to debug.
The behavior I’m looking for is basically the same as sklearns PolynomialFeature, but I’m trying to do it in numpy and or pandas only.
Thanks!
Use NumPy broadcasting for a vectorized solution -
In [66]: a = np.array([3,2])
In [67]: j = 3
In [68]: a**np.arange(1,j+1)[:,None]
Out[68]:
array([[ 3, 2],
[ 9, 4],
[27, 8]])
And there's a NumPy builtin : np.vander -
In [142]: np.vander(a,j+1).T[::-1][1:]
Out[142]:
array([[ 3, 2],
[ 9, 4],
[27, 8]])
Or with increasing flat set as True -
In [180]: np.vander(a,j+1,increasing=True).T[1:]
Out[180]:
array([[ 3, 2],
[ 9, 4],
[27, 8]])
Try concat with ignore_index option to remove duplicate in column names:
df = pd.DataFrame(np.arange(9).reshape(3,3))
j = 3
pd.concat([df**i for i in range(1,j+1)], axis=1,ignore_index=True)
Output:
0 1 2 3 4 5 6 7 8
0 0 1 2 0 1 4 0 1 8
1 3 4 5 9 16 25 27 64 125
2 6 7 8 36 49 64 216 343 512
Let's say I have an array such as this:
a = np.array([[1, 2, 3, 4, 5, 6, 7], [20, 25, 30, 35, 40, 45, 50], [2, 4, 6, 8, 10, 12, 14]])
and a dataframe such as this:
num letter
0 1 a
1 2 b
2 3 c
What I would then like to do is to calculate the difference between the first and last number in each sequence in the array and ultimately add this difference to a new column in the df.
Currently I am able to calculate the desired difference in each sequence in this manner:
for i in a:
print(i[-1] - i[0])
Giving me the following results:
6
30
12
I would expect to be able to do is replace the print with df['new_col'] like so:
df['new_col'] = (i[-1] - i[0])
And for my df to then look like this:
num letter new_col
0 1 a 6
1 2 b 30
2 3 c 12
However, I end up getting this:
num letter new_col
0 1 a 12
1 2 b 12
2 3 c 12
I would also really appreciate if anyone could tell me what the equivalent of .diff() and .shift() are in numpy as I tried that in the same way you would with a pandas dataframe as well but just got error messages. This would be useful for me if I want to calculate the difference not just between the first and last numbers but somewhere in between.
Any help would be really appreciated, cheers.
currently you are only performing the difference calculation in the very last one
use a list comprehension:
a = np.array([[1, 2, 3, 4, 5, 6, 7], [20, 25, 30, 35, 40, 45, 50], [2, 4, 6, 8, 10, 12, 14]])
b = [i[-1] - i[0] for i in a]
if the lengths mismatch, then you need to extend the list with NaNs:
b = b + [np.NaN]*(len(df) - len(b))
df['new_col'] = b
Might be better off doing this in a DataFrame if your array grows in size.
df1 = pd.DataFrame(a.T)
df['new_col'] = df1.iloc[-1] - df1.iloc[0]
print(df)
num letter new_col
0 1 a 6
1 2 b 30
2 3 c 12
I read about pandas multi index, but the examples I found did not cover this cases:
I have a set of measurements i in 1..n. Each measurement i consists of attibutes a, b, c, X, Y, Z. While a, b, c have scalar values, X, Y, Z are arrays.
(X, Y, Z have different length for different measurements, but within one measurement i, the arrays X, Y, Z have same length m).
Question: Whats the best way to represent this in a pandas data-frame?
The multi-index examples I saw would index data e.g. first level by i for the measurement and second level by k for index into X,Y,Z. But what about the attributes a, b, c that have just one value for each measurement, but not m? Should the a, b, c values be repeated? Or only the first row of each measurement i contains values for a, b, c, the other 2..m rows contain NaN?
As alluded to in the comment, much of this will depend on the sort of questions you want to be able to answer at the end of the day. One example of a potentially meaningful representation, which takes advantage of the facts that the arrays have the same length within each row, would be to take the product along the arrays, repeating the $i$th row $m_i$ times;
In [12]: from itertools import chain
In [13]: df = pd.DataFrame({'a': [5, 7, 1], 'b': [1, 3, 2], 'c': [10, 1, 1], 'X': [[10, 10], [10, 20, 30], [4, 5]], 'Y': [[20
...: , 30], [1, 4, 5], [0, 1]], 'Z': [[0, 3], [0, 1, 2], [50, 60]]})
In [14]: df
Out[14]:
X Y Z a b c
0 [10, 10] [20, 30] [0, 3] 5 1 10
1 [10, 20, 30] [1, 4, 5] [0, 1, 2] 7 3 1
2 [4, 5] [0, 1] [50, 60] 1 2 1
In [15]: pd.DataFrame({
...: 'a': np.repeat(df.a.values, df.X.apply(len)),
...: 'b': np.repeat(df.b.values, df.X.apply(len)),
...: 'c': np.repeat(df.c.values, df.X.apply(len)),
...: 'X': list(chain.from_iterable(df.X)),
...: 'Y': list(chain.from_iterable(df.Y)),
...: 'Z': list(chain.from_iterable(df.Z))})
...:
Out[15]:
X Y Z a b c
0 10 20 0 5 1 10
1 10 30 3 5 1 10
2 10 1 0 7 3 1
3 20 4 1 7 3 1
4 30 5 2 7 3 1
5 4 0 50 1 2 1
6 5 1 60 1 2 1
This assumes that the fact that the array lengths match across columns means that the arrays themselves are comparable element-wise.
I have a df that is 'divided' by chunks, like this:
A = pd.DataFrame([[1, 5, 2, 0], [2, 4, 4, 0], [3, 3, 1, 1], [4, 2, 2, 0], [5, 1, 4, 0], [2, 4, 4, 1]],
columns=['A', 'B', 'C', 'D'], index=[1, 2, 3, 4, 5, 6,])
In this example, the chunk size is 3, and we have 2 chunks (signaled by the element 1 in the column 'D'). I need to perfom a rolling calculation inside each chunk, that involves 2 columns. Specifically, I need to create a column 'E' that is equal to column 'B' minus the rolling min of column 'C', in function:
def retracement(x):
return x['B'] - pd.rolling_min(x['C'], window=3)
I need to apply the formula above for each chunk. So following this recipe I tried:
chunk_size = 3
A['E'] = A.groupby(np.arange(len(A))//chunk_size).apply(lambda x: retracement(x))
ValueError: Wrong number of items passed 3, placement implies 1
The output would look like:
A B C D E
1 1 5 2 0 3
2 2 4 4 0 2
3 3 3 1 1 2
4 4 2 2 0 0
5 5 1 4 0 -1
6 2 4 4 1 2
Thanks
Update:
Following #EdChum recommendation didn't work, I got
TypeError: <lambda>() got an unexpected keyword argument 'axis'
something like this:
def chunkify(chunk_size):
df['chunk'] = (df.index.values - 1) / chunk_size
df['E'] = df.groupby('chunk').apply(lambda x: x.B - pd.expanding_min(x.C)).values.flatten()