This question already has answers here:
Extracting first n columns of a numpy matrix
(2 answers)
Closed 4 years ago.
Although I am not new to python, I am a very infrequent user and get out of my depth pretty quickly. I’m guessing there is a simple solution to this which I am totally missing.
I have an image which is stored as a 2D numpy array. Rather than 3 values as with an RGB image, each pixel contains 6 numbers. The XY coordinates of a pixel is all stored as the row values of the array, while there are 6 columns which correspond to the different wavelength values. Currently, I can call up any single pixel and see what the wavelength values are, but I would like to add up the values over a range of pixels.
I know summing arrays is straightforward, and this is essentially what I'm looking to achieve over an arbitrary range of specified pixels in the image.
a=np.array([1,2,3,4,5,6])
b=np.array ([1,2,3,4,5,6])
sum=a+b
sum=[2,4,6,8,10,12]
I’m guessing I need a loop for specifying the range of input pixels I would like to be summed. Note, as with the example above, I’m not trying to sum the 6 values of one array, rather to sum the all the first elements, all the second elements, etc. However, I think what is happening is that the loop runs over the specified pixel range, but doesn’t store the values to be added to the next pixel. I’m not sure how to overcome this.
What I have so far is:
fileinput=np.load (‘C:\\Users\\image_data.npy’) #this is the image, each pixel is 6 values
exampledata=np.arange(42).reshape((7,6))
for x in range(1,4)
signal=exampledata[x]
print(exampledata[x]) #this prints all the arrays I would like to sum, ie it shows list of 6 values for each of the pixels specified within the range.
sum.exampledata[x] # sums up the values within each list, rather than all the first elements, all the second elements etc.
exampledata.sum(axis=1) # produces the error: AxisError: axis 1 is out of bounds for array of dimension 1
I suppose I could sum up a small range of pixels manually, though this only works for small ranges of pixels.
first=fileinput[1]
second=fileinput[2]
third=fileinput[3]
sum=first+second+third
This should work
exampledata = np.arange(42).reshape((7,6))
# Sum data along axis 0:
sum = exampledata[0:len(exampledata)].sum(0)
print(sum)
Initial 2D array:
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]
[24 25 26 27 28 29]
[30 31 32 33 34 35]
[36 37 38 39 40 41]]
Output:
[126 133 140 147 154 161]
Related
I have a Dataframe of NxM values, and i need to iterate through it in order to identify all possible combinantions of lenght 1, 2, 3... up to NxM-1 of rows and columns.
For each of these combinations, I want to aggregate values. Let's make an example: df_unreduced is my initial dataframe
df_unreduced= pd.DataFrame({"Strike":[50,75,100,125,150],
"31/12/2021":[1,2,3,4,5],
"31/12/2022":[6,7,8,9,10],
"31/12/2023":[11,12,13,14,15],
"31/12/2024":[16,17,18,19,20],
"31/12/2025":[21,22,23,24,25]})
Here I focus on just one possible combination of rows and columns. lets say I want to aggregate on these 4 "nodes" (being node=intersection between one row and one column) of my initial df:
row=75 , col= "31/12/2022"
row=75 , col= "31/12/2024"
row=150 , col= "31/12/2022"
row=150 , col= "31/12/2024"
the expected result is:
df_reduce=pd.DataFrame({"Strike":[75,150],
"31/12/2022":[45.8333,135.8333],
"31/12/2024":[41.6667, 101.6667]})
if the combination was:
row=75 , col= "31/12/2022"
row=75 , col= "31/12/2024"
row=125 , col= "31/12/2022"
row=125 , col= "31/12/2024"
the expected result would be:
df_reduced = pd.DataFrame({"Strike": [75, 125],
"31/12/2022": [36.25, 111.25],
"31/12/2024": [51.25, 126.25]})
and so on. the key point here is that, as you may see from the expected results, i must linearly weight values that fall between two nodes, before summing them. To achieve the expected result i initially summed by rows and then by columns, but if i did the other way around, the results would be unchanged.
once defined the right summing code, I have to loop between all possible combinations.
Whatever the number/position of nodes i aggregate on, the sum of values within the reduced data frame must match the sum of values within the unreduced one.
EDIT: in order to elaborate a bit more on the aggregation and the expected result: in my df_unreduced the "Strike" array is [50, 75, 100, 125, 150]. As said i'm keen to adopt a two steps approach, summing up by rows and then by columns.
Let's take my first example: the first node i need to aggregate on (call it A) is (75, "31/12/2022") and the third (D) is (150, "31/12/2022"). in the row dimension, between A and D I have two nodes B=(100, "31/12/2022") and C=(125,"31/12/2022"). the distance between 75 and 150 is 75: 100 falls 1/3 away from A and 2/3 away from D, while 125 is 1/3 away from D and 2/3 away from A. for this reason, when aggregating on A, I can say the that A_AGG_row = A + (1-1/3)*B + (1-2/3)*C. Conversely, D_AGG_row = D + (1-1/3)C + (1-2/3) B.
on the column dimension, if we consider F=(75, "31/12/2024"), we have that E=(75, "31/12/2023") is halfway between A and F so A_AGG_col = A_AGG_row + (1-1/2)*E_AGG_row and F_AGG_col = F_AGG_row + (1-1/2)*E_AGG_row (where E_AGG_row is the result of the previous summing by rows on node E).
In more scientifc terms, when summing up rows I need to split linearly the value of the generic node "i" falling between two selected nodes, A and D, by applying this formulas:
1-(Strike_i -Strike_A)/(Strike_D - Strike_A) -> on node A
1-(Strike_D -Strike_i)/(Strike_D - Strike_A) -> on node D
when summing up columns I need to split linearly the value of the generic node "j" falling between two selected nodes, A and F
1-(Date_j-Date_A)/(Date_F - Date_A) -> on node A
1-(Date_F-Date_j)/(Date_F - Date_A) -> on node D
these are the complete steps:
1.df_unreduced
31/12/2021
31/12/2022
31/12/2023
31/12/2024
31/12/2025
50
1
6
11
16
21
75
2
7
12
17
22
100
3
8
13
18
23
125
4
9
14
19
24
150
5
10
15
20
25
summing rows:
31/12/2021
31/12/2022
31/12/2023
31/12/2024
31/12/2025
50
75
6.333
21.333
36.333
51.333
63.333
100
125
150
8.667
18.667
28.667
38.667
48.667
summing by cols:
31/12/2021
31/12/2022
31/12/2023
31/12/2024
31/12/2025
50
75
45.833
135.833
100
125
150
41.667
101.667
I hope this clarifies the problem
thanks
I have a pandas dataframe like below:
Coordinate
1 (1150.0,1760.0)
28 (1260.0,1910.0)
6 (1030.0,2070.0)
12 (1170.0,2300.0)
9 (790.0,2260.0)
5 (750.0,2030.0)
26 (490.0,2130.0)
29 (360.0,1980.0)
3 (40.0,2090.0)
2 (630.0,1660.0)
20 (590.0,1390.0)
Now, I want to create a new column 'dotProduct' by applying the formula
np.dot((b-a),(b-c)) where b is the Coordinates(1260.0,1910.0) for index 28, c is the same for index 6, (i.e. (1030.0,2070.0)). The calculated product is for row 2. So, in a way I have to get the previous row value and next value too. This way I have to calculate for entire 'Coordinate' I am quite new to pandas, hence still in learning path. Please guide me a bit.
Thanks a lot for the help.
I assume that your 'Coordinate' column elements are already tuples of float values.
# Convert elements of 'Coordinate' into numpy array
df.Coordinate = df.Coordinate.apply(np.array)
# Subtract +/- 1 shifted values from original 'Coordinate'
a = df.Coordinate - df.Coordinate.shift(1)
b = df.Coordinate - df.Coordinate.shift(-1)
# take row-wise dot product based on the arrays a, b
df['dotProduct'] = [np.dot(x, y) for x, y in zip(a, b)]
# make 'Coordinate' tuple again (if you want)
df.Coordinate = df.Coordinate.apply(tuple)
Now I get this as df:
Coordinate dotProduct
1 (1150.0, 1760.0) NaN
28 (1260.0, 1910.0) 1300.0
6 (1030.0, 2070.0) -4600.0
12 (1170.0, 2300.0) 62400.0
9 (790.0, 2260.0) -24400.0
5 (750.0, 2030.0) 12600.0
26 (490.0, 2130.0) -18800.0
29 (360.0, 1980.0) -25100.0
3 (40.0, 2090.0) 236100.0
2 (630.0, 1660.0) -92500.0
20 (590.0, 1390.0) NaN
I have a density field on a 2d grid. I have a matrix that describes the "correlation" of each point of the grid with all the other points. For example, let's say that I have a 4x4 grid, whose points I label with these indices:
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
then I have a matrix which is 16x16. The first row describes the correlation of pixel 1 with all the others, the second row describes the correlation of pixel 2 with all the others and so on until row 16. Symbolically:
1-1 1-2 1-3 ...... 1-16
2-1 2-2 2-3 ...... 2-16
.
.
.
16-1 16-2 16-3...... 16-16
If I use a 2-dimensional FFT of this matrix (np.fft.fft2) do I get a quantity which is the density field power spectrum? Or do I have to do some extra operation on the matrix before applying the FFT, such as reordering the entries according to some rules, or padding extra entries with zeros?
I have a data frame, from which I can select a column (series) as follows:
df:
value_rank
275488 90
275490 35
275491 60
275492 23
275493 23
275494 34
275495 75
275496 40
275497 69
275498 14
275499 83
... ...
value_rank is a previously created percentile rank from a larger data-set. What I am trying to do, is to create bins of this data set, e.g. quintile
pd.qcut(df.value_rank, 5, labels=False)
275488 4
275490 1
275491 3
275492 1
275493 1
275494 1
275495 3
275496 2
... ...
This appears fine, as expected, but it isn't.
In fact, I have 1569 columns. The nearest number divisible by 5 bins is 1565 which should give 1565 / 5 = 313 observations in each bin. There are 4 extra records, so I would expect to have 4 bins with 314 observations, and one with 313 observations. Instead, I get this:
obs = pd.qcut(df.value_rank, 5, labels=False)
obs.value_counts()
0 329
3 314
1 313
4 311
2 302
I have no nans in df, and cannot think of any reason why this is happening. Literally beginning to tear my hair out!
Here is a small example:
df:
value_rank
286742 11
286835 53
286865 40
286930 31
286936 45
286955 27
287031 30
287111 36
287269 30
287310 18
pd.qcut gives this:
pd.qcut(df.value_rank, 5, labels = False).value_counts()
bin count
1 3
4 2
3 2
0 2
2 1
There should be 2 observations in each bin, not 3 in bin 1 and 1 in bin 2!
qcut is trying to compensate for repeating values. This is earlier to visualize if you return the bin limits along with your qcut results:
In [42]: test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
In [43]: test_series = pd.Series(test_list, name='value_rank')
In [49]: pd.qcut(test_series, 5, retbins=True, labels=False)
Out[49]:
(array([0, 0, 1, 1, 1, 2, 3, 3, 4, 4]),
array([ 11. , 25.2, 30. , 33. , 41. , 53. ]))
You can see that there was no choice but to set the bin limit at 30, so qcut had to "steal" one from the expected values in the third bin and place them in the second. I'm thinking that this is just happening at a larger scale with your percentiles since you're basically condensing their ranks into a 1 to 100 scale. Any reason not to just run qcut directly on the data instead of the percentiles or return percentiles that have greater precision?
Just try with the below code :
pd.qcut(df.rank(method='first'),nbins)
If you must get equal (or nearly equal) bins, then here's a trick you can use with qcut. Using the same data as the accepted answer, we can force these into equal bins by adding some random noise to the original test_list and binning according to those values.
test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
np.random.seed(42) #set this for reproducible results
test_list_rnd = np.array(test_list) + np.random.random(len(test_list)) #add noise to data
test_series = pd.Series(test_list_rnd, name='value_rank')
pd.qcut(test_series, 5, retbins=True, labels=False)
Output:
(0 0
1 0
2 1
3 2
4 1
5 2
6 3
7 3
8 4
9 4
Name: value_rank, dtype: int64,
array([ 11.37454012, 25.97573801, 30.42160255, 33.11683016,
41.81316392, 53.70807258]))
So, now we have two 0's, two 1's, two 2's and two 4's!
Disclaimer
Obviously, use this at your discretion because results can vary based on your data; like how large your data set is and/or the spacing, for instance. The above "trick" works well for integers because even though we are "salting" the test_list, it will still rank order in the sense that there will won't be a value in group 0 greater than a value in group 1 (maybe equal, but not greater). If, however, you have floats, this may be tricky and you may have to reduce the size of your noise accordingly. For instance if you had floats like 2.1, 5.3, 5.3, 5.4, etc., you should should reduce the noise by dividing by 10: np.random.random(len(test_list)) / 10. If you have arbitrarily long floats, however, you probably would not have this problem in the first place, given the noise already present in "real" data.
This problem arises from duplicate values. A possible solution to force equal sized bins is to use the index as the input for pd.qcut after sorting the dataframe:
import random
df = pd.DataFrame({'A': [random.randint(3, 9) for x in range(20)]}).sort_values('A').reset_index()
del df['index']
df = df.reset_index()
df['A'].plot.hist(bins=30);
picture: https://i.stack.imgur.com/ztjzn.png
df.head()
df['qcut_v1'] = pd.qcut(df['A'], q=4)
df['qcut_v2'] = pd.qcut(df['index'], q=4)
df
picture: https://i.stack.imgur.com/RB4TN.png
df.groupby('qcut_v1').count().reset_index()
picture: https://i.stack.imgur.com/IKtsW.png
df.groupby('qcut_v2').count().reset_index()
picture: https://i.stack.imgur.com/4jrkU.png
sorry I cannot post images since I don't have at least 10 reputation on stackoverflow -.-
I want to scale the numerical values (similar like R's scale function) based on different groups.
Noted: when I talked about the scale, I am referring to this metric
(x-group_mean)/group_std
Dataset (for demonstration the ideas) for example:
advertiser_id value
10 11
10 22
10 2424
11 34
11 342342
.....
Desirable results:
advertiser_id scaled_value
10 -0.58
10 -0.57
10 1.15
11 -0.707
11 0.707
.....
referring to this link: implementing R scale function in pandas in Python? I used the function for def scale and want to apply for it, like this fashion:
dt.groupby("advertiser_id").apply(scale)
but get an error:
ValueError: Shape of passed values is (2, 15770), indices imply (2, 23375)
In my original datasets the number of rows is 15770, but I don't think in my case the scale function maps a single value to more than 2 (in this case) results.
I would appreciate if you can give me some sample code or some suggestions into how to modify it, thanks!
First, np.std behaves differently than most other languages in that it delta degrees of freedom defaults to be 0. Therefore:
In [9]:
print df
advertiser_id value
0 10 11
1 10 22
2 10 2424
3 11 34
4 11 342342
In [10]:
print df.groupby('advertiser_id').transform(lambda x: (x-np.mean(x))/np.std(x, ddof=1))
value
0 -0.581303
1 -0.573389
2 1.154691
3 -0.707107
4 0.707107
This matches R result.
2nd, if any of your groups (by advertiser_id) happens to contain just 1 item, std would be 0 and you will get nan. Check if you get nan for this reason. R would return nan in this case as well.