Pandas dot product ValueError - python

I am trying to calculate the dot product of a data frame and a series, but I am getting ValueError: matrices are not aligned and I do not really understand why. I get
if (len(common) > len(self.columns) or len(common) > len(other.index)):
raise ValueError('matrices are not aligned')
with the error message, which I totally understand. But when I check my series, it has 25 values:
weights
Out[193]:
0 0.000002
1 0.000577
2 0.002480
3 0.004720
4 0.003640
5 0.001480
6 0.000054
7 0.000022
8 0.009060
9 0.000511
10 0.034900
11 0.140000
12 0.065600
13 0.325000
14 0.072900
15 0.031100
16 0.209000
17 0.003280
18 0.001390
19 0.002100
20 0.000847
21 0.009560
22 0.006320
23 0.014000
24 0.061900
Name: 3, dtype: float64
And when I check my data frame, it also has 25 columns:
In [195]: data
Out[195]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 0 to 130
Data columns (total 25 columns):
(etc)
So I don't understand why I get the error message. What am I missing here?
Some additional information:
I am using weightedave=data.dot(weights)
And I just figured out in the dot code that it does common = data.columns.union(weights.index) to get the common referred to in the error message. So I tested that, but in my case that becomes
In[220]: common
Out[220]: Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, u'100_AVET', u'101_AVET', u'102_AVET', u'13_AVET', u'14_AVET', u'15_AVET', u'18_AVET', u'19_AVET', u'20_AVET', u'22_AVET', u'36_AVET', u'62_AVET', u'74_AVET', u'78_AVET', u'79_AVET', u'80_AVET', u'83_AVET', u'85_AVET', u'86_AVET', u'88_AVET', u'94_AVET', u'95_AVET', u'96_AVET', u'97_AVET', u'99_AVET'], dtype=object)
Which indeed is longer (50) than my number of columns/indices (25). Should I rename either my series or the columns in my data frame?

Related

Splitting a Series of counts into bins, so that the bins contain similar count sums

I have a series of values that I want to split up into 5 bins, so that the sum of the values in the bins is approximately the same.
AD8_normal
075e5e2b-dd96-4097-8fc3-c3b82f610277 18
bd9a3e1f-b7a1-4ad9-ac8f-c6009dd5073a 18
b50fa764-1a1d-4760-bd6f-6efbba316adb 16
8cc1408b-9509-485a-8603-41895a53925d 14
3cb98c8c-918f-45fd-bd9f-11b0a6d19838 13
5fa9da0d-bfcb-48da-8bb3-e245e995e3e6 13
5950b2d8-b51a-4c84-86f6-40ee59ddf8c7 12
bbd07332-92a1-4f05-b7de-a0164b017343 12
9b67c710-f5ee-4a4b-ba80-eaf905a1a4fc 12
140f98ce-2ffe-4aca-accb-76f1cf2f08a9 12
177b0245-8ff4-4a34-b15b-1ddf909e0c9e 11
b5aaf5c4-fd3d-4781-9eaf-92e7dee4a234 11
e24f86ce-5cfe-487a-8f23-929fb7674a4a 11
b7ddaf74-adde-4bdf-9e07-f0d1a06176e7 11
5b442d58-78d8-4f3e-9d86-6782dd9945ca 10
acdcd702-dcae-45bc-b56f-1f50e4610c2a 10
c5ac2af7-e040-4d0f-b281-704349338463 10
10f815fa-16d4-4efb-a3e5-05b2f197688e 10
7098c542-7475-4030-87dd-6405952d179f 10
67566970-81f7-4f69-8765-2e1fbe242e62 9
7f2d7f97-da2e-408d-8f62-d46b27cc19fb 8
e0c43c9f-764f-44e9-86af-c80417253ba3 8
efcd4689-a4f1-4bcc-810a-57a7c3141f14 8
acbf916d-fafc-4f4d-8a2e-eeb3c35d4040 8
2fa4333d-dc1c-4f68-a745-c1cbdf90ed95 7
9b96a2c2-f3c5-479a-9666-1d4c44a17270 7
7e1070a8-8f96-440e-998f-409383a9faf1 7
90b1971d-c70f-495a-bdb0-cc3c5b67bbe1 7
Name: id, dtype: int64
I can randomly sample indices, in which case I would expect most bins to lead to similar total counts. However, there's obviously the chance that I accidentally sample the top N indices in a single bin, so that bin becomes 'overweight'. Is there a way to cut up this series into 5 parts (i.e., 5 sets of indices) so each part has a value sum that is approximately the same?
To reproduce the array:
idx = ['075e5e2b-dd96-4097-8fc3-c3b82f610277', 'bd9a3e1f-b7a1-4ad9-ac8f-c6009dd5073a', 'b50fa764-1a1d-4760-bd6f-6efbba316adb', '8cc1408b-9509-485a-8603-41895a53925d', '3cb98c8c-918f-45fd-bd9f-11b0a6d19838', '5fa9da0d-bfcb-48da-8bb3-e245e995e3e6', '5950b2d8-b51a-4c84-86f6-40ee59ddf8c7', 'bbd07332-92a1-4f05-b7de-a0164b017343', '9b67c710-f5ee-4a4b-ba80-eaf905a1a4fc', '140f98ce-2ffe-4aca-accb-76f1cf2f08a9', '177b0245-8ff4-4a34-b15b-1ddf909e0c9e', 'b5aaf5c4-fd3d-4781-9eaf-92e7dee4a234', 'e24f86ce-5cfe-487a-8f23-929fb7674a4a', 'b7ddaf74-adde-4bdf-9e07-f0d1a06176e7', '5b442d58-78d8-4f3e-9d86-6782dd9945ca', 'acdcd702-dcae-45bc-b56f-1f50e4610c2a', 'c5ac2af7-e040-4d0f-b281-704349338463', '10f815fa-16d4-4efb-a3e5-05b2f197688e', '7098c542-7475-4030-87dd-6405952d179f', '67566970-81f7-4f69-8765-2e1fbe242e62', '7f2d7f97-da2e-408d-8f62-d46b27cc19fb', 'e0c43c9f-764f-44e9-86af-c80417253ba3', 'efcd4689-a4f1-4bcc-810a-57a7c3141f14', 'acbf916d-fafc-4f4d-8a2e-eeb3c35d4040', '2fa4333d-dc1c-4f68-a745-c1cbdf90ed95', '9b96a2c2-f3c5-479a-9666-1d4c44a17270', '7e1070a8-8f96-440e-998f-409383a9faf1', '90b1971d-c70f-495a-bdb0-cc3c5b67bbe1']
vals = [18, 18, 16, 14, 13, 13, 12, 12, 12, 12, 11, 11, 11, 11, 10, 10, 10, 10, 10, 9, 8, 8, 8, 8, 7, 7, 7, 7]
AD8_normal = pandas.Series(data=vals, index=idx)
Order the values, biggest first.
Put the first five values each in a new bin.
Now you have five bins.
From now on, put each new value in the bin with the smallest total.

Get 25 quantile in cumsum pandas

Suppose I have the following DataFrame:
df = pd.DataFrame({'id': [2, 4, 10, 12, 13, 14, 19, 20, 21, 22, 24, 25, 27, 29, 30, 31, 42, 50, 54],
'value': [37410.0, 18400.0, 200000.0, 392000.0, 108000.0, 423000.0, 80000.0, 307950.0,
50807.0, 201740.0, 182700.0, 131300.0, 282005.0, 428800.0, 56000.0, 412400.0, 1091595.0, 1237200.0,
927500.0]})
And I do the following:
df.sort_values(by='id').set_index('id').cumsum()
value
id
2 37410.0
4 55810.0
10 255810.0
12 647810.0
13 755810.0
14 1178810.0
19 1258810.0
20 1566760.0
21 1617567.0
22 1819307.0
24 2002007.0
25 2133307.0
27 2415312.0
29 2844112.0
30 2900112.0
31 3312512.0
42 4404107.0
50 5641307.0
54 6568807.0
I want to know the first element of id that is bigger than 25% of the cumulative sum. In this example, 25% of the cumsum would be 1,642,201.75. The first element to exceed that would be 22. I know it can be done with a for, but I think it would be pretty inefficient.
You could do:
percentile_25 = df['value'].sum() * 0.25
res = df[df['value'].cumsum() > percentile_25].head(1)
print(res)
Output
id value
9 22 201740.0
Or use searchsorted to do the search in O(log N):
percentile_25 = df['value'].sum() * 0.25
i = df['value'].cumsum().searchsorted(percentile_25)
res = df.iloc[i]
print(res)
Output
id 22.0
value 201740.0
Name: 9, dtype: float64

Pandas DataFrame find closest index in previous rows where condition is met

I have the following df1 dataframe:
t A
0 23:00 2
1 23:01 1
2 23:02 2
3 23:03 2
4 23:04 6
5 23:05 5
6 23:06 4
7 23:07 9
8 23:08 7
9 23:09 10
10 23:10 8
For each t (increments simplified here, not uniformly distributed in real life), I would like to find, if any, the most recent time tr within the previous 5 min where A(t)- A(tr) >= 4. I want to get:
t A tr
0 23:00 2
1 23:01 1
2 23:02 2
3 23:03 2
4 23:04 6 23:03
5 23:05 5 23:01
6 23:06 4
7 23:07 9 23:06
8 23:08 7
9 23:09 10 23:06
10 23:10 8 23:06
Currently, I can use shift(-1) to compare each row to the previous row like cond = df1['A'] >= df1['A'].shift(-1) + 4.
How can I look further in time?
Assuming your data is continuous by the minute, then you can do usual shift:
df1['t'] = pd.to_timedelta(df1['t'].add(':00'))
df = pd.DataFrame({i:df1.A - df1.A.shift(i) >= 4 for i in range(1,5)})
df1['t'] - pd.to_timedelta('1min') * df.idxmax(axis=1).where(df.any(1))
Output:
0 NaT
1 NaT
2 NaT
3 NaT
4 23:03:00
5 23:01:00
6 NaT
7 23:06:00
8 NaT
9 23:06:00
10 23:06:00
dtype: timedelta64[ns]
I added a datetime index and used rolling(), which now includes time-window functionalities beyond simple index-window.
import pandas as pd
import numpy as np
import datetime
df1 = pd.DataFrame({'t' : [
datetime.datetime(2020, 5, 17, 23, 0, 0),
datetime.datetime(2020, 5, 17, 23, 0, 1),
datetime.datetime(2020, 5, 17, 23, 0, 2),
datetime.datetime(2020, 5, 17, 23, 0, 3),
datetime.datetime(2020, 5, 17, 23, 0, 4),
datetime.datetime(2020, 5, 17, 23, 0, 5),
datetime.datetime(2020, 5, 17, 23, 0, 6),
datetime.datetime(2020, 5, 17, 23, 0, 7),
datetime.datetime(2020, 5, 17, 23, 0, 8),
datetime.datetime(2020, 5, 17, 23, 0, 9),
datetime.datetime(2020, 5, 17, 23, 0, 10)
], 'A' : [2,1,2,2,6,5,4,9,7,10,8]}, columns=['t', 'A'])
df1.index = df1['t']
df2 = df1
cond = df1['A'] >= df1.rolling('5s')['A'].apply(lambda x: x[0] + 4)
result = df1[cond]
Gives
t A
2020-05-17 23:00:04 6
2020-05-17 23:00:05 5
2020-05-17 23:00:07 9
2020-05-17 23:00:09 10
2020-05-17 23:00:10 8

Can you explain the output: diff.sort_values(ascending=False).index.astype

Can anyone explain the following statement.
list(diff.sort_values(ascending=False).index.astype(int)[0:5])
Output: Int64Index([24, 26, 17, 2, 1], dtype='int64')
It sorts first, but what is the index doing and how do i get 24, 26, 17, 2 ,1 ??
diff is series
ipdb> diff
1 0.017647
2 0.311765
3 -0.060000
4 -0.120000
5 -0.040000
6 -0.120000
7 -0.190000
8 -0.200000
9 -0.100000
10 -0.011176
11 -0.130000
12 0.008824
13 -0.060000
14 -0.090000
15 -0.060000
16 0.008824
17 0.341765
18 -0.140000
19 -0.050000
20 -0.060000
21 -0.040000
22 -0.210000
23 0.008824
24 0.585882
25 -0.060000
26 0.555882
27 -0.031176
28 -0.060000
29 -0.170000
30 -0.220000
31 -0.170000
32 -0.040000
dtype: float64
Yout code return list of index values of top5 values of Series sorted in descending order.
First 'column' printed in pandas Series is called index, so your code after sorting convert values of index to integers and slice by indexing.
print (diff.sort_values(ascending=False))
24 0.585882
26 0.555882
17 0.341765
2 0.311765
1 0.017647
12 0.008824
23 0.008824
16 0.008824
10 -0.011176
27 -0.031176
32 -0.040000
21 -0.040000
5 -0.040000
19 -0.050000
15 -0.060000
3 -0.060000
13 -0.060000
25 -0.060000
28 -0.060000
20 -0.060000
14 -0.090000
9 -0.100000
6 -0.120000
4 -0.120000
11 -0.130000
18 -0.140000
31 -0.170000
29 -0.170000
7 -0.190000
8 -0.200000
22 -0.210000
30 -0.220000
Name: a, dtype: float64
print (diff.sort_values(ascending=False).index.astype(int))
Int64Index([24, 26, 17, 2, 1, 12, 23, 16, 10, 27, 32, 21, 5, 19, 15, 3, 13,
25, 28, 20, 14, 9, 6, 4, 11, 18, 31, 29, 7, 8, 22, 30],
dtype='int64')
print (diff.sort_values(ascending=False).index.astype(int)[0:5])
Int64Index([24, 26, 17, 2, 1], dtype='int64')
print (list(diff.sort_values(ascending=False).index.astype(int)[0:5]))
[24, 26, 17, 2, 1]
Here's what's happening:
diff.sort_values(ascending) - sorts a Series. By default, ascending is True, but you've kept it false, so it returns sorted Series in descending order.
pandas.Series.index returns a row-labels of the index (the sorted numbers 1 - 32 in your case)
.as_type(int) typecasts index row-labels as integers.
[0: 5] just picks the cells 0 through 5
Let me know if this helps!

Pandas Dataframe splice data into 2 columns and make a number with a comma and integer

I currently am running into two issues:
My data-frame looks like this:
, male_female, no_of_students
0, 24 : 76, "81,120"
1, 33 : 67, "12,270"
2, 50 : 50, "10,120"
3, 42 : 58, "5,120"
4, 12 : 88, "2,200"
What I would like to achieve is this:
, male, female, no_of_students
0, 24, 76, 81120
1, 33, 67, 12270
2, 50, 50, 10120
3, 42, 58, 5120
4, 12, 88, 2200
Basically I want to convert male_female into two columns and no_of_students into a column of integers. I tried a bunch of things, converting the no_of_students column into another type with .astype. But nothing seems to work properly, I also couldn't really find a smart way of splitting the male_female column properly.
Hopefully someone can help me out!
Use str.split with pop for new columns by separator, then strip trailing values, replace and if necessary convert to integers:
df[['male','female']] = df.pop('male_female').str.split(' : ', expand=True)
df['no_of_students'] = df['no_of_students'].str.strip('" ').str.replace(',','').astype(int)
df = df[['male','female', 'no_of_students']]
print (df)
male female no_of_students
0 24 76 81120
1 33 67 12270
2 50 50 10120
3 42 58 5120
4 12 88 2200

Categories