matrices are not aligned error message - python

I have the following dataframe of returns
ret
Out[3]:
Symbol FX OGDC PIB WTI
Date
2010-03-02 0.000443 0.006928 0.000000 0.012375
2010-03-03 -0.000690 -0.007873 0.000171 0.014824
2010-03-04 -0.001354 0.001545 0.000007 -0.008195
2010-03-05 -0.001578 0.008796 -0.000164 0.015955
And the following weights for each symbol:
df3
Out[4]:
Symbol Weight
0 OGDC 0.182022
1 WTI 0.534814
2 FX 0.131243
3 PIB 0.151921
I am trying to get a weighted return for each day and tried:
port_ret = ret.dot(df3)
but I get the following error message:
ValueError: matrices are not aligned
My objective is to have a weighted return for each date such that, for example 2010-03-02 would be as follows:
weighted_ret = 0.000443*.131243+.006928*.182022+0.000*0.151921+0.012375*.534814 = 0.007937512
I am not sure why I am getting this error but would be very happy for an alternative solution to the weighted return

You have two columns in your weight matrix:
df3.shape
Out[38]: (4, 2)
Set the index to Symbol on that matrix to get the proper dot:
ret.dot(df3.set_index('Symbol'))
Out[39]:
Weight
Date
2010-03-02 0.007938
2010-03-03 0.006430
2010-03-04 -0.004278
2010-03-05 0.009902

For a dot product of dataframe dfA by dataframe dfB, the column names of dfA must coincide with the index of dfB, otherwise you'll get the error 'ValueError: matrices are not aligned'
dfA = pd.DataFrame( data = [[1, 2], [3, 4], [5, 6]], columns=['one', 'two'])
dfB = pd.DataFrame( data = [[1, 2, 3], [4, 5, 6]], index=['one', 'two'])
dfA.dot(dfB)

Check the shape of the matrices you're calling the dot product on. The dot product of matrices A.dot(B) can be computed only if second axis of A is the same size as first axis of B.
In your example you have additional column with date, that ruins your computation. You should just get rid of it in your computation. Try running port_ret = ret[:,1:].dot(df3[1:]) and check if it produces the result you desire.
For future cases, use numpy.shape() function to debug matrix calculations, it is really helpful tool.

Related

Counting the number of pandas.DataFrame rows for each column

What I want to do
I would like to count the number of rows with conditions. Each column should have different numbers.
import numpy as np
import pandas as pd
## Sample DataFrame
data = [[1, 2], [0, 3], [np.nan, np.nan], [1, -1]]
index = ['i1', 'i2', 'i3', 'i4']
columns = ['c1', 'c2']
df = pd.DataFrame(data, index=index, columns=columns)
print(df)
## Output
# c1 c2
# i1 1.0 2.0
# i2 0.0 3.0
# i3 NaN NaN
# i4 1.0 -1.0
## Question 1: Count non-NaN values
## Expected result
# [3, 3]
## Question 2: Count non-zero numerical values
## Expected result
# [2, 3]
Note: Data types of results are not important. They can be list, pandas.Series, pandas.DataFrame etc. (I can convert data types anyway.)
What I have checked
## For Question 1
print(df[df['c1'].apply(lambda x: not pd.isna(x))].count())
## For Question 2
print(df[df['c1'] != 0].count())
Obviously these two print functions are only for column c1. It's easy to check one column by one column. I would like to know if there is a way to calculate counts of all columns at once.
Environment
Python 3.10.5
pandas 1.4.3
You do not iterate over your data using apply. You can achieve your results in a vectorized fashion:
print(df.notna().sum().to_list()) # [3, 3]
print((df.ne(0) & df.notna()).sum().to_list()) # [2, 3]
Note that I have assumed that "Question 2: Count non-zero values" also excluded nan values, otherwise you would get [3, 4].
You was close I think ! To answer your first question :
>>> df.apply(lambda x : x.isna().sum(), axis = 0)
c1 1
c2 1
dtype: int64
You change to axis = 1 to apply this operation on each row.
To answer your second question this is from here (already answered question on SO) :
>>> df.astype(bool).sum(axis=0)
c1 3
c2 4
dtype: int64
In the same way you can change axis to 1 if you want ...
Hope it helps !

Error when trying to set column as index in pandas dataframe

I have the following code:
A = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=[['att1', 'att2']])
A['idx'] = ['a', 'b', 'c']
A
which works fine until I do (trying to set column 'idx' as in index for the dataframe)
A.set_index('idx', inplace=True)
which throws an error
TypeError: only integer scalar arrays can be converted to a scalar index
What does this mean ?
The error is when you create A with
columns = [['att1', 'att2']]
If you print A.columns you will get:
MultiIndex([('att1',),
('att2',),
( 'idx',)],
)
So 'idx' is not really in your column for you to set index. Now, this would work:
A.set_index(('idx',))
and give:
att1 att2
(idx,)
a 1 2
b 1 3
c 4 6
However, you should fix your creation of A with just:
columns = ['att1', 'att2']

Why get different results when comparing two dataframes?

I am comparing two df, it gives me False when using .equals(), but if I append two df together and use drop_duplicate() it gives me nothing. Can someone explain this?
TL;DR
These are completely different operations and I'd have never expected them to produce the same results.
pandas.DataFrame.equals
Will return a boolean value depending on whether Pandas determines that the dataframes being compared are the "same". That means that the index of one is the "same" as the index of the other, the columns of one is the "same" as the columns of the the other, and the data of one is the "same" as the data of the other.
See docs
It is NOT the same as pandas.DataFrame.eq which will return a dataframe of boolean values.
Setup
Consider these three dataframes
df0 = pd.DataFrame([[0, 1], [2, 3]], [0, 1], ['A', 'B'])
df1 = pd.DataFrame([[1, 0], [3, 2]], [0, 1], ['B', 'A'])
df2 = pd.DataFrame([[0, 1], [2, 3]], ['foo', 'bar'], ['A', 'B'])
df0 df1 df2
A B B A A B
0 0 1 0 1 0 foo 0 1
1 2 3 1 3 2 bar 2 3
If we checked if df1 was equals to df0, we get
df0.equals(df1)
False
Even though all elements are the same
df0.eq(df1).all().all()
True
And that is because the columns are not aligned. If I sort the columns then ...
df0.equals(df1.sort_index(axis=1))
True
pandas.DataFrame.drop_duplicates
Compares the values in rows and doesn't care about the index.
So, both of these produce the same looking results
df0.append(df2).drop_duplicates()
and
df0.append(df1, sort=True).drop_duplicates()
A B
0 0 1
1 2 3
When I append (or pandas.concat), Pandas will align the columns and add the appended dataframe as new rows. Then drop_duplicates does it's thing. But it was the inherent aligning of the columns that does the what I did above with sort_index and axis=1.
maybe the lines in both dataframes are not ordered the same way? dataframes will be equal when the lines corresponding to the same index are the same

pandas dataframe to features and labels

Here is a dataframe which I want to convert to features and label list/arrays.
The dataframe represents Fedex Ground Shipping rates for weight and zone Ids (columns of the dataframe).
The features need to be like below
[weight,zone]
e.g. [[1,2],[1,3] ...[1,25],[2,2],[2,3] ...[2,25]....[8,25]]
And the labels corresponding to them are basically the shipping charges so,
[[shipping charge]]
e.g. [[8.95],[9.44] .....[35.18]]
While I am using following code, but I am sure there has to be a faster, more optimized and perhaps more direct way to achieve this, either using dataframe or numpy
i=0
j=0
for weight in df_ground.Weight:
for column in column_list[1:]: # skipping the weight column !
features[j] = [df_ground.Weight[i],column]
labels[j] = df_ground[column][df_ground['Weight'] == df_ground.Weight[i]]
j +=1
i +=1
For a dataframe of size 2700 this code takes between 1 and 2 seconds. I am asking for suggestions on a more optimized way.
First, make 'Weight' index and mix the index and the columns:
mixed = df_ground.set_index('Weight').stack()
#Weight
#1 2 8.95
# 3 9.44
# 4 9.89
#....
#2 2 9.24
# 3 9.92
# 4 10.41
Now, your new index is your features and the data column is your labels:
features = [list(x) for x in mixed.index]
#[[1, 2], [1, 3], [1, 4], ..., [2, 2], [2, 3], [2, 4], ...]
labels = [[x] for x in mixed.values]
#[[8.95],[9.44],[9.89],[9.24],[9.92],[10.41]])

pandas groupby first column shifted down

So I have read in a csv file as a pandas dataframe:
But when I group it, the year column is shifted down by one:
So when I try to pull out Years into a numpy array, it gives an error saying "KeyError:'Year'".
Is there a way to get the array to find the years, or a way to shift that first column up by one?
I have found a way to shift a dataframe column up by one, but I need to shift the grouping, not the dataframe.
I also tried turning the new grouping into a new dataframe so that I can shift the year column up, but haven't been successful.
Year is the name of the index.
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df
Out[12]:
A B
0 1 2
1 3 4
In [13]: df.index.name = "foo"
In [14]: df
Out[14]:
A B
foo
0 1 2
1 3 4
Pull out the index with .index:
In [15]: df.index
Out[15]: Int64Index([0, 1], dtype='int64', name='foo')
In [16]: df.index.values
Out[16]: array([0, 1])

Categories