create matrix structure using pandas

create matrix structure using pandas - python

I have loaded the below CSV file containing code and coefficient data into the below dataframe df:
CODE|COEFFICIENT
A|0.5
B|0.4
C|0.3
import pandas as pd
import numpy as np
df= pd.read_csv('cod_coeff.csv', delimiter='|', encoding="utf-8-sig")
giving
ITEM COEFFICIENT
0 A 0.5
1 B 0.4
2 C 0.3
From the above dataframe, I need to create a final dataframe as below which has a matrix structure with the product of the coefficients:
A B C
A 0.25 0.2 0.15
B 0.2 0.16 0.12
C 0.15 0.12 0.09
I am using np.multiply but I am not successful in producing the result.

numpy as a faster alternative
pd.DataFrame(np.outer(df, df), df.index, df.index)
Timing
Given sample
30,000 rows
df = pd.concat([df for _ in range(10000)], ignore_index=True)

You want to do the math between a vector and its tranposition. Transpose with .T and apply the matrix dot function between the two dataframes.
df = df.set_index('CODE')
df.T
Out[10]:
CODE A B C
COEFFICIENT 0.5 0.4 0.3
df.dot(df.T)
Out[11]:
CODE A B C
CODE
A 0.25 0.20 0.15
B 0.20 0.16 0.12
C 0.15 0.12 0.09

Related

pandas and numby to read csv and convert it from 2d vector to 1d with ignoring diagonal values

My csv file looks like this:
0 |0.1|0.2|0.4|
0.1|0 |0.5|0.6|
0.2|0.5|0 |0.9|
0.4|0.6|0.9|0 |
I try to read it row by row, ignoring the diagonal values and write it as one long column like this:
0.1
0.2
0.4
0.1
0.5
0.6
0.2
0.5
0.9
....
I use this method:
import numpy as np
import pandas as pd
data = pd.read_csv(r"C:\Users\soso-\Desktop\SVM\DataSet\chem_Jacarrd_sim.csv")
row_vector = np.array(data)
result = row_vector.ravel()
result.reshape(299756,1)
df = pd.DataFrame({'chem':result})
df.to_csv("my2.csv")
However the output ignores the first row and reads the zero's like follows:
how can I fix it?
0.1
0
0.5
0.6
0.2
0.5
0
0.9
....

For the datframe you have:
0 |0.1|0.2|0.4
0.1|0 |0.5|0.6
0.2|0.5|0 |0.9
0.4|0.6|0.9|0
which I saved as the ffff.csvdf, you need to do the following thing:
import numpy as np
import pandas as pd
data = pd.read_csv("ffff.csv", sep="|", header=None)
print(data)
row_vector = np.array(data)
# Create a new mask with the correct shape
mask = np.zeros((row_vector.shape), dtype=bool)
mask[np.arange(row_vector.shape[0]), np.arange(row_vector.shape[0])] = True
result = np.ma.array(row_vector, mask=mask)
result = result.compressed()
df = pd.DataFrame({'chem':result})
df.to_csv("my2.csv", index=False)
print(df)
which returns:
chem
0 0.1
1 0.2
2 0.4
3 0.1
4 0.5
5 0.6
6 0.2
7 0.5
8 0.9
9 0.4
10 0.6
11 0.9

This one is a bit shorter
assuming you have 2d numpy array
import numpy as np
arr = np.random.rand(3,3)
# array([[0.12964821, 0.92124532, 0.72456772],
# [0.26063188, 0.1486612 , 0.45312145],
# [0.04165099, 0.31071689, 0.26935581]])
arr_out = arr[np.where(~np.eye(arr.shape[0],dtype=bool))]
# array([0.92124532, 0.72456772, 0.26063188, 0.45312145, 0.04165099,
# 0.31071689])

Truncate all values in dataframe

I have a pandas dataframe with the first column being dates and then lots of adjusted stock prices (some of which have 16 decimals). I would like to truncate all the dataframe values to 8 decimals so I tried the following:
df = df.set_index("Day").pipe(lambda x: math.trunc(100000000 * x) / 100000000).reset_index()
But I get the following error:
type DataFrame doesn't define __trunc__ method

Have you tried formatting?
dec = [1.2736,9.3745,5.412783,8.25389]
to_3dp = lambda x: '%.3f'%(x)
rounded = [to_3dp(i) for i in dec]
print(rounded) # [1.273, 9.374, 5.412, 8.253]
So in your case:
df['column'] = df['column'].apply(lambda x: '%.8f'%(x))
If you want to round:
df['column'] = df['column'].apply(lambda x: round(x,8))

Use numpy.trunc for a vectorial solution:
n = 10**8
out = np.trunc(df.set_index("Day").mul(n)).div(n).reset_index()

IIUC, you are trying to apply the truncate-based lambda function on multiple columns at once. That's the reason for the error, try using applymap which applies your function on each cell independently. You have to first set your date column as index, leaving only the float columns in the dataframe. Try this -
f = lambda x: math.trunc(100000000 * x) / 100000000 #<-- your function
df.set_index("Day").applymap(f).reset_index() #<-- applied on each cell
Since I don't have the sample dataset you are using, here is a working dummy example.
import math
import pandas as pd
#Dummy dataframe
df = pd.DataFrame(np.random.random((10,3)),
columns = ['col1','col2','col3'])
f = lambda x: math.trunc(100 * x) / 100
df.applymap(f)
col1 col2 col3
0 0.80 0.76 0.14
1 0.40 0.48 0.85
2 0.58 0.40 0.76
3 0.82 0.04 0.10
4 0.23 0.04 0.91
5 0.57 0.41 0.12
6 0.72 0.71 0.71
7 0.32 0.59 0.99
8 0.11 0.70 0.32
9 0.95 0.80 0.24
Another simpler solution is to just use df.set_index("Day").round(8) directly, if that works for you but that would be rounding your numbers to 8 digits instead of truncating.

norm for all columns in a pandas datafrme

With a dataframe like this:
index col_1 col_2 ... col_n
0 0.2 0.1 0.3
1 0.2 0.1 0.3
2 0.2 0.1 0.3
...
n 0.4 0.7 0.1
How can one get the norm for each column ?
Where the norm is the sqrt of the sum of the squares.
I am able to do this for each column sequentially, but am unsure how to vectorize (avoiding a for loop) the same to an answer:
import pandas as pd
import numpy as np
norm_col_1 = np.linalg.norm(df[col_1])
norm_col_2 = np.linalg.norm(df[col_2])
norm_col_n = np.linalg.norm(df[col_n])
the answer would be a new dataframe series like this:
norms
col_1 0.111
col_2 0.202
col_3 0.55
...
con_n 0.100

You can pass the entire DataFrame to np.linalg.norm, along with an axis argument of 0 to tell it to apply it column-wise:
np.linalg.norm(df, axis=0)
To create a series with appropriate column names, try:
results = pd.Series(data=np.linalg.norm(df, axis=0), index=df.columns)

How to divide one dataframe by the other without converting to numpy first?

I have a dataframe with two columns, x and y, and a few hundred rows.
I have another dataframe with only one row and two columns, x and y.
I want to divide column x of the big dataframe by the value in x of the small dataframe, and column y by column y.
If I divide one dataframe by the other, I get all NaNs. For the division to work, I must convert the small dataframe to numpy.
Why can't I divide one dataframe by the other? What am I missing? I have a toy example below.
import numpy as np
import pandas as pd
df = pd.DataFrame()
r = int(10)
df['x'] = np.arange(0,r)
df['y'] = df['x'] * 2
other_df = pd.DataFrame()
other_df['x'] = [100]
other_df['y'] = [400]
# This doesn't work - I get all nans
new = df / other_df
# this works - it gives me what I want
new2 = df / [100,400]
# this also works
new3 = df / other_df.to_numpy()

You can convert one row DataFrame to Series for correct align columns, e.g. by selecting first row by DataFrame.iloc:
new = df / other_df.iloc[0]
print (new)
x y
0 0.00 0.000
1 0.01 0.005
2 0.02 0.010
3 0.03 0.015
4 0.04 0.020
5 0.05 0.025
6 0.06 0.030
7 0.07 0.035
8 0.08 0.040
9 0.09 0.045

You can use numpy.divide() to divide as numpy has a great property that is Broadcasting.
new = np.divide(df,other_df)
Please check this link for more details.

Getting mean, max, min from pandas dataframe

I have the following dataframe which is the result of performing a standard pandas correlation:
df.corr()
abc xyz jkl
abc 1 0.2 -0.01
xyz -0.34 1 0.23
jkl 0.5 0.4 1
I have a few things that need to be done with these correlations, however these calculations need to exclude all the cells where the value is 1. The 1 values are the cells where the item has a perfect correlation with itself, therefore I am not interested in it.:
Determine the maximum correlation pair. The result is 'jkl' and 'abc' which has a correlation of 0.5
Determine the minimum correlation pair. The result is 'abc' and 'xyz' which has a correlation of -0.34
Determine the average/mean for the whole dataframe (again this needs to exclude all the values which are 1). The result would be (0.2 + -0.01 + -0.34 + 0.23 + 0.5 + 0.4) / 6 = 0,163333333

Check this:
from numpy import unravel_index,fill_diagonal,nanargmax,nanargmin
from bottleneck import nanmean
a = df(columns=['abc','xyz', 'jkl'])
a.loc['abc'] = [1, 0.2 , -0.01]
a.loc['xyz'] = [-0.34, 1, 0.23]
a.loc['jkl'] = [0.5, 0.4, 1]
b = a.values.copy()
fill_diagonal(b, None)
imax = unravel_index(nanargmax(b), b.shape)
imin = unravel_index(nanargmin(b), b.shape)
print(a.index[imax[0]],a.columns[imax[1]])
print(a.index[imin[0]],a.columns[imin[1]])
print(nanmean(b))
Please don't forget to copy your data, otherwise np.fill_diagonal will erase its diagonal values.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

create matrix structure using pandas - python

numpy as a faster alternative pd.DataFrame(np.outer(df, df), df.index, df.index) Timing Given sample 30,000 rows df = pd.concat([df for _ in range(10000)], ignore_index=True)

Related

pandas and numby to read csv and convert it from 2d vector to 1d with ignoring diagonal values

Truncate all values in dataframe

norm for all columns in a pandas datafrme

How to divide one dataframe by the other without converting to numpy first?

Getting mean, max, min from pandas dataframe

Categories

Resources