norm for all columns in a pandas datafrme

norm for all columns in a pandas datafrme - python

With a dataframe like this:
index col_1 col_2 ... col_n
0 0.2 0.1 0.3
1 0.2 0.1 0.3
2 0.2 0.1 0.3
...
n 0.4 0.7 0.1
How can one get the norm for each column ?
Where the norm is the sqrt of the sum of the squares.
I am able to do this for each column sequentially, but am unsure how to vectorize (avoiding a for loop) the same to an answer:
import pandas as pd
import numpy as np
norm_col_1 = np.linalg.norm(df[col_1])
norm_col_2 = np.linalg.norm(df[col_2])
norm_col_n = np.linalg.norm(df[col_n])
the answer would be a new dataframe series like this:
norms
col_1 0.111
col_2 0.202
col_3 0.55
...
con_n 0.100

You can pass the entire DataFrame to np.linalg.norm, along with an axis argument of 0 to tell it to apply it column-wise:
np.linalg.norm(df, axis=0)
To create a series with appropriate column names, try:
results = pd.Series(data=np.linalg.norm(df, axis=0), index=df.columns)

Related

pandas and numby to read csv and convert it from 2d vector to 1d with ignoring diagonal values

My csv file looks like this:
0 |0.1|0.2|0.4|
0.1|0 |0.5|0.6|
0.2|0.5|0 |0.9|
0.4|0.6|0.9|0 |
I try to read it row by row, ignoring the diagonal values and write it as one long column like this:
0.1
0.2
0.4
0.1
0.5
0.6
0.2
0.5
0.9
....
I use this method:
import numpy as np
import pandas as pd
data = pd.read_csv(r"C:\Users\soso-\Desktop\SVM\DataSet\chem_Jacarrd_sim.csv")
row_vector = np.array(data)
result = row_vector.ravel()
result.reshape(299756,1)
df = pd.DataFrame({'chem':result})
df.to_csv("my2.csv")
However the output ignores the first row and reads the zero's like follows:
how can I fix it?
0.1
0
0.5
0.6
0.2
0.5
0
0.9
....

For the datframe you have:
0 |0.1|0.2|0.4
0.1|0 |0.5|0.6
0.2|0.5|0 |0.9
0.4|0.6|0.9|0
which I saved as the ffff.csvdf, you need to do the following thing:
import numpy as np
import pandas as pd
data = pd.read_csv("ffff.csv", sep="|", header=None)
print(data)
row_vector = np.array(data)
# Create a new mask with the correct shape
mask = np.zeros((row_vector.shape), dtype=bool)
mask[np.arange(row_vector.shape[0]), np.arange(row_vector.shape[0])] = True
result = np.ma.array(row_vector, mask=mask)
result = result.compressed()
df = pd.DataFrame({'chem':result})
df.to_csv("my2.csv", index=False)
print(df)
which returns:
chem
0 0.1
1 0.2
2 0.4
3 0.1
4 0.5
5 0.6
6 0.2
7 0.5
8 0.9
9 0.4
10 0.6
11 0.9

This one is a bit shorter
assuming you have 2d numpy array
import numpy as np
arr = np.random.rand(3,3)
# array([[0.12964821, 0.92124532, 0.72456772],
# [0.26063188, 0.1486612 , 0.45312145],
# [0.04165099, 0.31071689, 0.26935581]])
arr_out = arr[np.where(~np.eye(arr.shape[0],dtype=bool))]
# array([0.92124532, 0.72456772, 0.26063188, 0.45312145, 0.04165099,
# 0.31071689])

Pandas group by with conditional transform()

I am tackling an issue in pandas:
I would like to group a DataFrame by an index column, then perform a transform(np.gradient) (i.e. compute the derivative over all values in a group). This doesn't work if my group is too small (less than 2 elements), so I would like to just return 0 in this case.
The following code returns an error:
import pandas as pd
import numpy as np
data = pd.DataFrame(
{
"time": [0,0,1,2,2,3,3],
"position": [0.1,0.2,0.2,0.1,0.2,0.1,0.2],
"speed": [150.0,145.0, 149.0,150.0,150.0,150.0,150.0],
}
)
derivative = data.groupby("time").transform(np.gradient)
Gives me a ValueError:
ValueError: Shape of array too small to calculate a numerical gradient, at least (edge_order + 1) elements are required.
The desired output for the example DataFrame above would be
time position_km
0 0.1 -5.0
0.2 -5.0
1 0.2 0.0
2 0.1 0.0
0.2 0.0
3 0.1 0.0
0.2 0.0
Does anyone have a good idea on how to solve this, e.g. using a lambda function in the transform?

derivative = data.groupby("time").transform(lambda x: np.gradient(x) if len(x) > 1 else 0)
does exactly what I wanted. Thanks #Chrysophylaxs

Possible option:
def gradient_group(group):
if group.shape[0] < 2:
return 0
return np.gradient(group)
df['derivative'] = df.groupby(df.index).apply(gradient_group)

Scoring for each row based on matrix in python

I have a matrix as follows
0 1 2 3 ...
A 0.1 0.2 0.3 0.1
C 0.5 0.4 0.2 0.1
G 0.6 0.4 0.8 0.3
T 0.1 0.1 0.4 0.2
The data is in a dataframe as shown
Genes string
Gene1 ATGC
Gene2 GCTA
Gene3 ATCG
I need to write a code to find the score of each sequence. The score for seq ATGC is 0.1+0.1+0.8+0.1 = 1.1 (A is 0.1 because A is in first position and the value for A at that position is 0.1, similar this is calculated along the length of the sequence (450 letters))
The output should be as follows:
Genes Score
Gene1 1.1
Gene2 1.5
Gene3 0.7
I tried using biopython but could not get it right. Can anyone please help!

Let df and genes be your DataFrames. First, let's convert df into a "tall" form:
tall = df.stack().reset_index()
tall.columns = 'letter', 'pos', 'score'
tall.pos = tall.pos.astype(int) # Need a number here, not a string!
Create a new tuple-based index for the trall DF:
tall.set_index(tall[['pos', 'letter']].apply(tuple, axis=1), inplace=True)
This function will extract the scores indexed by the tuples in the form (position,"letter") from the tall DF and sum them up:
def gene2score(gene):
return tall.loc[list(enumerate(gene))]['score'].sum()
genes['string'].apply(gene2score)
#Genes
#Gene1 1.1
#Gene2 1.5
#Gene3 0.7

create matrix structure using pandas

I have loaded the below CSV file containing code and coefficient data into the below dataframe df:
CODE|COEFFICIENT
A|0.5
B|0.4
C|0.3
import pandas as pd
import numpy as np
df= pd.read_csv('cod_coeff.csv', delimiter='|', encoding="utf-8-sig")
giving
ITEM COEFFICIENT
0 A 0.5
1 B 0.4
2 C 0.3
From the above dataframe, I need to create a final dataframe as below which has a matrix structure with the product of the coefficients:
A B C
A 0.25 0.2 0.15
B 0.2 0.16 0.12
C 0.15 0.12 0.09
I am using np.multiply but I am not successful in producing the result.

numpy as a faster alternative
pd.DataFrame(np.outer(df, df), df.index, df.index)
Timing
Given sample
30,000 rows
df = pd.concat([df for _ in range(10000)], ignore_index=True)

You want to do the math between a vector and its tranposition. Transpose with .T and apply the matrix dot function between the two dataframes.
df = df.set_index('CODE')
df.T
Out[10]:
CODE A B C
COEFFICIENT 0.5 0.4 0.3
df.dot(df.T)
Out[11]:
CODE A B C
CODE
A 0.25 0.20 0.15
B 0.20 0.16 0.12
C 0.15 0.12 0.09

Getting mean, max, min from pandas dataframe

I have the following dataframe which is the result of performing a standard pandas correlation:
df.corr()
abc xyz jkl
abc 1 0.2 -0.01
xyz -0.34 1 0.23
jkl 0.5 0.4 1
I have a few things that need to be done with these correlations, however these calculations need to exclude all the cells where the value is 1. The 1 values are the cells where the item has a perfect correlation with itself, therefore I am not interested in it.:
Determine the maximum correlation pair. The result is 'jkl' and 'abc' which has a correlation of 0.5
Determine the minimum correlation pair. The result is 'abc' and 'xyz' which has a correlation of -0.34
Determine the average/mean for the whole dataframe (again this needs to exclude all the values which are 1). The result would be (0.2 + -0.01 + -0.34 + 0.23 + 0.5 + 0.4) / 6 = 0,163333333

Check this:
from numpy import unravel_index,fill_diagonal,nanargmax,nanargmin
from bottleneck import nanmean
a = df(columns=['abc','xyz', 'jkl'])
a.loc['abc'] = [1, 0.2 , -0.01]
a.loc['xyz'] = [-0.34, 1, 0.23]
a.loc['jkl'] = [0.5, 0.4, 1]
b = a.values.copy()
fill_diagonal(b, None)
imax = unravel_index(nanargmax(b), b.shape)
imin = unravel_index(nanargmin(b), b.shape)
print(a.index[imax[0]],a.columns[imax[1]])
print(a.index[imin[0]],a.columns[imin[1]])
print(nanmean(b))
Please don't forget to copy your data, otherwise np.fill_diagonal will erase its diagonal values.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

norm for all columns in a pandas datafrme - python

You can pass the entire DataFrame to np.linalg.norm, along with an axis argument of 0 to tell it to apply it column-wise: np.linalg.norm(df, axis=0) To create a series with appropriate column names, try: results = pd.Series(data=np.linalg.norm(df, axis=0), index=df.columns)

Related

pandas and numby to read csv and convert it from 2d vector to 1d with ignoring diagonal values

Pandas group by with conditional transform()

Scoring for each row based on matrix in python

create matrix structure using pandas

Getting mean, max, min from pandas dataframe

Categories

Resources