With a dataframe like this:
index col_1 col_2 ... col_n
0 0.2 0.1 0.3
1 0.2 0.1 0.3
2 0.2 0.1 0.3
...
n 0.4 0.7 0.1
How can one get the norm for each column ?
Where the norm is the sqrt of the sum of the squares.
I am able to do this for each column sequentially, but am unsure how to vectorize (avoiding a for loop) the same to an answer:
import pandas as pd
import numpy as np
norm_col_1 = np.linalg.norm(df[col_1])
norm_col_2 = np.linalg.norm(df[col_2])
norm_col_n = np.linalg.norm(df[col_n])
the answer would be a new dataframe series like this:
norms
col_1 0.111
col_2 0.202
col_3 0.55
...
con_n 0.100
You can pass the entire DataFrame to np.linalg.norm, along with an axis argument of 0 to tell it to apply it column-wise:
np.linalg.norm(df, axis=0)
To create a series with appropriate column names, try:
results = pd.Series(data=np.linalg.norm(df, axis=0), index=df.columns)
Related
My csv file looks like this:
0 |0.1|0.2|0.4|
0.1|0 |0.5|0.6|
0.2|0.5|0 |0.9|
0.4|0.6|0.9|0 |
I try to read it row by row, ignoring the diagonal values and write it as one long column like this:
0.1
0.2
0.4
0.1
0.5
0.6
0.2
0.5
0.9
....
I use this method:
import numpy as np
import pandas as pd
data = pd.read_csv(r"C:\Users\soso-\Desktop\SVM\DataSet\chem_Jacarrd_sim.csv")
row_vector = np.array(data)
result = row_vector.ravel()
result.reshape(299756,1)
df = pd.DataFrame({'chem':result})
df.to_csv("my2.csv")
However the output ignores the first row and reads the zero's like follows:
how can I fix it?
0.1
0
0.5
0.6
0.2
0.5
0
0.9
....
For the datframe you have:
0 |0.1|0.2|0.4
0.1|0 |0.5|0.6
0.2|0.5|0 |0.9
0.4|0.6|0.9|0
which I saved as the ffff.csvdf, you need to do the following thing:
import numpy as np
import pandas as pd
data = pd.read_csv("ffff.csv", sep="|", header=None)
print(data)
row_vector = np.array(data)
# Create a new mask with the correct shape
mask = np.zeros((row_vector.shape), dtype=bool)
mask[np.arange(row_vector.shape[0]), np.arange(row_vector.shape[0])] = True
result = np.ma.array(row_vector, mask=mask)
result = result.compressed()
df = pd.DataFrame({'chem':result})
df.to_csv("my2.csv", index=False)
print(df)
which returns:
chem
0 0.1
1 0.2
2 0.4
3 0.1
4 0.5
5 0.6
6 0.2
7 0.5
8 0.9
9 0.4
10 0.6
11 0.9
This one is a bit shorter
assuming you have 2d numpy array
import numpy as np
arr = np.random.rand(3,3)
# array([[0.12964821, 0.92124532, 0.72456772],
# [0.26063188, 0.1486612 , 0.45312145],
# [0.04165099, 0.31071689, 0.26935581]])
arr_out = arr[np.where(~np.eye(arr.shape[0],dtype=bool))]
# array([0.92124532, 0.72456772, 0.26063188, 0.45312145, 0.04165099,
# 0.31071689])
I am tackling an issue in pandas:
I would like to group a DataFrame by an index column, then perform a transform(np.gradient) (i.e. compute the derivative over all values in a group). This doesn't work if my group is too small (less than 2 elements), so I would like to just return 0 in this case.
The following code returns an error:
import pandas as pd
import numpy as np
data = pd.DataFrame(
{
"time": [0,0,1,2,2,3,3],
"position": [0.1,0.2,0.2,0.1,0.2,0.1,0.2],
"speed": [150.0,145.0, 149.0,150.0,150.0,150.0,150.0],
}
)
derivative = data.groupby("time").transform(np.gradient)
Gives me a ValueError:
ValueError: Shape of array too small to calculate a numerical gradient, at least (edge_order + 1) elements are required.
The desired output for the example DataFrame above would be
time position_km
0 0.1 -5.0
0.2 -5.0
1 0.2 0.0
2 0.1 0.0
0.2 0.0
3 0.1 0.0
0.2 0.0
Does anyone have a good idea on how to solve this, e.g. using a lambda function in the transform?
derivative = data.groupby("time").transform(lambda x: np.gradient(x) if len(x) > 1 else 0)
does exactly what I wanted. Thanks #Chrysophylaxs
Possible option:
def gradient_group(group):
if group.shape[0] < 2:
return 0
return np.gradient(group)
df['derivative'] = df.groupby(df.index).apply(gradient_group)
I have a matrix as follows
0 1 2 3 ...
A 0.1 0.2 0.3 0.1
C 0.5 0.4 0.2 0.1
G 0.6 0.4 0.8 0.3
T 0.1 0.1 0.4 0.2
The data is in a dataframe as shown
Genes string
Gene1 ATGC
Gene2 GCTA
Gene3 ATCG
I need to write a code to find the score of each sequence. The score for seq ATGC is 0.1+0.1+0.8+0.1 = 1.1 (A is 0.1 because A is in first position and the value for A at that position is 0.1, similar this is calculated along the length of the sequence (450 letters))
The output should be as follows:
Genes Score
Gene1 1.1
Gene2 1.5
Gene3 0.7
I tried using biopython but could not get it right. Can anyone please help!
Let df and genes be your DataFrames. First, let's convert df into a "tall" form:
tall = df.stack().reset_index()
tall.columns = 'letter', 'pos', 'score'
tall.pos = tall.pos.astype(int) # Need a number here, not a string!
Create a new tuple-based index for the trall DF:
tall.set_index(tall[['pos', 'letter']].apply(tuple, axis=1), inplace=True)
This function will extract the scores indexed by the tuples in the form (position,"letter") from the tall DF and sum them up:
def gene2score(gene):
return tall.loc[list(enumerate(gene))]['score'].sum()
genes['string'].apply(gene2score)
#Genes
#Gene1 1.1
#Gene2 1.5
#Gene3 0.7
I have loaded the below CSV file containing code and coefficient data into the below dataframe df:
CODE|COEFFICIENT
A|0.5
B|0.4
C|0.3
import pandas as pd
import numpy as np
df= pd.read_csv('cod_coeff.csv', delimiter='|', encoding="utf-8-sig")
giving
ITEM COEFFICIENT
0 A 0.5
1 B 0.4
2 C 0.3
From the above dataframe, I need to create a final dataframe as below which has a matrix structure with the product of the coefficients:
A B C
A 0.25 0.2 0.15
B 0.2 0.16 0.12
C 0.15 0.12 0.09
I am using np.multiply but I am not successful in producing the result.
numpy as a faster alternative
pd.DataFrame(np.outer(df, df), df.index, df.index)
Timing
Given sample
30,000 rows
df = pd.concat([df for _ in range(10000)], ignore_index=True)
You want to do the math between a vector and its tranposition. Transpose with .T and apply the matrix dot function between the two dataframes.
df = df.set_index('CODE')
df.T
Out[10]:
CODE A B C
COEFFICIENT 0.5 0.4 0.3
df.dot(df.T)
Out[11]:
CODE A B C
CODE
A 0.25 0.20 0.15
B 0.20 0.16 0.12
C 0.15 0.12 0.09
I have the following dataframe which is the result of performing a standard pandas correlation:
df.corr()
abc xyz jkl
abc 1 0.2 -0.01
xyz -0.34 1 0.23
jkl 0.5 0.4 1
I have a few things that need to be done with these correlations, however these calculations need to exclude all the cells where the value is 1. The 1 values are the cells where the item has a perfect correlation with itself, therefore I am not interested in it.:
Determine the maximum correlation pair. The result is 'jkl' and 'abc' which has a correlation of 0.5
Determine the minimum correlation pair. The result is 'abc' and 'xyz' which has a correlation of -0.34
Determine the average/mean for the whole dataframe (again this needs to exclude all the values which are 1). The result would be (0.2 + -0.01 + -0.34 + 0.23 + 0.5 + 0.4) / 6 = 0,163333333
Check this:
from numpy import unravel_index,fill_diagonal,nanargmax,nanargmin
from bottleneck import nanmean
a = df(columns=['abc','xyz', 'jkl'])
a.loc['abc'] = [1, 0.2 , -0.01]
a.loc['xyz'] = [-0.34, 1, 0.23]
a.loc['jkl'] = [0.5, 0.4, 1]
b = a.values.copy()
fill_diagonal(b, None)
imax = unravel_index(nanargmax(b), b.shape)
imin = unravel_index(nanargmin(b), b.shape)
print(a.index[imax[0]],a.columns[imax[1]])
print(a.index[imin[0]],a.columns[imin[1]])
print(nanmean(b))
Please don't forget to copy your data, otherwise np.fill_diagonal will erase its diagonal values.