Calculation is done only on part of the table - python

I am trying to calculate the kurtosis and skewness over a data and I managaed to create table but for some reason teh result is only for few columns and not for the whole fields.
For example, as you cann see, I have many fields (columns):
I calculate the skenwess and kurtosis using the next code:
sk=pd.DataFrame(data.skew())
kr=pd.DataFrame(data.kurtosis())
sk['kr']=kr
sk.rename(columns ={0: 'sk'}, inplace =True)
but then I get result that contains about half of the data I have:
I have tried to do head(10) but it doesn't change the fact that some columns dissapeard.
How can I calculte this for all the columns?

It is really hard to reproduce the error since you did not give the original data. Probably your dataframe contains non-numerical values in the missing columns which would result in this behavior.
dat = {"1": {'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"2":{'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"3":{'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"4":{'lg1':0.12, 'lg2':0.23, 'lg3':0.34, 'lg4':0.45},
"5":{'lg1':0.12, 'lg2':0.23, 'lg3': 'po', 'lg4':0.45}}
df = pd.DataFrame.from_dict(dat).T
print(df)
lg1 lg2 lg3 lg4
1 0.12 0.23 0.34 0.45
2 0.12 0.23 0.34 0.45
3 0.12 0.23 0.34 0.45
4 0.12 0.23 0.34 0.45
5 0.12 0.23 po 0.45
print(df.kurtosis())
lg1 0
lg2 0
lg4 0
The solution would be to preprocess the data.
One word of advice would be to check for consistency in the error, i.e. are always the same lines missing?

Related

Matching to a specific year column in pandas

I am trying to take a "given" value and match it to a "year" in the same row using the following dataframe:
data = {
'Given' : [0.45, 0.39, 0.99, 0.58, None],
'Year 1' : [0.25, 0.15, 0.3, 0.23, 0.25],
'Year 2' : [0.39, 0.27, 0.55, 0.3, 0.4],
'Year 3' : [0.43, 0.58, 0.78, 0.64, 0.69],
'Year 4' : [0.65, 0.83, 0.95, 0.73, 0.85],
'Year 5' : [0.74, 0.87, 0.99, 0.92, 0.95]
}
df = pd.DataFrame(data)
print(df)
Output:
Given Year 1 Year 2 Year 3 Year 4 Year 5
0 0.45 0.25 0.39 0.43 0.65 0.74
1 0.39 0.15 0.27 0.58 0.83 0.87
2 0.99 0.30 0.55 0.78 0.95 0.99
3 0.58 0.23 0.30 0.64 0.73 0.92
4 NaN 0.25 0.40 0.69 0.85 0.95
However, the matching process has a few caveats. I am trying to match to the closest year to the given value before calculating the time to the first "year" above 70%. So row 0 would match to "year 3", and we can see in the same row that it will take two years until "year 5", which is the first occurence in the row above 70%.
For any "given" value already above 70%, we can just output "full", and for any "given" values that don't contain data, we can just output the first year above 70%. The output will look like the following:
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.45 0.25 0.39 0.43 0.65 0.74 2
1 0.39 0.15 0.27 0.58 0.83 0.87 2
2 0.99 0.30 0.55 0.78 0.95 0.99 full
3 0.58 0.23 0.30 0.64 0.73 0.92 1
4 NaN 0.25 0.40 0.69 0.85 0.95 4
It has taken me a horrendously long time to clean up this data so at the moment I can't think of a way to begin other than some use of .abs() to begin the matching process. All help appreciated.
Vectorized Pandas Approach:
reset_index() of the column names and .T, so that you can have the same column names and subtract dataframes from each other in a vectorized way. pd.concat() with * creates a dataframe that duplicates the first column, so that you can get the absolute difference of the dataframes in a more vectorized way instead of looping through columns.
Use idxmax and idxmin to identify the column numbers according to your criteria.
Use np.select according to your criteria.
import pandas as pd
import numpy as np
# identify 70% columns
pct_70 = (df.T.reset_index(drop=True).T > .7).idxmax(axis=1)
# identify column number of lowest absolute difference to Given
nearest_col = ((df.iloc[:,1:].T.reset_index(drop=True).T
- pd.concat([df.iloc[:,0]] * len(df.columns[1:]), axis=1)
.T.reset_index(drop=True).T)).abs().idxmin(axis=1)
# Generate an output series
output = pct_70 - nearest_col - 1
# Conditionally apply the output series
df['Output'] = np.select([output.gt(0),output.lt(0),output.isnull()],
[output, 'full', pct_70],np.nan)
df
Out[1]:
Given Year 1 Year 2 Year 3 Year 4 Year 5 Output
0 0.45 0.25 0.39 0.43 0.65 0.74 2.0
1 0.39 0.15 0.27 0.58 0.83 0.87 2.0
2 0.99 0.30 0.55 0.78 0.95 0.99 full
3 0.58 0.23 0.30 0.64 0.73 0.92 1.0
4 NaN 0.25 0.40 0.69 0.85 0.95 4
Here you go!
import numpy as np
def output(df):
output = []
for i in df.iterrows():
row = i[1].to_list()
given = row[0]
compare = np.array(row[1:])
first_70 = np.argmax(compare > 0.7)
if np.isnan(given):
output.append(first_70 + 1)
continue
if given > 0.7:
output.append('full')
continue
diff = np.abs(np.array(compare) - np.array(given))
closest_year = diff.argmin()
output.append(first_70 - closest_year)
return output
df['output'] = output(df)

Plot clusters of similar words from pandas dataframe

I have a big dataframe, here's a small subset:
key_words prove have read represent lead replace
be 0.58 0.49 0.48 0.17 0.23 0.89
represent 0.66 0.43 0 1 0 0.46
associate 0.88 0.23 0.12 0.43 0.11 0.67
induce 0.43 0.41 0.33 0.47 0 0.43
Which shows how close each word from the key_words is to the rest of the columns (based on their embeddings distance).
I want to find a way to visualize this dataframe so that I see the clusters that are being formed among the words that are closest to each other.
Is there a simple way to do this, considering that the key_word column has string values?
One option is to set the key_words column as index and to use seaborn.clustermap to plot the clusters:
# pip install seaborn
import seaborn as sns
sns.clustermap(df.set_index('key_words'), # data
vmin=0, vmax=1, # values of min/max here white/black
cmap='Greys', # color palette
figsize=(5,5) # plot size
)
output:

Creating interaction terms in python

I'm trying to create interaction terms in a dataset. Is there another (simpler) way of creating interaction terms of columns in a dataset? For example, creating interaction terms in combinations of columns 4:98 and 98:106. I tried looping over the columns using numpy arrays, but with the following code, the kernel keeps dying.
col1 = df.columns[4:98] #94 columns
col2 = df.columns[98:106] #8 columns
var1_np = df_np[:, 4:98]
var2_np = df_np[:, 98:106]
for i in range(94):
for j in range(8):
name = col1[i] +"*" + col2[j]
df[name] = var1_np[:,i]*var2_np[:,j]
Here, df is the dataframe and df_np is df in NumPy array.
You could use the itertools.product that is roughly equivalent to nested for-loops in a generator expression. Then, use join to create the new column name with the product result. After that, use Pandas prod to return the product of two columns over the axis one (along the columns).
import pandas as pd
import numpy as np
from itertools import product
#setup
np.random.seed(12345)
data = np.random.rand(5, 10).round(2)
df = pd.DataFrame(data)
df.columns = [f'col_{c}' for c in range(0,10)]
print(df)
#code
col1 = df.columns[3:5]
col2 = df.columns[5:8]
df_new = pd.DataFrame()
for i in product(col1, col2):
name = "*".join(i)
df_new[name] = df[list(i)].prod(axis=1)
print(df_new)
Output from df
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 0.93 0.32 0.18 0.20 0.57 0.60 0.96 0.65 0.75 0.65
1 0.75 0.96 0.01 0.11 0.30 0.66 0.81 0.87 0.96 0.72
2 0.64 0.72 0.47 0.33 0.44 0.73 0.99 0.68 0.79 0.17
3 0.03 0.80 0.90 0.02 0.49 0.53 0.60 0.05 0.90 0.73
4 0.82 0.50 0.81 0.10 0.22 0.26 0.47 0.46 0.71 0.18
Output from df_new
col_3*col_5 col_3*col_6 col_3*col_7 col_4*col_5 col_4*col_6 col_4*col_7
0 0.1200 0.1920 0.1300 0.3420 0.5472 0.3705
1 0.0726 0.0891 0.0957 0.1980 0.2430 0.2610
2 0.2409 0.3267 0.2244 0.3212 0.4356 0.2992
3 0.0106 0.0120 0.0010 0.2597 0.2940 0.0245
4 0.0260 0.0470 0.0460 0.0572 0.1034 0.1012

Create line plot from dataframe with two columns index

I have the following dataframe:
>>> mean_traf_tie
a d c
0.22 0.99 0.11 22
0.23 21
0.34 34
0.46 45
0.44 0.99 0.11 45
0.23 65
0.34 66
0.46 68
0.50 0.50 0.11 22
0.23 12
0.34 34
0.46 37
...
I want to crate plot from this dataframe, on a way that c will be the X axis, y will be the mean velocity and lines will be according to the a and d columns, so for example, one line will be for a=0.22 and d=0.99, the x will be c and y will be mean velocity, and then the 2nd line will be for a=0.44 and d=0.99 ect.
I have tried to do it like this:
df.plot()
(values are differrent in original dataframe).
as you can see,for some reason it plots i nthe x axis the a,d and creates only one line.
I have tried to fix it like this:
df.unstack(level=0).plot(figsize=(10,6))
but then I got very weird graph, with the correct lines by a and d but wrong x axis:
As you can see, it plot somehow the a,d values ,but that not what I want- I want it to be the c columns, and then to create lines based on the a,d columns,which suppose to create continous line.
I have trid that as well:
df[('mean_traf_tie')].unstack(level=0).plot(figsize=(10,6))
plt.xlabel('C')
plt.ylabel('mean_traf_tie')
but got again:
The desired output will have the c column as x axis, the mean_traf_tie as y axis, and lines will be generated bseed on a and d columns (line for 0.22 and 0.99, line for 0.44 and 0.99 ect).
Update:
I Have managed to go over it by concatinating the two index columns to one before plotting like this:
df['a,d'] = list(zip(df.a, df.d))
df=df.groupby(['a,d','C']).mean()
df.unstack(level=0).plot(figsize=(10,6))
the legenss is still not idealistic but I got the lines and axes as I wanted.
If anyone has better idea how to do it with original columns, i'm still open to learn.

Delete values over the diagonal in a matrix with python

I have the next problem with a matrix in python and numpy
given this matrix
Cmpd1 Cmpd2 Cmpd3 Cmpd4
Cmpd1 1 0.32 0.77 0.45
Cmpd2 0.32 1 0.14 0.73
Cmpd3 0.77 0.14 1 0.29
Cmpd4 0.45 0.73 0.29 1
i want to obtain this:
Cmpd1 Cmpd2 Cmpd3 Cmpd4
Cmpd1 1
Cmpd2 0.32 1
Cmpd3 0.77 0.14 1
Cmpd4 0.45 0.73 0.29 1
I was trying with np.diag() but doesnt works
Thanks!
Use np.tril(a) to extract the lower triangular matrix.
Refer this : https://docs.scipy.org/doc/numpy/reference/generated/numpy.tril.html
Make sure you convert your matrix to a numpy matrix first, then, if you want to extract the lower part of the matrix, use:
np.tril(a)
where 'a' is your matrix. https://numpy.org/doc/stable/reference/generated/numpy.tril.html#numpy.tril
Equally, if you need the upper part of the matrix, use:
np.triu(a)
https://numpy.org/doc/stable/reference/generated/numpy.triu.html

Categories