Plotting variables by classes - python

I would need to plot as bar chart the following columns
%_Var1 %_Var2 %_Val1 %_Val2 Class
2 0.00 0.00 0.10 0.01 1
3 0.01 0.01 0.07 0.05 0
17 0.00 0.00 0.02 0.01 0
24 0.00 0.00 0.11 0.04 0
27 0.00 0.00 0.02 0.03 1
44 0.00 0.00 0.05 0.02 0
53 0.00 0.00 0.03 0.01 1
67 0.00 0.00 0.06 0.02 0
87 0.00 0.00 0.22 0.01 1
115 0.00 0.00 0.03 0.02 0
comparing the values having Class 1 and Class 0 respectively (i.e. bars which show each column of the dataframe, putting one beside the other the column for only Class 1 ad the column for Class 0.
So I should have 8 bars: 4 where 4 bars are for Class 1 and the remaining 4 for Class 0.
One column of Class 1 should be beside the same column for Class 0.
I tried as follows:
ax = df[["%_Var1", "%_Var2", "%_Var3" , "%_Var4"]].plot(kind='bar')
but the output is completely wrong, also writing ax = df[["%_Var1", "%_Var2", "%_Var3" , "%_Var4"]].Label.plot(kind='bar')
I think I should consider a groupby in my code, in order to group by Classes, but I do not know how to set the order (plots are not my top skill)

If you want to try the seaborn way, melt the dataframe to long format and then hue on the class.
data = df.melt(id_vars=['class'], value_vars=['var1','var2','val1','val2'])
import seaborn as sns
sns.barplot(x='variable', y='value', hue='class', data=data, ci=0)
gives:
Or if you want to get the plot based on the class, simply change the hue and x axis..
sns.barplot(x='class', y='value', hue='variable', data = data, ci=0)
gives:

Using groupby:
df.groupby('Class').mean().plot.bar()
With pivot_table method you can summarise the data per group as well.
df.pivot_table(index='Class').plot.bar()
# df.pivot_table(columns='Class').plot.bar() # invert order
By default, it calculates the mean of your target-columns, but you can specify another aggregation method with aggfunc='myfunc' parameter.

Related

Creating interaction terms in python

I'm trying to create interaction terms in a dataset. Is there another (simpler) way of creating interaction terms of columns in a dataset? For example, creating interaction terms in combinations of columns 4:98 and 98:106. I tried looping over the columns using numpy arrays, but with the following code, the kernel keeps dying.
col1 = df.columns[4:98] #94 columns
col2 = df.columns[98:106] #8 columns
var1_np = df_np[:, 4:98]
var2_np = df_np[:, 98:106]
for i in range(94):
for j in range(8):
name = col1[i] +"*" + col2[j]
df[name] = var1_np[:,i]*var2_np[:,j]
Here, df is the dataframe and df_np is df in NumPy array.
You could use the itertools.product that is roughly equivalent to nested for-loops in a generator expression. Then, use join to create the new column name with the product result. After that, use Pandas prod to return the product of two columns over the axis one (along the columns).
import pandas as pd
import numpy as np
from itertools import product
#setup
np.random.seed(12345)
data = np.random.rand(5, 10).round(2)
df = pd.DataFrame(data)
df.columns = [f'col_{c}' for c in range(0,10)]
print(df)
#code
col1 = df.columns[3:5]
col2 = df.columns[5:8]
df_new = pd.DataFrame()
for i in product(col1, col2):
name = "*".join(i)
df_new[name] = df[list(i)].prod(axis=1)
print(df_new)
Output from df
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 0.93 0.32 0.18 0.20 0.57 0.60 0.96 0.65 0.75 0.65
1 0.75 0.96 0.01 0.11 0.30 0.66 0.81 0.87 0.96 0.72
2 0.64 0.72 0.47 0.33 0.44 0.73 0.99 0.68 0.79 0.17
3 0.03 0.80 0.90 0.02 0.49 0.53 0.60 0.05 0.90 0.73
4 0.82 0.50 0.81 0.10 0.22 0.26 0.47 0.46 0.71 0.18
Output from df_new
col_3*col_5 col_3*col_6 col_3*col_7 col_4*col_5 col_4*col_6 col_4*col_7
0 0.1200 0.1920 0.1300 0.3420 0.5472 0.3705
1 0.0726 0.0891 0.0957 0.1980 0.2430 0.2610
2 0.2409 0.3267 0.2244 0.3212 0.4356 0.2992
3 0.0106 0.0120 0.0010 0.2597 0.2940 0.0245
4 0.0260 0.0470 0.0460 0.0572 0.1034 0.1012

merge 2 csv files by columns error related to strings?

I am trying to merge 2 csv files by column.
my both csv ends with '_4.csv' as filename, and the final result of the merged csv is something like below:
0-10 ,83.72,66.76,86.98 ,0-10 ,83.72,66.76,86.98
11-20 ,15.01,31.12,12.04 ,11-20 ,15.01,31.12,12.04
21-30 ,1.14,2.05,0.94 ,21-30 ,1.14,2.05,0.94
31-40 ,0.13,0.07,0.03 ,31-40 ,0.13,0.07,0.03
over 40 ,0.0,0.0,0.0 ,over 40 ,0.0,0.0,0.0
UHF case ,0.0,0.0,0.0 ,UHF case ,0.0,0.0,0.0
my code:
#combine 2 csv into 1 by columns
files_in_dir = [f for f in os.listdir(os.getcwd()) if f.endswith('_4.csv')]
temp_data = []
for filenames in files_in_dir:
temp_data.append(np.loadtxt(filenames,dtype='str'))
temp_data = np.array(temp_data)
np.savetxt('_mix.csv',temp_data.transpose(),fmt='%s',delimiter=',')
however the error said:
temp_data.append(np.loadtxt(filenames,dtype='str'))
for x in read_data(_loadtxt_chunksize):
raise ValueError("Wrong number of columns at line %d"
ValueError: Wrong number of columns at line 2
not sure if it is related to the first column being strings rather than values.
Does anyone know how to fix it? much appreciation
I think you're looking for the join method. If we have two .csv files of the form:
0-10 ,83.72,66.76,86.98
11-20 ,15.01,31.12,12.04
21-30 ,1.14,2.05,0.94
31-40 ,0.13,0.07,0.03
over 40 ,0.0,0.0,0.0
UHF case ,0.0,0.0,0.0
Assuming they both have similar structure, we'll work with one of these named data.csv:
import pandas as pd
# Assumes there are no headers
df1 = pd.read_csv("data.csv", header=None)
df2 = pd.read_csv("data.csv", header=None)
# By default: DataFrame headers are assigned numbers 0, 1, 2, 3
# In the second data frame, we will rename columns so they do not clash.
# meaning `df2` will now have columns named: 4, 5, 6, 7
df2 = df2.rename(
columns={
x: y for x, y in zip(df1.columns, range(len(df2.columns), len(df2.columns) * 2))
}
)
print(df1.join(df2))
Example output:
0 1 2 3 4 5 6 7
0 0-10 83.72 66.76 86.98 0-10 83.72 66.76 86.98
1 11-20 15.01 31.12 12.04 11-20 15.01 31.12 12.04
2 21-30 1.14 2.05 0.94 21-30 1.14 2.05 0.94
3 31-40 0.13 0.07 0.03 31-40 0.13 0.07 0.03
4 over 40 0.00 0.00 0.00 over 40 0.00 0.00 0.00
5 UHF case 0.00 0.00 0.00 UHF case 0.00 0.00 0.00

Remove the missing values from the rows having greater than 5 missing values and then print the percentage of missing values in each column

import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
d= df.loc[df.isnull().sum(axis=1)>5]
d.dropna(axis=0,inplace=True)
print(round(100*(1-df.count()/len(df)),2))
i m getting output as
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.24
Discount 0.65
Order_Quantity 0.65
Profit 0.65
Shipping_Cost 0.65
Product_Base_Margin 1.30
dtype: float64
but the output is
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.00
Discount 0.42
Order_Quantity 0.42
Profit 0.42
Shipping_Cost 0.42
Product_Base_Margin 1.06
dtype: float64
Try this way:
df.drop(df[df.isnull().sum(axis=1)>5].index,axis=0,inplace=True)
print(round(100*(1-df.count()/len(df)),2))
I think you are trying to find the index of rows with null values sum greater 5. Use np.where instead of df.loc to find the index and then drop them.
Try:
import pandas as pd
import numpy as np
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
d = np.where(df.isnull().sum(axis=1)>5)
df= df.drop(df.index[d])
print(round(100*(1-df.count()/len(df)),2))
output:
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.00
Discount 0.42
Order_Quantity 0.42
Profit 0.42
Shipping_Cost 0.42
Product_Base_Margin 1.06
dtype: float64
Try this, it should work
df = df[df.isnull().sum(axis=1) <= 5]
print(round(100*(1-df.count()/len(df)),2))
Try this solution
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
df = df[df.isnull().sum(axis=1)<=5]
print(round(100*(df.isnull().sum()/len(df.index)),2))
This Should work.
df = df.drop(df[df.isnull().sum(axis=1) > 5].index)
print(round(100 * (df.isnull().sum() / len(df.index)), 2))
{marks = marks[marks.isnull().sum(axis=1) < 5]
print(marks.isna().sum())}
Please try these this will help
This works:
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
df = df[df.isnull().sum(axis=1)<5]
print(df.isnull().sum())

dataframe values multiply by 2

I have a 1 dimensional dataframe:
4DS.AX A2B.AX A2M.AX AAC.AX ABC.AX ABP.AX ACW.AX ADH.AX
2018-12-14 0.00 0.00 0.14 0.01 0.12 0.01 0.00 0.01
expected output
4DS.AX A2B.AX A2M.AX AAC.AX ABC.AX ABP.AX ACW.AX ADH.AX
2018-12-14 0.00 0.00 0.28 0.02 0.24 0.02 0.00 0.02
I want to multiply all the values by 2
this is my attempt:
[in] df=df.iloc[0,:]*2.0
[out]
A2B.AX A2M.AX AAC.AX ABC.AX ABP.AX
2018-12-14 0.000.00 0.140.14 0.010.01 0.120.12....
It seems the type of columns is of str/object type and hence it is appending rather then multiplying
Example
import pandas as pd
pd.DataFrame({'x':['0.1']})*2
Output:
0.10.1
While
pd.DataFrame({'x':[0.1]})*2
Ouput
0.2
Can check the type of the column(s)
print(df.dtypes)
In order to change the type of column(s)
for col in ['4DS.AX','A2B.AX','A2M.AX','AAC.AX','ABC.AX','ABP.AX','ACW.AX','ADH.AX']:
df[col] = df[col].astype('float')
Then, it should also work
df.iloc[0,:]*2.0

pd.to_csv set float_format with list

I need to write a df to a text file, to save some space on disk I would like to set the number of decimal places for each column i.e. have each column a different width.
I have tried:
df = pd.DataFrame(np.random.random(size=(10, 4)))
df.to_csv(path, float_format=['%.3f', '%.3f', '%.3f', '%.10f'])
But this does not work;
TypeError: unsupported operand type(s) for %: 'list' and 'float'
Any suggestions on how to do this with pandas (version 0.23.0)
You can do in this way:
df.iloc[:,0:3] = df.iloc[:,0:3].round(3)
df['d'] = df['d'].round(10)
df.to_csv('path')
Thanks for all the answers, inspired by #Joe I came up with:
df = df.round({'a':3, 'b':3, 'c':3, 'd':10})
or more generically
df = df.round({c:r for c, r in zip(df.columns, [3, 3, 3, 10])})
This is a workaround and does not answer the original question, round modifies the underlying dataframe which may be undesirable.
I usually do it this way:
a['column_name'] = round(a['column_name'], 3)
And then you can export it to csv as usual.
You can use the applymap it applies to all rows and columns value.
df = pd.DataFrame(np.random.random(size=(10, 4)))
df.applymap(lambda x: round(x,2))
Out[58]:
0 1 2 3
0 0.12 0.63 0.47 0.19
1 0.06 0.81 0.09 0.56
2 0.78 0.85 0.42 0.98
3 0.58 0.39 0.73 0.68
4 0.79 0.56 0.77 0.34
5 0.16 0.20 0.94 0.89
6 0.34 0.79 0.54 0.27
7 0.70 0.58 0.05 0.28
8 0.75 0.53 0.37 0.64
9 0.57 0.68 0.59 0.84

Categories