dataframe values multiply by 2 - python

I have a 1 dimensional dataframe:
4DS.AX A2B.AX A2M.AX AAC.AX ABC.AX ABP.AX ACW.AX ADH.AX
2018-12-14 0.00 0.00 0.14 0.01 0.12 0.01 0.00 0.01
expected output
4DS.AX A2B.AX A2M.AX AAC.AX ABC.AX ABP.AX ACW.AX ADH.AX
2018-12-14 0.00 0.00 0.28 0.02 0.24 0.02 0.00 0.02
I want to multiply all the values by 2
this is my attempt:
[in] df=df.iloc[0,:]*2.0
[out]
A2B.AX A2M.AX AAC.AX ABC.AX ABP.AX
2018-12-14 0.000.00 0.140.14 0.010.01 0.120.12....

It seems the type of columns is of str/object type and hence it is appending rather then multiplying
Example
import pandas as pd
pd.DataFrame({'x':['0.1']})*2
Output:
0.10.1
While
pd.DataFrame({'x':[0.1]})*2
Ouput
0.2
Can check the type of the column(s)
print(df.dtypes)
In order to change the type of column(s)
for col in ['4DS.AX','A2B.AX','A2M.AX','AAC.AX','ABC.AX','ABP.AX','ACW.AX','ADH.AX']:
df[col] = df[col].astype('float')
Then, it should also work
df.iloc[0,:]*2.0

Related

Creating interaction terms in python

I'm trying to create interaction terms in a dataset. Is there another (simpler) way of creating interaction terms of columns in a dataset? For example, creating interaction terms in combinations of columns 4:98 and 98:106. I tried looping over the columns using numpy arrays, but with the following code, the kernel keeps dying.
col1 = df.columns[4:98] #94 columns
col2 = df.columns[98:106] #8 columns
var1_np = df_np[:, 4:98]
var2_np = df_np[:, 98:106]
for i in range(94):
for j in range(8):
name = col1[i] +"*" + col2[j]
df[name] = var1_np[:,i]*var2_np[:,j]
Here, df is the dataframe and df_np is df in NumPy array.
You could use the itertools.product that is roughly equivalent to nested for-loops in a generator expression. Then, use join to create the new column name with the product result. After that, use Pandas prod to return the product of two columns over the axis one (along the columns).
import pandas as pd
import numpy as np
from itertools import product
#setup
np.random.seed(12345)
data = np.random.rand(5, 10).round(2)
df = pd.DataFrame(data)
df.columns = [f'col_{c}' for c in range(0,10)]
print(df)
#code
col1 = df.columns[3:5]
col2 = df.columns[5:8]
df_new = pd.DataFrame()
for i in product(col1, col2):
name = "*".join(i)
df_new[name] = df[list(i)].prod(axis=1)
print(df_new)
Output from df
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 0.93 0.32 0.18 0.20 0.57 0.60 0.96 0.65 0.75 0.65
1 0.75 0.96 0.01 0.11 0.30 0.66 0.81 0.87 0.96 0.72
2 0.64 0.72 0.47 0.33 0.44 0.73 0.99 0.68 0.79 0.17
3 0.03 0.80 0.90 0.02 0.49 0.53 0.60 0.05 0.90 0.73
4 0.82 0.50 0.81 0.10 0.22 0.26 0.47 0.46 0.71 0.18
Output from df_new
col_3*col_5 col_3*col_6 col_3*col_7 col_4*col_5 col_4*col_6 col_4*col_7
0 0.1200 0.1920 0.1300 0.3420 0.5472 0.3705
1 0.0726 0.0891 0.0957 0.1980 0.2430 0.2610
2 0.2409 0.3267 0.2244 0.3212 0.4356 0.2992
3 0.0106 0.0120 0.0010 0.2597 0.2940 0.0245
4 0.0260 0.0470 0.0460 0.0572 0.1034 0.1012

Multiple time range selection in Pandas Python

I have time-series data in CSV format. I want to calculate the mean for a different selected time period on a single run of the script, e.g. 01-05-2017: 30-04-2018, 01-05-2018: 30-04-2019 so on. Below is sample data
I have a script but it's taking only one given time period. but I want to give the multiple time period as I mentioned above.
from datetime import datetime
import pandas as pd
df = pd.read_csv(r'D:\Data\RT_2015_2020.csv', index_col=[0],parse_dates=[0])
z = df['2016-05-01' : '2017-04-30']
# Want to make like this way
#z = df[['2016-05-01' : '2017-04-30'], ['2017-05-01' : '2018-04-30']]
# It will calculate the mean for the selected time period
z.mean()
If you use dates as an index, you can extract the data with the conditions included in the desired range.
import pandas as pd
import numpy as np
import io
data = '''
Date Mean
18-05-2016 0.31
07-06-2016 0.32
17-07-2016 0.50
15-09-2016 0.62
25-10-2016 0.63
04-11-2016 0.56
24-11-2016 0.56
14-12-2016 0.22
13-01-2017 0.22
23-01-2017 0.23
12-02-2017 0.21
22-02-2017 0.21
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df['2016'].head()
Mean
Date
2016-05-18 0.31
2016-07-06 0.32
2016-07-17 0.50
2016-09-15 0.62
2016-10-25 0.63
df.loc['2016-05-01':'2017-01-30']
Mean
Date
2016-05-18 0.31
2016-07-06 0.32
2016-07-17 0.50
2016-09-15 0.62
2016-10-25 0.63
2016-11-24 0.56
2016-12-14 0.22
2017-01-13 0.22
2017-01-23 0.23
df.loc['2016-05-01':'2017-01-30'].mean()
Mean 0.401111
dtype: float64

Plotting variables by classes

I would need to plot as bar chart the following columns
%_Var1 %_Var2 %_Val1 %_Val2 Class
2 0.00 0.00 0.10 0.01 1
3 0.01 0.01 0.07 0.05 0
17 0.00 0.00 0.02 0.01 0
24 0.00 0.00 0.11 0.04 0
27 0.00 0.00 0.02 0.03 1
44 0.00 0.00 0.05 0.02 0
53 0.00 0.00 0.03 0.01 1
67 0.00 0.00 0.06 0.02 0
87 0.00 0.00 0.22 0.01 1
115 0.00 0.00 0.03 0.02 0
comparing the values having Class 1 and Class 0 respectively (i.e. bars which show each column of the dataframe, putting one beside the other the column for only Class 1 ad the column for Class 0.
So I should have 8 bars: 4 where 4 bars are for Class 1 and the remaining 4 for Class 0.
One column of Class 1 should be beside the same column for Class 0.
I tried as follows:
ax = df[["%_Var1", "%_Var2", "%_Var3" , "%_Var4"]].plot(kind='bar')
but the output is completely wrong, also writing ax = df[["%_Var1", "%_Var2", "%_Var3" , "%_Var4"]].Label.plot(kind='bar')
I think I should consider a groupby in my code, in order to group by Classes, but I do not know how to set the order (plots are not my top skill)
If you want to try the seaborn way, melt the dataframe to long format and then hue on the class.
data = df.melt(id_vars=['class'], value_vars=['var1','var2','val1','val2'])
import seaborn as sns
sns.barplot(x='variable', y='value', hue='class', data=data, ci=0)
gives:
Or if you want to get the plot based on the class, simply change the hue and x axis..
sns.barplot(x='class', y='value', hue='variable', data = data, ci=0)
gives:
Using groupby:
df.groupby('Class').mean().plot.bar()
With pivot_table method you can summarise the data per group as well.
df.pivot_table(index='Class').plot.bar()
# df.pivot_table(columns='Class').plot.bar() # invert order
By default, it calculates the mean of your target-columns, but you can specify another aggregation method with aggfunc='myfunc' parameter.

Remove the missing values from the rows having greater than 5 missing values and then print the percentage of missing values in each column

import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
d= df.loc[df.isnull().sum(axis=1)>5]
d.dropna(axis=0,inplace=True)
print(round(100*(1-df.count()/len(df)),2))
i m getting output as
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.24
Discount 0.65
Order_Quantity 0.65
Profit 0.65
Shipping_Cost 0.65
Product_Base_Margin 1.30
dtype: float64
but the output is
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.00
Discount 0.42
Order_Quantity 0.42
Profit 0.42
Shipping_Cost 0.42
Product_Base_Margin 1.06
dtype: float64
Try this way:
df.drop(df[df.isnull().sum(axis=1)>5].index,axis=0,inplace=True)
print(round(100*(1-df.count()/len(df)),2))
I think you are trying to find the index of rows with null values sum greater 5. Use np.where instead of df.loc to find the index and then drop them.
Try:
import pandas as pd
import numpy as np
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
d = np.where(df.isnull().sum(axis=1)>5)
df= df.drop(df.index[d])
print(round(100*(1-df.count()/len(df)),2))
output:
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.00
Discount 0.42
Order_Quantity 0.42
Profit 0.42
Shipping_Cost 0.42
Product_Base_Margin 1.06
dtype: float64
Try this, it should work
df = df[df.isnull().sum(axis=1) <= 5]
print(round(100*(1-df.count()/len(df)),2))
Try this solution
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
df = df[df.isnull().sum(axis=1)<=5]
print(round(100*(df.isnull().sum()/len(df.index)),2))
This Should work.
df = df.drop(df[df.isnull().sum(axis=1) > 5].index)
print(round(100 * (df.isnull().sum() / len(df.index)), 2))
{marks = marks[marks.isnull().sum(axis=1) < 5]
print(marks.isna().sum())}
Please try these this will help
This works:
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
df = df[df.isnull().sum(axis=1)<5]
print(df.isnull().sum())

pd.to_csv set float_format with list

I need to write a df to a text file, to save some space on disk I would like to set the number of decimal places for each column i.e. have each column a different width.
I have tried:
df = pd.DataFrame(np.random.random(size=(10, 4)))
df.to_csv(path, float_format=['%.3f', '%.3f', '%.3f', '%.10f'])
But this does not work;
TypeError: unsupported operand type(s) for %: 'list' and 'float'
Any suggestions on how to do this with pandas (version 0.23.0)
You can do in this way:
df.iloc[:,0:3] = df.iloc[:,0:3].round(3)
df['d'] = df['d'].round(10)
df.to_csv('path')
Thanks for all the answers, inspired by #Joe I came up with:
df = df.round({'a':3, 'b':3, 'c':3, 'd':10})
or more generically
df = df.round({c:r for c, r in zip(df.columns, [3, 3, 3, 10])})
This is a workaround and does not answer the original question, round modifies the underlying dataframe which may be undesirable.
I usually do it this way:
a['column_name'] = round(a['column_name'], 3)
And then you can export it to csv as usual.
You can use the applymap it applies to all rows and columns value.
df = pd.DataFrame(np.random.random(size=(10, 4)))
df.applymap(lambda x: round(x,2))
Out[58]:
0 1 2 3
0 0.12 0.63 0.47 0.19
1 0.06 0.81 0.09 0.56
2 0.78 0.85 0.42 0.98
3 0.58 0.39 0.73 0.68
4 0.79 0.56 0.77 0.34
5 0.16 0.20 0.94 0.89
6 0.34 0.79 0.54 0.27
7 0.70 0.58 0.05 0.28
8 0.75 0.53 0.37 0.64
9 0.57 0.68 0.59 0.84

Categories