Multiple time range selection in Pandas Python - python

I have time-series data in CSV format. I want to calculate the mean for a different selected time period on a single run of the script, e.g. 01-05-2017: 30-04-2018, 01-05-2018: 30-04-2019 so on. Below is sample data
I have a script but it's taking only one given time period. but I want to give the multiple time period as I mentioned above.
from datetime import datetime
import pandas as pd
df = pd.read_csv(r'D:\Data\RT_2015_2020.csv', index_col=[0],parse_dates=[0])
z = df['2016-05-01' : '2017-04-30']
# Want to make like this way
#z = df[['2016-05-01' : '2017-04-30'], ['2017-05-01' : '2018-04-30']]
# It will calculate the mean for the selected time period
z.mean()

If you use dates as an index, you can extract the data with the conditions included in the desired range.
import pandas as pd
import numpy as np
import io
data = '''
Date Mean
18-05-2016 0.31
07-06-2016 0.32
17-07-2016 0.50
15-09-2016 0.62
25-10-2016 0.63
04-11-2016 0.56
24-11-2016 0.56
14-12-2016 0.22
13-01-2017 0.22
23-01-2017 0.23
12-02-2017 0.21
22-02-2017 0.21
'''
df = pd.read_csv(io.StringIO(data), delim_whitespace=True)
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df['2016'].head()
Mean
Date
2016-05-18 0.31
2016-07-06 0.32
2016-07-17 0.50
2016-09-15 0.62
2016-10-25 0.63
df.loc['2016-05-01':'2017-01-30']
Mean
Date
2016-05-18 0.31
2016-07-06 0.32
2016-07-17 0.50
2016-09-15 0.62
2016-10-25 0.63
2016-11-24 0.56
2016-12-14 0.22
2017-01-13 0.22
2017-01-23 0.23
df.loc['2016-05-01':'2017-01-30'].mean()
Mean 0.401111
dtype: float64

Related

How To Iterate Over A Timespan and Calculate some Values in a Dataframe using Python?

I have a dataset like below
data = {'ReportingDate':['2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31','2013/5/31',
'2013/6/28','2013/6/28',
'2013/6/28','2013/6/28','2013/6/28'],
'MarketCap':[' ',0.35,0.7,0.875,0.7,0.35,' ',1,1.5,0.75,1.25],
'AUM':[3.5,3.5,3.5,3.5,3.5,3.5,5,5,5,5,5],
'weight':[' ',0.1,0.2,0.25,0.2,0.1,' ',0.2,0.3,0.15,0.25]}
# Create DataFrame
df = pd.DataFrame(data)
df.set_index('Reporting Date',inplace=True)
df
Just a sample of a 8000 rows dataset.
ReportingDate starts from 2013/5/31 to 2015/10/30.
It includes data of all the months during the above period. But Only the last day of each month.
The first line of each month has two missing data. I know that
the sum of weight for each month is equal to 1
weight*AUM is equal to MarketCap
I can use the below line to get the answer I want, for only one month
a= (1-df["2013-5"].iloc[1:]['weight'].sum())
b= a* AUM
df.iloc[1,0]=b
df.iloc[1,2]=a
How can I use a loop to get the data for the whole period? Thanks
One way using pandas.DataFrame.groupby:
# If whitespaces are indeed whitespaces, not nan
df = df.replace("\s+", np.nan, regex=True)
# If not already datatime series
df.index = pd.to_datetime(df.index)
s = df["weight"].fillna(1) - df.groupby(df.index.date)["weight"].transform(sum)
df["weight"] = df["weight"].fillna(s)
df["MarketCap"] = df["MarketCap"].fillna(s * df["AUM"])
Note: This assumes that dates are always only the last day so that it is equivalent to grouping by year-month. If not so, try:
s = df["weight"].fillna(1) - df.groupby(df.index.strftime("%Y%m"))["weight"].transform(sum)
Output:
MarketCap AUM weight
ReportingDate
2013-05-31 0.350 3.5 0.10
2013-05-31 0.525 3.5 0.15
2013-05-31 0.700 3.5 0.20
2013-05-31 0.875 3.5 0.25
2013-05-31 0.700 3.5 0.20
2013-05-31 0.350 3.5 0.10
2013-06-28 0.500 5.0 0.10
2013-06-28 1.000 5.0 0.20
2013-06-28 1.500 5.0 0.30
2013-06-28 0.750 5.0 0.15
2013-06-28 1.250 5.0 0.25

Creating interaction terms in python

I'm trying to create interaction terms in a dataset. Is there another (simpler) way of creating interaction terms of columns in a dataset? For example, creating interaction terms in combinations of columns 4:98 and 98:106. I tried looping over the columns using numpy arrays, but with the following code, the kernel keeps dying.
col1 = df.columns[4:98] #94 columns
col2 = df.columns[98:106] #8 columns
var1_np = df_np[:, 4:98]
var2_np = df_np[:, 98:106]
for i in range(94):
for j in range(8):
name = col1[i] +"*" + col2[j]
df[name] = var1_np[:,i]*var2_np[:,j]
Here, df is the dataframe and df_np is df in NumPy array.
You could use the itertools.product that is roughly equivalent to nested for-loops in a generator expression. Then, use join to create the new column name with the product result. After that, use Pandas prod to return the product of two columns over the axis one (along the columns).
import pandas as pd
import numpy as np
from itertools import product
#setup
np.random.seed(12345)
data = np.random.rand(5, 10).round(2)
df = pd.DataFrame(data)
df.columns = [f'col_{c}' for c in range(0,10)]
print(df)
#code
col1 = df.columns[3:5]
col2 = df.columns[5:8]
df_new = pd.DataFrame()
for i in product(col1, col2):
name = "*".join(i)
df_new[name] = df[list(i)].prod(axis=1)
print(df_new)
Output from df
col_0 col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8 col_9
0 0.93 0.32 0.18 0.20 0.57 0.60 0.96 0.65 0.75 0.65
1 0.75 0.96 0.01 0.11 0.30 0.66 0.81 0.87 0.96 0.72
2 0.64 0.72 0.47 0.33 0.44 0.73 0.99 0.68 0.79 0.17
3 0.03 0.80 0.90 0.02 0.49 0.53 0.60 0.05 0.90 0.73
4 0.82 0.50 0.81 0.10 0.22 0.26 0.47 0.46 0.71 0.18
Output from df_new
col_3*col_5 col_3*col_6 col_3*col_7 col_4*col_5 col_4*col_6 col_4*col_7
0 0.1200 0.1920 0.1300 0.3420 0.5472 0.3705
1 0.0726 0.0891 0.0957 0.1980 0.2430 0.2610
2 0.2409 0.3267 0.2244 0.3212 0.4356 0.2992
3 0.0106 0.0120 0.0010 0.2597 0.2940 0.0245
4 0.0260 0.0470 0.0460 0.0572 0.1034 0.1012

Remove the missing values from the rows having greater than 5 missing values and then print the percentage of missing values in each column

import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
d= df.loc[df.isnull().sum(axis=1)>5]
d.dropna(axis=0,inplace=True)
print(round(100*(1-df.count()/len(df)),2))
i m getting output as
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.24
Discount 0.65
Order_Quantity 0.65
Profit 0.65
Shipping_Cost 0.65
Product_Base_Margin 1.30
dtype: float64
but the output is
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.00
Discount 0.42
Order_Quantity 0.42
Profit 0.42
Shipping_Cost 0.42
Product_Base_Margin 1.06
dtype: float64
Try this way:
df.drop(df[df.isnull().sum(axis=1)>5].index,axis=0,inplace=True)
print(round(100*(1-df.count()/len(df)),2))
I think you are trying to find the index of rows with null values sum greater 5. Use np.where instead of df.loc to find the index and then drop them.
Try:
import pandas as pd
import numpy as np
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
d = np.where(df.isnull().sum(axis=1)>5)
df= df.drop(df.index[d])
print(round(100*(1-df.count()/len(df)),2))
output:
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.00
Discount 0.42
Order_Quantity 0.42
Profit 0.42
Shipping_Cost 0.42
Product_Base_Margin 1.06
dtype: float64
Try this, it should work
df = df[df.isnull().sum(axis=1) <= 5]
print(round(100*(1-df.count()/len(df)),2))
Try this solution
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
df = df[df.isnull().sum(axis=1)<=5]
print(round(100*(df.isnull().sum()/len(df.index)),2))
This Should work.
df = df.drop(df[df.isnull().sum(axis=1) > 5].index)
print(round(100 * (df.isnull().sum() / len(df.index)), 2))
{marks = marks[marks.isnull().sum(axis=1) < 5]
print(marks.isna().sum())}
Please try these this will help
This works:
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
df = df[df.isnull().sum(axis=1)<5]
print(df.isnull().sum())

dataframe values multiply by 2

I have a 1 dimensional dataframe:
4DS.AX A2B.AX A2M.AX AAC.AX ABC.AX ABP.AX ACW.AX ADH.AX
2018-12-14 0.00 0.00 0.14 0.01 0.12 0.01 0.00 0.01
expected output
4DS.AX A2B.AX A2M.AX AAC.AX ABC.AX ABP.AX ACW.AX ADH.AX
2018-12-14 0.00 0.00 0.28 0.02 0.24 0.02 0.00 0.02
I want to multiply all the values by 2
this is my attempt:
[in] df=df.iloc[0,:]*2.0
[out]
A2B.AX A2M.AX AAC.AX ABC.AX ABP.AX
2018-12-14 0.000.00 0.140.14 0.010.01 0.120.12....
It seems the type of columns is of str/object type and hence it is appending rather then multiplying
Example
import pandas as pd
pd.DataFrame({'x':['0.1']})*2
Output:
0.10.1
While
pd.DataFrame({'x':[0.1]})*2
Ouput
0.2
Can check the type of the column(s)
print(df.dtypes)
In order to change the type of column(s)
for col in ['4DS.AX','A2B.AX','A2M.AX','AAC.AX','ABC.AX','ABP.AX','ACW.AX','ADH.AX']:
df[col] = df[col].astype('float')
Then, it should also work
df.iloc[0,:]*2.0

pd.to_csv set float_format with list

I need to write a df to a text file, to save some space on disk I would like to set the number of decimal places for each column i.e. have each column a different width.
I have tried:
df = pd.DataFrame(np.random.random(size=(10, 4)))
df.to_csv(path, float_format=['%.3f', '%.3f', '%.3f', '%.10f'])
But this does not work;
TypeError: unsupported operand type(s) for %: 'list' and 'float'
Any suggestions on how to do this with pandas (version 0.23.0)
You can do in this way:
df.iloc[:,0:3] = df.iloc[:,0:3].round(3)
df['d'] = df['d'].round(10)
df.to_csv('path')
Thanks for all the answers, inspired by #Joe I came up with:
df = df.round({'a':3, 'b':3, 'c':3, 'd':10})
or more generically
df = df.round({c:r for c, r in zip(df.columns, [3, 3, 3, 10])})
This is a workaround and does not answer the original question, round modifies the underlying dataframe which may be undesirable.
I usually do it this way:
a['column_name'] = round(a['column_name'], 3)
And then you can export it to csv as usual.
You can use the applymap it applies to all rows and columns value.
df = pd.DataFrame(np.random.random(size=(10, 4)))
df.applymap(lambda x: round(x,2))
Out[58]:
0 1 2 3
0 0.12 0.63 0.47 0.19
1 0.06 0.81 0.09 0.56
2 0.78 0.85 0.42 0.98
3 0.58 0.39 0.73 0.68
4 0.79 0.56 0.77 0.34
5 0.16 0.20 0.94 0.89
6 0.34 0.79 0.54 0.27
7 0.70 0.58 0.05 0.28
8 0.75 0.53 0.37 0.64
9 0.57 0.68 0.59 0.84

Categories