I need to calculate a percent change column with respect to the MultiIndex:
import pandas as pd
import numpy as np
row_x1 = ['1','0']
row_x2 = ['1.5','.5']
row_x3 = ['3','1']
row_x4 = ['2','0']
row_x5 = ['3','.5']
index_arrays = [
np.array(['first', 'first', 'first', 'second', 'second']),
np.array(['one','two','three','one','two'])
]
df1 = pd.DataFrame(
[row_x1,row_x2,row_x3,row_x4,row_x5],
columns=['A'],
index=index_arrays,
)
print(df1)
Starting with the following data frame:
A
first one 1
two 1.5
three 3
second one 2
two 3
The final "percentage change" column should be calculated as shown below:
A %
first one 1 0
two 1.5 .5
three 3 1
second one 2 0
two 3 .5
I have a large data set, and I need to do this programmatically.
Let's do groupby and calculate percent change
df1['A'] = df1['A'].astype(float)
df1['%'] = df1.groupby(level=0)['A'].pct_change().fillna(0)
A %
first one 1.0 0.0
two 1.5 0.5
three 3.0 1.0
second one 2.0 0.0
two 3.0 0.5
Related
I am applying an inner join in a for loop on another dataset and now I just need to remove the rows that are already part of the inner join so I went with Dataframe.isin(another_df) but it is not giving me the expected results. I checked the column names and their data types, they are all the same. Can someone help me with that, please?
In the following code, isin is where I check between two data frames still I'm not getting any response, I'm getting the same set of rows even if they have the same no of rows and columns.
Note: I'm dropping an extra column in isin function as it is the extra column present in one of the dataframes.
My code looks like this:
df = pd.DataFrame(columns= override.columns)
for i in list1:
join_value = tuple(i)
i.append('creditor_tier_interim')
subset_df = override.merge(criteria[i].dropna(), on = list(join_value), how = 'inner')
subset_df['PRE_CHARGEOFF_FLAG'] = pd.to_numeric(subset_df.PRE_CHARGEOFF_FLAG)
override=override[~override.isin(subset_df.drop(columns = 'creditor_tier_interim'))].dropna(how = 'all')
print('The override shape would be:', override.shape)
df = df.append(subset_df)
df = df.append(override)
It sounds as if you have 'left' and a 'right' DataFrames and you're look for those records that are exclusively in one or the other. The below returns rows that are in exclusively the right or left DataFrame.
import pandas as pd
import numpy as np
from pandas import DataFrame, Series
dataframe_left = DataFrame(np.random.randn(25).reshape(5,5),columns=['A','B','C','D','E'],index=np.arange(5))
dataframe_right = DataFrame(np.random.randn(25).reshape(5,5),columns=['A','B','C','D','E'],index=np.arange(5))
insert_left = DataFrame(np.arange(5).reshape(1,5),columns=['A','B','C','D','E'],index=[7])
insert_right = DataFrame(np.arange(5).reshape(1,5),columns=['A','B','C','D','E'], index=[6])
dataframe_right = dataframe_right.append(insert_right)
dataframe_left = dataframe_left.append(insert_left)
Code above produces this output
Left Table
A
B
C
D
E
0
-0.3240086903973736
1.0441549453943946
-0.23640436950107843
0.5466767470739027
-0.2123693649877372
1
-0.04263388410830733
-0.4855492977594353
-1.5584284407735072
1.2438524586306603
-0.31087239909921277
2
0.6982581750529829
-0.42379154444215905
1.1625089013522614
-3.378898146269229
1.0550121763954057
3
0.3774337535208665
0.6402576096348337
-0.2787520258645991
0.31071767629270125
0.34499495360962007
4
-0.133649590435452
0.3679768579635411
-2.0196709364730014
1.2860033685128436
-0.49674737879741193
7
0.0
1.0
2.0
3.0
4.0
Right Table
A
B
C
D
E
0
-0.09946693056759418
-0.03378933704588447
-0.4117873368048701
0.21976489856531914
-0.7020527418892488
1
-2.9936183481793233
0.42443360961021837
-0.1681576564885903
-0.5080538565354785
-0.29483296271514153
2
-0.6567306172004121
-1.221239625798079
-1.2604670988941196
0.44472543746187265
-0.4562966381137614
3
-0.0027697712245823482
0.1323767897141191
-0.11073953230359104
-0.3596157927825233
1.9894525572891626
4
0.5170901011452596
-1.1694605240821456
0.29238712582282705
-0.38912521589557797
-0.8793074660039492
6
0.0
1.0
2.0
3.0
4.0
After setting up the test dataframes we can join the two and filter for the rows we're interested in:
tmp = pd.merge(
left=dataframe_left,
right=dataframe_right,
right_index=True,
left_index=True,
how='outer',
suffixes=['_left','_right'],
indicator=True
)
tmp[tmp._merge.isin(['right_only','left_only'])]
This produces the below result
A_left
B_left
C_left
D_left
E_left
A_right
B_right
C_right
D_right
E_right
_merge
6
0.0
1.0
2.0
3.0
4.0
right_only
7
0.0
1.0
2.0
3.0
4.0
left_only
I have a Pandas dataframe like this:
import pandas as pd
df = pd.DataFrame(
{'gender':['F','F','F','F','F','M','M','M','M','M'],
'mature':[0,1,0,0,0,1,1,1,0,1],
'cta' :[1,1,0,1,0,0,0,1,0,1]}
)
df['gender'] = df['gender'].astype('category')
df['mature'] = df['mature'].astype('category')
df['cta'] = pd.to_numeric(df['cta'])
df
I calculated the sum (How many times people clicked) and total (the number of sent messages). I want to figure out how to calculate the percentage defined as clicks/total and how to get a dataframe as output.
temp_groupby = df.groupby('gender').agg({'cta': [('clicks','sum'),
('total','count')]})
temp_groupby
I think it means you need average, add new tuple to list like:
temp_groupby = df.groupby('gender').agg({'cta': [('clicks','sum'),
('total','count'),
('perc', 'mean')]})
print (temp_groupby)
cta
clicks total perc
gender
F 3 5 0.6
M 2 5 0.4
For avoid MultiIndex in columns specify column after groupby:
temp_groupby = df.groupby('gender')['cta'].agg([('clicks','sum'),
('total','count'),
('perc', 'mean')]).reset_index()
print (temp_groupby)
gender clicks total perc
0 F 3 5 0.6
1 M 2 5 0.4
Or use named aggregation:
temp_groupby = df.groupby('gender', as_index=False).agg(clicks= ('cta','sum'),
total= ('cta','count'),
perc= ('cta','mean'))
print (temp_groupby)
gender clicks total perc
0 F 3 5 0.6
1 M 2 5 0.4
I have imported data from a csv file into my program and then used set_index to set 'rule_id' as index. I used this code:
df = pd.read_excel('stack.xlsx')
df.set_index(['rule_id'])
and the data looks like this:
Now I want to compare one column with another but in reverse order , for eg; I want to compare 'c' data with 'b' , then compare 'b' with 'a' and so on and create another column after the comparison which contains the index of the column where the value was zero. If both the columns have value 0 , then Null should be updated in the new column and if both the comparison values are other than 0 , then also Null should be updated in the new column.
The result should look like this:
I am not able to write the code of how should I approach this problem, if you guys could help me , that would be great.
Edit: A minor edit. I have imported the data from an excel which looks like this , this is just a part of data , there are multiple columns:
Then I used pivot_table to manipulate the data as per my requirement using this code:
df = df.pivot_table(index = 'rule_id' , columns = ['date'], values = 'rid_fc', fill_value = 0)
and my data looks like this now:
Now I want to compare one column with another but in reverse order , for eg; I want to compare '2019-04-25 16:36:32' data with '2019-04-25 16:29:05' , then compare '2019-04-25 16:29:05' with '2019-04-25 16:14:14' and so on and create another column after the comparison which contains the index of the column where the value was zero. If both the columns have value 0 , then Null should be updated in the new column and if both the comparison values are other than 0 , then also Null should be updated in the new column.
IIUC you can try with:
d={i:e for e,i in enumerate(df.columns)}
m1=df[['c','b']]
m2=df[['b','a']]
df['comp1']=m1.eq(0).dot(m1.columns).map(d)
m3=m2.eq(0).dot(m2.columns)
m3.loc[m3.str.len()!=1]=np.nan
df['comp2']=m3.map(d)
print(df)
a b c comp1 comp2
rule_id
51234 0 7 6 NaN 0.0
53219 0 0 1 1.0 NaN
56195 0 2 2 NaN 0.0
I suggest use numpy - compare shifted values with logical_and and set new columns by range created by np.arange with swap order and numpy.where with DatFrame constructor:
df = pd.DataFrame({
'a':[0,0,0],
'b':[7,0,2],
'c':[6,1,2],
})
#change order of array
x = df.values[:, ::-1]
#compare for equal 0 and and not equal 0
a = np.logical_and(x[:, 1:] == 0, x[:, :-1] != 0)
#create range from top to 0
b = np.arange(a.shape[1]-1, -1, -1)
#new columns names
c = [f'comp{i+1}' for i in range(x.shape[1] - 1)]
#set values by boolean array a and set values
df1 = pd.DataFrame(np.where(a, b[None, :], np.nan), columns=c, index=df.index)
print (df1)
comp1 comp2
0 NaN 0.0
1 1.0 NaN
2 NaN 0.0
You can make use of this code snippet. I did not have time to perfect it with loops etc. so please make the change as per requirements.
import pandas as pd
import numpy as np
# Data
print(df.head())
a b c
0 0 7 6
1 0 0 1
2 0 2 2
cp = df.copy()
cp[cp != 0] = 1
cp['comp1'] = cp['a'] + cp['b']
cp['comp2'] = cp['b'] + cp['c']
# Logic
cp = cp.replace([0, 1, 2], [1, np.nan, 0])
cp[['a', 'b', 'c']] = df[['a', 'b', 'c']]
# Results
print(cp.head())
a b c comp1 comp2
0 0 7 6 NaN 0.0
1 0 0 1 1.0 NaN
2 0 2 2 NaN 0.0
I'm trying to do an, apparently, simple operation in python:
I have some datasets, say 6, and I want to sum the values of one column if the values of the other two columns coincides. After that, I want to divide the values of the column which has been summed by the number of datasets I have, in this case, 6 (i.e. Calculate the arithmetic mean). Also I want to sum 0 if the values of the other columns doesn't coincide.
I write down here two dataframes, as example:
Code1 Code2 Distance
0 15.0 15.0 2
1 15.0 60.0 3
2 15.0 69.0 2
3 15.0 434.0 1
4 15.0 842.0 0
Code1 Code2 Distance
0 14.0 15.0 4
1 14.0 60.0 7
2 15.0 15.0 0
3 15.0 60.0 1
4 15.0 69.0 9
The first column is the df.index column. Then , I want to sum 'Distance' column only if 'Code1' and 'Code2' columns coincide. In this case the desired output would be something like:
Code1 Code2 Distance
0 14.0 15.0 2
1 14.0 60.0 3.5
2 15.0 15.0 1
3 15.0 60.0 2
4 15.0 69.0 5.5
5 15.0 434.0 0.5
6 15.0 842.0 0
I've tried to do this using conditionals, but for more than two df is really hard to do. Is there any method in Pandas to do it faster?
Any help would be appreciated :-)
You could put all your data frames in a list and then use reduce to either append or merge them all.
Take a look at reduce here.
First, below some functions are defined for sample data generation.
import pandas
import numpy as np
# GENERATE DATA
# Code 1 between 13 and 15
def generate_code_1(n):
return np.floor(np.random.rand(n,1) * 3 + 13)
# Code 2 between 1 and 1000
def generate_code_2(n):
return np.floor(np.random.rand(n,1) * 1000) + 1
# Distance between 0 and 9
def generate_distance(n):
return np.floor(np.random.rand(n,1) * 10)
# Generate a data frame as hstack of 3 arrays
def generate_data_frame(n):
data = np.hstack([
generate_code_1(n)
,generate_code_2(n)
,generate_distance(n)
])
df = pandas.DataFrame(data=data, columns=['Code 1', 'Code 2', 'Distance'])
# Remove possible duplications of Code 1 and Code 2. Take smallest distance in case of duplications.
# Duplications will break merge method however will not break append method
df = df.groupby(['Code 1', 'Code 2'], as_index=False)
df = df.aggregate(np.min)
return df
# Generate n data frames each with m rows in a list
def generate_data_frames(n, m, with_count=False):
df_list = []
for k in range(0, n):
df = generate_data_frame(m)
# Add count column, needed for merge method to keep track of how many cases we have seen
if with_count:
df['Count'] = 1
df_list.append(df)
return df_list
Append method (faster, shorter, nicer)
df_list = generate_data_frames(94, 5)
# Append all data frames together using reduce
df_append = reduce(lambda df_1, df_2 : df_1.append(df_2), df_list)
# Aggregate by Code 1 and Code 2
df_append_grouped = df_append.groupby(['Code 1', 'Code 2'], as_index=False)
df_append_result = df_append_grouped.aggregate(np.mean)
df_append_result
Merge method
df_list = generate_data_frames(94, 5, with_count=True)
# Function to be passed to reduce. Merge 2 data frames and update Distance and Count
def merge_dfs(df_1, df_2):
df = pandas.merge(df_1, df_2, on=['Code 1', 'Code 2'], how='outer', suffixes=('', '_y'))
df = df.fillna(0)
df['Distance'] = df['Distance'] + df['Distance_y']
df['Count'] = df['Count'] + df['Count_y']
del df['Distance_y']
del df['Count_y']
return df
# Use reduce to apply merge over the list of data frames
df_merge_result = reduce(merge_dfs, df_list)
# Replace distance with its mean and drop Count
df_merge_result['Distance'] = df_merge_result['Distance'] / df_merge_result['Count']
del df_merge_result['Count']
df_merge_result
For a Dataframe such as:
dt
COL000 COL001
STK_ID
Rowname1 2 2
Rowname2 1 4
Rowname3 1 1
What's the easiest way to append to the same data frame the result of dividing Row1 by Row2? i.e. the desired outcome is:
COL000 COL001
STK_ID
Rowname1 2 2
Rowname2 1 4
Rowname3 1 1
Newrow 2 0.5
Sorry if this is a simple question, I'm slowly getting to grips with pandas from an R background.
Thanks in advance!!!
The code below will create a new row with index d which is formed from dividing rows a and b.
import pandas as pd
df = pd.DataFrame(data={'x':[1,2,3], 'y':[4,5,6]}, index=['a', 'b', 'c'])
df.loc['d'] = df.loc['a'] / df.loc['b']
print(df)
# x y
# a 1.0 4.0
# b 2.0 5.0
# c 3.0 6.0
# d 0.5 0.8
in order to access the first two rows without caring about the index, you can use:
df.loc['newrow'] = df.iloc[0] / df.iloc[1]
then just follow #Ffisegydd's solution...
in addition, if you want to append multiple rows, use the pd.DataFrame.append function.
pandas does all the work row by row. By including another element it also interprets you want a new column:
data['new_row_with_division'] = data['row_name1_values'] / data['row_name2_values']