I have a pandas data frame with 1000 of columns which I got from a sql pivot.
Out of that some columns have a substring
( like contains ALPHA ). Now what the dataframe looks like is this: ( showing sample here
say for 5 ALPHA Columns).
The charecterstic of the data is that for each unoique combination of Cols A to ColE
there is at the most one non-null value for each alpha column
input_df
ColA ColB ColC ColD ColE ALPHA_1 ALPHA_2 ALPHA_3 ALPHA_4 ALPHA_5.......
x y z p q NAN 1 NAN NAN 2
x y z p q 2 NAN NAN NAN NAN
x y z p q NAN NAN 11 NAN NAN
x y z p q NAN NAN NAN 15 NAN
u v w z k 11 NAN NAN NAN 1
u v w z k NAN NAN 34 NAN NAN
u v w z k NAN 6 NAN NAN NAN
u v w z k NAN NAN NAN 76 NAN
b d y s t NAN 4 NAN NAN NAN
b d y s t NAN NAN 8 NAN 80
b d y s t NAN NAN NAN 9 NAN
b d y s t 88 NAN NAN NAN NAN
What I am looking for is to drop all NANS from Columns and combine & club them when the Cols A-E are same.
So the data should look like
output_df
ColA ColB ColC ColD ColE ALPHA_1 ALPHA_2 ALPHA_3 ALPHA_4 ALPHA_5 .......
x y z p q 2 1 11 15 2
u v w z k 11 6 34 76 1
u v w z k NAN NAN 34 NAN NAN
u v w z k NAN 6 NAN NAN NAN
u v w z k NAN NAN NAN 76 3
b d y s t 88 4 8 9 8
What I planned is to create a subset of Cols from ColA to Col E and one
by creating a subset of cols containing only alpha and then drop the duplicates from the first dataframe(keydf)
and then drop the NANS from each of the columns from second dataframe (newdf) and then join the two dataframes by index.
keydf = input_df.loc[:, input_df.columns.str.contains('COL')]
newdf = input_df.loc[:, input_df.columns.str.contains('ALPHA')]
However, I am stuck at this stage and not sure how to proceed. Any help will be immensely appreciated.
Related
I would like to restructure groupby data into dataframe.
Because the table has different length of groupbyed data, when I iterate those groupbyed data, IndexError: list index out of range comes up and I do not know how to ignore this index error.
Background of my data.
I groupbyed by "Worker_id" and "Project_id"
Question
Answer
Worker_ID
Project_ID
DD
AB
X
Y
DD
AB
X
Y
DD
AB
X
Y
BD
BG
K
Y
BD
BG
K
Y
KY
GG
J
Y
KY
GG
J
Y
KY
GG
J
Y
KY
GG
J
Y
RR
FR
X
Q
HU
RT
K
Q
HU
RT
K
Q
HU
RT
K
Q
YU
GE
J
Q
YU
GE
J
Q
XX
FF
K
P
XX
FF
K
P
XI
UF
J
P
XI
UF
J
P
(This table goes on much longer)
I would like to create dataframe like this↓↓
Question_1
Answer_1
Worker_ID_1
Question_2
Answer_2
Worker_ID_2
Question_3
Answer_3
Worker_ID_3
Project_ID
DD
AB
X
BD
BG
K
KY
GG
J
Y
DD
AB
X
BD
BG
K
KY
GG
J
Y
DD
AB
X
Blank
Blank
Blank
KY
GG
J
Y
Blank
Blank
Blank
Blank
Blank
Blank
KY
GG
J
Y
RR
FR
X
HU
RT
K
YU
GE
J
Q
Blank
Blank
Blank
HU
RT
K
YU
GE
J
Q
Blank
Blank
Blank
HU
RT
K
Blank
Blank
Blank
Q
Blank
Blank
Blank
XX
FF
K
XI
UF
J
P
Blank
Blank
Blank
XX
FF
K
XI
UF
J
P
(This table goes on much longer)
Because each groupbyed data has different length, the index error comes up when I try with the longest list in my code.
My code↓↓
#groupby by "Worker_ID","Project_ID"
grouped_questions = {}
for x, y in df1.groupby(["Worker_ID","Project_ID"],as_index=True):
grouped_questions[x] = y.reset_index(drop = True)
#create lists of keys
unique_list = []
for unique in list(grouped_questions):
if unique not in unique_list:
unique_list.append(unique)
question_1 = []
question_2 = []
question_3 = []
for qn in unique_list:
worker = qn[2]
project = qn[3]
if worker == 'X':
question_1.append(qn)
elif worker == 'K':
question_2.append(qn)
elif worker == 'J':
question_3.append(qn)
#combine into dataframe
final_df1 = pd.DataFrame()
for index in range(1,max_length):
one_question = grouped_questions[question_1[index]]
two_question = grouped_questions[question_2[index]]
three_question = grouped_questions[question_3[index]]
merged_df1 = one_question.merge(two_question, how='outer', left_index=True, right_index=True, suffixes=["_1", "_2"])
merged_df2 = three_question.merge(four_question, how='outer', left_index=True, right_index=True, suffixes=["_3", "_4"])
if final_df1.shape[0] == 0:
final_df1 = merged_df5
else:
final_df1 = pd.concat([final_df1, merged_df3],ignore_index=True)
final_df1.reset_index(drop=True)
final_df1
Here is what I have so far. As I said in the comments, there are more rows than in your desired output, but I don't know on what conditions I should join them.
# restructure the df with `pivot_table`
m1 = df.groupby('Project_ID').cumcount() + 1
m2 = pd.factorize(df['Worker_ID'])[0] + 1
out = (
df
.pivot_table(
index=['Project_ID', m1],
columns=m2,
values=['Worker_ID', 'Question', 'Answer'],
aggfunc='first'
)
)
# join the multiindex columns to single columns
out.columns = out.columns.map(lambda x: f"{x[0]}_{x[1]}")
# in case you need to automate the sorting of the columns, here is how to do it (otherwise you could just pass a list with the new order of the columns)
# 1st sort by numbers, then always bring them in the right order (question, answer, worker)
pattern = '|'.join([f"({x})" for x in ['Question', 'Answer', 'Worker']])
def sort_key(x):
return re.search(pattern,x).lastindex
new_order = sorted(out.columns.tolist(), key=lambda x: (int(x[-1]), sort_key(x.split('_')[0])))
out = out[new_order]
print(out)
Question_1 Answer_1 Worker_ID_1 Question_2 Answer_2 Worker_ID_2 Question_3 Answer_3 Worker_ID_3
Project_ID
P 1 NaN NaN NaN XX FF K NaN NaN NaN
2 NaN NaN NaN XX FF K NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN XI UF J
4 NaN NaN NaN NaN NaN NaN XI UF J
Q 1 RR FR X NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN HU RT K NaN NaN NaN
3 NaN NaN NaN HU RT K NaN NaN NaN
4 NaN NaN NaN HU RT K NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN YU GE J
6 NaN NaN NaN NaN NaN NaN YU GE J
Y 1 DD AB X NaN NaN NaN NaN NaN NaN
2 DD AB X NaN NaN NaN NaN NaN NaN
3 DD AB X NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN BD BG K NaN NaN NaN
5 NaN NaN NaN BD BG K NaN NaN NaN
6 NaN NaN NaN NaN NaN NaN KY GG J
7 NaN NaN NaN NaN NaN NaN KY GG J
8 NaN NaN NaN NaN NaN NaN KY GG J
9 NaN NaN NaN NaN NaN NaN KY GG J
I have a dataframe below. I am trying to create lags for var1,var2, var3 by calculating
(var_n/ lag2(var_n))-2 (where n is 1,2,3)
below code works fine for lag2. But I need to perform the calculation grouped by "grp"
CODE:
lag=[2]
df=pd.concat([df]+[df.groupby('grp'['var1','var2','var3'].shift(x).add_prefix('lag'+str(x)) for x in lag],axis=1)
In a different approach I tried below but I am not able to apply group by:
yoy = [12]
columns_y = df.loc[:, 'var1':'var3']
for col in columns_y.columns:
for x in yoy:
columns_y.loc[:,col+"_yoy"] =(columns_y[col]/(columns_y[col].shift(x)))-1
Try this
df = pd.DataFrame({
'grp':['a','a','a','b','b','b'],'abc2':['l','m','n','p','q','r'], 'abc3':['x','y','z','a','b','c'],
'var1':[20,30,20,40,50,90],'var2':[50,80,70,20,30,40],'var3':[50,80,70,20,30,40]})
lag = [2]
lags_df = pd.concat([
df.groupby('grp')[[f'var{i+1}' for i in range(3)]]
.shift(x)
.add_prefix(f'lag{x}_')
for x in lag
], axis=1)
print(pd.concat([df, lags_df], axis=1))
otuputs
grp abc2 abc3 var1 var2 var3 lag2_var1 lag2_var2 lag2_var3
0 a l x 20 50 50 NaN NaN NaN
1 a m y 30 80 80 NaN NaN NaN
2 a n z 20 70 70 20.0 50.0 50.0
3 b p a 40 20 20 NaN NaN NaN
4 b q b 50 30 30 NaN NaN NaN
5 b r c 90 40 40 40.0 20.0 20.0
I have a data frame in python pandas as follows:
( the first two columns, mygroup1 & mygroup2 are groupby columns)
df =
**mygroup1 mygroup2 tname #dt #num #vek**
a p alpha may 6 a
b q alpha june 8 b
c r beta may 9 c
d s beta june 11 d
I want to pivot the table (the values in tname column) which should be the following with names of columns joined with tname values taken from the other columns (#dt,#num and #vec)
**mygroup1 mygroup2 alpha#dt alpha#num alpha#vec beta#dt beta#num beta#vec**
a p may 6 a nan nan nan
b q june 8 b nan nan nan
c r nan nan nan may 9 c
d s nan nan nan june 11 d
I am trying to do a pivot using pandas pivot table but not able to get in the below format which I really want. I will appreciate any help.
You can do:
new_df = df.set_index(['mygroup1','mygroup2','tname']).unstack('tname')
new_df.columns = [f'{y}{x}' for x,y in new_df.columns]
new_df = new_df.sort_index(axis=1).reset_index()
Output:
mygroup1 mygroup2 alpha#dt alpha#num alpha#vek beta#dt beta#num beta#vek
0 a p may 6.0 a NaN NaN NaN
1 b q june 8.0 b NaN NaN NaN
2 c r NaN NaN NaN may 9.0 c
3 d s NaN NaN NaN june 11.0 d
Need to perform the following operation on a pandas dataframe df inside a for loop with 50 iterations or more:
Column'X' of df has to be merged with column 'X' of df1,
Column'Y' of df has to be merged with column 'Y' of df2,
Column'Z' of df has to be merged with column 'Z' of df3,
Column'W' of df has to be merged with column 'W' of df4
The columns which are common in all 5 dataframes - df, df1, df2, df3 and df4 are A, B, C and D
EDIT
The shape of all dataframes is different from one another where df is the master dataframe having maximum number of rows and rest all other 4 dataframes have number of rows less than df but varying from each other. So while merging columns need to make sure that rows from both dataframes are matched first.
Input df
A B C D X Y Z W
1 2 3 4 nan nan nan nan
2 3 4 5 nan nan nan nan
5 9 7 8 nan nan nan nan
4 8 6 3 nan nan nan nan
df1
A B C D X Y Z W
2 3 4 5 100 nan nan nan
4 8 6 3 200 nan nan nan
df2
A B C D X Y Z W
1 2 3 4 nan 50 nan nan
df3
A B C D X Y Z W
1 2 3 4 nan nan 1000 nan
4 8 6 3 nan nan 2000 nan
df4
A B C D X Y Z W
2 3 4 5 nan nan nan 25
5 9 7 8 nan nan nan 35
4 8 6 3 nan nan nan 45
Output df
A B C D X Y Z W
1 2 3 4 nan 50 1000 nan
2 3 4 5 100 nan nan 25
5 9 7 8 nan nan nan 35
4 8 6 3 200 nan 2000 45
Which is the most efficient and fastest way to achieve it? Tried using 4 separate combine_first statements but that doesn't seem to be the most efficient way.
Can this be done by using just 1 line of code instead?
Any help will be appreciated. Many thanks in advance.
Sorry if the title isn't clear. I don't know how to describe what I want to do. I'm looking to tag rows with class grouping with same set and subset of data cross columns. Starting with this data in a dataframe:
group1 group2 group3 group4
1 Nan X Y Z
2 Nan X Nan Nan
3 Nan Nan Y Z
4 X Nan Y Nan
5 X V Y Nan
6 V V Nan Nan
7 Nan X Z Y
I want to get to this:
class group1 group2 group3 group4
1 C1 Nan X Y Z
2 C1 Nan X Nan Nan
3 C1 Nan Nan Y Z
4 C1 X Nan Y Nan
5 C2 X V Y Nan
6 C3 V V Nan Nan
7 C4 Nan X Z Y
The class has been grouped by the biggest common denominator. Column position does matter when looking at common patterns. I'm looking if there's a good why to do this without going through many loops.
edited for better example