I have a data frame in python pandas as follows:
( the first two columns, mygroup1 & mygroup2 are groupby columns)
df =
**mygroup1 mygroup2 tname #dt #num #vek**
a p alpha may 6 a
b q alpha june 8 b
c r beta may 9 c
d s beta june 11 d
I want to pivot the table (the values in tname column) which should be the following with names of columns joined with tname values taken from the other columns (#dt,#num and #vec)
**mygroup1 mygroup2 alpha#dt alpha#num alpha#vec beta#dt beta#num beta#vec**
a p may 6 a nan nan nan
b q june 8 b nan nan nan
c r nan nan nan may 9 c
d s nan nan nan june 11 d
I am trying to do a pivot using pandas pivot table but not able to get in the below format which I really want. I will appreciate any help.
You can do:
new_df = df.set_index(['mygroup1','mygroup2','tname']).unstack('tname')
new_df.columns = [f'{y}{x}' for x,y in new_df.columns]
new_df = new_df.sort_index(axis=1).reset_index()
Output:
mygroup1 mygroup2 alpha#dt alpha#num alpha#vek beta#dt beta#num beta#vek
0 a p may 6.0 a NaN NaN NaN
1 b q june 8.0 b NaN NaN NaN
2 c r NaN NaN NaN may 9.0 c
3 d s NaN NaN NaN june 11.0 d
Related
My question is quite similar to this one: Drop group if another column has duplicate values - pandas dataframe
I have the following dataframe:
letter value many other variables
A 5
A 5
A 8
A 9
B 3
B 10
B 10
B 4
B 5
B 9
C 10
C 10
C 10
D 6
D 8
D 10
E 5
E 5
E 5
F 4
F 4
And when grouping it by letter I want to remove all the resulting groups that only have value that repeats, thus getting a result like this:
letter value many other variables
A 5
A 5
A 8
A 9
B 3
B 10
B 10
B 4
B 5
B 9
D 6
D 8
D 10
I am afraid that if I use the duplicate() function similarly to the question I mentioned at the beggining I would be deleting groups (or the rows in) 'A' and 'B' which should rather stay in their place.
You have several possibilities.
Using duplicated and groupby.transform:
m = (df.duplicated(subset=['letter', 'value'], keep=False)
.groupby(df['letter']).transform('all')
)
out = df[~m]
NB. this won't drop groups with a single row.
Using groupby.transform and nunique:
out = df[df.groupby('letter')['value'].transform('nunique').gt(1)]
NB. this will drop groups with a single row.
Output:
letter value many other variables
0 A 5 NaN NaN NaN
1 A 5 NaN NaN NaN
2 A 8 NaN NaN NaN
3 A 9 NaN NaN NaN
4 B 3 NaN NaN NaN
5 B 10 NaN NaN NaN
6 B 10 NaN NaN NaN
7 B 4 NaN NaN NaN
8 B 5 NaN NaN NaN
9 B 9 NaN NaN NaN
13 D 6 NaN NaN NaN
14 D 8 NaN NaN NaN
15 D 10 NaN NaN NaN
I have a pandas data frame with 1000 of columns which I got from a sql pivot.
Out of that some columns have a substring
( like contains ALPHA ). Now what the dataframe looks like is this: ( showing sample here
say for 5 ALPHA Columns).
The charecterstic of the data is that for each unoique combination of Cols A to ColE
there is at the most one non-null value for each alpha column
input_df
ColA ColB ColC ColD ColE ALPHA_1 ALPHA_2 ALPHA_3 ALPHA_4 ALPHA_5.......
x y z p q NAN 1 NAN NAN 2
x y z p q 2 NAN NAN NAN NAN
x y z p q NAN NAN 11 NAN NAN
x y z p q NAN NAN NAN 15 NAN
u v w z k 11 NAN NAN NAN 1
u v w z k NAN NAN 34 NAN NAN
u v w z k NAN 6 NAN NAN NAN
u v w z k NAN NAN NAN 76 NAN
b d y s t NAN 4 NAN NAN NAN
b d y s t NAN NAN 8 NAN 80
b d y s t NAN NAN NAN 9 NAN
b d y s t 88 NAN NAN NAN NAN
What I am looking for is to drop all NANS from Columns and combine & club them when the Cols A-E are same.
So the data should look like
output_df
ColA ColB ColC ColD ColE ALPHA_1 ALPHA_2 ALPHA_3 ALPHA_4 ALPHA_5 .......
x y z p q 2 1 11 15 2
u v w z k 11 6 34 76 1
u v w z k NAN NAN 34 NAN NAN
u v w z k NAN 6 NAN NAN NAN
u v w z k NAN NAN NAN 76 3
b d y s t 88 4 8 9 8
What I planned is to create a subset of Cols from ColA to Col E and one
by creating a subset of cols containing only alpha and then drop the duplicates from the first dataframe(keydf)
and then drop the NANS from each of the columns from second dataframe (newdf) and then join the two dataframes by index.
keydf = input_df.loc[:, input_df.columns.str.contains('COL')]
newdf = input_df.loc[:, input_df.columns.str.contains('ALPHA')]
However, I am stuck at this stage and not sure how to proceed. Any help will be immensely appreciated.
I am trying to do the Regex in Python dataframe using this script
import pandas as pd
df1 = {'data':['1gsmxx,2gsm','abc10gsm','10gsm','18gsm hhh4gsm','Abc:10gsm','5gsmaaab3gsmABC55gsm','abc - 15gsm','3gsm,,ff40gsm','9gsm','VV - fg 8gsm','kk 5gsm 00g','001….abc..5gsm']}
df1 = pd.DataFrame(df1)
df1
df1['Result']=df1['Data'].str.findall('(\d{1,3}\s?gsm)')
OR
df2=df1['data'].str.extractall('(\d{1,3}\s?gsm)').unstack()
However, it turnout into multiple results in one column.
Is it possible I could have a result like the attached below?
Use pandas.Series.str.extractall with unstack.
If you want your original series, use pandas.concat.
df2 = df1['data'].str.extractall('(\d{1,3}\s?gsm)').unstack()
df = pd.concat([df1, df2.droplevel(0, 1)], 1)
print(df)
Output:
data 0 1 2
0 1gsmxx,2gsm 1gsm 2gsm NaN
1 abc10gsm 10gsm NaN NaN
2 10gsm 10gsm NaN NaN
3 18gsm hhh4gsm 18gsm 4gsm NaN
4 Abc:10gsm 10gsm NaN NaN
5 5gsmaaab3gsmABC55gsm 5gsm 3gsm 55gsm
6 abc - 15gsm 15gsm NaN NaN
7 3gsm,,ff40gsm 3gsm 40gsm NaN
8 9gsm 9gsm NaN NaN
9 VV - fg 8gsm 8gsm NaN NaN
10 kk 5gsm 00g 5gsm NaN NaN
11 001….abc..5gsm 5gsm NaN NaN
Need to perform the following operation on a pandas dataframe df inside a for loop with 50 iterations or more:
Column'X' of df has to be merged with column 'X' of df1,
Column'Y' of df has to be merged with column 'Y' of df2,
Column'Z' of df has to be merged with column 'Z' of df3,
Column'W' of df has to be merged with column 'W' of df4
The columns which are common in all 5 dataframes - df, df1, df2, df3 and df4 are A, B, C and D
EDIT
The shape of all dataframes is different from one another where df is the master dataframe having maximum number of rows and rest all other 4 dataframes have number of rows less than df but varying from each other. So while merging columns need to make sure that rows from both dataframes are matched first.
Input df
A B C D X Y Z W
1 2 3 4 nan nan nan nan
2 3 4 5 nan nan nan nan
5 9 7 8 nan nan nan nan
4 8 6 3 nan nan nan nan
df1
A B C D X Y Z W
2 3 4 5 100 nan nan nan
4 8 6 3 200 nan nan nan
df2
A B C D X Y Z W
1 2 3 4 nan 50 nan nan
df3
A B C D X Y Z W
1 2 3 4 nan nan 1000 nan
4 8 6 3 nan nan 2000 nan
df4
A B C D X Y Z W
2 3 4 5 nan nan nan 25
5 9 7 8 nan nan nan 35
4 8 6 3 nan nan nan 45
Output df
A B C D X Y Z W
1 2 3 4 nan 50 1000 nan
2 3 4 5 100 nan nan 25
5 9 7 8 nan nan nan 35
4 8 6 3 200 nan 2000 45
Which is the most efficient and fastest way to achieve it? Tried using 4 separate combine_first statements but that doesn't seem to be the most efficient way.
Can this be done by using just 1 line of code instead?
Any help will be appreciated. Many thanks in advance.
i'm trying to build a function to eliminate from my dataset the columns with only one value. I used this function:
def oneCatElimination(dataframe):
columns=dataframe.columns.values
for column in columns:
if len(dataframe[column].value_counts().unique())==1:
del dataframe[column]
return dataframe
the problem is that the function eliminates even column with more the one distinct value, i.e. a index column with integer number..
Just
df.dropna(thresh=2, axis=1)
will work. No need for anything else. It will keep all columns with 2 or more non-NA values (controlled by the value passed to thresh). The axis kwarg will let you work with rows or columns. It is rows by default, so you need to pass axis=1 explicitly to work on columns (I forgot this at the time I answered, hence this edit). See dropna() for more information.
A couple of assumptions went into this:
Null/NA values don't count
You need multiple non-NA values to keep a column
Those values need to be different in some way (e.g., a column full of 1's and only 1's should be dropped)
All that said, I would use a select statement on the columns.
If you start with this dataframe:
import pandas
N = 15
df = pandas.DataFrame(index=range(10), columns=list('ABCD'))
df.loc[2, 'A'] = 23
df.loc[3, 'B'] = 52
df.loc[4, 'B'] = 36
df.loc[5, 'C'] = 11
df.loc[6, 'C'] = 11
df.loc[7, 'D'] = 43
df.loc[8, 'D'] = 63
df.loc[9, 'D'] = 97
df
Which creates:
A B C D
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 23 NaN NaN NaN
3 NaN 52 NaN NaN
4 NaN 36 NaN NaN
5 NaN NaN 11 NaN
6 NaN NaN 11 NaN
7 NaN NaN NaN 43
8 NaN NaN NaN 63
9 NaN NaN NaN 97
Given my assumptions above, columns A and C should be dropped since A only has one value and both of C's values are the same. You can then do:
df.select(lambda c: df[c].dropna().unique().shape[0] > 1, axis=1)
And that gives me:
B D
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 52 NaN
4 36 NaN
5 NaN NaN
6 NaN NaN
7 NaN 43
8 NaN 63
9 NaN 97
This will work for both text and numbers:
for col in dataframe:
if(len(dataframe.loc[:,col].unique()) == 1):
dataframe.pop(col)
Note: This will remove the columns having only one value from the original dataframe.