Reshape DataFrame from long to wide along one column - python

I am looking for a way to reconfigure Table A show below into Table B.
Table A:
type x1 x2 x3
A 4 6 9
A 7 4 1
A 9 6 2
B 1 3 8
B 2 7 9
transformed into Table B:
type x1 x2 x3 x1' x2' x3' x1'' x2'' x3''
A 4 6 9 7 4 1 9 6 2
B 1 3 8 2 7 9 NA NA NA
The real Table A is over 150000 rows and 36 Columns. With 2100 unique "type" values.

You can set the index appropriately and then unstack:
df
type x1 x2 x3
0 A 4 6 9
1 A 7 4 1
2 A 9 6 2
3 B 1 3 8
4 B 2 7 9
res = (df.set_index(['type', df.groupby('type').cumcount()])
.unstack()
.sort_index(level=-1, axis=1))
res.columns = res.columns.map(lambda x: x[0] + "'" * int(x[1]))
res
x1 x2 x3 x1' x2' x3' x1'' x2'' x3''
type
A 4.0 6.0 9.0 7.0 4.0 1.0 9.0 6.0 2.0
B 1.0 3.0 8.0 2.0 7.0 9.0 NaN NaN NaN

Related

Pandas: Merging and comparing dataframes

I've got 3 Dataframes I would like to merge or join by "label" and then being able to compare all columns
Examples of df are below:
df1
Label,col1,col2,col3
NF1,1,1,6
NF2,3,2,8
NF3,4,5,4
NF4,5,7,2
NF5,6,2,2
df2
Label,col1,col1,col3
NF1,8,4,5
NF2,4,7,8
NF3,9,7,8
df3
Label,col1,col1,col3
NF1,2,8,8
NF2,6,2,0
NF3,2,2,5
NF4,2,4,9
NF5,2,5,8
and what ill like to see is similar to
Label,df1_col1,df2_col1,df_col1,df1_col2,df2_col2,df3_col2,df1_col3,df2_col3,df_col3
NF1,1,8,2,1,4,8,6,5,8
NF2,3,4,6,2,7,2,8,8,0
NF3,4,9,2,5,7,2,4,8,5
NF4,5,,2,7,,4,2,,9
NF5,6,,2,2,,5,2,,8
but I'm to suggestions on how to make the comparisons more readable.
Thanks!
Use concat with list of DataFrames, add parameter keys for prefixes and sorting by columns names:
dfs = [df1, df2, df3]
k = ('df1','df2','df3')
df = (pd.concat([x.set_index('Label') for x in dfs], axis=1, keys=k)
.sort_index(axis=1, level=1)
.rename_axis('Label')
.reset_index())
df.columns = df.columns.map('_'.join).str.strip('_')
print (df)
Label df1_col1 df2_col1 df3_col1 df2_col1.1 df3_col1.1 df1_col2 \
0 NF1 1 8.0 2 4.0 8 1
1 NF2 3 4.0 6 7.0 2 2
2 NF3 4 9.0 2 7.0 2 5
3 NF4 5 NaN 2 NaN 4 7
4 NF5 6 NaN 2 NaN 5 2
df1_col3 df2_col3 df3_col3
0 6 5.0 8
1 8 8.0 0
2 4 8.0 5
3 2 NaN 9
4 2 NaN 8
You can use df.merge:
In [1965]: res = df1.merge(df2, on='Label', how='left', suffixes=('_df1', '_df2')).merge(df3, on='Label', how='left').rename(columns={'col1': 'col1_df3','col2':'col2_df3','col3':'col3_df3'})
In [1975]: res = res.reindex(sorted(res.columns), axis=1)
In [1976]: res
Out[1965]:
Label col1_df1 col1_df2 col1_df3 col2_df1 col2_df2 col2_df3 col3_df1 col3_df2 col3_df3
0 NF1 1 8.00 2 1 4.00 8 6 5.00 8
1 NF2 3 4.00 6 2 7.00 2 8 8.00 0
2 NF3 4 9.00 2 5 7.00 2 4 8.00 5
3 NF4 5 nan 2 7 nan 4 2 nan 9
4 NF5 6 nan 2 2 nan 5 2 nan 8
We can use Pandas' join method, by setting the Label column as the index and joining the dataframes :
dfs = [df1,df2,df3]
keys = ['df1','df2','df3']
#set Label as index
df1, *others = [frame.set_index("Label").add_prefix(f"{prefix}_")
for frame,prefix in zip(dfs,keys)]
#join df1 with others
outcome = df1.join(others,how='outer').rename_axis(index='Label').reset_index()
outcome
Label df1_col1 df1_col2 df1_col3 df2_col1 df2_col2 df2_col3 df3_col1 df3_col2 df3_col3
0 NF1 1 1 6 8.0 4.0 5.0 2 8 8
1 NF2 3 2 8 4.0 7.0 8.0 6 2 0
2 NF3 4 5 4 9.0 7.0 8.0 2 2 5
3 NF4 5 7 2 NaN NaN NaN 2 4 9
4 NF5 6 2 2 NaN NaN NaN 2 5 8

python: inserting row at specifc index from one dataframe to another

I have a two dataframes as follows:
df1:
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
2 0 7 6 5 8
df2:
M N O P Q R S T
0 1 2 3
1 4 5 6
2 7 8 9
3 8 6 5
4 5 4 3
I have taken out a slice of data from df1 as follows:
>data_1 = df1.loc[0:1]
>data_1
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
Now I need to insert this data_1 into df2 at specific location of Index(0,P) (row,column). Is there any way to do it? I do not want to disturb the other columns in df2.
I can extract individual values of each cell and do it but since I have to do it for a large dataset, its not possible to do it cell-wise.
Cellwise method:
>var1 = df1.iat[0,1]
>var2 = df1.iat[0,0]
>df2.at[0, 'P'] = var1
>df2.at[0, 'Q'] = var2
If you specify all the columns, it is possible to do it as follows:
df2.loc[0:1, ['P', 'Q', 'R', 'S', 'T']] = df1.loc[0:1].values
Resulting dataframe:
M N O P Q R S T
0 1 2 3 8.0 6.0 4.0 9.0 7.0
1 4 5 6 2.0 6.0 3.0 8.0 5.0
2 7 8 9
3 8 6 5
4 5 4 3
You can rename columns and index names for match to second DataFrame, so possible use DataFrame.update for correct way specifiest by tuple pos:
data_1 = df1.loc[0:1]
print (data_1)
A B C D E
0 8 6 4 9 7
1 2 6 3 8 5
pos = (2, 'P')
data_1 = data_1.rename(columns=dict(zip(data_1.columns, df2.loc[:, pos[1]:].columns)),
index=dict(zip(data_1.index, df2.loc[pos[0]:].index)))
print (data_1)
P Q R S T
2 8 6 4 9 7
3 2 6 3 8 5
df2.update(data_1)
print (df2)
M N O P Q R S T
0 1 2 3 NaN NaN NaN NaN NaN
1 4 5 6 NaN NaN NaN NaN NaN
2 7 8 9 8.0 6.0 4.0 9.0 7.0
3 8 6 5 2.0 6.0 3.0 8.0 5.0
4 5 4 3 NaN NaN NaN NaN NaN
How working rename - idea is select all columns and all index values after specified column, index name by loc and then zip by columns names of data_1 with convert to dictionary. So last replace bot, index and columns names in data_1 by next columns, index values.

Python/Pandas: Use lookup DataFrame + function to replace specific/null values in DataFrame

Say I have an incomplete dataset in a Pandas DataFrame such as:
incData = pd.DataFrame({'comp': ['A']*3 + ['B']*5 + ['C']*4,
'x': [1,2,3] + [1,2,3,4,5] + [1,2,3,4],
'y': [3,None,7] + [1,4,7,None,None] + [4,None,2,1]})
And also a DataFrame with fitting parameters that I could use to fill holes:
fitTable = pd.DataFrame({'slope': [2,3,-1],
'intercept': [1,-2,5]},
index=['A','B','C'])
I would like to achieve the following using y=x*slope+intercept for the None entries only:
comp x y
0 A 1 3.0
1 A 2 5.0
2 A 3 7.0
3 B 1 1.0
4 B 2 4.0
5 B 3 7.0
6 B 4 10.0
7 B 5 13.0
8 C 1 4.0
9 C 2 3.0
10 C 3 2.0
11 C 4 1.0
One way I envisioned is by using join and drop:
incData = incData.join(fitTable,on='comp')
incData.loc[incData['y'].isnull(),'y'] = incData[incData['y'].isnull()]['x']*\
incData[incData['y'].isnull()]['slope']+\
incData[incData['y'].isnull()]['intercept']
incData.drop(['slope','intercept'], axis=1, inplace=True)
However, that does not seem very efficient, because it adds and removes columns. It seems that I am making this too complicated, do I overlook a simple more direct solution? Something more like this non-functional code:
incData.loc[incData['y'].isnull(),'y'] = incData[incData['y'].isnull()]['x']*\
fitTable[incData[incData['y'].isnull()]['comp']]['slope']+\
fitTable[incData[incData['y'].isnull()]['comp']]['intercept']
I am pretty new to Pandas, so I sometimes get a bit mixed up with the strict indexing rules...
you can use map on the column 'comp' once mask with null value in 'y' like:
mask = incData['y'].isna()
incData.loc[mask, 'y'] = incData.loc[mask, 'x']*\
incData.loc[mask,'comp'].map(fitTable['slope']) +\
incData.loc[mask,'comp'].map(fitTable['intercept'])
and your non-functional code, I guess it would be something like:
incData.loc[mask,'y'] = incData.loc[mask, 'x']*\
fitTable.loc[incData.loc[mask, 'comp'],'slope'].to_numpy()+\
fitTable.loc[incData.loc[mask, 'comp'],'intercept'].to_numpy()
IIUC:
incData.loc[pd.isna(incData['y']), 'y'] = incData[pd.isna(incData['y'])].apply(lambda row: row['x']*fitTable.loc[row['comp'], 'slope']+fitTable.loc[row['comp'], 'intercept'], axis=1)
incData
comp x y
0 A 1 3.0
1 A 2 5.0
2 A 3 7.0
3 B 1 1.0
4 B 2 4.0
5 B 3 7.0
6 B 4 10.0
7 B 5 13.0
8 C 1 4.0
9 C 2 3.0
10 C 3 2.0
11 C 4 1.0
merge is another option
# merge two dataframe together on comp
m = incData.merge(fitTable, left_on='comp', right_index=True)
# y = mx+b
m['y'] = m['x']*m['slope']+m['intercept']
comp x y slope intercept
0 A 1 3 2 1
1 A 2 5 2 1
2 A 3 7 2 1
3 B 1 1 3 -2
4 B 2 4 3 -2
5 B 3 7 3 -2
6 B 4 10 3 -2
7 B 5 13 3 -2
8 C 1 4 -1 5
9 C 2 3 -1 5
10 C 3 2 -1 5
11 C 4 1 -1 5

Pandas Rolling Groupby Shift back 1, Trying to lag rolling sum

I am trying to get a rolling sum of the past 3 rows for the same ID but lagging this by 1 row. My attempt looked like the below code and i is the column. There has to be a way to do this but this method doesnt seem to work.
for i in df.columns.values:
df.groupby('Id', group_keys=False)[i].rolling(window=3, min_periods=2).mean().shift(1)
id dollars lag
1 6 nan
1 7 nan
1 6 6.5
3 7 nan
3 4 nan
3 4 5.5
3 3 5
5 6 nan
5 5 nan
5 6 5.5
5 12 5.67
5 7 8.3
I am trying to get a rolling sum of the past 3 rows for the same ID but lagging this by 1 row.
You can create the lagged rolling sum by chaining DataFrame.groupby(ID), .shift(1) for the lag 1, .rolling(3) for the window 3, and .sum() for the sum.
Example: Let's say your dataset is:
import pandas as pd
# Reproducible datasets are your friend!
d = pd.DataFrame({'grp':pd.Series(['A']*4 + ['B']*5 + ['C']*6),
'x':pd.Series(range(15))})
print(d)
grp x
A 0
A 1
A 2
A 3
B 4
B 5
B 6
B 7
B 8
C 9
C 10
C 11
C 12
C 13
C 14
I think what you're asking for is this:
d['y'] = d.groupby('grp')['x'].shift(1).rolling(3).sum()
print(d)
grp x y
A 0 NaN
A 1 NaN
A 2 NaN
A 3 3.0
B 4 NaN
B 5 NaN
B 6 NaN
B 7 15.0
B 8 18.0
C 9 NaN
C 10 NaN
C 11 NaN
C 12 30.0
C 13 33.0
C 14 36.0

pandas.DataFrame set all string values to nan

I have a pandas.DataFrame that contain string, float and int types.
Is there a way to set all strings that cannot be converted to float to NaN ?
For example:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 "wajdi"
to:
A B C D
0 1 2 5 7
1 0 4 NaN 15
2 4 8 9 10
3 11 5 8 0
4 11 5 8 NaN
You can use pd.to_numeric and set errors='coerce'
pandas.to_numeric
df['D'] = pd.to_numeric(df.D, errors='coerce')
Which will give you:
A B C D
0 1 2 5.0 7.0
1 0 4 NaN 15.0
2 4 8 9.0 10.0
3 11 5 8.0 0.0
4 11 5 8.0 NaN
Deprecated solution (pandas <= 0.20 only):
df.convert_objects(convert_numeric=True)
pandas.DataFrame.convert_objects
Here's the dev note in the convert_objects source code: # TODO: Remove in 0.18 or 2017, which ever is sooner. So don't make this a long term solution if you use it.
Here is a way:
df['E'] = pd.to_numeric(df.D, errors='coerce')
And then you have:
A B C D E
0 1 2 5.0 7 7.0
1 0 4 NaN 15 15.0
2 4 8 9.0 10 10.0
3 11 5 8.0 0 0.0
4 11 5 8.0 wajdi NaN
You can use pd.to_numeric with errors='coerce'.
In [30]: df = pd.DataFrame({'a': [1, 2, 'NaN', 'bob', 3.2]})
In [31]: pd.to_numeric(df.a, errors='coerce')
Out[31]:
0 1.0
1 2.0
2 NaN
3 NaN
4 3.2
Name: a, dtype: float64
Here is one way to apply it to all columns:
for c in df.columns:
df[c] = pd.to_numeric(df[c], errors='coerce')
(See comment by NinjaPuppy for a better way.)

Categories