I have a pandas dataframe with a two-level column index. It's read in from a spreadsheet where the author used a lot of whitespace to accomplish things like alignment (for example, one column is called 'Tank #').
I've been able to remove the whitespace on the levels individually...
level0 = raw.columns.levels[0].str.replace('\s', '', regex=True)
level1 = raw.columns.levels[1].str.replace('\s', '', regex=True)
raw.columns.set_levels([level0, level1], inplace=True)
...but I'm curious if there is a way to do it without having to change each individual level one at a time.
I tried raw.columns.set_levels(raw.columns.str.replace('\s', '', regex=True)
but got AttributeError: Can only use .str accessor with Index, not MultiIndex.
Here is a small sample subset of the data-- my best attempt at SO table formatting :D, followed by a picture where I've highlighted in yellow the indices as received.
Run Info
Run Info
Run Data
Run Data
run #
Tank #
Step A
conc. %
ph
0
6931
5
5.29
33.14
1
6932
1
5.28
33.13
2
6933
2
5.32
33.40
3
6934
3
5.19
32.98
Thanks for any insight!
Edit: adding to_dict()
df.to_dict()
Out[5]:
{'Unnamed: 0': {0: nan, 1: 0.0, 2: 1.0, 3: 2.0, 4: 3.0, 5: 4.0},
'Run Info': {0: 'run #',
1: '6931',
2: '6932',
3: '6933',
4: '6934',
5: '6935'},
'Run Info.1': {0: 'Tank #',
1: '5',
2: '1',
3: '2',
4: '3',
5: '4'},
'Run Data': {0: 'Step A\npH',
1: '5.29',
2: '5.28',
3: '5.32',
4: '5.19',
5: '5.28'},
'Run Data.1': {0: 'concentration',
1: '33.14',
2: '33.13',
3: '33.4',
4: '32.98',
5: '32.7'}}
How about rename:
import re
df.rename(columns=lambda x: re.sub('\s+', ' ', x.strip() ),inplace=True)
If you don't want to keep any of the spaces, you can just replace ' ' with ''.
Related
I'm trying to pivot my dataframe so that there is a single row and a single cell for each summary X metric comparison. I have tried pivoting this, but can't figure out a sensible index column.
Here is my current output.
Does anyone know how to achieve my expected output?
To reproduce:
import pandas as pd
pd.DataFrame({'summary': {0: 'mean',
1: 'stddev',
2: 'mean',
3: 'stddev',
4: 'mean',
5: 'stddev'},
'metric': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'C', 5: 'C'},
'value': {0: '2.0',
1: '1.5811388300841898',
2: '0.4',
3: '0.5477225575051661',
4: None,
5: None}})
Remove missing values by DataFrame.dropna, join columns together, convert to index and transpose by DataFrame.T:
df = df.dropna(subset=['value'])
df['g'] = df['summary'] + '_' + df['metric']
df = df.set_index('g')[['value']].T.reset_index(drop=True).rename_axis(None, axis=1)
print (df)
mean_A stddev_A mean_B stddev_B
0 2.0 1.5811388300841898 0.4 0.5477225575051661
I have the following dataframe:
df = pd.DataFrame({'Variable': {0: 'Abs', 1: 'rho0', 2: 'cp', 3: 'K0'},
'Value': {0: 0.585, 1: 8220.000, 2: 435.000, 3: 11.400},
'Description': {0: 'foo', 1: 'foo', 2: 'foo', 3: 'foo'}})
I would like to reshape it like this:
df2 = pd.DataFrame({'Abs': {0: 0.585},
'rho0': {0: 8220.000},
'cp': {0: 435.000},
'K0': {0: 11.400}})
How can I do it?
df3 = df.pivot_table(columns='Variable', values='Value')
print(df3)
Variable Abs K0 cp rho0
Value 0.585 11.4 435.0 8220.0
gets very close to what I was looking for, but I'd rather do without the first column Variable, if at all possible.
You can try renaming the axis()
df3 = df.pivot_table(values='Value', columns='Variable').rename_axis(None, axis=1)
additionally if you want to reset the index
df3 = df.pivot_table( columns='Variable').rename_axis(None, axis=1).reset_index().drop('index',axis=1)
df3.to_dict()
# Output
{'Abs': {0: 0.585},
'K0': {0: 11.4},
'cp': {0: 435.0},
'rho0': {0: 8220.0}}
I'm working on the data below and I would like fill the Nan in Begin and End with a date take from the Subscription Period column.
All the columns are strings.
I have several format:
for 05/03/2020 to 04/03/2021, I use:
# clean if date begin and end in SubscriptionPeriod
# create 3 new colonnes
df_period = df['Subscription Period'] \
.str.extractall(r'(?P<Period>(?P<Begin>(0[1-9]|[12][0-9]|3[01])[/](0[1-9]|1[012])[/](19|20)?\d\d).+(?P<End>(0[1-9]|[12][0-9]|3[01])[/](0[1-9]|1[012])[/](19|20)?\d\d))')
df['Period'] = df_period['Period'].unstack()
df['Begin'] = df_period['Begin'].unstack()
df['End'] = df_period['End'].unstack()
for the other formats in Subscription Period:
Subscription Hospital Sept-Dec 2018: I would like extract Sept as 01/09/2018 in Begin and 31/12/2018 in End.
Yearly Subscription Hospital (effective 17/04/2019)
Yearly Subscription Hospital (effective 01 octobre 2018)
For this twice, I would like get the date in Begin and more one year in the End.
I try solutions:
with mask()
mask = df['Subscription Period'].str.contains(r'(\d{2}/\d{2}/\d{2,4})[)]?$')
df.loc[mask, 'Begin'] = df['Subscription Period'].str.contains(r'(\d{2}/\d{2}/\d{2,4})[)]?$')
with loc(): work for 'B' but not for a regex with extract.
df.loc[(df['Begin'].isnull()) , 'Period']= 'B'
Here the data:
data = {'Date': {0: '2020-05-05',
1: '2018-09-12',
2: '2020-04-22',
3: '2020-01-01',
4: '2019-04-17',
5: '2018-09-07',
6: '2018-11-20',
7: '2018-11-28'},
'Subscription Period': {0: 'Subscription Hospital : from 01/05/2020 to 30/04/2021',
1: 'Subscription Hospital Sept-Dec 2018',
2: 'Yearly Subscription Hospital from 05/03/2020 to 04/03/2021',
3: 'Subscription Hospital from 01/01/2020 to 31/12/2020',
4: 'Yearly Subscription Hospital (effective 17/04/2019)',
5: 'Yearly Subscription Hospital (effective 01 octobre 2018)',
6: 'Subscription : Hospital',
7: 'Yearly Subscription Hospital'},
'Period': {0: '01/05/2020 to 30/04/2021',
1: np.NaN,
2: '05/03/2020 to 04/03/2021',
3: '01/01/2020 to 31/12/2020',
4: np.NaN,
5: np.NaN,
6: np.NaN,
7: np.NaN},
'Begin': {0: '01/05/2020',
1: np.NaN,
2: '05/03/2020',
3: '01/01/2020',
4: np.NaN,
5: np.NaN,
6: np.NaN,
7: np.NaN},
'End': {0: '30/04/2021',
1: np.NaN,
2: '04/03/2021',
3: '31/12/2020',
4: np.NaN,
5: np.NaN,
6: np.NaN,
7: np.NaN}}
df = pd.DataFrame.from_dict(data)
Thank you for the help and any tips.
Regarding your mask example, if you're using str.extract or str.extractall, there's no need to index using a mask since the resulting dataframe is already indexed. Instead, you can use concat to join on the index and use combine_first to apply only where Begin is null:
begin2 = df['Subscription Period'].str.extract(r'(\d{2}/\d{2}/\d{2,4})[)]?$').rename({0:'Begin2'}, axis=1)
df = pd.concat([df, begin2], axis=1)
df.Begin = df.Begin.combine_first(df.Begin2)
df = df.drop('Begin2', axis=1)
Hopefully you can take it from here? Otherwise you might have to clarify where exactly you're having trouble.
And by the way, those regexes are pretty hairy. I'd suggest converting defining a custom function and using df.apply.
I need to unstack a contact list (id, relatives, phone numbers...) so that the columns keep a specific order.
Given an index, dataframe UNSTACK operates by unstacking single columns one by one, even when applied to couple of columns
Data have
df_have=pd.DataFrame.from_dict({'ID': {0: '100', 1: '100', 2: '100', 3: '100', 4: '100', 5: '200', 6: '200', 7: '200', 8: '200', 9: '200'},
'ID_RELATIVE': {0: '100', 1: '100', 2: '150', 3: '150', 4: '190', 5: '200', 6: '200', 7: '250', 8: '290', 9: '290'},
'RELATIVE_ROLE': {0: 'self', 1: 'self', 2: 'father', 3: 'father', 4: 'mother', 5: 'self', 6: 'self', 7: 'father', 8: 'mother', 9: 'mother'},
'PHONE': {0: '111111', 1: '222222', 2: '333333', 3: '444444', 4: '555555', 5: '123456', 6: '456789', 7: '987654', 8: '778899', 9: '909090'}})
Data want
df_want=pd.DataFrame.from_dict({'ID': {0: '100', 1: '200'},
'ID_RELATIVE_1': {0: '100', 1: '200'},
'RELATIVE_ROLE_1': {0: 'self', 1: 'self'},
'PHONE_1_1': {0: '111111', 1: '123456'},
'PHONE_1_2': {0: '222222', 1: '456789'},
'ID_RELATIVE_2': {0: '150', 1: '250'},
'RELATIVE_ROLE_2': {0: 'father', 1: 'father'},
'PHONE_2_1': {0: '333333', 1: '987654'},
'PHONE_2_2': {0: '444444', 1: 'nan'},
'ID_RELATIVE_3': {0: '190', 1: '290'},
'RELATIVE_ROLE_3': {0: 'mother', 1: 'mother'},
'PHONE_3_1': {0: '555555', 1: '778899'},
'PHONE_3_2': {0: 'nan', 1: '909090'}})
So, in the end, I need ID to be the index, and to unstack the other columns that will hence become attributes of ID.
The usual unstack process provides a "correct" ouput but in the wrong shape.
df2=have.groupby(['ID'])['ID_RELATIVE','RELATIVE_ROLE','PHONE'].apply(lambda x: x.reset_index(drop=True)).unstack()
This would require the re-ordering of columns and some removal of duplicates (by columns, not by row), together with a FOR loop. I'd like to avoid using this approach, since I'm looking for a more "elegant" way of achieving the desired result by means of grouping/stacking/unstacking/pivoting and so on.
Thanks a lot
Solution have main 2 steps - first grouping by all column without PHONE for pairs, convert columns names to ordered catagoricals for correct sorting and then grouping by ID:
c = ['ID','ID_RELATIVE','RELATIVE_ROLE']
df = df_have.set_index(c+ [df_have.groupby(c).cumcount().add(1)])['PHONE']
df = df.unstack().add_prefix('PHONE_').reset_index()
df = df.set_index(['ID', df.groupby('ID').cumcount().add(1)])
df.columns = pd.CategoricalIndex(df.columns, categories=df.columns.tolist(), ordered=True)
df = df.unstack().sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
ID ID_RELATIVE_1 RELATIVE_ROLE_1 PHONE_1_1 PHONE_2_1 ID_RELATIVE_2 \
0 100 100 self 111111 222222 150
1 200 200 self 123456 456789 250
RELATIVE_ROLE_2 PHONE_1_2 PHONE_2_2 ID_RELATIVE_3 RELATIVE_ROLE_3 PHONE_1_3 \
0 father 333333 444444 190 mother 555555
1 father 987654 NaN 290 mother 778899
PHONE_2_3
0 NaN
1 909090
If need change order of digits in PHONE columns:
df.columns = [f'{a.split("_")[0]}_{b}_{a.split("_")[1]}'
if 'PHONE' in a
else f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
ID ID_RELATIVE_1 RELATIVE_ROLE_1 PHONE_1_1 PHONE_1_2 ID_RELATIVE_2 \
0 100 100 self 111111 222222 150
1 200 200 self 123456 456789 250
RELATIVE_ROLE_2 PHONE_2_1 PHONE_2_2 ID_RELATIVE_3 RELATIVE_ROLE_3 PHONE_3_1 \
0 father 333333 444444 190 mother 555555
1 father 987654 NaN 290 mother 778899
PHONE_3_2
0 NaN
1 909090
I am working with movie data and have a dataframe column for movie genre. Currently the column contains a list of movie genres for each movie (as most movies are assigned to multiple genres), but for the purpose of this analysis, I would like to parse the list and create a new dataframe column for each genre. So instead of having genre=['Drama','Thriller'] for a given movie, I would have two columns, something like genre1='Drama' and genre2='Thriller'.
Here is a snippet of my data:
{'color': {0: [u'Color::(Technicolor)'],
1: [u'Color::(Technicolor)'],
2: [u'Color::(Technicolor)'],
3: [u'Color::(Technicolor)'],
4: [u'Black and White']},
'country': {0: [u'USA'],
1: [u'USA'],
2: [u'USA'],
3: [u'USA', u'UK'],
4: [u'USA']},
'genre': {0: [u'Crime', u'Drama'],
1: [u'Crime', u'Drama'],
2: [u'Crime', u'Drama'],
3: [u'Action', u'Crime', u'Drama', u'Thriller'],
4: [u'Crime', u'Drama']},
'language': {0: [u'English'],
1: [u'English', u'Italian', u'Latin'],
2: [u'English', u'Italian', u'Spanish', u'Latin', u'Sicilian'],
3: [u'English', u'Mandarin'],
4: [u'English']},
'rating': {0: 9.3, 1: 9.2, 2: 9.0, 3: 9.0, 4: 8.9},
'runtime': {0: [u'142'],
1: [u'175'],
2: [u'202', u'220::(The Godfather Trilogy 1901-1980 VHS Special Edition)'],
3: [u'152'],
4: [u'96']},
'title': {0: u'The Shawshank Redemption',
1: u'The Godfather',
2: u'The Godfather: Part II',
3: u'The Dark Knight',
4: u'12 Angry Men'},
'votes': {0: 1793199, 1: 1224249, 2: 842044, 3: 1774083, 4: 484061},
'year': {0: 1994, 1: 1972, 2: 1974, 3: 2008, 4: 1957}}
Any help would be greatly appreciated! Thanks!
I think you need DataFrame constructor with add_prefix and last concat to original:
df1 = pd.DataFrame(df.genre.values.tolist()).add_prefix('genre_')
df = pd.concat([df.drop('genre',axis=1), df1], axis=1)
Timings:
df = pd.DataFrame(d)
print (df)
#5000 rows
df = pd.concat([df]*1000).reset_index(drop=True)
In [394]: %timeit (pd.concat([df.drop('genre',axis=1), pd.DataFrame(df.genre.values.tolist()).add_prefix('genre_')], axis=1))
100 loops, best of 3: 3.4 ms per loop
In [395]: %timeit (pd.concat([df.drop(['genre'],axis=1),df['genre'].apply(pd.Series).rename(columns={0:'genre_0',1:'genre_1',2:'genre_2',3:'genre_3'})],axis=1))
1 loop, best of 3: 757 ms per loop
This should work for you:
pd.concat([df.drop(['genre'],axis=1),df['genre'].apply(pd.Series).rename(columns={0:'genre_0',1:'genre_1',2:'genre_2',3:'genre_3'})],axis=1)