Python - Unpivotting multiple columns into single column - python

As the title suggest, I am trying to turn this
into this
The end-goal is ideally so that I'd be able to group-by them up and get a word count.
This is the code that I've tried so far. I am unsure.
df_unpivoted = final_df.melt(df.reset_index(), id_vars= 'index', var_name = 'Count', value_name = 'Value')
df = final_df.rename(columns=lambda x: x + x[-1] if x.startswith('index') else x)
df = pd.wide_to_long(df, ['0'], i='id', j='i')
df = df.reset_index(level=1, drop=True).reset_index()

Related

Check if column values exists in different dataframe

I have a pandas DataFrame 'df' with x rows, and another pandas DataFrame 'df2' with y rows
(x < y). I want to return the indexes of where the values of df['Farm'] equals the value of df2['Fields'], in order to add respective 'Manager' to df.
the code I have is as follows:
data2 = [['field1', 'Paul G'] , ['field2', 'Mark R'], ['field3', 'Roy Jr']]
data = [['field1'] , ['field2']]
columns = ['Field']
columns2 = ['Field', 'Manager']
df = pd.DataFrame(data, columns=columns)
df2 = pd.DataFrame(data2, columns=columns2)
farmNames = df['Farm']
exists = farmNames.reset_index(drop=True) == df1['Field'].reset_index(drop=True)
This returns the error message:
ValueError: Can only compare identically-labeled Series objects
Does anyone know how to fix this?
As #NickODell mentioned, you could use a merge, basically a left join. See below code.
df_new = pd.merge(df, df2, on = 'Field', how = 'left')
print(df_new)
Output:
Field Manager
0 field1 Paul G
1 field2 Mark R

pandas rename multi-level column having the same name

When I use aggregate function, the resulting columns 'price' and 'carat' have the same column name of 'mean'.
How do i rename the mean under the price to price_mean and under carat to carat_mean.
I can't change them individually.
diamonds.groupby('cut').agg({
'price': ['count', 'mean'],
'carat': 'mean'
}).rename(columns={'mean':'price_mean','mean':'carat_mean'}, level = 1)
})
You could try this:
# Rename columns of level 1
df1 = df["price"]
df1.columns = ["count", "carat_mean"]
df2 = df["carat"]
df2.columns = ["carat_mean"]
# Aggregate dfs (with renamed columns) under level 0 columns
df = pd.concat([df1, df2], axis=1, keys=['price', 'carat'])
print(df)
# Outputs
price carat
count carat_mean carat_mean
Fair 0.693995 -0.632283 0.789963
Good 0.099057 1.005623 0.143289
Ideal -0.277984 -0.105138 -0.611168

How to reformat dataframe using pandas?

I have the following dataframe:
data = {'Names':['Abbey','English','Maths','Billy','English','Maths','Charlie','English','Maths'],'Subject Grade':['Student Name',85,91,'Student Name',82,74,'Student Name',83,96]}
df = pd.DataFrame(data, columns = ['Names','Subject Grade'])
I would like to reformat the dataframe in order for the names, subject and grades to all be in their respective columns as follows:
data2 = {'Names':['Abbey','Abbey','Billy','Billy','Charlie','Charlie'],'Subject':['English','Maths','English','Maths','English','Maths'],'Grade':[85,91,82,74,83,96]}
df2 = pd.DataFrame(data2, columns = ['Names','Subject','Grade'])
Hi you can use those instructions :
df['name'] = df['Names'].mask(df['Subject Grade'] != "Student Name")
df['name'] = df['name'].fillna(method='ffill')
df = df.query('`Subject Grade`!="Student Name"')
df = df.rename(columns={'Names':'Subject', 'Subject Grade':'Grade', 'name':'Names'})

pandas: how to modify values in a column in dataframe by comparing other column values

I have dataframe with the following structure:
raw_data = {'website': ['bbc.com', 'cnn.com', 'google.com', 'facebook.com'],
'type': ['image', 'audio', 'image', 'video'],
'source': ['bbc','google','stackoverflow','facebook']}
df = pd.DataFrame(raw_data, columns = ['website', 'type', 'source'])
I would like to modify the values in column type with a condition that if the source exists in website, then suffix type with '_1stParty' else '_3rdParty'. The dataframe should eventually look like:
Test values betwen rows with in and apply for processing each rows separately:
m = df.apply(lambda x: x['source'] in x['website'], axis=1)
Or use zip with list comprehension:
m = [a in b for a, b in zip(df['source'], df['website'])]
and then add new values by numpy.where:
df['type'] += np.where(m, '_1stParty', '_3rdParty')
#'long' alternative
#df['type'] = df['type'] + np.where(m, '_1stParty', '_3rdParty')
print (df)
website type source
0 bbc.com image_1stParty bbc
1 cnn.com audio_3rdParty google
2 google.com image_3rdParty stackoverflow
3 facebook.com video_1stParty facebook
you can use apply method for this like
df["type"] = df.apply(lambda row: f"{row.type}_1stparty" if row.source in row.website \
else f"{row.type}_thirdparty", axis=1)
df
This solution must be faster than others which use apply():
df.type += df.website.str.split('.').str[0].eq(df.source).\
replace({True: '_1stParty', False: '_3rdParty'})

How to replace a string in a pandas multiindex?

I have a dataframe with a large multiindex, sourced from a vast number of csv files. Some of those files have errors in the various labels, ie. "window" is missspelled as "winZZw", which then causes problems when I select all windows with df.xs('window', level='middle', axis=1).
So I need a way to simply replace winZZw with window.
Here's a very minimal sample df: (lets assume the data and the 'roof', 'window'… strings come from some convoluted text reader)
header = pd.MultiIndex.from_product(['roof', 'window', 'basement'], names = ['top', 'middle', 'bottom'])
dates = pd.date_range('01/01/2000','01/12/2010', freq='MS')
data = np.random.randn(len(dates))
df = pd.DataFrame(data, index=dates, columns=header)
header2 = pd.MultiIndex.from_product(['roof', 'winZZw', 'basement'], names = ['top', 'middle', 'bottom'])
data = 3*(np.random.randn(len(dates)))
df2 = pd.DataFrame(data, index=dates, columns=header2)
df = pd.concat([df, df2], axis=1)
header3 = pd.MultiIndex.from_product(['roof', 'door', 'basement'], names = ['top', 'middle', 'bottom'])
data = 2*(np.random.randn(len(dates)))
df3 = pd.DataFrame(data, index=dates, columns=header3)
df = pd.concat([df, df3], axis=1)
Now I want to xs a new dataframe for all the houses that have a window at their middle level: windf = df.xs('window', level='middle', axis=1)
But this obviously misses the misspelled winZZw.
So, how I replace winZZw with window?
The only way I found was to use set_levels, but if I understood that correctly, I need to feed it the whole level, ie
df.columns.set_levels([u'window',u'window', u'door'], level='middle',inplace=True)
but this has two issues:
I need to pass it the whole index, which is easy in this sample, but impossible/stupid for a thousand column df with hundreds of labels.
It seems to need the list backwards (now, my first entry in the df has door in the middle, instead of the window it had). That can probably be fixed, but it seems weird
I can work around these issues by xsing a new df of only winZZws, and then setting the levels with set_levels(df.shape[1]*[u'window'], level='middle') and then concatting it together again, but I'd like to have something more straightforward analog to str.replace('winZZw', 'window'), but I can't figure out how.
Use rename with specifying level:
header = pd.MultiIndex.from_product([['roof'],[ 'window'], ['basement']], names = ['top', 'middle', 'bottom'])
dates = pd.date_range('01/01/2000','01/12/2010', freq='MS')
data = np.random.randn(len(dates))
df = pd.DataFrame(data, index=dates, columns=header)
header2 = pd.MultiIndex.from_product([['roof'], ['winZZw'], ['basement']], names = ['top', 'middle', 'bottom'])
data = 3*(np.random.randn(len(dates)))
df2 = pd.DataFrame(data, index=dates, columns=header2)
df = pd.concat([df, df2], axis=1)
header3 = pd.MultiIndex.from_product([['roof'], ['door'], ['basement']], names = ['top', 'middle', 'bottom'])
data = 2*(np.random.randn(len(dates)))
df3 = pd.DataFrame(data, index=dates, columns=header3)
df = pd.concat([df, df3], axis=1)
df = df.rename(columns={'winZZw':'window'}, level='middle')
print(df.head())
top roof
middle window door
bottom basement basement basement
2000-01-01 -0.131052 -1.189049 1.310137
2000-02-01 -0.200646 1.893930 2.124765
2000-03-01 -1.690123 -2.128965 1.639439
2000-04-01 -0.794418 0.605021 -2.810978
2000-05-01 1.528002 -0.286614 0.736445

Categories