pandas rename multi-level column having the same name - python

When I use aggregate function, the resulting columns 'price' and 'carat' have the same column name of 'mean'.
How do i rename the mean under the price to price_mean and under carat to carat_mean.
I can't change them individually.
diamonds.groupby('cut').agg({
'price': ['count', 'mean'],
'carat': 'mean'
}).rename(columns={'mean':'price_mean','mean':'carat_mean'}, level = 1)
})

You could try this:
# Rename columns of level 1
df1 = df["price"]
df1.columns = ["count", "carat_mean"]
df2 = df["carat"]
df2.columns = ["carat_mean"]
# Aggregate dfs (with renamed columns) under level 0 columns
df = pd.concat([df1, df2], axis=1, keys=['price', 'carat'])
print(df)
# Outputs
price carat
count carat_mean carat_mean
Fair 0.693995 -0.632283 0.789963
Good 0.099057 1.005623 0.143289
Ideal -0.277984 -0.105138 -0.611168

Related

Check if column values exists in different dataframe

I have a pandas DataFrame 'df' with x rows, and another pandas DataFrame 'df2' with y rows
(x < y). I want to return the indexes of where the values of df['Farm'] equals the value of df2['Fields'], in order to add respective 'Manager' to df.
the code I have is as follows:
data2 = [['field1', 'Paul G'] , ['field2', 'Mark R'], ['field3', 'Roy Jr']]
data = [['field1'] , ['field2']]
columns = ['Field']
columns2 = ['Field', 'Manager']
df = pd.DataFrame(data, columns=columns)
df2 = pd.DataFrame(data2, columns=columns2)
farmNames = df['Farm']
exists = farmNames.reset_index(drop=True) == df1['Field'].reset_index(drop=True)
This returns the error message:
ValueError: Can only compare identically-labeled Series objects
Does anyone know how to fix this?
As #NickODell mentioned, you could use a merge, basically a left join. See below code.
df_new = pd.merge(df, df2, on = 'Field', how = 'left')
print(df_new)
Output:
Field Manager
0 field1 Paul G
1 field2 Mark R

Summing certain columns by similar part of its name

How to sum columns by already fetched list of unique columns partly names ?
list = ['13-14', '15-16']
DataFrame:
X.13-14 Y.13-14 Z.13-14 X.15-16 ...
id
182761 10274.00 6097173.00 5758902.00 3345841.00
I.e. I want to create '13-14' and '15-16' columns with corresponding sum of (X.13-14,Y.13-14,Z.13-14), then (X.15-16,Y.15-16,Z.15-16)
If want sum columns by columns names after . use lambda function in DataFrame.groupby with axis=1:
df1 = df.groupby(lambda x: x.split('.')[1], axis=1).sum()
print (df1)
13-14 15-16
id
182761 11866349.0 3345841.0
Or if need only columns by list:
L = ['13-14', '15-16']
df.columns = df.columns.str.extract(f'({"|".join(L)})', expand=False)
df1 = df.sum(level=0, axis=1)[L]
print (df1)
13-14 15-16
id
182761 11866349.0 3345841.0
If need add to original:
df = df.join(df1)
print (df)
X.13-14 Y.13-14 Z.13-14 X.15-16 13-14 15-16
id
182761 10274.0 6097173.0 5758902.0 3345841.0 11866349.0 3345841.0

Using a tuple to map values between dataframes

If I need to map one value between two dataframes, and get 'FD' value from row where Round = 1 an Id is 262:
df1 = pd.DataFrame({'Round':1,'ID':262,'FD':30,
'Round':2,'ID':262,'FD':20}, index=[0])
df2 = pd.DataFrame({'Round':1, 'Opponent':262,
'Round':2, 'Opponent':262},index=[0])
I have tried to map with:
df2['P_GS_by_FD'] = f2['Opponent'].map(df1.set_index('ID')['FD'])
df2 Expected output:
Round Opponent P_GS_by_FD
1 262 30
I would use drop_duplicates
this would select 'Round 1' rows
df1.drop_duplicates('Id', keep='first')
df2['P_GS_by_FD'] = df2['Opponent'].map(df1.drop_duplicates('Id', keep='first').set_index('Id')['FD'])
(I think your example df1, df2 would make only one row instead of two)
Then we need create the round in df2 as well
df2['Round'] = df.groupby('Opponent').cumcount()+1
yourdf = df2.merge(df1.rename(columns={'Id' : 'Opponent'}), on = ['Opponent','Round'], how = 'left')
Base on your update
yourdf = df2.merge(df1.rename(columns={'ID' : 'Opponent'}), on = ['Opponent','Round'], how = 'left')

Pyspark dataframe join based on key,group by and max

i have two parquet files, which i load with spark.read. These 2 dataframes have a same column named key, so i join them with:
df = df.join(df2, on=['key'], how='inner')
df columns are: ["key","Duration","Distance"] and df2 : ["key",department id"]. At the end i want to print Duration, max(Distance),department id group by department id. What i have done so far is:
df.join(df.groupBy('departmentid').agg(F.max('Distance').alias('Distance')),on='Distance',how='leftsemi').show()
but i think it is too slow, is there a faster way to achieve my goal?
thanks in advance
EDIT: sample (first 2 lines of each file)
df:
369367789289,2015-03-27 18:29:39,2015-03-27 19:08:28,-
73.975051879882813,40.760562896728516,-
73.847900390625,40.732685089111328,34.8
369367789290,2015-03-27 18:29:40,2015-03-27 18:38:35,-
73.988876342773438,40.77423095703125,-
73.985160827636719,40.763439178466797,11.16
df1:
369367789289,1
369367789290,2
each columns is seperated by "," first column on both files is my key, then i have timestamps,longtitudes and latitudes. At the second file i have only the key and department id.
to create Distance i am using a function called formater. this is how i get my distance and duration:
df = df.filter("_c3!=0 and _c4!=0 and _c5!=0 and _c6!=0")
df = df.withColumn("_c0", df["_c0"].cast(LongType()))
df = df.withColumn("_c1", df["_c1"].cast(TimestampType()))
df = df.withColumn("_c2", df["_c2"].cast(TimestampType()))
df = df.withColumn("_c3", df["_c3"].cast(DoubleType()))
df = df.withColumn("_c4", df["_c4"].cast(DoubleType()))
df = df.withColumn("_c5", df["_c5"].cast(DoubleType()))
df = df.withColumn("_c6", df["_c6"].cast(DoubleType()))
df = df.withColumn('Distance', formater(df._c3,df._c5,df._c4,df._c6))
df = df.withColumn('Duration', F.unix_timestamp(df._c2) -F.unix_timestamp(df._c1))
and then as i showed above:
df = df.join(vendors, on=['key'], how='inner')
df.registerTempTable("taxi")
df.join(df.groupBy('vendor').agg(F.max('Distance').alias('Distance')),on='Distance',how='leftsemi').show()
Output must be
Distance Duration department id
grouped by id, and geting only the row with max(distance)

How to replace a string in a pandas multiindex?

I have a dataframe with a large multiindex, sourced from a vast number of csv files. Some of those files have errors in the various labels, ie. "window" is missspelled as "winZZw", which then causes problems when I select all windows with df.xs('window', level='middle', axis=1).
So I need a way to simply replace winZZw with window.
Here's a very minimal sample df: (lets assume the data and the 'roof', 'window'… strings come from some convoluted text reader)
header = pd.MultiIndex.from_product(['roof', 'window', 'basement'], names = ['top', 'middle', 'bottom'])
dates = pd.date_range('01/01/2000','01/12/2010', freq='MS')
data = np.random.randn(len(dates))
df = pd.DataFrame(data, index=dates, columns=header)
header2 = pd.MultiIndex.from_product(['roof', 'winZZw', 'basement'], names = ['top', 'middle', 'bottom'])
data = 3*(np.random.randn(len(dates)))
df2 = pd.DataFrame(data, index=dates, columns=header2)
df = pd.concat([df, df2], axis=1)
header3 = pd.MultiIndex.from_product(['roof', 'door', 'basement'], names = ['top', 'middle', 'bottom'])
data = 2*(np.random.randn(len(dates)))
df3 = pd.DataFrame(data, index=dates, columns=header3)
df = pd.concat([df, df3], axis=1)
Now I want to xs a new dataframe for all the houses that have a window at their middle level: windf = df.xs('window', level='middle', axis=1)
But this obviously misses the misspelled winZZw.
So, how I replace winZZw with window?
The only way I found was to use set_levels, but if I understood that correctly, I need to feed it the whole level, ie
df.columns.set_levels([u'window',u'window', u'door'], level='middle',inplace=True)
but this has two issues:
I need to pass it the whole index, which is easy in this sample, but impossible/stupid for a thousand column df with hundreds of labels.
It seems to need the list backwards (now, my first entry in the df has door in the middle, instead of the window it had). That can probably be fixed, but it seems weird
I can work around these issues by xsing a new df of only winZZws, and then setting the levels with set_levels(df.shape[1]*[u'window'], level='middle') and then concatting it together again, but I'd like to have something more straightforward analog to str.replace('winZZw', 'window'), but I can't figure out how.
Use rename with specifying level:
header = pd.MultiIndex.from_product([['roof'],[ 'window'], ['basement']], names = ['top', 'middle', 'bottom'])
dates = pd.date_range('01/01/2000','01/12/2010', freq='MS')
data = np.random.randn(len(dates))
df = pd.DataFrame(data, index=dates, columns=header)
header2 = pd.MultiIndex.from_product([['roof'], ['winZZw'], ['basement']], names = ['top', 'middle', 'bottom'])
data = 3*(np.random.randn(len(dates)))
df2 = pd.DataFrame(data, index=dates, columns=header2)
df = pd.concat([df, df2], axis=1)
header3 = pd.MultiIndex.from_product([['roof'], ['door'], ['basement']], names = ['top', 'middle', 'bottom'])
data = 2*(np.random.randn(len(dates)))
df3 = pd.DataFrame(data, index=dates, columns=header3)
df = pd.concat([df, df3], axis=1)
df = df.rename(columns={'winZZw':'window'}, level='middle')
print(df.head())
top roof
middle window door
bottom basement basement basement
2000-01-01 -0.131052 -1.189049 1.310137
2000-02-01 -0.200646 1.893930 2.124765
2000-03-01 -1.690123 -2.128965 1.639439
2000-04-01 -0.794418 0.605021 -2.810978
2000-05-01 1.528002 -0.286614 0.736445

Categories