Multiindex Filterings of grouped data - python

I have a pandas dataframe where I have done a groupby. The groupby results look like this:
As you can see this dataframe has a multilevel index ('ga:dimension3','ga:data') and a single column ('ga:sessions').
I am looking to create a dataframe with the first level of the index ('ga:dimension3') and the first date for each first level index value :
I can't figure out how to do this.
Guidance appreciated.
Thanks in advance.

Inspired from #ggaurav suggestion for using first(), I think that the following should do the work (df is the data you provided, after the group):
result=df.reset_index(1).groupby('ga:dimension3').first()

You can directly use first. As you need data based on just 'ga:dimension3', so you need to groupby it (or level=0)
df.groupby(level=0).first()
Without groupby, you can get the level 0 index values and delete the duplicated ones and keeping the first one.
df[~df.index.get_level_values(0).duplicated(keep='first')]

Related

How can I transform a DataFrame so that the headers become column values?

I have Pandas DataFrame in this form:
How can I transform this into a new DataFrame with this form:
I am beginning to use Seaborn and Plotly for plotting, and it seems like they prefer data to be formatted in the second way.
Lets try set_index(), unstack(), renamecolumns
`df.set_index('Date').unstack().reset_index().rename(columns={'level_0':'Name',0:'Score'})`
How it works
df.set_index('Date')#Sets Date as index
df.set_index('Date').unstack()#Flips, melts the dataframe
d=df.set_index('Date').unstack().reset_index()# resets the datframe and allocates columns, those in index become level_suffix and attained values become 0
d.rename(columns={'level_0':'Name',0:'Score'})#renames columns
Use melt function in pandas
df.melt(id_vars="Date", value_vars=["Andy", "Barry", "Cathy"], var_name="Name", value_name="Score")
This should work :
df.stack().reset_index(level=1).rename(columns={'level_1':'Name')

How to drop duplicated rows in data frame based on certain criteria?

enter image description here
Our objective right now is to drop the duplicate player rows, but keep the row with the highest count in the G column (Games played). What code can we use to achieve this? I've attached a link to the image of our Pandas output here.
You probably want to first sort the dataframe by column G.
df = df.sort_values(by='G', ascending=False)
You can then use drop_duplicates to drop all duplicates except for the first occurrence.
df.drop_duplicates(['Player'], keep='first')
There are 2 ways that I can think of
df.groupby('Player', as_index=False)['G'].max()
and
df.sort_values('G').drop_duplicates(['Player'] , keep = 'last')
The first method uses groupby to group values by Player, and contracts rows keeping the one with the maximum of G. The second one uses the drop_duplicate method of Pandas to achieve the same.
Try this,
Assume your dataframe object is df1 then
series= df1.groupby('Player')['G'].max() # this will return series.
pd.DataFrame(series)
let me know if this work for you or not.

Pandas: Use a dataframe to index another one and fill the gaps?

I have two dataframes. df_0 is a complete list of dates and df_1 is a generic register indexed by incomplete dates. I need to make a dataframe that has df_0’s complete dates as an index, filled with df_1’s register in the matching dates. For dates without a register entry, I just need to repeat the last date’s register data as a filler. Any ideas on how to make this?
Thanks in advance.
Use DataFrame.reindex with parameter method:
df = df_1.reindex(df_0.index, method="ffill")
use 'reindex' to expand the df_1 and use 'fillna' to fill in the missed valued.
df_2 = df_1.reindex(df_0.index).fillna(method="ffill")

How can I set the index of a generated pandas Series to a column from a DataFrame?

In pandas this operation creates a Series:
q7.loc[:, list(q7)].max(axis=1) - q7.loc[:, list(q7)].min(axis=1)
I would like to be able to set the index as a list of values from a df colum. Ie
list(df['Colname'])
I've tried to create the series then update it with the series generated from the first code snippet. I've also searched the docs and don't see a method that will allow me to do this. I would prefer not to manually iterate over it.
Help is appreciated.
You can simply store that series to a variable say S and set the index accordingly as shown below..!!
S = (q7.loc[:, list(q7)].max(axis=1) - q7.loc[:, list(q7)].min(axis=1))
S.index = df['Colname']
The code is provided assuming the lengths of the series and Column from the dataframe is equal. Hope this helps.!!
If you want to reset series s index, you can do:
s.index = new_index_list

aggregate multiple dataframe with overlapping timeseries

I have multiple dataframe with timeseries index in dfList.(example dataframe is shown below)
I tried to concatenate these dataframe into one dataframe by following command.
db=pd.concat(dfList)
and I got following dataframe.
Timeseries index are duplicated (many index are 2012-10-12 20:00:00) since timeseries in base dataframe was overlapping each other.
I want to remove this duplicate. Does anyone know how to do this?
some example dataframe in which timeseries index are overlapping is shown below
Thank you!!
You can simply drop the duplicates by a particular column value as mentioned in the docs here. You may do something like this:
db = db.drop_duplicates(cols="Timestamp")
which will drop all rows with duplicates in the column "Timestamp" except the first occurrence.

Categories