Pandas DataFrame creates new row for index using groupby and sum - python

After using the groupby and sum operations as follows:
companyGrouped = dailyStocks.groupby(['SYMBOL'])
sumByCompany = companyGrouped.sum()
I end up with a new row for the group by and sum key, this is undesirable as I later want to merge this with another dataframe using [SYMBOL]. AN image of the table, obtained using: sumByCompany.head() is shown below.
I've tried a few things to get round this issue, but trying to manually delete this row and set the index as 'SYMBOL' does not seem elegant! Thanks for any help!!!
enter image description here

Solved with df.reset_index(level=0, inplace=True)
enter link description here

Related

Creating a new column with values calculated by continuously adding a value from the previous row in Pandas [duplicate]

I have a certain feature in my data which looks like this:
I'm trying to introduce cumulative sum this column in the DataFrame as following (the feature is int64 type):
df['Cumulative'] = df['feature'].cumsum()
But for unknown reason I have a drop in this function which is weird since the min number in the original column is 0:
Can someone explain why this happens and how can I fix that.Because I just want to sum the feature as it appears.
Thank you in advance.
Like in the comments suggested, sorting first and after that build the cumulative sum.
Did you try it like this:
df = df.sort_values(by='Date') #where "Date" is the column name of the values on the x-axis
df['cumulative'] = df['feature'].cumsum()

Pandas How do I create a list of duplicates from one column, and only keep the highest value for the corresponding columns?

I want to find all of the duplicates in the first column Primary Mod Site and only keep the highest value for all of the compounds (columns B-M) in the dataset.
excel sheet
For code, I have:
#read desired excel file
df = pd.read_excel("20220825_CISLIB01_Plate-1_Rows-A-B")
#function to find the duplicates in the dataset, sections them, and remove them
#can be applied to any dataset with the same format as original excel files
def getDuplicate():
gene = df["Primary Mod Site"]
#creates a list of all of the duplicates in Primary Mod Site
pd.concat(g for _, g in df.groupby("gene") if len(g) > 1)
Im stuck on what to do next. Help much appreciated!
it helps if you post the data as code or text, to allow to reproduce.
but, IIUC, you need to groupby the column 'A' and then take the max from rest of the columns, this seems to do the trick
df["Primary Mod Site"].max()
Based on what i noticed in the screenshot (3 first rows for example), the row with the highest values tends to have the highest value in all columns, sooo, something like this might work.
df = df.sort_values("ONCV-1-1-1", ascending = False).drop_duplicates("Primary Mod Site", keep='first', ignore_index=True)
or if not sure if that observation is correct for all rows.
probably this would work
df = df.groupby("Primary Mod Site").max()
NB: please post a reproducible example, easy to copy paste for us to test.

How to Index a dataframe based on an applied function? -Pandas

I have a dataframe that I created from a master table in SQL. That new dataframe is then grouped by type as I want to find the outliers for each group in the master table.
The function finds the outliers, showing where in the GroupDF they outliers occur. How do I see this outliers as a part of the original dataframe? Not just volume but also location, SKU, group etc.
dataframe: HOSIERY_df
Code:
##Sku Group Data Frames
grouped_skus = sku_volume.groupby('SKUGROUP')
HOSIERY_df = grouped_skus.get_group('HOSIERY')
hosiery_outliers = find_outliers_IQR(HOSIERY_df['VOLUME'])
hosiery_outliers
#.iloc[[hosiery_outliers]]
#hosiery_outliers
Picture to show code and output:
I know enough that I need to find the rows based on location of the index. Like Vlookup in Excel but i need to do it with in Python. Not sure how to pull only the 5, 6, 7...3888 and 4482nd place in the HOSIERY_df.
You can provide a list of index numbers as integers to iloc, which it looks like you have tried based on your commented-out code. So, you may want to make sure that find_outliers_IQR is returning a list of int so it will work properly with iloc, or convert it's output.
It looks like it's currently returning a DataFrame. You can get the index of that frame as a list like this:
hosiery_outliers.index.tolist()

How to drop duplicated rows in data frame based on certain criteria?

enter image description here
Our objective right now is to drop the duplicate player rows, but keep the row with the highest count in the G column (Games played). What code can we use to achieve this? I've attached a link to the image of our Pandas output here.
You probably want to first sort the dataframe by column G.
df = df.sort_values(by='G', ascending=False)
You can then use drop_duplicates to drop all duplicates except for the first occurrence.
df.drop_duplicates(['Player'], keep='first')
There are 2 ways that I can think of
df.groupby('Player', as_index=False)['G'].max()
and
df.sort_values('G').drop_duplicates(['Player'] , keep = 'last')
The first method uses groupby to group values by Player, and contracts rows keeping the one with the maximum of G. The second one uses the drop_duplicate method of Pandas to achieve the same.
Try this,
Assume your dataframe object is df1 then
series= df1.groupby('Player')['G'].max() # this will return series.
pd.DataFrame(series)
let me know if this work for you or not.

Reference Matrix in Pandas similar to Excel

I am trying to create a reference matrix in Pandas that looks like the below image in excel. I decided upon the index and column values, by simply entering in the values for the dates myself. Then, I am able to reference each column and index value for every calculation in the matrix. The calculations below are just for display.
In Pandas, I have been using the Pivot table function to produce a similar table. However, the Pivot table only uses column values if they are present in the data. See the screenshot below for the issue. I have values for 2018-05 in the index, but it doesn't appear in the columns. As such, the data is incomplete.
Therefore the Pivot table functionality does not work for me. I need to be able to manually decide on the column headers and the index values, similar to the example above in Excel.
Any help would be greatly appreciated as I cannot figure this one out!
repayments[(repayments.top_repayment_delinquency_reason == 'misappropriation_of_funds') & (repayments.repaid_date < date.today() - pd.offsets.MonthBegin(1))].pivot_table(values='amount_principal',
index='top_repayment_due_month', columns='repaid_month', aggfunc=sum)
I found an answer in the end.
dates_eom = pd.date_range('2018-5-31', (date.today() + relativedelta(months=+0)), freq='M')
dates_eom = dates_eom.to_period('M')
df = pd.DataFrame(index=dates_eom, columns=dates_eom)

Categories