Pandas Groupby First Value in Column - python

Is there a way to get the first or last value in a particular of a group in a pandas dataframe after performing a particular groupby ?
For example, I want to get the first value in column_z but this does not work :
df.groupby(by=['A', 'B']).agg({'x':np.sum, 'y':np.max, 'datetime':'count', 'column_z':first()})
The point of getting the first and last value in the group is I would like to eventually get the difference between the two.
I know there is this function: http://pandas.pydata.org/pandas-docs/stable/groupby.html#taking-the-nth-row-of-each-group
But i don't know how to use it with my use case, getting the first value in a particular column after grouping.

Related

How to reshape dataframe with pandas?

I have a data frame that contains product sales for each day starting from 2018 to 2021 year. Dataframe contains four columns (Date, Place, Product Category and Sales). From the first two columns (Date, Place) I want to use the available data to fill in the gaps. Once the data is added, I would like to delete rows that do not have data in ProductCategory. I would like to do in python pandas.
The sample of my data set looked like this:
I would like the dataframe to look like this:
Use fillna with method 'ffill' that propagates last valid observation forward to next valid backfill. Then drop the rows that contain NAs.
df['Date'].fillna(method='ffill',inplace=True)
df['Place'].fillna(method='ffill',inplace=True)
df.dropna(inplace=True)
You are going to use the forward-filling method to replace null values with the value of the nearest one above it df['Date', 'Place'] = df['Date', 'Place'].fillna(method='ffill'). Next, to drop rows with missing values df.dropna(subset='ProductCategory', inplace=True). Congrats, now you have your desired df 😄
Documentation: Pandas fillna function, Pandas dropna function
compute the frequency of catagories in the column by plotting,
from plot you can see bars reperesenting the most repeated values
df['column'].value_counts().plot.bar()
and get the most frequent value using index, index[0] gives most repeated and
index[1] gives 2nd most repeated and you can choose as per your requirement.
most_frequent_attribute = df['column'].value_counts().index[0]
then fill missing values by above method
df['column'].fillna(df['column'].most_freqent_attribute,inplace=True)
to fill multiple columns with same method just define this as funtion, like this
def impute_nan(df,column):
most_frequent_category=df[column].mode()[0]
df[column].fillna(most_frequent_category,inplace=True)
for feature in ['column1','column2']:
impute_nan(df,feature)

How to expand a list in a pandas dataframe without repeating other column values

I was wondering how I would be able to expand out a list in a cell without repeating variables in other cells.
The goal is to get it so that the list is expanded but the first column is not repeated. I know how to expand the list out but I would not like to have the first column values repeated if that is possible. Thank you for any help!!
In order to get what you're asking for, you still have to use explode() to get what you need. You just have to take it a step further and change the values of the first column. Please note that this will destroy the association between the elements of the list and the letter of the row they were first in. You would be creating a third value for the column (an empty string) that would be repeated for every record not beginning with 1.
If you want to eliminate the value from the rows you are talking about but still want to have those records associated with the value that their list was associated with, you can't. It's not logically possible for a value to both be in a given cell but also not be in that cell. So, I will show you the steps for eliminating the original association.
For this example, I named the columns since they are not provided.
data = [
["a",["1 hey","2 hi","3 hello"]],
["b",["1 what","2 how","3 say"]]
]
df = pd.DataFrame(data,columns=["first","second"])
df = df.explode("second")
df['first'] = df.apply(lambda x: x['first'] if x['second'][0] == '1' else '', axis=1)

Pandas: Find string in a column and replace them with numbers with incrementing values

I am working on a dataframe with where I have multiple columns and in one of the columns where there are many rows approx more than 1000 rows which contains the string values. Kindly check the below table for more details:
In the above image I want to change the string values in the column Group_Number to number by picking the values from the first column (MasterGroup) and increment by one (01) and want values to be like below:
Also need to verify that if the String is duplicating then instead of giving a new number it replaces with already changed number. For example in the above image ANAYSIM is duplicating and instead of giving a new sequence number I want already given number to repeating string.
Have checked different links but they are focusing on giving values from user:
Pandas DataFrame: replace all values in a column, based on condition
Change one value based on another value in pandas
Conditional Replace Pandas
Any help with achieving the desired outcome is highly appreciated.
We could do cumcount with groupby
s=(df.groupby('MasterGroup').cumcount()+1).mul(10).astype(str)
t=pd.to_datetime(df.Group_number, errors='coerce')
Then we assign
df.loc[t.isnull(), 'Group_number']=df.MasterGroup.astype(str)+s

Sum count of unique value counts of all series in a Pandas dataframe

I am at my wit's end as I am writing this. This is probably an incredibly small issue, but I've not been able to get around it. Here's what is going on:
I have a dataframe df with 80 columns
Performing value_counts().count() over df iteratively, I am able to print the column names and the number of unique values in that column.
Here's the problem: What I am also wanting to do is sum up the count() of unique values of the all columns. Essentially I will need just one number. S0 basically, if column1 had 10 uniques, column2 had 5, column3 had 3.., I am expecting the sum() to be 18.
About #2, here's what works (simple for loop) -
def counting_unique_values_in_df(df):
for evry_colm in df:
print (evry_colm, "-", df[evry_colm].value_counts().count())
That works. It prints it in this format - the column - unique values
Now, alongside that, I'd like to print the sum of the unique values. Whatever I tried, it either prints the unique value of the last column (which is incidentally 2), or prints some thing random. I know it's something to do with the for loop, but I can't seem to figure out what.
I also know that in order to get what I want, which is essentially sum(df[evry_colm].value_counts().count()), I will need to convert df[evry_colm].value_counts().count() to a series, or even a dataframe, but I am stuck with that too!
Thanks in advance for your help.
You could use nunique, which returns a series across all your columns, which you can then sum:
df.nunique().sum()
My first instinct was to do it by series with a list comprehension
sum([df[col].nunique() for col in list(df)])
but this is slower and less Pandorable!

Pandas groupby: divide last in group by first in group

I have a dataframe that I have grouped by multiple columns. Within each group, I would like to then generate a value that finds the last entity of each of those groups and divide by the first entity. I would also like to show the number of entities and the last entity value in the output.
See below for an example data and the desired output. I know how to show the count of the group, shown below in the code.
df_group=df.groupby(['ID','Item','End_Date','Type'])
df_output=df_group.size().reset_index(name='Group Count')
Below, I am grouping by the below:
So the first row in the example output dataframe that I am seeking has the Final Value of 2 (the most recent value for the group), and a percent change of the last value of 2 divided by the first value of 3. Two more examples are shown as well.
Please let me know if you have any tips on how to go about this application to a groupby object. Thank you very much for your help.
Just do assign with groupby tail and head
df_group=df.groupby(['ID','Item','End_Date','Type'])
df_output=df_group.size().reset_index(name='Group Count')
df_output['PCTCHange']=((df_group.value.tail(1)/df_group.value.head(1))-1).values
df_output['FinalValue']=df_group.value.tail(1).values

Categories