I am trying to filter 3 highest values from a column in a data frame without any luck. Can you please help?
I have tried Nlargest etc and it returns the top 3 rows of the highest value
Your question is not specific enough, it can be whatever. If you want to get all rows where that match with 3 highest values, you can use:
out = df[df['colA'].isin(df.value_counts('colA').drop_duplicates().head(3))]
Related
I need help getting two rows in the same datafram merged/joined.
The first table is the df that I have right now
The second one is the one that I would like to have
I need to combine Jim and Bill. I don't want to overwrite values in either tables. I just want to update NaN values in the row (Bill) with the values with row(Jim) e.g city
There are about 20 columns that I need updating because of that I cannot just update the Bill/City cell
Thanks
You can try
df.loc['Bill'] = df.loc['Bill'].fillna(df.loc['Jim'])
# or
df.loc['Bill'].fillna(df.loc['Jim'], inplace=True)
I have a data frame that contains product sales for each day starting from 2018 to 2021 year. Dataframe contains four columns (Date, Place, Product Category and Sales). From the first two columns (Date, Place) I want to use the available data to fill in the gaps. Once the data is added, I would like to delete rows that do not have data in ProductCategory. I would like to do in python pandas.
The sample of my data set looked like this:
I would like the dataframe to look like this:
Use fillna with method 'ffill' that propagates last valid observation forward to next valid backfill. Then drop the rows that contain NAs.
df['Date'].fillna(method='ffill',inplace=True)
df['Place'].fillna(method='ffill',inplace=True)
df.dropna(inplace=True)
You are going to use the forward-filling method to replace null values with the value of the nearest one above it df['Date', 'Place'] = df['Date', 'Place'].fillna(method='ffill'). Next, to drop rows with missing values df.dropna(subset='ProductCategory', inplace=True). Congrats, now you have your desired df 😄
Documentation: Pandas fillna function, Pandas dropna function
compute the frequency of catagories in the column by plotting,
from plot you can see bars reperesenting the most repeated values
df['column'].value_counts().plot.bar()
and get the most frequent value using index, index[0] gives most repeated and
index[1] gives 2nd most repeated and you can choose as per your requirement.
most_frequent_attribute = df['column'].value_counts().index[0]
then fill missing values by above method
df['column'].fillna(df['column'].most_freqent_attribute,inplace=True)
to fill multiple columns with same method just define this as funtion, like this
def impute_nan(df,column):
most_frequent_category=df[column].mode()[0]
df[column].fillna(most_frequent_category,inplace=True)
for feature in ['column1','column2']:
impute_nan(df,feature)
I have a dataframe like this:
I only want to keep rows highlighted green where students status change from Fail to Pass in any 2 consecutive months. How can i drop the rows highlighted red with this condition? I can`t figure out how to do this in pandas.
In the second step where i will have a dataframe like this:
I only want to count passes which are preceded by a fail in previous month (as in red text). So like get unique count of passed students each month if they failed in the previous month?
I am confused how to go about this approach using pandas.
I guess you would need to work with 'shift' to compare subsequent values
df['FailToPass'] = ((df == 'Pass') & ((df.shift(axis=1)) == 'Fail')).any(axis=1)
I have a column which contains lists of variable sizes. The lists contain a limited amount of short text values. Around 60 unique values all together.
0 ["AC","BB"]
1 ["AD","CB", "FF"]
2 ["AA","CC"]
3 ["CA","BB"]
4 ["AA"]
I want to make this values columns in my data-frame and the values of this columns would be 1 if the values is in this row and 0 if not.
I know I could expand the list and than call unique and set those as new columns. But after that I don't know what to do?
Here's one way:
df = pd.get_dummies(df.explode('val')).sum(level = 0)
NOTE: Here (level=0) is kind of like a grouping operation that uses an index for grouping stuff. So, I prefer to use this after exploding the dataframe.
I have a dataframe that I have grouped by multiple columns. Within each group, I would like to then generate a value that finds the last entity of each of those groups and divide by the first entity. I would also like to show the number of entities and the last entity value in the output.
See below for an example data and the desired output. I know how to show the count of the group, shown below in the code.
df_group=df.groupby(['ID','Item','End_Date','Type'])
df_output=df_group.size().reset_index(name='Group Count')
Below, I am grouping by the below:
So the first row in the example output dataframe that I am seeking has the Final Value of 2 (the most recent value for the group), and a percent change of the last value of 2 divided by the first value of 3. Two more examples are shown as well.
Please let me know if you have any tips on how to go about this application to a groupby object. Thank you very much for your help.
Just do assign with groupby tail and head
df_group=df.groupby(['ID','Item','End_Date','Type'])
df_output=df_group.size().reset_index(name='Group Count')
df_output['PCTCHange']=((df_group.value.tail(1)/df_group.value.head(1))-1).values
df_output['FinalValue']=df_group.value.tail(1).values