Grouping unique values with low value counts - python

My Data frame contains over 40 unique values for a particular attribute. I want to do some visualisation of this data, but fitting in all 40 points is challenging. Using wine['country'].value_counts(), I can see the frequency of each unique value.
When I go to create, for example, a bar chart, I would like any unique values with value counts less than 100 to be grouped together to create it's own bar in the visualisation (and say call it 'rest' or 'other').
Any way of doing this?

Initiate a variable x = 0.Iterate through wine['country'].value_counts() using for loop. Then check if a particular value_counts() is less than 100, if true, then add the value_counts() value for that particular iteration to x. This way you will have the sum of such values whose count is less than 100.
Now before charting, create a new dataframe having data of country vs value_counts() with only those rows whose value_counts() value is greater than 100. Then manually add another row named 'other' to this new dataframe with its value_counts() as x. Use this new dataframe for charting.

Related

How to reshape dataframe with pandas?

I have a data frame that contains product sales for each day starting from 2018 to 2021 year. Dataframe contains four columns (Date, Place, Product Category and Sales). From the first two columns (Date, Place) I want to use the available data to fill in the gaps. Once the data is added, I would like to delete rows that do not have data in ProductCategory. I would like to do in python pandas.
The sample of my data set looked like this:
I would like the dataframe to look like this:
Use fillna with method 'ffill' that propagates last valid observation forward to next valid backfill. Then drop the rows that contain NAs.
df['Date'].fillna(method='ffill',inplace=True)
df['Place'].fillna(method='ffill',inplace=True)
df.dropna(inplace=True)
You are going to use the forward-filling method to replace null values with the value of the nearest one above it df['Date', 'Place'] = df['Date', 'Place'].fillna(method='ffill'). Next, to drop rows with missing values df.dropna(subset='ProductCategory', inplace=True). Congrats, now you have your desired df 😄
Documentation: Pandas fillna function, Pandas dropna function
compute the frequency of catagories in the column by plotting,
from plot you can see bars reperesenting the most repeated values
df['column'].value_counts().plot.bar()
and get the most frequent value using index, index[0] gives most repeated and
index[1] gives 2nd most repeated and you can choose as per your requirement.
most_frequent_attribute = df['column'].value_counts().index[0]
then fill missing values by above method
df['column'].fillna(df['column'].most_freqent_attribute,inplace=True)
to fill multiple columns with same method just define this as funtion, like this
def impute_nan(df,column):
most_frequent_category=df[column].mode()[0]
df[column].fillna(most_frequent_category,inplace=True)
for feature in ['column1','column2']:
impute_nan(df,feature)

Sorting multiple Pandas Dataframe Columns based on the sorting of one column

I have a dataframe with two columns in it,'item' and 'calories'. I have sorted the 'calories' column numerically using a selection sort algorithm, but i need the 'item' column to change so the calorie value matches the correct item.
menu=pd.read_csv("menu.csv",encoding="utf-8") # Read in the csv file
menu_df=pd.DataFrame(menu,columns=['Item','Calories']) # Creating a dataframe with just the information from the calories column
print(menu_df) # Display un-sorted data
#print(menu_df.at[4,'Calories']) # Practise calling on individual elements within the dataframe.
# Start of selection sort
for outerloopindex in range (len(menu_df)):
smallest_value_index=outerloopindex
for innerloopindex in range(outerloopindex+1,len(menu_df)):
if menu_df.at[smallest_value_index,'Calories']>menu_df.at[innerloopindex,'Calories']:
smallest_value_index=innerloopindex
# Changing the order of the Calorie column.
menu_df.at[outerloopindex,'Calories'],menu_df.at[smallest_value_index,'Calories']=menu_df.at[smallest_value_index,'Calories'],menu_df.at[outerloopindex,'Calories']
# End of selection sort
print(menu_df)
Any help on how to get the 'Item' column to match the corresponding 'Calorie' values after the sort would be really really appreciated.
Thanks
Martin
You can replace df.at[...] with df.loc[...] and refer multiple columns, instead of single one.
So replace line:
menu_df.at[outerloopindex,'Calories'],menu_df.at[smallest_value_index,'Calories']=menu_df.at[smallest_value_index,'Calories'],menu_df.at[outerloopindex,'Calories']
With line:
menu_df.loc[outerloopindex,['Calories','Item']],menu_df.loc[smallest_value_index,['Calories','Item']]=menu_df.loc[smallest_value_index,['Calories','Item']],menu_df.loc[outerloopindex,['Calories','Item']]

Select single minimum value of Pandas dataframe column instead multiple

I want get minimum value a year from a dataframe(df_greater_TDS) column('DTS38').
So I grouped by year-column and applied transform(min). However, as there are multiple minimum values, min function is returning multiple rows.
how to get only one value or here a single row?
idx = df_greater_TDS.groupby('year')['DTS38'].transform(min)==df_greater_TDS['DTS38']
df_TDS=df_greater_TDS[idx]

Create variables with conditional logic from dataframe

I have a dataframe with a column called 'success' (amongst others). It this column, we have only 0 and 1 values. Now, I want to count how many times each value occurs.
I tried this command: sdf.groupby('success').sum() but it only gives me a table with the unique counts in 1 view.
Since I need to do math on the individual frequencies of 0 and 1, I need it in 2 variables, thus seperatly. Exmaple:
col1=6100
col2=5878
c=col1/(col1+col2)
How to do this?
You can use value_counts to count how many times each value in a column occurs. Then you could turn the resulting series into a dataframe, and transpose it to get the values as column headers.
counts = pd.DataFrame(sdf['success'].value_counts()).transpose()
Let me know if this works for you.
To do your calculation, you can then try to apply a lambda function to your resulting dataframe (which I named counts). row[0] will access your count of 0s in success since the previous code resulted in a column called 0.
counts['result'] = counts.apply(lambda row: row[0]/(row[0] + row[1]), axis=1)

Summing a dataframe and keeping row labels

I am just wondering if it's possible to sum a dataframe showing a total value at the end of each column while keeping the label string description in the zero column (like you would in Excel)?
I am using Python 2.7
Summing a column is as easy as Dataframe_Name['COLUMN_NAME'].sum() you can review it In the Documentation Here
You can also do Dataframe_Name.sum() and it will return the sums for each column

Categories