Sort value issue on categorical column

Sort value issue on categorical column - python

I would like to take a categorical column, group by individual type and then sum each type
I am using Python code and its result is what I want
data2 = data.groupby(['service_type']).sum().unstack()
popular_ser2 = data2.sort_values(ascending = False).head(10).droplevel(0)
popular_ser2
I would like to confirm whether my code is logic due to the need of unstack and droplevel that is uncommon to see when using groupby and sort value.

Related

overwriting dataframes in pandas

I have a given dataframe
new_df :
ID
summary
text_len
1
xxx
45
2
aaa
34
I am performing some df manipulation by concatenating keywords from different df, like that:
keywords = df["keyword"].to_list()
for key in keywords:
new_df[key] = new_df["summary"].str.lower().str.count(key)
new_df
from here I need two separate dataframes to perform few actions (to each of them add some columns, do some calculations etc).
I need a dataframe with occurrences as per given piece of code and a binary dataframe.
WHAT I DID:
assign dataframe for occurrences:
df_freq = new_df (because it is already calculated an done)
I created another dataframe - binary one - on the top of new_df:
#select only numeric columns to change them to binary
numeric_cols = new_df.select_dtypes("number", exclude='float64').columns.tolist()
new_df_binary = new_df
new_df_binary['text_length'] = new_df_binary['text_length'].astype(int)
new_df_binary[numeric_cols] = (new_df_binary[numeric_cols] > 0).astype(int)
Everything works fine - I perform the math I need, but when I want to come back to df_freq - it is no longer dataframe with occurrences.. looks like it changed along with binary code
I need separate tables and perform separate math on them. Do you know how I can avoid this hmm overwriting issue?

You may use pandas' copy method with the deep argument set to True:
df_freq = new_df.copy(deep=True)
Setting deep=True (which is the default parameter) ensures that modifications to the data or indices of the copy do not impact the original dataframe.

pandas: return mutated column into original dataframe

Ive attempted to search the forum for this question, but, I believe I may not be asking it correctly. So here it goes.
I have a large data set with many columns. Originally, I needed to sum all columns for each row by multiple groups based on a name pattern of variables. I was able to do so via:
cols = data.filter(regex=r'_name$').columns
data['sum'] = data.groupby(['id','group'],as_index=False)[cols].sum().assign(sum = lambda x: x.sum(axis=1))
By running this code, I receive a modified dataframe grouped by my 2 factor variables (group & id), with all the columns, and the final sum column I need. However, now, I want to return the final sum column back into the original dataframe. The above code returns the entire modified dataframe into my sum column. I know this is achievable in R by simply adding a .$sum at the end of a piped code. Any ideas on how to get this in pandas?
My hopeful output is just a the addition of the final "sum" variable from the above lines of code into my original dataframe.
Edit: To clarify, the code above returns this entire dataframe:
All I want returned is the column in yellow

is this what you need?
data['sum'] = data.groupby(['id','group'])[cols].transform('sum').sum(axis = 1)

Proper way to extract value from DataFrame with composite index?

I have a dataframe, call it current_data. This dataframe is generated via running statistical functions over another dataframe, current_data_raw. It has a compound index on columns "Method" and "Request.Name"
current_data = current_data_raw.groupby(['Name', 'Request.Method']).size().reset_index().set_index(['Name', 'Request.Method'])
I then run a bunch of statistical functions over current_data_raw adding new columns to current_data
I then need to query that dataframe for specific values of columns. I would love to do something like:
val = df['Request.Name' == some_name, 'Method' = some_method]['Average']
However this isn't working, nor are the varients I have attempted above. .xs is returning a series. I could grab the only row in the series but that doesn't seem proper.

If want select in MultiIndex is possible use tuple in order of levels, but here is not specified index name like 'Request.Name':
val = df.loc[(some_name, some_method), 'Average']
Another way is use DataFrame.query, but if levels names contains spaces or . is necessary use backticks:
val = df.query("`Request.Name`=='some_name' & `Request.Method`=='some_method'")['Average']
If one word levels names:
val = df.query("Name=='some_name' & Method=='some_method'")['Average']

Difference between "as_index = False", and "reset_index()" in pandas groupby

I just wanted to know what is the difference in the function performed by these 2.
Data:
import pandas as pd
df = pd.DataFrame({"ID":["A","B","A","C","A","A","C","B"], "value":[1,2,4,3,6,7,3,4]})
as_index=False :
df_group1 = df.groupby("ID").sum().reset_index()
reset_index() :
df_group2 = df.groupby("ID", as_index=False).sum()
Both of them give the exact same output.
ID value
0 A 18
1 B 6
2 C 6
Can anyone tell me what is the difference and any example illustrating the same?

When you use as_index=False, you indicate to groupby() that you don't want to set the column ID as the index (duh!). When both implementation yield the same results, use as_index=False because it will save you some typing and an unnecessary pandas operation ;)
However, sometimes, you want to apply more complicated operations on your groups. In those occasions, you might find out that one is more suited than the other.
Example 1: You want to sum the values of three variables (i.e. columns) in a group on both axes.
Using as_index=True allows you to apply a sum over axis=1 without specifying the names of the columns, then summing the value over axis 0. When the operation is finished, you can use reset_index(drop=True/False) to get the dataframe under the right form.
Example 2: You need to set a value for the group based on the columns in the groupby().
Setting as_index=False allow you to check the condition on a common column and not on an index, which is often way easier.
At some point, you might come across KeyError when applying operations on groups. In that case, it is often because you are trying to use a column in your aggregate function that is currently an index of your GroupBy object.

Pandas Mean for Certain Column

I have a pandas dataframe like that:
How can I able to calculate mean (min/max, median) for specific column if Cluster==1 or CLuster==2?
Thanks!

You can create new df with only the relevant rows, using:
newdf = df[df['cluster'].isin([1,2)]
newdf.mean(axis=1)
In order to calc mean of a specfic column you can:
newdf["page"].mean(axis=1)

If you meant take the mean only where Cluster is 1 or 2, then the other answers here address your issue. If you meant take a separate mean for each value of Cluster, you can use pandas' aggregation functions, including groupyby and agg:
df.groupby("Cluster").mean()
is the simplest and will take means of all columns, grouped by Cluster.
df.groupby("Cluster").agg({"duration" : np.mean})
is an example where you are taking the mean of just one specific column, grouped by cluster. You can also use np.min, np.max, np.median, etc.
The groupby method produces a GroupBy object, which is something like but not like a DataFrame. Think of it as the DataFrame grouped, waiting for aggregation to be applied to it. The GroupBy object has simple built-in aggregation functions that apply to all columns (the mean() in the first example), and also a more general aggregation function (the agg() in the second example) that you can use to apply specific functions in a variety of ways. One way of using it is passing a dict of column names keyed to functions, so specific functions can be applied to specific columns.

You can do it in one line, using boolean indexing. For example you can do something like:
import numpy as np
import pandas as pd
# This will just produce an example DataFrame
df = pd.DataFrame({'a':np.arange(30), 'Cluster':np.ones(30,dtype=np.int)})
df.loc[10:19, "Cluster"] *= 2
df.loc[20:, "Cluster"] *= 3
# This line is all you need
df.loc[(df['Cluster']==1)|(df['Cluster']==2), 'a'].mean()
The boolean indexing array is True for the correct clusters. a is just the name of the column to compute the mean over.

Simple intuitive answer
First pick the rows of interest, then average then pick the columns of interest.
clusters_of_interest = [1, 2]
columns_of_interest = ['page']
# rows of interest
newdf = df[ df.CLUSTER.isin(clusters_of_interest) ]
# average and pick columns of interest
newdf.mean(axis=0)[ columns_of_interest ]
More advanced
# Create groups object according to the value in the 'cluster' column
grp = df.groupby('CLUSTER')
# apply functions of interest to all cluster groupings
data_agg = grp.agg( ['mean' , 'max' , 'min' ] )
This is also a good link which describes aggregation techniques. It should be noted that the "simple answer" averages over clusters 1 AND 2 or whatever is specified in the clusters_of_interest while the .agg function averages over each group of values having the same CLUSTER value.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sort value issue on categorical column - python

Related

overwriting dataframes in pandas

pandas: return mutated column into original dataframe

Proper way to extract value from DataFrame with composite index?

Difference between "as_index = False", and "reset_index()" in pandas groupby

Pandas Mean for Certain Column

Categories

Resources