How to make a sum by using group by? - python

I have the following dataset and I want to sum the values of the column UnitPrice grouping by CustomerID.
I'm trying the following way but despite the new column is being added the values are not being filled
data['TotalEN'] = round(data.groupby(['SalesOrderID'])['UnitPrice'].sum(),2)
I tried to print the function if is calculating the values correctly and indeed it is
print(data.groupby(['CustomerID'])['UnitPrice'].sum())
What I'm doing wrong?

In this case, the shape of the output from the groupby operation will be different than the shape of your dataframe. You will need to use the transform method on the groupby object to restore the correct shape you need:
data['TotalEN'] = data.groupby(['SalesOrderID'])['UnitPrice'].transform('sum').round(2)
You can read more about transform here.

Related

Creating a new column with values calculated by continuously adding a value from the previous row in Pandas [duplicate]

I have a certain feature in my data which looks like this:
I'm trying to introduce cumulative sum this column in the DataFrame as following (the feature is int64 type):
df['Cumulative'] = df['feature'].cumsum()
But for unknown reason I have a drop in this function which is weird since the min number in the original column is 0:
Can someone explain why this happens and how can I fix that.Because I just want to sum the feature as it appears.
Thank you in advance.
Like in the comments suggested, sorting first and after that build the cumulative sum.
Did you try it like this:
df = df.sort_values(by='Date') #where "Date" is the column name of the values on the x-axis
df['cumulative'] = df['feature'].cumsum()

I have a dataframe containing arrays, is there a way collect all of the elements and store it in a seperate dataframe?

I cant seem to find a way to split all of the array values from the column of a dataframe.
I have managed to get all the array values using this code:
The dataframe is as follows:
I want to use value.counts() on the dataframe and I get this
I want the array values that are clubbed together to be split so that I can get the accurate count of every value.
Thanks in advance!
You could try .explode(), which would create a new row for every value in each list.
df_mentioned_id_exploded = pd.DataFrame(df_mentioned_id.explode('entities.user_mentions'))
With the above code you would create a new dataframe df_mentioned_id_exploded with a single column entities.user_mentions, which you could then use .value_counts() on.

How to make matrix taking specific columns from Dataframe pandas?

I have my data set, https://github.com/mayuripandey/Data-Analysis/blob/main/similarity.csv, is there any way i can make matrix with two specific column and make a matrix of it? For eg:
Count and Topic?
Simply subset the columns of interest, and retrieve the values without the column names using the ".values" attribute.
df = pd.read_html("https://github.com/mayuripandey/Data-Analysis/blob/main/similarity.csv")[0]
df[["Count","Topic"]].values
This returns a 2D numpy array of only the values, then if you need, you can transform into a matrix object like this:
np.matrix(df[["Count","Topic"]].values)

Python (pandas) loop through values in a column, do a calc with each value

I have a data set of dB values in a dataframe and want to do a calc for each row in a specific column. I've tried this:
for i in dataAnti['antilog']:
x = 10**(i/10)
It gives me the correct value but only loops once. How do I save these new values in a new column or save over the values in the antilog column?
You need to define the new column and simply formulate the calculus you desire.
dataAnti['new_column'] = 10**(dataAnti['antilog']/10)
This will automatically take the value of each row and perform the calculation to assign the resulting value to the same row in the new_column
You can make use of the apply attribute.
dataAnti['result']=dataAnti['antilog'].apply(lambda i: 10**(i/10))
You can pass any function inside apply() that takes an input and applies the result to each column.

Python correlation (.corr) results as dataframe

I am running the following code with a dataset named "mpg_data"
mpg_data.corr(method='pearson').style.format("{:.2}")
As a result I get the data I need as a table. However, when I try to assign these results to a variable, so I can get them as a usable dataframe, doing this:
results = mpg_data.corr(method='pearson').style.format("{:.2}")
As a result I get:
<pandas.formats.style.Styler object at 0x130379e90>
How can I get the correlation result as a usable dataframe?
Drop the .style...
results = mpg_data.corr(arguments)
This should return the correlation matrix as a dataframe. If you want to display just two digits, you can actually do this in matplotlib or use .apply() on the dataframe.
You might use the dataframe applymap instead of style.feature:
results = mpg_data.corr(method='pearson').applymap('${:,.2f}'.format)

Categories