Python correlation (.corr) results as dataframe - python

I am running the following code with a dataset named "mpg_data"
mpg_data.corr(method='pearson').style.format("{:.2}")
As a result I get the data I need as a table. However, when I try to assign these results to a variable, so I can get them as a usable dataframe, doing this:
results = mpg_data.corr(method='pearson').style.format("{:.2}")
As a result I get:
<pandas.formats.style.Styler object at 0x130379e90>
How can I get the correlation result as a usable dataframe?

Drop the .style...
results = mpg_data.corr(arguments)
This should return the correlation matrix as a dataframe. If you want to display just two digits, you can actually do this in matplotlib or use .apply() on the dataframe.

You might use the dataframe applymap instead of style.feature:
results = mpg_data.corr(method='pearson').applymap('${:,.2f}'.format)

Related

How to Index a dataframe based on an applied function? -Pandas

I have a dataframe that I created from a master table in SQL. That new dataframe is then grouped by type as I want to find the outliers for each group in the master table.
The function finds the outliers, showing where in the GroupDF they outliers occur. How do I see this outliers as a part of the original dataframe? Not just volume but also location, SKU, group etc.
dataframe: HOSIERY_df
Code:
##Sku Group Data Frames
grouped_skus = sku_volume.groupby('SKUGROUP')
HOSIERY_df = grouped_skus.get_group('HOSIERY')
hosiery_outliers = find_outliers_IQR(HOSIERY_df['VOLUME'])
hosiery_outliers
#.iloc[[hosiery_outliers]]
#hosiery_outliers
Picture to show code and output:
I know enough that I need to find the rows based on location of the index. Like Vlookup in Excel but i need to do it with in Python. Not sure how to pull only the 5, 6, 7...3888 and 4482nd place in the HOSIERY_df.
You can provide a list of index numbers as integers to iloc, which it looks like you have tried based on your commented-out code. So, you may want to make sure that find_outliers_IQR is returning a list of int so it will work properly with iloc, or convert it's output.
It looks like it's currently returning a DataFrame. You can get the index of that frame as a list like this:
hosiery_outliers.index.tolist()

How to make a sum by using group by?

I have the following dataset and I want to sum the values of the column UnitPrice grouping by CustomerID.
I'm trying the following way but despite the new column is being added the values are not being filled
data['TotalEN'] = round(data.groupby(['SalesOrderID'])['UnitPrice'].sum(),2)
I tried to print the function if is calculating the values correctly and indeed it is
print(data.groupby(['CustomerID'])['UnitPrice'].sum())
What I'm doing wrong?
In this case, the shape of the output from the groupby operation will be different than the shape of your dataframe. You will need to use the transform method on the groupby object to restore the correct shape you need:
data['TotalEN'] = data.groupby(['SalesOrderID'])['UnitPrice'].transform('sum').round(2)
You can read more about transform here.

panda dataframe extracting values

I have a dataframe called "nums" and am trying to find the value of the column "angle" by specifying the values of other columns like this:
nums[(nums['frame']==300)&(nums['tad']==6)]['angl']
When I do so, I do not get a singular number and cannot do calculations on them. What am I doing wrong?
nums
First of all, in general you should use .loc rather than concatenate indexes like that:
>>> s = nums.loc[(nums['frame']==300)&(nums['tad']==6), 'angl']
Now, to get the float, you may use the .item() accessor.
>>> s.item()
-0.466331

pandas: return mutated column into original dataframe

Ive attempted to search the forum for this question, but, I believe I may not be asking it correctly. So here it goes.
I have a large data set with many columns. Originally, I needed to sum all columns for each row by multiple groups based on a name pattern of variables. I was able to do so via:
cols = data.filter(regex=r'_name$').columns
data['sum'] = data.groupby(['id','group'],as_index=False)[cols].sum().assign(sum = lambda x: x.sum(axis=1))
By running this code, I receive a modified dataframe grouped by my 2 factor variables (group & id), with all the columns, and the final sum column I need. However, now, I want to return the final sum column back into the original dataframe. The above code returns the entire modified dataframe into my sum column. I know this is achievable in R by simply adding a .$sum at the end of a piped code. Any ideas on how to get this in pandas?
My hopeful output is just a the addition of the final "sum" variable from the above lines of code into my original dataframe.
Edit: To clarify, the code above returns this entire dataframe:
All I want returned is the column in yellow
is this what you need?
data['sum'] = data.groupby(['id','group'])[cols].transform('sum').sum(axis = 1)

indexing into a column in pandas

I am trying to set colb in a pandas array depending on the value in colb.
The order in which I refer to the two column indices in the array seems to have an impact on whether the indexing works. Why is this?
Here is an example of what I mean.
I set up my dataframe:
test=pd.DataFrame(np.random.rand(20,1))
test['cola']=[x for x in range(20)]
test['colb']=0
If I try to set column b using the following code:
test.loc['colb',test.cola>2]=1
I get the error:`ValueError: setting an array element with a sequence
If I use the following code, the code alters the dataframe as I expect.
test.loc[test.cola>2,'colb']=1
Why is this?
Further, is there a better way to assign a column using a test like this?

Categories