Pandas correlation - python

I have the following dataframe structure:
roc_sector roc_symbol
mean, max, min, count mean, max, min, count
date, industry
2015-03-15 Health 123, 675, 12, 6 35, 5677, 12, 7
2015-03-15 Mining 456, 687, 11, 9 54, 7897, 44, 3
2015-03-16 Health 346, 547, 34, 8 67, 7699, 23, 5
2015-03-16 Mining 234, 879, 34, 2 35, 3457, 23, 4
2015-03-17 Health 345, 875, 54, 6 45, 7688, 12, 8
2015-03-17 Mining 876, 987, 23, 7 56, 5656, 43, 9
What I need to do is calculate the correlation between the industries over x number of days. For example, I would need to see what the correlation is between the Health and Mining industry over the last 3 days for the roc_sector + mean.
I've been trying a few things with pandas df.corr() and pd.rolling_corr() but I haven't had any success because I can't seem to change the dataframe structure from what it is currently (as above), into something that will allow me to get the required correlations per industry, over x days.

You could do this by performing an appropriate unstack followed by a regular rolling_corr.
Start off by setting industry as the index (or part of the index). unstack the appropriate index level using the above link as an example. In the resulting dataframe, just use rolling_corr on the columns of the industry means.

Is this what you are expecting to do ? Assume this df is your dataframe -
In [43]: df
Out[43]:
date industry mean max min count
0 2015-03-15 Health 123 675 12 6
1 2015-03-15 Mining 456 687 11 9
2 2015-03-16 Health 346 547 34 8
3 2015-03-16 Mining 234 879 34 2
4 2015-03-17 Health 345 875 54 6
5 2015-03-17 Mining 876 987 23 7
In [44]: x = df.pivot(index='date', columns='industry', values='mean')
In [45]: x
Out[45]:
industry Health Mining
date
2015-03-15 123 456
2015-03-16 346 234
2015-03-17 345 876
In [46]: x.corr()
Out[46]:
industry Health Mining
industry
Health 1.000000 0.171471
Mining 0.171471 1.000000

Related

Pandas: Convert annual data to decade data

Background
I want to determine the global cumulative value of a variable for different decades starting from 1990 to 2014 i.e. 1990, 2000, 2010 (3 decades separately). I have annual data for different countries. However, data availability is not uniform.
Existing questions
Uses R: 1
Following questions look at date formatting issues: 2, 3
Answers to these questions do not address the current question.
Current question
How to obtain a global sum for the period of different decades using features/tools of Pandas?
Expected outcome
1990-2000 x1
2000-2010 x2
2010-2015 x3
Method used so far
data_binned = data_pivoted.copy()
decade = []
# obtaining decade values for each country
for i in range(1960, 2017):
if i in list(data_binned):
# adding the columns into the decade list
decade.append(i)
if i % 10 == 0:
# adding large header so that newly created columns are set at the end of the dataframe
data_binned[i *10] = data_binned.apply(lambda x: sum(x[j] for j in decade), axis=1)
decade = []
for x in list(data_binned):
if x < 3000:
# removing non-decade columns
del data_binned[x]
# renaming the decade columns
new_names = [int(x/10) for x in list(data_binned)]
data_binned.columns = new_names
# computing global values
global_values = data_binned.sum(axis=0)
This is a non-optimal method because of less experience in using Pandas. Kindly suggest a better method which uses features of Pandas. Thank you.
If I had pandas.DataFrame called df looking like this:
>>> df = pd.DataFrame(
... {
... 1990: [1, 12, 45, 67, 78],
... 1999: [1, 12, 45, 67, 78],
... 2000: [34, 6, 67, 21, 65],
... 2009: [34, 6, 67, 21, 65],
... 2010: [3, 6, 6, 2, 6555],
... 2015: [3, 6, 6, 2, 6555],
... }, index=['country_1', 'country_2', 'country_3', 'country_4', 'country_5']
... )
>>> print(df)
1990 1999 2000 2009 2010 2015
country_1 1 1 34 34 3 3
country_2 12 12 6 6 6 6
country_3 45 45 67 67 6 6
country_4 67 67 21 21 2 2
country_5 78 78 65 65 6555 6555
I could make another pandas.DataFrame called df_decades with decades statistics like this:
>>> df_decades = pd.DataFrame()
>>>
>>> for decade in set([(col // 10) * 10 for col in df.columns]):
... cols_in_decade = [col for col in df.columns if (col // 10) * 10 == decade]
... df_decades[f'{decade}-{decade + 9}'] = df[cols_in_decade].sum(axis=1)
>>>
>>> df_decades = df_decades[sorted(df_decades.columns)]
>>> print(df_decades)
1990-1999 2000-2009 2010-2019
country_1 2 68 6
country_2 24 12 12
country_3 90 134 12
country_4 134 42 4
country_5 156 130 13110
The idea behind this is to iterate over all possible decades provided by column names in df, filtering those columns, which are part of the decade and aggregating them.
Finally, I could merge these data frames together, so my data frame df could be enriched by decades statistics from the second data frame df_decades.
>>> df = pd.merge(left=df, right=df_decades, left_index=True, right_index=True, how='left')
>>> print(df)
1990 1999 2000 2009 2010 2015 1990-1999 2000-2009 2010-2019
country_1 1 1 34 34 3 3 2 68 6
country_2 12 12 6 6 6 6 24 12 12
country_3 45 45 67 67 6 6 90 134 12
country_4 67 67 21 21 2 2 134 42 4
country_5 78 78 65 65 6555 6555 156 130 13110

How do I extact max n values for multiple criteria from a DataFrame

I have a dataframe created from the dictionary below -
d = {
'Region':[
'north','north','north','north','south',
'south','south','east','east','east',
'east','west','west','west'
],
'Store No':[ 1,2,3,4,5,6,7,8,9,10,11,12,13,14],
'Sales':[196, 193, 176, 168, 165, 163, 166, 135, 151, 108, 119, 176, 132, 107]
}
1) How do I create another dataframe to extract the top 3 stores ("Sales" column) for each region.
2) Assuming the "Regions" column had many more different values (such as Northeast, Northwest,Southwest,etc), how do I create another dataframe to extract the regions that start with "North".
You can use groupby and nlargest functions.
1) Top 3 sales per region:
You can create a dictionary of dataframes, one for each region with top 3 sales:
In [687]: top_3_sales = df.groupby('Region')['Sales'].nlargest(3).reset_index().rename(columns={'level_1': 'Store No'})
In [688]: list_of_regions = df.Region.unique().tolist()
In [691]: dict_of_region_df = {region: top_3_sales.loc[top_3_sales['Region'] == region] for region in list_of_regions}
Then query your dict to have individual dataframes:
In [693]: dict_of_region_df['north']
Out[693]:
Region Store No Sales
3 north 0 196
4 north 1 193
5 north 2 176
In [694]: dict_of_region_df['east']
Out[694]:
Region Store No Sales
0 east 8 151
1 east 7 135
2 east 10 119
2.) Regions with north:
In [681]: df[df.Region.str.startswith('north')]
Out[681]:
Region Store No Sales
0 north 1 196
1 north 2 193
2 north 3 176
3 north 4 168
For question 1, use the nlargest function on dataframe.
In [13]: df_1 = d.groupby('Region')['Sales'].nlargest(3)
In [14]: df_1
Out[14]:
Region
east 8 151
7 135
10 119
north 0 196
1 193
2 176
south 6 166
4 165
5 163
west 11 176
12 132
13 107
Name: Sales, dtype: int64
For second question, you can use the startswith for find region starting with north.
In [11]: df_2 = d[d['Region'].str.startswith('north')]
In [12]: df_2
Out[12]:
Region Store No Sales
0 north 1 196
1 north 2 193
2 north 3 176
3 north 4 168

one to one column-value comparison between 2 dataframes - pandas

I have 2 dataframe -
print(d)
Year Salary Amount Amount1 Amount2
0 2019 1200 53 53 53
1 2020 3443 455 455 455
2 2021 6777 123 123 123
3 2019 5466 313 313 313
4 2020 4656 545 545 545
5 2021 4565 775 775 775
6 2019 4654 567 567 567
7 2020 7867 657 657 657
8 2021 6766 567 567 567
print(d1)
Year Salary Amount Amount1 Amount2
0 2019 1200 53 73 63
import pandas as pd
d = pd.DataFrame({
'Year': [
2019,
2020,
2021,
] * 3,
'Salary': [
1200,
3443,
6777,
5466,
4656,
4565,
4654,
7867,
6766
],
'Amount': [
53,
455,
123,
313,
545,
775,
567,
657,
567
],
'Amount1': [
53,
455,
123,
313,
545,
775,
567,
657,
567
], 'Amount2': [
53,
455,
123,
313,
545,
775,
567,
657,
567
]
})
d1 = pd.DataFrame({
'Year': [
2019
],
'Salary': [
1200
],
'Amount': [
53
],
'Amount1': [
73
], 'Amount2': [
63
]
})
I want to compare the 'Salary' value of dataframe d1 i.e. 1200 with all the values of 'Salary' in dataframe d and set a count if it is >= or < (a Boolean comparison) - this is to be done for all the columns(amount, amount1, amount2 etc), if the value in any column of d1 is NaN/None, no comparison needs to be done. The name of the columns will always be same so it is basically one to one column comparison.
My approach and thoughts -
I can get the values of d1 in a list by doing -
l = []
for i in range(len(d1.columns.values)):
if i == 0:
continue
else:
num = d1.iloc[0, i]
l.append(num)
print(l)
# list comprehension equivalent
lst = [d1.iloc[0, i] for i in range(len(d1.columns.values)) if i != 0]
[1200, 53, 73, 63]
and then use iterrows to iterate over all the columns and rows in dataframe d OR
I can iterate over d and then perform a similar comparison by looping over d1 - but these would be time consuming for a high dimensional dataframe(d in this case).
What would be the more efficient or pythonic way of doing it?
IIUC, you can do:
(df1 >= df2.values).sum()
Output:
Year 9
Salary 9
Amount 9
Amount1 8
Amount2 8
dtype: int64

pandas - How to convert aggregated data to dictionary

Here is a snippet of the CSV file I am working:
ID SN Age Gender Item ID Item Name Price
0, Lisim78, 20, Male, 108, Extraction of Quickblade, 3.53
1, Lisovynya38, 40, Male, 143, Frenzied Scimitar, 1.56
2, Ithergue48, 24, Male, 92, Final Critic, 4.88
3, Chamassasya86,24, Male, 100, Blindscythe, 3.27
4, Iskosia90, 23, Male, 131, Fury, 1.44
5, Yalae81, 22, Male, 81, Dreamkiss, 3.61
6, Itheria73, 36, Male, 169, Interrogator, 2.18
7, Iskjaskst81, 20, Male, 162, Abyssal Shard, 2.67
8, Undjask33, 22, Male, 21, Souleater, 1.1
9, Chanosian48, 35, Other, 136, Ghastly, 3.58
10, Inguron55, 23, Male, 95, Singed Onyx, 4.74
I wanna get the count of the most profitable items - profitable items are determined by taking the sum of the prices of the most frequently purchased items.
This is what I tried:
profitableCount = df.groupby('Item ID').agg({'Price': ['count', 'sum']})
And the output looks like this:
Price
count sum
Item ID
0 4 5.12
1 3 9.78
2 6 14.88
3 6 14.94
4 5 8.50
5 4 16.32
6 2 7.40
7 7 9.31
8 3 11.79
9 4 10.92
10 4 7.16
I want to extract the 'count' and 'sum' columns and put them in a dictionary but I can't seem to drop the 'Item ID' column (Item ID seems to be the index). How do I do this? Please help!!!
Dictionary consist of a series of {key:value} pairs. In outcome you provided there is no key:value.
{(4: 5.12), (3 : 9.78), (6:14.88), (6:14.94), (5:8.50), (4:16.32),
(2,7.40), (7,9.31), (3,11.79), (4,10.92), (4,7.16)}
Alternatively you can create two lists: df.count.tolist() and df.sum.tolist()
And put them to list of tuples: list(zip(list1,llist2))

Pandas: return a Dataframe with multiple aggregate values conditioned on another value [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I'm trying to do something well beyond my Pandas level and have spent far too much time getting this wrong. In this example I need to return individual Dataframes for each of the teams. The dataframes would show the mean cost, mean area, and sum of size, for each grade.
Because I need to produce separate tables, I probably need to pass single team names into a function over and over. To be clear, I'm happy to pass the team names into a function (or similar) manually to produce each table.
team grade cost area size
0 man utd 1 52300 5 1045
1 chelsea 3 52000 42 957
2 arsenal 2 25000 20 1099
3 man utd 1 61600 20 1400
4 man utd 2 43000 43 1592
5 arsenal 2 23400 78 1006
6 man utd 2 52300 89 987
7 chelsea 4 62000 30 849
8 arsenal 1 62000 46 973
9 arsenal 2 73000 78 1005
The man utd dataframe would look like this for example:
grade mean_cost mean_area size
1 56590 12.5 2445
2 47650 66 2579
Use groupby/agg to group by both the team and grade, and the aggregate the cost, area and size columns. Note that agg can accept a dict whose keys are column names and whose values are aggregation functions (such as mean or sum). Thus you can specify aggregation functions on a per-column basis.
In [120]: df.groupby(['team', 'grade']).agg({'cost':'mean', 'area':'mean', 'size':'sum'}).rename(columns={'cost':'mean_cost', 'area':'mean_area'})
Out[120]:
size mean_cost mean_area
team grade
arsenal 1 973 62000.000000 46.000000
2 3110 40466.666667 58.666667
chelsea 3 957 52000.000000 42.000000
4 849 62000.000000 30.000000
man utd 1 2445 56950.000000 12.500000
2 2579 47650.000000 66.000000
groupby returns an iterable. Therefore, to make a dict mapping team names to DataFrames you could use:
dfs = {team:grp for team, grp in result.reset_index().groupby('team')}
For example,
import pandas as pd
df = pd.DataFrame(
{'area': [5, 42, 20, 20, 43, 78, 89, 30, 46, 78],
'cost': [52300, 52000, 25000, 61600, 43000, 23400, 52300, 62000, 62000, 73000],
'grade': [1, 3, 2, 1, 2, 2, 2, 4, 1, 2], 'size': [1045, 957, 1099, 1400, 1592, 1006, 987, 849, 973, 1005],
'team': ['man utd', 'chelsea', 'arsenal', 'man utd', 'man utd', 'arsenal', 'man utd', 'chelsea', 'arsenal', 'arsenal']})
result = df.groupby(['team', 'grade']).agg({'cost':'mean', 'area':'mean', 'size':'sum'}).rename(columns={'cost':'mean_cost', 'area':'mean_area'})
dfs = {team:grp.drop('team', axis=1)
for team, grp in result.reset_index().groupby('team')}
for team, grp in dfs.items():
print('{}:\n{}\n'.format(team, grp))
yields
chelsea:
grade mean_cost mean_area size
2 3 52000 42 957
3 4 62000 30 849
arsenal:
grade mean_cost mean_area size
0 1 62000.000000 46.000000 973
1 2 40466.666667 58.666667 3110
man utd:
grade mean_cost mean_area size
4 1 56950 12.5 2445
5 2 47650 66.0 2579
Beware that for better performance try to avoid breaking up
DataFrames into smaller DataFrames, because once you use a dict or a list, you
are forced to use Python loops instead of the faster implicit C-compiled loops
used by Pandas/NumPy methods.
So for computation try to stick with the result DataFrame. Use the dfs dict
only if you have to do something like print the DataFrames separately.

Categories