pandas - How to convert aggregated data to dictionary - python

Here is a snippet of the CSV file I am working:
ID SN Age Gender Item ID Item Name Price
0, Lisim78, 20, Male, 108, Extraction of Quickblade, 3.53
1, Lisovynya38, 40, Male, 143, Frenzied Scimitar, 1.56
2, Ithergue48, 24, Male, 92, Final Critic, 4.88
3, Chamassasya86,24, Male, 100, Blindscythe, 3.27
4, Iskosia90, 23, Male, 131, Fury, 1.44
5, Yalae81, 22, Male, 81, Dreamkiss, 3.61
6, Itheria73, 36, Male, 169, Interrogator, 2.18
7, Iskjaskst81, 20, Male, 162, Abyssal Shard, 2.67
8, Undjask33, 22, Male, 21, Souleater, 1.1
9, Chanosian48, 35, Other, 136, Ghastly, 3.58
10, Inguron55, 23, Male, 95, Singed Onyx, 4.74
I wanna get the count of the most profitable items - profitable items are determined by taking the sum of the prices of the most frequently purchased items.
This is what I tried:
profitableCount = df.groupby('Item ID').agg({'Price': ['count', 'sum']})
And the output looks like this:
Price
count sum
Item ID
0 4 5.12
1 3 9.78
2 6 14.88
3 6 14.94
4 5 8.50
5 4 16.32
6 2 7.40
7 7 9.31
8 3 11.79
9 4 10.92
10 4 7.16
I want to extract the 'count' and 'sum' columns and put them in a dictionary but I can't seem to drop the 'Item ID' column (Item ID seems to be the index). How do I do this? Please help!!!

Dictionary consist of a series of {key:value} pairs. In outcome you provided there is no key:value.
{(4: 5.12), (3 : 9.78), (6:14.88), (6:14.94), (5:8.50), (4:16.32),
(2,7.40), (7,9.31), (3,11.79), (4,10.92), (4,7.16)}
Alternatively you can create two lists: df.count.tolist() and df.sum.tolist()
And put them to list of tuples: list(zip(list1,llist2))

Related

Bucket Customers based on Points

I have a customer table. I am trying to filter each ParentCustomerID based on multiple points they have and select a row based on the below conditions:
IF 0 points & negative points, select the row with the highest negative point (i.e. -30 > -20)
IF 0 points & positive points, select the row with the highest positive point
IF Positive & Negative Points, select the row with the highest positive point
IF Positive, 0 points, and Negative points, select the row with the highest positive point
IF 0 Points mark, select any row with 0 points
IF All Negative, select the row with the highest negative point (i.e. -30 > -20)
1:M relationship between ParentCustomerID and ChildCustomerID
ParentCustomerID
ChildCustomerID
Points
101
1
0.0
101
2
-20.0
101
3
-30.50
102
4
20.86
102
5
0.0
102
6
50.0
103
7
10.0
103
8
50.0
103
9
-30.0
104
10
-30.0
104
11
0.0
104
12
60.80
104
13
40.0
105
14
0.0
105
15
0.0
105
16
0.0
106
17
-20.0
106
18
-30.80
106
19
-40.20
Output should be:
ParentCustomerID
ChildCustomerID
Points
101
3
-30.50
102
6
50.0
103
8
50.0
104
12
60.80
105
16
0.0
106
19
-40.20
Note: for the rows customer 105, any row can be chosen because they all have 0 points.
Note2: Points can be float and ChildCustomerID can be missing (np.nan)
I do not know how to group each ParentCustomerID, check the above conditions, and select a specific row for each ParentCustomerID.
Thank you in advance!
Code
df['abs'] = df['Points'].abs()
df['pri'] = np.sign(df['Points']).replace(0, -2)
(
df.sort_values(['pri', 'abs'])
.drop_duplicates('ParentCustomerID', keep='last')
.drop(['pri', 'abs'], axis=1)
.sort_index()
)
How this works
Assign a temporary column named abs with the absolute values of Points
Assign a temporary column named pri(priority) corresponding to arithmetic signs(i.e, -1, 0, 1) of values in Points, Important hack: replace 0 with -2 so that zero always has least priority.
Sort the values by priority and absolute values
Drop the duplicates in sorted dataframe keeping the last row per ParentCustomerID
Result
ParentCustomerID ChildCustomerID Points
2 101 3 -30.5
5 102 6 50.0
7 103 8 50.0
11 104 12 60.8
15 105 16 0.0
18 106 19 -40.2
import pandas as pd
import numpy as np
df = pd.DataFrame([
[101, 1, 0.0],
[101, 2, -20.0],
[101, 3, -30.50],
[102, 4, 20.86],
[102, 5, 0.0],
[102, 6, 50.0],
[103, 7, 10.0],
[103, 8, 50.0],
[103, 9, -30.0],
[104, 10, -30.0],
[104, 11, 0.0],
[104, 12, 60.80],
[104, 13, 40.0],
[105, 14, 0.0],
[105, 15, 0.0],
[105, 16, 0.0],
[106, 17, -20.0],
[106, 18, -30.80],
[106, 19, -40.20]
],columns=['ParentCustomerID', 'ChildCustomerID', 'Points'])
data = df.groupby('ParentCustomerID').agg({
'Points': [lambda x: np.argmax(x) if (np.array(x) > 0).sum() else np.argmin(x), list],
'ChildCustomerID': list
})
pd.DataFrame(data.apply(lambda x: (x["ChildCustomerID", "list"][x["Points", "<lambda_0>"]], x["Points", "list"][x["Points", "<lambda_0>"]]), axis=1).tolist(), index=data.index).rename(columns={
0: "ChildCustomerID",
1: "Points"
}).reset_index()

Iterate over pandas dataframe columns containing nested arrays (reformulated request)

I Hope you can help. few weeks ago you did gave an huge help with a similar issue regarding nested arrays.
Today I've similar issue, and I've tried all solution you provided in this link below,
Iterate over pandas dataframe columns containing nested arrays
My data is an ORB vector containing description points. It returns a list. When I convert the List into an array I get this output
data=np.asarray([['Test /file0090',
np.asarray([[ 84, 55, 189],
[248, 100, 18],
[ 68, 0, 88]])],
['aa file6565',
np.asarray([[ 86, 58, 189],
[24, 10, 118],
[ 68, 11, 0]])],
['aa filejjhgjgj',
None],
['Test /file0088',
np.asarray([[ 54, 58, 787],
[ 4, 1, 18 ],
[ 8, 1, 0 ]])]])
This a small sample, real data is a array with 800.000 x 2
Some images do not return any Descriptor Points, and the value shows' None
Below an example, I've just selected 2 rows where values where "None",
array([['/00cbbc837d340fa163d11e169fbdb952.jpg',
None],
['/078b35be31e8ac99b0cbb817dab4c23f.jpg',
None]], dtype=object)
Once again, I need to get this in nx4 (in this case we have 4 variables but my real data there are 33 variables) this kind :
col0, Col1, Col2, col3,
Test /file0090 84, 55, 189
Test /file0090 248, 100, 18
Test /file0090 84, 55, 189
'aa file6565' 86, 58, 189
'aa file6565' 24, 10, 118
'aa file6565' 68, 11, 0
'aa filejjhgjgj' 0 0 0
'Test /file0088 54, 58, 787
'Test /file0088 4, 1, 18
'Test /file0088 8, 1, 0
The issue with the solution provided in the link, is when we have this "None" values in the array it returns
ufunc 'add' did not contain a loop with signature matching types (dtype('<U21'), dtype('<U21')) -> dtype('<U21')
If someone can help me to go trought this
You can modify #anky answer to handle null values by using df.fillna(''):
df = pd.DataFrame(data).add_prefix('col')
df = df.fillna('').explode('col1').reset_index(drop=True)
df = df.join(pd.DataFrame(df.pop('col1').tolist()).add_prefix('Col')).fillna(0)
Returns
col0 Col0 Col1 Col2
Test /file0090 84.0 55.0 189.0
Test /file0090 248.0 100.0 18.0
Test /file0090 68.0 0.0 88.0
aa file6565 86.0 58.0 189.0
aa file6565 24.0 10.0 118.0
aa file6565 68.0 11.0 0.0
aa filejjhgjgj 0.0 0.0 0.0
Test /file0088 54.0 58.0 787.0
Test /file0088 4.0 1.0 18.0
Test /file0088 8.0 1.0 0.0

Assign two dimensional array to panda dataframe

I want to explode the two dimensional array into 1 dimensional array and assign it to the panda data frame. Need help on this
This is my panda data frame.
Id Dept Date
100 Healthcare 2007-01-03
100 Healthcare 2007-01-10
100 Healthcare 2007-01-17
Two dimensional array looks like
array([[10, 20, 30],
[40, 50, 60],
[70, 80, 90]])
The output to be.
Id Dept Date vect
100 Healthcare 2007-01-03 [10, 20, 30]
100 Healthcare 2007-01-10 [40, 50, 60]
100 Healthcare 2007-01-17 [70, 80, 90]
You can achieve that by convert the array to list by using tolist
df['vect']=ary.tolist()
df
Out[281]:
Id Dept Date vect
0 100 Healthcare 2007-01-03 [10, 20, 30]
1 100 Healthcare 2007-01-10 [40, 50, 60]
2 100 Healthcare 2007-01-17 [70, 80, 90]

Pandas Dataframe splice data into 2 columns and make a number with a comma and integer

I currently am running into two issues:
My data-frame looks like this:
, male_female, no_of_students
0, 24 : 76, "81,120"
1, 33 : 67, "12,270"
2, 50 : 50, "10,120"
3, 42 : 58, "5,120"
4, 12 : 88, "2,200"
What I would like to achieve is this:
, male, female, no_of_students
0, 24, 76, 81120
1, 33, 67, 12270
2, 50, 50, 10120
3, 42, 58, 5120
4, 12, 88, 2200
Basically I want to convert male_female into two columns and no_of_students into a column of integers. I tried a bunch of things, converting the no_of_students column into another type with .astype. But nothing seems to work properly, I also couldn't really find a smart way of splitting the male_female column properly.
Hopefully someone can help me out!
Use str.split with pop for new columns by separator, then strip trailing values, replace and if necessary convert to integers:
df[['male','female']] = df.pop('male_female').str.split(' : ', expand=True)
df['no_of_students'] = df['no_of_students'].str.strip('" ').str.replace(',','').astype(int)
df = df[['male','female', 'no_of_students']]
print (df)
male female no_of_students
0 24 76 81120
1 33 67 12270
2 50 50 10120
3 42 58 5120
4 12 88 2200

Pandas correlation

I have the following dataframe structure:
roc_sector roc_symbol
mean, max, min, count mean, max, min, count
date, industry
2015-03-15 Health 123, 675, 12, 6 35, 5677, 12, 7
2015-03-15 Mining 456, 687, 11, 9 54, 7897, 44, 3
2015-03-16 Health 346, 547, 34, 8 67, 7699, 23, 5
2015-03-16 Mining 234, 879, 34, 2 35, 3457, 23, 4
2015-03-17 Health 345, 875, 54, 6 45, 7688, 12, 8
2015-03-17 Mining 876, 987, 23, 7 56, 5656, 43, 9
What I need to do is calculate the correlation between the industries over x number of days. For example, I would need to see what the correlation is between the Health and Mining industry over the last 3 days for the roc_sector + mean.
I've been trying a few things with pandas df.corr() and pd.rolling_corr() but I haven't had any success because I can't seem to change the dataframe structure from what it is currently (as above), into something that will allow me to get the required correlations per industry, over x days.
You could do this by performing an appropriate unstack followed by a regular rolling_corr.
Start off by setting industry as the index (or part of the index). unstack the appropriate index level using the above link as an example. In the resulting dataframe, just use rolling_corr on the columns of the industry means.
Is this what you are expecting to do ? Assume this df is your dataframe -
In [43]: df
Out[43]:
date industry mean max min count
0 2015-03-15 Health 123 675 12 6
1 2015-03-15 Mining 456 687 11 9
2 2015-03-16 Health 346 547 34 8
3 2015-03-16 Mining 234 879 34 2
4 2015-03-17 Health 345 875 54 6
5 2015-03-17 Mining 876 987 23 7
In [44]: x = df.pivot(index='date', columns='industry', values='mean')
In [45]: x
Out[45]:
industry Health Mining
date
2015-03-15 123 456
2015-03-16 346 234
2015-03-17 345 876
In [46]: x.corr()
Out[46]:
industry Health Mining
industry
Health 1.000000 0.171471
Mining 0.171471 1.000000

Categories