Iterate over pandas dataframe columns containing nested arrays (reformulated request) - python

I Hope you can help. few weeks ago you did gave an huge help with a similar issue regarding nested arrays.
Today I've similar issue, and I've tried all solution you provided in this link below,
Iterate over pandas dataframe columns containing nested arrays
My data is an ORB vector containing description points. It returns a list. When I convert the List into an array I get this output
data=np.asarray([['Test /file0090',
np.asarray([[ 84, 55, 189],
[248, 100, 18],
[ 68, 0, 88]])],
['aa file6565',
np.asarray([[ 86, 58, 189],
[24, 10, 118],
[ 68, 11, 0]])],
['aa filejjhgjgj',
None],
['Test /file0088',
np.asarray([[ 54, 58, 787],
[ 4, 1, 18 ],
[ 8, 1, 0 ]])]])
This a small sample, real data is a array with 800.000 x 2
Some images do not return any Descriptor Points, and the value shows' None
Below an example, I've just selected 2 rows where values where "None",
array([['/00cbbc837d340fa163d11e169fbdb952.jpg',
None],
['/078b35be31e8ac99b0cbb817dab4c23f.jpg',
None]], dtype=object)
Once again, I need to get this in nx4 (in this case we have 4 variables but my real data there are 33 variables) this kind :
col0, Col1, Col2, col3,
Test /file0090 84, 55, 189
Test /file0090 248, 100, 18
Test /file0090 84, 55, 189
'aa file6565' 86, 58, 189
'aa file6565' 24, 10, 118
'aa file6565' 68, 11, 0
'aa filejjhgjgj' 0 0 0
'Test /file0088 54, 58, 787
'Test /file0088 4, 1, 18
'Test /file0088 8, 1, 0
The issue with the solution provided in the link, is when we have this "None" values in the array it returns
ufunc 'add' did not contain a loop with signature matching types (dtype('<U21'), dtype('<U21')) -> dtype('<U21')
If someone can help me to go trought this

You can modify #anky answer to handle null values by using df.fillna(''):
df = pd.DataFrame(data).add_prefix('col')
df = df.fillna('').explode('col1').reset_index(drop=True)
df = df.join(pd.DataFrame(df.pop('col1').tolist()).add_prefix('Col')).fillna(0)
Returns
col0 Col0 Col1 Col2
Test /file0090 84.0 55.0 189.0
Test /file0090 248.0 100.0 18.0
Test /file0090 68.0 0.0 88.0
aa file6565 86.0 58.0 189.0
aa file6565 24.0 10.0 118.0
aa file6565 68.0 11.0 0.0
aa filejjhgjgj 0.0 0.0 0.0
Test /file0088 54.0 58.0 787.0
Test /file0088 4.0 1.0 18.0
Test /file0088 8.0 1.0 0.0

Related

Bucket Customers based on Points

I have a customer table. I am trying to filter each ParentCustomerID based on multiple points they have and select a row based on the below conditions:
IF 0 points & negative points, select the row with the highest negative point (i.e. -30 > -20)
IF 0 points & positive points, select the row with the highest positive point
IF Positive & Negative Points, select the row with the highest positive point
IF Positive, 0 points, and Negative points, select the row with the highest positive point
IF 0 Points mark, select any row with 0 points
IF All Negative, select the row with the highest negative point (i.e. -30 > -20)
1:M relationship between ParentCustomerID and ChildCustomerID
ParentCustomerID
ChildCustomerID
Points
101
1
0.0
101
2
-20.0
101
3
-30.50
102
4
20.86
102
5
0.0
102
6
50.0
103
7
10.0
103
8
50.0
103
9
-30.0
104
10
-30.0
104
11
0.0
104
12
60.80
104
13
40.0
105
14
0.0
105
15
0.0
105
16
0.0
106
17
-20.0
106
18
-30.80
106
19
-40.20
Output should be:
ParentCustomerID
ChildCustomerID
Points
101
3
-30.50
102
6
50.0
103
8
50.0
104
12
60.80
105
16
0.0
106
19
-40.20
Note: for the rows customer 105, any row can be chosen because they all have 0 points.
Note2: Points can be float and ChildCustomerID can be missing (np.nan)
I do not know how to group each ParentCustomerID, check the above conditions, and select a specific row for each ParentCustomerID.
Thank you in advance!
Code
df['abs'] = df['Points'].abs()
df['pri'] = np.sign(df['Points']).replace(0, -2)
(
df.sort_values(['pri', 'abs'])
.drop_duplicates('ParentCustomerID', keep='last')
.drop(['pri', 'abs'], axis=1)
.sort_index()
)
How this works
Assign a temporary column named abs with the absolute values of Points
Assign a temporary column named pri(priority) corresponding to arithmetic signs(i.e, -1, 0, 1) of values in Points, Important hack: replace 0 with -2 so that zero always has least priority.
Sort the values by priority and absolute values
Drop the duplicates in sorted dataframe keeping the last row per ParentCustomerID
Result
ParentCustomerID ChildCustomerID Points
2 101 3 -30.5
5 102 6 50.0
7 103 8 50.0
11 104 12 60.8
15 105 16 0.0
18 106 19 -40.2
import pandas as pd
import numpy as np
df = pd.DataFrame([
[101, 1, 0.0],
[101, 2, -20.0],
[101, 3, -30.50],
[102, 4, 20.86],
[102, 5, 0.0],
[102, 6, 50.0],
[103, 7, 10.0],
[103, 8, 50.0],
[103, 9, -30.0],
[104, 10, -30.0],
[104, 11, 0.0],
[104, 12, 60.80],
[104, 13, 40.0],
[105, 14, 0.0],
[105, 15, 0.0],
[105, 16, 0.0],
[106, 17, -20.0],
[106, 18, -30.80],
[106, 19, -40.20]
],columns=['ParentCustomerID', 'ChildCustomerID', 'Points'])
data = df.groupby('ParentCustomerID').agg({
'Points': [lambda x: np.argmax(x) if (np.array(x) > 0).sum() else np.argmin(x), list],
'ChildCustomerID': list
})
pd.DataFrame(data.apply(lambda x: (x["ChildCustomerID", "list"][x["Points", "<lambda_0>"]], x["Points", "list"][x["Points", "<lambda_0>"]]), axis=1).tolist(), index=data.index).rename(columns={
0: "ChildCustomerID",
1: "Points"
}).reset_index()

pandas - How to convert aggregated data to dictionary

Here is a snippet of the CSV file I am working:
ID SN Age Gender Item ID Item Name Price
0, Lisim78, 20, Male, 108, Extraction of Quickblade, 3.53
1, Lisovynya38, 40, Male, 143, Frenzied Scimitar, 1.56
2, Ithergue48, 24, Male, 92, Final Critic, 4.88
3, Chamassasya86,24, Male, 100, Blindscythe, 3.27
4, Iskosia90, 23, Male, 131, Fury, 1.44
5, Yalae81, 22, Male, 81, Dreamkiss, 3.61
6, Itheria73, 36, Male, 169, Interrogator, 2.18
7, Iskjaskst81, 20, Male, 162, Abyssal Shard, 2.67
8, Undjask33, 22, Male, 21, Souleater, 1.1
9, Chanosian48, 35, Other, 136, Ghastly, 3.58
10, Inguron55, 23, Male, 95, Singed Onyx, 4.74
I wanna get the count of the most profitable items - profitable items are determined by taking the sum of the prices of the most frequently purchased items.
This is what I tried:
profitableCount = df.groupby('Item ID').agg({'Price': ['count', 'sum']})
And the output looks like this:
Price
count sum
Item ID
0 4 5.12
1 3 9.78
2 6 14.88
3 6 14.94
4 5 8.50
5 4 16.32
6 2 7.40
7 7 9.31
8 3 11.79
9 4 10.92
10 4 7.16
I want to extract the 'count' and 'sum' columns and put them in a dictionary but I can't seem to drop the 'Item ID' column (Item ID seems to be the index). How do I do this? Please help!!!
Dictionary consist of a series of {key:value} pairs. In outcome you provided there is no key:value.
{(4: 5.12), (3 : 9.78), (6:14.88), (6:14.94), (5:8.50), (4:16.32),
(2,7.40), (7,9.31), (3,11.79), (4,10.92), (4,7.16)}
Alternatively you can create two lists: df.count.tolist() and df.sum.tolist()
And put them to list of tuples: list(zip(list1,llist2))

Pandas Dataframe splice data into 2 columns and make a number with a comma and integer

I currently am running into two issues:
My data-frame looks like this:
, male_female, no_of_students
0, 24 : 76, "81,120"
1, 33 : 67, "12,270"
2, 50 : 50, "10,120"
3, 42 : 58, "5,120"
4, 12 : 88, "2,200"
What I would like to achieve is this:
, male, female, no_of_students
0, 24, 76, 81120
1, 33, 67, 12270
2, 50, 50, 10120
3, 42, 58, 5120
4, 12, 88, 2200
Basically I want to convert male_female into two columns and no_of_students into a column of integers. I tried a bunch of things, converting the no_of_students column into another type with .astype. But nothing seems to work properly, I also couldn't really find a smart way of splitting the male_female column properly.
Hopefully someone can help me out!
Use str.split with pop for new columns by separator, then strip trailing values, replace and if necessary convert to integers:
df[['male','female']] = df.pop('male_female').str.split(' : ', expand=True)
df['no_of_students'] = df['no_of_students'].str.strip('" ').str.replace(',','').astype(int)
df = df[['male','female', 'no_of_students']]
print (df)
male female no_of_students
0 24 76 81120
1 33 67 12270
2 50 50 10120
3 42 58 5120
4 12 88 2200

Pandas DataFrame: Complex linear interpolation

I have a dataframe with 4 sections
Section 1: Product details
Section 2: 6 Potential product values based on a range of simulations
Section 3: Upper and lower bound for the input parameter to the simulations
Section 4: Randomly generated values for the input parameters
Section 2 is generated by pricing the product at equal intervals between the upper and lower bound.
I need to take the values in Section 4 and figure out the corresponding product value. Here is a possible setup for this dataframe:
table2 = pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2
0 40 50 60 1.000 0.0 1.0
1 41 51 61 2.000 0.0 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.0 9.0
I will run through a couple examples of this calculation to make it clear what my question is.
Product A - sim_2
The input here is 1.0. This is equal to the upper bound for this product. Therefore the simulation value is equivalent to the state_6 value - 60
Product B - sim_2
The input here is 1.5. the LB to UB range is (1,2), therefore the 6 states are {1,1.2,1.4,1.6,1.8,2}. 1.5 is exactly in the middle of state_3 which has a value of 31 and state 4 which has a value of 41. Therefore the simulation value is 36.
Product C - sim_1
The input here is .61. The LB to UB range is (.5,.625), therefore the 6 states are {.5,.525,.55,.575,.6,.625}. .61 is between state 5 and 6. Specifically the bucket it would fall under would be 5*(.61-.5)/(.625-.5)+1 = 5.4 (it is multiplied by 5 as that is the number of intervals - you can calculate it other ways and get the same result). Then to calculate the value we use that bucket in a weighing of the values for state 5 and state 6: (62-52)*(5.4-5)+52 = 56.
Product B - sim_1
The input here is 0 which is below the lower bound of 1. Therefore we need to extrapolate the value. We use the same formula as above we just use the values of state 1 and state 2 to extrapolate. The bucket would be 5*(0-1)/(2-1)+1 = -4. The two values used at 11 and 21, so the value is (21-11)*(-4-1)+11= -39
I've also simplified the problem to try to visualize the solution, my final code needs to run on 500 values and 10,000 simulations, and the dataframe will have about 200 rows.
Here are the formulas I've used for the interpolation although I'm not committed to them specifically.
Bucket = N*(sim_value-LB)/(UB-LB) + 1
where N is the number of intervals
then nLower is the state value directly below the bucket, and nHigher is the state value directly above the bucket. If the bucket is outside the UB/LB, then force nLower and nHigher to be either the first two or last two values.
Final_value = (nHigher-nLower)*(Bucket1 - number_value_of_nLower)+nLower
To summarize, my question is how I can generate the final results based on the combination of input data provided. The most challenging part to me is how to make the connection from the Bucket number to the nLower and nHigher values.
I was able to generate the result using the following code. I'm not sure of the memory implications on a large dataframe, so still interested in better answers or improvements.
Edit: Ran this code on the full dataset, 141 rows, 500 intervals, 10,000 simulations, and it took slightly over 1.5 hours. So not quite as useless as I assumed, but there is probably a smarter way of doing this in a tiny fraction of that time.
for i in range(1,3):
table2['Bucket%s'%i] = 5 * (table2['sim_%s'%i] - table2['Lower_Bound']) / (table2['Upper_Bound'] - table2['Lower_Bound']) + 1
table2['lv'] = table2['Bucket%s'%i].map(int)
table2['hv'] = table2['Bucket%s'%i].map(int) + 1
table2.ix[table2['lv'] < 1 , 'lv'] = 1
table2.ix[table2['lv'] > 5 , 'lv'] = 5
table2.ix[table2['hv'] > 6 , 'hv'] = 6
table2.ix[table2['hv'] < 2 , 'hv'] = 2
table2['nLower'] = table2.apply(lambda row: row['State_%s_Value'%row['lv']],axis=1)
table2['nHigher'] = table2.apply(lambda row: row['State_%s_Value'%row['hv']],axis=1)
table2['Final_value_%s'%i] = (table2['nHigher'] - table2['nLower'])*(table2['Bucket%s'%i]-table2['lv']) + table2['nLower']
Output:
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2 \
0 40 50 60 1.000 0.00 1.0
1 41 51 61 2.000 0.00 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.00 9.0
Bucket1 lv hv nLower nHigher Final_value_1 Bucket2 Final_value_2
0 3.5 5 6 50 60 35.0 6.0 60.0
1 -4.0 3 4 31 41 -39.0 3.5 36.0
2 5.4 5 6 52 62 56.0 9.0 92.0
3 2.0 3 4 33 43 23.0 3.0 33.0
I posted a superior solution with no loops here:
Alternate method to avoid loop in pandas dataframe
df= pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
buckets = df.ix[:,-2:].sub(df['Lower_Bound'],axis=0).div(df['Upper_Bound'].sub(df['Lower_Bound'],axis=0),axis=0) * 5 + 1
low = buckets.applymap(int)
high = buckets.applymap(int) + 1
low = low.applymap(lambda x: 1 if x < 1 else x)
low = low.applymap(lambda x: 5 if x > 5 else x)
high = high.applymap(lambda x: 6 if x > 6 else x)
high = high.applymap(lambda x: 2 if x < 2 else x)
low_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(low.shape[0])[:,None], low])
high_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(high.shape[0])[:,None], high])
df1 = (high_value - low_value).mul((buckets - low).values) + low_value
df1['Product Type'] = df['Product Type']

pandas combining dataframe

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
java = pickle.load(open('JavaSafe.p','rb')) ##import 2d array
python = pickle.load(open('PythonSafe.p','rb')) ##import 2d array
javaFrame = pd.DataFrame(java,columns=['Town','Java Jobs'])
pythonFrame = pd.DataFrame(python,columns=['Town','Python Jobs'])
javaFrame = javaFrame.sort_values(by='Java Jobs',ascending=False)
pythonFrame = pythonFrame.sort_values(by='Python Jobs',ascending=False)
print(javaFrame,"\n",pythonFrame)
This code comes out with the following:
Town Java Jobs
435 York,NY 3593
212 NewYork,NY 3585
584 Seattle,WA 2080
624 Chicago,IL 1920
301 Boston,MA 1571
...
79 Holland,MI 5
38 Manhattan,KS 5
497 Vernon,IL 5
30 Clayton,MO 5
90 Waukegan,IL 5
[653 rows x 2 columns]
Town Python Jobs
160 NewYork,NY 2949
11 York,NY 2938
349 Seattle,WA 1321
91 Chicago,IL 1312
167 Boston,MA 1117
383 Hanover,NH 5
209 Bulverde,TX 5
203 Salisbury,NC 5
67 Rockford,IL 5
256 Ventura,CA 5
[416 rows x 2 columns]
I want to make a new dataframe that uses the town names as an index and has a column for each java and python. However, some of the towns will only have results for one of the languages.
import pandas as pd
javaFrame = pd.DataFrame({'Java Jobs': [3593, 3585, 2080, 1920, 1571, 5, 5, 5, 5, 5],
'Town': ['York,NY', 'NewYork,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Holland,MI', 'Manhattan,KS', 'Vernon,IL', 'Clayton,MO', 'Waukegan,IL']}, index=[435, 212, 584, 624, 301, 79, 38, 497, 30, 90])
pythonFrame = pd.DataFrame({'Python Jobs': [2949, 2938, 1321, 1312, 1117, 5, 5, 5, 5, 5],
'Town': ['NewYork,NY', 'York,NY', 'Seattle,WA', 'Chicago,IL', 'Boston,MA', 'Hanover,NH', 'Bulverde,TX', 'Salisbury,NC', 'Rockford,IL', 'Ventura,CA']}, index=[160, 11, 349, 91, 167, 383, 209, 203, 67, 256])
result = pd.merge(javaFrame, pythonFrame, how='outer').set_index('Town')
# Java Jobs Python Jobs
# Town
# York,NY 3593.0 2938.0
# NewYork,NY 3585.0 2949.0
# Seattle,WA 2080.0 1321.0
# Chicago,IL 1920.0 1312.0
# Boston,MA 1571.0 1117.0
# Holland,MI 5.0 NaN
# Manhattan,KS 5.0 NaN
# Vernon,IL 5.0 NaN
# Clayton,MO 5.0 NaN
# Waukegan,IL 5.0 NaN
# Hanover,NH NaN 5.0
# Bulverde,TX NaN 5.0
# Salisbury,NC NaN 5.0
# Rockford,IL NaN 5.0
# Ventura,CA NaN 5.0
pd.merge will by default join two DataFrames on all columns shared in common. In this case, javaFrame and pythonFrame share only the Town column in common. So by default pd.merge would join the two DataFrames on the Town column.
how='outer causes pd.merge to use the union of the keys from both frames. In other words it causes pd.merge to return rows whose data come from either javaFrame or pythonFrame even if only one DataFrame contains the Town. Missing data is fill with NaNs.
Use pd.concat
df = pd.concat([df.set_index('Town') for df in [javaFrame, pythonFrame]], axis=1)
Java Jobs Python Jobs
Boston,MA 1571.0 1117.0
Bulverde,TX NaN 5.0
Chicago,IL 1920.0 1312.0
Clayton,MO 5.0 NaN
Hanover,NH NaN 5.0
Holland,MI 5.0 NaN
Manhattan,KS 5.0 NaN
NewYork,NY 3585.0 2949.0
Rockford,IL NaN 5.0
Salisbury,NC NaN 5.0
Seattle,WA 2080.0 1321.0
Ventura,CA NaN 5.0
Vernon,IL 5.0 NaN
Waukegan,IL 5.0 NaN
York,NY 3593.0 2938.0

Categories