Pyspark distributed matrix sum non-null values - python

I'm attempting to convert a pandas "dot matrix nansum" function to pyspark.
The goal is to convert this table into a matrix of non-null column sums:
dan ste bob
t1 na 2 na
t2 2 na 1
t3 2 1 na
t4 1 na 2
t5 na 1 2
t6 2 1 na
t7 1 na 2
For example, when 'dan' is not-null (t-2,3,4,6,7) the sum of 'ste' is 2 and 'bob' is 5. When 'ste' is not-null the sum of 'dan' is 4. (I zeroed out the diagonal, but no need to)
dan ste bob
dan 0 2 5
ste 4 0 2
bob 4 1 0
the calculation must remain distributed (no toPandas).
Here's the pandas version which works wonderfully: https://stackoverflow.com/a/46871184/7542835

Related

Reshaping long format dataframe to wide format according to the value of the elements in columns

Suppose I have the following pandas dataframe X in long format:
rank group ID
1 1 3
2 1 1
3 1 2
4 1 4
1 2 2
2 2 1
1 3 1
2 3 4
3 3 3
4 3 2
1 4 1
1 5 3
2 5 2
3 5 1
1 6 1
And I would like to reshape it to the following wide format according to the following rules:
split the ID column into 4 columns n1,n2,n3,n4 representing the 4 elements (person) in the ID column.
for column ni, i=1,2,3,4, the entry in the j-th row is 5 minus the ranking of i-the person in the j-th group. For example, in group 3, person 4 gets rank 2, hence the 3rd row of the n4 column is 5-2=3.
If person i doesn't exist in group j, then the j-th entry in column ni is NA.
So basically I want to create a "score system" for person i according to the ranking: the person who is ranked 1 gets the highest score and the person who is ranked 4 gets the lowest score (or NA if that no there aren't that many people in the group).
i.e.:
group n1 n2 n3 n4
1 3 2 4 1
2 3 4 NA NA
3 4 1 2 3
4 4 NA NA NA
5 2 3 4 NA
6 4 NA NA NA
I hope I have explained it in an understandable manner. Thank you.
Reshape the dataframe using pivot then subtract 5 from all the values and add prefix of n to column names:
df.pivot('group', 'ID', 'rank').rsub(5).add_prefix('n')
ID n1 n2 n3 n4
group
1 3.0 2.0 4.0 1.0
2 3.0 4.0 NaN NaN
3 4.0 1.0 2.0 3.0
4 4.0 NaN NaN NaN
5 2.0 3.0 4.0 NaN
6 4.0 NaN NaN NaN

How to match rows and filtering using rows with column values in pandas Dataframe

For example:
I have
Name Code State Unit
John +2 AZ 3
Mike +3 UT 3
Mike +3 UT 4
Jack +4 KY 6
Jack +5 KY 6
I need to remove lowest unit from dataframe if all the other columns matching
Name Code State Unit
John +2 AZ 3
Mike +3 UT 4
Jack +4 KY 6
Jack +5 KY 6
If need remove only first lowest value first sorting values and use DataFrame.duplicated in boolean indexing:
df = df.sort_values('Unit')
m1 = df.duplicated(['Name','Code','State'])
m2 = df.duplicated(['Name','Code','State'], keep=False)
df = df[m1 | ~m2]
print (df)
Name Code State Unit
0 John 2 AZ 3
2 Mike 3 UT 4
3 Jack 4 KY 6
4 Jack 5 KY 6
If need remove all lowest values you can compare minimal values per groups for first mask:
print (df)
Name Code State Unit
0 John 2 AZ 3
1 Mike 3 UT 3
2 Mike 3 UT 3
3 Mike 3 UT 4
4 Jack 4 KY 6
5 Jack 5 KY 6
m1 = df.groupby(['Name','Code','State'])['Unit'].transform('min').eq(df['Unit'])
m2 = df.duplicated(['Name','Code','State'], keep=False)
df = df[~m1 | ~m2]
print (df)
Name Code State Unit
0 John 2 AZ 3
3 Mike 3 UT 4
4 Jack 4 KY 6
5 Jack 5 KY 6
EDIT:
If need match all maximals unit per all columns:
m1 = df.groupby(['Name','Code','State'])['Unit'].transform('max').eq(df['Unit'])
df2 = df[m1]

Applying .mean() to a grouped data with a condition

I have a df that looks like this:
Day Country Type Product Cost
Mon US 1 a1 0
Mon US 2 a1 5
Mon US 3 a1 6
Mon CA 1 a1 8
Mon CA 2 a1 0
Mon CA 3 a1 1
I am trying to make it to this:
Day Country Type Product Cost Average
Mon US 1 a1 0 (5+6)/2
Mon US 2 a1 5 (5+6)/2
Mon US 3 a1 6 (5+6)/2
Mon CA 1 a1 8 (8+1)/2
Mon CA 2 a1 0 (8+1)/2
Mon CA 3 a1 1 (8+1)/2
The idea is to group it by Country and Product and get the average cost but take the Costs where its >0.
What I've tried:
np.where(df['Cost']>0, df.loc[df.groupby(['Country','Product'])]['Cost'].mean())
But I get:
ValueError: Cannot index with multidimensional key
What is the best practice solution of applying built-in functions like .mean(), max(), etc to a grouped pandas dataframe with a filter?
First idea is replace 0 to NaNs and then use GroupBy.transform with mean, missing values are omitted by default:
print (df.assign(new = df['Cost'].where(df['Cost'] > 0)))
Day Country Type Product Cost new
0 Mon US 1 a1 0 NaN
1 Mon US 2 a1 5 5.0
2 Mon US 3 a1 6 6.0
3 Mon CA 1 a1 8 8.0
4 Mon CA 2 a1 0 NaN
5 Mon CA 3 a1 1 1.0
df['Average'] = (df.assign(new = df['Cost'].where(df['Cost'] > 0))
.groupby(['Country','Product'])['new']
.transform('mean'))
print (df)
Day Country Type Product Cost Average
0 Mon US 1 a1 0 5.5
1 Mon US 2 a1 5 5.5
2 Mon US 3 a1 6 5.5
3 Mon CA 1 a1 8 4.5
4 Mon CA 2 a1 0 4.5
5 Mon CA 3 a1 1 4.5
Or first filter, aggregate mean and assign back by DataFrame.join:
s = df[df["Cost"] > 0].groupby(['Country','Product'])['Cost'].mean().rename('Average')
df = df.join(s, on=['Country','Product'])
print (df)
Day Country Type Product Cost Average
0 Mon US 1 a1 0 5.5
1 Mon US 2 a1 5 5.5
2 Mon US 3 a1 6 5.5
3 Mon CA 1 a1 8 4.5
4 Mon CA 2 a1 0 4.5
5 Mon CA 3 a1 1 4.5
Try this :
df[df["Cost"] > 0].groupby(['Country','Product'])["Cost"].mean()
It filters out where cost is greater than zero, groups it and then takes the mean.

How to drop 1st level index and then merge the remaining index values with custom logic for a pd DataFrame?

Say I have a MultiIndex DataFrame like so:
price volume
year product city
2010 A LA 10 7
B SF 7 9
C NY 7 6
LA 18 21
SF 4 8
2011 A LA 13 5
B SF 2 4
C NY 9 3
SF 2 0
I want to do a somewhat complex merge where the first level of the DataFrame index (year) is dropped and the duplicates in the now first level index (product) in the DataFrame get merged according to some custom logic. In this case I would like to be able to set the price column to use the value from the 2010 outer index and the volume column to use the values from the 2011 outer index, but I would like a general solution that can be applied to more columns should they exist.
Final DataFrame would look like this, where the price values are those from the 2010 index and the volume values are those from the 2011 index, where missing values are filled with NaNs.
price volume
product city
A LA 10 5
B SF 7 4
C NY 7 3
LA 18 NaN
SF 4 0
You can select by first level by DataFrame.xs and then concat:
df = pd.concat([df.xs(2010)['price'], df.xs(2011)['volume']], axis=1)
Also is possible use loc:
df = pd.concat([df.loc[2010, 'price'], df.loc[2011, 'volume']], axis=1)
print (df)
price volume
product city
A LA 10 5.0
B SF 7 4.0
C LA 18 NaN
NY 7 3.0
SF 4 0.0

How to classify one column's value by other dataframe?

I am trying to classify one data based on a dataframe of standard.
The standard like df1, and I want to classify df2 based on df1.
df1:
PAUCode SubClass
1 RA
2 RB
3 CZ
df2:
PAUCode SubClass
2 non
2 non
2 non
3 non
1 non
2 non
3 non
I want to get the df2 like as below:
expected result:
PAUCode SubClass
2 RB
2 RB
2 RB
3 CZ
1 RA
2 RB
3 CZ
Option 1
fillna
df2 = df2.replace('non', np.nan)
df2.set_index('PAUCode').SubClass\
.fillna(df1.set_index('PAUCode').SubClass)
PAUCode
2 RB
2 RB
2 RB
3 CZ
1 RA
2 RB
3 CZ
Name: SubClass, dtype: object
Option 2
map
df2.PAUCode.map(df1.set_index('PAUCode').SubClass)
0 RB
1 RB
2 RB
3 CZ
4 RA
5 RB
6 CZ
Name: PAUCode, dtype: object
Option 3
merge
df2[['PAUCode']].merge(df1, on='PAUCode')
PAUCode SubClass
0 2 RB
1 2 RB
2 2 RB
3 2 RB
4 3 CZ
5 3 CZ
6 1 RA
Note here the order of the data changes, but the answer remains the same.
Let us using reindex
df1.set_index('PAUCode').reindex(df2.PAUCode).reset_index()
Out[9]:
PAUCode SubClass
0 2 RB
1 2 RB
2 2 RB
3 3 CZ
4 1 RA
5 2 RB
6 3 CZ

Categories