Pandas code to PySpark with groupby operations - python

In pandas I achieved to transform the following, which is basically splitting the first non null value across following null values.
[100, None, None, 40, None, 120]
into
[33.33, 33.33, 33.33, 20, 20, 120]
Thanks to the solution given here, I managed to produce the following code for my specific task:
cols = ['CUSTOMER', 'WEEK', 'PRODUCT_ID']
colsToSplit = ['VOLUME', 'REVENUE']
df = pd.concat([
d.asfreq('W')
for _, d in df.set_index('WEEK').groupby(['CUSTOMER', 'PRODUCT_ID'])
]).reset_index()
df[cols] = df[cols].ffill()
df['nb_nan'] = df.groupby(['CUSTOMER', 'PRODUCT_ID', df_sellin['VOLUME'].notnull().cumsum()])['VOLUME'].transform('size')
df[colsToSplit] = df.groupby(['CUSTOMER', 'PRODUCT_ID'])[colsToSplit].ffill()[colsToSplit].div(df.nb_nan, axis=0)
df
My full dataframe looks like this :
df = pd.DataFrame(map(list, zip(*[['a', 'a', 'a', 'a', 'b', 'b', 'b', 'c', 'c'],
['2018-01-14', '2018-01-28', '2018-01-14', '2018-01-28', '2018-01-14', '2018-02-04', '2018-02-11', '2018-01-28', '2018-02-11'],
[1, 1, 2, 2, 1, 1, 1, 3, 3],
[50, 44, 22, 34, 42, 41, 43, 12, 13],
[15, 14, 6, 11, 14, 13.5, 13.75, 3, 3.5]])), columns =['CUSTOMER', 'WEEK', 'PRODUCT_ID', 'VOLUME', 'REVENUE'])
df
Out[16]:
CUSTOMER WEEK PRODUCT_ID VOLUME REVENUE
0 a 2018-01-14 1 50 15.00
1 a 2018-01-28 1 44 14.00
2 a 2018-01-14 2 22 6.00
3 a 2018-01-28 2 34 11.00
4 b 2018-01-14 1 42 14.00
5 b 2018-02-04 1 41 13.50
6 b 2018-02-11 1 43 13.75
7 c 2018-01-28 3 12 3.00
8 c 2018-02-11 3 13 3.50
In this case for example, the result would be :
CUSTOMER WEEK PRODUCT_ID VOLUME REVENUE
a 2018-01-14 1 25 7.50
a 2018-01-21 1 25 7.50
a 2018-01-28 1 44 14.00
a 2018-01-14 2 11 3.00
a 2018-01-21 2 11 3.00
a 2018-01-28 2 34 11.00
b 2018-01-14 1 14 4.67
b 2018-01-21 1 14 4.67
b 2018-01-28 1 14 4.67
b 2018-02-04 1 41 13.50
b 2018-02-11 1 43 13.75
c 2018-01-28 3 6 1.50
c 2018-02-04 3 6 1.50
c 2018-02-11 3 13 3.50
Sadly, my dataframe is way too big for further use and joins with other datasets, therefore I would like to test it out with Spark. I checked out many tutorials to compute most of those steps in PySpark, but none of them really showed how to include the groupby part. So I found how to do a transform('size') but not how to df.groupby(...).transform('size') and how I can combine all my steps.
Is there maybe a tool that can do pandas to PySpark translation ? Otherwise, could I have a clue on how to translate this piece of code ? Thanks, maybe I'm just over complicating this.

Related

Pandas - Identify non-unique rows, grouping any pairs except in particular case

This is an extension to this question.
I am trying to figure out a non-looping way to identify (auto-incrementing int would be ideal) the non-unique groups of rows (a group can contain 1 or more rows) within each TDateID, GroupID combination. Except I need it to ignore that paired grouping if all the rows have Structure = "s".
Here is an example DataFrame that looks like
Index
Cents
Structure
SD_YF
TDateID
GroupID
10
182.5
s
2.1
0
0
11
182.5
s
2.1
0
0
12
153.5
s
1.05
0
1
13
153.5
s
1.05
0
1
14
43
p
11
1
2
15
43
p
11
1
2
4
152
s
21
1
2
5
152
s
21
1
2
21
53
s
13
2
3
22
53
s
13
2
3
24
252
s
25
2
3
25
252
s
25
2
3
In pandas form:
df = pd.DataFrame({'Index': [10, 11, 12, 13, 14, 15, 4, 5, 21, 22, 24, 25],
'Cents': [182.5,
182.5,
153.5,
153.5,
43.0,
43.0,
152.0,
152.0,
53.0,
53.0,
252.0,
252.0],
'Structure': ['s', 's', 's', 's', 'p', 'p', 's', 's', 's', 's', 's', 's'],
'SD_YF': [2.1,
2.1,
1.05,
1.05,
11.0,
11.0,
21.0,
21.0,
13.0,
13.0,
25.0,
25.0],
'TDateID': [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
'GroupID': [0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]})
My ideal output would be:
Index
Cents
Structure
SD_YF
TDateID
GroupID
UniID
10
182.5
s
2.1
0
0
1
11
182.5
s
2.1
0
0
2
12
153.5
s
1.05
0
1
3
13
153.5
s
1.05
0
1
4
14
43
p
11
1
2
5
15
43
p
11
1
2
6
4
152
s
21
1
2
5
5
152
s
21
1
2
6
21
53
s
13
2
3
7
22
53
s
13
2
3
8
24
252
s
25
2
3
9
25
252
s
25
2
3
10
I have bolded #5 to draw attention to how index 14, 4 are paired together. Similar with #6. I hope that makes sense!
Using the following code worked great, except it would need to be adapted for the "Structure != "s" for all rows in the grouping" part.
df['UniID'] = (df['GroupID']
+df.groupby('GroupID').ngroup().add(1)
+df.groupby(['GroupID', 'Cents', 'SD_YF']).cumcount()
)
Do the IDs need to be consecutive?
If the occurrence of "duplicate" rows is small, looping over just those groups might not be too bad.
First set an ID to all the pairs using the code you have (and add an indicator column that a row belongs in a group). Then select out all the rows in groups (using the indicator column) and iterate over the groups. If the group has all S, then reassign the IDs to be unique for each row.
The tricky thing is to imagine how this should generalize. Here is my understanding: create a sequential count ignoring the p, then back fill those.
m = df['Structure'].eq('s')
df['UniID'] = m.cumsum()+(~m).cumsum().mask(m,0)
Output:
Index Cents Structure SD_YF TDateID GroupID UniID
0 10 182.5 s 2.10 0 0 1
1 11 182.5 s 2.10 0 0 2
2 12 153.5 s 1.05 0 1 3
3 13 153.5 s 1.05 0 1 4
4 14 43.0 p 11.00 1 2 5
5 15 43.0 p 11.00 1 2 6
6 4 152.0 s 21.00 1 2 5
7 5 152.0 s 21.00 1 2 6
8 21 53.0 s 13.00 2 3 7
9 22 53.0 s 13.00 2 3 8
10 24 252.0 s 25.00 2 3 9
11 25 252.0 s 25.00 2 3 10

How to sort a big dataframe by two columns?

I have a big dataframe which records all price info for stock market.
in this dataframe, there are two index info, which are 'time' and 'con'
here is the example:
In [15]: df = pd.DataFrame(np.reshape(range(20), (5,4)))
In [16]: df
Out[16]:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19
In [17]: df.columns = ['open', 'high', 'low', 'close']
In [18]: df['tme'] = ['9:00','9:00', '9:01', '9:01', '9:02']
In [19]: df['con'] = ['a', 'b', 'a', 'b', 'a']
In [20]: df
Out[20]:
open high low close tme con
0 0 1 2 3 9:00 a
1 4 5 6 7 9:00 b
2 8 9 10 11 9:01 a
3 12 13 14 15 9:01 b
4 16 17 18 19 9:02 a
what i want is some dataframes like this:
## here is the close dataframe, which only contains close info, indexed by 'time' and 'con'
Out[31]:
a b
9:00 3 7.0
9:01 11 15.0
9:02 19 NaN
How can i get this dataframe?
Use df.pivot:
In [117]: df.pivot('tme', 'con', 'close')
Out[117]:
con a b
tme
9:00 3.0 7.0
9:01 11.0 15.0
9:02 19.0 NaN
One solution is to use pivot_table. Try this out:
df.pivot_table(index=df['tme'], columns='con', values='close')
The solution is:

Using pandas cut function with groupby and group-specific bins

I have the following sample DataFrame
import pandas as pd
import numpy as np
df = pd.DataFrame({'Tag': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C'],
'ID': [11, 12, 16, 19, 14, 9, 4, 13, 6, 18, 21, 1, 2],
'Value': [1, 13, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9, 10]})
to which I add the percentage of the Value using
df['Percent_value'] = df['Value'].rank(method='dense', pct=True)
and add the Order using pd.cut() with pre-defined percentage bins
percentage = np.array([10, 20, 50, 70, 100])/100
df['Order'] = pd.cut(df['Percent_value'], bins=np.insert(percentage, 0, 0), labels = [1,2,3,4,5])
which gives
Tag ID Value Percent_value Order
0 A 11 1 0.076923 1
1 A 12 13 1.000000 5
2 A 16 11 0.846154 5
3 B 19 12 0.923077 5
4 B 14 2 0.153846 2
5 B 9 3 0.230769 3
6 B 4 4 0.307692 3
7 C 13 5 0.384615 3
8 C 6 6 0.461538 3
9 C 18 7 0.538462 4
10 C 21 8 0.615385 4
11 C 1 9 0.692308 4
12 C 2 10 0.769231 5
My Question
Now, instead of having a single percentage array (bins) for all Tags (groups), I have a separate percentage array for each Tag group. i.e., A, B and C. How can I apply df.groupby('Tag') and then apply pd.cut() using different percentage bins for each group from the following dictionary? Is there some direct-way avoiding for loops as I do below?
percentages = {'A': np.array([10, 20, 50, 70, 100])/100,
'B': np.array([20, 40, 60, 90, 100])/100,
'C': np.array([30, 50, 60, 80, 100])/100}
Desired outcome (Note: Order is now computed for each Tag using different bins):
Tag ID Value Percent_value Order
0 A 11 1 0.076923 1
1 A 12 13 1.000000 5
2 A 16 11 0.846154 5
3 B 19 12 0.923077 5
4 B 14 2 0.153846 1
5 B 9 3 0.230769 2
6 B 4 4 0.307692 2
7 C 13 5 0.384615 2
8 C 6 6 0.461538 2
9 C 18 7 0.538462 3
10 C 21 8 0.615385 4
11 C 1 9 0.692308 4
12 C 2 10 0.769231 4
My Attempt
orders = []
for k, g in df.groupby(['Tag']):
percentage = percentages[k]
g['Order'] = pd.cut(g['Percent_value'], bins=np.insert(percentage, 0, 0), labels = [1,2,3,4,5])
orders.append(g)
df_final = pd.concat(orders, axis=0, join='outer')
You can apply pd.cut within groupby,
df['Order'] = df.groupby('Tag').apply(lambda x: pd.cut(x['Percent_value'], bins=np.insert(percentages[x.name],0,0), labels=[1,2,3,4,5])).reset_index(drop = True)
Tag ID Value Percent_value Order
0 A 11 1 0.076923 1
1 A 12 13 1.000000 5
2 A 16 11 0.846154 5
3 B 19 12 0.923077 5
4 B 14 2 0.153846 1
5 B 9 3 0.230769 2
6 B 4 4 0.307692 2
7 C 13 5 0.384615 2
8 C 6 6 0.461538 2
9 C 18 7 0.538462 3
10 C 21 8 0.615385 4
11 C 1 9 0.692308 4
12 C 2 10 0.769231 4

Pandas: moving data from two dataframes to another with tuple index

I have three dataframes like the following:
final_df
other ref
(2014-12-24 13:20:00-05:00, a) NaN NaN
(2014-12-24 13:40:00-05:00, b) NaN NaN
(2018-07-03 14:00:00-04:00, d) NaN NaN
ref_df
a b c d
2014-12-24 13:20:00-05:00 1 2 3 4
2014-12-24 13:40:00-05:00 2 3 4 5
2017-11-24 13:10:00-05:00 ..............
2018-07-03 13:25:00-04:00 ..............
2018-07-03 14:00:00-04:00 9 10 11 12
2019-07-03 13:10:00-04:00 ..............
other_df
a b c d
2014-12-24 13:20:00-05:00 10 20 30 40
2014-12-24 13:40:00-05:00 20 30 40 50
2017-11-24 13:10:00-05:00 ..............
2018-07-03 13:20:00-04:00 ..............
2018-07-03 13:25:00-04:00 ..............
2018-07-03 14:00:00-04:00 90 100 110 120
2019-07-03 13:10:00-04:00 ..............
And I need to remplace the NaN values in my final_df with the related dataframe to be like that:
other ref
(2014-12-24 13:20:00-05:00, a) 10 1
(2014-12-24 13:40:00-05:00, b) 30 3
(2018-07-03 14:00:00-04:00, d) 110 11
How can I get it?
pandas.DataFrame.lookup
final_df['ref'] = ref_df.lookup(*zip(*final_df.index))
final_df['other'] = other_df.lookup(*zip(*final_df.index))
map and get
For when you have missing bits
final_df['ref'] = list(map(ref_df.stack().get, final_df.index))
final_df['other'] = list(map(other_df.stack().get, final_df.index))
Demo
Setup
idx = pd.MultiIndex.from_tuples([(1, 'a'), (2, 'b'), (3, 'd')])
final_df = pd.DataFrame(index=idx, columns=['other', 'ref'])
ref_df = pd.DataFrame([
[ 1, 2, 3, 4],
[ 2, 3, 4, 5],
[ 9, 10, 11, 12]
], [1, 2, 3], ['a', 'b', 'c', 'd'])
other_df = pd.DataFrame([
[ 10, 20, 30, 40],
[ 20, 30, 40, 50],
[ 90, 100, 110, 120]
], [1, 2, 3], ['a', 'b', 'c', 'd'])
print(final_df, ref_df, other_df, sep='\n\n')
other ref
1 a NaN NaN
2 b NaN NaN
3 d NaN NaN
a b c d
1 1 2 3 4
2 2 3 4 5
3 9 10 11 12
a b c d
1 10 20 30 40
2 20 30 40 50
3 90 100 110 120
Result
final_df['ref'] = ref_df.lookup(*zip(*final_df.index))
final_df['other'] = other_df.lookup(*zip(*final_df.index))
final_df
other ref
1 a 10 1
2 b 30 3
3 d 120 12
Another solution that can work with missing dates in ref_df and other_df:
index = pd.MultiIndex.from_tuples(final_df.index)
ref = ref_df.stack().rename('ref')
other = other_df.stack().rename('other')
result = pd.DataFrame(index=index).join(ref).join(other)

Groupby two columns and print different quantiles as seperate columns

Here is a reproducible example:
import pandas as pd
df = pd.DataFrame([['Type A', 'Event1', 1, 2, 3], ['Type A', 'Event1', 4, 5, 6], ['Type A', 'Event1', 7, 8, 9],
['Type A', 'Event2', 10, 11, 12], ['Type A', 'Event2', 13, 14, 15], ['Type A', 'Event2', 16, 17, 18], \
['Type B', 'Event1', 19, 20, 21], ['Type B', 'Event1', 22, 23, 24], ['Type B', 'Event1', 25, 26, 27], \
['Type B', 'Event2', 28, 29, 30], ['Type B', 'Event2', 31, 32, 33], ['Type B', 'Event2', 34, 35, 36]])
df.columns = ['TypeName', 'EventNumber', 'PricePart1', 'PricePart2', 'PricePart3']
print(df)
Gives:
TypeName EventNumber PricePart1 PricePart2 PricePart3
0 Type A Event1 1 2 3
1 Type A Event1 4 5 6
2 Type A Event1 7 8 9
3 Type A Event2 10 11 12
4 Type A Event2 13 14 15
5 Type A Event2 16 17 18
6 Type B Event1 19 20 21
7 Type B Event1 22 23 24
8 Type B Event1 25 26 27
9 Type B Event2 28 29 30
10 Type B Event2 31 32 33
11 Type B Event2 34 35 36
Here is what I've tried:
df['Average'] = df[['PricePart1', 'PricePart2', 'PricePart3']].mean(axis = 1)
print(df)
TypeName EventNumber PricePart1 PricePart2 PricePart3 Average
0 Type A Event1 1 2 3 2.0
1 Type A Event1 4 5 6 5.0
2 Type A Event1 7 8 9 8.0
3 Type A Event2 10 11 12 11.0
4 Type A Event2 13 14 15 14.0
5 Type A Event2 16 17 18 17.0
6 Type B Event1 19 20 21 20.0
7 Type B Event1 22 23 24 23.0
8 Type B Event1 25 26 27 26.0
9 Type B Event2 28 29 30 29.0
10 Type B Event2 31 32 33 32.0
11 Type B Event2 34 35 36 35.0
Now that I have this new column called Average, I can group by TypeName, EventNumber columns and find the 25th and 50th percentile using this peice of code:
print(df.groupby(['TypeName', 'EventNumber'])['Average'].quantile([0.25, 0.50]).reset_index())
What I have:
TypeName EventNumber level_2 Average
0 Type A Event1 0.25 3.5
1 Type A Event1 0.50 5.0
2 Type A Event2 0.25 12.5
3 Type A Event2 0.50 14.0
4 Type B Event1 0.25 21.5
5 Type B Event1 0.50 23.0
6 Type B Event2 0.25 30.5
7 Type B Event2 0.50 32.0
I want the level_2 as seperate columns with the values from Average column like with the output DataFrame I've created:
df1 = pd.DataFrame([['Type A', 'Event1', 3.5, 5], ['Type A', 'Event2', 12.5, 14], ['Type B', 'Event1', 21.5, 23], ['Type B', 'Event2', 30.5, 32]])
df1.columns = ['TypeName', 'EventNumber', '0.25', '0.50']
print(df1)
What I want:
TypeName EventNumber 0.25 0.50
0 Type A Event1 3.5 5
1 Type A Event2 12.5 14
2 Type B Event1 21.5 23
3 Type B Event2 30.5 32
I'm super sure that this is some duplicate, but I have searched through StackOverflow and not found my answer because of the difficulty wording the question (or maybe just that I'm stupid)
Use unstack with reset_index:
df = (df.groupby(['TypeName', 'EventNumber'])['Average']
.quantile([0.25, 0.50])
.unstack()
.reset_index())
print (df)
TypeName EventNumber 0.25 0.5
0 Type A Event1 3.5 5.0
1 Type A Event2 12.5 14.0
2 Type B Event1 21.5 23.0
3 Type B Event2 30.5 32.0
Syntactic sugar solution - new column Average is not necessary, is possible use groupby with 3 Series:
s = df[['PricePart1', 'PricePart2', 'PricePart3']].mean(axis = 1)
df = (s.groupby([df['TypeName'], df['EventNumber']])
.quantile([0.25, 0.50])
.unstack()
.reset_index())
print (df)
TypeName EventNumber 0.25 0.5
0 Type A Event1 3.5 5.0
1 Type A Event2 12.5 14.0
2 Type B Event1 21.5 23.0
3 Type B Event2 30.5 32.0

Categories