PySpark Cum Sum of two values

PySpark Cum Sum of two values - python

Given the following example dataframe:
advertiser_id| name | amount | total |max_total_advertiser|
4061 |source1|-434.955284|-354882.75336200005| -355938.53950700007
4061 |source2|-594.012216|-355476.76557800005| -355938.53950700007
4061 |source3|-461.773929|-355938.53950700007| -355938.53950700007
I need to sum the amount and the max_total_advertiser field in order to get the correct total value in each row. Taking into account that I need this total value for every group partitioned by advertiser_id. (The total column in the initial dataframe is incorrect, that's why I want to calculate correctly)
Something like that should be:
w = Window.partitionBy("advertiser_id").orderBy("advertiser_id")
df.withColumn("total_aux", when( lag("advertiser_id").over(w) == col("advertiser_id"), lag("total_aux").over(w) + col("amount") ).otherwise( col("max_total_advertiser") + col("amount") ))
This lag("total_aux") is not working because the column is not generated yet, that's what I want to achieve, if it is the first row in the group, sum the columns in the same row if not sum the previous obtained value with the current amount field.
Example output:
advertiser_id| name | amount | total_aux |
4061 |source1|-434.955284|-356373.494791 |
4061 |source2|-594.012216|-356967.507007 |
4061 |source3|-461.773929|-357429.280936 |
Thanks.

I assume that name is a distinct value for each advertiser_id and your dataset is therefore sortable by name. I also assume that max_total_advertiser contains the same value for each advertiser_id. If one of those is not the case, please add a comment.
What you need is a rangeBetween window which gives you all preceding and following rows within the specified range. We will use Window.unboundedPreceding as we want to sum up all the previous values.
import pyspark.sql.functions as F
from pyspark.sql import Window
l = [
(4061, 'source1',-434.955284,-354882.75336200005, -355938.53950700007),
(4061, 'source2',-594.012216,-355476.76557800005, -345938.53950700007),
(4062, 'source1',-594.012216,-355476.76557800005, -5938.53950700007),
(4062, 'source2',-594.012216,-355476.76557800005, -5938.53950700007),
(4061, 'source3',-461.773929,-355938.53950700007, -355938.53950700007)
]
columns = ['advertiser_id','name' ,'amount', 'total', 'max_total_advertiser']
df=spark.createDataFrame(l, columns)
w = Window.partitionBy('advertiser_id').orderBy('name').rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn('total', F.sum('amount').over(w) + df.max_total_advertiser)
df.show()
Output:
+-------------+-------+-----------+-------------------+--------------------+
|advertiser_id| name| amount| total|max_total_advertiser|
+-------------+-------+-----------+-------------------+--------------------+
| 4062|source1|-594.012216|-6532.5517230000705| -5938.53950700007|
| 4062|source2|-594.012216| -7126.563939000071| -5938.53950700007|
| 4061|source1|-434.955284| -356373.4947910001| -355938.53950700007|
| 4061|source2|-594.012216| -346967.5070070001| -345938.53950700007|
| 4061|source3|-461.773929|-357429.28093600005| -355938.53950700007|
+-------------+-------+-----------+-------------------+--------------------+

You might be looking for the orderBy() function. Does this work?
from pyspark.sql.window import *
df.withColumn("cumulativeSum", sum(df("amount"))
.over( Window.partitionBy("advertiser_id").orderBy("amount")))

Related

Filter Header 2 rows and Trailer 1 row in 1000 of huge files pyspark

I have list multiple 1000's of huge files in a folder ..
Each file is having 2 header rows and trailer row
file1
H|*|F|*|TYPE|*|EXTRACT|*|Stage_|*|2021.04.18 07:35:26|##|
H|*|TYP_ID|*|TYP_DESC|*|UPD_USR|*|UPD_TSTMP|##|
E|*||*|CONNECTOR|*|2012.06.01 09:03:11|##|
H|*|Tracking|*|asdasasd|*|2011.03.04 11:50:51|##|
S|*|Tracking|*|asdasdas|*|2011.03.04 11:51:06|##|
T|*|3|*|2021.04.18 07:35:43|##|
file 2
H|*|F|*|PA__STAT|*|EXTRACT|*|Folder|*|2021.04.18 07:35:26|##|
H|*|STAT_ID|*|STAT_DESC|*|UPD_USR|*|UPD_TSTMP|##|
A|*|Active / Actif|*|1604872|*|2018.06.25 15:12:35|##|
D|*||*|CONNECTOR|*|2012.04.06 10:49:09|##|
I|*|Intermittent Leave|*|asdasda|*|2021.04.09 13:14:00|##|
L|*|On Leave|*|asdasasd|*|2011.03.04 11:49:40|##|
P|*|Paid Leave|*|asdasd|*|2011.03.04 11:49:56|##|
T|*|Terminated / TerminÃ©|*|1604872|*|2018.06.25 15:13:06|##|
U|*||*|CONNECTOR|*|2012.06.16 09:04:14|##|
T|*|7|*|2021.04.18 07:35:55|##|
file3
H|*|K|*|PA_CPN|*|EXTRACT|*|SuccessFactors|*|2021.04.22 23:09:26|##|
H|*|COL_NUM|*|CPNT_TYP_ID|*|CPNT_ID|*|REV_DTE|##|
40|*|OLL|*|asdasdas|*|2019.01.21 14:07:00|##|
40|*|OLL|*|asdasda|*|2019.01.21 14:18:00|##|
40|*|OLL|*|asdasdas|*|2019.01.21 14:20:00|##|
T|*|3|*|2021.04.22 23:27:17|##|
I am applying a filter on lines starting with H|| and T|| but it is rejecting the data for few rows.
df_cleanse=spark.sql("select replace(replace(replace(value,'~','-'),'|*|','~'),'|##|','') as value from linenumber3 where value not like 'T|*|%' and value not like 'H|*|%'")
I know we can use zipwithindex , but i have to read file by file and and they apply zip index and then filter on the rows .
for each file:
df = spark.read.text('file1')
#Adding index column each row get its row numbers , Spark distibutes the data and to maintain the order of data we need to perfrom this action
df_1 = df.rdd.map(lambda r: r).zipWithIndex().toDF(['value', 'index'])
df_1.createOrReplaceTempView("linenumber")
spark.sql("select * from linenumber where index >1 and value.value not like 'T|*|%'")
Please let know the optimal solution for the same. I do not want to run a extensive program all i need is to juts remove 3 lines . Even a regex to remove the rows is fine we need to process TB's of files in this format
Unix Commands and Sed operators are ruled out due to the file sizes

Meanwhile I wait your answer, try this to remove the first two lines and the last:
from pyspark.sql.window import Window
import pyspark.sql.functions as f
df = spark.read.csv('your_path', schema='value string')
df = df.withColumn('filename', f.input_file_name())
df = df.repartition('filename')
df = df.withColumn('index', f.monotonically_increasing_id())
w = Window.partitionBy('filename')
df = (df
.withColumn('remove', (f.col('index') == f.max('index').over(w)) | (f.col('index') < f.min('index').over(w) + f.lit(2)))
.where(~f.col('remove'))
.select('value'))
df.show(truncate=False)
Output
+-------------------------------------------------------------+
|value |
+-------------------------------------------------------------+
|E|*||*|CONNECTOR|*|2012.06.01 09:03:11|##| |
|H|*|Tracking|*|asdasasd|*|2011.03.04 11:50:51|##| |
|S|*|Tracking|*|asdasdas|*|2011.03.04 11:51:06|##| |
|A|*|Active / Actif|*|1604872|*|2018.06.25 15:12:35|##| |
|D|*||*|CONNECTOR|*|2012.04.06 10:49:09|##| |
|I|*|Intermittent Leave|*|asdasda|*|2021.04.09 13:14:00|##| |
|L|*|On Leave|*|asdasasd|*|2011.03.04 11:49:40|##| |
|P|*|Paid Leave|*|asdasd|*|2011.03.04 11:49:56|##| |
|T|*|Terminated / TerminÃ©|*|1604872|*|2018.06.25 15:13:06|##||
|U|*||*|CONNECTOR|*|2012.06.16 09:04:14|##| |
|40|*|OLL|*|asdasdas|*|2019.01.21 14:07:00|##| |
|40|*|OLL|*|asdasda|*|2019.01.21 14:18:00|##| |
|40|*|OLL|*|asdasdas|*|2019.01.21 14:20:00|##| |
+-------------------------------------------------------------+

How to transform multiple pandas dataframes to array in memory constrains?

The given problem:
I have folders named from folder1 to folder999. In each folder there are parquet files - named from 1.parquet to 999.parquet. Each parquet consist pandas dataframe of given structure:
id |title |a
1 |abc |1
1 |abc |3
1 |abc |2
2 |abc |1
... |def | ...
Where column a can be value of range a1 to a3.
The partial step is to obtain structure:
id | title | a1 | a2 | a3
1 | abc | 1 | 1 | 1
2 | abc | 1 | 0 | 0
...
In order to obtain final form,:
title
id | abc | def | ...
1 | 3 | ... |
2 | 1 | ... |
where values of column abc is sum of columns a1, a2 and a3.
The goal is to obtain final form calculated on all the parquet files in all the folders.
Now, the situation I am now looks like this: I do know how to receive the final form by partial step, e.g. by using sparse.coo_matrix() like explained in How to make full matrix from dense pandas dataframe .
The problem is: due to memory limitations I cannot simply read all the parquets at once.
I have three questions:
How to get there efficiently, if I have plenty of data (assume each parquet file consists of 500MB)?
Can I transform each parquet to final form separately and THEN merge them somehow? If yes, how could I do that?
Is there any way to skip the partial step?

For every dataframe in the files, you seem to
Group Data by the columns id, title
Now, sum the data in column a for each group
Creating a full matrix for the task, is not necessary and so's the partial step.
I am not sure, how many unique combinations of id, title exists in a file and or all of them. A safe step would be to process files in batches, save their results and later combine all results
Which looks like,
import pandas as pd
import numpy as np
import string
def gen_random_data(N, M):
# N = 100
# M = 10
titles = np.apply_along_axis(lambda x: ''.join(x), 1, np.random.choice(list(string.ascii_lowercase), 3*M).reshape(-1, 3))
titles = np.random.choice(titles, N)
_id = np.random.choice(np.arange(M) + 1, N)
val = np.random.randint(M, size=(N,))
df = pd.DataFrame(np.vstack((_id, titles, val)).T, columns=['id', 'title', 'a'])
df = df.astype({'id': np.int64, 'title': str, 'a': np.int64})
return df
def combine_results(grplist):
# stitch into one dataframe
comb_df = pd.concat(dflist, axis=1)
# Sum over common axes i.e. id, titles
comb_df = comb_df.apply(lambda row: np.nansum(row), axis=1)
# Return a data frame with sum of a's
return comb_df.to_frame('sum_of_a')
totalfiles = 10
batch = 2
filelist = []
for counter,nfiles in enumerate(range(0, totalfiles, batch)):
# Read data from files. generate random data
dflist = [gen_random_data(100, 2) for _ in range(nfiles)]
# Process the data in memory
dflist = [_.groupby(['id', 'title']).agg(['sum']) for _ in dflist]
collection = combine_results(dflist)
# write intermediate results to file and repeat the process for the rest of the files
intermediate_result_file_name = f'resfile_{counter}'
collection.to_parquet(intermediate_result_file_name, index=True)
filelist.append(intermediate_result_file_name)
# Combining result files.
collection = [pd.read_parquet(file) for file in filelist]
totalresult = combine_results(collection)

Convert a value using a value from a different row with petl?

I have the following table:
+---------+------------+----------------+
| IRR | Price List | Cambrdige Data |
+=========+============+================+
| '1.56%' | '0' | '6/30/1989' |
+---------+------------+----------------+
| '5.17%' | '100' | '9/30/1989' |
+---------+------------+----------------+
| '4.44%' | '0' | '12/31/1990' |
+---------+------------+----------------+
I'm trying to write a calculator that updates the Price List field by making a simple calculation. The logic is basically this:
previous price * ( 1 + IRR%)
So for the last row, the calculation would be: 100 * (1 + 4.44%) = 104.44
Since I'm using petl, I'm trying to figure out how to update a field with its above value and a value from the same row and then populate this across the whole Price List column. I can't seem to find a useful petl utility for this. Should I just manually write a method? What do you guys think?

Try this:
# conversion can access other values from the same row
table = etl.convert(table, 'Price List',
lambda row: 100 * (1 + row.IRR),
pass_row=True)

Min Values in each row of selected columns in Python

Hi I have a rather simple task but seems like all online help is not working.
I have data set like this:
ID | Px_1 | Px_2
theta| 106.013676 | 102.8024788702673
Rho | 100.002818 | 102.62640389123405
gamma| 105.360589 | 107.21999706084836
Beta | 106.133046 | 115.40449479551263
alpha| 106.821119 | 110.54312246081719
I want to find min by each row in a fourth col so the output I can have is for example, theta is 102.802 because it is the min value of both Px_1 and Px_2
I tried this but doesnt work
I constantly get max value
df_subset = read.set_index('ID')[['Px_1','Px_2']]
d = df_subset.min( axis=1)
Thanks

You can try this
df["min"] = df[["Px_1", "Px_2"]].min(axis=1)
Select the columns needed, here ["Px_1", "Px_2"], to perform min operation.

How to create column from row and enter the subsequesnt column value in python spark

I use pyspark and work with the following dataframe:
+---------+----+--------------------+-------------------+
| id| sid| values| ratio|
+---------+----+--------------------+-------------------+
| 6052791|4178|[2#2#2#2#3#3#3#3#...|0.32673267326732675|
| 57908575|4178|[2#2#2#2#3#3#3#3#...| 0.3173076923076923|
| 78836630|4178|[2#2#2#2#3#3#3#3#...| 0.782608695652174|
|109252111|4178|[2#2#2#2#3#3#3#3#...| 0.2803738317757009|
|139428308|4385|[2#2#2#3#4#4#4#4#...| 1.140625|
|173158079|4320|[2#2#2#2#3#3#3#3#...|0.14049586776859505|
|183739386|4390|[3#2#2#3#3#2#4#4#...|0.32080419580419584|
|206815630|4178|[2#2#2#2#3#3#3#3#...|0.14782608695652175|
|242251660|4320|[2#2#2#2#3#3#3#3#...| 0.1452991452991453|
|272670796|5038|[3#2#2#2#2#2#2#3#...| 0.2648648648648649|
|297848516|4320|[2#2#2#2#3#3#3#3#...|0.12195121951219512|
|346566485|4113|[2#3#3#2#2#2#2#3#...| 0.646823138928402|
|369667874|5038|[2#2#2#2#2#2#2#3#...| 0.4546293788454067|
|374645154|4320|[2#2#2#2#3#3#3#3#...|0.34782608695652173|
|400996010|4320|[2#2#2#2#3#3#3#3#...|0.14049586776859505|
|401594848|4178|[3#3#6#6#3#3#4#4#...| 0.7647058823529411|
|401954629|4569|[3#3#3#3#3#3#3#3#...| 0.5520833333333333|
|417115190|4320|[2#2#2#2#3#3#3#3#...| 0.6235294117647059|
|423877535|4178|[2#2#2#2#3#3#3#3#...| 0.5538461538461539|
|445523599|4320|[2#2#2#2#3#3#3#3#...| 0.1271186440677966|
+---------+----+--------------------+-------------------+
What I want is to make sid 4178 as a column and put rounded ratio as its row value. The result should look as follows:
+---------+-------+------+-------+
| id| 4178 |4385 | 4390 |(if sid for id fill row with ratio)
+---------+-------+------+-------+
| 6052791|0.32 | 0 | 0 |(if not fill with 0)
id 4178
6052791 0.32
The number of columns is the number of sids that have the same rounded ratio.
If that sid does not exist for any id then sid column has to contain 0.

You need a column to groupby, for which I am adding a new column called sNo.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(List((6052791, 4178, 0.42673267326732675),
(6052791, 4178, 0.22673267326732675),
(6052791, 4179, 0.62673267326732675),
(6052791, 4180, 0.72673267326732675),
(6052791, 4179, 0.82673267326732675),
(6052791, 4179, 0.92673267326732675))).toDF("id", "sid", "ratio")
df.withColumn("sNo", lit(1))
.groupBy("sNo")
.pivot("sid")
.agg(min("ratio"))
.show
This would return output
+---+-------------------+------------------+------------------+
|sNo| 4178| 4179| 4180|
+---+-------------------+------------------+------------------+
| 1|0.22673267326732674|0.6267326732673267|0.7267326732673267|
+---+-------------------+------------------+------------------+

That sounds like a pivot that could be in Spark SQL (Scala version) as follows:
scala> ratios.
groupBy("id").
pivot("sid").
agg(first("ratio")).
show
+-------+-------------------+
| id| 4178|
+-------+-------------------+
|6052791|0.32673267326732675|
+-------+-------------------+
I'm still unsure how to select the other columns (4385 and 4390 in your example). It seems that you round ratio and search for other sids that would match.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PySpark Cum Sum of two values - python

You might be looking for the orderBy() function. Does this work? from pyspark.sql.window import * df.withColumn("cumulativeSum", sum(df("amount")) .over( Window.partitionBy("advertiser_id").orderBy("amount")))

Related

Filter Header 2 rows and Trailer 1 row in 1000 of huge files pyspark

How to transform multiple pandas dataframes to array in memory constrains?

Convert a value using a value from a different row with petl?

Min Values in each row of selected columns in Python

How to create column from row and enter the subsequesnt column value in python spark

Categories

Resources