I have two very, very large pandas dataframes.
df_A has one row for each YearQuarter and Company
df_B has one row for each distinct employee
I want to count the number of employees that are employed for each company each quarter. An employee is counted as employed in a quarter if their StartYearQuarter <= YearQuarter and EndYearQuarter >= YearQuarter.
I have tried a variety of different approaches so far but they have all ran into memory issues, or returned incorrect results, as the dataframes are so large.
Here is an example of one bit of code I ran which told me I would need 160GBi of free RAM when ran in Jupyter and just crashed my Azure Python Kernel:
merged = pd.merge(df_A, df_B, on="Company Name")
employed = merged_df[(merged_df['StartYearQuarter'] <= merged_df['YearQuarter']) & (merged_df['EndYearQuarter'] >= merged_df['YearQuarter'])]
result = employed.groupby(['YearQuarter', 'Company Name']).size().reset_index(name='Employee Count')
Is there a more memory efficient way of counting the number of employees for each Company by YearQuarter?
Many thanks for any help!
If you use the pyspark,
data1 = [['1997Q3', 'test1'], ['1997Q4', 'test1']]
data2 = [['test1', '1997Q2', '1998Q1', 1], ['test1', '1997Q3', '1997Q3', 2]]
df1 = spark.createDataFrame(data1, ['YearQuarter', 'Company Name'])
df2 = spark.createDataFrame(data2, ['Company Name2', 'StartYearQuarter', 'EndYearQuarter', 'ID'])
df1.show()
df2.show()
df1.join(df2, (f.col('Company Name') == f.col('Company Name2')) & f.col('YearQuarter').between(f.col('StartYearQuarter'), f.col('EndYearQuarter')), 'inner') \
.groupBy('Company Name', 'YearQuarter') \
.count() \
.show()
+-----------+------------+
|YearQuarter|Company Name|
+-----------+------------+
| 1997Q3| test1|
| 1997Q4| test1|
+-----------+------------+
+-------------+----------------+--------------+---+
|Company Name2|StartYearQuarter|EndYearQuarter| ID|
+-------------+----------------+--------------+---+
| test1| 1997Q2| 1998Q1| 1|
| test1| 1997Q3| 1997Q3| 2|
+-------------+----------------+--------------+---+
+------------+-----------+-----+
|Company Name|YearQuarter|count|
+------------+-----------+-----+
| test1| 1997Q3| 2|
| test1| 1997Q4| 1|
+------------+-----------+-----+
Related
I have a table like this:
+-------+-----+------+------+
|user_id|apple|good banana|carrot|
+-------+-----+------+------+
| user_0| 0| 3| 1|
| user_1| 1| 0| 2|
| user_2| 5| 1| 2|
+-------+-----+------+------+
Here, for each fruits, I want to get the list of customers who bought the most items.
The required output is following:
max_user max_count
apple [user_2] 5
banana [user_0] 3
carrot [user_1, user_2] 2
MWE
import numpy as np
import pandas as pd
import pyspark
from pyspark.sql import functions as F
spark = pyspark.sql.SparkSession.builder.getOrCreate()
sc = spark.sparkContext
sqlContext = pyspark.SQLContext(sc)
# pandas dataframe
pdf = pd.DataFrame({'user_id': ['user_0','user_1','user_2'],
'apple': [0,1,5],
'good banana': [3,0,1],
'carrot': [1,2,2]})
# spark dataframe
df = sqlContext.createDataFrame(pdf)
# df.show()
df.createOrReplaceTempView("grocery")
spark.sql('select * from grocery').show()
Question 1
How to get the required output using Pyspark?
Question 2
How to get the required output using Pyspark sql?
References
I have already done some research and searched multiple pages. So far I have come up with one close answer, but it requires transposed table and here my table is normal. Also, I am learning multiple methods such as Spark method and SQL method.
how to get the name of column with maximum value in pyspark dataframe
Pyspark solution. Similar to the pandas solutions, where you first melt the dataframe using stack, then filter the rows with max count using rank, group by fruit, and get a list of users using collect_list.
from pyspark.sql import functions as F, Window
df2 = df.selectExpr(
'user_id',
'stack(3, ' + ', '.join(["'%s', %s" % (c, c) for c in df.columns[1:]]) + ') as (fruit, items)'
).withColumn(
'rn',
F.rank().over(Window.partitionBy('fruit').orderBy(F.desc('items')))
).filter('rn = 1').groupBy('fruit').agg(
F.collect_list('user_id').alias('max_user'),
F.max('items').alias('max_count')
)
df2.show()
+------+----------------+---------+
| fruit| max_user|max_count|
+------+----------------+---------+
| apple| [user_2]| 5|
|banana| [user_0]| 3|
|carrot|[user_1, user_2]| 2|
+------+----------------+---------+
For Spark SQL:
df.createOrReplaceTempView("grocery")
df2 = spark.sql("""
select
fruit,
collect_list(user_id) as max_user,
max(items) as max_count
from (
select *,
rank() over (partition by fruit order by items desc) as rn
from (
select
user_id,
stack(3, 'apple', apple, 'banana', banana, 'carrot', carrot) as (fruit, items)
from grocery
)
)
where rn = 1 group by fruit
""")
df2.show()
+------+----------------+---------+
| fruit| max_user|max_count|
+------+----------------+---------+
| apple| [user_2]| 5|
|banana| [user_0]| 3|
|carrot|[user_1, user_2]| 2|
+------+----------------+---------+
You can try melt, filter the max values, then groupby().agg():
s = df.melt('user_id')
max_val = s.groupby('variable')['value'].transform('max')
(s[s['value']==max_val].groupby(['variable'])
.agg(max_user=('user_id',list),
max_count=('value', 'first'))
)
Output:
max_user max_count
variable
apple [user_2] 5
banana [user_0] 3
carrot [user_1, user_2] 2
For pandas you can do this:
pdf = pd.DataFrame({'user_id': ['user_0','user_1','user_2'],
'apple': [0,1,5],
'banana': [3,0,1],
'carrot': [1,2,2]})
ans = pdf.set_index('user_id').apply(lambda s: pd.Series(
[(s[s==s.max()]).index.tolist(), s.max()],
index=['max_user','max_count']
)).T
ans
This gives:
max_user max_count
apple [user_2] 5
banana [user_0] 3
carrot [user_1, user_2] 2
I started converting my Pandas implementations to pySpark but i'm having trouble going through some basic operations. So I have this table:
+-----+-----+----+
| Col1|Col2 |Col3|
+-----+-----+----+
| 1 |[1,3]| 0|
| 44 |[2,0]| 1|
| 77 |[1,5]| 7|
+-----+-----+----+
My desired output is:
+-----+-----+----+----+
| Col1|Col2 |Col3|Col4|
+-----+-----+----+----+
| 1 |[1,3]| 0|2.67|
| 44 |[2,0]| 1|2.67|
| 77 |[1,5]| 7|2.67|
+-----+-----+----+----+
To get here :
I averaged the first item of every array in Col2 and averaged the second item of every array in Col2. Since the average of the second "sub-column" is bigger ((3+0+5)/3) than the first "sub-column" ((1+2+1)/3) this is the "winning" condition. After that I created a new column that has the "winning" average replicated over the number of rows of that table (in this example 3).
I was already able to do this by "manually" selecting ta column, average it and then use a "lit" to replicate the results. The problem with my implementation is that collect() takes a lot of time and afaik its not recommended.
Could you please help me on this one ?
You can use greatest to get the greatest average of each (sub-)column in the array:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'Col4',
F.greatest(*[F.avg(F.udf(lambda r: [float(i) for i in r.toArray()], 'array<double>')('Col2')[i]).over(Window.orderBy()) for i in range(2)])
)
df2.show()
+----+------+----+------------------+
|Col1| Col2|Col3| Col4|
+----+------+----+------------------+
| 1|[1, 3]| 0|2.6666666666666665|
| 44|[2, 0]| 1|2.6666666666666665|
| 77|[1, 5]| 7|2.6666666666666665|
+----+------+----+------------------+
If you want the array size to be dynamic, you can do
arr_size = df.select(F.max(F.size(F.udf(lambda r: [float(i) for i in r.toArray()], 'array<double>')('Col2')))).head()[0]
df2 = df.withColumn(
'Col4',
F.greatest(*[F.avg(F.udf(lambda r: [float(i) for i in r.toArray()], 'array<double>')('Col2')[i]).over(Window.orderBy()) for i in range(arr_size)])
)
I have data like below
year name percent sex
1880 John 0.081541 boy
1881 William 0.080511 boy
1881 John 0.050057 boy
I need to groupby and count using different columns
df_year = df.groupby('year').count()
df_name = df.groupby('name').count()
df_sex = df.groupby('sex').count()
then I have to create a Window to get the top-3 data by each column
window = Window.partitionBy('year').orderBy(col("count").desc())
top4_res = df_year.withColumn('topn', func.row_number().over(window)).\
filter(col('topn') <= 4).repartition(1)
suppose I have hundreds of columns to groupby and count and topk_3 operation.
can I do it all in once?
or is there any better ways to do it?
I am not sure if this will meet your requirement but if you are okay with a single dataframe, i think it can give you a start, let me know if otherwise. You can stack these 3 columns (or more) and then groupby and take count :
cols = ['year','name','sex']
e = f"""stack({len(cols)},{','.join(map(','.join,
(zip([f'"{i}"' for i in cols],cols))))}) as (col,val)"""
(df.select(*[F.col(i).cast('string') for i in cols]).selectExpr(e)
.groupBy(*['col','val']).agg(F.count("col").alias("Counts")).orderBy('col')).show()
+----+-------+------+
| col| val|Counts|
+----+-------+------+
|name| John| 2|
|name|William| 1|
| sex| boy| 3|
|year| 1881| 2|
|year| 1880| 1|
+----+-------+------+
If you want a wide form you can also pivot but i think long form would be helpful:
(df.select(*[F.col(i).cast('string') for i in cols]).selectExpr(e)
.groupBy('col').pivot('val').agg(F.count('val')).show())
+----+----+----+----+-------+----+
| col|1880|1881|John|William| boy|
+----+----+----+----+-------+----+
|name|null|null| 2| 1|null|
|year| 1| 2|null| null|null|
| sex|null|null|null| null| 3|
+----+----+----+----+-------+----+
If you want top n values of columns that have the biggest count, this should work:
from pyspark.sql.functions import *
columns_to_check = [ 'year', 'name' ]
n = 4
for c in columns_to_check:
# returns a dataframe
x = df.groupBy(c).count().sort(col("count").desc()).limit(n)
x.show()
# returns a list of rows
x = df.groupBy(c).count().sort(col("count").desc()).take(n)
print(x)
I want to see what is the last amount is given by customer. and the last time of per customer sale.
I have two dataframes:
DF1:
+----------+-----------+-----------+
| ref_ID| Amount| Sale time|
| 11111111| 100| 2014-04-21|
| 22222222| 60| 2013-07-04|
| 33333333| 12| 2017-08-02|
| 22222222| 90| 2014-05-02|
| 22222222| 80| 2017-08-02|
| 11111111| 30| 2014-05-02|
+----------+-----------+-----------+
DF2:
+----------+----------+
| ID| num_sale|
| 11111111| 2|
| 33333333| 1|
| 22222222| 3|
+----------+----------+
I Need this output:
+----------+-----------+---------------+----------------+
| ID| num_sale| last_sale_time|last_sale_amount|
| 11111111| 2| 2014-05-02| 30|
| 33333333| 1| 2017-08-02| 12|
| 22222222| 3| 2017-08-02| 80|
+----------+-----------+---------------+----------------+
I am trying to do is:
last_sale_amount= []
for index, row in df.iterrows():
try:
last_sale_amount= max(df2.loc[df['id'] == row['f_id'], 'last_sale_time'])
print(str(last_sale_amount))
num_attempt.append(last_sale_amount)
except KeyError:
last_sale_amount.append(0)
ad['last_sale_amount'] = last_sale_amount
You can use groupby to get the maximum sale time from each column then merge back the info from df1 and df2
df_maxsale = df1.groupby('ref_ID')['Sale time'].max().to_frame().reset_index() \
.merge(df1, how='left', on=['ref_ID', 'Sale time']) \
.merge(df2, how='left', left_on='ref_ID', right_on='ID')
note: .max() returns a series with ref_ID as the index, so you need to call to_frame().reset_index() so that ref_ID is a column and you can merge on it and Sale time
We can use group by with sorted sales time and take there last row.
df1 = df1 .sort_values('Sale time').groupby('ref_ID').last().reset_index()
And Then merge it with dataframe 2 (df2).
df2= df2.merge( df1, left_on = "ID", right_on = "ref_ID", how="left" )
I have a pyspark dataframe like this
data = [(("ID1", 10, 30)), (("ID2", 20, 60))]
df1 = spark.createDataFrame(data, ["ID", "colA", "colB"])
df1.show()
df1:
+---+-----------+
| ID| colA| colB|
+---+-----------+
|ID1| 10| 30|
|ID2| 20| 60|
+---+-----------+
I have Another dataframe like this
data = [(("colA", 2)), (("colB", 5))]
df2 = spark.createDataFrame(data, ["Column", "Value"])
df2.show()
df2:
+-------+------+
| Column| Value|
+-------+------+
| colA| 2|
| colB| 5|
+-------+------+
I want to divide every column in df1 by the respective value in df2. Hence df3 will look like
df3:
+---+-------------------------+
| ID| colA| colB|
+---+------------+------------+
|ID1| 10/2 = 5| 30/5 = 6|
|ID2| 20/2 = 10| 60/5 = 12|
+---+------------+------------+
Ultimately, I want to add colA and colB to get the final df4 per ID
df4:
+---+---------------+
| ID| finalSum|
+---+---------------+
|ID1| 5 + 6 = 11|
|ID2| 10 + 12 = 22|
+---+---------------+
The idea is to join both the DataFrames together and then apply the division operation. Since, df2 contains the column names and the respective value, so we need to pivot() it first and then join with the main table df1. (Pivoting is an expensive operation, but it should be fine as long as the DataFrame is small.)
# Loading the requisite packages
from pyspark.sql.functions import col
from functools import reduce
from operator import add
# Creating the DataFrames
df1 = sqlContext.createDataFrame([('ID1', 10, 30), ('ID2', 20, 60)],('ID','ColA','ColB'))
df2 = sqlContext.createDataFrame([('ColA', 2), ('ColB', 5)],('Column','Value'))
The code is fairly generic, so that we need not need to specify the column names on our own. We find the column names we need to operate on. Except ID we need all.
# This contains the list of columns where we apply mathematical operations
columns_to_be_operated = df1.columns
columns_to_be_operated.remove('ID')
print(columns_to_be_operated)
['ColA', 'ColB']
Pivoting the df2, which we will join to df1.
# Pivoting the df2 to get the rows in column form
df2 = df2.groupBy().pivot('Column').sum('Value')
df2.show()
+----+----+
|ColA|ColB|
+----+----+
| 2| 5|
+----+----+
We can change the column names, so that we don't have a duplicate name for every column. We do so, by adding a suffix _x on all the names.
# Dynamically changing the name of the columns in df2
df2 = df2.select([col(c).alias(c+'_x') for c in df2.columns])
df2.show()
+------+------+
|ColA_x|ColB_x|
+------+------+
| 2| 5|
+------+------+
Next we join the tables with a Cartesian join. (Note that you may run into memory issues if df2 is large.)
df = df1.crossJoin(df2)
df.show()
+---+----+----+------+------+
| ID|ColA|ColB|ColA_x|ColB_x|
+---+----+----+------+------+
|ID1| 10| 30| 2| 5|
|ID2| 20| 60| 2| 5|
+---+----+----+------+------+
Finally adding the columns by dividing them with the corresponding value first. reduce() applies function add() of two arguments, cumulatively, to the items of the sequence.
df = df.withColumn(
'finalSum',
reduce(add, [col(c)/col(c+'_x') for c in columns_to_be_operated])
).select('ID','finalSum')
df.show()
+---+--------+
| ID|finalSum|
+---+--------+
|ID1| 11.0|
|ID2| 22.0|
+---+--------+
Note: OP has to be careful with the division with 0. The snippet just above can be altered to take this condition into account.