I have a spark dataframe in python. And, it was sorted based on a column. How can I select a specific range of data (for example 50% of data in the middle)? For example, if I have 1M data, I want to take data from 250K to 750K index. How can I do that without using collect in pyspark?
To be more precise, I want something like take function to get results between a range. For example, something like take(250000, 750000).
Here is one way to select a range in a pyspark DF:
Create DF
df = spark.createDataFrame(
data = [(10, "2018-01-01"), (22, "2017-01-01"), (13, "2014-01-01"), (4, "2015-01-01")\
,(35, "2013-01-01"),(26, "2016-01-01"),(7, "2012-01-01"),(18, "2011-01-01")]
, schema = ["amount", "date"]
)
df.show()
+------+----------+
|amount| date|
+------+----------+
| 10|2018-01-01|
| 22|2017-01-01|
| 13|2014-01-01|
| 4|2015-01-01|
| 35|2013-01-01|
| 26|2016-01-01|
| 7|2012-01-01|
| 18|2011-01-01|
+------+----------+
Sort (on date) and insert index (based on row number)
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = Window.orderBy("date")
df = df.withColumn("index", F.row_number().over(w))
df.show()
+------+----------+-----+
|amount| date|index|
+------+----------+-----+
| 18|2011-01-01| 1|
| 7|2012-01-01| 2|
| 35|2013-01-01| 3|
| 13|2014-01-01| 4|
| 4|2015-01-01| 5|
| 26|2016-01-01| 6|
| 22|2017-01-01| 7|
| 10|2018-01-01| 8|
+------+----------+-----+
Get The Required Range (assume want everything between rows 3 and 6)
df1=df.filter(df.index.between(3, 6))
df1.show()
+------+----------+-----+
|amount| date|index|
+------+----------+-----+
| 35|2013-01-01| 3|
| 13|2014-01-01| 4|
| 4|2015-01-01| 5|
| 26|2016-01-01| 6|
+------+----------+-----+
This is very simple using between , for example assuming your sorted column name is index -
df_sample = df.select(df.somecolumn, df.index.between(250000, 750000))
once you create a new dataframe df_sample, you can perform any operation (including take or collect) as per your need.
Related
I have a spark dataframe: df1 as below:
age = spark.createDataFrame(["10","11","13"], "string").toDF("age")
age.show()
+---+
|age|
+---+
| 10|
| 11|
| 13|
+---+
I have a requirement of adding a row number column to the dataframe to make it:
+---+------+
|age|col_id|
+---+------+
| 10| 1 |
| 11| 2 |
| 13| 3 |
+---+------+
None of the columns in my dataframe contains unique values.
I tried to use F.monotonically_increasing_id()) but it is just producing random numbers in increasing order.
>>> age = spark.createDataFrame(["10","11","13"], "string").toDF("age").withColumn("rowId1", F.monotonically_increasing_id())
>>> age
DataFrame[age: string, rowId1: bigint]
>>> age.show
<bound method DataFrame.show of DataFrame[age: string, rowId1: bigint]>
>>> age.show()
+---+-----------+
|age| rowId1|
+---+-----------+
| 10|17179869184|
| 11|42949672960|
| 13|60129542144|
+---+-----------+
Since I don't have any column with unique data, I am worried about using windowing functions and generate row_numbers.
So, is there a way I can add a column with row_count to the dataframe that gives:
+---+------+
|age|col_id|
+---+------+
| 10| 1 |
| 11| 2 |
| 13| 3 |
+---+------+
If windowing function is the only way to implement, how can I make sure all the data comes under a single partition ?
or if there is a way to implement the same without using windowing functions, how to implement it ?
Any help is appreciated.
Use zipWithIndex.
I could not find code I did myself in the past yesterday as I was busy working on issues, but here is a good post that explains it. https://sqlandhadoop.com/pyspark-zipwithindex-example/
pyspark different to Scala.
Other answer not good for performance - going to single Executor. zipWithIndex is narrow transformation so it works per partition.
Here goes, you can tailor accordingly:
from pyspark.sql.types import StructField
from pyspark.sql.types import StructType
from pyspark.sql.types import StringType, LongType
import pyspark.sql.functions as F
df1 = spark.createDataFrame([ ('abc'),('2'),('3'),('4'), ('abc'),('2'),('3'),('4'), ('abc'),('2'),('3'),('4') ], StringType())
schema = StructType(df1.schema.fields[:] + [StructField("index", LongType(), True)])
rdd = df1.rdd.zipWithIndex()
rdd1 = rdd.map(lambda row: tuple(row[0].asDict()[c] for c in schema.fieldNames()[:-1]) + (row[1],))
df1 = spark.createDataFrame(rdd1, schema)
df1.show()
returns:
+-----+-----+
|value|index|
+-----+-----+
| abc| 0|
| 2| 1|
| 3| 2|
| 4| 3|
| abc| 4|
| 2| 5|
| 3| 6|
| 4| 7|
| abc| 8|
| 2| 9|
| 3| 10|
| 4| 11|
+-----+-----+
Assumption: This answer is based on the assumption that the order of col_id should depend on the age column. If the assumption does not hold true the other suggested solution is the in the questions comments mentioned zipWithIndex. An example usage of zipWithIndex can be found in this answer.
Proposed solution:
You can use a window with an empty partitionBy and the the row number to get the expected numbers.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
windowSpec = Window.partitionBy().orderBy(F.col('age').asc())
age = age.withColumn(
'col_id',
F.row_number().over(windowSpec)
)
[EDIT] Add assumption of requirements and reference to alternative solution.
I have two tables like the following:
First Table:
+---+------+----------+----------+
| id|sub_id| startDate| endDate|
+---+------+----------+----------+
| 2| a|2018-11-15|2018-12-01|
| 2| b|2018-10-15|2018-11-01|
| 3| a|2018-09-15|2018-10-01|
+---+------+----------+----------+
Second Table:
+---+----------+----+
| id| date|time|
+---+----------+----+
| 2|2018-10-15|1200|
| 2|2018-10-16|1200|
| 2|2018-10-18|1200|
| 3|2018-09-28|1200|
| 3|2018-09-29|1200|
+---+----------+----+
For a particular id and a given startDate and endDate, I require to filter the second table between the given timeframe.
From the filtered table I require the sum of the time column and output should be like following:
+---+------+----------+----------+---------+
| id|sub_id| startDate| endDate|totalTime|
+---+------+----------+----------+---------+
| 2| a|2018-11-15|2018-12-01| 0|
| 2| b|2018-10-15|2018-11-01| 3600|
| 3| a|2018-09-15|2018-10-01| 2400|
+---+------+----------+----------+---------+
My objective is to avoid using for loop along with filter. I tried using pandas_udf but it works with only one dataframe.
I have to rename columns of table() with values from mapping table(df2 below) in Pyspark.
Thanks for any help!
I tried to do it with pandas but it works for 25 min with my tables.
import pandas as pd
df = pd.DataFrame({'kod':[1,1,3,4,5], 'freq':[4,8,8,20,16], 'lsv':[100,200,300,250,400]})
df2 = pd.DataFrame({'oldid':['kod','freq','lsv'], 'newid':['code','visits','volume']})
mapping=dict(df2[['oldid', 'newid']].values)
df=df.rename(columns=mapping)
display(df2)
Spark Dataframes works little differently than Pandas data frame
after converting your pandas dataframes into Spark data frames
I am updating the name of freq to zeq just to demonstrate the sorting
df = spark.createDataFrame([(4,1,100),(8,1,200),(8,3,300),(20,4,250),(16,5,400)], ['zeq','kod','lsv'])
sorted_df = df.select(sorted(df.columns))
sorted_df.show()
+---+---+---+
|kod|lsv|zeq|
+---+---+---+
| 1|100| 4|
| 1|200| 8|
| 3|300| 8|
| 4|250| 20|
| 5|400| 16|
+---+---+---+
header dataFrame
headers = spark.createDataFrame([('code','kod'),('visits','zeq'),('volume','lsv')],['newid','oldid'])
headers.show()
+------+-----+
| newid|oldid|
+------+-----+
| code| kod|
|visits| zeq|
|volume| lsv|
+------+-----+
there is a method called toDF available on Spark dataframe that takes the list of new header columns as an argument and updates the header of the dataframe.
so sort your data frame based on oldid and select new id and convert that column values into list like below
sorted_headers_list = headers.sort('oldid').select('newid').rdd.flatMap(lambda x: x).collect()
update your dataframe with new headers
df_with_updated_headers = sorted_df.toDF(*sorted_headers_list)
df_with_updated_headers.show()
+----+------+------+
|code|volume|visits|
+----+------+------+
| 1| 100| 4|
| 1| 200| 8|
| 3| 300| 8|
| 4| 250| 20|
| 5| 400| 16|
+----+------+------+
please let me know if you need more details
I have two dataframe which has been readed from two csv files.
+---+----------+-----------------+
| ID| NUMBER | RECHARGE_AMOUNT|
+---+----------+-----------------+
| 1|9090909092| 30|
| 2|9090909093| 30|
| 3|9090909090| 30|
| 4|9090909094| 30|
+---+----------+-----------------+
and
+---+----------+-----------------+
| ID| NUMBER | RECHARGE_AMOUNT|
+---+----------+-----------------+
| 1|9090909092| 40|
| 2|9090909093| 50|
| 3|9090909090| 60|
| 4|9090909094| 70|
+---+----------+-----------------+
I am triying to join this two data from using NUMBER coumn using the pyspark code dfFinal = dfFinal.join(df2, on=['NUMBER'], how='inner') and new dataframe is generated as follows.
+----------+---+-----------------+---+-----------------+
| NUMBER | ID| RECHARGE_AMOUNT| ID| RECHARGE_AMOUNT|
+----------+---+-----------------+---+-----------------+
|9090909092| 1| 30| 1| 40|
|9090909093| 2| 30| 2| 50|
|9090909090| 3| 30| 3| 60|
|9090909094| 4| 30| 4| 70|
+----------+---+-----------------+---+-----------------+
But i am not able to write this dataframe into a file since the dataframe after joining is having duplicate column. I am using the following code. dfFinal.coalesce(1).write.format('com.databricks.spark.csv').save('/home/user/output',header = 'true') Is there any way to avoid duplicate column after joining in spark. Given below is my pyspark code.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("test1").getOrCreate()
files = ["/home/user/test1.txt", "/home/user/test2.txt"]
dfFinal = spark.read.load(files[0],format="csv", sep=",", inferSchema="false", header="true", mode="DROPMALFORMED")
dfFinal.show()
for i in range(1,len(files)):
df2 = spark.read.load(files[i],format="csv", sep=",", inferSchema="false", header="true", mode="DROPMALFORMED")
df2.show()
dfFinal = dfFinal.join(df2, on=['NUMBER'], how='inner')
dfFinal.show()
dfFinal.coalesce(1).write.format('com.databricks.spark.csv').save('/home/user/output',header = 'true')
I need to generate unique column name.ie: if i gave two files in files array with same coumn it should generate as follows.
+----------+----+-------------------+-----+-------------------+
| NUMBER |IDx | RECHARGE_AMOUNTx | IDy | RECHARGE_AMOUNTy |
+----------+----+-------------------+-----+-------------------+
|9090909092| 1 | 30 | 1 | 40 |
|9090909093| 2 | 30 | 2 | 50 |
|9090909090| 3 | 30 | 3 | 60 |
|9090909094| 4 | 30 | 4 | 70 |
+----------+---+-----------------+---+------------------------+
In panda i can use suffixes argument as show below dfFinal = dfFinal.merge(df2,left_on='NUMBER',right_on='NUMBER',how='inner',suffixes=('x', 'y'),sort=True) which will generate the above dataframe. Is there any way i can replicate this on pyspark.
You can select the columns from each dataframe and alias it.
Like this.
dfFinal = dfFinal.join(df2, on=['NUMBER'], how='inner') \
.select('NUMBER',
dfFinal.ID.alias('ID_1'),
dfFinal.RECHARGE_AMOUNT.alias('RECHARGE_AMOUNT_1'),
df2.ID.alias('ID_2'),
df2.RECHARGE_AMOUNT.alias('RECHARGE_AMOUNT_2'))
I have this code:
from pyspark import SparkContext
from pyspark.sql import SQLContext, Row
sc = SparkContext()
sqlContext = SQLContext(sc)
documents = sqlContext.createDataFrame([
Row(id=1, title=[Row(value=u'cars', max_dist=1000)]),
Row(id=2, title=[Row(value=u'horse bus',max_dist=50), Row(value=u'normal bus',max_dist=100)]),
Row(id=3, title=[Row(value=u'Airplane', max_dist=5000)]),
Row(id=4, title=[Row(value=u'Bicycles', max_dist=20),Row(value=u'Motorbikes', max_dist=80)]),
Row(id=5, title=[Row(value=u'Trams', max_dist=15)])])
documents.show(truncate=False)
#+---+----------------------------------+
#|id |title |
#+---+----------------------------------+
#|1 |[[1000,cars]] |
#|2 |[[50,horse bus], [100,normal bus]]|
#|3 |[[5000,Airplane]] |
#|4 |[[20,Bicycles], [80,Motorbikes]] |
#|5 |[[15,Trams]] |
#+---+----------------------------------+
I need to split all compound rows (e.g. 2 & 4) to multiple rows while retaining the 'id', to get a result like this:
#+---+----------------------------------+
#|id |title |
#+---+----------------------------------+
#|1 |[1000,cars] |
#|2 |[50,horse bus] |
#|2 |[100,normal bus] |
#|3 |[5000,Airplane] |
#|4 |[20,Bicycles] |
#|4 |[80,Motorbikes] |
#|5 |[15,Trams] |
#+---+----------------------------------+
Just explode it:
from pyspark.sql.functions import explode
documents.withColumn("title", explode("title"))
## +---+----------------+
## | id| title|
## +---+----------------+
## | 1| [1000,cars]|
## | 2| [50,horse bus]|
## | 2|[100,normal bus]|
## | 3| [5000,Airplane]|
## | 4| [20,Bicycles]|
## | 4| [80,Motorbikes]|
## | 5| [15,Trams]|
## +---+----------------+
Ok, here is what I've come up with. Unfortunately, I had to leave the world of Row objects and enter the world of list objects because I couldn't find a way to append to a Row object.
That means this method is bit messy. If you can find a way to add a new column to a Row object, then this is NOT the way to go.
def add_id(row):
it_list = []
for i in range(0, len(row[1])):
sm_list = []
for j in row[1][i]:
sm_list.append(j)
sm_list.append(row[0])
it_list.append(sm_list)
return it_list
with_id = documents.flatMap(lambda x: add_id(x))
df = with_id.map(lambda x: Row(id=x[2], title=Row(value=x[0], max_dist=x[1]))).toDF()
When I run df.show(), I get:
+---+----------------+
| id| title|
+---+----------------+
| 1| [cars,1000]|
| 2| [horse bus,50]|
| 2|[normal bus,100]|
| 3| [Airplane,5000]|
| 4| [Bicycles,20]|
| 4| [Motorbikes,80]|
| 5| [Trams,15]|
+---+----------------+
I am using Spark Dataset API, and following solved the 'explode' requirement for me:
Dataset<Row> explodedDataset = initialDataset.selectExpr("ID","explode(finished_chunk) as chunks");
Note: The explode method of Dataset API is deprecated in Spark 2.4.5 and the documentation suggests using Select(shown above) or FlatMap.