How to rename DataFrame header with values from mapping table in Pyspark - python

I have to rename columns of table() with values from mapping table(df2 below) in Pyspark.
Thanks for any help!
I tried to do it with pandas but it works for 25 min with my tables.
import pandas as pd
df = pd.DataFrame({'kod':[1,1,3,4,5], 'freq':[4,8,8,20,16], 'lsv':[100,200,300,250,400]})
df2 = pd.DataFrame({'oldid':['kod','freq','lsv'], 'newid':['code','visits','volume']})
mapping=dict(df2[['oldid', 'newid']].values)
df=df.rename(columns=mapping)
display(df2)

Spark Dataframes works little differently than Pandas data frame
after converting your pandas dataframes into Spark data frames
I am updating the name of freq to zeq just to demonstrate the sorting
df = spark.createDataFrame([(4,1,100),(8,1,200),(8,3,300),(20,4,250),(16,5,400)], ['zeq','kod','lsv'])
sorted_df = df.select(sorted(df.columns))
sorted_df.show()
+---+---+---+
|kod|lsv|zeq|
+---+---+---+
| 1|100| 4|
| 1|200| 8|
| 3|300| 8|
| 4|250| 20|
| 5|400| 16|
+---+---+---+
header dataFrame
headers = spark.createDataFrame([('code','kod'),('visits','zeq'),('volume','lsv')],['newid','oldid'])
headers.show()
+------+-----+
| newid|oldid|
+------+-----+
| code| kod|
|visits| zeq|
|volume| lsv|
+------+-----+
there is a method called toDF available on Spark dataframe that takes the list of new header columns as an argument and updates the header of the dataframe.
so sort your data frame based on oldid and select new id and convert that column values into list like below
sorted_headers_list = headers.sort('oldid').select('newid').rdd.flatMap(lambda x: x).collect()
update your dataframe with new headers
df_with_updated_headers = sorted_df.toDF(*sorted_headers_list)
df_with_updated_headers.show()
+----+------+------+
|code|volume|visits|
+----+------+------+
| 1| 100| 4|
| 1| 200| 8|
| 3| 300| 8|
| 4| 250| 20|
| 5| 400| 16|
+----+------+------+
please let me know if you need more details

Related

Is there a way to add a column with range of values to a Spark Dataframe?

I have a spark dataframe: df1 as below:
age = spark.createDataFrame(["10","11","13"], "string").toDF("age")
age.show()
+---+
|age|
+---+
| 10|
| 11|
| 13|
+---+
I have a requirement of adding a row number column to the dataframe to make it:
+---+------+
|age|col_id|
+---+------+
| 10| 1 |
| 11| 2 |
| 13| 3 |
+---+------+
None of the columns in my dataframe contains unique values.
I tried to use F.monotonically_increasing_id()) but it is just producing random numbers in increasing order.
>>> age = spark.createDataFrame(["10","11","13"], "string").toDF("age").withColumn("rowId1", F.monotonically_increasing_id())
>>> age
DataFrame[age: string, rowId1: bigint]
>>> age.show
<bound method DataFrame.show of DataFrame[age: string, rowId1: bigint]>
>>> age.show()
+---+-----------+
|age| rowId1|
+---+-----------+
| 10|17179869184|
| 11|42949672960|
| 13|60129542144|
+---+-----------+
Since I don't have any column with unique data, I am worried about using windowing functions and generate row_numbers.
So, is there a way I can add a column with row_count to the dataframe that gives:
+---+------+
|age|col_id|
+---+------+
| 10| 1 |
| 11| 2 |
| 13| 3 |
+---+------+
If windowing function is the only way to implement, how can I make sure all the data comes under a single partition ?
or if there is a way to implement the same without using windowing functions, how to implement it ?
Any help is appreciated.
Use zipWithIndex.
I could not find code I did myself in the past yesterday as I was busy working on issues, but here is a good post that explains it. https://sqlandhadoop.com/pyspark-zipwithindex-example/
pyspark different to Scala.
Other answer not good for performance - going to single Executor. zipWithIndex is narrow transformation so it works per partition.
Here goes, you can tailor accordingly:
from pyspark.sql.types import StructField
from pyspark.sql.types import StructType
from pyspark.sql.types import StringType, LongType
import pyspark.sql.functions as F
df1 = spark.createDataFrame([ ('abc'),('2'),('3'),('4'), ('abc'),('2'),('3'),('4'), ('abc'),('2'),('3'),('4') ], StringType())
schema = StructType(df1.schema.fields[:] + [StructField("index", LongType(), True)])
rdd = df1.rdd.zipWithIndex()
rdd1 = rdd.map(lambda row: tuple(row[0].asDict()[c] for c in schema.fieldNames()[:-1]) + (row[1],))
df1 = spark.createDataFrame(rdd1, schema)
df1.show()
returns:
+-----+-----+
|value|index|
+-----+-----+
| abc| 0|
| 2| 1|
| 3| 2|
| 4| 3|
| abc| 4|
| 2| 5|
| 3| 6|
| 4| 7|
| abc| 8|
| 2| 9|
| 3| 10|
| 4| 11|
+-----+-----+
Assumption: This answer is based on the assumption that the order of col_id should depend on the age column. If the assumption does not hold true the other suggested solution is the in the questions comments mentioned zipWithIndex. An example usage of zipWithIndex can be found in this answer.
Proposed solution:
You can use a window with an empty partitionBy and the the row number to get the expected numbers.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
windowSpec = Window.partitionBy().orderBy(F.col('age').asc())
age = age.withColumn(
'col_id',
F.row_number().over(windowSpec)
)
[EDIT] Add assumption of requirements and reference to alternative solution.

Pyspark - Using two time indices for window function

I have a dataframe where each row has two date columns. I would like to create a window function with a range between that counts the number of rows in a particular range, where BOTH date columns are within the range. In the case below, both timestamps of a row must be before the timestamp of the current row, to be included in the count.
Example df including the count column:
+---+-----------+-----------+-----+
| ID|Timestamp_1|Timestamp_2|Count|
+---+-----------+-----------+-----+
| a| 0| 3| 0|
| b| 2| 5| 0|
| d| 5| 5| 3|
| c| 5| 9| 3|
| e| 8| 10| 4|
+---+-----------+-----------+-----+
I tried creating two windows and creating the new column over both of these:
w_1 = Window.partitionBy().orderBy('Timestamp_1').rangeBetween(Window.unboundedPreceding, 0)
w_2 = Window.partitionBy().orderBy('Timestamp_2').rangeBetween(Window.unboundedPreceding, 0)
df = df.withColumn('count', F.count('ID').over(w_1).over(w_2))
However, this is not allowed in Pyspark and therefore gives an error.
Any ideas? Solutions in SQL are also fine!
Would a self-join work?
from pyspark.sql import functions as F
df_count = (
df.alias('a')
.join(
df.alias('b'),
(F.col('b.Timestamp_1') <= F.col('a.Timestamp_1')) &
(F.col('b.Timestamp_2') <= F.col('a.Timestamp_2')),
'left'
)
.groupBy(
'a.ID'
)
.agg(
F.count('b.ID').alias('count')
)
)
df = df.join(df_count, 'ID')

Select a range in Pyspark

I have a spark dataframe in python. And, it was sorted based on a column. How can I select a specific range of data (for example 50% of data in the middle)? For example, if I have 1M data, I want to take data from 250K to 750K index. How can I do that without using collect in pyspark?
To be more precise, I want something like take function to get results between a range. For example, something like take(250000, 750000).
Here is one way to select a range in a pyspark DF:
Create DF
df = spark.createDataFrame(
data = [(10, "2018-01-01"), (22, "2017-01-01"), (13, "2014-01-01"), (4, "2015-01-01")\
,(35, "2013-01-01"),(26, "2016-01-01"),(7, "2012-01-01"),(18, "2011-01-01")]
, schema = ["amount", "date"]
)
df.show()
+------+----------+
|amount| date|
+------+----------+
| 10|2018-01-01|
| 22|2017-01-01|
| 13|2014-01-01|
| 4|2015-01-01|
| 35|2013-01-01|
| 26|2016-01-01|
| 7|2012-01-01|
| 18|2011-01-01|
+------+----------+
Sort (on date) and insert index (based on row number)
from pyspark.sql.window import Window
from pyspark.sql import functions as F
w = Window.orderBy("date")
df = df.withColumn("index", F.row_number().over(w))
df.show()
+------+----------+-----+
|amount| date|index|
+------+----------+-----+
| 18|2011-01-01| 1|
| 7|2012-01-01| 2|
| 35|2013-01-01| 3|
| 13|2014-01-01| 4|
| 4|2015-01-01| 5|
| 26|2016-01-01| 6|
| 22|2017-01-01| 7|
| 10|2018-01-01| 8|
+------+----------+-----+
Get The Required Range (assume want everything between rows 3 and 6)
df1=df.filter(df.index.between(3, 6))
df1.show()
+------+----------+-----+
|amount| date|index|
+------+----------+-----+
| 35|2013-01-01| 3|
| 13|2014-01-01| 4|
| 4|2015-01-01| 5|
| 26|2016-01-01| 6|
+------+----------+-----+
This is very simple using between , for example assuming your sorted column name is index -
df_sample = df.select(df.somecolumn, df.index.between(250000, 750000))
once you create a new dataframe df_sample, you can perform any operation (including take or collect) as per your need.

Remove all rows that are duplicates with respect to some rows

I've seen a couple questions like this but not a satisfactory answer for my situation. Here is a sample DataFrame:
+------+-----+----+
| id|value|type|
+------+-----+----+
|283924| 1.5| 0|
|283924| 1.5| 1|
|982384| 3.0| 0|
|982384| 3.0| 1|
|892383| 2.0| 0|
|892383| 2.5| 1|
+------+-----+----+
I want to identify duplicates by just the "id" and "value" columns, and then remove all instances.
In this case:
Rows 1 and 2 are duplicates (again we are ignoring the "type" column)
Rows 3 and 4 are duplicates, and therefore only rows 5 & 6 should remain:
The output would be:
+------+-----+----+
| id|value|type|
+------+-----+----+
|892383| 2.5| 1|
|892383| 2.0| 0|
+------+-----+----+
I've tried
df.dropDuplicates(subset = ['id', 'value'], keep = False)
But the "keep" feature isn't in PySpark (as it is in pandas.DataFrame.drop_duplicates.
How else could I do this?
You can do that using the window functions
from pyspark.sql import Window, functions as F
df.withColumn(
'fg',
F.count("id").over(Window.partitionBy("id", "value"))
).where("fg = 1").drop("fg").show()
You can groupBy the id and type to get the count. Then use join to filter out the rows in your DataFrame where the count is not 1:
df.join(
df.groupBy('id', 'value').count().where('count = 1').drop('count'), on=['id', 'value']
).show()
#+------+-----+----+
#| id|value|type|
#+------+-----+----+
#|892383| 2.5| 1|
#|892383| 2.0| 0|
#+------+-----+----+

Create Multiple .csv from Single Dataframe by Filtering on Column value

Here is my dataframe -
Date |Val1| Val2|
0 1/1/2015| a| 2|
1 1/1/2015| g| 6|
2 1/2/2015| d| 4|
3 1/2/2015| a| 6|
4 1/2/2015| f| 7|
5 1/13/2015| b| 8|
6 1/14/2015| r| 0|
7 1/14/2015| a| 1|
8 1/12015| t| 2|
I want to take the value on the column 'Date' and create separate .csv as in
01012015.csv, 01022015.csv, 01132015.csv, 01142015.csv etc.
And each of the .csv file would have the data only for those dates.
Ideally I was thinking of splitting the data frame into multiple dataframes and then create the .csv.
However I can do them manually using but not being able to do that using a loop or using unique() list.
I've looked at the below but it doesn't get to what I need.
Python Pandas Create Multiple dataframes from list
I think you're just looking for groupby. If your dataframe is called df, and the "Date" column is a string, then something like below should work:
df_by_date = df.groupby("Date")
for (date, date_df) in df_by_date:
filename = date.replace("/", "") + ".csv"
date_df.to_csv(filename)

Categories