Slit column into multiple columns using pyspark 2.4 [duplicate] - python

This question already has answers here:
Split Spark dataframe string column into multiple columns
(5 answers)
Closed 4 years ago.
How we can split sparkDataframe column based on some separator like '/' using pyspark2.4
My Column contains :
+-------------------+
| timezone|
+-------------------+
| America/New_York|
| Africa/Casablanca|
| Europe/Madrid|
| Europe/Madrid|
| |
| Null |
Thanks

# Creating a dataframe
values = [('America/New_York',),('Africa/Casablanca',),('Europe/Madrid',),('Europe/Madrid',),('Germany',),('',),(None,)]
df = sqlContext.createDataFrame(values,['timezone',])
df.show(truncate=False)
+-----------------+
|timezone |
+-----------------+
|America/New_York |
|Africa/Casablanca|
|Europe/Madrid |
|Europe/Madrid |
|Germany |
| |
|null |
+-----------------+
from pyspark.sql.functions import instr, split
df = df.withColumn('separator_if_exists',(instr(col('timezone'),'/') > 0) & instr(col('timezone'),'/').isNotNull())
df = df.withColumn('col1',when(col('separator_if_exists') == True,split(col('timezone'),'/')[0]).otherwise(None))
df = df.withColumn('col2',when(col('separator_if_exists') == True,split(col('timezone'),'/')[1]).otherwise(None)).drop('separator_if_exists')
df.show(truncate=False)
+-----------------+-------+----------+
|timezone |col1 |col2 |
+-----------------+-------+----------+
|America/New_York |America|New_York |
|Africa/Casablanca|Africa |Casablanca|
|Europe/Madrid |Europe |Madrid |
|Europe/Madrid |Europe |Madrid |
|Germany |null |null |
| |null |null |
|null |null |null |
+-----------------+-------+----------+
Documentation: split() and instr(). Note that instr() is 1 based indexing. If the substring to be searched is not found, 0 is returned.

Related

Modify Spark Dataframe

I have a spark dataframe to which i need to make some change
Input dataframe :
+-----+-----+----------+
| name|Index| Value |
+-----+-----+----------+
|name1|1 |ab |
|name2|1 |vf |
|name2|2 |ee |
|name2|3 |id |
|name3|1 |bd |
for every name there are multiple values which we need to bind together as shown below
Output dataframe :
+-----+----------+
| name|value |
+-----+----------+
|name1|[ab] |
|name2|[vf,ee,id]|
|name3|[bd] |
Thank you

How to run TA-Lib on multiple tickers in a single dataframe

I have a pandas dataframe named idf with data from 4/19/21 to 5/19/21 for 4675 tickers with the following columns: symbol, date, open, high, low, close, vol
|index |symbol |date |open |high |low |close |vol |EMA8|EMA21|RSI3|RSI14|
|-------|-------|-----------|-------|-------|-----------|-------|-------|----|-----|----|-----|
|0 |AACG |2021-04-19 |2.85 |3.03 |2.8000 |2.99 |173000 | | | | |
|1 |AACG |2021-04-20 |2.93 |2.99 |2.7700 |2.85 |73700 | | | | |
|2 |AACG |2021-04-21 |2.82 |2.95 |2.7500 |2.76 |93200 | | | | |
|3 |AACG |2021-04-22 |2.76 |2.95 |2.7200 |2.75 |56500 | | | | |
|4 |AACG |2021-04-23 |2.75 |2.88 |2.7000 |2.84 |277700 | | | | |
|... |... |... |... |... |... |... |... | | | | |
|101873 |ZYXI |2021-05-13 |13.94 |14.13 |13.2718 |13.48 |413200 | | | | |
|101874 |ZYXI |2021-05-14 |13.61 |14.01 |13.2200 |13.87 |225200 | | | | |
|101875 |ZYXI |2021-05-17 |13.72 |14.05 |13.5500 |13.82 |183600 | | | | |
|101876 |ZYXI |2021-05-18 |13.97 |14.63 |13.8300 |14.41 |232200 | | | | |
|101877 |ZYXI |2021-05-19 |14.10 |14.26 |13.7700 |14.25 |165600 | | | | |
I would like to use ta-lib to calculate several technical indicators like EMA of length 8 and 21, and RSI of 3 and 14.
I have been doing this with the following code after uploading the file and creating a dataframe named idf:
ind = pd.DataFrame()
tind = pd.DataFrame()
for ticker in idf['symbol'].unique():
tind['rsi3'] = ta.RSI(idf.loc[idf['symbol'] == ticker, 'close'], 3).round(2)
tind['rsi14'] = ta.RSI(idf.loc[idf['symbol'] == ticker, 'close'], 14).round(2)
tind['ema8'] = ta.EMA(idf.loc[idf['symbol'] == ticker, 'close'], 8).round(2)
tind['ema21'] = ta.EMA(idf.loc[idf['symbol'] == ticker, 'close'], 21).round(2)
ind = ind.append(tind)
tind = tind.iloc[0:0]
idf = pd.merge(idf, ind, left_index=True, right_index=True)
Is this the most efficient way to doing this?
If not, what is the easiest and fastest way to calculate indicator values and get those calculated indicator values into the dataframe idf?
Prefer to avoid a for loop if possible.
Any help is highly appreciated.
rsi = lambda x: talib.RSI(idf.loc[x.index, "close"], 14)
idf['rsi(14)'] = idf.groupby(['symbol']).apply(rsi).reset_index(0,drop=True)

Compare two pyspark dataframes by joining

I have two pyspark dataframes that have different number of rows. I am trying to compare values in all the columns by joining these two dataframes on multiple keys so I can find the records that have different values and the records that have the same values in these columns.
#df1:
+-------+----------+----------+----------+
|id |age |sex |value
+-------+----------+----------+
| 1 | 23 | M | 8.4
| 2 | 4 | M |2
| 3 |16 | F | 4.1
| 4 | 60 | M |4
| 5 | null | F |5
+-------+----------+----------+|
#df2:
+-------+----------+----------+----------+
|id |age |sex |value
+-------+----------+----------+
| 1 | 23 | M | 8.4
| 2 | 4 | null |2
| 4 | 13 | M |3.1
| 5 | 34 | F |6.2
+-------+----------+----------+|
#joining df1 and df2 on multiple keys
same=df1.join(df2, on=['id','age','sex','value'], how='inner')
Please note that the dataframes above are just samples. My real data has around 25 columns and 100k+ rows. So when I tried to do the join, the spark job was taking a long time and doesn't finish.
Want to know if anyone has good advice on comparing two dataframes and find out records that have different values in columns either using joining or other methods?
Use hashing.
from pyspark.sql.functions import hash
df1 = spark.createDataFrame([('312312','151132'),('004312','12232'),('','151132'),('013vjhrr134','111232'),(None,'151132'),('0fsgdhfjgk','151132')],
("Fruits", "Meat"))
df1 = df1.withColumn('hash_value', hash("Fruits", "Meat"))
df = spark.createDataFrame([('312312','151132'),('000312','151132'),('','151132'),('013vjh134134','151132'),(None,'151132'),('0fsgdhfjgk','151132')],
("Fruits", "Meat"))
df = df.withColumn('hash_value', hash("Fruits", "Meat"))
df.show()
+------------+------+-----------+
| Fruits| Meat| hash_value|
+------------+------+-----------+
| 312312|151132| -344340697|
| 000312|151132| -548650515|
| |151132|-2105905448|
|013vjh134134|151132| 2052362224|
| null|151132| 598159392|
| 0fsgdhfjgk|151132| 951458223|
+------------+------+-----------+
df1.show()
+-----------+------+-----------+
| Fruits| Meat| hash_value|
+-----------+------+-----------+
| 312312|151132| -344340697|
| 004312| 12232| 76821046|
| |151132|-2105905448|
|013vjhrr134|111232| 1289730088|
| null|151132| 598159392|
| 0fsgdhfjgk|151132| 951458223|
+-----------+------+-----------+
OR you can use SHA2 for the same
from pyspark.sql.functions import sha2, concat_ws
df1.withColumn("row_sha2", sha2(concat_ws("||", *df.columns), 256)).show(truncate=False)
+-----------+------+----------------------------------------------------------------+
|Fruits |Meat |row_sha2 |
+-----------+------+----------------------------------------------------------------+
|312312 |151132|7be3824bcaa5fa29ad58df2587d392a1cc9ca5511ef01005be6f97c9558d1eed|
|004312 |12232 |c7fcf8031a17e5f3168297579f6dc8a6f17d7a4a71939d6b989ca783f30e21ac|
| |151132|68ea989b7d33da275a16ff897b0ab5a88bc0f4545ec22d90cee63244c1f00fb0|
|013vjhrr134|111232|9c9df63553d841463a803c64e3f4a8aed53bcdf78bf4a089a88af9e91406a226|
|null |151132|83de2d466a881cb4bb16b83665b687c01752044296079b2cae5bab8af93db14f|
|0fsgdhfjgk |151132|394631bbd1ccee841d3ba200806f8d0a51c66119b13575cf547f8cc91066c90d|
+-----------+------+----------------------------------------------------------------+
This will create a unique code for all your rows now join and compare the hash values on the two data frames.
If the values in the rows are same they will have same hash values as well.
Now you can compare using joins
df1.join(df, "hash_value", "inner").show()
+-----------+----------+------+----------+------+
| hash_value| Fruits| Meat| Fruits| Meat|
+-----------+----------+------+----------+------+
|-2105905448| |151132| |151132|
| -344340697| 312312|151132| 312312|151132|
| 598159392| null|151132| null|151132|
| 951458223|0fsgdhfjgk|151132|0fsgdhfjgk|151132|
+-----------+----------+------+----------+------+
df1.join(df, "hash_value", "outer").show()
+-----------+-----------+------+------------+------+
| hash_value| Fruits| Meat| Fruits| Meat|
+-----------+-----------+------+------------+------+
|-2105905448| |151132| |151132|
| -548650515| null| null| 000312|151132|
| -344340697| 312312|151132| 312312|151132|
| 76821046| 004312| 12232| null| null|
| 598159392| null|151132| null|151132|
| 951458223| 0fsgdhfjgk|151132| 0fsgdhfjgk|151132|
| 1289730088|013vjhrr134|111232| null| null|
| 2052362224| null| null|013vjh134134|151132|
+-----------+-----------+------+------------+------+

pyspark - how can I remove all duplicate rows (ignoring certain columns) and not leaving any dupe pairs behind?

I have the following table:
df = spark.createDataFrame([(2,'john',1),
(2,'john',1),
(3,'pete',8),
(3,'pete',8),
(5,'steve',9)],
['id','name','value'])
df.show()
+----+-------+-------+--------------+
| id | name | value | date |
+----+-------+-------+--------------+
| 2 | john | 1 | 131434234342 |
| 2 | john | 1 | 10-22-2018 |
| 3 | pete | 8 | 10-22-2018 |
| 3 | pete | 8 | 3258958304 |
| 5 | steve | 9 | 124324234 |
+----+-------+-------+--------------+
I want to remove all duplicate pairs (When the duplicates occur in id, name, or value but NOT date) so that I end up with:
+----+-------+-------+-----------+
| id | name | value | date |
+----+-------+-------+-----------+
| 5 | steve | 9 | 124324234 |
+----+-------+-------+-----------+
How can I do this in PySpark?
You could groupBy id, name and value and filter on the count column : :
df = df.groupBy('id','name','value').count().where('count = 1')
df.show()
+---+-----+-----+-----+
| id| name|value|count|
+---+-----+-----+-----+
| 5|steve| 9| 1|
+---+-----+-----+-----+
You could eventually drop the count column if needed
Do groupBy for the columns you want and count and do a filter where count is equal to 1 and then you can drop the count column like below
import pyspark.sql.functions as f
df = df.groupBy("id", "name", "value").agg(f.count("*").alias('cnt')).where('cnt = 1').drop('cnt')
You can add the date column in the GroupBy condition if you want
Hope this helps you

How do I flattern a pySpark dataframe ?

I have a spark dataframe like this:
id | Operation | Value
-----------------------------------------------------------
1 | [Date_Min, Date_Max, Device] | [148590, 148590, iphone]
2 | [Date_Min, Date_Max, Review] | [148590, 148590, Good]
3 | [Date_Min, Date_Max, Review, Device] | [148590, 148590, Bad,samsung]
The resul that i expect:
id | Operation | Value |
--------------------------
1 | Date_Min | 148590 |
1 | Date_Max | 148590 |
1 | Device | iphone |
2 | Date_Min | 148590 |
2 | Date_Max | 148590 |
2 | Review | Good |
3 | Date_Min | 148590 |
3 | Date_Max | 148590 |
3 | Review | Bad |
3 | Review | samsung|
I'm using Spark 2.1.0 with pyspark. I tried this solution but it worked only for one column.
Thanks
Here is an example dataframe from above. I use this solution in order to solve your question.
df = spark.createDataFrame(
[[1, ['Date_Min', 'Date_Max', 'Device'], ['148590', '148590', 'iphone']],
[2, ['Date_Min', 'Date_Max', 'Review'], ['148590', '148590', 'Good']],
[3, ['Date_Min', 'Date_Max', 'Review', 'Device'], ['148590', '148590', 'Bad', 'samsung']]],
schema=['id', 'l1', 'l2'])
Here, you can define udf to zip two list together for each row first.
from pyspark.sql.types import *
from pyspark.sql.functions import col, udf, explode
zip_list = udf(
lambda x, y: list(zip(x, y)),
ArrayType(StructType([
StructField("first", StringType()),
StructField("second", StringType())
]))
)
Finally, you can zip two columns together then explode that column.
df_out = df.withColumn("tmp", zip_list('l1', 'l2')).\
withColumn("tmp", explode("tmp")).\
select('id', col('tmp.first').alias('Operation'), col('tmp.second').alias('Value'))
df_out.show()
Output
+---+---------+-------+
| id|Operation| Value|
+---+---------+-------+
| 1| Date_Min| 148590|
| 1| Date_Max| 148590|
| 1| Device| iphone|
| 2| Date_Min| 148590|
| 2| Date_Max| 148590|
| 2| Review| Good|
| 3| Date_Min| 148590|
| 3| Date_Max| 148590|
| 3| Review| Bad|
| 3| Device|samsung|
+---+---------+-------+
If using DataFrame then try this:-
import pyspark.sql.functions as F
your_df.select("id", F.explode("Operation"), F.explode("Value")).show()

Categories