Pyspark - how to merge transformed columns with an original DataFrame? - python

I created a function to test transformations on a DataFrame. This returns only the transformed columns.
def test_concat(df: sd.DataFrame, col_names: list) -> sd.DataFrame:
return df.select(*[F.concat(df[column].cast(StringType()), F.lit(" new!")).alias(column) for column in col_names])
How can I replace the existing columns with the transformed once in the original DF and return the whole DF?
Example DF:
test_df = self.spark.createDataFrame([(1, 'metric1', 10), (2, 'metric2', 20), (3, 'metric3', 30)], ['id', 'metric', 'score'])
cols = ["metric"]
new_df = perform_concat(test_df, cols)
new_df.show()
Expected result:
|metric | score |
+-------------+--------+
|metric1 new! | 10 |
|metric2 new! | 20 |
|metric3 new! | 30 |
It looks like I can drop the original columns from the DF and then somehow append the transformed. But not sure that it's the right way to achieve this.

I can see you have only adding a keyword in metric column , the same can be achieved using inbuilt spark function as below
The withColumn has two functionality
If the column is not present it will create a new clumn
If the column is there, it will perform the operation on the same column
Logic to Concat
from pyspark.sql import functions as F
df = df.withColumn('metric', F.concat(F.col('metric'), F.lit(' '), F.lit('new!')))
df = df.select('metric', 'score')
df.show()
Output---------
|metric | score |
+-------------+--------+
|metric1 new! | 10 |
|metric2 new! | 20 |
|metric3 new! | 30 |

If you want to do it for many columns you would make a foldLeft call.
#dsk has the right approach.
You probably want to avoid joins in this case since there is no need to decouple operation you are describing from original dataframe (this is based on the examples you provided, if you have different needs in real case then maybe different example is needed).
columnsToTransform.foldLeft(df)(
(acc, next) => acc.withColumn(next, concat(col(next), lit("new !")))
)
Edit: Just realised what I am proposing only works for scala and that your snippet is in python.
For python similar will still work just instead of fold you will do a for:
df = yourOriginalDf
for(next in columnsToTransform):
df = df.withColumn(next, concat(col(next), lit("new !")))

Create a new dataframe with updated column values and a monotonically increasing id
new_df = test_concat(test_df, cols).withColumn("index", F.monotonically_increasing_id())
Drop the list of columns from first dataframe and a monotonically increasing id
test_df_upt = test_df.drop(*cols).withColumn("index", F.monotonically_increasing_id())
Join the above 2 dataframes and drop the index colum
test_df_upt.join(new_df, "index").drop("index").show()

Related

How to select rows that are not present in another dataframe ith pyspark 2.1.0?

Env
pyspark 2.1.0
Context
I have two dataframes with the following structures:
dataframe 1:
id | ... | distance
dataframe 2:
id | ... | distance | other calculated values
The second dataframe is created based on a filter of the dataframe 1. This filter selects, from dataframe 1, only the distances <= 30.0.
Note that the dataframe1 will contain the same ID on multiple lines.
Problem
I need to to select from dataframe 1 rows with an ID that do not appear in the dataframe 2.
The purpose is to select the rows for which ID there is no distance lower or equal to 30.0.
Tested solution
i have tried the leftanti join, which, according to not official doc but sources on Internet (because, hey, why would they explain it ?): select all rows from df1 that are not present in df2
distinct_id_thirty = within_thirty_km \
.select("id") \
.distinct()
not_within_thirty_km = data_with_straight_distance.join(
distinct_id_thirty,
"id",
"leftanti")
Where:
within_thrity_km is a dataframe resulting of the filter filter(col("distance") <= 30.0) on data_with_straight_distance
data_with_straight_distance is a dataframe containing all the data.
distinct_id_thirty is a dataframe containing a distinct list of IDs from the dataframe within_thirty_km
Question
The above returns data where distance is bellow 30. So I assume I am doing something wrong:
What am I doing wrong here ?
Is it the good way to solve this problem ? If not, how should I proceed ?
Edit:
Here is a minimal example of what I expect:
data = [
("1", 15),
("1", 35),
("2", 15),
("2", 30),
("3", 35)]
data = spark.createDataFrame(data, ['id', 'distance'])
data.show()
thirty = data.filter(col("distance") <= 30)
dist_thirty = thirty.select("id").distinct()
not_in_thirty = data.join(dist_thirty, "id", "left_anti")
print("thirty")
thirty.show()
print("distinst thirty")
dist_thirty.show()
print("not_in_thirty")
not_in_thirty.show()
Output:
+---+--------+
| id|distance|
+---+--------+
| 3| 35|
+---+--------+
But I do get distance <= 30 where running on my actual data.
"leftanti" should be replaced by "left_anti" following the documentation on:
https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join

Count elements satisfying an extra condition on another column when group-bying in pyspark

The following pyspark command
df = dataFrame.groupBy("URL_short").count().select("URL_short", col("count").alias("NumOfReqs"))
created the following result.
|URL_short |NumOfReqs|
+-----------------------------------------------------------------------------------------+---------+
|http1 | 500 |
|http4 | 500 |
|http2 | 500 |
|http3 | 500 |
In the original DataFrame dataFrame I have a column named success whose type is text. The value can be "true" or "false".
In the result I would like to have an additional column named for example NumOfSuccess which counts the elements having entry "true" in the original column success per category URL_short.
How can I modify
df = dataFrame.groupBy("URL_short").count().select("URL_short", col("count").alias("NumOfReqs"))
to output also the column satisfying the condition success=="trueperURL_short` category?
One way to do it is to add another aggregation expression (also turn the count into an agg expression):
import pyspark.sql.functions as f
dataFrame.groupBy("URL_short").agg(
f.count('*').alias('NumOfReqs'),
f.sum(f.when(f.col('success'), 1).otherwise(0)).alias('CountOfSuccess')
).show()
Note this assumes your success column is boolean type, if it's string, change the expression to f.sum(f.when(f.col('success') == 'true', 1).otherwise(0)).alias('CountOfSuccess')

Appending to a Pandas dataframe and specifying the row index

I'm a little confused about the workings of the Pandas dataframe.
I have a pd.dataframe that looks like this:
index | val1 | val2
-----------------------------------
20-11-2017 22:33:20 | 0.33 | 05.43
23-11-2017 23:34:14 | 4.23 | 09.43
I'd like to append a row to it, and be able to specify the index, which in my case is a date and time.
I have tried the following methods:
dataframe = pd.DataFrame(columns=['val1', 'val2'])
dataframe.loc[someDate] = [someVal, someVal]
This seems to overwrite if the index already exists, but I want to be able to have duplicate indices.
dataframe = pd.DataFrame(columns=['val1', 'val2'])
record = pd.Series(
index=[someDate],
data=[someVal, someVal]
)
dataframe.append(record)
This causes the application to hang without returning an exception or error.
Am I missing something? Is this the correct way of doing the thing I want to achieve?

Fill in missing boolean rows in Pandas

I have a MySQL query that is doing a groupby and returning data in the following form:
ID | Boolean | Count
Sometimes there isn't data in the table for one of the boolean states, so data for a single ID might be returned like this:
1234 | 0 | 10
However I need it in this form for downstream analysis:
1234 | 0 | 10
1234 | 1 | 0
with an index on [ID, Boolean].
From querying Google and SO, it seems like getting MySQL to do this transform is a bit of a pain. Is there a simple way to do this in Pandas? I haven't been able to find anything useful in the docs or the Pandas cookbook.
You can assume that I've already loaded the data into a Pandas dataframe with no indexes.
Thanks.
I would set the index of your dataframe to the ID and Boolean columns, and the construct an new index from the Cartesian product of the unique values.
That would look like this:
import pandas
indexcols = ['ID', 'Boolean']
data = pandas.read_sql_query(engine, querytext)
full_index = pandas.MultiIndex.from_product(
[data['ID'].unique(), [0, 1]],
names=indexcols
)
data = (
data.set_index(indexcols)
.reindex(full_index)
.fillna(0)
.reset_index()
)

Efficiently plotting multiple columns in pandas

I would like to know how to efficiently plot groups of multiple columns in a pandas dataframe.
I have the following dataframe
| a | b | c |...|trial1.1|trial1.2|...|trial1.12|trial2.1|...|trial2.12|trial3.1|...|trial3.12|
GlobalID|
sd12f |...|...|...|...| 210.1 | 213.1 |...| 170.1 | 176.2 |...| 160.31 | 162.4 |...| 186.1 |
...
I would like to loop through the rows and for each row plot three waveforms: trial1.[1-12], trial2.[1-12], trial3.[1-12]. What is the most efficient way to do this? Right now I have:
t1 = df.ix[0][df.columns[[colname.startswith('trial1') for colname in df]]]
t2 = df.ix[0][df.columns[[colname.startswith('trial2') for colname in df]]]
t3 = df.ix[0][df.columns[[colname.startswith('trial3') for colname in df]]]
t1.astype(float).plot()
t2.astype(float).plot()
t3.astype(float).plot()
I need the .astype(float) because the values are originally strings. Is there some more efficient way of doing this I am missing? I am new to python and pandas.
How about first transverse the dataframe, then split the dataframe by trials, then plot.
# Transverse
data = pd.read_csv("data.txt").T
# Insert your code to remove irrelevant rows, like a, b, c in your example
#
# Group by the trial number (the first six characters) and plot
data.groupby(lambda x: x[:6], axis=0).plot()

Categories