PySpark: Create a column from ordered concatenation of columns - python

I am having an issue creating a new column from the ordered concatenation of two existing columns on a pyspark dataframe i.e.:
+------+------+--------+
| Col1 | Col2 | NewCol |
+------+------+--------+
| ORD | DFW | DFWORD |
| CUN | MCI | CUNMCI |
| LAX | JFK | JFKLAX |
+------+------+--------+
In other words, I want to grab Col1 and Col2, ordered them alphabetically and concatenate them.
Any suggestions?

Combine concat_ws, array and sort_array
from pyspark.sql.functions import concat_ws, array, sort_array
df = spark.createDataFrame(
[("ORD", "DFW"), ("CUN", "MCI"), ("LAX", "JFK")],
("Col1", "Col2"))
df.withColumn("NewCol", concat_ws("", sort_array(array("Col1", "Col2")))).show()
# +----+----+------+
# |Col1|Col2|NewCol|
# +----+----+------+
# | ORD| DFW|DFWORD|
# | CUN| MCI|CUNMCI|
# | LAX| JFK|JFKLAX|
# +----+----+------+

Related

How to keep only certain df columns that appear as rows in another df

Assuming that we have two dataframes
df_1
+--------+--------+-------+-------+
| id | col1 | col2 | col3 |
+--------+--------+-------+-------+
| A | 10 | 5 | 4 |
| B | 5 | 3 | 2 |
+--------+--------+-------+-------+
and df_2
+----------+--------+---------+
| col_name | col_t | col_d |
+----------+--------+---------+
| col1 | 3.3 | 2.2 |
| col3 | 1 | 2 |
+----------+--------+---------+
What I want to achieve is to join the two tables, such that only the columns that appear under df_2's col_name are kept in df_1 i.e. the desired table would be
+--------+--------+-------+
| id | col1 | col3 |
+--------+--------+-------+
| A | 10 | 4 |
| B | 5 | 2 |
+--------+--------+-------+
however, I need to perform this action only through joins and/or df transpose or pivot if possible.
I know that the above could be easily inferred by just selecting the df_1 columns as they appear in df_2's col_name but this is not what I am looking for here
One way to do this is to dedup and obtain the values in df_2.col_name using collect_list and passing this list of column names in your df_1 dataframe:
col_list = list(set(df_2.select(collect_list("col_name")).collect()[0][0]))
list_with_id = ['id'] + col_list
df_1[list_with_id].show()
Output:
+---+----+----+
| id|col1|col3|
+---+----+----+
| A| 10| 4|
| B| 5| 2|
+---+----+----+
Is this what you're looking for? (Assuming you want something dynamic and not manually selecting columns). I'm not using joins or pivots here though.

Python : Replace None value None with value of other column [duplicate]

I'm trying to fill a column of a dataframe from another dataframe based on conditions. Let's say my first dataframe is df1 and the second is named df2.
# df1 is described as bellow :
+------+------+
| Col1 | Col2 |
+------+------+
| A | 1 |
| B | 2 |
| C | 3 |
| A | 1 |
+------+------+
And
# df2 is described as bellow :
+------+------+
| Col1 | Col2 |
+------+------+
| A | NaN |
| B | NaN |
| D | NaN |
+------+------+
Each distinct value of Col1 has her an id number (In Col2), so what I want is to fill the NaN values in df2.Col2 where df2.Col1==df1.Col1 .
So that my second dataframe will look like :
# df2 :
+------+------+
| Col1 | Col2 |
+------+------+
| A | 1 |
| B | 2 |
| D | NaN |
+------+------+
I'm using Python 2.7
Use drop_duplicates with set_index and combine_first:
df = df2.set_index('Col1').combine_first(df1.drop_duplicates().set_index('Col1')).reset_index()
If need check dupes only in id column:
df = df2.set_index('Col1').combine_first(df1.drop_duplicates().set_index('Col1')).reset_index()
Here is a solution with the filter df1.Col1 == df2.Col1
df2['Col2'] = df1[df1.Col1 == df2.Col1]['Col2']
It is even better to use loc (but less clear from my point of view)
df2['Col2'] = df1.loc[df1.Col1 == df2.Col2, 'Col2']

How do I reference a sub-column from a pivoted table inside a method such as sort_values?

This table has 2 level of columns after pivotting col2. I want to sort the table with df['col3']['A'], but in .sort_values() you can only use a string or a list of strings to reference column(s).
I know for this specific problem, I can just sort the table before pivotting. But this problem applies to all other methods e.g. df.style.set_properties( subset=...) on any dataframes with multi-level/hierarchical columns.
EDIT:
Table here:
+------+---------------------+---------------------+
| | col3 | col4 |
+------+----------+----------+----------+----------+
| col2 | A | B | A | B |
+------+----------+----------+----------+----------+
| col1 | | | | |
+------+----------+----------+----------+----------+
| 1 | 0.242926 | 0.181175 | 0.189465 | 0.338340 |
| 2 | 0.240864 | 0.494611 | 0.211883 | 0.614739 |
| 3 | 0.052051 | 0.757591 | 0.361446 | 0.389341 |
+------+----------+----------+----------+----------+
Found the answer here: Sort Pandas Pivot Table by the margin ('All') values column
Basically just put the column and sub column(s) in a tuple. i.e. for my case, it is just .sort_values(('col3','A')).

Pandas Merge multiple dataframes on index and column

I am trying to Merge multiple dataframes to one main dataframe using the datetime index and id from main dataframe and datetime and id columns from other dataframes
Main dataframe
DateTime | id | data
(Df.Index)
---------|----|------
2017-9-8 | 1 | a
2017-9-9 | 2 | b
df1
id | data1 | data2 | DateTime
---|-------|-------|---------
1 | a | c | 2017-9-8
2 | b | d | 2017-9-9
5 | a | e | 2017-9-20
df2
id | data3 | data4 | DateTime
---|-------|-------|---------
1 | d | c | 2017-9-8
2 | e | a | 2017-9-9
4 | f | h | 2017-9-20
The main dataframe and the other dataframes are in different dictionaries. I want to read from each dictionary and merge when the joining condition (datetime, id) is met
for sleep in dictOfSleep#MainDataFrame:
for sensorDevice in dictOfSensor#OtherDataFrames:
try:
dictOfSleep[sleep]=pd.merge(dictOfSleep[sleep],dictOfSensor[sensorDevice], how='outer',on=['DateTime','id'])
except:
print('Join could not be done')
Desired Output:
DateTime | id | data | data1 | data2 | data3 | data4
(Df.Index)
---------|----|------|-------|-------|-------|-------|
2017-9-8 | 1 | a | a | c | d | c |
2017-9-9 | 2 | b | b | d | e | a |
I'm not sure how your dictionaries are set up so you will most likely need to modify this but I'd try something like:
for sensorDevice in dictOfSensor:
df = dictOfSensor[sensorDevice]
# set df index to match the main_df index
df = df.set_index(['DateTime'])
# try join (not merge) when combining on index
main_df = main_df.join(df, how='outer')
Alternatively, if the id column is very important you can try to first reset your main_df index and then merging.
main_df = main_df.reset_index()
for sensorDevice in dictOfSensor:
df = dictOfSensor[sensorDevice]
# try to merge on both columns
main_df = main_df.merge(df, how='outer', on=['DateTime', 'id])

Merge multiple columns into one column in pyspark dataframe using python

I need to merge multiple columns of a dataframe into one single column with list(or tuple) as the value for the column using pyspark in python.
Input dataframe:
+-------+-------+-------+-------+-------+
| name |mark1 |mark2 |mark3 | Grade |
+-------+-------+-------+-------+-------+
| Jim | 20 | 30 | 40 | "C" |
+-------+-------+-------+-------+-------+
| Bill | 30 | 35 | 45 | "A" |
+-------+-------+-------+-------+-------+
| Kim | 25 | 36 | 42 | "B" |
+-------+-------+-------+-------+-------+
Output dataframe should be
+-------+-----------------+
| name |marks |
+-------+-----------------+
| Jim | [20,30,40,"C"] |
+-------+-----------------+
| Bill | [30,35,45,"A"] |
+-------+-----------------+
| Kim | [25,36,42,"B"] |
+-------+-----------------+
Columns can be merged with sparks array function:
import pyspark.sql.functions as f
columns = [f.col("mark1"), ...]
output = input.withColumn("marks", f.array(columns)).select("name", "marks")
You might need to change the type of the entries in order for the merge to be successful
look at this doc : https://spark.apache.org/docs/2.1.0/ml-features.html#vectorassembler
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=["mark1", "mark2", "mark3"],
outputCol="marks")
output = assembler.transform(dataset)
output.select("name", "marks").show(truncate=False)
You can do it in a select like following:
from pyspark.sql.functions import *
df.select( 'name' ,
concat(
col("mark1"), lit(","),
col("mark2"), lit(","),
col("mark3"), lit(","),
col("Grade")
).alias('marks')
)
If [ ] necessary, it can be added lit function.
from pyspark.sql.functions import *
df.select( 'name' ,
concat(lit("["),
col("mark1"), lit(","),
col("mark2"), lit(","),
col("mark3"), lit(","),
col("Grade"), lit("]")
).alias('marks')
)
If this is still relevant, you can use StringIndexer to encode your string values to float substitutes.

Categories