Selecting only numeric/string columns names from a Spark DF in pyspark - python

I have a Spark DataFrame in Pyspark (2.1.0) and I am looking to get the names of numeric columns only or string columns only.
For example, this is the Schema of my DF:
root
|-- Gender: string (nullable = true)
|-- SeniorCitizen: string (nullable = true)
|-- MonthlyCharges: double (nullable = true)
|-- TotalCharges: double (nullable = true)
|-- Churn: string (nullable = true)
This is what I need:
num_cols = [MonthlyCharges, TotalCharges]
str_cols = [Gender, SeniorCitizen, Churn]
How can I make it?

dtypes is list of tuples (columnNane,type) you can use simple filter
columnList = [item[0] for item in df.dtypes if item[1].startswith('string')]

PySpark provides a rich API related to schema types. As #DanieldePaula mentioned you can access fields' metadata through df.schema.fields.
Here is a different approach based on statically typed checking:
from pyspark.sql.types import StringType, DoubleType
df = spark.createDataFrame([
[1, 2.3, "t1"],
[2, 5.3, "t2"],
[3, 2.1, "t3"],
[4, 1.5, "t4"]
], ["cola", "colb", "colc"])
# get string
str_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, StringType)]
# ['colc']
# or double
dbl_cols = [f.name for f in df.schema.fields if isinstance(f.dataType, DoubleType)]
# ['colb']

You can do what zlidme suggested to get only string (categorical columns). To extend on the answer given take a look at the example bellow. It will give you all numeric (continuous) columns in a list called continuousCols, all categorical columns in a list called categoricalCols and all columns in a list called allCols.
data = {'mylongint': [0, 1, 2],
'shoes': ['blue', 'green', 'yellow'],
'hous': ['furnitur', 'roof', 'foundation'],
'C': [1, 0, 0]}
play_df = pd.DataFrame(data)
play_ddf = spark.createDataFrame(play_df)
#store all column names in a list
allCols = [item[0] for item in play_ddf]
#store all column names that are categorical in a list
categoricalCols = [item[0] for item in play_ddf.dtypes if item[1].startswith('string')]
#store all column names that are continous in a list
continuousCols =[item[0] for item in play_ddf.dtypes if item[1].startswith('bigint')]
print(len(allCols), ' - ', len(continuousCols), ' - ', len(categoricalCols))
This will give the result: 4 - 2 - 2

Related

Pyspark removing duplicate columns after broadcast join

I have two dataframes which I wish to join and then save as a parquet table. After performing the join my resulting table has duplicate columns, preventing me from saving the dataset.
Here is my code for the join
join_conditions = [
df1.colX == df2.colY,
df1.col1 == df2.col1,
df1.col2 == df2.col2,
df1.col3 == df2.col3,
]
dfj= df1.alias("1").join(F.broadcast(df2.alias("2")), join_conditions, "inner"
).drop("1.col1", "1.col2", "1.col3")
dfj.write.format("parquet").mode("overwrite").saveAsTable("table")
I expected that the drop would remove the duplicate columns but an exception is thrown saying they are still there, when I try to save the table. drop() doesn't throw an exception if the columns don't exist, which means that the alias is probably wrong / not working as I expect?
I cannot do the join conditions as a list of strings as this seems to cause an error when not all columns in the join condition are called the same on each DataFrame:
join_conditions = [
df1.colX == df2.colY,
"col1",
"col2",
"col3"
]
doesn't work for example.
This join works but still results in the duplicate columns
join_conditions = [
df1.X == df2.colY,
F.col("1.col1") == F.col("2.col1"),
F.col("1.col2") == F.col("2.col2"),
F.col("1.col3") == F.col("2.col3"),
]
also didn't work. All of these methods still result in the joined dataframe having the duplicate columns col1, col2 and col3. What am I doing wrong / not understanding correctly? Answers with pyspark sample code would be appreciated.
Im not sure why it doesn't work, its really weird.
This isn't so pretty but it works
from pyspark.sql import functions as F
data = [{'colX': "hello", 'col1': 1, 'col2': 2, 'col3': 3}]
data2 = [{'colY': "hello", 'col1': 1, 'col2': 2, 'col3': 3}]
df1 = spark.createDataFrame(data)
df2 = spark.createDataFrame(data2)
join_cond = [df1.colX==df2.colY,
df1.col1==df2.col1,
df1.col2==df2.col2,
df1.col3==df2.col3]
df1.join(F.broadcast(df2), join_cond, 'inner').drop(df1.col1).drop(df1.col2).drop(df1.col3).printSchema()
root
|-- colX: string (nullable = true)
|-- col1: long (nullable = true)
|-- col2: long (nullable = true)
|-- col3: long (nullable = true)
|-- colY: string (nullable = true)

flatten nested json scala code in pyspark

Trying to do the following scala code but in pyspark:
val maxJsonParts = 3 // whatever that number is...
val jsonElements = (0 until maxJsonParts)
.map(i => get_json_object($"Payment", s"$$[$i]"))
val newDF = dataframe
.withColumn("Payment", explode(array(jsonElements: _*)))
.where(!isnull($"Payment"))
For example, I am trying to make a nested column such as in the payment column below:
id
name
payment
1
James
[ {"#id": 1, "currency":"GBP"},{"#id": 2, "currency": "USD"} ]
to become:
id
name
payment
1
James
{"#id": 1, "currency":"GBP"}
1
James
{"#id":2, "currency":"USD"}
The table schema:
root
|-- id: integer (nullable = true)
|-- Name: string (nullable = true)
|-- Payment: string (nullable = true)
I tried writing this in Pyspark but its just turning the nested column (payment) to null:
lst = [range(0,10)]
jsonElem = [F.get_json_object(F.col("payment"), f"$[{i}]") for i in lst]
bronzeDF = bronzeDF.withColumn("payment2", F.explode(F.array(*jsonElem)))
bronzeDF.show()
Any help is highly appreciated.
Here is another approach which allows you to parse the given JSON based on the right schema in order to generate the payment array. The solution is based on from_json function which allows you to parse a string JSON into struct type.
from pyspark.sql.types import IntegerType, StringType, ArrayType, StructField
from pyspark.sql.functions import from_json, explode
data = [
(1, 'James', '[ {"#id": 1, "currency":"GBP"},{"#id": 2, "currency": "USD"} ]'),
(2, 'Tonny', '[ {"#id": 3, "currency":"EUR"},{"#id": 4, "currency": "USD"} ]'),
]
df = spark.createDataFrame(data, ['id', 'name', 'payment'])
str_schema = 'array<struct<`#id`:int,`currency`:string>>'
# st_schema = ArrayType(StructType([
# StructField('#id', IntegerType()),
# StructField('currency', StringType())]))
df = df.withColumn("payment", explode(from_json(df["payment"], str_schema)))
df.show()
# +---+-----+--------+
# | id| name| payment|
# +---+-----+--------+
# | 1|James|[1, GBP]|
# | 1|James|[2, USD]|
# | 2|Tonny|[3, EUR]|
# | 2|Tonny|[4, USD]|
# +---+-----+--------+
Note: as you can see you can choose between the string representation of the schema or ArrayType. Both should produce the same results.
I came to the solution:
first convert the column to a string type as follows:
bronzeDF = bronzeDF.withColumn("payment2", F.to_json("payment")).drop("payment")
Then you can perform the following code on the column to stack the n nested json objects as separate rows with the same outer key values:
max_json_parts = 50
lst = [f for f in range(0, max_json_parts, 1)]
jsonElem = [F.get_json_object(F.col("payment2"), f"$[{i}]") for i in lst]
bronzeDF = bronzeDF.withColumn("payment2", F.explode(F.array(*jsonElem))).where(F.col("payment2").isNotNull())
Then convert back to struct with a defined schema and and explode the object keys as separate columns:
bronzeDF = bronzeDF.withColumn("temp", F.from_json("payment2", jsonSchemaPayment)).select("*", "temp.*").drop("payment2")

Create new dataFrame based on reformatted columns from old dataFrame

I imported the data from a database
df = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("uri",
"mongodb://127.0.0.1/test.db").load()
I have selected the double columns using
double_list = [name for name,types in df.dtypes if types == 'double']
Credits to #Ramesh Maharjan.
To remove special characters we use
removedSpecials = [''.join(y for y in x if y.isalnum()) for x in double_list]
The question is:
How can I create a new dataframe based on df with ONLY double_list columns. ?
If you already have list of column names with double as datatype then next step is to remove the special characters which can be done by using .isalnum() credit as
removedSpecials = [''.join(y for y in x if y.isalnum()) for x in double_list]
once you have the special characters removed list of column names then its just .withColumnRenamed() api call as
for (x, y) in zip(double_list, removedSpecials):
df = df.withColumnRenamed(x, y)
df.show(truncate=False) should give you the renamed dataframe on the columns with double datatype
If you don't want the columns that are not in double_list i.e. not in double datatype list then you can use select as
df.select(*removedSpecials).show(truncate=False)
The reason for doing
for (x, y) in zip(double_list, removedSpecials):
df = df.withColumnRenamed(x, y)
before doing
df.select(*removedSpecials).show(truncate=False)
is that there might be special characters like . which doesn't make concise solutions like df.select([df[x].alias(y) for (x, y) in zip(double_list, removedSpecials)]).show(truncate=False) to work
I hope the answer is helpful
scala code, you can convert into python
import sqlContext.implicits._
// sample df
df.show()
+----+--------------------+--------+
|data| Week|NumCCol1|
+----+--------------------+--------+
| aac|01/28/2018-02/03/...| 2.0|
| aac|02/04/2018-02/10/...| 23.0|
| aac|02/11/2018-02/17/...| 105.0|
+----+--------------------+--------+
df.printSchema()
root
|-- data: string (nullable = true)
|-- Week: string (nullable = true)
|-- NumCCol1: double (nullable = false)
val df1 = df.schema.fields
.collect({case x if x.dataType.typeName == "double" => x.name})
.foldLeft(df)({case(dframe,field) => dframe.select(field)})
// df with only double columns
df1.show()
use df1.withColumnRenamed to rename the columns

Comparing schema of dataframe using Pyspark

I have a data frame (df).
For showing its schema I use:
from pyspark.sql.functions import *
df1.printSchema()
And I get the following result:
#root
# |-- name: string (nullable = true)
# |-- age: long (nullable = true)
Sometimes the schema changes (the column type or name):
df2.printSchema()
#root
# |-- name: array (nullable = true)
# |-- gender: integer (nullable = true)
# |-- age: long (nullable = true)
I would like to compare between the two schemas (df1 and df2) and get only the differences in types and columns names (Sometimes the column can move to another position).
The results should be a table (or data frame) something like this:
column df1 df2 diff
name: string array type
gender: N/A integer new column
(age column is the same and didn't change. In case of omission of column there will be indication 'omitted')
How can I do it if efficiently if I have many columns in each?
Without any external library, we can find the schema difference using
from pyspark.sql.session import SparkSession
from pyspark.sql import DataFrame
def schema_diff(spark: SparkSession, df_1: DataFrame, df_2: DataFrame):
s1 = spark.createDataFrame(df_1.dtypes, ["d1_name", "d1_type"])
s2 = spark.createDataFrame(df_2.dtypes, ["d2_name", "d2_type"])
difference = (
s1.join(s2, s1.d1_name == s2.d2_name, how="outer")
.where(s1.d1_type.isNull() | s2.d2_type.isNull())
.select(s1.d1_name, s1.d1_type, s2.d2_name, s2.d2_type)
.fillna("")
)
return difference
fillna is optional. I prefer to view them as empty string.
in where clause we use type because this will help us to show even if column exists in both dataframe but they have different schemas.
this will also show all columns that are in second dataframe but not in first dataframe
Usage:
diff = schema_diff(spark, df_1, df_2)
diff.show(diff.count(), truncate=False)
You can try creating two pandas dataframes with metadata from both DF1 and DF2 like below
pd_df1=pd.DataFrame(df1.dtypes,columns=['column','data_type'])
pd_df2=pd.DataFrame(df2.dtypes,columns=['column','data_type'])
and then join those two pandas dataframes through 'outer' join?
A custom function that could be useful for someone.
def SchemaDiff(DF1, DF2):
# Getting schema for both dataframes in a dictionary
DF1Schema = {x[0]:x[1] for x in DF1.dtypes}
DF2Schema = {x[0]:x[1] for x in DF2.dtypes}
# Column present in DF1 but not in DF2
DF1MinusDF2 = dict.fromkeys((set(DF1.columns) - set(DF2.columns)), '')
for column_name in DF1MinusDF2:
DF1MinusDF2[column_name] = DF1Schema[column_name]
# Column present in DF2 but not in DF1
DF2MinusDF1 = dict.fromkeys((set(DF2.columns) - set(DF1.columns)), '')
for column_name in DF2MinusDF1:
DF2MinusDF1[column_name] = DF2Schema[column_name]
# Find data type changed in DF1 as compared to DF2
UpdatedDF1Schema = {k:v for k,v in DF1Schema.items() if k not in DF1MinusDF2}
UpdatedDF1Schema = {**UpdatedDF1Schema, **DF2MinusDF1}
DF1DataTypesChanged = {}
for column_name in UpdatedDF1Schema:
if UpdatedDF1Schema[column_name] != DF2Schema[column_name]:
DF1DataTypesChanged[column_name] = DF2Schema[column_name]
return DF1MinusDF2, DF2MinusDF1, DF1DataTypesChanged
you can simply use
df1.printSchema() == df2.printSchema()

Compare column names in two data frames pyspark

I have two data frames in pyspark df and data. The schema are like below
>>> df.printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- nation: string (nullable = true)
|-- Date: timestamp (nullable = false)
|-- ZipCode: integer (nullable = true)
|-- car: string (nullable = true)
|-- van: string (nullable = true)
>>> data.printSchema()
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- nation: string (nullable = true)
|-- date: string (nullable = true)
|-- zipcode: integer (nullable = true)
Now I want to add columns car and van to my data data frame by comparing both the schema.
I would also want to compares two data frames if the columns are same do nothing, but if the columns are different then add the columns to the data frame that doesn't have the columns.
How can we achieve that in pyspark.
FYI I am using spark 1.6
once the columns are added to the data frame. The values for those columns in the newly added data frame Should be null.
for example here we are adding columns to data data frame so the columns car and van in data data frame should contain null values but the same columns in df data frame should have their original values
what happens if there are more than 2 new columns to be added
As the schema is not but StructType consisting of list of StructFields, we can retrieve the fields list, to compare and find the missing columns,
df_schema = df.schema.fields
data_schema = data.schema.fields
df_names = [x.name.lower() for x in df_scehma]
data_names = [x.name.lower() for x in data_schema]
if df_schema <> data_schema:
col_diff = set(df_names) ^ set(data_names)
col_list = [(x[0].name,x[0].dataType) for x in map(None,df_schema,data_schema) if ((x[0] is not None and x[0].name.lower() in col_diff) or x[1].name.lower() in col_diff)]
for i in col_list:
if i[0] in df_names:
data = data.withColumn("%s"%i[0],lit(None).cast(i[1]))
else:
df = df.withColumn("%s"%i[0],lit(None).cast(i[1]))
else:
print "Nothing to do"
You have mentioned to add the column if there is no null values, but your schema diference are nullable columns, so have not used that check. If you need it, then add check for nullable as below,
col_list = [(x[0].name,x[0].dataType) for x in map(None,df_schema,data_schema) if (x[0].name.lower() in col_diff or x[1].name.lower() in col_diff) and not x.nullable]
Please check the documentation for more about StructType and StructFields,
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.types.StructType
If you have to do this to multiple tables, it might be worth it to generalize the code a bit. This code takes the first non-null value in the non-matching source column to create the new column in the target table.
from pyspark.sql.functions import lit, first
def first_non_null(f,t): # find the first non-null value of a column
return f.select(first(f[t], ignorenulls=True)).first()[0]
def match_type(f1,f2,miss): # add missing column to the target table
for i in miss:
try:
f1 = f1.withColumn(i, lit(first_non_null(f2,i)))
except:
pass
try:
f2 = f2.withColumn(i, lit(first_non_null(f1,i)))
except:
pass
return f1, f2
def column_sync_up(d1,d2): # test if the matching requirement is met
missing = list(set(d1.columns) ^ set(d2.columns))
if len(missing)>0:
return match_type(d1,d2,missing)
else:
print "Columns Match!"
df1, df2 = column_sync_up(df1,df2) # reuse as necessary

Categories