Converting Pandas dataframe into Spark dataframe error - python

I'm trying to convert Pandas DF into Spark one.
DF head:
10000001,1,0,1,12:35,OK,10002,1,0,9,f,NA,24,24,0,3,9,0,0,1,1,0,0,4,543
10000001,2,0,1,12:36,OK,10002,1,0,9,f,NA,24,24,0,3,9,2,1,1,3,1,3,2,611
10000002,1,0,4,12:19,PA,10003,1,1,7,f,NA,74,74,0,2,15,2,0,2,3,1,2,2,691
Code:
dataset = pd.read_csv("data/AS/test_v2.csv")
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
sdf = sqlCtx.createDataFrame(dataset)
And I got an error:
TypeError: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'>

I made this script, It worked for my 10 pandas Data frames
from pyspark.sql.types import *
# Auxiliar functions
def equivalent_type(f):
if f == 'datetime64[ns]': return TimestampType()
elif f == 'int64': return LongType()
elif f == 'int32': return IntegerType()
elif f == 'float64': return DoubleType()
elif f == 'float32': return FloatType()
else: return StringType()
def define_structure(string, format_type):
try: typo = equivalent_type(format_type)
except: typo = StringType()
return StructField(string, typo)
# Given pandas dataframe, it will return a spark's dataframe.
def pandas_to_spark(pandas_df):
columns = list(pandas_df.columns)
types = list(pandas_df.dtypes)
struct_list = []
for column, typo in zip(columns, types):
struct_list.append(define_structure(column, typo))
p_schema = StructType(struct_list)
return sqlContext.createDataFrame(pandas_df, p_schema)
You can see it also in this gist
With this you just have to call spark_df = pandas_to_spark(pandas_df)

You need to make sure your pandas dataframe columns are appropriate for the type spark is inferring. If your pandas dataframe lists something like:
pd.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5062 entries, 0 to 5061
Data columns (total 51 columns):
SomeCol 5062 non-null object
Col2 5062 non-null object
And you're getting that error try:
df[['SomeCol', 'Col2']] = df[['SomeCol', 'Col2']].astype(str)
Now, make sure .astype(str) is actually the type you want those columns to be. Basically, when the underlying Java code tries to infer the type from an object in python it uses some observations and makes a guess, if that guess doesn't apply to all the data in the column(s) it's trying to convert from pandas to spark it will fail.

Type related errors can be avoided by imposing a schema as follows:
note: a text file was created (test.csv) with the original data (as above) and hypothetical column names were inserted ("col1","col2",...,"col25").
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName('pandasToSparkDF').getOrCreate()
pdDF = pd.read_csv("test.csv")
contents of the pandas data frame:
col1 col2 col3 col4 col5 col6 col7 col8 ...
0 10000001 1 0 1 12:35 OK 10002 1 ...
1 10000001 2 0 1 12:36 OK 10002 1 ...
2 10000002 1 0 4 12:19 PA 10003 1 ...
Next, create the schema:
from pyspark.sql.types import *
mySchema = StructType([ StructField("col1", LongType(), True)\
,StructField("col2", IntegerType(), True)\
,StructField("col3", IntegerType(), True)\
,StructField("col4", IntegerType(), True)\
,StructField("col5", StringType(), True)\
,StructField("col6", StringType(), True)\
,StructField("col7", IntegerType(), True)\
,StructField("col8", IntegerType(), True)\
,StructField("col9", IntegerType(), True)\
,StructField("col10", IntegerType(), True)\
,StructField("col11", StringType(), True)\
,StructField("col12", StringType(), True)\
,StructField("col13", IntegerType(), True)\
,StructField("col14", IntegerType(), True)\
,StructField("col15", IntegerType(), True)\
,StructField("col16", IntegerType(), True)\
,StructField("col17", IntegerType(), True)\
,StructField("col18", IntegerType(), True)\
,StructField("col19", IntegerType(), True)\
,StructField("col20", IntegerType(), True)\
,StructField("col21", IntegerType(), True)\
,StructField("col22", IntegerType(), True)\
,StructField("col23", IntegerType(), True)\
,StructField("col24", IntegerType(), True)\
,StructField("col25", IntegerType(), True)])
Note: True (implies nullable allowed)
create the pyspark dataframe:
df = spark.createDataFrame(pdDF,schema=mySchema)
confirm the pandas data frame is now a pyspark data frame:
type(df)
output:
pyspark.sql.dataframe.DataFrame
Aside:
To address Kate's comment below - to impose a general (String) schema you can do the following:
df=spark.createDataFrame(pdDF.astype(str))

In spark version >= 3 you can convert pandas dataframes to pyspark dataframe in one line
use spark.createDataFrame(pandasDF)
dataset = pd.read_csv("data/AS/test_v2.csv")
sparkDf = spark.createDataFrame(dataset);
if you are confused about spark session variable,
spark session is as follows
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))
spark = SparkSession \
.builder \
.getOrCreate()

I have tried this with your data and it is working :
%pyspark
import pandas as pd
from pyspark.sql import SQLContext
print sc
df = pd.read_csv("test.csv")
print type(df)
print df
sqlCtx = SQLContext(sc)
sqlCtx.createDataFrame(df).show()

I cleaned up / simplified the top answer a bit:
import pyspark.sql.types as ps_types
def get_equivalent_spark_type(pandas_type):
"""
This method will retrieve the corresponding spark type given a pandas
type.
Args:
pandas_type (str): pandas data type
Returns:
spark data type
"""
type_map = {
'datetime64[ns]': ps_types.TimestampType(),
'int64': ps_types.LongType(),
'int32': ps_types.IntegerType(),
'float64': ps_types.DoubleType(),
'float32': ps_types.FloatType()}
if pandas_type not in type_map:
return ps_types.StringType()
else:
return type_map[pandas_type]
def pandas_to_spark(spark, pandas_df):
"""
This method will return a spark dataframe given a pandas dataframe.
Args:
spark (pyspark.sql.session.SparkSession): pyspark session
pandas_df (pandas.core.frame.DataFrame): pandas DataFrame
Returns:
equivalent spark DataFrame
"""
columns = list(pandas_df.columns)
types = list(pandas_df.dtypes)
p_schema = ps_types.StructType([
ps_types.StructField(column, get_equivalent_spark_type(pandas_type))
for column, pandas_type in zip(columns, types)])
return spark.createDataFrame(pandas_df, p_schema)

I received a similar error message once, in my case it was because my pandas dataframe contained NULLs. I will recommend to try & handle this in pandas before converting to spark (this resolved the issue in my case).

Related

Iterate through each column and find the max length

I want to get the maximum length from each column from a pyspark dataframe.
Following is the sample dataframe:
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
I tried to implement the solution provided in Scala but could not convert it.
This would work
from pyspark.sql.functions import col, length, max
df=df.select([max(length(col(name))) for name in df.schema.names])
Result
Edit: For reference: Converting to Rows (As asked here, updated there as well - pyspark max string length for each column in the dataframe)
df = df.select([max(length(col(name))).alias(name) for name in df.schema.names])
row=df.first().asDict()
df2 = spark.createDataFrame([Row(col=name, length=row[name]) for name in df.schema.names], ['col', 'length'])
Output:

Convert RDD[Row] containing strings To Dataframe Of IntegerTypes

I have the following RDD of Rows. As can be seen each field is a string type
[Row(A='6', B='1', C='hi'),
Row(A='4', B='5', C='bye'),
Row(A='8', B='9', C='night')]
I want to convert this RDD into a dataframe with IntegerTypes for column A and B
dtypes = [
StructField('A', IntegerType(), True),
StructField('B', IntegerType(), True),
StructField('C', StringType(), True)
]
df = spark.createDataFrame(rdd, StructType(dtypes))
I get the following error:
TypeError: field A: IntegerType can not accept
object '6' in type <class 'str'>
How can i succesfully convert '6' into an IntegerType?
You should modify the RDD of rows before you create a dataframe of desired column type.
def modify_row(row):
new_row = {}
for key in row:
if key in ['A', 'B']:
new_row[key] = int(row[key])
else:
new_row[key] = row[key]
return new_row
rdd = (sc.parallelize([Row(A='6', B='1', C='hi'),
Row(A='4', B='5', C='bye'),
Row(A='8', B='9', C='night')])
.map(lambda x: modify_row(x)))
dtypes = [
StructField('A', IntegerType(), True),
StructField('B', IntegerType(), True),
StructField('C', StringType(), True)
]
df = spark.createDataFrame(rdd, StructType(dtypes))

PySpark: Cannot create small dataframe

I'm trying to create a small dataframe so that I can save two scalars (doubles) and a string
from How to create spark dataframe with column name which contains dot/period?
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
input_data = ([output_stem, paired_p_value, scalar_pearson])
schema = StructType([StructField("Comparison", StringType(), False), \
StructField("Paired p-value", DoubleType(), False), \
StructField("Pearson coefficient", DoubleType(), True)])
df_compare_AF = sqlContext.createDataFrame(input_data, schema)
display(df_compare_AF)
producing the error message:
TypeError: StructType can not accept object 's3://sanford-biofx-dev/con/dev3/dev' in type <class 'str'> which doesn't make any sense to me, this column was meant for strings
my other solution is from
Add new rows to pyspark Dataframe
columns = ["comparison", "paired p", "Pearson coefficient"]
vals = [output_stem, paired_p_value, scalar_pearson]
df = spark.createDataFrame(vals, columns)
display(df)
but this gives an error: TypeError: Can not infer schema for type: <class 'str'>
I just want a small dataframe:
comparison | paired p-value | Pearson Coefficient
-------------------------------------------------
s3://sadf | 0.045 | -0.039
The solution is to put a comma of mystery at the end of input_data thanks to #10465355 says Reinstate Monica
from pyspark.sql.types import StructType, StructField, StringType, DoubleType
input_data = ([output_stem, paired_p_value, scalar_pearson],)
schema = StructType([StructField("Comparison", StringType(), False), \
StructField("Paired p-value", DoubleType(), False), \
StructField("Pearson coefficient", DoubleType(), True)])
df_compare_AF = sqlContext.createDataFrame(input_data, schema)
display(df_compare_AF)
I don't understand why this comma is necessary, or what it does, but it seems to do the job

What should I add to the code to avoid 'exceeds max allowed bytes' error using pyspark?

I have a dataframe with 4 million rows and 10 columns. I am trying to write this to a table in hdfs from the Cloudera Data Science Workbench using pyspark. I am running into an error when trying to do this:
[Stage 0:> (0 + 1) /
2]19/02/20 12:31:04 ERROR datasources.FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 0:0 was 318690577 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values.
I can break up the dataframe into 3 dataframes and perform the spark write 3 seperate times but I would like to do this just one time if possible by possibly adding something to the spark code like coalesce.
import pandas as pd
df=pd.read_csv('BulkWhois/2019-02-20_Arin_Bulk/Networks_arin_db_2-20-2019_parsed.csv')
'''PYSPARK'''
from pyspark.sql import SQLContext
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark import SparkContext
spark = SparkSession.builder.appName('Arin_Network').getOrCreate()
schema = StructType([StructField('NetHandle', StringType(), False),
StructField('OrgID', StringType(), True),
StructField('Parent', StringType(), True),
StructField('NetName', StringType(), True),
StructField('NetRange', StringType(), True),
StructField('NetType', StringType(), True),
StructField('Comment', StringType(), True),
StructField('RegDate', StringType(), True),
StructField('Updated', StringType(), True),
StructField('Source', StringType(), True)])
dataframe = spark.createDataFrame(df, schema)
dataframe.write. \
mode("append"). \
option("path", "/user/hive/warehouse/bulkwhois_analytics.db/arin_network"). \
saveAsTable("bulkwhois_analytics.arin_network")
User10465355 mentioned that I should use Spark directly. Doing this is simpler and the correct way to accomplish this.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Networks').getOrCreate()
dataset = spark.read.csv('Networks_arin_db_2-20-2019_parsed.csv', header=True, inferSchema=True)
dataset.show(5)
dataset.write \
.mode("append") \
.option("path", "/user/hive/warehouse/analytics.db/arin_network") \
.saveAsTable("analytics.arin_network")

Add RDD to DataFrame Column PySpark

I want to create a Dataframe with the columns of two RDD's.
The first is RDD that i get from CSV and second is another RDD with a cluster prediction of each row.
My schema is:
customSchema = StructType([ \
StructField("Area", FloatType(), True), \
StructField("Perimeter", FloatType(), True), \
StructField("Compactness", FloatType(), True), \
StructField("Lenght", FloatType(), True), \
StructField("Width", FloatType(), True), \
StructField("Asymmetry", FloatType(), True), \
StructField("KernelGroove", FloatType(), True)])
Map my rdd and create the DataFrame:
FN2 = rdd.map(lambda x: (float(x[0]), float(x[1]),float(x[2]),float(x[3]),float(x[4]),float(x[5]),float(x[6])))
df = sqlContext.createDataFrame(FN2, customSchema)
And my cluster prediction:
result = Kmodel.predict(rdd)
So, to conclude i want to have in my DataFrame the rows of my CSV and their cluster prediction at the end.
I tried to add a new column with .WithColumn() but i got nothing.
Thanks.
If you have a common field on both data frame, then join with the key otherwise create a unique Id and join both dataframe to get rows of CSV and their cluster prediction in a single dataframe
Scala code to generate a unique id for each row, try to convert for the pyspark. You need to generate a increasing row id and join them with row id
import org.apache.spark.sql.types.{StructType, StructField, LongType}
val df = sc.parallelize(Seq(("abc", 2), ("def", 1), ("hij", 3))).toDF("word", "count")
val wcschema = df.schema
val inputRows = df.rdd.zipWithUniqueId.map{
case (r: Row, id: Long) => Row.fromSeq(id +: r.toSeq)}
val wcID = sqlContext.createDataFrame(inputRows, StructType(StructField("id", LongType, false) +: wcschema.fields))
or use sql query
val tmpTable1 = sqlContext.sql("select row_number() over (order by count) as rnk,word,count from wordcount")
tmpTable1.show()

Categories