I am trying to use KMeans on geospatial data stored in MongoDB database using Apache Spark. The data has following format,
DataFrame[decimalLatitude: double, decimalLongitude: double, features: vector]
The code is as follows, where inputdf is the dataframe.
vecAssembler = VectorAssembler(
inputCols=["decimalLatitude", "decimalLongitude"],
outputCol="features")
inputdf = vecAssembler.transform(inputdf)
kmeans = KMeans(k = 10, seed = 123)
model = kmeans.fit(inputdf.select("features"))
There seems to be some empty strings in the dataset, as I get following error,
com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a IntegerType (value: BsonString{value=''})
I tried to find such rows using,
issuedf = inputdf.where(inputdf.decimalLatitude == '')
issuedf.show()
But I get the same type conversion error as above. I also tried df.replace, but I got the same error. How do I remove all rows where such value is present?
This issue can be solved by providing data types when loading the data as follows,
inputdf = my_spark.read.format("mongo").load(schema=StructType(
[StructField("decimalLatitude", DoubleType(), True),
StructField("decimalLongitude", DoubleType(), True)]))
This ensures that all values are of DoubleType. Now empty values can be removed using inputdf.dropna()
Related
I want to create a generic function in pyspark that takes dataframe and a datatype as a parameter and filter the columns that does not satisfy the criteria. I am not very good at python and I am kind of stuck at the point from where I am not able to find how can I do that.
I have a scala representation of the code that does the same thing.
//sample data
val df = Seq(("587","mumbai",Some(5000),5.05),("786","chennai",Some(40000),7.055),("432","Gujarat",Some(20000),6.75),("2","Delhi",None,10.0)).toDF("Id","City","Salary","Increase").withColumn("RefID",$"Id")
import org.apache.spark.sql.functions.col
def selectByType(colType: DataType, df: DataFrame) = {
val cols = df.schema.toList
.filter(x => x.dataType == colType)
.map(c => col(c.name))
df.select(cols:_*)
}
val res = selectByType(IntegerType, df)
res is the dataframe that has only integer columns in this case the salary column and we have drop all the other columns that have different types dynamically.
I won't the same behaviour in pyspark but I am not able to accomplish that.
This is what I have tried
//sample data
from pyspark.sql.types import StructType,StructField, StringType, IntegerType, DoubleType
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True), \
StructField("raise",DoubleType(),True) \
])
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000,2.5),
("Michael","Rose","","40288","M",4000,4.7),
("Robert","","Williams","42114","M",4000,8.9),
("Maria","Anne","Jones","39192","F",4000,0.0),
("Jen","Mary","Brown","","F",-1,-1.2)
]
df = spark.createDataFrame(data=data2,schema=schema)
//getting the column list from schema of the dataframe
pschema = df.schema.fields
datatypes = [IntegerType,DoubleType] //column datatype that I want.
out = filter(lambda x: x.dataType.isin(datatypes), pschema) //gives invalid syntax error.
Can someone help me out in terms of what is the thing that I am doing wrong. Scala code only passes single datatype but I as per my use case I want to handle scenario in which we can pass multiple datatypes and we get the dataframe with required columns of that specified datatypes.
Initially if someone can give any idea for how can I make it work for single datatype then I can give it a try to see if I can do that same for multiple datatypes.
Note : Sample data for scala and pyspark is different as I copied the Pyspark sample data from somewhere just to speed the operation as I am just concerned about the final output requirement.
To pass schema to a json file we do this:
from pyspark.sql.types import (StructField, StringType, StructType, IntegerType)
data_schema = [StructField('age', IntegerType(), True), StructField('name', StringType(), True)]
final_struc = StructType(fields = data_schema)
df =spark.read.json('people.json', schema=final_struc)
The above code works as expected. However now, I have data in table which I display by:
df = sqlContext.sql("SELECT * FROM people_json")
But if I try to pass a new schema to it by using following command it does not work.
df2 = spark.sql("SELECT * FROM people_json", schema=final_struc)
It gives the following error:
sql() got an unexpected keyword argument 'schema'
NOTE: I am using Databrics Community Edition
What am I missing?
How do I pass the new schema if I have data in the table instead of some JSON file?
You cannot apply a new schema to already created dataframe. However, you can change the schema of each column by casting to another datatype as below.
df.withColumn("column_name", $"column_name".cast("new_datatype"))
If you need to apply a new schema, you need to convert to RDD and create a new dataframe again as below
df = sqlContext.sql("SELECT * FROM people_json")
val newDF = spark.createDataFrame(df.rdd, schema=schema)
Hope this helps!
There is already one answer available but still I want to add something.
Create DF from RDD
using toDF
newDf = rdd.toDF(schema, column_name_list)
using createDataFrame
newDF = spark.createDataFrame(rdd ,schema, [list_of_column_name])
Create DF from other DF
suppose I have DataFrame with columns|data type - name|string, marks|string, gender|string.
if I want to get only marks as integer.
newDF = oldDF.select("marks")
newDF_with_int = newDF.withColumn("marks", df['marks'].cast('Integer'))
This will convert marks to integer.
The following is more or less straight python code which functionally extracts exactly as I want. The data schema for the column I'm filtering out within the dataframe is basically a json string.
However, I had to greatly bump up the memory requirement for this and I'm only running on a single node. Using a collect is probably bad and creating all of this on a single node really isn't taking advantage of the distributed nature of Spark.
I'd like a more Spark centric solution. Can anyone help me massage the logic below to better take advantage of Spark? Also, as a learning point: please provide an explanation for why/how the updates make it better.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import json
from pyspark.sql.types import SchemaStruct, SchemaField, StringType
input_schema = SchemaStruct([
SchemaField('scrubbed_col_name', StringType(), nullable=True)
])
output_schema = SchemaStruct([
SchemaField('val01_field_name', StringType(), nullable=True),
SchemaField('val02_field_name', StringType(), nullable=True)
])
example_input = [
'''[{"val01_field_name": "val01_a", "val02_field_name": "val02_a"},
{"val01_field_name": "val01_a", "val02_field_name": "val02_b"},
{"val01_field_name": "val01_b", "val02_field_name": "val02_c"}]''',
'''[{"val01_field_name": "val01_c", "val02_field_name": "val02_a"}]''',
'''[{"val01_field_name": "val01_a", "val02_field_name": "val02_d"}]''',
]
desired_output = {
'val01_a': ['val_02_a', 'val_02_b', 'val_02_d'],
'val01_b': ['val_02_c'],
'val01_c': ['val_02_a'],
}
def capture(dataframe):
# Capture column from data frame if it's not empty
data = dataframe.filter('scrubbed_col_name != null')\
.select('scrubbed_col_name')\
.rdd\
.collect()
# Create a mapping of val1: list(val2)
mapping = {}
# For every row in the rdd
for row in data:
# For each json_string within the row
for json_string in row:
# For each item within the json string
for val in json.loads(json_string):
# Extract the data properly
val01 = val.get('val01_field_name')
val02 = val.get('val02_field_name')
if val02 not in mapping.get(val01, []):
mapping.setdefault(val01, []).append(val02)
return mapping
One possible solution:
(df
.rdd # Convert to rdd
.flatMap(lambda x: x) # Flatten rows
# Parse JSON. In practice you should add proper exception handling
.flatMap(lambda x: json.loads(x))
# Get values
.map(lambda x: (x.get('val01_field_name'), x.get('val02_field_name')))
# Convert to final shape
.groupByKey())
Given output specification this operation is not exactly efficient (do you really require grouped values?) but still much better than collect.
I want to read data from a cvs file with the following format
HIERARCHYELEMENTID, REFERENCETIMESTAMP, VALUE
LOUTHNMA,"2014-12-03 00:00:00.0",0.004433333289
LOUTHNMA,"2014-12-03 00:15:00.0",0.004022222182
LOUTHNMA,"2014-12-03 00:30:00.0",0.0037666666289999998
LOUTHNMA,"2014-12-03 00:45:00.0",0.003522222187
LOUTHNMA,"2014-12-03 01:00:00.0",0.0033333332999999996
I am using the following PySpark function to read from this file
# Define a specific function to load flow data with schema
def load_flow_data(sqlContext, filename, timeFormat):
# Columns we're interested in
flow_columns = ['DMAID','TimeStamp', 'Value']
df = load_data(sqlContext, filename, flow_schema, flow_columns)
# convert type of timestamp column from string to timestamp
col = unix_timestamp(df['TimeStamp'], timeFormat).cast("timestamp")
df = df.withColumn('realTimeStamp', col)
return df
with the following schema and auxiliary function
flow_schema = StructType([
StructField('DMAID', StringType(), True),
StructField('TimeStamp', StringType(), True),
StructField('Value', FloatType(), True)
])
def load_data(sqlContext, filename, schema=None, columns=None):
# If no schema is specified, then infer the schema automatically
if schema is None:
df = sqlContext.read.format('com.databricks.spark.csv'). \
option('header', 'true').option('inferschema', 'true'). \
option('mode', 'DROPMALFORMED'). \
load(filename)
else:
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true').load(filename, schema=schema)
# If no columns are specified, then select all columns
if columns is None:
columns = schema.names
df = df.select(columns)
return df
I load the data from the cvs file using these commands
timeFormat = "yyyy-MM-dd HH:mm:SS"
df_flow_DMA = load_flow_data(sqlContext, flow_file, timeFormat)
Then I convert this data frame to Pandas for visualisation purposes.
However, I find that col = unix_timestamp(df['TimeStamp'], timeFormat).cast("timestamp") is mapping different date & time strings in the cvs file (found in the field 'TimeStamp') to the same 'realTimeStamp' field, as shown in the attached screenshot.
I suspect that the problem is related to the date time string format that I am passing to load_flow_data; I have tried several variations but nothing seems to work.
Could someone please provide a hist at what is wrong with my code? I use Python 2.7 and Spark 1.6.
Cheers
Problem writing pandas dataframe (timeseries) to HDF5 using pytables/tstables:
import pandas
import tables
import tstables
# example dataframe
valfloat = [512.3, 918.8]
valstr = ['abc','cba']
tstamp = [1445464064, 1445464013]
df = pandas.DataFrame(data = zip(valfloat, valstr, tstamp), columns = ['colfloat', 'colstr', 'timestamp'])
df.set_index(pandas.to_datetime(df['timestamp'].astype(int), unit='s'), inplace=True)
df.index = df.index.tz_localize('UTC')
colsel = ['colfloat', 'colstr']
dftoadd = df[colsel].sort_index()
# try string conversion from object-type (no type mixing here ?)
##dftoadd.loc[:,'colstr'] = dftoadd['colstr'].map(str)
h5fname = 'df.h5'
# class to use as tstable description
class TsExample(tables.IsDescription):
timestamp = tables.Int64Col(pos=0)
colfloat = tables.Float64Col(pos=1)
colstr = tables.StringCol(itemsize=8, pos=2)
# create new time series
h5f = tables.open_file(h5fname, 'a')
ts = h5f.create_ts('/','example',TsExample)
# append to HDF5
ts.append(dftoadd, convert_strings=True)
# save data and close file
h5f.flush()
h5f.close()
Exception:
ValueError: rows parameter cannot be converted into a recarray object
compliant with table tstables.tstable.TsTable instance at ...
The error was: cannot view Object as non-Object type
While this particular error happens with TsTables, the code chunk responsible for it is identical to PyTables try-section here.
The error is happening after I upgraded pandas to 0.17.0; the same code was running error-free with 0.16.2.
NOTE: if a string column is excluded then everything works fine, so this problem must be related to string-column type representation in the dataframe.
The issue could be related to this question. Is there some conversion required for 'colstr' column of the dataframe that I am missing?
This is not going to work with a newer pandas as the index is timezone aware, see here
You can:
convert to a type PyTables understands, this would require localizing
use HDFStore to write the frame
Note that what you are doing is the reason HDFStore exists in the first place, to make reading/writing pyTables friendly for pandas objects. Doing this 'manually' is full of pitfalls.