Unable to create dataframe from RDD - python

I am trying to create a recommender system from this kaggle dataset: f7a1f242-c
https://www.kaggle.com/kerneler/starter-user-artist-playcount-dataset-f7a1f242-c
the file is called: "user_artist_data_small.txt"
The data looks like this:
1059637 1000010 238
1059637 1000049 1
1059637 1000056 1
1059637 1000062 11
1059637 1000094 1
I'm getting an error on the third last line of code.
!pip install pyspark==3.0.1 py4j==0.10.9
from pyspark.sql import SparkSession
from pyspark import SparkContext
appName="Collaborative Filtering with PySpark"
from pyspark.sql.types import StructType,StructField,IntegerType,StringType,LongType
from pyspark.sql.functions import col
from pyspark.ml.recommendation import ALS
from google.colab import drive
drive.mount ('/content/gdrive')
spark = SparkSession.builder.appName(appName).getOrCreate()
sc = spark.sparkContext
userArtistData1=sc.textFile("/content/gdrive/My Drive/data/user_artist_data_small.txt")
schema_user_artist = StructType([StructField("userId",StringType(),True),StructField("artistId",StringType(),True),StructField("playCount",StringType(),True)])
userArtistRDD = userArtistData1.map(lambda k: k.split())
user_artist_df = spark.createDataFrame(userArtistRDD,schema_user_artist,['userId','artistId','playCount'])
ua = user_artist_df.alias('ua')
(training, test) = ua.randomSplit([0.8, 0.2]) #Training the model
als = ALS(maxIter=5, implicitPrefs=True,userCol="userId", itemCol="artistId", ratingCol="playCount",coldStartStrategy="drop")
model = als.fit(training)# predict using the testing datatset
predictions = model.transform(test)
predictions.show()
The error is:
IllegalArgumentException: requirement failed: Column userId must be of type numeric but was actually of type string.
So I change the type from StringType to IntegerType in the schema and I get this error:
TypeError: field userId: IntegerType can not accept object '1059637' in type <class 'str'>
The number happens to be the first item in the dataset. Please help?

Just create a dataframe using the CSV reader (with a space delimiter) instead of creating an RDD:
user_artist_df = spark.read.schema(schema_user_artist).csv('/content/gdrive/My Drive/data/user_artist_data_small.txt', sep=' ')

Related

How to generate Pyspark dynamic frame name dynamically

I have a table which has data as shown in the diagram . I want to create store results in dynamically generated data frame names.
For eg here in the below example I want to create two different data frame name
dnb_df and es_df and store the read result in these two frames and print structure of each data frame
When I am running the below code getting the error
SyntaxError: can't assign to operator (TestGlue2.py, line 66)
import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import regexp_replace, col
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
#sc.setLogLevel('DEBUG')
glueContext = GlueContext(sc)
spark = glueContext.spark_session
#logger = glueContext.get_logger()
#logger.DEBUG('Hello Glue')
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
client = boto3.client('glue', region_name='XXXXXX')
response = client.get_connection(Name='XXXXXX')
connection_properties = response['Connection']['ConnectionProperties']
URL = connection_properties['JDBC_CONNECTION_URL']
url_list = URL.split("/")
host = "{}".format(url_list[-2][:-5])
new_host=host.split('#',1)[1]
port = url_list[-2][-4:]
database = "{}".format(url_list[-1])
Oracle_Username = "{}".format(connection_properties['USERNAME'])
Oracle_Password = "{}".format(connection_properties['PASSWORD'])
#print("Oracle_Username:",Oracle_Username)
#print("Oracle_Password:",Oracle_Password)
print("Host:",host)
print("New Host:",new_host)
print("Port:",port)
print("Database:",database)
Oracle_jdbc_url="jdbc:oracle:thin:#//"+new_host+":"+port+"/"+database
print("Oracle_jdbc_url:",Oracle_jdbc_url)
source_df = spark.read.format("jdbc").option("url", Oracle_jdbc_url).option("dbtable", "(select * from schema.table order by VENDOR_EXECUTION_ORDER) ").option("user", Oracle_Username).option("password", Oracle_Password).load()
vendor_data=source_df.collect()
for row in vendor_data :
vendor_query=row.SRC_QUERY
row.VENDOR_NAME+'_df'= spark.read.format("jdbc").option("url",
Oracle_jdbc_url).option("dbtable", vendor_query).option("user",
Oracle_Username).option("password", Oracle_Password).load()
print(row.VENDOR_NAME+'_df')
Added use case in picture
Update: As discussed in the comments, your requirement is to further join all with another dataframe
for row in vendor_data:
rowAsDict=row.asDict()
# Here you can use any variable as rowAsDict is not going to be used anywhere else anyway
rowAsDict[rowAsDict["VENDOR_NAME"]+"_df"] = spark.sql(rowAsDict["SOURCE_QUERY"])
main_dataframe=main_dataframe.join(rowAsDict[rowAsDict["VENDOR_NAME"]+"_df"], "acc_id")
Input main_dataframe:
source_df :
View1 and View2:
Output main_dataframe
If I understood correctly, you need to generate the VENDOR_NAME_DF dynamically.
You won't be able to assign to the Row Object, neither it'll be useful to assign dataframe to a Row as you can't create a Dataframe with a column of type Dataframe.
Though, you can convert a row to a dict using asDict and use that instead.
This would work:
vendor_data=source_df.collect()
for row in vendor_data:
rowAsDict=row.asDict()
# Replace this with spark.read() or any way to create a Dataframe
rowAsDict[rowAsDict["VENDOR_NAME"]+"_df"] = spark.sql(rowAsDict["SOURCE_QUERY"])
rowAsDict[rowAsDict["VENDOR_NAME"]+"_df"].show()
Input Source_DF:
Result of SOURCE_QUERY:
Output (of rowAsDict[rowAsDict["VENDOR_NAME"]+"_df"].show()):
Final rowAsDict:
{'VENDOR_NAME': 'Name1', 'SOURCE_QUERY': 'select * from view1', 'Name1_df': DataFrame[id: string, date: string, Code: string]}
Add the last two lines in your for loop, you should be able to get the results.
First one is creating a temp table using the dynamic df name
Second is to show the data in that temp table.
for row in vendor_data :
vendor_query=row.SRC_QUERY
spark.read.format("jdbc").option("url",
Oracle_jdbc_url).option("dbtable", vendor_query).option("user",
Oracle_Username).option("password", Oracle_Password).load().createOrReplaceTempView(row.VENDOR_NAME+'_df')
spark.sql("select * from "+row.VENDOR_NAME+"_df").show()

Use Spacy with Pandas

I'm trying to build a multi-class text classifier using Spacy and I have built the model, but facing a problem applying it to my full dataset. The model I have built so far is in the screenshot:
Screenshot
Below is the code I used to apply to my full dataset using Pandas:
Messages = pd.read_csv('Messages.csv', encoding='cp1252')
Messages['Body'] = Messages['Body'].astype(str)
Messages['NLP_Result'] = nlp(Messages['Body'])._.cats
But it gives me the error:
ValueError: [E1041] Expected a string, Doc, or bytes as input, but got: <class 'pandas.core.series.Series'>
The reason I wanted to use Pandas in this case is the dataset has 2 columns: ID and Body. I want to apply the NLP model only to the Body column, but I want the final dataset to have 3 columns: ID, Body and the NLP result like in the screenshot above.
Thanks so much
I tried Pandas apply method too, but had no luck. Code used:
Messages['NLP_Result'] = Messages['Body'].apply(nlp)._.cats
The error I got: AttributeError: 'Series' object has no attribute '_'
Expectation is to generate 3 columns as described above
You should provide a callable into Series.apply call:
Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)
Here, each value in the NLP_Result column will be assigned to x variable.
The nlp(x) will create an NLP object that contains the necessary properties you'd like to access. Then, the nlp(x)._.cats will return the expected value.
import spacy
import classy classification
import csv
import pandas as pd
with open ('Deliveries.txt', 'r') as d:
Deliveries = d.read().splitlines()
with open ("Not Spam.txt", "r") as n:
Not_Spam = n.read().splitlines()
data = {}
data["Deliveries"] = Deliveries
data["Not_Spam"] = Not_Spam
# NLP model
nlp = spacy.blank("en")
nlp.add pipe("text_categorizer",
config={
"data": data,
"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
"device": "gpu"
}
)
Messages['NLP_Result'] = Messages['Body'].apply(lambda x: nlp(x)._.cats)

Streaming JSON data from Kafka into Pyspark: aggregation not working on the JSON data

I am trying to read JSON data from Kafka into Spark using Python and then doing some aggregation operations on the data, but there is some problem. I have the following code:
First I just send the input data to the console:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-sql-kafka-0-
10_2.11:2.4.0 pyspark-shell'
from pyspark.sql.functions import from_json
import findspark
findspark.init()
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import col
spark = SparkSession\
.builder\
.appName("Total-spending-for-top-users")\
.getOrCreate()
df = spark.readStream.format("kafka")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("subscribe", "saurabmarjara2")\
.option("startingOffsets", "earliest")\
.load()
jsonschema = StructType([StructField("order_id", IntegerType()),
StructField("customer_id", IntegerType()),
StructField("taxful_total_price", IntegerType())])
mta_stream = df.select(from_json(col("value").cast("string"), jsonschema) \
.alias("parsed_mta_values"))
mta_data = mta_stream.select("parsed_mta_values.*")
qry = mta_data.writeStream.outputMode("append").format("console").start()
qry.awaitTermination()
This works correctly. The output is:
order_id
customer_id
taxful_total_price
937665
7
10
937666
3
4
937667
6
3
937668
4
4
But I want to perform aggregation on the data by grouping by the customer_id field and summing the taxful_total_price. This will give me the total spending for every customer.
Here are the changes I made to the code:
df2 =
mta_data.groupBy("customer_id").agg(sum("taxful_total_price").
alias("total_spending"))
qry = df2.writeStream.outputMode("append").format("console").start()
qry.awaitTermination()
This is the error I am getting:
File "sparkconsumer.py", line 31, in
df2=mta_data.groupBy("customer_id").agg(sum("taxful_total_price").alias("total_spending"))
TypeError: unsupported operand type(s) for +: 'int' and 'str'
I have specified that each of the columns are integer types in the jsonschema. I think this is a problem with:
mta_stream.select("parsed_mta_values.*")
I tried this:
df2 = mta_data.groupBy(
"parsed_mta_values.customer_id").agg(sum(
"parsed_mta_values.taxful_total_price").alias("total_spending"))
But this gives the same error as above.
Please help!

PySpark Error when Using Jellyfish Functions: str argument expected

I am working on a task getting the similarity score of the name related data. I am using Spark and jellyfish function in Python. Below is my code in a class:
import jellyfish
import pyspark.sql.functions as F
from pyspark.sql import SparkSession, DataFrame
from pyspark import SparkContext
df = self.jaro_winkler_func(df, 'df1.first_name', 'df2.first_name')
def jaro_winkler_score(self, s1, s2):
if s1 is None or s2 is None:
out = 0
else:
out = jellyfish.jaro_winkler(s1, s2)
return out
def jaro_winkler_func(self, df, column_left, column_right):
df = df.withColumn('test', self.jaro_winkler_score(df[column_left], df[column_right]))
return df
Below is the error I got:
out = jellyfish.jaro_winkler(s1, s2)
TypeError: str argument expected
I see other related posts in below for same issue but above functions used are already borrowing the answers from these posts.
Creating score column in Pyspark data frame using jellyfish package
Pyspark: How to deal with null values in python user defined functions
I am using Spark 2.3.
Please suggest and thanks in advance.

Expand json column in PySpark - schema issues - AttributeError: 'tuple' object has no attribute 'name'

I am using pyspark to extract data from a mutli-line json object. I am able to read in the file but I am unable to parse out the contents of the geometry column.
An example of the overall table is shown below.
+--------------------+--------------------+-------+
| geometry| properties| type|
+--------------------+--------------------+-------+
|{[13.583336, 37.2...|{AGRIGENTO, AGRIG...|Feature|
|{[13.584538, 37.3...|{AGRIGENTO, AGRIG...|Feature|
|{[13.657838, 37.3...|{FAVARA, AGRIGENT...|Feature|
|{[13.846247, 37.3...|{CANICATTÌ, AGRI...|Feature|
|{[13.616626, 37.4...|{ARAGONA, AGRIGEN...|Feature|
|{[13.108426, 37.6...|{SAMBUCA DI SICIL...|Feature|
|{[16.709313, 41.0...|{GRUMO APPULA, BA...|Feature|
|{[12.670994, 41.4...|{NETTUNO, ROMA, 6...|Feature|
|{[12.501805, 42.1...|{CASTELNUOVO DI P...|Feature|
|{[12.608105, 41.4...|{ANZIO, ROMA, b54...|Feature|
+--------------------+--------------------+-------+
This is the format of a single line of the json geometry column
"geometry":{"type":"Point","coordinates":[13.583336,37.270182]}
and when I extract the schema this is what it looks like
StructType(List(StructField("geometry",StructType(List(StructField("coordinates",ArrayType(DoubleType,true),true),StructField("type",StringType,true))),true)
However, when I try and set up the schema in PySpark to import the data I am getting the following error.
AttributeError: 'tuple' object has no attribute 'name'
This is the code I am using.
from pyspark.sql.types import StructField, StructType, StringType, FloatType, ArrayType, DoubleType
import pyspark.sql.functions as F
df = spark.read.option("multiLine", False).option("mode", "PERMISSIVE").json('Italy/it_countrywide-addresses-country.geojson')
schema = StructType([
(StructField("coordinates",
ArrayType(DoubleType())),
StructField("type",StringType()))
])
df.withColumn("geometry", F.from_json("geometry", schema)).select(col('geometry.*')).show()
I welcome your comments.
Ultimately, my goal was to read in the json file and access the nested values. The error I was receiving was down to me not creating the schema correctly. The best way to correct this error is to avoid manually creating the schema.
To do this I used the schema that you can create by calling .schema on the json file. This resolves any problems of creating the schema yourself.
The downside of this is that you are effectively importing the file twice, no doubt this can be further optimised to avoid this.
json_schema = spark.read.option("multiLine", False).option("mode", "PERMISSIVE").json('Italy/it_countrywide-addresses-country.geojson').schema
df_with_schema = spark.read.option("multiLine", False).option("mode", "PERMISSIVE").schema(json_schema).json('Italy/it_countrywide-addresses-country.geojson')
df_with_schema.printSchema()
# Select coordinates array
coordinates = df_with_schema.select(F.col('geometry.coordinates'))
# select single value from coordinates array
single_value_from_coordinates_array = df_with_schema.select(F.col('geometry.coordinates')[0])
# create my own dataframe choosing multiple columns from json file
multi_columns = df_with_schema.select(F.col('geometry.coordinates'), F.col('properties.city'))

Categories