error in labelled point object pyspark - python

I am writing a function
which takes a RDD as input
splits the comma separated values
then convert each row into labelled point object
finally fetch the output as a dataframe
code:
def parse_points(raw_rdd):
cleaned_rdd = raw_rdd.map(lambda line: line.split(","))
new_df = cleaned_rdd.map(lambda line:LabeledPoint(line[0],[line[1:]])).toDF()
return new_df
output = parse_points(input_rdd)
upto this if I run the code, there is no error it is working fine.
But on adding the line,
output.take(5)
I am getting the error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 129.0 failed 1 times, most recent failure: Lost task 0.0 in s stage 129.0 (TID 152, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Py4JJavaError Traceback (most recent call last)
<ipython-input-100-a68c448b64b0> in <module>()
20
21 output = parse_points(raw_rdd)
---> 22 print output.show()
Please suggest me what is the mistake.

The reason you had no errors until you execute the action:
output.take(5)
Is due to the nature of spark, which is lazy.
i.e. nothing was execute in spark until you execute the action "take(5)"
You have a few issues in your code, and I think that you are failing due to extra "[" and "]" in [line[1:]]
So you need to remove extra "[" and "]" in [line[1:]] (and keep only the line[1:])
Another issue which you might need to solve is the lack of dataframe schema.
i.e. replace "toDF()" with "toDF(["features","label"])"
This will give the dataframe a schema.

Try:
>>> raw_rdd.map(lambda line: line.split(",")) \
... .map(lambda line:LabeledPoint(line[0], [float(x) for x in line[1:]])

Related

Pyspark job aborted error due to stage failure

I have the following piece of code:
# fact table
df = (spark.table(f'nn_squad7_{country}.fact_table')
.filter(f.col('date_key').between(start_date,end_date))
#.filter(f.col('is_lidl_plus')==1)
.filter(f.col('source')=='tickets')
.filter(f.col('subtype')=='trx')
.filter(f.col('is_trx_ok') == 1)
.join(dim_stores,'store_id','inner')
.join(dim_customers,'customer_id','inner')
.withColumn('week', f.expr('DATE_FORMAT(DATE_SUB(date_key, 1), "Y-ww")'))
.withColumn('quarter', f.expr('DATE_FORMAT(DATE_SUB(date_key, 1), "Q")')))
#checking metrics
df2 =(df
.groupby('is_client_plus','quarter')
.agg(
f.countDistinct('store_id'),
f.sum('customer_id'),
f.sum('ticket_id')))
display(df2)
When I execute the query I get the following error:
SparkException: Job aborted due to stage failure: Task 58 in stage 13.0 failed 4 times, most recent failure: Lost task 58.3 in stage 13.0 (TID 488, 10.32.14.43, executor 4): java.lang.IllegalArgumentException: Illegal pattern character 'Q'
I'm not sure about why I'm getting this error because when I run the fact table chunck alone I'm not getting any kind of error.
Any advice? Thanks!
According to the docs of Spark 3, 'Q' is a valid datetime format pattern, despite it's not a valid Java time format. Not sure why it didn't work for you - maybe a Spark version issue. Try using the function quarter instead, which should give the same expected output:
df = (spark.table(f'nn_squad7_{country}.fact_table')
.filter(f.col('date_key').between(start_date,end_date))
#.filter(f.col('is_lidl_plus')==1)
.filter(f.col('source')=='tickets')
.filter(f.col('subtype')=='trx')
.filter(f.col('is_trx_ok') == 1)
.join(dim_stores,'store_id','inner')
.join(dim_customers,'customer_id','inner')
.withColumn('week', f.expr('DATE_FORMAT(DATE_SUB(date_key, 1), "Y-ww")'))
.withColumn('quarter', f.expr('quarter(DATE_SUB(date_key, 1))')))
If you look at the documentation of the function, it explicitly states what pattern letters are valid to use:
All pattern letters of the Java class java.text.SimpleDateFormat can be used.
You can see the valid patterns here:
https://docs.oracle.com/javase/10/docs/api/java/text/SimpleDateFormat.html
It looks like Q is not one of them. As per the comments #mck shows a suitable alternative.

KeyError in Python, even though the key exists

I have been scratching my head on this for a few days now and cannot seem to find a solution that works online for my problem. I am trying to access data on zendesk and go through the pagination. For some reason, I am getting a KeyError, even though I can see that the key does exist. Here is my code :
data_users2 = [[]]
while url_users:
users_pagination = requests.get(url_users,auth=(user, pwd))
data_user_page = json.loads(users_pagination.text)
print (data_user_page.keys())
for user in data_user_page['users']:
data_users2.append(user)
url = data_user_page['next_page']
Here is the output :
dict_keys(['users', 'next_page', 'previous_page', 'count'])
dict_keys(['error'])
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-22-fab95d95ddeb> in <module>
6 data_user_page = json.loads(users_pagination.text)
7 print (data_user_page.keys())
----> 8 for user in data_user_page["users"]:
9 data_users2.append(user)
10 url = data_user_page["next_page"]
KeyError: 'users'
As you can see, users does exist. same thing happens if I try to print the next_page, I get a KeyError for next_page.
Any help would be appreciated ! Thanks!
Your code is failing in its second iteration of the loop, in that moment your keys in data_user_page are just "error" as you can see in the output you have pasted
dict_keys(['users', 'next_page', 'previous_page', 'count']) <----- FIRST ITERATION
dict_keys(['error']) <---- SECOND ITERATION, THEREFORE, YOUR KEY DOES NOT EXISTS
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-22-fab95d95ddeb> in <module>
6 data_user_page = json.loads(users_pagination.text)
7 print (data_user_page.keys())
----> 8 for user in data_user_page["users"]:
9 data_users2.append(user)
10 url = data_user_page["next_page"]
KeyError: 'users'
EDIT: This could be due to the fact that you are saving the next url in a variable called url not url_users

Unicode error while converting RDD to Spark DataFrame

I am getting the following error when I run show method on the data frame.
Py4JJavaError: An error occurred while calling o14904.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23450.0 failed 1 times, most recent failure: Lost task 0.0 in stage 23450.0 (TID 120652, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/Users/i854319/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 172, in main
process()
File "/Users/i854319/spark2/python/lib/pyspark.zip/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/Users/i854319/spark2/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<ipython-input-8-b76896bc4e43>", line 320, in <lambda>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 3-5: ordinal not in range(128)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.next(PythonRDD.scala:156)
When I only fetch 12 rows, it doesn't throw the error.
jpsa_rf.features_df.show(12)
+------------+--------------------+
|Feature_name| Importance_value|
+------------+--------------------+
| competitive|0.019380017988201638|
| new|0.012416277407924172|
|self-reliant|0.009044388916918005|
| related|0.008968947484358822|
| retail|0.008729510712416655|
| sales,|0.007680271475590303|
| work|0.007548541044789985|
| performance|0.007209008630295571|
| superior|0.007065626808393139|
| license|0.006436001036918034|
| industry|0.006416712169788629|
| record|0.006227581067732823|
+------------+--------------------+
only showing top 12 rows
But when I do .show(15) I get the error.
I created this data frame as below: it is basically a data frame of features with the their importance values from a Random Forest Model
vocab=np.array(self.cvModel.bestModel.stages[3].vocabulary)
if est_name=="rf":
feature_importance=self.cvModel.bestModel.stages[5].featureImportances.toArray()
argsort_feature_indices=feature_importance.argsort()[::-1]
elif est_name=="blr":
feature_importance=self.cvModel.bestModel.stages[5].coefficients.toArray()
argsort_feature_indices=abs(feature_importance).argsort()[::-1]
# Sort the features importance array in descending order and get the indices
feature_names=vocab[argsort_feature_indices]
self.features_df=sc.parallelize(zip(feature_names,feature_importance[argsort_feature_indices])).\
map(lambda x: (str(x[0]),float(x[1]))).toDF(["Feature_name","Importance_value"])
I assume you're using Python 2. The problem at hand is most likely the str(x[0]) part in your df.map. It seems x[0] refers to a unicode string and the str is supposed to convert it to a bytestring. It does so however by implicitly assuming an ASCII encoding, which will only work for plain english text.
This is not how things are supposed to be done.
The short answer is: Change str(x[0]) to x[0].encode('utf-8').
The long answer can be found e.g. here or here.

ZeroDivisionError unknown cause

I have a piece of code from a larger script, and for the life of me I can't figure out what's causing the error.
It looks like this:
counter = 1
for i in range(len(binCounts)):
thisRatio = float(binCounts[i]) / (float(counter) / float(len(bins)))
OUTFILE.write("\t".join(bins[i][0:3]))
OUTFILE.write("\t")
OUTFILE.write(str(binCounts[i]))
OUTFILE.write("\t")
OUTFILE.write(str(thisRatio))
OUTFILE.write("\n")
where binCounts is a chronological list [1, 2, 3]
and bins is another list that contains slightly more info:
[['chrY', '28626328', '3064930174', '28718777', '92449', '49911'], ['chrY', '28718777', '3065022623', '28797881', '79104', '49911'], ['chrY', '28797881', '3065101727', '59373566', '30575685', '49912']]
It should be, for each variable in binCounts, taking the calculated thisRatio, the first 3 rows in bins, and the output of binCounts, and putting them together in a new file (OUTFILE).
But it's not doing this. It's giving me an error:
thisRatio = float(binCounts[i]) / (float(counter) / float(len(bins)))
ZeroDivisionError: float division by zero
When I run the line:
thisRatio = float(binCounts[i]) / (float(counter) / float(len(bins)))
interactively, it works fine.
When I break it into pieces, this is what I get:
A = float(binCounts[i])
print (A)
49999.0
B = (float(counter))
print (B)
1.0
C = float(len(bins))
print (C)
50000.0
thisRatio
2499950000.0
And then I reran the whole piece interactively (which I hadn't done before - just the single thisRatio line) and got this error...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
IndexError: list index out of range
So it seems when run as a .py script the error is a ZeroDivisionError, and when run interactively the error is an IndexError.

Python Error :'numpy.float64' object is not callable

I have written a code in python to generate a sequence of ARIMA model's and determine their AIC values to compare them.The code is as below,
p=0
q=0
d=0
for p in range(5):
for d in range(1):
for q in range(4):
arima_mod=sm.tsa.ARIMA(df,(p,d,q)).fit()
print(arima_mod.params)
print arima_mod.aic()
I am getting a error message as below,
TypeError Traceback (most recent call last)
<ipython-input-60-b662b0c42796> in <module>()
8 arima_mod=sm.tsa.ARIMA(df,(p,d,q)).fit()
9 print(arima_mod.params)
---> 10 print arima_mod.aic()
global arima_mod.aic = 1262.2449736558815
11
**TypeError: 'numpy.float64' object is not callable**
Remove the brackets after print arima_mod.aic(). As I read it, arima_mod.aic is 1262.2449736558815, and thus a float. The brackets make python think it is a function, and tries to call it. You do not want that (because it breaks), you just want that value. So remove the brackets, and you'll be fine.

Categories