SAS Proc Transpose to Pyspark - python

I am trying to convert a SAS proc transpose statement to pyspark in databricks.
With the following data as a sample:
data = [{"duns":1234, "finc stress":100,"ver":6.0},{"duns":1234, "finc stress":125,"ver":7.0},{"duns":1234, "finc stress":135,"ver":7.1},{"duns":12345, "finc stress":125,"ver":7.6}]
I would expect the result to look like this
I tried using the pandas pivot_table() function with the following code however I ran into some performance issues with the size of the data:
tst = (df.pivot_table(index=['duns'], columns=['ver'], values='finc stress')
.add_prefix('ver')
.reset_index())
Is there a way to translate the PROC Transpose SAS logic to Pyspark instead of using pandas?
I am trying something like this but am getting an error
tst= sparkdf.groupBy('duns').pivot('ver').agg('finc_stress').withColumn('ver')
AssertionError: all exprs should be Column
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<command-2507760044487307> in <module>
4 df = pd.DataFrame(data) # pandas
5
----> 6 tst= sparkdf.groupBy('duns').pivot('ver').agg('finc_stress').withColumn('ver')
7
8
/databricks/spark/python/pyspark/sql/group.py in agg(self, *exprs)
115 else:
116 # Columns
--> 117 assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
118 jdf = self._jgd.agg(exprs[0]._jc,
119 _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
AssertionError: all exprs should be Column
If you could help me out I would so appreciate it! Thank you so much.

I don't know how you create df from data but here is what I did:
import pyspark.pandas as ps
df = ps.DataFrame(data)
df['ver'] = df['ver'].astype('str')
Then your pandas code worked.
To use PySpark method, here is what I did:
sparkdf.groupBy('duns').pivot('ver').agg(F.first('finc stress'))

Related

Python string column iteration

I am working on openAI, and stuck I have tried to sort this issue on my own but didn't get any resolution. I want my code to run the sentence generation operation on every row of the Input_Description_OAI column and give me the output in another column (OpenAI_Description). Can someone please help me with the completion of this task. I am new to python.
The dataset looks like:
import os
import openai
import wandb
import pandas as pd
openai.api_key = "MY-API-Key"
data=pd.read_excel("/content/OpenAI description.xlsx")
data
data["OpenAI_Description"] = data.apply(lambda _: ' ', axis=1)
data
gpt_prompt = ("Write product description for: Brand: COILCRAFT ; MPN: DO5010H-103MLD..")
response = openai.Completion.create(engine="text-curie-001", prompt=gpt_prompt,
temperature=0.7, max_tokens=1000, top_p=1.0, frequency_penalty=0.0, presence_penalty=0.0)
print(response['choices'][0]['text'])
data['OpenAI_Description'] = data.apply(gpt_prompt,response['choices'][0]['text'], axis=1)
I got the error after execution on first row as:
---------------------------------------------------------------------------
TypeError
Traceback (most recent call last)
<ipython-input-32-c798fbf9bc16> in <module>
15 print(response['choices'][0]['text'])
16 #data.add_data(gpt_prompt,response['choices'][0]['text'])
---> 17 data['OpenAI_Description'] = data.apply(gpt_prompt,response['choices'][0]['text'], axis=1)
18
TypeError: apply() got multiple values for argument 'axis'

Problem about printing certain rows without using Pandas

I want to print out the first 5 rows of the data from sklearn.datasets.load_diabetes. I tried head() and iloc but it seems not effective. What should I do?
Here is my work
# 1. Import dataset about diabetes from the sklearn package: from sklearn import
from sklearn import datasets
# 2. Load the data (use .load_diabetes() function )
df = datasets.load_diabetes()
df
# 3. Print out feature names and target names
# Features Names
x = df.feature_names
x
# Target Names
y = df.target
y
# 4. Print out the first 5 rows of the data
df.head(5)
Error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in __getattr__(self, key)
113 try:
--> 114 return self[key]
115 except KeyError:
KeyError: 'head'
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
1 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in __getattr__(self, key)
114 return self[key]
115 except KeyError:
--> 116 raise AttributeError(key)
117
118 def __setstate__(self, state):
AttributeError: head
According to the documentation for load_diabetes() it doesn't return a Pandas dataframe by default, so no wonder it doesn't work.
You can apparently do
df = datasets.load_diabetes(as_frame=True).data
if you want a dataframe.
If you don't want a dataframe, you need to read up on how Numpy array slicing works, since that's what you get by default.
Well, I thank Mr.AKX for giving me a useful hint. I can find my answer:
# 1. Import dataset about diabetes from the sklearn package: from sklearn import
from sklearn import datasets
import pandas as pd
# 2. Load the data (use .load_diabetes() function )
data = datasets.load_diabetes()
# 3. Print out feature names and target names
# Features Names
x = data.feature_names
x
# Target Names
y = data.target
y
# 4. Print out the first 5 rows of the data
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head(5)
The method load_diabetes() doesn't return a DataFrame by default but if you are using sklearn 0.23 or higher you can set as_frame parameter to True so it will return a Pd.DataFrame object.
df = datasets.load_diabetes(as_frame=True)
Then you can call head method and it will show you the first 5 rows, no need to specify 5.
print(df.head())

Why do I get a TypeError when I try to create a data frame?

I am writing code to analyze some data and want to create a data frame. How do I set it up successfully to run?
this is for analysis of data and I would like to create a data frame that can categorize data in different grades such as A
Here is the code I wrote:
import analyze_lc_Feb2update
from imp import reload
analyze_lc_Feb2update = reload(analyze_lc_Feb2update)
df = analyze_lc_Feb2update.create_df()
df.shape
df_new = df[df.grade=='A']
df_new.shape
df.columns
df.int_rate.head(5)
df.int_rate.tail(5)
df.int_rate.dtype
df.term.dtype
df_new = df[df.grade =='A']
df_new.shape
output:
TypeError Traceback (most recent call last)
<ipython-input-3-7079435f776f> in <module>()
2 from imp import reload
3 analyze_lc_Feb2update = reload(analyze_lc_Feb2update)
4 df = analyze_lc_Feb2update.create_df()
5 df.shape
6 df_new = df[df.grade=='A']
TypeError: create_df() missing 1 required positional
argument: 'grade'
Based on what was provided I guess your problem is here:
from imp import reload
analyze_lc_Feb2update = reload(analyze_lc_Feb2update)
df = analyze_lc_Feb2update.create_df()
This looks like some custom library you are trying to use, of which the .create_df() method requires a positional argument "grade" which would require you to do something like:
df = analyze_lc_Feb2update.create_df(grade="blah")

filtering dataframe in LAMBDA function in python [duplicate]

I am trying to filter an RDD based like below:
spark_df = sc.createDataFrame(pandas_df)
spark_df.filter(lambda r: str(r['target']).startswith('good'))
spark_df.take(5)
But got the following errors:
TypeErrorTraceback (most recent call last)
<ipython-input-8-86cfb363dd8b> in <module>()
1 spark_df = sc.createDataFrame(pandas_df)
----> 2 spark_df.filter(lambda r: str(r['target']).startswith('good'))
3 spark_df.take(5)
/usr/local/spark-latest/python/pyspark/sql/dataframe.py in filter(self, condition)
904 jdf = self._jdf.filter(condition._jc)
905 else:
--> 906 raise TypeError("condition should be string or Column")
907 return DataFrame(jdf, self.sql_ctx)
908
TypeError: condition should be string or Column
Any idea what I missed? Thank you!
DataFrame.filter, which is an alias for DataFrame.where, expects a SQL expression expressed either as a Column:
spark_df.filter(col("target").like("good%"))
or equivalent SQL string:
spark_df.filter("target LIKE 'good%'")
I believe you're trying here to use RDD.filter which is completely different method:
spark_df.rdd.filter(lambda r: r['target'].startswith('good'))
and does not benefit from SQL optimizations.
I have been through this and have settled to using a UDF:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
filtered_df = spark_df.filter(udf(lambda target: target.startswith('good'),
BooleanType())(spark_df.target))
More readable would be to use a normal function definition instead of the lambda
convert the dataframe into rdd.
spark_df = sc.createDataFrame(pandas_df)
spark_df.rdd.filter(lambda r: str(r['target']).startswith('good'))
spark_df.take(5)
I think it may work!

PySpark: TypeError: condition should be string or Column

I am trying to filter an RDD based like below:
spark_df = sc.createDataFrame(pandas_df)
spark_df.filter(lambda r: str(r['target']).startswith('good'))
spark_df.take(5)
But got the following errors:
TypeErrorTraceback (most recent call last)
<ipython-input-8-86cfb363dd8b> in <module>()
1 spark_df = sc.createDataFrame(pandas_df)
----> 2 spark_df.filter(lambda r: str(r['target']).startswith('good'))
3 spark_df.take(5)
/usr/local/spark-latest/python/pyspark/sql/dataframe.py in filter(self, condition)
904 jdf = self._jdf.filter(condition._jc)
905 else:
--> 906 raise TypeError("condition should be string or Column")
907 return DataFrame(jdf, self.sql_ctx)
908
TypeError: condition should be string or Column
Any idea what I missed? Thank you!
DataFrame.filter, which is an alias for DataFrame.where, expects a SQL expression expressed either as a Column:
spark_df.filter(col("target").like("good%"))
or equivalent SQL string:
spark_df.filter("target LIKE 'good%'")
I believe you're trying here to use RDD.filter which is completely different method:
spark_df.rdd.filter(lambda r: r['target'].startswith('good'))
and does not benefit from SQL optimizations.
I have been through this and have settled to using a UDF:
from pyspark.sql.functions import udf
from pyspark.sql.types import BooleanType
filtered_df = spark_df.filter(udf(lambda target: target.startswith('good'),
BooleanType())(spark_df.target))
More readable would be to use a normal function definition instead of the lambda
convert the dataframe into rdd.
spark_df = sc.createDataFrame(pandas_df)
spark_df.rdd.filter(lambda r: str(r['target']).startswith('good'))
spark_df.take(5)
I think it may work!

Categories