Python string column iteration

Python string column iteration - python

I am working on openAI, and stuck I have tried to sort this issue on my own but didn't get any resolution. I want my code to run the sentence generation operation on every row of the Input_Description_OAI column and give me the output in another column (OpenAI_Description). Can someone please help me with the completion of this task. I am new to python.
The dataset looks like:
import os
import openai
import wandb
import pandas as pd
openai.api_key = "MY-API-Key"
data=pd.read_excel("/content/OpenAI description.xlsx")
data
data["OpenAI_Description"] = data.apply(lambda _: ' ', axis=1)
data
gpt_prompt = ("Write product description for: Brand: COILCRAFT ; MPN: DO5010H-103MLD..")
response = openai.Completion.create(engine="text-curie-001", prompt=gpt_prompt,
temperature=0.7, max_tokens=1000, top_p=1.0, frequency_penalty=0.0, presence_penalty=0.0)
print(response['choices'][0]['text'])
data['OpenAI_Description'] = data.apply(gpt_prompt,response['choices'][0]['text'], axis=1)
I got the error after execution on first row as:
---------------------------------------------------------------------------
TypeError
Traceback (most recent call last)
<ipython-input-32-c798fbf9bc16> in <module>
15 print(response['choices'][0]['text'])
16 #data.add_data(gpt_prompt,response['choices'][0]['text'])
---> 17 data['OpenAI_Description'] = data.apply(gpt_prompt,response['choices'][0]['text'], axis=1)
18
TypeError: apply() got multiple values for argument 'axis'

Related

SAS Proc Transpose to Pyspark

I am trying to convert a SAS proc transpose statement to pyspark in databricks.
With the following data as a sample:
data = [{"duns":1234, "finc stress":100,"ver":6.0},{"duns":1234, "finc stress":125,"ver":7.0},{"duns":1234, "finc stress":135,"ver":7.1},{"duns":12345, "finc stress":125,"ver":7.6}]
I would expect the result to look like this
I tried using the pandas pivot_table() function with the following code however I ran into some performance issues with the size of the data:
tst = (df.pivot_table(index=['duns'], columns=['ver'], values='finc stress')
.add_prefix('ver')
.reset_index())
Is there a way to translate the PROC Transpose SAS logic to Pyspark instead of using pandas?
I am trying something like this but am getting an error
tst= sparkdf.groupBy('duns').pivot('ver').agg('finc_stress').withColumn('ver')
AssertionError: all exprs should be Column
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<command-2507760044487307> in <module>
4 df = pd.DataFrame(data) # pandas
5
----> 6 tst= sparkdf.groupBy('duns').pivot('ver').agg('finc_stress').withColumn('ver')
7
8
/databricks/spark/python/pyspark/sql/group.py in agg(self, *exprs)
115 else:
116 # Columns
--> 117 assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
118 jdf = self._jgd.agg(exprs[0]._jc,
119 _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
AssertionError: all exprs should be Column
If you could help me out I would so appreciate it! Thank you so much.

I don't know how you create df from data but here is what I did:
import pyspark.pandas as ps
df = ps.DataFrame(data)
df['ver'] = df['ver'].astype('str')
Then your pandas code worked.
To use PySpark method, here is what I did:
sparkdf.groupBy('duns').pivot('ver').agg(F.first('finc stress'))

Why do I get a TypeError when I try to create a data frame?

I am writing code to analyze some data and want to create a data frame. How do I set it up successfully to run?
this is for analysis of data and I would like to create a data frame that can categorize data in different grades such as A
Here is the code I wrote:
import analyze_lc_Feb2update
from imp import reload
analyze_lc_Feb2update = reload(analyze_lc_Feb2update)
df = analyze_lc_Feb2update.create_df()
df.shape
df_new = df[df.grade=='A']
df_new.shape
df.columns
df.int_rate.head(5)
df.int_rate.tail(5)
df.int_rate.dtype
df.term.dtype
df_new = df[df.grade =='A']
df_new.shape
output:
TypeError Traceback (most recent call last)
<ipython-input-3-7079435f776f> in <module>()
2 from imp import reload
3 analyze_lc_Feb2update = reload(analyze_lc_Feb2update)
4 df = analyze_lc_Feb2update.create_df()
5 df.shape
6 df_new = df[df.grade=='A']
TypeError: create_df() missing 1 required positional
argument: 'grade'

Based on what was provided I guess your problem is here:
from imp import reload
analyze_lc_Feb2update = reload(analyze_lc_Feb2update)
df = analyze_lc_Feb2update.create_df()
This looks like some custom library you are trying to use, of which the .create_df() method requires a positional argument "grade" which would require you to do something like:
df = analyze_lc_Feb2update.create_df(grade="blah")

How to remove every possible accents from a column in python

I am new in python. I have a data frame with a column, named 'Name'. The column contains different type of accents. I am trying to remove those accents. For example, rubén => ruben, zuñiga=zuniga, etc. I wrote following code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import unicodedata
data=pd.read_csv('transactions.csv')
data.head()
nm=data['Name']
normal = unicodedata.normalize('NFKD', nm).encode('ASCII', 'ignore')
I am getting error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-41-1410866bc2c5> in <module>()
1 nm=data['Name']
----> 2 normal = unicodedata.normalize('NFKD', nm).encode('ASCII', 'ignore')
TypeError: normalize() argument 2 must be unicode, not Series

The reason why it is giving you that error is because normalize requires a string for the second parameter, not a list of strings. I found an example of this online:
unicodedata.normalize('NFKD', u"Durrës Åland Islands").encode('ascii','ignore')
'Durres Aland Islands'

Try this for one column:
nm = nm.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
Try this for multiple columns:
obj_cols = data.select_dtypes(include=['O']).columns
data.loc[obj_cols] = data.loc[obj_cols].apply(lambda x: x.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'))

Try this for one column:
df[column_name] = df[column_name].apply(lambda x: unicodedata.normalize(u'NFKD', str(x)).encode('ascii', 'ignore').decode('utf-8'))
Change the column name according to your data columns.

How to print unique values of a column in a group using Pandas?

I am trying to print unique values of the column ADO_name in my data set. Following is the example data set and code I tried (which gives error):
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
data = {'ADO_name':['car1','car1','car1','car2','car2','car2'],
'Time_sec':[0,1,2,0,1,2],
'Speed.kph':[50,51,52,0,0,52]}
dframe = DataFrame(data)
for ado in dframe.groupby('ADO_name'):
ado_name = ado["ADO_name"]
adoID = ado_name.unique()
print(adoID)
Traceback (most recent call last):
File "C:\Users\Quinton\AppData\Local\Temp\Rtmp88ifpB\chunk-code-188c39fc7de8.txt", line 14, in <module>
ado_name = ado["ADO_name"]
TypeError: tuple indices must be integers or slices, not str
What am I doing wrong and how to fix it? Please help.

You can do: dframe["ADO_name"].unique().

You may want to correct your code or use the correct way.
Here is what you need to correct in your code.
for ado in dframe.groupby('ADO_name'):
ado_name = ado[1]["ADO_name"]
adoID = ado_name.unique()
print(adoID)

Getting error while trying to use a list with numpy to get some stat values

Hi I am having problems with this code:
**import numpy as np
# Summarize the data about minutes spent in the classroom
#total_minutes = total_minutes_by_account.values()
total_minutes = list(total_minutes_by_account.values())
type(total_minutes)
# Printing out the samething converting to a list
print('Printing out the samething converting to a list ')
print(type(total_minutes))
print ('Mean:', np.mean(total_minutes))
print ('Standard deviation:', np.std(total_minutes))
print ('Minimum:', np.min(total_minutes))
print ('Maximum:', np.max(total_minutes))**
The error I get is:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-93-945375bf6098> in <module>()
3 # Summarize the data about minutes spent in the classroom
4 #total_minutes = total_minutes_by_account.values()
----> 5 total_minutes = list(total_minutes_by_account.values())
6 type(total_minutes)
7 #print(total_minutes)
AttributeError: 'list' object has no attribute 'values'
I really would lie to know how I can make this work, I can do it with pandas converitng it to a numpy array and the getting values for the statistics I want with numpy

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python string column iteration - python

Related

SAS Proc Transpose to Pyspark

Why do I get a TypeError when I try to create a data frame?

How to remove every possible accents from a column in python

How to print unique values of a column in a group using Pandas?

Getting error while trying to use a list with numpy to get some stat values

Categories

Resources