Pandas interpolation function fails to interpolate after replacing values with .nan - python

I am working with the pandas function, and I am trying to interpolate a missing value after removing a value that isn't numeric. However, I am still reading one na value when calling the isna().sum() function. A better explanation is below.
The input .csv file can be found here.
Here is what I have done:
#Import modules
import pandas as pd
import numpy as np
#Import data
df = pd.read_csv('example.csv')
df.isna().sum() #Shows no NA values, but I know that one of them is not numeric.
pd.to_numeric(df['example'])
The following error is produced, indicating the presence of an entry that needs to be removed at line number 949:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~libs\lib.pyx:2315, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "asdf"
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Input In [111], in <cell line: 3>()
1 df1 = pd.read_csv('example.csv')
2 df1.isna().sum()
----> 3 pd.to_numeric(df1['example'])
File ~numeric.py:184, in to_numeric(arg, errors, downcast)
182 coerce_numeric = errors not in ("ignore", "raise")
183 try:
--> 184 values, _ = lib.maybe_convert_numeric(
185 values, set(), coerce_numeric=coerce_numeric
186 )
187 except (ValueError, TypeError):
188 if errors == "raise":
File ~libs\lib.pyx:2357, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "asdf" at position 949
Here is my attempt to correct remove this value and interpolate a new one in its place:
idx_missing = df== 'asdf'
df[idx_missing] = np.nan
df['example'].isnull().sum() #This line confirms that there is one value missing
#Perform interpolation with a linear method
df1.iloc[:, -1] = df.iloc[:, -1].interpolate(method='linear') #Specifying the last column in the dataframe with the 'iloc' command
df1.isna().sum()
Apparently, there is still a missing value and the value was not interpolated:
example 1
dtype: int64
How can I correctly interpolate this value?

If you first find and replace any value that is not a digit, that should fix your issue.
#Import modules
import pandas as pd
import numpy as np
#Import data
df = pd.read_csv('example.csv')
df['example'] = df.example.replace(r'[^\d]',np.nan,regex=True)
pd.to_numeric(df.example)

Related

Python string column iteration

I am working on openAI, and stuck I have tried to sort this issue on my own but didn't get any resolution. I want my code to run the sentence generation operation on every row of the Input_Description_OAI column and give me the output in another column (OpenAI_Description). Can someone please help me with the completion of this task. I am new to python.
The dataset looks like:
import os
import openai
import wandb
import pandas as pd
openai.api_key = "MY-API-Key"
data=pd.read_excel("/content/OpenAI description.xlsx")
data
data["OpenAI_Description"] = data.apply(lambda _: ' ', axis=1)
data
gpt_prompt = ("Write product description for: Brand: COILCRAFT ; MPN: DO5010H-103MLD..")
response = openai.Completion.create(engine="text-curie-001", prompt=gpt_prompt,
temperature=0.7, max_tokens=1000, top_p=1.0, frequency_penalty=0.0, presence_penalty=0.0)
print(response['choices'][0]['text'])
data['OpenAI_Description'] = data.apply(gpt_prompt,response['choices'][0]['text'], axis=1)
I got the error after execution on first row as:
---------------------------------------------------------------------------
TypeError
Traceback (most recent call last)
<ipython-input-32-c798fbf9bc16> in <module>
15 print(response['choices'][0]['text'])
16 #data.add_data(gpt_prompt,response['choices'][0]['text'])
---> 17 data['OpenAI_Description'] = data.apply(gpt_prompt,response['choices'][0]['text'], axis=1)
18
TypeError: apply() got multiple values for argument 'axis'

Problem about printing certain rows without using Pandas

I want to print out the first 5 rows of the data from sklearn.datasets.load_diabetes. I tried head() and iloc but it seems not effective. What should I do?
Here is my work
# 1. Import dataset about diabetes from the sklearn package: from sklearn import
from sklearn import datasets
# 2. Load the data (use .load_diabetes() function )
df = datasets.load_diabetes()
df
# 3. Print out feature names and target names
# Features Names
x = df.feature_names
x
# Target Names
y = df.target
y
# 4. Print out the first 5 rows of the data
df.head(5)
Error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in __getattr__(self, key)
113 try:
--> 114 return self[key]
115 except KeyError:
KeyError: 'head'
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
1 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in __getattr__(self, key)
114 return self[key]
115 except KeyError:
--> 116 raise AttributeError(key)
117
118 def __setstate__(self, state):
AttributeError: head
According to the documentation for load_diabetes() it doesn't return a Pandas dataframe by default, so no wonder it doesn't work.
You can apparently do
df = datasets.load_diabetes(as_frame=True).data
if you want a dataframe.
If you don't want a dataframe, you need to read up on how Numpy array slicing works, since that's what you get by default.
Well, I thank Mr.AKX for giving me a useful hint. I can find my answer:
# 1. Import dataset about diabetes from the sklearn package: from sklearn import
from sklearn import datasets
import pandas as pd
# 2. Load the data (use .load_diabetes() function )
data = datasets.load_diabetes()
# 3. Print out feature names and target names
# Features Names
x = data.feature_names
x
# Target Names
y = data.target
y
# 4. Print out the first 5 rows of the data
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head(5)
The method load_diabetes() doesn't return a DataFrame by default but if you are using sklearn 0.23 or higher you can set as_frame parameter to True so it will return a Pd.DataFrame object.
df = datasets.load_diabetes(as_frame=True)
Then you can call head method and it will show you the first 5 rows, no need to specify 5.
print(df.head())

seaborn.kdeplot (pandas.DataFrame) - ValueError: If using all scalar values, you must pass an index

I want to make a kdeplot for my pandas dataframe. I used the code below:
mean = [0,0]
cov = [[1,0],[0,100]]
dataset2 = np.random.multivariate_normal(mean,cov,1000)
dframe = pd.DataFrame(dataset2,columns=['X','Y'])
sns.kdeplot(dframe)
And got this error message:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-20-852468cc1da8> in <module>()
7 dframe = pd.DataFrame(dataset2,columns=['X','Y'])
8
----> 9 sns.kdeplot(dframe)
9 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/internals/construction.py in extract_index(data)
385
386 if not indexes and not raw_lengths:
--> 387 raise ValueError("If using all scalar values, you must pass an index")
388
389 if have_series:
ValueError: If using all scalar values, you must pass an index
How should I amend my code?
Note: It works when I instead use:
sns.kdeplot(dframe.X,dframe.Y)
you need to assign the columns to the plot
sns.kdeplot(data=dframe,x='X',y='Y')

pyspark taking a log of column

length = df.count()
df = df.withColumn("log", log(col("power"),lit(length)))
The following lines throw such an error. Can you please help me take a log of a column using another value or another column as a base.
TypeError Traceback (most recent call last)
<ipython-input-102-c0894b6127d1> in <module>()
1 #df.show()
2
----> 3 df = df.withColumn("log", log(col("power"),lit(2)))
5 frames
/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/column.py in __iter__(self)
342
343 def __iter__(self):
--> 344 raise TypeError("Column is not iterable")
345
346 # string methods
TypeError: Column is not iterable
If you want to use funtions that are not build-in on spark dataframes you can use user-defined functions, in your case it would look like this:
from pyspark.sql.functions import udf
from math import log
#udf("float")
def log_udf(s):
return log(s,2)
df.withColumn("log", log_udf("power")).show()

How to remove every possible accents from a column in python

I am new in python. I have a data frame with a column, named 'Name'. The column contains different type of accents. I am trying to remove those accents. For example, rubén => ruben, zuñiga=zuniga, etc. I wrote following code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import unicodedata
data=pd.read_csv('transactions.csv')
data.head()
nm=data['Name']
normal = unicodedata.normalize('NFKD', nm).encode('ASCII', 'ignore')
I am getting error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-41-1410866bc2c5> in <module>()
1 nm=data['Name']
----> 2 normal = unicodedata.normalize('NFKD', nm).encode('ASCII', 'ignore')
TypeError: normalize() argument 2 must be unicode, not Series
The reason why it is giving you that error is because normalize requires a string for the second parameter, not a list of strings. I found an example of this online:
unicodedata.normalize('NFKD', u"Durrës Åland Islands").encode('ascii','ignore')
'Durres Aland Islands'
Try this for one column:
nm = nm.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8')
Try this for multiple columns:
obj_cols = data.select_dtypes(include=['O']).columns
data.loc[obj_cols] = data.loc[obj_cols].apply(lambda x: x.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8'))
Try this for one column:
df[column_name] = df[column_name].apply(lambda x: unicodedata.normalize(u'NFKD', str(x)).encode('ascii', 'ignore').decode('utf-8'))
Change the column name according to your data columns.

Categories