Problem about printing certain rows without using Pandas - python

I want to print out the first 5 rows of the data from sklearn.datasets.load_diabetes. I tried head() and iloc but it seems not effective. What should I do?
Here is my work
# 1. Import dataset about diabetes from the sklearn package: from sklearn import
from sklearn import datasets
# 2. Load the data (use .load_diabetes() function )
df = datasets.load_diabetes()
df
# 3. Print out feature names and target names
# Features Names
x = df.feature_names
x
# Target Names
y = df.target
y
# 4. Print out the first 5 rows of the data
df.head(5)
Error:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in __getattr__(self, key)
113 try:
--> 114 return self[key]
115 except KeyError:
KeyError: 'head'
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
1 frames
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py in __getattr__(self, key)
114 return self[key]
115 except KeyError:
--> 116 raise AttributeError(key)
117
118 def __setstate__(self, state):
AttributeError: head

According to the documentation for load_diabetes() it doesn't return a Pandas dataframe by default, so no wonder it doesn't work.
You can apparently do
df = datasets.load_diabetes(as_frame=True).data
if you want a dataframe.
If you don't want a dataframe, you need to read up on how Numpy array slicing works, since that's what you get by default.

Well, I thank Mr.AKX for giving me a useful hint. I can find my answer:
# 1. Import dataset about diabetes from the sklearn package: from sklearn import
from sklearn import datasets
import pandas as pd
# 2. Load the data (use .load_diabetes() function )
data = datasets.load_diabetes()
# 3. Print out feature names and target names
# Features Names
x = data.feature_names
x
# Target Names
y = data.target
y
# 4. Print out the first 5 rows of the data
df = pd.DataFrame(data.data, columns=data.feature_names)
df.head(5)

The method load_diabetes() doesn't return a DataFrame by default but if you are using sklearn 0.23 or higher you can set as_frame parameter to True so it will return a Pd.DataFrame object.
df = datasets.load_diabetes(as_frame=True)
Then you can call head method and it will show you the first 5 rows, no need to specify 5.
print(df.head())

Related

SAS Proc Transpose to Pyspark

I am trying to convert a SAS proc transpose statement to pyspark in databricks.
With the following data as a sample:
data = [{"duns":1234, "finc stress":100,"ver":6.0},{"duns":1234, "finc stress":125,"ver":7.0},{"duns":1234, "finc stress":135,"ver":7.1},{"duns":12345, "finc stress":125,"ver":7.6}]
I would expect the result to look like this
I tried using the pandas pivot_table() function with the following code however I ran into some performance issues with the size of the data:
tst = (df.pivot_table(index=['duns'], columns=['ver'], values='finc stress')
.add_prefix('ver')
.reset_index())
Is there a way to translate the PROC Transpose SAS logic to Pyspark instead of using pandas?
I am trying something like this but am getting an error
tst= sparkdf.groupBy('duns').pivot('ver').agg('finc_stress').withColumn('ver')
AssertionError: all exprs should be Column
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<command-2507760044487307> in <module>
4 df = pd.DataFrame(data) # pandas
5
----> 6 tst= sparkdf.groupBy('duns').pivot('ver').agg('finc_stress').withColumn('ver')
7
8
/databricks/spark/python/pyspark/sql/group.py in agg(self, *exprs)
115 else:
116 # Columns
--> 117 assert all(isinstance(c, Column) for c in exprs), "all exprs should be Column"
118 jdf = self._jgd.agg(exprs[0]._jc,
119 _to_seq(self.sql_ctx._sc, [c._jc for c in exprs[1:]]))
AssertionError: all exprs should be Column
If you could help me out I would so appreciate it! Thank you so much.
I don't know how you create df from data but here is what I did:
import pyspark.pandas as ps
df = ps.DataFrame(data)
df['ver'] = df['ver'].astype('str')
Then your pandas code worked.
To use PySpark method, here is what I did:
sparkdf.groupBy('duns').pivot('ver').agg(F.first('finc stress'))

Pandas interpolation function fails to interpolate after replacing values with .nan

I am working with the pandas function, and I am trying to interpolate a missing value after removing a value that isn't numeric. However, I am still reading one na value when calling the isna().sum() function. A better explanation is below.
The input .csv file can be found here.
Here is what I have done:
#Import modules
import pandas as pd
import numpy as np
#Import data
df = pd.read_csv('example.csv')
df.isna().sum() #Shows no NA values, but I know that one of them is not numeric.
pd.to_numeric(df['example'])
The following error is produced, indicating the presence of an entry that needs to be removed at line number 949:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
File ~libs\lib.pyx:2315, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "asdf"
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
Input In [111], in <cell line: 3>()
1 df1 = pd.read_csv('example.csv')
2 df1.isna().sum()
----> 3 pd.to_numeric(df1['example'])
File ~numeric.py:184, in to_numeric(arg, errors, downcast)
182 coerce_numeric = errors not in ("ignore", "raise")
183 try:
--> 184 values, _ = lib.maybe_convert_numeric(
185 values, set(), coerce_numeric=coerce_numeric
186 )
187 except (ValueError, TypeError):
188 if errors == "raise":
File ~libs\lib.pyx:2357, in pandas._libs.lib.maybe_convert_numeric()
ValueError: Unable to parse string "asdf" at position 949
Here is my attempt to correct remove this value and interpolate a new one in its place:
idx_missing = df== 'asdf'
df[idx_missing] = np.nan
df['example'].isnull().sum() #This line confirms that there is one value missing
#Perform interpolation with a linear method
df1.iloc[:, -1] = df.iloc[:, -1].interpolate(method='linear') #Specifying the last column in the dataframe with the 'iloc' command
df1.isna().sum()
Apparently, there is still a missing value and the value was not interpolated:
example 1
dtype: int64
How can I correctly interpolate this value?
If you first find and replace any value that is not a digit, that should fix your issue.
#Import modules
import pandas as pd
import numpy as np
#Import data
df = pd.read_csv('example.csv')
df['example'] = df.example.replace(r'[^\d]',np.nan,regex=True)
pd.to_numeric(df.example)

How to apply a function to a DataFrame as part of a separate function

I am trying to apply a regex function to a DataFrame, which replaces a date formatted cell with a string taken from some of the characters.
I am having a problem getting the function to be applied to the dataframe itself.
This is my code so far:
def preprocess_test_data(self, test_df):
def to_month_day(s):
m = re.match("\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}", s)
if m:
return m[0][8:10].lstrip('0') + '-' + m[0][5:7].lstrip('0')
return s
test_df = test_df.apply(to_month_day)
a = test_df[:,0].astype(str)
b = test_df[:,1].astype(str)
c = test_df[:,2].astype(str)
d = test_df[:,3].astype(str)
e = test_df[:,4].astype(str)
f = test_df[:,5].astype(str)
g = test_df[:,6].astype(str)
h = test_df[:,7].astype(str)
i = test_df[:,8].astype(str)
I keep recieving this error:
AttributeError Traceback (most recent call last)
<ipython-input-10-a9f16326387d> in <module>
183
184 # Dont change
--> 185 x_test_processed = my_model.preprocess_test_data(x_test)
186
187 # Train your model
<ipython-input-10-a9f16326387d> in preprocess_test_data(self, test_df)
119 return s
120
--> 121 test_df = test_df.apply(to_month_day)
122 a = test_df[:,0].astype(str)
123 b = test_df[:,1].astype(str)
AttributeError: 'numpy.ndarray' object has no attribute 'apply'
How can I reformat the dataframe so that it allows me to run the Re function.
The error is caused by test_df being a numpy array and not a Pandas DataFrame. But even with a true dataframe, the function passed in the apply method will receive a full Series, by default a column, or a row if you use axis=1.
Here what you want (once test_df is a DataFrame) is:
test_df = test_df.apply(lambda x: x.apply(to_month_day))

pyspark taking a log of column

length = df.count()
df = df.withColumn("log", log(col("power"),lit(length)))
The following lines throw such an error. Can you please help me take a log of a column using another value or another column as a base.
TypeError Traceback (most recent call last)
<ipython-input-102-c0894b6127d1> in <module>()
1 #df.show()
2
----> 3 df = df.withColumn("log", log(col("power"),lit(2)))
5 frames
/content/spark-2.4.5-bin-hadoop2.7/python/pyspark/sql/column.py in __iter__(self)
342
343 def __iter__(self):
--> 344 raise TypeError("Column is not iterable")
345
346 # string methods
TypeError: Column is not iterable
If you want to use funtions that are not build-in on spark dataframes you can use user-defined functions, in your case it would look like this:
from pyspark.sql.functions import udf
from math import log
#udf("float")
def log_udf(s):
return log(s,2)
df.withColumn("log", log_udf("power")).show()

AttributeError: 'Series' object has no attribute 'label'

I'm trying to follow a tutorial on sound classification in neural networks, and I've found 3 different versions of the same tutorial, all of which work, but they all reach a snag at this point in the code, where I get the "AttributeError: 'Series' object has no attribute 'label'" issue. I'm not particularly au fait with either NNs or Python, so apologies if this is something trivial like a deprecation error, but I can't seem to figure it out myself.
def parser(row):
# function to load files and extract features
file_name = os.path.join(os.path.abspath(data_dir), 'Train/train', str(row.ID) + '.wav')
# handle exception to check if there isn't a file which is corrupted
try:
# here kaiser_fast is a technique used for faster extraction
X, sample_rate = librosa.load(file_name, res_type='kaiser_fast')
# we extract mfcc feature from data
mfccs = np.mean(librosa.feature.mfcc(y=X, sr=sample_rate, n_mfcc=40).T,axis=0)
except Exception as e:
print("Error encountered while parsing file: ", file)
return None, None
feature = mfccs
label = row.Class
return [feature, label]
temp = train.apply(parser, axis=1)
temp.columns = ['feature', 'label']
from sklearn.preprocessing import LabelEncoder
X = np.array(temp.feature.tolist())
y = np.array(temp.label.tolist())
lb = LabelEncoder()
y = np_utils.to_categorical(lb.fit_transform(y))
As mentioned, I've seen three different tutorials on the same subject, all of which end with the same "temp = train.apply(parser, axis=1) temp.columns = ['feature', 'label']" fragment, so I'm assuming this is assigning correctly, but I don't know where it's going wrong otherwise. Help appreciated!
Edit: Traceback as requested, turns out I'd added the wrong traceback. Also I've since found out that this is a case of converting the series object to a dataframe, so any help with that would be great.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-17-1613f53e2d98> in <module>()
1 from sklearn.preprocessing import LabelEncoder
2
----> 3 X = np.array(temp.feature.tolist())
4 y = np.array(temp.label.tolist())
5
/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py in __getattr__(self, name)
4370 if self._info_axis._can_hold_identifiers_and_holds_name(name):
4371 return self[name]
-> 4372 return object.__getattribute__(self, name)
4373
4374 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'feature'
Your current implementation of parser(row) method returns a list for each row of data from train DataFrame. But this is then collected as a pandas.Series object.
So your temp is actually a Series object. Then the following line dont have any effect:
temp.columns = ['feature', 'label']
Since temp is a Series, it does not have any columns, and hence temp.feature and temp.label dont exist and hence the error.
Change your parser() method as following:
def parser(row):
...
...
...
# Return pandas.Series instead of List
return pd.Series([feature, label])
By doing this, the apply method from temp = train.apply(parser, axis=1) will return a DataFrame, so your other code will work.
I cannot say about the tutorials you are following. Maybe they followed an older version of pandas which allowed a list to be automatically converted to DataFrame.

Categories