spark cannot create LabeledPoint - python

I always get this error:
AnalysisException: u"cannot resolve 'substring(l,1,-1)' due to data type mismatch: argument 1 requires (string or binary) type, however, 'l' is of array type.;"
Quite confused because l[0] is a string, and matches arg 1.
dataframe has only one column named 'value', which is a comma separated string.
And I want to convert this original dataframe to another dataframe of object LabeledPoint, with the first element to be 'label' and the others to be 'features'.
from pyspark.mllib.regression import LabeledPoint
def parse_points(dataframe):
df1=df.select(split(dataframe.value,',').alias('l'))
u_label_point=udf(LabeledPoint)
df2=df1.select(u_label_point(col('l')[0],col('l')[1:-1]))
return df2
parsed_points_df = parse_points(raw_data_df)

I think you what to create LabeledPoint in dataframe. So you can:
def parse_points(df):
df1=df.select(split(df.value,',').alias('l'))
df2=df1.map(lambda seq: LabeledPoint(float(seq[0][0]),seq[0][1:])) # since map applies lambda in each tuple
return df2.toDF() #this will convert pipelinedRDD to dataframe
parsed_points_df = parse_points(raw_data_df)

Related

Cannot plot or use .tolist() on pd dataframe column

so I am reading in data from a csv and saving it to a dataframe so I can use the columns. Here is my code:
filename = open(r"C:\Users\avalcarcel\Downloads\Data INSTR 9 8_16_2022 11_02_42.csv")
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(filename,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
length_ = len(df.date)
scan = list(range(1,length_+1))
plt.plot(scan,df.ch104)
plt.show()
When I try to plot scan vs. df.ch104, I get the following exception thrown:
'value' must be an instance of str or bytes, not a None
So what I thought to do was make each column in my df a list:
ch104 = df.ch104.tolist()
But it is turning my data from this to this:
before .tolist()
To this:
after .tolist()
This also happens when I use df.ch104.values.tolist()
Can anyone help me? I haven't used python/pandas in a while and I am just trying to get the data read in first. Thanks!
So, the df.ch104.values.tolist() code beasicly turns your column into a 2d 1XN array. But what you want is a 1D array of size N.
So transpose it before you call .tolist(). Lastly call [0] to convert Nx1 array to N array
df.ch104.values.tolist()[0]
Might I also suggest you include dropna() to avoid 'value' must be an instance of str or bytes, not a Non
df.dropna(subset=['ch104']).ch104.values.tolist()[0]
The error clearly says there are None or NaN values in your dataframe. You need to check for None and deal with them - replace with a suitable value or delete them.

A function for onehotencoding and labelencoding in a dataframe

I keep getting AttributeError: 'DataFrame' object has no attribute 'column' when I run the function on a column in a dataframe
def reform (column, dataframe):
if dataframe.column.nunique() > 2 and dataframe.column.dtypes == object:
enc.fit(dataframe[['column']])
enc.categories_
onehot = enc.transform(dataframe[[column]]).toarray()
dataframe[enc.categories_] = onehot
elif dataframe.column.nunique() == 2 and dataframe.column.dtypes == object :
le.fit_transform(dataframe[['column']])
else:
print('Column cannot be reformed')
return dataframe
Try changing
dataframe.column to dataframe.loc[:,column].
dataframe[['column']] to dataframe.loc[:,[column]]
For more help, please provide more information. Such as: What is enc (show your imports)? What does dataframe look like (show a small example, perhaps with dataframe.head(5))?
Details:
Since column is an input (probably a string), you need to use it correctly when asking for that column from the dataframe object. If you just use dataframe.column it will try to find the column actually named 'column', but if you ask for it dataframe.loc[:,column], it will use the string that is represented by the input parameter named column.
With dataframe.loc[:,column], you get a Pandas Series, and with dataframe.loc[:,[column]] you get a Pandas DataFrame.
The pandas attribute 'columns', used as dataframe.columns (note the 's' at the end) just returns a list of the names of all columns in your dataframe, probably not what you want here.
TIPS:
Try to name input parameters so that you know what they are.
When developing a function, try setting the input to something static, and iterate the code until you get desired output. E.g.
input_df = my_df
column_name = 'some_test_column'
if input_df.loc[:,column_name].nunique() > 2 and input_df.loc[:,column_name].dtypes == object:
enc.fit(input_df.loc[:,[column_name]])
onehot = enc.transform(input_df.loc[:,[column_name]]).toarray()
input_df.loc[:, enc.categories_] = onehot
elif input_df.loc[:,column_name].nunique() == 2 and input_df.loc[:,column_name].dtypes == object :
le.fit_transform(input_df.loc[:,[column_name]])
else:
print('Column cannot be transformed')
Look up on how to use SciKit Learn Pipelines, with ColumnTransformer. It will help make the workflow easier (https://scikit-learn.org/stable/modules/compose.html).

ValueError while trying to check for a "W" in dataset

Datset
I'm trying to check for a win from the WINorLOSS column, but I'm getting the following error:
Code and Error Message
The variable combined.WINorLOSS seems to be a Series type object and you can't compare an iterable (like list, dict, Series,etc) with a string type value. I think you meant to do:
for i in combined.WINorLOSS:
if i=='W':
hteamw+=1
else:
ateamw+=1
You can't compare a Series of values (like your WINorLOSS dataframe column) to a single string value. However you can use the following to counts the 'L' and 'W' in your columns:
hteamw = combined['WINorLOSS'].value_counts()['W']
hteaml = combined['WINorLOSS'].value_counts()['L']

pandas : Unable to update empty dataframe by single string

I created an empty dataframe and tried adding one string value to a column but it is showing empty
import pandas as pd
df = pd.DataFrame()
df['Man'] = "manish"
print(df)
when i am running above code i am getting output as:
Empty DataFrame
Columns: [Man]
Index: []
While when i am running below code
df['Man'] = ['manish']
print(df)
i am getting correct output which i expected
Man
0 manish
Can anyone explain me why this is happening ?
It seems, by looking at the code of __setitem__ that it expects a list like value, as written in the function _ensure_valid_index:
def _ensure_valid_index(self, value):
"""
Ensure that if we don't have an index, that we can create one from the
passed value.
"""
# GH5632, make sure that we are a Series convertible
if not len(self.index) and is_list_like(value):
try:
value = Series(value)
except (ValueError, NotImplementedError, TypeError):
raise ValueError(
"Cannot set a frame with no defined index "
"and a value that cannot be converted to a "
"Series"
)
self._data = self._data.reindex_axis(
value.index.copy(), axis=1, fill_value=np.nan
)
So if the len of the index is zero (as in your case) it expects a list like value, which a string is not, to convert to a Series an use the index from there. The function _ensure_valid_index is called inside set_item.

Iteration over the rows of a Pandas DataFrame as dictionaries

I need to iterate over a pandas dataframe in order to pass each row as argument of a function (actually, class constructor) with **kwargs. This means that each row should behave as a dictionary with keys the column names and values the corresponding ones for each row.
This works, but it performs very badly:
import pandas as pd
def myfunc(**kwargs):
try:
area = kwargs.get('length', 0)* kwargs.get('width', 0)
return area
except TypeError:
return 'Error : length and width should be int or float'
df = pd.DataFrame({'length':[1,2,3], 'width':[10, 20, 30]})
for i in range(len(df)):
print myfunc(**df.iloc[i])
Any suggestions on how to make that more performing ? I have tried iterating with tried df.iterrows(),
but I get the following error :
TypeError: myfunc() argument after ** must be a mapping, not tuple
I have also tried df.itertuples() and df.values , but either I am missing something, or it means that I have to convert each tuple / np.array to a pd.Series or dict , which will also be slow.
My constraint is that the script has to work with python 2.7 and pandas 0.14.1.
one clean option is this one:
for row_dict in df.to_dict(orient="records"):
print(row_dict['column_name'])
You can try:
for k, row in df.iterrows():
myfunc(**row)
Here k is the dataframe index and row is a dict, so you can access any column with: row["my_column_name"]
Defining a separate function for this will be inefficient, as you are applying row-wise calculations. More efficient would be to calculate a new series, then iterate the series:
df = pd.DataFrame({'length':[1,2,3,'test'], 'width':[10, 20, 30,'hello']})
df2 = df.iloc[:].apply(pd.to_numeric, errors='coerce')
error_str = 'Error : length and width should be int or float'
print(*(df2['length'] * df2['width']).fillna(error_str), sep='\n')
10.0
40.0
90.0
Error : length and width should be int or float

Categories