Fitting a pandas column containing a list in scikit-learn - python

I have a pandas DataFrame containing a column called 'X' containing a list of 300 doubles and a column called 'label' when trying to run:
cls = SVC()
cls.fit(miniset.loc[:,'X'],miniset.loc[:,'label'])
I get the error:
ValueError: setting an array element with a sequence.
Any idea how to fix it?
Thanks
Head of my DataFrame
label X
0 0 [-1.1990741, 0.98229957, -2.7413394, 0.5774205...
1 1 [0.10277234, 1.8292198, -1.8241594, 0.07206603...
2 0 [-0.26603428, 1.8654639, -2.2495375, -0.695124...
3 0 [-1.1662953, 3.0714324, -3.4975948, 0.01011618...
4 0 [-0.13769871, 1.9866339, -1.9885212, -0.830097...

Your issue is the 'X' column of your DataFrame. To get this to work with SVC (or basically any scikit-learn model), you need to split that column into several columns, one each for every element in your lists.
You can fix that by doing something like this.
The pandas package is not intended to store lists or other collections as values. It is meant to store panel data, hence the name pandas.

You can try:
cls.fit(np.array(miniset.loc[:,'X'].tolist()),miniset.loc[:,'label'])
where tolist() gives you a 2D array (which would be good enough).

Related

Cannot plot or use .tolist() on pd dataframe column

so I am reading in data from a csv and saving it to a dataframe so I can use the columns. Here is my code:
filename = open(r"C:\Users\avalcarcel\Downloads\Data INSTR 9 8_16_2022 11_02_42.csv")
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(filename,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
length_ = len(df.date)
scan = list(range(1,length_+1))
plt.plot(scan,df.ch104)
plt.show()
When I try to plot scan vs. df.ch104, I get the following exception thrown:
'value' must be an instance of str or bytes, not a None
So what I thought to do was make each column in my df a list:
ch104 = df.ch104.tolist()
But it is turning my data from this to this:
before .tolist()
To this:
after .tolist()
This also happens when I use df.ch104.values.tolist()
Can anyone help me? I haven't used python/pandas in a while and I am just trying to get the data read in first. Thanks!
So, the df.ch104.values.tolist() code beasicly turns your column into a 2d 1XN array. But what you want is a 1D array of size N.
So transpose it before you call .tolist(). Lastly call [0] to convert Nx1 array to N array
df.ch104.values.tolist()[0]
Might I also suggest you include dropna() to avoid 'value' must be an instance of str or bytes, not a Non
df.dropna(subset=['ch104']).ch104.values.tolist()[0]
The error clearly says there are None or NaN values in your dataframe. You need to check for None and deal with them - replace with a suitable value or delete them.

Iterate through 200 datasets [duplicate]

This question already has answers here:
Creating multiple dataframes with a loop
(3 answers)
Closed 1 year ago.
I have 200 datasets and I want to iterate through them to pick random rows and add them to another dataset(empty dataset), using iloc and value function. when I execute the code it does not give an error but also does not add anything to the empty dataset. However, when I try to run the single command to check if the random row has any value or not it gives an error of:
AttributeError: 'str' object has no attribute 'iloc'.
my code is given below:
Tdata = np.zeros([20, 6])
k = 0
for j in range(200):
for j1 in range(0, 20):
Tdata[k:k+1,:] = (('dataset'+j)).iloc[random.randint(100)].values
k += 1
('dataset'+j) is basically selecting different datasets. The names of my datasets are dataset0, dataset1, dataset2......there are already defined.
There are multiple issues with you code.
1. Using str in place of the actual DataFrame variable
You are trying use .iloc over a string dataframe1 for example. This won't work since what str has no attribute .iloc, as the error reads for you.
Since you want to work with DataFrame variable names, you may need to use eval() to interpret the string as a variable name. NOTE: BE EXTRA CAREFUL while using eval(). Please read the dangers of using eval() carefully.
2. Sampling 20 rows from each DataFrame.
If you are trying to get 20 rows by using for j1 in range(0, 20): along with random.randint(100), there is a better way to avoid this iteration. Instead what you need is to use random.randint(0,100,(n,) to get n random indexes. In this case random.randint(0,100,(20,)
Or an even better way to do this is just simply using df.sample(20) to sample 20 rows from a given dataframe.
3. Forcing update over views of the dataframe
Its better to use a different appraoch than force an update over a view of the dataframe with Tdata[k:k+1,:] == .... Since you want to combine dataframes, its better to just collect them in a list and pass them to a pd.concat which would be much more useful.
Here is sample code with a simple setting which should help guide you to what you are looking for.
import pandas as pd
import numpy as np
dataset0 = pd.DataFrame(np.random.random((100,3)))
dataset1 = pd.DataFrame(np.random.random((100,3)))
dataset2 = pd.DataFrame(np.random.random((100,3)))
dataset3 = pd.DataFrame(np.random.random((100,3)))
##Using random.randint
##samples = [eval('dataset'+str(i)).iloc[np.random.randint(0,100,(3,))] for i in range(4)]
##Using df.sample()
samples = [eval('dataset'+str(i)).sample(3) for i in range(4)]
##Change -
##1. The 3 to 20 for 20 samples per dataframe
##2. range(4) to range(200) to work with 200 dataframes
output = pd.concat(samples)
print(output)
0 1 2
42 0.372626 0.445972 0.030467
20 0.376201 0.445504 0.835735
56 0.214806 0.083550 0.582863
85 0.691495 0.346022 0.619638
24 0.290397 0.202795 0.704082
16 0.112986 0.013269 0.903917
51 0.521951 0.115386 0.632143
73 0.946870 0.531085 0.437418
98 0.745897 0.718701 0.280326
56 0.679253 0.010143 0.124667
4 0.028559 0.769682 0.737377
84 0.857553 0.866464 0.827472
4. Storing 200 dataframes??
Last but not the least, you should ask yourself, why are you storing 200 dataframe as individual variables, only to sample some rows from each.
Why not try to -
Read each of the files iteratively
Sample rows from each
Store them in a list of dataframes
pd.concat once you are done iterating over the 200 files
... instead of saving 200 dataframes and then doing the same.

numpy get rows where value in list within column

I have kind of an odd numpy ndarray, it looks like this:
[[0,1,3,list([0,1])],
[0,0,0,list([0])],
[0,0,0,list([])],
[1,1,1,list([1,2,3,4,5,6,7])]]
so, the column at index 3 is actually of type list.
I would like to extract the rows which contain a key value in the list contained in index 3, so for instance, something like magic_function(matrix, 0) might return [[0,1,3,list([0,1])], [0,0,0,list([0])]], as the rows at index 0 and 1 contain the value 0 in the list at index 3.
I tried using a combination of np.where and np.isin, but couldn't quite get it working in a way that was elegant (things like matrix[np.where(np.isin(matrix[:,3], 0))]). I would prefer the fastest approach, which I beleive would be an approach using numpy rather than iteration in python.

Why does pandas.to_numeric result in a list of lists?

I am trying to import csv data into a pandas dataframe. To do this I am doing the following:
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
index = pd.MultiIndex.from_arrays((columns, units, descr))
df.columns = index
df.columns.names = ['Name','Unit','Description']
df = df.apply(pd.to_numeric)
data['isotherm'] = df
This produces e.g. the following table:
In: data['isotherm']
Out:
Name Relative_Pressure Volume_STP
Unit - ccm/g
Description p/p0
0 0.042691 29.3601
1 0.078319 30.3071
2 0.129529 31.1643
3 0.183355 31.8513
4 0.233435 32.3972
5 0.280847 32.8724
However if I only want to get the values of the column Relative_Pressure I get this output:
In: data['isotherm']['Relative_Pressure'].values
Out:
array([[0.042691],
[0.078319],
[0.129529],
[0.183355],
[0.233435],
[0.280847]])
Of course I could now for every column I want to use flatten
x = [item for sublist in data['isotherm']['Relative_Pressure'].values for item in sublist]
However this would lead to a lot of extra effort and would also reduce the readability. How can I for the whole data frame make sure the data is flat?
array([[...]]) is not a list of lists, but a 2D numpy array. (I'm not sure why the values are returned as a single-column 2D array rather than a 1D array here, though. When I create a primitive DataFrame, a single column's values are returned as a 1D array.)
You can concatenate and flatten them using numpy's built-in functions, eg.
x = data['isotherm']['Relative_Pressure'].flatten()
Edit: This might be caused by the MultiIndex.
The direct way of indexing into one column belonging to your MultiIndex object is with a tuple as follows:
data[('isotherm', 'Relative_Pressure')]
which will return a Series object whose .values attribute will give you the expected 1D array. The docs discuss this here
You should be careful using chained indexing like data['isotherm']['Relative_Pressure'] because you won't know if you are dealing with a copy of the data or a view of the data. Please do a SO search of pandas' SettingWithCopyWarning for more details or read the docs here.

pandas SparseDataFrame insertion

i would like to create a pandas SparseDataFrame with the Dimonson 250.000 x 250.000. In the end my aim is to come up with a big adjacency matrix.
So far that is no problem to create that data frame:
df = SparseDataFrame(columns=arange(250000), index=arange(250000))
But when i try to update the DataFrame, i become massive memory/runtime problems:
index = 1000
col = 2000
value = 1
df.set_value(index, col, value)
I checked the source:
def set_value(self, index, col, value):
"""
Put single value at passed column and index
Parameters
----------
index : row label
col : column label
value : scalar value
Notes
-----
This method *always* returns a new object. It is currently not
particularly efficient (and potentially very expensive) but is provided
for API compatibility with DataFrame
...
The latter sentence describes the problem in this case using pandas? I really would like to keep on using pandas in this case, but its totally impossible in this case!
Does someone have an idea, how to solve this problem more efficiently?
My next idea is to work with something like nested lists/dicts or so...
thanks for your help!
Do it this way
df = pd.SparseDataFrame(columns=np.arange(250000), index=np.arange(250000))
s = df[2000].to_dense()
s[1000] = 1
df[2000] = s
In [11]: df.ix[1000,2000]
Out[11]: 1.0
So the procedure is to swap out the entire series at a time. The SDF will convert the passed in series to a SparseSeries. (you can do it yourself to see what they look like with s.to_sparse(). The SparseDataFrame is basically a dict of these SparseSeries, which themselves are immutable. Sparseness will have some changes in 0.12 to better support these types of operations (e.g. setting will work efficiently).

Categories