Im new in Python. I have dataframe and I want do min-max(0-1) scaling in every column (every attr). I found method MinMaxScaller but I dont know how to use it with dataframe.
from sklearn import preprocessing
def sci_minmax(X):
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)
return minmax_scale.fit_transform(X)
data_normalized = sci_minmax(data)
data_variance=data_normalized.var()
data_variance.head(10)
The error is 'numpy.float64' object has no attribute 'head'. I need the return type dataframe
There is no head method in scipy/numpy.
If you want a pandas.DataFrame, you'll have to call the constructor.
Any chance you mean to look at the first 10 records with head?
You can do this easily with numpy, too.
To select the first 10 records of an array, the python syntax is array[:10]. With numpy matrixes, you will want to specify rows and columns: array[:10,] or array[,:10]
Related
so I am reading in data from a csv and saving it to a dataframe so I can use the columns. Here is my code:
filename = open(r"C:\Users\avalcarcel\Downloads\Data INSTR 9 8_16_2022 11_02_42.csv")
columns = ["date","time","ch104","alarm104","ch114","alarm114","ch115","alarm115","ch116","alarm116","ch117","alarm117","ch118","alarm118"]
df = pd.read_csv(filename,sep='[, ]',encoding='UTF-16 LE',names=columns,header=15,on_bad_lines='skip',engine='python')
length_ = len(df.date)
scan = list(range(1,length_+1))
plt.plot(scan,df.ch104)
plt.show()
When I try to plot scan vs. df.ch104, I get the following exception thrown:
'value' must be an instance of str or bytes, not a None
So what I thought to do was make each column in my df a list:
ch104 = df.ch104.tolist()
But it is turning my data from this to this:
before .tolist()
To this:
after .tolist()
This also happens when I use df.ch104.values.tolist()
Can anyone help me? I haven't used python/pandas in a while and I am just trying to get the data read in first. Thanks!
So, the df.ch104.values.tolist() code beasicly turns your column into a 2d 1XN array. But what you want is a 1D array of size N.
So transpose it before you call .tolist(). Lastly call [0] to convert Nx1 array to N array
df.ch104.values.tolist()[0]
Might I also suggest you include dropna() to avoid 'value' must be an instance of str or bytes, not a Non
df.dropna(subset=['ch104']).ch104.values.tolist()[0]
The error clearly says there are None or NaN values in your dataframe. You need to check for None and deal with them - replace with a suitable value or delete them.
I am running into a weird inconsistency. So I had to learn the difference between immutable and mutable data types. For my purpose, I need to convert my pandas DataFrame into Numpy apply operations and convert it back, as I do not wish to alter my input.
so I am converting like follows:
mix=pd.DataFrame(array,columns=columns)
def mix_to_pmix(mix,p_tank):
previous=0
columns,mix_in=np.array(mix) #<---
mix_in*=p_tank
previous=0
for count,val in enumerate(mix_in):
mix_in[count]=val+previous
previous+=val
return pd.DataFrame(mix_in,columns=columns)
This works perfectly fine, but the function:
columns,mix_in=np.array(mix)
seems to not be consistent as in the case:
def to_molfrac(mix):
columns,mix_in=np.array(mix)
shape=mix_in.shape
for i in range(shape[0]):
mix_in[i,:]*=1/max(mix_in[i,:])
for k in range(shape[1]-1,0,-1):
mix_in[:,k]+=-mix_in[:,k-1]
mix_in=mix_in/mix_in.sum(axis=1)[:,np.newaxis]
return pd.DataFrame(mix_in,columns=columns)
I receive the error:
ValueError: too many values to unpack (expected 2)
The input of the latter function is the output of the previous function. So it should be the same case.
It's impossible to understand the input of to_molfrac and mix_to_pmix without an example.
But the pandas objects has a .value attribute which allows you to access the underlying numpy array. So, its probably better to use mix_in = mix.values instead.
columns, values = df.columns, df.values
I created a numpy matrix called my_stocks with two columns:
column 0 is made by specific objects that I defined
column 1 is made of integers
I try to sort it by column 1, but this causes ValueError: np.sort(my_stocks)
The full method:
def sort_stocks(self):
my_stocks = np.empty(2)
for data in self.datas:
if data.buflen() > self.p.period:
new_stock=[data,get_slope(self,data)]
my_stocks = np.vstack([my_stocks,new_stock])
self.stocks = np.sort(my_stocks)
It doesn't cause problems if instead of sorting the matrix, I print it. A print example is below:
[<backtrader.feeds.yahoo.YahooFinanceData object at 0x124ec2eb0>
1.1081551020408162]
[<backtrader.feeds.yahoo.YahooFinanceData object at 0x124ec2190>
0.20202275819418677]
[<backtrader.feeds.yahoo.YahooFinanceData object at 0x124eda610>
0.08357118119975258]
[<backtrader.feeds.yahoo.YahooFinanceData object at 0x124ecc400>
0.5487027829313539]]
I figured out that the method was sorting my matrix by raw and not by column.
This works even if it looks rather counterintuitive:
my_stocks[my_stocks[:, 1].argsort()]
I am using python with pandas and sklearn and trying to use the new and very convenient sklearn-pandas.
I have a big data frame and need to transform multiple columns in a similar way.
I have multiple column names in the variable other
the source code documentation here
states explicitly there is a possibility of transforming multiple columns with the same transformation, but the following code does not behave as expected:
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
mapper = DataFrameMapper([[other[0],other[1]],LabelEncoder()])
mapper.fit_transform(df.copy())
I get the following error:
raise ValueError("bad input shape {0}".format(shape))
ValueError: ['EFW', 'BPD']: bad input shape (154, 2)
When I use the following code, it works great:
cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)]
mapper = DataFrameMapper(cols)
mapper.fit_transform(df.copy())
To my understanding, both should work well and yield same results.
What am I doing wrong here?
Thanks!
The problem you encounter here, is that the two snippets of code are completely different in terms of data structure.
cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)] builds a list of tuples. Do note that you can shorten this line of code to:
cols = [(col, LabelEncoder()) for col in other]
Anyway, the first snippet, [[other[0],other[1]],LabelEncoder()] results in a list containing two elements: a list and a LabelEncoder instance. Now, it is documented that you can transform multiple columns through specifying:
Transformations may require multiple input columns. In these cases, the column names can be specified in a list:
mapper2 = DataFrameMapper([
(['children', 'salary'], sklearn.decomposition.PCA(1))
])
This is a list containing tuple(list, object) structured elements, not list[list, object] structured elements.
If we take a look at the source code itself,
class DataFrameMapper(BaseEstimator, TransformerMixin):
"""
Map Pandas data frame column subsets to their own
sklearn transformation.
"""
def __init__(self, features, default=False, sparse=False, df_out=False,
input_df=False):
"""
Params:
features a list of tuples with features definitions.
The first element is the pandas column selector. This can
be a string (for one column) or a list of strings.
The second element is an object that supports
sklearn's transform interface, or a list of such objects.
The third element is optional and, if present, must be
a dictionary with the options to apply to the
transformation. Example: {'alias': 'day_of_week'}
It is also clearly stated in the class definition that the features argument to DataFrameMapper is required to be a list of tuples, where the elements of the tuple may be lists.
As a last note, as to why you actually get your error message: The LabelEncoder transformer in sklearn is meant for labeling purposes on 1D arrays. As such, it is fundamentally unable to handle 2 columns at once and will raise an Exception. So, if you want to use the LabelEncoder, you will have to build N tuples with 1 columnname and the transformer where N is the amount of columns you wish to transform.
i would like to create a pandas SparseDataFrame with the Dimonson 250.000 x 250.000. In the end my aim is to come up with a big adjacency matrix.
So far that is no problem to create that data frame:
df = SparseDataFrame(columns=arange(250000), index=arange(250000))
But when i try to update the DataFrame, i become massive memory/runtime problems:
index = 1000
col = 2000
value = 1
df.set_value(index, col, value)
I checked the source:
def set_value(self, index, col, value):
"""
Put single value at passed column and index
Parameters
----------
index : row label
col : column label
value : scalar value
Notes
-----
This method *always* returns a new object. It is currently not
particularly efficient (and potentially very expensive) but is provided
for API compatibility with DataFrame
...
The latter sentence describes the problem in this case using pandas? I really would like to keep on using pandas in this case, but its totally impossible in this case!
Does someone have an idea, how to solve this problem more efficiently?
My next idea is to work with something like nested lists/dicts or so...
thanks for your help!
Do it this way
df = pd.SparseDataFrame(columns=np.arange(250000), index=np.arange(250000))
s = df[2000].to_dense()
s[1000] = 1
df[2000] = s
In [11]: df.ix[1000,2000]
Out[11]: 1.0
So the procedure is to swap out the entire series at a time. The SDF will convert the passed in series to a SparseSeries. (you can do it yourself to see what they look like with s.to_sparse(). The SparseDataFrame is basically a dict of these SparseSeries, which themselves are immutable. Sparseness will have some changes in 0.12 to better support these types of operations (e.g. setting will work efficiently).