Create Tabular Dataset in Azure using python sdk - python

So I'm just starting with Azure and I have this problem:
Here is my code:
def getWorkspace(name):
ws = Workspace.get(
name=name,
subscription_id= sid,
resource_group='my_ressource',
location='my_location')
return ws
def uploadDataset(ws, file, separator=','):
datastore = Datastore.get_default(ws)
path = DataPath(datastore=datastore,path_on_datastore=file)
dataset = TabularDatasetFactory.from_delimited_files(path=path, separator=separator)
#dataset = Dataset.Tabular.from_delimited_files(path=path, separator=separator)
print(dataset.to_pandas_dataframe().head())
print(type(dataset))
ws = getWorkspace(workspace_name)
uploadDataset(ws, my_csv,";")
#result :
fixed_acidity volatile_acidity citric_acid residual_sugar chlorides ... density pH sulphates alcohol quality0 7.5 0.33 0.32 11.1 0.036 ... 0.99620 3.15 0.34 10.5 61 6.3 0.27 0.29 12.2 0.044 ... 0.99782 3.14 0.40 8.8 62 7.0 0.30 0.51 13.6 0.050 ... 0.99760 3.07 0.52 9.6 73 7.4 0.38 0.27 7.5 0.041 ... 0.99535 3.17 0.43 10.0 54 8.1 0.12 0.38 0.9 0.034 ... 0.99026 2.80 0.55 12.0 6
[5 rows x 12 columns]
<class 'azureml.data.tabular_dataset.TabularDataset'>
But when I go to Microsoft Azure Machine Learning Studio in datasets, this dataset isn't created.
What am I doing wrong?

Firstly we need to check the format of the file, if the format is .csv or .tsv we need to use from_delimited_files() method which has TabularDataSetFactory class to read files. Or else if we have .paraquet files we have a method called as from_parquet_files(). Along with these we have register_pandas_dataframe() method which registers the TabularDataset to the workspace and uploads data to your underlying storage
Also for the storage is there is any virtual network or firewalls enabled then make sure that we set a parameter as validate=False in from_delimited_files() method as this will skip the validation/verification step.
Specify the datastore name as below along with Workspace:
datastore_name = 'your datastore name'
workspace = Workspace.from_config() #if we have existing work space.
datastore = Datastore.get(workspace, datastore_name)
Below is the way to create TabularDataSets from 3 file paths.
datastore_paths = [(datastore, 'weather/2018/11.csv'),
(datastore, 'weather/2018/12.csv'),
(datastore, 'weather/2019/*.csv')]
Create_TBDS = Dataset.Tabular.from_delimited_files(path=datastore_paths)
If we want to specify the separator, we can do it as below:
Create_TBDS = Dataset.Tabular.from_delimited_files(path=datastore_paths, separator=',')

Related

Encoding data with LabelEncoder()

I'm having the following dataset as a csv file.
Dataset ecoli.csv:
seq_name,mcg,gvh,lip,chg,aac,alm1,alm2,class
AAT_ECOLI,0.49,0.29,0.48,0.50,0.56,0.24,0.35,cp
ACEA_ECOLI,0.07,0.40,0.48,0.50,0.54,0.35,0.44,cp
(more entries...)
ACKA_ECOLI,0.59,0.49,0.48,0.50,0.52,0.45,0.36,cp
ADI_ECOLI,0.23,0.32,0.48,0.50,0.55,0.25,0.35,cp
My purpose for this dataset is to apply some classification algorithms. In order to handle ecoli.csv file I'm trying to change the class column and put in as first one while seq_name column is dropped. Then I'm printing a test to search for null values. Afterwards I'm plotting with the help of sns library.
Code before error:
column_drop = 'seq_name'
dataframe = pd.read_csv('ecoli.txt', header='infer')
dataframe.drop(column_drop, axis=1, inplace=True) # Dropping columns that I don't need
print(dataframe.isnull().sum())
plt.figure(figsize=(10,8))
sns.heatmap(dataframe.corr(), annot=True)
plt.show()
Before the encoding, and the error I'm facing, I group the values of the dataset based on class. Finally I'm trying to encode the dataset with LabelEncoder but and an error appears:
Error code:
result = dataframe.groupby(by=("class")).sum().reset_index()
print(result)
le = preprocessing.LabelEncoder()
dataframe.result = le.fit_transform(dataframe.result)
print(result)
Error:
AttributeError: 'DataFrame' object has no attribute 'result'
Update: result is filled with the following index
class mcg gvh lip chg aac alm1 alm2
0 cp 51.99 58.59 68.64 71.5 64.99 44.71 56.52
1 im 36.84 38.24 37.48 38.5 41.28 58.33 56.24
2 imL 1.45 0.94 2.00 1.5 0.91 1.29 1.14
3 imS 1.48 1.02 0.96 1.0 1.07 1.28 1.14
4 imU 25.41 16.06 17.32 17.5 19.56 26.04 26.18
5 om 13.45 14.20 10.12 10.0 14.78 9.25 6.11
6 omL 3.49 2.56 5.00 2.5 2.71 2.82 1.11
7 pp 33.91 36.39 24.96 26.0 22.71 24.34 19.47
Desired output:
Any thoughts?

How to solve NaN values error using Lmfit with Python

I'm trying to fit a set of data taken by an external simulation, and stored in a vector, with the Lmfit library.
Below there's my code:
import numpy as np
import matplotlib.pyplot as plt
from lmfit import Model
from lmfit import Parameters
def DGauss3Par(x,I1,sigma1,sigma2):
I2 = 2.63 - I1
return (I1/np.sqrt(2*np.pi*sigma1))*np.exp(-(x*x)/(2*sigma1*sigma1)) + (I2/np.sqrt(2*np.pi*sigma2))*np.exp(-(x*x)/(2*sigma2*sigma2))
#TAKE DATA
xFull = []
yFull = []
fileTypex = np.dtype([('xFull', np.float)])
fileTypey = np.dtype([('yFull', np.float)])
fDatax = "xValue.dat"
fDatay = "yValue.dat"
xFull = np.loadtxt(fDatax, dtype=fileTypex)
yFull = np.loadtxt(fDatay, dtype=fileTypey)
xGauss = xFull[:]["xFull"]
yGauss = yFull[:]["yFull"]
#MODEL'S DEFINITION
gmodel = Model(DGauss3Par)
params = Parameters()
params.add('I1', value=1.66)
params.add('sigma1', value=1.04)
params.add('sigma2', value=1.2)
result3 = gmodel.fit(yGauss, x=xGauss, params=params)
#PLOTS
plt.plot(xGauss, result3.best_fit, 'y-')
plt.show()
When I run it, I get this error:
File "Overlap.py", line 133, in <module>
result3 = gmodel.fit(yGauss, x=xGauss, params=params)
ValueError: The input contains nan values
These are the values of the data contained in the vector xGauss (related to the x axis):
[-3.88 -3.28 -3.13 -3.08 -3.03 -2.98 -2.93 -2.88 -2.83 -2.78 -2.73 -2.68
-2.63 -2.58 -2.53 -2.48 -2.43 -2.38 -2.33 -2.28 -2.23 -2.18 -2.13 -2.08
-2.03 -1.98 -1.93 -1.88 -1.83 -1.78 -1.73 -1.68 -1.63 -1.58 -1.53 -1.48
-1.43 -1.38 -1.33 -1.28 -1.23 -1.18 -1.13 -1.08 -1.03 -0.98 -0.93 -0.88
-0.83 -0.78 -0.73 -0.68 -0.63 -0.58 -0.53 -0.48 -0.43 -0.38 -0.33 -0.28
-0.23 -0.18 -0.13 -0.08 -0.03 0.03 0.08 0.13 0.18 0.23 0.28 0.33
0.38 0.43 0.48 0.53 0.58 0.63 0.68 0.73 0.78 0.83 0.88 0.93
0.98 1.03 1.08 1.13 1.18 1.23 1.28 1.33 1.38 1.43 1.48 1.53
1.58 1.63 1.68 1.73 1.78 1.83 1.88 1.93 1.98 2.03 2.08 2.13
2.18 2.23 2.28 2.33 2.38 2.43 2.48 2.53 2.58 2.63 2.68 2.73
2.78 2.83 2.88 2.93 2.98 3.03 3.08 3.13 3.28 3.88]
And these ones the ones in the vector yGauss (related to y axis):
[0.00173977 0.00986279 0.01529543 0.0242624 0.0287456 0.03238484
0.03285927 0.03945234 0.04615091 0.05701618 0.0637672 0.07194268
0.07763934 0.08565687 0.09615262 0.1043281 0.11350606 0.1199406
0.1260062 0.14093328 0.15079665 0.16651464 0.18065023 0.1938894
0.2047541 0.21794024 0.22806706 0.23793043 0.25164404 0.2635118
0.28075974 0.29568682 0.30871501 0.3311846 0.34648062 0.36984661
0.38540666 0.40618835 0.4283945 0.45002014 0.48303911 0.50746062
0.53167057 0.5548792 0.57835128 0.60256181 0.62566436 0.65704847
0.68289386 0.71332794 0.73258027 0.769608 0.78769989 0.81407275
0.83358852 0.85210239 0.87109068 0.89456217 0.91618782 0.93760247
0.95680234 0.96919757 0.9783219 0.98486193 0.9931429 0.9931429
0.98486193 0.9783219 0.96919757 0.95680234 0.93760247 0.91618782
0.89456217 0.87109068 0.85210239 0.83358852 0.81407275 0.78769989
0.769608 0.73258027 0.71332794 0.68289386 0.65704847 0.62566436
0.60256181 0.57835128 0.5548792 0.53167057 0.50746062 0.48303911
0.45002014 0.4283945 0.40618835 0.38540666 0.36984661 0.34648062
0.3311846 0.30871501 0.29568682 0.28075974 0.2635118 0.25164404
0.23793043 0.22806706 0.21794024 0.2047541 0.1938894 0.18065023
0.16651464 0.15079665 0.14093328 0.1260062 0.1199406 0.11350606
0.1043281 0.09615262 0.08565687 0.07763934 0.07194268 0.0637672
0.05701618 0.04615091 0.03945234 0.03285927 0.03238484 0.0287456
0.0242624 0.01529543 0.00986279 0.00173977]
I've also tried to print the values returned by my function, to see if there really were some NaN values:
params = Parameters()
params.add('I1', value=1.66)
params.add('sigma1', value=1.04)
params.add('sigma2', value=1.2)
func = DGauss3Par(xGauss,I1,sigma1,sigma2)
print func
but what I obtained is:
[0.04835225 0.06938855 0.07735839 0.08040181 0.08366964 0.08718237
0.09096169 0.09503048 0.0994128 0.10413374 0.10921938 0.11469669
0.12059333 0.12693754 0.13375795 0.14108333 0.14894236 0.15736337
0.16637406 0.17600115 0.18627003 0.19720444 0.20882607 0.22115413
0.23420498 0.24799173 0.26252377 0.27780639 0.29384037 0.3106216
0.32814069 0.34638266 0.3653266 0.38494543 0.40520569 0.42606735
0.44748374 0.46940149 0.49176057 0.51449442 0.5375301 0.56078857
0.58418507 0.60762948 0.63102687 0.65427809 0.6772804 0.69992818
0.72211377 0.74372824 0.76466232 0.78480729 0.80405595 0.82230355
0.83944875 0.85539458 0.87004937 0.88332762 0.89515085 0.90544838
0.91415806 0.92122688 0.92661155 0.93027889 0.93220625 0.93220625
0.93027889 0.92661155 0.92122688 0.91415806 0.90544838 0.89515085
0.88332762 0.87004937 0.85539458 0.83944875 0.82230355 0.80405595
0.78480729 0.76466232 0.74372824 0.72211377 0.69992818 0.6772804
0.65427809 0.63102687 0.60762948 0.58418507 0.56078857 0.5375301
0.51449442 0.49176057 0.46940149 0.44748374 0.42606735 0.40520569
0.38494543 0.3653266 0.34638266 0.32814069 0.3106216 0.29384037
0.27780639 0.26252377 0.24799173 0.23420498 0.22115413 0.20882607
0.19720444 0.18627003 0.17600115 0.16637406 0.15736337 0.14894236
0.14108333 0.13375795 0.12693754 0.12059333 0.11469669 0.10921938
0.10413374 0.0994128 0.09503048 0.09096169 0.08718237 0.08366964
0.08040181 0.07735839 0.06938855 0.04835225]
So it doesn't seems that there are NaN values, I'm not understanding for which reason it returns me that error.
Could anyone help me, please? Thanks!
If you add a print function to your fit function, printing out sigma1 and sigma2, you'll find that
DGauss3Par is evaluated already a few times before the error occurs.
Both sigma variables have a negative value at the time the error occurs.
Taking the square root of a negative value causes, of course, a NaN.
You should add a min bound or similar to your sigma1 and sigma2 parameters to prevent this. Using min=0.0 as an additional argument to params.add(...) will result in a good fit.
Be aware that for some analyses, setting explicit bounds to your fitting parameters may make these analyses invalid. For most cases, you'll be fine, but for some cases, you'll need to check whether the fitting parameters should be allowed to vary from negative infinity to positive infinity, or are allowed to be bounded.

Error building a function to calculate standard deviation

I am new to Python and I am trying to build a function to run some statistics on a data set. The data is in an Excel format and it contains 7 rows, with the first row I know what a function is and how it should be built, nevertheless I can't figure it out how to build this function.
This is the function:
def st_dev(benchmark, factor):
benchmark = mkt_ret
factor = smb
statistics = st.stdev(benchmark, factor)
return statistics
print(st_dev)
And this is the result:
Mkt-RF SMB HML RMW CMA RF
196307 -0.39 -0.46 -0.81 0.72 -1.16 0.27
196308 5.07 -0.81 1.65 0.42 -0.4 0.25
196309 -1.57 -0.48 0.19 -0.8 0.23 0.27
196310 2.53 -1.29 -0.09 2.75 -2.26 0.29
196311 -0.85 -0.85 1.71 -0.34 2.22 0.27
4.38
<function st_dev at 0x0000000002D92F28>
Process finished with exit code 0
the full code can be viewed here.
I tried several versions to write the function, some error messages told me that I cannot convert 'Series' to numerator/denominator.
I am running python 3.7
Thank you for your help.
Alex

Convert elements in masked astropy Table to np.nan

Consider the simple process of reading a data file with some non-valid entries. This is my test.dat file:
16 1035.22 1041.09 24.54 0.30 1.39 0.30 1.80 0.30 2.26 0.30 1.14 0.30 0.28 0.30 0.2884
127 824.57 1105.52 25.02 0.29 0.87 0.29 1.30 0.29 2.12 0.29 0.66 0.29 0.10 0.29 0.2986
182 1015.83 904.93 INDEF 0.28 1.80 0.28 1.64 0.28 2.38 0.28 1.04 0.28 0.06 0.28 0.3271
185 1019.15 1155.09 24.31 0.28 1.40 0.28 1.78 0.28 2.10 0.28 0.87 0.28 0.35 0.28 0.3290
192 1024.80 1045.57 24.27 0.27 1.24 0.27 2.01 0.27 2.40 0.27 0.90 0.27 0.09 0.27 0.3328
197 1035.99 876.04 24.10 0.27 1.23 0.27 1.52 0.27 2.59 0.27 0.45 0.27 0.25 0.27 0.3357
198 1110.80 1087.97 24.53 0.27 1.49 0.27 1.71 0.27 2.33 0.27 0.22 0.27 0.00 0.27 0.3362
1103 1168.39 1065.97 24.35 0.27 1.28 0.27 1.29 0.27 2.68 0.27 0.43 0.27 0.26 0.27 0.3388
And this is the code to read it, and replace the "bad" values (INDEF) with a float (99.999)
import numpy as np
from astropy.io import ascii
data = ascii.read("test.dat", fill_values=[('INDEF', '0')])
data = data.filled(99.999)
This works just fine, but if I instead try to replace the bad values with a np.nan (i.e., I use the line data = data.filled(np.nan)) I get:
ValueError: cannot convert float NaN to integer
why is this and how can I get around it?
As noted the issue is that the numpy MaskedArray.filled() method seems to try converting the fill value to the appropriate type before checking if there is actually anything to fill. Since the table in the example has an int column, this fails within numpy (and astropy.Table is just calling the filled() method on each column).
This should work:
In [44]: def fill_cols(tbl, fill=np.nan, kind='f'):
...: """
...: In-place fill of ``tbl`` columns which have dtype ``kind``
...: with ``fill`` value.
...: """
...: for col in tbl.itercols():
...: if col.dtype.kind == kind:
...: col[...] = col.filled(fill)
...:
In [45]: t = simple_table(masked=True)
In [46]: t
Out[46]:
<Table masked=True length=3>
a b c
int64 float64 str1
----- ------- ----
-- 1.0 c
2 2.0 --
3 -- e
In [47]: fill_cols(t)
In [48]: t
Out[48]:
<Table masked=True length=3>
a b c
int64 float64 str1
----- ------- ----
-- 1.0 c
2 2.0 --
3 nan e
I don't think it's primarily a numpy problem, as it works with individual columns:
>>> data['col4'].filled(np.nan)
<Column name='col4' dtype='float64' length=8>
24.54
25.02
nan
24.31
24.27
24.1
24.53
24.35
but you still can't construct a Table from this -
Table([data[n].filled(np.nan) for n in data.colnames])
raises the same error in np.ma.core.
You can explicitly set
data['col4'] = data['col4'].filled(np.nan)
but this apparently lets the table lose its .filled() method...
I am not that familiar with masked arrays and tables, but as you've already filed a related issue on Github, you might want to add this problem.
This is happening fairly deep in numpy, in numpy.ma.filled. fill values have to be scalars, basically.
A messy solution that fills with nan's and still returns a table could look like:
import numpy as np
from astropy.io import ascii
from astropy.table import Table
def fill_with_nan(t):
arr = t.as_array()
arr_list = arr.tolist()
arr = np.array(arr_list)
arr[np.equal(arr, None)] = np.nan
arr = np.array(arr.tolist())
return Table(arr)
data = ascii.read("test.dat", fill_values=[('INDEF', '0')])
data = fill_with_nan(data)
Cut out the middleman? fill_values=[('INDEF', np.nan)]) seems to work.

Using Pandas in Python to Join Multiple Files Based on Date

I have csv files that I need to join together based upon date but the dates in each file are not the same (i.e. some files start on 1/1/1991 and other in 1998). I have a basic start to the code (see below) but I am not sure where to go from here. Any tips are appreciated. Below please find a sample of the different csv I am trying to join.
import os, pandas as pd, glob
directory = r'C:\data\Monthly_Data'
files = os.listdir(directory)
print(files)
all_data =pd.DataFrame()
for f in glob.glob(directory):
df=pd.read_csv(f)
all_data=all_data.append(df,ignore_index=True)
all_data.describe()
File 1
DateTime F1_cfs F2_cfs F3_cfs F4_cfs F5_cfs F6_cfs F7_cfs
3/31/1991 0.860702028 1.167239264 0 0 0 0 0
4/30/1991 2.116930556 2.463493056 3.316688418
5/31/1991 4.056572581 4.544307796 5.562668011
6/30/1991 1.587513889 2.348215278 2.611659722
7/31/1991 0.55328629 1.089637097 1.132043011
8/31/1991 0.29702957 0.54186828 0.585073925 2.624375
9/30/1991 0.237083333 0.323902778 0.362583333 0.925563094 1.157786606 2.68722973 2.104090278
File 2
DateTime F1_mg-P_L F2_mg-P_L F3_mg-P_L F4_mg-P_L F5_mg-P_L F6_mg-P_L F7_mg-P_L
6/1/1992 0.05 0.05 0.06 0.04 0.03 0.18 0.08
7/1/1992 0.03 0.05 0.04 0.03 0.04 0.05 0.09
8/1/1992 0.02 0.03 0.02 0.02 0.02 0.02 0.02
File 3
DateTime F1_TSS_mgL F1_TVS_mgL F2_TSS_mgL F2_TVS_mgL F3_TSS_mgL F3_TVS_mgL F4_TSS_mgL F4_TVS_mgL F5_TSS_mgL F5_TVS_mgL F6_TSS_mgL F6_TVS_mgL F7_TSS_mgL F7_TVS_mgL
4/30/1991 10 7.285714286 8.5 6.083333333 3.7 3.1
5/31/1991 5.042553191 3.723404255 6.8 6.3 3.769230769 2.980769231
6/30/1991 5 5 1 1
7/31/1991
8/31/1991
9/30/1991 5.75 3.75 6.75 4.75 9.666666667 6.333333333 8.666666667 5 12 7.666666667 8 5.5 9 6.75
10/31/1991 14.33333333 9 14 10.66666667 16.25 11 12.75 9.25 10.25 7.25 29.33333333 18.33333333 13.66666667 9
11/30/1991 2.2 1.933333333 2 1.88 0 0 4.208333333 3.708333333 10.15151515 7.909090909 9.5 6.785714286 4.612903226 3.580645161
You didn't read the csv files correctly.
1) You need to comment out the following lines because you never use it later in your code.
files = os.listdir(directory)
print(files)
2) glob.glob(directory) didnt return any match files. glob.glob() takes pattern as argument, for example: 'C:\data\Monthly_Data\File*.csv', unfortunately you put a directory as a pattern, and no files are found
for f in glob.glob(directory):
I modified the above 2 parts and print all_data, the file contents display on my console

Categories