The data in the dataset is comprised purely of chars. For example:
p,x,s,n,t,p,f,c,n,k,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,y,t,a,f,c,b,k,e,c,s,s,w,w,p,w,o,p,n,n,g
e,b,s,w,t,l,f,c,b,n,e,c,s,s,w,w,p,w,o,p,n,n,m
p,x,y,w,t,p,f,c,n,n,e,e,s,s,w,w,p,w,o,p,k,s,u
e,x,s,g,f,n,f,w,b,k,t,e,s,s,w,w,p,w,o,e,n,a,g
A complete copy of the data can be found in agaricus-lepiota.data in the uci machine learning datasets mushroom dataset
Are there methods of visualisation for using char data (instead of having to convert the data set to numeric) via matplotlib?
Just for any sort of visualization i.e:
filename = 'mushrooms.csv'
df_mushrooms = pd.read_csv(filename, names = ["Classes", "Cap-Shape", "Cap-Surface", "Cap-Colour", "Bruises", "Odor", "Gill-Attachment", "Gill-Spacing", "Gill-Size", "Gill-Colour", "Stalk-Shape", "Stalk-Root", "Stalk-Surface-Above-Ring", "Stalk-Surface-Below-Ring", "Stalk-Colour-Above-Ring", "Stalk-Colour-Below-Ring", "Veil-Type", "Veil-Colour", "Ring-Number", "Ring-Type", "Spore-Print-Colour", "Population", "Habitat"])
#If there are any entires (rows) with any missing values/NaN's drop the row.
df_mushrooms.dropna(axis = 0, how = 'any', inplace = True)
df_mushrooms.plot.scatter(x = 'Classes', y = 'Cap-Shape')
It is possible to do this, but with this approach it doesn't really make any sense from a graphical point of view. If you were to do what you asked for it would look like this:
And I know I shouldn't tread into the territory of telling someone how to present their graphs, but this doesn't convey any information to me. The issue is that using Classes and Cap-Shape fields for your x and y indices will always put the same letter in the same place. There is no variability. Perhaps there is some other field you could use as the index and then use the Cap-Shape as your marker, but as it is this doesn't add any value. Again that is to me personally.
To use a string as a marker you can use the "$...$" marker described in matplotlib.markers, but again I must provide the caveat that graphing like this is much slower than the traditional method as you must iterate over the rows of your dataframe.
fig, ax = plt.subplots()
# Classes only has 'p' and 'e' as unique values so we will map them as 1 and 2 on the index
df['Class_Id'] = df.Classes.map(lambda x: 1 if x == 'p' else 2)
df['Cap_Val'] = df['Cap-Shape'].map(lambda x: ord(x) - 96)
for idx, row in df.iterrows():
ax.scatter(x=row.Class_Id, y=row.Cap_Val, marker=r"$ {} $".format(row['Cap-Shape']), c=plt.cm.nipy_spectral(row.Cap_Val / 26))
ax.set_xticks([0,1,2,3])
ax.set_xticklabels(['', 'p', 'e', ''])
ax.set_yticklabels(['', 'e', 'j', 'o', 't', 'y'])
fig.show()
Related
I've created a dataset using hdf5cpp library with a fixed size string (requirement). However when loading with pytables or pandas the strings are always represented like:
b'test\x00\xff\xff\xff\xff\xff\xff\xff\xff\xff
The string value of 'test' with the padding after it. Does anyone know a way to suppress or not show this padding data? I really just want 'test' shown. I realise this may be correct behaviour.
My hdf5cpp setup for strings:
strType = H5Tcopy(H5T_C_S1);
status = H5Tset_size(strType, 36);
H5Tset_strpad(strType, H5T_STR_NULLTERM);
I can't help with your C Code. It is possible to work with padded strings in Pytables. I can read data written by a C application that creates a struct array of mixed types, including padded strings. (Note: there was an issue related to copying a NumPy struct array with padding. It was fixed in 3.5.0. Read this for details: PyTables GitHub Pull 720.)
Here is an example that shows proper string handling with a file created by PyTables. Maybe it will help you investigate your problem. Checking the dataset's properties would be a good start.
import tables as tb
import numpy as np
arr = np.empty((10), 'S10')
arr[0]='test'
arr[1]='one'
arr[2]='two'
arr[3]='three'
with tb.File('SO_63184571.h5','w') as h5f:
ds = h5f.create_array('/', 'testdata', obj=arr)
print (ds.atom)
for i in range(4):
print (ds[i])
print (ds[i].decode('utf-8'))
Example below added to demonstrate compound dataset with int and fixed string. This is called a Table in PyTables (Arrays always contain homogeneous values). This can be done a number of ways. I show the 2 methods I prefer:
Create a record array and reference with the description= or
obj= parameter. This is useful when already have all of your data AND it will fit in memory.
Create a record array dtype and reference with the description=
parameter. Then add the data with the .append() method. This is
useful when all of your data will NOT fit in memory, OR you need to add data to an existing table.
Code below:
recarr_dtype = np.dtype(
{ 'names': ['ints', 'strs' ],
'formats': [int, 'S10'] } )
a = np.arange(5)
b = np.array(['a', 'b', 'c', 'd', 'e'])
recarr = np.rec.fromarrays((a, b), dtype=recarr_dtype)
with tb.File('SO_63184571.h5','w') as h5f:
ds1 = h5f.create_table('/', 'compound_data1', description=recarr)
for i in range(5):
print (ds1[i]['ints'], ds1[i]['strs'].decode('utf-8'))
ds2 = h5f.create_table('/', 'compound_data2', description=recarr_dtype)
ds2.append(recarr)
for i in range(5):
print (ds2[i]['ints'], ds2[i]['strs'].decode('utf-8'))
Say I have multiple lists:
names1 = [name11, name12, etc]
names2 = [name21, name22, etc]
names3 = [name31, name32, etc]
How do I create a for loop that combines the components of the lists in order ('name11name21name31', 'name11name21name32' and so on)?
I want to use this to name columns as I add them to a data frame. I tried like this:
Results['{}' .format(model_names[j]) + '{}' .format(Data_names[i])] = proba.tolist()
I am trying to take some results that I obtain as an array and introduce them one by one in a data frame and giving the columns names as I go on. It is for a machine learning model I am trying to make.
This is the whole code, I am sure it is messy because I am a beginner.
Train = [X_train_F, X_train_M, X_train_R, X_train_SM]
Test = [X_test_F, X_test_M, X_test_R, X_test_SM]
models_to_run = [knn, svc, forest, dtc]
model_names = ['knn', 'svc' ,'forest', 'dtc']
Data_names = ['F', 'M', 'R', 'SM']
Results = pd.DataFrame()
for T, t in zip(Train, Test):
for j, model in enumerate(models_to_run):
model.fit(T, y_train.values.ravel())
proba = model.predict_proba(t)
proba = pd.DataFrame(proba.max(axis=1))
proba = proba.to_numpy()
proba = proba.flatten()
Results['{}' .format(model_names[j]) + '{}' .format(Data_names[i])] = proba.tolist()
I dont know how to integrate 'i' in the loop, to use it to go through the list Data_names to add it to the column name. I am sure there is a cleaner way to do this. Please be gentle.
Edit: It currently gives me a data frame with 4 columns instead of 16 as it should, and it just adds the whole Data_names list to the column name.
How about:
Results= {}
for T, t, dname in zip(Train, Test, Data_names):
for mname, model in zip(model_names, models_to_run):
...
Results[(dname, mname)] = proba.to_list()
Results = pd.DataFrame(Results.values(), index=Results.keys()).T
I am pretty new to coding so this may be simple, but none of the answers I've found so far have provided information in a way I can understand.
I'd like to take a column of data and apply a function (a x e^bx) where a > 0 and b < 0. The (x) in this case would be the float value in each row of my data.
See what I have so far, but I'm not sure where to go from here....
def plot_data():
# read the file
data = pd.read_excel(FILENAME)
# convert to pandas dataframe
df = pd.DataFrame(data, columns=['FP Signal'])
# add a blank column to store the normalized data
headers = ['FP Signal', 'Normalized']
df = df.reindex(columns=headers)
df.plot(subplots=True, layout=(1, 2))
df['Normalized'] = df.apply(normalize(['FP Signal']), axis=1)
print(df['Normalized'])
# show the plot
plt.show()
# normalization formula (exponential) = a x e ^bx where a > 0, b < 0
def normalize(x):
x = A * E ** (B * x)
return x
I can get this image to show, but not the 'normalized' data...
thanks for any help!
Your code is almost correct.
# normalization formula (exponential) = a x e ^bx where a > 0, b < 0
def normalize(x):
x = A * E ** (B * x)
return x
def plot_data():
# read the file
data = pd.read_excel(FILENAME)
# convert to pandas dataframe
df = pd.DataFrame(data, columns=['FP Signal'])
# add a blank column to store the normalized data
headers = ['FP Signal', 'Normalized']
df = df.reindex(columns=headers)
df['Normalized'] = df['FP Signal'].apply(lambda x: normalize(x))
print(df['Normalized'])
df.plot(subplots=True, layout=(1, 2))
# show the plot
plt.show()
I changed apply row to the following: df['FP Signal'].apply(lambda x: normalize(x)).
It takes only the value on df['FP Signal'] because you don't need entire row. lambda x states current values assign to x, which we send to normalize.
You can also write df['FP Signal'].apply(normalize) which is more directly and more simple. Using lambda is just my personal preference, but many may disagree.
One small addition is to put df.plot(subplots=True, layout=(1, 2)) after you change dataframe. If you plot before changing dataframe, you won't see any change in the plot. df.plot actually doing the plot, plt.show just display it. That's why df.plot must be after you done processing your data.
You can use map to apply a function to a field
pandas.Series.map
s = pd.Series(['cat', 'dog', 'rabbit'])
s.map(lambda x: x.upper())
0 CAT
1 DOG
2 RABBIT
I'm new to any kind of programming as you can tell by this 'beautiful' piece of hard coding. With sweat and tears (not so bad, just a little), I've created a very sequential code and that's actually my problem. My goal is to create a somewhat-automated script - probably including for-loop (I've unsuccessfully tried).
The main aim is to create a randomization loop which takes original dataset looking like this:
dataset
From this data set picking randomly row by row and saving it one by one to another excel list. The point is that the row from columns called position01 and position02 should be always selected so it does not match with the previous pick in either of those two column values. That should eventually create an excel sheet with randomized rows that are followed always by a row that does not include values from the previous pick. So row02 should not include any of those values in columns position01 and position02 of the row01, row3 should not contain values of the row2, etc. It should also iterate in the range of the list length, which is 0-11. Important is also the excel output since I need the rest of the columns, I just need to shuffle the order.
I hope my aim and description are clear enough, if not, happy to answer any questions. I would appreciate any hint or help, that helps me 'unstuck'. Thank you. Code below. (PS: I'm aware of the fact that there is probably much more neat solution to it than this)
import pandas as pd
import random
dataset = pd.read_excel("C:\\Users\\ibm\\Documents\\Psychopy\\DataInput_Training01.xlsx")
# original data set use for comparisons
imageDataset = dataset.loc[0:11, :]
# creating empty df for storing rows from imageDataset
emptyExcel = pd.DataFrame()
randomPick = imageDataset.sample() # select randomly one row from imageDataset
emptyExcel = emptyExcel.append(randomPick) # append a row to empty df
randomPickIndex = randomPick.index.tolist() # get index of the row
imageDataset2 = imageDataset.drop(index=randomPickIndex) # delete the row with index selected before
# getting raw values from the row 'position01'/02 are columns headers
randomPickTemp1 = randomPick['position01'].values[0]
randomPickTemp2 = randomPick
randomPickTemp2 = randomPickTemp2['position02'].values[0]
# getting a dataset which not including row values from position01 and position02
isit = imageDataset2[(imageDataset2.position01 != randomPickTemp1) & (imageDataset2.position02 != randomPickTemp1) & (imageDataset2.position01 != randomPickTemp2) & (imageDataset2.position02 != randomPickTemp2)]
# pick another row from dataset not including row selected at the beginning - randomPick
randomPick2 = isit.sample()
# save it in empty df
emptyExcel = emptyExcel.append(randomPick2, sort=False)
# get index of this second row to delete it in next step
randomPick2Index = randomPick2.index.tolist()
# delete the another row
imageDataset3 = imageDataset2.drop(index=randomPick2Index)
# AND REPEAT the procedure of comparison of the raw values with dataset already not including the original row:
randomPickTemp1 = randomPick2['position01'].values[0]
randomPickTemp2 = randomPick2
randomPickTemp2 = randomPickTemp2['position02'].values[0]
isit2 = imageDataset3[(imageDataset3.position01 != randomPickTemp1) & (imageDataset3.position02 != randomPickTemp1) & (imageDataset3.position01 != randomPickTemp2) & (imageDataset3.position02 != randomPickTemp2)]
# AND REPEAT with another pick - save - matching - picking again.. until end of the length of the dataset (which is 0-11)
So at the end I've used a solution provided by David Bridges (post from Sep 19 2019) on psychopy websites. In case anyone is interested, here is a link: https://discourse.psychopy.org/t/how-do-i-make-selective-no-consecutive-trials/9186
I've just adjusted the condition in for loop to my case like this:
remaining = [choices[x] for x in choices if last['position01'] != choices[x]['position01'] and last['position01'] != choices[x]['position02'] and last['position02'] != choices[x]['position01'] and last['position02'] != choices[x]['position02']]
Thank you very much for the helpful answer! and hopefully I did not spam it over here too much.
import itertools as it
import random
import pandas as pd
# list of pair of numbers
tmp1 = [x for x in it.permutations(list(range(6)),2)]
df = pd.DataFrame(tmp1, columns=["position01","position02"])
df1 = pd.DataFrame()
i = random.choice(df.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index = i)
while not df.empty:
val = list(df1.iloc[-1])
tmp = df[(df["position01"]!=val[0])&(df["position01"]!=val[1])&(df["position02"]!=val[0])&(df["position02"]!=val[1])]
if tmp.empty: #looped for 10000 times, was never empty
print("here")
break
i = random.choice(tmp.index)
df1 = df1.append(df.loc[i],ignore_index = True)
df = df.drop(index=i)
I want to know how I should index / access some data programmatically in python.
I have columnar data: depth, temperature, gradient, gamma, for a set of boreholes. There are n boreholes. I have a header, which lists the borehole name and numeric ID. Example:
Bore_name,Bore_ID,,,Bore_name,Bore_ID,,,, ...
<a row of headers>
depth,temp,gradient,gamma,depth,temp,gradient,gamma ...
I don't know how to index the data, apart from rude iteration:
with open(filename,'rU') as f:
bores = f.readline().rstrip().split(',')
headers = f.readline().rstrip().split(',')
# load from CSV file, missing values are empty 'cells'
tdata = numpy.genfromtxt(filename, skip_header=2, delimiter=',', missing_values='', filling_values=numpy.nan)
for column in range(0,numpy.shape(tdata)[1],4):
# plots temperature on x, depth on y
pl.plot(tdata[:,column+1],tdata[:,column], label=bores[column])
# get index at max depth
depth = numpy.nanargmin(tdata[:,column])
# plot text label at max depth (y) and temp at that depth (x)
pl.text(tdata[depth,column+1],tdata[depth,column],bores[column])
It seems easy enough this way, but I've been using R recently and have got a bit used to their way of referencing data objects via classes and subclasses interpreted from headers.
Well if you like R's data.table, there have been a few (at least) attempts to re-create that functionality in NumPy--through additional classes in NumPy Core and through external Python libraries. The effort i find most promising is the datarray library by Fernando Perez. Here's how it works.
>>> # create a NumPy array for use as our data set
>>> import numpy as NP
>>> D = NP.random.randint(0, 10, 40).reshape(8, 5)
>>> # create some generic row and column names to pass to the constructor
>>> row_ids = [ "row{0}".format(c) for c in range(D1.shape[0]) ]
>>> rows = 'rows_id', row_ids
>>> variables = [ "col{0}".format(c) for c in range(D1.shape[1]) ]
>>> cols = 'variable', variables
Instantiate the DataArray instance, by calling the constructor and passing in an ordinary NumPy array and a list of tuples--one tuple for each axis, and since ndim = 2 here, there are two tuples in the list each tuple is comprised of axis label (str) and a sequence of labels for that axes (list).
>>> from datarray.datarray import DataArray as DA
>>> D1 = DA(D, [rows, cols])
>>> D1.axes
(Axis(name='rows', index=0, labels=['row0', 'row1', 'row2', 'row3',
'row4', 'row5', 'row6', 'row7']), Axis(name='cols', index=1,
labels=['col0', 'col1', 'col2', 'col3', 'col4']))
>>> # now you can use R-like syntax to reference a NumPy data array by column:
>>> D1[:,'col1']
DataArray([8, 5, 0, 7, 8, 9, 9, 4])
('rows',)
You could put your data into a dict for each borehole, keyed by the borehole id, and values as dicts with headers as keys. Roughly like this:
data = {boreid1:{"temp":temparray, ...}, boreid2:{"temp":temparray}}
Probably reading from files will be a little bit more cumbersome with these approach, but for plotting you could do something like
pl.plot(data[boreid]["temperature"], data[boreid]["depth"])
Here are idioms for naming rows and columns:
row0, row1 = np.ones((2,5))
for col in range(0, tdata.shape[1], 4):
depth,temp,gradient,gamma = tdata[:, col:col+4] .T
pl.plot( temp, depth )
See also namedtuple:
from collections import namedtuple
Rec = namedtuple( "Rec", "depth temp gradient gamma" )
r = Rec( *tdata[:, col:col+4].T )
print r.temp, r.depth
datarray (thanks Doug) is certainly more general.