Data separation for ML

Data separation for ML - python

I have imported a data set for a Machine Learning project. I need each "Neuron" in my first input layer to contain one numerical piece of data. However, I have been unable to do this. Here is my code:
import math
import numpy as np
import pandas as pd; v = pd.read_csv('atestred.csv',
error_bad_lines=False).values
rw = 1
print(v)
for x in range(0,10):
rw += 1
s = (v[rw])
list(s)
#s is one row of the dataset
print(s)#Just a debug.
myvar = s
class l1neuron(object):
def gi():
for n in range(0, len(s)):
x = (s[n])
print(x)#Just another debug
n11 = l1neuron
n11.gi()
What I would ideally like is a variant of this where the code creates a new variable for every new row it extracts from the data(what I try to do in the first loop) and a new variable for every piece of data extracted from each row (what I try to do in the class and second loop).
If I have been completely missing the point with my code then feel free to point me in the right direction for a complete re-write.
Here are the first few rows of my dataset:
fixed acidity;"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
Thanks in advance.

If I understand your problem well, you would like to convert each row in your csv-table into a separate variable, that in turn holds all the values of that row.
Here is an example of how you might approach this. There are many ways to that end, and others may be more efficient, faster, more pythonic, hipper or whatever. But the code below was written to help you understand how to store tabellic data into named variables.
Two remarks:
if reading the data is the only thing you need pandas for, you might look for a less complex solution
the L1Neuron-class is not very transparant while it's members cannot be read from code, but instead are created runtime by the list of variables in attrs. You may want to have a look at namedTuples for better readability instead.
`
import pandas as pd
from io import StringIO
import numbers
# example data:
atestred = StringIO("""fixed acidity;volatile acidity;citric acid;\
residual sugar;chlorides;free sulfur dioxide;total sulfur dioxide;\
density;pH;sulphates;alcohol;quality
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
""")
# read example data into dataframe 'data'; extract values and column names:
data = pd.read_csv(atestred, error_bad_lines=False, sep=';')
colNames = list(data)
class L1Neuron(object):
"neuron class that holds the variables of one data line"
def __init__(self, **attr):
"""
attr is a dict (like {'alcohol': 12, 'pH':7.4});
every pair in attr will result in a member variable
of this object with that name and value"""
for name, value in attr.items():
setattr(self, name.replace(" ", "_"), value)
def gi(self):
"print all numeric member variables whose names don't start with an underscore:"
for v in sorted(dir(self)):
if not v.startswith('_'):
value = getattr(self, v)
if isinstance(value, numbers.Number):
print("%-20s = %5.2f" % (v, value))
print('-'*50)
# read csv into variables (one for each line):
neuronVariables = []
for s in data.values:
variables = dict(zip(colNames, s))
neuron = L1Neuron(**variables)
neuronVariables.append(neuron)
# now the variables in neuronVariables are ready to be used:
for n11 in neuronVariables:
print("free sulphur dioxide in this variable:", n11.free_sulfur_dioxide, end = " of ")
print(n11.total_sulfur_dioxide, "total sulphur dioxide" )
n11.gi()

If this is for a machine learning project, I would recommend loading your CSV into a numpy array for ease of manipulation. You store every value in the table as its own variable, but that will give you a performance hit by preventing you from using vectorized operations, as well as make your data more difficult to work with. I'd suggest this:
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')
If your machine learning problem is supervised, you'll also want to split your labels into a separate data structure. If you're doing unsupervised learning, though, a single data structure will suffice. If you provide additional context on the problem you're trying to solve, we could provide you with additional context and guidance.

Related

Pandas Styler.to_latex() - how to pass commands and do simple editing

How do I pass the following commands into the latex environment?
\centering (I need landscape tables to be centered)
and
\caption* (I need to skip for a panel the table numbering)
In addition, I would need to add parentheses and asterisks to the t-statistics, meaning row-specific formatting on the dataframes.
For example:
Current
variable
value
const
2.439628
t stat
13.921319
FamFirm
0.114914
t stat
0.351283
founder
0.154914
t stat
2.351283
Adjusted R Square
0.291328
I want this
variable
value
const
2.439628
t stat
(13.921319)***
FamFirm
0.114914
t stat
(0.351283)
founder
0.154914
t stat
(1.651283)**
Adjusted R Square
0.291328
I'm doing my research papers in DataSpell. All empirical work is in Python, and then I use Latex (TexiFy) to create the pdf within DataSpell. Due to this workflow, I can't edit tables in latex code while they get overwritten every time I run the jupyter notebook.
In case it helps, here's an example of how I pass a table to the latex environment:
# drop index to column
panel_a.reset_index(inplace=True)
# write Latex index and cut names to appropriate length
ind_list = [
"ageFirm",
"meanAgeF",
"lnAssets",
"bsVol",
"roa",
"fndrCeo",
"lnQ",
"sic",
"hightech",
"nonFndrFam"
]
# assign the list of values to the column
panel_a["index"] = ind_list
# format column names
header = ["", "count","mean", "std", "min", "25%", "50%", "75%", "max"]
panel_a.columns = header
with open(
os.path.join(r"/.../tables/panel_a.tex"),"w"
) as tf:
tf.write(
panel_a
.style
.format(precision=3)
.format_index(escape="latex", axis=1)
.hide(level=0, axis=0)
.to_latex(
caption = "Panel A: Summary Statistics for the Full Sample",
label = "tab:table_label",
hrules=True,
))

You're asking three questions in one. I think I can do you two out of three (I hear that "ain't bad").
How to pass \centering to the LaTeX env using Styler.to_latex?
Use the position_float parameter. Simplified:
df.style.to_latex(position_float='centering')
How to pass \caption*?
This one I don't know. Perhaps useful: Why is caption not working.
How to apply row-specific formatting?
This one's a little tricky. Let me give an example of how I would normally do this:
df = pd.DataFrame({'a':['some_var','t stat'],'b':[1.01235,2.01235]})
df.style.format({'a': str, 'b': lambda x: "{:.3f}".format(x)
if x < 2 else '({:.3f})***'.format(x)})
Result:
You can see from this example that style.format accepts a callable (here nested inside a dict, but you could also do: .format(func, subset='value')). So, this is great if each value itself is evaluated (x < 2).
The problem in your case is that the evaluation is over some other value, namely a (not supplied) P value combined with panel_a['variable'] == 't stat'. Now, assuming you have those P values in a different column, I suggest you create a for loop to populate a list that becomes like this:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
Now, we can apply a function to df.style.format, and pop/select from the list like so:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
def func(v):
fmt = fmt_list.pop(0)
return fmt.format(v)
panel_a.style.format({'variable': str, 'value': func})
Result:
This solution is admittedly a bit "hacky", since modifying a globally declared list inside a function is far from good practice; e.g. if you modify the list again before calling func, its functionality is unlikely to result in the expected behaviour or worse, it may throw an error that is difficult to track down. I'm not sure how to remedy this other than simply turning all the floats into strings in panel_a.value inplace. In that case, of course, you don't need .format anymore, but it will alter your df and that's also not ideal. I guess you could make a copy first (df2 = df.copy()), but that will affect memory.
Anyway, hope this helps. So, in full you add this as follows to your code:
fmt_list = ['{:.3f}','({:.3f})***','{:.3f}','({:.3f})','{:.3f}','({:.3f})***','{:.3f}']
def func(v):
fmt = fmt_list.pop(0)
return fmt.format(v)
with open(fname, "w") as tf:
tf.write(
panel_a
.style
.format({'variable': str, 'value': func})
...
.to_latex(
...
position_float='centering'
))

How to turn items from extracted data to numbers for plotting in python?

So i have a text document with a lot of values from calculations. I have extracted all the data and stored it in an array, but they are not numbers that I can use for anything. I want to use the number to plot them in a graph, but the elements in the array are text-strings, how would i turn them into numbers and remove unneccesary signs like commas and n= for instance?
Here is code, and under is my print statement.
import numpy as np
['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9', 'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])

I'd use the conversion method presented in this post within the extract function, so e.g.
...
delta_x.append(strtofloat(words[1]))
...
where you might as well do the conversion inline (my strtofloat is a function you'd have to write based on mentioned post) and within a try/except block, so failed conversions are just ignored from your list.
To make it more consistent, any conversion error should discard the whole line affected, so you might want to use intermediate variables and a check for each field.
Btw. I noticed the argument to the extract function, it would seem logical to make the argument a string containing the file name from which to extract the data?
EDIT: as a side note, you might want to look into pandas, which is a library specialised in numerical data handling. Depending on the format of your data file there are probably standard functions to read your whole file into a DataFrame (which is a kind of super-charged array class which can handle a lot of data processing as well) in a single command.

I would consider using regular expression:
import re
match_number = re.compile('-?[0-9]+\.?[0-9]*(?:[Ee]-?[0-9]+)?')
for line in infile:
words = line.split()
new_delta_x = float(re.search(match_number, words[1]).group())
new_abs_error = float(re.search(match_number, words[7]).group())
new_n = int(re.search(match_number, words[10]).group())
delta_x.append(new_delta_x)
abs_error.append(new_abs_error)
n.append(new_n)
But it seems like your data is already in csv format. So try using pandas.
Then read data into dataframe without header (column names will be integers).
import numpy as np
import pandas as pd
df = pd.read_csv('approx_derivative_sine.txt', header=None)
delta_x = df[1].to_numpy()
abs_error = df[7].to_numpy()
# if n is always number of the row
n = df.index.to_numpy(dtype=int)
# if n is always in the form 'n=<integer>'
n = df[10].apply(lambda x: x.strip()[2:]).to_numpy(dtype=int)
If you could post a few rows of your approx_derivative_sine.txt file, that would be useful.

From the given array in the question, If you would like to remove the 'n=' and convert each element to an integer, you may try the following.
import numpy as np
array = np.array(['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9',
'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])
array = [int(i.replace('n=', '')) for i in array]
print(array)

Get index of value in Pytables unidimensional Array

I'm writing code for college right now that works with very big amounts of data, using Pytables with various matrices/matrixes so as not to overflow memory, and it's been working well so far.
Right now I need to assign an integer identifier (from 0 to whatever) to a number of distinct Strings, store the assignment and be able to get the corresponding integer to a certain String and vice-versa. Of course, normal types don't cut it, there's just too many Strings, so I need to use something that works with files like Pytables.
I thought of just using an unidimensional Pytables EArray (because I can't know how many of Strings there will be), store there the Strings and let the index for each element be the assigned integer identifier of the String.
This is an example of what I thought of using:
import tables as tb, numpy as np
>>>file = tb.open_file("sample_file.hdf5", mode='w')
>>>sample_array = file.create_earray(file.root, 'data', tb.StringAtom(itemsize=50),
shape=(0,), expectedrows=10000)
>>>sample_array.append(np.array(["String_value"]))
That way I can get the String value of a given integer, like in any normal array
>>>sample_array[0]
b'String_value'
But I can't for the life of me find out how to do the opposite, to find the index given the String, I'm only comming up with more absurd ways of doing shit...
>>> sample_array[np.where("String_value") in sample_array]
b'String_value'
>>> sample_array[np.where("String_value")]
array([b'String_value'], dtype='|S50')
>>> np.where("String_value") in sample_array
False
Thank you in advance!
EDIT:
Forgot to update, I figured it out while working on something else... Facepalmed hard, very hard, it was really stupid, but I could't figure out what was wrong for hours.
np.where(sample_array[:] == b'String_value')
>>>(array([0]),)

OP answered his question above. However, it's buried under EDIT:, so not obvious in search results (or to the casual reader). Also, there is another way to approach the problem (using a Table instead of an Earray). This provides a comparison of the 2 methods.
OP's solution with an Earray (with some embellishment):
import tables as tb, numpy as np
h5f = tb.open_file("sample_file.hdf5", mode='w')
sample_array = h5f.create_earray(h5f.root, 'data', tb.StringAtom(itemsize=50),
shape=(0,), expectedrows=10000)
sample_array.append(np.array(['str_val0']))
sample_array.append(np.array(['str_val10']))
sample_array.append(np.array(['str_val20']))
sample_array.append(np.array(['str_val30']))
sample_array.append(np.array(['str_val40']))
print (sample_array[0])
print (sample_array[-1])
print (np.where(sample_array[:] == b'str_val0'))
print (np.where(sample_array[:] == b'str_val40'))
print ('\n')
h5f.close()
Output looks like this:
b'str_val0'
b'str_val40'
(array([0], dtype=int64),)
(array([4], dtype=int64),)
My approach with a Table:
I like Tables in Pytables. They are handy because they have multiple built-in search and iteration methods (in this case using .get_where_list(); there are many others). This example shows Table creation from a np.recarray (uses dtype to define fields/columns, and data to populate the table). Additional data rows are added later with the .append() method.
import tables as tb, numpy as np
h5f = tb.open_file("sample_file.hdf5", mode='w')
simple_recarray = np.recarray((4,),dtype=[('tstr','S50')])
simple_recarray['tstr'][0] = 'str_val1'
simple_recarray['tstr'][1] = 'str_val2'
simple_recarray['tstr'][2] = 'str_val10'
simple_recarray['tstr'][3] = 'str_val20'
simple_table = h5f.create_table(h5f.root, 'table_data', simple_recarray, 'Simple dataset')
print (simple_table.get_where_list("tstr == b'str_val1'"))
print (simple_table.get_where_list("tstr == b'str_val20'"))
simple_table.append([('str_val30',), ('str_val31',)])
print (simple_table.get_where_list("tstr == b'str_val31'"))
h5f.close()
Output looks like this (slightly different b/c strings are not stored in arrays):
[0]
[3]
[5]

basic python vlookup equivalent

I'm looking for the equivalent to the vlookup function in excel. I have a script where I read in a csv file. I would like to be able to query an associated value from another column in the .csv. Script so far:
import matplotlib
import matplotlib.mlab as mlab
import glob
for files in glob.glob("*.csv"):
print files
r = mlab.csv2rec(files)
r.cols = r.dtype.names
depVar = r[r.cols[0]]
indVar = r[r.cols[1]]
print indVar
This will read in from .csv files in the same folder the script is in. In the above example depVar is the first column in the .csv, and indVar is the second column. In my case, I know a value for indVar, and I want to return the associated value for depVar. I'd like to add a command like:
depVar = r[r.cols[0]]
indVar = r[r.cols[1]]
print indVar
depVarAt5 = lookup value in depVar where indVar = 5 (I could sub in things for the 5 later)
In my case, all values in all fields are numbers and all of the values of indVar are unique. I want to be able to define a new variable (depVarAt5 in last example) equal to the associated value.
Here's example .csv contents, name the file anything and place it in same folder as script. In this example, depVarAt5 should be set equal to 16.1309.
Temp,Depth
16.1309,5
16.1476,94.4007
16.2488,100.552
16.4232,106.573
16.4637,112.796
16.478,118.696
16.4961,124.925
16.5105,131.101
16.5462,137.325
16.7016,143.186
16.8575,149.101
16.9369,155.148
17.0462,161.187

I think this solves your problem quite directly:
import numpy
import glob
for f in glob.glob("*.csv"):
print f
r = numpy.recfromcsv(f)
print numpy.interp(5, r.depth, r.temp)
I'm pretty sure numpy is a prerequisite for matplotlib.

Not sure what that r object is, but since it has a member called cols, I'm going to assume it also has a member called rows which contains the row data.
>>> r.rows
[[16.1309, 5], [16.1476, 94.4007], ...]
In that case, your pseudocode very nearly contains a valid generator expression/list comprehension.
depVarAt5 = lookup value in depVar where indVar = 5 (I could sub in things for the 5 later)
becomes
depVarAt5 = [row[0] for row in r.rows if row[1] == 5]
Or, more generally
depVarValue = [row[depVarColIndex] for row in r.rows if row[indVarColIndex] == searchValue]
so
def vlookup(rows, searchColumn, dataColumn, searchValue):
return [row[dataColumn] for row in rows if row[searchColumn] == searchValue]
Throw a [0] on the end of that if you can guarantee there will be exactly one output per input.
There's also a csv module in the Python standard libary which you might prefer to work with. =)

For arbitrary orderings and exact matches you can use indVar.index() and index depVar with the returned index.
If indVar is ordered and (well, "or", sort of) you need closest match then you should look at using bisect on indVar.

Data analysis for inconsistent string formatting

I have this task that I've been working on, but am having extreme misgivings about my methodology.
So the problem is that I have a ton of excel files that are formatted strangely (and not consistently) and I need to extract certain fields for each entry. An example data set is
My original approach was this:
Export to csv
Separate into counties
Separate into districts
Analyze each district individually, pull out values
write to output.csv
The problem I've run into is that the format (seemingly well organized) is almost random across files. Each line contains the same fields, but in a different order, spacing, and wording. I wrote a script to correctly process one file, but it doesn't work on any other files.
So my question is, is there a more robust method of approaching this problem rather than simple string processing? What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
If it helps clear up the problem, here is the script I wrote:
# This file takes a tax CSV file as input
# and separates it into counties
# then appends each county's entries onto
# the end of the master out.csv
# which will contain everything including
# taxes, bonds, etc from all years
#import the data csv
import sys
import re
import csv
def cleancommas(x):
toggle=False
for i,j in enumerate(x):
if j=="\"":
toggle=not toggle
if toggle==True:
if j==",":
x=x[:i]+" "+x[i+1:]
return x
def districtatize(x):
#list indexes of entries starting with "for" or "to" of length >5
indices=[1]
for i,j in enumerate(x):
if len(j)>2:
if j[:2]=="to":
indices.append(i)
if len(j)>3:
if j[:3]==" to" or j[:3]=="for":
indices.append(i)
if len(j)>5:
if j[:5]==" \"for" or j[:5]==" \'for":
indices.append(i)
if len(j)>4:
if j[:4]==" \"to" or j[:4]==" \'to" or j[:4]==" for":
indices.append(i)
if len(indices)==1:
return [x[0],x[1:len(x)-1]]
new=[x[0],x[1:indices[1]+1]]
z=1
while z<len(indices)-1:
new.append(x[indices[z]+1:indices[z+1]+1])
z+=1
return new
#should return a list of lists. First entry will be county
#each successive element in list will be list by district
def splitforstos(string):
for itemind,item in enumerate(string): # take all exception cases that didn't get processed
splitfor=re.split('(?<=\d)\s\s(?=for)',item) # correctly and split them up so that the for begins
splitto=re.split('(?<=\d)\s\s(?=to)',item) # a cell
if len(splitfor)>1:
print "\n\n\nfor detected\n\n"
string.remove(item)
string.insert(itemind,splitfor[0])
string.insert(itemind+1,splitfor[1])
elif len(splitto)>1:
print "\n\n\nto detected\n\n"
string.remove(item)
string.insert(itemind,splitto[0])
string.insert(itemind+1,splitto[1])
def analyze(x):
#input should be a string of content
#target values are nomills,levytype,term,yearcom,yeardue
clean=cleancommas(x)
countylist=clean.split(',')
emptystrip=filter(lambda a: a != '',countylist)
empt2strip=filter(lambda a: a != ' ', emptystrip)
singstrip=filter(lambda a: a != '\' \'',empt2strip)
quotestrip=filter(lambda a: a !='\" \"',singstrip)
splitforstos(quotestrip)
distd=districtatize(quotestrip)
print '\n\ndistrictized\n\n',distd
county = distd[0]
for x in distd[1:]:
if len(x)>8:
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
else:
print "x\n\n",x
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
special=x[5]
splitspec=special.split(' ')
try:
forind=[i for i,j in enumerate(splitspec) if j=='for'][0]
numyears=splitspec[forind+1]
yearcom=splitspec[forind+6]
except:
forind=[i for i,j in enumerate(splitspec) if j=='commencing'][0]
numyears=None
yearcom=splitspec[forind+2]
yeardue=str(x[6])[-4:]
reason=x[7]
data = [filename,county,district,vote1,vote2,mills,votetype,numyears,yearcom,yeardue,reason]
print "data other", data
openfile=csv.writer(open('out.csv','a'),delimiter=',', quotechar='|',quoting=csv.QUOTE_MINIMAL)
openfile.writerow(data)
# call the file like so: python tax.py 2007May8Tax.csv
filename = sys.argv[1] #the file is the first argument
f=open(filename,'r')
contents=f.read() #entire csv as string
#find index of every instance of the word county
separators=[m.start() for m in re.finditer('\w+\sCOUNTY',contents)] #alternative implementation in regex
# split contents into sections by county
# analyze each section and append to out.csv
for x,y in enumerate(separators):
try:
data = contents[y:separators[x+1]]
except:
data = contents[y:]
analyze(data)

is there a more robust method of approaching this problem rather than simple string processing?
Not really.
What I had in mind was more of a fuzzy logic approach for trying to pin which field an item was, which could handle the inputs being a little arbitrary. How would you approach this problem?
After a ton of analysis and programming, it won't be significantly better than what you've got.
Reading stuff prepared by people requires -- sadly -- people-like brains.
You can mess with NLTK to try and do a better job, but it doesn't work out terribly well either.
You don't need a radically new approach. You need to streamline the approach you have.
For example.
district=x[0]
vote1=x[1]
votemil=x[2]
spaceindex=[m.start() for m in re.finditer(' ', votemil)][-1]
vote2=votemil[:spaceindex]
mills=votemil[spaceindex+1:]
votetype=x[4]
numyears=x[6]
yearcom=x[8]
yeardue=x[10]
reason=x[11]
data = [filename,county,district, vote1, vote2, mills, votetype, numyears, yearcom, yeardue, reason]
print "data",data
Might be improved by using a named tuple.
Then build something like this.
data = SomeSensibleName(
district= x[0],
vote1=x[1], ... etc.
)
So that you're not creating a lot of intermediate (and largely uninformative) loose variables.
Also, keep looking at your analyze function (and any other function) to pull out the various "pattern matching" rules. The idea is that you'll examine a county's data, step through a bunch of functions until one matches the pattern; this will also create the named tuple. You want something like this.
for p in ( some, list, of, functions ):
match= p(data)
if match:
return match
Each function either returns a named tuple (because it liked the row) or None (because it didn't like the row).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data separation for ML - python

Related

Pandas Styler.to_latex() - how to pass commands and do simple editing

How to turn items from extracted data to numbers for plotting in python?

Get index of value in Pytables unidimensional Array

basic python vlookup equivalent

Data analysis for inconsistent string formatting

Categories

Resources