Related
So i have a text document with a lot of values from calculations. I have extracted all the data and stored it in an array, but they are not numbers that I can use for anything. I want to use the number to plot them in a graph, but the elements in the array are text-strings, how would i turn them into numbers and remove unneccesary signs like commas and n= for instance?
Here is code, and under is my print statement.
import numpy as np
['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9', 'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])
I'd use the conversion method presented in this post within the extract function, so e.g.
...
delta_x.append(strtofloat(words[1]))
...
where you might as well do the conversion inline (my strtofloat is a function you'd have to write based on mentioned post) and within a try/except block, so failed conversions are just ignored from your list.
To make it more consistent, any conversion error should discard the whole line affected, so you might want to use intermediate variables and a check for each field.
Btw. I noticed the argument to the extract function, it would seem logical to make the argument a string containing the file name from which to extract the data?
EDIT: as a side note, you might want to look into pandas, which is a library specialised in numerical data handling. Depending on the format of your data file there are probably standard functions to read your whole file into a DataFrame (which is a kind of super-charged array class which can handle a lot of data processing as well) in a single command.
I would consider using regular expression:
import re
match_number = re.compile('-?[0-9]+\.?[0-9]*(?:[Ee]-?[0-9]+)?')
for line in infile:
words = line.split()
new_delta_x = float(re.search(match_number, words[1]).group())
new_abs_error = float(re.search(match_number, words[7]).group())
new_n = int(re.search(match_number, words[10]).group())
delta_x.append(new_delta_x)
abs_error.append(new_abs_error)
n.append(new_n)
But it seems like your data is already in csv format. So try using pandas.
Then read data into dataframe without header (column names will be integers).
import numpy as np
import pandas as pd
df = pd.read_csv('approx_derivative_sine.txt', header=None)
delta_x = df[1].to_numpy()
abs_error = df[7].to_numpy()
# if n is always number of the row
n = df.index.to_numpy(dtype=int)
# if n is always in the form 'n=<integer>'
n = df[10].apply(lambda x: x.strip()[2:]).to_numpy(dtype=int)
If you could post a few rows of your approx_derivative_sine.txt file, that would be useful.
From the given array in the question, If you would like to remove the 'n=' and convert each element to an integer, you may try the following.
import numpy as np
array = np.array(['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9',
'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])
array = [int(i.replace('n=', '')) for i in array]
print(array)
I am trying to find the parametric equations for a certain set of plots, however when I implement the code I only get the first value back. The code is from a website, as I am not very proficient in python. I am using 3.6.5. Here is the code:
import numpy as np
import scipy as sp
from fractions import Fraction
def trigSeries(x):
f=sp.fft(x)
n=len(x)
A0=abs(f[0])/n
A0=Fraction(A0).limit_denominator(1000)
hn=np.ceil(n/2)
f=f[1:int(hn)]
A=2*abs(f)/n
P=sp.pi/2-sp.angle(f)
A=map(Fraction,A)
A=map(lambda a:a.limit_denominator(1000),A)
P=map(Fraction,P)
P=map(lambda a:a.limit_denominator(1000),P)
s=map(str,A)
s=map(lambda a: a+"*np.sin(", s)
s=map(lambda a,b,c :
a+str(b)+"-2*sp.pi*t*"+str(c)+")",
s,P,range(1,len(list(P))+1))
s="+".join(s)
s=str(A0)+"+"+s
return s
x=[5041,4333,3625,3018,2816,2967,3625,4535,5800,6811,7823,8834,8429,7418,6305,5193,4181,3018,3018,3777,4687,5496,6912,7974,9087]
y=[4494,5577,6930,8825,10990,13426,14509,15456,15456,15186,15321,17486,19246,21005,21276,21952,22223,23712,25877,27501,28178,28448,27636,26960,25742]
xf=trigSeries(x)
print(xf)
Any help would be appreciated.
I tried to make the code to work but i could not manage to do it.
The problem id that when you call map(...) You create an iterator, so in order to print it's content you have to do:
for data in iterator:
print(data)
The problem her is that when you apply the lambda function if you cycle over the variable it returns nothing.
You colud convert all the lambda in for cycles, but you have to think over the triple-argument lambda.
The problem is at the step
s=map(lambda a,b,c :
a+str(b)+"-2*sp.pi*t*"+str(c)+")",
s,P,range(1,len(list(P))+1))
it returns empty list. To resolve it, convert s and P to lists just before feeding to this map function. Add two lines above.
s = list(s)
P = list(P)
Output for your example
134723/25+308794/391*np.sin(-1016/709-2*sp.pi*t*1)+2537094/989*np.sin(641/835-2*sp.pi*t*2)+264721/598*np.sin(-68/241-2*sp.pi*t*3)+285344/787*np.sin(-84/997-2*sp.pi*t*4)+118145/543*np.sin(-190/737-2*sp.pi*t*5)+281400/761*np.sin(-469/956-2*sp.pi*t*6)+1451/8*np.sin(-563/489-2*sp.pi*t*7)+122323/624*np.sin(-311/343-2*sp.pi*t*8)+115874/719*np.sin(-137/183-2*sp.pi*t*9)+171452/861*np.sin(-67/52-2*sp.pi*t*10)+18152/105*np.sin(-777/716-2*sp.pi*t*11)+24049/125*np.sin(-107/76-2*sp.pi*t*12)
I'm writing code for college right now that works with very big amounts of data, using Pytables with various matrices/matrixes so as not to overflow memory, and it's been working well so far.
Right now I need to assign an integer identifier (from 0 to whatever) to a number of distinct Strings, store the assignment and be able to get the corresponding integer to a certain String and vice-versa. Of course, normal types don't cut it, there's just too many Strings, so I need to use something that works with files like Pytables.
I thought of just using an unidimensional Pytables EArray (because I can't know how many of Strings there will be), store there the Strings and let the index for each element be the assigned integer identifier of the String.
This is an example of what I thought of using:
import tables as tb, numpy as np
>>>file = tb.open_file("sample_file.hdf5", mode='w')
>>>sample_array = file.create_earray(file.root, 'data', tb.StringAtom(itemsize=50),
shape=(0,), expectedrows=10000)
>>>sample_array.append(np.array(["String_value"]))
That way I can get the String value of a given integer, like in any normal array
>>>sample_array[0]
b'String_value'
But I can't for the life of me find out how to do the opposite, to find the index given the String, I'm only comming up with more absurd ways of doing shit...
>>> sample_array[np.where("String_value") in sample_array]
b'String_value'
>>> sample_array[np.where("String_value")]
array([b'String_value'], dtype='|S50')
>>> np.where("String_value") in sample_array
False
Thank you in advance!
EDIT:
Forgot to update, I figured it out while working on something else... Facepalmed hard, very hard, it was really stupid, but I could't figure out what was wrong for hours.
np.where(sample_array[:] == b'String_value')
>>>(array([0]),)
OP answered his question above. However, it's buried under EDIT:, so not obvious in search results (or to the casual reader). Also, there is another way to approach the problem (using a Table instead of an Earray). This provides a comparison of the 2 methods.
OP's solution with an Earray (with some embellishment):
import tables as tb, numpy as np
h5f = tb.open_file("sample_file.hdf5", mode='w')
sample_array = h5f.create_earray(h5f.root, 'data', tb.StringAtom(itemsize=50),
shape=(0,), expectedrows=10000)
sample_array.append(np.array(['str_val0']))
sample_array.append(np.array(['str_val10']))
sample_array.append(np.array(['str_val20']))
sample_array.append(np.array(['str_val30']))
sample_array.append(np.array(['str_val40']))
print (sample_array[0])
print (sample_array[-1])
print (np.where(sample_array[:] == b'str_val0'))
print (np.where(sample_array[:] == b'str_val40'))
print ('\n')
h5f.close()
Output looks like this:
b'str_val0'
b'str_val40'
(array([0], dtype=int64),)
(array([4], dtype=int64),)
My approach with a Table:
I like Tables in Pytables. They are handy because they have multiple built-in search and iteration methods (in this case using .get_where_list(); there are many others). This example shows Table creation from a np.recarray (uses dtype to define fields/columns, and data to populate the table). Additional data rows are added later with the .append() method.
import tables as tb, numpy as np
h5f = tb.open_file("sample_file.hdf5", mode='w')
simple_recarray = np.recarray((4,),dtype=[('tstr','S50')])
simple_recarray['tstr'][0] = 'str_val1'
simple_recarray['tstr'][1] = 'str_val2'
simple_recarray['tstr'][2] = 'str_val10'
simple_recarray['tstr'][3] = 'str_val20'
simple_table = h5f.create_table(h5f.root, 'table_data', simple_recarray, 'Simple dataset')
print (simple_table.get_where_list("tstr == b'str_val1'"))
print (simple_table.get_where_list("tstr == b'str_val20'"))
simple_table.append([('str_val30',), ('str_val31',)])
print (simple_table.get_where_list("tstr == b'str_val31'"))
h5f.close()
Output looks like this (slightly different b/c strings are not stored in arrays):
[0]
[3]
[5]
I'm new to python and coding (last night). I need to generate a very large number of itertools products with a specific format for the output. I can generate the combinations using,
import itertools
s=[ ['CPT1','OTHERCPT1','OTHERCPT2','OTHERCPT3','OTHERCPT4','OTHERCPT5','OTHERCPT6','OTHERCPT7','OTHERCPT8','OTHERCPT9','OTHERCPT10','CONCURR1','CONCURR2','CONCURR3','CONCURR4','CONCURR5','CONCURR6','CONCURR7','CONCURR8','CONCURR9','CONCURR10'], ['15756','15757','15758','43496','49006','20969','20955','20956','20957','20962','20970','20972','20973'],['CPT1','OTHERCPT1','OTHERCPT2','OTHERCPT3','OTHERCPT4','OTHERCPT5','OTHERCPT6','OTHERCPT7','OTHERCPT8','OTHERCPT9','OTHERCPT10','CONCURR1','CONCURR2','CONCURR3','CONCURR4','CONCURR5','CONCURR6','CONCURR7','CONCURR8','CONCURR9','CONCURR10'], ['15756','15757','15758','43496','49006','20969','20955','20956','20957','20962','20970','20972','20973']]
x=list(itertools.product(*s))
print x
however the output appears as such:
('CPT1', '15756', 'CPT1', '15756'), ... etc.
I would like it to appear:
SELECT IF(CPT1='15756' AND CPT1='15756').
SELECT IF(...).
etc.
Thanks for your help!
You should use string formatting (https://docs.python.org/2/library/string.html)
import itertools
s = [[...first list...],[...second list...]]
for p in itertools.product(*s):
print("SELECT IF(CPT1='{}' AND CPT1='{}').".format(*p))
I'm running the following python script:
#!/usr/bin/python
import os,sys
from scipy import stats
import numpy as np
f=open('data2.txt', 'r').readlines()
N=len(f)-1
for i in range(0,N):
w=f[i].split()
l1=w[1:8]
l2=w[8:15]
list1=[float(x) for x in l1]
list2=[float(x) for x in l2]
result=stats.ttest_ind(list1,list2)
print result[1]
However I got the errors like:
ValueError: could not convert string to float: id
I'm confused by this.
When I try this for only one line in interactive section, instead of for loop using script:
>>> from scipy import stats
>>> import numpy as np
>>> f=open('data2.txt','r').readlines()
>>> w=f[1].split()
>>> l1=w[1:8]
>>> l2=w[8:15]
>>> list1=[float(x) for x in l1]
>>> list1
[5.3209183842, 4.6422726719, 4.3788135547, 5.9299061614, 5.9331108706, 5.0287087832, 4.57...]
It works well.
Can anyone explain a little bit about this?
Thank you.
Obviously some of your lines don't have valid float data, specifically some line have text id which can't be converted to float.
When you try it in interactive prompt you are trying only first line, so best way is to print the line where you are getting this error and you will know the wrong line e.g.
#!/usr/bin/python
import os,sys
from scipy import stats
import numpy as np
f=open('data2.txt', 'r').readlines()
N=len(f)-1
for i in range(0,N):
w=f[i].split()
l1=w[1:8]
l2=w[8:15]
try:
list1=[float(x) for x in l1]
list2=[float(x) for x in l2]
except ValueError,e:
print "error",e,"on line",i
result=stats.ttest_ind(list1,list2)
print result[1]
My error was very simple: the text file containing the data had some space (so not visible) character on the last line.
As an output of grep, I had 45 instead of just 45.
This error is pretty verbose:
ValueError: could not convert string to float: id
Somewhere in your text file, a line has the word id in it, which can't really be converted to a number.
Your test code works because the word id isn't present in line 2.
If you want to catch that line, try this code. I cleaned your code up a tad:
#!/usr/bin/python
import os, sys
from scipy import stats
import numpy as np
for index, line in enumerate(open('data2.txt', 'r').readlines()):
w = line.split(' ')
l1 = w[1:8]
l2 = w[8:15]
try:
list1 = map(float, l1)
list2 = map(float, l2)
except ValueError:
print 'Line {i} is corrupt!'.format(i = index)'
break
result = stats.ttest_ind(list1, list2)
print result[1]
For a Pandas dataframe with a column of numbers with commas, use this:
df["Numbers"] = [float(str(i).replace(",", "")) for i in df["Numbers"]]
So values like 4,200.42 would be converted to 4200.42 as a float.
Bonus 1: This is fast.
Bonus 2: More space efficient if saving that dataframe in something like Apache Parquet format.
Perhaps your numbers aren't actually numbers, but letters masquerading as numbers?
In my case, the font I was using meant that "l" and "1" looked very similar. I had a string like 'l1919' which I thought was '11919' and that messed things up.
Your data may not be what you expect -- it seems you're expecting, but not getting, floats.
A simple solution to figuring out where this occurs would be to add a try/except to the for-loop:
for i in range(0,N):
w=f[i].split()
l1=w[1:8]
l2=w[8:15]
try:
list1=[float(x) for x in l1]
list2=[float(x) for x in l2]
except ValueError, e:
# report the error in some way that is helpful -- maybe print out i
result=stats.ttest_ind(list1,list2)
print result[1]
Shortest way:
df["id"] = df['id'].str.replace(',', '').astype(float) - if ',' is the problem
df["id"] = df['id'].str.replace(' ', '').astype(float) - if blank space is the problem
Update empty string values with 0.0 values:
if you know the possible non-float values then update it.
df.loc[df['score'] == '', 'score'] = 0.0
df['score']=df['score'].astype(float)
I solved the similar situation with basic technique using pandas. First load the csv or text file using pandas.It's pretty simple
data=pd.read_excel('link to the file')
Then set the index of data to the respected column that needs to be changed. For example, if your data has ID as one attribute or column, then set index to ID.
data = data.set_index("ID")
Then delete all the rows with "id" as the value instead of number using following command.
data = data.drop("id", axis=0).
Hope, this will help you.