I'm trying to use python to make plots. Below is the simplified version of my code that cause error.
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
matplotlib.use("AGG")
distance = np.array('f')
depth = np.array('f')# make sure these two arrays store float type value
with open('Line671.txt','r') as datafile:
for line in datafile:
word_count = 0
for word in line.split():
word = float(word)#convert string type to float
if word_count == 0:
distance = np.append(distance, word)
elif word_count == 1:
depth = np.append(depth, word)
else:
print 'Error'
word_count += 1
datafile.closed
print depth
print distance #outputs looks correct
# original data is like this: -5.3458000e+00
# output of the array is :['f' '-5.3458' '-5.3463' ..., '-5.4902' '-5.4912' '-5.4926']
plt.plot(depth, distance)# here comes the problem
The error message says that in line for plt.plot(depth, distance): ValueError: could not convert string to float: f
I don't understand this because it seems I converted all string values into float type. I tried to search this problem on stackoverflow but they all seem to solve the problem once they cast all string values into float or int. Can anyone give any suggestion on this problem? I would be really appreciate for any help.
You confused the value with the type. If you're trying to declare the type, you need to use "dtype=". What you actually did was to stick a single character into the array.
To answer a later question, your line
word = float(word)
likely worked just fine. However, we can't tell because you didn't do anything with the resulting value. Are you expecting this to alter the original inside the variable "line"? Common variables don't work that way.
Related
I am fairly new to Python, so there may be a lot to improve upon, but in the following code I am trying to write a function that takes in the location of the data file, the attribute that has to be normalized and the type of normalization to be performed('min_max' or 'z_score')
After this, based on the normalization type that is mentioned, I want it to apply the appropriate formula and return a dictionary where key = original value in the dataset, value = normalized value.
def normalization (fname, attr, normType):
result = {
}
df = pd.read_csv(fname)
targ = list(df[df.columns[attr]])
scaler = MinMaxScaler()
df["minmax"] = scaler.fit.transform(df[[df.columns[attr]]])
df["zscore”] = ((df[[df.columns[attr]]]) - (df[[df.columns[attr.mean()]]]))/ (df[[df.columns[attr.std(ddof=1)]]])
if normType == "min_max":
result = dict(zip(targ, df.minmax.values.tolist())
else:
result = dict(zip(targ, df.zscore.values.tolist())
return result
I continually get an error specifically on the line with the zscore calculation and have been struggling to troubleshoot it. I would appreciate any help that could point me in the right direction.
Thanks
Edit: Error message shown is
"SyntaxError: EOL while scanning string literal"
"zscore” alone causes that error. The problem is that the ” isn't a proper double-quotes character so the string isn't properly terminated. Not sure how it got there, maybe bad formatting in a document while pasting code around. The fix: "zscore"
So i have a text document with a lot of values from calculations. I have extracted all the data and stored it in an array, but they are not numbers that I can use for anything. I want to use the number to plot them in a graph, but the elements in the array are text-strings, how would i turn them into numbers and remove unneccesary signs like commas and n= for instance?
Here is code, and under is my print statement.
import numpy as np
['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9', 'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])
I'd use the conversion method presented in this post within the extract function, so e.g.
...
delta_x.append(strtofloat(words[1]))
...
where you might as well do the conversion inline (my strtofloat is a function you'd have to write based on mentioned post) and within a try/except block, so failed conversions are just ignored from your list.
To make it more consistent, any conversion error should discard the whole line affected, so you might want to use intermediate variables and a check for each field.
Btw. I noticed the argument to the extract function, it would seem logical to make the argument a string containing the file name from which to extract the data?
EDIT: as a side note, you might want to look into pandas, which is a library specialised in numerical data handling. Depending on the format of your data file there are probably standard functions to read your whole file into a DataFrame (which is a kind of super-charged array class which can handle a lot of data processing as well) in a single command.
I would consider using regular expression:
import re
match_number = re.compile('-?[0-9]+\.?[0-9]*(?:[Ee]-?[0-9]+)?')
for line in infile:
words = line.split()
new_delta_x = float(re.search(match_number, words[1]).group())
new_abs_error = float(re.search(match_number, words[7]).group())
new_n = int(re.search(match_number, words[10]).group())
delta_x.append(new_delta_x)
abs_error.append(new_abs_error)
n.append(new_n)
But it seems like your data is already in csv format. So try using pandas.
Then read data into dataframe without header (column names will be integers).
import numpy as np
import pandas as pd
df = pd.read_csv('approx_derivative_sine.txt', header=None)
delta_x = df[1].to_numpy()
abs_error = df[7].to_numpy()
# if n is always number of the row
n = df.index.to_numpy(dtype=int)
# if n is always in the form 'n=<integer>'
n = df[10].apply(lambda x: x.strip()[2:]).to_numpy(dtype=int)
If you could post a few rows of your approx_derivative_sine.txt file, that would be useful.
From the given array in the question, If you would like to remove the 'n=' and convert each element to an integer, you may try the following.
import numpy as np
array = np.array(['n=1', 'n=2', 'n=3', 'n=4', 'n=5', 'n=6', 'n=7', 'n=8', 'n=9',
'n=10', 'n=11', 'n=12', 'n=13', 'n=14', 'n=15', 'n=16', 'n=17', 'n=18', 'n=19'])
array = [int(i.replace('n=', '')) for i in array]
print(array)
Introduction to the problem
I have inputs in a .txt file and I want to 'extract' the values when a velocity is given.
Inputs have the form: velocity\t\val1\t\val2...\tvaln
[...]
16\t1\t0\n
1.0000\t9.3465\t8.9406\t35.9604\n
2.0000\t10.4654\t9.9456\t36.9107\n
3.0000\t11.1235\t10.9378\t37.1578\n
[...]
What have I done
I have written a piece of code to return values when a velocity is requested:
def values(input,velocity):
return re.findall("\n"+str(velocity)+".*",input)[-1][1:]
It works "backwards" because I want to ignore the first row from the inputs (16\t1\t0\n), this way if I call:
>>>values('inputs.txt',16)
>>>16.0000\t0.5646\t14.3658\t1.4782\n
But it has a big problem: if I call the function for 1, it returns the value for 19.0000
Since I thought all inputs would be in the same format I made a litte fix:
def values(input,velocity):
if velocity <= 5: #Because velocity goes to 50
velocity = str(velocity)+'.0'
return re.findall("\n"+velocity+".*",input)[-1][1:]
And it works pretty well, maybe is not the most beautiful (or efficient) way of do it but I'm a beginner.
The problem
But with this code I have a problem and it is that sometimes inputs have this form:
[...]
16\t1\t0\n
1\t9.3465\t8.9406\t35.9604\n
2\t10.4654\t9.9456\t36.9107\n
3\t11.1235\t10.9378\t37.1578\n
[...]
And, of course my solution doesn't work
So, is there any pattern that fit both kinds of inputs?
Thank you for your help.
P.S. I have a solution using the function split('\n') and indexes but I would like to solve it with re library:
def values(input,velocity):
return input.split('\n)[velocity+1] #+1 to avoid first row
You could use a positive look ahead to check that after your velocity there is either a period or a tab. That will stop you picking up further numbers without hardcoding there must be .0. This means that velocity 1 will be able to match 1 or 1.xxxxx
import re
from typing import List
def find_by_velocity(velocity: int, data: str) -> List[str]:
return re.findall(r"\n" + str(velocity) + r"(?=\.|\t).*", data)
data = """16\t1\t0\n1\t9.3465\t8.9406\t35.9604\n2\t10.4654\t9.9456\t36.9107\n3\t11.1235\t10.9378\t37.1578\n16\t1\t0\n1.0000\t9.3465\t8.9406\t35.9604\n2.0000\t10.4654\t9.9456\t36.9107\n3.0000\t11.1235\t10.9378\t37.1578\n"""
print(find_by_velocity(1, data))
OUTPUT
['\n1\t9.3465\t8.9406\t35.9604', '\n1.0000\t9.3465\t8.9406\t35.9604']
I am trying to get proportion of nouns in my text using the code below and it is giving me an error. I am using a function that calculates the number of nouns in my text and I have the overall word count in a different column.
pos_family = {
'noun' : ['NN','NNS','NNP','NNPS']
}
def check_pos_tag(x, flag):
cnt = 0
try:
for tag,value in x.items():
if tag in pos_family[flag]:
cnt +=value
except:
pass
return cnt
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')/df2['word_count'])
Note: I have used nltk package to get the counts by PoS tags and I have the counts in a dictionary in PoS_Count column in my dataframe.
If I remove "/df2['word_count']" in the first run and get the noun count and include it again and run, it works fine but if I run it for the first time I get the below error.
ValueError: Wrong number of items passed 100, placement implies 1
Any help is greatly appreciated
Thanks in Advance!
As you have guessed, the problem is in the /df2['word_count'] bit.
df2['word_count'] is a pandas series, but you need to use a float or int here, because you are dividing check_pos_tag(x, 'noun') (which is an int) by it.
A possible solution is to extract the corresponding field from the series and use it in your lambda.
However, it would be easier (and arguably faster) to do each operation alone.
Try this:
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')) / df2['word_count']
I'm running the following python script:
#!/usr/bin/python
import os,sys
from scipy import stats
import numpy as np
f=open('data2.txt', 'r').readlines()
N=len(f)-1
for i in range(0,N):
w=f[i].split()
l1=w[1:8]
l2=w[8:15]
list1=[float(x) for x in l1]
list2=[float(x) for x in l2]
result=stats.ttest_ind(list1,list2)
print result[1]
However I got the errors like:
ValueError: could not convert string to float: id
I'm confused by this.
When I try this for only one line in interactive section, instead of for loop using script:
>>> from scipy import stats
>>> import numpy as np
>>> f=open('data2.txt','r').readlines()
>>> w=f[1].split()
>>> l1=w[1:8]
>>> l2=w[8:15]
>>> list1=[float(x) for x in l1]
>>> list1
[5.3209183842, 4.6422726719, 4.3788135547, 5.9299061614, 5.9331108706, 5.0287087832, 4.57...]
It works well.
Can anyone explain a little bit about this?
Thank you.
Obviously some of your lines don't have valid float data, specifically some line have text id which can't be converted to float.
When you try it in interactive prompt you are trying only first line, so best way is to print the line where you are getting this error and you will know the wrong line e.g.
#!/usr/bin/python
import os,sys
from scipy import stats
import numpy as np
f=open('data2.txt', 'r').readlines()
N=len(f)-1
for i in range(0,N):
w=f[i].split()
l1=w[1:8]
l2=w[8:15]
try:
list1=[float(x) for x in l1]
list2=[float(x) for x in l2]
except ValueError,e:
print "error",e,"on line",i
result=stats.ttest_ind(list1,list2)
print result[1]
My error was very simple: the text file containing the data had some space (so not visible) character on the last line.
As an output of grep, I had 45 instead of just 45.
This error is pretty verbose:
ValueError: could not convert string to float: id
Somewhere in your text file, a line has the word id in it, which can't really be converted to a number.
Your test code works because the word id isn't present in line 2.
If you want to catch that line, try this code. I cleaned your code up a tad:
#!/usr/bin/python
import os, sys
from scipy import stats
import numpy as np
for index, line in enumerate(open('data2.txt', 'r').readlines()):
w = line.split(' ')
l1 = w[1:8]
l2 = w[8:15]
try:
list1 = map(float, l1)
list2 = map(float, l2)
except ValueError:
print 'Line {i} is corrupt!'.format(i = index)'
break
result = stats.ttest_ind(list1, list2)
print result[1]
For a Pandas dataframe with a column of numbers with commas, use this:
df["Numbers"] = [float(str(i).replace(",", "")) for i in df["Numbers"]]
So values like 4,200.42 would be converted to 4200.42 as a float.
Bonus 1: This is fast.
Bonus 2: More space efficient if saving that dataframe in something like Apache Parquet format.
Perhaps your numbers aren't actually numbers, but letters masquerading as numbers?
In my case, the font I was using meant that "l" and "1" looked very similar. I had a string like 'l1919' which I thought was '11919' and that messed things up.
Your data may not be what you expect -- it seems you're expecting, but not getting, floats.
A simple solution to figuring out where this occurs would be to add a try/except to the for-loop:
for i in range(0,N):
w=f[i].split()
l1=w[1:8]
l2=w[8:15]
try:
list1=[float(x) for x in l1]
list2=[float(x) for x in l2]
except ValueError, e:
# report the error in some way that is helpful -- maybe print out i
result=stats.ttest_ind(list1,list2)
print result[1]
Shortest way:
df["id"] = df['id'].str.replace(',', '').astype(float) - if ',' is the problem
df["id"] = df['id'].str.replace(' ', '').astype(float) - if blank space is the problem
Update empty string values with 0.0 values:
if you know the possible non-float values then update it.
df.loc[df['score'] == '', 'score'] = 0.0
df['score']=df['score'].astype(float)
I solved the similar situation with basic technique using pandas. First load the csv or text file using pandas.It's pretty simple
data=pd.read_excel('link to the file')
Then set the index of data to the respected column that needs to be changed. For example, if your data has ID as one attribute or column, then set index to ID.
data = data.set_index("ID")
Then delete all the rows with "id" as the value instead of number using following command.
data = data.drop("id", axis=0).
Hope, this will help you.