Syntax error while trying to build dataframe in python - python

I am trying to create a function in python 3 that builds a dataframe from a csv file. However, I keep getting a syntax error when I call
y = (data_df["Status"].replace("underperform",0).replace("outperform",1).values.tolist())
This line of code is not running, because I never actually call the function. Here is all of my code.
def Build_Data_Set(features = ["DE Ratio","Trailing P/E"]):
data_df = pd.read_csv("key_stats.csv") #created in other file
X = np.array(data_df[features].values#.tolist())
y = (data_df["Status"].replace("underperform",0).replace("outperform",1).values.tolist())
return X,y
How should I go about fixing this error?

You're missing a closing parenthesis in your X = np.array(data_df[features].values#.tolist()) - it's there, but it's commented out of the code with the # sign.
Your python interpreter does not know that you actually wanted to end that line there and continues to search for a closing parenthesis. Before it finds one, it stumbles over the assignment in the next line, which is illegal and causes your syntax error.

You can simply do:
def Build_Data_Set(features = ["DE Ratio","Trailing P/E"]):
data_df = pd.read_csv("key_stats.csv") #created in other file
X = data_df[features].values # already returns an array
y = data_df["Status"].replace({"underperform": 0, "outperform":1}).values
return X,y
df = Build_Data_Set()

Related

How do I solve "SyntaxError: invalid character in identifier" with regards to df.iloc

df = pd.read_csv("foo.csv")
# drop NaN
df = df.dropna()
df.iloc[0,0] = 1
# drop time
for i in range(3):
    df.iloc[0,0] = 2
df.iloc[:,i] = datetime.strptime(df.iloc[:,i], "%Y-%m-%dT%H:%M:%S%z")
I get this error:
File "accuracy.py", line 19
    df.iloc[0,0] = 2
^
> SyntaxError: invalid character in identifier
I want to modify every value in the dataframe, so that I can work with the dates in there. However, when I use df.iloc in the for-loop, I get a similar Sytax Error to this one. I tried it again with simpler code, hence the df.iloc[0,0] = 2, but it gives the same error I would get from the other one. I also tried the code outside of the loop, and that doesn't result in an error, so I'm confused.
I'm working from my terminal since I can't download the csv I'm working on, which makes debugging a little more interesting. Mostly because I cannot print things if the file does not compile.
Any idea what the problem could be?

Writing a function that will normalize using either min max method or z-score method

I am fairly new to Python, so there may be a lot to improve upon, but in the following code I am trying to write a function that takes in the location of the data file, the attribute that has to be normalized and the type of normalization to be performed('min_max' or 'z_score')
After this, based on the normalization type that is mentioned, I want it to apply the appropriate formula and return a dictionary where key = original value in the dataset, value = normalized value.
def normalization (fname, attr, normType):
result = {
}
df = pd.read_csv(fname)
targ = list(df[df.columns[attr]])
scaler = MinMaxScaler()
df["minmax"] = scaler.fit.transform(df[[df.columns[attr]]])
df["zscore”] = ((df[[df.columns[attr]]]) - (df[[df.columns[attr.mean()]]]))/ (df[[df.columns[attr.std(ddof=1)]]])
if normType == "min_max":
result = dict(zip(targ, df.minmax.values.tolist())
else:
result = dict(zip(targ, df.zscore.values.tolist())
return result
I continually get an error specifically on the line with the zscore calculation and have been struggling to troubleshoot it. I would appreciate any help that could point me in the right direction.
Thanks
Edit: Error message shown is
"SyntaxError: EOL while scanning string literal"
"zscore” alone causes that error. The problem is that the ” isn't a proper double-quotes character so the string isn't properly terminated. Not sure how it got there, maybe bad formatting in a document while pasting code around. The fix: "zscore"

How to generate the Python lambda filter codes in the for loop?

I am a beginner to Python lambda. And try to convert the Python for loop to lambda expression. First I would like to explain the for loop lines.
fred = Fred2Hdfs() # construct the python imported objects
for i, state in enumerate(us_states):
df_unemployee_annual = fred.getFredDF('A', state, 'search_text') # generate dataframe from the object
if df_unemployee_annual is None:
continue
if i == 0:
fred.writeCsv2Hdfs('unemployee_annual.csv', df_unemployee_annual) # write dataframe
else:
fred.appendCsv2Hdfs('unemployee_annual.csv', df_unemployee_annual) # append dataframe
The above code work successfully without errors. And below codes are the Python lambda codes which I try to convert.
fred = Fred2Hdfs()
freq='A'
str='search_text'
result_df_list = list(map(lambda state: fred.getFredDF(freq, state, str), us_states))
result_df_list = list(filter(lambda df: df is not None, result_df_list))
print(result_df_list) # codes work correctly until this line.
#func=map(lambda df:fred.writeCsv2Hdfs('unemployee_annual_.csv', df) , result_df_list)
I am stuck with if i==0: line in the for loop. How can I make the appropriate Python lambda expression from if i==0: line. I am afraid I have no idea how to implement the if filter of Python lambda.
map(lambda (i,df):fred.writeCsv2Hdfs('unemployee_annual_.csv', df) if i == 0 else fred.appendCsv2Hdfs('unemployee_annual_.csv', df) , enumerate(result_df_list))

rpy2 code snippet returns an empty object

I am using rpy2 to use a R library in Python. The library has a function prebas() that returns an array in which the item with index [8] is the output. When I write this output to CSV within the R code snippet, everything works as expected (the output is a CSV over 200kB). However, when I return the same object (PREBASout[8]), it returns an empty object. So, obviously, when I write that object to CSV, the file is empty.
run_prebasso = robjects.r('''
weather <- read.csv("/home/example_inputs/weather.csv",header = T)
PAR = c(weather$PAR,weather$PAR,weather$PAR)
TAir = c(weather$TAir,weather$TAir,weather$TAir)
Precip = c(weather$Precip,weather$Precip,weather$Precip)
VPD = c(weather$VPD,weather$VPD,weather$VPD)
CO2 = c(weather$CO2,weather$CO2,weather$CO2)
DOY = c(weather$DOY,weather$DOY,weather$DOY)
library(Rprebasso)
PREBASout = prebas(nYears = 100, PAR=PAR,TAir=TAir,VPD=VPD,Precip=Precip,CO2=CO2)
write.csv(PREBASout[8],"/home/outputs/written_in_r.csv",row.names = F)
PREBASout[8]
''')
r_write_csv = robjects.r['write.csv']
r_write_csv(run_prebasso, "/home/outputs/written_in_py.csv")
This is what the code snippet returns:
(Pdb) run_prebasso
<rpy2.rinterface.NULLType object at 0x7fc1b31e6b48> [RTYPES.NILSXP]
Question: Why aren't written_in_py.csv and written_in_r.csv the same?
I have just found the bug. The problem was in the line
write.csv(PREBASout[8],"/home/outputs/written_in_r.csv",row.names = F)
This statement was returned instead of what I wanted (PREBASout[8]). When I removed it or assigned it to a variable, everything worked as expected.

Having an issue with using median function in numpy

I am having an issue with using the median function in numpy. The code used to work on a previous computer but when I tried to run it on my new machine, I got the error "cannot perform reduce with flexible type". In order to try to fix this, I attempted to use the map() function to make sure my list was a floating point and got this error message: could not convert string to float: .
Do some more attempts at debugging, it seems that my issue is with my splitting of the lines in my input file. The lines are of the form: 2456893.248202,4.490 and I want to split on the ",". However, when I print out the list for the second column of that line, I get
4
.
4
9
0
so it seems to somehow be splitting each character or something though I'm not sure how. The relevant section of code is below, I appreciate any thoughts or ideas and thanks in advance.
def curve_split(fn):
with open(fn) as f:
for line in f:
line = line.strip()
time,lc = line.split(",")
#debugging stuff
g=open('test.txt','w')
l1=map(lambda x:x+'\n',lc)
g.writelines(l1)
g.close()
#end debugging stuff
return time,lc
if __name__ == '__main__':
# place where I keep the lightcurve files from the image subtraction
dirname = '/home/kuehn/m4/kepler/subtraction/detrending'
files = glob.glob(dirname + '/*lc')
print(len(files))
# in order to create our lightcurve array, we need to know
# the length of one of our lightcurve files
lc0 = curve_split(files[0])
lcarr = np.zeros([len(files),len(lc0)])
# loop through every file
for i,fn in enumerate(files):
time,lc = curve_split(fn)
lc = map(float, lc)
# debugging
print(fn[5:58])
print(lc)
print(time)
# end debugging
lcm = lc/np.median(float(lc))
#lcm = ((lc[qual0]-np.median(lc[qual0]))/
# np.median(lc[qual0]))
lcarr[i] = lcm
print(fn,i,len(files))

Categories