Iterating and Writing Pandas Dataframe NaNs back to MySQL - python

I'm attempting to write the results of a regression back to MySQL, but am having problems iterating through the fitted values and getting the NaNs to write as null values. Originally, I did the iteration this way:
for i in dataframe:
cur = cnx.cursor()
query = ("UPDATE Regression_Data.Input SET FITTEDVALUES="+(dataframe['yhat'].__str__())+" where timecount="+(datafrane['timecount'].__str__())+";")
cur.execute(query)
cnx.commit()
cur.close()
.....which SQL thew back to me by saying:
"mysql.connector.errors.ProgrammingError: 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'NaN'
So, I've been trying to filter out the NaNs by only asking Python to commit when yhat does not equal NaN:
for i in dataframe:
if cleandf['yhat']>(-1000):
cur = cnx.cursor()
query = ("UPDATE Regression_Data.Input SET FITTEDVALUES="+(dataframe['yhat'].__str__())+" where timecount="+(datafrane['timecount'].__str__())+";")
cur.execute(query)
cnx.commit()
cur.close()
But then I get this:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
So, I try to get around it with this in my above syntax:
if cleandf['yhat'][i]>(-1000):
but then get this:
ValueError: Can only tuple-index with a MultiIndex
And then tried adding itterows() to both as in:
for i in dataframe.iterrows():
if cleandf['yhat'][i]>(-1000):
but get the same problems as above.
I'm not sure what I'm doing wrong here, but assume it's something with iterating in Pandas DataFrames. But, even if I got the iteration right, I would want to write Nulls into SQL where the NaN appeared.
So, how do you think I should do this?

I don't have a complete answer, but perhaps I have some tips that might help. I believe you are thinking of your dataframe as an object similar to a SQL record set.
for i in dataframe
This will iterate over the column name strings in the dataframe. i will take on column names, not rows.
dataframe['yhat']
This returns an entire column (pandas.Series, which is a numpy.ndarray), not a single value. Therefore:
dataframe['yhat'].__str__()
will give a string representation of an entire column that is useful for humans to read. It is certainly not a single value that can be converted to string for your query.
if cleandf['yhat']>(-1000)
This gives an error, because again, cleandf['yhat'] is an entire array of values, not just a single value. Think of it as an entire column, not the value from a single row.
if cleandf['yhat'][i]>(-1000):
This is getting closer, but you really want i to be an integer here, not another column name.
for i in dataframe.iterrows():
if cleandf['yhat'][i]>(-1000):
Using iterrows seems like the right thing for you. However, i takes on the value of each row, not an integer that can index into a column (cleandf['yhat'] is a full column).
Also, note that pandas has better ways to check for missing values than relying on a huge negative number. Try something like this:
non_missing_index = pandas.isnull(dataframe['yhat'])
cleandf = dataframe[non_missing_index]
for row in cleandf.iterrows():
row_index, row_values = row
query = ("UPDATE Regression_Data.Input SET FITTEDVALUES="+(row_values['yhat'].__str__())+" where timecount="+(row_values['timecount'].__str__())+";")
execute_my_query(query)
You can implement execute_my_query better than I can, I expect. However, this solution is not quite what you want. You really want to iterate over all rows and do two types of inserts. Try this:
for row in dataframe.iterrows():
row_index, row_values = row
if pandas.isnull(row_values['yhat']):
pass # populate the 'null' insert query here
else:
query = ("UPDATE Regression_Data.Input SET FITTEDVALUES="+(row_values['yhat'].__str__())+" where timecount="+(row_values['timecount'].__str__())+";")
execute_my_query(query)
Hope it helps.

Related

Add another column to existing list

I'm starting to learn python and I'm trying to do an exercise where I have to save in a "rows" variable some stock data coming from a SQL query, like this:
rows = db.execute("SELECT * FROM user_quote WHERE user_id=:userid", userid=session["user_id"])
This will return 4 columns (id, user_id, symbol, name)
Then, for every row the query returns I'll get the last known price of that stock from an API, and I want to add that information to another column in my rows variable. Is there a way to do this? Should I use another approach?
Thanks for your time!
I'm not sure what type the rows variable is, but you can just add an additional column in the SELECT:
rows = db.execute("SELECT *, 0 NewCol FROM user_quote WHERE user_id=:userid", userid=session["user_id"])
Assuming rows is mutable, this will provide a placeholder for the new value.
Convert the rows tuple to a list, then you can use append() to add the price.
rows = list(rows)
rows.append(price)

How to store pandas series to sql as a row

I have a pandas Dataframe object, and I iterate through the rows with:
for idx, row in df.iterrows():
# do some stuff
# save row to database
The problem is when I try to save it to database, to_sql treats my row as a column.
The variable row seems to be of type Series, and I did a careful search through the Series.to_sql in the manual, and I don't see any way of treating it as a database row instead of column.
The workaround I came up with is converting the Series to a DataFrame and then transposing it:
temp = pd.DataFrame(row).T
temp.to_sql(table, con=engine, if_exists='append', index_label='idx')
Is there a simpler way?
Rather than use df.iterrows, which returns indices and a series representation of each row, one approach would be to iterate through df.index and use integer-location based indexing to slice the data frame for row manuipulation.
df = pd.DataFrame.from_dict({'a':[1,2,3],'b':[4,5,6]})
for i in range(df.index):
row = df.iloc[i:i+1,:]
#do Stuff
row.to_sql(...)
This is the recommended way to modify to your dataframe. From the df.iterrows docstring:
2. You should **never modify** something you are iterating over.
This is not guaranteed to work in all cases. Depending on the
data types, the iterator returns a copy and not a view, and writing
to it will have no effect.

Splitting one large comma separated row into many rows after number of values

I'm rather new to MySQL so apologies if this is an intuitive problem, I couldn't find anything too helpful in stackoverflow. I have a rather large amount of financial data in one row currently, with each value separated by a comma. 12 values equals one set of data and so I want to create a new row after every 12 values.
In other words, the data I have looks like this:
(open_time,open,high,low,close,volume,close_time,quotevol,trades,ignore1,ignore2,ignore3, ...repeat...)
And I'd like for it to look like:
Row1:(open_time,open,high,low,close,volume,close_time,quotevol,trades,ignore1,ignore2,ignore3)
Row2:(open_time2,open2,high2,low2,close2,volume2,close_time2,quotevol2,trades2,ignore4,ignore5,ignore6)
Row3:
...
The data is already a .sql file and I have it in a table too if that makes a difference.
To clarify, the table it is in has only one row and one column.
I don't doubt there is a way to do it in MySQL, but I would approach it by exporting out the record as .CSV.
Export to CSV
Write a simple python script using the CSV module and shift every x number of fields to a new row using the comma as a delimiter. Afterward, you can reimport it back into MySQL.
If I understand correctly, you want to do the following:
Get the string from the database, which is located in the first row of the first column in the query results
Break the string into "rows" with 12 values long
Be able to use this data
The way I would go about this in Python is to:
Create a mysql connection and cursor
Execute the query to pull the data from the database
Put the data from the single cell into a string
Split the string at each comma and add those values to a list
Break that list into chunks of 12 elements each
Put this data into a tabular form for easy consumption
Code:
import mysql
import pandas as pd
query = '''this is your sql statement that returns everything into the first row of the first column in your query results'''
cnx = mysql.connector.connect('''enter relevant connection information here: user, password, host, and database''')
mycursor = cnx.cursor()
mycursor.execute(query)
tup = tuple(mycursor.fetchall()[0])
text = str(tup[0])
ls = text.split(',') # converts text into list of items
n = 12
rows = [ls[i:i + n] for i in range(0, len(ls), n)]
data = []
for row in rows:
data.append(tuple(row))
labels = ['open_time','open','high','low','close','volume','close_time','quotevol','trades','ignore1','ignore2','ignore3']
df = pd.DataFrame.from_records(data, columns=labels)
print(df)
The list comprehension code was taken from this. You did not specify exactly how you wanted your resultant dataset, but the pandas data frame should have each of your rows.
Without an actual string or dataset, I can't confirm that this works entirely. Would you be able to give us a Minimal, Complete, and Verifiable example?

Importing SQL query into Pandas results in only 1 column

I'm trying to import the results of a complex SQL query into a pandas dataframe. My query requires me to create several temporary tables since the final result table I want includes some aggregates.
My code looks like this:
cnxn = pyodbc.connect(r'DRIVER=foo;SERVER=bar;etc')
cursor = cnxn.cursor()
cursor.execute('SQL QUERY HERE')
cursor.execute('SECONDARY SQL QUERY HERE')
...
df = pd.DataFrame(cursor.fetchall(),columns = [desc[0] for desc in cursor.description])
I get an error that tells me shapes aren't matching:
ValueError: Shape of passed values is (1,900000),indices imply (5,900000)
And indeed, the result of all the SQL queries should be a table with 5 columns rather than 1. I've run the SQL query using Microsoft SQL Server Management Studio and it works and returns the 5 column table that I want. I've tried to not pass any column names into the dataframe and printed out the head of the dataframe and found that pandas has put all the information in 5 columns into 1. The values in each row is a list of 5 values separated by commas, but pandas treats the entire list as 1 column. Why is pandas doing this? I've also tried going the pd.read_sql route but I still get the same error.
EDIT:
I have done some more debugging, taking the comments into account. The issue doesn't appear to stem from the fact that my query is nested. I tried a simple (one line) query to return a 3 column table and I still got the same error. Printing out fetchall() looks like this:
[(str1,str2,str3,datetime.date(stuff),datetime.date(stuff)),
(str1,str2,str3,datetime.date(stuff),datetime.date(stuff)),...]
Use pd.DataFrame.from_records instead:
df = pd.DataFrame.from_records(cursor.fetchall(),
columns = [desc[0] for desc in cursor.description])
Simply adjust the pd.DataFrame() call as right now cursor.fetchall() returns one-length list of tuples. Use tuple() or list to map child elements into their own columns:
df = pd.DataFrame([tuple(row) for row in cur.fetchall()],
columns = [desc[0] for desc in cursor.description])

cursor.fetchall() in Python

After saving some data in a variable with cursor.fetchall(), it looks as follows:
mylist = [('abc1',), ('abc2',)] this is apparently a list.
That is not the issue.
The problem is that the following doesn't work:
if 'abc1' in mylist
it can't find 'abc1'. Is there a way in Python to do it easily or do I have to use a loop?
fetchall() returns a row list, i.e., a list containing rows.
Each row is a tuple containing column values. There is a tuple even when there is only one column.
To check whether a row is in the row list, you have to check for the row instead of the column value alone:
if ('abc1',) in mylist
This is problem with using select * statement.
Instead use select col1,col2 from table_name
Below code might help
sql = "select col1,col2,col3 from table_name"
cursor.execute(sql) # initialize cursor in your way
input_dict = dict( (row[0],(row[1],row[2])) for row in cursor.fetchall() )

Categories