insert list as a new column mysql - python

I have a list of string values new_values and I want to insert it as a new column in the companies tables my mysql database. Since I have hundreds of rows, I cannot manually type them using the ? syntax that I came across on SO.
import MySQLdb
cursor = db.cursor()
cursor.execute("INSERT INTO companies ....")
lst_to_add = ["name1", "name2", "name3"]
db.commit()
db.close()
However, i am not sure what query I should use to pass in my list and what's the correct syntax to include the new column name (eg: "newCol") into the query.
Edit:
current table:
id originalName
1 Hannah
2 Joi
3 Kale
expected output:
id originalName fakeName
1 Hannah name1
2 Joi name2
3 Kale name3

Use ALTER TABLE to add the new column.
cursor.execute("ALTER TABLE companies ADD COLUMN fakeName VARCHAR(100)")
Then loop through the data, updating each row. You need Python data that indicates the mapping from original name to fake name. You can use a dictionary, for example.
names_to_add = {
"Hannah": "name1",
"Joi": "name2",
"Kale": "name3"
}
for oldname, newname in names_to_add.items():
cursor.execute("UPDATE companies SET fakeName = %s WHERE originalName = %s", (newname, oldname))

Related

Get the most common word in a MySQL table using Python

I have a table containing full of movie genre, like this:
id | genre
---+----------------------------
1 | Drama, Romance, War
2 | Drama, Musical, Romance
3 | Adventure, Biography, Drama
Im looking for a way to get the most common word in the whole genre column and return it to a variable for further step in python.
I'm new to Python so I really don't know how to do it. Currently, I have these lines to connect to the database but don't know the way to get the most common word mentioned above.
conn = mysql.connect()
cursor = conn.cursor()
most_common_word = cursor.execute()
cursor.close()
conn.close()
First you need get list of words in each column. i.e create another table like
genre_words(genre_id bigint, word varchar(50))
For clues how to do that you may check this question:
SQL split values to multiple rows
You can do that as temporary table if you wish or use transaction and rollback. Which one to choose depend of your data size and PC on which DB running.
After that query will be really simple
select count(*) as c, word from genre_word group by word order by count(*) desc limit 1;
You also can do it using python, but if so it will not be a MySQL question at all. Need read table, create simple list of word+counter. If it new, add it, if exist - increase counter.
from collections import Counter
# Connect to database and get rows from table
rows = ...
# Create a list to hold all of the genres
genres = []
# Loop through each row and split the genre string by the comma character
# to create a list of individual genres
for row in rows:
genre_list = row['genre'].split(',')
genres.extend(genre_list)
# Use a Counter to count the number of occurrences of each genre
genre_counts = Counter(genres)
# Get the most common genre
most_common_genre = genre_counts.most_common(1)
# Print the most common genre
print(most_common_genre)

make column dataframe become sql query statement

I am using jupyter notebook to access Teradata database.
Assume I have a dataframe
Name Age
Sam 5
Tom 6
Roy 7
I want to let the whole column "Name" content become the WHERE condition of a sql query.
query = '''select Age
from xxx
where Name in (Sam, Tom, Roy)'''
age = pd.read_sql(query,conn)
How to format the column so that the whole column can be insert to the sql statement automatically instead of manually paste the column content?
Join the Name column and insert into the query using f-string:
query = f'''select Age
from xxx
where Name in ({", ".join(df.Name)})'''
print(query)
select Age
from xxx
where Name in (Sam, Tom, Roy)

Parameterizing ON DUPLICATE KEY UPDATE

So I have a SQL query ran in python that will add data to a database, but I am wondering if there is a duplicate key that just updates a couple of fields. The data that I am using is around 30 columns, and wondering if there is a way to do this.
data = [3, "hello", "this", "is", "random", "data",.......,44] #this being 30 items long
car_placeholder = ",".join(['%s'] * len(data))
qry = f"INSERT INTO car_sales_example VALUES ({car_placeholder}) ON DUPLICATE KEY UPDATE
Price = {data[15]}, IdNum = {data[29]}"
cursor.execute(qry, data)
conn.commit()
I want to be able to add an entry if the key doesn't exist, but if it does, update some of the columns within the entry which is that being the Price and the IdNum, which are at odd locations in the dataset. Is this even possible?
If this is not, is there a way to update every column within the database without explicitly saying it. For example
qry = f"INSERT INTO car_sales_example VALUES ({car_placeholder}) ON DUPLICATE KEY UPDATE
car_sales_example VALUES ({car_placeholder})"
instead of going column by column ->
ON DUPLICATE KEY UPDATE Id = %s, Name = %s, Number = %s, etc... #for 30 columns
In ON DUPLICATE KEY UPDATE you can use the VALUES() function with the name of a column to get the value that would have been inserted into that column.
ON DUPLICATE KEY UPDATE price = VALUES(price), idnum = VALUES(idnum)

sqlite3.OperationalError: no such column - but I'm not asking for a column?

So, I'm trying to use sqlite3 and there seems to be a problem when I run a SELECT query, I'm not too familiar with it so I was wondering where the problem is:
def show_items():
var = cursor.execute("SELECT Cost FROM Items WHERE ID = A01")
for row in cursor.fetchall():
print(row)
When I run this (hopefully asking for a cost value where the ID = A01), I get the error:
sqlite3.OperationalError: no such column: A01
Though I wasn't asking for it to look in column A01, I was asking for it to look in column 'Cost'?
If you're looking for a string value in a column, you have to wrap it in ', otherwise it will be interpreted as a column name:
var = cursor.execute("SELECT Cost FROM Items WHERE ID = 'A01'")
Update 2021-02-10:
Since this Q&A gets so much attention, I think it's worth editing to let you readers know about prepared statements.
Not only will they help you avoid SQL injections, they might in some cases even speed up your queries and you will no longer have to worry about those single quotes around stings, as the DB library will take care of it for you.
Let's assume we have the query above, and our value A01 is stored in a variable value.
You could write:
var = cursor.execute("SELECT Cost FROM Items WHERE ID = '{}'".format( value ))
And as a prepares statement it will look like this:
var = cursor.execute("SELECT Cost FROM Items WHERE ID = ?", (value,))
Notice that the cursor.execute() method accepts a second parameter, that must be a sequence (could be a tuple or a list). Since we have only a single value, you might miss the , in (value,) that will effectively turn the single value into a tuple.
If you want to use a list instead of a tuple the query would look like this:
var = cursor.execute("SELECT Cost FROM Items WHERE ID = ?", [value])
When working with multiple values, just make sure the numer of ? and the number of values in your sequence match up:
cursor.execute("SELECT * FROM students WHERE ID=? AND name=? AND age=?", (123, "Steve", 17))
You could also use named-style parameters, where instead of a tuple or list, you use a dictionary as parameter:
d = { "name": "Steve", "age": 17, "id": 123 }
cursor.execute("SELECT * FROM students WHERE ID = :id AND name = :name AND age = :age", d)
if you want to delete the data in sqllite
dara_list = ['1', 'apple']
cnt.execute("DELETE FROM to_do_data WHERE task='%s'"% str(data_list[1]))

how to collapse/compress/reduce string columns in pandas

Essentially, what I am trying to do is join Table_A to Table_B using a key to do a lookup in Table_B to pull column records for names present in Table_A.
Table_B can be thought of as the master name table that stores various attributes about a name. Table_A represents incoming data with information about a name.
There are two columns that represent a name - a column named 'raw_name' and a column named 'real_name'. The 'raw_name' has the string "code_" before the real_name.
i.e.
raw_name = CE993_VincentHanna
real_name = VincentHanna
Key = real_name, which exists in Table_A and Table_B
Please see the mySQL tables and query here: http://sqlfiddle.com/#!9/65e13/1
For all real_names in Table_A that DO-NOT exist in Table_B I want to store raw_name/real_name pairs into an object so I can send an alert to the data-entry staff for manual insertion.
For all real_names in Table_A that DO exist in Table_B, which means we know about this name and can add the new raw_name associated with this real_name into our master Table_B
In mySQL, this is easy to do as you can see in my sqlfidde example. I join on real_name and I compress/collapse the result by groupby a.real_name since I don't care if there are multiple records in Table_B for the same real_name.
All I want is to pull the attributes (stats1, stats2, stats3) so I can assign them to the newly discovered raw_name.
In the mySQL query result I can then separate the NULL records to be sent for manual data-entry and automatically insert the remaining records into Table_B.
Now, I am trying to do the same in Pandas but am stuck at the point of groupby on real-name.
e = {'raw_name': pd.Series(['AW103_Waingro', 'CE993_VincentHanna', 'EES43_NeilMcCauley', 'SME16_ChrisShiherlis',
'MEC14_MichaelCheritto', 'OTP23_RogerVanZant', 'MDU232_AlanMarciano']),
'real_name': pd.Series(['Waingro', 'VincentHanna', 'NeilMcCauley', 'ChrisShiherlis', 'MichaelCheritto',
'RogerVanZant', 'AlanMarciano'])}
f = {'raw_name': pd.Series(['SME893_VincentHanna', 'TVA405_VincentHanna', 'MET783_NeilMcCauley',
'CE321_NeilMcCauley', 'CIN453_NeilMcCauley', 'NIPS16_ChrisShiherlis',
'ALTW12_MichaelCheritto', 'NSP42_MichaelCheritto', 'CONS23_RogerVanZant',
'WAUE34_RogerVanZant']),
'real_name': pd.Series(['VincentHanna', 'VincentHanna', 'NeilMcCauley', 'NeilMcCauley', 'NeilMcCauley',
'ChrisShiherlis', 'MichaelCheritto', 'MichaelCheritto', 'RogerVanZant',
'RogerVanZant']),
'stats1': pd.Series(['meh1', 'meh1', 'yo1', 'yo1', 'yo1', 'hello1', 'bye1', 'bye1', 'namaste1',
'namaste1']),
'stats2': pd.Series(['meh2', 'meh2', 'yo2', 'yo2', 'yo2', 'hello2', 'bye2', 'bye2', 'namaste2',
'namaste2']),
'stats3': pd.Series(['meh3', 'meh3', 'yo3', 'yo3', 'yo3', 'hello3', 'bye3', 'bye3', 'namaste3',
'namaste3'])}
df_e = pd.DataFrame(e)
df_f = pd.DataFrame(f)
df_new = pd.merge(df_e, df_f, how='left', on='real_name', suffixes=['_left', '_right'])
df_new_grouped = df_new.groupby(df_new['raw_name_left'])
Now how do I compress/collapse the groups in df_new_grouped on real-name like I did in mySQL.
Once I have an object with the collapsed results I can slice the dataframe to report real_names we don't have a record of (NULL values) and those that we already know and can store the newly discovered raw_name.
You can drop duplicates based on columns raw_name_left and also remove the raw_name_right column using drop
In [99]: df_new.drop_duplicates('raw_name_left').drop('raw_name_right', 1)
Out[99]:
raw_name_left real_name stats1 stats2 stats3
0 AW103_Waingro Waingro NaN NaN NaN
1 CE993_VincentHanna VincentHanna meh1 meh2 meh3
3 EES43_NeilMcCauley NeilMcCauley yo1 yo2 yo3
6 SME16_ChrisShiherlis ChrisShiherlis hello1 hello2 hello3
7 MEC14_MichaelCheritto MichaelCheritto bye1 bye2 bye3
9 OTP23_RogerVanZant RogerVanZant namaste1 namaste2 namaste3
11 MDU232_AlanMarciano AlanMarciano NaN NaN NaN
Just to be thorough, this can also be done using Groupby, which I found on Wes McKinney's blog although drop_duplicates is cleaner and more efficient.
http://wesmckinney.com/blog/filtering-out-duplicate-dataframe-rows/
>index = [gp_keys[0] for gp_keys in df_new_grouped.groups.values()]
>unique_df = df_new.reindex(index)
>unique_df

Categories