how to collapse/compress/reduce string columns in pandas - python

Essentially, what I am trying to do is join Table_A to Table_B using a key to do a lookup in Table_B to pull column records for names present in Table_A.
Table_B can be thought of as the master name table that stores various attributes about a name. Table_A represents incoming data with information about a name.
There are two columns that represent a name - a column named 'raw_name' and a column named 'real_name'. The 'raw_name' has the string "code_" before the real_name.
i.e.
raw_name = CE993_VincentHanna
real_name = VincentHanna
Key = real_name, which exists in Table_A and Table_B
Please see the mySQL tables and query here: http://sqlfiddle.com/#!9/65e13/1
For all real_names in Table_A that DO-NOT exist in Table_B I want to store raw_name/real_name pairs into an object so I can send an alert to the data-entry staff for manual insertion.
For all real_names in Table_A that DO exist in Table_B, which means we know about this name and can add the new raw_name associated with this real_name into our master Table_B
In mySQL, this is easy to do as you can see in my sqlfidde example. I join on real_name and I compress/collapse the result by groupby a.real_name since I don't care if there are multiple records in Table_B for the same real_name.
All I want is to pull the attributes (stats1, stats2, stats3) so I can assign them to the newly discovered raw_name.
In the mySQL query result I can then separate the NULL records to be sent for manual data-entry and automatically insert the remaining records into Table_B.
Now, I am trying to do the same in Pandas but am stuck at the point of groupby on real-name.
e = {'raw_name': pd.Series(['AW103_Waingro', 'CE993_VincentHanna', 'EES43_NeilMcCauley', 'SME16_ChrisShiherlis',
'MEC14_MichaelCheritto', 'OTP23_RogerVanZant', 'MDU232_AlanMarciano']),
'real_name': pd.Series(['Waingro', 'VincentHanna', 'NeilMcCauley', 'ChrisShiherlis', 'MichaelCheritto',
'RogerVanZant', 'AlanMarciano'])}
f = {'raw_name': pd.Series(['SME893_VincentHanna', 'TVA405_VincentHanna', 'MET783_NeilMcCauley',
'CE321_NeilMcCauley', 'CIN453_NeilMcCauley', 'NIPS16_ChrisShiherlis',
'ALTW12_MichaelCheritto', 'NSP42_MichaelCheritto', 'CONS23_RogerVanZant',
'WAUE34_RogerVanZant']),
'real_name': pd.Series(['VincentHanna', 'VincentHanna', 'NeilMcCauley', 'NeilMcCauley', 'NeilMcCauley',
'ChrisShiherlis', 'MichaelCheritto', 'MichaelCheritto', 'RogerVanZant',
'RogerVanZant']),
'stats1': pd.Series(['meh1', 'meh1', 'yo1', 'yo1', 'yo1', 'hello1', 'bye1', 'bye1', 'namaste1',
'namaste1']),
'stats2': pd.Series(['meh2', 'meh2', 'yo2', 'yo2', 'yo2', 'hello2', 'bye2', 'bye2', 'namaste2',
'namaste2']),
'stats3': pd.Series(['meh3', 'meh3', 'yo3', 'yo3', 'yo3', 'hello3', 'bye3', 'bye3', 'namaste3',
'namaste3'])}
df_e = pd.DataFrame(e)
df_f = pd.DataFrame(f)
df_new = pd.merge(df_e, df_f, how='left', on='real_name', suffixes=['_left', '_right'])
df_new_grouped = df_new.groupby(df_new['raw_name_left'])
Now how do I compress/collapse the groups in df_new_grouped on real-name like I did in mySQL.
Once I have an object with the collapsed results I can slice the dataframe to report real_names we don't have a record of (NULL values) and those that we already know and can store the newly discovered raw_name.

You can drop duplicates based on columns raw_name_left and also remove the raw_name_right column using drop
In [99]: df_new.drop_duplicates('raw_name_left').drop('raw_name_right', 1)
Out[99]:
raw_name_left real_name stats1 stats2 stats3
0 AW103_Waingro Waingro NaN NaN NaN
1 CE993_VincentHanna VincentHanna meh1 meh2 meh3
3 EES43_NeilMcCauley NeilMcCauley yo1 yo2 yo3
6 SME16_ChrisShiherlis ChrisShiherlis hello1 hello2 hello3
7 MEC14_MichaelCheritto MichaelCheritto bye1 bye2 bye3
9 OTP23_RogerVanZant RogerVanZant namaste1 namaste2 namaste3
11 MDU232_AlanMarciano AlanMarciano NaN NaN NaN

Just to be thorough, this can also be done using Groupby, which I found on Wes McKinney's blog although drop_duplicates is cleaner and more efficient.
http://wesmckinney.com/blog/filtering-out-duplicate-dataframe-rows/
>index = [gp_keys[0] for gp_keys in df_new_grouped.groups.values()]
>unique_df = df_new.reindex(index)
>unique_df

Related

insert list as a new column mysql

I have a list of string values new_values and I want to insert it as a new column in the companies tables my mysql database. Since I have hundreds of rows, I cannot manually type them using the ? syntax that I came across on SO.
import MySQLdb
cursor = db.cursor()
cursor.execute("INSERT INTO companies ....")
lst_to_add = ["name1", "name2", "name3"]
db.commit()
db.close()
However, i am not sure what query I should use to pass in my list and what's the correct syntax to include the new column name (eg: "newCol") into the query.
Edit:
current table:
id originalName
1 Hannah
2 Joi
3 Kale
expected output:
id originalName fakeName
1 Hannah name1
2 Joi name2
3 Kale name3
Use ALTER TABLE to add the new column.
cursor.execute("ALTER TABLE companies ADD COLUMN fakeName VARCHAR(100)")
Then loop through the data, updating each row. You need Python data that indicates the mapping from original name to fake name. You can use a dictionary, for example.
names_to_add = {
"Hannah": "name1",
"Joi": "name2",
"Kale": "name3"
}
for oldname, newname in names_to_add.items():
cursor.execute("UPDATE companies SET fakeName = %s WHERE originalName = %s", (newname, oldname))

Executing an SQL update statement from a pandas dataframe

Context: I am using MSSQL, pandas, and pyodbc.
Steps:
Obtain dataframe from query using pyodbc (no problemo)
Process columns to generate the context of a new (but already existing) column
Fill an auxilliary column with UPDATE statements (i.e. UPDATE t SET t.value = df.value FROM dbo.table t where t.ID = df.ID)
Now how do I execute the sql code in the auxilliary column, without looping through each row?
sample data
The first two columns are obtained by querying dbo.table, the third columns exists but is empty in the database. The fourth column only exists in the dataframe to prepare the SQL statement that would correspond to updating dbo.table
ID
raw
processed
strSQL
1
lorum.ipsum#test.com
lorum ipsum
UPDATE t SET t.processed = 'lorum ipsum' FROM dbo.table t WHERE t.ID = 1
2
rumlo.sumip#test.com
rumlo sumip
UPDATE t SET t.processed = 'rumlo sumip' FROM dbo.table t WHERE t.ID = 2
3
...
...
...
I would like to execute the SQL script in each row in an efficient manner.
After I recommended .executemany() in a comment to the question, a subsequent comment from #Charlieface suggested that a table-valued parameter (TVP) would provide even better performance. I didn't think it would make that much difference, but I was wrong.
For an existing table named MillionRows
ID TextField
-- ---------
1 foo
2 bar
3 baz
…
and example data of the form
num_rows = 1_000_000
rows = [(f"text{x:06}", x + 1) for x in range(num_rows)]
print(rows)
# [('text000000', 1), ('text000001', 2), ('text000002', 3), …]
my test using a standard executemany() call with cnxn.autocommit = False and crsr.fast_executemany = True
crsr.executemany("UPDATE MillionRows SET TextField = ? WHERE ID = ?", rows)
took about 180 seconds (3 minutes).
However, by creating a user-defined table type
CREATE TYPE dbo.TextField_ID AS TABLE
(
TextField nvarchar(255) NULL,
ID int NOT NULL,
PRIMARY KEY (ID)
)
and a stored procedure
CREATE PROCEDURE [dbo].[mr_update]
#tbl dbo.TextField_ID READONLY
AS
BEGIN
SET NOCOUNT ON;
UPDATE MillionRows SET TextField = t.TextField
FROM MillionRows mr INNER JOIN #tbl t ON mr.ID = t.ID
END
when I used
crsr.execute("{CALL mr_update (?)}", (rows,))
it did the same update in approximately 80 seconds (less than half the time).

How to use variable column name in filter in Django ORM?

I have two tables BloodBank(id, name, phone, address) and BloodStock(id, a_pos, b_pos, a_neg, b_neg, bloodbank_id). I want to fetch all the columns from two tables where the variable column name (say bloodgroup) which have values like a_pos or a_neg... like that and their value should be greater than 0. How can I write ORM for the same?
SQL query is written like this to get the required results.
sql="select * from public.bloodbank_bloodbank as bb, public.bloodbank_bloodstock as bs where bs."+blood+">0 and bb.id=bs.bloodbank_id order by bs."+blood+" desc;"
cursor = connection.cursor()
cursor.execute(sql)
bloodbanks = cursor.fetchall()
You could be more specific in your questions, but I believe you have a variable called blood which contains the string name of the column and that the columns a_pos, b_pos, etc. are numeric.
You can use a dictionary to create keyword arguments from strings:
filter_dict = {bloodstock__blood + '__gt': 0}
bloodbanks = Bloodbank.objects.filter(**filter_dict)
This will get you Bloodbank objects that have a related bloodstock with a greater than zero value in the bloodgroup represented by the blood variable.
Note that the way I have written this, you don't get the bloodstock columns selected, and you may get duplicate bloodbanks. If you want to get eliminate duplicate bloodbanks you can add .distinct() to your query. The bloodstocks are available for each bloodbank instance using .bloodstock_set.all().
The ORM will generate SQL using a join. Alternatively, you can do an EXISTS in the where clause and no join.
from django.db.models import Exists, OuterRef
filter_dict = {blood + '__gt': 0}
exists = Exists(Bloodstock.objects.filter(
bloodbank_id=OuterRef('id'),
**filter_dict
)
bloodbanks = Bloodbank.objects.filter(exists)
There will be no need for a .distinct() in this case.

Update Firebird table with data from csv Python

Let's say we have a CSV file with payment information like this:
somthing1:500.00
somthing2:300.00
somthing3:200.00
I need to update different rows in a tableB in a Firebird database that has a column called paid with the values in the second column in the CSV file. For that reason iI created a global temporary table just to pump the data in it and then to update the desired table. What I tried is this:
idd = 0
with open('file.csv', 'r', encoding='utf-8', newline='') as file:
reader = list(csv.reader(file, delimiter=':'))
for row in reader:
idd += 1
c.execute("""INSERT INTO temp_table (id,code,paid) VALUES (?,?,?)""",
(idd,str(row[0]),str(row[1]), ))
This works as expected - the table is populated. After that I tried to update other table like this:
c.execute("""select paid from temp_table;""")
res = c.fetchall()
for r in res:
c.execute(f"""UPDATE tableB SET paid = {r[0]} WHERE oid = 10;""")
This does work - for every result from the SELECT query it updates all of the rows with the same value of the next result from that. I tried with MERGE in the database itself with the same result - every row is updated with the same value from temp_table:
MERGE INTO tableB b
USING (SELECT paid FROM temp_table) e
ON b.paid = 0 AND oid = 10
WHEN MATCHED THEN
UPDATE SET b.paid = e.paid;
I need some way to update first row from tableB with first row from the CSV and so on. What am I missing?
The problem is that you are not correlating rows from the temp_table with rows in the tableB.
In your initial attempt, you update all rows in tableB with oid = 10. Similarly, in your second attempt you update all rows with paid = 0 without correlating to temp_table.
You need add a condition to match rows between the tables, say both have an id column:
MERGE INTO tableB b
USING (SELECT paid FROM temp_table) e
ON b.paid = 0 AND b.oid = 10 AND b.id = e.id
WHEN MATCHED THEN
UPDATE SET b.paid = e.paid;

How can I get column name and type from an existing table in SQLAlchemy?

Suppose I have the table users and I want to know what the column names are and what the types are for each column.
I connect like this;
connectstring = ('mssql+pyodbc:///?odbc_connect=DRIVER%3D%7BSQL'
'+Server%7D%3B+server%3D.....')
engine = sqlalchemy.create_engine(connectstring).connect()
md = sqlalchemy.MetaData()
table = sqlalchemy.Table('users', md, autoload=True, autoload_with=engine)
columns = table.c
If I call
for c in columns:
print type(columns)
I get the output
<class 'sqlalchemy.sql.base.ImmutableColumnCollection'>
printed once for each column in the table.
Furthermore,
print columns
prints
['users.column_name_1', 'users.column_name_2', 'users.column_name_3'....]
Is it possible to get the column names without the table name being included?
columns have name and type attributes
for c in columns:
print c.name, c.type
Better use "inspect" to obtain only information from the columns, with that you do not reflect the table.
import sqlalchemy
from sqlalchemy import inspect
engine = sqlalchemy.create_engine(<url>)
insp = inspect(engine)
columns_table = insp.get_columns(<table_name>, <schema>) #schema is optional
for c in columns_table :
print(c['name'], c['type'])
You could call the column names from the Information Schema:
SELECT *
FROM information_schema.columns
WHERE TABLE_NAME = 'users'
Then sort the result as an array, and loop through them.

Categories