Split table into 2 new tables by index with sqlite3 and Python - python

I have a database.db file that has a table1 like this one. Note that the index is just codes and not numeric:
id | item | price
-------------
45f5 | book | 20
25h8 | copy | 30
as34 | pen | 10
t674 | key | 15
5h6f | ring | 25
67yu | mug | 40
and I would like to create two additional tables in my dababase.db names table2 and table3 in which one contain the first 4 rows and the other one the last 2 rows:
table2
id | item | price
-------------
45f5 | book | 20
25h8 | copy | 30
as34 | pen | 10
t674 | key | 15
table3
id | item | price
-------------
5h6f | ring | 25
67yu | mug | 40
I have been trying with CREATE TABLE but I have too many columns in table1 to write it one by one. What would be your approach to this problem? Thanks!
CREATE TABLE table2 AS SELECT * FROM table1 WHERE condition

To create table you have to specify name of column and type,
SELECT * FROM already exisitig will not work. See https://www.w3schools.com/sql/sql_create_table.asp
You can try to create such table based on already existing table like:
INSERT INTO table2 SELECT * FROM table1 WHERE condition;
Try to use above statement without WHERE to check whether it works as I don't have access now to SQLite database to check it.

Related

completely remove rows that contain ID associate with more than one industry

I've seen a lot of post on removing duplicates but those don't apply to my case.
The idea is I only care about whether the dataset contain IDs associate with more than one industries, if an ID has more than one industry, completely remove that ID and rows associate with it from the dataset. Can this be done with SQL? Python?
For example:
ID | Date | Industry |
S000123 | oct/1/22 | Media |
S000123 | oct/1/22 | Education |
S000456 | oct/4/22 | Auto |
S000789 | oct/4/22 | Beverage |
becomes
ID | Date | Industry |
S000456 | oct/4/22 | Auto |
S000789 | oct/4/22 | Beverage |
This will select only the rows you are looking for:
select *
from data
where ID in (
select ID
from data
group by ID
having count(distinct Industry) <= 1
)
The inner query selects only IDs with one or fewer (in case of a NULL Industry) different values for Industry.
You can do:
delete from t
where id in (
select id from t group by id having count(distinct industry) > 1
)
See fiddle.

How to create a table from another table with GridDB?

I have a GridDB container where I have stored my database. I want to copy the table but this would exclude a few columns. The function I need should extract all columns matching a given keyword and then create a new table from that. It must always include the first column *id because it is needed on every table.
For example, in the table given below:
'''
-- | employee_id | department_id | employee_first_name | employee_last_name | employee_gender |
-- |-------------|---------------|---------------------|---------------------|-----------------|
-- | 1 | 1 | John | Matthew | M |
-- | 2 | 1 | Alexandra | Philips | F |
-- | 3 | 2 | Hen | Lotte | M |
'''
Suppose I need to get the first column and every other column starting with "employee". How can I do this through a Python function?
I am using GridDB Python client on my Ubuntu machine and I have already stored the database.csv file in the container. Thanks in advance for your help!

update with csv file using python

I have to update the database with the CSV files. Consider the database table looks like this:
The CSV file data looks like this:
As you can see the CSV file data some data modified and some new records are added and what I supposed to do is to update only the data which is modified or some new records which are added.
In Table2 the first record of col2 is modified.. I need to update only the first record of col2(i.e, AA) but not the whole records of col2.
I could do this by hardcoding but I don't want to do it by hardcoding as I need to do this with 2000 tables.
Can anyone suggest me the steps to approach my goal.
Here is my code snippet..
df = pd.read_csv('F:\\filename.csv', sep=",", header=0, dtype=str)
sql_query2 = engine.execute('''
SELECT
*
FROM ttcmcs023111temp
''')
df2 = pd.DataFrame(sql_query2)
df.update(df2)
Since I do not have data similar to you, I used my own DB.
The schema of my books table is as follows:
+--------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------+-------------+------+-----+---------+-------+
| id | int(11) | NO | PRI | NULL | |
| name | varchar(30) | NO | | NULL | |
| author | char(30) | NO | | NULL | |
+--------+-------------+------+-----+---------+-------+
And the table looks like this:
+----+--------------------+------------------+
| id | name | author |
+----+--------------------+------------------+
| 1 | Origin | Dan Brown |
| 2 | River God | Wilbur Smith |
| 3 | Chromosome 6 | Robin Cook |
| 4 | Where Eagles Dare | Alistair Maclean |
| 5 | The Seventh Scroll | Dan Brown | ### Added wrong entry to prove
+----+--------------------+------------------+ ### my point
So, my approach is to create a new temporary table with the same schema as the books table from the CSV using python.
The code I used is as follows:
sql_query = sqlalchemy.text("CREATE TABLE temp (id int primary key, name varchar(30) not null, author varchar(30) not null)")
result = db_connection.execute(sql_query)
csv_df.to_sql('temp', con = db_connection, index = False, if_exists = 'append')
Which creates a table like this:
+----+--------------------+------------------+
| id | name | author |
+----+--------------------+------------------+
| 1 | Origin | Dan Brown |
| 2 | River God | Wilbur Smith |
| 3 | Chromosome 6 | Robin Cook |
| 4 | Where Eagles Dare | Alistair Maclean |
| 5 | The Seventh Scroll | Wilbur Smith |
+----+--------------------+------------------+
Now, you just need to use the update in MySQL using INNER JOIN to update the values you want to update in your original table. (in my case, 'books').
Here's how you'll do this:
statement = '''update books b
inner join temp t
on t.id = b.id
set b.name = t.name,
b.author = t.author;
'''
db_connection.execute(statement)
This query will update the values in table books from the table temp that I've created using the CSV.
You can destroy the temp table after updating the values.

SQL - Conditionally join and replace values between two tables

I have two tables where one is holding "raw" data and another is holding "updated" data. The updated data just contains corrections of rows from the first table, but is essentially the same. It is a functional requirement for this data to be stored separately.
I want a query with the following conditions:
Select all rows from the first table
If there is a matching row in the second table (ie. when raw_d.primary_key_col_1 = edit_d.primary_key_col_1 and raw_d.primary_key_col_2 = edit_d.primary_key_col_2), we use the most recent (where most recent is based on column primary_key_col_3 values from the second table, rather than the first
Otherwise we use the values from the first table.
Note: I have many more "value" columns in the actual data. Considering the following toy example where I have two tables, raw_d and edit_d, that are quite similar as follows:
primary_key_col_1 | primary_key_col_2 | value_col_1 | value_col_2
-------------------------+-------------------------+-------------------+-------------------
src_1 | dest_1 | 0 | 1
src_2 | dest_2 | 5 | 4
src_3 | dest_3 | 2 | 2
src_4 | dest_4 | 6 | 3
src_5 | dest_5 | 9 | 9
primary_key_col_1 | primary_key_col_2 | primary_key_col_3 | value_col_1 | value_col_2
-------------------------+-------------------------+-------------------------+---------------------------------------
src_1 | dest_1 | 2020-05-09 | 7 | 0
src_2 | dest_2 | 2020-05-08 | 6 | 1
src_3 | dest_3 | 2020-05-07 | 5 | 2
src_1 | dest_1 | 2020-05-08 | 3 | 4
src_2 | dest_2 | 2020-05-09 | 2 | 5
The expected result is as given:
primary_key_col_1 | primary_key_col_2 | value_col_1 | value_col_2
-------------------------+-------------------------+-------------------+-------------------
src_1 | dest_1 | 7 | 0
src_2 | dest_2 | 2 | 5
src_3 | dest_3 | 5 | 2
src_4 | dest_4 | 6 | 3
src_5 | dest_5 | 9 | 9
My proposed solution is to query the "greatest n per group" with the second table and then "overwrite" rows in a query of the first table, using Pandas.
The first query would just grab data from the first table:
SELECT * FROM raw_d
The second query to select "the greatest n per group" would be as follows:
SELECT DISTINCT ON (primary_key_col_1, primary_key_col_2) * FROM edit_d
ORDER BY primary_key_col_1, primary_key_col_2, primary_key_col_3 DESC;
I planned on merging the data like in Replace column values based on another dataframe python pandas - better way?.
Does anyone know a better solution, preferably using SQL only? For reference, I am using PostgreSQL and Pandas as part of my data stack.
I would suggest phrasing the requirements as:
select the most recent row from the second table
bring in additional rows from the first table that don't match
This is a union all with distinct on:
(select distinct on (primary_key_col_1, primary_key_col_2) u.primary_key_col_1, u.primary_key_col_2, u.value_col_1, u.value_col_2
from updated u
order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc
) union all
select r.primary_key_col_1, r.primary_key_col_2, r.value_col_1, r.value_col_2
from raw r
where not exists (select 1
from updated u
where u.primary_key_col_1 = r.primary_key_col_2 and
u.primary_key_col_2 = r.primary_key_col_2
);
As I understood from your question, there are 2 ways to solve this
1. Using FULL OUTER JOIN
with cte as (
select distinct on (primary_key_col_1,primary_key_col_2) * from edit_d
order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc
)
select
coalesce(t1.primary_key_col_1,t2.primary_key_col_1),
coalesce(t1.primary_key_col_2,t2.primary_key_col_2),
coalesce(t1.value_col_1,t2.value_col_1),
coalesce(t1.value_col_2,t2.value_col_2)
from cte t1
full outer join raw_d t2
on t1.primary_key_col_1 = t2.primary_key_col_1
and t1.primary_key_col_2 = t2.primary_key_col_2
DEMO
2. Using Union
select
distinct on (primary_key_col_1, primary_key_col_2)
primary_key_col_1, primary_key_col_2, value_col_1, value_col_2
from (
select * from edit_d
union all
select primary_key_col_1,primary_key_col_2, null as "primary_key_col_3",
value_col_1,value_col_2 from raw_d
order by primary_key_col_1, primary_key_col_2, primary_key_col_3 desc nulls last
)tab
DEMO

SQLAlchemy: Insert or Update when column value is a duplicate

I have a table A with the following columns:
id UUID
str_identifier TEXT
num FLOAT
and a table B with similar columns:
str_identifier TEXT
num FLOAT
entry_date TIMESTAMP
I want to construct a sqlalchemy query that does the following:
finds entries in table B that either do not exist yet in table A, and inserts them
finds entries in table B that do exist in table A but have a different value for the num column
The catch is that table B has the entry_date column, and as a result can have multiple entries with the same str_identifier but different entry dates. So I always want to perform this insert/update query using the latest entry for a given str_identifier (if it has multiple entries in table B).
For example, if before the query runs tables A and B are:
[A]
| id | str_identifier | num |
|-----|-----------------|-------|
| 1 | str_id_1 | 25 |
[B]
| str_identifier | num | entry_date |
|----------------|-----|------------|
| str_id_1 | 89 | 2020-07-20 |
| str_id_1 | 25 | 2020-06-20 |
| str_id_1 | 50 | 2020-05-20 |
| str_id_2 | 45 | 2020-05-20 |
After the update query, table A should look like:
[A]
| id | str_identifier | num |
|-----|-----------------|-----|
| 1 | str_id_1 | 89 |
| 2 | str_id_2 | 45 |
The query I've constructed so far should detect difference, but will adding order_by(B.entry_date.desc()) ensure I only do the exist comparisons with the latest str_identifier values?
My Current Query
query = (
select([B.str_identifier, B.value])
.select_from(
join(B, A, onclause=B.str_identifier == A.str_identifier, isouter=True)
)
.where(
and_(
~exists().where(
and_(
B.str_identifier == A.str_identifier,
B.value == A.value,
~B.value.in_([None]),
)
)
)
)
)

Categories