I'm struggling with something relatively simple,
Let's say I have a table the first column is the primary key autoincrement
id
name
age
1
tom
22
2
harry
33
3
greg
44
4
sally
55
I want to remove row 2 and the rest of the items to automatically reorder so it looks like this
id
name
age
1
tom
22
2
greg
44
3
sally
55
I have tried every available bit of advice I can find online, they all involve deleting the table name from the sqlite_sequence table which doesn't work
is there a simple way to do archive what I am after?
I don't see the point/need to resequence the id column, as all values there will continue to be unique, even after deleting one or more records. If you really wanted to view your data this way, you could delete the record mentioned, and then use ROW_NUMBER:
DELETE
FROM yourTable
WHERE id = 2;
SELECT ROW_NUMBER() OVER (ORDER BY id) id, name, age
FROM yourTable
ORDER BY id;
Related
I have a data frame containing a multi-parent hierarchy of employees. Node (int64) is a key that identifies each unique combination of employee and manager. Parent (float64) is a key that represents the manager in the ‘node’.
Due to some source data anomalies, there are parent keys present in the data frame that are not 'nodes'. I would like to delete all such rows where this is occurring.
empId
empName
mgrId
mgrName
node
parent
111
Alice
222
Bob
1
3
111
Alice
333
Charlie
2
4
222
Bob
444
Dave
3
5
333
Charlie
444
Dave
4
5
444
Dave
5
555
Elizabeth
333
Charlie
6
7
In the above sample, it would be employee ID 555 because parent key 7 is not present anywhere in ‘node’ column.
Here's what I tried so far:
This removes some rows but does not remove all. Not sure why?
df1 = df[df['parent'].isin(df['node'])]
I thought maybe it was because ‘parent’ is float and ‘node’ is int64, so I converted and tried but same result as previous.
df1 = df[df['parent'].astype('int64').isin(df['node'])]
Something to consider is that the data frame contains around 1.5 million rows.
I also tried this, but this just keeps running the code forever - I'm assume this is because .map will loop through the entire data frame (which is around 1.5 million rows):
df[df['parent'].map(lambda x: np.isin(x, df['node']).all())]
I'm especially perplexed, when I use the first 2 code snippets, as to why it would consistently filter out a small subset of rows that do not meet the filter condition but not all.
Again, 'parent' is float64 and has empty values. 'node' is int64 and has no empty values. A more realistic example of node and parent keys is as follows:
Node - 13192210227
Parent - 12668210227.0
ok, I have a bit of a humdinger.
I hava a dataframe that can be upwards of 120,000 entries
The frames will be similar to this:
ID UID NAME DATE
1 1234 Bob 02/02/2020
2 1235 Jim 02/04/2020
3 1234 Bob
4 1234 Bob 02/02/2020
5 1236 Jan 20/03/2020
6 1235 Jim
I need to be able to eliminate all duplicates, however; i need to check if in the duplicates, if there is a date, then that one, or one of the ones that does have a date, is the one kept, and remove all others. if there is no date in any of the duplicates, then just use whichever is easiest.
I am struggling to come up with a way to do this elegantly.
My thought is:
iterate through all entries, for each entry, create a temp DF and place all duplicates in it, iterate through THAT df and if i find a date, save the index and then delete each entry that is not that entry.. but that seems VERY bulky and slow.
any better suggestions??
Since the blanks are empty string '', you can do this:
(sample_df.sort_values(['UID','NAME','DATE'],
ascending=(True, True, False))
.groupby(['UID','NAME'])
.first()
.reset_index())
which gives you:
UID NAME DATE
0 1234 Bob 02/02/2020
1 1235 Jim 02/04/2020
2 1236 Jan 20/03/2020
Note the ascending flag in sort_values. Pandas will sort string according to their length, and to have non-empty DATE sorted before empty DATE (i.e. ''), you need to sort the column in descending order.
After sorting, you can simply group each pair of (UID, NAME) and keep the first entry.
I have 2 data frames created from CSV files and there is another data frame which is a reference for these table. For e.g.
1 Employee demographic (Emp_id, dept_id)
2 Employee detail (Emp_id, RM_ID)
I have 3rd dataframe(dept_manager) which has only 2 columns (dept_id, RM_ID). Now I need to join table 1 and 2 referencing the 3rd dataframe.
Trying out in pandas(python) any help here would be much appreciated..Thanks in advance.
Table1
Empdemogr
Empid dept_id
1 10
2 20
1 30
Table2
Empdetail
Empid RM_id
1 E120
2 E140
3 E130
Table3
dept_manager
dept_id RM_id
10 E110
10 E120
10 E121
10 E122
10 E123
20 E140
20 E141
20 E142
30 E130
30 E131
30 E132
Output:
Emp_id dept_id RM_id
1 10 E120
2 20 E140
1 30 E130
So trying to bring this sql in python:
select a.Emp_id, a.dept_id, b.RM_id
Empdemogr a, Empdetail b, dept_manager d
where
a.emp_id=b.emp_id
and a.dept_id=d.dept_id
and b.RM_id=d.RM_id
Trying to figure out if you had a typo or you have wrong understanding. Your above SQL would not output the the result you are looking for based on the provided data. I do not think you will see dept_id '30' in there.
But Going by your SQL query, here is how you can write the same in python dataframe:
Preparing DataFrames (I will leave it up to you how you load the dataframes):
import pandas as pd
EmpployeeDemo=pd.read_csv(r"YourEmployeeDemoFile.txt")
EmpDetail=pd.read_csv(r"YourEmpDetailFile.txt")
Dept_Manager=pd.read_csv(r"YourDept_Manager.txt")
Code to Join the DataFrames:
joined_dataframe = pd.merge(pd.merge(EmpployeeDemo, EmpDetail, on="Empid"),Dept_Manager, on=["dept_id", "RM_id"])
print(joined_dataframe)
Now I have a list of tuple named "Data"
[
('171115090000',
Timestamp('2017-11-15 09:00:00'),
'PAIR1',
156.0)
]
I want to insert this list to Oracle DB, my code is
cur.executemany(
'''INSERT INTO A
("SID","DATE","ATT","VALUE")
VALUES(:1,:2,:3,:4)''',Data)
And it works well. However if I want to add/replace new records into this database, I have to create a table B to put those records then merge A and B.
Is there anything like on duplicate key update that I could finish my job without creating a new table?
I know I could select all records from A, convert them to a DataFrame and merge DataFrames in Python, is this a good solution?
Is there anything like on duplicate key update
In Oracle, it is called MERGE; have a look at the following example:
Table contents at the beginning:
SQL> select * From dept;
DEPTNO DNAME LOC
---------- -------------- -------------
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
MERGE statement:
SQL> merge into dept d
2 using (select deptno, dname, loc
3 from (select 10 deptno, 'ACC' dname, 'NY' loc from dual --> already exists, should be updated
4 union all
5 select 99 , 'NEW DEPT' , 'LONDON' from dual --> doesn't exists, should be inserted
6 )
7 ) x
8 on (d.deptno = x.deptno)
9 when matched then update set
10 d.dname = x.dname,
11 d.loc = x.loc
12 when not matched then insert (d.deptno, d.dname, d.loc)
13 values (x.deptno, x.dname, x.loc);
2 rows merged.
The result: as you can see, values for existing DEPTNO = 10 were updated, while the new DEPTNO = 99 was inserted into the table.
SQL> select * From dept;
DEPTNO DNAME LOC
---------- -------------- -------------
10 ACC NY
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
99 NEW DEPT LONDON
SQL>
I don't speak Python so I can't compose code you might use, but I hope that you'll manage to do it yourself.
I have a pandas data-frame that looks like this:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars NaN
5 Photo Andrew
6 Football NaN
.............. 1303 rows.
The number of Names filled in might be large then 2 as well. I would like to end up the entire Names column filled n equally into the names ( or+1 in the case of even number of rows). I already store into a variable number of names the total number of names. In the above case it's 2. I tried filtering and counting by each name but I don't know how to make this when the number of name is dynamic.
Expected Dataframe:
ID Hobbby Name
1 Travel Kevin
2 Photo Andrew
3 Travel Kevin
4 Cars Kevin
5 Photo Andrew
6 Football Andrew
I tried: replace NaN with 0 in Column Name using fillna. Filter the column and end up with a dataframe that has only the na fields and afterwards len(df) to get the number of nan and from here created 2 databases each containing half of the df. Bu I think this approach is completely wrong as I do not always have 2 Names. There could be2,3,4 etc. ( this is given by a dictionary)
Any help highly appreciated
Thanks.
It's difficult to tell but I think you need ffill
df['Name'] = df['Name'].ffill()