Given a pandas dataframe df which looks like following:
p_id | sales | salesperson | year
1 | 10,000| None | 2017
2 | 15,000| None | 2016
5 | 7,000 | None | 2014
5 | 3,000 | None | 2015
There exists an SQL table, persons, which looks like the following:
p_id | p_name | from_year | to_year
1 | Brian Griffin | 2017 | Null
2 | Quagmire | 2016 | Null
5 | Cleveland | 2014 | 2015
5 | Lois Griffin | 2015 | Null
I'm trying to populating the missing data in my dataframe from the SQL table.
A p_id can be reused as long as it's used by 1 person at a time.
What I've done is the following:
for index, row in df.iterrows():
df.at[index, 'salesperson'] = fetch_name(row['p_id'], row['year'])
def fetch_name(pid, year):
meta = sqlalchemy.MetaData()
persons = sqlalchemy.Table('persons', meta, autoload=True, autoload_with=data_engine)
stmt = sqlalchemy.select([persons.c.p_name]).where(
and_(persons.c.p_id == pid, and_(year >= persons.c.from_year,
or_(year < persons.c.to_year, persons.c.to_year.is_(None))))
name = data_engine.execute(stmt).scalar()
return name
This works fine but it's very slow. For a dataframe of 30,000 rows, it takes about 20 minutes to map and populate the missing data.
Would there be a better way to achieve the same result?
Related
This is a follow up question Check for 3 consecutive declined payment dates based on name That I had the other day
The following is the same as the question above except, Jim has 2 loans to repay and the dates in df2 are not sorted in any particular order.
Jim and Bob are part of a loan repayment program. They need to be making monthly payments to pay off their loans. Sometimes, a payment can be declined for various reasons. I would like to find when there are 3 consecutive declined payments in a row(so no complete payment in-between any of the 3).
I believe the proper way to solve this is to look at a certain time frame. Having something that looks within a 3 month window prior to the due date seems like a good strategy.
Here are the dataframes:
DF1
| Name | ID | Due Date |
| -----| ---- |------------------------|
| Jim | 1 | 2020-05-10 |
| Bob | 2 | 2021-06-11 |
| Jim | 3 | 2022-06-10 |
Here we have the "payment" dataframe.
DF2
| Name | Payment Date | Declined/Complete |
| -----|------------ | -------------------|
| Jim | 2020-04-5 | declined |
| Jim | 2020-03-9 | declined |
| Jim | 2020-05-6 | declined |
| Bob | 2021-04-11 | declined |
| Bob | 2021-03-20 | complete |
| Bob | 2021-05-11 | declined |
| Jim | 2022-04-3 | declined |
| Jim | 2022-03-5 | complete |
| Jim | 2022-05-15 | declined |
Jim (ID = 1) has had 3 consecutive declined payments before his due date, he gets flagged(1).
Bob (ID = 2) had a complete payment in-between his last 3 payments, so he does not get flagged(0).
Jim (ID = 3) has a complete payment between two declines, so he does not get flagged(0).
Expected Output
| Name | ID |Due Date | 3 consecutive declines |
| -----|----|-------------|---------------------------|
| Jim | 1 |2020-05-10 |1 |
| Bob | 2 |2021-06-11 |0 |
| Jim | 3 |2022-06-10 |0 |
I believe the proper way to solve this is to look at a certain time frame. Having something that looks within a 3 month window prior to the due date seems like a good strategy.
here is one way to do it
btw, we need an ID in the DF2 in order to map, otherwise, both JIMs will be treated as a single person
#map the due date into the df2 dataframe, so we can filter on payment prior to the due date
df2['DueDate'] = df2['ID'].map(df1.set_index(['ID'])['DueDate'])
#setup dates as dates
df2['PaymentDate']=pd.to_datetime(df2['PaymentDate'])
df2['DueDate']=pd.to_datetime(df2['DueDate'])
# get the count of declined payments for each ID
ref=df2[(df2['PaymentDate'] <= df2['DueDate']) &
(df2['Declined_Complete'] == 'declined')
].groupby(['ID'], as_index=False)[['ID','Declined_Complete']].size()
ref
#map the declined count to the DF1
df1['flag'] = df1['ID'].map(ref.set_index(['ID'])['size'])
# flag the customers that has missed the threshold number of payments, here its 3
missed_payment_count = 3
df1['flag'] = np.where(df1['flag'] >= missed_payment_count, 1, 0)
df1
Name ID DueDate flag
0 Jim 1 2020-05-10 1
1 Bob 2 2021-06-11 0
2 Jim 3 2022-06-10 0
I have a table in a SQLite database as below:
+-------+------------+---------------+-------+------------+
| ROWID | student_id | qualification | grade | date_stamp |
+-------+------------+---------------+-------+------------+
| 1 | 000001 | Mathematics | A | 2022-04-01 |
| 2 | 000002 | NULL | NULL | 2022-03-01 |
| 3 | 000003 | Physics | B | 2022-03-01 |
| 4 | 000003 | NULL | NULL | 2022-02-01 |
+-------+------------+---------------+-------+------------+
It is a table of student exam results, if a student has a qualification in a subject it appears in the table as ROW #1. If a student has no qualifications it appears in the table as ROW #2.
ROW #3 & #4 refer to a student (id 000003) who previously had no qualifications in the database, but now has a B in Physics. I need to delete ROW #4 based on the fact that this now has a qualification and the NULL values are no longer appropriate. ROW #2 for student 000002 should be unaffected.
The date_stamp column just shows when that record was last updated.
Appreciate any help, thanks in advance.
You may try doing a delete with exists logic:
DELETE
FROM yourTable
WHERE qualification IS NULL AND
EXISTS (
SELECT 1
FROM yourTable t
WHERE t.student_id = yourTable.student_id AND
t.qualification IS NOT NULL AND
t.date_stamp > yourTable.date_stamp
);
The dataset contains data about COVID-19 patients. It is in both EXCEL and CSV file formats, and contains several variables and over 7 thousand records (rows) which has made the problem extremely harder and very time consuming to solve manually. Below are the 4 most important variables (columns) needed in solving the problem; 1: id for identifying each record (row), 2: day_at_hosp for each day a patient remained admitted at the hospital, 3: sex of patient, 4: death for whether the patient eventually died or survived.
I want to create a new variable total_days_at_hosp which should contain a total of days a patient remained admitted at hospital.
Old Table:
_______________________________________
| id | day_at_hosp | sex | death |
|_______|_____________|________|________|
| 1 | 0 | male | no |
| 2 | 1 | | |
| 3 | 2 | | |
| 4 | 0 | female | no |
| 5 | 1 | | |
| 6 | 0 | male | no |
| 7 | 0 | female | no |
| 8 | 0 | male | no |
| 9 | 1 | | |
| 10 | 2 | | |
| 11 | 3 | | |
| 12 | 4 | | |
| ... | ... | ... | ... |
| 7882 | 0 | female | no |
| 7883 | 1 | | |
|_______|_____________|________|________|
New Table:
I want to convert table above into table below:
____________________________________________
| id |total_days_at_hosp| sex | death |
|_______|__________________|________|________|
| 1 | 3 | male | no |
| 4 | 2 | male | yes |
| 6 | 1 | male | yes |
| 7 | 1 | female | no |
| 8 | 5 | male | no |
| ... | ... | ... | ... |
| 2565 | 2 | female | no |
|_______|__________________|________|________|
NOTE: the id column is for every record entered, and multiple records were entered for each patient depending on how long a patient remained admitted at the hospital. The day_at_hosp variable contains days: 0=initial day at hospital, 1=second day at hospital, ... , n=nth last day at hospital.
The record (row) where the variable (column) day_at_hosp is 0 corresponds to all entries in other columns, if the record (row) for day_at_hosp is *not 0, say 1,2,3, ...,5 then it belongs to the patient right above, and all the corresponding variables (columns) are left blank.
However the dataset I need should look like the table below.
It should include a new variable (column) called total_days_at_hosp generated from the variable (column) day_at_hosp. The new variable (column) total_days_at_hosp is more useful in statistical tests to be conducted and will replace variable (column) day_at_hosp, so that all blank rows can be deleted.
To move from old table to new table the needed program should do the following:
day_at_hosp ===> total_days_at_hosp
0
1 ---> 3
2
-------------------------------------
0 ---> 2
1
-------------------------------------
0 ---> 1
-------------------------------------
0 ---> 1
-------------------------------------
0
1
2 ---> 5
3
4
-------------------------------------
...
-------------------------------------
0 ---> 2
1
-------------------------------------
How can I achieve this?
Another formula option without dummy value placed at end of the Old/New Table.
1] Create New Table by >>
Copy and paste all Old Table data to a unused area
Click "Autofilter"
In "days_at_hospital" column select =0 value
Copy and paste filter of admissions to New Table column F
Delete all 0s in rows of Column G
Then,
2] In G2, formula copied down :
=IF(F2="","",IF(F3="",MATCH(9^9,A:A)+1,MATCH(F3,A:A,0))-MATCH(F2,A:A,0))
Remark : If your "ID Column" is Text value, formula changed to :
=IF(F2="","",IF(F3="",MATCH("zzz",A:A)+1,MATCH(F3,A:A,0))-MATCH(F2,A:A,0))
It is apparent that your data are sorted by patient, and that your desired table will be much 'shorter' - accordingly the starting point for this answer is to apply an AutoFilter to your original data, setting the filter criterion to be days_at_hospital = 0, and then copy this filter of admissions to column F:
after deleting the old column G data, the formula below can then be entered in cell G2 and copied down
=INDEX(B:B,MATCH(F3,A:A,0)-1)+1
to keep the formula simple the same dummy maximum value should be entered at both the end of the old and new tables.
I have to update the database with the CSV files. Consider the database table looks like this:
The CSV file data looks like this:
As you can see the CSV file data some data modified and some new records are added and what I supposed to do is to update only the data which is modified or some new records which are added.
In Table2 the first record of col2 is modified.. I need to update only the first record of col2(i.e, AA) but not the whole records of col2.
I could do this by hardcoding but I don't want to do it by hardcoding as I need to do this with 2000 tables.
Can anyone suggest me the steps to approach my goal.
Here is my code snippet..
df = pd.read_csv('F:\\filename.csv', sep=",", header=0, dtype=str)
sql_query2 = engine.execute('''
SELECT
*
FROM ttcmcs023111temp
''')
df2 = pd.DataFrame(sql_query2)
df.update(df2)
Since I do not have data similar to you, I used my own DB.
The schema of my books table is as follows:
+--------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------+-------------+------+-----+---------+-------+
| id | int(11) | NO | PRI | NULL | |
| name | varchar(30) | NO | | NULL | |
| author | char(30) | NO | | NULL | |
+--------+-------------+------+-----+---------+-------+
And the table looks like this:
+----+--------------------+------------------+
| id | name | author |
+----+--------------------+------------------+
| 1 | Origin | Dan Brown |
| 2 | River God | Wilbur Smith |
| 3 | Chromosome 6 | Robin Cook |
| 4 | Where Eagles Dare | Alistair Maclean |
| 5 | The Seventh Scroll | Dan Brown | ### Added wrong entry to prove
+----+--------------------+------------------+ ### my point
So, my approach is to create a new temporary table with the same schema as the books table from the CSV using python.
The code I used is as follows:
sql_query = sqlalchemy.text("CREATE TABLE temp (id int primary key, name varchar(30) not null, author varchar(30) not null)")
result = db_connection.execute(sql_query)
csv_df.to_sql('temp', con = db_connection, index = False, if_exists = 'append')
Which creates a table like this:
+----+--------------------+------------------+
| id | name | author |
+----+--------------------+------------------+
| 1 | Origin | Dan Brown |
| 2 | River God | Wilbur Smith |
| 3 | Chromosome 6 | Robin Cook |
| 4 | Where Eagles Dare | Alistair Maclean |
| 5 | The Seventh Scroll | Wilbur Smith |
+----+--------------------+------------------+
Now, you just need to use the update in MySQL using INNER JOIN to update the values you want to update in your original table. (in my case, 'books').
Here's how you'll do this:
statement = '''update books b
inner join temp t
on t.id = b.id
set b.name = t.name,
b.author = t.author;
'''
db_connection.execute(statement)
This query will update the values in table books from the table temp that I've created using the CSV.
You can destroy the temp table after updating the values.
I have two dataframes. One contains a list of the most recent meeting for each customer. The second is a list of statuses that each customer has been recorded with, and their start date and end date.
I want to look up a customer and meeting date, and find out what status they were at when the meeting occurred.
What I think this will involve is creating a new column in my meeting dataframe that checks the rows of the statuses dataframe for a matching customer ID, then checks if the date from the first dataframe is between two dates in the second. If it is, the calculated column will take its value from the second dataframe's status column.
My dataframes are:
meeting
| CustomerID | MeetingDate |
|------------|-------------|
| 70704 | 2019-07-23 |
| 70916 | 2019-09-04 |
| 72712 | 2019-04-16 |
statuses
| CustomerID | Status | StartDate | EndDate |
|------------|--------|------------|------------|
| 70704 | First | 2019-04-01 | 2019-06-30 |
| 70704 | Second | 2019-07-01 | 2019-08-25 |
| 70916 | First | 2019-09-01 | 2019-10-13 |
| 72712 | First | 2019-03-15 | 2019-05-02 |
So, I think I want to take meeting.CustomerID and find a match in statuses.CustomerID. I then want to check if meeting.MeetingDate is between statuses.StartDate and statuses.EndDate. If it is, I want to return statuses.Status from the matching row, if not, ignore that row and move to the next to see if that matches the criteria and return the Status as described.
The final result should look like:
| CustomerID | MeetingDate | Status |
|------------|-------------|--------|
| 70704 | 2019-07-23 | Second |
| 70916 | 2019-09-04 | First |
| 72712 | 2019-04-16 | First |
I'm certain there must be a neater and more streamlined way to do this than what I've suggested, but I'm still learning the ins and outs of python and pandas and would appreciate if someone could point me in the right direction.
This should work. If the columns are not sorted by CustomerID or Status, this can be easily done. This is assuming your dates are already a datetime type. Here, df2 refers to the dataframe whose columns are CustomerID, Status, StartDate, and EndDate.
import numpy as np
df2 = df2[::-1]
row_arr = np.unique(df2.CustomerID, return_index = True)[1]
df2 = df2.iloc[row_arr, :].drop(['StartDate', 'EndDate'], axis = 1)
final = pd.merge(df1, df2, how = 'inner', on = 'CustomerID')
I managed to wrangle something that works for me:
df = statuses.merge(meetings, on='CustomerID')
df = df[(df['MeetingDate'] >= df['StartDate']) & (df['MeetingDate'] <= df['EndDate'])].reset_index(drop=True)
Gives:
| CustomerID | Status | StartDate | EndDate | MeetingDate |
|------------|--------|------------|------------|-------------|
| 70704 | Second | 2019-01-21 | 2019-07-28 | 2019-07-23 |
| 70916 | First | 2019-09-04 | 2019-10-21 | 2019-09-04 |
| 72712 | First | 2019-03-19 | 2019-04-17 | 2019-04-16 |
And I can just drop the now unneeded columns.