Python data subset - select value ... from DF1 ... where value exists in DF2 - python

I am trying to create a dataframe partly by seeing if values exist in another dataframe. here is the SQL version of what I am trying to do:
SELECT *
FROM DF1
WHERE
Patient_alive='still_alive'
AND Patient_ID in (SELECT Pat_ID from DF2)
Here is the code I am struggling with, the last line is what I can't figure out, i have two versions of pseudocode concerning PT_ID:
DF3 = DF1[
(DF1['Patient_alive'].str.contains('still_alive', case=False))&
#(DF1['PT_ID'].isin(DF2))
(DF1['PT_ID'].contains(DF2, case=False))
]
Update1:
Input Data of df1:
Patient_ID | Patient_Alive | Patient_Name
12345 | StillAlive | Knowles, Archibald
23456 | NotAlive | Hauzer, Bruno
911235 | StillAlive | Samarkand, Samsonite VII
Input Data of df2:
PT_ID
12345
22222
55555
99999
Df3 desired output:
Patient_ID | Patient_Alive | Patient_Name
12345 | StillAlive | Knowles, Archibald

Related

Merging 2 datasets in Python

I have 2 diffferent datasets and I want to merge these 2 datasets based on column "country" with the common country names and dropping the ones different. I have done it with inner merge, but the dataset is not as I want to have.
inner_merged = pd.merge(TFC_DATA,CO2_DATA,how="inner",on="country")
TFC_DATA (in the orginal dataset there exits a column called year but I've dropped it):
| Country | TFP |
| Angola | 0.8633379340171814 |
| Angola | 0.9345720410346984 |
| Angola | 1.0301895141601562 |
| Angola | 1.0850582122802734 |
.
.
.
CO2_DATA:
| Country | year | GDP | co2
| Afghanistan | 2005 | 25397688320.0 | 1
| Afghanistan | 2006 | 28704401408.0 | 2
| Afghanistan | 2007 | 34507530240.0 | 2
| Afghanistan | 2008 | 1.0850582122802734 | 3
| Afghanistan | 2009 | 1.040212631225586 | 1
.
.
.
What I want is
Output
|Country|Year|gdp|co2|TFP
Angola|2005|51967275008.0|19.006|0.8633379340171814
Angola|2006|66748907520.0|19.006|0.9345720410346984
Angola|2007|87085293568.0|19.006|1.0301895141601562
.
.
.
What I have instead
Output
Country|Year|gdp|co2|Year|TFP
Angola|2005|51967275008.0|19.006|2005|0.8633379340171814
Angola|2005|51967275008.0|19.006|2006|0.9345720410346984
Angola|2005|51967275008.0|19.006|2007|1.0301895141601562
Angola|2005|51967275008.0|19.006|2008|1.0850582122802734
Angola|2005|51967275008.0|19.006|2009|1.040212631225586
Angola|2005|51967275008.0|19.006|2010|1.0594196319580078
Angola|2005|51967275008.0|19.006|2011|1.036203384399414
Angola|2005|51967275008.0|19.006|2012|1.076979637145996
Angola|2005|51967275008.0|19.006|2013|1.0862818956375122
Angola|2005|51967275008.0|19.006|2014|1.096832513809204
Angola|2005|51967275008.0|19.006|2015|1.0682281255722046
Angola|2005|51967275008.0|19.006|2016|1.0160540342330933
Angola|2005|51967275008.0|19.006|2017|1.0
I expected the datas of the countrys' merge in one dataset but it duplicates itself until the second one data is over then the second one does the same
TFC_DATA (in the orginal dataset there exits a column called year but
I've dropped it):
Well, based on your expected output, you should not drop the column Year from the dataframe TFC_DATA. Only then, you can use pandas.merge (as shown below). Because otherwise, you'll have duplicated values.
pd.merge(CO2_DATA, TFC_DATA, left_on=["country", "year"], right_on=["country", "Year"])
OR :
pd.merge(CO2_DATA, TFC_DATA.rename(columns={"Year": "year"}), on=["country", "year"])
pd.merge() function performs an inner join by default that means it only includes rows that have matching values in the specified columns.
Use a different join type one option is to use a left outer join, which will include all rows from the left dataset (TFC_DATA) and only the matching rows from the right dataset (CO2_DATA).
Specify a left outer join using the how="left" parameter in the pd.merge() function.
merged_data = pd.merge(TFC_DATA, CO2_DATA, how="left", on="country")
After #abokey's comment
EDIT
First, create a new column in the TFC_DATA dataset with the year value
TFC_DATA["year"] = TFC_DATA.index.year
Group the TFC_DATA dataset by "country" and "year", and compute the mean TFP value for each group
TFC_DATA_agg = TFC_DATA.groupby(["country", "year"]).mean()
Reset the index to make "country" and "year" columns in the resulting dataset
TFC_DATA_agg = TFC_DATA_agg.reset_index()
Perform the inner merge, using "country" and "year" as the merge keys
merged_data = pd.merge(CO2_DATA, TFC_DATA_agg, how="inner", on=["country", "year"])

Pandas question - merging on int64 and object columns?

I have a couple of Pandas dataframes that I am trying to merge together without any luck.
The first dataframe (let's call it dataframe A) looks a little like this:
offer | type | country
------|------|--------
123 | A | UK
456 | B | UK
789 | A | ROI
It's created by reading in an .xlsx file using the following code:
file_name = "My file.xlsx"
df_a = pd.read_excel(file_name, sheet_name = "Offers Plan", usecols= ["offer","type","country"], dtype={'offer': str})
The offer column is being read in as a string because otherwise they end up in the format 123.0. The offers need to be in the format 123 because they're being used in some embedded SQL later on in the code that looks for them in a certain database table. In this table the offers are in the format 123, so the SQL will return no results when looking for 123.0.
The second dataframe (dataframe B) looks a little like this:
offer
-----
123
456
789
123
456
123
What I want to do is merge the two dataframes together so the results look like this:
offer | type | country
------|------|--------
123 | A | UK
456 | B | UK
789 | A | ROI
123 | A | UK
456 | B | UK
123 | A | UK
I've tried using the following code, but I get an error message saying "ValueError: You are trying to merge on int64 and object columns":
df_merged = pd.merge(df.b, df.a, how = 'left', on="offer")
Does anyone know how I can merge the dataframes correctly please?
IIUC you can just change the df_a column to an int
df_a['offer'] = df_a['offer'].astype(int)
This will change it from a float/str to an int. If this gives you an error about converting from a str/float to an int check to make sure that you don't have any NaN/Nulls in your data. If you do you will need to remove them for a successful conversion.

How to aggregate in pandas with some conditions?

I want to aggregate my data in this way:
df.groupby('date').agg({ 'user_id','nunique',
'user_id':'nunique' ONLY WHERE purchase_flag==1})
date | user_id | purchase_flag
4-1-2020 | 1 | 1
4-1-2020 | 1 | 1 (purchased second time but still same unique user on that day)
4-1-2020 | 2 | 0
In this case I want the output to looks like:
date | total_users | total_users_who_purchased
4-1-2020 | 2 | 1
How can I best achieve this?
Try this by creating helper column in your dataframe to indicate users who purchased first then groupby and aggregate on that helper column:
df["user_id_purchased"] = df["user_id"].where(df["purchase_flag"].astype(bool))
df_output = df.groupby("date", as_index=False).agg(
total_users=("user_id", "nunique"),
total_users_who_purchased=("user_id_purchased", "nunique"),
)
Output:
date total_users total_users_who_purchased
0 4-1-2020 2 1
I think that one way to achieve this goal is using .loc
df.loc[ (df["purchase_flag"]==1)].user_id.nunique
Implementation to get your output:
details = { 'date' : ['4-1-2020'],
'total_users' : df.user_id.nunique(),
'total_users_who_purchased' :
df.loc(df["purchase_flag"]==1)].user_id.nunique()}
df2 = pd.DataFrame(details)
df2

Pandas Merge two tables with the second tables' one column transposed

Table 1
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
Table 2
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','05-05-2021','03-05-2021'],'data':[12,13,16,9],'test':['g','h','j','i'],'test2':['k','l','m','n']})
Result Table
Brief Explanation on how the Result table needs to be created:
I have two data frames and I want to merge them based on a df_id. But the date column from second table should be transposed into the resultant table.
The date columns for the result table will be a range between the min date and max date from the second table
The column values for the dates in the result table will be from the data column of the second table.
Also the test column from the second table will only take its value of the latest date for the result table
I hope this is clear. Any suggestion or help regarding this will be greatly appreciated.
I have tried using pivot on the second table and then trying to merge the pivoted second table df1 but its not working. I do not know how to get only one row for the latest value of test.
Note: I am trying to solve this problem using vectorization and do not want to serially parse through each row
You need to pivot your df2 into two separate table as we need data and test values and then merge both resulting pivot table with df1
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','03-05-2021','05-05-2021'],'data':[12,13,9,16],'test':['g','h','i','j']})
test_piv = df2.pivot(index=['df1_id'],columns=['date'],values=['test'])
data_piv = df2.pivot(index=['df1_id'],columns=['date'],values=['data'])
max_test = test_piv['test'].ffill(axis=1).iloc[:,-1].rename('test')
final = df1.merge(data_piv['data'],left_on=df1.df1_id, right_index=True, how='left')
final = final.merge(max_test,left_on=df1.df1_id, right_index=True, how='left')
and hence your resulting final dataframe as below
| | df1_id | col1 | col2 | 01-05-2021 | 03-05-2021 | 05-05-2021 | test |
|---:|---------:|:-------|:-------|-------------:|-------------:|-------------:|:-------|
| 0 | 1 | a | d | 12 | 9 | 16 | j |
| 1 | 2 | b | e | nan | 13 | nan | h |
| 2 | 3 | c | f | nan | nan | nan | nan |
Here is the solution for the question:
I first sort df2 based of df1_id and date to ensure that table entries are in order.
Then I drop duplicates based on df_id and select the last row to ensure I have the latest values for test and test2
Then I pivot df2 to get the corresponding date as column and data as its value
Then I merge the table with df2_pivoted to combine the latest values of test and test2
Then I merge with df1 to get the resultant table
df1 = pd.DataFrame({'df1_id':['1','2','3'],'col1':["a","b","c"],'col2':["d","e","f"]})
df2 = pd.DataFrame({'df1_id':['1','2','1','1'],'date':['01-05-2021','03-05-2021','05-05-2021','03-05-2021'],'data':[12,13,16,9],'test':['g','h','j','i'],'test2':['k','l','m','n']})
df2=df2.sort_values(by=['df1_id','date'])
df2_latest_vals = df2.drop_duplicates(subset=['df1_id'],keep='last')
df2_pivoted = df2.pivot_table(index=['df1_id'],columns=['date'],values=['data'])
df2_pivoted = df2_pivoted.droplevel(0,axis=1).reset_index()
df2_pivoted = pd.merge(df2_pivoted,df2_latest_vals,on='df1_id')
df2_pivoted = df2_pivoted.drop(columns=['date','data'])
result = pd.merge(df1,df2_pivoted,on='df1_id',how='left')
result
Note: I have not been able to figure out how to get the entire date range between 01-05-2021 and 05-05-2021 and show the empty values as NaN. If anyone can help please edit the answer

Faster way to index pandas dataframe multiple times

For every row in df_a, I am looking to find rows in df_b where the id's are the same and the df_a row's location falls within the df_b row's start and end location.
df_a looks like:
|---------------------|------------------|------------------|
| Name | id | location |
|---------------------|------------------|------------------|
| a | 1 | 202013 |
|---------------------|------------------|------------------|
df_b looks like:
|---------------------|------------------|------------------|------------------|
| Name | id | location_start | location_end |
|---------------------|------------------|------------------|------------------|
| x | 1 | 202010 | 2020199 |
|---------------------|------------------|------------------|------------------|
Unfortunately, df_a and df_b are both nearly a million rows. This code is taking like 10 hours to run on my local. Currently I'm running the following:
for index,row in df_a.iterrows():
matched = df_b[(df_b['location_start']<row['location'])
& (df_b['location_end']>row['location'])
& (df_b['id']==row['id'])]
Is there any obvious way to speed this up?
You can do this:
Consider my sample dataframes below:
In [90]: df_a = pd.DataFrame({'Name':['a','b'], 'id':[1,2], 'location':[202013, 102013]})
In [91]: df_b = pd.DataFrame({'Name':['a','b'], 'id':[1,2], 'location_start':[202010, 1020199],'location_end':[2020199, 1020299] })
In [92]: df_a
Out[92]:
Name id location
0 a 1 202013
1 b 2 102013
In [93]: df_b
Out[93]:
Name id location_start location_end
0 a 1 202010 2020199
1 b 2 1020199 1020299
In [95]: d = pd.merge(df_a, df_b, on='id')
In [106]: indexes = d[d['location'].between(d['location_start'], d['location_end'])].index.tolist()
In [107]: df_b.iloc[indexes, :]
Out[107]:
Name id location_start location_end
0 a 1 202010 2020199

Categories