need a way to overwrite columns in 2 seperate pandas dataframes - python

I have 2 dataframes, both have an identical emails column and each has a unique ID Column. My code used to create these looks like this
import pandas as pd
df = pd.read_excel(r'C:\Users\file.xlsx')
df['healthAssessment'] = df['ltv']*.01*df['Employment.Weight']*df['Income_Per_Year']/df['Debits_Per_Year'].astype(int)df['ltv']*.01*df['Employment.Weight']*df['Income_Per_Year']/df['Debits_Per_Year'].astype(int)
df0 = df.loc[df['receivedHealthEmail'].str.contains('No Email Sent')]
df2 = df0.loc[df['healthAssessment'] > 2.5]
df3 = df2.loc[df['Emails'].str.contains('#')]
print (df)
df4 = df
df1 = df3
receiver = df1['Emails'].astype(str)
receivers = receiver
df1['receivedHealthEmail'] = receiver
print (df1)
the first dataframe it produces looks roughly like this
Unique ID | Emails | receivedHealthEmail| healthAssessment
0 | aaaaaaaaaa#aaaaaa | No Email Sent| 2.443849
1 | bbbbbbbbbbbbb#bbb | No Email Sent| 3.809817
2 | ccccccccccccc#ccc | No Email Sent| 2.952871
3 | ddddddddddddd#ddd | No Email Sent| 2.564398
4 | eeeeeeeeeee#eeeee | No Email Sent| 3.315868
... | ... | ... ...
3294 | no email provided | No Email Sent| 7.674677
the second data frame looks like this
Unique ID Emails receivedHealthEmail| healthAssessment
1 | bbbbbbbbbbbbb#bbb| bbbbbbbbbbbbb#bbb| 3.809817
2 | cccccccccccccc#cc| cccccccccccccc#cc| 2.952871
3 | ddddddddddddd#ddd| ddddddddddddd#ddd| 2.564398
4 | eeeeeeeeeee#eeeee| eeeeeeeeeee#eeeee| 3.315868
i need a way to overwrite the received emails tab in the first dataframe using the values from the second dataframe. any help is appreciated

You can merge the 2 dataframes based on UniqueID:
df = df1.merge(df2, on='UniqueID')
df.drop(columns=['receivedHealthEmail_x', 'healthAssessment_x', 'Emails_x'], inplace=True)
print(df)
UniqueID Emails_y receivedHealthEmail_y healthAssessment_y
0 1 bbbbbbbbbbbbb#bbb bbbbbbbbbbbbb#bbb 3.809817
1 2 cccccccccccccc#cc cccccccccccccc#cc 2.952871
2 3 ddddddddddddd#ddd ddddddddddddd#ddd 2.564398
3 4 eeeeeeeeeee#eeeee eeeeeeeeeee#eeeee 3.315868

Related

Alter dataframe based on values in other rows

I'm trying to alter my dataframe to create a Sankey diagram.
I've 3 million rows like this:
client_id | | start_date | end_date | position
1234 16-07-2019 27-03-2021 3
1234 18-07-2021 09-10-2021 1
1234 28-03-2021 17-07-2021 2
1234 10-10-2021 20-11-2021 2
I want it to look like this:
client_id | | start_date | end_date | position | source | target
1234 16-07-2019 27-03-2021 3 3 2
1234 18-07-2021 09-10-2021 1 1 2
1234 28-03-2021 17-07-2021 2 2 1
1234 10-10-2021 20-11-2021 2 2 4
Value 4 is the value that I use as "exit in the flow.
I have no idea how to do this.
Background: the source and target values contain the position values based on start_date and end_date. So for example in the first row the source is position value 3 but the target is position value 2 because after the end date client changed from position 3 to 2.
Because the source and target are calculated by each client's date order. So it is possible to order the date and find its next position.
columns = ["client_id" ,"start_date","end_date","position"]
data = [
["1234","16-07-2019","27-03-2021",3],
["1234","18-07-2021","09-10-2021",1],
["1234","28-03-2021","17-07-2021",2],
["1234","10-10-2021","20-11-2021",2],
["5678","16-07-2019","27-03-2021",3],
["5678","18-07-2021","09-10-2021",1],
["5678","28-03-2021","17-07-2021",2],
["5678","10-10-2021","20-11-2021",2],
]
df = pd.DataFrame(
data,
columns=columns
)
df = df.assign(
start_date = pd.to_datetime(df["start_date"]),
end_date = pd.to_datetime(df["end_date"])
)
sdf = df.assign(
rank=df.groupby("client_id")["start_date"].rank()
)
sdf = sdf.assign(
next_rank=sdf["rank"] + 1
)
combine_result = pd.merge(sdf,
sdf[["client_id", "position", "rank"]],
left_on=["client_id", "next_rank"],
right_on=["client_id", "rank"],
how="left",
suffixes=["", "_next"]
).fillna({"position_next": 4})
combine_result[["client_id", "start_date", "end_date", "position", "position_next"]].rename(
{"position": "source", "position_next": "target"}, axis=1).sort_values(["client_id", "start_date"])

Date difference from a list in pandas dataframe

I have a pandas dataframe for text data. I created by doing group by and aggregate to get the texts per id like below. I later calculated the word count.
df = df.groupby('id') \
.agg({'chat': ', '.join }) \
.reset_index()
It looks like this:
chat is the collection of the text data per id. The created_at is the dates of chats, converted to string type.
|id|chat |word count|created_at |
|23|hi,hey!,hi|3 |2018-11-09 02:11:24,2018-11-09 02:11:43,2018-11-09 03:13:22|
|24|look there|2 |2017-11-03 18:05:34,2017-11-06 18:03:22 |
|25|thank you!|2 |2017-11-07 09:18:01,2017-11-18 11:09:37 |
I want to change add a chat duration column that gives the difference between first date and last date in days as integer.If chat ends same day then 1. The new expected column is :-
|chat_duration|
|1 |
|3 |
|11 |
Copying to clipboard looks like this before the group by
,id,chat,created_at
0,23,"hi",2018-11-09 02:11:24
1,23,"hey!",2018-11-09 02:11:43
2,23,"hi",2018-11-09 03:13:22
If I were doing the entire process
Beginning with the unprocessed data
id,chat,created_at
23,"hi i'm at school",2018-11-09 02:11:24
23,"hey! how are you",2018-11-09 02:11:43
23,"hi mom",2018-11-09 03:13:22
24,"leaving home",2018-11-09 02:11:24
24,"not today",2018-11-09 02:11:43
24,"i'll be back",2018-11-10 03:13:22
25,"yesterday i had",2018-11-09 02:11:24
25,"it's to hot",2018-11-09 02:11:43
25,"see you later",2018-11-12 03:13:22
# create the dataframe with this data on the clipboard
df = pd.read_clipboard(sep=',')
set created_at to datetime
df.created_at = pd.to_datetime(df.created_at)
create word_count
df['word_count'] = df.chat.str.split(' ').map(len)
groupby agg to get all chat as a string, created_at as a list, and word_cound as a total sum.
df = df.groupby('id').agg({'chat': ','.join , 'created_at': list, 'word_count': sum}).reset_index()
calculate chat_duration
df['chat_duration'] = df['created_at'].apply(lambda x: (max(x) - min(x)).days)
convert created_at to desired string format
If you skip this step, created_at will be a list of datetimes.
df['created_at'] = df['created_at'].apply(lambda x: ','.join([y.strftime("%m/%d/%Y %H:%M:%S") for y in x]))
Final df
| | id | chat | created_at | word_count | chat_duration |
|---:|-----:|:------------------------------------------|:------------------------------------------------------------|-------------:|----------------:|
| 0 | 23 | hi i'm at school,hey! how are you,hi mom | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/09/2018 03:13:22 | 10 | 0 |
| 1 | 24 | leaving home,not today,i'll be back | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/10/2018 03:13:22 | 7 | 1 |
| 2 | 25 | yesterday i had,it's to hot,see you later | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/12/2018 03:13:22 | 9 | 3 |
After some tries I got it:
First convert string to list.
df['created_at'] = df['created_at'].str.split(
',').apply(lambda s: list(s))
Then subtract max and min date item by converting to list
df['created_at'] = df['created_at'].apply(lambda s: (datetime.strptime(
str(max(s)), '%Y-%m-%d') - datetime.strptime(str(min(s)), '%Y-%m-%d') ).days)
Create DataFrame by split and then subtract first and last columns converted to datetimes:
df1 = df['created_at'].str.split(',', expand=True).ffill(axis=1)
df['created_at'] = (pd.to_datetime(df1.iloc[:, -1]) - pd.to_datetime(df1.iloc[:, 0])).dt.days

Split column without changing row position

I am trying to split a column but I noticed split changing the other values. For example, some values of row 10 exchange with row 8. Why is that?
Actual data on ID 10
| vat_number | email | foi_mail | website
| 10 | abc#test.com;example#test.com;example#test.com | xyz#test.com | example.com
After executing this line of code:
base_data[['email','email_1','email_2']] = pd.DataFrame(base_data.email.str.split(';').tolist(),
columns = ['email','email_1','email_2'])
base_data becomes:
| vat_number | email | foi_mail | website | email_1 | email_2
| 10 | some other row value | some other row value | example.com | ------ | -----
Before:
After:
Data contains thousands of row, but I showed only one row.
try do table in table:
def test():
base_data = []
base_data.append(['12','32'])
base_data.append(['352','335'])
base_data.append(['232','32'])
print(base_data)
a = base_data[0]
print(a)
print(a[0])
print(a[1])
input("Enter to contuniue. . .")
and use loop to add
if i understand the case. I believe you need something like that:
base_data = base_data.merge(base_data['email'].str.split(';', expand = True).rename(columns = {0:'email',1:'email_1',2:'email_2']}), left_index = True, right_index = True)
Here is the logic explanation:
a1 = list('abcdef')
b1 = list('fedcba')
c1 = [f'{x[0]};{x[1]}' for x in zip(a1, b1)]
df1 = pd.DataFrame({'c1':c1})
df1
Out[1]:
c1
0 a;f
1 b;e
2 c;d
3 d;c
4 e;b
5 f;a
df1 = df1.merge(df1['c1'].str.split(';', expand = True).rename(columns = {0:'c2',1:'c3'}), left_index = True, right_index = True)
df1
Out[2]:
c1 c2 c3
0 a;f a f
1 b;e b e
2 c;d c d
3 d;c d c
4 e;b e b
5 f;a f a
Use the expand parameter of .str.split:
import pandas as pd
# your dataframe
vat_number email foi_mail website
NaN abc#test.com;example#test.com;example#test.com xyz#test.com example.com
# split and expand
df[['email_1', 'email_2', 'email_3']] = df['email'].str.split(';', expand=True)
# drop `email` col
df.drop(columns='email', inplace=True)
# result
vat_number foi_mail website email_1 email_2 email_3
NaN xyz#test.com example.com abc#test.com example#test.com example#test.com

Groupby and join values but keep all columns

I've this Dataframe and want to group on ID and join the values.
ID | A_Num | I_Num
--------------------------
001 | A_001 | I_001
002 | A_002 | I_002
003 | A_003 | I_004
005 | A_002 | I_002
Desired Output
ID | A_Num | I_Num
--------------------------
001 | A_001 | I_001
002;005 | A_002 | I_002
003 | A_003 | I_004
Code:
df = df.groupby(['A_Num','I_Num'])['ID'].apply(lambda tags: ';'.join(tags))
df.to_csv('D:\joined.csv', sep=';', encoding='utf-8-sig', quoting=csv.QUOTE_ALL, index=False, header=True)
When I write the DataFrame to a csv file I've only the ID column.
Try reset_index():
df=df.groupby(['A_Num','I_Num'])["ID"].apply(lambda tags: ';'.join(tags.values)).reset_index()
This way your aggregation from apply() will be executed, and then reassigned as column instead of index.
Just another way to do it is:
result= df.groupby(['A_Num', 'I_Num']).agg({'ID': list})
result.reset_index(inplace=True)
result[['ID', 'A_Num', 'I_Num']]
The output is:
Out[37]:
ID A_Num I_Num
0 [001 ] A_001 I_001
1 [002 , 005 ] A_002 I_002
2 [003 ] A_003 I_004
ID contains lists in that case. If you rather want strings, just do:
result['ID']= result['ID'].map(lambda lst: ';'.join(lst))
result[['ID', 'A_Num', 'I_Num']]
Which outputs:
Out[48]:
ID A_Num I_Num
0 001 A_001 I_001
1 002;005 A_002 I_002
2 003 A_003 I_004
Groupby 'A_Num' and 'I_Num' and then merge the IDs in the same groups.
df.groupby(['A_Num','I_Num']).ID.apply(lambda x: ';'.join(x.tolist())).reset_index()

Maintaining column order when adding two dataframes with similar formats

I have two dataframes with similar formats. Both have 3 indexes/headers. Most of the headers are the same but df2 has a few additional ones. When I add them up the order of the headers gets mixed up. I would like to maintain the order of df1. Any ideas?
Global = pd.read_excel('Mickey Mouse_Clean2.xlsx',header=[0,1,2,3],index_col=[0,1],sheet_name = 'Global')
Oslav = pd.read_excel('Mickey Mouse_Clean2.xlsx',header=[0,1,2,3],index_col=[0,1],sheet_name = 'Country XYZ')
Oslav = Oslav.replace(to_replace=1,value=10)
Oslav = Oslav.replace(to_replace=-1,value=-2)
df = Global.add(Oslav,fill_value=0)
Example of df Format
HeaderA | Header2 | Header3 |
xxx1|xxx2|xxx3|xxx4||xxx1|xxx2|xxx3|xxx4||xxx1|xxx2|xxx3|xxx4 |
ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColDK|
1 | ds | 1 | |+1 |-1 | .......................................
2 | dh | ..........................................................
3 | ge | ..........................................................
4 | ew | ..........................................................
5 | er | ..........................................................
df = df[Global.columns+list(set(Oslav.columns)-set(Global.columns))].copy()
or
df = df[Global.columns+[col for col in Oslav.columns if not col in Global.columns]].copy()
(The second option should preserve the order of Oslav columns as well, if you care about that.)
or
df = df.reindex(columns=Global.columns+list(set(Oslav.columns)-set(Global.columns)))
If you don't want to keep the columns that are in Oslav, but not in Global, you can do
df = df[Global.columns].copy()
Note that without .copy(), you're getting a view of the previous dataframe, rather than a dataframe in its own right.

Categories