I'm trying to alter my dataframe to create a Sankey diagram.
I've 3 million rows like this:
client_id | | start_date | end_date | position
1234 16-07-2019 27-03-2021 3
1234 18-07-2021 09-10-2021 1
1234 28-03-2021 17-07-2021 2
1234 10-10-2021 20-11-2021 2
I want it to look like this:
client_id | | start_date | end_date | position | source | target
1234 16-07-2019 27-03-2021 3 3 2
1234 18-07-2021 09-10-2021 1 1 2
1234 28-03-2021 17-07-2021 2 2 1
1234 10-10-2021 20-11-2021 2 2 4
Value 4 is the value that I use as "exit in the flow.
I have no idea how to do this.
Background: the source and target values contain the position values based on start_date and end_date. So for example in the first row the source is position value 3 but the target is position value 2 because after the end date client changed from position 3 to 2.
Because the source and target are calculated by each client's date order. So it is possible to order the date and find its next position.
columns = ["client_id" ,"start_date","end_date","position"]
data = [
["1234","16-07-2019","27-03-2021",3],
["1234","18-07-2021","09-10-2021",1],
["1234","28-03-2021","17-07-2021",2],
["1234","10-10-2021","20-11-2021",2],
["5678","16-07-2019","27-03-2021",3],
["5678","18-07-2021","09-10-2021",1],
["5678","28-03-2021","17-07-2021",2],
["5678","10-10-2021","20-11-2021",2],
]
df = pd.DataFrame(
data,
columns=columns
)
df = df.assign(
start_date = pd.to_datetime(df["start_date"]),
end_date = pd.to_datetime(df["end_date"])
)
sdf = df.assign(
rank=df.groupby("client_id")["start_date"].rank()
)
sdf = sdf.assign(
next_rank=sdf["rank"] + 1
)
combine_result = pd.merge(sdf,
sdf[["client_id", "position", "rank"]],
left_on=["client_id", "next_rank"],
right_on=["client_id", "rank"],
how="left",
suffixes=["", "_next"]
).fillna({"position_next": 4})
combine_result[["client_id", "start_date", "end_date", "position", "position_next"]].rename(
{"position": "source", "position_next": "target"}, axis=1).sort_values(["client_id", "start_date"])
Related
I have a central DataFrame called "cases" (5000000 rows × 5 columns) and a secondary DataFrame, called "relevant information", which is a kind of dictionary in relation to the central DataFrame (300 rows × 6 columns).
I am trying to fill in the central DataFrame based on a common column called "Verdict_type".
And, if the value does not appear in the secondary DataFrame it fill in "not_relevant" in all the rows that will be added.
I used all sorts of directions without success.
I would love to get a good direction.
The DataFrames
import pandas as pd
# this is a mockup of the raw data
cases = [
[1, "1", "v1"],
[2, "2", "v2"],
[3, "3", "v3"]
]
relevant_info = [
["v1", "info1"],
["v3", "info3"]
]
# these are the data from screenshot
df_cases = pd.DataFrame(cases, columns=["id", "verdict_name", "verdict_type"]).set_index("id")
df_relevant_info = pd.DataFrame(relevant_info, columns=["verdict_type", "features"])
Input:
df_cases <-- note here the index marked as 'id'
df_relevant_info
# first, flatten the index of the cases ( this is probably what you were missing )
df_cases = df_cases.reset_index()
# then, merge the two sets on the verdict_type
df_merge = pd.merge(df_cases, df_relevant_info, on="verdict_type", how="outer")
# finally, mark missing values as non relevant
df_merge["features"] = df_merge["features"].fillna(value="not_relevant")
Output:
merged set:
+----+------+----------------+----------------+--------------+
| | id | verdict_name | verdict_type | features |
|----+------+----------------+----------------+--------------|
| 0 | 1 | 1 | v1 | info1 |
| 1 | 2 | 2 | v2 | not_relevant |
| 2 | 3 | 3 | v3 | info3 |
+----+------+----------------+----------------+--------------+
I have a pandas dataframe for text data. I created by doing group by and aggregate to get the texts per id like below. I later calculated the word count.
df = df.groupby('id') \
.agg({'chat': ', '.join }) \
.reset_index()
It looks like this:
chat is the collection of the text data per id. The created_at is the dates of chats, converted to string type.
|id|chat |word count|created_at |
|23|hi,hey!,hi|3 |2018-11-09 02:11:24,2018-11-09 02:11:43,2018-11-09 03:13:22|
|24|look there|2 |2017-11-03 18:05:34,2017-11-06 18:03:22 |
|25|thank you!|2 |2017-11-07 09:18:01,2017-11-18 11:09:37 |
I want to change add a chat duration column that gives the difference between first date and last date in days as integer.If chat ends same day then 1. The new expected column is :-
|chat_duration|
|1 |
|3 |
|11 |
Copying to clipboard looks like this before the group by
,id,chat,created_at
0,23,"hi",2018-11-09 02:11:24
1,23,"hey!",2018-11-09 02:11:43
2,23,"hi",2018-11-09 03:13:22
If I were doing the entire process
Beginning with the unprocessed data
id,chat,created_at
23,"hi i'm at school",2018-11-09 02:11:24
23,"hey! how are you",2018-11-09 02:11:43
23,"hi mom",2018-11-09 03:13:22
24,"leaving home",2018-11-09 02:11:24
24,"not today",2018-11-09 02:11:43
24,"i'll be back",2018-11-10 03:13:22
25,"yesterday i had",2018-11-09 02:11:24
25,"it's to hot",2018-11-09 02:11:43
25,"see you later",2018-11-12 03:13:22
# create the dataframe with this data on the clipboard
df = pd.read_clipboard(sep=',')
set created_at to datetime
df.created_at = pd.to_datetime(df.created_at)
create word_count
df['word_count'] = df.chat.str.split(' ').map(len)
groupby agg to get all chat as a string, created_at as a list, and word_cound as a total sum.
df = df.groupby('id').agg({'chat': ','.join , 'created_at': list, 'word_count': sum}).reset_index()
calculate chat_duration
df['chat_duration'] = df['created_at'].apply(lambda x: (max(x) - min(x)).days)
convert created_at to desired string format
If you skip this step, created_at will be a list of datetimes.
df['created_at'] = df['created_at'].apply(lambda x: ','.join([y.strftime("%m/%d/%Y %H:%M:%S") for y in x]))
Final df
| | id | chat | created_at | word_count | chat_duration |
|---:|-----:|:------------------------------------------|:------------------------------------------------------------|-------------:|----------------:|
| 0 | 23 | hi i'm at school,hey! how are you,hi mom | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/09/2018 03:13:22 | 10 | 0 |
| 1 | 24 | leaving home,not today,i'll be back | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/10/2018 03:13:22 | 7 | 1 |
| 2 | 25 | yesterday i had,it's to hot,see you later | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/12/2018 03:13:22 | 9 | 3 |
After some tries I got it:
First convert string to list.
df['created_at'] = df['created_at'].str.split(
',').apply(lambda s: list(s))
Then subtract max and min date item by converting to list
df['created_at'] = df['created_at'].apply(lambda s: (datetime.strptime(
str(max(s)), '%Y-%m-%d') - datetime.strptime(str(min(s)), '%Y-%m-%d') ).days)
Create DataFrame by split and then subtract first and last columns converted to datetimes:
df1 = df['created_at'].str.split(',', expand=True).ffill(axis=1)
df['created_at'] = (pd.to_datetime(df1.iloc[:, -1]) - pd.to_datetime(df1.iloc[:, 0])).dt.days
I have 2 dataframes, both have an identical emails column and each has a unique ID Column. My code used to create these looks like this
import pandas as pd
df = pd.read_excel(r'C:\Users\file.xlsx')
df['healthAssessment'] = df['ltv']*.01*df['Employment.Weight']*df['Income_Per_Year']/df['Debits_Per_Year'].astype(int)df['ltv']*.01*df['Employment.Weight']*df['Income_Per_Year']/df['Debits_Per_Year'].astype(int)
df0 = df.loc[df['receivedHealthEmail'].str.contains('No Email Sent')]
df2 = df0.loc[df['healthAssessment'] > 2.5]
df3 = df2.loc[df['Emails'].str.contains('#')]
print (df)
df4 = df
df1 = df3
receiver = df1['Emails'].astype(str)
receivers = receiver
df1['receivedHealthEmail'] = receiver
print (df1)
the first dataframe it produces looks roughly like this
Unique ID | Emails | receivedHealthEmail| healthAssessment
0 | aaaaaaaaaa#aaaaaa | No Email Sent| 2.443849
1 | bbbbbbbbbbbbb#bbb | No Email Sent| 3.809817
2 | ccccccccccccc#ccc | No Email Sent| 2.952871
3 | ddddddddddddd#ddd | No Email Sent| 2.564398
4 | eeeeeeeeeee#eeeee | No Email Sent| 3.315868
... | ... | ... ...
3294 | no email provided | No Email Sent| 7.674677
the second data frame looks like this
Unique ID Emails receivedHealthEmail| healthAssessment
1 | bbbbbbbbbbbbb#bbb| bbbbbbbbbbbbb#bbb| 3.809817
2 | cccccccccccccc#cc| cccccccccccccc#cc| 2.952871
3 | ddddddddddddd#ddd| ddddddddddddd#ddd| 2.564398
4 | eeeeeeeeeee#eeeee| eeeeeeeeeee#eeeee| 3.315868
i need a way to overwrite the received emails tab in the first dataframe using the values from the second dataframe. any help is appreciated
You can merge the 2 dataframes based on UniqueID:
df = df1.merge(df2, on='UniqueID')
df.drop(columns=['receivedHealthEmail_x', 'healthAssessment_x', 'Emails_x'], inplace=True)
print(df)
UniqueID Emails_y receivedHealthEmail_y healthAssessment_y
0 1 bbbbbbbbbbbbb#bbb bbbbbbbbbbbbb#bbb 3.809817
1 2 cccccccccccccc#cc cccccccccccccc#cc 2.952871
2 3 ddddddddddddd#ddd ddddddddddddd#ddd 2.564398
3 4 eeeeeeeeeee#eeeee eeeeeeeeeee#eeeee 3.315868
I am tracking in which "month" a certain event has taken place. If it hasn't, the "month" field is a NaN. The starting table looks like this:
+-------+----------+---------+
| Month | Category | Balance |
+-------+----------+---------+
| 1 | a | 100 |
| nan | a | 300 |
| 2 | a | 200 |
+-------+----------+---------+
I am trying to build a crosstab like this:
+-------+----------------------------------+
| Month | Category a - cumulative % amount |
+-------+----------------------------------+
| 1 | 0.16 |
| 2 | 0.50 |
+-------+----------------------------------+
In month 1, the event has happened for 100/600, ie for 16%
In month 2, the event has happened, cumulatively, for (100 + 200) / 600 = 50%, where 100 is in month 1 and 200 in month 2.
My issue is with NaNs. Pandas automatically removes NaNs from any groupby / pivot / crosstab. I could convert the month field to string, so that grouping it won't remove the NaNs, but then pandas sorts by the month as if it were a string, ie it would sort: 10, 48, 5, 6.
Any suggestions?
The following works but seems extremely convoluted:
Convert "month" to string
Do a crosstab
Convert "month" back to float (can I do it without moving the index to a column, and then the column back to the index?)
Sort again
Do the cumsum
Code:
import numpy as np
import pandas as pd
df = pd.DataFrame()
mylen = int(10e3)
df['ix'] = np.arange(0,mylen)
df['amount'] = np.random.uniform(10e3,20e3,mylen)
df['category'] = np.where( df['ix'] <=4000, 'a','b' )
df['month'] = np.random.uniform(3,48,mylen)
df['month'] = np.where( df['ix'] <=1000, np.nan, df['month'] )
df['month rounded'] = np.ceil(df['month'])
ct = pd.crosstab(df['month rounded'].astype(str) , df['category'], \
values = df['amount'] ,aggfunc = 'sum', margins = True ,\
normalize = 'columns', dropna = False)
# the index is 'month rounded'
ct = ct.reset_index()
ct['month rounded'] = ct['month rounded'].astype('float32')
ct = ct.sort_values('month rounded')
ct = ct.set_index('month rounded')
ct2 = ct.cumsum (axis = 0)
Use:
new_df = df.assign(cumulative=df['Balance'].mask(df['Month'].isna())
.groupby(df['Category'])
.cumsum()
.div(df.groupby('Category')['Balance']
.transform('sum'))).dropna()
print(new_df)
Month Category Balance cumulative
0 1.0 a 100 0.166667
2 2.0 a 200 0.500000
If you want create a DataFrame for each Category you could create a dict:
df_category = {i:group for i,group in new_df.groupby('Category')}
df['Category a - cumulative % amount'] = (
df.groupby(by=df.Month.fillna(np.inf))
.apply(lambda x: x.Balance.cumsum().div(df.Balance.sum()))
.reset_index(level=0, drop=True)
)
df.dropna()
Month Category Balance Category a - cumulative % amount
0 1 a 100 0.166667
2 2 a 200 0.333333
I have an Impala table that I'd like to query using Ibis. The table looks like the following:
id | timestamp
-------------------
A | 5
A | 7
A | 3
B | 9
B | 5
I'd like to group_by this table according to unique combinations of id and timestamp range. The grouping operation should ultimately produce a single grouped object that I can then apply aggregations on. For example:
group1 conditions: id == A; 4 < timestamp < 11
group2 conditions: id == A; 1 < timestamp < 6
group3 conditions: id == B; 4 < timestamp < 7
yielding a grouped object with the following groups:
group1:
id | timestamp
-------------------
A | 5
A | 7
group2:
id | timestamp
-------------------
A | 5
A | 3
group3:
id | timestamp
-------------------
B | 5
Once I have the groups I'll perform various aggregations to get my final results. If anybody could help me figure this group_by out it would be greatly appreciated, even a regular pandas expression would be helpful!
So here is an example for groupby (no underscore):
df = pd.DataFrame({"id":["a","b","a","b","c","c"], "timestamp":[1,2,3,4,5,6]})
create a grouper column for your timestamp.
df["my interval"] = (df["timestamp"] > 3 )& (df["timestamp"] <5)
"you need some _data_ columns, i.e. those which you do not use for grouping"
df["dummy"] = 1
df.groupby(["id", "my interval"]).agg("count")["dummy"]
Or you can use both:
df["something that I need"] = df["my interval"] & (df["id"] == "b")
df.groupby(["something that I need"]).agg("count")["dummy"]
you might also want to apply integer division to generate time intervals:
df = pd.DataFrame({"id":["a","b","a","b","c","c"], "timestamp":[1,2,13,14,25,26], "sales": [0,4,2,3,6,7]})
epoch = 10
df["my interval"] = epoch* (df["timestamp"] // epoch)
df.groupby(["my interval"]).agg(sum)["sales"]
EDIT:
your example:
import pandas as pd
A = "A"
B = "B"
df = pd.DataFrame({"id":[A,A,A,B,B], "timestamp":[5,7,3,9,5]})
df["dummy"] = 1
Solution:
grouper = (df["id"] == A) & (4 < df["timestamp"] ) & ( df["timestamp"] < 11)
df.groupby( grouper ).agg(sum)["dummy"]
or better:
df[grouper]["dummy"].sum()