Date difference from a list in pandas dataframe - python

I have a pandas dataframe for text data. I created by doing group by and aggregate to get the texts per id like below. I later calculated the word count.
df = df.groupby('id') \
.agg({'chat': ', '.join }) \
.reset_index()
It looks like this:
chat is the collection of the text data per id. The created_at is the dates of chats, converted to string type.
|id|chat |word count|created_at |
|23|hi,hey!,hi|3 |2018-11-09 02:11:24,2018-11-09 02:11:43,2018-11-09 03:13:22|
|24|look there|2 |2017-11-03 18:05:34,2017-11-06 18:03:22 |
|25|thank you!|2 |2017-11-07 09:18:01,2017-11-18 11:09:37 |
I want to change add a chat duration column that gives the difference between first date and last date in days as integer.If chat ends same day then 1. The new expected column is :-
|chat_duration|
|1 |
|3 |
|11 |
Copying to clipboard looks like this before the group by
,id,chat,created_at
0,23,"hi",2018-11-09 02:11:24
1,23,"hey!",2018-11-09 02:11:43
2,23,"hi",2018-11-09 03:13:22

If I were doing the entire process
Beginning with the unprocessed data
id,chat,created_at
23,"hi i'm at school",2018-11-09 02:11:24
23,"hey! how are you",2018-11-09 02:11:43
23,"hi mom",2018-11-09 03:13:22
24,"leaving home",2018-11-09 02:11:24
24,"not today",2018-11-09 02:11:43
24,"i'll be back",2018-11-10 03:13:22
25,"yesterday i had",2018-11-09 02:11:24
25,"it's to hot",2018-11-09 02:11:43
25,"see you later",2018-11-12 03:13:22
# create the dataframe with this data on the clipboard
df = pd.read_clipboard(sep=',')
set created_at to datetime
df.created_at = pd.to_datetime(df.created_at)
create word_count
df['word_count'] = df.chat.str.split(' ').map(len)
groupby agg to get all chat as a string, created_at as a list, and word_cound as a total sum.
df = df.groupby('id').agg({'chat': ','.join , 'created_at': list, 'word_count': sum}).reset_index()
calculate chat_duration
df['chat_duration'] = df['created_at'].apply(lambda x: (max(x) - min(x)).days)
convert created_at to desired string format
If you skip this step, created_at will be a list of datetimes.
df['created_at'] = df['created_at'].apply(lambda x: ','.join([y.strftime("%m/%d/%Y %H:%M:%S") for y in x]))
Final df
| | id | chat | created_at | word_count | chat_duration |
|---:|-----:|:------------------------------------------|:------------------------------------------------------------|-------------:|----------------:|
| 0 | 23 | hi i'm at school,hey! how are you,hi mom | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/09/2018 03:13:22 | 10 | 0 |
| 1 | 24 | leaving home,not today,i'll be back | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/10/2018 03:13:22 | 7 | 1 |
| 2 | 25 | yesterday i had,it's to hot,see you later | 11/09/2018 02:11:24,11/09/2018 02:11:43,11/12/2018 03:13:22 | 9 | 3 |

After some tries I got it:
First convert string to list.
df['created_at'] = df['created_at'].str.split(
',').apply(lambda s: list(s))
Then subtract max and min date item by converting to list
df['created_at'] = df['created_at'].apply(lambda s: (datetime.strptime(
str(max(s)), '%Y-%m-%d') - datetime.strptime(str(min(s)), '%Y-%m-%d') ).days)

Create DataFrame by split and then subtract first and last columns converted to datetimes:
df1 = df['created_at'].str.split(',', expand=True).ffill(axis=1)
df['created_at'] = (pd.to_datetime(df1.iloc[:, -1]) - pd.to_datetime(df1.iloc[:, 0])).dt.days

Related

Calculate difference in days in one single column based on the values in another column (pandas)

I have a pandas df (called df2) like this:
id | orderdate |
___________________
123|2020-11-01 |
123|2020-08-01 |
233|2020-07-01 |
233|2020-11-04 |
444|2020-11-04 |
444|2020-05-03 |
444|2020-04-01 |
444|2020-11-25 |
The values of orderdate are datetime with the format '%Y%m%d'. They represent orders of a client. I want to calculate the delta time between the first order and the second one for each id (each client).
I come up with:
for i in list(set(df2.id)):
list_sorted=list(set((df2.loc[df2['id']==i, 'orderdate'] )))
list_sorted= sorted(list_sorted) #get sorted list of the order dates in ascending order
min_list= list_sorted[0] # first element is first order
df2.loc[df2['id']==i, 'First Order']= min_list
if len(list_sorted)>1:
penultimate_list= list_sorted[1]
df2.loc[df2['id']==i, 'Second Order']= penultimate_list # second element is second order
df2.loc[df2['id']==i, 'Delta orders']= min_list - penultimate_list #calculate delta
else:
df2.loc[df2['id_user']==i, 'Delta orders']= None
My expected outcome is:
id | orderdate | First Order | Second Order| Delta Orders
______________________________________________
123|2020-11-01 |2020-08-01 | 2020-11-01 | 92 days
123|2020-08-01 |2020-08-01 | 2020-11-01 | 92 days
233|2020-07-01 |2020-07-01 | 2020-11-04 | 126 days
233|2020-11-04 |2020-07-01 | 2020-11-04 | 126 days
444|2020-11-04 |2020-04-01 | 2020-05-03 | 32 days
444|2020-05-03 |2020-04-01 | 2020-05-03 | 32 days
444|2020-04-01 |2020-04-01 | 2020-05-03 | 32 days
444|2020-11-25 |2020-04-01 | 2020-05-03 | 32 days
It works but I feel like it's cumbersome. Any easier way to do it?
Slightly different from what you want, but it's a start:
import pandas as pd
from io import StringIO
data = StringIO(
"""id|orderdate
123|2020-11-01
123|2020-08-01
233|2020-07-01
233|2020-11-04
444|2020-11-04
444|2020-05-03
444|2020-04-01
444|2020-11-25 """)
df = pd.read_csv(data, sep='|')
df['orderdate'] = pd.to_datetime(df['orderdate'], infer_datetime_format=True)
df = df.sort_values(['id', 'orderdate'], ascending=False)
def date_diff(df):
df['order_time_diff'] = (df['orderdate'] - df['orderdate'].shift(-1)).dt.days
df = df.dropna()
return df
# this calculates all order differences
df.groupby('id').apply(date_diff)
# this will get the data as requested
df.groupby('id', as_index=False).apply(date_diff).groupby('id').tail(1)

row_number ranking function to filter the latest records in DF

I want to apply a Window function to a DataFrame to get only the latest metrics for every Id. For the following data I expect the df to contain only the first two records after applying a Window function.
| id | metric | transaction_date |
| 1 | 0.5 | 05-10-2019 |
| 2 | 15.9 | 07-22-2020 |
| 2 | 4.7 | 11-03-2017 |
Is it a correct approach to use row_number ranking function? My current implementation looks like this:
df.withColumn(
"_row_number",
F.row_number().over(
Window.partitionBy("id").orderBy(F.desc("transaction_date")))
)
.filter(F.col("_row_number") == 1)
.drop("_row_number")
You need to first sort the dataframe by id and date (descending). Then you do group by id. The first method on group by object will return the first row (which has the latest date).
import pandas as pd
from datetime import datetime
df = pd.DataFrame({'id':[1,2,2],
'metric':[0.5, 15.9, 4.7],
'date':[datetime(2019,5,10), datetime(2020,7,22), datetime(2017,11,3)]})
## sort df by id and date
df = df.sort_values(['id','date'], ascending= [True, False])
## return the first row of each group
df.groupby('id').first()
val fDF = Seq( (1, 0.5, "05-10-2019"),
(2, 15.9, "07-22-2020"),
(2, 4.7, "11-03-2017"))
.toDF("id", "metric", "transaction_date")
val f1DF = fDF
.withColumn("transaction_date", to_date('transaction_date, "MM-dd-yyyy"))
.orderBy('id.asc,'transaction_date.desc)
val f2DF = f1DF.groupBy("id")
.agg(first('transaction_date).alias("transaction_date"),
first('metric).alias("metric"))
f2DF.show(false)
// +---+----------------+------+
// |id |transaction_date|metric|
// +---+----------------+------+
// |1 |2019-05-10 |0.5 |
// |2 |2020-07-22 |15.9 |
// +---+----------------+------+

Replacing string value in a pandas dataframe column inside a list in Python

I have a column in my dataframe like this:
___________________________
| columnn |
____________________________
| [happiness#sad] |
| [happy ness#moderate] |
| [happie ness#sad] |
____________________________
and I want to replace the “happy ness”,”happiness”,”happie ness” with 'happyness' . I am currently using this method but nothing is changed.
string exactly matching
happy ness===> happyness
happiness ===> happyness
happie ness===>happyness
I treid the below two approaches
1st Approach
df['column']
df.column=df.column.replace({"happiness":"happyness" ,"happy ness":"happyness" ,"happie ness":"happynesss" })
2nd Approach
df['column']=df['column'].str.replace("happiness","happyness").replace(“happy ness”.”happyness”).replace(“happie ness”,”happynesss”)
Desired Output:
______________________
| columnn |
_______________________
| [happyness,sad] |
| [happyness,moderate] |
| [happyness,sad] |
_______________________
This is one approach using replace with regex=True.
Ex:
import pandas as pd
df = pd.DataFrame({"columnn": [["happiness#sad"], ["happy ness#moderate"], ["happie ness$sad"]]})
data = {"happiness":"happyness" ,"happy ness":"happyness" ,"happie ness":"happynesss" }
df["columnn"] = df["columnn"].apply(lambda x: pd.Series(x).replace(data, regex=True).tolist())
print(df)
Output:
columnn
0 [happyness#sad]
1 [happyness#moderate]
2 [happynesss$sad]
Try this approach i think this will work for you.
df['new_col']=df['column'].replace(to_replace =
['happyness','happiness','happie ness'], value =
['happyness','happyness','happyness'])

Maintaining column order when adding two dataframes with similar formats

I have two dataframes with similar formats. Both have 3 indexes/headers. Most of the headers are the same but df2 has a few additional ones. When I add them up the order of the headers gets mixed up. I would like to maintain the order of df1. Any ideas?
Global = pd.read_excel('Mickey Mouse_Clean2.xlsx',header=[0,1,2,3],index_col=[0,1],sheet_name = 'Global')
Oslav = pd.read_excel('Mickey Mouse_Clean2.xlsx',header=[0,1,2,3],index_col=[0,1],sheet_name = 'Country XYZ')
Oslav = Oslav.replace(to_replace=1,value=10)
Oslav = Oslav.replace(to_replace=-1,value=-2)
df = Global.add(Oslav,fill_value=0)
Example of df Format
HeaderA | Header2 | Header3 |
xxx1|xxx2|xxx3|xxx4||xxx1|xxx2|xxx3|xxx4||xxx1|xxx2|xxx3|xxx4 |
ColX|ColY |ColA|ColB|ColC|ColD||ColD|ColE|ColF|ColG||ColH|ColI|ColJ|ColDK|
1 | ds | 1 | |+1 |-1 | .......................................
2 | dh | ..........................................................
3 | ge | ..........................................................
4 | ew | ..........................................................
5 | er | ..........................................................
df = df[Global.columns+list(set(Oslav.columns)-set(Global.columns))].copy()
or
df = df[Global.columns+[col for col in Oslav.columns if not col in Global.columns]].copy()
(The second option should preserve the order of Oslav columns as well, if you care about that.)
or
df = df.reindex(columns=Global.columns+list(set(Oslav.columns)-set(Global.columns)))
If you don't want to keep the columns that are in Oslav, but not in Global, you can do
df = df[Global.columns].copy()
Note that without .copy(), you're getting a view of the previous dataframe, rather than a dataframe in its own right.

Using Pandas to join and append columns in a loop

I want to append columns from tables generated in a loop to a dataframe. I was hoping to accomplish this using pandas.merge, but it doesn't seem to be working out for me.
My code:
from datetime import date
from datetime import timedelta
import pandas
import numpy
import pyodbc
date1 = date(2017, 1, 1) #Starting Date
date2 = date(2017, 1, 10) #Ending Date
DateDelta = date2 - date1
DateAdd = DateDelta.days
StartDate = date1
count = 1
# Create the holding table
conn = pyodbc.connect('Server Information')
**basetable = pandas.read_sql("SELECT....")
while count <= DateAdd:
print(StartDate)
**datatable = pandas.read_sql("SELECT...WHERE Date = "+str(StartDate)+"...")
finaltable = basetable.merge(datatable,how='left',left_on='OrganizationName',right_on='OrganizationName')
StartDate = StartDate + timedelta(days=1)
count = count + 1
print(finaltable)
Shortened the select statements for brevity's sake, but the tables produced look like this:
**Basetable
School_District
---------------
District_Alpha
District_Beta
...
District_Zed
**Datatable
School_District|2016-01-01|
---------------|----------|
District_Alpha | 400 |
District_Beta | 300 |
... | 200 |
District_Zed | 100 |
I have the datatable written so the column takes the name of the date selected for that particular loop, so column names can be unique once i get this up and running. My problem, however, is that the above code only produces one column of data. I have a good guess as to why: Only the last merge is being processed - I thought using pandas.append would be the way to get around that, but pandas.append doesn't "join" like merge does. Is there some other way to accomplish a sort of Join & Append using Pandas? My goal is to keep this flexible so that other dates can be easily input depending on our data needs.
In the end, what I want to see is:
School_District|2016-01-01|2016-01-02|... |2016-01-10|
---------------|----------|----------|-----|----------|
District_Alpha | 400 | 1 | | 45 |
District_Beta | 300 | 2 | | 33 |
... | 200 | 3 | | 5435 |
District_Zed | 100 | 4 | | 333 |
Your error is in the statement finaltable = basetable.merge(datatable,...). At each loop iteration, you merge the original basetable with the new datatable, store the result in the finaltable... and discard it. What you need is basetable = basetable.merge(datatable,...). No finaltables.

Categories