pandas dataframe: duplicates based on column and time range - python

I have a (very simplyfied here) pandas dataframe which looks like this:
df
datetime user type msg
0 2012-11-11 15:41:08 u1 txt hello world
1 2012-11-11 15:41:11 u2 txt hello world
2 2012-11-21 17:00:08 u3 txt hello world
3 2012-11-22 18:08:35 u4 txt hello you
4 2012-11-22 18:08:37 u5 txt hello you
What I would like to do now is to get all the duplicate messages which have their timestamp within 3 seconds. The desired output would be:
datetime user type msg
0 2012-11-11 15:41:08 u1 txt hello world
1 2012-11-11 15:41:11 u2 txt hello world
3 2012-11-22 18:08:35 u4 txt hello you
4 2012-11-22 18:08:37 u5 txt hello you
without the third row, as its text is the same as in row one and two, but its timestamp is not
within the range of 3 seconds.
I tried to define the columns datetime and msg as parameters for the duplicate() method, but it returns an empty dataframe because the timestamps are not identical:
mask = df.duplicated(subset=['datetime', 'msg'], keep=False)
print(df[mask])
Empty DataFrame
Columns: [datetime, user, type, msg, MD5]
Index: []
Is there a way where I can define a range for my "datetime" parameter? To illustrate, something
like:
mask = df.duplicated(subset=['datetime_between_3_seconds', 'msg'], keep=False)
Any help here would as always be very much appreciated.

This Piece of code gives the expected output
df[(df.groupby(["msg"], as_index=False)["datetime"].diff().fillna(0).dt.seconds <= 3).reset_index(drop=True)]
I have grouped on "msg" column of dataframe and then selected "datetime" column of that dataframe and used inbuilt function diff. Diff function finds the difference between values of that column. Filled the NaT values with zero and selected only those indexes which have values less than 3 seconds.
Before using above code make sure that your dataframe is sorted on datetime in ascending order.

This bit of code works on your example data, although you might have to play around with any extreme cases.
From your question I'm assuming you want to filter out messages from the first time it appears in df. It won't work if you have instances where you want to keep the string if it appears again after another threshold.
In short I wrote a function that will take your dataframe and the 'msg' to filter for. It takes the timestamp of the first time the message appears and compares that to all the other times it appears.
It then selects only the instances where it appears within 3 seconds of the first appearance.
import numpy as np
import pandas as pd
#function which will return dataframe containing messages within three seconds of the first message
def get_info_within_3seconds(df, msg):
df_of_msg = df[df['msg']==msg].sort_values(by = 'datetime')
t1 = df_of_msg['datetime'].reset_index(drop = True)[0]
datetime_deltas = [(i -t1).total_seconds() for i in df_of_msg['datetime']]
filter_list = [i <= 3.0 for i in datetime_deltas]
return df_of_msg[filter_list]
msgs = df['msg'].unique()
#apply function to each unique message and then create a new df
new_df = pd.concat([get_info_within_3seconds(df, i) for i in msgs])

Related

Checking for Specific Value in a Pandas Column and performing further operation

I have a pandas DataFrame, which has a column named is_retweeted. The values in this column are either Yes or No. If, the value is 'Yes', I want to go ahead performing X type sentiment analysis (the code for which I have). Else-if value is No, I want to go ahead performing Y type sentiment analysis (again, the code for which I have)
But, I am unable to check for this condition. I get the same error seen here. No solution here is helping for my usecase.
Based on what is suggested here if I do:
s = 'Yes' in tweet_df.is_retweeted print(s)
I get False as output.
This is what the dataframe looks like (for ease of representation I havent displayed other columns here):
tweet_dt is_retweeted
2020-09-01 No
2020-09-01 No
2020-09-01 Yes
I want to perform below sorta operation based on the value in 'is_retweeted' column:
retweets_nerlst = []
while tweet_df['is_retweeted'] == 'Yes':
for index, row in tqdm(tweet_df.iterrows(), total=tweet_df.shape[0]):
cleanedTweet = row['tweet'].replace("#", "")
sentence = Sentence(cleanedTweet, use_tokenizer=True)
PS: My codebase can be seen here
I think you can do it with np.where:
import pandas as pd
import numpy as np
def SentimentX(text):
#your SentimentX code
return f"SentimentX_result of {text}"
def SentimentY(text):
#your SentimentY code
return f"SentimentY_result of {text}"
data={"date":["2020-09-01","2020-09-02","2020-09-03"], "is_retweeted":["No","No","Yes"],'text':['text1','text2','text3']}
df=pd.DataFrame(data)
df['sentiment']=np.where(df["is_retweeted"]=="Yes",df['text'].apply(SentimentX),df['text'].apply(SentimentY))
print(df)
result:
date is_retweeted text sentiment
0 2020-09-01 No text1 SentimentY_result of text1
1 2020-09-02 No text2 SentimentY_result of text2
2 2020-09-03 Yes text3 SentimentX_result of text3
If I understand your question correctly, you can use apply with a condition:
tweet_df['result'] = tweet_df.apply(lambda x: sentiment_x(x.text) if x.is_retweeted =='Yes' else sentiment_y(x.text), axis = 1)
This is given that your dataframe contains "text" column which you are trying to do Sentiment analysis on, and your sentiment functions return a value to be stored in a new column which I have called "result".

Pandas groupby based on a condition from another column

I have a df such as example below and I am looking to identify users who message the same text within a given time period, such as <= 60 minutes for the example:
user = [1,2,3,4,5,6]
text = ['hello','hello','whats up','not now','not now','hello']
times = ['2010-09-14 16:51:00','2010-09-14 15:59:00',
'2010-09-14 15:14:00','2010-09-14 14:55:00','2010-09-14 15:47:00','2010-09-14 15:29:00']
df = pd.DataFrame({'userid':user,'message':text,'time':times})
My current method groups the text by a list of users who messaged each text:
group = df.groupby('message')['userid'].apply(list)
Then I return all the possible combinations of userid's from each list as an array of pair values and then retrieve the userid-text for each instance as a key for retrieving the time of each message for each pair from the original df
This method works but I have been trying to find a better way of grouping the users of each different text conditionally based on whether the time between each instance is less than a specified period of time, say 60 minutes for this example, taken as the difference between the two messages from the users. So "hello" for users 1 and 2 is less than 60 minutes apart so pass the condition and be added to the list for "hello".
The expected output for the example would therefore be:
userid
"hello" [1,2,6]
"not not" [4,5]
I haven't found any exact or similar solutions so any help is really appreciated. It may be that my approach to the problem is wrong!
Not sure it's the most elegant solution - but here's one using group-by and rolling. The advantage of this method is that it can work for large sets of data. It doesn't create the full cartesian product of all the users and times who sent the same message.
res = []
def collect_users(x):
if len(x) > 1:
s = set(x)
if res and res[-1].issubset(s):
res.pop()
res.append(set(x))
return 0
df.groupby("message").rolling("3600s").agg(collect_users)
The result comes as a list of sets:
[{1.0, 2.0, 6.0}, {4.0, 5.0}]
One option is to use a groupby to find the next matching message chronologically, merge it to the original dataframe, and then filter to things where the message gap is < 1 hour:
In [402]: df2 = df.merge(df.sort_values("time").groupby("message").shift(), left_index=True, right_index=True, suffixes=["_source", "_target"])
In [403]: df2.loc[df2['time_source'].sub(df2['time_target']).lt("1h"), ["message", "userid_source", "userid_target"]].astype('O')
Out[403]:
message userid_source userid_target
0 hello 1 2
1 hello 2 6
4 not now 5 4
Note that in your current data, 2 and 6 messaged hello 30 minutes apart and also appears here.

how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that?

I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))

Creating new dataframe with .txt file using Pandas

I have a text file with data displayed like this:
{"created_at":"Mon Jun 02 00:04:00 +0000 2018","id":870430762953920,"id_str":"87043076220","text":"Hello there","source":"\u003ca href=\"http:\/\/tapbots.com\/software\/tweetbot\/mac\" rel=\"nofollow\"\u003eTweetbot for Mac\u003c\/a\u003e","truncated":false,"in_reply_to_status_id"}
The data is twitter posts and I have hundreds of these in one text file. I want to get the key value pair of "text":"Hello there" and turn that into it's own dataframe with a third column named target. I don't need any of the other columns. I'm doing some sensitivity analysis.
What would be the most pythonic way to go about this? I thought about using the
df = pd.read_csv('test.txt', sep=r'"'), but then I don't know how to get rid of all the other columns i don't need and select the column with the text in it.
Any help would be much appreciated!
I had to modify the lost two key/value pairs in your data to work. You may want to check if you're getting the data correctly or if you copy and pasted properly because you should be getting errors with the data as is displayed in your post.
"truncated":False,"in_reply_to_status_id":1
Then this worked well for me:
import pandas as pd
with open('test.txt','r') as inf1: # reads the text file as code to evaluate
d =eval(inf1.read())
index = range(len(d))
df = pd.DataFrame(d,index=index) # have to add index to because the entire df are scalar values
df = df.pop('text')
print(df)
Returns
0 Hello there
1 Hello there
2 Hello there
3 Hello there
4 Hello there
5 Hello there
6 Hello there
Name: text, dtype: object

Add multiple columns to multiple data frames

I have a number of number of small dataframes with a date and stock price for a given stock. Someone else showed me how to loop through them so they are contained in a list called all_dfs. So all_dfs[0] would be a dataframe with Date and IBM US equity, all_dfs[1] would be Date and MMM US Equity, etc. (example shown below). The Date column in the dataframes is always the same but the stock names are all different and the numbers associated with that stock column are always different. So when you call all_dfs[1] this is the dataframe you would see (i.e., all_dfs[1].head()):
IDX Date MMM US equity
0 1/3/2000 47.19
1 1/4/2000 45.31
2 1/5/2000 46.63
3 1/6/2000 50.38
I want to add the same additional columns to EVERY dataframe. So I was trying to loop through them and add the columns. The numbers in the stock name columns are the basis for the calculations that make the other columns.
There are more columns to add but I think they will all loop through the same way soc this is a sample of the columns I want to add:
Column 1 to add >>> df['P_CHG1D'] = df['Stock name #1'].pct_change(1) * 100
Column 2 to add >>> df['PCHG_SIG'] = P_CHG1D > 3
Column 3 to add>>> df['PCHG_SIG']= df['PCHG_SIG'].map({True:1,False:0})
This is the code that I have so far but it is returning a syntax errors for the all_dfs[i].
for i in range (len(df.columns)):
for all_dfs[i]:
df['P_CHG1D'] = df.loc[:,0].pct_change(1) * 100
So I also have 2 problems that I can not figure out
I dont know how to add columns to every dataframes in the loop. So I would have to do something like all_dfs[i].['ADD COLUMN NAME'] = df['Stock Name 1'].pct_change(1) * 100
the second part after the = which is the df['Stock Name 1'] this keeps changing (so in this example it is called MMM US Equity but the next time it would be called the column header of the second dataframe - so it could be IBM US Equity) as each dataframe has a different name so I don't know how to call that properly in the loop
I am new to python/pandas so if I am thinking about this the wrong way let me know if there is a better solution.
Consider iterating through the length of alldfs to reference each element in loop by its index. For first new column, use .ix operator to select stock column by its column position of 2 (third column):
for i in range(len(alldfs)):
dfList[i].is_copy = False # TURNS OFF SettingWithCopyWarning
dfList[i]['P_CHG1D'] = dfList[i].ix[:, 2].pct_change(1) * 100
dfList[i]['PCHG_SIG'] = dfList[i]['P_CHG1D'] > 3
dfList[i]['PCHG_SIG_VAL'] = dfList[i]['PCHG_SIG'].map({True:1,False:0})

Categories