Pandas Split-Apply-Combine - python

I have a dataset with userIDs, Tweets and CreatedDates. Each UserID will have multiple tweets created at different dates. I want to find the frequency of tweets and Ive written a small calculation which gives me the number of tweets per hour per userID. I used group by to do this the code as follows :
twitterDataFrame = twitterDataFrame.set_index(['CreatedAt'])
tweetsByEachUser = twitterDataFrame.groupby('UserID')
numberOfHoursBetweenFirstAndLastTweet = (tweetsByEachUser['CreatedAtForCalculations'].first() - tweetsByEachUser['CreatedAtForCalculations'].last()).astype('timedelta64[h]')
numberOfTweetsByTheUser = tweetsByEachUser.size()
frequency = numberOfTweetsByTheUser / numberOfHoursBetweenFirstAndLastTweet
When printing the value of frequency I get :
UserID
807095 5.629630
28785486 2.250000
134758540 8.333333
Now I need to go back into my big data frame (twitterDataFrame) and add these values alongside the correct UserIDs. How can i possible do that? Id like to say
twitterDataFrame['frequency'] = the frequency corresponding to the UserID. e.g twitterDataFrame['UserID'] and the frequency value we got for that above.
However I am not sure how i do this. Would anyone know how i can achieve this?

You can use join operation on the frequency object you created, or do it in one stage:
get_freq = lambda ts: (ts.last() - ts.first()).astype('timedelta64[h]') / len(ts)
twitterDataFrame['frequency'] = twitterDataFrame.groupby('UserID')['CreatedAtForCalculations'].transform(get_freq)

Related

In python pandas, How do you eliminate rows of data that fail to meet a condition of grouped data?

I have a data set that contains hourly data of marketing campaigns. There are several campaigns and not all of them are active during the 24 hours of the day. My goal is to eliminate all rows of active hour campaigns where I don't have the 24 data rows of a single day.
The raw data contains a lot of information like this:
Original Data Set
I created a dummy variable with ones to be able to count single instance of rows. This is the code I applied to be able to see the results I want to get.
tmp = df.groupby(['id','date']).count()
tmp.query('Hour' > 23)
I get the following results:
Results of two lines of code
These results illustrate exactly the data that I want to keep in my data frame.
How can I eliminate the data per campaign per day that does not reach 24? The objective is not the count but the real data. Therefore ungrouped from what I present in the second picture.
I appreciate the guidance.
Use transform to broadcast the count over all rows of your dataframe the use loc as replacement of query:
out = df.loc[df.groupby(['id', 'date'])['Hour'].transform('count')
.loc[lambda x: x > 23].index]
drop the data you don't want before you do the groupby
you can use .loc or .drop, I am unfamiliar with .query

Pandas Dataframe not appending

Here is what I need help with, specifically #3.
Import labelled data
Evaluate word for matches
Update labelled data with those words that are similar to 'labelled data'
I have a set of labeled data (say 100 rows) and I want to update it automatically, when a new word is similar to an existing row (i.e., SimilarityScore > 75%).
I start out by importing labelled data into two df. The first df (labelled_data) I use to calculate and store the similarity score and two columns (this is where I store similar text and associated score). The second df (dictionary_revised) is the dataframe that I want to append. Here is the code that I use to create those two df's.
#Read the labelled data
labelled_data = pd.read_csv('DictionaryV2.csv')
dictionary_revised = pd.read_csv('DictionaryV2.csv')
#Add two columns to labelled_data
labelled_data['SimilarText'] = ''
labelled_data['SimilarityScore'] = float()
Next, I calculate the similarity of Word A and Word B, updating labelled_data with SimilarTest and SimilarityScore. Works, here is what the output looks like:
QueryText Subjectmatter DateAdded SimilarText SimilarityScore
2 hr HR & Benefits 1/1/2020 support 0.771284
4 pay HR & Benefits 1/1/2020 check 0.829261
Next, I created the following var to return only those scores > 75%. Works
score = labelled_data['SimilarityScore'] > 0.75
Works, here is a sample output
QueryText Subjectmatter DateAdded SimilarText SimilarityScore
0 store Shopping 1/1/2020 retail 0.730492
1 performance Career & Jobs 1/1/2020 connecting 0.743287
Next, I get the current date (as I want to know when the SimilarityScore was calculated)
import datetime
now = datetime.datetime.now()
Finally, I attempt to append the dictionary_revised df set using the following. But it is not working. I have tried with 'results =' and without the 'results =' portion of the code. Neither works.
for i in range(len(labelled_data[score])):
results = dictionary_revised.append({'QueryText': labelled_data['SimilarText'],
'Subjectmatter': labelled_data['Subjectmatter'],
'DateAdded': now.strftime('%Y-%m-%d')},ignore_index=True)
Any suggestions?

how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that?

I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))

Obtaining column based mean over unique multi index values using pandas

Good day awesome people,
I'm working on the dataframe on Table and was looking to achieve New table. I tried first obtaining the test score and total averages of the new table using :
df = pd.read_csv("testdata.csv")
grouped = df.groupby(["county_id","school_id","student_id"]).mean()
print (grouped)
It gives me this error:
KeyError: 'county_id'
My plan is for the new table to be grouped based on county_id, school_id and student_id. However, for each of their unique indexes, the average of their test scores and the remarks based on a bandwidth of (Excellent 20.0-25.0, Good 17.0-19.9 and Pass 16.9 - 14.0) would be populated.
I will really appreciate anyone that could help out. Also, if it's possible to use a lambda function to achieve this will be cool too. Thank you

Performing multiple calculations on a Python Pandas group from CSV data

I have daily csv's that are automatically created for work that average about 1000 rows and exactly 630 columns. I've been trying to work with pandas to create a summary report that I can write to a new txt.file each day.
The problem that I'm facing is that I don't know how to group the data by 'provider', while also performing my own calculations based on the unique values within that group.
After 'Start', the rest of the columns(-2000 to 300000) are profit/loss data based on time(milliseconds). The file is usually between 700-1000 lines and I usually don't use any data past column heading '20000' (not shown).
I am trying to make an output text file that will summarize the csv file by 'provider'(there are usually 5-15 unique providers per file and they are different each day). The calculations I would like to perform are:
Provider = df.group('providers')
total tickets = sum of 'filled' (filled column: 1=filled, 0=reject)
share % = a providers total tickets / sum of all filled tickets in file
fill rate = sum of filled / (sum of filled + sum of rejected)
Size = Sum of 'fill_size'
1s Loss = (count how many times column '1000' < $0) / total_tickets
1s Avg = average of column '1000'
10s Loss = (count how many times MIN of range ('1000':'10000') < $0) / total_tickets
10s Avg = average of range ('1000':'10000')
Ideally, my output file will have these headings transposed across the top and the 5-15 unique providers underneath
While I still don't understand the proper format to write all of these custom calculations, my biggest hurdle is referencing one of my calculations in the new dataframe (ie. total_tickets) and applying it to the next calculation (ie. 1s loss)
I'm looking for someone to tell me the best way to perform these calculations and maybe provide an example of at least 2 or 3 of my metrics. I think that if I have the proper format, I'll be able to run with the rest of this project.
Thanks for the help.
The function you want is DataFrame.groupby, with more examples in the documentation here.
Usage is fairly straightforward.
You have a field called 'provider' in your dataframe, so to create groups, you simple call grouped = df.groupby('provider'). Note that this does no calculations, just tells pandas how to find groups.
To apply functions to this object, you can do a few things:
If it's an existing function (like sum), tell the grouped object which columns you want and then call .sum(), e.g., grouped['filled'].sum() will give the sum of 'filled' for each group. If you want the sum of every column, grouped.sum() will do that. For your second example, you could divide this resulting series by df['filled'].sum() to get your percentages.
If you want to pass a custom function, you can call grouped.apply(func) to apply that function to each group.
To store your values (e.g., for total tickets), you can just assign them to a variable, to total_tickets = df['filled'].sum(), and tickets_by_provider = grouped['filled'].sum(). You can then use these in other calculations.
Update:
For one second loss (and for the other losses), you need two things:
The number of times for each provider df['1000'] < 0
The total number of records for each provider
These both fit within groupby.
For the first, you can use grouped.apply with a lambda function. It could look like this:
_1s_loss_freq = grouped.apply(lambda x: x['fill'][x['1000'] < 0].sum())
For group totals, you just need to pick a column and get counts. This is done with the count() function.
records_per_group = grouped['1000'].count()
Then, because pandas aligns on indices, you can get your percentages with _1s_loss_freq / records_per_group.
This analogizes to the 10s Loss question.
The last question about the average over a range of columns relies on pandas understanding of how it should apply functions. If you take a dataframe and call dataframe.mean(), pandas returns the mean of each column. There's a default argument in mean() that is axis=0. If you change that to axis=1, pandas will instead take the mean of each row.
For your last question, 10s Avg, I'm assuming you've aggregated to the provider level already, so that each provider has one row. I'll do that with sum() below but any aggregation will do. Assuming the columns you want the mean over are stored in a list called cols, you want:
one_rec_per_provider = grouped[cols].sum()
provider_means_over_cols = one_rec_per_provider.mean(axis=1)

Categories