I have the dataframe named Tasks, containing a column named UserName. I want to count every occurrence of a row containing the same UserName, therefore getting to know how many tasks a user has been assigned to. For a better understanding, here's how my dataframe looks like:
In order to achieve this, I used the code below:
Most_Involved = Tasks['UserName'].value_counts()
But this got me a DataFrame like this:
Index Username
John 4
Paul 1
Radu 1
Which is not exactly what I am looking for. How should I re-write the code in order to achieve this:
Most_Involved
Index UserName Tasks
0 John 4
1 Paul 1
2 Radu 1
You can use transform to add a new column to existing data frame:
df['Tasks'] = df.groupby('UserName')['UserName'].transform('size')
# finally select the columns needed
df = df[['Index','UserName','Tasks']]
you can find duplicate rows based on columns by using pandas.
duplicateRowsDF = dataframe[dataframe.duplicated(['columnName'])]
here is the complete solution
Related
I was struggling with how to word the question, so I will provide an example of what I am trying to do below. I have a dataframe that looks like this:
ID CODE COST
0 60086 V2401 105.38
1 60142 V2500 221.58
2 60086 V2500 105.38
3 60134 V2750 35
4 60134 V2020 0
I am trying to create a dataframe that has the ID as rows, the CODE as columns, and the COST as values since the cost for the same code is different per ID. How can I do this in?
This seems like a classic "long to wide" problem, and there are several ways to do it. You can try pivot, for example:
df.pivot_table(index='ID', columns='CODE', values='COST')
(assuming that the dataframe is df.)
B0e.png
I am still relatively new to python. I am trying to do something more complicated. How can I use a for loop or iteration so that I count the same names ranked 1 and add them but also place them into a list format and also place the counted names in a separate list. The reason for this is that I will create a plot, and that I can do but I am stuck on how to separate the total counts of the same name and the names already counted.
Using this as example DataFrame:
RANK NAME
0 1 EMILY
1 1 DANIEL
2 1 EMILY
3 1 ISABELLA
You can do this to get the counted names:
counted_names = name_file[name_file.RANK == 1]['NAME'].value_counts()
print(counted_names)
EMILY 2
DANIEL 1
ISABELLA 1
pandas.groupby()
To solve any form aggregation in Python, all you need is to crack the groupby Function
For your case if you want to sum over 'Count' for all the unique names and later plot it, use pd.groupby()
Make sure you convert it into a DataFrame first and then apply Groupby Magic
name_file = pd.DataFrame(name_file)
name_file.groupby('Name').agg({'Count':'sum'})
This gives you the aggregated sum of counts for eaxh unique name in your dataframe
To get the Count of each Name, use the size.reset_index() method below
pd.DataFrame(name_file).groupby('Name').size().reset_index()
This returns the frequency of occurence of each unique name in the name_file
Hope this helps! Cheers !
I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))
I'm a beginner at Python and I have a school proyect where I need to analyze an excel document with information. It has aproximately 7 columns and more than 1000 rows.
Theres a column named "Materials" that starts at B13. It contains a code that we use to identify some materials. The material code looks like this -> 3A8356. There are different material codes in the same column they repeat a lot. I want to identify them and make a list with only one code, no repeating. Is there a way I can analyze the column and extract the codes that repeat so I can take them and make a new column with only one of each material codes?
An example would be:
12 Materials
13 3A8356
14 3A8376
15 3A8356
16 3A8356
17 3A8346
18 3A8346
and transform it toosomething like this:
1 Materials
2 3A8346
3 3A8356
4 3A8376
Yes.
If df is your dataframe, you only have to do df = df.drop_duplicates(subset=['Materials',], keep=False)
To load the dataframe from an excel file, just do:
import pandas as pd
df = pd.read_excel(path_to_file)
the subset argument indicates which column headings you want to look at.
Docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html
For the docs, the new data frame with the duplicates dropped is returned so you can assign it to any variable you want. If you want to re_index the first column, take a look at:
new_data_frame = new_data_frame.reset_index(drop=True)
Or simply
new_data_frame.reset_index(drop=True, inplace=True)
I have data frame df and I would like to keep a running total of names that occur in a column of that data frame. I am trying to calculate the running total column:
name running total
a 1
a 2
b 1
a 3
c 1
b 2
There are two ways I thought to do this:
Loop through the dataframe and use a separate dictionary containing name and current count. The current count for the relevant name would increase by 1 each time the loop is carried out, and that value would be copied into my dataframe.
Change the count in field for each value in the dataframe. In excel I would use a countif combined with a drag down formula A$1:A1 to fix the first value but make the second value relative so that the range I am looking in changes with the row.
The problem is I am not sure how to implement these. Does anyone have any ideas on which is preferable and how these could be implemented?
#bunji is right. I'm assuming you're using pandas and that your data is in a dataframe called df. To add the running totals to your dataframe, you could do something like this:
df['running total'] = df.groupby(['name']).cumcount() + 1
The + 1 gives you a 1 for your first occurrence instead of 0, which is what you would get otherwise.