extract the top values from one column based on another column - python

so basically I have this dataframe called df:
where the first column have list of user id and the genre that they played and the total number of them. how can I extract the top 10 genres with most streams while showing the total number of users who streamed them?
so what I thought of doing is to sort the column values like this:
df_genre.sort_values(by="total_streams",ascending=False)
and then get the top genre but I got this:
But this is not what i want how can i fix it?

I think this is what you are looking for:
Data:
ID,genre,plays
12345,pop,23
12345,pop,576
12345,dance,18
12345,world,45
12345,dance,23
12345,pop,456
Input:
df = df.groupby(['ID','genre'])['plays'].sum().reset_index()
df.sort_values(by=['plays'], ascending=False)
Output:
ID genre plays
1 12345 pop 1055
2 12345 world 45
0 12345 dance 41

Related

Fill column with value from database if value in rows match when there are duplicates

So I have a data frame (D1) and every entry has a code associated with it and these codes are representative of specific categories. In a separate data frame(D2), all the codes have a description. What I need to do is look through D1, match the codes there with those in D2 and pull in the description to an additional column. I've tried to do this using merge, but keep running into duplicate errors. What is the best way to do this?
d1 = pd.DataFrame({'data':['one','two','three','four'],'code':['abc','xyz','abc','lnm']})
d2 = pd.DataFrame({'code':['abc','lnm','xyz'],'description':['first','second','third']})
need =
data code description
0 one abc first
1 two xyz third
2 three abc first
3 four lmn second
You can use simple .map:
df1["description"] = df1["code"].map(df2.set_index("code")["description"])
print(df1)
Prints:
data code description
0 one abc first
1 two xyz third
2 three abc first
3 four def second

Python Pandas gather all occurrences

I am new to pandas so I am not sure if i am doing what i want the best way possible, and one part seems to not be working properly.
In my database I have a table that records all the sells of the products on my website and I would like to create a csv report with all the product that have been sold, min price, max, price, and other information. The table that sells each sale has the following attributes:
product_id
sell_price
created_by
From my research i found out how to make a dataframe with the csv from a db export for now like below.
sellsdb = pd.read_csv('sellsdb.csv', delimiter = ',')
Now I make a copy of that dataframe without duplicates.
sells = sellsdb.copy().drop_duplicates(subset='product_id', keep=False)
Now I loop over each unique product title in the copied dataframe
for index, row in sells.iterrows():
countSells = sellsdb.loc[sellsdb['product_id'] == str(row['product_id'])].count()['product_id']
if countSells > 1:
print(countSells)
When I run this i have all the counts comming back as 1 even when there is a duplicate in the dataframe, but when I hard code a product id i get the right number for that id. What is going on?
In the loop i was just going to append the columns that i need for the report to the dataframe of no duplicates.
Assume that your DataFrame contains:
product_id sell_price created_by
0 Aaa 20.35 Xxxx1
1 Aaa 20.15 Xxxx2
2 Aaa 22.00 Xxxx3
3 Bbb 10.13 Xxxx4
4 Ccc 16.00 Xxxx5
5 Ccc 16.50 Xxxx6
6 Ccc 17.00 Xxxx7
To compute the number of sales per product it is much easier (and more
pandasonic) to run:
result = df.groupby('product_id').sell_price.count().rename('cnt')
I added rename('cnt') to give the result a meaningful name.
Otherwise the name would have been inherited from the original column
(sell_price), but the values are numbers of sales, not prices.
The result, for the above sample input, is:
product_id
Aaa 3
Bbb 1
Ccc 3
Name: cnt, dtype: int64
It is a Series with index named product_id and the value under each
index is the count of sales of this product.
And the final remark: I named the result cnt (not count) because
count is the name of a Pandas function (used here) and it is a bad
practice to use names "overwriting" names of existing functions or
attributes.

how to divide pandas dataframe into different dataframes based on unique values from one column and itterate over that?

I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))

Can I update the value of a column based on the same column value in a python dataframe?

I have a dataframe to capture characteristics of people accessing a webpage. The list of time spent by each user in the page is one of the characteristic feature that I get as an input. I want to update this column with maximum value of the list. Is there a way in which I can do this?
Assume that my data is:
df = pd.DataFrame({Page_id:{1,2,3,4}, User_count:{5,3,3,6}, Max_time:{[45,56,78,90,120],[87,109,23],[78,45,89],[103,178,398,121,431,98]})
What I want to do is convert the column Max_time in df to Max_time:{120,109,89,431}
I am not supposed to add another column for computing the max separately as this table structure cannot be altered.
I tried the following:
for i in range(len(df)):
df.loc[i]["Max_time"] = max(df.loc[i]["Max_time"])
But this is not changing the column as I intended it to. Is there something that I missed?
df = pd.DataFrame({'Page_id':[1,2,3,4],'User_count':[5,3,3,6],'Max_time':[[45,56,78,90,120],[87,109,23],[78,45,89],[103,178,398,121,431,98]]})
df.Max_time = df.Max_time.apply(max)
Result:
Page_id User_count Max_time
0 1 5 120
1 2 3 109
2 3 3 89
3 4 6 431
You can use this:
df['Max_time'] = df['Max_time'].map(lambda x: np.max(x))

Drop duplicates keeping the row with the highest value in another column

a = [['John', 'Mary', 'John'], [10,22,50]]
df1 = pd.DataFrame(a, columns=['Name', 'Count'])
Given a data frame like this I want to compare all similar string values of "Name" against the "Count" value to determine the highest. I'm not sure how to do this in a dataframe in Python.
Ex: In the case above the Answer would be:
Name Count
Mary 22
John 50
The lower value John 10 has been dropped (I only want to see the highest value of "Count" based on the same value for "Name").
In SQL it would be something like a Select Case query (wherein I select the Case where Name == Name & Count > Count recursively to determine the highest number. Or a For loop for each name, but as I understand loops in DataFrames is a bad idea due to the nature of the object.
Is there a way to do this with a DF in Python? I could create a new data frame with each variable (one with Only John and then get the highest value (df.value()[:1] or similar. But as I have many hundreds of unique entries that seems like a terrible solution. :D
Either sort_values and drop_duplicates,
df1.sort_values('Count').drop_duplicates('Name', keep='last')
Name Count
1 Mary 22
2 John 50
Or, like miradulo said, groupby and max.
df1.groupby('Name')['Count'].max().reset_index()
Name Count
0 John 50
1 Mary 22

Categories