Can you have automatic numbering indexes hierarchies? - python

I am wondering if it is possible to have multiple indexes similar to the picture in which one of them (second level in my case) counts automatically?
I have the following problem that i have data which needs to be updated repeatedly and the data either belong to the category "Math" or "English". However I would like to keep track of the first entry, second entry and so on for each category.
Now the trick is that, I would like to have the second level index count automatically within the category, so that every time I add a new entry to a category "math", for example, it would automatically update the second level index.
Thanks for the help.

You can set_index() using a column and a computed series. In your case cumcount() does what you need.
df = pd.DataFrame({"category":np.random.choice(["English","Math"],15), "data":np.random.uniform(2,5,15)})
df2 = df.sort_values("category").set_index(["category", df.sort_values("category").groupby("category").cumcount()+1])
df2
output
data
category
English 1 2.163213
2 4.292678
3 4.227062
4 3.255596
5 3.376833
6 2.477596
Math 1 3.436956
2 3.275532
3 2.720285
4 2.181704
5 3.667757
6 2.683818
7 2.069882
8 3.155550
9 4.155107

Related

Printing the whole row of my data from a max value in a column

I am trying to select the highest value from this data but i also need the month it comes from too, here printing the whole row. Currently i'm using df.max() which just pulls the highest value. Does anyone know how to do this in pandas.
#current code
accidents["month"] = accidents.Date.apply(lambda s: int(s.split("/")[1]))
temp = accidents.groupby('month').size().rename('Accidents')
#selecting the highest value from the dataframe
temp.max()
answer given = 10937
answer i need should look like this (month and no of accidents): 11 10937
temp dataframe;
month
1 9371
2 8838
3 9427
4 8899
5 9758
6 9942
7 10325
8 9534
9 10222
10 10311
11 10937
12 9972
Name: Accidents, dtype: int64
would also be good to rename the accidents column to accidents is anyone can help too. Thanks
If the value is unique (in your case it is) you can simply get a subset of the dataframe.
temp[temp.iloc[:,1]==temp.iloc[:,1].max()]
So what the code is doing is looking at the integer position (rows then columns) and matching it with your condition, which is the max temp.

How to do arithmetic on a Python DataFrame using instructions held in another DataFrame?

I asked this question for R a few months back and got a great answer that I used often. Now I'm trying to transition to Python but I was dreading attempting rewriting this code snippet. And now after trying I haven't been able to translate the answer I got (or find anything similar by searching).
The question is: I have a dataframe that I'd like to append new columns to where the calculation is dependent on values in another dataframe which holds the instructions.
I have created a reproducible example below (although in reality there are quite a few more columns and many rows so speed is important and I'd like to avoid a loop if possible):
input dataframes:
import pandas as pd;
data = {"A":["orange","apple","banana"],"B":[5,3,6],"C":[7,12,4],"D":[5,2,7],"E":[1,18,4]}
data_df = pd.DataFrame(data)
key = {"cols":["A","B","C","D","E"],"include":["no","no","yes","no","yes"],"subtract":["na","A","B","C","D"],"names":["na","G","H","I","J"]}
key_df = pd.DataFrame(key)
desired output (same as data but with 2 new columns):
output = {"A":["orange","apple","banana"],"B":[5,3,6],"C":[7,12,4],"D":[5,2,7],"E":[1,18,4],"H":[2,9,-2],"J":[-4,16,-3]}
output_df= pd.DataFrame(output)
So, the key dataframe has 1 row for each column in the base dataframe and it has an "include" column that has to be set to "yes" if any calculation is to be done. When it is set to "yes", then I want to add a new column with a defined name that subtracts a defined column (all lookups from the key dataframe).
For example, column "C" in the base dataframe is included so I want to create a new column called "H" which is the the value from column "C" minus the value from column "B".
p.s. here was the answer from R in case that triggers any thought processes for someone better skillled than me!
k <- subset(key, include == "yes")
output <- cbind(base,setNames(base[k[["cols"]]]-base[k[["subtract"]]],k$names))
Filter for the yes values in include:
yes = key_df.loc[key_df.include.eq("yes"), ["cols", "subtract", "names"]]
cols subtract names
2 C B H
4 E D J
Create a dictionary of the yes values and unpack it in the assign method::
yes_values = { name: data_df[col] - data_df[subtract]
for col, subtract, name
in yes.to_numpy()}
data_df.assign(**yes_values)
A B C D E H J
0 orange 5 7 5 1 2 -4
1 apple 3 12 2 18 9 16
2 banana 6 4 7 4 -2 -3

Calculating each specific occurrence using value_counts() in Python

I have the dataframe named Tasks, containing a column named UserName. I want to count every occurrence of a row containing the same UserName, therefore getting to know how many tasks a user has been assigned to. For a better understanding, here's how my dataframe looks like:
In order to achieve this, I used the code below:
Most_Involved = Tasks['UserName'].value_counts()
But this got me a DataFrame like this:
Index Username
John 4
Paul 1
Radu 1
Which is not exactly what I am looking for. How should I re-write the code in order to achieve this:
Most_Involved
Index UserName Tasks
0 John 4
1 Paul 1
2 Radu 1
You can use transform to add a new column to existing data frame:
df['Tasks'] = df.groupby('UserName')['UserName'].transform('size')
# finally select the columns needed
df = df[['Index','UserName','Tasks']]
you can find duplicate rows based on columns by using pandas.
duplicateRowsDF = dataframe[dataframe.duplicated(['columnName'])]
here is the complete solution

Finding New/Existing Customers from a Dataframe

I need to create a categorical column indicating whether the client account code has occurred for the first time i.e. "New" or it has occurred before i.e. "Existing".
Only the first occurrence needs to be considered as "New", the rest of the occurrences, irrespective of the gap in occurrences, should all be considered as "Existing".
I tried looping through the list of unique account codes within which I would filter the Dataframe for that particular account code and find the minimum date which would be stored in a separate table. Then looking-up to this table I would enter the New/Existing tag in the categorical column. Couldn't Execute it properly though.
Is there a simple way to accomplish it?
I have attached the sample file below:
Sample Data
Also the Data has some non UTF-8 encoded characters which couldn't be handled by me.
Try:
df.assign(Occurence=np.where(~df['Account Code'].duplicated(),'New','Existing'))
Output:
Created Date Account Code Occurence
0 7-Sep-13 CL000247 New
1 7-Sep-13 CL000012 New
2 7-Sep-13 CL000875 New
3 7-Sep-13 CL000084 New
4 7-Sep-13 CL000186 New
5 7-Sep-13 CL000167 New
6 7-Sep-13 CL000167 Existing
7 7-Sep-13 CL000215 New
8 12-Sep-13 Wan2013001419 New
9 12-Sep-13 CL000097 New
...

Calculating running total

I have data frame df and I would like to keep a running total of names that occur in a column of that data frame. I am trying to calculate the running total column:
name running total
a 1
a 2
b 1
a 3
c 1
b 2
There are two ways I thought to do this:
Loop through the dataframe and use a separate dictionary containing name and current count. The current count for the relevant name would increase by 1 each time the loop is carried out, and that value would be copied into my dataframe.
Change the count in field for each value in the dataframe. In excel I would use a countif combined with a drag down formula A$1:A1 to fix the first value but make the second value relative so that the range I am looking in changes with the row.
The problem is I am not sure how to implement these. Does anyone have any ideas on which is preferable and how these could be implemented?
#bunji is right. I'm assuming you're using pandas and that your data is in a dataframe called df. To add the running totals to your dataframe, you could do something like this:
df['running total'] = df.groupby(['name']).cumcount() + 1
The + 1 gives you a 1 for your first occurrence instead of 0, which is what you would get otherwise.

Categories