Obtaining column based mean over unique multi index values using pandas - python

Good day awesome people,
I'm working on the dataframe on Table and was looking to achieve New table. I tried first obtaining the test score and total averages of the new table using :
df = pd.read_csv("testdata.csv")
grouped = df.groupby(["county_id","school_id","student_id"]).mean()
print (grouped)
It gives me this error:
KeyError: 'county_id'
My plan is for the new table to be grouped based on county_id, school_id and student_id. However, for each of their unique indexes, the average of their test scores and the remarks based on a bandwidth of (Excellent 20.0-25.0, Good 17.0-19.9 and Pass 16.9 - 14.0) would be populated.
I will really appreciate anyone that could help out. Also, if it's possible to use a lambda function to achieve this will be cool too. Thank you

Related

In python pandas, How do you eliminate rows of data that fail to meet a condition of grouped data?

I have a data set that contains hourly data of marketing campaigns. There are several campaigns and not all of them are active during the 24 hours of the day. My goal is to eliminate all rows of active hour campaigns where I don't have the 24 data rows of a single day.
The raw data contains a lot of information like this:
Original Data Set
I created a dummy variable with ones to be able to count single instance of rows. This is the code I applied to be able to see the results I want to get.
tmp = df.groupby(['id','date']).count()
tmp.query('Hour' > 23)
I get the following results:
Results of two lines of code
These results illustrate exactly the data that I want to keep in my data frame.
How can I eliminate the data per campaign per day that does not reach 24? The objective is not the count but the real data. Therefore ungrouped from what I present in the second picture.
I appreciate the guidance.
Use transform to broadcast the count over all rows of your dataframe the use loc as replacement of query:
out = df.loc[df.groupby(['id', 'date'])['Hour'].transform('count')
.loc[lambda x: x > 23].index]
drop the data you don't want before you do the groupby
you can use .loc or .drop, I am unfamiliar with .query

Counting Frequency of an Aggregate result using pandas

Broadly I have the Smart Meters dataset from Kaggle and I'm trying to get a count of the first and last measure by house, then trying to aggregate that to see how many houses began (or ended) reporting on a given day. I'm open to methods totally different than the line I pursue below.
In SQL, when exploring data I often used something like following:
SELECT Max_DT, COUNT(House_ID) AS HouseCount
FROM
(
SELECT House_ID, MAX(Date_Time) AS Max_DT
FROM ElectricGrid GROUP BY HouseID
) MeasureMax
GROUP BY Max_DT
I'm trying to replicate this logic in Pandas and failing. I can get the initial aggregation like:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
However I'm failing to get the outer query. Specifically I don't know what the aggregated column is called. If I do a describe() it shows as Date_Time in the example above. I tried renaming the columns:
house_max.columns = ['House_Id','Max_Date_Time']
I found a StackOverflow discussion about renaming the results of aggregation and attempted to apply it:
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
I still find that a describe() returns Date_Time as the column name.
start_end_collate = house_max.groupby('Date_Time_max')['House_Id'].size()
In the rename example my second query fails to find Date_Time or Max_Date_Time. In the later case, the Ravel code it appears to not find House_Id when I run it.
That's seems weird, I would think your code would not be able to find the House_Id field. After you perform your groupby on House_Id it becomes an index which you cannot reference as a column.
This should work:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
start_end_collate = house_max.groupby('Date_Time_max').size()
Alternatively you can just drop the multilevel column:
house_max.columns = house_max.columns.droplevel(0)
start_end_collate = house_max.groupby('max').size()

Select and minimum value of a data frame column, by category

I have a data frame representing IMDb ratings of a selection of tv shows with the following columns:
date, ep_no, episode, show_title, season, rating
I need to select the lowest rated episode of each show, but I am having trouble displaying all of the columns I want.
I can successfully select the correct data using:
df.groupby('show_title')['rating'].min()
But this only displays the show title and the rating of the lowest rated episode for that show.
I need it to display:
show_title, ep_no, episode, rating
I have tried various tweaks to the code, from the simple to the complex, but I guess I'm just not experienced enough to crack this particular puzzle right now.
Any ideas?
If I understand what you want, this question is similar to this question; And the following code should do the trick.
df[df.groupby('show_title')['rating'].transform(min) == df['rating']]
One approach is to sort the DataFrame by rating, then dropping duplicates of show while keeping the first occurrence of each show:
df.sort_values(by='rating').drop_duplicates(['show_title'], keep='first')
# It's easy just do a sort by show_title , rating before using groupby
df.sort_values(by=['show_title','rating'],inplace=True)
# Now use groupby and return the first instance of every group by object
# first row will automatically contain the minimum rating
df1 = df.groupby('show_title').first()

Incrementing Values in Column Based on Values in Another (Pandas)

I'd like to find a simple way to increment values in in one column that correspond to a particular date in Pandas. This is what I have so far:
old_casual_value = tstDF['casual'].loc[tstDF['ds'] == '2012-10-08'].values[0]
old_registered_value = tstDF['registered'].loc[tstDF['ds'] == '2012-10-08'].values[0]
# Adjusting the numbers of customers for that day.
tstDF.set_value(406, 'casual', old_casual_value*1.05)
tstDF.set_value(406, 'registered', old_registered_value*1.05)
If I could find a better and simpler way to do this (a one-liner), I'd greatly appreciate it.
Thanks for your help.
The following one liner should work based on your limited description of your problem. If not, please provide more information.
#The code below first filters out the records based on specified date and then increase casual and regisitered column values by 1.05 times.
tstDF.loc[tstDF['ds'] == '2012-10-08',['casual','registered']]*=1.05

Pandas: speeding up groupby?

I am wondering whether it is possible to speed up pandas dataframe.groupby with the following application:
Basic data structure:
HDFStore with 9 columns
4 columns are columns with data (colF ... colI)
the combination of the remaining 5 columns (colA ... colE) gives a unique index
colE is a "last modified" column
The basic idea is to implement a data base with a "transactional memory". Assuming an entry changes, I don't delete it but write a new row with a new value in the "last modified" column. This allows me to retroactively look at how entries have changed over time.
However, in situations where I only want the currently valid "state" of the data, it requires selecting only those rows with the most recent "last modified" column:
idx = df.groupby(['colA', 'colB', 'colC', 'colD'],
as_index=False, sort=False)['colE'].max()
df_current_state = df.merge(idx, 'inner', on=['colA', 'colB', 'colC', 'colD', 'colE'])
This groupby method eats up about 70% of my run time.
Note: For the majority of rows, there exists only a single entry with respect to the "last modified" column. Only for very few, multiple versions of the row with different "last modified" values exist.
Is there a way to speed up this process other than changing the program logic as follows?
Alternative Solution without need for groupby:
Add an additional "boolean" column activeState which stores whether the row is part of the "active state".
When rows change, mark their activeState field as False and insert a new row with activeState=True.
One can then query the table with activeState==True rather than use groupby.
My issue with this solution is that it has the potential for mistakes where the activeState field is not set appropriately. Of course this is recoverable from using the "last modified" column, but if the groupby could be sped up, it would be foolproof...
What about using a sort followed by drop_duplicates? I'm using it on a large database with four levels of grouping with good speed. Im taking the first, so I don't know how first vs last helps the speed, but you can always reverse the sort too.
df_current_state = df.sort(columns='colE')
df_current_state = df_current_state.drop_duplicates(subset=['colA','colB','colC','colD'],take_last=True)

Categories