How to output groupby variables when using .groupby() in pandas? - python

I have some data that I want to analyze. I group my data by the relevant group variables (here, 'test_condition' and 'region') and analyze the measure variable ('rt') with a function I wrote:
grouped = data.groupby(['test_condition', 'region'])['rt'].apply(summarize)
That works fine. The output looks like this (fake data):
ci1 ci2 mean
test_condition region
Test Condition Name And 0 295.055978 338.857066 316.956522
Spill1 0 296.210167 357.036210 326.623188
Spill2 0 292.955327 329.435977 311.195652
The problem is, 'test_condition' and 'region' are not actual columns, I can't index into them. I just want columns with the names of the group variables! This seems so simple (and is automatically done in R's ddply) but after lots of googling I have come up with nothing. Does anyone have a simple solution?

By default, the grouping variables are turned into an index. You can change the index to columns with grouped.reset_index().
My second suggestion to specify this in the groupby call with as_index=False, seems not to work as desired in this case with apply (but it does work when using aggregate)

Related

Fixing broken naming after merging a groupby pivot_table dataframe

I have a problem with naming of columns of dataframe resulting from merging it with its iteration created by group_by.
Generally, the code that creates the mess looks like this:
volume_aggrao = volume.groupby(by = ['room_name', 'material', 'RAO']).sum()['quantity']
volume_aggrao_concat = pd.pivot_table(pd.DataFrame(volume_aggrao), index=['room_name', 'material'], columns = ['RAO'], values = ['quantity'])
volume = volume.merge(volume_aggrao_concat, how = 'left', on = ['room_name', 'material'])
Now to what it does: the goal of pivot_table is to show 'quantity' variable sum over each category of 'RAO' and it looks like that:
And it is fine until you access how it looks on the inside:
"('room_name', '')","('material', '')","('quantity', 'moi')","('quantity', 'nao')","('quantity', 'onrao')","('quantity', 'prom')","('quantity', 'sao')"
1,aluminum,NaN,13.0,NaN,NaN,NaN
1,concrete,151.0,NaN,NaN,NaN,NaN
1,plastic,56.0,NaN,NaN,NaN,NaN
1,steel_mark_1,NaN,30.0,2.0,NaN,1.0
1,steel_mark_2,52.0,NaN,88.0,NaN,NaN
2,aluminum,123.0,NaN,84.0,NaN,NaN
2,concrete,155.0,NaN,NaN,30.0,NaN
2,plastic,170.0,NaN,NaN,NaN,NaN
2,steel_mark_1,107.0,NaN,105.0,47.0,NaN
2,steel_mark_2,81.0,41.0,NaN,NaN,NaN
3,aluminum,NaN,NaN,90.0,NaN,79.0
3,concrete,NaN,82.0,NaN,NaN,NaN
3,plastic,1.0,NaN,25.0,NaN,NaN
3,steel_mark_1,116.0,10.0,NaN,136.0,NaN
3,steel_mark_2,NaN,92.0,34.0,NaN,NaN
4,aluminum,50.0,74.0,NaN,NaN,88.0
4,concrete,96.0,NaN,27.0,NaN,NaN
4,plastic,63.0,135.0,NaN,NaN,NaN
4,steel_mark_1,97.0,NaN,28.0,87.0,NaN
4,steel_mark_2,57.0,22.0,7.0,NaN,NaN
Nevertheless, I was still able to merge it, with resulting columns being named automatically like that:
I cannot seem to be able to call these '(quantity, smth)' columns and hence could not even rename them directly. And there i decided to fully reset column namings with volume.columns = ["id", "room_name", "material", "alpha_UA", "beta_UA", "alpha_F", "beta_F", "gamma_EP", "quantity", "files_id", "all_UA", "RAO", "moi", "nao", "onrao", "prom", "sao"], which is indeed bulky, but it worked. Except it did not when one or more of categorical values of "RAO" is missing. For example, there is no "nao" in "RAO" and hence there is no such column created and hence the code has nothing to rename.
I tried fixing it with volume.rename(lambda x: x.lstrip("(\'quantity\',").strip("\'() \'") if "(" in x else x, axis=1), but it seems to do nothing with them.
I want to know if there is a way to rename these columns.
Data
Here's some example data of 'volume' dataframe you may use to replicate the process with desired output embedded in it to compare
"id","room_name","RAO","moi","nao","onrao","prom","sao"
"1","3","onrao","1","","25","",""
"2","4","nao","57","22","7","",""
"4","2","moi","170","","","",""
"6","4","moi","97","","28","87",""
"7","4","moi","97","","28","87",""
"11","1","nao","","13","","",""
"12","4","onrao","97","","28","87",""
"13","2","moi","107","","105","47",""
"18","2","moi","123","","84","",""
"19","2","moi","155","","","30",""
"22","2","moi","170","","","",""
"23","4","sao","50","74","","","88"
"24","4","nao","50","74","","","88"
So, after a cup of coffee and a cold shower, I was able to investigate a bit further and found out that the strange namings are actually tuples and not strings! Knowing that I decided to iterate over columns to change them to strings and then use the filter. A bit bulky once again, but here is a solution:
for name in volume.columns:
names.append(str(name).lstrip("(\'quantity\',").strip("\'() \'"))

Pick a list of values from one CSV and get the count of the values of the list in a different CSV

i am working on python code to calculate the occurrences of few values in a column within a CSV.
Example - CSV1 is as below
**Type Value**
Simple test
complex problem
simple formula
complex theory
simple idea
simple task
I need to get the content of value for type simple and complex i.e
**Type Value**
simple test
simple formula
simple idea
simple task
complex theory
complex problem
And query other CSV which is CSV1 on the total count of occurrences of simple list i.e [test, formula, idea, task] and complex list i.e [theory, problem]
Other CSV2 is
**Category**
test
test
test
formula
formula
formula
test
test
idea
task
task
idea
task
idea
task
problem
problem
theory
problem
problem
idea
task
problem
test
Both CSV1 and CSV2 are dynamic, from CSV1 as example for type "simple' get the list of the corresponding values and refer CSV2 to know what's count for each value. i.e counts of test, idea, task, formula.
Same for Complex type
I tried multiple methods with pandas but not expecting result as expected. Any pointers please.
Use:
df2['cat'] = df2['Category'].map(df1.set_index('Value')['Type'])
df2 = df2['cat'].value_counts().rename_axis('a').reset_index(name='b')
print (df2)
a b
0 simple 18
1 complex 6
Much like #jezrael,however I would first groupby the second csv. This would help in merging if the second csv is very large.
df2=cv2.groupby('value').agg(cnt=('value','count')).reset_index()
This would give me a dataframe with two columns, value and count.
Now, you can merge it with CV1
df1 = cv1.merge(df2,on=['value'],how='inner')

Python loop through two dataframes and find similar column

I am currently working on a project where my goal is to get the game scores for each NCAA mens basketball game. In order to do this, I need to use the python package sportsreference. I need to use two dataframes, one called df which has the game date and one called box_index (shown below) which has the unique link of each game. I need to get the date column replaced by the unique link of each game. These unique links start with the date (formatted exactly as in the date column of df), which makes it easier to do this with regex or the .contains(). I keep getting a Keyerror: 0 error. Can someone help me figure out what is wrong with my logic below?
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined =Schedule(name).dataframe
box_index = combined["boxscore_index"]
box = box_index.to_frame()
#print(box)
for i in range(len(df)):
for j in range(len(box)):
if box.loc[i,"boxscore_index"].contains(df.loc[i, "date"]):
df.loc[i,"date"] = box.loc[i,"boxscore_index"]
get_team_schedule("Virginia")
It seems like "box" and "df" are pandas data frame, and since you are iterating through all the rows, it may be more efficient to use iterrows (instead of searching by index with ".loc")
for i, row_df in df.iterrows():
for j, row_box in box.iterrows():
if row_box["boxscore_index"].contains(row_df["date"]):
df.at[i, 'date'] = row_box["boxscore_index"]
the ".at" function will overwrite the value at a given cell
Just fyi, iterrows is more efficient than .loc., however itertuples is about 10x faster, and zip about 100xs.
The Keyerror: 0 error is saying you can't get that row at index 0, because there is no index value of 0 using box.loc[i,"boxscore_index"] (the index values are the dates, for example '2020-12-22-14-virginia'). You could use .iloc. though, like box.iloc[i]["boxscore_index"]. You'd have to convert all the .loc to that.
Like the other post said though, I wouldn't go that path. I actually wouldn't even use iterrows here. I would put the box_index into a list, then iterarte through that. Then use pandas to filter your df dataframe. I'm sort of making some assumptions of what df looks like, so if this doesn't work, or not what you looking to do, please share some sample rows of df:
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined = Schedule(name).dataframe
box_index_list = list(combined["boxscore_index"])
for box_index in box_index_list:
temp_game_data = df[df["date"] == boxscore_index]
print(box_index)
print(temp_game_data,'\n')
get_team_schedule("Virginia")

How to exclude more than one group in a groupby using python?

I have grouped the number of customers by region and year joined using groupby in Python. However I want to remove several regions from the region group.
I know in order to exclude one group from a groupby you can use the following code:
grouped = df.groupby(['Region'])
df1 = df.drop(grouped.get_group(('Southwest')).index).
Therefore I initially tried the following:
grouped = df.groupby(['Region'])
df1 = df.drop(grouped.get_group(('Southwest','Northwest')).index)
However that gave me the apparent error ('Southwest','Northwest').
Now I am wondering if there is a way to drop several groups at once instead of me having to type out the above code repeatedly for each region I want to remove.
I expect the output of the final query to be similar to the image shown below however information regarding the Northwest and Southwest regions should be removed.
It's not df1 = df.drop(grouped.get_group(('Southwest','Northwest')).index). grouped.get_group takes a single name as argument. If you want to drop more than one group, you can use df1 = df.drop((grouped.get_group('Southwest').index, grouped.get_group('Northwest').index)) since drop can take a list as input.
As a side note, ('Southwest') evaluates to 'Southwest' (i.e. it's not a tuple). If you want to make a tuple of size 1, it's ('Southwest', )

Counting Frequency of an Aggregate result using pandas

Broadly I have the Smart Meters dataset from Kaggle and I'm trying to get a count of the first and last measure by house, then trying to aggregate that to see how many houses began (or ended) reporting on a given day. I'm open to methods totally different than the line I pursue below.
In SQL, when exploring data I often used something like following:
SELECT Max_DT, COUNT(House_ID) AS HouseCount
FROM
(
SELECT House_ID, MAX(Date_Time) AS Max_DT
FROM ElectricGrid GROUP BY HouseID
) MeasureMax
GROUP BY Max_DT
I'm trying to replicate this logic in Pandas and failing. I can get the initial aggregation like:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
However I'm failing to get the outer query. Specifically I don't know what the aggregated column is called. If I do a describe() it shows as Date_Time in the example above. I tried renaming the columns:
house_max.columns = ['House_Id','Max_Date_Time']
I found a StackOverflow discussion about renaming the results of aggregation and attempted to apply it:
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
I still find that a describe() returns Date_Time as the column name.
start_end_collate = house_max.groupby('Date_Time_max')['House_Id'].size()
In the rename example my second query fails to find Date_Time or Max_Date_Time. In the later case, the Ravel code it appears to not find House_Id when I run it.
That's seems weird, I would think your code would not be able to find the House_Id field. After you perform your groupby on House_Id it becomes an index which you cannot reference as a column.
This should work:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
start_end_collate = house_max.groupby('Date_Time_max').size()
Alternatively you can just drop the multilevel column:
house_max.columns = house_max.columns.droplevel(0)
start_end_collate = house_max.groupby('max').size()

Categories