Fixing broken naming after merging a groupby pivot_table dataframe - python

I have a problem with naming of columns of dataframe resulting from merging it with its iteration created by group_by.
Generally, the code that creates the mess looks like this:
volume_aggrao = volume.groupby(by = ['room_name', 'material', 'RAO']).sum()['quantity']
volume_aggrao_concat = pd.pivot_table(pd.DataFrame(volume_aggrao), index=['room_name', 'material'], columns = ['RAO'], values = ['quantity'])
volume = volume.merge(volume_aggrao_concat, how = 'left', on = ['room_name', 'material'])
Now to what it does: the goal of pivot_table is to show 'quantity' variable sum over each category of 'RAO' and it looks like that:
And it is fine until you access how it looks on the inside:
"('room_name', '')","('material', '')","('quantity', 'moi')","('quantity', 'nao')","('quantity', 'onrao')","('quantity', 'prom')","('quantity', 'sao')"
1,aluminum,NaN,13.0,NaN,NaN,NaN
1,concrete,151.0,NaN,NaN,NaN,NaN
1,plastic,56.0,NaN,NaN,NaN,NaN
1,steel_mark_1,NaN,30.0,2.0,NaN,1.0
1,steel_mark_2,52.0,NaN,88.0,NaN,NaN
2,aluminum,123.0,NaN,84.0,NaN,NaN
2,concrete,155.0,NaN,NaN,30.0,NaN
2,plastic,170.0,NaN,NaN,NaN,NaN
2,steel_mark_1,107.0,NaN,105.0,47.0,NaN
2,steel_mark_2,81.0,41.0,NaN,NaN,NaN
3,aluminum,NaN,NaN,90.0,NaN,79.0
3,concrete,NaN,82.0,NaN,NaN,NaN
3,plastic,1.0,NaN,25.0,NaN,NaN
3,steel_mark_1,116.0,10.0,NaN,136.0,NaN
3,steel_mark_2,NaN,92.0,34.0,NaN,NaN
4,aluminum,50.0,74.0,NaN,NaN,88.0
4,concrete,96.0,NaN,27.0,NaN,NaN
4,plastic,63.0,135.0,NaN,NaN,NaN
4,steel_mark_1,97.0,NaN,28.0,87.0,NaN
4,steel_mark_2,57.0,22.0,7.0,NaN,NaN
Nevertheless, I was still able to merge it, with resulting columns being named automatically like that:
I cannot seem to be able to call these '(quantity, smth)' columns and hence could not even rename them directly. And there i decided to fully reset column namings with volume.columns = ["id", "room_name", "material", "alpha_UA", "beta_UA", "alpha_F", "beta_F", "gamma_EP", "quantity", "files_id", "all_UA", "RAO", "moi", "nao", "onrao", "prom", "sao"], which is indeed bulky, but it worked. Except it did not when one or more of categorical values of "RAO" is missing. For example, there is no "nao" in "RAO" and hence there is no such column created and hence the code has nothing to rename.
I tried fixing it with volume.rename(lambda x: x.lstrip("(\'quantity\',").strip("\'() \'") if "(" in x else x, axis=1), but it seems to do nothing with them.
I want to know if there is a way to rename these columns.
Data
Here's some example data of 'volume' dataframe you may use to replicate the process with desired output embedded in it to compare
"id","room_name","RAO","moi","nao","onrao","prom","sao"
"1","3","onrao","1","","25","",""
"2","4","nao","57","22","7","",""
"4","2","moi","170","","","",""
"6","4","moi","97","","28","87",""
"7","4","moi","97","","28","87",""
"11","1","nao","","13","","",""
"12","4","onrao","97","","28","87",""
"13","2","moi","107","","105","47",""
"18","2","moi","123","","84","",""
"19","2","moi","155","","","30",""
"22","2","moi","170","","","",""
"23","4","sao","50","74","","","88"
"24","4","nao","50","74","","","88"

So, after a cup of coffee and a cold shower, I was able to investigate a bit further and found out that the strange namings are actually tuples and not strings! Knowing that I decided to iterate over columns to change them to strings and then use the filter. A bit bulky once again, but here is a solution:
for name in volume.columns:
names.append(str(name).lstrip("(\'quantity\',").strip("\'() \'"))

Related

Identify if records exist in another dataframe, within the first dataframe

I have two csv files, OrderOne (approx 105k records) & OrderTwo (approx 115k records)
I want to add a column in OrderTwo which states "TRUE" if that record is found in OrderOne, and "FALSE" if not.
The new column should be appended and the file output.
There is no shared key, so I'm creating one. It will the concatenation of columns within the orders, which are in different formats from different suppliers. For simplicity in this example, it will be 'Forename' + 'Surname'.
I am reading the two data tables in, one of which I only need a few columns from. I'm converting names to upper case & stripping out white space to ensure they match correctly.
I've read the outputs from these files and they look correct. So far, so good.
import pandas as pd
orderoneData = pd.read_csv ('orderone.csv', usecols=['Customer Reference','Forename', 'Surname'], index_col=False)
orderoneData.set_index('Customer Reference', inplace=True)
orderoneData["FNSN"] = orderoneData['Forename'].str.strip() + orderoneData["Surname"].str.strip()
orderoneData["FNSN"] = orderoneData["FNSN"].str.upper()
ordertwoData = pd.read_csv ('ordertwo.csv')
ordertwoData.set_index('Supplier Reference', inplace=True)
ordertwoData["FNSN"] = ordertwoData['Forename'].str.strip() + ordertwoData["Surname"].str.strip()
ordertwoData["FNSN"] = ordertwoData["FNSN"].str.upper()
Next I'm merging; I'm using OrderTwo as the left (because that's the file I want the new column added to). I intend to change the values of the indicator to Boolean ('both' = True, otherwise False) but I haven't got that far yet.
d = (
ordertwoData.merge(orderoneData['FNSN'],
on=['FNSN'],
how='left',
indicator=True,
)
)
d.reset_index(drop=True, inplace=True)
At this point, I have far too many records (approx 179k; I'm expecting the same as OrderTwo, which is 115k). My understanding was that a left join should have the same number of records as the left table, which is my case is ordertwoData
#I thought I might have used the wrong merge criteria and it was creating duplicates, so I thought I would just remove them
d1 = d.drop_duplicates()
print(d1)
d1.to_csv("d.csv")
Dropping duplicates leaves me with too few records, so I'm confused how I get the right result.
Any help much appreciated!
As #Clegane identified, the issue here was not the code but the input data containing duplicate records. By including the original reference in the merge then dropping duplicates on OrderTwo['Supplier Reference'] I got the expected answer. Thanks!

Python loop through two dataframes and find similar column

I am currently working on a project where my goal is to get the game scores for each NCAA mens basketball game. In order to do this, I need to use the python package sportsreference. I need to use two dataframes, one called df which has the game date and one called box_index (shown below) which has the unique link of each game. I need to get the date column replaced by the unique link of each game. These unique links start with the date (formatted exactly as in the date column of df), which makes it easier to do this with regex or the .contains(). I keep getting a Keyerror: 0 error. Can someone help me figure out what is wrong with my logic below?
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined =Schedule(name).dataframe
box_index = combined["boxscore_index"]
box = box_index.to_frame()
#print(box)
for i in range(len(df)):
for j in range(len(box)):
if box.loc[i,"boxscore_index"].contains(df.loc[i, "date"]):
df.loc[i,"date"] = box.loc[i,"boxscore_index"]
get_team_schedule("Virginia")
It seems like "box" and "df" are pandas data frame, and since you are iterating through all the rows, it may be more efficient to use iterrows (instead of searching by index with ".loc")
for i, row_df in df.iterrows():
for j, row_box in box.iterrows():
if row_box["boxscore_index"].contains(row_df["date"]):
df.at[i, 'date'] = row_box["boxscore_index"]
the ".at" function will overwrite the value at a given cell
Just fyi, iterrows is more efficient than .loc., however itertuples is about 10x faster, and zip about 100xs.
The Keyerror: 0 error is saying you can't get that row at index 0, because there is no index value of 0 using box.loc[i,"boxscore_index"] (the index values are the dates, for example '2020-12-22-14-virginia'). You could use .iloc. though, like box.iloc[i]["boxscore_index"]. You'd have to convert all the .loc to that.
Like the other post said though, I wouldn't go that path. I actually wouldn't even use iterrows here. I would put the box_index into a list, then iterarte through that. Then use pandas to filter your df dataframe. I'm sort of making some assumptions of what df looks like, so if this doesn't work, or not what you looking to do, please share some sample rows of df:
from sportsreference.ncaab.schedule import Schedule
def get_team_schedule(name):
combined = Schedule(name).dataframe
box_index_list = list(combined["boxscore_index"])
for box_index in box_index_list:
temp_game_data = df[df["date"] == boxscore_index]
print(box_index)
print(temp_game_data,'\n')
get_team_schedule("Virginia")

Select specific rows on pandas based on condition

I have a dataframe containing a column called bmi (Body Mass Index) containing int values
I have to separate the values in bmi column into Under weight, Normal, Over weight and Obese based on the values. Below is the loop for the same
However I am getting an error. I am a beginner. Just started coding 2 weeks back.
Generally speaking, using a for loop in pandas is usually a bad idea. Pandas allows you to manipulate data easily. For example, if you want to filter by some condition:
print(df[df["bmi"] > 30])
will print all rows where there bmi>30. It works as follows: df[condition]. Condition in this case is "bmi" is larger then 30, so our condition is df["bmi"] > 30. Notice the line df[df["bmi"] > 30] returns all rows that satisfy the condition. I printed them, but you can manipulate them whatever you like.
Even though it's a bad technique (or used only for specific need), you can of course iterate through dataframe. This is not done via for l in df, as df is a dataframe object. To iterate through it you can use iterrows:
for index, row in df.iterrows():
if (row["bmi"] > 30)
print("Obese")
Also for next time please provide your code inline. Don't paste an image of it
If your goal is to separate into different labels, I suggest the following:
df.loc[df[df["bmi"] > 30, "NewColumn"] = "Obese"
df.loc[df[df["bmi"] < 18.5, "NewColumn"] = "Underweight"
.loc operator allows me to manipulate only part of the data. It's format is [rows, columns]. So the above code takes on rows where bmi>30, and it takes only "NewColumn" (change it whatever you like) which is a new column. It puts the value on the right to this column. That way, after that operation, you have a new column in your dataframe which has "Obese/Underweight" as you like.
As side note - there are better ways to map values (e.g pandas' map and others) but if you are a beginner, it's important to understand simple methods to manipulate data before diving into more complex one. That's why I am avoiding into explaining more complex method
First of all, As mentioned in the comment you should post text/code instead of screenshots.
You could do binning in pandas:
bmi_labels = ['Normal', 'Overweight', 'Obese']
cut_bins = [18.5, 24.9, 29.9, df["bmi"].max()]
df['bmi_label'] = pd.cut(df['bmi'], bins=cut_bins, labels=bmi_labels)
Here, i have made a seperate column (bmi_label) to store label but you could can do it in same column (bmi) too.

Counting Frequency of an Aggregate result using pandas

Broadly I have the Smart Meters dataset from Kaggle and I'm trying to get a count of the first and last measure by house, then trying to aggregate that to see how many houses began (or ended) reporting on a given day. I'm open to methods totally different than the line I pursue below.
In SQL, when exploring data I often used something like following:
SELECT Max_DT, COUNT(House_ID) AS HouseCount
FROM
(
SELECT House_ID, MAX(Date_Time) AS Max_DT
FROM ElectricGrid GROUP BY HouseID
) MeasureMax
GROUP BY Max_DT
I'm trying to replicate this logic in Pandas and failing. I can get the initial aggregation like:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
However I'm failing to get the outer query. Specifically I don't know what the aggregated column is called. If I do a describe() it shows as Date_Time in the example above. I tried renaming the columns:
house_max.columns = ['House_Id','Max_Date_Time']
I found a StackOverflow discussion about renaming the results of aggregation and attempted to apply it:
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
I still find that a describe() returns Date_Time as the column name.
start_end_collate = house_max.groupby('Date_Time_max')['House_Id'].size()
In the rename example my second query fails to find Date_Time or Max_Date_Time. In the later case, the Ravel code it appears to not find House_Id when I run it.
That's seems weird, I would think your code would not be able to find the House_Id field. After you perform your groupby on House_Id it becomes an index which you cannot reference as a column.
This should work:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
start_end_collate = house_max.groupby('Date_Time_max').size()
Alternatively you can just drop the multilevel column:
house_max.columns = house_max.columns.droplevel(0)
start_end_collate = house_max.groupby('max').size()

How to output groupby variables when using .groupby() in pandas?

I have some data that I want to analyze. I group my data by the relevant group variables (here, 'test_condition' and 'region') and analyze the measure variable ('rt') with a function I wrote:
grouped = data.groupby(['test_condition', 'region'])['rt'].apply(summarize)
That works fine. The output looks like this (fake data):
ci1 ci2 mean
test_condition region
Test Condition Name And 0 295.055978 338.857066 316.956522
Spill1 0 296.210167 357.036210 326.623188
Spill2 0 292.955327 329.435977 311.195652
The problem is, 'test_condition' and 'region' are not actual columns, I can't index into them. I just want columns with the names of the group variables! This seems so simple (and is automatically done in R's ddply) but after lots of googling I have come up with nothing. Does anyone have a simple solution?
By default, the grouping variables are turned into an index. You can change the index to columns with grouped.reset_index().
My second suggestion to specify this in the groupby call with as_index=False, seems not to work as desired in this case with apply (but it does work when using aggregate)

Categories