Remove extra index in a dataframe - python

I would like to remove the extra index called service_type_id that I have not included in my code but it just appear without any reason. I am using Python.
My code is
data_tr = data.groupby(['transaction_id', 'service_type']).sum().unstack().reset_index().fillna(0).set_index('transaction_id')
The output is this table with extra index:
I believe it is something to do with the groupby and unstack. Kindly highlight to me why there will be extra index and what should be my code be.
The dataset
https://drive.google.com/file/d/1XZVfXbgpV0l3Oewgh09Vw5lVCchK0SEh/view?usp=sharing

I hope pandas.DataFrame.droplevel can do the job for you query
import pandas as pd
df = pd.read_csv('Dataset - Transaction.csv')
data_tr = df.groupby(['transaction_id', 'service_type']).sum().unstack().reset_index().fillna(0).set_index('transaction_id').droplevel(0,1)
data_tr.head(2)
Output
df.groupby(['transaction_id', 'service_type']).sum() takes the sum of numerical field service_type_id
data_tr = df.groupby(['transaction_id', 'service_type']).sum().unstack()
print(data_tr.columns)
MultiIndex([('service_type_id', '3 Phase Wiring'),
('service_type_id', 'AV Equipment')
...
('service_type_id', 'Yoga Lessons'),
('service_type_id', 'Zumba Classes')],
names=[None, 'service_type'], length=188)
#print(data_tr.info())
Initially there was only one column (service_type_id) and two indexes transaction_id, service_type, After you unstack service_type becomes column like tuples (Multindex) where each service type have value of service_type_id. droplevel(0,1) will convert your dataframe from Multindex to single Index as follows
print(data_tr.columns)
Index(['3 Phase Wiring', ......,'Zumba Classes'],
dtype='object', name='service_type', length=188)

It looks like you are trying to make a pivot table of transaction_id and service_type, using service_type_id as value. The reason you are getting the extra index, is because your sum generates a sum for every (numerical) column.
For insight, try to execute just
data.groupby(['transaction_id', 'service_type']).sum()
Since the data uses the label service_type_id, I assume the sum actually only serves the purpose of getting the id value out. A cleaner way to get the desired result is usig a pivot
data_tr = data[['transaction_id'
, 'service_type'
, 'service_type_id']
].pivot(index = 'transaction_id'
, columns= 'service_type'
, values = 'service_type_id'
).fillna(0)
Depending on how you like your data structure, you can follow up with a .reset_index()

Related

Iteration over df rows to sum groups of items

I am new to coding. I looked for similar questions on this site, that helped me to come up with a working version of my code, but I need help with making it more professional.
I need help with iterating over rows of a data frame in pandas. What I want to do is to find same items (e.g. Groceries) in 'Description' column and total (sum) their values from 'Amount' column, and finally to write the result to a .csv file. The reason I am doing all this is to compile data for a bar graph I want to create based on those categories.
I was able to accomplish all that with the help of the following code, but that most likely is not very pythonic or efficient. What I did is that I used a print statement nested in an if statement to get category label (i) and amount to print to file. The issue is that I had to add a lot of things for the whole to work. First, I had to create an empty list to make sure that the if statement does not trigger .loc every time it sees a desired item in 'Description' column. Second, I am not sure if saving print statements is the best way to go here, as it appears very ersatz. It feels like I am a step away from using punch cards. In short, I would appreciate if someone could help me in making my code more up to standards.
'''
used_list = []
for i in df['Description']:
if i in used_list:
continue
sys.stdout = open(file_to_hist, "a")
print(i,',', df.loc[df['Description'] == i, 'Amount'].sum())
used_list.append(i)
'''
I also tried a slightly different approach (save results directly into a df), but then I get NaN values all across the 'Amount' column and no other errors (exit code 0) to help me understand what is going on:
'''
used_list = []
df_hist_data = pd.DataFrame(columns=['Description', 'Amount'])
for i in df['Description']:
if i in used_list:
continue
df_hist_data = df_hist_data.append({'Description' : i}, {'Amount' : df.loc[df['Description'] == i, 'Amount'].sum()})
used_list.append(i)
print(df_hist_data)
'''
You can select only rows that match a criteria with
df[ a boolean matrix here ]
When doing df["a column name"]=="value" you actually get a boolean matrix where rows where "a column name" == "value" are True and other are False
To sumarize this : Dataframe[Dataframe["Description"] == "banana"] is going to give you a view to a new dataframe where only rows matching your condition are kept. (original dataframe is not altered)
If you select column "Amount" of this dataframe and .sum() it, you have what you desired, in one line.
That's the typical pandadorable (equivalent of pythonic for pandas) way of doing conditionnal sums.
If require, select the rows of the dataframe where condition can take multiple values, use .isin() to get your boolean matrix
Dataframe["Description"].isin(["banana","apple"])
Then, when scanning all possible values of "Description" in your dataframe, use .unique() when generating your iterator.
And then you can finally append Series to your empty dataframe before saving it as csv.
Overal, we get the code :
import pandas as pd
Dataframe = pd.DataFrame([
{"Description":"apple","Amount":15},
{"Description":"banana","Amount":1},
{"Description":"berry","Amount":155},
{"Description":"banana","Amount":4}])
df_hist_data = pd.DataFrame(columns=['Description', 'Sum'])
for item in Dataframe["Description"].unique() :
df_hist_data = df_hist_data.append( pd.Series(
{ "Description" : item ,
"Sum" : Dataframe[(Dataframe["Description"].isin([item]))]["Amount"].sum() }
), ignore_index=True )
OUT:
>> 20
You can also do it even more pythonically, in one line with a list comprehension :
selector = "Description"
sum_on = "Amount"
new_df = pd.DataFrame([ {selector : item , sum_on : df[(df[selector].isin([item]))][sum_on].sum() } for item in df[selector].unique() ] )

Is there a better way to group by a category, and then select values based on different column values in Pandas?

I have an issue where I want to group by a date column, sort by a time column, and grab the resulting values in the values column.
The data that looks something like this
time value date
0 12.850000 19.195359 08-22-2019
1 9.733333 13.519543 09-19-2019
2 14.083333 9.191413 08-26-2019
3 16.616667 18.346598 08-19-2019
...
Where every date can occur multiple times, recording values at different points
during the day.
I wanted to group by date, and extract the minimum and maximum values of those groupings so I did this:
dayMin = df.groupby('date').value.min()
which gives me a Series object that is fairly easy to manipulate. The issue
comes up when I want to group by 'date', sort by 'time', then grab the 'value'.
What I did was:
dayOpen = df.groupby('date').apply(lambda df: df[ df.time == df.time.min() ])['value']
which almost worked, resulting in a DataFrame of:
date
08-19-2019 13344 17.573522
08-20-2019 12798 19.496609
08-21-2019 2009 20.033917
08-22-2019 5231 19.393700
08-23-2019 12848 17.784213
08-26-2019 417 9.717627
08-27-2019 6318 7.630234
I figured out how to clean up those nasty indexes to the left, name the column, and even concat with my dayMin Series to achieve my goal.
Ultimately my question is if there is a nicer way to perform these data manipulations that follow the general pattern of: "Group by column A, perform filtering or sorting operation on column B, grab resulting values from column C" for future applications.
Thank you in advance :)
You can sort the data frame before calling groupby:
first_of_day = df.sort_values('time').groupby('date').head(1)
This should work for you:
df.sort_values('time').groupby(['date'])['value'].agg([('Min' , 'min'), ('Max', 'max')])
For this small example:
Result df:

Sorting by columns using groupby pandas

I have this data frame:
There is a column "CD" which has the city name, a column "FAC18" which is the number of people each line represents. And I have a column "S3P1" which is the academic level and is type int.
When I group it by "CD" with hab_ciudad = cuestio.groupby('CD')["FAC18"].sum() I get:
Where I have summed over "FAC18".
Now, I want also to group by academic level ("S3P1"). I want it to look like this:
where the columns are the values of "S3P1" and the sum is made over "FAC18".
I tried this code: test = cuestio.groupby(['CD','S3P1'])["FAC18"].sum()
But I get this:
What's the syntax to get the form I want?
After your groupby, you need to use pivot:
test = cuestio.groupby(['CD','S3P1'])["FAC18"].sum()
test = test.pivot(index='CD', columns='S3P1', values='FAC18')

Counting Frequency of an Aggregate result using pandas

Broadly I have the Smart Meters dataset from Kaggle and I'm trying to get a count of the first and last measure by house, then trying to aggregate that to see how many houses began (or ended) reporting on a given day. I'm open to methods totally different than the line I pursue below.
In SQL, when exploring data I often used something like following:
SELECT Max_DT, COUNT(House_ID) AS HouseCount
FROM
(
SELECT House_ID, MAX(Date_Time) AS Max_DT
FROM ElectricGrid GROUP BY HouseID
) MeasureMax
GROUP BY Max_DT
I'm trying to replicate this logic in Pandas and failing. I can get the initial aggregation like:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
However I'm failing to get the outer query. Specifically I don't know what the aggregated column is called. If I do a describe() it shows as Date_Time in the example above. I tried renaming the columns:
house_max.columns = ['House_Id','Max_Date_Time']
I found a StackOverflow discussion about renaming the results of aggregation and attempted to apply it:
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
I still find that a describe() returns Date_Time as the column name.
start_end_collate = house_max.groupby('Date_Time_max')['House_Id'].size()
In the rename example my second query fails to find Date_Time or Max_Date_Time. In the later case, the Ravel code it appears to not find House_Id when I run it.
That's seems weird, I would think your code would not be able to find the House_Id field. After you perform your groupby on House_Id it becomes an index which you cannot reference as a column.
This should work:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
start_end_collate = house_max.groupby('Date_Time_max').size()
Alternatively you can just drop the multilevel column:
house_max.columns = house_max.columns.droplevel(0)
start_end_collate = house_max.groupby('max').size()

How to maintain lexsort status when adding to a multi-indexed DataFrame?

Say I construct a dataframe with pandas, having multi-indexed columns:
mi = pd.MultiIndex.from_product([['trial_1', 'trial_2', 'trial_3'], ['motor_neuron','afferent_neuron','interneuron'], ['time','voltage','calcium']])
ind = np.arange(1,11)
df = pd.DataFrame(np.random.randn(10,27),index=ind, columns=mi)
Link to image of output dataframe
Say I want only the voltage data from trial 1. I know that the following code fails, because the indices are not sorted lexically:
idx = pd.IndexSlice
df.loc[:,idx['trial_1',:,'voltage']]
As explained in another post, the solution is to sort the dataframe's indices, which works as expected:
dfSorted = df.sortlevel(axis=1)
dfSorted.loc[:,idx['trial_1',:,'voltage']]
I understand why this is necessary. However, say I want to add a new column:
dfSorted.loc[:,('trial_1','interneuron','scaledTime')] = 100 * dfSorted.loc[:,('trial_1','interneuron','time')]
Now dfSorted is not sorted anymore, since the new column was tacked onto the end, rather than snuggled into order. Again, I have to call sortlevel before selecting multiple columns.
I feel this makes for repetitive, bug-prone code, especially when adding lots of columns to the much bigger dataframe in my own project. Is there a (preferably clean-looking) way of inserting new columns in lexical order without having to call sortlevel over and over again?
One approach would be to use filter which does a text filter on the column names:
In [117]: df['trial_1'].filter(like='voltage')
Out[117]:
motor_neuron afferent_neuron interneuron
voltage voltage voltage
1 -0.548699 0.986121 -1.339783
2 -1.320589 -0.509410 -0.529686

Categories