I have this data frame:
There is a column "CD" which has the city name, a column "FAC18" which is the number of people each line represents. And I have a column "S3P1" which is the academic level and is type int.
When I group it by "CD" with hab_ciudad = cuestio.groupby('CD')["FAC18"].sum() I get:
Where I have summed over "FAC18".
Now, I want also to group by academic level ("S3P1"). I want it to look like this:
where the columns are the values of "S3P1" and the sum is made over "FAC18".
I tried this code: test = cuestio.groupby(['CD','S3P1'])["FAC18"].sum()
But I get this:
What's the syntax to get the form I want?
After your groupby, you need to use pivot:
test = cuestio.groupby(['CD','S3P1'])["FAC18"].sum()
test = test.pivot(index='CD', columns='S3P1', values='FAC18')
Related
I have an issue where I want to group by a date column, sort by a time column, and grab the resulting values in the values column.
The data that looks something like this
time value date
0 12.850000 19.195359 08-22-2019
1 9.733333 13.519543 09-19-2019
2 14.083333 9.191413 08-26-2019
3 16.616667 18.346598 08-19-2019
...
Where every date can occur multiple times, recording values at different points
during the day.
I wanted to group by date, and extract the minimum and maximum values of those groupings so I did this:
dayMin = df.groupby('date').value.min()
which gives me a Series object that is fairly easy to manipulate. The issue
comes up when I want to group by 'date', sort by 'time', then grab the 'value'.
What I did was:
dayOpen = df.groupby('date').apply(lambda df: df[ df.time == df.time.min() ])['value']
which almost worked, resulting in a DataFrame of:
date
08-19-2019 13344 17.573522
08-20-2019 12798 19.496609
08-21-2019 2009 20.033917
08-22-2019 5231 19.393700
08-23-2019 12848 17.784213
08-26-2019 417 9.717627
08-27-2019 6318 7.630234
I figured out how to clean up those nasty indexes to the left, name the column, and even concat with my dayMin Series to achieve my goal.
Ultimately my question is if there is a nicer way to perform these data manipulations that follow the general pattern of: "Group by column A, perform filtering or sorting operation on column B, grab resulting values from column C" for future applications.
Thank you in advance :)
You can sort the data frame before calling groupby:
first_of_day = df.sort_values('time').groupby('date').head(1)
This should work for you:
df.sort_values('time').groupby(['date'])['value'].agg([('Min' , 'min'), ('Max', 'max')])
For this small example:
Result df:
I would like to remove the extra index called service_type_id that I have not included in my code but it just appear without any reason. I am using Python.
My code is
data_tr = data.groupby(['transaction_id', 'service_type']).sum().unstack().reset_index().fillna(0).set_index('transaction_id')
The output is this table with extra index:
I believe it is something to do with the groupby and unstack. Kindly highlight to me why there will be extra index and what should be my code be.
The dataset
https://drive.google.com/file/d/1XZVfXbgpV0l3Oewgh09Vw5lVCchK0SEh/view?usp=sharing
I hope pandas.DataFrame.droplevel can do the job for you query
import pandas as pd
df = pd.read_csv('Dataset - Transaction.csv')
data_tr = df.groupby(['transaction_id', 'service_type']).sum().unstack().reset_index().fillna(0).set_index('transaction_id').droplevel(0,1)
data_tr.head(2)
Output
df.groupby(['transaction_id', 'service_type']).sum() takes the sum of numerical field service_type_id
data_tr = df.groupby(['transaction_id', 'service_type']).sum().unstack()
print(data_tr.columns)
MultiIndex([('service_type_id', '3 Phase Wiring'),
('service_type_id', 'AV Equipment')
...
('service_type_id', 'Yoga Lessons'),
('service_type_id', 'Zumba Classes')],
names=[None, 'service_type'], length=188)
#print(data_tr.info())
Initially there was only one column (service_type_id) and two indexes transaction_id, service_type, After you unstack service_type becomes column like tuples (Multindex) where each service type have value of service_type_id. droplevel(0,1) will convert your dataframe from Multindex to single Index as follows
print(data_tr.columns)
Index(['3 Phase Wiring', ......,'Zumba Classes'],
dtype='object', name='service_type', length=188)
It looks like you are trying to make a pivot table of transaction_id and service_type, using service_type_id as value. The reason you are getting the extra index, is because your sum generates a sum for every (numerical) column.
For insight, try to execute just
data.groupby(['transaction_id', 'service_type']).sum()
Since the data uses the label service_type_id, I assume the sum actually only serves the purpose of getting the id value out. A cleaner way to get the desired result is usig a pivot
data_tr = data[['transaction_id'
, 'service_type'
, 'service_type_id']
].pivot(index = 'transaction_id'
, columns= 'service_type'
, values = 'service_type_id'
).fillna(0)
Depending on how you like your data structure, you can follow up with a .reset_index()
I would like to learn how to loop through a columns name with conditions in pandas.
For example, I have a list T = [400,500,600]. I have dataframe with columns name are G_ads_400, G_ads_500 ...
I would like to get the min value of the G_ads_ columns if T's value match G_ads_... using for-loop and if-statement (I am only familiar with these 2. Open for other suggestion)
For example: Take min value of G_ads_400 when T = 400
here is my code
T = [400,500,600,700]
for t in T:
if t in df.columns[df.columns.str.contains('t')]:
min_value = df.columns.min()
print(min_value)
I tried few other ways but it didnt work. It was either return error or only the name of the columns.
Thank you!
I think you can do it easily like this.This will return all min values for column matches with T values.
T = [400,500,600,700]
columns = [f"G_ads_{i}" for i in T]
res=df[columns].min()
if you need a min values as a dataframe
res=df[columns].min().to_frame().T
I have a dataframe with three columns
The first column has 3 unique values I used the below code to create unique dataframes, However I am unable to iterate over that dataframe and not sure how to use that to iterate.
df = pd.read_excel("input.xlsx")
unique_groups = list(df.iloc[:,0].unique()) ### lets assume Unique values are 0,1,2
mtlist = []
for index, value in enumerate(unique_groups):
globals()['df%s' % index] = df[df.iloc[:,0] == value]
mtlist.append('df%s' % index)
print(mtlist)
O/P
['df0', 'df1', 'df2']
for example lets say I want to find out the length of the first unique dataframe
if I manually type the name of the DF I get the correct output
len(df0)
O/P
35
But I am trying to automate the code so technically I want to find the length and itterate over that dataframe normally as i would by typing the name.
What I'm looking for is
if I try the below code
len('df%s' % 0)
I want to get the actual length of the dataframe instead of the length of the string.
Could someone please guide me how to do this?
I have also tried to create a Dictionary using the below code but I cant figure out how to iterate over the dictionary when the DF columns are more than two, where key would be the unique group and the value containes the two columns in same line.
df = pd.read_excel("input.xlsx")
unique_groups = list(df["Assignment Group"].unique())
length_of_unique_groups = len(unique_groups)
mtlist = []
df_dict = {name: df.loc[df['Assignment Group'] == name] for name in unique_groups}
Can someone please provide a better solution?
UPDATE
SAMPLE DATA
Assignment_group Description Document
Group A Text to be updated on the ticket 1 doc1.pdf
Group B Text to be updated on the ticket 2 doc2.pdf
Group A Text to be updated on the ticket 3 doc3.pdf
Group B Text to be updated on the ticket 4 doc4.pdf
Group A Text to be updated on the ticket 5 doc5.pdf
Group B Text to be updated on the ticket 6 doc6.pdf
Group C Text to be updated on the ticket 7 doc7.pdf
Group C Text to be updated on the ticket 8 doc8.pdf
Lets assume there are 100 rows of data
I'm trying to automate ServiceNow ticket creation with the above data.
So my end goal is GROUP A tickets should go to one group, however for each description an unique task has to be created, but we can club 10 task once and submit as one request so if I divide the df's into different df based on the Assignment_group it would be easier to iterate over(thats the only idea which i could think of)
For example lets say we have REQUEST001
within that request it will have multiple sub tasks such as STASK001,STASK002 ... STASK010.
hope this helps
Your problem is easily solved by groupby: one of the most useful tools in pandas. :
length_of_unique_groups = df.groupby('Assignment Group').size()
You can do all kind of operations (sum, count, std, etc) on your remaining columns, like getting the mean value of price for each group if that was a column.
I think you want to try something like len(eval('df%s' % 0))
Broadly I have the Smart Meters dataset from Kaggle and I'm trying to get a count of the first and last measure by house, then trying to aggregate that to see how many houses began (or ended) reporting on a given day. I'm open to methods totally different than the line I pursue below.
In SQL, when exploring data I often used something like following:
SELECT Max_DT, COUNT(House_ID) AS HouseCount
FROM
(
SELECT House_ID, MAX(Date_Time) AS Max_DT
FROM ElectricGrid GROUP BY HouseID
) MeasureMax
GROUP BY Max_DT
I'm trying to replicate this logic in Pandas and failing. I can get the initial aggregation like:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
However I'm failing to get the outer query. Specifically I don't know what the aggregated column is called. If I do a describe() it shows as Date_Time in the example above. I tried renaming the columns:
house_max.columns = ['House_Id','Max_Date_Time']
I found a StackOverflow discussion about renaming the results of aggregation and attempted to apply it:
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
I still find that a describe() returns Date_Time as the column name.
start_end_collate = house_max.groupby('Date_Time_max')['House_Id'].size()
In the rename example my second query fails to find Date_Time or Max_Date_Time. In the later case, the Ravel code it appears to not find House_Id when I run it.
That's seems weird, I would think your code would not be able to find the House_Id field. After you perform your groupby on House_Id it becomes an index which you cannot reference as a column.
This should work:
house_max = house_info.groupby('House_Id').agg({'Date_Time' :['max']})
house_max.columns = ["_".join(x) for x in house_max.columns.ravel()]
start_end_collate = house_max.groupby('Date_Time_max').size()
Alternatively you can just drop the multilevel column:
house_max.columns = house_max.columns.droplevel(0)
start_end_collate = house_max.groupby('max').size()