Pandas group by values in list (in series) - python

I am trying to group by items in a list in DataFrame Series. The dataset being used is the Stack Overflow 2020 Survey.
The layout is roughly as follows:
... LanguageWorkedWith ... ConvertedComp ...
Respondent
1 Python;C 50000
2 C++;C 70000
I want to essentially want to use groupby on the unique values in the list of languages worked with, and apply the a mean aggregator function to the ConvertedComp like so...
LanguageWorkedWith
C++ 70000
C 60000
Python 50000
I have actually managed to achieve the desired output but my solution seems somewhat janky and being new to Pandas, I believe that there is probably a better way.
My solution is as follows:
# read csv
sos = pd.read_csv("developer_survey_2020/survey_results_public.csv", index_col='Respondent')
# seperate string into list of strings, disregarding unanswered responses
temp = sos["LanguageWorkedWith"].dropna().str.split(';')
# create new DataFrame with respondent index and rows populated withknown languages
langs_known = pd.DataFrame(temp.tolist(), index=temp.index)
# stack columns as rows, dropping old column names
stacked_responses = langs_known.stack().reset_index(level=1, drop=True)
# Re-indexing sos DataFrame to match stacked_responses dimension
# Concatenate reindex series to ConvertedComp series columnwise
reindexed_pays = sos["ConvertedComp"].reindex(stacked_responses.index)
stacked_with_pay = pd.concat([stacked_responses, reindexed_pays], axis='columns')
# Remove rows with no salary data
# Renaming columns
stacked_with_pay.dropna(how='any', inplace=True)
stacked_with_pay.columns = ["LWW", "Salary"]
# Group by LLW and apply median
lang_ave_pay = stacked_with_pay.groupby("LWW")["Salary"].median().sort_values(ascending=False).head()
Output:
LWW
Perl 76131.5
Scala 75669.0
Go 74034.0
Rust 74000.0
Ruby 71093.0
Name: Salary, dtype: float64
which matches the value calculated when choosing specific language: sos.loc[sos["LanguageWorkedWith"].str.contains('Perl').fillna(False), "ConvertedComp"].median()
Any tips on how to improve/functions that provide this functionality/etc would be appreciated!

In the target column only data frame, decompose the language name and combine it with the salary. The next step is to convert the data from horizontal format to vertical format using melt. Then we group the language names together to get the median. melt docs
lww = sos[["LanguageWorkedWith","ConvertedComp"]]
lwws = pd.concat([lww['ConvertedComp'], lww['LanguageWorkedWith'].str.split(';', expand=True)], axis=1)
lwws.reset_index(drop=True, inplace=True)
df_long = pd.melt(lwws, id_vars='ConvertedComp', value_vars=lwws.columns[1:], var_name='lang', value_name='lang_name')
df_long.groupby('lang_name')['ConvertedComp'].median().sort_values(ascending=False).head()
lang_name
Perl 76131.5
Scala 75669.0
Go 74034.0
Rust 74000.0
Ruby 71093.0
Name: ConvertedComp, dtype: float64

Related

Apply function to several DataFrames by creating new DataFrame

May be Someone can just help me to find the solution:
I have 100 dataframes. Each dataframe contains time / High_Price / Low_price
I would like to create new Dataframe, which contains Gains from each DataFrame.
Example:
df1 = pd.DataFrame({"high":[5,4,5,2],
"low":[1,2,2,1]},
index=["2019-04-06","2019-04-07","2019-04-08","2019-04-09"])
df100 = pd.DataFrame({"high":[7,5,6,7],
"low":[1,2,3,4]},
index=["2019-04-06","2019-04-07","2019-04-08","2019-04-09"])
Functions:
def myfunc(data, amount):
data= data.loc[(data!=0).any(1)]
profit = (amount/data.iloc[0]['low']) * data.iloc[-1]['high']
return profit
Output should be:
output= pd.DataFrame({"Gain":[1,6]},
index=["df1","df100"])
How can I apply function to 100 DataFrames and get from them only Gains by creating the Dataframe, where we see the name of DataFrame and the Gain for this DataFrame?
Put your dataframes in a list and access them by integer index. Having variables named df1 to df100 is bad programming style because a) the dataframes belong together, so put them in a collection (e.g. list) and b) you cannot get "the" name of an object from its value, leading to complications such as the one you are facing now.
So let dfs be your list of 100 dataframes, starting at index 0.
Use
amount = ... # the value you want to use
output = pd.DataFrame([myfunc(df, amount) for df in dfs], columns=['Gain'])
The index of output now corresponds to the index of dfs, starting at 0. There's no reason to rename it to 'df1' ... 'df100', you gain no information and the output becomes harder to handle.
In case of arbitrary dataframe names, use a dictionary that maps name to df. Let's call it dfs again. Then use
amount = ... # the value you want to use
output = pd.DataFrame([myfunc(df, amount) for df in dfs.values()], columns=['Gain'], index=dfs.keys()])
I'm assuming myfunc is correct, I did not debug it.

Is there a better way to group by a category, and then select values based on different column values in Pandas?

I have an issue where I want to group by a date column, sort by a time column, and grab the resulting values in the values column.
The data that looks something like this
time value date
0 12.850000 19.195359 08-22-2019
1 9.733333 13.519543 09-19-2019
2 14.083333 9.191413 08-26-2019
3 16.616667 18.346598 08-19-2019
...
Where every date can occur multiple times, recording values at different points
during the day.
I wanted to group by date, and extract the minimum and maximum values of those groupings so I did this:
dayMin = df.groupby('date').value.min()
which gives me a Series object that is fairly easy to manipulate. The issue
comes up when I want to group by 'date', sort by 'time', then grab the 'value'.
What I did was:
dayOpen = df.groupby('date').apply(lambda df: df[ df.time == df.time.min() ])['value']
which almost worked, resulting in a DataFrame of:
date
08-19-2019 13344 17.573522
08-20-2019 12798 19.496609
08-21-2019 2009 20.033917
08-22-2019 5231 19.393700
08-23-2019 12848 17.784213
08-26-2019 417 9.717627
08-27-2019 6318 7.630234
I figured out how to clean up those nasty indexes to the left, name the column, and even concat with my dayMin Series to achieve my goal.
Ultimately my question is if there is a nicer way to perform these data manipulations that follow the general pattern of: "Group by column A, perform filtering or sorting operation on column B, grab resulting values from column C" for future applications.
Thank you in advance :)
You can sort the data frame before calling groupby:
first_of_day = df.sort_values('time').groupby('date').head(1)
This should work for you:
df.sort_values('time').groupby(['date'])['value'].agg([('Min' , 'min'), ('Max', 'max')])
For this small example:
Result df:

Pandas dataframe: summing cell data from a group of rows, storing in a new column

As a part of a treatment for a health related issue, I need to measure my liquid intake (along with some other parameters), registring the amount of liquid every time I drink. I have a dataframe, of several months of such registration.
I want to sum my daily amount in an additional column (in red, image below)
As you may see, I wish like to store it in the first column of the slice returned by df.groupby(df['Date'])., for all the days.
I tried the following:
df.groupby(df.Date).first()['Total']= df.groupby(df.Date)['Drank'].fillna(0).sum()
But seems not to be the way to do it.
Greatful for any advice.
Thanks
Michael
use fact False==0
first row of date will be where data is not equal to shift() of date
merge() to sum
## construct a data set
d = pd.date_range("1-jan-2021", "1-mar-2021", freq="2H")
A = np.random.randint(20,300,len(d)).astype(float)
A.ravel()[np.random.choice(A.size, A.size//2, replace=False)] = np.nan
df = pd.DataFrame({"datetime":d, "Drank":A})
df = df.assign(Date=df.datetime.dt.date, Time=df.datetime.dt.time).drop(columns=["datetime"]).loc[:,["Date","Time","Drank"]]
## construction done
# first row will have different date to shift
# merge Total back
df.assign(row=df.Date.eq(df.Date.shift())).merge(df.groupby("Date", as_index=False).agg(Total=("Drank","sum")).assign(row=0),
on=["Date","row"], how="left").drop(columns="row")

Python, lambda function as argument for groupby

I'm trying to figure out what a piece of code is doing, but I'm getting kinda lost on it.
I have a pandas dataframe, which has been loaded by the following .csv file:
origin_census_block_group,date_range_start,date_range_end,device_count,distance_traveled_from_home,bucketed_distance_traveled,median_dwell_at_bucketed_distance_traveled,completely_home_device_count,median_home_dwell_time,bucketed_home_dwell_time,at_home_by_each_hour,part_time_work_behavior_devices,full_time_work_behavior_devices,destination_cbgs,delivery_behavior_devices,median_non_home_dwell_time,candidate_device_count,bucketed_away_from_home_time,median_percentage_time_home,bucketed_percentage_time_home,mean_home_dwell_time,mean_non_home_dwell_time,mean_distance_traveled_from_home
010539707003,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,49,626,"{""16001-50000"":5,""0"":11,"">50000"":4,""2001-8000"":3,""1-1000"":9,""1001-2000"":7,""8001-16000"":1}","{""16001-50000"":110,"">50000"":155,""<1000"":40,""2001-8000"":237,""1001-2000"":27,""8001-16000"":180}",12,627,"{""721-1080"":11,""361-720"":9,""61-360"":1,""<60"":11,"">1080"":12}","[32,32,28,30,30,31,27,23,20,20,20,17,19,19,15,14,17,20,20,21,25,22,24,23]",7,3,"{""120330012011"":1,""010030107031"":1,""010030114052"":2,""120330038001"":1,""010539701003"":1,""010030108001"":1,""010539707002"":14,""010539705003"":2,""120330015001"":1,""121130102003"":1,""010539701002"":1,""120330040001"":1,""370350101014"":2,""120330033081"":2,""010030106003"":1,""010539706001"":2,""010539707004"":3,""120330039001"":1,""010539699003"":1,""120330030003"":1,""010539707003"":41,""010970029003"":1,""010539705004"":1,""120330009002"":1,""010539705001"":3,""010539704003"":1,""120330028012"":1,""120330035081"":1,""120330036102"":1,""120330036142"":1,""010030114062"":1,""010539706004"":7,""010539706002"":1,""120330036082"":1,""010539707001"":7,""010030102001"":1,""120330028011"":1}",2,241,71,"{""21-45"":4,""481-540"":2,""541-600"":1,""721-840"":1,""1201-1320"":1,""301-360"":3,""<20"":13,""61-120"":3,""241-300"":3,""121-180"":1,""421-480"":3,""1321-1440"":4,""1081-1200"":1,""961-1080"":2,""601-660"":1,""181-240"":1,""661-720"":2,""361-420"":3}",72,"{""0-25"":13,""76-100"":21,""51-75"":6,""26-50"":3}",657,413,1936
010730144081,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,139,2211,"{""16001-50000"":17,""0"":41,"">50000"":15,""2001-8000"":22,""1-1000"":8,""1001-2000"":12,""8001-16000"":24}","{""16001-50000"":143,"">50000"":104,""<1000"":132,""2001-8000"":39,""1001-2000"":15,""8001-16000"":102}",41,806,"{""721-1080"":32,""361-720"":16,""61-360"":12,""<60"":30,"">1080"":46}","[91,92,93,91,91,90,86,83,78,64,64,61,64,62,65,62,60,74,61,64,75,78,81,84]",8,6,"{""131350501064"":1,""131350502151"":1,""010730102002"":1,""011170302131"":2,""010730038024"":1,""010730108041"":1,""010730144133"":1,""010730132003"":1,""011210118002"":1,""011170303053"":1,""010730111084"":2,""011170302142"":1,""010730119011"":1,""010730129063"":2,""010730107063"":1,""010730059083"":1,""010730058003"":1,""011270204003"":1,""010730049012"":2,""130879701001"":1,""010730120021"":1,""130890219133"":1,""010730144082"":4,""170310301031"":1,""010730129112"":1,""010730024002"":1,""011170303034"":2,""481390616004"":1,""121270826052"":1,""010730128021"":2,""121270825073"":1,""010730004004"":1,""211959313002"":1,""010730100012"":1,""011170302151"":1,""010730142041"":1,""010730129123"":1,""010730129084"":1,""010730042002"":1,""010730059033"":2,""170318306001"":1,""130519800001"":1,""010730027003"":1,""121270826042"":1,""481610001002"":1,""010730100011"":1,""010730023032"":1,""350250004002"":1,""010730056003"":1,""010730132001"":1,""011170302171"":2,""120910227003"":1,""011239620001"":1,""130351503002"":1,""010730129155"":1,""010730001001"":2,""010730110021"":1,""170310104003"":1,""010730059082"":2,""010730120022"":1,""011170303151"":1,""010730139022"":1,""011170303441"":4,""010730144092"":3,""010730129151"":1,""011210119001"":2,""010730144081"":117,""010730108052"":1,""010730129122"":9,""370710321003"":1,""010730142034"":2,""010730042001"":2,""010570201003"":1,""010730144132"":6,""010730059032"":1,""010730012001"":2,""010730102003"":1,""011170303332"":1,""010730128032"":2,""010730129081"":1,""010730103011"":1,""010730058001"":3,""011150401041"":1,""010730045001"":3,""010730110013"":1,""010730119041"":1,""010730042003"":1,""010730141041"":1,""010730144091"":1,""010730129154"":1,""484759501002"":1,""010730144063"":1,""010730144102"":12,""011170303141"":1,""011250106011"":1,""011170303152"":1,""010730059104"":1,""010730107021"":1,""010730100014"":1,""010730008004"":1,""011170303451"":1,""010730127041"":2,""370559704003"":1,""010730047011"":2,""010730129132"":2,""011010014002"":1,""010730144131"":1,""011170302133"":1,""010730030011"":1,""131350506063"":1,""010730118023"":1,""010890110141"":1,""010730128023"":1,""010730106022"":2,""130879703004"":1,""010730108015"":1,""131390010041"":1,""011170305013"":1,""010730134002"":1,""010730031004"":1,""010730138012"":1,""010730011004"":1,""011250102041"":1,""010730129131"":4,""010730144101"":4,""011170303331"":2,""010730003001"":1,""011010033012"":1,""483539504004"":1,""010550104021"":1,""011170303411"":1,""010730106031"":1,""011170303153"":5,""010730128034"":1,""010730129061"":1,""131390010023"":1,""010730051042"":1,""130510107002"":1,""010730027001"":2,""120090686011"":1,""010730107042"":1,""010730123052"":1,""010730129102"":1,""011210115003"":1,""010730129083"":4,""011170303142"":1,""011010014001"":1,""010730107064"":2}",7,176,205,"{""21-45"":7,""481-540"":10,""541-600"":4,""46-60"":2,""721-840"":3,""1201-1320"":3,""301-360"":7,""<20"":46,""61-120"":6,""241-300"":4,""121-180"":9,""421-480"":2,""1321-1440"":3,""1081-1200"":5,""961-1080"":1,""601-660"":1,""181-240"":5,""661-720"":1,""361-420"":7}",78,"{""0-25"":29,""76-100"":71,""51-75"":27,""26-50"":8}",751,338,38937
010890017002,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,78,1934,"{""16001-50000"":2,""0"":12,"">50000"":9,""2001-8000"":27,""1-1000"":12,""1001-2000"":8,""8001-16000"":8}","{""16001-50000"":49,"">50000"":99,""<1000"":111,""2001-8000"":37,""1001-2000"":24,""8001-16000"":28}",11,787,"{""721-1080"":17,""361-720"":11,""61-360"":11,""<60"":15,"">1080"":23}","[49,42,48,48,47,48,44,44,39,32,34,32,36,31,32,36,40,37,36,38,49,45,46,46]",5,1,"{""010890101002"":1,""010730108041"":1,""010890020003"":2,""010890010001"":2,""010890025011"":3,""010890026001"":4,""280819505003"":1,""281059504004"":1,""010890103022"":1,""120990056011"":1,""010890109012"":2,""010890019021"":6,""010890013021"":4,""010890015004"":3,""010890108003"":1,""010890014022"":6,""281059501003"":1,""281059503001"":1,""010890007022"":3,""010890017001"":3,""010890107023"":1,""010890021002"":1,""010890009011"":1,""010890109013"":1,""010730120022"":1,""010890031003"":15,""011170303151"":1,""010890019011"":9,""010890030002"":2,""010890110221"":1,""011170305021"":1,""010890026003"":2,""010890025012"":3,""010730117034"":1,""010830208022"":1,""010890031002"":2,""010890112002"":1,""010210602001"":1,""010890002022"":1,""010890017002"":65,""281059506021"":1,""010890010003"":2,""010890106222"":1,""120990059182"":1,""010890110222"":1,""010890020001"":1,""010890101003"":1,""010890018013"":1,""010890021001"":1,""010890109021"":1,""010890108001"":1,""010770106005"":1,""281059506011"":1,""010030114032"":2,""010830209001"":1,""010890027222"":1,""010730128023"":1,""010890009021"":1,""010030114051"":1,""010030109031"":1,""010030103003"":1,""010890031001"":1,""010890021003"":1,""010030114062"":4,""010890106241"":1,""281059504003"":1,""010890018011"":10,""010890019031"":5,""010890027012"":1,""010730108054"":1,""010890106223"":2,""010890111001"":1,""010210603002"":1,""010890109011"":1,""010890019012"":2,""010890113001"":1,""010890028013"":3}",1,229,99,"{""481-540"":3,""541-600"":2,""46-60"":1,""721-840"":1,""1201-1320"":7,""301-360"":6,""<20"":18,""61-120"":10,""241-300"":5,""121-180"":2,""1321-1440"":2,""841-960"":1,""1081-1200"":1,""961-1080"":3,""601-660"":3,""181-240"":2,""661-720"":3}",78,"{""0-25"":16,""76-100"":44,""51-75"":11,""26-50"":7}",708,353,14328
010950308022,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,100,2481,"{""16001-50000"":11,""0"":19,"">50000"":11,""2001-8000"":40,""1-1000"":6,""1001-2000"":3,""8001-16000"":4}","{""16001-50000"":150,"">50000"":23,""<1000"":739,""2001-8000"":23,""1001-2000"":12,""8001-16000"":208}",17,703,"{""721-1080"":21,""361-720"":19,""61-360"":10,""<60"":24,"">1080"":26}","[62,64,64,63,65,67,54,48,37,37,34,33,30,34,32,33,35,43,50,56,58,56,56,57]",8,6,"{""010950306004"":1,""010950302023"":1,""011030054051"":1,""010950311002"":1,""010950309023"":1,""010499606003"":1,""121319506023"":2,""010950308022"":86,""121319506016"":2,""010950304013"":1,""010950307024"":1,""010950309041"":1,""010890019021"":2,""010950312001"":5,""010499607002"":1,""011150402013"":1,""010550102003"":1,""120050027043"":3,""010719509003"":1,""010950302022"":1,""010950308023"":2,""120050027051"":2,""471079701022"":1,""010890106221"":1,""010950306001"":1,""010950302011"":2,""011150405013"":1,""011150402041"":2,""010950312002"":16,""011030054042"":1,""010950301002"":2,""130459105011"":1,""010730001001"":1,""130459102001"":1,""010890109013"":2,""010950308013"":14,""010719508004"":1,""120050027041"":3,""010550110021"":3,""010730049022"":1,""010950308024"":1,""010950312004"":6,""010950312003"":1,""010550104012"":2,""010550110013"":1,""120860004111"":1,""010890027222"":1,""010950306002"":2,""010950304015"":1,""011030054041"":1,""010950309031"":8,""010950308021"":1,""010950302024"":1,""010950307011"":5,""010550110012"":2,""011150404013"":1,""130459103003"":1,""120050027032"":3,""010950307012"":5,""010950309022"":2,""010950307023"":1,""010719508003"":1,""010499608001"":2,""010950310003"":1,""011150402043"":1,""120860099063"":1,""010950309021"":4,""010950309043"":2,""010950308011"":1,""010950306003"":3,""120050027042"":1,""010950308025"":5,""010950309032"":6,""010499607001"":1}",1,199,132,"{""21-45"":8,""481-540"":6,""541-600"":4,""46-60"":3,""721-840"":3,""1201-1320"":4,""301-360"":3,""<20"":20,""61-120"":10,""241-300"":2,""121-180"":4,""421-480"":3,""1321-1440"":1,""841-960"":3,""961-1080"":2,""601-660"":1,""181-240"":3,""661-720"":1,""361-420"":2}",74,"{""0-25"":20,""76-100"":48,""51-75"":23,""26-50"":4}",661,350,5044
df = pd.read_csv(csv_file,
usecols=[
'origin_census_block_group',
'date_range_start',
'date_range_end',
'device_count',
'distance_traveled_from_home',
'completely_home_device_count',
'median_home_dwell_time',
'part_time_work_behavior_devices',
'full_time_work_behavior_devices'
],
dtype={'origin_census_block_group': str},
).set_index('origin_census_block_group')
and, later in the code, the dataframe is modified by:
df = df.groupby(lambda cbg: cbg[:5]).sum()
I don't quite understand what this line is doing precisely.
Groupby generally groups a dataframe by column, so...is it grouping the dataframe using multiple columns (0 to 5)? What is the effect of .sum() at the end?
If you run your code exactly as you wrote it (both the creation of df and the groupby) you can see the result. I print first couple of columns of the output of groupby
device_count distance_traveled_from_home
----- -------------- -----------------------------
01053 49 626
01073 139 2211
01089 78 1934
01095 100 2481
What happens here is the function lambda cbg: cbg[:5] is applied to each of the index values (strings that look like numbers in column origin_census_block_group). As a side, note the statement
...
dtype={'origin_census_block_group': str},
when creating the df, so somebody went into trouble to make sure they are actually str
So the function is applied to string like '010539707003' and returns a substring which is the first 5 characters of that string:
'010539707003'[:5]
produces
'01053'
so I assume there are multiple keys that share the first 5 characters (in the actual file -- the snippet has them all unique so not very interesting) and all these rows are grouped together
Then .sum() is applied to each numerical column of each group and returns, well, the column sum per each groupby key. This is what you see in my output in column 'device_count' and so on.
Hope this is clear now
Pandas' read_csv() will render a csv-formatted file a Pandas Dataframe.
I recommend having a ready at the Pandas' documentation, as it's very exhaustive -> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
usecols=[
'origin_census_block_group',
'date_range_start',
'date_range_end',
'device_count',
'distance_traveled_from_home',
'completely_home_device_count',
'median_home_dwell_time',
'part_time_work_behavior_devices',
'full_time_work_behavior_devices'
],
The usecols parameter will take as input an array of desired columns and will only load the specified columns into the dataframe.
dtype={'origin_census_block_group': str}
The dtype parameter will take a dict as input and is to specify the data type of the values, like {'column' : datatype}
.set_index('origin_census_block_group')
.set_index() will set the specificed column as the index column (ie: the first column). The usual index of Pandas' Dataframe is the row's index number, which appears as the first column of the dataframe. By setting the index, the first column now becomes the specified column. See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html
Panda's .groupby() function will take a dataframe a regroup it basing on the occurrences of he values from the specified column.
That is to say, if we a dataframe such as df =
Fruit Name Quality Count
Apple Marco High 4
Pear Lucia Medium 10
Apple Francesco Low 3
Banana Carlo Medium 6
Pear Timmy Low 7
Apple Roberto High 8
Banana Joe High 21
Banana Jack Low 3
Pear Rob Medium 5
Apple Louis Medium 6
Pear Jennifer Low 7
Pear Laura High 8
Performing a groupby operations, such as:
df = df.groupby(lambda x: x[:2]).sum()
Will take all the elements in the index, slice them from index 0 through index 2 and return the sum of all the corresponding values, ie:
Ap 21
Ba 30
Pe 37
Now, you might be wondering about that final .sum() method. If you try to print the dataframe without applying it, you'll likely get something like this:
<bound method GroupBy.sum of <pandas.core.groupby.generic.DataFrameGroupBy object at 0x109d260a0>>
This is because Pandas has created a groubpy object and does not yet now how to display it to you. Do you want to have it displayed by the number of the occurrences in the index? You'd do this:
df = df.groupby(lambda x: x[:2]).size()
And that would output:
Ap 4
Ba 3
Pe 5
Or maybe the sum of their respective summable values? (Which is what is done in the example)
df = df.groupby(lambda x: x[:2]).sum()
Which again, will output:
Ap 21
Ba 30
Pe 37
Notice it has taken the first two letters of the string in the index. Had it been x[:3], it would have taken the first three letters, of course.
Summing it up:
-> .groupby() takes the elements in the index, i.e. the first column of the dataframe and organises the dataframe in groups relating to the index
-> The input you have given to groubpy is an anonymous function, i.e. lambda function, slicing from index 0 through 5 of its mapped input
-> You may choose how to have the results of groubpy by appending the methos .sum() or .size() to a groubpy object
I also recommend reading about Python's lambda functions:
https://docs.python.org/3/reference/expressions.html

How to maintain lexsort status when adding to a multi-indexed DataFrame?

Say I construct a dataframe with pandas, having multi-indexed columns:
mi = pd.MultiIndex.from_product([['trial_1', 'trial_2', 'trial_3'], ['motor_neuron','afferent_neuron','interneuron'], ['time','voltage','calcium']])
ind = np.arange(1,11)
df = pd.DataFrame(np.random.randn(10,27),index=ind, columns=mi)
Link to image of output dataframe
Say I want only the voltage data from trial 1. I know that the following code fails, because the indices are not sorted lexically:
idx = pd.IndexSlice
df.loc[:,idx['trial_1',:,'voltage']]
As explained in another post, the solution is to sort the dataframe's indices, which works as expected:
dfSorted = df.sortlevel(axis=1)
dfSorted.loc[:,idx['trial_1',:,'voltage']]
I understand why this is necessary. However, say I want to add a new column:
dfSorted.loc[:,('trial_1','interneuron','scaledTime')] = 100 * dfSorted.loc[:,('trial_1','interneuron','time')]
Now dfSorted is not sorted anymore, since the new column was tacked onto the end, rather than snuggled into order. Again, I have to call sortlevel before selecting multiple columns.
I feel this makes for repetitive, bug-prone code, especially when adding lots of columns to the much bigger dataframe in my own project. Is there a (preferably clean-looking) way of inserting new columns in lexical order without having to call sortlevel over and over again?
One approach would be to use filter which does a text filter on the column names:
In [117]: df['trial_1'].filter(like='voltage')
Out[117]:
motor_neuron afferent_neuron interneuron
voltage voltage voltage
1 -0.548699 0.986121 -1.339783
2 -1.320589 -0.509410 -0.529686

Categories