Python, lambda function as argument for groupby - python

I'm trying to figure out what a piece of code is doing, but I'm getting kinda lost on it.
I have a pandas dataframe, which has been loaded by the following .csv file:
origin_census_block_group,date_range_start,date_range_end,device_count,distance_traveled_from_home,bucketed_distance_traveled,median_dwell_at_bucketed_distance_traveled,completely_home_device_count,median_home_dwell_time,bucketed_home_dwell_time,at_home_by_each_hour,part_time_work_behavior_devices,full_time_work_behavior_devices,destination_cbgs,delivery_behavior_devices,median_non_home_dwell_time,candidate_device_count,bucketed_away_from_home_time,median_percentage_time_home,bucketed_percentage_time_home,mean_home_dwell_time,mean_non_home_dwell_time,mean_distance_traveled_from_home
010539707003,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,49,626,"{""16001-50000"":5,""0"":11,"">50000"":4,""2001-8000"":3,""1-1000"":9,""1001-2000"":7,""8001-16000"":1}","{""16001-50000"":110,"">50000"":155,""<1000"":40,""2001-8000"":237,""1001-2000"":27,""8001-16000"":180}",12,627,"{""721-1080"":11,""361-720"":9,""61-360"":1,""<60"":11,"">1080"":12}","[32,32,28,30,30,31,27,23,20,20,20,17,19,19,15,14,17,20,20,21,25,22,24,23]",7,3,"{""120330012011"":1,""010030107031"":1,""010030114052"":2,""120330038001"":1,""010539701003"":1,""010030108001"":1,""010539707002"":14,""010539705003"":2,""120330015001"":1,""121130102003"":1,""010539701002"":1,""120330040001"":1,""370350101014"":2,""120330033081"":2,""010030106003"":1,""010539706001"":2,""010539707004"":3,""120330039001"":1,""010539699003"":1,""120330030003"":1,""010539707003"":41,""010970029003"":1,""010539705004"":1,""120330009002"":1,""010539705001"":3,""010539704003"":1,""120330028012"":1,""120330035081"":1,""120330036102"":1,""120330036142"":1,""010030114062"":1,""010539706004"":7,""010539706002"":1,""120330036082"":1,""010539707001"":7,""010030102001"":1,""120330028011"":1}",2,241,71,"{""21-45"":4,""481-540"":2,""541-600"":1,""721-840"":1,""1201-1320"":1,""301-360"":3,""<20"":13,""61-120"":3,""241-300"":3,""121-180"":1,""421-480"":3,""1321-1440"":4,""1081-1200"":1,""961-1080"":2,""601-660"":1,""181-240"":1,""661-720"":2,""361-420"":3}",72,"{""0-25"":13,""76-100"":21,""51-75"":6,""26-50"":3}",657,413,1936
010730144081,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,139,2211,"{""16001-50000"":17,""0"":41,"">50000"":15,""2001-8000"":22,""1-1000"":8,""1001-2000"":12,""8001-16000"":24}","{""16001-50000"":143,"">50000"":104,""<1000"":132,""2001-8000"":39,""1001-2000"":15,""8001-16000"":102}",41,806,"{""721-1080"":32,""361-720"":16,""61-360"":12,""<60"":30,"">1080"":46}","[91,92,93,91,91,90,86,83,78,64,64,61,64,62,65,62,60,74,61,64,75,78,81,84]",8,6,"{""131350501064"":1,""131350502151"":1,""010730102002"":1,""011170302131"":2,""010730038024"":1,""010730108041"":1,""010730144133"":1,""010730132003"":1,""011210118002"":1,""011170303053"":1,""010730111084"":2,""011170302142"":1,""010730119011"":1,""010730129063"":2,""010730107063"":1,""010730059083"":1,""010730058003"":1,""011270204003"":1,""010730049012"":2,""130879701001"":1,""010730120021"":1,""130890219133"":1,""010730144082"":4,""170310301031"":1,""010730129112"":1,""010730024002"":1,""011170303034"":2,""481390616004"":1,""121270826052"":1,""010730128021"":2,""121270825073"":1,""010730004004"":1,""211959313002"":1,""010730100012"":1,""011170302151"":1,""010730142041"":1,""010730129123"":1,""010730129084"":1,""010730042002"":1,""010730059033"":2,""170318306001"":1,""130519800001"":1,""010730027003"":1,""121270826042"":1,""481610001002"":1,""010730100011"":1,""010730023032"":1,""350250004002"":1,""010730056003"":1,""010730132001"":1,""011170302171"":2,""120910227003"":1,""011239620001"":1,""130351503002"":1,""010730129155"":1,""010730001001"":2,""010730110021"":1,""170310104003"":1,""010730059082"":2,""010730120022"":1,""011170303151"":1,""010730139022"":1,""011170303441"":4,""010730144092"":3,""010730129151"":1,""011210119001"":2,""010730144081"":117,""010730108052"":1,""010730129122"":9,""370710321003"":1,""010730142034"":2,""010730042001"":2,""010570201003"":1,""010730144132"":6,""010730059032"":1,""010730012001"":2,""010730102003"":1,""011170303332"":1,""010730128032"":2,""010730129081"":1,""010730103011"":1,""010730058001"":3,""011150401041"":1,""010730045001"":3,""010730110013"":1,""010730119041"":1,""010730042003"":1,""010730141041"":1,""010730144091"":1,""010730129154"":1,""484759501002"":1,""010730144063"":1,""010730144102"":12,""011170303141"":1,""011250106011"":1,""011170303152"":1,""010730059104"":1,""010730107021"":1,""010730100014"":1,""010730008004"":1,""011170303451"":1,""010730127041"":2,""370559704003"":1,""010730047011"":2,""010730129132"":2,""011010014002"":1,""010730144131"":1,""011170302133"":1,""010730030011"":1,""131350506063"":1,""010730118023"":1,""010890110141"":1,""010730128023"":1,""010730106022"":2,""130879703004"":1,""010730108015"":1,""131390010041"":1,""011170305013"":1,""010730134002"":1,""010730031004"":1,""010730138012"":1,""010730011004"":1,""011250102041"":1,""010730129131"":4,""010730144101"":4,""011170303331"":2,""010730003001"":1,""011010033012"":1,""483539504004"":1,""010550104021"":1,""011170303411"":1,""010730106031"":1,""011170303153"":5,""010730128034"":1,""010730129061"":1,""131390010023"":1,""010730051042"":1,""130510107002"":1,""010730027001"":2,""120090686011"":1,""010730107042"":1,""010730123052"":1,""010730129102"":1,""011210115003"":1,""010730129083"":4,""011170303142"":1,""011010014001"":1,""010730107064"":2}",7,176,205,"{""21-45"":7,""481-540"":10,""541-600"":4,""46-60"":2,""721-840"":3,""1201-1320"":3,""301-360"":7,""<20"":46,""61-120"":6,""241-300"":4,""121-180"":9,""421-480"":2,""1321-1440"":3,""1081-1200"":5,""961-1080"":1,""601-660"":1,""181-240"":5,""661-720"":1,""361-420"":7}",78,"{""0-25"":29,""76-100"":71,""51-75"":27,""26-50"":8}",751,338,38937
010890017002,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,78,1934,"{""16001-50000"":2,""0"":12,"">50000"":9,""2001-8000"":27,""1-1000"":12,""1001-2000"":8,""8001-16000"":8}","{""16001-50000"":49,"">50000"":99,""<1000"":111,""2001-8000"":37,""1001-2000"":24,""8001-16000"":28}",11,787,"{""721-1080"":17,""361-720"":11,""61-360"":11,""<60"":15,"">1080"":23}","[49,42,48,48,47,48,44,44,39,32,34,32,36,31,32,36,40,37,36,38,49,45,46,46]",5,1,"{""010890101002"":1,""010730108041"":1,""010890020003"":2,""010890010001"":2,""010890025011"":3,""010890026001"":4,""280819505003"":1,""281059504004"":1,""010890103022"":1,""120990056011"":1,""010890109012"":2,""010890019021"":6,""010890013021"":4,""010890015004"":3,""010890108003"":1,""010890014022"":6,""281059501003"":1,""281059503001"":1,""010890007022"":3,""010890017001"":3,""010890107023"":1,""010890021002"":1,""010890009011"":1,""010890109013"":1,""010730120022"":1,""010890031003"":15,""011170303151"":1,""010890019011"":9,""010890030002"":2,""010890110221"":1,""011170305021"":1,""010890026003"":2,""010890025012"":3,""010730117034"":1,""010830208022"":1,""010890031002"":2,""010890112002"":1,""010210602001"":1,""010890002022"":1,""010890017002"":65,""281059506021"":1,""010890010003"":2,""010890106222"":1,""120990059182"":1,""010890110222"":1,""010890020001"":1,""010890101003"":1,""010890018013"":1,""010890021001"":1,""010890109021"":1,""010890108001"":1,""010770106005"":1,""281059506011"":1,""010030114032"":2,""010830209001"":1,""010890027222"":1,""010730128023"":1,""010890009021"":1,""010030114051"":1,""010030109031"":1,""010030103003"":1,""010890031001"":1,""010890021003"":1,""010030114062"":4,""010890106241"":1,""281059504003"":1,""010890018011"":10,""010890019031"":5,""010890027012"":1,""010730108054"":1,""010890106223"":2,""010890111001"":1,""010210603002"":1,""010890109011"":1,""010890019012"":2,""010890113001"":1,""010890028013"":3}",1,229,99,"{""481-540"":3,""541-600"":2,""46-60"":1,""721-840"":1,""1201-1320"":7,""301-360"":6,""<20"":18,""61-120"":10,""241-300"":5,""121-180"":2,""1321-1440"":2,""841-960"":1,""1081-1200"":1,""961-1080"":3,""601-660"":3,""181-240"":2,""661-720"":3}",78,"{""0-25"":16,""76-100"":44,""51-75"":11,""26-50"":7}",708,353,14328
010950308022,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,100,2481,"{""16001-50000"":11,""0"":19,"">50000"":11,""2001-8000"":40,""1-1000"":6,""1001-2000"":3,""8001-16000"":4}","{""16001-50000"":150,"">50000"":23,""<1000"":739,""2001-8000"":23,""1001-2000"":12,""8001-16000"":208}",17,703,"{""721-1080"":21,""361-720"":19,""61-360"":10,""<60"":24,"">1080"":26}","[62,64,64,63,65,67,54,48,37,37,34,33,30,34,32,33,35,43,50,56,58,56,56,57]",8,6,"{""010950306004"":1,""010950302023"":1,""011030054051"":1,""010950311002"":1,""010950309023"":1,""010499606003"":1,""121319506023"":2,""010950308022"":86,""121319506016"":2,""010950304013"":1,""010950307024"":1,""010950309041"":1,""010890019021"":2,""010950312001"":5,""010499607002"":1,""011150402013"":1,""010550102003"":1,""120050027043"":3,""010719509003"":1,""010950302022"":1,""010950308023"":2,""120050027051"":2,""471079701022"":1,""010890106221"":1,""010950306001"":1,""010950302011"":2,""011150405013"":1,""011150402041"":2,""010950312002"":16,""011030054042"":1,""010950301002"":2,""130459105011"":1,""010730001001"":1,""130459102001"":1,""010890109013"":2,""010950308013"":14,""010719508004"":1,""120050027041"":3,""010550110021"":3,""010730049022"":1,""010950308024"":1,""010950312004"":6,""010950312003"":1,""010550104012"":2,""010550110013"":1,""120860004111"":1,""010890027222"":1,""010950306002"":2,""010950304015"":1,""011030054041"":1,""010950309031"":8,""010950308021"":1,""010950302024"":1,""010950307011"":5,""010550110012"":2,""011150404013"":1,""130459103003"":1,""120050027032"":3,""010950307012"":5,""010950309022"":2,""010950307023"":1,""010719508003"":1,""010499608001"":2,""010950310003"":1,""011150402043"":1,""120860099063"":1,""010950309021"":4,""010950309043"":2,""010950308011"":1,""010950306003"":3,""120050027042"":1,""010950308025"":5,""010950309032"":6,""010499607001"":1}",1,199,132,"{""21-45"":8,""481-540"":6,""541-600"":4,""46-60"":3,""721-840"":3,""1201-1320"":4,""301-360"":3,""<20"":20,""61-120"":10,""241-300"":2,""121-180"":4,""421-480"":3,""1321-1440"":1,""841-960"":3,""961-1080"":2,""601-660"":1,""181-240"":3,""661-720"":1,""361-420"":2}",74,"{""0-25"":20,""76-100"":48,""51-75"":23,""26-50"":4}",661,350,5044
df = pd.read_csv(csv_file,
usecols=[
'origin_census_block_group',
'date_range_start',
'date_range_end',
'device_count',
'distance_traveled_from_home',
'completely_home_device_count',
'median_home_dwell_time',
'part_time_work_behavior_devices',
'full_time_work_behavior_devices'
],
dtype={'origin_census_block_group': str},
).set_index('origin_census_block_group')
and, later in the code, the dataframe is modified by:
df = df.groupby(lambda cbg: cbg[:5]).sum()
I don't quite understand what this line is doing precisely.
Groupby generally groups a dataframe by column, so...is it grouping the dataframe using multiple columns (0 to 5)? What is the effect of .sum() at the end?

If you run your code exactly as you wrote it (both the creation of df and the groupby) you can see the result. I print first couple of columns of the output of groupby
device_count distance_traveled_from_home
----- -------------- -----------------------------
01053 49 626
01073 139 2211
01089 78 1934
01095 100 2481
What happens here is the function lambda cbg: cbg[:5] is applied to each of the index values (strings that look like numbers in column origin_census_block_group). As a side, note the statement
...
dtype={'origin_census_block_group': str},
when creating the df, so somebody went into trouble to make sure they are actually str
So the function is applied to string like '010539707003' and returns a substring which is the first 5 characters of that string:
'010539707003'[:5]
produces
'01053'
so I assume there are multiple keys that share the first 5 characters (in the actual file -- the snippet has them all unique so not very interesting) and all these rows are grouped together
Then .sum() is applied to each numerical column of each group and returns, well, the column sum per each groupby key. This is what you see in my output in column 'device_count' and so on.
Hope this is clear now

Pandas' read_csv() will render a csv-formatted file a Pandas Dataframe.
I recommend having a ready at the Pandas' documentation, as it's very exhaustive -> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
usecols=[
'origin_census_block_group',
'date_range_start',
'date_range_end',
'device_count',
'distance_traveled_from_home',
'completely_home_device_count',
'median_home_dwell_time',
'part_time_work_behavior_devices',
'full_time_work_behavior_devices'
],
The usecols parameter will take as input an array of desired columns and will only load the specified columns into the dataframe.
dtype={'origin_census_block_group': str}
The dtype parameter will take a dict as input and is to specify the data type of the values, like {'column' : datatype}
.set_index('origin_census_block_group')
.set_index() will set the specificed column as the index column (ie: the first column). The usual index of Pandas' Dataframe is the row's index number, which appears as the first column of the dataframe. By setting the index, the first column now becomes the specified column. See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html
Panda's .groupby() function will take a dataframe a regroup it basing on the occurrences of he values from the specified column.
That is to say, if we a dataframe such as df =
Fruit Name Quality Count
Apple Marco High 4
Pear Lucia Medium 10
Apple Francesco Low 3
Banana Carlo Medium 6
Pear Timmy Low 7
Apple Roberto High 8
Banana Joe High 21
Banana Jack Low 3
Pear Rob Medium 5
Apple Louis Medium 6
Pear Jennifer Low 7
Pear Laura High 8
Performing a groupby operations, such as:
df = df.groupby(lambda x: x[:2]).sum()
Will take all the elements in the index, slice them from index 0 through index 2 and return the sum of all the corresponding values, ie:
Ap 21
Ba 30
Pe 37
Now, you might be wondering about that final .sum() method. If you try to print the dataframe without applying it, you'll likely get something like this:
<bound method GroupBy.sum of <pandas.core.groupby.generic.DataFrameGroupBy object at 0x109d260a0>>
This is because Pandas has created a groubpy object and does not yet now how to display it to you. Do you want to have it displayed by the number of the occurrences in the index? You'd do this:
df = df.groupby(lambda x: x[:2]).size()
And that would output:
Ap 4
Ba 3
Pe 5
Or maybe the sum of their respective summable values? (Which is what is done in the example)
df = df.groupby(lambda x: x[:2]).sum()
Which again, will output:
Ap 21
Ba 30
Pe 37
Notice it has taken the first two letters of the string in the index. Had it been x[:3], it would have taken the first three letters, of course.
Summing it up:
-> .groupby() takes the elements in the index, i.e. the first column of the dataframe and organises the dataframe in groups relating to the index
-> The input you have given to groubpy is an anonymous function, i.e. lambda function, slicing from index 0 through 5 of its mapped input
-> You may choose how to have the results of groubpy by appending the methos .sum() or .size() to a groubpy object
I also recommend reading about Python's lambda functions:
https://docs.python.org/3/reference/expressions.html

Related

python pandas how to organize similar group data

I want to organize similar group data. Here is my data frame
SKU
FATUT
GUYGE
FATUT-01
SUPAU
GUYPE
SUPAU-01
FATUT-02
GUYGE-01
my expected dataframe will be look like this:
SKU
FATUT
FATUT-01
FATUT-02
GUYGE
GUYGE-01
SUPAU
SUPAU-01
GUYPE
I want to organize similar group of data sequentially.
One option is to use groupby with the parameter sort=False; then concatenate the split DataFrames.
How it works:
Group df by the strings before the dash
groupby sorts by the groupby keys by default; when we specify sort=False, we make sure that the keys are stored in the same order as they first appear in df, i.e. "GUYPE" stays behind "SUPAU".
groupby object contains information about the groups that you can unpack like a dictionary. Then unpack it and build a generator expression that returns the grouped DataFrames.
Using concat, concatenate the split DataFrames into one; by using ignore_index=True, we ignore index coming from the split DataFrames and reset the index.
out = pd.concat((d for _, d in df.groupby(df['SKU'].str.split('-').str[0], sort=False)), ignore_index=True)
Output:
SKU
0 FATUT
1 FATUT-01
2 FATUT-02
3 GUYGE
4 GUYGE-01
5 SUPAU
6 SUPAU-01
7 GUYPE
But I feel like, for your task, sort_values might work as well, even if the orders are not exactly the same as in the desired output:
df = df.sort_values(by='SKU', ignore_index=True)
Output:
SKU
0 FATUT
1 FATUT-01
2 FATUT-02
3 GUYGE
4 GUYGE-01
5 GUYPE
6 SUPAU
7 SUPAU-01

Pandas group by values in list (in series)

I am trying to group by items in a list in DataFrame Series. The dataset being used is the Stack Overflow 2020 Survey.
The layout is roughly as follows:
... LanguageWorkedWith ... ConvertedComp ...
Respondent
1 Python;C 50000
2 C++;C 70000
I want to essentially want to use groupby on the unique values in the list of languages worked with, and apply the a mean aggregator function to the ConvertedComp like so...
LanguageWorkedWith
C++ 70000
C 60000
Python 50000
I have actually managed to achieve the desired output but my solution seems somewhat janky and being new to Pandas, I believe that there is probably a better way.
My solution is as follows:
# read csv
sos = pd.read_csv("developer_survey_2020/survey_results_public.csv", index_col='Respondent')
# seperate string into list of strings, disregarding unanswered responses
temp = sos["LanguageWorkedWith"].dropna().str.split(';')
# create new DataFrame with respondent index and rows populated withknown languages
langs_known = pd.DataFrame(temp.tolist(), index=temp.index)
# stack columns as rows, dropping old column names
stacked_responses = langs_known.stack().reset_index(level=1, drop=True)
# Re-indexing sos DataFrame to match stacked_responses dimension
# Concatenate reindex series to ConvertedComp series columnwise
reindexed_pays = sos["ConvertedComp"].reindex(stacked_responses.index)
stacked_with_pay = pd.concat([stacked_responses, reindexed_pays], axis='columns')
# Remove rows with no salary data
# Renaming columns
stacked_with_pay.dropna(how='any', inplace=True)
stacked_with_pay.columns = ["LWW", "Salary"]
# Group by LLW and apply median
lang_ave_pay = stacked_with_pay.groupby("LWW")["Salary"].median().sort_values(ascending=False).head()
Output:
LWW
Perl 76131.5
Scala 75669.0
Go 74034.0
Rust 74000.0
Ruby 71093.0
Name: Salary, dtype: float64
which matches the value calculated when choosing specific language: sos.loc[sos["LanguageWorkedWith"].str.contains('Perl').fillna(False), "ConvertedComp"].median()
Any tips on how to improve/functions that provide this functionality/etc would be appreciated!
In the target column only data frame, decompose the language name and combine it with the salary. The next step is to convert the data from horizontal format to vertical format using melt. Then we group the language names together to get the median. melt docs
lww = sos[["LanguageWorkedWith","ConvertedComp"]]
lwws = pd.concat([lww['ConvertedComp'], lww['LanguageWorkedWith'].str.split(';', expand=True)], axis=1)
lwws.reset_index(drop=True, inplace=True)
df_long = pd.melt(lwws, id_vars='ConvertedComp', value_vars=lwws.columns[1:], var_name='lang', value_name='lang_name')
df_long.groupby('lang_name')['ConvertedComp'].median().sort_values(ascending=False).head()
lang_name
Perl 76131.5
Scala 75669.0
Go 74034.0
Rust 74000.0
Ruby 71093.0
Name: ConvertedComp, dtype: float64

Custom Grouper object in pandas

How can I implement a custom pandas.Grouper class?
In other words - which methods/interfaces should the subclass of pandas.Grouper implement so that it can operate as an argument of DataFrame.groupby?
As a brief toy example: writing a grouper class that will group dataframe based on the first 5 symbols of some particular column of strings. Indeed this is easily achieved using existing methods:
df.groupby(df[column].str[:5])
But what I'm looking for is implementing some NameGrouper class, so that the operation is done by:
df.groupby(NameGrouper(column))
p.s.
There're quite some questions on SO about tailoring group-by to a particular needs (e.g. Pandas: Custom group-by function, Pandas customized group aggregation, Pandas groupby custom groups) with answers using standard pandas functionality. But what I'm interested in an extension of the default pandas functionality.
Important Note: In practice, you never need to subclass pandas.Grouper, and I would generally advise against doing so. It's API is internal and may change between versions without notice or warning.
Note: If you need to use custom logic in your groupby function, and no other option works for you, you can always fall back to:
def custom_fn(df):
return series # a column of group indices
(
df
.assign(custom_groups=custom_fn)
.groupby("custom_groups")
...
)
If at this point, you are still reading, then I assume that you are aware and have accepted that this is for educational purposes only :). In this case, you can create your own pandas.Grouper subclass by overwriting exactly one method: _get_grouper. It takes as input an NDFrame (aka df) and a boolean (validate) and - in pandas v1.5.3 - returns a (None, Grouper, NDFrame) triplet. (The None is for compatibility.)
For example, we can create a VowelGrouper, which groups the dataframe based on the number of vowels inside a given string column:
import pandas as pd
from pandas.core.groupby.grouper import get_grouper
class VowelCounter(pd.Grouper):
"""Group by number of vowels.
"""
def _get_grouper(self, df:pd.DataFrame, validate: bool = True):
groups = df[self.key].map(lambda x: sum(map(x.count, "aeiou")))
groups.name = "vowel_count"
grouper, _, _ = get_grouper(
df,
groups,
axis=self.axis,
level=self.level,
sort=self.sort,
validate=validate,
dropna=self.dropna,
)
return None, grouper, df
In the above, groups is a hashmap that maps the dataframe's index onto its group label. It can be pretty much anything that can be understood as a index_key->hash mapping, e.g., a dict, pd.Series (used above), callable, column name (str), etc. For a detailed list of which types of hashmaps are supported, I recommend having a look at pandas.core.groupby.grouper.get_grouper's source code (which is fairly easy to understand).
get_grouper is another internal function, which takes the hashmap we created and turns it into a pandas.core.groupby.ops.BaseGrouper. The BaseGrouper is the object that eventually creates the pandas.core.groupby.grouper.Grouping objects that make up the groupby selections. You can explore these at your leisure.
To use our newly created VowelGrouper, we simply pass it to Groupby as we would normally do. Given a dataframe
import string
silly_df = pd.DataFrame().from_dict({
"integer":np.arange(10, 0, -1),
"char":list("abcdefghij"),
"fruit":["apple", "bunana"] * 5,
"random_string": np.random.choice(list(string.ascii_lowercase), 100).view(dtype="<U10")
})
integer
char
fruit
random_string
0
10
a
apple
cujhtsutjj
1
9
b
bunana
vqxlxuyrao
2
8
c
apple
reocupqhvw
3
7
d
bunana
umahpirheg
4
6
e
apple
yxjdfyqjmp
5
5
f
bunana
tiolonmvjw
6
4
g
apple
jebzzhspsd
7
3
h
bunana
giorkxlyzq
8
2
i
apple
mvzkovpnmt
9
1
j
bunana
cokbxvqijo
we can group it by the fruit and the number of vowels in random_string and apply our usual aggregations on it
(
silly_df
.groupby(["fruit", VowelCounter("random_string")])
.agg({"integer": "mean", "char": "sum"})
.reset_index()
)
fruit
vowel_count
integer
char
0
apple
0
8
ae
1
apple
1
2
i
2
apple
3
6
cg
3
bunana
1
6.33333
bdh
4
bunana
2
1
j
5
bunana
5
5
f

Summing rows based on keyword within index

I am trying to sum multiple rows together based on a keyword that is part of the index - but it is not the entire index. For example, the index could look like
Count
1234_Banana_Green 43
4321_Banana_Yellow 34
2244_Banana_Brown 23
12345_Apple_Red 45
I would like to sum all of the rows that have the same "keyword" within them and create a total "banana" row. Is there a way to do this without searching for the keyword "banana"? For my purposes, this keyword changes every time and I would like to be able to automate this summing process. Any help is very much appreciated.
May be this:
df.groupby(df.index.to_series()
.str.split('_', expand=True)[1]
)['Count'].sum()
Output:
1
Apple 45
Banana 100
Name: Count, dtype: int64
Given the following dataframe:
raw_data = {'id': ['1234_Banana_Green', '4321_Banana_Yellow',
'2244_Banana_Brown', '12345_Apple_Red',
'1267_Apple_Blue']}
df = pd.DataFrame(raw_data).set_index(['id'])
Try this code:
df = df.reset_index()
df['extracted_keyword'] = df['id'].apply(lambda x: x.split('_')[1])
df.groupby(["extracted_keyword"]).count()
And gives:
id
extracted_keyword
Apple 2
Banana 3
if you want restore the index, add in the end:
df = df.set_index(['id'])

pandas dataframe throwing an empty list

I have a table where column names are not really organized like they have different years of data with different column numbers.
So I should access each data through specified column names.
I am using this syntax to access a column.
df = df[["2018/12"]]
But when I just want to extract numbers under that column, using
df.iloc[0,0]
it throws an error like
single positional indexer is out-of-bounds
So I am using
df.loc[0]
but it has the column name with the numeric data.
How can I extract just the number of each row?
Below is the CSV data
Closing Date,2014/12,2015/12,2016/12,2017/12,2018/12,Trend
Net Sales,"31,634","49,924","62,051","68,137","72,590",
""
Net increase,"-17,909","-16,962","-34,714","-26,220","-29,721",
Net Received,-,-,-,-,-,
Net Paid,-328,"-6,038","-9,499","-9,375","-10,661",
When writing this dumb question, I was just a beginner not even knowing what I wanted ask.
The OP's question comes down to "getting the row as a list" since he ended his post asking
how to get numbers(though he said "number" maybe by mistake) of each row.
The answer is that he made a mistake of using double square brackets in his example and it caused problems.
The solution is to use df = df["2018/12"] instead of df= df[["2018/12"]]
As for things I(me at the time of writing this) mentioned, I will answer them one by one:
Let's say the table looks like this
Unnamed: 0 2018/12 country drives_right
0 US 809 United States True
1 AUS 731 Australia False
2 JAP 588 Japan False
3 IN 18 India False
4 RU 200 Russia True
5 MOR 70 Morocco True
6 EG 45 Egypt True
1>df = df[["2018/12"]]
: it will output a dataframe which only has the column "2018/12" and the index column on the left side.
2>df.iloc[0,0]
Now, since from 1> we have a new dataframe having only one column(except for index column mentioning index values) this will output the first element of the column.
In the example above, the outcome will be "809" since it's the first element of the column.
3>
But when I just want to extract numbers under that column, using
df.iloc[0,0]
-> doesn't make sense if you want to get extract numbers. It will just output one element
809 from the sub-dataframe you created using df = df[["2018/12"]].
it throws an error like
single positional indexer is out-of-bounds
Maybe you are confused about the outcome.(Maybe in this case "df" is the one before your df dataframe subset assignment?(df=df[["2018/12"]]) Since df = df[["2018/12"]] will output a dataframe so it will work fine.
3
So I am using
df.loc[0]
but it has the column name with the numeric data.
: Yes df.loc[0] from df = df[["2018/12"]] will return column name and the first element of that column.
4.
How can I extract just the number of each row?
You mean "numbers" of each row right?
Use this:
print(df["2018/12"].values.tolist())
In terms of finding varying names of columns or rows, and then access each rows and columns, you should think of using regex.

Categories