Custom Grouper object in pandas - python

How can I implement a custom pandas.Grouper class?
In other words - which methods/interfaces should the subclass of pandas.Grouper implement so that it can operate as an argument of DataFrame.groupby?
As a brief toy example: writing a grouper class that will group dataframe based on the first 5 symbols of some particular column of strings. Indeed this is easily achieved using existing methods:
df.groupby(df[column].str[:5])
But what I'm looking for is implementing some NameGrouper class, so that the operation is done by:
df.groupby(NameGrouper(column))
p.s.
There're quite some questions on SO about tailoring group-by to a particular needs (e.g. Pandas: Custom group-by function, Pandas customized group aggregation, Pandas groupby custom groups) with answers using standard pandas functionality. But what I'm interested in an extension of the default pandas functionality.

Important Note: In practice, you never need to subclass pandas.Grouper, and I would generally advise against doing so. It's API is internal and may change between versions without notice or warning.
Note: If you need to use custom logic in your groupby function, and no other option works for you, you can always fall back to:
def custom_fn(df):
return series # a column of group indices
(
df
.assign(custom_groups=custom_fn)
.groupby("custom_groups")
...
)
If at this point, you are still reading, then I assume that you are aware and have accepted that this is for educational purposes only :). In this case, you can create your own pandas.Grouper subclass by overwriting exactly one method: _get_grouper. It takes as input an NDFrame (aka df) and a boolean (validate) and - in pandas v1.5.3 - returns a (None, Grouper, NDFrame) triplet. (The None is for compatibility.)
For example, we can create a VowelGrouper, which groups the dataframe based on the number of vowels inside a given string column:
import pandas as pd
from pandas.core.groupby.grouper import get_grouper
class VowelCounter(pd.Grouper):
"""Group by number of vowels.
"""
def _get_grouper(self, df:pd.DataFrame, validate: bool = True):
groups = df[self.key].map(lambda x: sum(map(x.count, "aeiou")))
groups.name = "vowel_count"
grouper, _, _ = get_grouper(
df,
groups,
axis=self.axis,
level=self.level,
sort=self.sort,
validate=validate,
dropna=self.dropna,
)
return None, grouper, df
In the above, groups is a hashmap that maps the dataframe's index onto its group label. It can be pretty much anything that can be understood as a index_key->hash mapping, e.g., a dict, pd.Series (used above), callable, column name (str), etc. For a detailed list of which types of hashmaps are supported, I recommend having a look at pandas.core.groupby.grouper.get_grouper's source code (which is fairly easy to understand).
get_grouper is another internal function, which takes the hashmap we created and turns it into a pandas.core.groupby.ops.BaseGrouper. The BaseGrouper is the object that eventually creates the pandas.core.groupby.grouper.Grouping objects that make up the groupby selections. You can explore these at your leisure.
To use our newly created VowelGrouper, we simply pass it to Groupby as we would normally do. Given a dataframe
import string
silly_df = pd.DataFrame().from_dict({
"integer":np.arange(10, 0, -1),
"char":list("abcdefghij"),
"fruit":["apple", "bunana"] * 5,
"random_string": np.random.choice(list(string.ascii_lowercase), 100).view(dtype="<U10")
})
integer
char
fruit
random_string
0
10
a
apple
cujhtsutjj
1
9
b
bunana
vqxlxuyrao
2
8
c
apple
reocupqhvw
3
7
d
bunana
umahpirheg
4
6
e
apple
yxjdfyqjmp
5
5
f
bunana
tiolonmvjw
6
4
g
apple
jebzzhspsd
7
3
h
bunana
giorkxlyzq
8
2
i
apple
mvzkovpnmt
9
1
j
bunana
cokbxvqijo
we can group it by the fruit and the number of vowels in random_string and apply our usual aggregations on it
(
silly_df
.groupby(["fruit", VowelCounter("random_string")])
.agg({"integer": "mean", "char": "sum"})
.reset_index()
)
fruit
vowel_count
integer
char
0
apple
0
8
ae
1
apple
1
2
i
2
apple
3
6
cg
3
bunana
1
6.33333
bdh
4
bunana
2
1
j
5
bunana
5
5
f

Related

Python, lambda function as argument for groupby

I'm trying to figure out what a piece of code is doing, but I'm getting kinda lost on it.
I have a pandas dataframe, which has been loaded by the following .csv file:
origin_census_block_group,date_range_start,date_range_end,device_count,distance_traveled_from_home,bucketed_distance_traveled,median_dwell_at_bucketed_distance_traveled,completely_home_device_count,median_home_dwell_time,bucketed_home_dwell_time,at_home_by_each_hour,part_time_work_behavior_devices,full_time_work_behavior_devices,destination_cbgs,delivery_behavior_devices,median_non_home_dwell_time,candidate_device_count,bucketed_away_from_home_time,median_percentage_time_home,bucketed_percentage_time_home,mean_home_dwell_time,mean_non_home_dwell_time,mean_distance_traveled_from_home
010539707003,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,49,626,"{""16001-50000"":5,""0"":11,"">50000"":4,""2001-8000"":3,""1-1000"":9,""1001-2000"":7,""8001-16000"":1}","{""16001-50000"":110,"">50000"":155,""<1000"":40,""2001-8000"":237,""1001-2000"":27,""8001-16000"":180}",12,627,"{""721-1080"":11,""361-720"":9,""61-360"":1,""<60"":11,"">1080"":12}","[32,32,28,30,30,31,27,23,20,20,20,17,19,19,15,14,17,20,20,21,25,22,24,23]",7,3,"{""120330012011"":1,""010030107031"":1,""010030114052"":2,""120330038001"":1,""010539701003"":1,""010030108001"":1,""010539707002"":14,""010539705003"":2,""120330015001"":1,""121130102003"":1,""010539701002"":1,""120330040001"":1,""370350101014"":2,""120330033081"":2,""010030106003"":1,""010539706001"":2,""010539707004"":3,""120330039001"":1,""010539699003"":1,""120330030003"":1,""010539707003"":41,""010970029003"":1,""010539705004"":1,""120330009002"":1,""010539705001"":3,""010539704003"":1,""120330028012"":1,""120330035081"":1,""120330036102"":1,""120330036142"":1,""010030114062"":1,""010539706004"":7,""010539706002"":1,""120330036082"":1,""010539707001"":7,""010030102001"":1,""120330028011"":1}",2,241,71,"{""21-45"":4,""481-540"":2,""541-600"":1,""721-840"":1,""1201-1320"":1,""301-360"":3,""<20"":13,""61-120"":3,""241-300"":3,""121-180"":1,""421-480"":3,""1321-1440"":4,""1081-1200"":1,""961-1080"":2,""601-660"":1,""181-240"":1,""661-720"":2,""361-420"":3}",72,"{""0-25"":13,""76-100"":21,""51-75"":6,""26-50"":3}",657,413,1936
010730144081,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,139,2211,"{""16001-50000"":17,""0"":41,"">50000"":15,""2001-8000"":22,""1-1000"":8,""1001-2000"":12,""8001-16000"":24}","{""16001-50000"":143,"">50000"":104,""<1000"":132,""2001-8000"":39,""1001-2000"":15,""8001-16000"":102}",41,806,"{""721-1080"":32,""361-720"":16,""61-360"":12,""<60"":30,"">1080"":46}","[91,92,93,91,91,90,86,83,78,64,64,61,64,62,65,62,60,74,61,64,75,78,81,84]",8,6,"{""131350501064"":1,""131350502151"":1,""010730102002"":1,""011170302131"":2,""010730038024"":1,""010730108041"":1,""010730144133"":1,""010730132003"":1,""011210118002"":1,""011170303053"":1,""010730111084"":2,""011170302142"":1,""010730119011"":1,""010730129063"":2,""010730107063"":1,""010730059083"":1,""010730058003"":1,""011270204003"":1,""010730049012"":2,""130879701001"":1,""010730120021"":1,""130890219133"":1,""010730144082"":4,""170310301031"":1,""010730129112"":1,""010730024002"":1,""011170303034"":2,""481390616004"":1,""121270826052"":1,""010730128021"":2,""121270825073"":1,""010730004004"":1,""211959313002"":1,""010730100012"":1,""011170302151"":1,""010730142041"":1,""010730129123"":1,""010730129084"":1,""010730042002"":1,""010730059033"":2,""170318306001"":1,""130519800001"":1,""010730027003"":1,""121270826042"":1,""481610001002"":1,""010730100011"":1,""010730023032"":1,""350250004002"":1,""010730056003"":1,""010730132001"":1,""011170302171"":2,""120910227003"":1,""011239620001"":1,""130351503002"":1,""010730129155"":1,""010730001001"":2,""010730110021"":1,""170310104003"":1,""010730059082"":2,""010730120022"":1,""011170303151"":1,""010730139022"":1,""011170303441"":4,""010730144092"":3,""010730129151"":1,""011210119001"":2,""010730144081"":117,""010730108052"":1,""010730129122"":9,""370710321003"":1,""010730142034"":2,""010730042001"":2,""010570201003"":1,""010730144132"":6,""010730059032"":1,""010730012001"":2,""010730102003"":1,""011170303332"":1,""010730128032"":2,""010730129081"":1,""010730103011"":1,""010730058001"":3,""011150401041"":1,""010730045001"":3,""010730110013"":1,""010730119041"":1,""010730042003"":1,""010730141041"":1,""010730144091"":1,""010730129154"":1,""484759501002"":1,""010730144063"":1,""010730144102"":12,""011170303141"":1,""011250106011"":1,""011170303152"":1,""010730059104"":1,""010730107021"":1,""010730100014"":1,""010730008004"":1,""011170303451"":1,""010730127041"":2,""370559704003"":1,""010730047011"":2,""010730129132"":2,""011010014002"":1,""010730144131"":1,""011170302133"":1,""010730030011"":1,""131350506063"":1,""010730118023"":1,""010890110141"":1,""010730128023"":1,""010730106022"":2,""130879703004"":1,""010730108015"":1,""131390010041"":1,""011170305013"":1,""010730134002"":1,""010730031004"":1,""010730138012"":1,""010730011004"":1,""011250102041"":1,""010730129131"":4,""010730144101"":4,""011170303331"":2,""010730003001"":1,""011010033012"":1,""483539504004"":1,""010550104021"":1,""011170303411"":1,""010730106031"":1,""011170303153"":5,""010730128034"":1,""010730129061"":1,""131390010023"":1,""010730051042"":1,""130510107002"":1,""010730027001"":2,""120090686011"":1,""010730107042"":1,""010730123052"":1,""010730129102"":1,""011210115003"":1,""010730129083"":4,""011170303142"":1,""011010014001"":1,""010730107064"":2}",7,176,205,"{""21-45"":7,""481-540"":10,""541-600"":4,""46-60"":2,""721-840"":3,""1201-1320"":3,""301-360"":7,""<20"":46,""61-120"":6,""241-300"":4,""121-180"":9,""421-480"":2,""1321-1440"":3,""1081-1200"":5,""961-1080"":1,""601-660"":1,""181-240"":5,""661-720"":1,""361-420"":7}",78,"{""0-25"":29,""76-100"":71,""51-75"":27,""26-50"":8}",751,338,38937
010890017002,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,78,1934,"{""16001-50000"":2,""0"":12,"">50000"":9,""2001-8000"":27,""1-1000"":12,""1001-2000"":8,""8001-16000"":8}","{""16001-50000"":49,"">50000"":99,""<1000"":111,""2001-8000"":37,""1001-2000"":24,""8001-16000"":28}",11,787,"{""721-1080"":17,""361-720"":11,""61-360"":11,""<60"":15,"">1080"":23}","[49,42,48,48,47,48,44,44,39,32,34,32,36,31,32,36,40,37,36,38,49,45,46,46]",5,1,"{""010890101002"":1,""010730108041"":1,""010890020003"":2,""010890010001"":2,""010890025011"":3,""010890026001"":4,""280819505003"":1,""281059504004"":1,""010890103022"":1,""120990056011"":1,""010890109012"":2,""010890019021"":6,""010890013021"":4,""010890015004"":3,""010890108003"":1,""010890014022"":6,""281059501003"":1,""281059503001"":1,""010890007022"":3,""010890017001"":3,""010890107023"":1,""010890021002"":1,""010890009011"":1,""010890109013"":1,""010730120022"":1,""010890031003"":15,""011170303151"":1,""010890019011"":9,""010890030002"":2,""010890110221"":1,""011170305021"":1,""010890026003"":2,""010890025012"":3,""010730117034"":1,""010830208022"":1,""010890031002"":2,""010890112002"":1,""010210602001"":1,""010890002022"":1,""010890017002"":65,""281059506021"":1,""010890010003"":2,""010890106222"":1,""120990059182"":1,""010890110222"":1,""010890020001"":1,""010890101003"":1,""010890018013"":1,""010890021001"":1,""010890109021"":1,""010890108001"":1,""010770106005"":1,""281059506011"":1,""010030114032"":2,""010830209001"":1,""010890027222"":1,""010730128023"":1,""010890009021"":1,""010030114051"":1,""010030109031"":1,""010030103003"":1,""010890031001"":1,""010890021003"":1,""010030114062"":4,""010890106241"":1,""281059504003"":1,""010890018011"":10,""010890019031"":5,""010890027012"":1,""010730108054"":1,""010890106223"":2,""010890111001"":1,""010210603002"":1,""010890109011"":1,""010890019012"":2,""010890113001"":1,""010890028013"":3}",1,229,99,"{""481-540"":3,""541-600"":2,""46-60"":1,""721-840"":1,""1201-1320"":7,""301-360"":6,""<20"":18,""61-120"":10,""241-300"":5,""121-180"":2,""1321-1440"":2,""841-960"":1,""1081-1200"":1,""961-1080"":3,""601-660"":3,""181-240"":2,""661-720"":3}",78,"{""0-25"":16,""76-100"":44,""51-75"":11,""26-50"":7}",708,353,14328
010950308022,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,100,2481,"{""16001-50000"":11,""0"":19,"">50000"":11,""2001-8000"":40,""1-1000"":6,""1001-2000"":3,""8001-16000"":4}","{""16001-50000"":150,"">50000"":23,""<1000"":739,""2001-8000"":23,""1001-2000"":12,""8001-16000"":208}",17,703,"{""721-1080"":21,""361-720"":19,""61-360"":10,""<60"":24,"">1080"":26}","[62,64,64,63,65,67,54,48,37,37,34,33,30,34,32,33,35,43,50,56,58,56,56,57]",8,6,"{""010950306004"":1,""010950302023"":1,""011030054051"":1,""010950311002"":1,""010950309023"":1,""010499606003"":1,""121319506023"":2,""010950308022"":86,""121319506016"":2,""010950304013"":1,""010950307024"":1,""010950309041"":1,""010890019021"":2,""010950312001"":5,""010499607002"":1,""011150402013"":1,""010550102003"":1,""120050027043"":3,""010719509003"":1,""010950302022"":1,""010950308023"":2,""120050027051"":2,""471079701022"":1,""010890106221"":1,""010950306001"":1,""010950302011"":2,""011150405013"":1,""011150402041"":2,""010950312002"":16,""011030054042"":1,""010950301002"":2,""130459105011"":1,""010730001001"":1,""130459102001"":1,""010890109013"":2,""010950308013"":14,""010719508004"":1,""120050027041"":3,""010550110021"":3,""010730049022"":1,""010950308024"":1,""010950312004"":6,""010950312003"":1,""010550104012"":2,""010550110013"":1,""120860004111"":1,""010890027222"":1,""010950306002"":2,""010950304015"":1,""011030054041"":1,""010950309031"":8,""010950308021"":1,""010950302024"":1,""010950307011"":5,""010550110012"":2,""011150404013"":1,""130459103003"":1,""120050027032"":3,""010950307012"":5,""010950309022"":2,""010950307023"":1,""010719508003"":1,""010499608001"":2,""010950310003"":1,""011150402043"":1,""120860099063"":1,""010950309021"":4,""010950309043"":2,""010950308011"":1,""010950306003"":3,""120050027042"":1,""010950308025"":5,""010950309032"":6,""010499607001"":1}",1,199,132,"{""21-45"":8,""481-540"":6,""541-600"":4,""46-60"":3,""721-840"":3,""1201-1320"":4,""301-360"":3,""<20"":20,""61-120"":10,""241-300"":2,""121-180"":4,""421-480"":3,""1321-1440"":1,""841-960"":3,""961-1080"":2,""601-660"":1,""181-240"":3,""661-720"":1,""361-420"":2}",74,"{""0-25"":20,""76-100"":48,""51-75"":23,""26-50"":4}",661,350,5044
df = pd.read_csv(csv_file,
usecols=[
'origin_census_block_group',
'date_range_start',
'date_range_end',
'device_count',
'distance_traveled_from_home',
'completely_home_device_count',
'median_home_dwell_time',
'part_time_work_behavior_devices',
'full_time_work_behavior_devices'
],
dtype={'origin_census_block_group': str},
).set_index('origin_census_block_group')
and, later in the code, the dataframe is modified by:
df = df.groupby(lambda cbg: cbg[:5]).sum()
I don't quite understand what this line is doing precisely.
Groupby generally groups a dataframe by column, so...is it grouping the dataframe using multiple columns (0 to 5)? What is the effect of .sum() at the end?
If you run your code exactly as you wrote it (both the creation of df and the groupby) you can see the result. I print first couple of columns of the output of groupby
device_count distance_traveled_from_home
----- -------------- -----------------------------
01053 49 626
01073 139 2211
01089 78 1934
01095 100 2481
What happens here is the function lambda cbg: cbg[:5] is applied to each of the index values (strings that look like numbers in column origin_census_block_group). As a side, note the statement
...
dtype={'origin_census_block_group': str},
when creating the df, so somebody went into trouble to make sure they are actually str
So the function is applied to string like '010539707003' and returns a substring which is the first 5 characters of that string:
'010539707003'[:5]
produces
'01053'
so I assume there are multiple keys that share the first 5 characters (in the actual file -- the snippet has them all unique so not very interesting) and all these rows are grouped together
Then .sum() is applied to each numerical column of each group and returns, well, the column sum per each groupby key. This is what you see in my output in column 'device_count' and so on.
Hope this is clear now
Pandas' read_csv() will render a csv-formatted file a Pandas Dataframe.
I recommend having a ready at the Pandas' documentation, as it's very exhaustive -> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
usecols=[
'origin_census_block_group',
'date_range_start',
'date_range_end',
'device_count',
'distance_traveled_from_home',
'completely_home_device_count',
'median_home_dwell_time',
'part_time_work_behavior_devices',
'full_time_work_behavior_devices'
],
The usecols parameter will take as input an array of desired columns and will only load the specified columns into the dataframe.
dtype={'origin_census_block_group': str}
The dtype parameter will take a dict as input and is to specify the data type of the values, like {'column' : datatype}
.set_index('origin_census_block_group')
.set_index() will set the specificed column as the index column (ie: the first column). The usual index of Pandas' Dataframe is the row's index number, which appears as the first column of the dataframe. By setting the index, the first column now becomes the specified column. See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html
Panda's .groupby() function will take a dataframe a regroup it basing on the occurrences of he values from the specified column.
That is to say, if we a dataframe such as df =
Fruit Name Quality Count
Apple Marco High 4
Pear Lucia Medium 10
Apple Francesco Low 3
Banana Carlo Medium 6
Pear Timmy Low 7
Apple Roberto High 8
Banana Joe High 21
Banana Jack Low 3
Pear Rob Medium 5
Apple Louis Medium 6
Pear Jennifer Low 7
Pear Laura High 8
Performing a groupby operations, such as:
df = df.groupby(lambda x: x[:2]).sum()
Will take all the elements in the index, slice them from index 0 through index 2 and return the sum of all the corresponding values, ie:
Ap 21
Ba 30
Pe 37
Now, you might be wondering about that final .sum() method. If you try to print the dataframe without applying it, you'll likely get something like this:
<bound method GroupBy.sum of <pandas.core.groupby.generic.DataFrameGroupBy object at 0x109d260a0>>
This is because Pandas has created a groubpy object and does not yet now how to display it to you. Do you want to have it displayed by the number of the occurrences in the index? You'd do this:
df = df.groupby(lambda x: x[:2]).size()
And that would output:
Ap 4
Ba 3
Pe 5
Or maybe the sum of their respective summable values? (Which is what is done in the example)
df = df.groupby(lambda x: x[:2]).sum()
Which again, will output:
Ap 21
Ba 30
Pe 37
Notice it has taken the first two letters of the string in the index. Had it been x[:3], it would have taken the first three letters, of course.
Summing it up:
-> .groupby() takes the elements in the index, i.e. the first column of the dataframe and organises the dataframe in groups relating to the index
-> The input you have given to groubpy is an anonymous function, i.e. lambda function, slicing from index 0 through 5 of its mapped input
-> You may choose how to have the results of groubpy by appending the methos .sum() or .size() to a groubpy object
I also recommend reading about Python's lambda functions:
https://docs.python.org/3/reference/expressions.html

Sort string columns with numbers in it in Pandas

I want to order my table by a column. The column is a string that has numbers in it, for example ASH11, ASH2, ASH1, etc. The problem is that using the method sort_values is going to do a "character" order, so the columns from the example will be order like this --> ASH1, ASH11, ASH2. And I want the order like this --> AS20H1, AS20H2, AS20H11 (taking into account the last number).
I though about taking the last characters of the string but sometimes would be only the last and in other cases the last two. The other way around (taking the characters from the beggining) doesnt work either because the strings are not always from the same lenght (i.e. some cases the name is ASH1, ASGH22, ASHGT3, etc)
Use keyparameter (new in 1.1.0)
df.sort_values(by=['xxx'], key=lambda col: col.map(lambda x: int(re.split('(\d+)',x)[-2])))
Using list comprehension and regular expression:
>>> import pandas as pd
>>> import re #Regular expression
>>> a = pd.DataFrame({'label':['AS20H1','AS20H2','AS20H11','ASH1','ASGH22','ASHGT3']})
>>> a
label
0 AS20H1
1 AS20H2
2 AS20H11
3 ASH1
4 ASGH22
5 ASHGT3
r'(\d+)(?!.*\d)'
Matches the last number in a string
>>> a['sort_int'] = [ int(re.search(r'(\d+)(?!.*\d)',i).group(0)) for i in a['label']]
>>> a
label sort_int
0 AS20H1 1
1 AS20H2 2
2 AS20H11 11
3 ASH1 1
4 ASGH22 22
5 ASHGT3 3
>>> a.sort_values(by='sort_int',ascending=True)
label sort_int
0 AS20H1 1
3 ASH1 1
1 AS20H2 2
5 ASHGT3 3
2 AS20H11 11
4 ASGH22 22
You could maybe extract the integers from your column and then use it to sort your dataFrame
df["new_index"] = df.yourColumn.str.extract('(\d+)')
df.sort_values(by=["new_index"], inplace=True)
In case you get some NA in your "new_index" column you can use the option na_position in the sort_values method in order to choose where to put them (beginning or end)

Filter Dataframe by using ~isin([list_of_substrings])

Given a dataframe full of emails, I want to filter out rows containing potentially blocked domain names or clearly fake emails. The dataframe below represents an example of my data.
>> print(df)
email number
1 fake#fake.com 2
2 real.email#gmail.com 1
3 no.email#email.com 5
4 real#yahoo.com 2
5 rich#money.com 1
I want to filter by two lists. The first list is fake_lst = ['noemail', 'noaddress', 'fake', ... 'no.email'].
The second list is just the set from disposable_email_domains import blocklist converted to a list (or kept as a set).
When I use df = df[~df['email'].str.contains('noemail')] it works fine and filters out that entry. Yet when I do df = df[~df['email'].str.contains(fake_lst)] I get TypeError: unhashable type: 'list'.
The obvious answer is to use df = df[~df['email'].isin(fake_lst)] as in many other stackoverflow questions, like Filter Pandas Dataframe based on List of substrings or pandas filtering using isin function but that ends up having no effect.
I suppose I could use str.contains('string') for each possible list entry, but that is ridiculously cumbersome.
Therefore, I need to filter this dataframe based on the substrings contained in the two lists such that any email containing a particular substring I want rid of, and the subsequent row in which it is contained, are removed.
In the example above, the dataframe after filtering would be:
>> print(df)
email number
2 real.email#gmail.com 1
4 real#yahoo.com 2
5 rich#money.com 1
Here is a potential solution assuming you have following df and fake_lst
df = pd.DataFrame({
'email': ['fake#fake.com', 'real.email#gmail.com', 'no.email#email.com',
'real#yahoo.com', 'rich#money.com'],
'number': [2, 1, 5, 2, 1]
})
fake_lst = ['fake', 'money']
Option 1:
Filter out rows that have any of the fake_lst words in email with apply:
df.loc[
~df['email'].apply(lambda x: any([i in x for i in fake_lst]))
]
email number
1 real.email#gmail.com 1
2 no.email#email.com 5
3 real#yahoo.com 2
Option 2:
Filter out without apply
df.loc[
[not any(i) for i in zip(*[df['email'].str.contains(word) for word in fake_lst])]
]
email number
1 real.email#gmail.com 1
2 no.email#email.com 5
3 real#yahoo.com 2
Use DataFrame.isin to check whether each element in the DataFrame is contained in values. Another issue is that your fake list contains the name without the domain so you need str.split to remove the characters you are not matching against.
Note: str.contains tests if a pattern or regex is contained within a string of a Series and hence your code df['email'].str.contains('noemail') works fine but doesn't work for list
df[~df['email'].str.split('#').str[0].isin(fake_lst)]
email number
0 fake#fake.com 2
1 real.email#gmail.com 1
3 real#yahoo.com 2
4 rich#money.com 1

pandas settingwithcopywarning on groupby

While I generally understand the warnings, and many posts deal with this, I don understand why I am getting a warning only when I reach the groupby line (the last one):
grouped = data.groupby(['group'])
for name, group in grouped:
data2=group.loc[data['B-values'] > 0]
data2["unique_A-values"]=data2.groupby(["A-values"])["A-values"].transform('count')
EDIT:
Here is my dataframe (data):
group A-values B-values
human 1 -1
human 1 5
human 1 4
human 3 4
human 2 10
bird 7 8
....
For B-values > 0 (data2=group.loc[data['B-values'] > 0]):
human has two A-values equal to one, one equals to 3 and one equals to 2 (data2["unique_A-values"]=data2.groupby(["A-values"])["A-values"].transform('count'))
You get the error because you take a reference to your groupby and then try add a column to it, so it's just warning you that if your intention is to update the original df then this may or may not work.
If you are just modifying a local copy then take a copy using copy() so it's explicit and the warning will go away:
for name, group in grouped:
data2=group.loc[data['B-values'] > 0].copy() # <- add .copy() here
data2["unique_A-values"]=data2.groupby(["A-values"])["A-values"].transform('count')
FYI the pandas groupby user guide says:
Group chunks should be treated as immutable, and changes to a group chunk may produce unexpected results.
for name, group in grouped:
# making a reference to the group chunk
data2 = group.loc[data['B-values'] > 0]
# trying to make a change to that group chunk reference
data2["unique_A-values"] = data2.groupby(["A-values"])["A-values"].transform('count')
That said, it looks like you just want to count the values in the data frame so you may be better off using value_counts():
>>> data[data['B-values']>0].groupby('group')['A-values'].value_counts()
group A-values
bird 7 1
human 1 2
2 1
3 1
Name: A-values, dtype: int64

Combining MultiIndex and Index in a PANDAS DataFrame

I'm trying to come up with a DataFrame to do some data analysis and would really benefit from having a data frame that can handle regular indexing and MultiIndexing together in one data frame.
For each patient, I have 6 slices of various types of data (T1avg, T2avg, etc...). Let's call this dataframe1 (from an ipython notebook):
import pandas
dat0 = numpy.zeros([6])
dat1 = numpy.zeros([6])
pat0=(['NecS3Hs05']*6)
pat1=(['NecS3Hs06']*6)
slc = (['Slice ' + str(x) for x in xrange(dat0.shape[-1])])
ind = zip(*[pat0+pat1,slc+slc])
named_ind = pandas.MultiIndex.from_tuples(ind, names = ['Patients','Slices'])
ser = pandas.Series(numpy.append(dat0,dat1),index = named_ind)
df = pandas.DataFrame(data=ser, columns=['T1avg'])
Image of output: df1
I also have, for each patient, various strings of information (tumour type, number of imaging sessions, treatment type):
pats = ['NecS3Hs05','NecS3Hs05']
tx = ['Control','Treated']
Ttype = ['subcutaneous','orthotopic']
NSessions = ['2','3']
cols = ['Tx Group', 'Tumour Type', 'Imaging Sessions']
dat = numpy.array([tx,Ttype,NSessions]).T
df2 = pandas.DataFrame(dat, index=pats,columns=cols)
[I'd like to post a picture here as well, but I need at least 10 reputation to do so]
Ideally, I want to have a dataframe that looks as follows (sketched it out in an image editor sorry)
Image of desired output: df-desired
But when I try to use the append command,
com = df.append(df2)
I get something undesired, the MultiIndex that I set up in df is now gone, replaced with a simple index of type tuples ('NecS3Hs05, Slice 0' etc...). The indices from df2 remain the same 'NecS3Hs05'.
Is this possible to do with PANDAS, or am I barking up the wrong tree here? Also, is this even a recommended way of storing Patient attributes in a dataframe (i.e. is this unpandas)? I think what I would really like is to keep everything a simple index, but instead store N-d arrays inside the elements of the data frame.
For instance, if I try something like:
com['NecS3Hs05','T1avg']
I want to get an array/tuple of shape/len 6
and when I try to get the tumour type:
com['NecS3Hs05','Tumour Type']
I get the string 'subcutaneous'. Obviously I also want to retain the cool features of data frames as well, it looks like PANDAS is the right way to go here, I just need to understand a bit more about how to set up my dataframe
I hope this is a sensible question, if not, I'd be happy to re-form it.
Your problem can be solved, I believe, if you drop the MultiIndex business. Imagine '''df''' only has the (non-unique) 'Patient' as index. 'Slices' would become a simple column.
ind = zip(*[pat0+pat1])
named_ind = pandas.MultiIndex.from_tuples(ind, names = ['Patients'])
df = pandas.DataFrame({'T1avg':ser})
df['Slice']=pandas.Series(numpy.append(slc, slc), index=df.index)
If you had to select on the slice, you can still do that:
df[df['Slice']=='Slice 4']
Will give you Slice 4 for all patients. Note how this eliminates the need to have that row for all patients.
As long as your new dataframe (df2) defines the same index you can now join on that index quite simply:
df.join(df2)
and you'll get
T1avg Slice Tx Group Tumour Type Imaging Sessions
Patients
NecS3Hs05 0 Slice 0 Control subcutaneous 2
NecS3Hs05 0 Slice 1 Control subcutaneous 2
NecS3Hs05 0 Slice 2 Control subcutaneous 2
NecS3Hs05 0 Slice 3 Control subcutaneous 2
NecS3Hs05 0 Slice 4 Control subcutaneous 2
NecS3Hs05 0 Slice 5 Control subcutaneous 2
NecS3Hs06 0 Slice 0 Treated orthotopic 3
NecS3Hs06 0 Slice 1 Treated orthotopic 3
NecS3Hs06 0 Slice 2 Treated orthotopic 3
NecS3Hs06 0 Slice 3 Treated orthotopic 3
NecS3Hs06 0 Slice 4 Treated orthotopic 3
NecS3Hs06 0 Slice 5 Treated orthotopic 3

Categories