Pandas Group By and Sorting by multiple columns

Pandas Group By and Sorting by multiple columns - python

I have some initial data that looks like this:
code type value
1111 Golf Acceptable
1111 Golf Undesirable
1111 Basketball Acceptable
1111 Basketball Undesirable
1111 Basketball Undesirable
and I'm trying to group it on the code and type columns to get the row with the most occurrences. In the case of a tie, I want to select the row with the value Undesirable. So the example above would become this:
code type value
1111 Golf Undesirable
1111 Basketball Undesirable
Currently I'm doing it this way:
df = pd.DataFrame(df.groupby(['code', 'type', 'value']).size().reset_index(name='count'))
df = df.sort_values(['type', 'count'])
df = pd.DataFrame(df.groupby(['code', 'type']).last().reset_index())
I've done some testing of this and it seems to do what I want, but I don't really like trusting the .last() call, and hoping in the case of a tie that Undesirable was sorted last. Is there a better way to group this to ensure I always get the higher count, or in the cases of a tie select the Undesirable value?
Performance isn't too much of an issue as I'm only working with around 50k rows or so.

Case 1
If the value column only contains two values i.e. ['Acceptable', 'Undesirable'] then we can rely on the fact that Acceptable < Undesirable alphabetically. In this case you can use the following simplified solution.
Create an auxiliary column called count which contain the count of number of rows per code, type and value. Then sort the dataframe by count and value and drop the dupes per code and type keeping the last row.
c = ['code', 'type']
df['count'] = df.groupby([*c, 'value'])['value'].transform('count')
df.sort_values(['count', 'value']).drop_duplicates(c, keep='last')
Case 2
If the value column contains other values and you can't rely on alphabetical ordering use the following solution which is similar to solution proposed in case 1 but this first converts the value column to ordered Categorical type before sorting
c = ['code', 'type']
df['count'] = df.groupby([*c, 'value'])['value'].transform('count')
df['value'] = pd.Categorical(df['value'], categories=['Acceptable', 'Undesirable'], ordered=True)
df.sort_values(['count', 'value']).drop_duplicates(c, keep='last')
Result
code type value count
1 1111 Golf Undesirable 1
4 1111 Basketball Undesirable 2

Another possible solution, which is based on the following ideas:
Grouping the data by code and type.
If a group has more than one row (len(x) > 1) and its rows have the same count (x['count'] == x['count'].min()).all()), return the row with Undesirable.
Otherwise, return the row where the count is maximum (x.iloc[[x['count'].argmax()]]).
(df.groupby(['code', 'type', 'value'])['value'].size()
.reset_index(name='count').groupby(['code', 'type'])
.apply(lambda x: x.loc[x['value'] == 'Undesirable'] if
((len(x) > 1) and (x['count'] == x['count'].min()).all()) else
x.iloc[[x['count'].argmax()]])
.reset_index(drop=True)
.drop('count', axis=1))
Output:
code type value
0 1111 Basketball Undesirable
1 1111 Golf Undesirable

Related

Python, lambda function as argument for groupby

I'm trying to figure out what a piece of code is doing, but I'm getting kinda lost on it.
I have a pandas dataframe, which has been loaded by the following .csv file:
origin_census_block_group,date_range_start,date_range_end,device_count,distance_traveled_from_home,bucketed_distance_traveled,median_dwell_at_bucketed_distance_traveled,completely_home_device_count,median_home_dwell_time,bucketed_home_dwell_time,at_home_by_each_hour,part_time_work_behavior_devices,full_time_work_behavior_devices,destination_cbgs,delivery_behavior_devices,median_non_home_dwell_time,candidate_device_count,bucketed_away_from_home_time,median_percentage_time_home,bucketed_percentage_time_home,mean_home_dwell_time,mean_non_home_dwell_time,mean_distance_traveled_from_home
010539707003,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,49,626,"{""16001-50000"":5,""0"":11,"">50000"":4,""2001-8000"":3,""1-1000"":9,""1001-2000"":7,""8001-16000"":1}","{""16001-50000"":110,"">50000"":155,""<1000"":40,""2001-8000"":237,""1001-2000"":27,""8001-16000"":180}",12,627,"{""721-1080"":11,""361-720"":9,""61-360"":1,""<60"":11,"">1080"":12}","[32,32,28,30,30,31,27,23,20,20,20,17,19,19,15,14,17,20,20,21,25,22,24,23]",7,3,"{""120330012011"":1,""010030107031"":1,""010030114052"":2,""120330038001"":1,""010539701003"":1,""010030108001"":1,""010539707002"":14,""010539705003"":2,""120330015001"":1,""121130102003"":1,""010539701002"":1,""120330040001"":1,""370350101014"":2,""120330033081"":2,""010030106003"":1,""010539706001"":2,""010539707004"":3,""120330039001"":1,""010539699003"":1,""120330030003"":1,""010539707003"":41,""010970029003"":1,""010539705004"":1,""120330009002"":1,""010539705001"":3,""010539704003"":1,""120330028012"":1,""120330035081"":1,""120330036102"":1,""120330036142"":1,""010030114062"":1,""010539706004"":7,""010539706002"":1,""120330036082"":1,""010539707001"":7,""010030102001"":1,""120330028011"":1}",2,241,71,"{""21-45"":4,""481-540"":2,""541-600"":1,""721-840"":1,""1201-1320"":1,""301-360"":3,""<20"":13,""61-120"":3,""241-300"":3,""121-180"":1,""421-480"":3,""1321-1440"":4,""1081-1200"":1,""961-1080"":2,""601-660"":1,""181-240"":1,""661-720"":2,""361-420"":3}",72,"{""0-25"":13,""76-100"":21,""51-75"":6,""26-50"":3}",657,413,1936
010730144081,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,139,2211,"{""16001-50000"":17,""0"":41,"">50000"":15,""2001-8000"":22,""1-1000"":8,""1001-2000"":12,""8001-16000"":24}","{""16001-50000"":143,"">50000"":104,""<1000"":132,""2001-8000"":39,""1001-2000"":15,""8001-16000"":102}",41,806,"{""721-1080"":32,""361-720"":16,""61-360"":12,""<60"":30,"">1080"":46}","[91,92,93,91,91,90,86,83,78,64,64,61,64,62,65,62,60,74,61,64,75,78,81,84]",8,6,"{""131350501064"":1,""131350502151"":1,""010730102002"":1,""011170302131"":2,""010730038024"":1,""010730108041"":1,""010730144133"":1,""010730132003"":1,""011210118002"":1,""011170303053"":1,""010730111084"":2,""011170302142"":1,""010730119011"":1,""010730129063"":2,""010730107063"":1,""010730059083"":1,""010730058003"":1,""011270204003"":1,""010730049012"":2,""130879701001"":1,""010730120021"":1,""130890219133"":1,""010730144082"":4,""170310301031"":1,""010730129112"":1,""010730024002"":1,""011170303034"":2,""481390616004"":1,""121270826052"":1,""010730128021"":2,""121270825073"":1,""010730004004"":1,""211959313002"":1,""010730100012"":1,""011170302151"":1,""010730142041"":1,""010730129123"":1,""010730129084"":1,""010730042002"":1,""010730059033"":2,""170318306001"":1,""130519800001"":1,""010730027003"":1,""121270826042"":1,""481610001002"":1,""010730100011"":1,""010730023032"":1,""350250004002"":1,""010730056003"":1,""010730132001"":1,""011170302171"":2,""120910227003"":1,""011239620001"":1,""130351503002"":1,""010730129155"":1,""010730001001"":2,""010730110021"":1,""170310104003"":1,""010730059082"":2,""010730120022"":1,""011170303151"":1,""010730139022"":1,""011170303441"":4,""010730144092"":3,""010730129151"":1,""011210119001"":2,""010730144081"":117,""010730108052"":1,""010730129122"":9,""370710321003"":1,""010730142034"":2,""010730042001"":2,""010570201003"":1,""010730144132"":6,""010730059032"":1,""010730012001"":2,""010730102003"":1,""011170303332"":1,""010730128032"":2,""010730129081"":1,""010730103011"":1,""010730058001"":3,""011150401041"":1,""010730045001"":3,""010730110013"":1,""010730119041"":1,""010730042003"":1,""010730141041"":1,""010730144091"":1,""010730129154"":1,""484759501002"":1,""010730144063"":1,""010730144102"":12,""011170303141"":1,""011250106011"":1,""011170303152"":1,""010730059104"":1,""010730107021"":1,""010730100014"":1,""010730008004"":1,""011170303451"":1,""010730127041"":2,""370559704003"":1,""010730047011"":2,""010730129132"":2,""011010014002"":1,""010730144131"":1,""011170302133"":1,""010730030011"":1,""131350506063"":1,""010730118023"":1,""010890110141"":1,""010730128023"":1,""010730106022"":2,""130879703004"":1,""010730108015"":1,""131390010041"":1,""011170305013"":1,""010730134002"":1,""010730031004"":1,""010730138012"":1,""010730011004"":1,""011250102041"":1,""010730129131"":4,""010730144101"":4,""011170303331"":2,""010730003001"":1,""011010033012"":1,""483539504004"":1,""010550104021"":1,""011170303411"":1,""010730106031"":1,""011170303153"":5,""010730128034"":1,""010730129061"":1,""131390010023"":1,""010730051042"":1,""130510107002"":1,""010730027001"":2,""120090686011"":1,""010730107042"":1,""010730123052"":1,""010730129102"":1,""011210115003"":1,""010730129083"":4,""011170303142"":1,""011010014001"":1,""010730107064"":2}",7,176,205,"{""21-45"":7,""481-540"":10,""541-600"":4,""46-60"":2,""721-840"":3,""1201-1320"":3,""301-360"":7,""<20"":46,""61-120"":6,""241-300"":4,""121-180"":9,""421-480"":2,""1321-1440"":3,""1081-1200"":5,""961-1080"":1,""601-660"":1,""181-240"":5,""661-720"":1,""361-420"":7}",78,"{""0-25"":29,""76-100"":71,""51-75"":27,""26-50"":8}",751,338,38937
010890017002,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,78,1934,"{""16001-50000"":2,""0"":12,"">50000"":9,""2001-8000"":27,""1-1000"":12,""1001-2000"":8,""8001-16000"":8}","{""16001-50000"":49,"">50000"":99,""<1000"":111,""2001-8000"":37,""1001-2000"":24,""8001-16000"":28}",11,787,"{""721-1080"":17,""361-720"":11,""61-360"":11,""<60"":15,"">1080"":23}","[49,42,48,48,47,48,44,44,39,32,34,32,36,31,32,36,40,37,36,38,49,45,46,46]",5,1,"{""010890101002"":1,""010730108041"":1,""010890020003"":2,""010890010001"":2,""010890025011"":3,""010890026001"":4,""280819505003"":1,""281059504004"":1,""010890103022"":1,""120990056011"":1,""010890109012"":2,""010890019021"":6,""010890013021"":4,""010890015004"":3,""010890108003"":1,""010890014022"":6,""281059501003"":1,""281059503001"":1,""010890007022"":3,""010890017001"":3,""010890107023"":1,""010890021002"":1,""010890009011"":1,""010890109013"":1,""010730120022"":1,""010890031003"":15,""011170303151"":1,""010890019011"":9,""010890030002"":2,""010890110221"":1,""011170305021"":1,""010890026003"":2,""010890025012"":3,""010730117034"":1,""010830208022"":1,""010890031002"":2,""010890112002"":1,""010210602001"":1,""010890002022"":1,""010890017002"":65,""281059506021"":1,""010890010003"":2,""010890106222"":1,""120990059182"":1,""010890110222"":1,""010890020001"":1,""010890101003"":1,""010890018013"":1,""010890021001"":1,""010890109021"":1,""010890108001"":1,""010770106005"":1,""281059506011"":1,""010030114032"":2,""010830209001"":1,""010890027222"":1,""010730128023"":1,""010890009021"":1,""010030114051"":1,""010030109031"":1,""010030103003"":1,""010890031001"":1,""010890021003"":1,""010030114062"":4,""010890106241"":1,""281059504003"":1,""010890018011"":10,""010890019031"":5,""010890027012"":1,""010730108054"":1,""010890106223"":2,""010890111001"":1,""010210603002"":1,""010890109011"":1,""010890019012"":2,""010890113001"":1,""010890028013"":3}",1,229,99,"{""481-540"":3,""541-600"":2,""46-60"":1,""721-840"":1,""1201-1320"":7,""301-360"":6,""<20"":18,""61-120"":10,""241-300"":5,""121-180"":2,""1321-1440"":2,""841-960"":1,""1081-1200"":1,""961-1080"":3,""601-660"":3,""181-240"":2,""661-720"":3}",78,"{""0-25"":16,""76-100"":44,""51-75"":11,""26-50"":7}",708,353,14328
010950308022,2020-06-25T00:00:00-05:00,2020-06-26T00:00:00-05:00,100,2481,"{""16001-50000"":11,""0"":19,"">50000"":11,""2001-8000"":40,""1-1000"":6,""1001-2000"":3,""8001-16000"":4}","{""16001-50000"":150,"">50000"":23,""<1000"":739,""2001-8000"":23,""1001-2000"":12,""8001-16000"":208}",17,703,"{""721-1080"":21,""361-720"":19,""61-360"":10,""<60"":24,"">1080"":26}","[62,64,64,63,65,67,54,48,37,37,34,33,30,34,32,33,35,43,50,56,58,56,56,57]",8,6,"{""010950306004"":1,""010950302023"":1,""011030054051"":1,""010950311002"":1,""010950309023"":1,""010499606003"":1,""121319506023"":2,""010950308022"":86,""121319506016"":2,""010950304013"":1,""010950307024"":1,""010950309041"":1,""010890019021"":2,""010950312001"":5,""010499607002"":1,""011150402013"":1,""010550102003"":1,""120050027043"":3,""010719509003"":1,""010950302022"":1,""010950308023"":2,""120050027051"":2,""471079701022"":1,""010890106221"":1,""010950306001"":1,""010950302011"":2,""011150405013"":1,""011150402041"":2,""010950312002"":16,""011030054042"":1,""010950301002"":2,""130459105011"":1,""010730001001"":1,""130459102001"":1,""010890109013"":2,""010950308013"":14,""010719508004"":1,""120050027041"":3,""010550110021"":3,""010730049022"":1,""010950308024"":1,""010950312004"":6,""010950312003"":1,""010550104012"":2,""010550110013"":1,""120860004111"":1,""010890027222"":1,""010950306002"":2,""010950304015"":1,""011030054041"":1,""010950309031"":8,""010950308021"":1,""010950302024"":1,""010950307011"":5,""010550110012"":2,""011150404013"":1,""130459103003"":1,""120050027032"":3,""010950307012"":5,""010950309022"":2,""010950307023"":1,""010719508003"":1,""010499608001"":2,""010950310003"":1,""011150402043"":1,""120860099063"":1,""010950309021"":4,""010950309043"":2,""010950308011"":1,""010950306003"":3,""120050027042"":1,""010950308025"":5,""010950309032"":6,""010499607001"":1}",1,199,132,"{""21-45"":8,""481-540"":6,""541-600"":4,""46-60"":3,""721-840"":3,""1201-1320"":4,""301-360"":3,""<20"":20,""61-120"":10,""241-300"":2,""121-180"":4,""421-480"":3,""1321-1440"":1,""841-960"":3,""961-1080"":2,""601-660"":1,""181-240"":3,""661-720"":1,""361-420"":2}",74,"{""0-25"":20,""76-100"":48,""51-75"":23,""26-50"":4}",661,350,5044
df = pd.read_csv(csv_file,
usecols=[
'origin_census_block_group',
'date_range_start',
'date_range_end',
'device_count',
'distance_traveled_from_home',
'completely_home_device_count',
'median_home_dwell_time',
'part_time_work_behavior_devices',
'full_time_work_behavior_devices'
],
dtype={'origin_census_block_group': str},
).set_index('origin_census_block_group')
and, later in the code, the dataframe is modified by:
df = df.groupby(lambda cbg: cbg[:5]).sum()
I don't quite understand what this line is doing precisely.
Groupby generally groups a dataframe by column, so...is it grouping the dataframe using multiple columns (0 to 5)? What is the effect of .sum() at the end?

If you run your code exactly as you wrote it (both the creation of df and the groupby) you can see the result. I print first couple of columns of the output of groupby
device_count distance_traveled_from_home
----- -------------- -----------------------------
01053 49 626
01073 139 2211
01089 78 1934
01095 100 2481
What happens here is the function lambda cbg: cbg[:5] is applied to each of the index values (strings that look like numbers in column origin_census_block_group). As a side, note the statement
...
dtype={'origin_census_block_group': str},
when creating the df, so somebody went into trouble to make sure they are actually str
So the function is applied to string like '010539707003' and returns a substring which is the first 5 characters of that string:
'010539707003'[:5]
produces
'01053'
so I assume there are multiple keys that share the first 5 characters (in the actual file -- the snippet has them all unique so not very interesting) and all these rows are grouped together
Then .sum() is applied to each numerical column of each group and returns, well, the column sum per each groupby key. This is what you see in my output in column 'device_count' and so on.
Hope this is clear now

Pandas' read_csv() will render a csv-formatted file a Pandas Dataframe.
I recommend having a ready at the Pandas' documentation, as it's very exhaustive -> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
usecols=[
'origin_census_block_group',
'date_range_start',
'date_range_end',
'device_count',
'distance_traveled_from_home',
'completely_home_device_count',
'median_home_dwell_time',
'part_time_work_behavior_devices',
'full_time_work_behavior_devices'
],
The usecols parameter will take as input an array of desired columns and will only load the specified columns into the dataframe.
dtype={'origin_census_block_group': str}
The dtype parameter will take a dict as input and is to specify the data type of the values, like {'column' : datatype}
.set_index('origin_census_block_group')
.set_index() will set the specificed column as the index column (ie: the first column). The usual index of Pandas' Dataframe is the row's index number, which appears as the first column of the dataframe. By setting the index, the first column now becomes the specified column. See: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html
Panda's .groupby() function will take a dataframe a regroup it basing on the occurrences of he values from the specified column.
That is to say, if we a dataframe such as df =
Fruit Name Quality Count
Apple Marco High 4
Pear Lucia Medium 10
Apple Francesco Low 3
Banana Carlo Medium 6
Pear Timmy Low 7
Apple Roberto High 8
Banana Joe High 21
Banana Jack Low 3
Pear Rob Medium 5
Apple Louis Medium 6
Pear Jennifer Low 7
Pear Laura High 8
Performing a groupby operations, such as:
df = df.groupby(lambda x: x[:2]).sum()
Will take all the elements in the index, slice them from index 0 through index 2 and return the sum of all the corresponding values, ie:
Ap 21
Ba 30
Pe 37
Now, you might be wondering about that final .sum() method. If you try to print the dataframe without applying it, you'll likely get something like this:
<bound method GroupBy.sum of <pandas.core.groupby.generic.DataFrameGroupBy object at 0x109d260a0>>
This is because Pandas has created a groubpy object and does not yet now how to display it to you. Do you want to have it displayed by the number of the occurrences in the index? You'd do this:
df = df.groupby(lambda x: x[:2]).size()
And that would output:
Ap 4
Ba 3
Pe 5
Or maybe the sum of their respective summable values? (Which is what is done in the example)
df = df.groupby(lambda x: x[:2]).sum()
Which again, will output:
Ap 21
Ba 30
Pe 37
Notice it has taken the first two letters of the string in the index. Had it been x[:3], it would have taken the first three letters, of course.
Summing it up:
-> .groupby() takes the elements in the index, i.e. the first column of the dataframe and organises the dataframe in groups relating to the index
-> The input you have given to groubpy is an anonymous function, i.e. lambda function, slicing from index 0 through 5 of its mapped input
-> You may choose how to have the results of groubpy by appending the methos .sum() or .size() to a groubpy object
I also recommend reading about Python's lambda functions:
https://docs.python.org/3/reference/expressions.html

Filling columns based on other dataframe columns

I have two data sets
df1 = pd.DataFrame ({"skuid" :("A","B","C","D"), "price": (0,0,0,0)})
df2 = pd.DataFrame ({"skuid" :("A","B","C","D"),"salesprice" :(10,0,0,30),"regularprice" : (9,10,0,2)})
I want to insert sales price and regular price in price with conditions:
If df1 skuid and df2 skuid matches and df2 salesprice is not zero, use salesprice as price value. if sku's match and df2 salesprice is zero, use regularprice. if not use zero as price value.
def pric(df1,df2):
if (df1['skuid'] == df2['skuid'] and salesprice !=0):
price = salesprice
elif (df1['skuid'] == df2['skuid'] and regularprice !=0):
price = regularprice
else:
price = 0
I made a function with similar conditions but its not working. the result should look like in df1
skuid price
A 10
B 10
C 0
D 30
Thanks.

So there are a number of issues with the function given above. Here are a few in no particular order:
Indentation in python matters https://docs.python.org/2.0/ref/indentation.html
Vectorized functions versus loops. The function you give looks vaguely like it expects to be applied on a vectorized basis, but python doesn't work like that. You need to loop through the rows you want to look at (https://wiki.python.org/moin/ForLoop). While there is support for column transformations in python (which work without loops), they need to be invoked specifically (here's some documentation for one instance of such functionality https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transform.html).
Relatedly, accessing dataframe elements and indexing Indexing Pandas data frames: integer rows, named columns
Return: if you want your python function to give you a result, you should have it return the value. Not all programming languages require this (julia), but in python you should/must.
Generality. This isn't strictly necessary in a one-off application, but your function is vulnerable to breaking if you change, for example, the column names in the dataframe. It is better practice to allow the user to give the relevant names in the input, for this reason and for simple flexibility.
Here is a version of your function which was more or less minimally change to fix the above specific issues
import pandas as pd
df1 = pd.DataFrame({"skuid" :("A","B","C","D"), "price": (0,0,0,0)})
df2 = pd.DataFrame({"skuid" :("A","B","C","D"),"salesprice" :(10,0,0,30),"regularprice" : (9,10,0,2)})
def pric(df1, df2, id_colname,df1_price_colname, df2_salesprice_colname,df2_regularprice_colname):
for i in range(df1.shape[0]):
for j in range(df2.shape[0]):
if (df1.loc[df1.index[i],id_colname] == df2.loc[df2.index[j],id_colname] and df2.loc[df2.index[j],df2_salesprice_colname] != 0):
df1.loc[df1.index[i],df1_price_colname] = df2.loc[df2.index[j],df2_salesprice_colname]
break
elif (df1.loc[df1.index[i],id_colname] == df2.loc[df2.index[j],id_colname] and df2.loc[df2.index[j],df2_regularprice_colname] != 0):
df1.loc[df1.index[i],df1_price_colname] = df2.loc[df2.index[j],df2_regularprice_colname]
break
return df1
for which entering
df1_imputed=pric(df1,df2,'skuid','price','salesprice','regularprice')
print(df1_imputed['price'])
gives
0 10
1 10
2 0
3 30
Name: price, dtype: int64
Notice how the function loops through row indices before checking equality conditions on specific elements given by a row-index / column pair.
A few things to consider:
Why does the code loop through df1 "above" the loop through df2? Relatedly, what purpose does the break condition serve?
Why was the else condition omitted?
What is 'df1.loc[df1.index[i],id_colname]' all about? (hint: check one of the above links)

Assign value to dataframe from another dataframe based on two conditions

I am trying to assign values from a column in df2['values'] to a column df1['values']. However values should only be assigned if:
df2['category'] is equal to the df1['category'] (rows are part of the same category)
df1['date'] is in df2['date_range'] (date is in a certain range for a specific category)
So far I have this code, which works, but is far from efficient, since it takes me two days to process the two dfs (df1 has ca. 700k rows).
for i in df1.category.unique():
for j in df2.category.unique():
if i == j: # matching categories
for ia, ra in df1.loc[df1['category'] == i].iterrows():
for ib, rb in df2.loc[df2['category'] == j].iterrows():
if df1['date'][ia] in df2['date_range'][ib]:
df1.loc[ia, 'values'] = rb['values']
break
I read that I should try to avoid using for-loops when working with dataframes. List comprehensions are great, however since I do not have a lot of experience yet, I struggle formulating more complicated code.
How can I iterate over this problem more efficient? What essential key aspect should I think about when iterating over dataframes with conditions?
The code above tends to skip some rows or assigns them wrongly, so I need to do a cleanup afterwards. And the biggest problem, that it is really slow.
Thank you.
Some df1 insight:
df1.head()
date category
0 2015-01-07 f2
1 2015-01-26 f2
2 2015-01-26 f2
3 2015-04-08 f2
4 2015-04-10 f2
Some df2 insight:
df2.date_range[0]
DatetimeIndex(['2011-11-02', '2011-11-03', '2011-11-04', '2011-11-05',
'2011-11-06', '2011-11-07', '2011-11-08', '2011-11-09',
'2011-11-10', '2011-11-11', '2011-11-12', '2011-11-13',
'2011-11-14', '2011-11-15', '2011-11-16', '2011-11-17',
'2011-11-18'],
dtype='datetime64[ns]', freq='D')
df2 other two columns:
df2[['values','category']].head()
values category
0 01 f1
1 02 f1
2 2.1 f1
3 2.2 f1
4 03 f1

Edit: Corrected erroneous code and added OP input from a comment
Alright so if you want to join the dataframes on similar categories, you can merge them :
import pandas as pd
df3 = df1.merge(df2, on = "category")
Next, since date is a timestamp and the "date_range" is actually generated from two columns, per OP's comment, we rather use :
mask = (df3["startdate"] <= df3["date"]) & (df3["date"] <= df3["enddate"])
subset = df3.loc[mask]
Now we get back to df1 and merge on the common dates while keeping all the values from df1. This will create NaN for the subset values where they didn't match with df1 in the earlier merge.
As such, we set df1["values"] where the entries in common are not NaN and we leave them be otherwise.
common_dates = df1.merge(subset, on = "date", how= "left") # keeping df1 values
df1["values"] = np.where(common_dates["values_y"].notna(),
common_dates["values_y"], df1["values"])
N.B : If more than one df1["date"] matches with the date range, you'll have to drop some values otherwise duplicates mess up the explanation.

You could accomplish the first point:
1. df2['category'] is equal to the df1['category']
with the use of a join.
You could then use a for loop for filtering the data poings from df1[date] inside the merged dataframe that are not contemplated in the df2[date_range]. Unfortunately I need more information about the content of df1[date] and df2[date_range] to write the code here that would exactly do that.

Combine paired rows after pandas groupby, give NaN value if ID didn't occur twice in df

I have a single dataframe containing an ID column id, and I know that the ID will exist either exactly in one row ('mismatched') or two rows ('matched') in the dataframe.
In order to select the mismatched rows and the pairs of matched rows I can use a groupby on the ID column.
Now for each group, I want to take some columns from the second (pair) row, rename them, and copy them to the first row. I can then discard all the second rows and return a single dataframe containing all the modified first rows (for each and every group).
Where there is no second row (mismatched) - it's fine to put NaN in its place.
To illustrate this see table below id=1 and 3 are a matched pair, but id=2 is mismatched:
entity id partner value
A 1 B 200
B 1 A 300
A 2 B 600
B 3 C 350
C 3 B 200
The resulting transformation should leave me with the following:
entity id partner entity_value partner_value
A 1 B 200 300
A 2 B 600 NaN
B 3 C 350 200
What's baffling me is how to come up with a generic way of getting the matching partner_value from row 2, copied into row 1 after the groupby, in a way that also works when there is no matching id.

Solution (this was tricky):
dfg = df.groupby('id', sort=False)
# Create 'entity','id','partner','entity_value' from the first row...
df2 = dfg['entity','id','partner','value'].first().rename(columns={'value': 'entity_value'})
# Now insert 'partner_value' from those groups that have a second row...
df2['partner_value'] = nan
df2['partner_value'] = dfg['value'].nth(n=1)
entity id partner entity_value partner_value
id
1 A 1 B 200 300.0
2 A 2 B 600 NaN
3 B 3 C 350 200.0
This was tricky to get working. The short answer is that although pd.groupby(...).agg(...) in principle allows you to specify a list of tuples of (column, aggregate_function), and you could then chain those into a rename, that won't work here since we're trying to do two separate aggregate operations both on value column, and rename both their results (you get pandas.core.base.SpecificationError: Function names must be unique, found multiple named value).
Other complications:
We can't directly use groupby.nth(n) which sounds useful at first glance, except it's only on a DataFrame not a Series like df['value'], and also it silently drops groups which don't have an n'th element, not what we want. (But it does keep the index, so we can use it by first initializing the column as all-NaNs, then selectively inserting on that column, as above).
In any case the pd.groupby.agg() syntax won't even let you call nth() by just passing 'nth' as the agg_func name, since nth() is missing its n argument; you'd have to declare a lambda.
I tried defining the following function second_else_nan to use inside an agg() as above, but after much struggling I couldn't get this as this to work for multiple reasons, only one of which is you can't do two aggs on the same column:
Code:
def second_else_nan(v):
if v.size == 2:
return v[1]
else:
return pd.np.nan
(i.e. the equivalent on a list of the dict.get(key, default) builtin)

I would do that. First, get the first value:
df_grouped = df.reset_index().groupby('id').agg("first")
Then retrieve the values that are duplicated and insert them:
df_grouped["partner_value"] = df.groupby("id")["value"].agg("last")
The only thing is that you have a repeated value in case it's not duplicated (instead of a NaN).

What about something like this?
grouped = df.groupby("id")
first_values = grouped.agg("first")
sums = grouped.agg("sum")
first_values["partner_value"] = sums["value"] - first_values["value"]
first_values["partner_value"].replace(0, np.nan, inplace=True)
transformed_df = first_values.copy()
Group the data by id, take the first row, take the sum of the 'value' column for each group, from this subtract 'value' from the first row. Then replace 0's in the resulting column with np.nan (making the assumption here that data from the 'value' column is never 0)

Drop duplicates keeping the row with the highest value in another column

a = [['John', 'Mary', 'John'], [10,22,50]]
df1 = pd.DataFrame(a, columns=['Name', 'Count'])
Given a data frame like this I want to compare all similar string values of "Name" against the "Count" value to determine the highest. I'm not sure how to do this in a dataframe in Python.
Ex: In the case above the Answer would be:
Name Count
Mary 22
John 50
The lower value John 10 has been dropped (I only want to see the highest value of "Count" based on the same value for "Name").
In SQL it would be something like a Select Case query (wherein I select the Case where Name == Name & Count > Count recursively to determine the highest number. Or a For loop for each name, but as I understand loops in DataFrames is a bad idea due to the nature of the object.
Is there a way to do this with a DF in Python? I could create a new data frame with each variable (one with Only John and then get the highest value (df.value()[:1] or similar. But as I have many hundreds of unique entries that seems like a terrible solution. :D

Either sort_values and drop_duplicates,
df1.sort_values('Count').drop_duplicates('Name', keep='last')
Name Count
1 Mary 22
2 John 50
Or, like miradulo said, groupby and max.
df1.groupby('Name')['Count'].max().reset_index()
Name Count
0 John 50
1 Mary 22

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas Group By and Sorting by multiple columns - python

Related

Python, lambda function as argument for groupby

Filling columns based on other dataframe columns

Assign value to dataframe from another dataframe based on two conditions

Combine paired rows after pandas groupby, give NaN value if ID didn't occur twice in df

Drop duplicates keeping the row with the highest value in another column

Categories

Resources