So I have data similar to this:
import pandas as pd
df = pd.DataFrame({'Order ID':[555,556,557,558,559,560,561,562,563,564,565,566],
'State':["MA","MA","MA","MA","MA","MA","CT","CT","CT","CT","CT","CT"],
'County':["Essex","Essex","Essex","Worcester","Worcester","Worcester","Bristol","Bristol","Bristol","Hartford","Hartford","Hartford"],
'AP':[50,50,75,100,100,125,150,150,175,200,200,225]})
but I need to add a column that shows the mode of AP grouped by State and County. I can get the mode this way:
(df.groupby(['State', 'County']).AP.agg(Mode = (lambda x: x.value_counts().index[0])).reset_index().round(0))
I'm just not sure how I can get that data added to the original data so that it looks like this:
Order ID
State
County
AP
Mode
555
MA
Essex
50
50
556
MA
Essex
50
50
557
MA
Essex
75
50
558
MA
Worcester
100
100
559
MA
Worcester
100
100
560
MA
Worcester
125
100
561
CT
Bristol
150
150
562
CT
Bristol
150
150
563
CT
Bristol
175
150
564
CT
Hartford
200
200
565
CT
Hartford
200
200
566
CT
Hartford
225
200
Use GroupBy.transform for new column:
df['Mode'] = (df.groupby(['State', 'County']).AP
.transform(lambda x: x.value_counts().index[0]))
Or Series.mode:
df['Mode'] = df.groupby(['State', 'County']).AP.transform(lambda x: x.mode().iat[0])
I have a data frame look like
df1
UserID group day sp PU
213 test 12/11/14 3 311
314 control 13/11/14 4 345
354 test 13/08/14 5 376
and second data frame df2, it has information about the values in df1 column UserID, the matching rows in df2 and df1 are test-red and others should be itself.
df2
UserID
213
And what I am aiming is to append a new column group2 to df1 derived from the group column in df1 using matching values from df2 as well as the values already there in df1 as following,. For instance here UserId 213 is matching in df1 and df2 so it should be added in the newly appended column 'group2' as test-Red and otherwise it should as it is from group column.
df1
UserID group day sp PU group2
213 test 12/11/14 3 311 test-Red
314 control 13/11/14 4 345 control
354 test 13/08/14 5 376 test-NonRed
This is what I tried,
def converters(df2,df1):
if df1['UserId']==df2['UserId']:
val="test-Red"
elif df1['group']== "test":
val="test-NonRed"
else:
val="control"
return val
But it throws error as following,
ValueError: Series lengths must match to compare
Use numpy.where :
df1['new'] = np.where(df1['UserID'].isin(df2['UserID']), 'test-Red',
np.where(df1['group'] == 'test','test-NonRed',df1['group']))
print (df1)
UserID group day sp PU new
0 213 test 12/11/14 3 311 test-Red
1 314 control 13/11/14 4 345 control
2 354 test 13/08/14 5 376 test-NonRed
Or numpy.select:
m1 = df1['UserID'].isin(df2['UserID'])
m2 = df1['group'] == 'test'
df1['new'] = np.select([m1,m2], ['test-Red', 'test-NonRed'],default=df1['group'])
print (df1)
UserID group day sp PU new
0 213 test 12/11/14 3 311 test-Red
1 314 control 13/11/14 4 345 control
2 354 test 13/08/14 5 376 test-NonRed
More general solution:
print (df1)
UserID group day sp PU
0 213 test 12/11/14 3 311
1 314 control 13/11/14 4 345
2 354 test 13/08/14 5 376
3 2131 test1 12/11/14 3 311
4 314 control1 13/11/14 4 345
5 354 test1 13/08/14 5 376
df2 = pd.DataFrame({'UserID':[213, 2131]})
m1 = df1['UserID'].isin(df2['UserID'])
m2 = df1['group'].isin(df1.loc[m1, 'group'])
df1['new'] = np.select([m1,m2],
[df1['group'] + '-Red', df1['group'] + '-NonRed'],
default=df1['group'])
print (df1)
UserID group day sp PU new
0 213 test 12/11/14 3 311 test-Red
1 314 control 13/11/14 4 345 control
2 354 test 13/08/14 5 376 test-NonRed
3 2131 test1 12/11/14 3 311 test1-Red
4 314 control1 13/11/14 4 345 control1
5 354 test1 13/08/14 5 376 test1-NonRed
Can you use pd.merge and specify the how=outer parameter? This would include all the data from both tables being joined
ie:
df1.merge(df2, how='outer', on='UserId')
I have the following dataframe:
df1= df[['rsa_units','regions','ssno','veteran','pos_off_ttl','occ_ser','grade','gender','ethnicity','age','age_category','service_time','type_appt','disabled','actn_dt','nat_actn_2_3','csc_auth_12','fy']]
this will produce 1.4 mil records. I've taken the first 12.
Eastern Region (R9),Eastern Region (R9),123456789,Non Vet,LBRER,3502,3,Male,White,43.0,Older Gen X'ers,5.0,Temporary,,2009-05-18 00:00:00,115,BDN,2009
Northern Region (R1),Northern Region (R1),234567891,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,7.0,Temporary,,2007-05-27 00:00:00,115,BDN,2007
Northern Region (R1),Northern Region (R1),345678912,Non Vet,FRSTRY AID,0462,3,Male,White,33.0,Younger Gen X'ers,8.0,Temporary,,2006-06-05 00:00:00,115,BDN,2006
Northern Research Station (NRS),Research & Development(RES),456789123,Non Vet,FRSTRY TECHNCN,0462,7,Male,White,37.0,Younger Gen X'ers,10.0,Term,,2006-11-26 00:00:00,702,N6M,2007
Intermountain Region (R4),Intermountain Region (R4),5678912345,Non Vet,BIOLCL SCI TECHNCN,0404,5,Male,White,45.0,Older Gen X'ers,6.0,Temporary,,2008-05-18 00:00:00,115,BWA,2008
Intermountain Region (R4),Intermountain Region (R4),678912345,Non Vet,FRSTRY AID (FIRE),0462,3,Female,White,31.0,Younger Gen X'ers,5.0,Temporary,,2009-05-10 00:00:00,115,BDN,2009
Pacific Southwest Region (R5),Pacific Southwest Region (R5),789123456,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2012-05-06 00:00:00,115,NAM,2012
Pacific Southwest Region (R5),Pacific Southwest Region (R5),891234567,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2011-06-05 00:00:00,115,BDN,2011
Intermountain Region (R4),Intermountain Region (R4),912345678,Non Vet,FRSTRY TECHNCN,0462,5,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2006-04-30 00:00:00,115,BDN,2006
Northern Region (R1),Northern Region (R1),987654321,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2005-04-11 00:00:00,115,BDN,2005
Southwest Region (R3),Southwest Region (R3),876543219,Non Vet,FRSTRY TECHNCN (HOTSHOT/HANDCREW),0462,4,Male,White,30.0,Gen Y Millennial,4.0,Temporary,,2013-03-24 00:00:00,115,NAM,2013
Southwest Region (R3),Southwest Region (R3),765432198,Non Vet,FRSTRY TECHNCN (RECR),0462,4,Male,White,30.0,Gen Y Millennial,5.0,Temporary,,2010-11-21 00:00:00,115,BDN,2011
I then filter on ['nat_actn_2_3'] for the certain hiring codes.
h1 = df1[df1['nat_actn_2_3'].isin(['100','101','108','170','171','115','130','140','141','190','702','703'])]
h2 = h1.sort('ssno')
h3 = h2.drop_duplicates(['ssno','actn_dt'])
and can look at value_counts() to see total hires by region.
total_newhires = h3['regions'].value_counts()
total_newhires
produces:
Out[38]:
Pacific Southwest Region (R5) 42255
Pacific Northwest Region (R6) 32081
Intermountain Region (R4) 24045
Northern Region (R1) 22822
Rocky Mountain Region (R2) 17481
Southwest Region (R3) 17305
Eastern Region (R9) 11034
Research & Development(RES) 7337
Southern Region (R8) 7288
Albuquerque Service Center(ASC) 7032
Washington Office(WO) 4837
Alaska Region (R10) 4210
Job Corps(JC) 4010
nda 438
I'd like to do something like in excel where I can have the ['regions'] as my row and the ['fy'] as the columns to give me a total count of numbers based off the ['ssno'] for each ['fy']. It would also be nice to eventually do calculations based off the numbers too, like averages and sums.
Along with looking at examples in the url: http://pandas.pydata.org/pandas-docs/stable/reshaping.html, I've also tried:
hirestable = pivot_table(h3, values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'])
I'm wondering if groupby may be what I'm looking for?
Any help is appreciated. I've spent 3 days on this and can't seem to put it together.
So based off the answer below I did a pivot using the following code:
h3.pivot_table(values=['ssno'], rows=['nat_actn_2_3'], cols=['fy'], aggfunc=len).
Which produced a somewhat decent result. When I used 'ethnicity' or 'veteran' as a value my results came out really strange and didn't match my value counts numbers. Not sure if the pivot eliminates duplicates or what, but it did not come out correctly.
ssno
fy 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
nat_actn_2_3
100 34 20 25 18 38 43 45 14 19 25 10
101 510 453 725 795 1029 1293 957 383 470 605 145
108 170 132 112 85 123 127 84 43 40 29 10
115 9203 8972 7946 9038 10139 10480 9211 8735 10482 11258 339
130 299 313 431 324 291 325 336 202 230 436 112
140 62 74 71 75 132 125 82 42 45 74 18
141 20 16 23 17 20 14 10 9 13 17 7
170 202 433 226 278 336 386 284 265 121 118 49
171 4771 4627 4234 4196 4470 4472 3270 3145 354 341 34
190 1 1 NaN NaN NaN 1 NaN NaN NaN NaN NaN
702 3141 3099 3429 3030 3758 3952 3813 2902 2329 2375 650
703 2280 2354 2225 2050 2260 2328 2172 2503 2649 2856 726
Try it like this:
h3.pivot_table(values=['ethnicity', 'veteran'], index=['regions'], columns=['fy'], aggfunc=len, fill_value=0)
To get counts use the aggfunc = len
Also your isin references a list of strings, but the data you provide for columns 'nat_actn_2_3' are int
Try:
h3.pivot_table(values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'], aggfunc=len, fill_value=0)
if you have an older version of pandas
Problem:
I'm trying to two relatively small datasets together, but the merge raises a MemoryError. I have two datasets of aggregates of country trade data, that I'm trying to merge on the keys year and country, so the data needs to be particularity placed. This unfortunately makes the use of concat and its performance benefits impossible as seen in the answer to this question: MemoryError on large merges with pandas in Python.
Here's the setup:
The attempted merge:
df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])
Basic data structure:
i:
Year Reporter_Code Trade_Flow_Code Partner_Code Classification Commodity Code Quantity Unit Code Supplementary Quantity Netweight (kg) Value Estimation Code
0 2003 381 2 36 H2 070951 8 1274 1274 13810 0
1 2003 381 2 36 H2 070930 8 17150 17150 30626 0
2 2003 381 2 36 H2 0709 8 20493 20493 635840 0
3 2003 381 1 36 H2 0507 8 5200 5200 27619 0
4 2003 381 1 36 H2 050400 8 56439 56439 683104 0
df:
mporter cod CC ComTrade_CC Distance_miles
0 110 215 215 757 428.989
1 110 215 215 757 428.989
2 110 215 215 757 428.989
3 110 215 215 757 428.989
4 110 215 215 757 428.989
Error Traceback:
MemoryError Traceback (most recent call last)
<ipython-input-10-8d6e9fb45de6> in <module>()
1 for i in c_list:
----> 2 df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
36 right_index=right_index, sort=sort, suffixes=suffixes,
37 copy=copy)
---> 38 return op.get_result()
39 if __debug__:
40 merge.__doc__ = _merge_doc % '\nleft : DataFrame'
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
193 copy=self.copy)
194
--> 195 result_data = join_op.get_result()
196 result = DataFrame(result_data)
197
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
693 if klass in mapping:
694 klass_blocks.extend((unit, b) for b in mapping[klass])
--> 695 res_blk = self._get_merged_block(klass_blocks)
696
697 # if we have a unique result index, need to clear the _ref_locs
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_merged_block(self, to_merge)
706 def _get_merged_block(self, to_merge):
707 if len(to_merge) > 1:
--> 708 return self._merge_blocks(to_merge)
709 else:
710 unit, block = to_merge[0]
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _merge_blocks(self, merge_chunks)
728 # Should use Fortran order??
729 block_dtype = _get_block_dtype([x[1] for x in merge_chunks])
--> 730 out = np.empty(out_shape, dtype=block_dtype)
731
732 sofar = 0
MemoryError:
Thanks for your thoughts!
In case anyone coming across this question still has similar trouble with merge, you can probably get concat to work by renaming the relevant columns in the two dataframes to the same names, setting them as a MultiIndex (i.e. df = dv.set_index(['A','B'])), and then using concat to join them.
UPDATE
Example:
df1 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'C':[3, 4]})
df2 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'D':[7, 8]})
both = pd.concat([df1.set_index(['A','B']), df2.set_index(['A','B'])], axis=1).reset_index()
df1
A B C
0 1 2 3
1 2 3 4
df2
A B D
0 1 2 7
1 2 3 8
both
A B C D
0 1 2 3 7
1 2 3 4 8
I haven't benchmarked the performance of this approach, but it didn't get the memory error and worked for my applications.