Pandas column reformatting - python

Any quick way to achieve the below output pls?
Input:
Code Items
123 eq-hk
456 ca-eu; tp-lbe
789 ca-us
321 go-ch
654 ca-au; go-au
987 go-jp
147 co-ml; go-ml
258 ca-us
369 ca-us; ca-my
741 ca-us
852 ca-eu
963 ca-ml; co-ml; go-ml
Output:
Code eq ca go co tp
123 hk
456 eu lbe
789 us
321 ch
654 au au
987 jp
147 ml ml
258 us
369 us,my
741 us
852 eu
963 ml ml ml
Am again running into loops and a very ugly code to make it work. If there is an elegant way to achieve this pls?
Thank you!

This is a little bit complicate
(df.set_index('Code')
.Items.str.split(';',expand=True)
.stack()
.str.split('-',expand=True)
.set_index(0,append=True)[1]
.unstack()
.fillna('')
.sum(level=0))
0 ca co eq go tp
Code
123 hk
147 ml ml
258 us
321 ch
369 usmy
456 eu lbe
654 au au
741 us
789 us
852 eu
963 ml ml ml
987 jp
# using str split to get unnest the column,
#then we do stack, and str split again , then set the first column to index
# after unstack we yield the result

List comprehensions work better (read: much faster) for string problems like this which require multiple levels of splitting.
df2 = pd.DataFrame([
dict(y.split('-') for y in x.split('; '))
for x in df.Items]).fillna('')
df2.insert(0, 'Code', df.Code)
print(df2)
Code ca co eq go tp
0 123 hk
1 456 eu lbe
2 789 us
3 321 ch
4 654 au au
5 987 jp
6 147 ml ml
7 258 us # Should be "us,my"... see below.
8 369 my
9 741 us
10 852 eu
11 963 ml ml ml
This does not handle the situation where multiple items with the same key can be present in a row. For that, a slightly more involved solution is needed.
from itertools import chain
v = [x.split('; ') for x in df.Items]
X = pd.Series(df.Code.values.repeat([len(x) for x in v]))
Y = pd.DataFrame([x.split('-') for x in chain.from_iterable(v)])
df2 = pd.concat([X, Y], axis=1, ignore_index=True)
(df2.set_index([0, 1, 3])[2]
.unstack(1)
.fillna('')
.groupby(level=0)
.agg(lambda x: ','.join(x).strip(','))
1 ca co eq go tp
0
123 hk
147 ml ml
258 us
321 ch
369 us,my
456 eu lbe
654 au au
741 us
789 us
852 eu
963 ml ml ml
987 jp

import pandas as pd
df = pd.DataFrame([
('123', 'eq-hk'),
('456', 'ca-eu; tp-lbe'),
('789', 'ca-us'),
('321', 'go-ch'),
('654', 'ca-au; go-au'),
('987', 'go-jp'),
('147', 'co-ml; go-ml'),
('258', 'ca-us'),
('369', 'ca-us; ca-my'),
('741', 'ca-us'),
('852', 'ca-eu'),
('963', 'ca-ml; co-ml; go-ml')],
columns=['Code', 'Items'])
# Get item type list from each row, sum (concatenate) the lists and convert
# to a set to remove duplicates
item_types = set(df['Items'].str.findall('(\w+)-').sum())
print(item_types)
# {'ca', 'co', 'eq', 'go', 'tp'}
# Generate a column for each item type
df1 = pd.DataFrame(df['Code'])
for t in item_types:
df1[t] = df['Items'].str.findall('%s-(\w+)' % t).apply(lambda x: ''.join(x))
print(df1)
# Code ca tp eq co go
#0 123 hk
#1 456 eu lbe
#2 789 us
#3 321 ch
#4 654 au au
#5 987 jp
#6 147 ml ml
#7 258 us
#8 369 usmy
#9 741 us
#10 852 eu
#11 963 ml ml ml

Related

Python - Adding grouped mode as additional column in original dataset

So I have data similar to this:
import pandas as pd
df = pd.DataFrame({'Order ID':[555,556,557,558,559,560,561,562,563,564,565,566],
'State':["MA","MA","MA","MA","MA","MA","CT","CT","CT","CT","CT","CT"],
'County':["Essex","Essex","Essex","Worcester","Worcester","Worcester","Bristol","Bristol","Bristol","Hartford","Hartford","Hartford"],
'AP':[50,50,75,100,100,125,150,150,175,200,200,225]})
but I need to add a column that shows the mode of AP grouped by State and County. I can get the mode this way:
(df.groupby(['State', 'County']).AP.agg(Mode = (lambda x: x.value_counts().index[0])).reset_index().round(0))
I'm just not sure how I can get that data added to the original data so that it looks like this:
Order ID
State
County
AP
Mode
555
MA
Essex
50
50
556
MA
Essex
50
50
557
MA
Essex
75
50
558
MA
Worcester
100
100
559
MA
Worcester
100
100
560
MA
Worcester
125
100
561
CT
Bristol
150
150
562
CT
Bristol
150
150
563
CT
Bristol
175
150
564
CT
Hartford
200
200
565
CT
Hartford
200
200
566
CT
Hartford
225
200
Use GroupBy.transform for new column:
df['Mode'] = (df.groupby(['State', 'County']).AP
.transform(lambda x: x.value_counts().index[0]))
Or Series.mode:
df['Mode'] = df.groupby(['State', 'County']).AP.transform(lambda x: x.mode().iat[0])

Append a new column on a dataframe based on other dataframe with matching rows and fill the non-matching ones with value from the existing column

I have a data frame look like
df1
UserID group day sp PU
213 test 12/11/14 3 311
314 control 13/11/14 4 345
354 test 13/08/14 5 376
and second data frame df2, it has information about the values in df1 column UserID, the matching rows in df2 and df1 are test-red and others should be itself.
df2
UserID
213
And what I am aiming is to append a new column group2 to df1 derived from the group column in df1 using matching values from df2 as well as the values already there in df1 as following,. For instance here UserId 213 is matching in df1 and df2 so it should be added in the newly appended column 'group2' as test-Red and otherwise it should as it is from group column.
df1
UserID group day sp PU group2
213 test 12/11/14 3 311 test-Red
314 control 13/11/14 4 345 control
354 test 13/08/14 5 376 test-NonRed
This is what I tried,
def converters(df2,df1):
if df1['UserId']==df2['UserId']:
val="test-Red"
elif df1['group']== "test":
val="test-NonRed"
else:
val="control"
return val
But it throws error as following,
ValueError: Series lengths must match to compare
Use numpy.where :
df1['new'] = np.where(df1['UserID'].isin(df2['UserID']), 'test-Red',
np.where(df1['group'] == 'test','test-NonRed',df1['group']))
print (df1)
UserID group day sp PU new
0 213 test 12/11/14 3 311 test-Red
1 314 control 13/11/14 4 345 control
2 354 test 13/08/14 5 376 test-NonRed
Or numpy.select:
m1 = df1['UserID'].isin(df2['UserID'])
m2 = df1['group'] == 'test'
df1['new'] = np.select([m1,m2], ['test-Red', 'test-NonRed'],default=df1['group'])
print (df1)
UserID group day sp PU new
0 213 test 12/11/14 3 311 test-Red
1 314 control 13/11/14 4 345 control
2 354 test 13/08/14 5 376 test-NonRed
More general solution:
print (df1)
UserID group day sp PU
0 213 test 12/11/14 3 311
1 314 control 13/11/14 4 345
2 354 test 13/08/14 5 376
3 2131 test1 12/11/14 3 311
4 314 control1 13/11/14 4 345
5 354 test1 13/08/14 5 376
df2 = pd.DataFrame({'UserID':[213, 2131]})
m1 = df1['UserID'].isin(df2['UserID'])
m2 = df1['group'].isin(df1.loc[m1, 'group'])
df1['new'] = np.select([m1,m2],
[df1['group'] + '-Red', df1['group'] + '-NonRed'],
default=df1['group'])
print (df1)
UserID group day sp PU new
0 213 test 12/11/14 3 311 test-Red
1 314 control 13/11/14 4 345 control
2 354 test 13/08/14 5 376 test-NonRed
3 2131 test1 12/11/14 3 311 test1-Red
4 314 control1 13/11/14 4 345 control1
5 354 test1 13/08/14 5 376 test1-NonRed
Can you use pd.merge and specify the how=outer parameter? This would include all the data from both tables being joined
ie:
df1.merge(df2, how='outer', on='UserId')

Pandas data pull - messy strings to float

I am new to Pandas and I am just starting to take in the versatility of the package. While working with a small practice csv file, I pulled the following data in:
Rank Corporation Sector Headquarters Revenue (thousand PLN) Profit (thousand PLN) Employees
1.ÿ PKN Orlen SA oil and gas P?ock 79 037 121 2 396 447 4,445
2.ÿ Lotos Group SA oil and gas Gda?sk 29 258 539 584 878 5,168
3.ÿ PGE SA energy Warsaw 28 111 354 6 165 394 44,317
4.ÿ Jer¢nimo Martins retail Kostrzyn 25 285 407 N/A 36,419
5.ÿ PGNiG SA oil and gas Warsaw 23 003 534 1 711 787 33,071
6.ÿ Tauron Group SA energy Katowice 20 755 222 1 565 936 26,710
7.ÿ KGHM Polska Mied? SA mining Lubin 20 097 392 13 653 597 18,578
8.ÿ Metro Group Poland retail Warsaw 17 200 000 N/A 22,556
9.ÿ Fiat Auto Poland SA automotive Bielsko-Bia?a 16 513 651 83 919 5,303
10.ÿ Orange Polska telecommunications Warsaw 14 922 000 1 785 000 23,805
I have two serious problems with it that I cannot seem to find solution for:
1) data in "Ravenue" and "Profit" columns is pulled in as strings because of funny formatting with spaces between thousands, and I cannot seem to figure out how to make Pandas translate into floating point values.
2) Data under "Rank" column is pulled in as "1.?", "2.?" etc. What's happening there? Again, when I am trying to re-write this data with something more appropriate like "1.", "2." etc. the DataFrame just does not budge.
Ideas? Suggestions? I am also open for outright bashing because my problem might be quite obvious and silly - excuse my lack of experience then :)
I would use the converters parameter.
pass this to your pd.read_csv call
def space_float(x):
return float(x.replace(' ', ''))
converters = {
'Revenue (thousand PLN)': space_float,
'Profit (thousand PLN)': space_float,
'Rank': str.strip
}
pd.read_csv(... converters=converters ...)

Pivot tables using pandas

I have the following dataframe:
df1= df[['rsa_units','regions','ssno','veteran','pos_off_ttl','occ_ser','grade','gender','ethnicity','age','age_category','service_time','type_appt','disabled','actn_dt','nat_actn_2_3','csc_auth_12','fy']]
this will produce 1.4 mil records. I've taken the first 12.
Eastern Region (R9),Eastern Region (R9),123456789,Non Vet,LBRER,3502,3,Male,White,43.0,Older Gen X'ers,5.0,Temporary,,2009-05-18 00:00:00,115,BDN,2009
Northern Region (R1),Northern Region (R1),234567891,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,7.0,Temporary,,2007-05-27 00:00:00,115,BDN,2007
Northern Region (R1),Northern Region (R1),345678912,Non Vet,FRSTRY AID,0462,3,Male,White,33.0,Younger Gen X'ers,8.0,Temporary,,2006-06-05 00:00:00,115,BDN,2006
Northern Research Station (NRS),Research & Development(RES),456789123,Non Vet,FRSTRY TECHNCN,0462,7,Male,White,37.0,Younger Gen X'ers,10.0,Term,,2006-11-26 00:00:00,702,N6M,2007
Intermountain Region (R4),Intermountain Region (R4),5678912345,Non Vet,BIOLCL SCI TECHNCN,0404,5,Male,White,45.0,Older Gen X'ers,6.0,Temporary,,2008-05-18 00:00:00,115,BWA,2008
Intermountain Region (R4),Intermountain Region (R4),678912345,Non Vet,FRSTRY AID (FIRE),0462,3,Female,White,31.0,Younger Gen X'ers,5.0,Temporary,,2009-05-10 00:00:00,115,BDN,2009
Pacific Southwest Region (R5),Pacific Southwest Region (R5),789123456,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2012-05-06 00:00:00,115,NAM,2012
Pacific Southwest Region (R5),Pacific Southwest Region (R5),891234567,Non Vet,FRSTRY AID (FIRE),0462,3,Male,White,31.0,Younger Gen X'ers,3.0,Temporary,,2011-06-05 00:00:00,115,BDN,2011
Intermountain Region (R4),Intermountain Region (R4),912345678,Non Vet,FRSTRY TECHNCN,0462,5,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2006-04-30 00:00:00,115,BDN,2006
Northern Region (R1),Northern Region (R1),987654321,Non Vet,FRSTRY TECHNCN,0462,4,Male,White,37.0,Younger Gen X'ers,11.0,Temporary,,2005-04-11 00:00:00,115,BDN,2005
Southwest Region (R3),Southwest Region (R3),876543219,Non Vet,FRSTRY TECHNCN (HOTSHOT/HANDCREW),0462,4,Male,White,30.0,Gen Y Millennial,4.0,Temporary,,2013-03-24 00:00:00,115,NAM,2013
Southwest Region (R3),Southwest Region (R3),765432198,Non Vet,FRSTRY TECHNCN (RECR),0462,4,Male,White,30.0,Gen Y Millennial,5.0,Temporary,,2010-11-21 00:00:00,115,BDN,2011
I then filter on ['nat_actn_2_3'] for the certain hiring codes.
h1 = df1[df1['nat_actn_2_3'].isin(['100','101','108','170','171','115','130','140','141','190','702','703'])]
h2 = h1.sort('ssno')
h3 = h2.drop_duplicates(['ssno','actn_dt'])
and can look at value_counts() to see total hires by region.
total_newhires = h3['regions'].value_counts()
total_newhires
produces:
Out[38]:
Pacific Southwest Region (R5) 42255
Pacific Northwest Region (R6) 32081
Intermountain Region (R4) 24045
Northern Region (R1) 22822
Rocky Mountain Region (R2) 17481
Southwest Region (R3) 17305
Eastern Region (R9) 11034
Research & Development(RES) 7337
Southern Region (R8) 7288
Albuquerque Service Center(ASC) 7032
Washington Office(WO) 4837
Alaska Region (R10) 4210
Job Corps(JC) 4010
nda 438
I'd like to do something like in excel where I can have the ['regions'] as my row and the ['fy'] as the columns to give me a total count of numbers based off the ['ssno'] for each ['fy']. It would also be nice to eventually do calculations based off the numbers too, like averages and sums.
Along with looking at examples in the url: http://pandas.pydata.org/pandas-docs/stable/reshaping.html, I've also tried:
hirestable = pivot_table(h3, values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'])
I'm wondering if groupby may be what I'm looking for?
Any help is appreciated. I've spent 3 days on this and can't seem to put it together.
So based off the answer below I did a pivot using the following code:
h3.pivot_table(values=['ssno'], rows=['nat_actn_2_3'], cols=['fy'], aggfunc=len).
Which produced a somewhat decent result. When I used 'ethnicity' or 'veteran' as a value my results came out really strange and didn't match my value counts numbers. Not sure if the pivot eliminates duplicates or what, but it did not come out correctly.
ssno
fy 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
nat_actn_2_3
100 34 20 25 18 38 43 45 14 19 25 10
101 510 453 725 795 1029 1293 957 383 470 605 145
108 170 132 112 85 123 127 84 43 40 29 10
115 9203 8972 7946 9038 10139 10480 9211 8735 10482 11258 339
130 299 313 431 324 291 325 336 202 230 436 112
140 62 74 71 75 132 125 82 42 45 74 18
141 20 16 23 17 20 14 10 9 13 17 7
170 202 433 226 278 336 386 284 265 121 118 49
171 4771 4627 4234 4196 4470 4472 3270 3145 354 341 34
190 1 1 NaN NaN NaN 1 NaN NaN NaN NaN NaN
702 3141 3099 3429 3030 3758 3952 3813 2902 2329 2375 650
703 2280 2354 2225 2050 2260 2328 2172 2503 2649 2856 726
Try it like this:
h3.pivot_table(values=['ethnicity', 'veteran'], index=['regions'], columns=['fy'], aggfunc=len, fill_value=0)
To get counts use the aggfunc = len
Also your isin references a list of strings, but the data you provide for columns 'nat_actn_2_3' are int
Try:
h3.pivot_table(values=['ethnicity', 'veteran'], rows=['regions'], cols=['fy'], aggfunc=len, fill_value=0)
if you have an older version of pandas

Pandas Merge Error: MemoryError

Problem:
I'm trying to two relatively small datasets together, but the merge raises a MemoryError. I have two datasets of aggregates of country trade data, that I'm trying to merge on the keys year and country, so the data needs to be particularity placed. This unfortunately makes the use of concat and its performance benefits impossible as seen in the answer to this question: MemoryError on large merges with pandas in Python.
Here's the setup:
The attempted merge:
df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])
Basic data structure:
i:
Year Reporter_Code Trade_Flow_Code Partner_Code Classification Commodity Code Quantity Unit Code Supplementary Quantity Netweight (kg) Value Estimation Code
0 2003 381 2 36 H2 070951 8 1274 1274 13810 0
1 2003 381 2 36 H2 070930 8 17150 17150 30626 0
2 2003 381 2 36 H2 0709 8 20493 20493 635840 0
3 2003 381 1 36 H2 0507 8 5200 5200 27619 0
4 2003 381 1 36 H2 050400 8 56439 56439 683104 0
df:
mporter cod CC ComTrade_CC Distance_miles
0 110 215 215 757 428.989
1 110 215 215 757 428.989
2 110 215 215 757 428.989
3 110 215 215 757 428.989
4 110 215 215 757 428.989
Error Traceback:
MemoryError Traceback (most recent call last)
<ipython-input-10-8d6e9fb45de6> in <module>()
1 for i in c_list:
----> 2 df = merge(df, i, left_on=['year', 'ComTrade_CC'], right_on=["Year","Partner Code"])
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
36 right_index=right_index, sort=sort, suffixes=suffixes,
37 copy=copy)
---> 38 return op.get_result()
39 if __debug__:
40 merge.__doc__ = _merge_doc % '\nleft : DataFrame'
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
193 copy=self.copy)
194
--> 195 result_data = join_op.get_result()
196 result = DataFrame(result_data)
197
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in get_result(self)
693 if klass in mapping:
694 klass_blocks.extend((unit, b) for b in mapping[klass])
--> 695 res_blk = self._get_merged_block(klass_blocks)
696
697 # if we have a unique result index, need to clear the _ref_locs
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _get_merged_block(self, to_merge)
706 def _get_merged_block(self, to_merge):
707 if len(to_merge) > 1:
--> 708 return self._merge_blocks(to_merge)
709 else:
710 unit, block = to_merge[0]
/usr/local/lib/python2.7/dist-packages/pandas-0.12.0rc1_309_g9fc8636-py2.7-linux-x86_64.egg/pandas/tools/merge.pyc in _merge_blocks(self, merge_chunks)
728 # Should use Fortran order??
729 block_dtype = _get_block_dtype([x[1] for x in merge_chunks])
--> 730 out = np.empty(out_shape, dtype=block_dtype)
731
732 sofar = 0
MemoryError:
Thanks for your thoughts!
In case anyone coming across this question still has similar trouble with merge, you can probably get concat to work by renaming the relevant columns in the two dataframes to the same names, setting them as a MultiIndex (i.e. df = dv.set_index(['A','B'])), and then using concat to join them.
UPDATE
Example:
df1 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'C':[3, 4]})
df2 = pd.DataFrame({'A':[1, 2], 'B':[2, 3], 'D':[7, 8]})
both = pd.concat([df1.set_index(['A','B']), df2.set_index(['A','B'])], axis=1).reset_index()
df1
A B C
0 1 2 3
1 2 3 4
df2
A B D
0 1 2 7
1 2 3 8
both
A B C D
0 1 2 3 7
1 2 3 4 8
I haven't benchmarked the performance of this approach, but it didn't get the memory error and worked for my applications.

Categories