pandas not sorting as expected - python

I have a pandas dataframe I am trying to sort, which contains a int column (encoded target) which I sort like so:
some_set.encoded_target = train_set.encoded_target.astype(int) # last but one column
some_set.sort_values(by='encoded_target', ascending=True)
print(some_set)
and this gives me:
1953 61c4930b42ca426eb8dfaf7314899d08__11_115_3... 61c4930b42ca426eb8dfaf7314899d08__115 134 61c4930b42ca426eb8dfaf7314899d08
1623 3659cfea02b44543812e13f0d7fb7147__105_105_4... 3659cfea02b44543812e13f0d7fb7147__105 63 3659cfea02b44543812e13f0d7fb7147
241 bd67717fe59e4fa8bb5307a663016eb3__13_13_3_p... bd67717fe59e4fa8bb5307a663016eb3__13 290 bd67717fe59e4fa8bb5307a663016eb3
1573 9fdfabfad9974d6cac5b588ff2d9e47a__194__194_2... 9fdfabfad9974d6cac5b588ff2d9e47a__194 238 9fdfabfad9974d6cac5b588ff2d9e47a
602 0a64aee93755481cb9f5162373c776f8__182__182_1... 0a64aee93755481cb9f5162373c776f8__182 13 0a64aee93755481cb9f5162373c776f8
... ... ... ... ...
1779 7b19321376b842a2aece02cd458fb043__186__186_3... 7b19321376b842a2aece02cd458fb043__186 187 7b19321376b842a2aece02cd458fb043
2910 64bff78431914373a78c8f547d985b7d__141__141_2... 64bff78431914373a78c8f547d985b7d__141 142 64bff78431914373a78c8f547d985b7d
1377 2410de3f2fee45cdab25b61428f282bd__93__93_3_p... 2410de3f2fee45cdab25b61428f282bd__93 39 2410de3f2fee45cdab25b61428f282bd
2533 a567db4f10c34228b5452f79b5ff08d7__43__43_1_p... a567db4f10c34228b5452f79b5ff08d7__43 247 a567db4f10c34228b5452f79b5ff08d7
2790 9430d8f375bc4888a0a61b47bc7228fd__102__102_3... 9430d8f375bc4888a0a61b47bc7228fd__102 217 9430d8f375bc4888a0a61b47bc7228fd
clearly, this is wrong, 13 must come before 134
I have spent two hours trying to figure WTF could be wrong, but I am having no lick whatsoever.
:((
Any clues would be great.

One thing need to remember is to assign it back
some_set = some_set.sort_values(by='encoded_target', ascending=True)

Related

Filling an empty list in python

I am trying to create a new list using data from a pandas Dataframe. The Dataframe in question has a column of Dates as well as a column for Units Sold as seen below:
Peep = Xsku[['new_date', 'cum_sum']]
Peep.head(15)
Out[159]:
new_date cum_sum
18 2011-01-17 214
1173 2011-01-24 343
2328 2011-01-31 407 #Save Entry in List
3483 2011-02-07 71
4638 2011-02-14 159
5793 2011-02-21 294
6948 2011-02-28 425 #Save Entry in List
8103 2011-03-07 113
9258 2011-03-14 249
10413 2011-03-21 347
11568 2011-03-28 463 #Save Entry in List
12723 2011-04-04 99
13878 2011-04-11 186
15033 2011-04-18 291
16188 2011-04-25 385
I am trying to make a new list, where the list contains the maximum 'cum_sum' before the number is reset (i.e. becomes smaller). For example, in the first four entries above, the cum_sum reaches 407 and then goes back down to 71. I am thus trying to save the number 407 as well as the corresponding 'new_date' (2011-01-31 in this example) and do this for every entry.
My final List will thus have all the maximum 'cum_sum' values before it is reset.
For example it will look like as follows:
(First Three Expected Values)
MyList
Out[]:
new_date cum_sum
2011-01-31 407
2011-02-28 425
2011-03-28 463
...
I have been trying to do something as a for loop, but continually run into problems:
MyList= [] ##My Empty List
for i in range(len(Peep['new_date'])):
if Peep.iloc[i,1] > Peep.iloc[i + 1,1]:
MyList.append(Peep.iloc[i,1])
Can anyone help me in this regard?
Use .diff and filter like
In [17]: df[df['cum_sum'].diff(-1).ge(0)]
Out[17]:
new_date cum_sum
2 2011-01-31 407
6 2011-02-28 425
10 2011-03-28 463

Segment dataframe every time a value repeats in Python using pandas

I am trying generate strings or atleast a different dataframe from a dataframe that I have. The one that I have is:
Line MM/DD/YYhh:mm:ss.ms.us TEST
9 04/17/2013:44:18.215.500 S
20 04/17/2013:44:18.216.020 U
27 04/17/2013:44:18.216.544 P
34 04/17/2013:44:18.217.064 P
39 04/17/2013:44:18.217.584 L
48 04/17/2013:44:18.218.104 Y
55 04/17/2013:44:18.218.627 P
62 04/17/2013:44:18.219.147 R
69 04/17/2013:44:18.219.667 <CR>
76 04/17/2013:44:18.220.187 <LF>
179 04/17/2013:44:18.721.249 U
184 04/17/2013:44:18.721.769 L
193 04/17/2013:44:18.722.289 <CR>
200 04/17/2013:44:18.722.812 <LF>
304 04/17/2013:44:19.236.017 E
311 04/17/2013:44:19.236.537 R
318 04/17/2013:44:19.237.060 R
327 04/17/2013:44:19.237.580 <CR>
334 04/17/2013:44:19.238.100 <LF>
371 04/17/2013:44:19.649.033 M
376 04/17/2013:44:19.649.553 O
383 04/17/2013:44:19.650.073 D
390 04/17/2013:44:19.650.596 E
395 04/17/2013:44:19.651.116 ?
402 04/17/2013:44:19.651.636 <CR>
409 04/17/2013:44:19.652.156 <LF>
489 04/17/2013:44:20.160.040 T
496 04/17/2013:44:20.160.560 P
505 04/17/2013:44:20.161.084 <CR>
512 04/17/2013:44:20.161.604 <LF>
607 04/17/2013:44:20.642.301 P
614 04/17/2013:44:20.642.821 R
623 04/17/2013:44:20.643.345 <CR>
630 04/17/2013:44:20.643.865 <LF>
I am trying to format the above snippet to strings such that it looks like this
04/17/2013:44:18.220.187-SUPPLYPR<CR><LF>
04/17/2013:44:18.722.812-UL<CR><LF>
.
.
.
What it should do is that It should take the MM/DD/YY data where the value of TEST is and combine all the values in TEST upto each and make a string for each occurrence of . The raw data that I used to get till this Dataframe was different and was lot of work. But now I am kinda stuck on how to get this format. Any ideas/suggestions will be appreciated. Thanks :)
You are looking for groupby:
(df.groupby(df.TEST.shift().eq('<LF>').cumsum())
.agg({'MM/DD/YYhh:mm:ss.ms.us':'last',
'TEST':''.join})
.reset_index(drop=True)
)
Output:
MM/DD/YYhh:mm:ss.ms.us TEST
0 04/17/2013:44:18.220.187 SUPPLYPR<CR><LF>
1 04/17/2013:44:18.722.812 UL<CR><LF>
2 04/17/2013:44:19.238.100 ERR<CR><LF>
3 04/17/2013:44:19.652.156 MODE?<CR><LF>
4 04/17/2013:44:20.161.604 TP<CR><LF>
5 04/17/2013:44:20.643.865 PR<CR><LF>

How to make data frame calculations from two columns to generate another with a custom function?

I am trying to apply a custom function that takes two arguments to certain two columns of a group by dataframe.
I have tried with apply and groupby dataframe but any suggestion is welcome.
I have the following dataframe:
id y z
115 10 820
115 12 960
115 13 1100
144 25 2500
144 55 5500
144 65 960
144 68 6200
144 25 2550
146 25 2487
146 25 2847
146 25 2569
146 25 2600
146 25 2382
And I would like to apply a custom function with two arguments and get the result by id.
def train_logmodel(x, y):
##.........
return x
data.groupby('id')[['y','z']].apply(train_logmodel)
TypeError: train_logmodel() missing 1 required positional argument: 'y'
I would like to know how to pass 'y' and 'z' in order to estimate the desired column 'x' by each id.
The expected output example:
id x
115 0.23
144 0.45
146 0.58
It is a little different from the question: How to apply a function to two columns of Pandas dataframe
In this case we have to deal with groupby dataframe which works slightly different than a dataframe.
Thanks in advance!
Not knowing your train_logmodel function, I can only give a general example here. Your function takes one argument, from this argument you get the columns inside your function:
def train_logmodel(data):
return (data.z / data.y).min()
df.groupby('id').apply(train_logmodel)
Result:
id
115 80.000000
144 14.769231
146 95.280000

Searching a Pandas series using a string produces a KeyError

I'm trying to use df[df['col'].str.contains("string")] (described in these two SO questions: 1 & 2) to select rows based on a partial string match. Here's my code:
import requests
import json
import pandas as pd
import datetime
url = "http://api.turfgame.com/v4/zones/all" # get request returns .json
r = requests.get(url)
df = pd.read_json(r.content) # create a df containing all zone info
print df[df['region'].str.contains("Uppsala")].head()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-23-55bbf5679808> in <module>()
----> 1 print df[df['region'].str.contains("Uppsala")].head()
C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
1670 if isinstance(key, (Series, np.ndarray, list)):
1671 # either boolean or fancy integer index
-> 1672 return self._getitem_array(key)
1673 elif isinstance(key, DataFrame):
1674 return self._getitem_frame(key)
C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\frame.pyc in _getitem_array(self, key)
1714 return self.take(indexer, axis=0, convert=False)
1715 else:
-> 1716 indexer = self.ix._convert_to_indexer(key, axis=1)
1717 return self.take(indexer, axis=1, convert=True)
1718
C:\Users\User\AppData\Local\Enthought\Canopy32\User\lib\site-packages\pandas\core\indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
1083 if isinstance(obj, tuple) and is_setter:
1084 return {'key': obj}
-> 1085 raise KeyError('%s not in index' % objarr[mask])
1086
1087 return indexer
KeyError: '[ nan nan nan ..., nan nan nan] not in index'
I don't understand the which I get a KeyError because df.columns returns:
Index([u'dateCreated', u'id', u'latitude', u'longitude', u'name', u'pointsPerHour', u'region', u'takeoverPoints', u'totalTakeovers'], dtype='object')
So the Key is in the list of columns and opening the page in an internet browser I can find 739 instances of 'Uppsala'.
The column in which I'm search was a nested .json table that looks like this {"id":200,"name":"Scotland","country":"gb"}. Do I have do something special to search between '{}' characters? Could somebody explain where I've made my mistake(s)?
Looks to me like your region column contains dictionaries, which aren't really supported as elements, and so .str isn't working. One way to solve the problem is to promote the region dictionaries to columns in their own right, maybe with something like:
>>> region = pd.DataFrame(df.pop("region").tolist())
>>> df = df.join(region, rsuffix="_region")
after which you have
>>> df.head()
dateCreated id latitude longitude name pointsPerHour takeoverPoints totalTakeovers country id_region name_region
0 2013-06-15T08:00:00+0000 14639 55.947079 -3.206477 GrandSquare 1 185 32 gb 200 Scotland
1 2014-06-15T20:02:37+0000 31571 55.649181 12.609056 Stenringen 1 185 6 dk 172 Hovedstaden
2 2013-06-15T08:00:00+0000 18958 54.593570 -5.955772 Hospitality 0 250 1 gb 206 Northern Ireland
3 2013-06-15T08:00:00+0000 18661 53.754283 -1.526638 LanshawZone 0 250 0 gb 202 Yorkshire & The Humber
4 2013-06-15T08:00:00+0000 17424 55.949285 -3.144777 NoDogsZone 0 250 5 gb 200 Scotland
and
>>> df[df["name_region"].str.contains("Uppsala")].head()
dateCreated id latitude longitude name pointsPerHour takeoverPoints totalTakeovers country id_region name_region
28 2013-07-16T18:53:48+0000 20828 59.793476 17.775389 MoraStenRast 5 125 536 se 142 Uppsala
59 2013-02-08T21:42:53+0000 14797 59.570418 17.482116 BålWoods 3 155 555 se 142 Uppsala
102 2014-06-19T12:00:00+0000 31843 59.617637 17.077094 EnaAlle 5 125 168 se 142 Uppsala
328 2012-09-24T20:08:22+0000 11461 59.634438 17.066398 BluePark 6 110 1968 se 142 Uppsala
330 2014-08-28T20:00:00+0000 33695 59.867027 17.710792 EnbackensBro 4 140 59 se 142 Uppsala
(A hack workaround would be df["region"].apply(str).str.contains("Uppsala"), but I think it's best to clean the data right at the start.)

Speed of using append in python repeatedly

Is it faster substantially to start with a preallocated list and set items at each index, as opposed to starting with an empty list and appending items? I need this list to hold 10k-100k items.
I ask because I am trying to implement an algorithm that requires O(n) time at each level of recursion, but I am getting results that indicate O(n^2) time. I thought perhaps python needing to keep resizing the list might cause this slowdown.
I found similar questions but none that explicitly answered my question. One answer indicated that garbage collecting might be very slow with so many items, so I tried turning gc on and off at saw no improvement in results.
PROBLEM SOLVED:
If anyone is curious, the slowdown was caused by unioning sets together too often. Now I use a different method (involves sorting), to check if the same key is seen twice.
Python preallocates the list in chunks that are proportional to the size of the list. This gives amortized O(1) for appending to lists
Here is a simple test to see when a list grows. Note that many of these will be able to be reallocated in place, so a copy over isn't always necessary
>>> import sys
>>> A = []
>>> sz = sys.getsizeof(A)
>>> for i in range(100000):
... if sz != sys.getsizeof(A):
... sz = sys.getsizeof(A)
... print i, sz
... A.append(i)
...
1 48
5 64
9 96
17 132
26 172
36 216
47 264
59 320
73 384
89 456
107 536
127 624
149 724
174 836
202 964
234 1108
270 1268
310 1448
355 1652
406 1880
463 2136
527 2424
599 2748
680 3116
772 3528
875 3992
991 4512
1121 5100
1268 5760
1433 6504
1619 7340
1828 8280
2063 9336
2327 10524
2624 11864
2959 13368
3335 15060
3758 16964
4234 19108
4770 21520
5373 24232
6051 27284
6814 30716
7672 34580
8638 38924
9724 43812
10946 49312
12321 55500
13868 62460
15608 70292
17566 79100
19768 89012
22246 100160
25033 112704
28169 126816
31697 142692
35666 160552
40131 180644
45154 203248
50805 228676
57162 257284
64314 289468
72360 325676
81412 366408
91595 412232

Categories