Change the sign of the number in the pandas series - python

How to change the sign in the series, if I have:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
and need to get:
1, 2, 3, -4, -5, -6, 8, 9, 10, -11, -12, -13
I need to be able to set the period (now it is equal to 3) and the index from which the function starts (now it is equal to 3).
For example, if I specify 2 as the index, I get
1, 2, -3, -4, -5, 6, 8, 9, -10, -11, -12, 13
I need to apply this function sequentially to each column, since applying to the entire DataFrame leads to a memory error.

Use numpy.where with integer division by (//) and modulo (%) for boolean mask:
s = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13])
N = 3
#if default RangeIndex
m = (s.index // N) % 2 == 1
#general index
#m = (np.arange(len(s.index)) // N) % 2 == 1
s = pd.Series(np.where(m, -s, s))
print (s)
0 1
1 2
2 3
3 -4
4 -5
5 -6
6 7
7 8
8 9
9 -10
10 -11
11 -12
12 13
dtype: int64
EDIT:
N = 3
M = 1
m = np.concatenate([np.repeat(False, M),
(np.arange(len(s.index) - M) // N) % 2 == 0])
s = pd.Series(np.where(m, -s, s))
print (s)
0 1
1 -2
2 -3
3 -4
4 5
5 6
6 7
7 -8
8 -9
9 -10
10 11
11 12
12 13
dtype: int64

Related

Pandas compare items in list in one column with single value in another column

Consider this two column df. I would like to create an apply function that compares each item in the "other_yrs" column list with the single integer in the "cur" column and keeps count of each item in the "other_yrs" column list that is greater than or equal to the single value in the "cur" column. I cannot figure out how to enable pandas to do this with apply. I am using apply functions for other purposes and they are working well. Any ideas would be very appreciated.
cur other_yrs
1 11 [11, 11]
2 12 [16, 13, 12, 9, 9, 6, 6, 3, 3, 3, 2, 1, 0]
4 16 [15, 85]
5 17 [17, 17, 16]
6 13 [8, 8]
Below is the function I used to extract the values into the "other_yrs" column. I am thinking I can just insert into this function some way of comparing each successive value in the list with the "cur" column value and keep count. I really only need to store the count of how many of the list items are <= the value in the "cur" column.
def col_check(col_string):
cs_yr_lst = []
count = 0
if len(col_string) < 1: #avoids col values of 0 meaning no other cases.
pass
else:
case_lst = col_string.split(", ") #splits the string of cases into a list
for i in case_lst:
cs_yr = int(i[3:5]) #gets the case year from each individual case number
cs_yr_lst.append(cs_yr) #stores those integers in a list and then into a new column using apply
return cs_yr_lst
The expected output would be this:
cur other_yrs count
1 11 [11, 11] 2
2 12 [16, 13, 12, 9, 9, 6, 6, 3, 3, 3, 2, 1, 0] 11
4 16 [15, 85] 1
5 17 [17, 17, 16] 3
6 13 [8, 8] 2
Use zip inside a list comprehension to zip the columns cur and other_yrs and use np.sum on boolean mask:
df['count'] = [np.sum(np.array(b) <= a) for a, b in zip(df['cur'], df['other_yrs'])]
Another idea:
df['count'] = pd.DataFrame(df['other_yrs'].tolist(), index=df.index).le(df['cur'], axis=0).sum(1)
Result:
cur other_yrs count
1 11 [11, 11] 2
2 12 [16, 13, 12, 9, 9, 6, 6, 3, 3, 3, 2, 1, 0] 11
4 16 [15, 85] 1
5 17 [17, 17, 16] 3
6 13 [8, 8] 2
You can consider explode and compare then group on level=0 and sum:
u = df.explode('other_yrs')
df['Count'] = u['cur'].ge(u['other_yrs']).sum(level=0).astype(int)
print(df)
cur other_yrs Count
1 11 [11, 11] 2
2 12 [16, 13, 12, 9, 9, 6, 6, 3, 3, 3, 2, 1, 0] 11
4 16 [15, 85] 1
5 17 [17, 17, 16] 3
6 13 [8, 8] 2
If columns contain millions of records in both of the dataframes and one has to compare each element in first column with all the elements in the second column then following code might be helpful.
for element in Dataframe1.Column1:
Dataframe2[Dateframe2.Column2.isin([element])]
Above code snippet will return one by one specific rows of dataframe2 where element from dataframe1 is found in dataframe2.column2.

Python: Membership testing way slower with frozenset than sets, tuples and lists?

I have been reading up for a few hours trying to understand membership testing and speeds as I fell down that rabbit hole. I thought I had gotten it until I ran my own little timeit test
Here's the code
range_ = range(20, -1, -1)
w = timeit.timeit('0 in {seq}'.format(seq=list(range_)))
x = timeit.timeit('0 in {seq}'.format(seq=tuple(range_)))
y = timeit.timeit('0 in {seq}'.format(seq=set(range_)))
z = timeit.timeit('0 in {seq}'.format(seq=frozenset(range_)))
print('list:', w)
print('tuple:', x)
print('set:', y)
print('frozenset:', z)
and here is the result
list: 0.3762843
tuple: 0.38087859999999996
set: 0.06568490000000005
frozenset: 1.5114070000000002
List and tuple having the same time makes sense.
I thought set and frozenset would have the same time as well but it is extremey slow even compared to lists?
Changing the code to the following gives me similar results still:
list_ = list(range(20, -1, -1))
tuple_ = tuple(range(20, -1, -1))
set_ = set(range(20, -1, -1))
frozenset_ = frozenset(range(20, -1, -1))
w = timeit.timeit('0 in {seq}'.format(seq=list_))
x = timeit.timeit('0 in {seq}'.format(seq=tuple_))
y = timeit.timeit('0 in {seq}'.format(seq=set_))
z = timeit.timeit('0 in {seq}'.format(seq=frozenset_))
It's not the membership test, it's the construction that's taking the time.
Consider the following:
import timeit
list_ = list(range(20, -1, -1))
tuple_ = tuple(range(20, -1, -1))
set_ = set(range(20, -1, -1))
frozenset_ = frozenset(range(20, -1, -1))
w = timeit.timeit('0 in list_', globals=globals())
x = timeit.timeit('0 in tuple_', globals=globals())
y = timeit.timeit('0 in set_', globals=globals())
z = timeit.timeit('0 in frozenset_', globals=globals())
print('list:', w)
print('tuple:', x)
print('set:', y)
print('frozenset:', z)
I get the following timings with Python 3.5:
list: 0.28041897085495293
tuple: 0.2775509520433843
set: 0.0552431708201766
frozenset: 0.05547476885840297
The following will demonstrate why frozenset is so much slower by disassembling the code you're benchmarking:
import dis
def print_dis(code):
print('{code}:'.format(code=code))
dis.dis(code)
range_ = range(20, -1, -1)
print_dis('0 in {seq}'.format(seq=list(range_)))
print_dis('0 in {seq}'.format(seq=tuple(range_)))
print_dis('0 in {seq}'.format(seq=set(range_)))
print_dis('0 in {seq}'.format(seq=frozenset(range_)))
Its output is pretty self-explanatory:
0 in [20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]:
1 0 LOAD_CONST 0 (0)
3 LOAD_CONST 21 ((20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0))
6 COMPARE_OP 6 (in)
9 RETURN_VALUE
0 in (20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0):
1 0 LOAD_CONST 0 (0)
3 LOAD_CONST 21 ((20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0))
6 COMPARE_OP 6 (in)
9 RETURN_VALUE
0 in {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}:
1 0 LOAD_CONST 0 (0)
3 LOAD_CONST 21 (frozenset({0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}))
6 COMPARE_OP 6 (in)
9 RETURN_VALUE
0 in frozenset({0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}):
1 0 LOAD_CONST 0 (0)
3 LOAD_NAME 0 (frozenset)
6 LOAD_CONST 0 (0)
9 LOAD_CONST 1 (1)
12 LOAD_CONST 2 (2)
15 LOAD_CONST 3 (3)
18 LOAD_CONST 4 (4)
21 LOAD_CONST 5 (5)
24 LOAD_CONST 6 (6)
27 LOAD_CONST 7 (7)
30 LOAD_CONST 8 (8)
33 LOAD_CONST 9 (9)
36 LOAD_CONST 10 (10)
39 LOAD_CONST 11 (11)
42 LOAD_CONST 12 (12)
45 LOAD_CONST 13 (13)
48 LOAD_CONST 14 (14)
51 LOAD_CONST 15 (15)
54 LOAD_CONST 16 (16)
57 LOAD_CONST 17 (17)
60 LOAD_CONST 18 (18)
63 LOAD_CONST 19 (19)
66 LOAD_CONST 20 (20)
69 BUILD_SET 21
72 CALL_FUNCTION 1 (1 positional, 0 keyword pair)
75 COMPARE_OP 6 (in)
78 RETURN_VALUE
This is because among the 4 data types you converted the range object into, frozenset is the only data type in Python 3 that requires a name lookup in its literal form, and name lookups are expensive because it requires hashing the string of the name and then looking it up through local, global and then built-in namespaces:
>>> repr(list(range(3)))
'[0, 1, 2]'
>>> repr(tuple(range(3)))
'(0, 1, 2)'
>>> repr(set(range(3)))
'{0, 1, 2}'
>>> repr(frozenset(range(3)))
'frozenset([0, 1, 2])' # requires a name lookup when evaluated by timeit
In Python 2, sets also require a name lookup when converted by repr, which is why #NPE reported in the comment that there is little difference in performance between a frozenset and a set in Python 2:
>>> repr(set(range(3)))
'set([0, 1, 2])'

How to drop a value from data series with multiple index?

I have a data frame with the temperatures recorded per day/month/year.
Then I find the lowest temperature from each month using groupby and min functions, which gives a data series with multiple index.
How can I drop a value from a specific year and month? eg. year 2005 month 12?
# Find the lowest value per each month
[In] low = df.groupby([df['Date'].dt.year,df['Date'].dt.month])['Data_Value'].min()
[In] low
[Out]
Date Date
2005 1 -60
2 -114
3 -153
4 -13
5 -14
6 26
7 83
8 65
9 21
10 36
11 -36
12 -86
2006 1 -75
2 -53
3 -83
4 -30
5 36
6 17
7 85
8 82
9 66
10 40
11 -2
12 -32
2007 1 -63
2 -42
3 -21
4 -11
5 28
6 74
7 73
8 61
9 46
10 -33
11 -37
12 -97
[In] low.index
[Out] MultiIndex(levels=[[2005, 2006, 2007], [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]],
labels=[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]],
names=['Date', 'Date'])
This works.
#dummy data
mux = pd.MultiIndex.from_arrays([
(2017,)*12 + (2018,)*12,
list(range(1, 13))*2
], names=['year', 'month'])
df = pd.DataFrame({'value': np.random.randint(1, 20, (len(mux)))}, mux)
Then just use drop.
df.drop((2017, 12), inplace=True)
>>> print(df)
value
year month
2017 1 18
2 13
3 14
4 1
5 8
6 19
7 19
8 8
9 11
10 5
11 7 <<<
2018 1 9
2 18
3 9
4 14
5 7
6 4
7 6
8 12
9 12
10 1
11 19
12 10

Radius search in list of coordinates

I'm using python 2.7 and numpy (import numpy as np).
I have a list of x-y coordinates in the following shape:
coords = np.zeros((100, 2), dtype=np.int)
I have a list of values corresponding to these coordinates:
values = np.zeros(100, dtype=np.int)
My program is populating these arrays.
Now, for each coordinate, I want to find neighbours within radius r that have a non-zero value. What's the most efficient way to do that?
Demo:
import pandas as pd
from scipy.spatial.distance import pdist, squareform
In [101]: np.random.seed(123)
In [102]: coords = np.random.rand(20, 2)
In [103]: r = 0.3
In [104]: d = pd.DataFrame(squareform(pdist(coords)))
In [105]: d
Out[105]:
0 1 2 3 4 5 6 7 8 9 10 11 12 \
0 0.000000 0.539313 0.138885 0.489671 0.240183 0.566555 0.343214 0.541508 0.525761 0.295906 0.566702 0.326087 0.045059
1 0.539313 0.000000 0.509028 0.765644 0.299834 0.212418 0.535287 0.253292 0.378472 0.305322 0.504946 0.501173 0.545672
2 0.138885 0.509028 0.000000 0.369830 0.240542 0.484970 0.459329 0.449965 0.591335 0.217102 0.434730 0.187983 0.100192
3 0.489671 0.765644 0.369830 0.000000 0.579235 0.639118 0.827519 0.585140 0.946945 0.474554 0.383486 0.266724 0.444612
4 0.240183 0.299834 0.240542 0.579235 0.000000 0.364005 0.335128 0.355671 0.368796 0.148598 0.482379 0.327450 0.251218
5 0.566555 0.212418 0.484970 0.639118 0.364005 0.000000 0.676135 0.055591 0.576447 0.272729 0.315123 0.399127 0.555655
6 0.343214 0.535287 0.459329 0.827519 0.335128 0.676135 0.000000 0.679527 0.281035 0.481218 0.813671 0.621056 0.387169
7 0.541508 0.253292 0.449965 0.585140 0.355671 0.055591 0.679527 0.000000 0.602427 0.245620 0.261309 0.350237 0.526773
8 0.525761 0.378472 0.591335 0.946945 0.368796 0.576447 0.281035 0.602427 0.000000 0.498845 0.811462 0.695304 0.559738
9 0.295906 0.305322 0.217102 0.474554 0.148598 0.272729 0.481218 0.245620 0.498845 0.000000 0.333842 0.208528 0.282959
10 0.566702 0.504946 0.434730 0.383486 0.482379 0.315123 0.813671 0.261309 0.811462 0.333842 0.000000 0.254850 0.533784
11 0.326087 0.501173 0.187983 0.266724 0.327450 0.399127 0.621056 0.350237 0.695304 0.208528 0.254850 0.000000 0.288072
12 0.045059 0.545672 0.100192 0.444612 0.251218 0.555655 0.387169 0.526773 0.559738 0.282959 0.533784 0.288072 0.000000
13 0.339648 0.350100 0.407307 0.769145 0.202592 0.501132 0.185248 0.511020 0.186913 0.347808 0.678357 0.527288 0.372879
14 0.530211 0.104003 0.473790 0.689158 0.303486 0.109841 0.589377 0.149459 0.468906 0.257676 0.404710 0.431203 0.527905
15 0.622118 0.178856 0.627453 0.923461 0.391044 0.387645 0.509836 0.431502 0.273610 0.450269 0.683313 0.656742 0.639993
16 0.337079 0.211995 0.297111 0.582175 0.113238 0.251168 0.434076 0.246505 0.403684 0.107671 0.409858 0.316172 0.337886
17 0.271897 0.311029 0.313864 0.668400 0.097022 0.424905 0.252905 0.426640 0.279160 0.243693 0.576241 0.422417 0.296806
18 0.664617 0.395999 0.554151 0.592343 0.504234 0.184188 0.833801 0.157951 0.758223 0.376555 0.212643 0.410605 0.642698
19 0.328445 0.719013 0.238085 0.186618 0.476045 0.642499 0.671657 0.594990 0.828653 0.413697 0.465589 0.245340 0.284878
13 14 15 16 17 18 19
0 0.339648 0.530211 0.622118 0.337079 0.271897 0.664617 0.328445
1 0.350100 0.104003 0.178856 0.211995 0.311029 0.395999 0.719013
2 0.407307 0.473790 0.627453 0.297111 0.313864 0.554151 0.238085
3 0.769145 0.689158 0.923461 0.582175 0.668400 0.592343 0.186618
4 0.202592 0.303486 0.391044 0.113238 0.097022 0.504234 0.476045
5 0.501132 0.109841 0.387645 0.251168 0.424905 0.184188 0.642499
6 0.185248 0.589377 0.509836 0.434076 0.252905 0.833801 0.671657
7 0.511020 0.149459 0.431502 0.246505 0.426640 0.157951 0.594990
8 0.186913 0.468906 0.273610 0.403684 0.279160 0.758223 0.828653
9 0.347808 0.257676 0.450269 0.107671 0.243693 0.376555 0.413697
10 0.678357 0.404710 0.683313 0.409858 0.576241 0.212643 0.465589
11 0.527288 0.431203 0.656742 0.316172 0.422417 0.410605 0.245340
12 0.372879 0.527905 0.639993 0.337886 0.296806 0.642698 0.284878
13 0.000000 0.408426 0.339019 0.274263 0.105627 0.668252 0.643427
14 0.408426 0.000000 0.282070 0.194058 0.345013 0.294029 0.663142
15 0.339019 0.282070 0.000000 0.344028 0.355134 0.568361 0.854775
16 0.274263 0.194058 0.344028 0.000000 0.181494 0.399730 0.513362
17 0.105627 0.345013 0.355134 0.181494 0.000000 0.581128 0.551910
18 0.668252 0.294029 0.568361 0.399730 0.581128 0.000000 0.649183
19 0.643427 0.663142 0.854775 0.513362 0.551910 0.649183 0.000000
result:
In [107]: d[(0 < d) & (d < r)].apply(lambda x: x.dropna().index.tolist())
Out[107]:
0 [2, 4, 9, 12, 17]
1 [4, 5, 7, 14, 15, 16]
2 [0, 4, 9, 11, 12, 16, 19]
3 [11, 19]
4 [0, 1, 2, 9, 12, 13, 16, 17]
5 [1, 7, 9, 14, 16, 18]
6 [8, 13, 17]
7 [1, 5, 9, 10, 14, 16, 18]
8 [6, 13, 15, 17]
9 [0, 2, 4, 5, 7, 11, 12, 14, 16, 17]
10 [7, 11, 18]
11 [2, 3, 9, 10, 12, 19]
12 [0, 2, 4, 9, 11, 17, 19]
13 [4, 6, 8, 16, 17]
14 [1, 5, 7, 9, 15, 16, 18]
15 [1, 8, 14]
16 [1, 2, 4, 5, 7, 9, 13, 14, 17]
17 [0, 4, 6, 8, 9, 12, 13, 16]
18 [5, 7, 10, 14]
19 [2, 3, 11, 12]
dtype: object
You can also do this only in numpy and scipy, I find it faster.
from scipy.spatial.distance import pdist, squareform
import numpy
SIZE=512
N_PARTICLE=100
RADIUS = 15
VALUE_THRESHOLD = 0
coords = numpy.random.randint(0, SIZE, size=(N_PARTICLE, 2))
values = numpy.random.randint(0, 2, (N_PARTICLE))
square_dist = squareform(pdist(coords, metric='euclidean'))
condlist = []
for i, row in enumerate(square_dist[:]):
condlist.append(numpy.where((values>VALUE_THRESHOLD) & (row < RADIUS) & (row > 0))[0].tolist())
It must be a better way to do it thoughtfully.

To Generate a split indices for n-fold

I have a requirement to generate a split for cross validation, say s is an index of records
s = [1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20]
Now I want to randomly shuffle and split the data with 5 folds, typically I want output something like this
s = [[1 5 4 6], [2,3, 19,20], [... ], [... ], [.. ]]
Note: In each array numbers should be unique, it should not repeat
I know I can use chunk() but in chunk you can do only sequence wise like 1-4, 5-8,....
Can anyone help me on this ?
Shuffle your array using random.shuffle and split it into 5 pieces:
For Python2 use
import random
s = range(1, 21)
random.shuffle(s)
s = [s[i::5] for i in range(5)]
or for Python3:
import random
s = list(range(1, 21))
random.shuffle(s)
s = [s[i::5] for i in range(5)]
import random
s = [1 ,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
print [random.sample(s,5) for i in xrange(len(s)/5)]

Categories