How to select ranges of values in pandas? - python

Newbie question.
My dataframe looks like this:
class A B
0 1 3.767809 11.016
1 1 2.808231 4.500
2 1 4.822522 1.008
3 2 5.016933 -3.636
4 2 6.036203 -5.220
5 2 7.234567 -6.696
6 2 5.855065 -7.272
7 4 4.116770 -8.208
8 4 2.628000 -10.296
9 4 1.539184 -10.728
10 3 0.875918 -10.116
11 3 0.569210 -9.072
12 3 0.676379 -7.632
13 3 0.933921 -5.436
14 3 0.113842 -3.276
15 3 0.367129 -2.196
16 1 0.968661 -1.980
17 1 0.160997 -2.736
18 1 0.469383 -2.232
19 1 0.410463 -2.340
20 1 0.660872 -2.484
I would like to get groups where class is the same, like:
class 1: rows 0..2
class 2: rows 3..6
class 4: rows 7..9
class 3: rows 10..15
class 1: rows 16..20
The reason is that order matters. In requirements I have that class 4 can be only between 1 and 2 and if after prediction we have class 4 after 2 it should be considered as 2.

Build a New Para to identify the group
df['group']=df['class'].diff().ne(0).cumsum()
df.groupby('group')['group'].apply(lambda x : x.index)
Out[106]:
group
1 Int64Index([0, 1, 2], dtype='int64')
2 Int64Index([3, 4, 5, 6], dtype='int64')
3 Int64Index([7, 8, 9], dtype='int64')
4 Int64Index([10, 11, 12, 13, 14, 15], dtype='in...
5 Int64Index([16, 17, 18, 19, 20], dtype='int64')
Name: group, dtype: object

Related

moving last two dataframe rows

I'm trying to move the last two rows up:
import pandas as pd
df = pd.DataFrame({
"A" : [1,2,3,4],
"C": [5, 6, 7, 8],
"D": [9, 10, 11, 12],
"E": [13, 14, 15, 16],
})
print(df)
Output:
A C D E
0 1 5 9 13
1 2 6 10 14
2 3 7 11 15
3 4 8 12 16
Desired output:
A C D E
0 3 7 11 15
1 4 8 12 16
2 1 5 9 13
3 2 6 10 14
I was able to move the last row using
df = df.reindex(np.roll(df.index, shift=1))
But can't get the second to last row to move as well. Any advice what's the most efficient way to do this without creating a copy of the dataframe?
Using your code, you can just change the roll's shift value.
import pandas as pd
import numpy as np
df = pd.DataFrame({
"A" : [1,2,3,4],
"C": [5, 6, 7, 8],
"D": [9, 10, 11, 12],
"E": [13, 14, 15, 16],
})
df = df.reindex(np.roll(df.index, shift=2), copy=False)
df.reset_index(inplace=True, drop=True)
print(df)
A C D E
0 3 7 11 15
1 4 8 12 16
2 1 5 9 13
3 2 6 10 14
The shift value will change how many rows are affected by the roll, and afterwards we just reset the index of the dataframe so that it goes back to 0,1,2,3.
Based on the comment of wanting to swap indexes 0 and 1 around, we can use an answer in #CatalinaChou's link to do that. I am choosing to do it after using the roll so as to only have to contend with indexes 0 and 1 after it's been shifted.
# continuing from where the last code fence ends
swap_indexes = {1: 0, 0: 1}
df.rename(swap_indexes, inplace=True)
df.sort_index(inplace=True)
print(df)
A C D E
0 4 8 12 16
1 3 7 11 15
2 1 5 9 13
3 2 6 10 14
A notable difference is the use of inplace=True and thus not being able to chain the methods, but this would be to fulfil not copying the dataframe at all (or as much as possible, I'm not sure if df.reindex will make an internal copy with copy=False).

Pandas dataframe with N columns

I need to use Python with Pandas to write a DataFrame with N columns. This is a simplified version of what I have:
Ind=[[1, 2, 3],[4, 5, 6],[7, 8, 9],[10, 11, 12]]
DAT = pd.DataFrame([Ind[0],Ind[1],Ind[2],Ind[3]], index=None).T
DAT.head()
Out
0 1 2 3
0 1 4 7 10
1 2 5 8 11
2 3 6 9 12
This is the result that I want, but my real Ind has 121 sets of points and I really don't want to write each one in the DataFrame's argument. Is there a way to write this easily? I tried using a for loop, but that didn't work out.
You can just pass the list directly:
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]
df = pd.DataFrame(data, index=None).T
df.head()
Outputs:
0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12

Dict to df if value is a list of lists

I have following dictionary:
my_dict = dict([(779825550, [[346583, 2, 305.98, 9]]), (779825605, [[276184, 2, 169.5, 15], [331465, 2, 214.5, 15], [276184, 2, 169.5, 15], [331465, 2, 214.5, 15], [637210, 2, 368.5, 15], [249559, 2, 133.46, 15], [591652, 2, 132.0, 15], [216367, 2, 142.5, 14]]), (779825644, [[568025, 13, 494.5, 15]]), (779825657, [[75366, 18, 43.26, 9]])])
I need to convert this dict into pandas df. In each row I need my_dict key (that is 779825550, 779825605 etc) followed by values in the list of list. So the first row will be: 779825550, 346583, 2, 305.98, 9. If there is more lists in the list (like for 779825605) I need to have more rows with the same key in the first column (that is 779825605, 276184, 2, 169.5, 15 and 779825605, 276184, 2, 169.5, 15 etc). How can I do this please?
I tried:
df = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in my_dict.items() ]))
but it gives me a wrong result. Thanks
You can flatten nested lists and add k with unpack nested lists by *, last pass to DataFrame constructor:
df = pd.DataFrame((k, *x) for k,v in my_dict.items() for x in v)
print (df)
0 1 2 3 4
0 779825550 346583 2 305.98 9
1 779825605 276184 2 169.50 15
2 779825605 331465 2 214.50 15
3 779825605 276184 2 169.50 15
4 779825605 331465 2 214.50 15
5 779825605 637210 2 368.50 15
6 779825605 249559 2 133.46 15
7 779825605 591652 2 132.00 15
8 779825605 216367 2 142.50 14
9 779825644 568025 13 494.50 15
10 779825657 75366 18 43.26 9
Your solution should be changed by DataFrame constructor with concat:
df = pd.concat(dict((k,pd.DataFrame(v)) for k,v in my_dict.items()))
print (df)
0 1 2 3
779825550 0 346583 2 305.98 9
779825605 0 276184 2 169.50 15
1 331465 2 214.50 15
2 276184 2 169.50 15
3 331465 2 214.50 15
4 637210 2 368.50 15
5 249559 2 133.46 15
6 591652 2 132.00 15
7 216367 2 142.50 14
779825644 0 568025 13 494.50 15
779825657 0 75366 18 43.26 9

Compare each row of a data frame with all rows of another one

EDIT
The #ashkangh's answer to the original question is perfectly fine, but the question itself turned out to be a bit less trivial: for df1 not all possible values for width and thickness but only min and max values are given. Moreover, width and thickness in both data frames are, in general, floats (not integers).
So, I have two data frames df1 and df2:
import pandas as pd
d1 = {'order_id': range(3),
'width': [[0.9, 3.1], [0.7, 2.5], [1.9, 3.3]],
'thickness': [[9.9, 11.1], [11.7, 14.4], [9.1, 13.2]]}
df1 = pd.DataFrame(d1)
df1
order_id width thickness
0 0 [0.9, 3.1] [9.9, 11.1]
1 1 [0.7, 2.5] [11.7, 14.4]
2 2 [1.9, 3.3] [9.1, 13.2]
d2 = {'piece_id': range(10, 15),
'width': [2, 3, 3, 1, 2],
'thickness':[10, 15, 9, 11, 12]}
df2 = pd.DataFrame(d2)
df2
piece_id width thickness
0 10 2 10
1 11 3 15
2 12 3 9
3 13 1 11
4 14 2 12
Now I want to find what pieces from df2 are okay for what orders in df1. I.e. if df2.width is in df1.width and df2.thickness is in df1.thickness.
So, the desired output should be:
order_id piece_id width thickness
0 0 10 True True
1 0 11 True False
2 0 12 True False
3 0 13 True True
4 0 14 True False
5 1 10 True False
6 1 11 False False
7 1 12 False False
8 1 13 True False
9 1 14 True True
10 2 10 True True
11 2 11 True False
12 2 12 True False
13 2 13 False True
14 2 14 True True
Or, even better (only suitable order_id-piece_id pairs are kept),
order_id piece_id
0 0 10
1 0 13
2 1 14
3 2 10
4 2 14
I can do it with loops, but the data frames can be rather big (10^3 - 10^5 rows), so I'm wondering if there is a more smart pandas solution.
ORIGINAL DATA FRAMES
(width and thickness in df1 were given explicitly and had only integers.)
import pandas as pd
d1 = {'order_id': range(3),
'width': [[1, 2, 3], [1, 2], [2, 3]],
'thickness': [[10, 11], [12, 13, 14], [10, 11, 12, 13]]}
df1 = pd.DataFrame(d1)
df1
order_id width thickness
0 0 [1, 2, 3] [10, 11]
1 1 [1, 2] [12, 13, 14]
2 2 [2, 3] [10, 11, 12, 13]
d2 = {'piece_id': range(10, 15),
'width': [2, 3, 3, 1, 2],
'thickness':[10, 15, 9, 11, 12]}
df2 = pd.DataFrame(d2)
df2
piece_id width thickness
0 10 2 10
1 11 3 15
2 12 3 9
3 13 1 11
4 14 2 12
Using explode method and merge you can get your result:
df1.explode('width').explode('thickness')\
.merge(df2, on=['width', 'thickness'], how='inner')[['order_id', 'piece_id']]
Output:
order_id piece_id
0 0 13
1 0 10
2 2 10
3 1 14
4 2 14

Finding consecutive segments in a pandas data frame

I have a pandas.DataFrame with measurements taken at consecutive points in time. Along with each measurement the system under observation had a distinct state at each point in time. Hence, the DataFrame also contains a column with the state of the system at each measurement. State changes are much slower than the measurement interval. As a result, the column indicating the states might look like this (index: state):
1: 3
2: 3
3: 3
4: 3
5: 4
6: 4
7: 4
8: 4
9: 1
10: 1
11: 1
12: 1
13: 1
Is there an easy way to retrieve the indices of each segment of consecutively equal states. That means I would like to get something like this:
[[1,2,3,4], [5,6,7,8], [9,10,11,12,13]]
The result might also be in something different than plain lists.
The only solution I could think of so far is manually iterating over the rows, finding segment change points and reconstructing the indices from these change points, but I have the hope that there is an easier solution.
One-liner:
df.reset_index().groupby('A')['index'].apply(np.array)
Code for example:
In [1]: import numpy as np
In [2]: from pandas import *
In [3]: df = DataFrame([3]*4+[4]*4+[1]*4, columns=['A'])
In [4]: df
Out[4]:
A
0 3
1 3
2 3
3 3
4 4
5 4
6 4
7 4
8 1
9 1
10 1
11 1
In [5]: df.reset_index().groupby('A')['index'].apply(np.array)
Out[5]:
A
1 [8, 9, 10, 11]
3 [0, 1, 2, 3]
4 [4, 5, 6, 7]
You can also directly access the information from the groupby object:
In [1]: grp = df.groupby('A')
In [2]: grp.indices
Out[2]:
{1L: array([ 8, 9, 10, 11], dtype=int64),
3L: array([0, 1, 2, 3], dtype=int64),
4L: array([4, 5, 6, 7], dtype=int64)}
In [3]: grp.indices[3]
Out[3]: array([0, 1, 2, 3], dtype=int64)
To address the situation that DSM mentioned you could do something like:
In [1]: df['block'] = (df.A.shift(1) != df.A).astype(int).cumsum()
In [2]: df
Out[2]:
A block
0 3 1
1 3 1
2 3 1
3 3 1
4 4 2
5 4 2
6 4 2
7 4 2
8 1 3
9 1 3
10 1 3
11 1 3
12 3 4
13 3 4
14 3 4
15 3 4
Now groupby both columns and apply the lambda function:
In [77]: df.reset_index().groupby(['A','block'])['index'].apply(np.array)
Out[77]:
A block
1 3 [8, 9, 10, 11]
3 1 [0, 1, 2, 3]
4 [12, 13, 14, 15]
4 2 [4, 5, 6, 7]
You could use np.diff() to test where a segment starts/ends and iterate over those results. Its a very simple solution, so probably not the most performent one.
a = np.array([3,3,3,3,3,4,4,4,4,4,1,1,1,1,4,4,12,12,12])
prev = 0
splits = np.append(np.where(np.diff(a) != 0)[0],len(a)+1)+1
for split in splits:
print np.arange(1,a.size+1,1)[prev:split]
prev = split
Results in:
[1 2 3 4 5]
[ 6 7 8 9 10]
[11 12 13 14]
[15 16]
[17 18 19]

Categories