Pandas concat similar DataFrames and Series - python

I have a list of Dataframes, all with the same columns. Occaisionally, a DataFrame has only one row, and is, hence, a Series. When I try to concatenate this list with pd.concat, where there was a Series, it puts what I want to be the columns in the index. See below for a minimal working example.
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: d = {'a':np.random.randn(100), 'b':np.random.randn(100)}
In [4]: df = pd.DataFrame(d)
In [5]: thing1 = df.iloc[:10, :]
In [6]: thing1
Out[6]:
a b
0 -0.505268 -1.109089
1 -1.792729 -0.580566
2 -0.478042 0.410095
3 -0.758376 0.558772
4 0.112519 0.556316
5 -1.015813 -0.568148
6 1.234858 -1.062879
7 -0.455796 -0.107942
8 1.231422 0.780694
9 -1.082461 -1.809412
In [7]: thing2 = df.iloc[10,:]
In [8]: thing2
Out[8]:
a -1.527836
b 0.653610
Name: 10, dtype: float64
In [9]: thing3 = df.iloc[11:, :]
In [10]: thing3
Out[10]:
a b
11 -1.247939 -0.694491
12 1.359737 0.625284
13 -0.491533 -0.230665
14 1.360465 0.472451
15 0.691532 -1.822708
16 0.938316 1.310101
17 0.485776 -0.313206
18 1.398189 -0.232446
19 -0.626278 0.714052
20 -1.292272 -1.299580
21 -1.521746 -1.615611
22 1.464332 2.839602
23 0.707370 -0.162056
24 -1.825903 0.000278
25 0.917284 -0.094716
26 -0.239839 0.132572
27 -0.463240 -0.805458
28 1.174125 0.131057
29 0.183503 0.328603
30 0.045839 -0.244965
31 0.449265 0.642082
32 2.381600 -0.417044
33 0.276217 -0.257426
34 0.755067 0.012898
35 0.130339 -0.094300
36 -1.643097 0.038982
37 0.895719 0.789494
38 0.701480 -0.668440
39 -0.201400 1.441928
40 -2.018043 -0.106764
.. ... ...
70 0.971799 0.298164
71 1.307070 -2.093075
72 -1.049177 2.183065
73 -0.469273 -0.739449
74 0.685838 2.579547
75 1.994485 0.783204
76 -0.414760 -0.285766
77 -1.005873 -0.783886
78 1.486588 -0.349575
79 1.417006 -0.676501
80 1.284611 -0.817505
81 -0.624406 -1.659931
82 -0.921061 0.424663
83 -0.645472 -0.769509
84 -1.217172 -0.943542
85 -0.184948 0.482977
86 -0.253972 -0.080682
87 -0.699122 0.368751
88 1.391163 0.042899
89 -0.075512 0.019728
90 0.449151 0.486462
91 -0.182553 0.876379
92 -0.209162 0.390093
93 0.789094 1.570251
94 -1.018724 -0.084603
95 1.109534 1.840739
96 0.774806 -0.380387
97 0.534344 1.165343
98 1.003597 -0.221899
99 -0.659863 -1.061590
[89 rows x 2 columns]
In [11]: pd.concat([thing1, thing2, thing3])
Out[11]:
a b 0
0 -0.505268 -1.109089 NaN
1 -1.792729 -0.580566 NaN
2 -0.478042 0.410095 NaN
3 -0.758376 0.558772 NaN
4 0.112519 0.556316 NaN
5 -1.015813 -0.568148 NaN
6 1.234858 -1.062879 NaN
7 -0.455796 -0.107942 NaN
8 1.231422 0.780694 NaN
9 -1.082461 -1.809412 NaN
a NaN NaN -1.527836
b NaN NaN 0.653610
11 -1.247939 -0.694491 NaN
12 1.359737 0.625284 NaN
13 -0.491533 -0.230665 NaN
14 1.360465 0.472451 NaN
15 0.691532 -1.822708 NaN
16 0.938316 1.310101 NaN
17 0.485776 -0.313206 NaN
18 1.398189 -0.232446 NaN
19 -0.626278 0.714052 NaN
20 -1.292272 -1.299580 NaN
21 -1.521746 -1.615611 NaN
22 1.464332 2.839602 NaN
23 0.707370 -0.162056 NaN
24 -1.825903 0.000278 NaN
25 0.917284 -0.094716 NaN
26 -0.239839 0.132572 NaN
27 -0.463240 -0.805458 NaN
28 1.174125 0.131057 NaN
.. ... ... ...
70 0.971799 0.298164 NaN
71 1.307070 -2.093075 NaN
72 -1.049177 2.183065 NaN
73 -0.469273 -0.739449 NaN
74 0.685838 2.579547 NaN
75 1.994485 0.783204 NaN
76 -0.414760 -0.285766 NaN
77 -1.005873 -0.783886 NaN
78 1.486588 -0.349575 NaN
79 1.417006 -0.676501 NaN
80 1.284611 -0.817505 NaN
81 -0.624406 -1.659931 NaN
82 -0.921061 0.424663 NaN
83 -0.645472 -0.769509 NaN
84 -1.217172 -0.943542 NaN
85 -0.184948 0.482977 NaN
86 -0.253972 -0.080682 NaN
87 -0.699122 0.368751 NaN
88 1.391163 0.042899 NaN
89 -0.075512 0.019728 NaN
90 0.449151 0.486462 NaN
91 -0.182553 0.876379 NaN
92 -0.209162 0.390093 NaN
93 0.789094 1.570251 NaN
94 -1.018724 -0.084603 NaN
95 1.109534 1.840739 NaN
96 0.774806 -0.380387 NaN
97 0.534344 1.165343 NaN
98 1.003597 -0.221899 NaN
99 -0.659863 -1.061590 NaN
[101 rows x 3 columns]
Please note that for this problem, I need to maintain the original index.
I've spent a long time investigating the documentation but can't seem to figure out my problem. Is there an easy way around this?

thing2 = pd.DataFrame(thing2).transpose()
pd.concat([thing1, thing2, thing3])
In your case transpose() will set Pandas Series index as colums and then you can concate easily.
Documentation here : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.transpose.html

Related

How to perform operations with columns from different datasets with different indexation?

The goal
A bit of background, to get familiar with variables and understand what the problem is:
floor, square, matc and volume are tables or dataframes, all share same column "id" (which simply goes from 1 to 100), so every row is unique;
floor and square also share column "room_name";
volume is generally equivalent to floor, except all rows with rooms ("room_name") that have no values in "square" column of square dataframe were dropped; This implies that some values of "id" are missing
That done, I needed to create a new column in volume dataframe, which would consist of multiplication of one of its own columns with two other columns from matc and square dataframes.
The problem
This seemingly simple interaction turned out to be quite difficult, because, well, the columns I am working with are of different length (except for square and matc, they are the same) and I need to align them by "id". To make matters worse, when called directly as volume['coefLoosening'] (please note that coefLoosening does not originate from floor and is added after the table is created), it returns a series with its own index and no way to relate it to "id".
What I tried
Whilst trying to solve the issue, I came up with this abomination:
volume = volume.merge(pd.DataFrame({"id": matc.loc[matc["id"].isin(volume["id"])]["id"], "tempCoef": volume['coefLoosening'] * matc.loc[matc["id"].isin(volume["id"])]['width'] * square.loc[square["id"].isin(volume["id"])]['square']}), how = "left", on = ["id"])
This, however, misaligns "id" column completely, somehow creating more rows. For instance, this what `` returns:
index id tempCoef
0 1.0 960.430612244898
1 2.0 4665.499999999999
2 NaN NaN
3 4.0 2425.44652173913
4 5.0 5764.964210526316
5 6.0 55201.68727272727
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 10.0 1780.7208791208789
10 11.0 6075.385074626865
11 12.0 10400.94
12 13.0 31.378285714285713
13 NaN NaN
14 NaN NaN
15 NaN NaN
16 17.0 10505.431451612903
17 18.0 1208.994845360825
18 NaN NaN
19 NaN NaN
20 21.0 568.8900000000001
21 22.0 4275.416470588235
22 NaN NaN
23 NaN NaN
24 25.0 547.04
25 26.0 2090.666111111111
26 27.0 2096.88406779661
27 NaN NaN
28 29.0 8324.566547619048
29 NaN NaN
30 NaN NaN
31 NaN NaN
32 33.0 2459.8314736842103
33 34.0 2177.778461538461
34 35.0 166.1257142857143
35 36.0 1866.8492307692304
36 37.0 3598.1470588235293
37 38.0 21821.709411764703
38 NaN NaN
39 40.0 2999.248
40 41.0 980.3136
41 42.0 2641.3503947368426
42 NaN NaN
43 44.0 25829.878148148146
44 45.0 649.3632
45 46.0 10895.386666666667
46 NaN NaN
47 NaN NaN
48 49.0 825.9879310344828
49 50.0 15951.941666666671
50 51.0 2614.9343434343436
51 52.0 2462.30625
52 NaN NaN
53 NaN NaN
54 55.0 1366.8287671232877
55 56.0 307.38
56 57.0 11601.975
57 58.0 1002.5415730337081
58 59.0 2493.4532432432434
59 60.0 981.7482608695652
61 62.0 NaN
63 64.0 NaN
65 66.0 NaN
67 68.0 NaN
73 74.0 NaN
75 76.0 NaN
76 77.0 NaN
77 78.0 NaN
78 79.0 NaN
80 81.0 NaN
82 83.0 NaN
84 85.0 NaN
88 89.0 NaN
89 90.0 NaN
90 91.0 NaN
92 93.0 NaN
94 95.0 NaN
95 96.0 NaN
97 98.0 NaN
98 99.0 NaN
99 100.0 NaN
For clarity, no values in any of columns in the operation have NaNs in them.
This is what 'volume["coefLoosening"]` returns:
0 1.020408
1 1.515152
2 2.000000
3 4.347826
4 5.263158
5 9.090909
6 1.162791
7 1.149425
8 1.851852
9 1.098901
10 1.492537
11 2.083333
12 1.428571
13 1.010101
14 1.562500
15 3.448276
16 1.612903
17 1.030928
18 33.333333
19 1.000000
20 1.123596
21 1.960784
22 2.127660
23 2.857143
24 1.369863
25 1.111111
26 1.694915
27 1.492537
28 1.190476
29 1.818182
30 1.612903
31 12.500000
32 1.052632
33 3.846154
34 2.040816
35 1.098901
36 2.941176
37 2.941176
38 2.857143
39 1.111111
40 1.333333
41 1.315789
42 3.703704
43 3.703704
44 2.000000
45 33.333333
46 12.500000
47 1.149425
48 1.724138
49 4.166667
50 1.010101
51 1.041667
52 1.162791
53 3.225806
54 1.369863
55 1.666667
56 4.545455
57 1.123596
58 1.351351
59 2.173913
and finally, this is what volume["id"] returns (to compare to the result of «abomination»):
0 1
1 2
2 4
3 5
4 6
5 10
6 11
7 12
8 13
9 17
10 18
11 21
12 22
13 25
14 26
15 27
16 29
17 33
18 34
19 35
20 36
21 37
22 38
23 40
24 41
25 42
26 44
27 45
28 46
29 49
30 50
31 51
32 52
33 55
34 56
35 57
36 58
37 59
38 60
39 62
40 64
41 66
42 68
43 74
44 76
45 77
46 78
47 79
48 81
49 83
50 85
51 89
52 90
53 91
54 93
55 95
56 96
57 98
58 99
59 100
Some thoughts
I believe, part of the problem is how pandas returns columns (as series with default indexation) and I don't know how to work around that.
Another source of the problem might be the way how .loc() method returns its result. In the case of matc.loc[matc["id"].isin(volume["id"])]['width'] it is:
0 15.98
1 36.12
3 32.19
4 18.54
5 98.96
9 64.56
10 58.20
11 55.08
12 3.84
16 77.31
17 15.25
20 63.21
21 76.32
24 10.52
25 54.65
26 95.46
28 79.67
32 57.01
33 27.54
34 7.36
35 36.44
36 23.64
37 78.98
39 92.19
40 31.26
41 61.71
43 70.07
44 10.91
45 4.24
48 7.35
49 46.70
50 97.69
51 32.03
54 13.50
55 42.30
56 94.71
57 37.49
58 57.86
59 50.29
61 18.18
63 88.26
65 4.28
67 28.89
73 4.05
75 22.37
76 52.20
77 98.29
78 72.98
80 6.07
82 35.80
84 64.16
88 23.60
89 45.05
90 21.14
92 31.21
94 46.04
95 7.15
97 27.70
98 31.93
99 79.62
which is shifted by -1 and I don't see a way to change this manually.
So, any ideas? Maybe there is answered analogue of this question (because I tried to search it before asking, but found nothing)?
Data
Minimal columns of tables required to replicate this (because stack overflow does not allow files to be uploaded)
volume:
index,id,room_name,coefLoosening
0,1,6,1.0204081632653061
1,2,7,1.5151515151515151
2,4,3,2.0
3,5,7,4.3478260869565215
4,6,4,5.2631578947368425
5,10,7,9.090909090909092
6,11,5,1.1627906976744187
7,12,4,1.1494252873563218
8,13,1,1.8518518518518516
9,17,3,1.0989010989010988
10,18,3,1.4925373134328357
11,21,3,2.0833333333333335
12,22,7,1.4285714285714286
13,25,3,1.0101010101010102
14,26,6,1.5625
15,27,6,3.4482758620689657
16,29,4,1.6129032258064517
17,33,2,1.0309278350515465
18,34,2,33.333333333333336
19,35,5,1.0
20,36,4,1.1235955056179776
21,37,2,1.9607843137254901
22,38,6,2.127659574468085
23,40,5,2.857142857142857
24,41,6,1.36986301369863
25,42,3,1.1111111111111112
26,44,2,1.6949152542372883
27,45,4,1.4925373134328357
28,46,2,1.1904761904761905
29,49,5,1.8181818181818181
30,50,4,1.6129032258064517
31,51,2,12.5
32,52,3,1.0526315789473684
33,55,6,3.846153846153846
34,56,5,2.0408163265306123
35,57,5,1.0989010989010988
36,58,4,2.941176470588235
37,59,5,2.941176470588235
38,60,5,2.857142857142857
39,62,7,1.1111111111111112
40,64,7,1.3333333333333333
41,66,7,1.3157894736842106
42,68,3,3.7037037037037033
43,74,5,3.7037037037037033
44,76,4,2.0
45,77,3,33.333333333333336
46,78,4,12.5
47,79,5,1.1494252873563218
48,81,5,1.7241379310344829
49,83,4,4.166666666666667
50,85,2,1.0101010101010102
51,89,4,1.0416666666666667
52,90,1,1.1627906976744187
53,91,2,3.2258064516129035
54,93,2,1.36986301369863
55,95,1,1.6666666666666667
56,96,4,4.545454545454546
57,98,7,1.1235955056179776
58,99,7,1.3513513513513513
59,100,5,2.1739130434782608
matc:
index,id,width
0,1,15.98
1,2,36.12
2,3,63.41
3,4,32.19
4,5,18.54
5,6,98.96
6,7,5.65
7,8,97.42
8,9,50.88
9,10,64.56
10,11,58.2
11,12,55.08
12,13,3.84
13,14,75.87
14,15,96.51
15,16,42.08
16,17,77.31
17,18,15.25
18,19,81.43
19,20,98.71
20,21,63.21
21,22,76.32
22,23,22.59
23,24,30.79
24,25,10.52
25,26,54.65
26,27,95.46
27,28,49.93
28,29,79.67
29,30,45.0
30,31,59.14
31,32,62.25
32,33,57.01
33,34,27.54
34,35,7.36
35,36,36.44
36,37,23.64
37,38,78.98
38,39,47.8
39,40,92.19
40,41,31.26
41,42,61.71
42,43,93.11
43,44,70.07
44,45,10.91
45,46,4.24
46,47,35.39
47,48,99.1
48,49,7.35
49,50,46.7
50,51,97.69
51,52,32.03
52,53,48.61
53,54,33.44
54,55,13.5
55,56,42.3
56,57,94.71
57,58,37.49
58,59,57.86
59,60,50.29
60,61,77.98
61,62,18.18
62,63,3.42
63,64,88.26
64,65,48.66
65,66,4.28
66,67,20.78
67,68,28.89
68,69,27.17
69,70,57.48
70,71,59.07
71,72,12.63
72,73,22.06
73,74,4.05
74,75,22.3
75,76,22.37
76,77,52.2
77,78,98.29
78,79,72.98
79,80,49.37
80,81,6.07
81,82,28.85
82,83,35.8
83,84,66.74
84,85,64.16
85,86,33.64
86,87,66.36
87,88,34.51
88,89,23.6
89,90,45.05
90,91,21.14
91,92,97.27
92,93,31.21
93,94,13.04
94,95,46.04
95,96,7.15
96,97,47.87
97,98,27.7
98,99,31.93
99,100,79.62
square:
index,id,room_name,square
0,1,5,58.9
1,2,3,85.25
2,3,5,90.39
3,4,3,17.33
4,5,2,59.08
5,6,4,61.36
6,7,2,29.02
7,8,2,59.63
8,9,6,98.31
9,10,4,25.1
10,11,3,69.94
11,12,7,90.64
12,13,4,5.72
13,14,6,29.96
14,15,4,59.06
15,16,1,41.85
16,17,7,84.25
17,18,4,76.9
18,19,1,17.2
19,20,4,60.9
20,21,1,8.01
21,22,2,28.57
22,23,1,65.07
23,24,1,20.24
24,25,7,37.96
25,26,7,34.43
26,27,3,12.96
27,28,6,80.96
28,29,5,87.77
29,30,2,95.67
30,31,1,10.4
31,32,1,30.96
32,33,6,40.99
33,34,7,20.56
34,35,5,11.06
35,36,4,46.62
36,37,3,51.75
37,38,4,93.94
38,39,5,62.64
39,40,6,29.28
40,41,3,23.52
41,42,6,32.53
42,43,1,33.3
43,44,3,99.53
44,45,5,29.76
45,46,7,77.09
46,47,1,71.31
47,48,2,59.22
48,49,1,65.18
49,50,7,81.98
50,51,7,26.5
51,52,3,73.8
52,53,6,78.52
53,54,6,69.67
54,55,6,73.91
55,56,6,4.36
56,57,5,26.95
57,58,2,23.8
58,59,2,31.89
59,60,1,8.98
60,61,1,88.76
61,62,5,88.75
62,63,4,44.94
63,64,4,81.13
64,65,5,48.39
65,66,3,55.63
66,67,7,46.28
67,68,3,40.85
68,69,7,54.37
69,70,3,14.01
70,71,6,20.13
71,72,2,90.67
72,73,3,4.28
73,74,4,56.18
74,75,3,74.8
75,76,5,10.34
76,77,6,15.94
77,78,2,29.4
78,79,6,60.8
79,80,3,13.05
80,81,3,49.46
81,82,1,75.76
82,83,1,84.27
83,84,5,76.36
84,85,3,75.98
85,86,7,77.81
86,87,2,56.34
87,88,1,43.93
88,89,5,30.64
89,90,5,55.78
90,91,5,88.26
91,92,6,15.11
92,93,1,20.64
93,94,2,5.08
94,95,1,82.31
95,96,4,76.92
96,97,1,53.47
97,98,2,2.7
98,99,7,77.12
99,100,4,29.43
floor:
index,id,room_name
0,1,6
1,2,7
2,3,12
3,4,3
4,5,7
5,6,4
6,7,8
7,8,11
8,9,10
9,10,7
10,11,5
11,12,4
12,13,1
13,14,11
14,15,12
15,16,9
16,17,3
17,18,3
18,19,9
19,20,12
20,21,3
21,22,7
22,23,8
23,24,12
24,25,3
25,26,6
26,27,6
27,28,10
28,29,4
29,30,10
30,31,9
31,32,11
32,33,2
33,34,2
34,35,5
35,36,4
36,37,2
37,38,6
38,39,11
39,40,5
40,41,6
41,42,3
42,43,11
43,44,2
44,45,4
45,46,2
46,47,9
47,48,12
48,49,5
49,50,4
50,51,2
51,52,3
52,53,9
53,54,10
54,55,6
55,56,5
56,57,5
57,58,4
58,59,5
59,60,5
60,61,12
61,62,7
62,63,12
63,64,7
64,65,11
65,66,7
66,67,12
67,68,3
68,69,8
69,70,11
70,71,12
71,72,8
72,73,12
73,74,5
74,75,11
75,76,4
76,77,3
77,78,4
78,79,5
79,80,12
80,81,5
81,82,12
82,83,4
83,84,8
84,85,2
85,86,8
86,87,8
87,88,9
88,89,4
89,90,1
90,91,2
91,92,9
92,93,2
93,94,12
94,95,1
95,96,4
96,97,8
97,98,7
98,99,7
99,100,5
IIUC you overcomplicated things. The whole thing about merging on id is that you don't need to filter the other df's beforehand on id with loc and isin like you tried to do, merge will do that for you.
You could multiply square and width at the square_df (matc_df would also work since they have same length and id).
Then merge this new column to the volume_df (which filters the multiplied result only to the id's which are found in the volume_df) and multiply it again.
square_df['square*width'] = square_df['square'] * matc_df['width']
df = volume_df.merge(square_df[['id', 'square*width']], on='id', how='left')
df['result'] = df['coefLoosening'] * df['square*width']
Output df:
id room_name coefLoosening square*width result
0 1 6 1.020408 941.2220 960.430612
1 2 7 1.515152 3079.2300 4665.500000
2 4 3 2.000000 557.8527 1115.705400
3 5 7 4.347826 1095.3432 4762.361739
4 6 4 5.263158 6072.1856 31958.871579
5 10 7 9.090909 1620.4560 14731.418182
6 11 5 1.162791 4070.5080 4733.148837
7 12 4 1.149425 4992.4512 5738.449655
8 13 1 1.851852 21.9648 40.675556
9 17 3 1.098901 6513.3675 7157.546703
10 18 3 1.492537 1172.7250 1750.335821
11 21 3 2.083333 506.3121 1054.816875
12 22 7 1.428571 2180.4624 3114.946286
13 25 3 1.010101 399.3392 403.372929
14 26 6 1.562500 1881.5995 2939.999219
15 27 6 3.448276 1237.1616 4266.074483
16 29 4 1.612903 6992.6359 11278.445000
17 33 2 1.030928 2336.8399 2409.113299
18 34 2 33.333333 566.2224 18874.080000
19 35 5 1.000000 81.4016 81.401600
20 36 4 1.123596 1698.8328 1908.800899
21 37 2 1.960784 1223.3700 2398.764706
22 38 6 2.127660 7419.3812 15785.917447
23 40 5 2.857143 2699.3232 7712.352000
24 41 6 1.369863 735.2352 1007.171507
25 42 3 1.111111 2007.4263 2230.473667
26 44 2 1.694915 6974.0671 11820.452712
27 45 4 1.492537 324.6816 484.599403
28 46 2 1.190476 326.8616 389.120952
29 49 5 1.818182 479.0730 871.041818
30 50 4 1.612903 3828.4660 6174.945161
31 51 2 12.500000 2588.7850 32359.812500
32 52 3 1.052632 2363.8140 2488.225263
33 55 6 3.846154 997.7850 3837.634615
34 56 5 2.040816 184.4280 376.383673
35 57 5 1.098901 2552.4345 2804.873077
36 58 4 2.941176 892.2620 2624.300000
37 59 5 2.941176 1845.1554 5426.927647
38 60 5 2.857143 451.6042 1290.297714
39 62 7 1.111111 1613.4750 1792.750000
40 64 7 1.333333 7160.5338 9547.378400
41 66 7 1.315789 238.0964 313.284737
42 68 3 3.703704 1180.1565 4370.950000
43 74 5 3.703704 227.5290 842.700000
44 76 4 2.000000 231.3058 462.611600
45 77 3 33.333333 832.0680 27735.600000
46 78 4 12.500000 2889.7260 36121.575000
47 79 5 1.149425 4437.1840 5100.211494
48 81 5 1.724138 300.2222 517.624483
49 83 4 4.166667 3016.8660 12570.275000
50 85 2 1.010101 4874.8768 4924.117980
51 89 4 1.041667 723.1040 753.233333
52 90 1 1.162791 2512.8890 2921.963953
53 91 2 3.225806 1865.8164 6018.762581
54 93 2 1.369863 644.1744 882.430685
55 95 1 1.666667 3789.5524 6315.920667
56 96 4 4.545455 549.9780 2499.900000
57 98 7 1.123596 74.7900 84.033708
58 99 7 1.351351 2462.4416 3327.623784
59 100 5 2.173913 2343.2166 5093.949130

What cause the different times function if the dataframe has a columns name?

I tested this two snippets,the df share the same structure other than the df.columns, so what causes the difference between them? And how should I change my second snippet, for example,should I always use the pandas.DataFrame.mul or use the other method to avoid this?
# test1
df = pd.DataFrame(np.random.randint(100, size=(10, 10))) \
.assign(Count=np.random.rand(10))
df.iloc[:, 0:3] *= df['Count']
df
Out[1]:
0 1 2 3 4 5 6 7 8 9 Count
0 26.484949 68.217006 4.902341 61 10 13 31 15 10 11 0.645974
1 56.845743 70.085965 28.106758 79 56 47 82 83 62 40 0.934480
2 33.590667 78.496281 1.634114 94 3 91 16 41 93 55 0.326823
3 51.031974 15.886152 26.145821 67 31 20 81 21 10 8 0.012706
4 47.156128 82.234199 10.458328 24 8 68 44 24 4 50 0.517130
5 18.733256 61.675649 23.531239 74 61 97 20 12 0 95 0.360815
6 4.521820 26.165427 26.145821 68 10 77 67 92 82 11 0.606739
7 24.547026 62.610129 23.531239 50 45 69 94 56 77 56 0.412445
8 52.969897 75.692843 9.804683 73 74 5 10 60 51 77 0.125309
9 21.963128 30.837825 19.609366 75 9 50 68 10 82 96 0.687966
#test2
df = pd.DataFrame(np.random.randint(100, size=(10, 10))) \
.assign(Count=np.random.rand(10))
df.columns = ['find', 'a', 'b', 3, 4, 5, 6, 7, 8, 9, 'Count']
df.iloc[:, 0:3] *= df['Count']
df
Out[2]:
find a b 3 4 5 6 7 8 9 Count
0 NaN NaN NaN 63 63 47 81 3 48 34 0.603953
1 NaN NaN NaN 70 48 41 27 78 75 23 0.839635
2 NaN NaN NaN 5 38 52 23 3 75 4 0.515159
3 NaN NaN NaN 40 49 31 25 63 48 25 0.483255
4 NaN NaN NaN 42 89 46 47 78 30 5 0.693555
5 NaN NaN NaN 68 83 81 87 7 54 3 0.108306
6 NaN NaN NaN 74 48 99 67 80 81 36 0.361500
7 NaN NaN NaN 10 19 26 41 11 24 33 0.705899
8 NaN NaN NaN 38 51 83 78 7 31 42 0.838703
9 NaN NaN NaN 2 7 63 14 28 38 10 0.277547
df.iloc[:,0:3] is a dataframe with three series, named find, a, and b. df['Count'] is a series named Count. When you multiply these, Pandas tries to match up same-named series, but since there are none, it ends up generating NaN values for all the slots. Then it assigns these NaN:s back to the dataframe.
I think that using .mul with an appropriate axis= is the way around this, but I may be wrong about that...

Python: compare many columns to one column and replace values greater than that column with NaN

My overall goal is to remove outliers in a row that are higher than the 1.5xIQR of the that row. I have a large dataframe with thousands of features which mainly consists of numeric data. I have calculated the 1.5xIQR in a row-wise fashion and set it as a new column. I would like to replace any data within each row that is greater than its respective 1.5xIQR with either NaN (preferred) or 0.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4),), columns=list('ABCD'))
df
A B C D
0 46 99 38 11
1 43 49 3 95
2 64 39 33 49
3 41 60 49 7
4 38 95 70 13
5 11 45 57 73
6 8 62 57 22
7 9 83 89 91
8 47 82 61 40
9 34 21 21 41
I have tried numerous variations of this and beyond with no success.
df1 = df.iloc[:,:] > df.loc['D'] = 'NaN'
I think this should work:
def f(row):
Q1 = row.quantile(0.25)
Q3 = row.quantile(0.75)
IQR = Q3 - Q1
row[row > 1.5*IQR] = np.nan
return row
df1 = df.apply(f, axis=1)
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4),), columns=list('ABCD'))
s = df.apply(lambda x: (x.quantile(.75)-x.quantile(.25))*1.5, axis=1)
df=df.where(df.lt(s, axis=0),np.nan)
print(df)
My understanding from the wording of question and your tried code is that you have already calculated the 1.5xIQR in column D. As such, you can use df.mask as follows:
df1 = df.mask(df.gt(df['D'], axis=0), np.nan)
Demo:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10, 4),), columns=list('ABCD'))
print(df)
A B C D
0 45 71 15 22
1 56 68 62 91
2 21 90 44 15
3 60 87 2 68
4 48 21 22 25
5 60 68 67 60
6 74 97 94 27
7 69 26 56 85
8 39 42 74 73
9 23 99 91 72
df1 = df.mask(df.gt(df['D'], axis=0), np.nan)
print(df1)
A B C D
0 NaN NaN 15.0 22
1 56.0 68.0 62.0 91
2 NaN NaN NaN 15
3 60.0 NaN 2.0 68
4 NaN 21.0 22.0 25
5 60.0 NaN NaN 60
6 NaN NaN NaN 27
7 69.0 26.0 56.0 85
8 39.0 42.0 NaN 73
9 23.0 NaN NaN 72
Alternatively, you can also use the simplified code below to update df elements meeting the criteria in place:
df[df.gt(df['D'], axis=0)] = np.nan
Printing df after the code will give the same result.

How to create a rolling window in pandas with another condition

I have a data frame with 2 columns
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('AB'))
A B
0 11 10
1 61 30
2 24 54
3 47 52
4 72 42
... ... ...
95 61 2
96 67 41
97 95 30
98 29 66
99 49 22
100 rows × 2 columns
Now I want to create a third column, which is a rolling window max of col 'A' BUT
the max has to be lower than the corresponding value in col 'B'. In other words I want the value of the 4 (using a window size of 4) in column 'A' closest to the value in col 'B', yet smaller than B
So for example in row
3 47 52
the new value I am looking for, is not 61 but 47, because it is the highest value of the 4 that is not higher than 52
pseudo code
df['C'] = df['A'].rolling(window=4).max() where < df['B']
You can use concat + shift to create a wide DataFrame with the previous values, which makes complicated rolling calculations a bit easier.
Sample Data
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 100, size=(100, 2)), columns=list('AB'))
Code
N = 4
# End slice ensures same default min_periods behavior to `.rolling`
df1 = pd.concat([df['A'].shift(i).rename(i) for i in range(N)], axis=1).iloc[N-1:]
# Remove values larger than B, then find the max of remaining.
df['C'] = df1.where(df1.lt(df.B, axis=0)).max(1)
print(df.head(15))
A B C
0 51 92 NaN # Missing b/c min_periods
1 14 71 NaN # Missing b/c min_periods
2 60 20 NaN # Missing b/c min_periods
3 82 86 82.0
4 74 74 60.0
5 87 99 87.0
6 23 2 NaN # Missing b/c 82, 74, 87, 23 all > 2
7 21 52 23.0 # Max of 21, 23, 87, 74 which is < 52
8 1 87 23.0
9 29 37 29.0
10 1 63 29.0
11 59 20 1.0
12 32 75 59.0
13 57 21 1.0
14 88 48 32.0
You can use a custom function to .apply to the rolling window. In this case, you can use a default argument to pass in the B column.
df = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=('AB'))
def rollup(a, B=df.B):
ix = a.index.max()
b = B[ix]
return a[a<b].max()
df['C'] = df.A.rolling(4).apply(rollup)
df
# returns:
A B C
0 8 17 NaN
1 23 84 NaN
2 75 84 NaN
3 86 24 23.0
4 52 83 75.0
.. .. .. ...
95 38 22 NaN
96 53 48 38.0
97 45 4 NaN
98 3 92 53.0
99 91 86 53.0
The NaN values occur when no number in the window of A is less than B or at the start of the series when the window is too big for the first few rows.
You can use where to replace values that don't fulfill the condition with np.nan and then use rolling(window=4, min_periods=1):
In [37]: df['C'] = df['A'].where(df['A'] < df['B'], np.nan).rolling(window=4, min_periods=1).max()
In [38]: df
Out[38]:
A B C
0 0 1 0.0
1 1 2 1.0
2 2 3 2.0
3 10 4 2.0
4 4 5 4.0
5 5 6 5.0
6 10 7 5.0
7 10 8 5.0
8 10 9 5.0
9 10 10 NaN

Compute annual rate using a DataFrame and pct_change()

I have a column inside a DataFrame that I want to use in order to perform the operation:
n = step/12
step = 3
t1 = step - 1
pd.DataFrame(100*((df[t1+step::step]['Column'].values / df[t1:-t1:step]['Column'].values)**(1/n) - 1))
A possible set of values for the column of interest could be:
>>> df['Column']
0 NaN
1 NaN
2 7469.5
3 NaN
4 NaN
5 7537.9
6 NaN
7 NaN
8 7655.2
9 NaN
10 NaN
11 7712.6
12 NaN
13 NaN
14 7784.1
15 NaN
16 NaN
17 7819.8
18 NaN
19 NaN
20 7898.6
21 NaN
22 NaN
23 7939.5
24 NaN
25 NaN
26 7995.0
27 NaN
28 NaN
29 8084.7
...
So df[t1+step::step]['Column'] would give us:
>>> df[5::3]['Column']
5 7537.9
8 7655.2
11 7712.6
14 7784.1
17 7819.8
20 7898.6
23 7939.5
26 7995.0
29 8084.7
32 8158.0
35 8292.7
38 8339.3
41 8449.5
44 8498.3
47 8610.9
50 8697.7
53 8766.1
56 8831.5
59 8850.2
62 8947.1
65 8981.7
68 8983.9
71 8907.4
74 8865.6
77 8934.4
80 8977.3
83 9016.4
86 9123.0
89 9223.5
92 9313.2
...
And lastly df[t1:-t1:step]['Column']
>>> df[2:-2:3]['Column']
2 7469.5
5 7537.9
8 7655.2
11 7712.6
14 7784.1
17 7819.8
20 7898.6
23 7939.5
26 7995.0
29 8084.7
32 8158.0
35 8292.7
38 8339.3
41 8449.5
44 8498.3
47 8610.9
50 8697.7
53 8766.1
56 8831.5
59 8850.2
62 8947.1
65 8981.7
68 8983.9
71 8907.4
74 8865.6
77 8934.4
80 8977.3
83 9016.4
86 9123.0
89 9223.5
...
With these values what we expect is the following output:
>>> pd.DataFrame(100*((df[5::3]['Column'].values / df[2:-2:3]['Column'].values)**4 -1))
0 3.713517
1 6.371352
2 3.033171
3 3.760103
4 1.847168
5 4.092131
6 2.087397
7 2.825602
8 4.563898
9 3.676223
10 6.769944
11 2.266778
12 5.391516
13 2.330287
14 5.406150
15 4.093476
16 3.182961
17 3.017786
18 0.849662
19 4.452016
20 1.555866
21 0.098013
22 -3.362834
23 -1.863919
24 3.140454
25 1.934544
26 1.753587
27 4.813692
28 4.479794
29 3.947179
Since this reminds a lot of what pct_change() does I was wondering if I could achieve the same result by doing something like:
>>> df['Column'].pct_change(periods=step)**(1/n) * 100
Until now I am getting incorrect outputs though. Is it possible to use pct_change() and achieve the same result?

Categories