I want to remove rows where specific columns unique value counts is less than some value.
Dataframe looks like that:
class reason bank_fees cash_advance community food_and_drink ... recreation service shops tax transfer travel
0 0 at&t 20.00 0.0 253.95 254.48 ... 19.27 629.34 842.77 0.0 -4089.51 121.23
1 0 at&t 0.00 0.0 0.00 319.55 ... 0.00 1327.53 656.84 -1784.0 -1333.20 79.60
2 1 entergy arkansas 80.00 0.0 3.39 3580.99 ... 612.36 3746.90 4990.33 0.0 -14402.54 888.67
3 1 entergy arkansas 0.00 0.0 0.00 37.03 ... 0.00 405.24 47.34 0.0 -400.01 41.12
4 1 entergy arkansas 0.00 0.0 0.00 250.18 ... 0.00 123.48 54.28 0.0 -270.15 87.00
... ... ... ... ... ... ... ... ... ... ... ... ... ...
6659 0 t-mobile 0.00 0.0 0.00 0.00 ... 0.00 0.00 50.00 0.0 -253.74 284.44
6660 0 optimum 0.00 -30.0 108.63 158.67 ... 10.11 7098.23 2657.95 0.0 -12641.89 3011.04
6661 0 optimum 0.00 0.0 0.00 267.86 ... 0.00 2459.41 1939.35 0.0 -5727.50 212.06
6662 0 state farm insurance 0.00 0.0 0.00 80.91 ... 25.00 130.27 195.42 0.0 -1189.71 48.79
6663 0 verizon 39.97 0.0 0.00 0.00 ... 0.00 110.00 0.00 0.0 0.00 0.00
[6664 rows x 15 columns]
this is the counts of the column reason
at&t 724
verizon 544
geico 341
t-mobile 309
state farm insurance 135
...
town of smyrna 1
city of hendersonville 1
duke energy 1
pima medical institute 1
gieco 1
Name: reason, Length: 649, dtype: int64
the important column there is the reason. for example, if the unique value count is less than 5 I want to remove those rows. How can I do that? Thanks
You can try to get the index of value counts where value is below 5 and use isin to filter out these value
out = df[~df['reason'].isin(df['reason'].value_counts().lt(5).pipe(lambda s: s[s].index))]
To elaborate each step usage
out = df[~df['reason'].isin(
df['reason'].value_counts() # get each value count
.lt(5) # mask value lower than 5
.pipe(lambda s: s[s].index) # get the index of value which is lower than 5
)] # if value is not in the index, keep it
I have a current Pandas DataFrame in the format below (see Current DataFrame) but I want to change the structure of it to look like the Desired DataFrame below. The top row of titles is longitudes and the first column of titles is latitudes.
Current DataFrame:
E0 E1 E2 E3 E4
LAT
89 0.01 0.01 0.02 0.01 0.00
88 0.01 0.00 0.00 0.01 0.00
87 0.00 0.02 0.01 0.02 0.01
86 0.02 0.00 0.03 0.02 0.00
85 0.00 0.00 0.00 0.01 0.03
Code to build it:
df = pd.DataFrame({
'LAT': [89, 88, 87, 86, 85],
'E0': [0.01, 0.01, 0.0, 0.02, 0.0],
'E1': [0.01, 0.0, 0.02, 0.0, 0.0],
'E2': [0.02, 0.0, 0.01, 0.03, 0.0],
'E3': [0.01, 0.01, 0.02, 0.02, 0.01],
'E4': [0.0, 0.0, 0.01, 0.0, 0.03]
}).set_index('LAT')
Desired DataFrame:
LAT LON R
89 0 0.01
89 1 0.01
89 2 0.02
89 3 0.01
89 4 0.00
88 0 0.01
88 1 0.00
88 2 0.00
88 3 0.01
88 4 0.00
87 0 0.00
87 1 0.02
87 2 0.01
87 3 0.02
87 4 0.01
86 0 0.02
86 1 0.00
86 2 0.03
86 3 0.02
86 4 0.00
85 0 0.00
85 1 0.00
85 2 0.00
85 3 0.01
85 4 0.03
Try with stack + str.extract:
new_df = (
df.stack()
.reset_index(name='R')
.rename(columns={'level_1': 'LON'})
)
new_df['LON'] = new_df['LON'].str.extract(r'(\d+$)').astype(int)
Or with pd.wide_to_long + reindex:
new_df = df.reset_index()
new_df = (
pd.wide_to_long(new_df, stubnames='E', i='LAT', j='LON')
.reindex(new_df['LAT'], level=0)
.rename(columns={'E': 'R'})
.reset_index()
)
new_df:
LAT LON R
0 89 0 0.01
1 89 1 0.01
2 89 2 0.02
3 89 3 0.01
4 89 4 0.00
5 88 0 0.01
6 88 1 0.00
7 88 2 0.00
8 88 3 0.01
9 88 4 0.00
10 87 0 0.00
11 87 1 0.02
12 87 2 0.01
13 87 3 0.02
14 87 4 0.01
15 86 0 0.02
16 86 1 0.00
17 86 2 0.03
18 86 3 0.02
19 86 4 0.00
20 85 0 0.00
21 85 1 0.00
22 85 2 0.00
23 85 3 0.01
24 85 4 0.03
You could solve it with pivot_longer from pyjanitor:
# pip install pyjanitor
import janitor
import pandas as pd
df.pivot_longer(index = None,
names_to = 'LON',
values_to = "R",
names_pattern = r".(.)",
sort_by_appearance = True,
ignore_index = False).reset_index()
LAT LON R
0 89 0 0.01
1 89 1 0.01
2 89 2 0.02
3 89 3 0.01
4 89 4 0.00
5 88 0 0.01
6 88 1 0.00
7 88 2 0.00
8 88 3 0.01
9 88 4 0.00
10 87 0 0.00
11 87 1 0.02
12 87 2 0.01
13 87 3 0.02
14 87 4 0.01
15 86 0 0.02
16 86 1 0.00
17 86 2 0.03
18 86 3 0.02
19 86 4 0.00
20 85 0 0.00
21 85 1 0.00
22 85 2 0.00
23 85 3 0.01
24 85 4 0.03
Here we are only interested in the numbers that are at the end of the columns - we get this by passing a regular expression to names_pattern.
You can avoid pyjanitor altogether by using melt and rename:
(df.rename(columns=lambda col: col[-1])
.melt(var_name='LON', value_name='R', ignore_index=False)
)
LON R
LAT
89 0 0.01
88 0 0.01
87 0 0.00
86 0 0.02
85 0 0.00
89 1 0.01
88 1 0.00
87 1 0.02
86 1 0.00
85 1 0.00
89 2 0.02
88 2 0.00
87 2 0.01
86 2 0.03
85 2 0.00
89 3 0.01
88 3 0.01
87 3 0.02
86 3 0.02
85 3 0.01
89 4 0.00
88 4 0.00
87 4 0.01
86 4 0.00
85 4 0.03
Another approach, does this work:
pd.wide_to_long(df.reset_index(), ['E'], i = 'LAT', j = 'LON').reset_index().sort_values(by = ['LAT','LON'])
LAT LON E
4 85 0 0.00
9 85 1 0.00
14 85 2 0.00
19 85 3 0.01
24 85 4 0.03
3 86 0 0.02
8 86 1 0.00
13 86 2 0.03
18 86 3 0.02
23 86 4 0.00
2 87 0 0.00
7 87 1 0.02
12 87 2 0.01
17 87 3 0.02
22 87 4 0.01
1 88 0 0.01
6 88 1 0.00
11 88 2 0.00
16 88 3 0.01
21 88 4 0.00
0 89 0 0.01
5 89 1 0.01
10 89 2 0.02
15 89 3 0.01
20 89 4 0.00
Quick and dirty.
Pad your LAT with you LON in a list of tuple pairs.
[
(89.0, 0.01),
(89.1, 0.01),
(89.2, 0.02)
]
Im sure someone can break down a way to organize it like you want... but from what I know you need a unique ID data point for most data in a query structure.
OR:
If you aren't putting this back into a db, then maybe you can use a dict something like this:
{ '89' : { '0' : 0.01,
'1' : 0.01,
'2' : 0.02 .....
}
}
You can then get the data with
dpoint = data['89']['0']
assert dpoint == 0.01
\\ True
dpoint = data['89']['2']
assert dpoint == 0.02
\\ True
I have a DataFrame df filled with rows and columns where there are duplicate Id's:
Index A B
0 0.00 0.00
1 0.00 0.00
29 0.50 105.00
36 0.80 167.00
37 0.80 167.00
42 1.00 209.00
44 0.50 105.00
45 0.50 105.00
46 0.50 105.00
50 0.00 0.00
51 0.00 0.00
52 0.00 0.00
53 0.00 0.00
When I use:
df.drop_duplicates(subset=['A'], keep='last')
I get:
Index A B
37 0.80 167.00
42 1.00 209.00
46 0.50 105.00
53 0.00 0.00
Which makes sense, that's what the function does. However, what I actually would like to achieve is something like:
Index A B
1 0.00 0.00
29 0.50 105.00
37 0.80 167.00
42 1.00 209.00
46 0.50 105.00
53 0.00 0.00
Basically from each subpart of column A (0,0), (0.80, 0.80), etc. To pick the last value.
It is also important the values in column A stay in this order 0; 0.5; 0.8; 1; 0.5;0 and they do not get mixed.
Compare by not equal Series.ne with Series.shift and filter by boolean indexing:
df1 = df[df['A'].ne(df['A'].shift(-1))]
print (df1)
A B
Index
1 0.0 0.0
29 0.5 105.0
37 0.8 167.0
42 1.0 209.0
46 0.5 105.0
53 0.0 0.0
Details:
print (df['A'].ne(df['A'].shift(-1)))
Index
0 False
1 True
29 True
36 False
37 True
42 True
44 False
45 False
46 True
50 False
51 False
52 False
53 True
Name: A, dtype: bool
I am confused by the results of pandas subtraction of two columns. When I subtract two float64 and int64 columns it yields several NaN entries. Why is this happening? What could be the cause of this strange behavior?
Final Updae: As N.Wouda pointed out, my problem was that the index columns did not match.
Y_predd.reset_index(drop=True,inplace=True)
Y_train_2.reset_index(drop=True,inplace=True)
solved my problem
Update 2: It seems like my index columns don't match, which makes sense because they are both sampled from the same data frome. How can I "start fresh" with new index coluns?
Update: Y_predd- Y_train_2.astype('float64') also yields NaN values. I am confused why this did not raise an error. They are the same size. Why could this be yielding NaN?
In [48]: Y_predd.size
Out[48]: 182527
In [49]: Y_train_2.astype('float64').size
Out[49]: 182527
Original documentation of error:
In [38]: Y_train_2
Out[38]:
66419 0
2319 0
114195 0
217532 0
131687 0
144024 0
94055 0
143479 0
143124 0
49910 0
109278 0
215905 1
127311 0
150365 0
117866 0
28702 0
168111 0
64625 0
207180 0
14555 0
179268 0
22021 1
120169 0
218769 0
259754 0
188296 1
63503 1
175104 0
218261 0
35453 0
..
112048 0
97294 0
68569 0
60333 0
184119 1
57632 0
153729 1
155353 0
114979 1
180634 0
42842 0
99979 0
243728 0
203679 0
244381 0
55646 0
35557 0
148977 0
164008 0
53227 1
219863 0
4625 0
155759 0
232463 0
167807 0
123638 0
230463 1
198219 0
128459 1
53911 0
Name: objective_for_classifier, dtype: int64
In [39]: Y_predd
Out[39]:
0 0.00
1 0.48
2 0.04
3 0.00
4 0.48
5 0.58
6 0.00
7 0.00
8 0.02
9 0.06
10 0.22
11 0.32
12 0.12
13 0.26
14 0.18
15 0.18
16 0.28
17 0.30
18 0.52
19 0.32
20 0.38
21 0.00
22 0.02
23 0.00
24 0.22
25 0.64
26 0.30
27 0.76
28 0.10
29 0.42
...
182497 0.60
182498 0.00
182499 0.06
182500 0.12
182501 0.00
182502 0.40
182503 0.70
182504 0.42
182505 0.54
182506 0.24
182507 0.56
182508 0.34
182509 0.10
182510 0.18
182511 0.06
182512 0.12
182513 0.00
182514 0.22
182515 0.08
182516 0.22
182517 0.00
182518 0.42
182519 0.02
182520 0.50
182521 0.00
182522 0.08
182523 0.16
182524 0.00
182525 0.32
182526 0.06
Name: prediction_method_used, dtype: float64
In [40]: Y_predd - Y_tr
Y_train_1 Y_train_2
In [40]: Y_predd - Y_train_2
Out[41]:
0 NaN
1 NaN
2 0.04
3 NaN
4 0.48
5 NaN
6 0.00
7 0.00
8 NaN
9 NaN
10 NaN
11 0.32
12 -0.88
13 -0.74
14 0.18
15 NaN
16 NaN
17 NaN
18 NaN
19 0.32
20 0.38
21 0.00
22 0.02
23 0.00
24 0.22
25 NaN
26 0.30
27 NaN
28 0.10
29 0.42
...
260705 NaN
260706 NaN
260709 NaN
260710 NaN
260711 NaN
260713 NaN
260715 NaN
260716 NaN
260718 NaN
260721 NaN
260722 NaN
260723 NaN
260724 NaN
260725 NaN
260726 NaN
260727 NaN
260731 NaN
260735 NaN
260737 NaN
260738 NaN
260739 NaN
260740 NaN
260742 NaN
260743 NaN
260745 NaN
260748 NaN
260749 NaN
260750 NaN
260751 NaN
260752 NaN
dtype: float64
Posting here so we can close the question, from the comments:
Are you sure each dataframe has the same index range?
You can reset the indices on both frames by df.reset_index(drop=True) and then subtract the frames as you were already doing. This process should result in the desired output.
I have grabbed a column from a pandas data frame and made a list from the rows.
If I print the first two values of the list, I get this output:
print dfList[0]
print dfList[1]
0 0.00
1 0.00
2 0.00
3 0.00
4 0.00
5 0.11
6 0.84
7 1.00
8 0.27
9 0.00
10 0.52
11 0.55
12 0.92
13 0.00
14 0.00
...
50 0.42
51 0.00
52 0.00
53 0.00
54 0.40
55 0.65
56 0.81
57 1.00
58 0.54
59 0.21
60 0.00
61 0.33
62 1.00
63 0.75
64 1.00
Name: 9, Length: 65, dtype: float64
65 1.00
66 1.00
67 1.00
68 1.00
69 1.00
70 1.00
71 0.55
72 0.00
73 0.39
74 0.51
75 0.70
76 0.83
77 0.87
78 0.85
79 0.53
...
126 0.83
127 0.83
128 0.83
129 0.71
130 0.26
131 0.11
132 0.00
133 0.00
134 0.50
135 1.00
136 1.00
137 0.59
138 0.59
139 0.59
140 1.00
Name: 9, Length: 76, dtype: float64
When I try to iterate over the list with a for loop, I get an error message:
for i in dfList:
print dfList[i]
Traceback (most recent call last):
File "./windows.py", line 84, in <module>
print dfList[i]
TypeError: list indices must be integers, not Series
If I write my code as:
for i in dfList:
print i
I get the correct output:
0 0.00
1 0.00
2 0.00
3 0.00
4 0.00
5 0.11
6 0.84
7 1.00
8 0.27
9 0.00
10 0.52
11 0.55
12 0.92
13 0.00
14 0.00
...
50 0.42
51 0.00
52 0.00
53 0.00
54 0.40
55 0.65
56 0.81
57 1.00
58 0.54
59 0.21
60 0.00
61 0.33
62 1.00
63 0.75
64 1.00
Name: 9, Length: 65, dtype: float64
...
...
Name: 9, Length: 108, dtype: float64
507919 0.00
507920 0.83
507921 1.00
507922 1.00
507923 0.46
507924 0.83
507925 1.00
507926 1.00
507927 1.00
507928 1.00
507929 1.00
507930 1.00
507931 1.00
507932 1.00
507933 1.00
...
508216 1
508217 1
508218 1
508219 1
508220 1
508221 1
508222 1
508223 1
508224 1
508225 1
508226 1
508227 1
508228 1
508229 1
508230 1
Name: 9, Length: 312, dtype: float64
I do not know why this happens.
I ultimately want to iterate through the list for 5 consecutive "windows" and calculate their means and put these means in a list.
In python when you write for i in then i is the element not the index you need to then print i not print dfList[i].
Either of the following two options are fine:
for i in dfList[i]:
print i
for i in range(len(dfList)):
print dfList[i]
The first is more pythonic and elegant unless you need the index.
Edit
As jwilner suggested you can also do:
for i, element in enumerate(dfList):
print i, element
Here i is the index and element is the dfList[i].