I have grabbed a column from a pandas data frame and made a list from the rows.
If I print the first two values of the list, I get this output:
print dfList[0]
print dfList[1]
0 0.00
1 0.00
2 0.00
3 0.00
4 0.00
5 0.11
6 0.84
7 1.00
8 0.27
9 0.00
10 0.52
11 0.55
12 0.92
13 0.00
14 0.00
...
50 0.42
51 0.00
52 0.00
53 0.00
54 0.40
55 0.65
56 0.81
57 1.00
58 0.54
59 0.21
60 0.00
61 0.33
62 1.00
63 0.75
64 1.00
Name: 9, Length: 65, dtype: float64
65 1.00
66 1.00
67 1.00
68 1.00
69 1.00
70 1.00
71 0.55
72 0.00
73 0.39
74 0.51
75 0.70
76 0.83
77 0.87
78 0.85
79 0.53
...
126 0.83
127 0.83
128 0.83
129 0.71
130 0.26
131 0.11
132 0.00
133 0.00
134 0.50
135 1.00
136 1.00
137 0.59
138 0.59
139 0.59
140 1.00
Name: 9, Length: 76, dtype: float64
When I try to iterate over the list with a for loop, I get an error message:
for i in dfList:
print dfList[i]
Traceback (most recent call last):
File "./windows.py", line 84, in <module>
print dfList[i]
TypeError: list indices must be integers, not Series
If I write my code as:
for i in dfList:
print i
I get the correct output:
0 0.00
1 0.00
2 0.00
3 0.00
4 0.00
5 0.11
6 0.84
7 1.00
8 0.27
9 0.00
10 0.52
11 0.55
12 0.92
13 0.00
14 0.00
...
50 0.42
51 0.00
52 0.00
53 0.00
54 0.40
55 0.65
56 0.81
57 1.00
58 0.54
59 0.21
60 0.00
61 0.33
62 1.00
63 0.75
64 1.00
Name: 9, Length: 65, dtype: float64
...
...
Name: 9, Length: 108, dtype: float64
507919 0.00
507920 0.83
507921 1.00
507922 1.00
507923 0.46
507924 0.83
507925 1.00
507926 1.00
507927 1.00
507928 1.00
507929 1.00
507930 1.00
507931 1.00
507932 1.00
507933 1.00
...
508216 1
508217 1
508218 1
508219 1
508220 1
508221 1
508222 1
508223 1
508224 1
508225 1
508226 1
508227 1
508228 1
508229 1
508230 1
Name: 9, Length: 312, dtype: float64
I do not know why this happens.
I ultimately want to iterate through the list for 5 consecutive "windows" and calculate their means and put these means in a list.
In python when you write for i in then i is the element not the index you need to then print i not print dfList[i].
Either of the following two options are fine:
for i in dfList[i]:
print i
for i in range(len(dfList)):
print dfList[i]
The first is more pythonic and elegant unless you need the index.
Edit
As jwilner suggested you can also do:
for i, element in enumerate(dfList):
print i, element
Here i is the index and element is the dfList[i].
Related
I am using the below code to calculate F1 score for my dataset
from sklearn.metrics import f1_score
from sklearn.preprocessing import MultiLabelBinarizer
m = MultiLabelBinarizer().fit(y_test_true_f)
print("F1-score is : {:.1%}".format(f1_score(m.transform(y_test_true_f),
m.transform(y_pred_f),
average='macro')))
and classification report
from sklearn.metrics import classification_report
print(classification_report(m.transform(y_test_true_f), m.transform(y_pred_f)))
but the output of the classification report does not show the label names
precision recall f1-score support
0 0.88 1.00 0.94 15
1 1.00 0.95 0.98 22
2 0.82 0.74 0.78 19
3 0.90 0.85 0.88 33
4 0.68 0.87 0.76 15
5 0.94 0.98 0.96 46
6 0.83 0.94 0.88 16
7 0.33 0.86 0.48 7
8 0.95 0.90 0.92 20
9 0.67 1.00 0.80 10
10 0.91 0.83 0.87 12
11 0.29 0.33 0.31 6
12 0.25 0.40 0.31 5
13 0.00 0.00 0.00 3
14 0.88 1.00 0.93 7
15 0.50 0.75 0.60 8
16 0.50 1.00 0.67 1
17 1.00 1.00 1.00 10
18 0.80 1.00 0.89 8
19 0.89 1.00 0.94 17
20 0.88 1.00 0.94 15
21 0.86 0.80 0.83 15
22 0.71 0.79 0.75 19
23 0.65 1.00 0.79 11
24 0.74 0.82 0.78 17
25 1.00 1.00 1.00 11
26 0.75 0.86 0.80 14
How shall I update my code to see the label names instead of numbers 0,1,2,3.....?
According to output there are 27 classes in the dataset if am not wrong. For getting the classes name/label you need to use attribute of MultiLabelBinarizer to get the mapping of class and 0,1,2,3,... because it transform label into 1,2,3,... numeric type
Attribute is .classes_, you could add this as an parameter in your classification_report as follows:
print(classification_report(m.transform(y_test_true_f), m.transform(y_pred_f)),target_names=m.classes_)
I hope this could give you classes label.
Specify them as target_names when calling classification_report.
From their examples:
>>> from sklearn.metrics import classification_report
>>> y_true = [0, 1, 2, 2, 2]
>>> y_pred = [0, 0, 2, 2, 1]
>>> target_names = ['class 0', 'class 1', 'class 2']
>>> print(classification_report(y_true, y_pred, target_names=target_names))
precision recall f1-score support
class 0 0.50 1.00 0.67 1
class 1 0.00 0.00 0.00 1
class 2 1.00 0.67 0.80 3
accuracy 0.60 5
macro avg 0.50 0.56 0.49 5
weighted avg 0.70 0.60 0.61 5
I have a current Pandas DataFrame in the format below (see Current DataFrame) but I want to change the structure of it to look like the Desired DataFrame below. The top row of titles is longitudes and the first column of titles is latitudes.
Current DataFrame:
E0 E1 E2 E3 E4
LAT
89 0.01 0.01 0.02 0.01 0.00
88 0.01 0.00 0.00 0.01 0.00
87 0.00 0.02 0.01 0.02 0.01
86 0.02 0.00 0.03 0.02 0.00
85 0.00 0.00 0.00 0.01 0.03
Code to build it:
df = pd.DataFrame({
'LAT': [89, 88, 87, 86, 85],
'E0': [0.01, 0.01, 0.0, 0.02, 0.0],
'E1': [0.01, 0.0, 0.02, 0.0, 0.0],
'E2': [0.02, 0.0, 0.01, 0.03, 0.0],
'E3': [0.01, 0.01, 0.02, 0.02, 0.01],
'E4': [0.0, 0.0, 0.01, 0.0, 0.03]
}).set_index('LAT')
Desired DataFrame:
LAT LON R
89 0 0.01
89 1 0.01
89 2 0.02
89 3 0.01
89 4 0.00
88 0 0.01
88 1 0.00
88 2 0.00
88 3 0.01
88 4 0.00
87 0 0.00
87 1 0.02
87 2 0.01
87 3 0.02
87 4 0.01
86 0 0.02
86 1 0.00
86 2 0.03
86 3 0.02
86 4 0.00
85 0 0.00
85 1 0.00
85 2 0.00
85 3 0.01
85 4 0.03
Try with stack + str.extract:
new_df = (
df.stack()
.reset_index(name='R')
.rename(columns={'level_1': 'LON'})
)
new_df['LON'] = new_df['LON'].str.extract(r'(\d+$)').astype(int)
Or with pd.wide_to_long + reindex:
new_df = df.reset_index()
new_df = (
pd.wide_to_long(new_df, stubnames='E', i='LAT', j='LON')
.reindex(new_df['LAT'], level=0)
.rename(columns={'E': 'R'})
.reset_index()
)
new_df:
LAT LON R
0 89 0 0.01
1 89 1 0.01
2 89 2 0.02
3 89 3 0.01
4 89 4 0.00
5 88 0 0.01
6 88 1 0.00
7 88 2 0.00
8 88 3 0.01
9 88 4 0.00
10 87 0 0.00
11 87 1 0.02
12 87 2 0.01
13 87 3 0.02
14 87 4 0.01
15 86 0 0.02
16 86 1 0.00
17 86 2 0.03
18 86 3 0.02
19 86 4 0.00
20 85 0 0.00
21 85 1 0.00
22 85 2 0.00
23 85 3 0.01
24 85 4 0.03
You could solve it with pivot_longer from pyjanitor:
# pip install pyjanitor
import janitor
import pandas as pd
df.pivot_longer(index = None,
names_to = 'LON',
values_to = "R",
names_pattern = r".(.)",
sort_by_appearance = True,
ignore_index = False).reset_index()
LAT LON R
0 89 0 0.01
1 89 1 0.01
2 89 2 0.02
3 89 3 0.01
4 89 4 0.00
5 88 0 0.01
6 88 1 0.00
7 88 2 0.00
8 88 3 0.01
9 88 4 0.00
10 87 0 0.00
11 87 1 0.02
12 87 2 0.01
13 87 3 0.02
14 87 4 0.01
15 86 0 0.02
16 86 1 0.00
17 86 2 0.03
18 86 3 0.02
19 86 4 0.00
20 85 0 0.00
21 85 1 0.00
22 85 2 0.00
23 85 3 0.01
24 85 4 0.03
Here we are only interested in the numbers that are at the end of the columns - we get this by passing a regular expression to names_pattern.
You can avoid pyjanitor altogether by using melt and rename:
(df.rename(columns=lambda col: col[-1])
.melt(var_name='LON', value_name='R', ignore_index=False)
)
LON R
LAT
89 0 0.01
88 0 0.01
87 0 0.00
86 0 0.02
85 0 0.00
89 1 0.01
88 1 0.00
87 1 0.02
86 1 0.00
85 1 0.00
89 2 0.02
88 2 0.00
87 2 0.01
86 2 0.03
85 2 0.00
89 3 0.01
88 3 0.01
87 3 0.02
86 3 0.02
85 3 0.01
89 4 0.00
88 4 0.00
87 4 0.01
86 4 0.00
85 4 0.03
Another approach, does this work:
pd.wide_to_long(df.reset_index(), ['E'], i = 'LAT', j = 'LON').reset_index().sort_values(by = ['LAT','LON'])
LAT LON E
4 85 0 0.00
9 85 1 0.00
14 85 2 0.00
19 85 3 0.01
24 85 4 0.03
3 86 0 0.02
8 86 1 0.00
13 86 2 0.03
18 86 3 0.02
23 86 4 0.00
2 87 0 0.00
7 87 1 0.02
12 87 2 0.01
17 87 3 0.02
22 87 4 0.01
1 88 0 0.01
6 88 1 0.00
11 88 2 0.00
16 88 3 0.01
21 88 4 0.00
0 89 0 0.01
5 89 1 0.01
10 89 2 0.02
15 89 3 0.01
20 89 4 0.00
Quick and dirty.
Pad your LAT with you LON in a list of tuple pairs.
[
(89.0, 0.01),
(89.1, 0.01),
(89.2, 0.02)
]
Im sure someone can break down a way to organize it like you want... but from what I know you need a unique ID data point for most data in a query structure.
OR:
If you aren't putting this back into a db, then maybe you can use a dict something like this:
{ '89' : { '0' : 0.01,
'1' : 0.01,
'2' : 0.02 .....
}
}
You can then get the data with
dpoint = data['89']['0']
assert dpoint == 0.01
\\ True
dpoint = data['89']['2']
assert dpoint == 0.02
\\ True
I am looking a way to aggregate (in pandas) a subset of values based on a particular partition, an equivalent of
select table.*,
sum(income) over (order by id, num_yyyymm rows between 3 preceding and 1 preceding) as prev_income_3,
sum(income) over (order by id, num_yyyymm rows between 1 following and 3 following) as next_income_3
from table order by a.id_customer, num_yyyymm;
I tried with the following solution but it has some problems:
1) Takes ages to complete
2) I have to merge all the results at the end of
for x, y in df.groupby(['id_customer']):
print(y[['num_yyyymm', 'income']])
y['next3'] = y['income'].iloc[::-1].rolling(3).sum()
print(y[['num_yyyymm', 'income', 'next3']])
break
Results:
num_yyyymm income next3
0 201501 0.00 0.00
1 201502 0.00 0.00
2 201503 0.00 0.00
3 201504 0.00 0.00
4 201505 0.00 0.00
5 201506 0.00 0.00
6 201507 0.00 0.00
7 201508 0.00 0.00
8 201509 0.00 0.00
9 201510 0.00 0.00
10 201511 0.00 0.00
11 201512 0.00 0.00
12 201601 0.00 0.00
13 201602 0.00 0.00
14 201603 0.00 0.00
15 201604 0.00 0.00
16 201605 0.00 0.00
17 201606 0.00 0.00
18 201607 0.00 0.00
19 201608 0.00 0.00
20 201609 0.00 1522.07
21 201610 0.00 1522.07
22 201611 0.00 1522.07
23 201612 1522.07 0.00
24 201701 0.00 -0.00
25 201702 0.00 1.52
26 201703 0.00 1522.07
27 201704 0.00 1522.07
28 201705 1.52 1520.55
29 201706 1520.55 0.00
30 201707 0.00 NaN
31 201708 0.00 NaN
32 201709 0.00 NaN
Does anybody have an alternative solution?
I have a DataFrame df filled with rows and columns where there are duplicate Id's:
Index A B
0 0.00 0.00
1 0.00 0.00
29 0.50 105.00
36 0.80 167.00
37 0.80 167.00
42 1.00 209.00
44 0.50 105.00
45 0.50 105.00
46 0.50 105.00
50 0.00 0.00
51 0.00 0.00
52 0.00 0.00
53 0.00 0.00
When I use:
df.drop_duplicates(subset=['A'], keep='last')
I get:
Index A B
37 0.80 167.00
42 1.00 209.00
46 0.50 105.00
53 0.00 0.00
Which makes sense, that's what the function does. However, what I actually would like to achieve is something like:
Index A B
1 0.00 0.00
29 0.50 105.00
37 0.80 167.00
42 1.00 209.00
46 0.50 105.00
53 0.00 0.00
Basically from each subpart of column A (0,0), (0.80, 0.80), etc. To pick the last value.
It is also important the values in column A stay in this order 0; 0.5; 0.8; 1; 0.5;0 and they do not get mixed.
Compare by not equal Series.ne with Series.shift and filter by boolean indexing:
df1 = df[df['A'].ne(df['A'].shift(-1))]
print (df1)
A B
Index
1 0.0 0.0
29 0.5 105.0
37 0.8 167.0
42 1.0 209.0
46 0.5 105.0
53 0.0 0.0
Details:
print (df['A'].ne(df['A'].shift(-1)))
Index
0 False
1 True
29 True
36 False
37 True
42 True
44 False
45 False
46 True
50 False
51 False
52 False
53 True
Name: A, dtype: bool
I have a dataframe like this:
Ind TIME PREC ET PET YIELD
0 1 1.21 0.02 0.02 0.00
1 2 0.00 0.03 0.04 0.00
2 3 0.00 0.03 0.05 0.00
3 4 0.00 0.04 0.05 0.00
4 5 0.00 0.05 0.07 0.00
5 6 0.00 0.03 0.05 0.00
6 7 0.00 0.02 0.04 0.00
7 8 1.14 0.03 0.04 0.00
8 9 0.10 0.02 0.03 0.00
9 10 0.00 0.03 0.04 0.00
10 11 0.10 0.05 0.11 0.00
11 12 0.00 0.06 0.15 0.00
12 13 2.30 0.14 0.44 0.00
13 14 0.17 0.09 0.29 0.00
14 15 0.00 0.13 0.35 0.00
15 16 0.00 0.14 0.39 0.00
16 17 0.00 0.10 0.31 0.00
17 18 0.00 0.15 0.51 0.00
18 19 0.00 0.22 0.58 0.00
19 20 0.10 0.04 0.09 0.00
20 21 0.00 0.04 0.06 0.00
21 22 0.27 0.13 0.43 0.00
22 23 0.00 0.10 0.25 0.00
23 24 0.00 0.03 0.04 0.00
24 25 0.00 0.04 0.05 0.00
25 26 0.43 0.04 0.15 0.00
26 27 0.17 0.06 0.23 0.00
27 28 0.50 0.02 0.04 0.00
28 29 0.00 0.03 0.04 0.00
29 30 0.00 0.04 0.08 0.00
30 31 0.00 0.04 0.08 0.00
31 1 6.48 1.97 5.10 0.03
32 32 0.00 0.22 0.70 0.00
33 33 0.00 0.49 0.88 0.00
In this dataframe column 'TIME' shows ordinal day number in a year, and after the end of every month - an ordinal number of month in a year, which messes up all dataframe calculations, so, for this reason, I would like to drop all rows that contain month value. First, I tried to use .shift():
df = df.loc[df.TIME == df.TIME.shift() +1],
however, in this case, I delete twice as many rows as it supposed to be. I also tried to delete every value after the end of every month:
for i in indexes:
df = df.loc[df.index != i],
where indexes is a list, containing row indexes after day value is equal to 31, 59, ... 365 or end of every month. However, in a leap year, these values would be different, and I could create another list for a leap year, but this method would be very non-pythonist. So, I wonder, is there any better way to delete non-consecutive values from a dataframe (excluding when one year ends and another one starts: 364, 365, 1, 2)?
EDIT: I should, probably, add that there are twenty years in this dataframe, so this is how the dataframe looks like at the end of each year:
TIME PREC ET PET YIELD
370 360 0.00 0.14 0.26 0.04
371 361 0.00 0.15 0.27 0.04
372 362 0.00 0.14 0.25 0.04
373 363 0.11 0.18 0.32 0.04
374 364 0.00 0.15 0.25 0.04
375 365 0.00 0.17 0.29 0.04
376 12 16.29 4.44 7.74 1.89
377 1 0.00 0.16 0.28 0.03
378 2 0.00 0.18 0.32 0.03
379 3 0.00 0.22 0.40 0.03
df
TIME PREC ET PET YIELD
0 360 0.00 0.14 0.26 0.04
1 361 0.00 0.15 0.27 0.04
2 362 0.00 0.14 0.25 0.04
3 363 0.11 0.18 0.32 0.04
4 364 0.00 0.15 0.25 0.04
5 365 0.00 0.17 0.29 0.04
6 12 16.29 4.44 7.74 1.89
7 1 1.21 0.02 0.02 0.00
8 2 0.00 0.03 0.04 0.00
9 3 0.00 0.03 0.05 0.00
10 4 0.00 0.04 0.05 0.00
11 5 0.00 0.05 0.07 0.00
12 6 0.00 0.03 0.05 0.00
13 7 0.00 0.02 0.04 0.00
14 8 1.14 0.03 0.04 0.00
15 9 0.10 0.02 0.03 0.00
16 10 0.00 0.03 0.04 0.00
17 11 0.10 0.05 0.11 0.00
18 12 0.00 0.06 0.15 0.00
19 13 2.30 0.14 0.44 0.00
20 14 0.17 0.09 0.29 0.00
21 15 0.00 0.13 0.35 0.00
22 16 0.00 0.14 0.39 0.00
23 17 0.00 0.10 0.31 0.00
24 18 0.00 0.15 0.51 0.00
25 19 0.00 0.22 0.58 0.00
26 20 0.10 0.04 0.09 0.00
27 21 0.00 0.04 0.06 0.00
28 22 0.27 0.13 0.43 0.00
29 23 0.00 0.10 0.25 0.00
30 24 0.00 0.03 0.04 0.00
31 25 0.00 0.04 0.05 0.00
32 26 0.43 0.04 0.15 0.00
33 27 0.17 0.06 0.23 0.00
34 28 0.50 0.02 0.04 0.00
35 29 0.00 0.03 0.04 0.00
36 30 0.00 0.04 0.08 0.00
37 31 0.00 0.04 0.08 0.00
38 1 6.48 1.97 5.10 0.03
39 32 0.00 0.22 0.70 0.00
40 33 0.00 0.49 0.88 0.00
Look at the diffs in TIME. Drop the rows where diff is between -360 and -1
df[~df.TIME.diff().le(-12)]
TIME PREC ET PET YIELD
0 360 0.00 0.14 0.26 0.04
1 361 0.00 0.15 0.27 0.04
2 362 0.00 0.14 0.25 0.04
3 363 0.11 0.18 0.32 0.04
4 364 0.00 0.15 0.25 0.04
5 365 0.00 0.17 0.29 0.04
7 1 1.21 0.02 0.02 0.00
8 2 0.00 0.03 0.04 0.00
9 3 0.00 0.03 0.05 0.00
10 4 0.00 0.04 0.05 0.00
11 5 0.00 0.05 0.07 0.00
12 6 0.00 0.03 0.05 0.00
13 7 0.00 0.02 0.04 0.00
14 8 1.14 0.03 0.04 0.00
15 9 0.10 0.02 0.03 0.00
16 10 0.00 0.03 0.04 0.00
17 11 0.10 0.05 0.11 0.00
18 12 0.00 0.06 0.15 0.00
19 13 2.30 0.14 0.44 0.00
20 14 0.17 0.09 0.29 0.00
21 15 0.00 0.13 0.35 0.00
22 16 0.00 0.14 0.39 0.00
23 17 0.00 0.10 0.31 0.00
24 18 0.00 0.15 0.51 0.00
25 19 0.00 0.22 0.58 0.00
26 20 0.10 0.04 0.09 0.00
27 21 0.00 0.04 0.06 0.00
28 22 0.27 0.13 0.43 0.00
29 23 0.00 0.10 0.25 0.00
30 24 0.00 0.03 0.04 0.00
31 25 0.00 0.04 0.05 0.00
32 26 0.43 0.04 0.15 0.00
33 27 0.17 0.06 0.23 0.00
34 28 0.50 0.02 0.04 0.00
35 29 0.00 0.03 0.04 0.00
36 30 0.00 0.04 0.08 0.00
37 31 0.00 0.04 0.08 0.00
39 32 0.00 0.22 0.70 0.00
40 33 0.00 0.49 0.88 0.00
df[df['TIME'].shift().fillna(0) <= df['TIME']]
Gives what you're looking for. You were almost there with
df.loc[df.TIME == df.TIME.shift() +1]
But you don't need to get rid of cases where .shift is smaller, because that's just the first of the month.
The addition of .fillna(0) takes care of the NaN in the first row of df['TIME'].shift().
Edit:
For the end of year case, just be sure to also take those with a difference of 11, to catch where the 12th month ends.
That would give
df[(df['TIME'].shift().fillna(0) <= df['TIME']+11)]
Edit2:
By the by, I checked solution runtimes, and the current version(df[~df.TIME.diff().le(-12)]) of #piRSquared's seems to run fastest.
For completeness, of the one presented in this post and the original version posted by #piRSquared,
the former was a bit faster on datasets on the order of 10000 rows or fewer, the latter somewhat faster on those larger.