With following code snippet
import pandas as pd
train = pd.read_csv('train.csv',parse_dates=['dates'])
print(data['dates'])
I load and control the data.
My question is, how can I standardize/normalize data['dates'] to make all the elements lie between -1 and 1 (linear or gaussian)??
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
import time
def convert_to_timestamp(x):
"""Convert date objects to integers"""
return time.mktime(x.to_datetime().timetuple())
def normalize(df):
"""Normalize the DF using min/max"""
scaler = MinMaxScaler(feature_range=(-1, 1))
dates_scaled = scaler.fit_transform(df['dates'])
return dates_scaled
if __name__ == '__main__':
# Create a random series of dates
df = pd.DataFrame({
'dates':
['1980-01-01', '1980-02-02', '1980-03-02', '1980-01-21',
'1981-01-21', '1991-02-21', '1991-03-23']
})
# Convert to date objects
df['dates'] = pd.to_datetime(df['dates'])
# Now df has date objects like you would, we convert to UNIX timestamps
df['dates'] = df['dates'].apply(convert_to_timestamp)
# Call normalization function
df = normalize(df)
Sample:
Date objects that we convert using convert_to_timestamp
dates
0 1980-01-01
1 1980-02-02
2 1980-03-02
3 1980-01-21
4 1981-01-21
5 1991-02-21
6 1991-03-23
UNIX timestamps that we can normalize using a MinMaxScaler from sklearn
dates
0 315507600
1 318272400
2 320778000
3 317235600
4 348858000
5 667069200
6 669661200
Normalized to (-1, 1), the final result
[-1. -0.98438644 -0.97023664 -0.99024152 -0.81166138 0.98536228
1. ]
a solution with Pandas
df = pd.DataFrame({
'A':
['1980-01-01', '1980-02-02', '1980-03-02', '1980-01-21',
'1981-01-21', '1991-02-21', '1991-03-23'] })
df['A'] = pd.to_datetime(df['A']).astype('int64')
max_a = df.A.max()
min_a = df.A.min()
min_norm = -1
max_norm =1
df['NORMA'] = (df.A- min_a) *(max_norm - min_norm) / (max_a-min_a) + min_norm
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
df = pd.DataFrame(np.random.randint(1, 100, (1000, 2)).astype(float64), columns=['A', 'B'])
A B
0 87 95
1 15 12
2 85 88
3 33 61
4 33 29
5 33 91
6 67 19
7 68 20
8 79 18
9 29 93
.. .. ..
990 70 84
991 37 24
992 91 12
993 92 13
994 4 64
995 32 98
996 97 62
997 38 40
998 12 56
999 48 8
[1000 rows x 2 columns]
# specify your desired range (-1, 1)
scaler = MinMaxScaler(feature_range=(-1, 1))
scaled = scaler.fit_transform(df.values)
print(scaled)
[[ 0.7551 0.9184]
[-0.7143 -0.7755]
[ 0.7143 0.7755]
...,
[-0.2449 -0.2041]
[-0.7755 0.1224]
[-0.0408 -0.8571]]
df[['A', 'B']] = scaled
Out[30]:
A B
0 0.7551 0.9184
1 -0.7143 -0.7755
2 0.7143 0.7755
3 -0.3469 0.2245
4 -0.3469 -0.4286
5 -0.3469 0.8367
6 0.3469 -0.6327
7 0.3673 -0.6122
8 0.5918 -0.6531
9 -0.4286 0.8776
.. ... ...
990 0.4082 0.6939
991 -0.2653 -0.5306
992 0.8367 -0.7755
993 0.8571 -0.7551
994 -0.9388 0.2857
995 -0.3673 0.9796
996 0.9592 0.2449
997 -0.2449 -0.2041
998 -0.7755 0.1224
999 -0.0408 -0.8571
[1000 rows x 2 columns]
Related
I am filtering out records for last month data records, however when doing
emp_df = emp_df[emp_df['Date'].dt.month == (currentMonth-1)]
It neglects some records(treats some records months as days).Link to File
from datetime import datetime, date
import pandas as pd
import numpy as np
cholareport = pd.read_excel("D:/Automations/HealthCheck and Audit Trail/report.xlsx")
uniqueemp = set(cholareport['Email'])
cholareport['Date'] = pd.to_datetime(cholareport['Date'])
uniqueemp = set(cholareport['Email'])
daystoignore = ['Holiday_COE', 'Leave_COE']
# datedfforemp = pd.DataFrame(columns=uniqueemp)
cholareport['Date'] = cholareport['Date'].apply(lambda x:
pd.to_datetime(x).strftime('%d/%m/%Y'))
cholareport["Date"] = pd.to_datetime(cholareport["Date"], utc=True)
for emp in uniqueemp:
emp_df = cholareport[cholareport['Email'].isin([emp])]
emp_df = emp_df[~emp_df['Task: Task Name'].isin(daystoignore)]
# s1 = pd.to_datetime(emp_df['Date']).dt.strftime('%Y-%m')
# s2 = (pd.to_datetime('today').strftime('%Y-%m') -pd.DateOffset(months=1)).strftime('%Y-%m')
# emp_df = emp_df[s1 == s2]
currentMonth = datetime.now().month
# print(currentMonth)
# print(emp_df['Date'])
emp_df['Date'] = pd.to_datetime(emp_df['Date']).dt.strftime("%dd-%mm-%YYYY")
format_data = "%dd-%mm-%YYYY"
empdfdate = []
for i in emp_df['Date']:
empdfdate.append(datetime.strptime(i,format_data))
print(empdfdate)
emp_df['Date'] = empdfdate
for i in emp_df['Date']:
print(i.month, i.day)
# emp_df['Date'] = pd.to_datetime(emp_df['Date']).dt.strftime('%Y-%m')
emp_df = emp_df[emp_df['Date'].dt.month == (currentMonth-1)]
for i in emp_df['Date']:
print(i.month, i.day)
Results :
6 10
7 10
10 10
11 10
12 10
10 13
10 14
Expected:
6 10
7 10
10 10
11 10
12 10
13 10
14 10
I am not entirely sure what you want to accomplish. If I understand it correctly, you simply want to count the number of entries per day for the past month. In such case, you can simply do the following.
from datetime import datetime
import pandas as pd
report = pd.read_excel('report.xlsx')
print('day: counts', report.Date[report.Date.dt.month == datetime.now().month - 1].dt.day.value_counts(), sep='\n')
I do not get your expected results. It might be that you also want to filter by email somehow; however, I cannot understand from your code what it is that you want to do.
Output:
day: counts
3 101
5 101
6 101
7 101
4 101
24 84
28 84
27 84
26 84
25 84
10 82
11 82
12 82
13 82
14 82
17 67
21 67
20 67
19 67
18 67
31 2
Name: Date, dtype: int64
I'm trying to work out the best way to create a p-value using Fisher's Exact test from four columns in a dataframe. I have already extracted the four parts of a contingency table, with 'a' being top-left, 'b' being top-right, 'c' being bottom-left and 'd' being bottom-right. I have started including additional calculated columns via simple pandas calculations, but these aren't necessary if there's an easier way to just use the 4 initial columns. I have over 1 million rows when including an additional set (x.type = high), so want to use an efficient method. So far this is my code:
import pandas as pd
import glob
import math
path = r'directory_path'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
frame['a+b'] = frame['a'] + frame['b']
frame['c+d'] = frame['c'] + frame['d']
frame['a+c'] = frame['a'] + frame['c']
frame['b+d'] = frame['b'] + frame['d']
As an example of this data, 'frame' currently shows:
ID(n) a b c d i x.name x.type a+b c+d a+c b+d
0 1258065 5 28 31 1690 1754 Albumin low 33 1721 36 1718
1 1132105 4 19 32 1699 1754 Albumin low 23 1731 36 1718
2 898621 4 30 32 1688 1754 Albumin low 34 1720 36 1718
3 573158 4 30 32 1688 1754 Albumin low 34 1720 36 1718
4 572975 4 23 32 1695 1754 Albumin low 27 1727 36 1718
... ... ... ... ... ... ... ... ... ... ... ... ...
666646 12435 1 0 27 1726 1754 WHR low 1 1753 28 1726
666647 15119 1 0 27 1726 1754 WHR low 1 1753 28 1726
666648 17053 1 2 27 1724 1754 WHR low 3 1751 28 1726
666649 24765 1 3 27 1723 1754 WHR low 4 1750 28 1726
666650 8733 1 1 27 1725 1754 WHR low 2 1752 28 1726
Is the best way to convert these to a numpy array and process it through iteration, or keep it in pandas? I assume that I can't use math functions within a dataframe (I've tried math.comb(), which didn't work in a dataframe). I've also tried using pyranges for its fisher method but it seems it doesn't work with my environment (python 3.8).
Any help would be much appreciated!
Following the answer here which came from the author of pyranges (i think), let's say you data is something like:
import pandas as pd
import scipy.stats as stats
import numpy as np
np.random.seed(111)
df = pd.DataFrame(np.random.randint(1,100,(1000000,4)))
df.columns=['a','b','c','d']
df['ID'] = range(1000000)
df.head()
a b c d ID
0 85 85 85 87 0
1 20 42 67 83 1
2 41 72 58 8 2
3 13 11 66 89 3
4 29 15 35 22 4
You convert it into a numpy array and did it like in the post:
c = df[['a','b','c','d']].to_numpy(dtype='uint64')
from fisher import pvalue_npy
_, _, twosided = pvalue_npy(c[:, 0], c[:, 1], c[:, 2], c[:, 3])
df['odds'] = (c[:, 0] * c[:, 3]) / (c[:, 1] * c[:, 2])
df['pvalue'] = twosided
Or you can fit it directly:
_, _, twosided = pvalue_npy(df['a'].to_numpy(np.uint), df['b'].to_numpy(np.uint),
df['c'].to_numpy(np.uint), df['d'].to_numpy(np.uint))
df['odds'] = (df['a'] * df['d']) / (df['b'] * df['c'])
df['pvalue'] = twosided
So I'm using sci-kit learn to classify some data. I have 13 different class values/categorizes to classify the data to. Now I have been able to use cross validation and print the confusion matrix. However, it only shows the TP and FP etc without the classlabels, so I don't know which class is what. Below is my code and my output:
def classify_data(df, feature_cols, file):
nbr_folds = 5
RANDOM_STATE = 0
attributes = df.loc[:, feature_cols] # Also known as x
class_label = df['task'] # Class label, also known as y.
file.write("\nFeatures used: ")
for feature in feature_cols:
file.write(feature + ",")
print("Features used", feature_cols)
sampler = RandomOverSampler(random_state=RANDOM_STATE)
print("RandomForest")
file.write("\nRandomForest")
rfc = RandomForestClassifier(max_depth=2, random_state=RANDOM_STATE)
pipeline = make_pipeline(sampler, rfc)
class_label_predicted = cross_val_predict(pipeline, attributes, class_label, cv=nbr_folds)
conf_mat = confusion_matrix(class_label, class_label_predicted)
print(conf_mat)
accuracy = accuracy_score(class_label, class_label_predicted)
print("Rows classified: " + str(len(class_label_predicted)))
print("Accuracy: {0:.3f}%\n".format(accuracy * 100))
file.write("\nClassifier settings:" + str(pipeline) + "\n")
file.write("\nRows classified: " + str(len(class_label_predicted)))
file.write("\nAccuracy: {0:.3f}%\n".format(accuracy * 100))
file.writelines('\t'.join(str(j) for j in i) + '\n' for i in conf_mat)
#Output
Rows classified: 23504
Accuracy: 17.925%
0 372 46 88 5 73 0 536 44 317 0 200 127
0 501 29 85 0 136 0 655 9 154 0 172 67
0 97 141 78 1 56 0 336 37 429 0 435 198
0 135 74 416 5 37 0 507 19 323 0 128 164
0 247 72 145 12 64 0 424 21 296 0 304 223
0 190 41 36 0 178 0 984 29 196 0 111 43
0 218 13 71 7 52 0 917 139 177 0 111 103
0 215 30 84 3 71 0 1175 11 55 0 102 62
0 257 55 156 1 13 0 322 184 463 0 197 160
0 188 36 104 2 34 0 313 99 827 0 69 136
0 281 80 111 22 16 0 494 19 261 0 313 211
0 207 66 87 18 58 0 489 23 157 0 464 239
0 113 114 44 6 51 0 389 30 408 0 338 315
As you can see, you can't really know what column is what and the print is also "misaligned" so it's difficult to understand.
Is there a way to print the labels as well?
From the doc, it seems that there is no such option to print the rows and column labels of the confusion matrix. However, you can specify the label order using argument labels=...
Example:
from sklearn.metrics import confusion_matrix
y_true = ['yes','yes','yes','no','no','no']
y_pred = ['yes','no','no','no','no','no']
print(confusion_matrix(y_true, y_pred))
# Output:
# [[3 0]
# [2 1]]
print(confusion_matrix(y_true, y_pred, labels=['yes', 'no']))
# Output:
# [[1 2]
# [0 3]]
If you want to print the confusion matrix with labels, you may try pandas and set the index and columns of the DataFrame.
import pandas as pd
cmtx = pd.DataFrame(
confusion_matrix(y_true, y_pred, labels=['yes', 'no']),
index=['true:yes', 'true:no'],
columns=['pred:yes', 'pred:no']
)
print(cmtx)
# Output:
# pred:yes pred:no
# true:yes 1 2
# true:no 0 3
Or
unique_label = np.unique([y_true, y_pred])
cmtx = pd.DataFrame(
confusion_matrix(y_true, y_pred, labels=unique_label),
index=['true:{:}'.format(x) for x in unique_label],
columns=['pred:{:}'.format(x) for x in unique_label]
)
print(cmtx)
# Output:
# pred:no pred:yes
# true:no 3 0
# true:yes 2 1
It is important to ensure that the way you label your confusion matrix rows and columns corresponds exactly to the way sklearn has coded the classes. The true order of the labels can be revealed using the .classes_ attribute of the classifier. You can use the code below to prepare a confusion matrix data frame.
labels = rfc.classes_
conf_df = pd.DataFrame(confusion_matrix(class_label, class_label_predicted, columns=labels, index=labels))
conf_df.index.name = 'True labels'
The second thing to note is that your classifier is not predicting labels well. The number of correctly predicted labels is shown on the main diagonal of the confusion matrix. You have non-zero values accross the matrix and some classes have not been predicted at all - the columns that are all zero. It might be a good idea to run the classifier with its default parameters and then try to optimise them.
Another better way of doing this is using crosstab function in pandas.
pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
or
pd.crosstab(le.inverse_transform(y_true),
le.inverse_transform(y_pred),
rownames=['True'],
colnames=['Predicted'],
margins=True)
Since confusion matrix is just a numpy matrix, it does not contain any column information. What you can do is convert your matrix into a dataframe and then print this dataframe.
import pandas as pd
import numpy as np
def cm2df(cm, labels):
df = pd.DataFrame()
# rows
for i, row_label in enumerate(labels):
rowdata={}
# columns
for j, col_label in enumerate(labels):
rowdata[col_label]=cm[i,j]
df = df.append(pd.DataFrame.from_dict({row_label:rowdata}, orient='index'))
return df[labels]
cm = np.arange(9).reshape((3, 3))
df = cm2df(cm, ["a", "b", "c"])
print(df)
Code snippet is from https://gist.github.com/nickynicolson/202fe765c99af49acb20ea9f77b6255e
Output:
a b c
a 0 1 2
b 3 4 5
c 6 7 8
It appears your data has 13 different classes, which is why your confusion matrix has 13 rows and columns. Furthermore, your classes aren't labeled in any way, just integers from what I can see.
If this isn't the case, and your training data has actual labels, you can pass a list of unique labels to confusion_matrix
conf_mat = confusion_matrix(class_label, class_label_predicted, df['task'].unique())
I am making lists by some conditions.
this is what it looks like.
def time_price_pair(a, b):
if 32400<=a and a<32940:
a_list=[]
a_list.append(b)
elif 32940<=a and a<33480:
b_list=[]
b_list.append(b)
elif 33480<=a and a <34020:
c_list=[]
c_list.append(b)
......
......
......
elif 52920 <=a and a <53460:
some_list=[]
some_list.append(b)
each condition will be added by 540. like [32400, 32940, 33480, 34020, 34560, 35100, 35640, 36180, 36720, 37260, 37800, 38340,38880, 39420....53460]
and list name doesn't matter.
I would use a dict to store these lists of values, and use some math to know where to put these numbers
from collections import defaultdict
lists = defaultdict(list)
def time_price_pair(a, b):
if 32400 <= a < 53460:
i = (a-32400)/540
lists[i].append(b)
You can just use a for loop with some incrementing variable i and keep updating the requirements. Something like this:
def time_price_pair(a, b):
min = 32400
max = 32940
inc = 540
for i in range(some value):
if min + inc*i <= a < max + inc*i:
b = min + inc*i
a_list = [b]
It looks like a simple high-level pandas function pd.cut would suit your purpose very well.
import pandas as np
import numpy as np
# simulate your data
# ==================================
np.random.seed(0)
a = np.random.randint(32400, 53439, size=1000000)
b = np.random.randn(1000000)
# put them in dataframe
df = pd.DataFrame(dict(a=a, b=b))
print(df)
a b
0 35132 -0.4605
1 43199 -0.9469
2 42245 0.2580
3 52048 -0.7309
4 45523 -0.4334
5 41625 2.0155
6 53157 -1.4712
7 46516 -0.1715
8 47335 -0.6594
9 47830 -1.0391
... ... ...
999990 39754 0.8771
999991 34779 0.7030
999992 37836 0.5409
999993 44330 -0.6747
999994 41078 -1.1368
999995 38752 1.6121
999996 42155 -0.1139
999997 49018 -0.1737
999998 45848 -1.2640
999999 50669 -0.4367
# processing
# ===================================
rng = np.arange(32400, 53461, 540)
# your custom labels
labels = np.arange(1, len(rng))
# use pd.cut()
%time df['cat'] = pd.cut(df.a, bins=rng, right=False, labels=labels)
CPU times: user 52.5 ms, sys: 16 µs, total: 52.5 ms
Wall time: 51.6 ms
print(df)
a b cat
0 35132 -0.4605 6
1 43199 -0.9469 20
2 42245 0.2580 19
3 52048 -0.7309 37
4 45523 -0.4334 25
5 41625 2.0155 18
6 53157 -1.4712 39
7 46516 -0.1715 27
8 47335 -0.6594 28
9 47830 -1.0391 29
... ... ... ..
999990 39754 0.8771 14
999991 34779 0.7030 5
999992 37836 0.5409 11
999993 44330 -0.6747 23
999994 41078 -1.1368 17
999995 38752 1.6121 12
999996 42155 -0.1139 19
999997 49018 -0.1737 31
999998 45848 -1.2640 25
999999 50669 -0.4367 34
[1000000 rows x 3 columns]
# groupby
grouped = df.groupby('cat')['b']
# access to a particular group using your user_defined key
grouped.get_group(1).values
array([ 0.4525, -0.7226, -0.981 , ..., 0.0985, -1.4286, -0.2257])
A dictionary could be used to hold all of the used time range bins as follows:
import collections
time_prices = [(32401, 20), (32402,30), (32939, 42), (32940, 10), (32941, 15), (40000, 123), (40100, 234)]
dPrices = collections.OrderedDict()
for atime, aprice in time_prices:
abin = 32400 + ((atime - 32400) // 540) * 540 # For bins as times
#abin = (atime - 32400) // 540 + 1 # For bins starting from 1
dPrices.setdefault(abin, []).append(aprice)
# Display results
for atime, prices in dPrices.items():
print atime, prices
This would give you the following output:
32400 [20, 30, 42]
32940 [10, 15]
39960 [123, 234]
Or individually as:
print dPrices[32400]
[20, 30, 42]
Tested using Python 2.7
I have a df that looks like:
import pandas as pd
import numpy as np
d = {'Hours':np.arange(12, 97, 12),
'Average':np.random.random(8),
'Count':[500, 250, 125, 75, 60, 25, 5, 15]}
df = pd.DataFrame(d)
This df has a decrease number of cases for each row. After the count drops below a certain threshold, I'd like to drop off the remainder, for example after a < 10 case threshold was reached.
Starting:
Average Count Hours
0 0.560671 500 12
1 0.743811 250 24
2 0.953704 125 36
3 0.313850 75 48
4 0.640588 60 60
5 0.591149 25 72
6 0.302894 5 84
7 0.418912 15 96
Finished (everything after row 6 removed):
Average Count Hours
0 0.560671 500 12
1 0.743811 250 24
2 0.953704 125 36
3 0.313850 75 48
4 0.640588 60 60
5 0.591149 25 72
We can use the index generated from the boolean index and slice the df using iloc:
In [58]:
df.iloc[:df[df.Count < 10].index[0]]
Out[58]:
Average Count Hours
0 0.183016 500 12
1 0.046221 250 24
2 0.687945 125 36
3 0.387634 75 48
4 0.167491 60 60
5 0.660325 25 72
Just to break down what is happening here
In [54]:
# use a boolean mask to index into the df
df[df.Count < 10]
Out[54]:
Average Count Hours
6 0.244839 5 84
In [56]:
# we want the index and can subscript the first element using [0]
df[df.Count < 10].index
Out[56]:
Int64Index([6], dtype='int64')