Add row to pandas data frame causes prediction failure - python

What is the best way to add a row to training data?
import numpy as np
import pandas as pd
# Features=x / Labels=y
new_train1 = pd.DataFrame({'A': [1,2,3,3,4,4],
'B': [4,5,6,6,4,3],
'C': ['a','b','c','ddd','c','ddd']})
new_train2 = pd.DataFrame({'A': [1],
'B': [4],
'C': ['a']})
# Add new_train2's row to new_train1.
Maybe this would work:
new_train1 = new_train1.append(new_train2)
new_train1 = new_train1.reset_index(drop=True)
Finally, the data is split into features and labels.
new_train_x = new_train1.iloc[:,0:1] # Cols A and B
new_train_y = new_train1['C']
EDIT: Notably, after attempting this process (to add a row), here is the confusion matrix (#s are from real data set, not above sample set):
[[336 0 7 0 3 0]
[ 23 8 358 0 0 3]
[ 0 0 373 1 0 0]
[ 0 0 0 281 30 25]
[ 0 0 0 14 220 33]
[ 0 0 0 6 14 265]]
Whereas prior to adding the row (and whenever dropping one row multiple times), here is the typical confusion matrix (once again using #s from real data not above sample data):
[[343 0 0 0 3 0]
[ 2 349 39 0 0 2]
[ 0 52 322 0 0 0]
[ 0 0 0 330 3 3]
[ 0 0 0 3 261 3]
[ 0 0 0 2 1 282]]
And here is the confusion matrix before adding or removing any data points:
[[343 0 0 0 3 0]
[ 3 355 31 0 0 3]
[ 0 30 344 0 0 0]
[ 0 0 0 331 1 4]
[ 0 0 0 1 261 5]
[ 0 0 0 3 4 278]]

Related

How to pivot dataframe into ML format

My head is spinning trying to figure out if I have to use pivot_table, melt, or some other function.
I have a DF that looks like this:
month day week_day classname_en origin destination
0 1 7 2 1 2 5
1 1 2 6 2 1 167
2 2 1 5 1 2 54
3 2 2 6 4 1 6
4 1 2 6 5 6 1
But I want to turn it into something like:
month_1 month_2 ...classname_en_1 classname_en_2 ... origin_1 origin_2 ...destination_1
0 1 0 1 0 0 1 0
1 1 0 0 1 1 0 0
2 0 1 1 0 0 1 0
3 0 1 0 0 1 0 0
4 1 0 0 0 0 0 1
Basically, turn all values into columns and then have binary rows 1 - if the column is present, 0 if none.
IDK if it is at all possible to do with like a single function or not, but would appreciate all and any help!
To expand #Corraliens answer
It is indeed a way to do it, but since you write for ML purposes, you might introduce a bug.
With the code above you get a matrix with 20 features. Now, say you want to predict on some data which suddenly have a month more than your training data, then your matrix on your prediction data would have 21 features, thus you cannot parse that into your fitted model.
To overcome this you can use one-hot-encoding from Sklearn. It'll make sure that you always have the same amount of features on "new data" as your training data.
import pandas as pd
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
pd.get_dummies(df_train)
# output
age color_blue color_red
0 10 0 1
1 15 1 0
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
pd.get_dummies(df_new)
#output
age color_blue color_green color_red
0 10 0 0 1
1 15 1 0 0
2 20 0 1 0
and as you can see, the order of the color-binary representation has also changed.
If we on the other hand use OneHotEncoder you can ommit all those issues
from sklearn.preprocessing import OneHotEncoder
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
ohe = OneHotEncoder(handle_unknown="ignore")
color_ohe_transformed= ohe.fit_transform(df_train[["color"]]) #creates sparse matrix
ohe_features = ohe.get_feature_names_out() # [color_blue, color_red]
pd.DataFrame(color_ohe_transformed.todense(),columns = ohe_features, dtype=int)
# output
color_blue color_red
0 0 1
1 1 0
# now transform new data
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
new_data_ohe_transformed = ohe.transform(df_new[["color"]])
pd.DataFrame(new_data_ohe_transformed .todense(),columns = ohe_features, dtype=int)
#output
color_blue color_red
0 0 1
1 1 0
2 0 0
note in the last row that both blue and red are both zeros since it has color= "green" which was not present in the training data.
Note the todense() function is only used here to illustrate how it works. Ususally you would like to keep it a sparse matrix and use e.g scipy.sparse.hstack to append your other features such as age to it.
Use pd.get_dummies:
out = pd.get_dummies(df, columns=df.columns)
print(out)
# Output
month_1 month_2 day_1 day_2 day_7 week_day_2 week_day_5 ... origin_2 origin_6 destination_1 destination_5 destination_6 destination_54 destination_167
0 1 0 0 0 1 1 0 ... 1 0 0 1 0 0 0
1 1 0 0 1 0 0 0 ... 0 0 0 0 0 0 1
2 0 1 1 0 0 0 1 ... 1 0 0 0 0 1 0
3 0 1 0 1 0 0 0 ... 0 0 0 0 1 0 0
4 1 0 0 1 0 0 0 ... 0 1 1 0 0 0 0
[5 rows x 20 columns]
You can use get_dummies function of pandas for convert row to column based on data.
For that your code will be:
import pandas as pd
df = pd.DataFrame({
'month': [1, 1, 2, 2, 1],
'day': [7, 2, 1, 2, 2],
'week_day': [2, 6, 5, 6, 6],
'classname_en': [1, 2, 1, 4, 5],
'origin': [2, 1, 2, 1, 6],
'destination': [5, 167, 54, 6, 1]
})
response = pd.get_dummies(df, columns=df.columns)
print(response)
Result :

Cannot pick columns dtype after numpy hstack

I want to choose column based on dtype.
Examples:
a = np.random.randn(10, 10).astype('float')
b = np.random.randn(10, 10).astype('uint8')
t=np.hstack((a,b))
t=pd.DataFrame(t)
uints = t.select_dtypes(include=['uint8']).columns.tolist()
The expected output from uints is: [10,11,12,13,14,15,16,17,18,19]
The problem is when i join my original numpy data (a and b) together using hstack, dtype cannot be detected correctly as the code above returning [].
I think pandas can handle different data types better. Try this:
# Converting your arrays to dataframes
a = pd.DataFrame(np.random.randn(10, 10).astype('float'))
b = pd.DataFrame(np.random.randn(10, 10).astype('uint8'))
df = pd.concat([a,b],axis=1) # horizontally concatenating a and b
df.columns=[i for i in range(20)] # setting the column names manually
print(df.head())
0 1 2 3 4 5 6 \
0 0.931404 0.612939 -0.369925 -0.777209 0.776831 1.923639 0.714632
1 1.002620 0.612617 -0.184530 -0.279565 -0.021436 1.079653 0.299139
2 0.938141 0.621674 1.723074 0.298568 -0.892739 -1.154118 -2.623486
3 -1.050390 -1.058590 1.319297 -1.052302 -0.633126 -1.089275 0.796025
4 -0.312114 -0.045124 -0.094495 0.296262 0.518496 0.068003 -1.247959
7 8 9 10 11 12 13 14 15 16 17 18 19
0 0.710094 -1.465146 -0.009591 0 255 0 255 0 0 0 0 0 1
1 1.645174 -0.491199 0.961290 0 253 0 1 254 1 0 255 0 0
2 0.633076 -1.366998 -0.450123 0 1 255 255 0 0 255 0 254 0
3 -0.650617 1.226741 1.884750 0 255 0 0 0 0 255 0 1 0
4 -0.774224 0.780239 -1.072834 0 254 3 2 0 0 0 0 0 0
df.select_dtypes(include=['uint8']).columns.tolist()
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

Convert dataframe string into multiple dummy variables in Python

I have a dataframe with several columns. One column is "category", which is a space separated string. A sample of the df's category is:
3 36 211 433 474 533 690 980
3 36 211
3 16 36 211 396 398 409
3 35 184 590 1038
67 179 208 1008 5000 5237
I have another list of categories dict = [3,5,7,8,16,5000].
What I would like to see is a new data frame with dict as columns, and 0/1 as entries. If a row in df contains the dict entry, it's 1, else it's 0. So the output is:
3 5 7 8 16 36 5000
1 0 0 0 0 1 0
1 0 0 0 0 1 0
1 0 0 0 1 1 0
1 0 0 0 0 0 0
0 0 0 0 0 0 1
Have tried something like:
for cat in level_0_cat:
df[cat] = df.apply(lambda x: int(cat in map(int, x.category)), axis = 1)
But it does not work for large dataset (10 million rows). Have also tried isin, but have not figured out. Any idea is appreciated.
This ought to do it.
# Read your data
>>> s = pd.read_clipboard(sep='|', header=None)
# Convert `cats` to string to make `to_string` approach work below
>>> cats = list(map(str, [3,4,7,8,16,36,5000]))
>>> cats
['3', '4', '7', '8', '16', '36', '5000']
# Nested list comprehension... Checks whether each `c` in `cats` exists in each row
>>> encoded = [[1 if v in set(s.ix[idx].to_string().split()) else 0 for idx in s.index] for v in cats]
>>> encoded
[[1, 1, 1, 1, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 0, 0, 0], [0, 0, 1, 0, 0], [0, 0, 0, 0, 1]]
>>> import numpy as np
# Convert the whole thing to a dataframe to add columns
>>> encoded = pd.DataFrame(data=np.matrix(encoded).T, columns=cats)
>>> encoded
3 4 7 8 16 36 5000
0 1 0 0 0 0 1 0
1 1 0 0 0 0 1 0
2 1 0 0 0 1 1 0
3 1 0 0 0 0 0 0
4 0 0 0 0 0 0 1
Edit: answer way to do this without directly calling any pandas indexing methods like ix or loc.
encoded = [[1 if v in row else 0 for row in s[0].str.split().map(set)] for v in cats]
encoded
Out[18]:
[[1, 1, 1, 1, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[1, 1, 1, 0, 0],
[0, 0, 0, 0, 1]]
encoded = pd.DataFrame(data=np.matrix(encoded).T, columns=cats)
encoded
Out[20]:
3 4 7 8 16 36 5000
0 1 0 0 0 0 1 0
1 1 0 0 0 0 1 0
2 1 0 0 0 1 1 0
3 1 0 0 0 0 0 0
4 0 0 0 0 0 0 1
You don't need to convert every line to integers, it's simpler to
convert to strings the elements of the list of categories...
categories = [l.strip() for l in '''\
3 36 211 433 474 533 690 980
3 36 211
3 16 36 211 396 398 409
3 35 184 590 1038
67 179 208 1008 5000 5237'''.split('\n')]
result = [3,5,7,8,16,5000]
d = [str(n) for n in result]
for category in categories:
result.append([1 if s in category else 0 for s in d])
Please don't use dict (that is a builtin function) to name one of your objects.

reason for transposed confusion matrix in heatmap

I plot a heatmap which takes a confusion matrix as input data. The confusion matrix has the shape:
[[37 0 0 0 0 0 0 0 0 0]
[ 0 42 0 0 0 1 0 0 0 0]
[ 1 0 43 0 0 0 0 0 0 0]
[ 0 0 0 44 0 0 0 0 1 0]
[ 0 0 0 0 37 0 0 1 0 0]
[ 0 0 0 0 0 47 0 0 0 1]
[ 0 0 0 0 0 0 52 0 0 0]
[ 0 0 0 0 1 0 0 47 0 0]
[ 0 1 0 1 0 0 0 1 45 0]
[ 0 0 0 0 0 2 0 0 0 45]]
The code to plot the heatmap is:
fig2=plt.figure()
fig2.add_subplot(111)
sns.heatmap(confm.T,annot=True,square=True,cbar=False,fmt="d")
plt.xlabel("true label")
plt.ylabel("predicted label")
which yields:
As you can see, the input matrix "confm" is transposed (confm.T). What is the reason for this? Do I necessarily have to do that?
When I plot your data with the code you provided I get this:
Without the transpose and when swapping the x and y labels you get:
fig2=plt.figure()
fig2.add_subplot(111)
sns.heatmap(confm,annot=True,square=True,cbar=False,fmt="d")
plt.xlabel("predicted label")
plt.ylabel("true label")
Which results in the same confusion matrix. What the transpose really does is swap which is the prediction and which is the ground truth (true label). What you need to use depends on how the data is formatted.
You need to transpose only if you want to switch along which axis which data will be placed. I'm usually use confusion matrix as is: y - true labels, x - predicted labels. You need transpose matrix and swap labels only if you like it vice versa: y - predicted labels, x - true labels.

New pandas dataframe from meta information of existing DF

Currently have a CSV file that outputs a dateframe as follows:
[in]
df = pd.read_csv(file_name)
df.sort('TOTAL_MONTHS', inplace=True)
print df[['TOTAL_MONTHS','COUNTEM']]
[out]
TOTAL_MONTHS COUNTEM
12 0
12 0
12 2
25 10
25 0
37 1
68 3
I want to get the total number of rows (by TOTAL_MONTHS) for which the 'COUNTEM' value falls within a preset bin.
The data is going to be entered into a histogram via excel/powerpoint with:
X-axis = Number of contracts
Y-axis = Total_months
Color of bar = COUNTEM
The input of the graph is like this (columns being COUNTEM bins):
MONTHS 0 1-3 4-6 7-10 10+ 20+
0 0 0 0 0 0 0
1 0 0 0 0 0 0
2 0 0 0 0 0 0
3 0 0 0 0 0 0
...
12 2 1 0 0 0 0
...
25 1 0 0 0 1 0
...
37 0 1 0 0 0 0
...
68 0 1 0 0 0 0
Ideally I'd like the code to output a dataframe in that format.
Interesting problem. Knowing pandas (as I don't properly) there may well be a much fancier and simpler solution to this. However, doing it through iterations is also possible in the following manner:
#First, imports and create your data
import pandas as pd
DF = pd.DataFrame({'TOTAL_MONTHS' : [12, 12, 12, 25, 25, 37, 68],
'COUNTEM' : [0, 0, 2, 10, 0, 1, 3]
})
#Next create a data frame of 'bins' with the months as index and all
#values set at a default of zero
New_DF = pd.DataFrame({'bin0' : 0,
'bin1' : 0,
'bin2' : 0,
'bin3' : 0,
'bin4' : 0,
'bin5' : 0},
index = DF.TOTAL_MONTHS.unique())
In [59]: New_DF
Out[59]:
bin0 bin1 bin2 bin3 bin4 bin5
12 0 0 0 0 0 0
25 0 0 0 0 0 0
37 0 0 0 0 0 0
68 0 0 0 0 0 0
#Create a list of bins (rather than 20 to infinity I limited it to 100)
bins = [[0], range(1, 4), range(4, 7), range(7, 10), range(10, 20), range(20, 100)]
#Now iterate over the months of the New_DF index and slice the original
#DF where TOTAL_MONTHS equals the month of the current iteration. Then
#get a value count from the original data frame and use integer indexing
#to place the value count in the appropriate column of the New_DF:
for month in New_DF.index:
monthly = DF[DF['TOTAL_MONTHS'] == month]
counts = monthly['COUNTEM'].value_counts()
for count in counts.keys():
for x in xrange(len(bins)):
if count in bins[x]:
New_DF.ix[month, x] = counts[count]
Which gives me:
In [62]: New_DF
Out[62]:
bin0 bin1 bin2 bin3 bin4 bin5
12 2 1 0 0 0 0
25 1 0 0 0 1 0
37 0 1 0 0 0 0
68 0 1 0 0 0 0
Which appears to be what you want. You can rename the index as you see fit....
Hope this helps. Perhaps someone has a solution that uses a built in pandas function, but for now this seems to work.

Categories