Grouping and splitting to avoid leakage - python

I have a pandas dataframe where the data is arranged as follows:
filename label
0 4456723 0
1 4456723_01 0
2 4456723_02 0
3 ab43912 1
4 ab43912_01 1
5 ab43912_03 1
... ... ...
I want to randomly split this dataframe in training and validation sets. Though if I do so, I will introduce a leakage because the files are images with slight variations but represented with different names, for example ab43912, ab43912_01, ab43912_03, are all same images with some variations.
Is there any efficient way to group these files and then make a split that doesn't introduce leakage?

You can manually select ~80% of the unique file handles randomly.
df = pd.DataFrame({'filename': list('aaabbbcccdddeeefff')})
df['filename'] = df['filename'] + ['', '_01', '_02']*6
# Get the unique handles
files = df.filename.str.split('_').str[0]
# Randomly select ~80%.
m = files.isin(np.random.choice(files.unique(), int(files.nunique()*0.8), replace=False))
# Split
train, test = df.loc[m], df.loc[~m]
In effect we got a 2/3-1/3 split because of the small N
train:
filename
0 a
1 a_01
2 a_02
6 c
7 c_01
8 c_02
12 e
13 e_01
14 e_02
15 f
16 f_01
17 f_02
test:
filename
3 b
4 b_01
5 b_02
9 d
10 d_01
11 d_02

Related

Losing my target variable when encoding categorial variables

I am dealing with a little challenge. I am trying to create a logistic regression model (multicass). Some of my variables are categorical, therefore I'm trying to encode them.
My initial dataset looks like that:
The column I want to predict is action1_preflop, it contains 3 possibles classes: "r","c","f"
When encoding categorical features, I end up losing the variable I want to predict as it gets converted into 3 sub-variables:
action1_preflop_r
action1_preflop_f
action1_preflop_c
Below is the new dataframe after encoding
tiers tiers2_theory ... action1_preflop_f action1_preflop_r
0 7 11 ... 1 0
1 1 7 ... 0 1
2 5 11 ... 1 0
3 1 11 ... 0 1
4 1 7 ... 0 1
... ... ... ... ...
31007 4 11 ... 0 1
31008 1 11 ... 0 1
31009 1 11 ... 0 1
31010 1 11 ... 0 1
31011 2 7 ... 0 1
[31012 rows x 11 columns]
Could you please let me know how I am supposed to deal with those new variables considering that the initial variable before being encoded was actually the variable I wanted to target from prediction?
Thanks for the help
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import linear_model
df_raw = pd.read_csv('\\Users\\rapha\\Desktop\\Consulting\\Poker\\Tables test\\SB_preflop_a1_prob V1.csv', sep=";")
#Select categorical features only & use binary encoding
feature_cols = ['tiers','tiers2_theory','tiers3_theory','assorties','score','proba_preflop','action1_preflop']
df_raw = df_raw[feature_cols]
cat_features = df_raw.select_dtypes(include=[object])
num_features = df_raw.select_dtypes(exclude=[object])
df = num_features.join(pd.get_dummies(cat_features))
df = df.select_dtypes(exclude = [object])
df_outcome = df.action1_preflop
df_variables = df.drop('action1_preflop',axis=1)
x = df_variables
y = df.action1_preflop
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1)
lm = linear_model.LogisticRegression(multi_class='ovr', solver='liblinear')
lm.fit(x_train, y_train)
predict_test=lm.predict(x_test)
print(lm.score(x_test, y_test))
You should leave the 'action1_preflop' out of the 'cat_features' dataframe and include it in the 'num_features' dataframe:
cat_features = df_raw.select_dtypes(include=[object])
cat_features = cat_features.drop(['action1_preflop'], axis=1)
num_features = df_raw.select_dtypes(exclude=[object])
num_features = pd.concat([num_features, df_raw['action1_preflop']
You can also save some typing, and joining too
cat_features = df_raw.select_dtypes(include=[object]).columns.to_list()
cat_features.remove("action1_preflop")
And then, you can just include this list of columns in the columns parameter
df = pd.get_dummies(df_raw, columns=cat_features)

How to parse tables from .txt files using Pandas

I'm trying to extract tables from log files which are in .txt format. The file is loaded using read_csv() from pandas.
The log file looks like this:
aaa
bbb
ccc
=====================
A B C D E F
=====================
1 2 3 4 5 6
7 8 9 1 2 3
4 5 6 7 8 9
1 2 3 4 5 6
---------------------
=====================
G H I J
=====================
1 3 4
5 6 7
---------------------
=====================
K L M N O
=====================
1 2 3
4 5 6
7 8 9
---------------------
xxx
yyy
zzz
Here are some points about the log file:
Files start and end with some lines of comment which can be ignored.
In the example above there are three tables.
Headers for each table are located between lines of "======..."
The end of each table is signified by a line of "------..."
My code as of now:
import pandas as pd
import itertools
df = pd.read_csv("xxx.txt", sep="\n", header=None)
# delimiters for header and end-of-table
h_dl = "=" * 21
r_dl = "-" * 21
for i in range(len(df.index)-2):
# if loop to find lines which are table headers & convert to list
if (df.iloc[i].any() == h_dl) & (df.iloc[i+2].any() == h_dl):
h = df.iloc[i+1].str.split().tolist()
h = list(itertools.chain(*h))
# while loop to find lines which are table rows & append to one list
x = 3
r = []
while True:
if df.iloc[i+x].any() == r_dl:
break
r.append(df.iloc[i+x].str.split().tolist())
x += 1
r = list(itertools.chain(*r))
# create pandas dataframe with header and rows obtained above
t = pd.DataFrame(data=r, columns=h)
This code returns AssertionError: 14 columns passed, passed data had 15 columns. I know that this is due to the fact that for the table rows, I am using .str.split() which by default splits on whitespace. Since there are some columns for which there are missing values, the number of elements in table headers and number of elements in table rows does not match for the second and htird table. I am struggling to get around this, since the number of whitespace characters to signify missing values is different for each table.
My question is: is there a way to account for missing values in some of the columns, so that I can get a DataFrame as output where there are either null or NaN for missing values as appropriate?
With usage of Victor Ruiz method I added if options to handle different header sizes.
=^..^=
Description in code:
import re
import pandas as pd
import itertools
df = pd.read_csv("stack.txt", sep="\n", header=None)
# delimiters for header and end-of-table
h_dl = "=" * 21
r_dl = "-" * 21
for i in range(len(df.index)-2):
# if loop to find lines which are table headers & convert to list
if (df.iloc[i].any() == h_dl) & (df.iloc[i+2].any() == h_dl):
h = df.iloc[i+1].str.split().tolist()
h = list(itertools.chain(*h))
# get header string
head = df.iloc[i+1].to_string()
# get space distance in header
space_range = 0
for result in re.findall('([ ]*)', head):
if len(result) > 0:
space_range = len(result)
x = 3
r = []
while True:
if df.iloc[i+x].any() == r_dl:
break
# strip line
line = df.iloc[i+x].to_string()[5::]
# collect items based on elements distance
items = []
for result in re.finditer('(\d+)([ ]*)', line):
item, delimiter = result.groups()
items.append(item)
if len(delimiter) > space_range*2+1:
items.append('NaN')
items.append('NaN')
if len(delimiter) < space_range*2+2 and len(delimiter) > space_range:
items.append('NaN')
r.append([items])
x += 1
r = list(itertools.chain(*r))
# create pandas dataframe with header and rows obtained above
t = pd.DataFrame(data=r, columns=h)
Output:
A B C D E F
0 1 2 3 4 5 6
1 7 8 9 1 2 3
2 4 5 6 7 8 9
3 1 2 3 4 5 6
G H I J
0 1 NaN 3 4
1 5 NaN 6 7
K L M N O
0 1 NaN NaN 2 3
1 4 5 NaN NaN 6
2 7 8 NaN 9 None
Maybe this can help you.
Suppose we have the next line of text:
1 3 4
The problem is to identify how much spaces delimits two consecutive items without considering that there is a missing value between them.
Let consider that 5 spaces is a delimiter, and more than 5 is a missing value.
You can use regex to parse the items:
from re import finditer
line = '1 3 4'
items = []
for result in finditer('(\d+)([ ]*)', line):
item, delimiter = result.groups()
items.append(item)
if len(delimiter) > 5:
items.append(nan)
print(items)
Output is:
['1', nan, '3', '4']
A more complex situation would be if it can appear two or more consecutive missing values (the code above will just inyect only one nan)

How to delete a matrix cell's neighbors which are the same value with it

I have a matrix as shown below (taken from a txt file with an argument), and every cell has neighbors. Once you pick a cell, that cell and all neighboring cells that containing the same number will disappear.
1 0 4 7 6 8
0 5 4 4 5 5
2 1 4 4 4 6
4 1 3 7 4 4
I've tried to do this with using recursion. I separated function four parts which are up(), down() , left() and right(). But I got an error message: RecursionError: maximum recursion depth exceeded in comparison
cmd=input("Row,column:")
cmdlist=command.split(",")
row,column=int(cmdlist[0]),int(cmdlist[1])
num=lines[row-1][column-1]
def up(x,y):
if lines[x-2][y-1]==num and x>1:
left(x,y)
right(x,y)
lines[x-2][y-1]=None
def left(x,y):
if lines[x-1][y-2]==num and y>1:
up(x,y)
down(x,y)
lines[x-1][y-2]=None
def right(x,y):
if lines[x-1][y]==num and y<len(lines[row-1]):
up(x,y)
down(x,y)
lines[x-1][y]=None
def down(x,y):
if lines[x][y-1]==num and x<len(lines):
left(x,y)
right(x,y)
lines[x][y-1]=None
up(row,column)
down(row,column)
for i in lines:
print(str(i).strip("[]").replace(",","").replace("None"," "))
When I give the input (3,3) which represents the number of "4", the output must be like this:
1 0 7 6 8
0 5 5 5
2 1 6
4 1 3 7
I don't need fixed code, just the main idea will be enough. Thanks a lot.
Recursion error happens when your recursion does not terminate.
You can solve this without recursing using set's of indexes:
search all indexes that contain the looked for number into all_num_idx
add the index you are currently at (your input) to a set tbd (to be deleted)
loop over the tbd and add all indexed from all_num_idx that differ only in -1/+1 in row or col to any index thats already in the set
do until tbd does no longer grow
delete all indexes from tbd:
t = """4 0 4 7 6 8
0 5 4 4 5 5
2 1 4 4 4 6
4 1 3 7 4 4"""
data = [k.strip().split() for k in t.splitlines()]
row,column=map(int,input("Row,column:").strip().split(";"))
num = data[row][column]
len_r =len(data)
len_c = len(data[0])
all_num_idx = set((r,c) for r in range(len_r) for c in range(len_c) if data[r][c]==num)
tbd = set( [ (row,column)] ) # inital field
tbd_size = 0 # different size to enter while
done = set() # we processed those already
while len(tbd) != tbd_size: # loop while growing
tbd_size=len(tbd)
for t in tbd:
if t in done:
continue
# only 4-piece neighbourhood +1 or -1 in one direction
poss_neighbours = set( [(t[0]+1,t[1]), (t[0],t[1]+1),
(t[0]-1,t[1]), (t[0],t[1]-1)] )
# 8-way neighbourhood with diagonals
# poss_neighbours = set((t[0]+a,t[1]+b) for a in range(-1,2) for b in range(-1,2))
tbd = tbd.union( poss_neighbours & all_num_idx)
# reduce all_num_idx by all those that we already addded
all_num_idx -= tbd
done.add(t)
# delete the indexes we collected
for r,c in tbd:
data[r][c]=None
# output
for line in data:
print(*(c or " " for c in line) , sep=" ")
Output:
Row,column: 3,4
4 0 7 6 8
0 5 5 5
2 1 6
4 1 3 7
This is a variant of a "flood-fill-algorythm" flooding only cells of a certain value. See https://en.wikipedia.org/wiki/Flood_fill
Maybe you should replace
def right(x,y):
if lines[x-1][y]==num and y<len(lines[row-1]):
up(x,y)
down(x,y)
lines[x-1][y]=None
by
def right(x,y):
if lines[x-1][y]==num and y<len(lines[row-1]):
lines[x-1][y]=None
up(x - 1,y)
down(x - 1,y)
right(x - 1, y)
and do the same for all the other functions.
Putting lines[x-1][y]=None ensure that your algorithm stops and changing the indices ensure that the next step of your algorithm will start from the neighbouring cell.

pandas display: truncate column display rather than wrapping

With lengthy column names, DataFrames will display in a very messy form seemingly no matter what options are set.
Info: I'm in Jupyter QtConsole, pandas 0.20.1, with the following relevant options specified at startup:
pd.set_option('display.max_colwidth', 20)
pd.set_option('expand_frame_repr', False)
pd.set_option('display.max_rows', 25)
Question: how can I truncate the DataFrame if necessary rather than wrapping the columns to the next line, while keeping expand_frame_repr=False?
Here's an example. Again, the issue doesn't depend on the number of columns but length of the columns.
This will not cause an issue:
df = pd.DataFrame(np.random.randn(1000, 1000),
columns=['col' + str(i) for i in range(1000)])
As the output is perfectly readable and looks like:
The same DataFrame with long column names causes the issue I'm talking about:
df = pd.DataFrame(np.random.randn(1000, 1000),
columns=['very_long_col_name_'
+ str(i) for i in range(1000)])
Is there any way to conform the second output to be like the first that I'm missing? (Through specifying an option, not through using .iloc every time I want to view.)
Use max_columns
from string import ascii_letters
df = pd.DataFrame(np.random.randint(10, size=(5, 52)), columns=list(ascii_letters))
with pd.option_context(
'display.max_colwidth', 20,
'expand_frame_repr', False,
'display.max_rows', 25,
'display.max_columns', 5,
):
print(df.add_prefix('really_long_column_name_'))
really_long_column_name_a really_long_column_name_b ... really_long_column_name_Y really_long_column_name_Z
0 8 1 ... 1 9
1 8 5 ... 2 1
2 5 0 ... 9 9
3 6 8 ... 0 9
4 1 2 ... 7 1
[5 rows x 52 columns]
Another idea... Obviously not exactly what you want, but maybe you can twist it to your needs.
d1 = df.add_suffix('_really_long_column_name')
with pd.option_context('display.max_colwidth', 4, 'expand_frame_repr', False):
mw = pd.get_option('display.max_colwidth')
print(d1.rename(columns=lambda x: x[:mw-3] + '...' if len(x) > mw else x))
a... b... c... d... e... f... g... h... i... j... ... Q... R... S... T... U... V... W... X... Y... Z...
0 6 5 5 5 8 3 5 0 7 6 ... 9 0 6 9 6 8 4 0 6 7
1 0 5 4 7 2 5 4 3 8 7 ... 8 1 5 3 5 9 4 5 5 3
2 7 2 1 6 5 1 0 1 3 1 ... 6 7 0 9 9 5 2 8 2 2
3 1 8 7 1 4 5 5 8 8 3 ... 3 6 5 7 1 0 8 1 4 0
4 7 5 6 2 4 9 7 9 0 5 ... 6 8 1 6 3 5 4 2 3 2
Looks like it will need an enhancement. The relevant code in the repr function appears to be here:
max_rows = get_option("display.max_rows")
max_cols = get_option("display.max_columns")
show_dimensions = get_option("display.show_dimensions")
if get_option("display.expand_frame_repr"):
width, _ = console.get_console_size()
else:
width = None
self.to_string(buf=buf, max_rows=max_rows, max_cols=max_cols,
line_width=width, show_dimensions=show_dimensions)
So either you pass expand_frame_repr=True and it wraps on the line width, or you pass expand_frame_repr=False and it shouldn't. But it looks like there is a bug in the code (this should be pandas 0.20.3 iirc):
in pd.io.formats.format.DataFrameFormatter:
def _chk_truncate(self):
"""
Checks whether the frame should be truncated. If so, slices
the frame up.
"""
from pandas.core.reshape.concat import concat
# Column of which first element is used to determine width of a dot col
self.tr_size_col = -1
# Cut the data to the information actually printed
max_cols = self.max_cols
max_rows = self.max_rows
if max_cols == 0 or max_rows == 0: # assume we are in the terminal
# (why else = 0)
(w, h) = get_terminal_size()
self.w = w
self.h = h
if self.max_rows == 0:
dot_row = 1
prompt_row = 1
if self.show_dimensions:
show_dimension_rows = 3
n_add_rows = (self.header + dot_row + show_dimension_rows +
prompt_row)
# rows available to fill with actual data
max_rows_adj = self.h - n_add_rows
self.max_rows_adj = max_rows_adj
# Format only rows and columns that could potentially fit the
# screen
if max_cols == 0 and len(self.frame.columns) > w:
max_cols = w
if max_rows == 0 and len(self.frame) > h:
max_rows = h
Looks like it intended to do what you wanted, but was unfinished. It's checking max_cols against the number of columns, not the total width of the columns.
So you could either create a show_df function that would calculate the correct number of columns and show it in an option_context like pi2Squared's answer, or fix it here (and maybe submit a patch if you need it distributed).
As others have pointed out, Pandas itself seems to be bugged or badly designed here, so a workaround is required.
Most of the time this problem occurs with numerical columns, since numbers are relatively short. Pandas will split the column heading onto multiple lines if there are spaces in it, so you can "hack in" the correct behavior by inserting spaces into column headings for numerical columns when you display the dataframe. I have a one-liner to do this:
def colfix(df, L=5): return df.rename(columns=lambda x: ' '.join(x.replace('_', ' ')[i:i+L] for i in range(0,len(x),L)) if df[x].dtype in ['float64','int64'] else x )
do display your dataframe, simply type
colfix(your_df)
note that the renaming is not going to permanently change the dataframe, it will only add spaces to the names for the purposes of displaying it that one time.
Results (in a Jupyter Notebook):
With colfix:
Without:

How to Align Unicode-Type Values of a Column?

I have the following output containing two columns (line# and ID):
1 Q50331
2 P75247
3 P75544
4 P22446
5 P78027
6 P75271
7 P75176
8 P0ABB4
9 P63284
10 P0A6M8
11 P0AES4
12 P39452
13 P0A8T7
14 P0A698
How can I make the ID values of second column align at the top of each other, like the following:
1 Q50331
2 P75247
3 P75544
4 P22446
5 P78027
6 P75271
7 P75176
8 P0ABB4
9 P63284
10 P0A6M8
11 P0AES4
12 P39452
13 P0A8T7
14 P0A698
The problem I am facing is how to incorporate the solution to my code. I tried to use python tabulate, but found this is not working properly since what I am printing: row[0] is a unicode from the tuple row (See the following code).
count = 0
for row in c:
count += 1
print count, row[0]
Any idea how can I incorporate tabulate or other methods to align the unicode-type values in the column?
Use alignment specifiers:
data = {
1:'Q50331',
2:'P75247',
3:'P75544',
4:'P22446',
5:'P78027',
6:'P75271',
7:'P75176',
8:'P0ABB4',
9:'P63284',
10:'P0A6M8',
11:'P0AES4',
12:'P39452',
13:'P0A8T7',
14:'P0A698',
333:'P00bar'
}
length = len(str(max(data.keys())))+1
for k,v in data.items():
print "{:<{}}{}".format(k, length, v)
Output:
1 Q50331
2 P75247
3 P75544
4 P22446
5 P78027
6 P75271
7 P75176
8 P0ABB4
9 P63284
10 P0A6M8
11 P0AES4
12 P39452
13 P0A8T7
14 P0A698
333 P00bar
I've created length which will contain the length of the max value from data keys, +1. Then I pass that length value to my alignment specifier, which in this case is 4:
{:<4}{}

Categories