Friday, November 11, 2016

Nearest adversary border detection

Border-PLA-nearest-adversary

Border detection through nearest adversary


We continue to improve our border detection algorithm. We started with a simple algorithm based on the differences in kNN results under different k values (here), which did not work well when the result of different k values tend to be the same. We then adopted a local region test, relying on a break in homogeneity across a pre-set number of neighbors as an indicator of being near a border. It is easy to see that the local region approach will not work if the cloud of points for each class are far apart --i.e., the local region will always be of the same class, preventing the algorithm to detect any border at all.

Nearest adversary

We now evaluate another border detection strategy. We still evaluate a local region, but we will invert the process: instead of checking if a data point is a border point, we use that data point to check if other points are border points. That is, for each point, we check all neighboring points until we find a point that is of an opposing class (an adversary). We then label that adversary point as a border point. It is a border point because that point is the nearest point to at least one member of the opposing class. This method should work well in concept. We now test if it works well in practice.

So let's get to work. We will re-use the code for the Voronoi tesselation, but increase the number of meshgrid points to make the decision boundaries more precise.

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

def generate_data(N,n):  # N is population size, n is sample size
    # generate n random sample using a uniform distribution from -1 to +1
    P_N=np.array(np.random.uniform(-1,+1,(N+n,2)))
    X_n=P_N[:n,:]
    P_N=P_N[n:N+n,:]
    
    # random dividing line g in standard form
    end_pts=np.array(np.random.uniform(-1,+1,(2,2)))
    slope=(end_pts[1,1]-end_pts[0,1])/(end_pts[1,0]-end_pts[0,0])
    b=end_pts[0,1]-slope*end_pts[0,0]
    w_div=np.array([b,slope,-1])
    
    # add x0=1 to feature vector X and training vectors in P
    X=np.hstack((np.ones(shape=(n,1)),X_n))
    P=np.hstack((np.ones(shape=(N,1)),P_N))
    
    # classify the training and test data set
    y=np.sign(np.dot(X,w_div))
    y_N=np.sign(np.dot(P,w_div))
    
    # train data, train data labels, test data, test data labels, weights vector
    return X_n,y,P_N,y_N,w_div

def plot_points(ax,X,y,labels=True):
    type_mask=y==1
    ax.scatter(X[type_mask,0],X[type_mask,1],marker='+',label='Type -1',c='r')
    ax.scatter(X[np.invert(type_mask),0],X[np.invert(type_mask),1],marker='x',label='Type +1',c='b')
    if labels: ax.legend(loc=4, fontsize = 'x-small')

def plot_border_points(ax,X_b,y_b,labels=True):
    type_mask_b=y_b==1
    ax.scatter(X_b[type_mask_b,0],X_b[type_mask_b,1],marker='^',label='Type -1, border',c='g',s=40)
    ax.scatter(X_b[np.invert(type_mask_b),0],X_b[np.invert(type_mask_b),1],marker='o',label='Type +1, border',c='gold',s=40)
    if labels: ax.legend(loc=4, fontsize = 'x-small')

def adjust_plot(ax):
    ax.set_aspect('equal')
    ax.axis([-1,+1,-1,+1])
    return ax
    
def plot_voronoi(ax,clf,X,y):
    #x_min,x_max=X[:,0].min(),X[:,0].max()
    #y_min,y_max=X[:,1].min(),X[:,1].max()
    x_min,x_max=-1,+1
    y_min,y_max=-1,+1
    xx,yy=np.meshgrid(np.arange(x_min,x_max+0.01,0.01),np.arange(y_min,y_max+0.01,0.01))
    Z=clf.predict(np.c_[xx.ravel(),yy.ravel()])
    Z=Z.reshape(xx.shape)
    ax.contourf(xx,yy,Z,alpha=0.4)

N=1000
n=1000
train_data,y,test_data,y_N,w_g=generate_data(N,n)

kNN=KNeighborsClassifier(n_neighbors=1).fit(train_data,y)

fig=plt.figure(figsize=(8,4))
ax=fig.add_subplot(121)
ax=adjust_plot(ax)

plot_voronoi(ax,kNN,train_data,y)
plot_points(ax,train_data,y)

plt.show()


Let us run the same code on different k values for kNNs. Notice that the decision boundary is fairly consistent due to the uniform distribution. This is a situation where the kNN-based border detector would only identify a few, if any, border points.

In [2]:
kNN1=KNeighborsClassifier(n_neighbors=1).fit(train_data,y)
kNN2=KNeighborsClassifier(n_neighbors=3).fit(train_data,y)
kNN3=KNeighborsClassifier(n_neighbors=11).fit(train_data,y)
kNN4=KNeighborsClassifier(n_neighbors=21).fit(train_data,y)

kNNs=[kNN1,kNN2,kNN3,kNN4]

fig=plt.figure(figsize=(12,3))
plot_num=1
for kNN in kNNs:
    ax=plt.subplot(1,4,plot_num)
    ax=adjust_plot(ax)
    plot_voronoi(ax,kNN1,train_data,y)
    plot_points(ax,train_data,y)
    
    plot_num+=1
plt.show()


Comparison of previous two border detectors

To recall our previous border detection algorithms, we show their output alongside each other.

In [3]:
def find_borders_kNN_delta(X,y):
    kNN1=KNeighborsClassifier(n_neighbors=1).fit(X,y)
    pred1=kNN1.predict(X)
    kNN2=KNeighborsClassifier(n_neighbors=5).fit(X,y)
    pred2=kNN2.predict(X)

    border_rows=pred1!=pred2
    if len(border_rows.nonzero()[0])>0 or (y[border_rows]).all or not (y[border_rows]).all:
        X=X[border_rows]
        y=y[border_rows]
    
    return X,y,np.count_nonzero(border_rows)

def find_borders_local_field(X,y):
    kNN=KNeighborsClassifier(n_neighbors=1).fit(X,y)
    neighbors=kNN.kneighbors(n_neighbors=5,return_distance=False)
    
    b1=y[neighbors[:,0]]!=y
    b2=y[neighbors[:,1]]!=y
    b3=y[neighbors[:,2]]!=y
    b4=y[neighbors[:,3]]!=y
    b5=y[neighbors[:,4]]!=y
    border_rows=b1+b2+b3+b4+b5
    new_X=X[border_rows]
    new_y=y[border_rows]
    
    return new_X,new_y,np.count_nonzero(border_rows)    

fig=plt.figure(figsize=(12,8))
borders_list=[find_borders_kNN_delta(train_data,y),find_borders_local_field(train_data,y)]
plot_num=1
for borders in borders_list:
    ax=plt.subplot(2,3,plot_num)
    ax=adjust_plot(ax)
    plot_points(ax,train_data,y)
    
    ax=plt.subplot(2,3,plot_num+1)
    ax=adjust_plot(ax)
    X_b,y_b,border_N=borders
    plot_border_points(adjust_plot(ax),X_b,y_b)
    
    ax=plt.subplot(2,3,plot_num+2)
    ax=adjust_plot(ax)
    plot_points(ax,train_data,y)
    plot_border_points(adjust_plot(ax),X_b,y_b)

    plot_num+=3
plt.show()


We can see clearly that the delta-based algorithm does not work well on uniformly distributed data, or data where the regions near the decision boundary is uniformly or equally populated by sample points of both classes. The local region approach addresses by using the uniformity assumption to detect if a point is near a border. However, it is easy to see that if the two classes are widely separated, even the local region approach will not work. We show these below.

Non-uniform data

Let us plot a few non-uniform datasets using scikit's datasets library. As long as we keep the noise level low, the spread of each class will remain low, the points from each class will cluster together tightly, providing a clear 'gap' across classes.

In [4]:
from sklearn import datasets as dt
from sklearn.preprocessing import StandardScaler as ss

np.random.seed(0)
n_samples=1000
noisy_circles=dt.make_circles(n_samples=n_samples,factor=0.25,noise=0.1)
noisy_moons=dt.make_moons(n_samples=n_samples,noise=0.1)
blobs=dt.make_blobs(n_samples=n_samples,centers=2,random_state=1)

datasets=[noisy_circles,noisy_moons,blobs]
fig=plt.figure(figsize=(12,4))
plot_num=1

for dataset in datasets:
    X,y=dataset
    X=ss().fit_transform(X)
    X[:,0]=X[:,0]/np.max(X[:,0])
    X[:,1]=X[:,1]/np.max(X[:,1])
    ax=plt.subplot(1,3,plot_num)
    plot_points(adjust_plot(ax),X,y)
    plot_num+=1
plt.show()


We then run the local region homogeneity to see it fail. As expected, the local region algorithm is unable to determine any border point because the nearest neighbors are dominated by a similar class.

In [5]:
np.random.seed(0)
n_samples=1000
noisy_circles=dt.make_circles(n_samples=n_samples,factor=0.25,noise=0.1)
noisy_moons=dt.make_moons(n_samples=n_samples,noise=0.1)
blobs=dt.make_blobs(n_samples=n_samples,centers=2,random_state=1)

datasets=[noisy_circles,noisy_moons,blobs]
fig=plt.figure(figsize=(12,8))
plot_num=1

for dataset in datasets:
    X,y=dataset
    X=ss().fit_transform(X)
    X[:,0]=X[:,0]/np.max(X[:,0])
    X[:,1]=X[:,1]/np.max(X[:,1])
    
    X_b,y_b,border_N=find_borders_local_field(X,y)
    #if len(X_b)!=0:
    #    X_b[:,0]=X_b[:,0]/np.max(X_b[:,0])
    #    X_b[:,1]=X_b[:,1]/np.max(X_b[:,1])
    
    ax=plt.subplot(2,3,plot_num)
    plot_points(adjust_plot(ax),X,y)
    
    ax=plt.subplot(2,3,plot_num+3)
    plot_border_points(adjust_plot(ax),X_b,y_b)
    plot_num+=1
plt.show()



Nearest adversary detector

To solve the problem above, we now implement our simple nearest adversary algorithm as described in the beginning of this post.

In [6]:
def find_borders_nearest_adversary(X,y,friendly_fraction=1.0):
    kNN=KNeighborsClassifier(n_neighbors=1).fit(X,y)
    neighbors=kNN.kneighbors(n_neighbors=int(1+friendly_fraction*max(sum(y==1),sum(y!=1))),return_distance=False)
    
    adversary_list=[]
    for i,n in enumerate(neighbors):
        adversary=np.where(y[n]!=y[i])[0]
        if len(adversary)!=0:
            adversary_list.append(n[adversary][0])
    adversary_list=np.array(np.unique(adversary_list))
    
    X_b=X[adversary_list]
    y_b=y[adversary_list]
    
    return X_b,y_b,len(adversary_list)

np.random.seed(0)
n_samples=1000
circles=dt.make_circles(n_samples=n_samples,factor=0.25,noise=0.1)
moons=dt.make_moons(n_samples=n_samples,noise=0.1)
blobs=dt.make_blobs(n_samples=n_samples,centers=2)#,random_state=1)

datasets=[circles,moons,blobs]
fig=plt.figure(figsize=(12,12))
plot_num=1

for dataset in datasets:
    X,y=dataset
    X=ss().fit_transform(X)
    X[:,0]=X[:,0]/np.max(X[:,0])
    X[:,1]=X[:,1]/np.max(X[:,1])
    
    X_b,y_b,border_N=find_borders_nearest_adversary(X,y)
    print('... reduction factor:',border_N/len(y))
    #if len(X_b)!=0:
    #    X_b[:,0]=X_b[:,0]/np.max(X_b[:,0])
    #    X_b[:,1]=X_b[:,1]/np.max(X_b[:,1])
    
    ax=plt.subplot(3,3,plot_num)
    plot_points(adjust_plot(ax),X,y)
    
    ax=plt.subplot(3,3,plot_num+1)
    plot_border_points(adjust_plot(ax),X_b,y_b)
    
    ax=plt.subplot(3,3,plot_num+2)
    plot_points(ax,X,y)
    plot_border_points(adjust_plot(ax),X_b,y_b)
    
    plot_num+=3
plt.show()
... reduction factor: 0.029
... reduction factor: 0.036
... reduction factor: 0.013


We are now able to detect the edge points even if there are big gaps. Naturally, as they are border points, they also define the location of the decision boundary. This looks good enough, perhaps even good enough as a candidate to compress kNN data points. (There is literature on different ways to compress kNN memory by only saving the training data points that are useful in defining the decision boundary or maintaining the in-sample classification accuracy with the least number of points. These approaches also include algorithms that create new points.)

Voronoi tesselation of the nearest adversary algorithm

To see if it is indeed close to replicating the original decision boundary, let's show the Voronoi tesselation.

In [7]:
def nearest_neighbor_voronoi(n_samples,seed):
    np.random.seed(seed)
    circles=dt.make_circles(n_samples=n_samples,factor=0.25,noise=0.1)
    moons=dt.make_moons(n_samples=n_samples,noise=0.1)
    blobs=dt.make_blobs(n_samples=n_samples,centers=2)#,random_state=1)

    datasets=[circles,moons,blobs]
    fig=plt.figure(figsize=(12,9))
    plot_num=1

    for dataset in datasets:
        X,y=dataset
        X=ss().fit_transform(X)
        X[:,0]=X[:,0]/np.max(X[:,0])
        X[:,1]=X[:,1]/np.max(X[:,1])

        X_b,y_b,border_N=find_borders_nearest_adversary(X,y)
        print('... reduction factor:',1-border_N/len(y),'(from ',len(y),'to',border_N,'points)')

        kNN1=KNeighborsClassifier(n_neighbors=1).fit(X,y)
        kNN2=KNeighborsClassifier(n_neighbors=1).fit(X_b,y_b)

        ax=plt.subplot(3,4,plot_num)
        plot_voronoi(adjust_plot(ax),kNN1,X,y)
        plot_points(adjust_plot(ax),X,y)

        ax=plt.subplot(3,4,plot_num+1)
        plot_voronoi(adjust_plot(ax),kNN2,X_b,y_b)
        plot_border_points(adjust_plot(ax),X_b,y_b)

        ax=plt.subplot(3,4,plot_num+2)
        plot_voronoi(adjust_plot(ax),kNN1,X,y)
        plot_points(adjust_plot(ax),X,y)
        plot_border_points(adjust_plot(ax),X_b,y_b)

        ax=plt.subplot(3,4,plot_num+3)
        plot_voronoi(adjust_plot(ax),kNN2,X_b,y_b)
        plot_points(adjust_plot(ax),X,y)
        plot_border_points(adjust_plot(ax),X_b,y_b)

        plot_num+=4
    plt.show()

nearest_neighbor_voronoi(1000,0)
... reduction factor: 0.971 (from  1000 to 29 points)
... reduction factor: 0.964 (from  1000 to 36 points)
... reduction factor: 0.987 (from  1000 to 13 points)


It tracks well, but does not exactly duplicate, the original decision boundary, often causing one class to intrude into the other class' territory.

Let's see how this will behave with more training data.

In [8]:
nearest_neighbor_voronoi(10000,0)
... reduction factor: 0.9952 (from  10000 to 48 points)
... reduction factor: 0.9942 (from  10000 to 58 points)
... reduction factor: 0.9986 (from  10000 to 14 points)

Uh-oh, not good. The decision boundary becomes highly erronenous on the non-blob --more precisely, the not linearly separable sets, even if blob types-- datasets at high sample sizes and concentrated points (or dense/concentrated sample points from different Gaussian centers).

The main flaw is that when the region is highly dense (also visible in other tests not shown), some of the adversary points are heavily favored as the nearest adversary, skipping other neighboring adversary points that in fact happen to define the neighboring decision boundary. The algorithm also often misses the outer-most key points near the outer tips of the blob, which causes the extended decision boundary to veer away at an angle. That said, as a rough border definition, it seems useful enough.

Closing thoughts

This post will, for now, mark the end of the border points detection algorithm series. This started from a simply kNN delta, which had limited use. This last iteration appears more robust, although it is not able to perfectly recreate the decision boundary of the original training set.

However, the algorithm is fast enough and does not need to iterate through all points repeatedly to test if each removed point affects the decision boundary. As a generic border detector and kNN memory compression, it is good enough. It has it limits, particularly against datasets that are not clearly separable, or cases where local regions have overlapping, highly-dense regions from different classes.

Finally, the computational savings made possible with a reduction of points is quite significant in a linearly separable scenario, as shown in the reduction factors (from 1000 points to 10-50, from 10,000 points to 10-50 points).