Saturday, October 22, 2016

OLS-primed PLA: From linear regression to linear classifier to PLA

PLA-with-OLS


Jump-starting the PLA with OLS

In this post, we will explore an old statistics staple --the ever-useful linear regression. It is mathematically similar to how the PLA behaves, or vice versa. Linear regression is of course real-valued, whereas a Perceptron is by definition binary-based, so there are obvious differences also. We explore the possibility of using linear regression (ordinary least squares linear regression) to help the PLA. As an aside, we can also compare the performance of an OLS-assisted PLA with our Border PLA, which we know by now (here and here) to be faster than an ordinary PLA.

Let's jump right into linear regression.

Linear regression

Linear regression is comparable to minimizing the gap between a mean and a sample point, in other words, the gap to the average in each dimension for all dimensions $d$:

$$\sum_{i=1}^d \left(w_ix_i-\bar x_i\right)^2$$

If we apply this to $N$ multiple points, we get:

$$\sum_{i=1}^N \sum_{j=1}^d \left(w_{j}x_{ij}-\bar x_j\right)^2$$

In the context of a vector $\mathbf x$, the single-point error is the squared error between an estimate $h(\mathbf x)$ and a target function $f(\mathbf x)$:

$$(h(\mathbf x)-f(\mathbf x))^2$$

So the linear regression error over $N$ multiples points, which is equivalent to the in-sample error $E_\text{in}$ in our PLA (here), is:

$$E_\text{in}(h(\mathbf x_{1...N})) = \frac{1}{N} \sum_{i=1}^{N} \left((h(\mathbf x_i)-f(\mathbf x_i))^2\right) $$

From our PLA discussion, we of course know that:

$$h(\mathbf x)=\mathbf w^\mathsf T\mathbf x$$$$f(\mathbf x)=\mathbf w_f^\mathsf T\mathbf x$$

Expressing this in vector form in terms of $\mathbf x$, we get:

$$E_\text{in}(\mathbf w) = \frac{1}{N} \sum_{i=1}^{N} \left(\mathbf w^\mathsf T\mathbf x_i-\mathbf w_f^\mathsf T\mathbf x_i\right) = \frac{1}{N} \sum_{i=1}^{N} \left(\mathbf w^\mathsf T\mathbf x-y_i\right) $$

We can simplify the notation further by noting that the matrix $\mathbf X$ is composed of different $\mathbf x$ vectors, that is, $\mathbf X=\{\mathbf x_1,\mathbf x_2,...,\mathbf x_N\}$, and $\mathbf y=\{y_1,y_2,...,y_N\}$:

$$E_\text{in}(\mathbf w) = \frac{1}{N} \lVert\mathbf X\mathbf w-\mathbf y\rVert^2$$



Closed form solution

Since the goal of a linear regression is to find a line that minimizes the sum of the individual squared errors (the residuals, and the sum of residuals is familiar in statistics as SSE), we therefore have to do a little bit of matrix calculus to find the smallest squared error (hence, ordinary least squares, OLS):

$$\nabla E_\text{in}(\mathbf w) = \frac{2}{N}\mathbf X^\mathsf T (\mathbf X\mathbf w-\mathbf y)$$

Setting it to zero (i.e., the closed form solution) and collecting terms,

$$\mathbf X^\mathsf T \mathbf X\mathbf w=\mathbf X^\mathsf T\mathbf y$$

Extracting $\mathbf w$,

$$\bbox[yellow,6px,border:1px solid red]{\mathbf w=(\mathbf X^\mathsf T\mathbf X)^{-1}\mathbf X^\mathsf T\mathbf y}$$

And there we have the solution to finding $\mathbf w$ via linear regression. This looks like cheating, isn't it? It barely looks like 'learning'. It is a very direct and exact method, as opposed to trial-and-error to finding the final value for $\mathbf w$.

Before we leave this part, note the presence of a matrix inversion, which is not always possible when the matrix is not invertible (singular or degenerate, which I will not describe here). In cases like this, we have to apply a pseudo-inverse. In code, this is simply using different matrix function. For instance, "pinv(X)" instead of "inv(X)" in Numpy.

Connecting linear regression to the PLA

If we could use linear regression to do the Perceptron's work, we could have a one-step solution. But linear regression uses and produces real values, that is $y\in \Bbb R$. The Perceptron is a classifier, and while its inputs are real values, its output is $y\in\{-1,+1\}$. If we use a linear regression but force it to minimize errors against $y\in\{-1,+1\}$, it will produce a solution based on {-1,+1}. We could then set a cut-off: anything above 0 becomes a +1, anything below -1 (i.e., use a sign function on top of the linear regression output).

Let's write the OLS classifier code below.

In [1]:
%matplotlib inline
import numpy as np
import time
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

def generate_data(N,n,d):  # N is population size, n is sample size
    # generate n random sample using a uniform distribution from -1 to +1
    P_N=np.array(np.random.uniform(-1,+1,(N+n,d)))
    X_n=np.array(P_N[:n,:])
    P_N=np.array(P_N[n:N+n,:])
    
    # random dividing plane
    w_div=np.array(np.random.uniform(-1,+1,d+1))
    w_div[-1]=-1
    
    # add x0=1 to feature vector X and training vectors in P
    X=np.ones(shape=(n,len(X_n[0])+1))
    P=np.ones(shape=(N,len(P_N[0])+1))
    X[:,1:]=X_n
    P[:,1:]=P_N
    
    # classify the training and test data set
    y=np.sign(np.dot(X,w_div))
    y_N=np.sign(np.dot(P,w_div))
    
    return X_n,y,P_N,y_N,w_div

def run_OLS(X,y,P,y_N,w_div):
    # add x0=1 to X and P
    X_temp=np.ones(shape=(len(X),len(X[0])+1))
    P_temp=np.ones(shape=(len(P),len(P[0])+1))
    X_temp[:,1:]=X
    P_temp[:,1:]=P
    X=X_temp
    P=P_temp
    
    # normal equations, ordinary least squares
    X_pseudo=np.dot(np.linalg.pinv(np.dot(X.T,X)),X.T)
    w=np.dot(X_pseudo,y)
    #w=np.dot(np.linalg.pinv(X_pseudo),y)    # same effect
    
    # classify X using w
    h=np.array(np.sign(np.dot(X,w)))
    error_mask=h!=y
    
    # classify P using w
    h_N=np.array(np.sign(np.dot(P,w)))
    error_mask_N=h_N!=y_N
    
    # error rates
    in_err=1 if len(error_mask)==0 else np.count_nonzero(error_mask)/len(error_mask)
    out_err=1 if len(error_mask_N)==0 else np.count_nonzero(error_mask_N)/len(error_mask_N)
    
    return in_err,out_err,w

def setup_OLS_experiment(N,sample_sizes,d,reps):
    print('Running linear regression experiment, N=',N,', samples sizes=',sample_sizes,', dimensions=',d,'...')
    for n in sample_sizes:
        total_in_accuracy=[]
        total_out_accuracy=[]
        start_time=time.time()
        for i in range(reps):
            X_in,y_in,X_out,y_out,w_f=generate_data(N,n,d)
            in_err_ols,out_err_ols,w_ols=run_OLS(X_in,y_in,X_out,y_out,w_f)
            total_in_accuracy.append(in_err_ols)
            total_out_accuracy.append(out_err_ols)
        print('training sample size:',n)
        print('... elapsed time:',time.time()-start_time)
        print('... average in-sample error:',np.mean(total_in_accuracy))
        print('... average out-of-sample error:',np.mean(total_out_accuracy))

########## main code ##########
setup_OLS_experiment(20,(10,100,1000),2,10)
Running linear regression experiment, N= 20 , samples sizes= (10, 100, 1000) , dimensions= 2 ...
training sample size: 10
... elapsed time: 0.005002260208129883
... average in-sample error: 0.01
... average out-of-sample error: 0.095
training sample size: 100
... elapsed time: 0.0010006427764892578
... average in-sample error: 0.057
... average out-of-sample error: 0.085
training sample size: 1000
... elapsed time: 0.0015010833740234375
... average in-sample error: 0.0473
... average out-of-sample error: 0.05


It works! We did not have to use our old PLA code that iterates so many times.

But hang on a second, we seem to have high out-of-sample error. Worse still, we seem to be getting some in-sample error! What's going on? A linear regression should find a $\mathbf w$ that cleanly separates the training points, but we are seeing errors even on the training points!

Well, this is because the linear regression is trying its best to minimize the squared error against either a -1 or a +1. Consider a training point that causes an output of +5. Against the sign(+5) threshold, it will correctly classify into a +1. However, for the linear regression, the +5 is too far from +1, and is therefore a 'huge' error! So it will adjust the 'average' regressor line to make that error smaller. The same happens with extreme negative training sample points. The net effect is that the OLS line does not perfectly separate the training points, much less the unknown test points.

OLS-primed PLA

So what else can we do with OLS? Well, the OLS line is already pretty close a perfect separator, so can we use it as the initial line for a PLA? Absolutely, let's try that below.

We first make a couple of small changes to our previous PLA algorithm. We add the option to add an initial vector $\mathbf w$ and a record trail of these $w$s.

In [2]:
def record_PLA(X,y,P,y_N,w_div,w_init=None,limit=100,record_limit=1):
    # add x0=1 to X and P
    X_temp=np.ones(shape=(len(X),len(X[0])+1))
    P_temp=np.ones(shape=(len(P),len(P[0])+1))
    X_temp[:,1:]=X
    P_temp[:,1:]=P
    X=X_temp
    P=P_temp
    
    # initial weights vector w
    if w_init is None:
        w=np.zeros(len(X[0]))
    else:
        w=w_init
    
    # classify X using w
    h=np.array(np.sign(np.dot(X,w)))
    error_mask=h!=y
    error=True if error_mask.any() else False
    
    alpha=1
    pla_iter=0
    w_list=[]
    w_list.append(w.copy())
    while error and pla_iter<limit:
        # find a misclassified point
        miss=error_mask.nonzero()[0]            
        miss_index=np.random.choice(miss)
        
        # update w
        w+=y[miss_index]*X[miss_index]*alpha
        if pla_iter<record_limit: w_list.append(w.copy())
        pla_iter+=1
        
        # classify X using w
        h=np.sign(np.dot(X,w))
        error_mask=h!=y
        error=True if error_mask.any() else False
        
    # classify P using w
    h_N=np.array(np.sign(np.dot(P,w)))
    error_mask_N=h_N!=y_N
    
    # error rates
    in_err=np.count_nonzero(error_mask)/len(error_mask)
    out_err=np.count_nonzero(error_mask_N)/len(error_mask_N)
    
    return pla_iter,in_err,out_err,w,w_list

def setup_PLA_experiment(N,sample_sizes,d,reps):
    print('Running PLA experiment, N=',N,', samples sizes=',sample_sizes,', dimensions=',d,'...')
    for n in sample_sizes:
        total_iter=[]
        total_in_accuracy=[]
        total_out_accuracy=[]
        start_time=time.time()
        for i in range(reps):
            X_in,y_in,X_out,y_out,w_f=generate_data(N,n,d)
            w_ols=None
            in_err_ols,out_err_ols,w_ols=run_OLS(X_in,y_in,X_out,y_out,w_f)
            max_iter=X_in.shape[0]*2
            record_limit=10
            pla_iter,in_acc,out_acc,w,w_list=record_PLA(X_in,y_in,X_out,y_out,w_f,w_ols,max_iter,record_limit)
            total_iter.append(pla_iter)
            total_in_accuracy.append(in_acc)
            total_out_accuracy.append(out_acc)
        print('training sample size:',n)
        print('... elapsed time:',time.time()-start_time)
        print('... average iteration:',np.mean(total_iter))
        print('... average in-sample error:',np.mean(total_in_accuracy))
        print('... average out-of-sample error:',np.mean(total_out_accuracy))

########## main code ##########
setup_PLA_experiment(20,(10,100,1000),2,10)
Running PLA experiment, N= 20 , samples sizes= (10, 100, 1000) , dimensions= 2 ...
training sample size: 10
... elapsed time: 0.0
... average iteration: 0.2
... average in-sample error: 0.0
... average out-of-sample error: 0.08
training sample size: 100
... elapsed time: 0.015711069107055664
... average iteration: 65.7
... average in-sample error: 0.007
... average out-of-sample error: 0.01
training sample size: 1000
... elapsed time: 0.08463215827941895
... average iteration: 456.2
... average in-sample error: 0.0
... average out-of-sample error: 0.0


That looks better, we get no training error, and a reasonable small test error. We further notice that the average iteration appears to be lower. For example, on the $n=10$ experiment, the average is about one iteration, vs. the typical 10 iterations. The OLS line is almost always near the final PLA line, needing only one more adjustment. This iteration savings appear to carry over to the other experiments with larger training sizes, although less pronounced.

Visualizing a PLA, OLS-primed PLA, and a Border PLA

To help us understand how an OLS-primed PLA works, it is useful to show how the hypothesis line moves through each iteration, in comparison to others we previously tested.

However, unlike the first posts on the PLA, we won't be attaching a GIF (too much work to stitch together the images :) ). Instead, we will just show the first 10 iterations, hence the earlier code changes to record the $\mathbf w$ values. We will also exclude the test data points for clarity.

We will also run across different values of $n$ and review the output later. Note that the red line is the target function, while the black line is the current hypothesis.

In [14]:
import matplotlib.pyplot as plt

def plot_2D_points(ax,X,y,X_b,y_b):
    ax.set_aspect('equal')
    ax.axis([-1,+1,-1,+1])
    type_mask=y==1
    type_mask_b=y_b==1
    ax.scatter(X[type_mask,0],X[type_mask,1],marker='+',label='Type -1',c='r')
    ax.scatter(X[np.invert(type_mask),0],X[np.invert(type_mask),1],marker='x',label='Type +1',c='g')

def plot_points(ax,X,y):
    ax.set_xlim(-1,+1)
    ax.set_ylim(-1,+1)
    ax.set_zlim(-1,+1)
    y1=y==1
    not_y1=np.invert(y1)
    ax.scatter(X[:,0][y1],X[:,1][y1],X[:,2][y1], c='r', marker='+')
    ax.scatter(X[:,0][not_y1],X[:,1][not_y1],X[:,2][not_y1], c='g', marker='x')
    
def plot_lines(ax,w_f,w):
    ax.plot([-1,+1],[-(w_f[0]+w_f[1]*-1)/w_f[2],-(w_f[0]+w_f[1]*+1)/w_f[2]],'k-',c='r',label='target f',linewidth=1.0)
    ax.plot([-1,+1],[-(w[0]+w[1]*-1)/w[2],-(w[0]+w[1]*+1)/w[2]],'k-',c='k',label='hypothesis h',linewidth=1.0)
In [15]:
import time

def find_borders_local_field(X,y):
    kNN=KNeighborsClassifier(n_neighbors=1).fit(X,y)
    neighbors=kNN.kneighbors(n_neighbors=5,return_distance=False)
    
    b1=y[neighbors[:,0]]!=y
    b2=y[neighbors[:,1]]!=y
    b3=y[neighbors[:,2]]!=y
    b4=y[neighbors[:,3]]!=y
    b5=y[neighbors[:,4]]!=y
    border_rows=b1+b2+b3+b4+b5
    new_X=X[border_rows]
    new_y=y[border_rows]
    return new_X,new_y,np.count_nonzero(border_rows)

def compare_PLAs(n):
    N=2000
    #n=10
    d=2
    
    X_in,y_in,X_out,y_out,w_f=generate_data(N,n,d)
    w_ols=None
    max_iter=X_in.shape[0]*2
    record_limit=10

    # plan PLA
    print('Running PLA experiment, N=',N,', sample size=',n,', dimensions=',d,'...')
    start_time=time.time()
    pla_iter,in_acc,out_acc,w,w_list_pla=record_PLA(X_in,y_in,X_out,y_out,w_f,w_ols,max_iter,record_limit)
    print('training sample size:',X_in.shape[0])
    print('... elapsed time:',time.time()-start_time)
    print('... iteration:',pla_iter)
    print('... in-sample error:',in_acc)
    print('... out-of-sample error:',out_acc)
    print('... showing first 10 iterations (or convergence) below...')
    
    # PLA with OLS
    print('Running PLA (OLS) experiment, N=',N,', sample size=',n,', dimensions=',d,'...')
    start_time=time.time()
    in_err_ols,out_err_ols,w_ols=run_OLS(X_in,y_in,X_out,y_out,w_f)
    pla_iter,in_acc,out_acc,w,w_list_ols=record_PLA(X_in,y_in,X_out,y_out,w_f,w_ols,max_iter,record_limit)
    print('training sample size:',X_in.shape[0])
    print('... elapsed time:',time.time()-start_time)
    print('... iteration:',pla_iter)
    print('... in-sample error:',in_acc)
    print('... out-of-sample error:',out_acc)
    print('... showing first 10 iterations (or convergence) below...')
    
    # border-only PLA
    print('Running PLA (border-only) experiment, N=',N,', sample size=',n,', dimensions=',d,'...')
    start_time=time.time()
    X_b,y_b,border_N=find_borders_local_field(X_in,y_in)
    w_ols=None
    pla_iter,in_acc,out_acc,w,w_list_border=record_PLA(X_b,y_b,X_out,y_out,w_f,w_ols,max_iter,record_limit)
    print('... training sample size:',X_b.shape[0])
    print('... elapsed time:',time.time()-start_time)
    print('... iteration:',pla_iter)
    print('... in-sample error:',in_acc)
    print('... out-of-sample error:',out_acc)
    print('... showing first 10 iterations (or convergence) below...')
    
    w_list_pla=w_list_pla[1:]
    fig=plt.figure(figsize=(15,9),facecolor='0.75')
    for i in range(len(w_list_pla)):
        if i==10: break
        ax=plt.subplot2grid((3,10),(0,i))
        ax.get_xaxis().set_visible(False)
        ax.get_yaxis().set_visible(False)
        plot_2D_points(ax,X_in,y_in,X_in,y_in)
        plot_lines(ax,w_f,w_list_pla[i])
    plt.show()    
    
    #w_list_ols=w_list_ols[1:]
    fig=plt.figure(figsize=(15,9),facecolor='0.5')
    for i in range(len(w_list_ols)):
        if i==10: break
        ax=plt.subplot2grid((3,10),(1,i))
        ax.get_xaxis().set_visible(False)
        ax.get_yaxis().set_visible(False)
        plot_2D_points(ax,X_in,y_in,X_in,y_in)
        plot_lines(ax,w_f,w_list_ols[i])
    plt.show()
    
    w_list_border=w_list_border[1:]
    fig=plt.figure(figsize=(15,9),facecolor='0.25')
    for i in range(len(w_list_border)):
        if i==10: break
        ax=plt.subplot2grid((3,10),(2,i))
        ax.get_xaxis().set_visible(False)
        ax.get_yaxis().set_visible(False)
        plot_2D_points(ax,X_b,y_b,X_b,y_b)
        plot_lines(ax,w_f,w_list_border[i])

    plt.show()

compare_PLAs(n=10)
Running PLA experiment, N= 2000 , sample size= 10 , dimensions= 2 ...
training sample size: 10
... elapsed time: 0.0
... iteration: 20
... in-sample error: 0.2
... out-of-sample error: 0.152
... showing first 10 iterations (or convergence) below...
Running PLA (OLS) experiment, N= 2000 , sample size= 10 , dimensions= 2 ...
training sample size: 10
... elapsed time: 0.0
... iteration: 0
... in-sample error: 0.0
... out-of-sample error: 0.088
... showing first 10 iterations (or convergence) below...
Running PLA (border-only) experiment, N= 2000 , sample size= 10 , dimensions= 2 ...
... training sample size: 10
... elapsed time: 0.0
... iteration: 15
... in-sample error: 0.0
... out-of-sample error: 0.0645
... showing first 10 iterations (or convergence) below...
In [16]:
compare_PLAs(10)
Running PLA experiment, N= 2000 , sample size= 10 , dimensions= 2 ...
training sample size: 10
... elapsed time: 0.0005004405975341797
... iteration: 4
... in-sample error: 0.0
... out-of-sample error: 0.2015
... showing first 10 iterations (or convergence) below...
Running PLA (OLS) experiment, N= 2000 , sample size= 10 , dimensions= 2 ...
training sample size: 10
... elapsed time: 0.000499725341796875
... iteration: 0
... in-sample error: 0.0
... out-of-sample error: 0.2205
... showing first 10 iterations (or convergence) below...
Running PLA (border-only) experiment, N= 2000 , sample size= 10 , dimensions= 2 ...
... training sample size: 7
... elapsed time: 0.001001596450805664
... iteration: 6
... in-sample error: 0.0
... out-of-sample error: 0.202
... showing first 10 iterations (or convergence) below...
In [17]:
compare_PLAs(100)
Running PLA experiment, N= 2000 , sample size= 100 , dimensions= 2 ...
training sample size: 100
... elapsed time: 0.0010027885437011719
... iteration: 16
... in-sample error: 0.0
... out-of-sample error: 0.0175
... showing first 10 iterations (or convergence) below...
Running PLA (OLS) experiment, N= 2000 , sample size= 100 , dimensions= 2 ...
training sample size: 100
... elapsed time: 0.0014972686767578125
... iteration: 55
... in-sample error: 0.0
... out-of-sample error: 0.0075
... showing first 10 iterations (or convergence) below...
Running PLA (border-only) experiment, N= 2000 , sample size= 100 , dimensions= 2 ...
... training sample size: 11
... elapsed time: 0.0020172595977783203
... iteration: 71
... in-sample error: 0.0
... out-of-sample error: 0.013
... showing first 10 iterations (or convergence) below...
In [18]:
compare_PLAs(100)
Running PLA experiment, N= 2000 , sample size= 100 , dimensions= 2 ...
training sample size: 100
... elapsed time: 0.0010006427764892578
... iteration: 37
... in-sample error: 0.0
... out-of-sample error: 0.0045
... showing first 10 iterations (or convergence) below...
Running PLA (OLS) experiment, N= 2000 , sample size= 100 , dimensions= 2 ...
training sample size: 100
... elapsed time: 0.0035021305084228516
... iteration: 130
... in-sample error: 0.0
... out-of-sample error: 0.005
... showing first 10 iterations (or convergence) below...
Running PLA (border-only) experiment, N= 2000 , sample size= 100 , dimensions= 2 ...
... training sample size: 14
... elapsed time: 0.00250244140625
... iteration: 46
... in-sample error: 0.0
... out-of-sample error: 0.0095
... showing first 10 iterations (or convergence) below...
In [19]:
compare_PLAs(1000)
Running PLA experiment, N= 2000 , sample size= 1000 , dimensions= 2 ...
training sample size: 1000
... elapsed time: 0.004515171051025391
... iteration: 205
... in-sample error: 0.0
... out-of-sample error: 0.0045
... showing first 10 iterations (or convergence) below...
Running PLA (OLS) experiment, N= 2000 , sample size= 1000 , dimensions= 2 ...
training sample size: 1000
... elapsed time: 0.0029883384704589844
... iteration: 114
... in-sample error: 0.0
... out-of-sample error: 0.0045
... showing first 10 iterations (or convergence) below...
Running PLA (border-only) experiment, N= 2000 , sample size= 1000 , dimensions= 2 ...
... training sample size: 32
... elapsed time: 0.009007692337036133
... iteration: 405
... in-sample error: 0.0
... out-of-sample error: 0.0045
... showing first 10 iterations (or convergence) below...
In [20]:
compare_PLAs(1000)
Running PLA experiment, N= 2000 , sample size= 1000 , dimensions= 2 ...
training sample size: 1000
... elapsed time: 0.03775954246520996
... iteration: 2000
... in-sample error: 0.01
... out-of-sample error: 0.008
... showing first 10 iterations (or convergence) below...
Running PLA (OLS) experiment, N= 2000 , sample size= 1000 , dimensions= 2 ...
training sample size: 1000
... elapsed time: 0.03124856948852539
... iteration: 2000
... in-sample error: 0.028
... out-of-sample error: 0.0195
... showing first 10 iterations (or convergence) below...
Running PLA (border-only) experiment, N= 2000 , sample size= 1000 , dimensions= 2 ...
... training sample size: 69
... elapsed time: 0.03778266906738281
... iteration: 2000
... in-sample error: 0.15942028985507245
... out-of-sample error: 0.0045
... showing first 10 iterations (or convergence) below...
In [21]:
compare_PLAs(10000)
Running PLA experiment, N= 2000 , sample size= 10000 , dimensions= 2 ...
training sample size: 10000
... elapsed time: 0.3325328826904297
... iteration: 3762
... in-sample error: 0.0
... out-of-sample error: 0.0
... showing first 10 iterations (or convergence) below...
Running PLA (OLS) experiment, N= 2000 , sample size= 10000 , dimensions= 2 ...
training sample size: 10000
... elapsed time: 0.2029876708984375
... iteration: 2163
... in-sample error: 0.0
... out-of-sample error: 0.0
... showing first 10 iterations (or convergence) below...
Running PLA (border-only) experiment, N= 2000 , sample size= 10000 , dimensions= 2 ...
... training sample size: 181
... elapsed time: 0.08005595207214355
... iteration: 2822
... in-sample error: 0.0
... out-of-sample error: 0.0
... showing first 10 iterations (or convergence) below...
In [22]:
compare_PLAs(10000)
Running PLA experiment, N= 2000 , sample size= 10000 , dimensions= 2 ...
training sample size: 10000
... elapsed time: 1.006901502609253
... iteration: 13683
... in-sample error: 0.0
... out-of-sample error: 0.0
... showing first 10 iterations (or convergence) below...
Running PLA (OLS) experiment, N= 2000 , sample size= 10000 , dimensions= 2 ...
training sample size: 10000
... elapsed time: 0.71750807762146
... iteration: 9918
... in-sample error: 0.0
... out-of-sample error: 0.0005
... showing first 10 iterations (or convergence) below...
Running PLA (border-only) experiment, N= 2000 , sample size= 10000 , dimensions= 2 ...
... training sample size: 166
... elapsed time: 0.21164941787719727
... iteration: 13339
... in-sample error: 0.0
... out-of-sample error: 0.0005
... showing first 10 iterations (or convergence) below...
In [23]:
compare_PLAs(100000)
Running PLA experiment, N= 2000 , sample size= 100000 , dimensions= 2 ...
training sample size: 100000
... elapsed time: 54.138582706451416
... iteration: 72616
... in-sample error: 0.0
... out-of-sample error: 0.0
... showing first 10 iterations (or convergence) below...
Running PLA (OLS) experiment, N= 2000 , sample size= 100000 , dimensions= 2 ...
training sample size: 100000
... elapsed time: 51.66406607627869
... iteration: 69561
... in-sample error: 0.0
... out-of-sample error: 0.0
... showing first 10 iterations (or convergence) below...
Running PLA (border-only) experiment, N= 2000 , sample size= 100000 , dimensions= 2 ...
... training sample size: 580
... elapsed time: 1.890852689743042
... iteration: 74863
... in-sample error: 0.0
... out-of-sample error: 0.0
... showing first 10 iterations (or convergence) below...
In [24]:
compare_PLAs(100000)
Running PLA experiment, N= 2000 , sample size= 100000 , dimensions= 2 ...
training sample size: 100000
... elapsed time: 44.96293926239014
... iteration: 79837
... in-sample error: 0.0
... out-of-sample error: 0.0
... showing first 10 iterations (or convergence) below...
Running PLA (OLS) experiment, N= 2000 , sample size= 100000 , dimensions= 2 ...
training sample size: 100000
... elapsed time: 27.52830982208252
... iteration: 50317
... in-sample error: 0.0
... out-of-sample error: 0.0
... showing first 10 iterations (or convergence) below...
Running PLA (border-only) experiment, N= 2000 , sample size= 100000 , dimensions= 2 ...
... training sample size: 437
... elapsed time: 2.13401198387146
... iteration: 104963
... in-sample error: 0.0
... out-of-sample error: 0.0
... showing first 10 iterations (or convergence) below...


Performance comparison of the three PLAs

It could be misleading to generate conclusions based on a limited run (two runs per $n$), so we will limit to the more obvious and in general terms. We will conduct a longer comparison in a future experiment (repeated runs on 100k training data takes some time, so best done another time).

  • At low n (n=10), the OLS-derived hypothesis often matches location and orientation of the converged PLA line, or would just need a few iterations to converge. This gives the OLS-primed PLA a processing speed advantage over both the plain PLA and the Border PLA.
  • However, this advantage diminishes with increasing n. At large training sizes, the OLS-primed PLA takes roughly the same time as the plain PLA, both slower than the Border PLA. This suggests the proximity of the initial hypothesis to the target line is not a guarantee of processing speed, at least at high n. This is due to the way a PLA performs an update (with a +1 or -1 factor, regardless of the actual error differential, which standard gradient descent algorithms take into account).
  • The Border PLA is comparatively slower at low n, but definitively beats the execution time of both the OLS-primed PLA and the plain PLA (as was shown in previous posts) at high n.

To solidify these points, we will explore some of these generalizations in a future experiment by using more runs/repetitions to minimize the possibility of random results.