Gradient descent on a linear regression

In the last post (here), we observed that even with an initial weights vector provided by a linear regression --that is, an initial hypothesis already very close to the target function-- the OLS-primed PLA ends up with approximately similar number of iterations and execution time as would a normal PLA with an initial vector weight of $\mathbf 0$ (or some random initial weights).

This surprised me because at small training sets, the OLS-initialized PLA was clearly faster. It turns out that the only clear advantage created by an OLS initialization was at small sample sizes. On closer look, this non-intuitive behavior is due to the PLA update. It simply adds a -1/+1 times the input, without regard to the distance between the input point and the hypothesis. I should not have been surprised if I looked at the math :), but it is hard to visualize the math when dealing with random points.

Analytical solution to a linear regression

A more sensible approach is to use the magnitude of the error (bigger error, bigger jumps), as is done on linear regression (here). In linear regression we proceed immediately to a closed-form formula that derives $\mathbf w$ based on the input training matrix $\mathbf X$ and the ground truth label $\mathbf y$ for each training point in the input matrix:

$$\mathbf w=(\mathbf X^\mathsf T\mathbf X)^{-1}\mathbf X^\mathsf T\mathbf y$$

Numerical solution to a linear regression

Unfortunately, the analytical solution above becomes cumbersome when the size of the matrices become very large (which ML tend to have). Instead, we can solve this numerically by applying a gradient descent on the squared error gradient $\nabla E(\mathbf w)$, that is, adjusting $\mathbf w$ with the gradient (the partial derivatives of each axes) of the error $(\mathbf y-h(\mathbf w,\mathbf x))^2$.

For each axis (repeatable with the other axes), the partial derivative with respect to $\mathbf w$ ($w_j$) is:

$$\begin{align} \frac{\partial E(\mathbf w)}{\partial w_j}&=\frac{\partial\left [\frac{1}{n}\sum_i^n\left ( y_i-h(\mathbf x_{i})\right ) ^2\right ]}{\partial w_j}\\ &=\frac{\partial\left [\frac{1}{n}\sum_i^n\left ( y_i-\sum_{j=0}^dw_jx_{ij}\right ) ^2\right ]}{\partial w_j}\\ &=-\frac{1}{n}\sum_{i=1}^{n}(y_i-h(\mathbf x_i))x_{i,j} \end{align}$$

where the other axes disappear as derivatives of a constant inside the chain rule. We conveniently removed the "2" in $\frac{2}{n}$ since it does not affect the value of $\mathbf w$ with the lowest error (or start the derivation with $\frac{1}{2n}$ to eliminate "2").

The gradient $\nabla E(\mathbf w)$ therefore is just a collection of these straightforward partial derivatives, where $x_{i,0}=1$ is handled like any other $x_{i,j}$:

$$\begin{align} \nabla E(\mathbf w)&=\left ( \frac{\partial E(\mathbf w)}{\partial w_0}, \frac{\partial E(\mathbf w)}{\partial w_1} ,..., \frac{\partial E(\mathbf w)}{\partial w_d} \right )\\ &=\left (-\frac{1}{n}\sum_{i=1}^{n}(y_i-h(\mathbf x_i))x_{i,0},\ -\frac{1}{n}\sum_{i=1}^{n}(y_i-h(\mathbf x_i))x_{i,1}\ ,...,\ -\frac{1}{n}\sum_{i=1}^{n}(y_i-h(\mathbf x_i))x_{i,d} \right )\\ \end{align} $$

Applying gradient descent to the squared error (negative step on the gradient), we then have:

$$\mathbf w\leftarrow\mathbf w+\Delta\mathbf w$$$$\mathbf w\leftarrow\mathbf w-\nabla E(\mathbf w)$$$$\mathbf w_{t+1}=\mathbf w_t+\Delta\mathbf w=\mathbf w_t-\eta\nabla E(\mathbf w_t)$$$$\mathbf w_{t+1}=\mathbf w_t-\eta\sum_{i=1}^{n}\nabla E_i(\mathbf w_t)$$

where $\eta$ is the learning rate parameter (how fast the gradient descent jumps; this is analogous to the size of the update step in numerical calculations). Standard definition of a PLA sets this to 1. It is sometimes identified as $\alpha$, which is probably a better idea to avoid confusion with training size $n$.

Note that we can also derive the same effect if we reverse the error function, that is, $(h(\mathbf w,\mathbf x)-\mathbf y)^2$, leading to a change in the sign of the resulting $\Delta \mathbf w$.

Using the partial derivatives above, and to make clear that $\mathbf w$ is a comprised of $w_i$'s, we can write:

$$w_j\leftarrow w_j-\left [\alpha \left (-\frac{1}{n} \sum_{i=1}^{n} (y_i-h(\mathbf x_i)) x_{i,j} \right ) \right ] \quad \text{(for all j)}$$$$w_j\leftarrow w_j+\alpha\frac{1}{n} \sum_{i=1}^{n} (y_i-h(\mathbf x_i)) x_{i,j}\quad \text{(for all j)}$$

Since vector-vector addition and scalar-vector multiplication is element-wise, we can also write this as:

$$\mathbf w_{t+1}=\mathbf w_t+\alpha\frac{1}{n} \sum_{i=1}^{n}(y_i-h(\mathbf x_i))\mathbf x_i$$

Converting to vectors and matrices, we have:

$$\mathbf w_{t+1}=\mathbf w_t+\alpha\frac{1}{n} \mathbf X^T(\mathbf y-\mathbf X\mathbf w)$$

Or,

$$\mathbf w_{t+1}=\mathbf w_t-\alpha\frac{1}{n} \mathbf X^T(\mathbf X\mathbf w-\mathbf y)$$

The second term suspiciously looks like the closed form term in the previous post (here). The normalizing effect of $n$ could be skipped as long as the learning rate $\alpha$ has a very low value to compensate for skipping a high $n$. A high value for $\frac{\alpha}{n}$ would cause $\Delta\mathbf w$ to 'blow up'.

Why change to vectors and matrices?

Most of us are likely comfortable with the computational clarity of the partial derivatives, and the element-wise update to $w_i \ \text{(for all i)}$, even with the confusion-inducing multiple indices. So why change to vectors and matrices where we lose some of the computational guideposts and have to mentally work out what a matrix transpose mean in relation to a dot product against a vector?

Well, some might say the clean notation is more intuitive (sometimes I disagree with this assertion :) ). To me, the better reason is that the abstraction allows us to convert this algorithm into functional code, as opposed to procedural code.

For instance, we could procedurally write the $\mathbf w$ update via two loops, one for each summation --i.e., loop over each dimension $j$ of $\mathbf w$, and loop through each data point $i$ in $\mathbf X$. This will work, and with an affinity to for/while loops after years of programming, this is my natural way of implementing gradient descent in code. However, the vectorized notation is far cleaner and I am less likely to make a coding error, as you will see in the code below. It is also faster, so it is worth checking.

A trick that helped me is to recognize that a particular vectorized notation translates to a particular element-wise operation (e.g., $\mathbf w^\mathsf T\mathbf x$ is equal to $\sum_i^dw_ix_i$, $\mathbf X\mathbf w$ is equal to $\sum_i^n\sum_j^dw_jx_{ij}$, and $\mathbf X^\mathsf T\mathbf w$ is equal to $\sum_j^d\sum_i^nw_jx_{ij}$). To accept this mental leap, it helps to actually figure out the matrix-vector operations on paper and remove all doubts.

Randomized gradient descent on squared error

Recalling the PLA update,

$$\mathbf w_{t+1}=\mathbf w_t+\Delta\mathbf w=\mathbf w_t+y_i\mathbf x_i$$

Under the PLA, we update $\mathbf w$ using the effect of one training point. We could apply the same thing on the squared error, albeit with slightly different effects from a 'full' gradient descent. In other words, we could estimate the gradient using the gradient of one randomly selected point $i$. Thus,

$$\mathbf w_{t+1}=\mathbf w_t-\eta\nabla E_i(\mathbf w_t)$$$$\mathbf w_{t+1}=\mathbf w_t+\eta(y_i-h(\mathbf x_i))\mathbf x_i$$

Or equivalently, $$\mathbf w_{t+1}=\mathbf w_t-\eta(h(\mathbf x_i)-y_i)\mathbf x_i$$

Stochastic linear regression

This randomized single-point gradient descent is how the AdaLINE, for Adaptive Linear Element or Adaptive Linear Neuron, works. The Adaline is a natural evolution of the Perceptron and was invented in the 1960s. We briefly mentioned this on the first PLA post (mid-way through here). The delta rule, as applied to neural networks, refers to this concept.

Notice the difference in signs on the $\Delta w$ ($+y\mathbf x$ vs. $-\eta\nabla E$). The PLA adds because the value of $y$ is always opposite the correct classification (because it is an error). It is in fact a subtraction, as happens in a gradient descent. This is why the label in a PLA has to be {-1,+1}, not {0,+1} else the PLA will not update when the Type +1s are misclassified.

There is really no need to be restricted to {-1,+1}, and {0,+1} will work just as well, if we note that the PLA update is in fact $\mathbf w_{t+1}=\mathbf w_t+\Delta\mathbf w=\mathbf w_t+(y_i-h(x_i)\mathbf x_i$. This looks like the Adaline update, but for the key difference that $h(x)_{PLA}=\text {SIGN}(y-h(x))$ while $h(x)_{ADALINE}=(y-h(x))$.

Notice also that the original PLA updates the $\mathbf w$ only for each incorrectly classified training point, disregarding correct points. In a gradient descent PLA, we naturally do not have to discriminate between correct and incorrect points because the squared error will handle this for us, even when the training point is on the same side of the target function (i.e., same label). This is an advantage of the Adaline. It can influence the hypothesis even with correct training points, and continue to make updates even when all points have already been classified correctly, where a PLA will otherwise stop.

Stochastic vs. batch gradient descent

Finally, note that the original derivation applies an update based on all the errors, $\nabla E$. This is called a batch gradient descent and is the faster way to do a numerical linear regression. But in an Adaline, by convention we are limited to picking one random point each time. In this sense this is an stochastic gradient descent.

A batch gradient descent (against a convex error function) will always improve --i.e., lower the error, descend the gradient-- whereas an stochastic (random) gradient descent could increase the error on some updates (but will generally descend across repetitions). A batch gradient descent is also deterministic --it will always follow the same descent path on different runs-- whereas the stochastic gradient descent is, well, stochastic in nature across runs.

In ML, stochastic gradient descent is useful in cases with large training datasets, where memory and speed requirements of matrix and vector calculations are prohibitive. The middle ground of course works also, hence we have the 'mini-batch' gradient descent, or a gradient descent using a smaller batch of randomly selected points.

So then let's revise our PLA into a gradient descent linear regression.

%matplotlib inline
import numpy as np
import time
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier

def hx_linear(X,w):
    return np.dot(X,w)

def hx_pla(X,w):
    return np.sign(np.dot(X,w))

def generate_data(N,n,d,hx):  # N is population size, n is sample size, d is dimensions
    # generate n random sample using a uniform distribution from -1 to +1
    P_N=np.array(np.random.uniform(-1,+1,(N+n,d)))
    X_n=np.array(P_N[:n,:])
    P_N=np.array(P_N[n:N+n,:])
    
    # random dividing hyperplane
    #w_div=np.array(np.random.uniform(-1,+1,d+1))
    #w_div[-1]=-1
    
    # random dividing line g in standard form
    end_pts=np.array(np.random.uniform(-1,+1,(2,2)))
    slope=(end_pts[1,1]-end_pts[0,1])/(end_pts[1,0]-end_pts[0,0])
    b=end_pts[0,1]-slope*end_pts[0,0]
    w_div=np.array([b,slope,-1])
    
    # add x0=1 to feature vector X and training vectors in P
    X=np.hstack((np.ones(shape=(n,1)),X_n))
    P=np.hstack((np.ones(shape=(N,1)),P_N))
    
    # classify the training and test data set
    #y=np.sign(np.dot(X,w_div))
    #y_N=np.sign(np.dot(P,w_div))
    #y=np.dot(X,w_div)
    #y_N=np.dot(P,w_div)
    y=hx(X,w_div)
    y_N=hx(P,w_div)
    
    return X_n,y,P_N,y_N,w_div

def run_closed_form_OLS(X,y,P,y_N,w_div):
    # add x0=1 to X and P
    X=np.hstack((np.ones(shape=(len(X),1)),X))
    P=np.hstack((np.ones(shape=(len(P),1)),P))
    
    # normal equations, ordinary least squares
    X_pseudo=np.linalg.pinv(np.dot(X.T,X))
    w=np.dot(np.dot(X_pseudo,X.T),y)
    
    # classify X and P using w
    h=np.array(np.dot(X,w))
    h_N=np.array(np.dot(P,w))
    
    # error rates
    in_err=(y-h)**2/len(y)
    out_err=(y_N-h_N)**2/len(y_N)
    
    return in_err,out_err,w

def setup_OLS_experiment(N,sample_sizes,d,reps):
    print('Running linear regression experiment, N=',N,', samples sizes=',sample_sizes,', dimensions=',d,'...')
    for n in sample_sizes:
        total_in_accuracy=[]
        total_out_accuracy=[]
        start_time=time.time()
        for i in range(reps):
            X_in,y_in,X_out,y_out,w_f=generate_data(N,n,d,hx_pla)
            in_err_ols,out_err_ols,w_ols=run_closed_form_OLS(X_in,y_in,X_out,y_out,w_f)
            total_in_accuracy.append(in_err_ols)
            total_out_accuracy.append(out_err_ols)
        print('training sample size:',n)
        print('... elapsed time:',time.time()-start_time)
        print('... average in-sample error:',np.mean(total_in_accuracy))
        print('... average out-of-sample error:',np.mean(total_out_accuracy))

########## main code ##########
setup_OLS_experiment(N=20,sample_sizes=(10,100,1000),d=2,reps=10)

Running linear regression experiment, N= 20 , samples sizes= (10, 100, 1000) , dimensions= 2 ...
training sample size: 10
... elapsed time: 0.01950669288635254
... average in-sample error: 0.0209595315225
... average out-of-sample error: 0.0291288097497
training sample size: 100
... elapsed time: 0.0015001296997070312
... average in-sample error: 0.00275642537508
... average out-of-sample error: 0.0126123124657
training sample size: 1000
... elapsed time: 0.002001523971557617
... average in-sample error: 0.0002610789894
... average out-of-sample error: 0.0136257859072

(Text to be added here)

import matplotlib.pyplot as plt

def adjust_plot(ax):
    ax.set_aspect('equal')
    ax.axis([-1,+1,-1,+1])
    ax.legend(loc=4, fontsize = 'x-small')
    return ax
    
def plot_points(ax,X,y,X_b,y_b):
    y1=y==1
    not_y1=np.invert(y1)
    ax.scatter(X[:,0][y1],X[:,1][y1],marker='+',label='Type -1',c='r')
    ax.scatter(X[:,0][not_y1],X[:,1][not_y1],marker='x',label='Type +1',c='g')

def plot_lines(ax,w_y,w_o,w_h1,w_h2):
    ax.plot([-1,+1],[-(w_y[0]+w_y[1]*-1)/w_y[2],-(w_y[0]+w_y[1]*+1)/w_y[2]],
            'k-',c='r',label='target y',linewidth=1.0)
    ax.plot([-1,+1],[-(w_o[0]+w_o[1]*-1)/w_o[2],-(w_o[0]+w_o[1]*+1)/w_o[2]],
            'k-',c='g',label='ols h',linewidth=1.0)
    ax.plot([-1,+1],[-(w_h1[0]+w_h1[1]*-1)/w_h1[2],-(w_h1[0]+w_h1[1]*+1)/w_h1[2]],
            'k-',c='b',label='grad h',linewidth=1.0)
    ax.plot([-1,+1],[-(w_h2[0]+w_h2[1]*-1)/w_h2[2],-(w_h2[0]+w_h2[1]*+1)/w_h2[2]],
            'k-',c='c',label='sgd h',linewidth=1.0)

(Text to be added here)

def record_gradient_descent(X,y,P,y_N,w_div,w_init=None,limit=100,record_limit=1,stochastic=False):
    # add x0=1 to X and P
    X=np.hstack((np.ones(shape=(len(X),1)),X))
    P=np.hstack((np.ones(shape=(len(P),1)),P))
    
    # initial weights vector w
    if w_init is None:w=np.zeros(len(X[0]))
    else:w=w_init
    
    # classify X using w
    h=np.array(np.dot(X,w))
    
    alpha=0.1
    gd_iter=0
    w_list=[]
    
    error_grad=y-h
    while gd_iter<limit:
        # update w
        if stochastic:
            miss_index=int(np.random.uniform(0,len(y)))
            w+=error_grad[miss_index]*X[miss_index]*alpha
        else:
            # this is the single line update to w, avoiding messy nested loops
            w+=np.dot(X.T,error_grad)*alpha/len(X)
        
        # classify X using w then update gradient
        h=np.dot(X,w)
        error_grad=y-h
        
        if gd_iter<record_limit: w_list.append(w.copy())        
        gd_iter+=1
        
    # classify P using w
    h_N=np.array(np.dot(P,w))
    
    # error rates
    in_err=(y-h)**2/len(y)
    out_err=(y_N-h_N)**2/len(y_N)
    
    return gd_iter,in_err,out_err,w,w_list

def setup_gradient_descent_experiment(N,sample_sizes,d,reps,function):
    print('Running gradient descent experiment, N=',N,', samples sizes=',sample_sizes,', dimensions=',d,' reps=',reps,'...')
    fig=plt.figure(figsize=(12,8))
    plot_num=1
    max_iter=1000
    record_limit=10
    for n in sample_sizes:
        total_iter=[]
        total_in_accuracy=[]
        total_out_accuracy=[]
        start_time=time.time()
        for i in range(reps):
            X_in,y_in,X_out,y_out,w_t=generate_data(N,n,d,function)
            in_err_ols,out_err_ols,w_o=run_closed_form_OLS(X_in,y_in,X_out,y_out,w_t)
            gd_iter,in_acc,out_acc,w_h_batch,w_list_batch=record_gradient_descent(X_in,y_in,X_out,y_out,w_t,None,max_iter,record_limit,stochastic=False)
            total_iter.append(gd_iter)
            total_in_accuracy.append(in_acc)
            total_out_accuracy.append(out_acc)
            gd_iter,in_acc,out_acc,w_h_sgd,w_list_sgd=record_gradient_descent(X_in,y_in,X_out,y_out,w_t,None,max_iter,record_limit,stochastic=True)
        print('training sample size:',n)
        print('... elapsed time:',time.time()-start_time)
        print('... average iteration:',np.mean(total_iter))
        print('... average in-sample error:',np.mean(total_in_accuracy))
        print('... average out-of-sample error:',np.mean(total_out_accuracy))
        print('... last rep, target function w:',w_t)
        print('... last rep, ols function w   :',w_o)
        print('... last rep, batch function w :',w_h_batch)
        print('... last rep, sgd function w   :',w_h_sgd)
        
        if len(sample_sizes)<=4:
            ax=plt.subplot(1,len(sample_sizes),plot_num)
            plot_points(ax,X_in,y_in,X_out,y_out)
            plot_lines(ax,w_t,w_o,w_h_batch,w_h_sgd)
            ax=adjust_plot(ax)
            plot_num+=1
    plt.show()
    
########## main code ##########
setup_gradient_descent_experiment(N=1000,sample_sizes=(10,100,1000,1000),d=2,reps=10,function=hx_pla)
setup_gradient_descent_experiment(N=1000,sample_sizes=(10,100,1000,1000),d=2,reps=10,function=hx_linear)

Running gradient descent experiment, N= 1000 , samples sizes= (10, 100, 1000, 1000) , dimensions= 2  reps= 10 ...
training sample size: 10
... elapsed time: 0.11107826232910156
... average iteration: 1000.0
... average in-sample error: 0.0192878481594
... average out-of-sample error: 0.00052400235754
... last rep, target function w: [ 0.07592975 -1.31624979 -1.        ]
... last rep, ols function w   : [-0.41150622 -0.92060904 -0.21502963]
... last rep, batch function w : [-0.41150622 -0.92060904 -0.21502963]
... last rep, sgd function w   : [-0.5518486  -0.81580821 -0.28571952]
training sample size: 100
... elapsed time: 0.11913681030273438
... average iteration: 1000.0
... average in-sample error: 0.00284674131789
... average out-of-sample error: 0.000292822367859
... last rep, target function w: [-0.48846949  0.66076885 -1.        ]
... last rep, ols function w   : [-0.47637007  0.77044316 -0.97364561]
... last rep, batch function w : [-0.47637007  0.77044316 -0.97364561]
... last rep, sgd function w   : [-0.59630526  0.81598496 -1.13254178]
training sample size: 1000
... elapsed time: 0.21617555618286133
... average iteration: 1000.0
... average in-sample error: 0.000296534908816
... average out-of-sample error: 0.000292061356311
... last rep, target function w: [-0.52668128 -1.42511623 -1.        ]
... last rep, ols function w   : [-0.38390563 -1.07494789 -0.68298783]
... last rep, batch function w : [-0.38390563 -1.07494789 -0.68298783]
... last rep, sgd function w   : [-0.46174364 -0.97967514 -0.63976644]
training sample size: 1000
... elapsed time: 0.23203730583190918
... average iteration: 1000.0
... average in-sample error: 0.000291146500407
... average out-of-sample error: 0.000291120718678
... last rep, target function w: [-0.17213134  1.01293162 -1.        ]
... last rep, ols function w   : [-0.15588191  1.07072429 -0.93572306]
... last rep, batch function w : [-0.15588191  1.07072429 -0.93572306]
... last rep, sgd function w   : [-0.13008698  0.84491809 -0.9453702 ]

Running gradient descent experiment, N= 1000 , samples sizes= (10, 100, 1000, 1000) , dimensions= 2  reps= 10 ...
training sample size: 10
... elapsed time: 0.10978889465332031
... average iteration: 1000.0
... average in-sample error: 4.57732260591e-10
... average out-of-sample error: 2.48631787352e-11
... last rep, target function w: [ 4.57797834  5.55646258 -1.        ]
... last rep, ols function w   : [ 4.57797834  5.55646258 -1.        ]
... last rep, batch function w : [ 4.57797834  5.55646258 -0.99999999]
... last rep, sgd function w   : [ 4.57797834  5.55646258 -1.        ]
training sample size: 100
... elapsed time: 0.11592268943786621
... average iteration: 1000.0
... average in-sample error: 1.4498605276e-25
... average out-of-sample error: 1.86108121487e-26
... last rep, target function w: [-0.02748999 -0.26987796 -1.        ]
... last rep, ols function w   : [-0.02748999 -0.26987796 -1.        ]
... last rep, batch function w : [-0.02748999 -0.26987796 -1.        ]
... last rep, sgd function w   : [-0.02748999 -0.26987796 -1.        ]
training sample size: 1000
... elapsed time: 0.22690200805664062
... average iteration: 1000.0
... average in-sample error: 2.0483809509e-31
... average out-of-sample error: 2.19597422729e-31
... last rep, target function w: [-0.23757398  6.65881723 -1.        ]
... last rep, ols function w   : [-0.23757398  6.65881723 -1.        ]
... last rep, batch function w : [-0.23757398  6.65881723 -1.        ]
... last rep, sgd function w   : [-0.23757398  6.65881723 -1.        ]
training sample size: 1000
... elapsed time: 0.2346477508544922
... average iteration: 1000.0
... average in-sample error: 4.66126961638e-31
... average out-of-sample error: 5.05856476346e-31
... last rep, target function w: [ 0.07913549  1.53919783 -1.        ]
... last rep, ols function w   : [ 0.07913549  1.53919783 -1.        ]
... last rep, batch function w : [ 0.07913549  1.53919783 -1.        ]
... last rep, sgd function w   : [ 0.07913549  1.53919783 -1.        ]

(Text to be added here)

A Random Walk to AI

Pages

Monday, November 7, 2016

Gradient descent on linear regression

Gradient descent on a linear regression

Analytical solution to a linear regression

Numerical solution to a linear regression

Why change to vectors and matrices?

Randomized gradient descent on squared error

Stochastic linear regression

Stochastic vs. batch gradient descent

No comments:

Post a Comment