So suppose we have a set of points $(x_1, y_1), \ldots, (x_n, y_n)$ and we want to find $\beta_0$, $\beta_1$ such that $y_i = \beta_1 x_i + \beta_0$ for each $i$, or in matrix notation $Y = X\bar \beta$, where $X$ is n-by-2 and has the $x_i$’s in its first column and $1$’s in its second. Well, unfortunately, this may not have a solution, since $Y$ may not lie in the span of the two columns of $X$. So let’s instead try to find another vector $Y’$ which is in the span of $X$ and is as close to $Y$ as possible.
This can clearly be found by projection. Let $v_1, v_2$ be the columns of X. Then we can uniquely write $Y = av_1 + bv_2 + u$, where $u$ is orthogonal to $v_1, v_2$. Thus, multiplying through by $X^T$, we see that $X^TY = aX^Tv_1 + bX^Tv_2 + 0$, since the rows of $X^T$ are perpendicular to $u$. This can now be rewritten as $X^TY = X^T X \begin{pmatrix} a \\ b \end{pmatrix}$. So we have $\begin{pmatrix} a \\ b \end{pmatrix} = (X^TX)^{-1}X^TY$. But by definition $a$ and $b$ are the coordinates of the projection of $Y$ onto the columns of $X$. So we’ve found $\bar \beta$ – assuming that $X^TX$ is invertible!
If $X^TX$ is not invertible, we can find some vector $v$ with $X^TXv = 0$. But then $X^Tu = 0$, where $u = Xv$. Each entry in the vector $X^Tu$ is the dot product of a column of $X$ with $u$. As these are all 0, it follows that $u$ is perpendicular to all of $X$’s columns. But $u = Xv$ is also a linear combination of $X$’s columns, so it must be 0, so there is some linear dependence between the columns of $X$. Therefore, if the columns of $X$ are linearly independent, then $X^TX$ is invertible and the above argument gives us $\bar \beta = (X^TX)^{-1}X^TY$.
$Y’ = X\bar\beta$ being the closest vector to $Y$ in the column space of $X$ means we’ve minimized $\|Y – X\beta\|$, hence minimized $\|Y – X\beta\|^2 = (Y – X\beta)^T(Y – X\beta)$. This expands out to $f(\beta) = Y^TY – \beta^TX^TY – Y^TX\beta + \beta^T(X^TX)\beta$. Taking the matrix derivative of $f$ and setting it equal to $0$ will also let us find $\beta = (X^TX)^{-1}X^TY$.
The same logic obviously applies if we have more than one predictor variable, in which case the matrix $X$ will have more than two columns and $\beta$ has more than two rows. Also, we get $\beta = (X^TWX)^{-1}X^TWY$ via the above differentiation argument if $W$ represents a diagonal matrix of weights.