LinAlg Notes

Type cmd + P $\to$ Fold all to fold bullet points.

Properties of Euclidean Norms
- $∥ u^{T} v ∥ \leq ∥ u ∥ ∥ v ∥$ (Cauchy-Schwartz)
  - $u^{T} v = ∥ u ∥ ∥ v ∥ \cdot cos θ$
  - $∣ cos θ ∣ \leq 1$
- $∥ u + v ∥ \leq ∥ u ∥ + ∥ v ∥$
  - The direct way is always shorter or as long as the way with a stop in between
Linear Transformation; Matrix Multiplication
- $T$ linear transformation $⟺$ $T (λ_{1} x_{1} + λ_{2} x_{2}) = λ_{1} T (x_{1}) + λ_{2} T (x_{2})$
- $T$ linear transformation $⟺$ exists matrix that acts like $T$
- Matrix mult. distributive and associative (why?)
- Why $(M^{T})^{- 1} = (M^{- 1})^{T}$ ?
Transposes and Inverses
- $(A B)^{T} = B^{T} A^{T}$
  - Makes sense: $(A x)^{T} =$ x^T A^T$. Matrix is simply a series of vectors.
- $(A B)^{- 1} = B^{- 1} A^{- 1}$
  - When you perform $A B$ , you first perform $B$ , then $A$ . Therefore, to undo, you first have to undo $A$ , then $B$ .
Rank, Column Space, Nullspace
- $N (A) = R (A)^{⊥} = C (A^{T})^{⊥}$
  - $x$ in nullspace $⟺ A x = 0$ . $x$ is therefore orthogonal to every row of $A$ .
  - Since the rows span the rowspace, $x$ must be orth. to the rowspace.
- $N (A) = N (A^{T} A)$ (and therefore $N (A^{T}) = N (A A^{T})$ )
  - $A x = 0$ $⟹ A^{T} (A x) = 0$
  - $A^{T} A x = 0 ⟹ x^{T} A^{T} A x = 0 ⟹ ∥ A x ∥^{2} = 0 ⟹ A x = 0$
  - $N (A^{T}) = N (A A^{T})$ follows because $(A^{T})^{T} (A^{T}) = A A^{T}$
- $C (A^{T}) = C (A^{T} A)$ (and therefore $C (A) = C (A A^{T})$ )
  - $C (A^{T}) = N (A)^{⊥}$ $= N (A^{T} A)^{⊥}$ $= C ((A^{T} A)^{T})$ $= C (A^{T} A)$
- $rank (A)$ $= rank (A^{T})$
  - Result of reduced row echelon form. There are always as many rows as pivot columns.
- $rank (A) = rank (A A^{T})$ (and therefore $rank (A^{T})$ $= rank (A^{T} A)$ )
  - $C (A) = C (A A^{T})$ , and $rank (A) = dim (C (A))$ by def.
  - All four are equal b/c of $rank (A)$ $= rank (A^{T})$
- ${x \in R^{n} ∣ A x = b} \neq = \emptyset$ $⟹$ ${x \in R^{n} ∣ A x = b} = x_{1} + N (A)$ with $x_{1} \in R (A)$ unique
  - You can add an arbitrary element to a unique solution in the row space and still have a solution.
  - Intuition: We can move a solution up and down the nullspace b/c adding a vector $x_{0}$ in the nullspace to a solution $x$ doesn’t change the validity ( $A x_{0} = 0$ ). Since the nullspace is the orth. complement of the rowspace, it must intercept the rowspace in exactly one position ( $N (A) \cap C (A^{T}) = {0}$ , which gets moved along $C (A^{T})$ )
  - Direct result $A x_{1} = b$
  - If there exists a solution $x_{1}$ , then $A x_{1} = A x_{1} + A x_{0} = A (x_{1} + x_{0}) = b$ is also a solution, where $x_{0} \in N (A)$
  - $x_{1} \in R (A)$ $⟹ A x_{1} = A A^{T} c = b$ for some $c$
  - $A$ underdetermined ( $A$ wide) $⟹$ $A$ full row rank $⟹$ $A A^{T}$ full rank $⟹$ $A A^{T} c = b$ has a unique solution $c$ (so $x$ unique b/c matrix mult. is well defined)
CR-Decomposition
- Any $A$ can be decomposed into $A = CR$
  - $C$ : Lin ind. columns of $A$
  - $R$ : Each col contains the coeffecients of the lin. comb. of the cols of $C$ that make up the cols of $A$ (How much of each $C$ col goes into each $A$ col)
- $C (A) = C (C)$ by construction b/c $C$ are the lin. ind. cols of $A$
- $R (A) = R (R)$
  - $R (A) \subseteq R (R)$ : Each row of $A$ is a lin. comb. of the rows of $R$ , with each row of $C$ offering the coefficients
  - $R (A) \supseteq R (R)$ : $C$ has full col. rank $⟹$ $(C^{T} C)^{- 1} C^{T}$ left inverse (see pseudoinverse). Therefore: $CR = A$ $⟹ R = P (C^{T} C)^{- 1} C^{T} A$ $⟹$ Each row of $R$ is a lin. comb. of the rows of $A$ , with each row of $P$ offering the coefficients
- $CR$ composition can be done using Gauss-Jordan elimination. $C$ = the cols of $A$ where $RREF (A)$ has pivots, and $R$ = the non-zero rows of $RREF (A)$
  - $C$ : cols follows def (since cols with pivots are lin. ind.)
  - $R$ : We can write the Gauss-Jordan steps into an $m \times m$ matrix $M$ : $M A$ $= MCR$ $= [R 0]$ w/ as many 0 rows as $RREF (C)$ , which yields 1s on diag. & 0 rows below
Certificates
- ${x \in R^{n} ∣ A x = b} = \emptyset ⟺ {z \in R^{m} ∣ A^{T} z = 0 \land b^{T} z = 1} \neq = \emptyset$
  - $A^{T} z = 0 ⟺ z \in N (A^{T})$
  - $A x = b$ has a solution $⟹ b \in C (A)$ $⟹ b^{T} z = 0$ b/c $C (A) ⊥ N (A^{T})$
  - Therefore: $A x = b$ has no solution $⟹ b^{T} z \neq = 0$ $⟹ b^{T} z = 1$ since any nonzero $z$ that satisfies $b^{T} z \neq = 0$ can be scaled by $b^{T} z$ without leaving $N (A^{T})$ , i.e. while maintaining the condition $A^{T} z = 0$ (vectorspace axioms).
Projections onto Subspaces (https://youtu.be/Y_Ac6KiQ1t0)
- $A x = b$ may have no solution. Solve $A \overset{x}{^} = proj_{A} (b)$ instead, where $proj_{A} (b)$ is projection of $b$ onto col. space. projection = closest point in col. space to $b$ .
- $e = b - proj_{A} (b) = b - A \overset{x}{^}$ is error of $b$ , i.e. the vector from $b$ to its projection. $e$ is orth. to $C (A)$ b/c that’s the directest and therefore shortest way to reach the col. space. $⟹ A^{T} e = 0$ $⟹ A^{T} (b - A \overset{x}{^}) = 0$ $⟹ A^{T} A \overset{x}{^} = A^{T} b$ (normal equation) $⟹ \overset{x}{^} = (A^{T} A)^{- 1} A^{T} b$ $⟹ proj_{A} (b) = A (A^{T} A)^{- 1} A^{T} b$ $⟹$ projection matrix $P = A (A^{T} A)^{- 1} A^{T}$
- $P^{2} = P$ b/c projecting something in the col. space (the first projection) changes nothing
- $P^{T} = P$ b/c
  - $(A (A^{T} A)^{- 1} A^{T})^{T} = A (A^{T} A)^{- 1} A^{T}$
- Intuitively: Projection matrix annihilates orth. vector component
  - $P b = P (b_{⊥} + b_{∥})$ $= P b_{⊥} + P b_{∥}$ $= P b_{⊥} + P A c$ for some $c$ ( $b_{∥} \in C (A)$ ) $= A (A^{T} A)^{- 1} A^{T} b_{⊥} + A (A^{T} A)^{- 1} A^{T} A c$ $= A (A^{T} A)^{- 1} A^{T} b_{⊥} + A c$ $= 0 + A c$ (dot prod. between $b_{⊥}$ & every col. of $A$ 0; $b_{⊥} ⊥ C (A)$ $⟺ b_{⊥} \in N (A^{T})$ ) $= b_{∥}$
Least Squares Regression
- Goal: calculate coefficients of function so that we minimize the squared error, i.e. the sum of the squared distances from our function output to actual datapoints
- Example: we have datapoints $(x_{1}, y_{1}), \dots, (x_{4}, y_{4})$ and want a quadratic function. If we plug them in: $⎩ ⎨ ⎧ y_{1} = α_{2} (x_{1})^{2} + α_{1} x_{1} + α_{0} y_{2} = α_{2} (x_{2})^{2} + α_{1} x_{2} + α_{0} y_{3} = α_{2} (x_{3})^{2} + α_{1} x_{3} + α_{0} y_{4} = α_{2} (x_{4})^{2} + α_{1} x_{4} + α_{0}$ $⟺ y y_{1} y_{2} y_{3} y_{4} = A (x_{1})^{2} (x_{2})^{2} (x_{3})^{2} (x_{4})^{2} x_{1} x_{2} x_{3} x_{4} 1111 \overset{α}{^} α_{2} α_{1} α_{0}$
- Minimizing now becomes reducing sum of squared differences from every coordinate of $A \overset{α}{^}$ (all our predictions) to the corres. coordinate of $y$ (all our expected results): $min_{\overset{α}{^} \in R^{n}} (A \overset{α}{^} - y)^{T} (A \overset{α}{^} - y)$ $⟺ min_{\overset{α}{^} \in R^{n}} ∥ A \overset{α}{^} - y ∥^{2}$ $⟺ min_{\overset{α}{^} \in R^{n}} ∥ A \overset{α}{^} - y ∥$ (norm is always positive)
- $A \overset{α}{^} - y ≅ A \overset{x}{^} - b = e$ from projections. We therefore know how to minimize norm of $e$ : find $e$ that’s orth. to the col. space. Hence, $\overset{α}{^} = (A^{T} A)^{- 1} A^{T} y$ .
Orthogonal Matrices and Gram-Schmidt
- Orthonormal columns: $Q^{T} Q = I$ b/c rows of $A^{T}$ are always orthogonal to cols of $A$ except if row idx = col idx, in which case we have $v^{T} v = ∥ v ∥^{2} = 1^{2} = 1$
- If square: called orthogonal matrix. $⟹ Q$ has a right inverse (left inverse $\Rightarrow$ matrix injective: a vector $x$ first gets multiplied with $Q$ , then with $Q^{T}$ . If $Q$ reduced multiple values to the same value, $Q^{T}$ could not reverse $\Rightarrow$ matrix bijective if square) $⟹ Q^{T} = Q^{- 1}$ since $L = L (QR) = (L Q) R = I R = R$ ( $L$ is left inverse, $R$ right) $⟹ Q Q^{T} = I$
- $Q$ orthogonal $⟺$ $Q$ norm preserving: $∥ Q x ∥ = ∥ x ∥$ and $(Q x)^{T} (Q y) = x^{T} y$
  - $(Q x)^{T} (Q x) = x^{T} Q^{T} Q x = x^{T} x$
  - $(Q x)^{T} (Q y) = x^{T} Q^{T} Q y = x^{T} y$
- Gram-Schmidt: $q_{1} = \frac{a _{1}}{∥ a _{1} ∥}$ ; $q_{k}^{'} = a_{k} - \sum_{j = 1}^{k - 1} q_{j} (q_{j}^{T} a_{k})$ ; $q_{k} = \frac{q _{k}^{'}}{∥ q _{k}^{'} ∥}$
  - We project $a_{k}$ onto the already created subspace and then norm the error
  - The sum above is the same as subtracting $P \cdot a_{k}$ , where $P$ is proj. matrix: $P = Q (Q^{T} Q)^{- 1} Q^{T} = Q Q^{T}$ since $Q^{T} Q = I$ by definition. $Q Q^{T} \neq = I$ b/c $Q$ is not yet square.
- For orthonormal matrices: $Q^{T} Q \overset{x}{^} = Q^{T} b$ $⟹ \overset{x}{^} = Q^{T} b$ is the least-squares solution to $Q x = b$
- A matrix can be decomposed into $A = QR$
  - $Q$ has orthonormal columns (not necessarily orthogonal b/c it may not be square).
  - $R$ denotes how much each column of $Q$ contributes to a column of $A$ . Since every column of $A$ is a linear combination of all the already created orthonormal columns + a new one in Gram-Schmidt, we get an upper triangular matrix.
    - If $A$ has linearly independent columns, then $R$ has values $\neq = 0$ along the diagonal: each col. adds a new orthonormal vector to $Q$ . It is therefore invertible. This is part of the def. in this class.
    - Otherwise, we add a column w/o pivot (we don’t add a new col to $Q$ b/c $p ro j_{C} (a) = a$ , so the error is the 0-vector). We get an upper triangular matrix with pivots set back.
  - Since $C (A) = C (Q)$ , $P = Q Q^{T}$ is projection matrix.
  - $A^{T} A \overset{x}{^} = A^{T} b$ $⟹ R^{T} R \overset{x}{^} = R^{T} Q^{T} b$ ( $A^{T} A = (QR)^{T} (QR) = R^{T} Q^{T} QR = R^{T} R$ ) $⟹ R \overset{x}{^} = Q^{T} b$ ( $R \in R^{m \times n}$ and $rank (R) = m$ $\Rightarrow R^{T} \in R^{n \times m}$ has full col. rank $\Rightarrow$ $R^{T}$ has left inverse)
Determinant Formula (https://youtu.be/Sv7VseMsOQc)
- We can easily derive 3 axioms from volume:
  - $det (I) = 1$ : a hypercube with edge length 1 has volume 1
  - Linearity is a direct result of volume:
    - A volume is spanned by multiple vectors. If you scale any one of the vectors by $α$ , your volume increases by $α$ . $⟹ det (\dots, α r_{i}, \dots) = α det (\dots, r_{i}, \dots)$
    - A vector always determines two parallel edges in the case of a parallelogram and four parallel edges in the case of a 3d parallelepiped. If you split a vector into two, that equally bulges the shape in on one side and out at the other. Therefore, the volume stays the same. $⟹ det (\dots, r_{i} + r_{i}^{'}, \dots) = det (\dots, r_{i}, \dots) + det (\dots, r_{i}^{'}, \dots)$
  - If a spanning vector is linearly dependent on the others, then the parallelepiped collapses into one dimension less $⟹$ volume of 0.
- From these 3 axioms, it follows that the sign must switch if we swap two columns: $0 = V (v + w, v + w, u)$ $= V (v, v + w, u) + V (w, v + w, u)$ $= V (v, v, u) + V (v, w, u) + V (w, v, u) + V (w, w, u)$ $= V (v, w, u) + V (w, v, u)$ (linear dependence) $⟹ V (v, w, u) = - V (w, v, u)$
- After using our linearity and linear dependence axioms multiple times, we are left with only $I$ ‘s permutations. By using sign flipping col. switching, we get to $det (I) = 1$ . Hence: $det (A) = σ \in S_{n} \sum sgn (σ) i = 1 \prod n a_{i, σ (i)}$
  - $S_{n}$ : set of all possible permutation functions
- Trick for sign of permutation: Draw two sets of dots from 1 to $n$ . Connect input and output of permutation. $(- 1)^{# intersections}$ is sign.
Properties of Determinants
- $det (A) = det (A^{T})$
  - $P^{T} = P$ , where $P$ is the set of all permutation matrices.
  - Each permutation matrix is like a hole puncher.
  - If you turn both the page and the holes you punch, you still get the same holes.
- $det (A B) = det (A) \cdot det (B)$
  - Scale volume by A first, then by B
  - Follows from linearity: we can extract entries separately
- $det (A^{- 1}) = \frac{1}{d e t ( A )}$
  - $det (A) = det (A^{- 1} AA) = det (A^{- 1}) \cdot det (A) \cdot det (A)$ , then divide by $det (A) \cdot det (A)$ .
Cofactors, Cramer’s rule and beyond
- Cofactor: det of matrix = sum of product of row or col entries w/ cofactors $C_{ij} = (- 1)^{i + j} det (M_{ij})$ . $M_{ij}$ is the minor obtained by deleting row $i$ and column $j$ from $A$ .
  - Performs linearity on row or col: Split row or col into vectors with one entry and otherwise only zeros. Delete row and column b/c of lin dep.
  - Swapping that vector up to $i, j = 0$ , which is part of the diagonal, swaps the sign $i + j$ times.
  - Example $3 \times 3$ matrix cofactor expansion along the first col: $a_{11} a_{21} a_{31} a_{12} a_{22} a_{32} a_{13} a_{23} a_{33}$ $= a_{11} \cdot a_{22} a_{32} a_{23} a_{33} - a_{21} \cdot a_{12} a_{32} a_{13} a_{33} + a_{31} \cdot a_{12} a_{22} a_{13} a_{23}$
- $A^{- 1} = \frac{1}{det ( A )} C^{T}$ , where $C$ is the matrix with the co-factors of A as entries.
  - $A C^{T} = \sum_{k = 1}^{n} a_{ik} C_{jk} .$
  - Case 1: $i = j$
    - Then $(A C^{T})_{ii} = \sum_{k} a_{ik} C_{ik}$
    - This is exactly the cofactor expansion of $det (A)$ along row $i$ .
    - So $(A C^{T})_{ii} = det (A)$ .
  - Case 2: $i \neq =$ Then $(A C^{T})_{ij} = \sum_{k} a_{ik} C_{jk}$
    - This is like taking the determinant of a matrix where two rows are equal (row $i$ and row $j$ after replacing row $j$ with row $i$ )
    - Determinant = 0 in this case.
  - $⟹ A C^{T} = det (A) I$
- Efficient way to compute det: subtract and add rows until upper triangle form (similar to Gauss). If swapping rows: multiply final det by -1. Scaling not allowed.
  - Intuition: det measures volume of parallelepiped. Adding a multiple of one row to another just slides one face along another, which doesn’t change the volume ( $det A = det A^{T}$ , so rows can be seen as spanning vectors as well).
- Cramer’s rule: $x_{i} = \frac{det A _{i}}{det A}$ , where $x_{i}$ is $i$ -th component of sol. vector for $A x = b$ and $A_{i}$ has the $i$ -th col replaced by $b$
  - $det A_{i}$ $= det (a_{1}, a_{2}, \dots, b, \dots, a_{m})$ $= det (a_{1}, a_{2}, \dots, (a_{1} \cdot x_{1} + a_{2} \cdot x_{2} + \dots + a_{i} \cdot x_{i} + \dots + a_{m} \cdot x_{m}), \dots, a_{m})$ ( $b$ lin. comb. of $A$ ‘s cols) $= x_{i} det A$ (after applying linearity, all other dets = 0 b/c of linearly dep. cols)
Eigenvalues and Eigenvectors
- $A x = λ x, x \neq = 0$ $⟹ A x - λ x = 0$ $⟹ A x - λ I x = 0$ $⟹ (A - λ I) x = 0$ $⟹ det (A - λ I) = 0$ (because $x \neq = 0 ⟹ A - λ I$ must have nontrivial nullspace)
  - The pair $λ = 0, x = 0$ does not count. Only if $A$ is singular is $λ = 0$ an eigenvalue b/c then $x \neq = 0$ satisfies $A x = 0 x$ .
- There are always as many (not necessarily unique) eigenvalues as dimensions $n$ b/c $det (A - λ I) = 0$ results in a polynomial with $n$ terms, which has $n$ (possibly complex) roots (fundamental theorem of algebra)
- Characteristic polynomial: $(- 1)^{n} det (A - z I)$ b/c we want monic polynomial (leading coefficient $= 1$ )
  - Only diagonal contributes to leading coefficient (see proof for $Tr (A) = \sum_{i} λ_{i}$ ). If we have an uneven $n$ , then we have an uneven number of $- z$ in the diagonal $⟹ - 1$ as leading coefficient
  - Monic polynomial can always be factored into a series of positive factors b/c the $z$ after combining all must be pos.
Properties of Eigenvalues and Eigenvectors
- If $λ$ and $v$ eigenpair of $A$ : $λ^{k}$ and $v$ eigenpair of $A^{k}$
  - If you repeat the same matrix operation $k$ times, evecs are scaled by $λ$ $k$ times.
- If $λ$ and $v$ eigenpair of $A$ : $\frac{1}{λ}$ and $v$ eigenpair of $A^{- 1}$
  - If multiplying by $A$ scales $v$ by $λ$ , then reversing $A$ should scale $v$ by $\frac{1}{λ}$
- Eigenvalues $λ_{1}, \dots λ_{k}$ distinct $⟹$ corresponding evecs $v_{1}, \dots, v_{k}$ lin. ind.
  - Assume $v_{1}, v_{2}, v_{3}$ are evecs of $A$ w/ distinct eigenvalues $λ_{1}, λ_{2}, λ_{3}$ , and suppose $v_{3}$ is a linear combination of lin. ind. $v_{1}$ and $v_{2}$ , i.e. $v_{3} = a v_{1} + b v_{2}$ for $a, b \neq = 0$ $⟹$ $A v_{3} = λ_{3} v_{3}$ $⟹$ $A a v_{1} + A b v_{2} = λ_{3} a v_{1} + λ_{3} b v_{2}$ $⟹$ $λ_{1} a v_{1} + λ_{2} b v_{2} = λ_{3} a v_{1} + λ_{3} b v_{2}$ $⟹$ $(λ_{1} - λ_{3}) a v_{1} + (λ_{2} - λ_{3}) b v_{2} = 0$ $⟹$ $(λ_{1} - λ_{3}) a v_{1} = 0$ and $(λ_{2} - λ_{3}) b v_{2} = 0$ ( $v_{1}, v_{2}$ lin. ind.) $⟹$ $a, v_{1}$ non-zero, so $(λ_{1} - λ_{3}) = 0$ $⟹ λ_{1} = λ_{3}$ . Contradiction. Analogous for $v_{2}$ .
  - Proof above generalizes for any num of vectors (can be written using sum notation)
  - Result: if $A \in R^{k \times k}$ , then we can create a basis of evecs
- $(λ, v)$ eigenpair $⟹$ $(\overline{λ}, \overline{v})$ eigenpair
  - $\overline{A v} = \overline{λ v}$ $⟹ \overline{A} \overline{v} = \overline{λ} \overline{v}$ $⟹ A \overline{v} = \overline{λ} \overline{v}$ if $A \in R^{n \times n}$
- Eigenvalues for $A$ = eigenvalues for $A^{T}$
  - $det (A - z I) = det ((A - z I)^{T}) = det (A^{T} - z I^{T}) = det (A^{T} - z I)$
- $det A = \prod_{i} λ_{i}$
  - $(- 1)^{n} det (A - z I)$ $= (z - λ_{1}) (z - λ_{2}) \dots (z - λ_{n})$ $= [terms with z] + (- 1)^{n} \prod_{i = 1}^{n} λ_{i}$
  - Set $z = 0$ . Then you get the claimed result.
- $Tr (A) = \sum_{i} λ_{i}$
  - Characteristic polynomial: $(z - λ_{1}) (z - λ_{2}) \dots (z - λ_{n})$
    - Coefficient of $z^{n - 1}$ is sum of the eigenvalues since we pick every eigenvalue once and multiply it with the $z$ of all the $n - 1$ other factors
  - We can calculate the characteristic polynomial by using the Leibnitz formula on $A - I z$ .
    - Results in a sum of products of matrix entries
    - Since we add these products, each product can only contribute to coefficients of powers of variables that it contains
    - Therefore: we are looking for product with a $z^{n - 1}$ . Since only the diagonal contains $z$ , we need to pick $n - 1$ elements from the diagonal to get the $z$ in. Picking $n - 1$ elements forces the last one $⟹$ only the diagonal is relevant.
    - The diagonal: $(d_{1} - z) (d_{2} - z) \dots (d_{3} - z)$ , where $d_{i}$ is $A$ ‘s $i$ -th diagonal entry
    - Coefficient of $z^{n - 1}$ is sum of diagonal entries ( $= Tr (A)$ ) b/c we pick $n - 1$ $z$ and then must pick one of $A$ ‘s diagonal entries.
  - If diagonalizable: $Tr (A)$ $= Tr (S^{- 1} Λ S)$ $= Tr (Λ S S^{- 1})$ $= Tr (Λ)$ $= \sum_{i} λ_{i}$
  - $Tr (A B) = Tr (B A)$ and $Tr (A BC) = Tr (BC A) = Tr (C A B)$
    - $Tr (A B)$ $= \sum_{j} (A B)_{jj}$ $= \sum_{j} \sum_{i} A_{ij} B_{ji}$ $= \sum_{j} \sum_{i} B_{ji} A_{ij}$ $= Tr (B A)$
    - $Tr (A (BC)) = Tr ((BC) A)$
- $A^{T} A$ and $A A^{T}$ share non-zero eigenvalues
  - Since $λ \neq = 0, u \neq = 0$ , $A u \neq = 0$ . Symmetric for $v, A^{T} v$ .
  - $u$ evec of $A^{T} A$ : $A A^{T} (A u) = A (A^{T} A) u = A (λ u) = λ (A u)$
  - $v$ evec of $A A^{T}$ : $A^{T} A (A^{T} v) = A^{T} (A A^{T}) v = A^{T} (λ v) = λ (A^{T} v)$
Diagonalization and Powers of A
- $S :=$ the matrix with the evecs as cols. Then $A S = S Λ$ , where $Λ$ has the eigenvalues corresponding to the evecs in $S$ on the diagonal $⟹ A = S Λ S^{- 1}$ and $Λ = S^{- 1} A S$ if $S$ invertible
  - $Λ S$ would be wrong b/c you would then multiply each component of every evec with a different eigenvalue
- To calculate power of $A$ : $A = S Λ^{k} S^{- 1}$ b/c the $S^{- 1}$ and the $S$ cancel out
  - Can be used for closed formulas for recursions
Spectral Theorem
- Any symmetric matrix $A \in R^{n \times n}$ has $n$ real eigenvalues and an orthonormal basis of $R^{n}$ consisting of its evecs.
  - Induction proof in script
- Symmetric matrices can be diagonalized as $A = V Λ V^{T}$ since $V$ can be made orthogonal
Positive (Semi-)Definite Matrices
- All eigenvalues of a symmetric matrix $> 0$ (positive definite, PD) or $\geq 0$ (positive semi-definite, PSD)
- Rayleigh-Quotient $R_{A} (x) = \frac{x ^{T} A x}{x ^{T} x}$
  - Projects $A x$ onto $x$ , then norms by $∥ x ∥$
  - Measures how much $A$ scales $x$ in $x$ ‘s direction
- $λ_{m i n} \leq R_{A} (x) \leq λ_{m a x}$
  - $R_{A} (x) = \frac{x ^{T} A x}{∥ x ∥ ^{2}} = \frac{x ^{T} V Λ V ^{T} x}{∥ x ∥ ^{2}}$ (evecs of sym. matrix can form orth. matrix)
  - Let $y = V^{T} x$ , which is a vector. Then $R_{A} (x)$ $= \frac{y ^{T} Λ y}{∥ y ∥ ^{2}}$ ( $V$ is orth. and therefore norm preserving) $= \frac{\sum _{i} y _{i} λ _{i} y _{i}}{\sum _{i} y _{i}^{2}}$ $= \frac{\sum _{i} λ _{i} y _{i}^{2}}{\sum _{i} y _{i}^{2}}$
  - This is a positively weighted average of the eigenvalues $λ_{i}$ , so $λ_{m i n} \leq R_{A} (x) \leq λ_{m a x}$
- Matrix PD $⟺$ $\forall x R_{A} (x) > 0$
  - Implies matrix invertible: if $A x = 0$ for $x \neq = 0$ , then $A x = 0 x = λ x$ , meaning $λ = 0$ were an eigenvalue, contradicting PD.
- Matrix PSD $⟺$ $\forall x R_{A} (x) \geq 0$
- $A^{T} A$ and $A A^{T}$ are always positive semi-definite
  - $x^{T} (A^{T} A) x = (A x)^{T} A x = ∥ A x ∥^{2} \geq 0$ (the denominator of $R_{A} (x)$ is always $\geq 0$ )
  - $x^{T} (A A^{T}) x = (A^{T} x)^{T} A^{T} x = ∥ A^{T} x ∥^{2} \geq 0$
Gram Matrices and Cholesky Decomposition
- $M$ Gram matrix $⟺ M = V^{T} V$ for some matrix $V$
- $M$ PSD matrix $⟺$ $M$ Gram matrix with $M = C^{T} C$ , where $C$ upper triangular (Cholesky Decomposition)
  - $M = V Λ V^{T}$ b/c $M$ PSD $⟹$ $M$ symmetric $⟹ M = V Λ^{1/2} Λ^{1/2} V^{T}$ b/c eigenvalues non-negative $⟹ M = V^{T} (Λ^{1/2} V^{T})^{T} V (Λ^{1/2} V^{T})$
  - $V = QR$ $⟹ M = (QR)^{T} QR$ $⟹ M = R^{T} Q^{T} QR$ $⟹ M = R^{T} R$
Singular Value Decomposition
- We’re looking for orthogonal matrices $V$ and $U$ s.t. $A V = U Σ$ , where $Σ$ is diagonal, with only positive elements along the diagonal (the “singular values”). $A V = U Σ$ $⟺ A = U Σ V^{T}$ b/c $V^{T} = V^{- 1}$ for orthogonal matrices
  - $U \in R^{m \times m}$ b/c left most matrix in matrix mult. determines num rows (it telescopes)
  - $Σ \in R^{m \times n}$ to match dim
  - $V^{T} \in R^{n \times n} ⟺ V \in R^{n \times n}$ b/c right most matrix determines num cols
- To find $V$ and $U$ :
  - $A^{T} A = (U Σ V^{T})^{T} U Σ V^{T}$ $= V Σ^{T} U^{T} U Σ V^{T}$ $= V (Σ^{T} Σ) V^{T}$
  - $A A^{T} = U Σ V^{T} (U Σ V^{T})^{T}$ $= U Σ V^{T} V Σ^{T} U^{T}$ $= U (Σ Σ^{T}) U^{T}$
  - Since $A^{T} A$ and $A A^{T}$ are sym., they each have a full set of orthogonal evecs (spectral theorem) $⟹$ Above is their diagonalization, where $Σ^{T} Σ$ and $Σ Σ^{T}$ corresponds to $Λ$
  - Works b/c both $A A^{T}$ and $A^{T} A$ are PSD and have the same non-zero eigenvalues. $Σ$ is filled from the top left with the square roots of the shared eigenvalues of $A A^{T}$ and $A^{T} A$ . By convention: descending order. Rest of matrix is filled with zeros.
    - $m > n$ : $A^{T} A$ is larger than $A A^{T}$ , so it has additional zero eigenvalues. Therefore, $Σ^{T} Σ$ is larger than $Σ Σ^{T}$ . $Σ$ is wide.
    - $n > m$ : Reverse. $Σ$ is tall.
- Intuition: $V^{T}$ tells you how much the input it rotated, $Σ$ how much it is stretched, and $U$ how much the output is rotated. Since $V^{T}$ and $U$ are orth., they don’t stretch at all.
Change of Basis
- A vector is a series of numbers that each scale a basis vector. Therefore, the same vector looks differently in different bases.
- $B_{1}$ and $B_{2} :=$ two bases
- $M :=$ matrix w/ $B_{1}$ ‘s basis vectors as cols as expressed in $B_{2}$ . Multiplying a vector in $B_{2}$ by $M$ therefore maps it to $B_{1}$ .
- $A_{1} :=$ matrix of lin. transformation in $B_{1}$ . To apply it to a vector $x$ in $B_{2}$ , we must first convert $x$ to $B_{1}$ ( $M x$ ), then apply $A$ ( $A M x$ ) and then convert back to $B_{2}$ ( $M^{- 1} A M x$ ) $⟹ A_{2} = M^{- 1} A M$
  - $A_{1}, A_{2}$ called similar
- Application: Image Compression
  - Represent entire image or parts of image as vector or matrix, where each coordinate or row has a grayscale/color value.
  - Change basis so that basis vectors capture larger parts of the image, for example all pixels black, half black/half white, and so on. We then discard the ones with low coefficients since they don’t contribute much.
  - Common choices:
    - Fourier matrix: You have a wave going across the entire vector, with different frequencies for each basis vector.
    - Wavelet matrix: You have one wave cycle somewhere in the column vector. The crests have a couple different widths and locations.
Left and Right Inverses; Pseudoinverse
- Goal: Always get the most relevant answer to $A x = b$
- Tall matrix, full col. rank ( $r = n$ )
  - Does not collapse information in $A x$ : Converts vectors $x$ into higher dimension vectorspace; full column rank $⟹$ the subspace that the vectors inhabit has the same dim as their original vectorspace, so the transformation is injective. Therefore, $A$ should have a left inverse.
  - $A^{T} A$ is an $n \times n$ square matrix. Since $rank A = rank A^{T} A$ , $A^{T} A$ is full rank and therefore invertible.
    - $A^{+} = (A^{T} A)^{- 1} A^{T}$ is left inverse of $A$
  - Same as projection: $A x = b$ $⟹ A^{T} A \overset{x}{^} = A^{T} b$ $⟺ \overset{x}{^} = A^{+} b$
    - Intuition: The inverse must work for all points, not just ones on hyperspace formed by $A x$ . The inverse must therefore project along the nullspace onto the hyperspace.
- Wide matrix, full row rank ( $r = m$ )
  - Does not collapse information in $x^{T} A$ : Converts vectors $x^{T}$ into higher dimension vectorspace; full row rank $⟹$ the subspace that the vectors inhabit has the same dim as their original vectorspace, so the transformation is injective. Therefore, $A$ should have a right inverse.
  - $A A^{T}$ is an $m \times m$ square matrix. Since $rank A = rank A A^{T}$ , $A A^{T}$ is full rank and therefore invertible.
    - $A^{+} = A^{T} (A A^{T})^{- 1}$ is right inverse of $A$
  - $x^{*} = A^{+} b$ is min norm solution to $A x = b$
    - $A (x^{*} + n) = b$ , where $n \in N (A)$ and $x^{*} \in C (A^{T})$ , b/c they’re orth. complements $⟺ A x^{*} + A n = b$ $⟺ A x^{*} = b$
    - $x^{*}$ is min-norm solution b/c $∥ x ∥^{2} = ∥ x^{*} ∥^{2} + ∥ n ∥^{2} + 2 \cdot x^{T} n \geq ∥ x^{*} ∥^{2}$ ( $x^{T} n = 0$ )
    - $A^{T} c = x^{*}$ for some $c$ b/c $x^{*} \in C (A^{T})$ $⟹ A A^{T} c = b$ $⟹ A^{T} c = A^{+} b$ ( $A^{+} A A^{T} = A^{T}$ ) $⟹ x^{*} = A^{+} b$
  - $A^{+} = ((A^{T})^{+})^{T}$
    - $A^{+}$ is the same for a wide matrix as transposing it into a tall matrix, taking the pseudoinverse, and then transposing again.
    - $((A^{T})^{+})^{T} = ((A^{T})^{T} A^{T})^{- 1} (A^{T})^{T})^{T} = A^{+}$
- For any matrix
  - Use CR-decomposition: $A = CR$ $⟹ A^{+} = (CR)^{+} = R^{+} C^{+}$
  - $C$ is tall matrix w/ full col. rank. $R$ is wide matrix w/ full row rank
  - $R^{+} C^{+}$ $= R^{T} (R R^{T})^{- 1} (C^{T} C)^{- 1} C^{T}$ $= R^{T} (C^{T} CR R^{T})^{- 1} C^{T}$ ( $R R^{T}$ and $C^{T} C$ invertible $⟺$ both bijections) $= R^{T} (C^{T} A R^{T})^{- 1} C^{T}$
  - $\overset{x}{^} = A^{+} b$ is least squares sol:
    - $\overset{x}{^}$ least squares sol $⟺$ $\overset{x}{^}$ satisfies normal equation
    - $A^{T} A \overset{x}{^}$ $= (CR)^{T} CR \overset{x}{^}$ $= (CR)^{T} CR (R^{+} C^{+} b)$ ( $x = A^{+} b$ ) $= (CR)^{T} C C^{+} b$ ( $R^{+}$ right inverse) $= (CR)^{T} C (C^{T} C)^{- 1} C^{T} b$ (def. $C^{+}$ ) $= R^{T} C^{T} C (C^{T} C)^{- 1} C^{T} b$ (def. transpose) $= R^{T} C^{T} b$ $= A^{T} b$
  - $\overset{x}{^} = A^{+} b$ is min norm solution:
    - We know from wide matrices: $\overset{x}{^}$ min norm $⟺ \overset{x}{^} \in R (A) ⟺ \overset{x}{^} = A^{T} c$
    - $\overset{x}{^}$ $= A^{+} b$ $= R^{+} C^{+} b$ $= R^{T} v \in R^{m} (R R^{T})^{- 1} (C^{T} C)^{- 1} C^{T} b$ $= R^{T} v$ $= A^{T} c$ for some $c$ ( $A$ and $R$ have same row space)
Properties of the Pseudoinverse
- $A A^{+} A = A$ and $A^{+} A A^{+} = A^{+}$
  - $C I R R^{+} I C^{+} C R = CR$ ( $R^{+}$ = right inverse of $R$ and $C^{+}$ = left inverse of $C$ )
  - $R^{+} I C^{+} C I R R^{+} C^{+} = R^{+} C^{+}$
- $(A^{T})^{+} = (A^{+})^{T}$ .
  - $((CR)^{T})^{+} = (R^{T} C^{T})^{+} = (C^{T})^{+} (R^{T})^{+} = \dots = (C^{+} R^{+})^{T}$

Uni Notes

Explorer

LinAlg Notes

Backlinks