PCA(3 points)

Given n data points $x_1,\ldots,x_n$, where $x_i \in \mathcal{R}^d$. Describe how to find the top k principle components by SVD.

• 对X进行奇异值分解：$X = U\Sigma V^T = \sum_{i=1}^d \sigma_iu_iv_i^T$
• 取X的k个最大的左奇异向量：$u_1,u_2,\ldots,u_k$
• x的新坐标：$U_k^Tx = [u_1^Tx, u_2^Tx, \ldots, u_k^Tx] \in \mathcal{R}^k, U_k = [u_1, u_2,\ldots,u_k] \in \mathcal{R}^{d*k}$
• X的新坐标：$U_k^TX=U_k^TU_r\Sigma_rV_r^T = \Sigma_kV_k^T$

Association Pattern Mining(2 points)

The downward closure property (i.e.,every subset of frequent itemset is also frequcent) is leveraged to design efficient algorithms for association pattern mining. Why does this property hold?

NMF(3 points)

Suppose we want to find a rank-k approximation of matrix $X \in \mathcal{R}^{d*n}$ by nonnegative matrix factorization. What is the optimization problem? Is it convex?

• 优化问题：$\min_{U \in \mathcal{R}^{d*k}, V \in \mathcal{R}^{v*k}} \ \Vert X - UV^T \Vert_F^2$， $s.t. \ U \ge 0, V \ge 0$
• 非负矩阵分解是非凸的

SVM(6 points)

Given a set of training data $(x_1,y_1),\ldots,(x_n,y_n)$, where $x_i \in \mathcal{R}^d$ and $y_i \in {\pm 1}$. The primal problem of SVM without intercept is given by: $$\min_{w \in \mathcal{R}^d} \sum_{i=1}^m \max(0,1-y_iw^Tx_i) + \frac{\lambda}{2} \Vert w \Vert_2^2$$. Show the derivation of the dual problem of SVM.

Hints: Let $\mathcal{l}(x) = max(0,1-x)$ be the hinge loss. Then, its conjugate function id given by $$\mathcal{l}^*(y) = sup_x(yx-\mathcal{l}(x)) = \begin{cases} y& -1 \le y \le 0 \\ \infty& \ otherwise \end{cases}$$

• 由上述定义的hinge loss及其共轭函数知，原优化问题简化为：$\min_{w \in R^d} \sum_{i=1}^n \mathcal{l}(y_iw^Tx_i) + \frac{\lambda}{2}\Vert w \Vert_2^2$,等价于： $$\sum_{i=1}^n \mathcal{l}(u_i) + \frac{\lambda}{2} \Vert w \Vert_2^2, s.t. \ u_i = y_iw^Tx_i, i = 1, \ldots, n$$
• 由拉格朗日乘法，得：$L(w,u,v) = \sum_{i=1}^n \mathcal{l}(u_i) + \frac{\lambda}{2} \Vert w \Vert_2^2 + \sum_{i=1}^n v_i(u_i - y_iw^Tx_i)$,其对偶问题：

$$\begin{array} {lcl} g(v) &=& \inf_{w,u} L(w,u,v) \\ &=& \inf_{w,u} \sum_{i=1}^n \mathcal{l}(u_i) + \frac{\lambda}{2} \Vert w \Vert_2^2 + \sum_{i=1}^n v_i(u_i - y_iw^Tx_i) \\ &=& \inf_{w,u} \sum_{i=1}^n(\mathcal{l}(u_i) + v_iu_i) + (\frac{\lambda}{2} \Vert w \Vert_2^2 - w^T\sum_{i=1}^nv_iy_ix_i) \end{array}$$

依次最小化w,u:

• $\inf_{u_i}(\mathcal{l}(u_i) + v_iu_i) = -\sup_{u_i}(-v_iu_i - l(u_i)) = - \mathcal{l}^* (-v_i) = v_i$, if $0 \le v_i \le 1$
• $\nabla_w L(w,u,v) = \lambda w - \sum_{i=1}^nv_iy_ix_i)$, 得：$w = \frac{1}{\lambda} \sum_{i=1}^nv_iy_ix_i$
• 最后， $g(v) = \sum_{i=1}^n v_i - \frac{1}{2 \lambda}\sum_{i=1}^n\sum_{j=1}^n v_iv_jy_iy_jx_i^Tx_j$，得到对偶问题： $$\max_{v \in R^n} \sum_{i=1}^n v_i - \frac{1}{2 \lambda}\sum_{i=1}^n\sum_{j=1}^n v_iv_jy_iy_jx_i^Tx_j, \ s.t.\ 0 \le v_i \le 1, i=1, \ldots,n$$

Ensemble(2 points)

Ensemble analysis is used to reduce the bias or variance of the classification process. Which of them is reduced by bagging/boosting?

a) Bagging aims to reduce the: variance(代表性算法：随机森林，随机化的决策树模型，每次随机选择一定大小的特征)

b) Boosting aims to reduce the:bias(代表性算法：AdaBoost，每个训练实例都有个权重，分类错误的实例会赋予一个更大的权重)

Ridge Regression(5 points)

Given a set of training data $(x_1,y_1),\ldots,(x_n,y_n)$, where $x_i \in \mathcal{R}^d$ and $y_i \in \mathcal{R}$. Our goal is to learn a linear model $f(x) = x^Tw + b$ to predict the label $y \in \mathcal{R}$ of an instance $x \in \mathcal{R}^d$

a) Show the optimization problem of the ridge regression for learning $w \in \mathcal{R}^d$ and $b \in \mathcal{R}$. Notations:$X = [x_1,\ldots,x_n] \in \mathcal{R}^{d*n}, y = [y_1, \ldots, y_n]$

b) Derive the optimal solution $w_*$, and $b_*$ of the above problem. Notations: I is the identity matrix, and $H = I - \frac{1}{n}1_n1_n^T$ is the centering matrix. Hints:$\frac{\partial \Vert u - A^Tw \Vert_2^2}{\partial w} = 2A(A^Tw-u)$

1. 优化问题：$\min_{b \in R,w \in R^d} \Vert y - Xw - 1_Nb \Vert_2^2 + \lambda \Vert w \Vert_2^2$, 其中：$1_N = [1,\ldots,1]^T \in R^d$

• 对上式$b$进行求导，并令倒数为0：$-2*1_N^T(y-Xw -1_Nb)=0, b = \frac{1}{N}1_N^T(y-Xw)$
• 对上式$w$进行求导，并令倒数为0：$2*X^T(Xw -y + 1_Nb) + 2 \lambda w=0, (X^T(I-\frac{1}{N}1_N1_N^T)X + \lambda I)w=X^T(I-\frac{1}{N}1_N1_N^T)y$
• 令$H = I - \frac{1}{N}1_N1_N^T$为中心矩阵，则得到：$w_* = (X^THX + \lambda I)^{-1}X^THy,b_* = \frac{1}{N}1_N^T(y-Xw_*)$

Give a brief introduction of semi-suppervised learning and active learning

• semi-suppervised learning
• 标记数据的代价昂贵并且难于获得；无标签数据通常是大量可得到的；无标签数据是有用的：无标签数据可以用来评估数据的低维流型结构和特征的联合概率分布。
• 相关算法：
• 元算法：使用任何现有的算法作为子程序，这类算法主要有Self-training和Co-training。
• 具体算法：半监督贝叶斯分类器，转导支持向量机，基于图的半监督学习算法。
• 数据要求：数据的类别特征应该近似和它的聚类特征匹配；在实际应用中，当有标签的数据非常少时效率很高。
• active learning
• 主动学习是半监督机器学习的一个特例，在主动学习中，一个学习算法可以主动地提出一些标注请求，将一些经过筛选的数据提交给专家进行标注。
• 两种查询系统：选择性取样和基于池的采样。
• 种类：基于异构的模型、基于性能的模型、基于代表的模型。

Collaborative Filtering(5 points)

A merchant has an n*d ratings matrix D representing the preferences of n customers across d items. It is assumed that the matrix is sparse, and therefore each customer may have bought only a few items. Please provide one approach that the utilizes the rating matrix D to make recommendations to customers.

• 基于邻居的方法：基于用户的评级相似性、基于商品的评级相似性
• 基于图的方法：
• 聚类方法：自适应k-means聚类，自适应协同聚类
• 潜伏因子模型：奇异值分解、矩阵分解、矩阵填充