-
Notifications
You must be signed in to change notification settings - Fork 0
/
introduction.tex
102 lines (92 loc) · 3.76 KB
/
introduction.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
\section{Introduction}
\label{sec:intro}
This report use sketching methods to deal with the
least absolute deviation regression problem which minimizes
\begin{equation} \label{eq:ls}
f(x) = \frac{1}{N}\|Ax-b\|_1,
\end{equation}
where $N$ is the number of samples,
$A\in\real^{n\times d}$, $b\in\real^n$,
and $x\in\real^d$.
The two sketching methods are
\emph{Cauchy}
and \emph{Exponential}.
\subsection{Dataset}
Same as {\it HW1},
we use the \emph{how much did it rain ii} \cite{rain, Lakshmanan16}
dataset from \emph{kaggle.com} as our benchmark.
The aim of the dataset is to predict the amount of raining.
The original dataset contains 13,765,201 training samples
with 23 features and 1 prediction value.
The following description was extracted from web page \cite{rain}.
\begin{description}[align=left,labelindent=2em]
\item [Id] A unique number for the set of observations over an hour at a gauge.
\item [Minutes\_past] For each set of radar observations,
the minutes past the top of the hour that the radar observations were carried out.
Radar observations are snapshots at that point in time.
\item [Radardist\_km] Distance of gauge from the radar whose observations are being reported.
\item [Ref] Radar reflectivity in km.
\item [Ref\_5x5\_10th, Ref\_5x5\_50th, Ref\_5x5\_90th]
10th, 50-th, 90-th percentile of reflectivity values
in 5x5 neighborhood around the gauge.
\item [RefComposite, RefComposite\_5x5\_10th, RefComposite\_5x5\_50th, RefComposite\_5x5\_90th]
Maximum reflectivity in the vertical column above gauge. In dBZ.
\item [RhoHV, RhoHV\_5x5\_10th, RhoHV\_5x5\_50th, RhoHV\_5x5\_90th] Correlation coefficient.
\item [Zdr, Zdr\_5x5\_10th, Zdr\_5x5\_50th, Zdr\_5x5\_90th] Differential reflectivity in dB.
\item [Kdp, Kdp\_5x5\_10th, Kdp\_5x5\_50th, Kdp\_5x5\_90th] Specific differential phase (deg/km).
\item [Expected] Actual gauge observation in mm at the end of the hour.
\end{description}
Since it contains null values and outliers,
we will do some preprocesses.
All the experiments were conducted on a machine
that has an 16-core Intel Xeon E5-2637 v2 CPU at 3.5GHz,
and 256G memory.
Codes were implemented using the Julia language
and are available at \emph{https://github.com/innerlee/random.computation.report1}.
\subsection{Preprocess and Baselines} \label{sec:preprocess}
Since some observations are not complete,
we first filter out 2,769,088 samples that do not contain missing values.
The first two features \emph{id} and \emph{minutes\_past}
are used just for identification rows,
thus we omit them and
get data matrices
\begin{equation}
A_0 \in \real^{2769088\times 21}, \,
b_0\in\real^{2769088}.
\end{equation}
The $99$-th percentile of prediction values in $b_0$ is $144.0$.
We then use samples that have prediction values less than $144.0$
and get new data
\begin{equation} \label{eq:data}
A_1 \in \real^{2741220\times 21}, \,
b_1\in\real^{2741220}.
\end{equation}
Since the LP solver for getting the exact solution cannot process large scale
data in a reasonable time,
we then truncate the matrix to 20,000 rows.
As suggested by instruction,
we normalize matrix $A$ by subtracting mean and divide by std column-wisely.
$b$ is \emph{not} normalized.
The final dataset we use is
\begin{equation} \label{eq:data}
A \in \real^{20000\times 21}, \,
b\in\real^{20000}.
\end{equation}
Solving problem~\eqref{eq:ls} by LP solver,
we have
{
\def\OldComma{,}
\catcode`\,=13
\def,{%
\ifmmode%
\OldComma\discretionary{}{}{}%
\else%
\OldComma%
\fi%
}%
$x^* = (0.637049,-0.0401072,0.117485,-0.196894,-0.375371,-0.267876,0.855095,0.509979,
0.146945,0.0406474,-0.306016,0.0689336,-0.0244424,0.0619177,0.202188,0.395404,-0.183876,
-0.0417801,0.0202723,-0.00508027,-0.125157)$
}
with loss $f(x^*) = 3.429$ in $97.3$ seconds.
This serves as the baseline for our later discussion.