*The views expressed in this article are those of the interviewee and do not necessarily reflect the position of RBC or Borealis AI.*

Data privacy is a hot topic for most organizations. But operating in a highly regulated industry must place privacy even higher on RBC’s agenda. Is privacy driven by regulation?

*Holly Shonaman (HS): *

Not at all. I agree that we are highly regulated and that privacy is central to those regulations. But I think most data-rich businesses now understand that privacy protection is about much more than simply meeting a regulatory hurdle.

Like many organizations, we need data from our customers in order to run our business. We need it to ensure our products and services are meaningful and valuable to them. If our customers don’t trust us with their data, it becomes very difficult for us to do our jobs and deliver value.

So, yes, we are always mindful of the regulatory aspects. But that’s not what guides us: our focus is on building trust with our customers, and privacy is central to that.

*(HS): *

My job is to consider how we are using and processing data in all aspects of the business. And in that respect, AI isn’t all that different from more conventional methods of data analytics.

However, there are clear nuances that surround the use of AI, particularly in a consumer setting. In part, it’s the scale and speed that AI can achieve. That makes privacy and reputational risks more difficult to assess and control.

But it’s also that the public conversation around AI remains mired in mistrust. People don’t trust that the data is accurate; they don’t trust it is free from bias; they don’t trust how their data is going to be used. They simply don’t believe that machine learning can replace human interactions.

*(HS): *

It comes down to literacy on the topic. I don’t believe people really know what AI is and what protections are around it. I would argue that, as a country, we need to have a much more robust conversation about AI and help Canadians understand what kinds of questions they should be asking. That will require some thinking at a national policy level. But it’s important that – like financial literacy and data literacy – Canadians gain some AI literacy as well.

**(HS): **

At RBC, data privacy is baked into our processes. Our role in the global Privacy Office is to ensure that AI developers and business leaders understand and assess privacy risks. For example, before launching any new product or initiative, we conduct a privacy risk impact assessment which looks at the entire end-to-end process. If a risk is identified, we have a conversation about the types of controls that should be put in place.

Sometimes that means applying differential privacy techniques or limiting the amount of information that goes into the model. Or it could require further testing for things like the right level of data granularity, to ensure anonymous people in the data set cannot be identified based on the outcomes.

**(HS): **

I can’t overstate the trust aspect. One of my concerns is that if we overuse AI without understanding the full short and long-term consequences of models, we could end up destroying any trust we build in the technology as a society.

The problem is that society is changing extraordinarily rapidly and that means the AI community can’t always assess the full impact of their models until the risks are all too apparent. Being able to stay on top of these shifts is our focus.

*(HS)**:*

I am very encouraged by the robustness of the way many management teams – including those at RBC – are approaching this issue. We have a very strong risk management team. And our board of directors and executives demand clarity on what we are doing to treat clients fairly and use their data appropriately.

Generally speaking, I think everyone is very happy to do things with more speed, better information and more efficiency. But they also recognize that if you have a fast car, you need strong brakes. In other words, companies need to have the ability to continuously assess these models and take them ‘offline’ if there is a problem.

*(HS): *

I would argue that it needs to start at the university and training level – we need to educate developers on ethical AI from the outset. It can’t just be all about code; developers need to understand the social, ethical and privacy issues that influence their field.

I also think bias and risk should always be top of mind. They need to try to think broadly about a range of potential short, medium and long-term scenarios and test against them. That’s not easy; it’s hard work to look into the future.

I would also encourage AI developers to be more front-and-centre, working with the business and the privacy team to talk about what they are doing, the problems they have identified, their data sources and their designs.

**(HS): **

Business leaders need to keep doing more of what they are already doing. They need to demand more transparency, more reporting and testing. Perhaps more importantly, leaders need to allow employees to find flaws in their models, and maybe even reward that.

I think we are also going to see a lot more focus on third-party verification and audits to ensure corporate models and controls are really up to the task. It’s good protection for the business and helps the organization understand the robustness of their own testing.

**(HS): **

Quite to the contrary. I actually believe that – if we get it right – privacy is the key to building trust in AI. It doesn’t matter if you are lending money or selling sweatpants; access to customer data is critical to being able to deepen your relationship with your customers, deliver a great experience to them, and serve them. If you don’t use their data respectfully to support the client relationship, they’ll lose interest in your business. If you breach their privacy or cross the ethical line, you lose their trust. So our focus and attention to privacy controls is actually what will allow us to move ahead with AI development in Canada.

As RBC’s Chief Privacy Officer, Holly Shonaman leads RBC’s global Privacy Risk Management program and provides compliance oversight in support of the bank’s leadership digitally-enabled relationship banking. Ms. Shonaman has held various positions within RBC across the retail and commercial banking, and wealth management divisions.

]]>This tutorial concerns the *Boolean satisfiability* or *SAT* problem. We are given a formula containing binary variables that are connected by logical relations such as $\text{OR}$ and $\text{AND}$. We aim to establish whether there is any way to set these variables so that the formula evaluates to $\text{true}$. Algorithms that are applied to this problem are known as *SAT solvers*.

The tutorial is divided into three parts. In part I, we introduce Boolean logic and the SAT problem. We discuss how to transform SAT problems into a standard form that is amenable to algorithmic manipulation. We categorize types of SAT solvers and present two naïve algorithms. We introduce several SAT constructions, which can be thought of as common sub-routines for SAT problems. Finally, we present some applications; the Boolean satisfiability problem may seem abstract, but as we shall see it has many practical uses.

In part II of the tutorial, we will dig more deeply into the internals of modern SAT solver algorithms. In part III, we recast SAT solving in terms of message passing on factor graphs. We also discuss satisfiability modulo theory (SMT) solvers, which extend the machinery of SAT solvers to solve more general problems involving continuous variables.

The relevance of SAT solvers to machine learning is not immediately obvious. However, there are two direct connections. First, machine learning algorithms rely on optimization. SAT can also be considered an optimization problem and SAT solvers can find global optima without relying on gradients. Indeed, in this tutorial, we'll show how to fit both neural networks and decision trees using SAT solvers.

Second, machine learning techniques are often used as components of SAT solvers; in part II of this tutorial, we'll discuss how reinforcement learning can be used to speed up SAT solving, and in part III we will show that there is a close connection between factor graphs and SAT solvers and that belief propagation algorithms can be used to solve satisfiability problems.

In this section, we define a set of *Boolean operators* and show how they are combined into *Boolean logic formulae*. Then we introduce the *Boolean satisfiability problem*.

*Boolean operators* are standard functions that take one or more binary variables as input and return a single binary output. Hence, they can be defined by *truth tables* in which we enumerate every combination of inputs and define the output for each (figure 1). Common logical operators include:

- The $\text{OR}$ operator is written as $\lor$ and takes two inputs $x_{1}$ and $x_{2}$. It returns $\text{true}$ if one or both of the inputs are $\text{true}$ and returns $\text{false}$ otherwise.
- The $\text{AND}$ operator is written as $\land$ and takes two inputs $x_{1}$ and $x_{2}$. It returns $\text{true}$ if both the inputs are $\text{true}$ and $\text{false}$ otherwise.
- The $\text{IMPLICATION}$ operator is written as $\Rightarrow$ and evaluates whether the two inputs are consistent with the statement 'if $x_{1}$ then $x_{2}$'. The statement is only disobeyed when $x_{1}$ is $\text{true}$ and $x_{2}$ is $\text{false}$ and so implication returns $\text{false}$ for this combination of inputs and $\text{true}$ otherwise.
- The $\text{EQUIVALENCE}$ operator is written as $\Leftrightarrow$ and takes two inputs $x_{1}$ and $x_{2}$. It returns $\text{true}$ if the two inputs are the same and returns $\text{false}$ otherwise.
- The $\text{NOT}$ operator is written as $\lnot$ and takes one input. It returns $\text{true}$ if the input $x_{1}$ is $\text{false}$ and vice-versa. We refer $\lnot x_{1}$ as the
*complement*of $x_{1}$.

A *Boolean logic formula* $\phi$ takes a set of $I$ variables $\{x_{i}\}_{i=1}^{I}\in\{$$\text{false}$,$\text{true}$$\}$ and combines them using Boolean operators, returning $\text{true}$ or $\text{false}$. For example:

\begin{equation}

\phi:= (x_{1}\Rightarrow (\lnot x_{2}\land x_{3})) \land (x_{2} \Leftrightarrow (\lnot x_{3} \lor x_{1}). \tag{1}

\end{equation}

For any combination of input variables $x_{1},x_{2},x_{3}\in\{$$\text{false}$,$\text{true}$$\}$, we could evaluate this formula and see if it returns $\text{true}$ or $\text{false}$. Notice that even for this simple example with three variables it is hard to see what the answer will be by inspection.

The *Boolean satisfiability problem* asks whether there is *at least one* combination of binary input variables $x_{i}\in\{$$\text{false}$,$\text{true}$$\}$ for which a Boolean logic formula returns $\text{true}$. When this is the case, we say the formula is *satisfiable*.

A SAT solver is an algorithm for establishing satisfiability. It takes the Boolean logic formula as input and returns $\text{SAT}$ if it finds a combination of variables that can satisfy it or $\text{UNSAT}$ if it can demonstrate that no such combination exists. In addition, it may sometimes return without an answer if it cannot determine whether the problem is $\text{SAT}$ or $\text{UNSAT}$.

To solve the SAT problem, we first convert the Boolean logic formula to a standard form that it is more amenable to algorithmic manipulation. Any formula can be re-written as a conjunction of disjunctions (i.e., the logical $\text{AND}$ of statements containing $\text{OR}$ relations). This is known as *conjunctive normal form*. For example:

\begin{equation}\label{eq:example_cnf}

\phi:= (x_{1} \lor x_{2} \lor x_{3}) \land (\lnot x_{1} \lor x_{2} \lor x_{3}) \land (x_{1} \lor \lnot x_{2} \lor x_{3}) \land (x_{1} \lor x_{2} \lor \lnot x_{3}). \tag{2}

\end{equation}

Each term in brackets is known as a *clause* and combines together variables and their complements with a series of logical $\text{OR}$s. The clauses themselves are combined via $\text{AND}$ relations.

The *Tseitin transformation* converts an arbitrary logic formula to conjunctive normal form. The approach is to i) associate new variables with sub-parts of the formula using logical equivalence relations, (ii) to restate the formula by logically $\text{AND}$-ing these new variables together, and finally (iii) manipulate each of the equivalence relations so that they themselves are in conjunctive normal form.

This process is most easily understood with a concrete example. Consider the conversion of the formula:

\begin{equation}

\phi:= ((x_{1} \lor x_{2}) \Leftrightarrow x_{3}) \Rightarrow (\lnot x_{4}). \tag{3}

\end{equation}

**Step 1:** We associate new binary variables $y_{i}$ with the sub-parts of the original formula using the $\text{EQUIVALENCE}$ operator:

\begin{eqnarray}\label{eq:tseitin}

y_{1} &\Leftrightarrow &(x_{1} \lor x_{2})\nonumber \\

y_{2} &\Leftrightarrow &(y_{1} \Leftrightarrow x_{3}) \nonumber \\

y_{3} &\Leftrightarrow &\lnot x_{4}\nonumber \\

y_{4} &\Leftrightarrow &(y_{2} \Rightarrow y_{3}). \tag{4}

\end{eqnarray}

We work from the inside out (i.e., from the deepest brackets to the least deep) and choose sub-formulae that contain a single operator ($\lor, \land, \lnot, \Rightarrow$ or $\Leftrightarrow$).

**Step 2:** We restate the formula in terms of these relations. The full original statement is now represented by $y_{4}$ together with the definitions of $y_{1},y_{2},y_{3},y_{4}$ in equations 4. So the statement is $\text{true}$ when we combine all of these relations with logical $\text{AND}$ relations. Working backwards we get:

\begin{eqnarray}\label{eq:tseitin_stage2}

\phi&=& y_{4} \land (y_{4} \Leftrightarrow (y_{2} \Rightarrow y_{3})) \nonumber \\&&\hspace{0.4cm}\land (y_{3} \Leftrightarrow \lnot x_{4})\nonumber\\&& \hspace{0.4cm}\land (y_{2} \Leftrightarrow (y_{1} \Leftrightarrow x_{3}))\nonumber

\\&&\hspace{0.4cm}\land (y_{1} \Leftrightarrow (x_{1} \lor x_{2})). \tag{5}

\end{eqnarray}

This is getting closer to the conjunctive normal form as it is now a conjunction (logical $\text{AND}$) of different terms.

**Step 3:** We convert each of these individual terms to conjunctive normal form. In practice, there is a recipe for each type of operator:

\begin{eqnarray}

a \Leftrightarrow (\lnot b) & = & (a \lor b) \land (\lnot a \lor \lnot b) \\

a \Leftrightarrow (b \lor c) &= & (a\lor \lnot b) \land (a \lor \lnot c) \land (\lnot a \lor b \lor c) \nonumber \\

a \Leftrightarrow (b \land c) & = & (\lnot a \lor b) \land (\lnot a \lor c) \land (a \lor \lnot b \lor \lnot c) \nonumber \\

a \Leftrightarrow (b \Rightarrow c) & = & (a \lor b) \land (a \lor \lnot c) \land (\lnot a \lor \lnot b \lor c) \nonumber \\

a \Leftrightarrow (b \Leftrightarrow c) & = & (\lnot a \lor \lnot b \lor c)\land (\lnot a \lor b \lor \lnot c) \land (a \lor \lnot b \lor \lnot c) \land (a\lor b\lor c).\nonumber \tag{6}

\end{eqnarray}

The first of these recipes is easy to understand. If $a$ is $\text{true}$ then the first clause is satisfied, but the second can only be satisfied by having $\lnot b$. If $a$ is $\text{false}$ then the second clause is satisfied, but the first clause can only be satisfied by $b$. Hence when $a$ is $\text{true}$, $\lnot b$ is $\text{true}$ and when $a$ is $\text{false}$, $\lnot b$ is $\text{false}$ and so $a \Leftrightarrow (\lnot b)$ as required.

The remaining recipes are not obvious, but you can confirm that they are correct by writing out the truth tables for the left and right sides of each expression and confirming that they are the same. Applying the recipes to equation 5 we get the final expression in conjunctive normal form:

\begin{eqnarray}\label{eq:tseitin_stage3}

\phi\!\!&\!\!:=& y_{4} \land (y_4\lor y_2) \land (y_4 \lor \lnot y_3) \land (\lnot y_4 \lor \lnot y_2 \lor y_3)\nonumber \\

&&\hspace{0.4cm}\land (y_3 \lor x_4) \land (\lnot y_3 \lor \lnot x_4)\nonumber\\

&& \hspace{0.4cm}\land (\lnot y_2 \lor \lnot y_1 \lor x_3)\land (\lnot y_2 \lor y_1 \lor \lnot x_3) \land (y_2 \lor \lnot y_1 \lor \lnot x_3) \land (y_2\lor y_1\lor x_3)\nonumber \\&&

\hspace{0.4cm}\land (y_1\lor \lnot x_1) \land (y_1 \lor \lnot x_2) \land (\lnot y_1 \lor x_1 \lor x_2). \tag{7}

\end{eqnarray}

In the conjunctive normal form, each clause is a conjunction (logical $\text{OR}$) of variables and their complements. For neatness, we will write the complement $\lnot x$ of a variable as $\overline{x}$, so instead of writing:

\begin{equation}

\phi:= (x_{1} \lor x_{2} \lor x_{3}) \land (\lnot x_{1} \lor x_{2} \lor x_{3}) \land (x_{1} \lor \lnot x_{2} \lor x_{3}) \land (x_{1} \lor x_{2} \lor \lnot x_{3}), \tag{8}

\end{equation}

we write:

\begin{equation}\label{eq:example_cnf2}

\phi:= (x_{1} \lor x_{2} \lor x_{3}) \land (\overline{x}_{1} \lor x_{2} \lor x_{3}) \land (x_{1} \lor \overline{x}_{2} \lor x_{3}) \land (x_{1} \lor x_{2} \lor \overline{x}_{3}). \tag{9}

\end{equation}

We collectively refer to the variables and their complements as *literals* and so this formula contains literals $x_{1},\overline{x}_{1},x_{2},\overline{x}_{2}, x_{3}$ and $\overline{x}_{3}.$

When expressed in conjunctive normal form, we can characterise the problem in terms of the number of variables, the number of clauses and the size of those clauses. To facilitate this we introduce the following terminology:

- A clause that contains $k$ variables is known as a $k$
*-clause*. When a clause contains only a single variable, it is known as a*unit clause*. - When all the clauses contain $k$ variables, we refer to a problem as $k$
*-SAT*. Using this nomenclature, we see that equation 9 is a 3-SAT problem.

SAT solvers are algorithms that establish whether a Boolean expression is satisfiable and they can be classified into two types. *Complete* algorithms guarantee to return $\text{SAT}$ or $\text{UNSAT}$ (although they may take an impractically long time to do so). *Incomplete* algorithms return $\text{SAT}$ or return $\text{UNKNOWN}$ (i.e. return without providing an answer). If they find a solution that satisfies the expression then all is good, but if they don't then we can draw no conclusions.

Here are two naïve algorithms that will help you understand the difference:

- An example of a complete algorithm is
*exhaustive search*. If there are $V$ variables, we evaluate the expression with all $2^{V}$ combinations of literals and see if any combination returns $\text{true}$. Obviously, this will take an impractically long time when the number of variables are large, but nonetheless it is guaranteed to return either $\text{SAT}$ or $\text{UNSAT}$ eventually. - An example of an incomplete algorithm is
*Schöning's random walk*. This is a Monte Carlo solver in which we repeatedly (i) randomly choose an unsatisfied clause, (ii) choose one of the variables in this clause at random and set it to the opposite value. At each step we test if the formula is now satisfied and if so return $\text{SAT}$. After $3V$ iterations, we return $\text{UNKNOWN}$ if we have not found a satisfying configuration.

When a solver returns $\text{SAT}$ or $\text{UNSAT}$, it also returns a *certificate*, which can be used to check the result with a simpler algorithm. If the solver returns $\text{SAT}$, then the certificate will be a set of variables that obey the formula. These can obviously be checked by simply computing the formula with them and checking that it returns $\text{true}$ . If it returns $\text{UNSAT}$ then the certificate will usually be a complex data structure that depends on the solver.

First, the bad news. The SAT problem is proven to be NP-complete and it follows that there is no known polynomial algorithm for establishing satisfiability in the general case. An important exception to this statement is 2-SAT for which a polynomial algorithm is known. However, for 3-SAT and above the problem is very difficult.

The good news is that modern SAT solvers are very efficient and can often solve problems involving tens of thousands of variables and millions of clauses in practice. In part II of this tutorial we will explain how these algorithms work.

Until now we have focused on the satisfiability problem in which try to establish if there is at least one set of literals that makes a given statement evaluate to $\text{true}$. We note that there are also a number of closely related problems:

**UNSAT:** In the UNSAT problem we aim to show that there is no combination of literals that satisfies the formula. This is subtly different from SAT where algorithms return as soon as they find literals that show the formula is $\text{SAT}$, but may take exponential time if they cannot find a solution. For the UNSAT problem, the converse is true. The algorithm will return as soon as soon as it establishes the formula is *not* $\text{UNSAT}$, but may take exponential time to show that it is $\text{UNSAT}$.

**Model counting:** In model counting (sometimes referred to as #SAT or #CSP), our goal is to count the number of distinct sets of literals that satisfy the formula.

**Max-SAT:** In Max-SAT, it may be the case that a formula is $\text{UNSAT}$ but we aim to find a solution that minimizes the number of clauses that are invalid.

**Weighted Max-SAT:** This is a variation of Max-SAT in which we pay a different penalty for each clause when it is invalid. We wish to find the solution that incurs the least penalty.

For the rest of this tutorial, we'll concentrate on the main SAT problem, but we'll return to these related problems in part III of this tutorial when we discuss factor graph methods.

Most of the remainder of part I of this tutorial is devoted to discussing practical applications of satisfiability problems. Based on the discussion thus far, the reader would be forgiven for being sceptical about how this rather abstract problem can find real-world uses. We will attempt to convince you that it can! However, before, we can do this, it will be helpful to review commonly-used *SAT constructions*.

SAT constructions can be thought of as subroutines for Boolean logic expressions. A common situation is that we have a set of variables $x_{1},x_{2},x_{3},\ldots$ and we want to enforce a collective constraint on their values. In this section, we'll discuss how to enforce the constraints that they are all the same, that exactly one of them is $\text{true}$, that no more than $K$ of them are true or that exactly $K$ of them are true.

To enforce the constraint that a set of variables $x_{1},x_{2}$ and $x_{3}$ are either all $\text{true}$ or all $\text{false}$ we simply take the logical $\text{OR}$ of these two cases so we have:

\begin{equation}

\mbox{Same}[x_{1},x_{2},x_{3}]:= (x_{1}\land x_{2}\land x_{3})\lor(\overline{x}_{1}\land \overline{x}_{2}\land \overline{x}_{3}). \tag{10}

\end{equation}

Note that this is not in conjunctive normal form (the $\text{AND}$ and $\text{OR}$s are the wrong way around) but could be converted via the Tseitin transformation.

To enforce the constraint that only one of a set of variables $x_{1},x_{2}$ and $x_{3}$ is true and the other two are false, we add two constraints. First we ensure that at least one variable is $\text{true}$ by logically $\text{OR}$ing the variables together:

\begin{equation}

\phi_{1}:= x_{1}\lor x_{2} \lor x_{3}. \tag{11}

\end{equation}

Then we add a constraint that indicates that both members of any pair of varaiables cannot be simultaneously $\text{true}$:

\begin{equation}\label{eq:exactly_one}

\mbox{ExactlyOne}[x_{1},x_{2},x_{3}]:= \phi_{1}\land \lnot (x_{1}\land x_{2}) \land \lnot (x_{1}\land x_{3}) \land \lnot (x_{2}\land x_{3}) . \tag{12}

\end{equation}

There are many standard ways to enforce the constraint that at least $K$ of a set of variables are $\text{true}$. We'll present one method which is a simplified version of the sequential counter encoding.

The idea is straightforward. If we have $J$ variables $x_{1},x_{2},\ldots x_{J}$ and wish to test if $K$ or more are true, we construct a $J\times K$ matrix containing new binary variables $r_{j,k}$ (figures 2b and d). The $j^{th}$ row of the table contains a count of the number of $\text{true}$ elements we have seen in $x_{1\ldots j}$. So, if we have seen 3 variables that are $\text{true}$ in the first $j$ elements, the $j^{th}$ row will start with 3 $\text{true}$ elements and finish with $K-3$ $\text{false}$ elements.

If there are at least $K$ variables, then the bottom right variable $r_{J,K}$ in this table must be $\text{true}$ and so in practice, we would add a clause $(r_{J,K})$ stating that this bottom right element must be $\text{true}$ to enforce the constraint. When this element is $\text{false}$, the solver will search for a different solution where $\mathbf{x}$ does have at least $K$ elements or return $\text{UNSAT}$ if it cannot find one. By the same logic, to enforce the constraint that there are less than $K$ elements, we add a clause $\overline{r}_{J,K}$ stating the at the bottom right hand variable is $\text{false}$.

The table constructed in figure 2d also shows us how to constrain the data to have exactly K $\text{true}$ values. In this case, we expect the bottom right element to be $\text{true}$, but the element above this to be $\text{false}$ add the clause $(r_{J,K}\land \overline{r}_{J-1,K})$. Figure 3 provides more detail about how we add extra clauses to the SAT formula that build these tables.

Armed with these SAT constructions, we'll now present two complementary ways of thinking about SAT applications. The goal is to inspire the novice reader to see the applicability to their own problems. In the next section, we'll consider SAT in terms of constraint satisfaction problems and in the section following that, we'll discuss it in terms of model fitting.

The constraint satisfaction viewpoint considers combinatorial problems where there are a very large number of potential solutions, but most of those solutions are ruled out by some pre-specified constraints. To make this explicit, we'll consider the two examples of graph coloring and scheduling.

In the graph coloring problem (figure 4) we are given a graph consisting of a set of vertices and edges. We want to associate each vertex with a color in such a way that every pair of vertices connected by an edge have different colors. We might also want to know how many colors are necessary to find a valid solution. Note that this maps to our description of the generic constraint satisfaction problem; there are a large number of possible assignments of colors, but many of these are ruled out by the constraint that neighboring colors must be different.

To encode this as a SAT problem, we'll choose the number of colors $C$ to test. Then we create binary variables $x_{c,v}$ which will be $\text{true}$ if vertex $v$ is colored with color $c$. We then encode the constraint that each vertex can only have exactly one color using the construction $\mbox{ExactlyOne}[x_{\bullet, v}]$ from equation 12. We also add the constraints to ensure that the neighbours have different colors. Formally this means that that $x_{c,v}\Rightarrow \lnot x_{c,v'}$ for every color $c$ and neighbour $v'$ of vertex $v$.

Having set up the problem, we run the SAT solver. If it returns $\text{UNSAT}$ this means we need more colors. If it returns $\text{SAT}$ with a concrete coloring, then we have an answer. We can find the minimum number of colors required by using binary search over the number of colors to find the point where the problem changes from $\text{SAT}$ to $\text{UNSAT}$.

The graph coloring problem is a rather artificial computer science example, but many real-world problems can similarly be expressed in terms of satisfiability. For example, consider scheduling courses in a university. We have a number of professors, each of whom teach several different courses. We have a number of classrooms. We have a number of possible time-slots in each classroom. Finally, we have the students themselves, who are each signed up to a different subset of courses. We can use the SAT machinery to decide which course will be taught in which classroom and in what time-slot so that no clashes occur.

In practice, this is done by defining binary variables describing the known relations between the real world quantities. For example, we might have variables $x_{i,j}$ indicating that student $i$ takes course $j$. Then we encode the relevant constraints: no teacher can teach two classes simultaneously, no student can be in two classes simultaneously, no room can host more than one class simultaneously, and so on. The details are left as an exercise to the reader, but the similarity to the graph coloring problem is clear.

A second way to think about satisfiability is in terms of function fitting. Here, there is a clear connection to machine learning in which we fit complex functions (i.e., models) to training data. In fact there is a simple relationship between function-fitting and constraint satisfaction; when we fit a model, we can consider the parameters as unknown variables, and each training data/label pair represents a constraint on the values those parameters can take. In this section, we'll consider fitting binary neural networks and decision trees.

Binary neural networks are nets in which both the weights and activations are binary. Their performance can be surprisingly good, and their implementation can be extremely efficient. We'll show how to fit a binary neural network using SAT.

Following Mezard and Mora (2008) we consider a one layer binary network with $K$ neurons. The network takes a $J$ dimensional data example $\mathbf{x}$ with elements $x_{j}\in\{-1,1\}$ and computes a label $y\in\{-1,1\}$, using the function:

\begin{equation}\label{eq:one_layer}

y = \mbox{sign}\left[\sum_{j=1}^{J}\phi_{j}x_{j}\right] \tag{13}

\end{equation}

where the unknown model parameters $\phi_{j}$ are also binary and the function $\mbox{sign}[\bullet]$ returns -1 or 1 (figure 5) based on the sign of the summed terms.

Given a training set of $I$ data/label pairs $\{\mathbf{x}_{i},y_{i}\}$, our goal is to choose the model parameters $\phi_{j}$. We'll force all of the training examples to be classified correctly and so each training example/label pair can be considered a hard constraint on the parameters.

To encode these constraints, we create new variables $z_{i,j}$ that indicate whether the product $\phi_{j}x_{i,j}$ is positive. This happens when either both elements are positive or both are negative, so we can use the $\mbox{Same}[\phi_{j},x_{i,j}]$ construction. Note that for the rest of this discussion we'll revert to the convention that $x_{i,j}, y_{j}\in\{$$\text{false}$,$\text{true}$$\}$.

The predicted label is the sum of the elements $z_{i,j}$ and will be positive when more than half of the product terms $z_{i,\bullet}$ evaluate to $\text{true}$. Likewise it will be negative if less than half are $\text{true}$. Hence, for the network to predict the correct output label $y_{i}$ we require

\begin{equation}

\left(y_{i} \land \mbox{AtLeastK}[\mathbf{z}_{i}]\right)\lor \left(\overline{y}_{i} \land \lnot\mbox{AtLeastK}[\mathbf{z}_{i}]\right) \tag{14}

\end{equation}

where $K=J/2$ and the vector $\mathbf{z}_{i}$ contains the product terms $z_{i,\bullet}$.

We have one such constraint for each training example and we logically $\text{AND}$ these together. When we run the SAT solver we are asking whether it is possible to find a set of parameters $\boldsymbol\phi$ for which all of these constraints are met.

It is easy to extend this example to multi-layer networks and to allow a certain amount of training error and we leave these extensions as exercises for the reader.

A binary decision tree also classifies data $\mathbf{x}_{i}$ into binary labels $y_{i}\in\{0,1\}$. Each data example $\mathbf{x}_{i}$ starts at the root. It then passes to either the left or right branch of the tree by testing one of its elements $x_{i,j}$. We'll consider binary data $x_{i,j}\in\{$$\text{false}$, $\text{true}$$\}$ and adopt the convention that the data example passes left if $x_{i,j}$ is $\text{false}$ and right if $x_{i,j}$ is $\text{true}$. This procedure continues, testing a different value of $x_{i,j}$ at each node in the tree until we reach a leaf node at which a binary output label is assigned.

Learning the binary decision tree can also be framed as a satisfiability problem. From a training perspective, we would like to select the tree structure so that the training examples $\mathbf{x}_{i}$ that reach each leaf node have labels $y_{i}$ that are all $\text{true}$ or all $\text{false}$ and hence the training classification performance is 100%.

We'll develop simplified version of the approach of Narodytska et al. (2018). Incredibly, we can learn both the structure of the tree and which features to branch on simultaneously. When we run the SAT solver for a given number $N$ of tree nodes, it will search over the space of all tree structures and branching features and return $\text{SAT}$ if it is possible to classify all the training examples correctly and provide a concrete example in which this is possible. By changing the number of tree nodes, we can find the point at which this problem turns from $\text{SAT}$ to $\text{UNSAT}$ and hence find the smallest possible tree that classifies the training data correctly.

We'll describe the SAT construction in two parts. First we'll describe how to encode the structure of the tree as a set of logical relations and then we'll discuss how to choose branching features that classify the data correctly.

**Tree structure:** We create $N$ binary variables $v_{n}$ that indicate if each of the $N$ nodes is a leaf. Similarly we create $N^{2}$ binary variables $l_{m,n}$ indicating if node $n$ is the left child of node $m$ and $N^{2}$ binary variables $r_{m,n}$ indicating if node $m$ is the right child of node $n$. Then we build Boolean expressions to enforce the following constraints:

- The root node (node 1) is not a leaf.
- Leaf nodes have neither left or right children.
- Non-leaf nodes have exactly one left and one right child.
- Every node except the root is either a left or child or a right child of one another node.

Any set of variables $v_{n}$, $l_{m,n}$, $r_{m,n}$ that obey these constraints form a valid tree, and we can find such a configuration with a SAT solver. Two such trees are illustrated in figure 6.

**Classification:** The second part of the construction ensures that the data examples $\mathbf{x}_{i}$ are classified correctly (figure 7). We introduce variables $f_{n,j}$ that indicate that node $n$ branches on feature $x_j$. We'll adopt the convention that when the branching variable $x_{j}$ is $\text{false}$ we will always branch left and when it is $\text{true}$ we will always branch right. In addition, we introduce variables $\hat{y}_{n}$ that will indicate if each leaf node classifies the data as $\text{true}$ or $\text{false}$ (their values will be arbitrary for non-leaf nodes).

We'll also create several book-keeping variables that are needed to set this up as a SAT problem, but are not required to run the model once trained. We introduce ancestor variables $a^{l}_{nj}$ at each node $n$ which are $\text{true}$ if we branched left on feature $j$ at node $n$ or at any of its ancestors and similarly $a^{r}_{nj}$ if we branched right on feature $j$ at this node or any of its ancestors. Finally, we introduce variables $e_{i,n}$ that indicate that training example $\mathbf{x}_{i}$ reached leaf node $n$. Notice that this happens when $x_{ij}$ is $\text{false}$ everywhere $a^{l}_{nj}$ is $\text{true}$ (i.e., we branched left somewhere above on these left ancestor features) and $x_{ij}$ is $\text{true}$ everywhere $a^{r}_{nj}$ is $\text{true}$ (i.e., we branched right somewhere above on these right ancestor features).

Using these variables, we build Boolean expressions to enforce the following constraints:

- Each non-leaf node must branch on exactly one feature.
- The left and right ancestor variables at the root are all $\text{false}$.
- The left ancestor variables $a^{l}_{\bullet, n}$ at a node $n$ are the same as the parent's, but the index associated with the parents branching variable is also $\text{true}$ if we branched left to get here.
- The right ancestor variables $a^{r}_{\bullet, n}$ at a node $n$ are the same as the parent's, but the index associated with the parents branching variable is also $\text{true}$ if we branched right to get here.
- You can't branch on a variable twice in any one path to a leaf.
- A data example reaches a leaf node if the left and right ancestors match its pattern of $\text{true}$ and $\text{false}$ elements as described above.
- All data that reach a given leaf node must have the same class.

Collectively, these constraints mean that all of the data must be correctly classified. When we logically $\text{AND}$ all of these constraints together, and find a solution that is $\text{SAT}$ we retrieve a tree that classifies the data 100% correctly. By reducing the number of nodes until the point that the problem becomes $\text{UNSAT}$, we can find the most efficient tree that partitions the training data exactly.

This concludes part I of this tutorial on SAT solvers. We've introduced the SAT problem, shown how to convert it to conjunctive normal form and presented some standard SAT constructions. Finally, we've described several different applications which we hope will inspire you to see SAT as a viable approach to your own problems.

In the next part of this tutorial, we'll delve into how SAT solvers actually work. In the final part, we'll elucidate the connections between SAT solving and factor graphs. For those readers who still harbor reservations about the applicability of a method based purely on Boolean variables, we'll also consider (i) how to converting non-Boolean variables to binary form and (ii) methods to work with them directly using SMT solvers.

If you want to try working with SAT algorithms, then this tutorial will help you get started. For an extremely comprehensive list of applications of satisfiability, consult SAT/SMT by example. This may give you more inspiration for how to re-frame your problems in terms of satisfiability.

]]>

However properties of distributions constructed with normalizing flows remain less well understood theoretically. One important property is that of *tail behavior*. We can think about a distribution as having two regions: the *typical set* and the *tails* which are illustrated in Figure 1. The typical set is what is most often considered; it's the area where the distribution has a significant amount of density. That is, if you draw samples or have a set of training examples they're generally from the typical set of the distribution. How accurately a model captures the typical set is important when we want to use distributions to, for instance, generate data which looks similar to the training data. Many papers show figures like Figure 2 which showcase how well a model matches the target distribution in regions where there's lots of density.

The tails of the distribution are basically everything else and, when working on an unbounded domain (like $\mathbb{R}^n$) corresponds to asking how the probability density behaves as you go to infinity. We know that the probability density of a continuous distribution on an unbounded domain goes to zero in the limit, but the rate at which it goes to infinity can vary significantly between different distributions. Intuitively tail behaviour indicates how likely extreme events are and this behaviour can be very important in practice. For instance, in financial modelling applications like risk estimation, return prediction and actuarial modelling, tail behaviour plays a key role.

This blog post discusses the tail behaviour of normalizing flows and presents a theoretical analysis showing that some popular normalizing flow architectures are actually unable to estimate tail behaviour. Experiments show that this is indeed a problem in practice and a remedy is proposed for the case of estimating heavy-tailed distributions. This post will omit the proofs and other formalities and instead will aim at providing a high level overview of the results. For readers interested in the details we refer them to the full paper which was recently presented at ICML 2020.

Let $\mathbf{X} \in \mathbb{R}^D$ be a random variable with a known and tractable probability density function $f_\mathbf{X} : \mathbb{R}^D \to \mathbb{R}$. Let $\mathbf{T}$ be an invertible function and $\mathbf{X} = \mathbf{T}(\mathbf{Y})$. Then using the change of variables formula, one can compute the probability density function of the random variable $\mathbf{Y}$:

\begin{align}

f_\mathbf{Y}(\mathbf{y}) & = f_\mathbf{X}(\mathbf{T}(\mathbf{y})) \left| \det \textrm{D}\mathbf{T}(\mathbf{y}) \right| , \tag{1}

\end{align}

where $\textrm{D}\mathbf{T}(\mathbf{y}) = \frac{\partial \mathbf{T}} {\partial \mathbf{y}}$ is the Jacobian of $\mathbf{T}$. Normalizing Flows are constructed by defining invertible, differential functions $\mathbf{T}$ which can be thought of as transforming the complex distribution of data into the simple base distribution, or "normalizing" it. The paper attempts to characterize the tail behaviour of $f_\mathbf{Y}$ in terms of $f_\mathbf{X}$ and properties of the transformation $\mathbf{T}$.

Before we can do that though we need to formally define what we mean by tail behaviour. The basis for characterizing tail behaviour in 1D was provided in a paper by Emanuel Parzen. Parzen argued that tail behaviour could be characterized in terms of the *density-quantile function*. If $f$ is a probability density and $F : \mathbb{R} \to [0,1]$ is its cumulative density function then the quantile function is the inverse, *i.e.*, $Q = F^{-1}$ where $Q : [0,1] \to \mathbb{R}$. The density-quantile function $fQ : [0,1] \to \mathbb{R}$ is then the composition of the density and the quantile function $fQ(u) = f(Q(u))$ and is well defined for square integrable densities. Parzen suggested that the limiting behaviour of the density-quantile function captured the differences in the tail behaviour of distributions. In particular, for many distributions

\begin{equation}

\lim_{u\rightarrow1^-} \frac{fQ(u)}{(1-u)^{\alpha}} \tag{2}

\end{equation}

converges for some $\alpha > 0$. In other words, the density-quantile function asymptotically behaves like $(1-u)^{\alpha}$ and we denote this as $fQ(u) \sim (1-u)^{\alpha}$. (Note that here we consider the right tail, i.e., $u \to 1^-$, but we could just as easily consider the left tail, i.e., $u \to 0^+$.) We call the parameter $\alpha$ the *tail exponent* and Parzen noted that it characterizes how heavy a distribution is with larger values having heavier tails. Values of $\alpha$ between $0$ and $1$ are called light tailed and include things like bounded distributions. A value of $\alpha=1$ corresponds to some well known distributions like the Gaussian or Exponential distributions. Distributions with $\alpha > 1$ are called heavy tailed, *e.g.*, a Cauchy or student-T. More fine-grained characterizations of tail behaviour are possible in some cases but we won't go into those here.

Now, given the above and two 1D random variables, $\mathbf{Y}$ and $\mathbf{X}$ with tail exponents $\alpha_\mathbf{Y}$ and $\alpha_\mathbf{X}$, we can make a statement about the transformation $\mathbf{T}$ that maps between them. First, the transformation is given by $T(\mathbf{x}) = Q_\mathbf{Y}( F_\mathbf{X}( \mathbf{x} ) )$ where $F_\mathbf{X}$ denotes the CDF of $\mathbf{X}$ and $Q_\mathbf{Y}$ denotes the quantile function (i.e., the inverse CDF) of $\mathbf{Y}$. Second, we can then show that the derivative of this transformation is given by

\begin{equation}

T'(\mathbf{x}) = \frac{fQ_\mathbf{X}(u)}{fQ_\mathbf{Y}(u)} \tag{3}

\end{equation}

where $u=F_\mathbf{X}(\mathbf{y})$ and $fQ_\mathbf{X}$ and $fQ_\mathbf{Y}$ are the density-quantile functions of $\mathbf{X}$ and $\mathbf{Y}$ respectively.

Now, given our characterization of tail behaviour we get that

\begin{equation}

T'(\mathbf{x}) \sim \frac{(1-u)^{\alpha_{\mathbf{X}}}}{(1-u)^{\alpha_{\mathbf{Y}}}} = (1-u)^{\alpha_{\mathbf{X}}-\alpha_{\mathbf{Y}}} \tag{4}

\end{equation}

and now we come to a key result. If $\alpha_{\mathbf{X}} < \alpha_{\mathbf{Y}}$ then, as $u \to 1$ we get that $T'(\mathbf{x}) \to \infty$. That is, if the tails of the target distribution of $\mathbf{Y}$ are heavier than those of the source distribution $\mathbf{X}$ then the slope of the transformation must be unbounded. Conversely, if the slope of $T(\mathbf{x})$ is bounded (i.e., $T(\mathbf{x})$ is Lipschitz) then the tail exponent of $\mathbf{Y}$ will be the same as $\mathbf{X}$, i.e., $\alpha_\mathbf{Y} = \alpha_\mathbf{X}$.

The above is an elegant characterization of tail behaviour and it's relationship to the transformations between distributions but it only applies to distributions in 1D. To generalize it to higher dimensional distributions, we consider the tail behaviour of the norm of a random variable, i.e., $\Vert \cdot \Vert$. Then the degree of heaviness of $\mathbf{X}$ can be characterized by the degree of heaviness of the distribution of the norm. Using this characterization we can then prove an analog of the above.

**Theorem 3** *Let $\mathbf{X}$ be a random variable with density function $f_\mathbf{X}$ that is light-tailed and $\mathbf{Y}$ be a target random variable with density function $f_\mathbf{Y}$ that is heavy-tailed. Let $T$ be such that $\mathbf{Y} = T(\mathbf{X})$, then $T$ cannot be a Lipschitz function.*

So what does this all mean for normalizing flows which are attempting to transform a Gaussian distribution into some complex data distribution? The results show that a Lipschitz transformation of a distribution cannot make it heavier tailed. Unfortunately, many commonly implemented normalizing flows are actually Lipschitz. The transformations used in RealNVP and Glow are known as affine coupling layers and they have the form

\begin{equation}

T(\mathbf{x}) = (\mathbf{x}^{(A)},\sigma(\mathbf{x}^{(A)}) \odot \mathbf{x}^{(B)} + \mu(\mathbf{x}^{(A)}) \tag{5}

\end{equation}

where $\mathbf{x} = (\mathbf{x}^{(A)},\mathbf{x}^{(B)})$ is a disjoint partitioning of the dimensions, $\odot$ is element-wise multiplication and $\sigma(\cdot)$ and $\mu(\cdot)$ are arbitrary functions. For transformations of this form, we can then prove the following:

**Theorem 4** *Let $p$ be a light-tailed density and $T(\cdot)$ be a triangular transformation such that $T_j(x_j; ~x_{<j}) = \sigma_{j}\cdot x_j + \mu_j$. If, $\sigma_j(z_{<j})$ is bounded above and $\mu_j(z_{<j})$ is Lipschitz continuous then the distribution resulting from transforming $p$ by $T$ is also light-tailed.*

The RealNVP paper uses $\sigma(\cdot) = \exp(NN(\cdot))$ and $\mu(\cdot) = NN(\cdot)$ where $NN(\cdot)$ is a neural network with ReLU activation functions. The translation function $t(\cdot)$ is hence Lipschitz since a neural network with ReLU activation is Lipschitz. However the scale function, $\sigma(\cdot)$, at first glance, is not bounded because the exponential function is not unbounded. However in practice this was implemented as $\sigma(\cdot) = \exp(c\tanh(NN(\cdot)))$ for a scalar $c$. This means that, as originally implemented, $\sigma(\cdot)$ *is* bounded above, i.e., $\sigma(\cdot) < \exp(c)$. Similarly, Glow uses $\sigma(\cdot) = \mathsf{sigmoid}(NN(\cdot))$ which is also clearly bounded above.

Hence, RealNVP and Glow are unable to unable to represent heavier tailed distributions. Not all architectures have this property though and we point out a few that can actually change tail behaviour, for instance SOS Flows.

To address this limitation with common architectures, we proposed using a parametric base distribution which is capable of representing heavier tails which we called *Tail Adaptive Flows* (TAF). In particular, we proposed the use of the student-T distribution as a base distribution with learnable degree-of-freedom parameters. With TAF the tail behaviour can be learned in the base distribution while the transformation captures the behaviour of the typical set of the distribution.

We also explored these limitations experimentally. First we created a synthetic dataset using a target distribution with heavy tails. After fitting with a normalizing flow, we can measure it's tail behaviour. Measuring tail behaviour can be done by estimating the density-quantile function and finding the value of $\alpha$ such that $(1-u)^{\alpha}$ approximates its near $u=1$. Our experimental results confirmed the theory. In particular, fitting a normalizing flow with a RealNVP or Glow style affine coupling layer was fundamentally unable to change the tail exponent, even as more depth was added. Figure 4 shows an attempt to fit a model based on a RealNVP-style affine coupling layers to a heavy tailed distribution (student T). No matter how many blocks of affine coupling layers are used, it is unable to capture the structure of the distribution and the measured tail exponents remain the same as the base distribution.

However, when using a tail adaptive flow the tail behaviour can be readily learned. Figure 5 shows the results of fitting a tail adaptive flow on the same target as above but with 5 blocks. This isn't entirely surprising as tail adaptive flows use a student T base distribution. However, SOS Flows is also able to learn the tail behaviour as predicted by the theory. This is shown in Figure 6.

We also evaluated TAF on a number of other datasets. For instance, Figure 7 shows tail adaptive flows successfully fitting the tails of Neal's Funnel, an important distribution which has heavier tails and exhibits some challenging geometry.

In terms of log likelihood on a test set, our experiments show that using TAF is effectively equivalent to not using TAF. However, this shouldn't be too surprising.

We know that normalizing flows are able to capture the distribution around the typical set and this is where most samples, even in the test set, are likely to be. Put another way, capturing tail behaviour is about understanding how frequently rare events happen and by definition it's unlikely that a test set will have many of these events.

This paper explored the behaviour of the tails of commonly used normalizing flows and showed that two of the most popular normalizing flow models are unable to learn tail behaviour that is heavier than that of the base distribution. It also showed that by changing the base distribution we are able to restore the ability of these models to capture tail behaviour. Alternatively, other normalizing flow models like SOS Flows are also able to learn tail behaviour.

So does any of this matter in practice? If the problem you're working on is sensitive to tail behaviour then absolutely and our work suggests that using an adaptive base distribution with a range of tail behaviour is a simple and effective way to ensure that your flow can capture tail behaviour. If your problem isn't sensitive to tail behaviour then perhaps less so. However, it is interesting to note that the seemingly minor detail of adding a $\tanh(\cdot)$ or replacing $\exp(\cdot)$ with sigmoid could significantly change the expressiveness of the overall model. These details have typically been motivated by empirically observed training instabilities. However our work connects these details to fundamental properties of the estimated distributions, perhaps suggesting alternative explanations for why they were empirically necessary.

]]>That’s something that the organizers of the recent AI4Good Lab Industry Night were able to recreate – albeit in a virtual world - thanks to personalized avatars, virtual meeting rooms and real-time chats.

The purpose of the event was to give the all-women students of the AI4Good Lab a stronger sense of research groups and companies that work in the AI space and an array of initiatives that they can get involved in. It also gave the partners an opportunity to provide more detailed information about themselves.

Borealis AI’s all-female team, along with other partners, including CIFAR, IVADO, Amii, DeepMind and Accenture among others, participated in the AI4Good industry event, chatting with delegates about internships, fellowships, and offering advice on how to navigate the job market in the AI space. The team shared their thoughts on a wide range of topical issues, including ethical AI. They also provided information about AI research and products at Borealis AI as well as various internship and job opportunities with the team.

The AI4Good team prepared avatars for everyone, using photos of the participants. The delegates were able to virtually walk and stand with each other while they chatted. Borealis AI’s room, designed by visual designer, April Cooper, brought some nature and light to the room with the addition of a virtual tree!

Thanks to Maya Marcus-Sells, Executive Director of AI4Good Lab, and her colleague, Yosra Kazemi, for pulling the industry Night together and giving us a much-needed chance to chat and further build the women in AI community.

If you would like a peek inside this year’s virtual Industry Night, a tour of the 3D booths, a look at Maya’s, Eirene’s, and April’s avatars enjoying the virtual shadow of the Borealis AI tree, or just want to virtually “feel” and “smell” the breeze though the branches of the Borealis AI tree, we’ve got you covered!

Click on the gallery below to see pics from the event.

]]>

The partnership is part of Borealis AI’s ongoing commitment to advancing AI in Canada and fostering diversity and inclusion in the field. Borealis AI will be providing mentorship, career advice, and online workshops for the 30 women selected to participate in this year’s program as well as ongoing support for the AI4Good team.

The AI4Good Lab was founded in 2017 in Montreal by Angelique Mannella, Global Alliance Lead at Amazon Web Services, and Dr. Doina Precup, researcher at Mila, McGill University, and Deepmind. It's the first program of its kind to combine rigorous teaching in Artificial Intelligence (AI) with tackling diversity and inclusion in research and development, while promoting AI as a tool for social good.

The 2020 AI4Good Lab cohort marks the 4th year of training the next generation of diverse AI leaders, with 110 participants and alumni from across Canada. This year the lab will be held virtually from June 8th to July 28th due to the COVID-19 pandemic.

The 7-week program consists of two parts:

- intensive machine learning training through workshops and lectures by AI experts from academia and industry;
- a prototype development phase, during which the participants will work on AI products to tackle a social good problem of their choosing.

Borealis AI will actively be involved in both parts of the program with presentations and mentorship for the students as well as advice on how to navigate the job market in the AI space.

Speaking about the Lab, co-founder Angelique Mannella explained:

“Creating more diversity in technical environments is hard. While progress is being made, the only way to make sustainable, lasting change is to take an ecosystem approach, where organizations work together to surface new ways of working, new ways of knowledge sharing, and new ways of nurturing talent.”

Dr. Eirene Seiradaki, Director of Research Partnerships at Borealis AI, said:

“Increasing the number of women working in technology and science is a priority for Borealis AI. We are delighted to support the AI4Good Lab program. We hope this program will provide the participants with new skills and tools to help them develop their careers in this exciting and evolving industry.”

Maya Marcus-Sells, Executive Director at the AI4Good Lab, said:

“Our partnership with Borealis AI helps bring women across Canada into the fast-moving tech ecosystem. Through mentorship, speaking, and career guidance, Borealis AI will provide the participants of the AI4Good Lab with insights and networks into AI careers that will help them grow into the AI leaders of tomorrow, ultimately leading towards a more diverse and representative AI talent pool.”

“Borealis AI's commitment from day one to foster an environment for gender diverse talent and to work with us to create new opportunities for knowledge sharing and mentorship has been invaluable to our participants and alumni and also to our ability to amplify our collective impact across Canada,”added Mannella.

“There is a lot more to be done in this area”said Seiradaki.“Borealis AI will continue to partner with universities, government, and industry to further narrow the gender gap and improve the talent pool through a larger presentation of women in AI.”

Borealis AI is a world-class AI Research center backed by RBC. Recognized for scientific excellence, Borealis AI uses the latest in machine learning capabilities to solve challenging problems in the financial industry. Led by award-winning inventor and entrepreneur, Foteini Agrafioti, and with top North America scientists and engineers, Borealis AI is at the core of the bank’s innovation strategy and benefits from RBC’s scale, data, and trusted brand.

With a focus on responsible AI, natural language processing, and reinforcement learning, Borealis AI is committed to building solutions using machine learning and artificial intelligence that will transform the way individuals manage their finances and their futures. For more information please visit www.borealisai.com.

]]>This talk will provide an update on recent progress in this area. It will start out with novel state-of-the-art methods for the self-play setting. Next, it will introduce the Zero-Shot Coordination setting as a new frontier for multi-agent research. Finally it will introduce Other-Play as a novel learning algorithm, which allows agents to coordinate ad-hoc and biases learning towards more human compatible policies.

]]>However, other optimization problems are much more challenging. Consider *hyperparameter search* in a neural network. Before we a train the network, we must choose the architecture, optimization algorithm, and cost function. These choices are encoded numerically as a vector of hyperparameters. To get the best performance, we must find the hyperparameters for which the resulting trained network is best. This hyperparameter optimization problem has many challenging characteristics:

**Evaluation cost:** Evaluating the function that we wish to maximize (i.e., the network performance) in hyperparameter search is very expensive; we have to train the neural network model and then run it on the validation set to measure the network performance for a given set of hyperparameters.

**Multiple local optima:** The function is not convex and there may be many combinations of hyperparameters that are locally optimal.

**No derivatives:** We do not have access to the gradient of the function with respect to the hyperparameters; there is no easy and inexpensive way to propagate gradients back through the model training / validation process.

**Variable types:** There are a mixture of discrete variables (e.g., the number of layers, number of units per layer and type of non-linearity) and continuous variables (e.g., the learning rate and regularization weights).

**Conditional variables:** The existence of some variables depends on the settings of others. For example, the number of units in layer $3$ is only relevant if we already chose $\geq 3$ layers.

**Noise:** The function may return different values for the same input hyperparameter set. The neural network training process relies on stochastic gradient descent and so we typically don't get exactly the same result every time.

Bayesian optimization is a framework that can deal with optimization problems that have all of these challenges. The core idea is to build a model of the entire function that we are optimizing. This model includes both our current estimate of that function and the uncertainty around that estimate. By considering this model, we can choose where next to sample the function. Then we update the model based on the observed sample. This process continues until we are sufficiently certain of where the best point on the function is.

Let's now put aside the specific example of hyperparameter search and consider Bayesian optimization in its more general form. Bayesian optimization addresses problems where the aim is to find the parameters $\hat{\mathbf{x}}$ that maximize a function $\mbox{f}[\mathbf{x}]$ over some domain $\mathcal{X}$ consisting of finite lower and upper bounds on every variable:

\begin{equation}

\hat{\mathbf{x}} = \mathop{\rm argmax}_{\mathbf{x} \in \mathcal{X}} \left[ \mbox{f}[\mathbf{x}]\right]. \tag{1}

\label{eq:global-opt}

\end{equation}

At iteration $t$, the algorithm can learn about the function by choosing parameters $\mathbf{x}_t$ and receiving the corresponding function value $f[\mathbf{x}_t]$. The goal of Bayesian optimization is to find the maximum point on the function using the minimum number of function evaluations. More formally, we want to minimize the number of iterations $t$ before we can guarantee that we find parameters $\hat{\mathbf{x}}$ such $f[\hat{\mathbf{x}}]$ is less than $\epsilon$ from the true maximum $\hat{f}$.

We'll assume for now that all parameters are continuous, that their existences are not conditional on one another, and that the cost function is deterministic so that it always returns the same value for the same input. We'll return these complications later in this document. To help understand the basic optimization problem let's consider some simple strategies:

**Grid Search:** One obvious approach is to quantize each dimension of $\mathbf{x}$ to form an input grid and then evaluate each point in the grid (figure 1). This is simple and easily parallelizable, but suffers from the curse of dimensionality; the size of the grid grows exponentially in the number of dimensions.

**Random Search:** Another strategy is to specify probability distributions for each dimension of $\mathbf{x}$ and then randomly sample from these distributions (Bergstra and Bengio, 2012). This addresses a subtle inefficiency of grid search that occurs when one of the parameters has very little effect on the function output (see figure 1 for details). Random search is also simple and parallelizable. However, if we are unlucky, we can may either (i) make many similar observations that provide redundant information, or (ii) never sample close to the global maximum.

**Sequential search strategies:** One obvious deficiency of both grid search and random search is that they do not take into account previous measurements. If the measurements are made sequentially then we could use the previous results to decide where it might be strategically best to sample next (figure 2). One idea is that we could *explore* areas where there are few samples so that we are less likely to miss the global maximum entirely. Another approach could *exploit* what we have learned so far by sampling more in relatively promising areas. An optimal strategy would recognize that there is a trade-off between *exploration* and *exploitation* and combine both ideas.

Bayesian optimization is a sequential search framework that incorporates both exploration and exploitation and can be considerably more efficient than either grid search or random search. It can easily be motivated from figure 2; the goal is to build a probabilistic model of the underlying function that will know both (i) that $\mathbf{x}_{1}$ is a good place to sample because the function will probably return a high value here and (ii) that $\mathbf{x}_{2}$ is a good place to sample because the uncertainty here is very large.

A Bayesian optimization algorithm has two main components:

**A probabilistic model of the function:**Bayesian optimization starts with an initial probability distribution (the prior) over the function $f[\bullet]$ to be optimized. Usually this just reflects the fact that we are extremely uncertain about what the function is. With each observation of the function $(\mathbf{x}_t, f[\mathbf{x}_t])$, we learn more and the distribution over possible functions (now called the posterior) becomes narrower.**An acquisition function:**This is computed from the posterior distribution over the function and is defined on the same domain. The acquisition indicates the desirability of sampling each point next and depending on how it is defined it can favor exploration or exploitation.

In the next two sections, we consider each of these components in turn.

There are several ways to model the function and its uncertainty, but the most popular approach is to use Gaussian processes (GPs). We will present other models (Bernoulli-Beta bandits, random forests, and Tree-Parzen estimators) later in this document.

A Gaussian Process is a collection of random variables, where any finite number of these are jointly normally distributed. It is defined by (i) a mean function $\mbox{m}[\mathbf{x}]$ and (ii) a covariance function $k[\mathbf{x},\mathbf{x}']$ that returns the similarity between two points. When we model our function as $\mbox{f}[\mathbf{x}]\sim \mbox{GP}[\mbox{m}[\mathbf{x}],k[\mathbf{x},\mathbf{x}^\prime]]$ we are saying that:

\begin{eqnarray}

\mathbb{E}[\mbox{f}[\mathbf{x}]] &=& \mbox{m}[\mathbf{x}] \tag{2}

\end{eqnarray}

\begin{eqnarray}

\mathbb{E}[(\mbox{f}[\mathbf{x}]-\mbox{m}[\mathbf{x}])(f[\mathbf{x}']-\mbox{m}[\mathbf{x}'])] &=& k[\mathbf{x}, \mathbf{x}']. \tag{3}

\end{eqnarray}

The first equation states that the expected value of the function is given by some function $\mbox{m}[\mathbf{x}]$ of $\mathbf{x}$ and the second equation tells us how to compute the covariance of any two points $\mathbf{x}$ and $\mathbf{x}'$. As a concrete example, let's choose:

\begin{eqnarray}

\mbox{m}[\mathbf{x}] &=& 0 \tag{4}

\end{eqnarray}

\begin{eqnarray}

k[\mathbf{x}, \mathbf{x}']

&=&\mbox{exp}\left[-\frac{1}{2}\left(\mathbf{x}-\mathbf{x}'\right)^{T}\left(\mathbf{x}-\mathbf{x}'\right)\right], \tag{5}

\end{eqnarray}

so here the expected function values are all zero and the covariance decreases as a function of distance between two points. In other words, points very close to one another of the function will tend to have similar values and those further away will be less similar.

Given observations $\mathbf{f} = [f[\mathbf{x}_{1}], f[\mathbf{x}_{2}],\ldots, f[\mathbf{x}_{t}]]$ at $t$ points, we would like to make a prediction about the function value at a new point $\mathbf{x}^{*}$. This new function value $f^{*} = f[\mathbf{x}^{*}]$ is jointly normally distributed with the observations $\mathbf{f}$ so that:

\begin{equation}

Pr\left(\begin{bmatrix}\label{eq:GP_Joint}

\mathbf{f}\\f^{*}\end{bmatrix}\right) = \mbox{Norm}\left[\mathbf{0}, \begin{bmatrix}\mathbf{K}[\mathbf{X},\mathbf{X}] & \mathbf{K}[\mathbf{X},\mathbf{x}^{*}]\\ \mathbf{K}[\mathbf{x}^{*},\mathbf{X}]& \mathbf{K}[\mathbf{x}^{*},\mathbf{x}^{*}]\end{bmatrix}\right], \tag{6}

\end{equation}

where $\mathbf{K}[\mathbf{X},\mathbf{X}]$ is a $t\times t$ matrix where element $(i,j)$ is given by $k[\mathbf{x}_{i},\mathbf{x}_{j}]$, $\mathbf{K}[\mathbf{X},\mathbf{x}^{*}]$ is a $t\times 1$ vector where element $i$ is given by $k[\mathbf{x}_{i},\mathbf{x}^{*}]$ and so on.

Since the function values in equation 6 are jointly normal, the conditional distribution $Pr(f^{*}|\mathbf{f})$ must also be normal, and we can use the standard formula for the mean and variance of this conditional distribution:

\begin{equation}\label{eq:gp_posterior}

Pr(f^*|\mathbf{f}) = \mbox{Norm}[\mu[\mathbf{x}^{*}],\sigma^{2}[\mathbf{x}^{*}]], \tag{7}

\end{equation}

where

\begin{eqnarray}\label{eq:GP_Conditional}

\mu[\mathbf{x}^{*}]&=& \mathbf{K}[\mathbf{x}^{*},\mathbf{X}]\mathbf{K}[\mathbf{X},\mathbf{X}]^{-1}\mathbf{f}\nonumber \\

\sigma^{2}[\mathbf{x}^{*}]&=&\mathbf{K}[\mathbf{x}^{*},\mathbf{x}^{*}]\!-\!\mathbf{K}[\mathbf{x}^{*}, \mathbf{X}]\mathbf{K}[\mathbf{X},\mathbf{X}]^{-1}\mathbf{K}[\mathbf{X},\mathbf{x}^{*}]. \tag{8}

\end{eqnarray}

Using this formula, we can estimate the distribution of the function at any new point $\mathbf{x}^{*}$. The best estimate of the function value is given by the mean $\mu[\mathbf{x}]$, and the uncertainty is given by the variance $\sigma^{2}[\mathbf{x}]$. Figure 3 shows an example of measuring several points on a function sequentially and showing how the predicted mean and variance changes for other points.

Now that we have a model of the function and its uncertainty, we will use this to choose which point to sample next. The *acquisition* function takes the mean and variance at each point $\mathbf{x}$ on the function and computes a value that indicates how desirable it is to sample next at this position. A good acquisition function should trade off exploration and exploitation.

In the following sections we'll describe four popular acquisition functions: the upper confidence bound (Srinivas *et* al., 2010), expected improvement (Močkus, 1975), probability of improvement (Kushner, 1964), and Thompson sampling (Thompson, 1933). Note that there are several other approaches which are not discussed here including those based on entropy search (Villemonteix *et* al., 2009, Hennig and Schuler, 2012) and the knowledge gradient (Wu *et* al., 2017).

**Upper confidence bound:** This acquisition function (figure 4a) is defined as:

\begin{align}

\mbox{UCB}[\mathbf{x}^{*}] = \mu[\mathbf{x}^{*}] + \beta^{1/2} \sigma[\mathbf{x}^{*}]. \label{eq:UCB-def} \tag{9}

\end{align}

This favors either (i) regions where $\mu[\mathbf{x}^{*}]$ is large (for exploitation) or (ii) regions where $\sigma[\mathbf{x}^{*}]$ is large (for exploration). The positive parameter $\beta$ trades off these two tendencies.

**Probability of improvement:** This acquisition function computes the likelihood that the function at $\mathbf{x}^{*}$ will return a result higher than the current maximum $\mbox{f}[\hat{\mathbf{x}}]$. For each point $\mathbf{x}^{*}$, we integrate the part of the associated normal distribution that is above the current maximum (figure 4b) so that:

\begin{equation}

\mbox{PI}[\mathbf{x}^{*}] = \int_{\mbox{f}[\hat{\mathbf{x}}]}^{\infty} \mbox{Norm}_{\mbox{f}[\mathbf{x}^{*}]}[\mu[\mathbf{x}^{*}],\sigma[\mathbf{x}^{*}]] d\mbox{f}[\mathbf{x}^{*}] \tag{10}

\end{equation}

**Expected improvement:** The main disadvantage of the probability of improvement function is that it does not take into account how much the improvement will be; we do not want to favor small improvements (even if they are very likely) over larger ones. Expected improvement (figure 4c) takes this into account. It computes the expectation of the improvement $f[\mathbf{x}^{*}]- f[\hat{\mathbf{x}}]$ over the part of the normal distribution that is above the current maximum to give:

\begin{equation}

\mbox{EI}[\mathbf{x}^{*}] = \int_{\mbox{f}[\hat{\mathbf{x}}]}^{\infty} (f[\mathbf{x}^{*}]- f[\hat{\mathbf{x}}])\mbox{Norm}_{\mbox{f}[\mathbf{x}^{*}]}[\mu[\mathbf{x}^{*}],\sigma[\mathbf{x}^{*}]] d\mbox{f}[\mathbf{x}^{*}]. \tag{11}

\end{equation}

There also exist methods to allow us to trade-off exploitation and exploration for probability of improvement and expected improvement (see Brochu *et* al., 2010).

**Thompson sampling:** When we introduced Gaussian processes, we only talked about how to compute the probability distribution for a single new point $\mathbf{x}^{*}$. However, it's also possible to draw a sample from the joint distribution of many new points that could collectively represent the entire function. Thompson sampling (figure 4d) exploits this by drawing such a sample from the posterior distribution over possible functions and then chooses the next point $\mathbf{x}$ according to the position of the maximum of this sampled function. To draw the sample, we append an equally spaced set of points to the observed ones as in equation 6, use the conditional formula to find a Gaussian distribution over these points as in equation 8, and then draw a sample from this Gaussian.

Figure 5 shows a complete worked example of Bayesian optimization in one dimension using the upper confidence bound. As we sample more points, the function becomes steadily more certain. The method explores the function but also focuses on promising areas, exploiting what it has already learned.

In the previous section, we summarized the main ideas of Bayesian optimization with Gaussian processes. In this section, we'll dig a bit deeper into some of the practical aspects. We consider how to deal with noisy observations, how to choose a kernel, how to learn the parameters of that kernel, how to exploit parallel sampling of the function, and finally we'll discuss some limitations of the approach.

Until this point, we have assumed that the function that we are estimating is noise-free and always returns the same value $\mbox{f}[\mathbf{x}]$ for a given input $\mathbf{x}$. To incorporate a stochastic output with variance $\sigma_{n}^{2}$, we add an extra noise term to the expression for the Gaussian process covariance:

\begin{eqnarray}

\mathbb{E}[(y[\mathbf{x}]-\mbox{m}[\mathbf{x}])(y[\mathbf{x}]-\mbox{m}[\mathbf{x}'])] &=& k[\mathbf{x}, \mathbf{x}'] + \sigma^{2}_{n}. \tag{12}

\end{eqnarray}

We no longer observe the function values $\mbox{f}[\mathbf{x}]$ directly, but observe noisy corruptions $y[\mathbf{x}] = \mbox{f}[\mathbf{x}]+\epsilon$ of them. The joint distribution of previously observed noisy function values $\mathbf{y}$ and a new unobserved point $f^{*}$ becomes:

\begin{equation}

Pr\left(\begin{bmatrix}

\mathbf{y}\\f^{*}\end{bmatrix}\right) = \mbox{Norm}\left[\mathbf{0}, \begin{bmatrix}\mathbf{K}[\mathbf{X},\mathbf{X}]+\sigma^{2}_{n}\mathbf{I} & \mathbf{K}[\mathbf{X},\mathbf{x}^{*}]\\ \mathbf{K}[\mathbf{x}^{*},\mathbf{X}]& \mathbf{K}[\mathbf{x}^{*},\mathbf{x}^{*}]\end{bmatrix}\right], \tag{13}

\end{equation}

and the conditional probability of a new point becomes:

\begin{eqnarray}\label{eq:noisy_gp_posterior}

Pr(f^{*}|\mathbf{y}) &=& \mbox{Norm}[\mu[\mathbf{x}^{*}],\sigma^{2}[\mathbf{x}^{*}]], \tag{14}

\end{eqnarray}

where

\begin{eqnarray}

\mu[\mathbf{x}^{*}]&=& \mathbf{K}[\mathbf{x}^{*},\mathbf{X}](\mathbf{K}[\mathbf{X},\mathbf{X}]+\sigma^{2}_{n}\mathbf{I})^{-1}\mathbf{f}\nonumber \\

\sigma^{2}[\mathbf{x}^{*}] &=& \mathbf{K}[\mathbf{x}^{*},\mathbf{x}^{*}]\!-\!\mathbf{K}[\mathbf{x}^{*}, \mathbf{X}](\mathbf{K}[\mathbf{X},\mathbf{X}]+\sigma^{2}_{n}\mathbf{I})^{-1}\mathbf{K}[\mathbf{X},\mathbf{x}^{*}]. \tag{15}

\end{eqnarray}

Incorporating noise means that there is uncertainty about the function even where we have already sampled points (figure 6), and so sampling twice at the same position or at very similar positions could be sensible.

When we build the model of the function and its uncertainty, we are assuming that the function is smooth. If this was not the case, then we could say nothing at all about the function between the sampled points. The details of this smoothness assumption are embodied in the choice of kernel covariance function.

We can visualize the covariance function by drawing samples from the Gaussian process prior. In one dimension, we do this by defining an evenly spaced set of points $\mathbf{X}=\begin{bmatrix}\mathbf{x}_{1},& \mathbf{x}_{2},&\cdots,& \mathbf{x}_{I}\end{bmatrix}$, drawing a sample from $\mbox{Norm}[\mathbf{0}, \mathbf{K}[\mathbf{X},\mathbf{X}]]$ and then plotting the results. In this section, we'll consider several different choices of covariance function, and use this method to visualize each.

**Squared Exponential Kernel:** In our example above, we used the squared exponential kernel, but more properly we should have included the amplitude $\alpha$ which controls the overall amount of variability and the length scale $\lambda$ which controls the amount of smoothness:

\begin{equation}\label{eq:bo_squared_exp}

\mbox{k}[\mathbf{x},\mathbf{x}'] = \alpha^{2}\cdot \mbox{exp}\left[-\frac{d^{2}}{2\lambda}\right],\nonumber

\end{equation}

where $d$ is the Euclidean distance between the points:

\begin{equation}

d = \sqrt {\left(\mathbf{x}-\mathbf{x}'\right)^{T}\left(\mathbf{x}-\mathbf{x}'\right)}. \tag{16}

\end{equation}

When the amplitude $\alpha^{2}$ is small, the function does not vary too much in the vertical direction. When it is larger, there is more variation. When the length scale $\lambda$ is small, the function is assumed to be less smooth and we quickly become uncertain about the state of the function as we move away from known positions. When it is large, the function is assumed to be more smooth and we are increasingly confident about what happens away from these observations (figure 7). Samples from the squared exponential kernel are visualized in figure 8a-c.

**Matérn kernel:** The squared exponential function assumes that the function is infinitely differentiable. The Matérn kernel (figure 8d-l) relaxes this constraint by assuming a certain degree of smoothness $\nu$. The Matérn kernel with $\nu=0.5$ is once differentiable and is defined as

\begin{equation}

\mbox{k}[\mathbf{x},\mathbf{x}'] = \alpha^{2}\cdot \exp\left[-\frac{d}{\lambda^{2}}\right], \tag{17}

\end{equation}

where once again, $d$ is the Euclidean distance between $\mathbf{x}$ and $\mathbf{x}'$, $\alpha$ is the amplitude, and $\lambda$ is the length scale. The Matérn kernel with $\nu=1.5$ is twice differentiable and is defined as:

\begin{equation}

\mbox{k}[\mathbf{x},\mathbf{x}'] = \alpha^{2} \left(1+\frac{\sqrt{3}d}{\lambda}\right)\exp\left[-\frac{\sqrt{3}d}{\lambda}\right]. \tag{18}

\end{equation}

The Matérn kernel with $\nu=2.5$ is three times differentiable and is defined as:

\begin{equation}

\mbox{k}[\mathbf{x},\mathbf{x}'] = \alpha^{2} \left(1+\frac{\sqrt{5}d}{\lambda} + \frac{5d^{2}}{3\lambda^{2}}\right)\exp\left[-\frac{\sqrt{5}d}{\lambda}\right]. \tag{19}

\end{equation}

The Matérn kernel with $\nu=\infty$ is infinitely differentiable and is identical to the squared exponential kernel (equation 16).

**Periodic Kernel:** If we believe that the underlying function is oscillatory, we use the periodic function:

\begin{equation}

\mbox{k}[\mathbf{x},\mathbf{x}^\prime] = \alpha^{2} \cdot \exp \left[ \frac{-2(\sin[\pi d/\tau])^{2}}{\lambda^2} \right], \tag{20}

\end{equation}

where $\tau$ is the period of the oscillation and the other parameters have the same meanings as before.

A common application for Bayesian optimization is to search for the best hyperparameters of a machine learning model. However, in an ironic twist, the kernel functions used in Bayesian optimization themselves contain unknown hyper-hyperparameters like the amplitude $\alpha$, length scale $\lambda$ and noise $\sigma^{2}_{n}$. There are several possible approaches to choosing these hyperparameters:

**1. Maximum likelihood:** similar to training ML models, we can choose these parameters by maximizing the marginal likelihood (i.e., the likelihood of the data after marginalizing over the possible values of the function):

\begin{eqnarray}\label{eq:bo_learning}

Pr(\mathbf{y}|\mathbf{x},\boldsymbol\theta)&=&\int Pr(\mathbf{y}|\mathbf{f},\mathbf{x},\boldsymbol\theta)d\mathbf{f}\nonumber\\

&=& \mbox{Norm}_{y}[\mathbf{0}, \mathbf{K}[\mathbf{X},\mathbf{X}]+\sigma^{2}_{n}\mathbf{I}], \tag{21}

\end{eqnarray}

where $\boldsymbol\theta$ contains the unknown parameters in the kernel function and the measurement noise $\sigma^{2}_{n}$.

In Bayesian optimization, we are collecting the observations sequentially, and where we collect them will depend on the kernel parameters, and we would have to interleave the processes of acquiring new points and optimizing the kernel parameters.

**2. Full Bayesian approach:** here we would choose a prior distribution $Pr(\boldsymbol\theta)$ on the kernel parameters of the Gaussian process and combine this with the likelihood in equation 21 to compute the posterior. We then weight the acquisition functions according to this posterior:

\begin{equation}\label{eq:snoek_post}

\hat{a}[\mathbf{x}^{*}]\propto \int a[\mathbf{x}^{*}|\boldsymbol\theta]Pr(\mathbf{y}|\mathbf{x},\boldsymbol\theta)Pr(\boldsymbol\theta). \tag{22}

\end{equation}

In practice this would usually be done using an Monte Carlo approach in which the posterior is represented by a set of samples (see Snoek *et* al., 2012) and we sum together multiple acquisition functions derived from these kernel parameter samples (figure 9).

For practical applications like hyperparameter search, we would want to make multiple function evaluations in parallel. In this case, we must consider how to prevent the algorithm from starting a new function evaluation in a place that is already being explored by a parallel thread.

One solution is to use a stochastic acquisition function. For example, Thompson sampling draws from the posterior distribution over the function and samples where this sample is maximal (figure 4d). When we sample several times, we will get different draws from the posterior and hence different values of $\mathbf{x}$.

A more sophisticated approach is to treat the problem in a fully Bayesian way (Snoek et al., 2012). The optimization algorithm keeps track of both the points that have been evaluated and the points that are pending, marginalizing over the uncertainty in the pending points. This can be done using a sampling approach similar to the method in figure 9 for incorporating different length scales. We draw samples from the Gaussians representing the possible pending results and build an acquisition function for each. We then average together these acquisition functions weighted by the probability of observing those results.

In practice, Bayesian optimization with Gaussian Processes works best if we start with a number of points from the function that have already been evaluated. A rule of thumb might be to use random sampling for $\sqrt{d}$ iterations where $d$ is the number of dimensions and then start the Bayesian optimization process. A second useful trick is to occasionally incorporate a random sample into the scheme. This can stop the Bayesian optimization process getting waylaid examining unproductive regions of the space and forces a certain degree of exploration. A typical approach might be to use a random sample every 10 iterations.

The main limitation of Bayesian optimization with GPs is efficiency. As the dimensionality increases, more points need to be evaluated. Unfortunately, the cost of exact inference in the Gaussian process scales as $\mathcal{O}[n^3]$ where $n$ is the number of data points. There has been some work to reduce this cost through different approximations such as:

**Inducing points:**This approach tries to summarize the large number of observed points into a smaller subset known as inducing points (Snelson*et*al., 2006).**Decomposing the kernel:**This approach decomposes the "big" kernel in high dimension into "small" kernels that act on small dimensions (Duvenaud*et*al., 2011).**Using random projections:**This approach relies on random embedding to solve the optimization problem in a lower dimension (Wang*et*al., 2013).

So far, we have considered optimizing continuous variables. What does Bayesian optimization look like in the discrete case? Perhaps we wish to choose which of $K$ discrete conditions (parameter values) yields the best output. In the absence of noise, this problem is trivial; we simply try all $K$ conditions in turn and choose the one that returns the maximum. However, when there is noise on the output, we can use Bayesian optimization to find the best condition efficiently.

The basic approach is model each condition independently. For continuous observations, we could model each output $f_{k}$ with a normal distribution, choose a prior over the mean of the normal and then use the measurements to compute a posterior over this mean. We'll leave developing this model as an exercise for the reader. Instead and for a bit of variety, we'll move to a different setting where the observations are binary we wish to find the configuration that produces the highest proportion of '1's in the output. This setting motivates the Beta-Bernoulli bandit model.

Consider the problem of choosing which of $K$ graphics to present to the user for a web-advert. We assume that for the $k^{th}$ graphic, there is a fixed probability $f_{k}$ that the person will click, but these parameters are unknown. We would like to efficiently choose the graphic that prompts the most clicks.

To solve this problem, we treat the parameters $f_{1}\ldots f_{K}$ as uncertain and place an uninformative Beta distribution prior with $\alpha,\beta=1$ over their values:

\begin{equation}

Pr(f_{k}) = \mbox{Beta}_{f_{k}}\left[1.0, 1.0\right]. \tag{23}

\end{equation}

The likelihood of showing the $k^{th}$ graphic $n_{k}$ times and receiving $c_{k}$ clicks is then

\begin{equation}

Pr(c_{k}|f, n_{k}) = f_{k}^{c_{k}}(1-f_{k})^{n_{k}-c_{k}}, \tag{24}

\end{equation}

and we can combine these two equations via Bayes rule to compute the posterior distribution over the parameter $f_{k}$ (see chapter 4 of Prince, 2012) which will be given by

\begin{equation}

Pr(f_{k}|c_{k},n_{k}) = \mbox{Beta}_{f_{k}}\left[1.0 + c_{k}, 1.0 + n_{k}-c_{k} \right]. \tag{25}

\end{equation}

Now we must choose which value of $k$ to try next given the $k$ posterior distributions over the probabilities $f_{k}$ of getting a click (figure 10). As before, we choose an acquisition function and sample the value of $k$ that maximizes this. In this case, the most practical approach is to use Thompson sampling. We sample from each posterior distribution separately (they are independent) and choose $k$ based on the highest sampled value.

As in the continuous case, this method will trade off exploiting existing knowledge by showing graphics that it knows will generate a high rate of clicks and exploring graphics where the click rate is very uncertain. This model and algorithm are part of a more general literature on bandit algorithms. More information can be found in this book.

When we have many discrete variables (e.g., the orientation, color, font size in an advert graphic), we could treat each combination of variables as one value of $k$ and use the above approach in which each condition is treated independently. However, the number of combinations may be very large and so this is not necessarily practical.

If the discrete variables have a natural order (e.g., font size) then one approach is to treat them as continuous. We amalgamate them into an observation vector $\mathbf{x}$ and use a Gaussian process model. The only complication is that we now only compute the acquisition function at the discrete values that are valid.

If the discrete variables have no natural order then we are in trouble. Gaussian processes depend on the kernel 'distance' between points and it is hard to define such kernels for discrete variables. One approach is to use a one-hot encoding, apply a kernel for each dimension and let the overall kernel be defined by the product of these sub-kernels (Duvenaud *et* al., 2014). However, this is not ideal because there is no way for the model to know about the invalid input values which will be assigned some probability and may be selected as new points to evaluate. One way to move forward is to consider a different underling probabilistic model.

The approaches up to this point can deal with most of the problems that we outline in the introduction, but are not suited to the case where there are many discrete variables (possibly in combination with continuous variables). Moreover, they cannot elegantly handle the case of conditional variables where the existence of some variables is contingent on the settings of others. In this section we consider random forest models and tree-Parzen estimators, both of which can handle these situations.

The *Sequential Model-based Algorithm Configuration* (SMAC) algorithm uses a random forest as an alternative to Gaussian processes. Consider the case where we have made some observations and trained a regression forest. For any point, we can measure the mean of the trees' predictions and their variance (figure 11). This mean and variance are then treated similarly to the equivalent outputs from a Gaussian process model. We apply an acquisition function to choose which of a set of candidate points to sample next. In practice, the forest must be updated as we go along and a simple way to do that is just to split a leaf when it accumulates a certain number of training examples.

Random forests based on binary splits can easily cope with combinations of discrete and continuous variables; it is just as easy to split the data by thresholding a continuous value as it is to split it by dividing a discrete variable into two non-overlapping sets. Moreover, the tree structure makes it easy to accommodate conditional parameters: we do not consider splitting on contingent variables until they are guaranteed by prior choices to exist.

The Tree-Parzen estimator (Bergstra *et* al., 2011) works quite differently from the models that we have considered so far. It describes the likelihood $Pr(\mathbf{x}|y)$ of the data $\mathbf{x}$ given the noisy function value $y$ rather than the posterior $Pr(y|\mathbf{x})$.

More specifically, the goal is to build two separate models $Pr(\mathbf{x}|y\in\mathcal{L})$ and $Pr(\mathbf{x}|y\in\mathcal{H})$ where the set $\mathcal{L}$ contains the lowest values of $y$ seen so far and the set $\mathcal{H}$ contains the highest. These sets are created by partitioning the values according to whether they fall below or above some fixed quantile.

The likelihoods $Pr(\mathbf{x}|y\in\mathcal{L})$ and $Pr(\mathbf{x}|y\in\mathcal{H})$ are modelled with kernel density estimators; for example, we might describe the likelihood as a sum of Gaussians with a mean on each observed data point $\mathbf{x}$ and fixed variance (figure 12). It can be shown that expected improvement is then maximized by choosing the point that maximizes the ratio $Pr(\mathbf{x}|y\in\mathcal{H})/Pr(\mathbf{x}|y\in\mathcal{L})$.

Tree-Parzen estimators work when we have a mixture of discrete and continuous spaces, and when some parameters are contingent on others. Moreover, the computation scales linearly with the number of data points as opposed to with their cube as for Gaussian processes.

In this tutorial, we have discussed Bayesian optimization, its key components, and applications. For further information, consult the recent surveys by Shahriari *et* al. (2016) and Frazier 2018. Python packages for Bayesian optimization include BoTorch, Spearmint, GPFlow, and GPyOpt.

Code for hyperparameter optimization can be found in the Hyperopt and HPBandSter packages. A popular application of Bayesian optimization is for AutoML which broadens the scope of hyperparameter optimization to also compare different model types as well as choosing their parameters and hyperparameters. Python packages for AutoML include Auto-sklearn, Hyperop-sklearn, and NNI.

]]>

**School: **Perimeter Institute for Theoretical Physics, University of Waterloo.

**Research areas: **Theoretical physics.

Research topic: Machine Learning for physics and physics for machine learning.

**School: **AMII, University of Alberta.

**Research areas: **Machine learning, deep learning and natural language processing.

Research topic: Towards empathetic conversational AI.

**School: **Concordia Institute for Information System Engineering (CIISE), Concordia University, Montreal.

**Research areas: **Deep learning in health domain applications.

Research topic: Designing optimal deep neural networks for hand gesture recognition and force prediction, and developing domain adaptation algorithms for time-domain features.

**School: **University of Toronto.

**Research areas: **Machine learning.

Research topic: Interplay between optimization, generalization and uncertainty of deep learning.

**School: **Centre for Intelligent Machines (CIM), McGill University.

**Research areas: **Computer vision and artificial intelligence.

Research topic: Towards building reliable deep neural network.

**School: **University of British Columbia.

**Research areas: **Deep learning, chemistry, natural language processing, artificial intelligence, generative models, mass spectrometry.

Research topic: Automated discovery of unknown molecules using deep neural networks.

**School: **Mila, Université de Montréal.

**Research areas: **Optimization and deep learning.

Research topic: A dynamical systems perspective into game optimization.

**School: **Mila, Université de Montréal.

**Research areas: **Natural language processing and deep learning.

Research topic: Learning and modeling neural representations of text.

**School: **Artificial Intelligence and Algorithms laboratories at the University of British Columbia.

**Research areas: **Stochastic processes - neural networks - DNA computing.

Research topic: Mean first passage time and parameter estimation for continous-time Markov Chains.

**School: **McMaster University.

**Research areas: **Deep learning, watermarking, steganography, information-theoretical principles.

Research topic: New deep neural network architectures for blind image watermarking based on the information-theoretic principles.

The ten Fellows won the awards for their outstanding research capabilities and represent leading universities and AI institutes from provinces across Canada.

These fellowships are part of Borealis AI’s commitment to support Canadian academic excellence in AI and machine learning. They provide financial assistance for exceptional domestic and international graduate students to carry out fundamental research as they pursue their masters and PhDs in various fields of AI. The program is one of a number of Borealis AI initiatives designed to strengthen the partnership between academia and industry and advance the momentum of Canada’s leadership in the AI space.

This year’s winners demonstrated exceptional talent, vision and passion for high-quality research. Backed by some of Canada’s leading AI professors, the projects range from using AI in metabolomics, quantum physics and in areas such as natural language processing, deep learning, uncertainty, and computer vision.

Speaking about the program, Prof. Geoffrey Hinton, Chief Scientific Advisor at Vector Institute, said:

“Deep learning is poised to change the way we work and live and I am proud of the talent and caliber that our Universities have to offer. Canada is a top destination for research in machine learning globally and the Borealis AI Fellowships demonstrate the continuous support of the industry in that regard. Supporting students with the means to conduct their research is very important for our community.”

Foteini Agrafioti, Head of Borealis AI, said:

“

AI was pioneered in Canada, and our universities have trained some of the most prolific experts in the world. There is a huge demand for AI expertise, and that is why we are committed to nurturing talent in this highly critical field in Canada. I’m impressed by the caliber of this year’s winners and am excited to provide them with additional resources to advance their research and kick start their promising careers.”

Ibtihel Amara, a PhD student at McGill University at the Centre for Intelligent Machines (CIM), was awarded a fellowship this year for her work on addressing uncertainty and integrating reliability and trust into modern AI systems. Speaking about her award, Ibtihel said:

“The Borealis AI fellowship is important to me because it means I can focus completely on my research work. This award has also encouraged me to believe in the significance of my research, especially to be chosen among the best candidates in the AI research community. It compels me to achieve more and dream big!”

Click here to meet the class of 2020.

]]>Our study shows that maximizing margins can be achieved by minimizing the adversarial loss on the decision boundary at the "shortest successful perturbation", demonstrating a close connection between adversarial losses and the margins. We propose Max-Margin Adversarial (MMA) training to directly maximize the margins to achieve adversarial robustness.

Instead of adversarial training with a fixed $\epsilon$, MMA offers an improvement by enabling adaptive selection of the "correct" $\epsilon$ as the margin individually for each datapoint. In addition, we rigorously analyze adversarial training with the perspective of margin maximization, and provide an alternative interpretation for adversarial training, maximizing either a lower or an upper bound of the margins. Our experiments empirically confirm our theory and demonstrate MMA training's efficacy on the MNIST and CIFAR10 datasets w.r.t. $\ell_\infty$ and $\ell_2$ robustness.

]]>While sharing data between institutions and communities can help boost AI innovation, this practice runs the risk of exposing sensitive and private information about involved parties. Private and secure sharing of data is imperative and necessary for the AI field to succeed at scale.

Traditionally, a common practice has been to simply delete PII (personal identifiable information)—such as name, social insurance number, home address, or birth date—from data before it is shared. However, “Scrubbing,” as it is often called, is no longer a reliable way to protect privacy, because widespread proliferation of user-generated data has made it possible to reconstruct PII from scrubbed data. For example, in 2009 a Netflix user sued the company because her “anonymous” viewing history could be de-anonymized using other publicly-available data sources, inferring her sexual orientation.

Differentially private synthetic data generation (differential privacy) presents an interesting solution to this problem. In a nutshell, this technology adds “noise” to sensitive data while preserving the statistical patterns from which machine learning algorithms learn, allowing data to be shared safely and innovation to proceed rapidly.

Differential privacy preserves the statistical properties of a data set—the patterns and trends that algorithms care about to drive insights or automate processes—while obfuscating the underlying data themselves. The key idea behind data generation is to mask PII by adding statistical noise. The noisy synthetic data can be shared without compromising users’ privacy but still yields the same aggregate insights as the raw, noiseless data. Compare it to a doctor sharing trends and statistics about a patient base without ever revealing individual patients’ specific details.

Many major technology companies are already using differential privacy. For example, Google has applied differential privacy in the form of RAPPOR, a novel privacy technology that allows inferring statistics about populations while preserving the privacy of individual users. Apple also applies statistical noise to mask users’ individual data.

Differential privacy is not a free lunch, however: adding noise makes ML algorithms less accurate, especially with smaller datasets. This allows groups to join forces and safely leverage the combined size of their data to gain new insights. Consider a network of hospitals studying diabetes and needing to use their patient records to construct early diagnostics techniques from their collective intelligence. Each hospital could analyze its own patient records independently however, modern AI systems thrive on massive amounts of data, a reality that can only be practically achieved through large scale merging of patient records. Differential privacy presents a way of achieving that through the sharing of synthetic data and creation of a single, massive—but still privacy-preserving—dataset for scientists.

While differential privacy is not a universal solution, it bridges the gap between the need for individual privacy and the need for statistical insights – opening the doors to new possibilities.

]]>