Embedding Prior Knowledge into Neural Networks

Table of Contents


In many scientific and engineering problems, purely data-driven learning is insufficient. Physical systems are governed by underlying structure, including conservation laws, symmetries, boundary conditions, and causality. Incorporating such structure into neual networks significantly improves data efficiency, generalization, and physical consistency.

This section presents systematic ways to embed prior knowledge such as constraints and structure, into neural networks.


1. Hard vs. Soft Constraints

Constraints on neural network outputs arise from many sources: physical laws, mathematical requirements, conservation principles, and symmetry properties. Regardless of their origin, the key question is how strongly the constraint must be enforced. This leads to two fundamentally different approaches.

A useful way to think about this is in terms of prior knowledge and enforcement level. When we have strong, exact prior knowledge (e.g., a boundary condition that must hold precisely), hard enforcement is appropriate. When the prior knowledge is approximate, uncertain, or too complex to encode architecturally, soft enforcement provides a flexible alternative. The choice between the two is therefore not just a matter of implementation convenience, but reflects how confident we are in the constraint and how costly a violation would be.


1.1. Soft Constraints

Soft constraints incorporate requirements into the training loss as penalty terms. Instead of forcing the network to satisfy a constraint exactly, we add an extra term to the loss that increases whenever the constraint is violated. The network then learns to reduce this penalty during training.

For example, if we want the output to satisfy some requirement $\mathcal{C}$, the loss becomes


$$ \mathcal{L}(\theta) = \mathcal{L}_{\mathrm{data}}(\theta) + \lambda \, \mathcal{R}(\theta) $$

where $\mathcal{R}(\theta) \geq 0$ measures the degree of violation and $\lambda > 0$ controls how strongly the constraint is enforced.

The main advantage is that soft constraints are easy to implement and applicable to almost any differentiable constraint. The main limitation is that satisfaction is only approximate: the network reduces the penalty but is never forced to eliminate it entirely.


1.2. Hard Constraints

Hard constraints enforce requirements directly through the network architecture, so that the output always satisfies the constraint exactly, regardless of the parameter values $\theta$.

Instead of penalizing violations, the network is redesigned so that producing a violating output is structurally impossible. This is typically achieved in one of two ways.


(1) Output parameterization passes the raw network output $v(x)$ through a fixed transformation that maps it into the constraint-satisfying set.

For example, if the output must be non-negative:


$$ \hat{v}(x) = \exp\!\left(v(x)\right) $$

(2) Projection applies a correction step after the raw output is computed, mapping it onto the constraint-satisfying set.

For example, if the output must satisfy a zero-mean condition $\int_\Omega v(x)\,dx = 0$, the raw output $v$ is corrected by subtracting its mean:


$$ \hat{v}(x) = v(x) - \frac{1}{|\Omega|}\int_\Omega v(y)\,dy $$

which guarantees $\int_\Omega \hat{v}(x)\,dx = 0$ exactly, regardless of what $\tilde{v}$ produces.


The main advantage is exact satisfaction at every point, for every input, at every stage of training and inference. The main limitation is construction difficulty: not every constraint has a simple architectural encoding.


1.3. Choosing Between Hard and Soft

In practice, the two approaches are often combined. The decision of which to use for a given constraint depends on three factors.

Certainty of the prior knowledge. If the constraint is exact and known with certainty (e.g., a Dirichlet boundary condition or a conservation law), hard enforcement is appropriate. If the constraint is approximate, domain-dependent, or derived from uncertain physical assumptions, soft enforcement is more robust.

Cost of violation. In safety-critical or physically interpretable settings, even a small constraint violation may be unacceptable. Hard enforcement eliminates this risk entirely. In exploratory or data-rich settings, approximate satisfaction via soft constraints may be sufficient.

Architectural tractability. Not every constraint admits a clean hard encoding. Complex constraints, nonlinear invariants, or constraints that involve interactions between multiple outputs may be difficult or impossible to encode exactly. In such cases, soft enforcement is the only practical option.

A common and effective strategy is to apply hard constraints wherever an exact architectural encoding is available, and soft constraints for the remainder. This hybrid approach preserves the flexibility of penalty-based training while guaranteeing satisfaction of the most critical structural requirements.

2. Positivity and Negativity Constraints

A positivity constraint requires that the output is always non-negative:


$$ f(x) \geq 0 \quad \forall x $$

This arises naturally in many engineering problems. For example, stress intensity factors, energy densities, and material stiffness values are all quantities that must be non-negative by physical definition.


2.1. Enforcing Positivity


(1) Output transformation. Pass the raw network output $v(x)$ through a strictly positive function:


$$ \hat{f}(x) = \exp(v(x)) $$

or the softplus variant:


$$ \hat{f}(x) = \log(1 + \exp(v(x))) $$

The exponential is the simplest but can produce very large values. Softplus behaves linearly for large inputs and is more stable during training.


(2) Soft enforcement. Add a penalty term to the loss that penalizes negative outputs:


$$ \mathcal{L}_{\mathrm{pos}}(\theta) = \frac{1}{N} \sum_{j=1}^{N} \left[ \max\left(0, -f(x_j)\right) \right]^2 $$

This term is zero when the output is non-negative everywhere and grows quadratically with the magnitude of any violation. As with all soft constraints, satisfaction is only approximate.

Output transformation is the simplest and most reliable approach for both cases, providing exact constraint satisfaction at negligible computational cost. Soft enforcement is easy to implement but provides no exact guarantee.


2.2. Extending to Negativity Constraints

By reversing the sign convention, the same ideas apply directly to a negativity constraint, which requires the output to be always non-positive:


$$ f(x) \leq 0 \quad \forall x $$

This arises in problems where the output represents a quantity that is bounded above by zero, such as a compressive stress component, a dissipation rate expressed as a negative quantity, or a signed distance function restricted to the interior of a domain.


(1) Output transformation. Apply a strictly negative function to the raw output:


$$ \hat{f}(x) = -\exp(v(x)) $$

or equivalently:


$$ \hat{f}(x) = -\log(1 + \exp(v(x))) $$

Both are always negative regardless of $v(x)$, so the constraint is satisfied by construction.


(2) Soft enforcement. Penalize positive outputs:


$$ \mathcal{L}_{\mathrm{neg}}(\theta) = \frac{1}{N} \sum_{j=1}^{N} \left[ \max\left(0, f(x_j)\right) \right]^2 $$

3. Monotonicity Constraint

A monotonicity constraint requires that the output changes consistently in one direction as the input increases. There are two cases:

Monotone increasing requires that the output increases as the input increases:


$$ x_1 \leq x_2 \implies f(x_1) \leq f(x_2) $$

For example, the deflection of a beam increases monotonically as the applied load increases.

Monotone decreasing requires that the output decreases as the input increases:


$$ x_1 \leq x_2 \implies f(x_1) \geq f(x_2) $$

For example, the fatigue life of a material decreases monotonically as the stress amplitude increases.

Note that a monotone decreasing function can always be written as $f(x) = -h(x)$ where $h$ is monotone increasing. Therefore, all strategies below are described for the increasing case, and the decreasing case follows by negating the output.


3.1. Strategies for Enforcing Monotone Increasing

(1) Non-negative derivative. Enforce monotonicity by ensuring the derivative of the output is always non-negative:


$$ \frac{\partial f}{\partial x} \geq 0 $$

One way to achieve this is to pass the derivative through a non-negative activation and integrate to recover the output:


$$ f(x) = f(x_0) + \int_{x_0}^{x} \exp(g(t)) \, dt $$

where $g$ is an unconstrained network. Since $\exp(g(t)) > 0$ always, the integral is strictly increasing in $x$.


(2) Monotone architecture. Constrain all weights in the network to be non-negative, and use only non-decreasing activation functions such as ReLU or sigmoid. A network with non-negative weights and non-decreasing activations is guaranteed to be monotone with respect to its input.


(3) Soft enforcement. Add a penalty term to the loss that penalizes violations of monotonicity at sampled input pairs $x_1 \leq x_2$:


$$ \mathcal{L}_{\mathrm{mono}}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \left[ \max\left(0, \, f(x_1^i) - f(x_2^i)\right) \right]^2 $$

This term is zero when monotonicity is satisfied and grows quadratically with the magnitude of any violation.


3.2. Extending to Monotone Decreasing

For a monotone decreasing constraint, the same three strategies apply with the sign reversed.


(1) Non-positive derivative.


$$ f(x) = f(x_0) - \int_{x_0}^{x} \exp(g(t)) \, dt $$

(2) Monotone architecture. Constrain all weights to be non-positive, or equivalently apply the monotone increasing architecture to $-f(x)$.


(3) Soft enforcement. Penalize violations of the decreasing condition:


$$ \mathcal{L}_{\mathrm{mono}}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \left[ \max\left(0, \, f(x_2^i) - f(x_1^i)\right) \right]^2 $$

3.3. Summary

Among the three strategies, the non-negative derivative approach provides the strongest guarantee and is the most flexible. The monotone architecture approach is simpler but more restrictive, as constraining all weights limits the expressive power of the network. Soft enforcement is the easiest to implement but provides no exact guarantee.

4. Boundary and Initial Conditions

For a PDE problem, the output must satisfy prescribed boundary or initial conditions. Instead of penalizing violations, we can enforce them exactly through output parameterization.


4.1. Boundary Conditions

Suppose the output must satisfy $v(x) = g(x)$ on $\partial\Omega$. We construct the network output as


$$ \hat{v}(x) = g(x) + s(x)\,v(x) $$

where $s(x)$ is a masking function satisfying $s(x) = 0$ on $\partial\Omega$ and $s(x) > 0$ inside $\Omega$. Since $s(x)$ vanishes on the boundary, we have $\hat{v}(x) = g(x)$ on $\partial\Omega$ exactly, for any network output $v(x)$.

The choice of masking function depends on the domain geometry. For the unit interval $\Omega = (0, 1)$, the simplest choice is:


$$ s(x) = x(1-x) $$

This satisfies $s(0) = 0$ and $s(1) = 0$ exactly, and $s(x) > 0$ for all $x \in (0, 1)$. For example, if the boundary conditions are homogeneous $g(x) = 0$, the parameterized output becomes:


$$ \hat{v}(x) = x(1-x)\,v(x) $$

The network $v(x)$ can output anything it likes, but the final output always vanishes at both endpoints. For higher-dimensional domains, the masking function generalizes naturally. On the unit square $\Omega = (0,1)^2$:


$$ s(x_1, x_2) = x_1(1-x_1)\,x_2(1-x_2) $$

which vanishes on all four edges by construction.


4.2. Initial Conditions

For time-dependent problems, suppose the output must satisfy $v(x, 0) = v_0(x)$. The same idea applies with a temporal masking function:


$$ \hat{v}(x, t) = v_0(x) + \tau(t)\,v(x, t) $$

where $\tau(0) = 0$ and $\tau(t) > 0$ for $t > 0$. A common choice is $\tau(t) = t$. At $t = 0$, the second term vanishes and $\hat{v}(x, 0) = v_0(x)$ exactly.

Other choices of $\tau(t)$ are possible depending on the problem. For example:


$$ \tau(t) = 1 - e^{-\alpha t}, \qquad \alpha > 0 $$

which also satisfies $\tau(0) = 0$ and grows smoothly for $t > 0$. The parameter $\alpha$ controls how quickly the network contribution is switched on relative to the initial condition.


4.3. Summary

In both cases, the parameterization separates the constraint-carrying part (the known boundary or initial value) from the free part (the network output). The network $v$ is free to learn any function on the interior or for $t > 0$, while the constraint is satisfied by construction at the boundary or at $t = 0$.


5. Conservation Laws and Invariants

Conservation laws require that some quantity computed from the output remains fixed. A common example is a global integral constraint:


$$ \int_\Omega v(x)\,dx = C $$

where $C$ is a prescribed constant such as total mass or total energy.


Soft enforcement adds the violation as a penalty term:


$$ \mathcal{L}(\theta) = \mathcal{L}_{\mathrm{data}}(\theta) + \lambda\left(\int_\Omega \mathcal{G}_\theta(u)(x)\,dx - C\right)^2 $$

Hard enforcement corrects the raw output by an additive shift:


$$ \hat{v}(x) = v(x) + \frac{C - \int_\Omega v(y)\,dy}{|\Omega|} $$

which guarantees $\int_\Omega \hat{v}(x)\,dx = C$ exactly for any $v$.


When the conserved quantity must also be non-negative, such as a probability density, both positivity and normalization can be enforced simultaneously via the continuous softmax:


$$ \hat{v}(x) = \frac{\exp({v}(x))}{\int_\Omega \exp({v}(y))\,dy} $$

which guarantees $\hat{v}(x) > 0$ everywhere and $\int_\Omega \hat{v}(x)\,dx = 1$ exactly.



6. Equivariance and Symmetry

Many neural networks are asked to learn mappings that respect symmetries. If we rotate or translate the input, the output should rotate or translate accordingly. Embedding this structure into the network improves data efficiency and guarantees consistent predictions under transformed inputs.


6.1. Equivariance

Let $f$ be a neural network and $T$ a transformation (e.g., rotation, translation, reflection). Equivariance requires that the network commutes with the transformation:


$$ f(T x) = T\,f(x) $$

This means transforming the input and then applying the network gives the same result as applying the network and then transforming the output.

For example, consider a network that predicts the velocity field of a fluid. If we rotate the entire flow configuration by 90 degrees, the predicted velocity field should also rotate by 90 degrees. As another example, if a network predicts the temperature distribution given a heat source, shifting the heat source to the right should shift the entire predicted temperature field by the same amount. In both cases, equivariance guarantees the correct response automatically, without retraining.


There are three main strategies for enforcing equivariance.

(1) Group-equivariant layers build the symmetry directly into the network by constructing weight-sharing patterns that commute with the transformation. The canonical example is the convolution layer, which is equivariant to translations: shifting the input shifts the output by the same amount. For rotational equivariance, group convolutional networks extend this idea by sharing weights across rotated versions of each filter.


(2) Output symmetrization enforces equivariance by post-processing the raw network output. For a finite set of transformations $\{T_1, T_2, \ldots, T_n\}$:


$$ \hat{f}(x) = \frac{1}{n} \sum_{i=1}^{n} T_i^{-1} f(T_i x) $$

This guarantees exact equivariance for any network $f$ without modifying the architecture, but requires evaluating the network $n$ times per input.


(3) Data augmentation exposes the network to transformed versions of each training example. Given a training pair $(x_i, y_i)$, augmented pairs are generated as $(T x_i,\; T y_i)$ for various transformations $T$. This encourages equivariant behavior without modifying the architecture, but provides only approximate equivariance rather than exact equivariance for every input.


Among the three strategies, group-equivariant layers provide the strongest guarantee. Data augmentation is the easiest to implement but the weakest. Output symmetrization sits in between: it is easy to apply and guarantees exact equivariance, but requires multiple forward passes.


6.2. Symmetry

Consider a clamped-clamped beam defined on $x \in [-L/2, L/2]$, with the origin at the center. Given a load $q(x)$, the network predicts the deflection $v(x)$.

Now suppose the load is symmetric about the center:


$$ q(-x) = q(x) $$

Then the deflection must also be symmetric:


$$ v(-x) = v(x) $$

This is exactly even symmetry equivariance. However, a network $f$ trained without any symmetry constraint may not satisfy this, even for a symmetric load.


Applying output symmetrization:

  1. Pass the original load $q(x)$ through the network to obtain $f(q(x))$.
  2. Pass the flipped load $q(-x)$ through the network to obtain $f(q(-x))$.
  3. Average the two results:

$$ \hat{v}(x) = \frac{f(q(x)) + f(q(-x))}{2} $$

Even if the network $f$ has not learned the symmetry at all, the final output $\hat{v}(x)$ always satisfies $\hat{v}(-x) = \hat{v}(x)$. Symmetry of the deflection is guaranteed by construction, at the cost of two forward passes instead of one.


7. Invariance and Symmetry

7.1. Invariance

Invariance is the special case where the output is unchanged by the transformation:


$$ f(Tx) = f(x) $$

This is appropriate when the output is a scalar quantity that does not depend on orientation or position. For example, a network that predicts the total energy of a molecule should return the same value regardless of how the molecule is rotated or translated in space, since total energy is a physical scalar. Similarly, a network that classifies an image as "cat" or "not cat" should return the same label whether the image is shifted left or right by a few pixels.

The key distinction is: if the output is a field or a vector that has a direction, equivariance is the right requirement. If the output is a single number with no directional dependence, invariance is the right requirement.


There are three main strategies.

(1) Invariant input features transform the raw input into features that are unchanged by the transformation, before passing them into the network. For example, if the network should be invariant to rotation, one can use pairwise distances between points as input features, since distances do not change under rotation. The network then operates entirely on invariant quantities and its output is automatically invariant.


(2) Pooling over transformations computes the network output for all transformed versions of the input and aggregates them into a single value. For example, to enforce invariance over a finite set of transformations $\{T_1, T_2, \ldots, T_n\}$:


$$ \hat{f}(x) = \frac{1}{n} \sum_{i=1}^{n} f(T_i x) $$

Since the sum is taken over all transformations, the result is the same regardless of which transformation was applied to the input. Max pooling can be used in place of averaging:


$$ \hat{f}(x) = \max_{i} \, f(T_i x) $$

(3) Data augmentation exposes the network to transformed versions of each training example. Given a training pair $(x_i, y_i)$ where $y_i$ is a scalar label, augmented pairs are generated as:


$$ (T x_i,\; y_i) $$

Note that unlike the equivariant case, the label $y_i$ is kept unchanged, since invariance means the output should not change when the input is transformed. This encourages invariant behavior without modifying the architecture, but provides only approximate invariance rather than exact invariance for every input.

Among the three strategies, invariant input features provide the strongest and most efficient guarantee since the invariance is built into the representation itself. Pooling over transformations is exact but requires multiple forward passes. Data augmentation is the easiest to implement but the weakest.


7.2. Symmetry

Consider a clamped-clamped beam defined on $x \in [-L/2, L/2]$. Given a load $q(x)$, the network predicts the total strain energy:


$$ E = \int_{-L/2}^{L/2} \frac{1}{2} EI \left(\frac{d^2 v}{dx^2}\right)^2 dx $$

Now suppose the load is antisymmetric about the center:


$$ q(-x) = -q(x) $$

The deflection $v(x)$ will be antisymmetric as well, $v(-x) = -v(x)$, but the total strain energy remains unchanged. This is because the energy depends on the squared curvature, which is always non-negative regardless of the sign of the deflection:


$$ E[q(-x)] = E[q(x)] $$

This is exactly invariance: flipping the load left-to-right does not change the total strain energy. The output is a scalar with no directional dependence, so it should be the same for both $q(x)$ and $q(-x)$.

A network that predicts total strain energy should therefore satisfy:


$$ f(q(-x)) = f(q(x)) $$

If this is not enforced, the network may predict different energy values for $q(x)$ and $q(-x)$, which is physically inconsistent. Two strategies can enforce this invariance by construction.


(1) Invariant input features. Symmetrize the input before passing it to the network:


$$ \tilde{q}(x) = q(x) + q(-x) $$

Since $\tilde{q}(-x) = \tilde{q}(x)$ always holds, the network receives only symmetric inputs and its output is automatically invariant. The network never sees the antisymmetric part of the load, which carries no information about the total strain energy.


(2) Output symmetrization. Pass both the original and flipped load through the network and average the results:


$$ \hat{E} = \frac{f(q(x)) + f(q(-x))}{2} $$

Since the output is a scalar, no inverse transformation is needed after averaging. Even if the network $f$ has not learned the invariance at all, the final output $\hat{E}$ always satisfies $\hat{E}[q(-x)] = \hat{E}[q(x)]$. Invariance is guaranteed by construction, at the cost of two forward passes instead of one.


(3) Data augmentation. Expose the network to both the original and flipped versions of each training example. Given a training pair $(q_i, E_i)$ where $E_i$ is the total strain energy, an augmented pair is generated as:


$$ (q_i(-x),\; E_i) $$

Since the total strain energy is invariant under load flipping, the label $E_i$ is kept unchanged. The network is trained on both $(q_i(x), E_i)$ and $(q_i(-x), E_i)$ simultaneously, which encourages it to produce the same energy prediction for both inputs.

Unlike invariant input features and output symmetrization, data augmentation does not guarantee exact invariance. The network is encouraged to learn the invariance from the training data, but may still produce slightly different predictions for $q(x)$ and $q(-x)$ on unseen examples. It is the easiest strategy to implement, requiring no modification to the network architecture or the inference procedure.

In [ ]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')