AI Writing Tools

Explore the best AI Writing Tools — independent reviews, comparisons, pricing and step-by-step how-to guides, curated by Aizhi.

  • Common Image Generator Interface

    Common Image Generator Interface

    The Common Image Generator Interface (CIGI) (pronounced sig-ee), is an on-the-wire data protocol that allows communication between an Image Generator and its host simulation. The interface is designed to promote a standard way for a host device to communicate with an image generator (IG) within the industry. CIGI enables plug-and-play by standard-compliant image generator vendors and reduces integration costs when upgrading visual systems. == Background == Most high-end simulators do not have everything running on a single machine the way popular home software flight simulators are currently implemented. The airplane model is run on one machine, normally referred to as the host, and the out the window visuals or scene graph program is run on another, usually referred to as an Image Generator (IG). Frequently there are multiple IGs required to display the surrounding environment created by a host. CIGI is the interface between the 'host' and the IGs. The main goal of CIGI is to capitalize on previous investments through the use of a common interface. CIGI is designed to assist suppliers and integrators of IG systems with ease of integration, code reuse, and overall cost reduction. In the past most image generators provided their own proprietary interface; every host had to implement that interface making changing image generators a costly ordeal. CIGI was created to standardize the interface between the host and the image generator so that little modification would be needed to switch image generators. The CIGI initiative was largely spearheaded by The Boeing Company during the early 21st century. The latest version of CIGI (CIGI 4.0) was developed by the Simulation Interoperability Standards Organization (SISO) in the form of SISO-STD-013-2014, Standard for Common Image Generator Interface (CIGI), Version 4.0, dated 22 August 2014. SISO-STD-013-2014 is freely available from SISO. == Definitions == Image generator – In this context an image generator consists of one or more rendering channels that produce an image that can be used to visualize an “Out-The-Window” scene, or images produced by various sensor simulations such as Infra-red, Day TV, electro-optical, and night vision. Host simulation – In this context a “Host” is the computational system that provides information about the device being simulated so that the image generator can portray the correct scenery to the user. This information is passed via CIGI to the image generator. == Maturation == CIGI 4 is the latest version of the standard as was approved by the Simulation Interoperability Standards Organization on August 22, 2014. CIGI became an international SISO standard known as SISO-STD-013-2014; which contains the CIGI version 4.0 Interface Control Document (ICD). CIGI 4.0 is the official standard, published by SISO. Previous versions of CIGI were spearheaded by Boeing include CIGI v3.3, in November 2008, v3.2 April 2006, v3.1 June 2004, v3 November 2003, v2 in March 2002, and the original (v1) in March 2001 == Protocol dependencies == Typically, CIGI uses UDP as its transport protocol, but CIGI does not require a specific transport mechanism, only packet definition conformance. CIGI traffic does not have a well known port; however, the use of ports 8004-8005 has been widely adopted by commercial image generator vendors implementations. == Development tools == === Host Emulator === The Host Emulator can be used as a surrogate to manipulate the interface when a simulation Host is not available. It is a Windows-based image generator Host application used to develop, integrate and test image generators that use the CIGI protocol. It provides a graphical user interface (GUI) for the creation, modification and deletion of entities; manipulation of views; control of environmental attributes and phenomena; and other host functions. The Host Emulator has several features that are useful for integration and testing. A free-flight mode allows for fixed-wing and rotorcraft flight, movement along entity axes and free rotation using a joystick or a joystick-like widget. Scripting and record/playback features support regression testing, demonstrations and other tasks needing exact reproduction of certain sequences of events. A packet-level snoop feature allows the user to examine the contents of CIGI messages, image generator response times and latencies. A Heartbeat Monitor Window shows a graphical timing history of the Image Generator's data frame rate. Other features include explicit packet creation, animation control, missile flyouts and a situation display window (Host Emulator 3.x only). === Multi-Purpose Viewer === The Multi-Purpose Viewer (MPV) provides the basic functionality expected of an Image Generator, such as loading and displaying a terrain database, displaying entities and so forth. The Multi-Purpose Viewer can be used as a surrogate to manipulate the interface when a real Image Generator is not available. The MPV is capable of operating with both the Windows and Linux operating systems. === CIGI Class Library === The CCL is an object-oriented software interface that automatically handles message composition and decomposition (i.e. packing, unpacking and byte swapping to the ICD specification) on both the Host and Image Generator sides of the interface. The CCL interprets Host or Image Generator messages based on compile time parameters. It also performs error handling and translation between different versions of CIGI. Each packet type has its own class. The individual packet members are accessed through packet class accessors. Outgoing messages are constructed by placing each packet into the outgoing buffer using a streaming operator. Incoming messages are parsed using callback or event-based mechanisms that supply the using program with fully populated packet objects. === Current tool suite === A set of CIGI development tools are managed and maintained by the SISO CIGI Product Support Group. The latest packages are available on SourceForge. Comments/Suggestions to the package can be directed to the SISO discussion board at: https://discussions.sisostds.org/index.htm?A0=SAC-PSG-CIGI Archived 2017-09-13 at the Wayback Machine === Wireshark === Wireshark is a free and open source packet analyzer. It is used for network troubleshooting, analysis, software and communications protocol development, and education. Wireshark provides a dissector for CIGI packets. As of October 2016, “The CIGI dissector is fully functional for CIGI version 2 and 3. Version 1 is not yet implemented.” === Older versions of CIGI === A CIGI Interface Control Document (ICD) and development suite is available in open source format. The tools, ICD, and accompanying user documentation can be found and downloaded from the CIGI sourceforge web site. The SourceForge version of the MPV is limited in its support of CIGI data packets and is intended to grow as needs arise. The MPV uses CIGI 3 as its interface, but the MPV is backward-compatible with earlier CIGI versions through the use of the CCL. The MPV uses the Open Scene Graph library to render a scene. The scene graph is manipulated according to the CIGI commands received from the Host via the CCL. The MPV itself is an application layer that consists of a small kernel leveraging heavily on a plug-in architecture for ease of maintainability and flexibility. An implementer can implement the interface from scratch, however a full suite of integration tools is available. These tools consist of three elements. The Host Emulator (HE), the Multi-Purpose Viewer (MPV), and the CIGI Class Library (CCL).

    Read more →
  • Oscillatory neural network

    Oscillatory neural network

    An oscillatory neural network (ONN) is an artificial neural network that uses coupled oscillators as neurons. Oscillatory neural networks are closely linked to the Kuramoto model, and are inspired by the phenomenon of neural oscillations in the brain. Oscillatory neural networks have been trained to recognize images. Complex-Valued Oscillatory network has also been shown to store and retrieve multidimensional aperiodic signals. An oscillatory autoencoder has also been demonstrated, which uses a combination of oscillators and rate-coded neurons. A neuron made of two coupled oscillators, one having a fixed and the other having a tunable natural frequency, has been shown able to run logic gates such as XOR that conventional sigmoid neurons cannot.

    Read more →
  • Multiclass classification

    Multiclass classification

    In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary classification). For example, deciding on whether an image is showing a banana, peach, orange, or an apple is a multiclass classification problem, with four possible classes (banana, peach, orange, apple), while deciding on whether an image contains an apple or not is a binary classification problem (with the two possible classes being: apple, no apple). While many classification algorithms (e.g., decision trees, k-NN, neural networks and multinomial logistic regression) naturally permit the use of more than two classes, some are by nature binary algorithms (e.g., classical binary support vector machine) and require decomposition strategies such as one-vs-all, one-vs-one, or ECOC to solve multiclass problems. Multiclass classification should not be confused with multi-label classification, where multiple labels are to be predicted for each instance (e.g., predicting that an image contains both an apple and an orange, in the previous example). == Better-than-random multiclass models == From the confusion matrix of a multiclass model, we can determine whether a model does better than chance. Let K ≥ 3 {\displaystyle K\geq 3} be the number of classes, O {\displaystyle {\mathcal {O}}} a set of observations, y ^ : O → { 1 , . . . , K } {\displaystyle {\hat {y}}:{\mathcal {O}}\to \{1,...,K\}} a model of the target variable y : O → { 1 , . . . , K } {\displaystyle y:{\mathcal {O}}\to \{1,...,K\}} and n i , j {\displaystyle n_{i,j}} be the number of observations in the set { y = i } ∩ { y ^ = j } {\displaystyle \{y=i\}\cap \{{\hat {y}}=j\}} . We note n i . = ∑ j n i , j {\displaystyle n_{i.}=\sum _{j}n_{i,j}} , n . j = ∑ i n i , j {\displaystyle n_{.j}=\sum _{i}n_{i,j}} , n = ∑ j n . j = ∑ i n i . {\displaystyle n=\sum _{j}n_{.j}=\sum _{i}n_{i.}} , λ i = n i . n {\displaystyle \lambda _{i}={\frac {n_{i.}}{n}}} and μ j = n . j n {\displaystyle \mu _{j}={\frac {n_{.j}}{n}}} . It is assumed that the confusion matrix ( n i , j ) i , j {\displaystyle (n_{i,j})_{i,j}} contains at least one non-zero entry in each row, that is λ i > 0 {\displaystyle \lambda _{i}>0} for any i {\displaystyle i} . Finally we call "normalized confusion matrix" the matrix of conditional probabilities ( P ( y ^ = j ∣ y = i ) ) i , j = ( n i , j n i . ) i , j {\displaystyle (\mathbb {P} ({\hat {y}}=j\mid y=i))_{i,j}=\left({\frac {n_{i,j}}{n_{i.}}}\right)_{i,j}} . === Intuitive explanation === The lift is a way of measuring the deviation from independence of two events A {\displaystyle A} and B {\displaystyle B} : L i f t ( A , B ) = P ( A ∩ B ) P ( A ) P ( B ) = P ( A ∣ B ) P ( A ) = P ( B ∣ A ) P ( B ) {\displaystyle \mathrm {Lift} (A,B)={\frac {\mathbb {P} (A\cap B)}{\mathbb {P} (A)\mathbb {P} (B)}}={\frac {\mathbb {P} (A\mid B)}{\mathbb {P} (A)}}={\frac {\mathbb {P} (B\mid A)}{\mathbb {P} (B)}}} We have L i f t ( A , B ) > 1 {\displaystyle \mathrm {Lift} (A,B)>1} if and only if events A {\displaystyle A} and B {\displaystyle B} occur simultaneously with a greater probability than if they were independent. In other words, if one of the two events occurs, the probability of observing the other event increases. A first condition to satisfy is to have L i f t ( y = i , y ^ = i ) ≥ 1 {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)\geq 1} for any i {\displaystyle i} . And the quality of a model (better or worse than chance) does not change if we over- or undersample the dataset, that is if we multiply each row R i {\displaystyle R_{i}} of the confusion matrix by a constant c i {\displaystyle c_{i}} . Thus the second condition is that the necessary and sufficient conditions for doing better than chance need only depend on the normalized confusion matrix. The condition on lifts can be reformulated with One versus Rest binary models : for any i {\displaystyle i} , we define the binary target variable y i {\displaystyle y_{i}} which is the indicator of event { y = i } {\displaystyle \{y=i\}} , and the binary model y ^ i {\displaystyle {\hat {y}}_{i}} of y i {\displaystyle y_{i}} which is the indicator of event { y ^ = i } {\displaystyle \{{\hat {y}}=i\}} . Each of the y ^ i {\displaystyle {\hat {y}}_{i}} models is a "One versus Rest" model. L i f t ( y = i , y ^ = i ) {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)} only depends on the events { y = i } {\displaystyle \{y=i\}} and { y ^ = i } {\displaystyle \{{\hat {y}}=i\}} , so merging or not merging the other classes doesn't change its value. We therefore have L i f t ( y = i , y ^ = i ) = L i f t ( y i = 1 , y ^ i = 1 ) {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)=\mathrm {Lift} (y_{i}=1,{\hat {y}}_{i}=1)} and the first condition is that all binary One versus Rest models are better than chance. ==== Example ==== If K = 2 {\displaystyle K=2} and 2 is the class of interest , the normalized confusion matrix is ( s p e c i f i c i t y 1 − s p e c i f i c i t y 1 − s e n s i t i v i t y s e n s i t i v i t y ) {\displaystyle {\begin{pmatrix}\mathrm {specificity} &1-\mathrm {specificity} \\1-\mathrm {sensitivity} &\mathrm {sensitivity} \end{pmatrix}}} and we have L i f t ( y = 1 , y ^ = 1 ) − 1 = P ( y = y ^ = 1 ) λ 1 μ 1 − 1 = n 1 , 1 n n 1. n .1 − 1 {\displaystyle \mathrm {Lift} (y=1,{\hat {y}}=1)-1={\frac {\mathbb {P} (y={\hat {y}}=1)}{\lambda _{1}\mu _{1}}}-1={\frac {n_{1,1}n}{n_{1.}n_{.1}}}-1} = n 1 , 1 ( n 1 , 1 + n 1 , 2 + n 2 , 1 + n 2 , 2 ) − ( n 1 , 1 + n 1 , 2 ) ( n 1 , 1 + n 2 , 1 ) n 1. n .1 = n 1 , 1 n 2 , 2 − n 1 , 2 n 2 , 1 n 1. n .1 {\displaystyle ={\frac {n_{1,1}(n_{1,1}+n_{1,2}+n_{2,1}+n_{2,2})-(n_{1,1}+n_{1,2})(n_{1,1}+n_{2,1})}{n_{1.}n_{.1}}}={\frac {n_{1,1}n_{2,2}-n_{1,2}n_{2,1}}{n_{1.}n_{.1}}}} . Thus L i f t ( y = 1 , y ^ = 1 ) ≥ 1 ⟺ n 1 , 1 n 2 , 2 − n 1 , 2 n 2 , 1 ≥ 0 {\displaystyle \mathrm {Lift} (y=1,{\hat {y}}=1)\geq 1\iff n_{1,1}n_{2,2}-n_{1,2}n_{2,1}\geq 0} . Similarly, by swapping the roles of 1 and 2, we find that L i f t ( y = 2 , y ^ = 2 ) ≥ 1 ⟺ n 1 , 1 n 2 , 2 − n 1 , 2 n 2 , 1 ≥ 0 {\displaystyle \mathrm {Lift} (y=2,{\hat {y}}=2)\geq 1\iff n_{1,1}n_{2,2}-n_{1,2}n_{2,1}\geq 0} . Dividing by n 1. n 2. {\displaystyle n_{1.}n_{2.}} we find that the necessary and sufficient condition on the normalized confusion matrix is s e n s i t i v i t y s p e c i f i c i t y − ( 1 − s e n s i t i v i t y ) ( 1 − s p e c i f i c i t y ) ≥ 0 ⟺ s e n s i t i v i t y + s p e c i f i c i t y − 1 ≥ 0 ⟺ J ≥ 0 {\displaystyle \mathrm {sensitivity} \ \mathrm {specificity} -(1-\mathrm {sensitivity} )(1-\mathrm {specificity} )\geq 0\iff \mathrm {sensitivity} +\mathrm {specificity} -1\geq 0\iff J\geq 0} . This brings us back to the classical binary condition: Youden's J must be positive (or zero for random models). === Random models === A random model is a model that is independent of the target variable. This property is easily reformulated with the confusion matrix. This proposition shows that the model y ^ {\displaystyle {\hat {y}}} of y {\displaystyle y} is uninformative if and only if there are two families of numbers ( α i ) i {\displaystyle (\alpha _{i})_{i}} and ( β j ) j {\displaystyle (\beta _{j})_{j}} such that P ( { y = i } ∩ { y ^ = j } ) = α i β j {\displaystyle \mathbb {P} (\{y=i\}\cap \{{\hat {y}}=j\})=\alpha _{i}\beta _{j}} for any i {\displaystyle i} and j {\displaystyle j} . === Multiclass likelihood ratios and diagnostic odds ratios === We define generalized likelihood ratios calculated from the normalized confusion matrix: for any i {\displaystyle i} and j ≠ i {\displaystyle j\not =i} , let L R i , j = P ( y ^ = j ∣ y = j ) P ( y ^ = j ∣ y = i ) {\displaystyle \mathrm {LR} _{i,j}={\frac {\mathbb {P} ({\hat {y}}=j\mid y=j)}{\mathbb {P} ({\hat {y}}=j\mid y=i)}}} . When K = 2 {\displaystyle K=2} , if 2 is the class of interest,, we find the classical likelihood ratios L R 1 , 2 = L R + {\displaystyle \mathrm {LR} _{1,2}=\mathrm {LR} _{+}} and L R 2 , 1 = 1 L R − {\displaystyle \mathrm {LR} _{2,1}={\frac {1}{\mathrm {LR} _{-}}}} . Multiclass diagnostic odds ratios can also be defined using the formula D O R i , j = D O R j , i = L R i , j L R j , i = n i , i n j , j n i , j n j , i = P ( y ^ = j ∣ y = j ) / P ( y ^ = i ∣ y = j ) P ( y ^ = j ∣ y = i ) / P ( y ^ = i ∣ y = i ) {\displaystyle \mathrm {DOR} _{i,j}=\mathrm {DOR} _{j,i}=\mathrm {LR} _{i,j}\mathrm {LR} _{j,i}={\frac {n_{i,i}n_{j,j}}{n_{i,j}n_{j,i}}}={\frac {\mathbb {P} ({\hat {y}}=j\mid y=j)/\mathbb {P} ({\hat {y}}=i\mid y=j)}{\mathbb {P} ({\hat {y}}=j\mid y=i)/\mathbb {P} ({\hat {y}}=i\mid y=i)}}} We saw above that a better-than-chance model (or a random model) must verify L i f t ( y = i , y ^ = i ) ≥ 1 {\displaystyle \mathrm {Lift} (y=i,{\hat {y}}=i)\geq 1} for any i {\displaystyle i} and λ i {\displaystyle \lambda _{i}} . According to the previous corollary, likelihood ratios are thus greater

    Read more →
  • Logic learning machine

    Logic learning machine

    Logic learning machine (LLM) is a machine learning method based on the generation of intelligible rules. LLM is an efficient implementation of the Switching Neural Network (SNN) paradigm, developed by Marco Muselli, Senior Researcher at the Italian National Research Council CNR-IEIIT in Genoa. LLM has been employed in many different sectors, including the field of medicine (orthopedic patient classification, DNA micro-array analysis and Clinical Decision Support Systems), financial services and supply chain management. == History == The Switching Neural Network approach was developed in the 1990s to overcome the drawbacks of the most commonly used machine learning methods. In particular, black box methods, such as multilayer perceptron and support vector machine, had good accuracy but could not provide deep insight into the studied phenomenon. On the other hand, decision trees were able to describe the phenomenon but often lacked accuracy. Switching Neural Networks made use of Boolean algebra to build sets of intelligible rules able to obtain very good performance. In 2014, an efficient version of Switching Neural Network was developed and implemented in the Rulex suite with the name Logic Learning Machine. Also, an LLM version devoted to regression problems was developed. == General == Like other machine learning methods, LLM uses data to build a model able to perform a good forecast about future behaviors. LLM starts from a table including a target variable (output) and some inputs and generates a set of rules that return the output value y {\displaystyle y} corresponding to a given configuration of inputs. A rule is written in the form: if premise then consequence where consequence contains the output value whereas premise includes one or more conditions on the inputs. According to the input type, conditions can have different forms: for categorical variables the input value must be in a given subset: x 1 ∈ { A , B , C , . . . } {\displaystyle x_{1}\in \{A,B,C,...\}} . for ordered variables the condition is written as an inequality or an interval: x 2 ≤ α {\displaystyle x_{2}\leq \alpha } or β ≤ x 3 ≤ γ {\displaystyle \beta \leq x_{3}\leq \gamma } A possible rule is therefore in the form if x 1 ∈ { A , B , C , . . . } {\displaystyle x_{1}\in \{A,B,C,...\}} AND x 2 ≤ α {\displaystyle x_{2}\leq \alpha } AND β ≤ x 3 ≤ γ {\displaystyle \beta \leq x_{3}\leq \gamma } then y = y ¯ {\displaystyle y={\bar {y}}} == Types == According to the output type, different versions of the Logic Learning Machine have been developed: Logic Learning Machine for classification, when the output is a categorical variable, which can assume values in a finite set Logic Learning Machine for regression, when the output is an integer or real number.

    Read more →
  • Apertus (LLM)

    Apertus (LLM)

    Apertus is a public large language model, developed by the Swiss AI Initiative (a collaboration between EPFL, ETH Zurich, and the Swiss National Supercomputing Centre). It was released on September 2, 2025, under the free and open-source Apache 2.0 license. Designed initially for business and research use cases around the world, Apertus was trained on over 1800 languages, and comes in 8 billion or 70 billion parameter versions and is available on Hugging Face for download. The model was developed aiming to adhere to European copyright law, and is one of the first examples of AI as a public good in the vein of AI Sovereignty. It is also the first large model to comply with the European Union's Artificial Intelligence Act. At its launch, the model creators emphasized multilinguality, transparency, and auditability as priorities in contrast to commercial frontier model. While international reception was largely positive, the first iteration was significantly behind the capabilities of frontier models and needs adaptation for many use cases with chatbots being a secondary but not a primary use case. As of late 2025, it was considered the largest and most capable fully open model. The capability of future models will depend in part on how much more funding can be secured.

    Read more →
  • Dynamic Bayesian network

    Dynamic Bayesian network

    A dynamic Bayesian network (DBN) is a Bayesian network (BN) which relates variables to each other over adjacent time steps. == History == A dynamic Bayesian network (DBN) is often called a "two-timeslice" BN (2TBN) because it says that at any point in time T, the value of a variable can be calculated from the internal regressors and the immediate prior value (time T-1). DBNs were developed by Paul Dagum in the early 1990s at Stanford University's Section on Medical Informatics. Dagum developed DBNs to unify and extend traditional linear state-space models such as Kalman filters, linear and normal forecasting models such as ARMA and simple dependency models such as hidden Markov models into a general probabilistic representation and inference mechanism for arbitrary nonlinear and non-normal time-dependent domains. Today, DBNs are common in robotics, and have shown potential for a wide range of data mining applications. For example, they have been used in speech recognition, digital forensics, protein sequencing, and bioinformatics. DBN is a generalization of hidden Markov models and Kalman filters. DBNs are conceptually related to probabilistic Boolean networks and can, similarly, be used to model dynamical systems at steady-state.

    Read more →
  • Population model (evolutionary algorithm)

    Population model (evolutionary algorithm)

    The population model of an evolutionary algorithm (EA) describes the structural properties of its population to which its members are subject. A population is the set of all proposed solutions of an EA considered in one iteration, which are also called individuals according to the biological role model. The individuals of a population can generate further individuals as offspring with the help of the genetic operators of the procedure. The simplest and widely used population model in EAs is the global or panmictic model, which corresponds to an unstructured population. It allows each individual to choose any other individual of the population as a partner for the production of offspring by crossover, whereby the details of the selection are irrelevant as long as the fitness of the individuals plays a significant role. Due to global mate selection, the genetic information of even slightly better individuals can prevail in a population after a few generations (iteration of an EA), provided that no better other offspring have emerged in this phase. If the solution found in this way is not the optimum sought, that is called premature convergence. This effect can be observed more often in panmictic populations. In nature global mating pools are rarely found. What prevails is a certain and limited isolation due to spatial distance. The resulting local neighbourhoods initially evolve independently and mutants have a higher chance of persisting over several generations. As a result, genotypic diversity in the gene pool is preserved longer than in a panmictic population. It is therefore obvious to divide the previously global population by substructures. Two basic models were introduced for this purpose, the island models, which are based on a division of the population into fixed subpopulations that exchange individuals from time to time, and the neighbourhood models, which assign individuals to overlapping neighbourhoods, also known as cellular genetic or evolutionary algorithms (cGA or cEA). The associated division of the population also suggests a corresponding parallelization of the procedure. For this reason, the topic of population models is also frequently discussed in the literature in connection with the parallelization of EAs. == Island models == In the island model, also called the migration model or coarse grained model, evolution takes place in strictly divided subpopulations. These can be organised panmictically, but do not have to be. From time to time an exchange of individuals takes place, which is called migration. The time between an exchange is called an epoch and its end can be triggered by various criteria: E.g. after a given time or given number of completed generations, or after the occurrence of stagnation. Stagnation can be detected, for example, by the fact that no fitness improvement has occurred in the island for a given number of generations. Island models introduce a variety of new strategy parameters: Number of subpopulations Size of the subpopulations Neighbourhood relations between islands: they determine which islands are considered neighbouring and can thus exchange individuals, see picture of a simple unidirectional ring (black arrows) and its extension by additional bidirectional neighbourhood relations (additional green arrows) Criteria for the termination of an epoch, synchronous or asynchronous migration Migration rate: number or proportion of individuals involved in migration. Migrant selection: There are many alternatives for this. E.g. the best individuals can replace the worst or randomly selected ones. Depending on the migration rate, this can affect one or more individuals at a time. With these parameters, the selection pressure can be influenced to a considerable extent. For example, it increases with the interconnectedness of the islands and decreases with the number of subpopulations or the epoch length. == Neighbourhood models or cellular evolutionary algorithms == The neighbourhood model, also called diffusion model or fine grained model, defines a topological neighbouhood relation between the individuals of a population that is independent of their phenotypic properties. The fundamental idea of this model is to provide the EA population with a special structure defined as a connected graph, in which each vertex is an individual that communicates with its nearest neighbours. Particularly, individuals are conceptually set in a toroidal mesh, and are only allowed to recombine with close individuals. This leads to a kind of locality known as isolation by distance. The set of potential mates of an individual is called its neighbourhood or deme. The adjacent figure illustrates that by showing two slightly overlapping neighbourhoods of two individuals marked yellow, through which genetic information can spread between the two demes. It is known that in this kind of algorithm, similar individuals tend to cluster and create niches that are independent of the deme boundaries and, in particular, can be larger than a deme. There is no clear borderline between adjacent groups, and close niches could be easily colonized by competitive ones and maybe merge solution contents during this process. Simultaneously, farther niches can be affected more slowly. EAs with this type of population are also well known as cellular EAs (cEA) or cellular genetic algorithms (cGA). A commonly used structure for arranging the individuals of a population is a 2D toroidal grid, although the number of dimensions can be easily extended (to 3D) or reduced (to 1D, e.g. a ring, see the figure on the right). The neighbourhood of a particular individual in the grid is defined in terms of the Manhattan distance from it to others in the population. In the basic algorithm, all the neighbourhoods have the same size and identical shapes. The two most commonly used neighbourhoods for two-dimensional cEAs are L5 and C9, see the figure on the left. Here, L stands for Linear while C stands for Compact. Each deme represents a panmictic subpopulation within which mate selection and the acceptance of offspring takes place by replacing the parent. The rules for the acceptance of offspring are local in nature and based on the neighbourhood: for example, it can be specified that the best offspring must be better than the parent being replaced or, less strictly, only better than the worst individual in the deme. The first rule is elitist and creates a higher selective pressure than the second non-elitist rule. In elitist EAs, the best individual of a population always survives. In this respect, they deviate from the biological model. The overlap of the neighbourhoods causes a mostly slow spread of genetic information across the neighbourhood boundaries, hence the name diffusion model. A better offspring now needs more generations than in panmixy to spread in the population. This promotes the emergence of local niches and their local evolution, thus preserving genotypic diversity over a longer period of time. The result is a better and dynamic balance between breadth and depth search adapted to the search space during a run. Depth search takes place in the niches and breadth search in the niche boundaries and through the evolution of the different niches of the whole population. For the same neighbourhood size, the spread of genetic information is larger for elongated figures like L9 than for a block like C9, and again significantly larger than for a ring. This means that ring neighbourhoods are well suited for achieving high quality results, even if this requires comparatively long run times. On the other hand, if one is primarily interested in fast and good, but possibly suboptimal results, 2D topologies are more suitable. == Comparison == When applying both population models to genetic algorithms, evolutionary strategy and other EAs, the splitting of a total population into subpopulations usually reduces the risk of premature convergence and leads to better results overall more reliably and faster than would be expected with panmictic EAs. Island models have the disadvantage compared to neighbourhood models that they introduce a large number of new strategy parameters. Despite the existing studies on this topic in the literature, a certain risk of unfavourable settings remains for the user. With neighbourhood models, on the other hand, only the size of the neighbourhood has to be specified and, in the case of the two-dimensional model, the choice of the neighbourhood figure is added. == Parallelism == Since both population models imply population partitioning, they are well suited as a basis for parallelizing an EA. This applies even more to cellular EAs, since they rely only on locally available information about the members of their respective demes. Thus, in the extreme case, an independent execution thread can be assigned to each individual, so that the entire cEA can run on a parallel hardware platform. The island model also supports p

    Read more →
  • Bayesian hierarchical modeling

    Bayesian hierarchical modeling

    Bayesian hierarchical modelling is a statistical model written in multiple levels (hierarchical form) that estimates the posterior distribution of model parameters using the Bayesian method. The sub-models combine to form the hierarchical model, and Bayes' theorem is used to integrate them with the observed data and account for all the uncertainty that is present. This integration enables calculation of updated posterior over the (hyper)parameters, effectively updating prior beliefs in light of the observed data. Frequentist statistics may yield conclusions seemingly incompatible with those offered by Bayesian statistics due to the Bayesian treatment of the parameters as random variables and its use of subjective information in establishing assumptions on these parameters. As the approaches answer different questions the formal results are not technically contradictory but the two approaches disagree over which answer is relevant to particular applications. Bayesians argue that relevant information regarding decision-making and updating beliefs cannot be ignored and that hierarchical modeling has the potential to overrule classical methods in applications where respondents give multiple observational data. Moreover, the model has proven to be robust, with the posterior distribution less sensitive to the more flexible hierarchical priors. Hierarchical modeling, as its name implies, retains nested data structure, and is used when information is available at several different levels of observational units. For example, in epidemiological modeling to describe infection trajectories for multiple countries, observational units are countries, and each country has its own time-based profile of daily infected cases. In decline curve analysis to describe oil or gas production decline curve for multiple wells, observational units are oil or gas wells in a reservoir region, and each well has each own time-based profile of oil or gas production rates (usually, barrels per month). Hierarchical modeling is used to devise computation based strategies for multiparameter problems. == Philosophy == Statistical methods and models commonly involve multiple parameters that can be regarded as related or connected in such a way that the problem implies a dependence of the joint probability model for these parameters. Individual degrees of belief, expressed in the form of probabilities, come with uncertainty. Amidst this is the change of the degrees of belief over time. As was stated by Professor José M. Bernardo and Professor Adrian F. Smith, "The actuality of the learning process consists in the evolution of individual and subjective beliefs about the reality." These subjective probabilities are more directly involved in the mind rather than the physical probabilities. Hence, it is with this need of updating beliefs that Bayesians have formulated an alternative statistical model which takes into account the prior occurrence of a particular event. == Bayes' theorem == The assumed occurrence of a real-world event will typically modify preferences between certain options. This is done by modifying the degrees of belief attached, by an individual, to the events defining the options. Suppose in a study of the effectiveness of cardiac treatments, with the patients in hospital j having survival probability θ j {\displaystyle \theta _{j}} , the survival probability will be updated with the occurrence of y, the event in which a controversial serum is created which, as believed by some, increases survival in cardiac patients. In order to make updated probability statements about θ j {\displaystyle \theta _{j}} , given the occurrence of event y, we must begin with a model providing a joint probability distribution for θ j {\displaystyle \theta _{j}} and y. This can be written as a product of the two distributions that are often referred to as the prior distribution P ( θ ) {\displaystyle P(\theta )} and the sampling distribution P ( y ∣ θ ) {\displaystyle P(y\mid \theta )} respectively: P ( θ , y ) = P ( θ ) P ( y ∣ θ ) {\displaystyle P(\theta ,y)=P(\theta )P(y\mid \theta )} Using the basic property of conditional probability, the posterior distribution will yield: P ( θ ∣ y ) = P ( θ , y ) P ( y ) = P ( y ∣ θ ) P ( θ ) P ( y ) {\displaystyle P(\theta \mid y)={\frac {P(\theta ,y)}{P(y)}}={\frac {P(y\mid \theta )P(\theta )}{P(y)}}} This equation, showing the relationship between the conditional probability and the individual events, is known as Bayes' theorem. This simple expression encapsulates the technical core of Bayesian inference which aims to deconstruct the probability, P ( θ ∣ y ) {\displaystyle P(\theta \mid y)} , relative to solvable subsets of its supportive evidence. == Exchangeability == The usual starting point of a statistical analysis is the assumption that the n values y 1 , y 2 , … , y n {\displaystyle y_{1},y_{2},\ldots ,y_{n}} are exchangeable. If no information – other than data y – is available to distinguish any of the θ j {\displaystyle \theta _{j}} 's from any others, and no ordering or grouping of the parameters can be made, one must assume symmetry of prior distribution parameters. This symmetry is represented probabilistically by exchangeability. Generally, it is useful and appropriate to model data from an exchangeable distribution as independently and identically distributed, given some unknown parameter vector θ {\displaystyle \theta } , with distribution P ( θ ) {\displaystyle P(\theta )} . === Finite exchangeability === For a fixed number n, the set y 1 , y 2 , … , y n {\displaystyle y_{1},y_{2},\ldots ,y_{n}} is exchangeable if the joint probability P ( y 1 , y 2 , … , y n ) {\displaystyle P(y_{1},y_{2},\ldots ,y_{n})} is invariant under permutations of the indices. That is, for every permutation π {\displaystyle \pi } or ( π 1 , π 2 , … , π n ) {\displaystyle (\pi _{1},\pi _{2},\ldots ,\pi _{n})} of (1, 2, …, n), P ( y 1 , y 2 , … , y n ) = P ( y π 1 , y π 2 , … , y π n ) . {\displaystyle P(y_{1},y_{2},\ldots ,y_{n})=P(y_{\pi _{1}},y_{\pi _{2}},\ldots ,y_{\pi _{n}}).} The following is an exchangeable, but not independent and identical (iid), example: Consider an urn with a red ball and a blue ball inside, with probability 1 2 {\displaystyle {\frac {1}{2}}} of drawing either. Balls are drawn without replacement, i.e. after one ball is drawn from the n {\displaystyle n} balls, there will be n − 1 {\displaystyle n-1} remaining balls left for the next draw. Let Y i = { 1 , if the i th ball is red , 0 , otherwise . {\displaystyle {\text{Let }}Y_{i}={\begin{cases}1,&{\text{if the }}i{\text{th ball is red}},\\0,&{\text{otherwise}}.\end{cases}}} The probability of selecting a red ball in the first draw and a blue ball in the second draw is equal to the probability of selecting a blue ball on the first draw and a red on the second, both of which are 1/2: P ( y 1 = 1 , y 2 = 0 ) = P ( y 1 = 0 , y 2 = 1 ) = 1 2 {\displaystyle P(y_{1}=1,y_{2}=0)=P(y_{1}=0,y_{2}=1)={\frac {1}{2}}} . This makes y 1 {\displaystyle y_{1}} and y 2 {\displaystyle y_{2}} exchangeable. But the probability of selecting a red ball on the second draw given that the red ball has already been selected in the first is 0. This is not equal to the probability that the red ball is selected in the second draw, which is 1/2: P ( y 2 = 1 ∣ y 1 = 1 ) = 0 ≠ P ( y 2 = 1 ) = 1 2 {\displaystyle P(y_{2}=1\mid y_{1}=1)=0\neq P(y_{2}=1)={\frac {1}{2}}} . Thus, y 1 {\displaystyle y_{1}} and y 2 {\displaystyle y_{2}} are not independent. If x 1 , … , x n {\displaystyle x_{1},\ldots ,x_{n}} are independent and identically distributed, then they are exchangeable, but the converse is not necessarily true. === Infinite exchangeability === Infinite exchangeability is the property that every finite subset of an infinite sequence y 1 {\displaystyle y_{1}} , y 2 , … {\displaystyle y_{2},\ldots } is exchangeable. For any n, the sequence y 1 , y 2 , … , y n {\displaystyle y_{1},y_{2},\ldots ,y_{n}} is exchangeable. == Hierarchical models == === Components === Bayesian hierarchical modeling makes use of two important concepts in deriving the posterior distribution, namely: Hyperparameters: parameters of the prior distribution Hyperpriors: distributions of Hyperparameters Suppose a random variable Y follows a normal distribution with parameter θ {\displaystyle \theta } as the mean and 1 as the variance, that is Y ∣ θ ∼ N ( θ , 1 ) {\displaystyle Y\mid \theta \sim N(\theta ,1)} . The tilde relation ∼ {\displaystyle \sim } can be read as "has the distribution of" or "is distributed as". Suppose also that the parameter θ {\displaystyle \theta } has a distribution given by a normal distribution with mean μ {\displaystyle \mu } and variance 1, i.e. θ ∣ μ ∼ N ( μ , 1 ) {\displaystyle \theta \mid \mu \sim N(\mu ,1)} . Furthermore, μ {\displaystyle \mu } follows another distribution given, for example, by the standard normal distribution, N ( 0 , 1 ) {\displaystyle {\text{N}}(0,1)} . The parameter μ {\dis

    Read more →
  • Audio-visual speech recognition

    Audio-visual speech recognition

    Audio visual speech recognition (AVSR) is a technique that uses image processing capabilities in lip reading to aid speech recognition systems in recognizing indeterministic phones or giving preponderance among near probability decisions. Each system of lip reading and speech recognition works separately, then their results are mixed at the stage of feature fusion. As the name suggests, it has two parts. First one is the audio part and second one is the visual part. In audio part we use features like log mel spectrogram, mfcc etc. from the raw audio samples and we build a model to get feature vector out of it . For visual part generally we use some variant of convolutional neural network to compress the image to a feature vector after that we concatenate these two vectors (audio and visual ) and try to predict the target object.

    Read more →
  • Sum of absolute differences

    Sum of absolute differences

    In digital image processing, the sum of absolute differences (SAD) is a measure of the similarity between image blocks. It is calculated by taking the absolute difference between each pixel in the original block and the corresponding pixel in the block being used for comparison. These differences are summed to create a simple metric of block similarity, the L1 norm of the difference image or Manhattan distance between two image blocks. The sum of absolute differences may be used for a variety of purposes, such as object recognition, the generation of disparity maps for stereo images, and motion estimation for video compression. == Example == This example uses the sum of absolute differences to identify which part of a search image is most similar to a template image. In this example, the template image is 3 by 3 pixels in size, while the search image is 3 by 5 pixels in size. Each pixel is represented by a single integer from 0 to 9. Template Search image 2 5 5 2 7 5 8 6 4 0 7 1 7 4 2 7 7 5 9 8 4 6 8 5 There are exactly three unique locations within the search image where the template may fit: the left side of the image, the center of the image, and the right side of the image. To calculate the SAD values, the absolute value of the difference between each corresponding pair of pixels is used: the difference between 2 and 2 is 0, 4 and 1 is 3, 7 and 8 is 1, and so forth. Calculating the values of the absolute differences for each pixel, for the three possible template locations, gives the following: Left Center Right 0 2 0 5 0 3 3 3 1 3 7 3 3 4 5 0 2 0 1 1 3 3 1 1 1 3 4 For each of these three image patches, the 9 absolute differences are added together, giving SAD values of 20, 25, and 17, respectively. From these SAD values, it could be asserted that the right side of the search image is the most similar to the template image, because it has the lowest sum of absolute differences as compared to the other two locations. == Comparison to other metrics == === Object recognition === The sum of absolute differences provides a simple way to automate the searching for objects inside an image, but may be unreliable due to the effects of contextual factors such as changes in lighting, color, viewing direction, size, or shape. The SAD may be used in conjunction with other object recognition methods, such as edge detection, to improve the reliability of results. === Video compression === SAD is an extremely fast metric due to its simplicity; it is effectively the simplest possible metric that takes into account every pixel in a block. Therefore, it is very effective for a wide motion search of many different blocks. SAD is also easily parallelizable since it analyzes each pixel separately, making it easily implementable with such instructions as ARM NEON or x86 SSE2. For example, SSE has packed sum of absolute differences instruction (PSADBW) specifically for this purpose. Once candidate blocks are found, the final refinement of the motion estimation process is often done with other slower but more accurate metrics, which better take into account human perception. These include the sum of absolute transformed differences (SATD), the sum of squared differences (SSD), and rate–distortion optimization.

    Read more →
  • Spiking neural network

    Spiking neural network

    Spiking neural networks (SNNs) are artificial neural networks (ANN) that mimic natural neural networks. These models leverage timing of discrete spikes as the main information carrier. In addition to neuronal and synaptic state, SNNs incorporate the concept of time into their operating model. The idea is that neurons in the SNN do not transmit information at each propagation cycle (as it happens with typical multi-layer perceptron networks), but rather transmit information only when a membrane potential—an intrinsic quality of the neuron related to its membrane electrical charge—reaches a specific value, called the threshold. When the membrane potential reaches the threshold, the neuron fires, and generates a signal that travels to other neurons which, in turn, increase or decrease their potentials in response to this signal. A neuron model that fires at the moment of threshold crossing is also called a spiking neuron model. While spike rates can be considered the analogue of the variable output of a traditional ANN, neurobiology research indicated that high speed processing cannot be performed solely through a rate-based scheme. For example humans can perform an image recognition task requiring no more than 10ms of processing time per neuron through the successive layers (going from the retina to the temporal lobe). This time window is too short for rate-based encoding. The precise spike timings in a small set of spiking neurons also has a higher information coding capacity compared with a rate-based approach. The most prominent spiking neuron model is the leaky integrate-and-fire model. In that model, the momentary activation level (modeled as a differential equation) is normally considered to be the neuron's state, with incoming spikes pushing this value higher or lower, until the state eventually either decays or—if the firing threshold is reached—the neuron fires. After firing, the state variable is reset to a lower value. Various decoding methods exist for interpreting the outgoing spike train as a real-value number, relying on either the frequency of spikes (rate-code), the time-to-first-spike after stimulation, or the interval between spikes. == History == Many multi-layer artificial neural networks are fully connected, receiving input from every neuron in the previous layer and signalling every neuron in the subsequent layer. Although these networks have achieved breakthroughs, they do not match biological networks and do not mimic neurons. The biology-inspired Hodgkin–Huxley model of a spiking neuron was proposed in 1952. This model described how action potentials are initiated and propagated. Communication between neurons, which requires the exchange of chemical neurotransmitters in the synaptic gap, is described in models such as the integrate-and-fire model, FitzHugh–Nagumo model (1961–1962), and Hindmarsh–Rose model (1984). The leaky integrate-and-fire model (or a derivative) is commonly used as it is easier to compute than Hodgkin–Huxley. While the notion of an artificial spiking neural network became popular only in the twenty-first century, studies between 1980 and 1995 supported the concept. The first models of this type of ANN appeared to simulate non-algorithmic intelligent information processing systems. However, the notion of the spiking neural network as a mathematical model was first worked on in the early 1970s. As of 2019 SNNs lagged behind ANNs in accuracy, but the gap is decreasing, and has vanished on some tasks. == Underpinnings == Information in the brain is represented as action potentials (neuron spikes), which may group into spike trains or coordinated waves. A fundamental question of neuroscience is to determine whether neurons communicate by a rate or temporal code. Temporal coding implies that a single spiking neuron can replace hundreds of hidden units on a conventional neural net. SNNs define a neuron's current state as its potential (possibly modeled as a differential equation). An input pulse causes the potential to rise and then gradually decline. Encoding schemes can interpret these pulse sequences as a number, considering pulse frequency and pulse interval. Using the precise time of pulse occurrence, a neural network can consider more information and offer better computing properties. SNNs compute in the continuous domain. Such neurons test for activation only when their potentials reach a certain value. When a neuron is activated, it produces a signal that is passed to connected neurons, accordingly raising or lowering their potentials. The SNN approach produces a continuous output instead of the binary output of traditional ANNs. Pulse trains are not easily interpretable, hence the need for encoding schemes. However, a pulse train representation may be more suited for processing spatiotemporal data (or real-world sensory data classification). SNNs connect neurons only to nearby neurons so that they process input blocks separately (similar to CNN using filters). They consider time by encoding information as pulse trains so as not to lose information. This avoids the complexity of a recurrent neural network (RNN). Impulse neurons are more powerful computational units than traditional artificial neurons. SNNs are theoretically more powerful than so called "second-generation networks" defined as ANNs "based on computational units that apply activation function with a continuous set of possible output values to a weighted sum (or polynomial) of the inputs"; however, SNN training issues and hardware requirements limit their use. Although unsupervised biologically inspired learning methods are available such as Hebbian learning and STDP, no effective supervised training method is suitable for SNNs that can provide better performance than second-generation networks. Spike-based activation of SNNs is not differentiable, thus gradient descent-based backpropagation (BP) is not available. SNNs have much larger computational costs for simulating realistic neural models than traditional ANNs. Pulse-coupled neural networks (PCNN) are often confused with SNNs. A PCNN can be seen as a kind of SNN. Researchers are actively working on various topics. The first concerns differentiability. The expressions for both the forward- and backward-learning methods contain the derivative of the neural activation function which is not differentiable because a neuron's output is either 1 when it spikes, and 0 otherwise. This all-or-nothing behavior disrupts gradients and makes these neurons unsuitable for gradient-based optimization. Approaches to resolving it include: resorting to entirely biologically inspired local learning rules for the hidden units translating conventionally trained "rate-based" NNs to SNNs smoothing the network model to be continuously differentiable defining an SG (Surrogate Gradient) as a continuous relaxation of the real gradients The second concerns the optimization algorithm. Standard BP can be expensive in terms of computation, memory, and communication and may be poorly suited to the hardware that implements it (e.g., a computer, brain, or neuromorphic device). Incorporating additional neuron dynamics such as Spike Frequency Adaptation (SFA) is a notable advance, enhancing efficiency and computational power. These neurons sit between biological complexity and computational complexity. Originating from biological insights, SFA offers significant computational benefits by reducing power usage, especially in cases of repetitive or intense stimuli. This adaptation improves signal/noise clarity and introduces an elementary short-term memory at the neuron level, which in turn, improves accuracy and efficiency. This was mostly achieved using compartmental neuron models. The simpler versions are of neuron models with adaptive thresholds, are an indirect way of achieving SFA. It equips SNNs with improved learning capabilities, even with constrained synaptic plasticity, and elevates computational efficiency. This feature lessens the demand on network layers by decreasing the need for spike processing, thus lowering computational load and memory access time—essential aspects of neural computation. Moreover, SNNs utilizing neurons capable of SFA achieve levels of accuracy that rival those of conventional ANNs, while also requiring fewer neurons for comparable tasks. This efficiency streamlines the computational workflow and conserves space and energy, while maintaining technical integrity. High-performance deep spiking neural networks can operate with 0.3 spikes per neuron. == Applications == SNNs can in principle be applied to the same applications as traditional ANNs. In addition, SNNs can model the central nervous system of biological organisms, such as an insect seeking food without prior knowledge of the environment. Due to their relative realism, they can be used to study biological neural circuits. Starting with a hypothesis about the topology of a biological neuronal circuit and its functi

    Read more →
  • Probably approximately correct learning

    Probably approximately correct learning

    In computational learning theory, probably approximately correct (PAC) learning is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant. In this framework, the learner receives samples and must select a generalization function (called the hypothesis) from a certain class of possible functions. The goal is that, with high probability (the "probably" part), the selected function will have low generalization error (the "approximately correct" part). The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success, or distribution of the samples. The model was later extended to treat noise (misclassified samples). An important innovation of the PAC framework is the introduction of computational complexity theory concepts to machine learning. In particular, the learner is expected to find efficient functions (time and space requirements bounded to a polynomial of the example size), and the learner itself must implement an efficient procedure (requiring an example count bounded to a polynomial of the concept size, modified by the approximation and likelihood bounds). == Definitions and terminology == In order to give the definition for something that is PAC-learnable, we first have to introduce some terminology. For the following definitions, two examples will be used. The first is the problem of character recognition given an array of n {\displaystyle n} bits encoding a binary-valued image. The other example is the problem of finding an interval that will correctly classify points within the interval as positive and the points outside of the range as negative. Let X {\displaystyle X} be a set called the instance space or the encoding of all the samples. In the character recognition problem, the instance space is X = { 0 , 1 } n {\displaystyle X=\{0,1\}^{n}} . In the interval problem the instance space, X {\displaystyle X} , is the set of all bounded intervals in R {\displaystyle \mathbb {R} } , where R {\displaystyle \mathbb {R} } denotes the set of all real numbers. A concept is a subset c ⊂ X {\displaystyle c\subset X} . One concept is the set of all patterns of bits in X = { 0 , 1 } n {\displaystyle X=\{0,1\}^{n}} that encode a picture of the letter "P". An example concept from the second example is the set of open intervals, { ( a , b ) ∣ 0 ≤ a ≤ π / 2 , π ≤ b ≤ 13 } {\displaystyle \{(a,b)\mid 0\leq a\leq \pi /2,\pi \leq b\leq {\sqrt {13}}\}} , each of which contains only the positive points. A concept class C {\displaystyle C} is a collection of concepts over X {\displaystyle X} . This could be the set of all subsets of the array of bits that are skeletonized 4-connected (width of the font is 1). Let EX ⁡ ( c , D ) {\displaystyle \operatorname {EX} (c,D)} be a procedure that draws an example, x {\displaystyle x} , using a probability distribution D {\displaystyle D} and gives the correct label c ( x ) {\displaystyle c(x)} , that is 1 if x ∈ c {\displaystyle x\in c} and 0 otherwise. Now, given 0 < ϵ , δ < 1 {\displaystyle 0<\epsilon ,\delta <1} , assume there is an algorithm A {\displaystyle A} and a polynomial p {\displaystyle p} in 1 / ϵ , 1 / δ {\displaystyle 1/\epsilon ,1/\delta } (and other relevant parameters of the class C {\displaystyle C} ) such that, given a sample of size p {\displaystyle p} drawn according to EX ⁡ ( c , D ) {\displaystyle \operatorname {EX} (c,D)} , then, with probability of at least 1 − δ {\displaystyle 1-\delta } , A {\displaystyle A} outputs a hypothesis h ∈ C {\displaystyle h\in C} that has an average error less than or equal to ϵ {\displaystyle \epsilon } on X {\displaystyle X} with the same distribution D {\displaystyle D} . Further if the above statement for algorithm A {\displaystyle A} is true for every concept c ∈ C {\displaystyle c\in C} and for every distribution D {\displaystyle D} over X {\displaystyle X} , and for all 0 < ϵ , δ < 1 {\displaystyle 0<\epsilon ,\delta <1} then C {\displaystyle C} is (efficiently) PAC learnable (or distribution-free PAC learnable). We can also say that A {\displaystyle A} is a PAC learning algorithm for C {\displaystyle C} . == Equivalence == Under some regularity conditions these conditions are equivalent: The concept class C is PAC learnable. The VC dimension of C is finite. C is a uniformly Glivenko-Cantelli class. C is compressible in the sense of Littlestone and Warmuth

    Read more →
  • Sherwood Applied Business Security Architecture

    Sherwood Applied Business Security Architecture

    SABSA (Sherwood Applied Business Security Architecture) is a model and methodology for developing a risk-driven enterprise information security architecture and service management, to support critical business processes. It was developed independently from the Zachman Framework, but has a similar structure. The primary characteristic of the SABSA model is that everything must be derived from an analysis of the business requirements for security, especially those in which security has an enabling function through which new business opportunities can be developed and exploited. The process analyzes the business requirements at the outset, and creates a chain of traceability through the strategy and concept, design, implementation, and ongoing ‘manage and measure’ phases of the lifecycle to ensure that the business mandate is preserved. Framework tools created from practical experience further support the whole methodology. The model is layered, with the top layer being the business requirements definition stage. At each lower layer a new level of abstraction and detail is developed, going through the definition of the conceptual architecture, logical services architecture, physical infrastructure architecture and finally at the lowest layer, the selection of technologies and products (component architecture). The SABSA model itself is generic and can be the starting point for any organization, but by going through the process of analysis and decision-making implied by its structure, it becomes specific to the enterprise, and is finally highly customized to a unique business model. It becomes in reality the enterprise security architecture, and it is central to the success of a strategic program of information security management within the organization. SABSA is a particular example of a methodology that can be used both for IT (information technology) and OT (operational technology) environments. == SABSA matrix == Note: The above is the original SABSA Matrix, which is still valid today, but it has been expanded by a comprehensive service management matrix and updated in some detail and terminology areas. In the words of David Lynas, SABSA author, "The SABSA Matrix and the SABSA Service Management Matrix have not been updated since the late 90s. We have redesigned them to deliver the improvements your feedback has requested over the years. We have not fundamentally changed the structure or principles of the matrices (very few elements have changed position) but have focused on terminology update and consistency." The new versions can be downloaded (along with the 2009 revision of the SABSA White Paper and other important documents like the SABSA Certification Roadmap) at the SABSA Members' Web Site.

    Read more →
  • Cross-entropy

    Cross-entropy

    In information theory, the cross-entropy between two probability distributions p {\displaystyle p} and q {\displaystyle q} , over the same underlying set of events, measures the average number of bits needed to identify an event drawn from the set when the coding scheme used for the set is optimized for an estimated probability distribution q {\displaystyle q} , rather than the true distribution p {\displaystyle p} . == Definition == The cross-entropy of the distribution q {\displaystyle q} relative to a distribution p {\displaystyle p} over a given set is defined as follows: H ( p , q ) = − E p ⁡ [ log ⁡ q ] , {\displaystyle H(p,q)=-\operatorname {E} _{p}[\log q],} where E p ⁡ [ ⋅ ] {\displaystyle \operatorname {E} _{p}[\cdot ]} is the expected value operator with respect to the distribution p {\displaystyle p} . The definition may be formulated using the Kullback–Leibler divergence D K L ( p ∥ q ) {\displaystyle D_{\mathrm {KL} }(p\parallel q)} , divergence of p {\displaystyle p} from q {\displaystyle q} (also known as the relative entropy of p {\displaystyle p} with respect to q {\displaystyle q} ). H ( p , q ) = H ( p ) + D K L ( p ∥ q ) , {\displaystyle H(p,q)=H(p)+D_{\mathrm {KL} }(p\parallel q),} where H ( p ) {\displaystyle H(p)} is the entropy of p {\displaystyle p} . For discrete probability distributions p {\displaystyle p} and q {\displaystyle q} with the same support X {\displaystyle {\mathcal {X}}} , this means The situation for continuous distributions is analogous. We have to assume that p {\displaystyle p} and q {\displaystyle q} are absolutely continuous with respect to some reference measure r {\displaystyle r} (usually r {\displaystyle r} is a Lebesgue measure on a Borel σ-algebra). Let P {\displaystyle P} and Q {\displaystyle Q} be probability density functions of p {\displaystyle p} and q {\displaystyle q} with respect to r {\displaystyle r} . Then − ∫ X P ( x ) log ⁡ Q ( x ) d x = E p ⁡ [ − log ⁡ Q ] , {\displaystyle -\int _{\mathcal {X}}P(x)\,\log Q(x)\,\mathrm {d} x=\operatorname {E} _{p}[-\log Q],} and therefore NB: The notation H ( p , q ) {\displaystyle H(p,q)} is also used for a different concept, the joint entropy of p {\displaystyle p} and q {\displaystyle q} . == Motivation == In information theory, the Kraft–McMillan theorem establishes that any directly decodable coding scheme for coding a message to identify one value x i {\displaystyle x_{i}} out of a set of possibilities { x 1 , … , x n } {\displaystyle \{x_{1},\ldots ,x_{n}\}} can be seen as representing an implicit probability distribution q ( x i ) = ( 1 2 ) ℓ i {\displaystyle q(x_{i})=\left({\frac {1}{2}}\right)^{\ell _{i}}} over { x 1 , … , x n } {\displaystyle \{x_{1},\ldots ,x_{n}\}} , where ℓ i {\displaystyle \ell _{i}} is the length of the code for x i {\displaystyle x_{i}} in bits. Therefore, cross-entropy can be interpreted as the expected message-length per datum when a wrong distribution q {\displaystyle q} is assumed while the data actually follows a distribution p {\displaystyle p} . That is why the expectation is taken over the true probability distribution p {\displaystyle p} and not q . {\displaystyle q.} Indeed the expected message-length under the true distribution p {\displaystyle p} is E p ⁡ [ ℓ ] = − E p ⁡ [ ln ⁡ q ( x ) ln ⁡ ( 2 ) ] = − E p ⁡ [ log 2 ⁡ q ( x ) ] = − ∑ x i p ( x i ) log 2 ⁡ q ( x i ) = − ∑ x p ( x ) log 2 ⁡ q ( x ) = H ( p , q ) . {\displaystyle {\begin{aligned}\operatorname {E} _{p}[\ell ]&=-\operatorname {E} _{p}\left[{\frac {\ln {q(x)}}{\ln(2)}}\right]\\[1ex]&=-\operatorname {E} _{p}\left[\log _{2}{q(x)}\right]\\[1ex]&=-\sum _{x_{i}}p(x_{i})\,\log _{2}q(x_{i})\\[1ex]&=-\sum _{x}p(x)\,\log _{2}q(x)=H(p,q).\end{aligned}}} == Estimation == There are many situations where cross-entropy needs to be measured but the distribution of p {\displaystyle p} is unknown. An example is language modeling, where a model is created based on a training set T {\displaystyle T} , and then its cross-entropy is measured on a test set to assess how accurate the model is in predicting the test data. In this example, p {\displaystyle p} is the true distribution of words in any corpus, and q {\displaystyle q} is the distribution of words as predicted by the model. Since the true distribution is unknown, cross-entropy cannot be directly calculated. In these cases, an estimate of cross-entropy is calculated using the following formula: H ( T , q ) = − ∑ i = 1 N 1 N log 2 ⁡ q ( x i ) {\displaystyle H(T,q)=-\sum _{i=1}^{N}{\frac {1}{N}}\log _{2}q(x_{i})} where N {\displaystyle N} is the size of the test set, and q ( x ) {\displaystyle q(x)} is the probability of event x {\displaystyle x} estimated from the training set. In other words, q ( x i ) {\displaystyle q(x_{i})} is the probability estimate of the model that the i-th word of the text is x i {\displaystyle x_{i}} . The sum is averaged over the N {\displaystyle N} words of the test. This is a Monte Carlo estimate of the true cross-entropy, where the test set is treated as samples from p ( x ) {\displaystyle p(x)} . == Relation to maximum likelihood == The cross entropy arises in classification problems when introducing a logarithm in the guise of the log-likelihood function. This section concerns the estimation of the probabilities of different discrete outcomes. To this end, denote a parametrized family of distributions by q θ {\displaystyle q_{\theta }} , with θ {\displaystyle \theta } subject to the optimization effort. Consider a given finite sequence of N {\displaystyle N} values x i {\displaystyle x_{i}} from a training set, obtained from conditionally independent sampling. The likelihood assigned to any considered parameter θ {\displaystyle \theta } of the model is then given by the product over all probabilities q θ ( X = x i ) {\displaystyle q_{\theta }(X=x_{i})} . Repeated occurrences are possible, leading to equal factors in the product. If the count of occurrences of the value equal to x {\displaystyle x} is denoted by # x {\displaystyle \#x} , then the frequency of that value equals # x / N {\displaystyle \#x/N} . If p ( X = x ) {\displaystyle p(X=x)} is the underlying probability distribution, for large N {\displaystyle N} we expect p ( X = x ) ≈ # x / N {\displaystyle p(X=x)\approx \#x/N} , by the law of large numbers. Writing our likelihood function as the product of observations from the distribution q θ {\displaystyle q_{\theta }} : L ( θ ; x ) = ∏ i q θ ( X = x i ) = ∏ x q θ ( X = x ) # x ≈ ∏ x q θ ( X = x ) N ⋅ p ( X = x ) = exp ⁡ log ⁡ [ ∏ x q θ ( X = x ) N ⋅ p ( X = x ) ] = exp ⁡ ( ∑ x N ⋅ p ( X = x ) log ⁡ q θ ( X = x ) ) , {\displaystyle {\begin{aligned}{\mathcal {L}}(\theta ;{\mathbf {x} })&=\prod _{i}q_{\theta }(X=x_{i})=\prod _{x}q_{\theta }(X=x)^{\#x}\\&\approx \prod _{x}q_{\theta }(X=x)^{N\cdot p(X=x)}=\exp \log \left[\prod _{x}q_{\theta }(X=x)^{N\cdot p(X=x)}\right]\\&=\exp \left(\sum _{x}N\cdot p(X=x)\log q_{\theta }(X=x)^{}\right),\end{aligned}}} where we have used the calculation rules for the logarithm in the final line. Notice how the exponent contains a − H ( p , q θ ) {\displaystyle -H(p,q_{\theta })} term. Taking the logarithm of both sides gives: log ⁡ L ( θ ; x ) = − N ⋅ H ( p , q θ ) . {\displaystyle \log {\mathcal {L}}(\theta ;{\mathbf {x} })=-N\cdot H(p,q_{\theta }).} Since the logarithm is a monotonically increasing function, the maximizing value of θ {\displaystyle \theta } is unaffected by this final step. Similarly, the maximizing value of θ {\displaystyle \theta } is unaffected by the factor of N {\displaystyle N} . So we observe that the likelihood maximization amounts to minimization of the cross-entropy. == Cross-entropy minimization == Cross-entropy minimization is frequently used in optimization and rare-event probability estimation. When comparing a distribution q {\displaystyle q} against a fixed reference distribution p {\displaystyle p} , cross-entropy and KL divergence are identical up to an additive constant (since p {\displaystyle p} is fixed): According to the Gibbs' inequality, both take on their minimal values when p = q {\displaystyle p=q} , which is 0 {\displaystyle 0} for KL divergence, and H ( p ) {\displaystyle \mathrm {H} (p)} for cross-entropy. In the engineering literature, the principle of minimizing KL divergence (Kullback's "Principle of Minimum Discrimination Information") is often called the Principle of Minimum Cross-Entropy (MCE), or Minxent. However, as discussed in the article Kullback–Leibler divergence, sometimes the distribution q {\displaystyle q} is the fixed prior reference distribution, and the distribution p {\displaystyle p} is optimized to be as close to q {\displaystyle q} as possible, subject to some constraint. In this case the two minimizations are not equivalent. This has led to some ambiguity in the literature, with some authors attempting to resolve the inconsistency by restating cross-entropy to be D K L ( p ∥ q ) {\displaystyle D_{\mathrm {KL} }(p\parallel q)} , rather than H (

    Read more →
  • Ordination (statistics)

    Ordination (statistics)

    Ordination or gradient analysis, in multivariate analysis, is a method complementary to data clustering, and used mainly in exploratory data analysis (rather than in hypothesis testing). In contrast to cluster analysis, ordination orders quantities in a (usually lower-dimensional) latent space. In the ordination space, quantities that are near each other share attributes (i.e., are similar to some degree), and dissimilar objects are farther from each other. Such relationships between the objects, on each of several axes or latent variables, are then characterized numerically and/or graphically in a biplot. The first ordination method, principal components analysis, was suggested by Karl Pearson in 1901. == Methods == Ordination methods can broadly be categorized in eigenvector-, algorithm-, or model-based methods. Many classical ordination techniques, including principal components analysis, correspondence analysis (CA) and its derivatives (detrended correspondence analysis, canonical correspondence analysis, and redundancy analysis, belong to the first group). The second group includes some distance-based methods such as non-metric multidimensional scaling, and machine learning methods such as T-distributed stochastic neighbor embedding and nonlinear dimensionality reduction. The third group includes model-based ordination methods, which can be considered as multivariate extensions of Generalized Linear Models. Model-based ordination methods are more flexible in their application than classical ordination methods, so that it is for example possible to include random-effects. Unlike in the aforementioned two groups, there is no (implicit or explicit) distance measure in the ordination. Instead, a distribution needs to be specified for the responses as is typical for statistical models. These and other assumptions, such as the assumed mean-variance relationship, can be validated with the use of residual diagnostics, unlike in other ordination methods. == Applications == Ordination can be used on the analysis of any set of multivariate objects. It is frequently used in several environmental or ecological sciences, particularly plant community ecology. It is also used in genetics and systems biology for microarray data analysis and in psychometrics.

    Read more →