## Parallel Computation with Threshold Functions IAN PARBERRY AND GEORG SCHNITGER\* Department of Computer Science, The Pennsylvania State University, University Park, Pennsylvania 16802 Received August 1, 1986; revised February 20, 1987 We study two classes of unbounded fan-in parallel computation, the standard one based on unbounded fan-in ANDs and ORs, and a new class based on unbounded fan-in threshold functions. The latter is motivated by a connectionist model of the brain used in artificial intelligence. We are interested in the resources of time and address complexity. Intuitively, the address complexity of a parallel machine is the number of bits needed to describe an individual piece of hardware. We demonstrate that (for WRAMs and uniform unbounded fan-in circuits) parallel time and address complexity is simultaneously equivalent to alternations and time on an alternating Turing machine (the former to within a constant multiple, and the latter a polynomial). In particular, for constant parallel time, the latter equivalence holds to within a constant multiple. Thus, for example, polynomial-processor, constant-time WRAMs recognize exactly the languages in the logarithmic time hierarchy, and polynomialword-size, constant-time WRAMs recognize exactly the languages in the polynomial time hierarchy. As a corollary, we provide improved simulations of deterministic Turing machines by constant-time shared-memory machines. Furthermore, in the threshold model, the same results hold if we replace the alternating Turing machine with the analogous threshold Turing machine, and replace the resource of alternations with the corresponding resource of thresholds. Threshold parallel computers are much more powerful than the standard models (for example, with only polynomially many processors, they can multiply two integers, compute the parity function and sort in constant time), and appear less amenable to known lower-bound proof techniques. © 1988 Academic Press, Inc. #### 1. INTRODUCTION There has recently been a growing interest in the time complexity of massively parallel computers, that is, parallel machines with an excessively large number of processors and unbounded fan-in. Two machine models which have become popular are the MIMO shared-memory machine, a non-uniform collection of random-access machines which communicate via a shared memory (a MIMO parallel computer may have a different program for each processor, see Flynn [10]), and the unbounded fan-in creat, a nonuniform combinational circuit built from unbounded fan-in AND and OR gates. The important resources for the former are (unit-cost) time and number of processors, and for the later, size (unmber of wires) and depth (maximum length of any path from an input to an 0022-0000/88 \$3.00 <sup>\*</sup> Research supported by NSF Grant DCR-84-07256. output). There is a well-known result [37] which states that simultaneous size and edpth of nonaniform unbounded finis—in circuits is equivalent to simultaneous processors and time on a MIMD shared-memory machine, the former related by a constant multiple, and the latter by a polynomial. Recent research has centered on the number of processors required to obtain constant time on a massively parallal as superpolynomial number of processors is required to a superpolynomial number of processors is required to compute the parity of n bits to constant time on a nonuniform model. Recently, Andrew Yao (43) has shown that $2^{n+1}$ processors are necessary, for some real number $\lambda > 0$ . We provide a matching unper bound in this paper. We wish to characterize surform unbounded fars-in parallel computers, such as the WRAM, a SIMD shared-memory machine with simultaneous writes. Limited characterizations have been attempted, but they appear to fail when the number of processors get too brail. The parallel computers computed in the parallel computers of the interpretation states that fast parallel computers recognize exactly the languages in POLYLOGSPACE. For convenience, we will call this the first parallel computers to such as the surface of the will call this the first parallel computers. One of the parallel computers of the parallel computers of the parallel computers and the parallel computers of comput These parallel computation theses appear to have lost popularity due to the increasing interest in massively parallel computers. But no! 49 cennostrated that aeither the first nor the second parallel computation thesis holds for shared-menory machines with a large number of processors. It was shown in [25] that the first parallel computation thesis holds for shared-memory machines with a large number of processors. It was shown in [25] that otherwise, in particular, constant parallel time is possible for any recursive function if word-size (and number of processors) is large enough. One of our aims into paper is to provide a characterization of unbounded fan-in parallelism which is valid for sub-logarithmic, or even constant parallelism. We will also find that this characterization extends to a wider class of parallel computation. The standard parallel models described above derive their power from the ability to compute umbounded fan-in Boolean ANDs and ORs. We consider parallel computers based on unbounded fan-in throubsold functions, motivated by connectionist models of the brain used in artificial intelligence. Parallel computers based on threshold functions, most estimated properties for example, they can compute the parity function in constanted for the constant of o The remainder of this paper is divided into seven sections. In Section 2 we examine a machine model from artificial intelligence called the *Boltzmann machine*. TABLE I | | | Simulated Machine | | | Simulating Machine | chine | |----------------|-------------------|-------------------|----------------|-------------------|----------------------|---------------------------| | Result | Machine type | Parallel time | Hardware | Machine type | Parallel time | Hardware | | Theorem 2.1 | Beltzmann machine | Time T(s) | Size Z(n) | Threshold circuit | Depth O(T(n)) | Size Z(s)our | | Theorem 6.1 | Minimal TRAM | Time T(n) | Word-size W(n) | A-Tape TTM | Thresholds O(T(n)) | Time O(T(a) W(a)) | | Sorollary 6.2 | Reasonable TRAM | Time T(n) | Word-size W(n) | A-Tape TTM | Thresholds O(T(a)) | Time W(n)(n) | | emma 7.1 | k-Tape DTM | | Time T(n) | Minimal WRAM | Time O(1) | Word-size O(T(n)) | | Pheorem 7.2 | A-Tape TTM | Thresholds H(n) | Time T(n) | Minimal TRAM | Time O(H(n)) | Word-size O(T(n) H(n)) | | Sorellary 7.3 | A-Tape TTM | Thresholds H(n) | Time T(n) | Reasonable TRAM | Time O(H(n) | Word-size O(T(n)?) | | Theorem 8.1 | Minimal WRAM | Time T(n) | Word-size W(n) | A-Tape ATM | Alternations O(T(n)) | | | Jorellary 8.2 | Reasonable WRAM | Time T(n) | Woed-size W(n) | A-Tape ATM | Alternations O(T(n)) | Time W(n)(n) | | Theorem 8.3 | A-Tape ATM | Alternations H(n) | Time T(n) | Minimal WRAM | Time O(H(n)) | Word-size O(T(n) H(n)) | | Corollary 8.4 | A-Tape DTM | | Time T(n) | Minimal WRAM | Time O(1) | Word-size O(T(n)35e* T(n) | | Corollary R.S. | A-Ture ATM | Alternations M(a) | Time T(a) | Bassashie WBAM | Time Others | Want of a Contract) | We define the authorized fine-in threshold circuit, and show that it is equivalent to a deterministic variant of the Boltzmann machine. In Section 3 we define a variant of the popular districts and the section of the popular districts and the section of the popular districts and the section of the popular districts and the section of This paper contains eleven simulations of one resource-bounded parallel machine by another. A brief summary appears in Table I. A preliminary version of this paper has appeared in [26]. #### 2. BOLTZMANN MACHINES AND THRESHOLD CIRCUITS Connectionist models of the brain have recently regained popularity amongst researchers in artificial intelligence. The connectionist model is a parallel system which has a large number of simple processing elements. Computation is performed by increasing or decreasing the strength of communications between physically connected processors. One such model is the Boltzmann machine [1, 18], an undirected graph in which vertices represent processors, and edges links. Each vertex is labelled with a threshold value, and each edge with a weight, both of which are integers. Each processor can be in one of two states, which we will call active and inactive. The computation occurs synchronously as follows. At time t, a processor computes the sum of the weights of the edges connecting it to its active neighbors. That processor is active at time t+1 with probability depending on the difference between that sum and its threshold (with probability tending to zero below the threshold, exactly one-half at the threshold, and tending to one above it). At the start of the computation a distinguished set of input vertices is held in either the active or inactive state, to represent the input in binary. The output is similarly encoded in the states of a distinguished set of output vertices on completion of the computation. The computation is completed when the energy of the system is at a local minimum. The three key properties of this model are that it is probabilistic, it computes using threshold functions, and that the termination condition depends on global energy. We wish to focus on the second property and address the question of how parallel machines which compute using threshold functions differ from previously studed models of parallel computers. This may throw some light on the prodigious computing power of the human brain. We isolate this property by making the model deterministic and simplifying the termination condition. This enables us to avoid certain problems concerned with the determination of the global energy in the more general model [20, 21]. Formally, a deterministic holtzmann machine (hereafter we will drop the adjective 'deterministic') an infinite family of finite machines, $B_k$ , $B_k$ , $B_k$ , one for each input size. Each $B_k$ consists of a directed graph $G_k = (V_k, E_k)$ , distinguished sets of large and output vertices $I_k$ , $O_k \in V_k$ , $I_k$ , $I_k$ , and $I_k$ and suppression $I_k$ , $I_k$ and $I_k$ and a weight a szigmment $w_k$ : $E_k = X$ . Computation on an input $x \in [0, 1]$ "proceeds as follows: At time t = 0, the input processors are placed in states which encode x. All other processors are placed in the inactive state. A processor $e_k$ $V_k$ is a fixed $v_k$ and $$S_v = \big\{ u \in V_n \mid (u,v) \in E_n \text{ and } u \text{ is active at time } t-1 \big\}.$$ The computation terminates when no further change of state occurs in the system. At that time, the output is seneded in the states of the output vertices. A Boltzan machine runs in time T(n) if the maximum number of steps taken by $B_n$ on an input that $G_n$ is zero in size n is bounded above by T(n), and has size Z(n) if $T(B_n | | Z(n))$ . We will that $G_n$ is connected, so that the number of edges is a reasonable measure of size. We also assume that the absolute values of the edge weights (and therefore the thresholds) are bounded above by a polynomial in size, and that $T(n) \le Z(n)$ . A number of features of this model are non-critical; for example, the connection graph can be made acyclic, all weights can be made I, and all threshold values can be made nonequieve. In addition, the threshold functions can be replaced by "exactly equal" functions. Let $B = \{0,1\}$ denote the Boolean set. Define the function $\#_k$ : $B = B = B \text{ follows: } \#_k(x, x_2, x_k) = 1$ iff exactly k of the $x_k^2$ are equal to I. Define an unbounded fain-in threshold circuit to be similar to the standard unbounded fain-in circuit (see, for example, $III_k$ ). All coverplot for the standard equivalent to circuits built from upper-threshold (true if at least K inputs are true) and lower-threshold (true if at the substance) and example $III_k$ . All the substances in size and a simultaneous increase in depth of at most a constant multiple. We will efter to a parallel machine based on upper- and lower-threshold (true if the substance) in the substance sub $$\begin{split} & x_1 \wedge x_2 \wedge \dots \wedge x_s = \#_s(x_1, \dots, x_s) \\ & x_1 \vee x_2 \vee \dots \vee x_s = \#_1(\#_1(x_1, \dots, x_s), \#_2(x_1, \dots, x_s), \dots, \#_s(x_1, \dots, x_s)) \\ & x_1 \oplus x_2 \oplus \dots \oplus x_s = \#_1(\#_1(x_1, \dots, x_s), \#_2(x_1, \dots, x_s), \dots, \#_{s \in (a \bmod 2)-1}(x_1, \dots, x_s)) \\ & & & & & & & & & & & & & & & \\ & & & & & & & & & & & & \\ & & & & & & & & & & & \\ & & & & & & & & & & & \\ & & & & & & & & & & & \\ & & & & & & & & & & & \\ & & & & & & & & & & & \\ & & & & & & & & & & \\ & & & & & & & & & & \\ & & & & & & & & & & \\ & & & & & & & & & & \\ & & & & & & & & & \\ & & & & & & & & & \\ & & & & & & & & & \\ & & & & & & & & & \\ & & & & & & & & & \\ & & & & & & & & \\ & & & & & & & & \\ & & & & & & & \\ & & & & & & & & \\ & & & & & & & & \\ & & & & & & & & \\ & & & & & & & & \\ & & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & & \\ & & & & & & \\ & & & & & & \\ & & & & & & \\ & & & & & & \\ & & & & & & \\ & & & & & & \\ & & & & & & \\ & & & & & & \\ & & & & & & \\ & & & & & & \\ & & & & & & \\ & & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & & \\ & & & & \\ & & & & & \\ & & & & \\ & & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & & \\ & & & \\ & & & & \\ & &$$ THEOREM 2.1. Size and time on a Boltzmann machine is simultaneously equivalent to size and depth on an unbounded fan-in threshold circuit, size to within a notynomial, and time/depth to within a constant multiple. Proof (Sketch). If $w_1, \dots, w_n \in \mathbb{Z}$ , define the function $\#_k(w_1, \dots, w_n) \colon B^n \to B$ as follows: $\#_k(w_1, \dots, w_n)(x_1, \dots, x_n) = 1$ provided $\sum_{i=1}^n x_i \cdot w_i = k$ (and 0 otherwise). Similarly define $\#_k(w_1, \dots, w_n)$ and $\#_k(w_1, \dots, w_n) \colon \mathbb{Z}^n \to B$ to be 1 iff $\sum_{i=1}^n x_i \cdot w_i \leqslant k$ and $\sum_{i=1}^n x_i \cdot w_i \geqslant k$ . respectively. CLAIM 1. Boltzmann machines based on $\#_k$ are at least as powerful as those based on $\geqslant_k$ . Proof. For all integers k. $$\geqslant_k(w_1, ..., w_n)(x_1, ..., x_n) = \#_k(w_1, ..., w_n)(x_1, ..., x_n) \vee \#_{k+1}(w_1, ..., w_n)(x_1, ..., x_n)$$ $\vee \cdots \vee \#_{\Sigma_{n-1}^n w_n}(w_1, ..., w_n)(x_1, ..., x_n).$ Claim 2. Boltzmann machines based on $\geqslant_k$ are at least as powerful as those based on $\#_k$ . Proof. For all integers k. $$\#_k(w_1,...,w_n)(x_1,...,x_n) = \geqslant_k(w_1,...,w_n)(x_1,...,x_n) \, \wedge \, \leqslant_k(w_1,...,w_n)(x_1,...,x_n)$$ $$\leq_k (w_1, w_2, ..., w_n)(x_1, x_2, ..., x_n) = \geq_{-k} (-w_1, -w_2, ..., -w_n)(x_1, x_2, ..., x_n).$$ Thus #-functions provide a convenient normal-form for Boltzmann machines. It remains to show that these normal-form machines can be "unwound" into circuits. CLAIM 3. All edge-weights can be made positive. *Proof.* Consider a vertex computing $\#_k(w_1, w_2, ..., w_n)(x_1, ..., x_2, ..., x_n)$ . Without loss of generality assume that $w_1, ..., w_r > 0$ and $w_{r+1}, ..., w_n < 0$ . Let $S = \sum_{l=1}^r w_l$ . First construct S - k + 1 vertices computing $$\#_r(w_1, w_2, ..., w_r)(x_1, x_2, ..., x_r)$$ for $k \le r \le S$ and S-k vertices computing $$\#_{s}(-w_{t+1}, -w_{t+2}, ..., -w_{n})(x_{t+1}, x_{t+2}, ..., x_{n})$$ for $1 \le s \le S - k$ . Finally, AND together all pairs of #, and #, pairs with r-s=k, and OR together these ANDs. $\blacksquare$ If we allow the connection graph to have multiple edges, we have CLAIM 4. All edge-weights can be made 1. *Proof.* Replace each edge of weight w from vertex u to vertex v by w edges of weight 1. CLAIM 5. The graph can be "unwound" into a circuit. Proof. Uses standard techniques similar to those used in [16, 33]. Multiple edges in the graph can easily be handled using additional fan-out. Each of these transformations increases the size by a polynomial and the depth by a constant multiple. It is also interesting to observe that combinatorial circuits built from "majority gates are also equivalent to Boltzmann machines. Let $\mathrm{hal}(x_1,\dots,x_s)=\mu_{s\geq s}(x_1,\dots,x_s)$ for s>1. Then if k<(n/2), $\#_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_{s}(x_1,\dots,x_s)=\mu_$ Even polynomial-size uniform threshold circuits appear to be extremely powerful. For example, we have seen that they can compute the parity of n bits in constant time. Chandra, Stockmeyer, and Vishkin [6] observe that they can also sort n opplynomiable tinegers, and multiply two n-bit integers in constant time. They can also multiply two n-w matrices of polynomiable tinegers in constant time. They can also multiply two n-w matrices of polynomiable tinegers in constant time. We have chosen to remove the probabilism from the removed from nonuniform Boltzmann machines without increasing their size by more than a polynomial, and their running time by more than a constant multiple [27]. We have chosen to model the brain as an unbounded fan-in computer. It may argued that this is an unreasonable low-level model, since the brain appears to have around 10th enough a control of the order of the property th ### 3. THE SHARED-MEMORY MODEL The shared-memory machine is a popular parallel model used by complexity theorists. Informally, it consists of a large number of powerful processors which communicate via a shared memory. Each processor possesses an infinite number of a general purpose registers $x_0$ , $x_1$ , $x_2$ , each of which can hold a single integrate an unique read-only processor identity register PID which is preset to i in the ith processor, i eN V6 denotes the set of nonnegative integres. The shared memory consists of an infinite number of cells $x_0$ , $x_1$ , $x_2$ , $x_3$ , $x_4$ | $r_i \leftarrow \text{constant}$ | (load register with constant) | |----------------------------------|------------------------------------| | $r_i \leftarrow r_j \circ r_k$ | (binary operation) | | $r_i \leftarrow r_{r_i}$ | (indirect load) | | $r_n \leftarrow r_j$ | (indirect store) | | $r_i \leftarrow PID$ | (store PID) | | halt | (end execution) | | goto $m$ if $r_i \ge 0$ | (conditional transfer of control). | Communication instructions have the form: $$r_i \leftarrow s_{r_i}$$ (read) $s_{r_i} \leftarrow r_i$ (write). So far, we have left the binary operation "o" unspecified. In particular, we will be interested in two types of instruction-sets. The minimal instruction-set allows integer addition and subtraction, and logical shifts: $$r_i \leftarrow r_j \pm r_k$$ (addition) $r_i \leftarrow \lfloor r_j^* 2^{r_k} \rfloor$ (logical shift). Note that in the second instruction, $\epsilon_n$ may be either positive (corresponding to a felt-shift) or negative (corresponding to a felt-shift). The general instruction-set includes any instruction which can be simulated by a k-tape deterministic Turing machine in polynomial time. (That is, k-ket-ped terministic Turing machine can, when given as input the m-bit binary representations of the operands, compute the binary representation of the result in time $m^{(0,1)}$ . More formally, each machine is specified by a program M and a processor bound P(n). A computation proceeds roughly as follows. Suppose $x \in \mathbb{Z}^n$ , $x \in \langle x_1, x_2, ..., x_n \rangle$ . Each $x_i$ is called an *input symbol*. The symbol $x_i$ is placed into whared-memory location $i_i$ for $1 \le i \le n$ . All other memory locations and general purpose registers are set to zero. Processors $0, 1_i, -p(n) = 1$ are activated simultaneously; they synchronously execute the program M. When all processors have halted, the output is to be found in some specified place in the shared memory. In particular, single-integer outputs are found in shared memory location $\delta_n$ , a shared-memory machine acts as an acceptor if the inputs are restricted to a finite alphabet (encoded as integers in the obvious fashion), and at the end of a computation, shared memory location $\delta_n$ contains either 1 = 0 t, indicating acceptance or rejection of the impat respectively. We will comider several different particular $\delta_n$ in - The PRAM model. Simultaneous memory access is forbidden. In the following models, simultaneous access to individual shared-memory locations is allowed. Any number of processors may simultaneously read from any sharedmemory location. We will be interested primarily in two different conventions for dealing with simultaneous writes. - The WRAM model. If several processors are attempting to write into a single shared-memory cell, then the smallest-numbered processor succeeds. All other data is lost. - 3. The TRAM model. Suppose several processors are attempting to write into a shared-memory location which currently contains the value k. If exactly k of them are attempting to write a nonzero value, then the smallest-numbered processor succeeds. Otherwise the shared-memory location takes on the value zero. The processor bound P(n) is a measure of the number of processors used as a function of input size. The machine is said to have worksize W(n) if the maximum value in any register or sharted memory location during any computation on an imput of size n has absolute value local $\Sigma^{(0)}$ . Note that this includes the inputs, we will concentrate on parallel medis and $\Sigma^{(0)}$ . Note that this include the inputs, will concentrate on parallel medis when $\Sigma^{(0)}$ is $\Sigma^{(0)}$ is $\Sigma^{(0)}$ in is a reasonable assumption, since at least $\Gamma^{(0)}$ in $\Sigma^{(0)}$ is a reasonable assumption, since at least $\Gamma^{(0)}$ in $\Sigma^{(0)}$ is a reasonable assumption, since at least $\Gamma^{(0)}$ in $\Sigma^{(0)}$ is a reasonable assumption, since at least $\Gamma^{(0)}$ in $\Gamma^{(0)}$ is a recognized to address its processors. We also assume that W(n) is a recognized to address the n shared-memory locations containing the inputs. The time has the shared $\Gamma^{(0)}$ in $\Gamma^{(0)$ Similar parallel machine models have appeared in a large number of papers, the carliest of which include Fortune and Willie [11]. Goldschlager [13]. Schwartz [34], and Shiloach and Vishkin [35]. Our model is not strictly SIMD (see Flyan [10] for nomenclature), since different processors can be at different points in their program at any given time. However, it is easy to show [23] that is equivalent to a strictly SIMD one. Note that our machines are slightly nonuniform in the sense that information (depending on the input-size) may be encoded in the number of processors P(n). Our simulations will make heavy use of this property. #### 4. Some Preliminary Results Concerning Shared Memory Machines WRAMs are powerful parallel machines. In particular, they can compute various useful arithmetic functions in constant time. The results in this section will prove useful during the simulation of threshold Turing machines by TRAMs in Section 7. The reader who wishes to process this paper in a top-down fashion may skip the material contained in this section during the first reading. LEMMA 4.1. A PRAM with $\lceil \log n \rceil$ processors can compute $\lceil \log n \rceil$ or $\lfloor \log n \rfloor$ in constant time with word-size $O(\log n)$ , when given as input a single positive integer n. Proof. Suppose n > 1: $\lfloor \log n \rfloor$ is computed as follows. Processor $i, i \geqslant 0$ , computes the value $v = \lfloor n/t^2 \rfloor$ using a single shift instruction. If it finds that v > 0 and $\lfloor v/2 \rfloor = 0$ , then $i = \lfloor \log n \rfloor$ . Computation of $\lceil \log n \rceil$ if from $\lfloor \log n \rfloor$ is simple; if $n = 2^t$ then $\lceil \log n \rceil = i$ , otherwise $\lceil \log n \rceil = i + 1$ . The required value can then be written into shared-memory cell $s_0$ by processor i. $\blacksquare$ LEMMA 4.2. A WRAM with word-size $O(n(b + \log n))$ can add n b-bit numbers in constant time. *Proof.* We use the techniques of [23-25]. Suppose that we are given n b-bit integers $x_1, \dots, x_n$ . Let us assume, for the purposes of this proof, that they are all positive. The modifications necessary for the inclusion of negative integers are simple but tedious. We can assume without loss of generality that n is a power of 2. First, it is easy to determine n, the number of inputs. We can adopt the convenient hat zero is never used for an input value (if necessary we can encode non-negative integers by adding one to them). Processor i reads shared memory cells i and i+1. If the latter contains zero while the former contains a nonzero value, then i=n. Processor i can then write its PID into shared-memory location 0, where it can be read simulationally by all processors Note that i is not a power of 2, compute $2^{2+n\delta}$ , the power of 2 immediately above $n_i$ using a shift operation. This will have a negligible effect on our resource bounds. The sum of $n \stackrel{>}{\sim} b$ 1st numbers can have no more than $n + \log n$ bits. Each processor equives this value. The value log n on the computed using Lemma 4.1; it remains to determine the value of h. We do this by finding the largest input integer. We use the first $n^2$ processors, divided into n or equal-absted teams. Each processor can determine the value of h. We do this by finding the largest input integer. We use Since n is a power of 2 and both n and log n are known to all processors, $n^2$ can be computed with a single shift operation. Next, each processor determines which team it is in, and its identity number within its team, as follows. Each processor extracts the last $\log n$ bits of its PID using two shifts and a subtraction. It treats this value as its identity number within its team. It treats the remaining leading bits of its PID as its team identity number. That is, it has divided its PID in constant time into two values (i, j), where $0 \leqslant i, j \leqslant n$ , and acts as the jth member of the jth team. The th team uses shared memory location n+i+1 for communication (remembering that the cells t through n contain the inpu). The 0th processor in team i (which we will call the t1 for t2 for t3 for t4 for t2 for t4 for t3 for t4 for t4 for t6 for t6 for t6 for t6 for t7 for t7 for t7 for t8 for t8 for t8 for t8 for t8 for t9 for t8 for t9 for t8 for t9 for t1 for t1 for t1 for former is greater than the latter, then it writes a one into shared-memory location n+i+1. When this is finished, the canel-adeer reads shared-memory location t9 for t1 for t8 Now the value $b + \log n$ is known to all processors. Let us assume for the purposes of this proof that it is a power of 2. If not, then it can easily be rounded up to a power of 2 by again using Lemma 4.1 and a shift operation. The number of bits in this value is also easily obtained using Lemma 4.1. The processors now divide themselves into $2^{O(n(b+\log n))}$ teams of n processors. Each team interprets its team identification number (which has $O(n(b \log n))$ bits) as a sequence of $n(b + \log n)$ -bit integers. The ith member of each team, $1 \le i < n$ extracts the ith and (i+1)th integer in this sequence, while the 0th processor extracts the first integer. The ith processor will have to do a shift of $i(b + \log n)$ bits. In order to do this, it will have to first compute the shift amount. Since the latter factor is a power of 2, the multiplication can be implemented using a shift operation. The ith processor of each team. $1 \le i \le n$ verifies that this (i+1)th integer is equal to the ith plus $x_{i+1}$ , while the 0th processor verifies that the first integer equals x1. Those processors which find a discrepancy report to their team leaders via the shared-memory as described in the previous paragraph. Exactly one team will find no discrepancy. Its team leader knows that its team identity number represents a valid prefix-sum string for the given inputs. It then extracts the total sum (the last integer in the sequence) which it finally writes into shared-memory location 0 for output. All of the operations described take place in constant time. The PIDs of the processors have $O(n(b + \log n))$ bits, and are the largest words used in the computation. LEMMA 4.3. A WRAM with word-size O(b2) can multiply two b-bit positive integers in constant time. *Proof.* We will use the standard shift-and-add algorithm. For simplicity, let us assume that the two input integers $x_1$ and $x_2$ are both positive. The machine first finds the value of b by taking the logarithm of the largest input (using Lemma 4.1). Processor i, $0 \le i < b$ , does the following. (a) Extract the ith bit of x<sub>2</sub> (where the bits are numbered from left-to-right starting with 0) using shifts and a subtraction. (b) If the value obtained in (b) is nonzero, then left-shift $x_1$ by i places (that is, multiply it by $2^i$ ), and write the result into shared-memory location i+1. The sum of these values is computed in constant time using Lemma 4.2. Lemma 4.4. A WRAM, when given as input a single integer m, can compute $\lceil m^{1/c} \rceil$ in constant time and word-size $O(\log^2 m)$ , for any natural number c > 1. *Proof.* Use p teams of processors, where $p \ge \lceil m^{kr} \rceil$ . Team $i, 0 \le i < p$ , checks to see whether $i = \lceil m^{kr} \rceil$ . It does this by computing i' (using e-1 multiplications). For this purpose, each team has $2^{20\log^2 m}$ processors (by Lemma 4.3). If $i' \ge m$ , yet i' = 1 < m, then $i = \lceil m^{kr} \rceil$ . For our purposes, it will be sufficient to show that a WRAM with linear wordsize can add n constant-bit integers in constant time. However, a much stronger result is possible without much extra effort. Lemma 4.5. For every $0 < \lambda \le \frac{1}{2}$ there exists $\mu > 0$ such that a WRAM with wordsize $O(n^{1-\mu})$ can add $n n^{1-\lambda}$ -bit integers in constant time. *Proof.* Suppose we have as input n positive integers, each of $n^{1-1/6}$ bits, for some positive integer $c \geqslant 2$ . Since the technique used is elementary, for a cleaner presentation of this proof we will omit the floor and ceiling operators necessary to ensure that all values are integers. The sum of these integers (and every partial sum) has at most $O(n^{1-1/6})$ bits. The WRAM first determines n, and computes $m = n^{1/(c+1)}$ . After this pre-computation, the summation is performed in two phases. Phase 1. Divide the input into n/m groups of m numbers, and sum each group. After c iterations of this process, we are left with $n/m^c$ partial sums. Phase 2. Add the n/me partial sums. By Lemma 4.4, the pre-computation takes constant time and negligible word-size (for large enough n). By Lemma 4.2, Phases 1 and 2 can be performed in constant time. The word-size required for the former is proportional to $mn^{1-1/c} = n^{1-1/(c^2+c)}$ and, for the latter, is proportional to $$\frac{n}{m^c} n^{1-1/c} = n^{1-1/(c^2+c)}.$$ Thus $n n^{1-1/\epsilon}$ -bit integers can be summed in constant time with word-size $O(n^{1-1/(\epsilon^2+\epsilon)})$ . Suppose $x_i \in \mathbb{N}$ , $1 \le x_i \le n$ for $1 \le i \le n$ . Define prev: $Z^n \to Z^n$ by prev $(x_1, \dots, x_n) = \langle y_1, \dots, y_n \rangle$ , where $y_i = j$ if $x_j = x_i, j < i$ , and $x_i \ne x_i$ for j < k < l (and 0 if no such j exists). Define last: $Z^n \to Z^n$ by $last(x_1, \dots, x_n) = \langle y_1, \dots, y_n \rangle$ , where $y_i = j$ if $x_i = i$ and $x_i \ne i$ for $j < k \le n$ (and 0 if no such j exists). LEMMA 4.6. A WRAM can compute $prev(x_1, ..., x_n)$ and $last(x_1, ..., x_n)$ in constant time with word-size $O(\log n)$ . #### 5. THRESHOLD TURING MACHINES It is possible to define threshold quantifiers based on the threshold functions introduced in Section 2. Let $\Sigma$ be a finite alphable, and F: $\Sigma^2 - R$ from the finite $(R^*)w^* R^*)v \in R$ , denoting the predicate "exactly k strings of size n statisty F: is defined as follows: $(g^* p \in P(u)) - R$ if $((w \in \Sigma^*) F(u)) - R$ denoting the contribution of the states are powerful as existential and universal quantifiers are at least as powerful as existential and universal quantifiers are the states of $(g^* p \in R(u))_{-k}$ . $(g^* p \in R(u))_{-k}$ ( $g^* p \in R(u)_{-k}$ ) where the state of $(g^* p \in R(u))_{-k}$ ( $g^* p \in R(u)_{-k}$ ), where the domain of the quantification is obvious from context. A 8-tape threshold Turing machine (abbreviated TTM) is similar to the popular alternating Turing machine (abbreviated ATM) [5. 7, 9, 23]. It has & read/write work-taper, a finite-state control, and random-access to its input via a write-only made-viape. The latter device is necessary life was to discuss sublinear running-time and will be familiar to those acquainted with alternating Turing machine. It also not incomplete that the state of the state of the state of the state of the one direction and have cell numbered [2, 2, c, ach of which can hold a single symbol Each tape has a single head, which scans a single tape cell. More formally, a 8-tape TTM is a 9-tuple (Q, 1, 2, 6, 6, 9, q, q, q, 1). $\Sigma$ is a finite input alphabet. Without loss of generality, we will take $\Sigma = \{0,1\}$ . $\Gamma$ is a finite tape alphabet, $\Sigma \subset \Gamma$ . Without loss of generality, we will take $\Gamma = \{0,1,b\}$ , where b is the distinguished blank symbol. Q is a finite set of states, including four distinct distinguished states, as follows: $q_0$ is the initial state, $q_a$ the accept state, $q_c$ the reject state, and $q_i$ the threshold state. $\delta$ is the transition function. If $d = \{\text{left, stay, right}\}$ is the set of directions in which a tape-head may move, then $\delta \colon \mathcal{L} \times (Q - \{q_s, q_t, q_t\}) \times f^{d+1} \to Q \times (F \times d)^2 \times (Z \times d)^2 \times d$ . We define a configuration of a TTM in the normal way, to be a snapshot of the machine at some instant in time. A configuration with state, q is called a harding configuration, all others are called nonbranching. A configuration with state q, or q is called a harding configuration. All This is started with albeds in the first cell of their respective tapes, and all tape cells blank except for the first cell on the threshold and indee tapes, which both contain the symbol 1<sup>1</sup>. The finite-state control is in state q<sub>0</sub>. This is called the intuit configuration. Consider an arbitrary contained to the control of con (i) Suppose the finite-state control is in state $q \in Q - \{q_a, q_i, q_i\}$ , symbol $s_0 \in \Gamma$ is under the guess-head, symbols $s_1, s_2, ..., s_k$ are under the k work-tape heads, and $\delta(x_i,\,q,\,s_0,\,s_1,\,...,\,s_k)=(r,\,(t_1,\,d_1),\,...,\,(t_k,\,d_k),\,(t_{k+1},\,d_{k+1}),\,(t_{k+2},\,d_{k+2}),\,d_{k+3}).$ Then $t_i$ is written in the cell under the head on the $t_i$ th work-tape and the head is moved one cell in direction $d_{t_i} = t_i \le k_i t_{t_i}$ ; is written in the cell under the head on the index tape and the index-head is moved one cell in direction $d_{t_i} = t_i$ the written in the cell under the head on the threshold tape and the threshold-head is moved one cell in direction $d_{t_i}$ , the guest-head is moved one cell in direction $d_{t_i}$ , and the finite-theat courted moves into since $t_i$ . The new configuration the $d_{t_i}$ is successor is accepting. The time requirement of the original configuration is defined to be one plus the time requirement of the successor. The threshold requirement of the successor. The threshold requirement of the successor. (ii) If the finite-state control is in state q ∈ {q, q, r} then the TTM halts. If q = q, then the configuration is called accepting. The time requirement and threshold requirement is defined to be zero in both cases. (iii) If the finite-state control is in state q<sub>n</sub>, then the contents of the threshold tape are interpreted as the binary encoding of a nonnegative integer m. Suppose the guess-head is on cell g of the guess-tape. The TTM is restarted in state q<sub>n</sub>, with its work-tape and index-tape unaltered, the threshold-pare returned to its initial contents, a random string of symbols from Σ written on cells one through g of the understange the remaining cells left blank), and the guess-head returned to the first cell. Each of these 2\* possible new configurations are called successors of the original configuration. The configuration is said to be accepting if exactly mof its successors are accepting. The time requirement of the original configuration is defined to be one plus the longest time requirement of its successors. The threshold requirement is defined to be one plus the largest threshold requirement of its suc- A TIM is said to accept its input if its initial configuration is accepting. The language recognized by a TIM is the set of accepted strings over alphabet $\mathcal{E}$ . A TIM is said to run in time T(n) if, for all input strings of length n, the initial configuration has time requirement bounded above by T(n). It is said to use thresholds H(n) if, for all input strings of length n, the initial configuration has threshold requirement bounded above by H(n) if its said to use thresholds requirement bounded above by H(n) if its said to the substitute of the said in the said in the said in the said in the said to the said in the said in the said in the said to the said in to the said in s Since existential and universal quantifiers are a special case of the threshold quantifier (see the identities given in the first paragraph of this section,) it is clear that the standard alternating Turing machine (row a special case of the threshold Turing machine (row of the former to machine swit constructible time-bounds). Furthermore, time on this limited threshold Turing machine is related by a polynomial to time on an alternating Turing machine, and the number of attentions to the constraint purpose of the constraint of the short Suppose, for convenience, we reduce threshold Turing machines using upperthreshold and lower-threshold functions instead of #-functions. As noted in Section 2, this affects the time by a polynomial, and the number of thresholds by at most a constant multiple. Threshold Turing machines can then be used to define a polynomial-time threshold hierarchy, analogous to the Meyer-Stockmeyer polynomial-time hierarchy [38]. Let TP be the class of languages recognizable in polynomial time by a TTM using a single threshold, either a ≤-threshold or a >-threshold (note that the class remains the same in either case). Then define $\Theta_0^p = P$ , and for $k \ge 0$ , $\Theta_{k+1}^p = TP^{\Theta_k^p}$ . By induction on k, $\Theta_k$ contains the class of languages recognizable in polynomial time by a TTM in k thresholds. It can also be proved using standard techniques that any language in $\theta_k$ can be recognized in polynomial time by a TTM in 2k thresholds. However, it is not apparent that the 2k can be reduced to k (as in the case of the polynomial-time hierarchy and alternating Turing machines, see Wrathall [42]) since it appears impossible to combine two consecutive polynomial-bounded threshold quantifiers of the same type. The polynomial-time threshold hierarchy clearly includes the polynomial-time hierarchy; for $k \ge 0$ , $\Sigma \xi$ , $\Pi \xi \subseteq \Theta \xi$ . Note also that $\Theta \xi$ contains the language class corresponding to Valiant's #P [39, 40], since it is possible to verify the number of solutions to a polynomial-time verifiable predicate using two queries to an oracle language in O?. The polynomial-time threshold hierarchy has been studied in detail by Wagner [41] under the name of "the counting polynomial-time hierarchy." The polynomial-time threshold hierarchy is contained in PSPACE. #### 6. SIMULATION OF TRAMS BY THRESHOLD TURING MACHINES In this section we consider the simulation of a T(n) time-bounded, W(n) wordsize bounded TRAM on a threshold Turing machine. We say that $W(n) = \Omega(\log n)$ is constructible if a k-tape deterministic Turing machine can, when given the binary representation of n, compute the binary representation of W(n) in time O(W(n)). Most useful functions are constructible, for example, $W(n) = [\log^{-1}W(n)] = [\log^{-1}W(n)] = [\log^{-1}W(n)]$ . Theorem 6.1. Suppose W(n) is constructible. A threshold Turing machine can standare a T(n) time-bounded, W(n) word-size TRAM with the minimal instruction-set using O(T(n)) thresholds and time $O(T(n) \cdot W(n))$ . Proof (Sketch). Let M be a TRAM which runs in time T(n) and uses word-size W(n). Let $X_1, X_2, \dots X_n$ be an input to M for the purposes of this proof, we will assume that each $x_i \in B$ . In general, each $x_i$ will be an integer of at most W(n) bits in this case, the input to the TTM will be a binary encoding of this sequence. We will demonstrate a threshold Turing machine which accepts this input string iff M does. On input $1_{x_i, X_i_i} \dots X_n$ , the TTM first computes w = W(n). During the simulation of M. The TTM first computes w = W(n). During the simulation of M is expense of O(1) contingous work-size eds. It he program of M is stored in the influe-state control. We will write I(I) for the Ith instruction of this program, I is stored in the following mutually recurrive Boolean procedures returns the value of Each of the following mutually recurrive Boolean procedures returns the value of the quantified Boolean formula given as its statement part. Quantified variables range over all possible values of length w. ``` function result(p, t, v) {processor p computes value v at time t} ``` ``` (case /[/] of ``` "r, + constant": const = p " $r_i \leftarrow r_j \circ r_k$ ": $\exists v_1 \exists v_2 \ (\operatorname{local}(j, p, t-1, v_1) \land \operatorname{local}(k, p, t-1, v_2) \land (v = v_1 \circ v_2))$ $(v = v_1 \circ v_2)$ " $r_i \leftarrow r_i$ ": $\exists v_1(local(i, p, t-1, v_1) \land local(v_1, p, t-1, v_1))$ " $r_s \leftarrow r_i$ ": local(j, p, t-1, v) " $r_i \leftarrow PID$ ": v = p" $r_i \leftarrow s_i$ ": $\exists v_i (local(j, p, t-1, v_i) \land shared(v_i, p, t-1, v))$ " $s_n \leftarrow r_j$ ": local(j, p, t-1, v) function target(p, t, h) {processor p changes register $r_b$ at time t} $\exists l(pc(p, t, l) \land (I[I])$ is of the form " $r_k \leftarrow \cdots$ ", or " $r_{r_i} \leftarrow r_i$ " with local (i, p, t-1, h))) function write (p, t, h) {processor p writes into shared-memory location $s_i$ at time t} $\exists I(pc(p, t, l) \land (I[I] \text{ is "}s_r \leftarrow r_i") \land local(i, p, t-1, h))$ ``` function p(p, t, l) (program-counter of processor p at time t is l) (t = 0, l = 1), (p(ep, t, t - 1, 0), \Lambda l = 0) \vee \exists k(p(ep, t - 1, k)) \wedge (exas <math>I(k) of "goto m if r_i \ge 0": \exists v \ge 0 \log al(i, p, t - 1, v) \wedge (m = l) "halt": e0 others: l = k + 1 ``` The Boolean operations $\wedge$ and $\vee$ are computed by branching universally and existentially, respectively. Negations can be computed directly or pushed back to the final states, much in the same manner as alternating Turing machines [5]. Quantifiers are computed by guessing quantified values. The simulation of individual instructions of M is carried out deterministically. The TTM simulates M by computing 2(Mp(Ep(E, M), 0)) a harded(M, M). We claim that any call to procedures result(p, t, e), target(p, t, h), write(t, p, t), per(p, t, h), local(p, t, e), or share(t, t, e) requires at most O(t) thresholds. Let r(t), t(t), w(t), p(t), l(t), and s(t) denote the number of thresholds required by these procedures, respectively. Then r(0) = r(0) = w(0) = p(0) = l(0) = s(0) = 0 ``` r(t) \le \max(p(t) + 2, l(t - 1) + 6, s(t - 1) + 6) v(t) \le \max(p(t) + 2, l(t - 1) + 4) w(t) \le \max(p(t) + 2, l(t - 1) + 4) w(t) \le \max(p(t - 1) + 3, l(t - 1) + 7) l(t) \le \max(p(t - 1) + 3, l(t - 1) + 2) v(t) \le \max(p(t - 1) + 3, l(t - 1) + 3) ``` Thus, in particular, $s(t) \leqslant 28t + 21$ , and so the simulation of a T(n) time-bounded TRAM uses O(T(n)) thresholds. If M has the minimal instruction-set, then the computation between each threshold takes time O(W(n)) (since it only involves guessing register contents, and simulating local instructions of M). This gives the required result. Note that if the TRAM has the general instruction-set, then the running-time of the TTM increases by only a polynomial. COROLLARY 6.2. If W(n) is constructible, a threshold Turing machine can simulate a reasonable T(n) time-bounded, W(n) word-size TRAM using O(T(n)) thresholds and time $W(n)^{O(1)}$ . #### 7. SIMULATION OF THRESHOLD TURING MACHINES BY TRAMS Before we address the problem of simulating a TTM on a TRAM we first consider a much simpler problem, that of simulating a standard k-tape deterministic Turing machine (DTM) on a WRAM. LEMMA 7.1. A WRAM with the minimal instruction-set can simulate a T(n) timebounded k-tane DTM in constant time and word-size O(T(n)). *Proof* (Sketch). Suppose we number the rules of the DTM in some reasonable fashion. We divide the processors into $2^{OT(n)}$ teams, one for each sequence of T(n) rules $r_0, r_1, \dots, r_{T(n)-1}$ . Each team has $2^{OT(n)}$ processors. Each processor and elermine which team it is in and its identity number within that team, in constant time as follows. First, the WRAM computes as in the proof of Lemma 4.2 Second, the machine determines the number of active processor is affolious. Processor is writes a one into shared-memory location n+(+1. The number of processors can then be computed using the technique used to determine n (see the proof of Lemma 4.2). Since the number of processors is 2<sup>-10.8</sup> with c a small constant dependent on the constant in the Turing machine, the value of Tife can be found difficiently using Lemma 4.1. This value, along with the number of lists in Ti(n) (found by using Lemma 4.1 assains can be made available to all trocessors through the shared memory. Now that the value of T(n) is known to all processors, each can extract the first $T(n) + \log T(n)$ bits of its PID, which it treats as its identity number within its team. The remaining O(T(n)) bits are treated as the identity number of the team. These values can be extracted in constant time using shifts and subtractions, as in the proof of Lemma 4.7. Once each team has determined its team identity number, it interprets that identity number as sequence of $\Gamma(n)$ rules. The ith processor of that team extracts the ith rule r, $0 \le i < \Gamma(n)$ , and determines for each of the tapes the head direction associated with har rule, + 1 for n rightward move, -1 for a leftward move, and 0 for no move at all. The head position at each point in time is easily determined for each tape by computing a prefix-sum of these values within the team. Each prefix- sum can be computed using T(n) separate additions performed in parallel using Lemma 4.5, using $T(n)/2^{1/2}$ represents on it each team. (This is the reason for requiring T(n)+log T(n) bits for the identity number of each processor within its team.) Each team uses a separate part of the shartee-memory for communication during this computation, the relevant addresses are found by multiplying the team identity number by the appropriate value. The latter can be taken to be a power of 2, so that the multiplication can be performed using a shift (operation. It then verifies that: - The sequence of states determined by the sequence of rules is a valid one. That is, rule r₀ requires that the DTM be in its initial state, and for 1 ≤ i < T(n) if rule r₁ leaves the DTM in state q, then rule r₁ requires that the DTM be in state q.</li> - For each of the k tapes, and for each time t, 0 ≤ t < T(n):</li> (i) If this is the first time that the head visits this cell, then the symbol read - (1) If this is the first time that the head visits this cell, then the symbol read by rule r<sub>t</sub> is the symbol found in that cell in the initial configuration. - (ii) If the last time the head visited this cell was at time s < t, then the symbol written by rule $r_s$ is the symbol read by rule $r_t$ . The information necessary for this verification is provided by using the algorithm of Lemma 46. Exactly one team will find that its sequence of rules is valid. It can then determine if the final state is accepting and set the contents of shared memory location 0 to 0 or 1, accordingly. By Lemmas 4.5 and 4.6 the simulation requires constant time and word-size Or(In). This result can obviously be extended to the simulation of deterministic Turing machines which compute results, rather than acts as exceptors for a language (the final configuration can be constructed from the valid sequence of rule numbers by use of the algorithm for function last in Lemma 46,3 and to the simulation of non-deterministic Turing machines. Furthermore, it can also be used to simulate threshold Turing machines on TRAMs. THEOREM 7.2. Suppose T(n) is constructible. A TRAM with the minimal instruction-set can simulate a T(n) time-bounded, H(n) threshold-bounded k-tape TTM in time O(H(n) and word-size O(T(n) - H(n)). Proof (Sketch). The TRAM first evaluates $\Gamma(n)$ in constant time and word-wire $O(\Gamma(n))$ (by use of Lemma 7.1) and constructs a look-up thele showing, for every nonbranching configuration, the configuration which follows by the rules of $\delta$ in $\Gamma$ steps of the TTM. It $\delta \in T(n)$ is with the convention that once the TTM entire $\delta$ in a halfing or branching configuration, then it remains there). The table can be constructed in constant time by utilizing the technique used in the proof of Lemma 7.1. A slight of the contraction of the proof of Lemma 7.1. The state of the proof of Lemma 7.1. The state of the proof of Lemma 7.1. The state of the proof of Lemma 7.1. The state of Lemma 7.1. The proof input pointers can be computed in constant time using word-size $T(n)^{1-\lambda}$ for any real number $\lambda > 0$ , by using a technique similar to Lemma 4.5. The simulation then proceeds as follows. The TTM is simulated up to the point when a branching or halting configuration is entered for the first time (using the look-up table and Lemma 46). Call this new configuration. C. If the accept state has been entered, then the computation is accepting, and this can be reported in the appropriate way (similarly if the reject state has been entered). Otherwise C is a branching configuration. The processors divide themselves into as many as 20<sup>76101</sup> branching configuration. The processors divide themselves into as many as 20<sup>76101</sup> teams, one for each possible guess of the appropriate size (gleaned from teams, one for each possible guess of the appropriate size (gleaned from the continues that the computation from the new configuration. When the teams have (recursively) completed their simulation, they report back by having teams which find that their configurations are configuration which is the configuration of conf Since the computation between branchings takes constant time, the entire simulation takes time O(H(n)). The word-size is dominated by $O(T(n) \cdot H(n))$ for the recursive branching. COROLLARY 7.3. A reasonable TRAM can simulate a T(n) time-bounded, H(n) threshold-bounded k-tape TTM in time O(H(n)) and word-size $O(T(n)^2)$ . *Proof.* The TRAM of Theorem 7.2 is reasonable since the number of thresholds that a TTM can perform is bounded above by its running time. Thus time and word-size on a reasonable TRAM are simultaneously equivalent to thresholds and time on a threshold Turing machine. The first equivalence holds to within a constant multiple, and the second to within a polynomial. 8. Two Parallel Computation Theses for Unbounded Fan-in Parallelism Since WRAMs are a weaker form of TRAM it is not necessary to use the full power of threshold Turing machines in order to simulate them efficiently. THEOREM 8.1. Suppose W(n) is constructible. An alternating Turing machine can simulate a T(n) time-bounded, W(n) word-size WRAM with the minimal instruction-set using O(T(n)) alternations and time $O(T(n) \cdot W(n))$ . *Proof.* Similar to the proof of Theorem 6.1, replacing function shared by: function shared (i, t, v) {shared-memory cell $s_i$ contains v at time t} $(t = 0 \land 1 \le i \le n \land v = x_i) \lor$ $(t=0 \land i > n \land v=0) \lor$ $(\exists p(\text{write}(i, p, t) \land \text{result}(p, t, v) \land \neg \exists q$ $(( \neg \exists p \text{ write}(p, t)) \land \text{shared}(i, t-1, v)).$ COROLLARY 8.2. Suppose W(n) is constructible. An alternating Turing machine can simulate a reasonable T(n) time-bounded, W(n) word-size WRAM using O(T(n)) alternations and time $W(n)^{O(1)}$ . Similarly, it is not necessary to use the full power of TRAMs to simulate ATM's efficiently. THEOREM 8.3. A WRAM with the minimal instruction-set can simulate a T(n) time-bounded, H(n) alternation-bounded k-tape ATM in time O(H(n)) and word-size O(T(n)-H(n)). It was shown in [24] that a WRAM with minimal instruction-set can simulate $\Gamma(n)$ time-bounded nondeterministic 4-tage Turing machine in constant time with word-size $\Gamma(n)^{1+\epsilon}$ , for any real number $\epsilon > 0$ . Theorem 8.3 improves the word-size $O(\Gamma(n))$ and extends the result to alternating Turing machines with bounded alternations. A further improvement can be shown for the simulation of k-tage deterministic Turing machines. COROLLARY 8-4. Suppose $T(n) = \Omega(n \cdot \log^n n)$ is time-constructible. A WRAM with the minimal instruction-set can simulate a T(n) time-bounded k-tape deterministic Turine machine in constant time using word-size $OT(n)\log^n T(n)$ . Proof. By Theorem 8.3 above, and Theorem 3.3 of Paul et al. [29]. An improvement of the word-size to $O(\sqrt{T(n)})$ for the simulation of single-tape deterministic Turing machines can be made using the weaker result of Maass [22]. Since on an alternating Turing machine, alternations are bounded above by time, we have COROLLARY 8.5. A reasonable WRAM can simulate a T(n) time-bounded, H(n) alternation-bounded k-tape ATM in time O(H(n)) and word-size $O(T(n)^2)$ . Proof. By Theorem 8.3. Thus time and word-size on a reasonable WRAM are simultaneously equivalent to alternations and time on an alternating Furing machine. The first equivalence holds to within a constant multiple, and the second to within a polynomial. This provides evidence for the third paulified computation thesis. "In a partialled machine with unbounded fas-in- communication, time and address complexity are simultaneously equivalent to alternations and time on an alternating Turing machine, the former to within a constant, and the latter a polynomial." Our results also show that, for constant parallel time, address complexity is equivalent to within a constant multiple to time on a constant-alternation ATM. Thus, for example, a constant multiple to time on a constant-alternation ATM. Thus, for example, a constant-time massively parallel computers recognize exactly the languages in the polynomial-size parallel computers recognize exactly the languages in [Supers's logarithmic-time hierarchy, 36]. The third parallel computation thesis sheds some light on a dilemma raised by Cook [3]: It is popular to take alternating time as a measure of 'parallel time' since alternating time is polynomially related to sequential space. This 'space is parallel time' characterization was proposed independently by Chandra et al. [45]. And Goldschänger [13, 14] (the latter calling it the parallel computation state (Incomparation Corresponding to Thardware." This led Dymond to his formulation of the extended parallel computation thesis [8, 9], based on the seminal work by Pippenger [30]. Our results suggest that the reasons why the alternating Turing machine papered to have no resource corresponding to 'hardware' was that the work of the control of parallel time. Instead, another of alternations corresponds to 'parallel time,' and alternating time is related, on the front of "address or "and alternating time is related, on the front of "address or "and alternating time is related, on the front of "address or "and the anting time is related, on the front of "address or "and alternating time is related, on the front of "address or "and the parallel time." Intellegate the support of the control of the parallel time. Intellegate the support of the control of the parallel time. Intellegate the number of this necessary to describe an individual unit of hardware. Corollary 6.2 and Corollary 7.3 provide evidence for the fourth parallel computation thesis, which seeks to characterize parallel computation based on threshold functions. "In a parallel machine with unbounded fan-in communication using threshold functions, time and address complexity are simultaneously equivalent to thresholds and time on a threshold Turing machine, the former to within a constant, and the latter a polynomial." Further evidence for the third and fourth parallel computation theses is provided by considering uniform unbounded fan-in circuits. The connection language of an unbounded fan-in threshold circuit is the set of 4-tuples $\langle g_1, g_2, i, k, n \rangle$ such that in the finite circuit with n inputs the ith input of gate $g_1$ is connected to the output of gate $g_2$ , and $g_3$ is a $\#_4$ -gate. An unbounded fan-in threshold circuit is said to be uniform if its connection language can be recognized by a polynomial-time k-tape deterministic Turing machine (note that the running-time of the Turing machine is thus polynomial in the address complexity of the circuit). Clearly the depth and address complexity of such a circuit is simultaneously equivalent to time and word-size on a TRAM, the former to within a constant multiple and the latter a polynomial (the techniques of [37] extend equally well to the uniform case, with processors that can shift, and to threshold computations). The corresponding result holds for conventional unbounded fan-in circuits and WRAMs. #### 9. CONCLUSION We have provided evidence for two parallel computation theses for unbounded fan-in parallelium, it appears that, for the standard unbounded fan-in models, parallel time and address complexity are simultaneously equivalent to alternations and time on an alternating Turing machine (the former to within a constant of the latter a polynomial). A similar result holds for the new class of threshold computations, provided the alternating Turing machine is replaced by a threshold Turing machine, and the resource of alternations is replaced by the analogous resource of thresholds: Many interesting open problems remain. The standard lover-bound proof techniques for unbounded flam-in critical suppart to break down completely in the case of threshold circuits. Is there a problem in LOGSPACE which you have been depended or the control of the problem of the control Finally, we have simplified the Boltzmann machine by removing probabilism and simplifying the termination condition. The computing power of the more general machine remains an open problem. # ACKNOWLEDGMENTS The authors are grateful to Piotr Berman and Nick Pippenger who showed us how to construct the constant-depth threshold circuit for integer multiplication mentioned in Section 2, to Klaus Wagner or correspondence concerning the polynomial-time threshold hierarchy, and to Richard Ladner for suggestions on improving the readability of this paper. # REFERENCES - D. H. ACKLEY, G. E. HINTON, AND T. J. SEINOWSKI, A learning algorithm for Boltzmann machines, Cognit. Sci. 9 (1985), 147–169. - L. ADLEMAN, Two theorems on random polynomial time, in "Proceedings, 19th Ann. IEEE Symp. on Foundations of Computer Science," 1978, pp. 75–83. - M. Altau, Σ]-formulae on finite structures, Ann. Pure Appl. Logic 24 (1983), 1-48. N. Blun, A note on the "parallel computation thesis," Inform. Process. Lett. 17 (1983), 203-205. - A. K. CIANDIRA, D. C. KOZIN, AND L. J. STOCKMEYDE, Alternation, J. Assoc. Comput. Mach. 28, No. 1 (1981), 114-133. A. K. CHANDIRA, L. J. STOCKMEYDE, AND U. VERIKIN, Constant depth reducibility, SIAM J. Comput. - A. K. CHANDRA, L. J. STOCKMITTER, AND U. VISHKIN, Constant depth reducibility, SLAM J. Comput. 13, No. 2 (1984), 423-422. S. A. COOK, Towards a complexity theory of synchronous parallel computation, L'Enseign. Math. 30 - (1980). 8. P. W. DYMOND, "Simultaneous Resource Bounds and Parallel Computations," Ph. D. thesis. - W. DYBOND, "Simutaneous Resource Bounds and Parallel Computations," Ph. D. thesis. Technical Report TR14/50, Dept. of Computer Science, Univ. of Toronto, Aug. 1980. P. W. DYBOND AND S. A. COOK, Hardware complexity and parallel computation, in "Proceedings, - 21st Annu IEEE Symp on Foundations of Computer Science, Oct. 1980, pp. 360-372. 10. M. FLYNN, Very high-speed computing systems, Proc. IEEE 54 (1966), 1901-190. - S. FORTUNE AND J. WYLLIE, Parallelism in random access machines, in "Proceedings, 10th Annu. ACM Symp. on Theory of Computing, 1978," pp. 114-118. - 12. M. FURST, J. B. SAXE, AND M. SIPSER, Parity, circuits and the polynomial time hierarchy, Math. - Systems Theory 17, No. 1 (1984), 13-27. 13. L. M. GOLDSCHLAGER, "Synchronous Parallel Computation," Ph. D. thesis, Technical Report TR- - 114, Dept. of Computer Science, Univ. of Toronto, Dec. 1977. 14. L. M. GOLDSCHLAGER, A universal interconnection pattern for parallel computers, J. Assoc. Comput. Mach. 29. No. 4 (1982), 1073-1086. - 15. L. M. GOLDSCHLAGER, "A Computational Theory of Higher Brain Function," Technical Report, - Stanford Univ., Apr. 1984. 16. L. M. GOLDSCHLAGER AND I. PARBERRY, On the construction of parallel computers from various - bases of Boolean functions, Theoret. Comput. Sci. 43, No. 1 (1986), 43-48. 17. J. HARTMANIS AND J. SIMON, On the power of multiplication in random access machines, in "Proceedings, 15th Annu. IEEE Symp. on Switching and Automata Theory, 1974," pp. 13-23 - 18. G. E. HINTON, T. J. SEINOWSKI, AND D. H. ACKLEY, "Boltzmann Machines: Constraint Satisfaction Networks That Learn," Technical Report CMU-CS-84-119, Dept. of Computer Science, Carnegie- - Mellon Univ., May 1984. 19. J. HOPCROFT, W. PAUL, AND L. VALJANT, On time versus space, J. Assoc. Comput. Mach. 24, No. 2 - 20. J. J. HOPFIELD, Neural networks and physical systems with emergent collective computational abilities, Proc. Nat. Acad. Sci. 79 (1982), 2554-2558. - 21. M. Lusy, A simple parallel algorithm for the maximal independent set problem, in "Proceedings, 17th Annu. ACM Symp. on Theory of Computing, Providence, R.I., May 1985," pp. 1-10. - 22. W. Maass, personal communication, 1985. 23. I. PARMERRY, "A Complexity Theory of Parallel Computation," Ph. D. thesis, Dept. of Computer - Science, Univ. of Warwick, May 1984. 24. I. PARBERRY, "On the Number of Processors Required to Simulate Turing Machines in Constant - Parallel Time," Technical Report CS-85-17, Dept. of Computer Science, Penn. State Univ., Aug. 1985. 25. I. PARBERRY, Parallel speedup of sequential machines: A defense of the parallel computation thesis, - SIGACT News 18, No. 1 (1986), 54-67. 26. I. PARBERRY AND G. SCHNITGER, Parallel computation with threshold functions (preliminary - version), in "Proceedings, Structure in Complexity Theory Conference, Berkeley, California, June 1986," Lecture Notes in Computer Science Vol. 223, pp. 272-290, Springer-Verlag, New York/Berlin, 1986 27. I. PARBERRY AND G. SCHNITGER, "Relating Boltzmann Machines to Conventional Models of - Computation." Technical Report CS-87-07, Dept. of Computer Science, Penn. State Univ., Mar. - 28. M. S. PATERSON, Tape bounds for time-bounded Turing machines, J. Comput. System Sci. 6, No. 2 - 29. W. J. PAUL, N. PIPPENGER, E. SZEMERÉDI, AND W. T. TROTTER, On determinism versus nondeterminism and related problems, in "Proceedings, 24th Annu. IEEE Symp. on Foundations of Computer Science, Tucson, Arizona, Nov. 1983," pp. 429-438. - 30. N. PIPPENGER, On simultaneous resource bounds, in "Proceedings, 20th Annu. IEEE Symp. on Foundations of Computer Science, Oct. 1979," pp. 307-311. - 31. V. PRATT AND L. J. STOCKMEYER, "A characterization of the power of vector machines," J. Comput. System Sci. 12 (1976), 198-221. - 32. W. L. Ruzzo, On uniform circuit complexity, J. Comput. System Sci. 22, No. 3 (1981), 365-383. 33. J. E. SAVAGE. Computational work and time on finite machines, J. Assoc. Comput. Mach. 19, No. 4 (1972), 660-674. - J. T. SCHWARTZ, Ultracomputers, ACM TOPLAS 2, No. 4 (1980), 484–521. - 35. Y. SHILOACH AND U. VISHKIN, Finding the maximum, sorting and merging in a parallel computation model, J. Algorithms 2 (1981), 88-102. - M. SIPSER, Borel sets and circuit complexity, in "Proceedings, 15th Annu. ACM Symp. on Theory of Computing, Boston, Mass., Apr. 1983," pp. 61–69. - L. STOCKMIYER AND U. VISHKIN, Simulation of parallel random access machines by circuits, SIAM J. Comput. 13, No. 2 (1984), 409 –422. - 38. L. J. STOCKMEYER, The polynomial time hierarchy, Theoret. Comput. Sci. 3 (1977), 1-22. - L. G. VALIANT, The Complexity of enumeration and reliability problems, SIAM J. Comput. 8, No. 3 (1979), 410-421. L. G. VALIANT, The complexity of computing the permanent, Theoret. Comput. Sci. 8 (1979), - 189-201. 41. K. W. WAGNER, The complexity of combinatorial problems with succinct input representation, Acta - Inform. 23, No. 3 (1986), 325–356. 42. C. Wrattfall, Complete sets and the polynomial-time hierarchy, Theoret. Comput. Sci. 3 (1976). - A. C. YAO, Separating the polynomial-time hierarchy by oracles, in "Proceedings, 26th Annu. IEEE Symp. on Foundations of Computer Science, Portland, Oregon, Oct. 1985."