# Ultra-low Power Hybrid CMOS-Magnetic Logic Architecture

Jayita Das, Syed M. Alam, Sanjukta Bhanja

Magnetic coupling between single layer nanomagnets is used to realize magnetic logic. Apart from writing and reading, one other phenomenon performed on the magnets is clocking. Traditionally, these operations were carried out using external magnetic fields generated by current carrying conductors. But the current requirements are typically in mAwhich increases the overall power. Also, the fields cannot be sharply terminated at the boundary between two nanomagnets which needs to be clocked at two different instants. The above concerns motivated us to look into alternate magnetic devices to realize magnetic logic. We suggested the use of multilayer spintronic devices (the Magnetic Tunnel Junctions abbr. MTJs) for carrying out logic computation. MTJs are already in use in magnetic-MRAMs from where we have borrowed some concepts in writing and reading our logic. The MTJ free layers are capable of interacting with neighbors through magnetic coupling. We have proposed the use of this coupling to compute logic in this paper. At the same time, MTJs also provide scope for CMOS integration which we have used to assist in current driven writing, clocking and reading the devices. CMOS integration also improves the overall control over individual cells in the logic. In this paper we have presented a novel CMOS integrated MTJ architecture layout that enables (a) logic computation using magnetic coupling between MTJs and (b) current driven input, clock and read operations that are much more energy efficient. A feasibility study of this integration in 22nm CMOS node is presented in the paper along with a variability tolerant reading scheme for the logic. The proposed architecture achieves over 95%reduction in energy as seen in various adders and array multiplier over traditional magnetic logic with external field-based clocking.

#### I. INTRODUCTION

Nanomagnetic logic (NML and alternately known as Magnetic Quantum Cellular Automata or MQCA) uses singledomain single layer nanomagnets for logic computation. The typical patterned shape of nanomagnets used are rectangular and elliptical. This gives the nanomagnets a distinct shape anisotropy which is used to store a logic 0 or a 1 at room temperature. For example, when the magnetization is along the +x direction, it is denoted as logic 1 (see *Fig.* 1*a*). A magnetization along -x is denoted as logic 0. We will use this convention in this paper unless otherwise stated. Logic 1 and



**Figure 1:** (a) Single layer nanomagnets as elemental computing cells of traditional MQCA logic. *Logic* 0 and *logic* 1 are represented by the two energy minima configuration of the nanomagnets as shown in the figure. (b) MTJs, the multi-layer spintronic devices, as basic computing elements. The free layers of the devices behave as single-domain nanomagnets. The energy minimum configuration is termed as the *easy axis* while the energy maximum is called the *hard axis. Logic* 1 and *logic* 0 are represented by the magnetization of the free layer along the two easy axes directions.

0 are written into the nanomagnets traditionally with the aid of external magnetic fields. The two logic states are energy minimum configurations and are also separated by an energy barrier at room temperature. For nanomagnets of dimensions larger than the superparamagnetic limit, an external energy source (external field) is required to help switch between the logic states at room temperature. This phenomenon is called clocking. The external fields are generated by current carrying conductors placed underneath the nanomagnets.

However, this traditional magnetic computing suffers certain drawbacks: (i) large current requirements (order of mA) to generate the external fields. (ii) no control over the fields so as to influence only the desired number of cells in a closely placed group of cells. We sought to alleviate these two major concerns in this paper by proposing a novel CMOS-Magnetic Tunnel Junction (MTJ) integrated architecture which aids in current driven logic operations and increased control of *when* and *which* MTJ to select in the logic.

MTJs are primarily two ferromagnetic layers, free layer and fixed layer, with a tunnel barrier sandwiched in between (see *Fig.* 1b). The free layer behaves as soft magnet while the fixed layer as hard magnet. The free layer is used to store logic 1 and 0. Current of appropriate magnitude and direction can be used to switch between the logic states. A current can also be used to clock the MTJs as identified in our previous work [1]. When the stack is appropriately designed, the free layer of the MTJs can interact with each other like single layer nanomagnets. We have verified this behavior in micromagnetic simulator LLG [2] which is widely accepted in the magnetic community [3], [4], [5], [6]. MTJs can be read with the help of current by measuring their resistance. Due to the

J.Das and S.Bhanja are with Electrical Engineering Department, University of South Florida, Tampa, FL.

S.M.Alam is Senior Member of Technical Staff at Everspin Technologies Inc., Austin, TX.



Note: All dimensions are in nm.

**Figure 2:** Ferromagnetic and antiferromagnetic coupling between the free layers of two MTJs, *A* and *B*. Ferromagnetic coupling orients the free layers of both the devices along the same magnetization direction. Antiferromagnetic coupling on the other hand orients the devices along opposite magnetization direction.

phenomenon defined as Tunnel Magnetoresistance (TMR), the resistance of logic 1 state is different from logic 0. The current requirements to perform each of the write, clock and read operations are much less than those to generate external magnetic fields. The use of MTJs can therefore reduce overall power in magnetic logic computation compared to traditional NML implementation and allow integration with CMOS that adds cell selectivity in an array of cells.

-While our previous work [1] proposed MTJ as elemental logic cell in NML, presented details of the MTJ stacked layer analysis and the need for tilted MTJs, this work focuses on the architecture of NML for realizing small gates to large cascaded circuits. By integrating 22nm CMOS with MTJs, we have proposed in this paper a novel NML architecture. A certain spacing needs to be maintained between the MTJs for effective magnetic coupling to take place. At the same time, the 22nm CMOS transistor and metal pitch should also not be violated. Although selected previous research used MTJs in nanomagnetic logic, our work is significantly different and the first to (i) utilize magnetic coupling between the free layers of neighboring MTJs for logic computation, and (ii) to present a CMOS integrated architectural solution with low power read, write and clocking suitable for large magnetic logic implementation. A review of previous work is presented in Section II.

A brief summary of the technical contributions in this paper are as follows:

- We have designed the proposed hybrid architecture to obey the constraints of dipolar interaction of magnetic logic, central to magnetic information processing. To mention some of the constraints : free layer dimension (shape anisotropy and super-paramagnetic limit), intercell spacing, pinned and free layer configurations and readability through TMR.
- 2) We have designed the architecture obeying the 22nm CMOS integration constraints such as 22nm CMOS metal pitch requirements, sizing of the access transistors capable of driving the required switching current and minimizing the number of routing metal layers.
- 3) The architecture is designed such that all magnetic cells are connected with a bitline and a sourceline. A few cells are connected to wordline as well in order to



- selectively write to input cells or
- to selectively deactivate a few cells while others are written or clocked in order to achieve lower power dissipation
- to selectively clock a cell and finally
- to read an output cell.

The architecture is regular and uses bitlines, sourcelines and wordlines similar to conventional memory. Hence the architecture is suitable for logic-in-memory application.

- 4) This work is the first to integrate the clocking schemes and timing of clocking pulses into the regular 2D grid architecture. The clocking scheme that is introduced in this work is a sequence of three voltage pulses that ensures the desired clocking operation.
- 5) We have proposed a differential ReadOut scheme where a bit is compared against its complement which eliminates the requirements for a precise reference voltage or resistance. Note that since nanomagnetic logic relies on neighbor interaction, bit and its antiferromagnetically coupled neighbor automatically provides the complements. We leverage from this feature of NML and have thus reduced expensive circuitry required to store variation prone reference values. Additional features of the proposed read schemes are:
  - a) Non-destructive read scheme where the cell's logic states are retained while the cell is being read by TMR.
  - b) Since magnetic cells are sensitive to fabrication and geometric variations, we have proposed a variation tolerant read scheme where duplicate copies (can be extended to multiple copies as well) of the output cells and their complements are compared.
- Finally, we have used the features of the architecture to build a two-input XOR logic (Section IV-F) which is an important component in datapath circuits.

In Section VIII, we report the MTJ dimensions, the current densities and pulse durations for writing, reading, and clocking operations. Several adders (half, full and 32-



**Figure 3:** Various configurations of MTJs. (a) **Inplane MTJ:** Both the reference and the free layer have their easy axis along x-direction. The anti-ferromagnetic coupling from the reference layer onto the free layer dominates over the magnetostatic coupling from the neighboring free layer.(b) **Perpendicular MTJ:** The free layer has its easy axis along the x-direction while the reference layer has it along the z direction, which is perpendicular to the plane of the free layer. Neighbor interaction is observed to switch the free layer from one of its energy maximum state. However, the cells do not display any TMR between the two energy minimum states of the free layer along the easy axis. Low power STT current-induced clocking by targeting stationary states in the y-z plane is not possible. (c) **Tilted MTJ:** Cells with their reference layer aligned equally to the z and x-axis while their free layer having an in-plane anisotropy along x-direction. Cells are capable of using neighbor interaction for logic computation. Distinctive TMR separates the +x orientation of free layer from the -x alignment. STT current-induced low power clocking can be achieved.

bit ripple carry) and array multiplier implementation in NML are computed for delay and energy, and compared with traditional NML implementation for energy efficiency. Section *IX* concludes our work.

# II. MAGNETIC LOGIC: A REVIEW OF THE TRADITIONAL AND CONTEMPORARY MAGNETIC LOGIC IMPLEMENTATION

Cowburn et al. [7] and Imre et al. [8] were the first to demonstrate successful room temperature Magnetic Quantum Cellular Automata (MQCA) operation. Logic functions were realized using ferromagnetic and antiferromagnetic coupling between the magnetic elements. The conventional elemental cells of MQCA comprises of single domain nanomagnets. Various MQCA logic components and interconnects like the majority logic [8], AND/OR [9], NAND/NOR [10], the ferro and antiferro wire [11], fanout [12], majority line [11] and coplanar wire [13], [14] have been demonstrated over time. However the conventional MQCAs, face certain drawbacks from the use of external magnetic fields for switching between their logic states. For an adiabatic switching between the states, an external pumping field in the direction of the hard axis needs to be applied followed by a field in the direction of the final magnetization [15], [16]. The external pumping field serves as a clock and flattens the energy landscape. The field is generated by current flowing through an underneath wire [17]. Though the dissipation in the nanomagnets during the switching is minimal, the current required to generate the required external field is in the order of few hundred mA [18]. This large current discourages the use of magnetic logic for low power logic implementations. Moreover, the field generated by a clocking wire exerts influence on the cells in the neighboring clock zones. Furthermore, to support this huge current, the dimensions of the underneath wires need to be in the range of few micrometers [17]. However, for effective interactions to take place between the cells, the spacing between two cells should be maintained in the range of 20 - 30 nm. Therefore, a micrometer wide wire would encompass multiple cells under its magnetic field. This clocking strategy, though suitable for long interconnects, proves incapable of housing single rows of cells in different clock zones.

Modifying the shape of the nanostructures in order to alter the topography of their energy curves and energy barrier [9], [19] has been undergone to devise more efficient switching and clocking mechanism. But the devices still suffer from power dissipation in the external circuits used for clocking. Writing to the input cells have been proposed and conducted through fields either generated by input wires external to the logic [20] or by external MTJs in close association (1 nm) with the nanomagnetic cells [21]. A feasible dynamic input mechanism to write into magnetic logic needs further exploration and experimental study. Reading the output of the logic has been effected with the help of output sensors that transport the signal to off-chip peripherals for data determination [15]. The peripheral circuits used in writing, reading, and clocking are still a subject that has not been much explored. Preliminary calculations suggest that the peripherals are in general power consumptive while at the same time affecting the compactness and homogeneity of the logic entity. Feasibility of simultaneous clocking and reading through the above-mentioned schemes is still an open field of study that needs further attention. Moreover the lack of controllability over individual cells in the existing MQCA logic makes them vulnerable in compact design of magnetic logic.

Observing these challenges faced by the existing magnetic logic architectures, we realized a critical need to integrate the input, output, and clocking operations of magnetic logic with CMOS as a solution to the drawbacks mentioned above. The MTJs, which can be written, clocked, and read with the aid of current, show promise as an alternate candidate for magnetic logic realization. A group of researchers have demonstrated the use of MTJs integrated with CMOS in the building of non-volatile lookup table for field programmable gate arrays [22]. The non-volatility property of MTJ helps in immediate power up and zero standby power. In another effort to extend the concepts of Magnetic Logic, researchers

have off late demonstrated the feasibility of integrating MTJ with CMOS for resistance measurements [23].

-In [24] Lee et al. has proposed the build up of full adder with MTJs. However, the authors utilize only the spin current induced writing and TMR of the MTJs to perform the logic operation. Such a logic implementation concept faces serious concerns during cascading to realize larger circuits as stage to stage interaction is electrical and not magnetic. In this paper, we use the STT induced write, clock and TMR based read of the MTJs. In addition, we use one more property of MTJs, the ability of their free layer to compute and propagate information like any single domain nanomagnets. This helps in cascading without CMOS intervention between stages of the logic. The MTJs in the logic are also integrated to CMOS and this helped us to improve localized control over individual cells of a magnetic logic architecture.

#### III. MTJS AS ELEMENTAL LOGIC CELLS

We have chosen a free layers dimension of  $100 \times 50 \ nm^2$ for the MTJs used in our architecture. The free layers are single-domain and store logic 1 and 0 through their magnetization direction (see Fig. 1b). However, the effectiveness of interaction between the free layers of neighboring MTJs for logic realization depends on the orientation of their fixed layers as well as discussed next. In this paper we broadly classify MTJs into three different configurations depending on the polarities of their free and fixed layers.

- 1) inplane fixed layer with inplane free layer, *abbr. inplane*
- 2) perpendicular fixed layer with inplane free layer, *abbr*. *perpendicular*
- 3) tilted fixed layer with inplane free layer, *abbr. tilted*

Table *II* lists the plane of magnetization of the free layer, the polarity of the fixed layer and direction of the easy axis, saddle point and hard axis direction of the free layer for each of the three types of MTJs. In Table III we have mentioned the direction of magnetization of the free layer during logic 1 & 0 representation in the three types of MTJs. Table III also provides a qualitative assessment of (i) interaction between free layers of neighboring MTJs for logic realization (obtained through simulations in LLG [2]) (ii) STT current induced clocking of the free layers and (iii) TMR based readability. For inplane MTJs, the antiferromagnetic coupling between the free and fixed layer of a single MTJ hinders effective coupling between the free layers of two neighboring MTJs. Neighbor coupling in perpendicular MTJs is excellent since the fixed layer has no inplane component of magnetization. In tilted devices, neighbor coupling is possible and the effectiveness is between those of the inplane and the perpendicular MTJ.

We have clocked the MTJs in the logic using a train of STT voltage pulses. During clocking, the voltage pulses (of appropriate magnitudes and durations) sweep the magnetization of the free layer of a MTJ to its saddle point [1]. We have seen in [1] that to obtain this feature, we need to have an MTJ with the fixed layer polarized  $45^{\circ}$  to the x & z axis. The STT clocking is therefore possible only in the tilted MTJs. Perpendicular MTJs do not have TMR since the polarization

## Table II: MTJ types and the characteristic features of their free and fixed layers.

| MTJ type | Magn. plane   | Polarization of               | Easy      | Saddle*   | Hard      |
|----------|---------------|-------------------------------|-----------|-----------|-----------|
|          | of free layer | fixed layer                   | Axis      | Point     | Axis      |
| In plane | x- $y$        | x-axis                        | x- $axis$ | y- $axis$ | z- $axis$ |
| Perp.    | x- $y$        | z- $axis$                     | x- $axis$ | y- $axis$ | z- $axis$ |
| Tilted   | x- $y$        | $45^{\circ}$ to $x \& z axes$ | x- $axis$ | y- $axis$ | z- $axis$ |

# Table III: Logic 1 & 0 representation and possibility of coupling, STT clocking and TMR in the three different types of MTJs.

| MTJ type | logic 0    | logic 1    | Neighbor | STT      | TMR read |
|----------|------------|------------|----------|----------|----------|
|          |            |            | coupling | clocking |          |
| Inplane  | +x- $axis$ | -x- $axis$ | no       | no       | yes      |
| Perp.    | +x- $axis$ | -x- $axis$ | yes      | no       | no       |
| Tilted   | +x- $axis$ | -x- $axis$ | yes      | yes      | yes      |

of the free layer is normal to that of the fixed layer for any logic 0 & 1 configuration. However, a reasonable TMR is observed in tilted devices due to the *x* component of magnetization of their fixed layer. Section *VII* explains this in further details. Since the tilted MTJs are the only candidates that possess all the three properties mentioned in Table *III*, we have used them as elementary cells for logic computation in our novel hybrid architecture.

# IV. REGULAR HYBRID CMOS-MAGNETIC LOGIC Architecture

# A. Integration Challenges

The integration of MTJs with 22nm CMOS for NML realization needs to meet the following basic criteria.

- 1) The *spacing between the MTJs* should allow effective neighbor interaction.
- 2) The CMOS minimum metal pitch requirement should not be violated.
- 3) Transistors need sufficient W/L ratio to sustain the required writing and clocking currents.
- 4) *Minimize the number of metal layers* for cost-effective implementation.

We have devised a novel CMOS-Magnetic logic architecture addressing the above mentioned challenges. We observed satisfactory coupling between MTJs when they are placed 20nm apart. The architecture has a regular 2D lattice structure (see Fig. 4) of rows and columns of MTJs placed 20nm apart. The row pitch of the architecture is (50 + 20)nm = 70nm. The column pitch is (100 + 20)nm = 120nm. The CMOS minimum metal pitch for layer 1 and intermediate wiring of 64nm [25] is satisfied. Access transistors are integrated to 1 in 4 MTJs of the architecture. For example, (see Fig. 4, only  $X_{11}$  in the group of ( $X_{11}$ ,  $X_{12}$ ,  $X_{21}$  and  $X_{22}$ ) has an access transistor.

\*A saddle point is an energy equilibrium position with energy value in between the easy axis and the hard axis of the magnet.



**Figure 4:** The regular CMOS-magnetic logic architecture. The cell (MTJ) layout in the architecture is regular with a constant horizontal and vertical pitch of 70nm and 120nm maintained between the cells. The CMOS integration with the MTJs. The blue cells are integrated with an underlying access transistor. The yellow cells do not have an access transistor. Note that only one cell with access transistor is present for every  $2 \times 2$  cells of the array e.g.  $X_{11}, X_{12}, X_{21}$  and  $X_{22}$ . Also only one of two adjacent rows (e.g.  $r_1$  and  $r_2$ ) can have cells with access transistors.  $Col_i$  represents the *i*<sup>th</sup> word line running across the architecture.  $r_i$  represents the rows of the architecture. Each row is signatured with a pair of bit and source lines.

#### B. Salient Features of the Hybrid Architecture

- 1) A transistor for every  $2 \times 2$  MTJ array, i.e. 1 in 4 MTJs has an access transistor. We derive the following two properties of the architecture in relation to the minimum spacing between two MTJs with access transistors (any two adjacent blue cells in Fig. 4) is
  - a) one row apart (e.g. cells  $X_{11}$  and  $X_{31}$ ) in the vertical direction
    - which implies that out of any two adjacent rows (e.g.  $r_3 \& r_4$  in Fig. 4) only one can have MTJs with access transistors.
  - b) one column apart (e.g. cells  $X_{11}$  and  $X_{12}$ ) in the horizontal direction.
- 2) A Source and Bit line pair for every row and Word Line for every alternate column.

A dedicated source and bit line pair is assigned to every row of cells in the architecture (see Fig. 4, 5, & 6). The bit line is housed in metal layer 2 while the source line in metal layer 1. The bit line for a row is connected to the free layer of the MTJs in that row. It runs on-axis with the MTJs. The connection of the source line to an MTJ varies depending on the row where the MTJ sits.

- a) For rows without access transistors (e.g.  $r_2$  in Fig. 4), the source line is connected on-axis to the fixed layer of the MTJs in that row (see cell  $X_{21}$  in Fig. 5).
- b) For rows with access transistors, the source line is aligned to the source of the access transistor. (cell  $X_{11}$  in Fig. 5). The drain of the access transistor is connected to an MTJ.

The word line runs vertically across the architecture and is housed in metal layer 3. One word line is present for every alternate column of cells. This condition is



**Figure 5:** A 3D view of a column of the architecture. Here  $Col_1$  in Fig. 4 seen from the direction of the red arrow. M1, M2 and M3 represents the three different metal layers used in routing the bit, source and word lines in the architecture. Note that the source and bit lines run parallel to each other signifying any particular row of the architecture. The word lines run vertically across the architecture and are connected to the polysilicon lines of the access transistors that are in one vertical column.



**Figure 6:** A 3D view of a row with both types of MTJs: MTJs with and MTJs without access transistors. In the figure is row  $r_1$  in Fig. 4 in the direction of the arrow. Please note that the contact with the source line for the MTJ without access transistor  $(X_{12})$  is offset from the midpoint in order to maintain alignment with the MTJ with access transistor  $(X_{11})$ . The color code is consistent with Fig. 5.

imposed by the placement of access transistor beneath alternate MTJs in a row. Only three metal layers are utilized in designing the regular architecture. Their respective pitches in the architecture and their corresponding pitch requirement in 22nm CMOS is listed in Table *IV*. Fig. 7 gives a 2D cross-section view of Fig. 5 looking from the front.

#### C. Operation techniques

When a MTJ needs to be written or clocked, a suitable current needs to flow through it. The technique to write or clock is different for MTJs with and without access transistors. We briefly describe the two methods in this section while the details on the writing and clocking principles are discussed in Sections V & 20.

- 1) *MTJs without access transistors* 
  - To have the desired writing and clocking currents through these MTJs, an appropriate potential is applied across the bit and source lines of the row(s) to which the MTJ(s) belong.
- 2) MTJs with access transistors

#### Table IV: Metal layers and their pitches in the architecture.

| Metal Layer | Architecture Line | Pitch† | CMOS pitch required <sup>††</sup> |
|-------------|-------------------|--------|-----------------------------------|
| Metal 1     | Source Line       | 70     | 64                                |
| Metal 2     | Bit Line          | 120    | > 64                              |
| Metal 3     | Word Line         | 140    | > 64                              |

† All pitches are in nm.

*†† Metal 1 and intermediate wiring.* 

Likewise, required potential is applied across the bit and source lines of the row(s) to which the MTJ(s) belong. But a current can only flow through the MTJ if its access transistor is on. The turn on and off of the access transistor is controlled by the potential on the word line. Therefore, writing and clocking these MTJs are controlled by their word line in addition to the bit and source lines of their rows.

# D. Cell Types

The placement of transistors in the array and an MTJ's location in the logic gives rise to three broad categories of cells.

# • Input Cells

They are MTJs that are:

- only written to provide input to a logic gate/block and are never clocked. Their word line is activated only during the writing period.
- always integrated with access transistors. Therefore, the inputs remains unaffected when later in time other MTJs in their rows are clocked. This is made possible since no two access transistors in a row share a common word line.

Cells A, B,  $\overline{A}$  and  $\overline{B}$  are input cells in Fig. 9. They are marked in green.

#### Output Cells

These MTJs also have access transistors. They undergo two operations:

# 1) Clocking

2) *Reading* 

These operations are further discussed in later sections. The output cells are marked in red in Fig. 9.

• Logic Cells

They form the heart of the array and perform the logic computation. These cells are only clocked and never written nor read. They are of two types:

- Standard Cells or MTJs without access transistors These MTJs are marked in yellow in Fig. 9. The minimum spacing between two such MTJs is equal to the row and column pitch of the array.
- Controlled Cells or MTJs with access transistors These cells are marked in blue in Fig. 9.

## E. Elementary Logic Blocks

The three building blocks in the architecture that are used to realize any logic are:





# 1) Majority

for performing AND/OR operation (see Fig. 8a). The fixed MTJ in these blocks is set to either *logic* 1 or *logic* 0 depending on the AND/OR operation [8].

# 2) Interconnects

# They are further classified into

- a) *Horizontal interconnects or Horizontal wires* where the MTJs interact with their neighbors through antiferromagnetic coupling. A horizontal interconnect propagates a bit by generating complementary values on any two adjacent cells (see Fig. 8b).
- b) Vertical interconnects or Vertical wires where the MTJs interact with their neighbors through ferromagnetic coupling. A vertical interconnect transmits a bit without altering its value in any of its cells (see Fig. 8b).
- 3) Differential output generator block that generates the output, S, and its complement,  $\overline{S}$ . In magnetic logic, antiferromagnetic coupling makes it possible to generate complementary values on horizontally adjacent MTJs, a concept which we have used. The difference in the electrical resistances of S and  $\overline{S}$  is then utilized to produce a differential output voltage. This method, explained in detail in Section VII, eliminates the need of a reference voltage to determine the output of the logic.

# F. Case Study: Two-input XOR

Fig. 9 shows a schematic view of a two-input XOR implementation using the proposed architectural and layout specifications to compute the output S. Note that S is equivalent to the sum output of a half adder. The following cells build up the logic:

- Input Cells marked in green.
- Standard Cells marked in yellow.
- Controlled Cells marked in blue.
- *Output Cells* marked in red.

As discussed in Section IV-D, the green, blue and red cells have access transistors integrated to them as seen in the figure.

# The metal lines of the logic include

- Source Line: horizontal lines in blue (metal layer 1).
- Bit Line: horizontal lines in black (metal layer 2).



Figure 8: The cell layouts for major logic blocks in the architecture. (a) A majority AND and OR. (b) Horizontal and Vertical interconnects between different logic blocks in the architecture. The interconnects are used to propagate logic information between the logic blocks by using the antiferromagnetic and ferromagnetic coupling between the free layers of two adjacent MTJs.



Figure 9: Schematic view of a two-input XOR logic.

#### • *Word Line*: vertical lines in green (metal layer 3).

1) Explaining the cell placements: Each of the rows of MTJs  $(r_1, r_2, r_3, ...)$  in the schematic view in Fig. 9 has a source and bit line. The placement of MTJs is in the form of a 2D array. Just like in a NML, MTJs from certain locations of the array are removed to realize the logic. At the same time, we have placed the Standard and Controlled Cells in a way so that only one access transistor is present for every  $2 \times 2$  neighboring MTJs in the array.

To minimize power consumption, it is advisable to place minimum number of Standard Cells (MTJs directly connected to source and bit lines) in a row. The current flow through Standard Cells cannot be gated once a potential is applied across the bit and source lines connected across them. Therefore in this logic realization, since the Input Cells each have an access transistor, we have placed only other Controlled Cells in their rows to reduce the overall power consumption.

# Algorithm 1: *Two-input XOR operation sequence of Fig.* 9

- **Inputs:**  $A, \overline{A}, B, \overline{B}$ . **Outputs:**  $S_a = S_b = f(A, B) = A \oplus B$ . 2: Write  $X_1 = \overline{B}, X_2 = A, X_3 = B, X_4 = \overline{A}$ .
- **Clock** rows  $r_2 \& r_6$ . {Apply appropriate voltage pulses across bit and source lines of the rows  $r_2 \& r_6$ .}
- 4: Release clock.
  - $X_5 X_6 \& X_7$  settle to  $A.\overline{B}, A.\overline{B} \& A.\overline{B}$  respectively.
- 6: X<sub>8</sub> X<sub>9</sub> & X<sub>10</sub> settle to A.B, A.B and A.B respectively.
  Select word line WL<sub>2</sub>. Clock rows r<sub>3</sub> & r<sub>5</sub>. †
- 8: Release clock.

X<sub>11</sub> & X<sub>12</sub> settle to A.B & A.B respectively.
10: Clock row r<sub>4</sub>.

- $X_{13}, \ldots X_{16}$  settle to  $(A \oplus B) \ldots (\overline{A \oplus B})$  respectively.
- 12: Select word line WL<sub>3</sub>. Clock rows r<sub>3</sub> & r<sub>2</sub> and r<sub>5</sub> & r<sub>6</sub>.
  Release clocks in sequence : r<sub>3</sub> then r<sub>2</sub> & r<sub>5</sub> then r<sub>6</sub>.
- 14: Both  $X_{17}$  &  $X_{18}$  settle to  $\overline{A \oplus B}$ . Both  $X_{19}$  &  $X_{20}$  settle to  $A \oplus B$ .
- 16: Select word line  $WL_4$ . Clock rows  $r_1 \& r_7$ .
  - $X_{21} \& X_{23}$  settle to  $\overline{A \oplus B} = \overline{S}_a \& \overline{S}_b$  respectively.
- 18:  $X_{24}$  &  $X_{26}$  settle to  $A \oplus B = S_a$  &  $S_b$  respectively.

 $\dagger X_2, X_3, X_{17}$  and  $X_{19}$  are not clocked since their access transistors are not turned on.

For this reason,  $X_{11}$ ,  $X_{12}$ ,  $X_{17}$  and  $X_{19}$  are Controlled Cells. Also note, no two adjacent rows has access transistors (derivative of *feature* 1, Section IV-B). By the same rule,  $X_{13} \cdots$  $X_{16}$  are all Standard Cells. The output and its complement are generated in cells  $X_{15}$  &  $X_{16}$  and are copied through ferromagnetic coupling into  $X_{17}$  and  $X_{19}$ . They are then propagated along the two vertical columns to produce the output values in cells  $X_{24}$  &  $X_{26}$  while the complement in  $X_{21}$  &  $X_{23}$  (shown in red). The two pair of cells will then be used by the two arms of the variability tolerant differential read scheme discussed later in Section VII.

2) The number of Elementary Logic blocks used in the two-input XOR implementation are:

- *Majority AND*: Two  $(A_1 \& A_2 \text{ annotated in Fig. 9})$ .
- *Majority OR*: One  $(O_1)$ .
- Horizontal Wire: Five  $(H_1, H_2, H_3, H_4, H_5)$ .
- Vertical Wire: Two  $(V_1, V_2)$ .
- Differential Output Generator: One  $(D_1)$ .

3) Sequence of logic operation: Algorithm 1 gives an outline of the sequence of the 2-input XOR operation for the circuit shown in Fig. 9.

#### V. STT CURRENT INDUCED WRITE-IN SCHEME

In our architecture, we write a *logic* 0 or *logic* 1 into the Input MTJs (Input Cells) with the help of STT current. Depending on their direction and magnitude, the current can write a 1 or a 0 into the MTJ [26]. Writing into the Input Cells takes place in two steps:

- 1) Turn on its access transistor by selecting its word line.
- Apply suitable potential difference across the bit and source lines of the row to which the Input Cell belongs. To write



Figure 10: Input Cell: In an Input Cell the MTJ is always integrated with an access transistor. The color code is consistent with Fig. 5.

- a) *Logic* 0 a negative potential difference needs to be applied between the bit and source lines. The resulting current is called the negative current.
- b) *Logic* 1 a reverse potential is applied across bit and source lines and the resultant current is called the positive current.

#### How does it write ?

The writing takes place through conservation of angular momentum between the electrons in the current and electrons residing in the device [26]. In a negative current, a stream of electrons flow through the device in the direction from the fixed to the free layer. These electrons are first polarized by the fixed layer along their direction of magnetization. The spin-polarized electrons on reaching the free layer transfer their momentum to the electrons in the free layer. If the magnitude of the spin-polarized current is beyond a certain critical value it causes the free layer to align along +xdirection [27] (see Fig. 1b). We say, a *logic* 0 is written into the Input Cell. Similarly, a positive current carries a stream of electrons from the free to the fixed layer. A logic 1 is written into the cell by the reflected electrons from the interface of the fixed and the metal layers underneath it. The reflected electrons have an opposite polarity to the fixed layer, and therefore aligns the Input Cell along the -x direction (see Fig. 1b).

The relationship between the spin-polarized current and the magnetization of the free layer is best captured by the Landau-Lifshitz-Gilbert (LLG) equation given by Eq. (1) in Table V with the Slonczewski term added to it. The relevant symbols are explained in Table VI. The switching from one logic state to another can take place either through multiple precessional movements or through half a precession depending on the current magnitude and duration [30], resulting in a trade-off between speed and power consumption. The critical current for switching the cell is given in Eq. (4) [31]. For a half-precession switch, a current pulse of both polarities (generated by appropriate voltages), one followed by the other, should be applied for a duration of  $\tau/2$  each [30], where  $\tau$ is given by Eq. (9). The relevant symbols are explained in Table VI.

The main advantages of this STT current driven writing are as follows:

1) Low Power: The writing current is in the range of  $\mu A$  (see Table VIII). With improvement in device fabrica-



Figure 11: Clocked state of a tilted polarizer MTJ. Please note that the free layer orients along the y axis when clocked.

tion, writing current magnitudes as low as  $0.8MA/cm^2$  has been obtained [32]. Field driven writing requires current in the range of mA [17].

- Scalable: The current magnitude scales with the device dimensions (see Eq. 1). In field driven writing, the reverse effect is observed with device scaling [33], [34], [35].
- 3) No interference on neighboring cells during writing: Only that specific Input Cell that is selected through its access transistor is written. There are no stray fields (unlike in field driven writing) impacting the MTJs in the neighborhood.

#### VI. LOW POWER CLOCKING

In this paper we have proposed a novel train of voltage pulses (Fig. 12) to clock the MTJs (Standard Cells, Controlled Cells and Output Cells) inside the logic. The main underlying concept in clocking is the use of STT current to orient the magnetization of the free layer to a stationary state along the y axis [1] (see Fig. 11). The clocking current density  $J_{clk}$  is given by Eq. (8) (see Table V). STT current induced clocking has the same low current, scalability, and bit selectivity properties mentioned for the write operation in the previous section that makes the overall clocking a low power operation compared to the use of external fields for clocking.

We have proposed a novel train of three voltage pulses of varying magnitude and duration that are to be applied (in three different phases: I, II and III) across the bit and source lines for clocking the cells of the logic. The three different phases are discussed below:

- *Phase I*: A positive voltage pulse,  $V_1$ , is applied across the source and bit lines. The resultant current magnitude is equal to the writing current for logic 1 (see Table *VIII*). This pulse ensures that all the cells targeted for clocking are in logic 1 state at the end of Phase I.
- *Phase II*: A positive voltage pulse,  $V_2$ , is applied across the bit and source lines. The resultant current magnitude is equal to the writing current for logic 0 (see Table *VIII*). This pulse is for a quarter of a precession duration ( $\tau/4$ ) and is referred as *QP* pulse. The pulse sweeps the magnetization of the cells from the logic 1 state towards the logic 0. At the end of the pulse ( $\tau/4$ ), the magnetization is along the *y*-direction. This phase is immediately followed by a clocking pulse described next.

| Description                  | Equation                                                                                                                                                                                       |     |
|------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|
| LLG equation [28]            | $\frac{d\mathbf{m}}{dt} = -\gamma M_s \mathbf{m} \times \left( \mathbf{H}_{eff} - \frac{\alpha}{\gamma M_s} \frac{d\mathbf{m}}{dt} - \frac{J_e G}{J_p} \mathbf{e}_p \times \mathbf{m} \right)$ | (1) |
|                              | where $G = \left[-4 + (1+P)^3 \frac{(3+\hat{s_1}.\hat{s_2})}{4P^{3/2}}\right]^{-1}$                                                                                                            | (2) |
|                              | $J_p = \mu_0 \cdot M_s^2 \frac{\mid e \mid d}{\hbar}$                                                                                                                                          | (3) |
| Writing current [29]         | $ I_c  = \left(\frac{2e}{\hbar}\right) \cdot \left[\frac{\alpha M_s.Vol}{\eta(\theta)}\right] \cdot \left(H_k + \frac{H_{eff}}{2\sqrt{2}}\right)$                                              | (4) |
|                              | where $\eta(\theta) = \frac{p}{1\pm p^2}$                                                                                                                                                      | (5) |
|                              | $p = \sqrt{(TMR/(TMR+2))}$                                                                                                                                                                     | (6) |
|                              | '+' for $logic \ 0 \rightarrow logic \ 1$ & '-' for $logic \ 1 \rightarrow logic \ 0$                                                                                                          |     |
|                              | $H_{eff} = 4\pi M_s$                                                                                                                                                                           | (7) |
| Clocking current density [1] | $ J_{clk}  = \left(\frac{\mu_0 \cdot M_s \cdot \mid e \mid \cdot d \cdot H_d}{\hbar \cdot G}\right)$                                                                                           | (8) |
| Current pulse duration [30]  | $\tau = \frac{1}{4\gamma M_s}$                                                                                                                                                                 | (9) |

★ The symbols used are defined in Table VI.

• *Phase III*: A positive voltage pulse,  $V_3$ , is applied across the source and bit lines. The pulse magnitude is sufficient to sustain a clocking current density of  $J_{clk}$ . The current magnitude and duration are mentioned in Table *VIII*. In this phase, the cell remains in the clocked state for the entire duration of the pulse.

#### Case Study: Clocking sequence in a magnetic logic

The novel train of voltage pulses introduced above is now applied to a magnetic logic (see Fig. 12*a*) where the blue cells indicate the inputs (*A*, *B* and *C*) to the logic. The cells  $U_1 \cdots U_2$  and  $W_1$ ,  $W_2$  form the body of the logic and needs to be clocked. Fig. 12*b* shows the clocking sequence. The logic has two clocking zones: one comprising of cells  $U_1 \cdots U_2$ . The other of cells  $W_1 \& W_2$ . Prior to releasing the clock for a clock zone, we need to ensure that the cells in the next clock zone are in clocked state. This is to make certain that the states of the MTJs in a clock zone are only influenced by the logic from the previous clock zone. This guarantees information propagation from the direction of the input to the output of the logic.

Note that the technique of sequentially releasing the clock in case of field-induced clocking from a horizontal and a vertical row of cells as discussed by Carlton et al. [19] is applicable to our context of clocking as well.

# VII. TMR BASED READOUT SCHEME

The intrinsic dependence of TMR of the MTJs on the bias voltages across them, gives a wide difference in the electrical resistance of the MTJ between the *logic* 0 and *logic* 1 states near zero-bias voltage [36]. The device conductance  $G(\theta)$  for a MTJ with a tilted fixed layer can be given by Eq. (10) [37] where  $\theta$  is the difference in angle between the free and the fixed layer.  $\theta$  is given by Eq. (11). The symbols are explained in Table VI.

$$G(\theta) = \frac{1}{2}(1 + \cos(\theta))G_p + \frac{1}{2}(1 - \cos(\theta))G_{ap}$$
(10)

$$\theta \sim \cos^{-1} \left[ \frac{H_{dz} - \left(\frac{\hbar}{2e\alpha}\right) \left[\frac{g(\pi/2)}{M_s \cdot Vol}\right] I}{4\pi M_s + (H_k \pm H_{dx})/2} \right]$$
(11)

The TMR of the device can then be written as in Eq. (12) [36].

$$TMR = \frac{G_1^{-1} - G_0^{-1}}{G_0^{-1}} \tag{12}$$



**Figure 12:** STT current induced clocking in a Magnetic logic. (a) Voltage sources to illustrate clocking operation for generating output  $W_1$  and  $W_2$  from initial input states (at t = 0) of A, B, and C for a majority logic. (b) Bit states and voltage waveforms for producing valid output  $W_1 \& W_2$  through the proposed clocking sequence.  $|V_{SB_r0}|$  and  $|V_{SB_r1}|$  are the magnitudes of the voltage difference across the source and bit lines for rows  $r_0$  and  $r_1$  respectively.

| Table | VI: | List | of | symbols | used | in | the | equations | in | this | paper. |
|-------|-----|------|----|---------|------|----|-----|-----------|----|------|--------|
|-------|-----|------|----|---------|------|----|-----|-----------|----|------|--------|

| Symbols                   | Description                                                                                                                                                                                                             | N/                                        |                                                     |                                 |                                 |
|---------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------|-----------------------------------------------------|---------------------------------|---------------------------------|
| Р                         | Spin-polarizing factor [28]                                                                                                                                                                                             |                                           | V <sub>DD</sub>                                     | V <sub>DD</sub>                 | V <sub>DD</sub>                 |
| $\hat{s_1}$ , $\hat{s_2}$ | Unit vectors along the global spin orientation of the fixed and free layers resp.                                                                                                                                       |                                           |                                                     | M <sub>6</sub> b                | M <sub>8</sub> b <sup>−</sup> 3 |
| $M_s$                     | Saturation magnetization of the material                                                                                                                                                                                |                                           |                                                     |                                 |                                 |
| $\mu_0$                   | Permittivity of free-space                                                                                                                                                                                              |                                           | M,                                                  |                                 |                                 |
| e                         | electron charge                                                                                                                                                                                                         |                                           | X                                                   | E Y                             |                                 |
| α                         | Damping constant                                                                                                                                                                                                        |                                           | ↓ '                                                 | q                               |                                 |
| $\gamma$                  | Gyromagnetic ratio                                                                                                                                                                                                      |                                           | $\Phi_2$ $M_2$ $+$                                  | $-M_1 \square_2^2$              |                                 |
| ħ                         | Reduced Planck's Constant                                                                                                                                                                                               |                                           | 3 Compan                                            | ator 4                          |                                 |
| L,W,d                     | Length, Width and Thickness of free layer                                                                                                                                                                               |                                           | ta y                                                | B•                              |                                 |
| Vol                       | Volume of free layer                                                                                                                                                                                                    |                                           | $\Phi_1 \square \Box_1 \Phi_1 \square$              | $\Phi_1 \square \square \Phi_1$ |                                 |
| $H_k$                     | Anisotropy field                                                                                                                                                                                                        |                                           | M <sub>1a</sub> M                                   | M <sub>20</sub> M <sub>2b</sub> |                                 |
| $H_d$                     | Coupling field from fixed layer                                                                                                                                                                                         |                                           |                                                     |                                 |                                 |
| $H_{dx}, H_{dz}$          | x- and z-components of coupling field                                                                                                                                                                                   |                                           | $\mathbf{S}_{\mathbf{a}} = \mathbf{S}_{\mathbf{b}}$ | S <sub>a</sub> S <sub>b</sub>   |                                 |
| ${ m H_{eff}}$            | Effective magnetic field on the free layer arising from<br>crystalline and shape anisotropy, demagnetization field, ex-<br>change field and external field which can be in the form of<br>coupling from the fixed layer | Figure 13:                                | ReadOut circuit                                     | ry for the hybri                | id CMOS-Magnetic                |
| $M_x, M_y, M_z$           | x,y, z-component of magnetization of free layer                                                                                                                                                                         | - 6                                       |                                                     |                                 |                                 |
| m, e <sub>p</sub>         | Unit vector in the direction of magnetization of free and fixed layer respectively                                                                                                                                      | nechanism                                 | that we describe                                    | e in this section               | is non-destructive              |
| θ                         | Diff. in magn. direction of free and fixed layer                                                                                                                                                                        | n nature T                                | the Differential                                    | ReadOut scher                   | ne leverages from               |
| $G_p, G_{ap}$             | Conductivities for $(\theta = 0^{\circ})$ and $(\theta = 180^{\circ})$ states                                                                                                                                           | $\frac{1}{1}$ $\frac{1}{1}$ $\frac{1}{1}$ |                                                     |                                 |                                 |

where  $G_1$  and  $G_0$  are the conductances of the device for logic 1 and logic 0 states respectively.

We have devised a Low Power Differential ReadOut scheme for reading the output of magnetic logic by effectively utilizing the TMR of the MTJ. Furthermore, the read mechanism that we describe in this section is non-destructive in nature. The Differential ReadOut scheme leverages from the characteristics of the CMOS-Magnetic logic architecture where a bit and its complement are spatially adjacent. The differential readout technique gives a higher sense margin since a bit is compared against its complement. The readout scheme features a higher tolerance to the MTJ resistance variability by reading the average output over a number of same state MTJs that are grouped together (e.g.  $(S_a, S_b)$  and  $(\overline{S}_a, \overline{S}_b)$  in Fig. 13). The ReadOut circuit is illustrated in Fig. 13. A symmetry is maintained between the transistors in the two arms of the circuit. The reading of the cell is carried out in two consecutive phases: *Pre-charge phase* followed by *Sensing phase* as shown in the simulated waveforms of Fig. 14. The waveforms are simulated using 22 nm predictive CMOS technology [38], [39], [40], [41].  $(M_{1a}, M_{1b})$  and  $(M_{2a}, M_{2b})$  are access transistors of the output MTJs  $S_a$ ,  $S_b$  and their complements  $\overline{S}_a$ ,  $\overline{S}_b$ , respectively. The access transistors remain on  $(\phi_1 = 1)$  during the entire read operation.

During the Pre-charge phase, the  $\phi_2$  signal is pulled low to turn off transistors  $M_3$  and  $M_4$ . The active low signal  $\phi_3$ is pulled down to assist in fast pre-charge of nodes X and Y to potential  $V_{DD}$ . Signal  $E_q$  is raised high to equalize nodes X and Y through transistor  $M_9$ . During the sensing phase,  $\phi_2$  is raised to a low voltage, say  $V_{read}$ , for applying a low voltage bias on the output MTJs. With  $E_q = 0$  and  $\phi_3 = 1$ , voltage differences start to grow at nodes X and Y due to differential current from the complementary output states. The Comparator senses the voltage difference and accordingly sets its output O/P to either high or low. Fig. 14b shows the waveforms at node X and Y when  $S_a \& S_b$  are in logic state 0. Fig. 14c shows the case when  $S_a \& S_b$ are in logic state 1. A sense margin of 32.3mV is obtained. The comparator reads a low and a high in these two cases respectively.

The key characteristics of our Readout schemes are summarized below:

- **Differential Output Reading**: This technique utilizes the inherent property of magnetic logic architecture to readily produce the complement of a bit through antiferromagnetic coupling.
- Low Power Non-destructive Read: In order to read the contents of the Output Cells an average current in the order of 28-32  $\mu$ A, is supplied to the MTJs. This ensures reading the contents of the MTJs at low voltage bias, thus preventing the switching of the contents of the output MTJs during the read operation.
- Variability Tolerance: In the nanometer regime, any variations that creep into the device dimensions can have a profound impact on the device's parameters. Furthermore, critical dimension and oxide thickness variation in MTJs can also result in MTJ resistance variation. Hence, we proposed a Variability Tolerant Read Architecture by reading pairs of MTJs  $(S_a, S_b)$  and their complements  $(\overline{S}_a, \overline{S}_b)$  simultaneously shown in Fig. 13. Supplementary Document with Sections S1 & S2 present a comparative analysis of variability tolerance of different read schemes and 22nm node transistor mismatch analysis of the comparator in the proposed read circuit, respectively.

#### VIII. RESULTS AND DISCUSSIONS

The MTJ device characteristics in the proposed hybrid CMOS-Magnetic logic architecture are summarized in Table VII while the magnitudes and durations of writing, clocking, and reading currents are outlined in Table VIII. The writing and clocking current magnitudes and durations

Table VII: A single MTJ characteristics used in the architecture<sup>8</sup>.

| MTJ characteristics                              |                        |  |  |  |  |
|--------------------------------------------------|------------------------|--|--|--|--|
| MTJ Footprint                                    | $100{\times}50 \ nm^2$ |  |  |  |  |
| Free Layer thickness                             | 2nm                    |  |  |  |  |
| Horizontal pitch                                 | 70nm                   |  |  |  |  |
| Vertical pitch                                   | 120nm                  |  |  |  |  |
| Logic 0 resistance $(R_0)$ [32]                  | <mark>2KΩ</mark>       |  |  |  |  |
| $Logic \ 1 \ resistance \ (R_1)$                 | <mark>2.6KΩ</mark>     |  |  |  |  |
| Standard Deviation for $R_0$ ( $\sigma_0$ ) [42] | <mark>9.3%</mark>      |  |  |  |  |
| Standard Deviation for $R_1$ ( $\sigma_1$ )      | <mark>10.3%</mark>     |  |  |  |  |

are theoretically computed using Eqs. (4), (9) & (8) while the values for suitable reading current is obtained through simulations in Cadence. In the simulations of the readout circuits, the MTJs are replaced by their resistance values represented by Eq. (10). With a resistance of  $2K \Omega$  for *logic* 0 and 2.6K  $\Omega$  for *logic* 1 [32], a TMR of around 30% is maintained in these devices. This TMR is sufficient to read the contents of the MTJs using the Differential ReadOut scheme discussed in the previous section.

The architecture proposed here has the ability to realize larger logic by stitching smaller sub-circuit modules as in any CMOS circuit. A half-adder is designed within the specifications of the hybrid architecture (see Fig. 9). A fulladder can be designed using the half-adder as modules. The full adder templates can then be instantiated to develop a 8-bit ripple carry adder and a  $8 \times 8$  array multiplier, thus supporting the modularity of design architecture. Table IX presents the delay, energy and area in half-adder, full-adder, 8-bit ripple carry adder and a  $8 \times 8$  array multiplier. The total energy consumed is computed for our proposed STT current-induced clocking and compared with that of *field-induced* clocking technique. The current magnitude and duration for fieldinduced clocking are taken from [18]. An energy reduction of more than 95% is observed in STT current-induced writing and clocking operations over Field-induced operations, thus substantiating our claim of current-induced logic operation as a low power technique. Simulation results in Fig. 15 show energy consumed vs. number of cells clocks in external fieldand STT current- induced clocking schemes. A total of 40 cells can be clocked at a time (in a single clocking zone) using STT current before the total clocking energy can equal the value required for field induced clocking. Due to cell selectivity in the proposed architecture, only the required number of cells can be clocked resulting a more energy efficient clocking. Another major advantage of STT currentinduced clocking is power reduction with scaling.

As seen in Eq. (4), the switching current decreases in proportion to the dimension of the MTJs used as elemental building blocks of magnetic logic. But the reverse effect occurs for field-induced clocking with scaling of the device dimension [34], [35], [43].

#### IX. CONCLUSION AND FUTURE DIRECTIONS

This work highlights the selectivity of cells offered and the power and energy improvement achieved through STT

9



Figure 14: Output Signals at node X and Y during Pre-Charge and Sensing Phases. V<sub>out</sub> represents the signal at the output of the comparator. Design implemented in 22nm predictive CMOS technology including the comparator.

Table IX: Delay and Energy comparison in Various Logic Circuits in Hybrid CMOS-Magnetic logic Architecture.

| Logic function                | Delay                       |           | Energy       |                | Energy reduction | Area $(\mu m^2)$ |  |
|-------------------------------|-----------------------------|-----------|--------------|----------------|------------------|------------------|--|
|                               | $[n(t_{clk} + t_p) + mt_w]$ | Operation | $STT \ (pJ)$ | $Field \ (pJ)$ |                  |                  |  |
|                               |                             | Input     | 0.0098       | —              |                  |                  |  |
| $Half \ adder$                | 4 + 2                       | Clock     | 10.76        | 287            | $\approx 96.2\%$ | 0.41             |  |
|                               |                             | Read      | 0.13         | -              |                  |                  |  |
|                               |                             | Input     | 0.034        | -              |                  | 2.7              |  |
| Full adder                    | 8+2                         | Clock     | 37.95        | 1015           | $\approx 96.2\%$ |                  |  |
|                               |                             | Read      | 0.13         | -              |                  |                  |  |
|                               |                             | Input     | 0.80         | —              |                  |                  |  |
| $32-bit \ RC \ adder$         | 256 + 2                     | Clock     | 1.21e3       | 32.5e3         | $\approx 96.2\%$ | 86.4             |  |
|                               |                             | Read      | 0.13         | -              |                  |                  |  |
| $8 \times 8$ array multiplier |                             | Input     | 0.785        | -              |                  |                  |  |
|                               | 60 + 2                      | Clock     | 2.16e3       | 57e3           | $\approx 96.2\%$ | 151.2            |  |
|                               |                             | Read      | 1.05         | _              |                  |                  |  |

n and m are integers.



**Figure 15:** Energy consumption vs. number of cells in a single clocking zone for STT current-induced clocking and Field-induced clocking. The clocking frequency for Field-induced clocking is  $10^8$  Hz with 50% duty cycle with a clocking current of 4 mA [17]. For the cell dimensions of  $100 \times 50 \ nm^2$  and a vertical pitch of  $120 \ nm$ , Field-induced clocking of 40 cells in one clock zone would require a clock wire length of  $4.8 \ \mu m$  with an overall resistance of  $0.216 \ \Omega$ . Since the clocking in STT current-induced clocking is implemented using a stationary state in the y-axis, the average clocking duration can be approximated to  $3 \ ns$  from our simulation results (*see supplemental document*) and the clocking current is a combination of the three voltage pulses as discussed in Section 20.

Table VIII: Current Specifications

|       | Opera              | tion                       | <b>Current</b> ( $\mu A$ ) | Pulse duration                    |
|-------|--------------------|----------------------------|----------------------------|-----------------------------------|
| Write | (logic             | $0 \rightarrow logic \ 1)$ | 278.9                      | $t_w = 10 \ ps \ (\tau/2) \ [30]$ |
| write | (logic             | $1 \rightarrow logic \ 0)$ | -216                       | $t_w = 10 \ ps \ (\tau/2)$        |
| Clock | $logic \ 1 \to QP$ |                            | -216                       | $t_p = 10 \ ps \ (\tau/2)$        |
| CIUCK | Clocked            |                            | -169.3                     | $t_{clk}{}^*$                     |
|       | Pre-charge         |                            | 0.26                       | 2 ns                              |
|       | Sense              | logic 0                    | 82.29 (peak)               | 2 ns                              |
| Read  |                    | $logic \ 1$                | 73.59 (peak)               | 2 ns                              |
|       | logic 0            |                            | 31.43 (avg.)               | 4 ns                              |
|       |                    | logic 1                    | 28 (avg.)                  | 4 ns                              |

\*  $t_{clk}$  is the duration for which the cells remain in clocked state.  $t_{clk} = 3ns$ .

current-induced clocking in a novel hybrid CMOS-Magnetic logic architecture using MTJs. The feasibility of the architecture with 22nm CMOS technology is studied. The fast, low power, and non-destructive ReadOut circuit demonstrates robustness against variability by leveraging inherent properties of the hybrid CMOS-Magnetic logic architecture. The modularity of design block helps in the realization of larger circuits. With the scaling of underlying CMOS in future, the

overlying MTJ dimension scaling will reduce writing and clocking currents.

Nanomagnetic logic is in very early stage of exploration and going through gradual improvements. Hence, we compared our proposed architecture with prior magnetic logic work and demonstrated a significant improvement (larger than 95%) in energy reduction using STT-induced clocking over traditional method of field-induced clocking. However, at its current state, magnetic logic is not favorably comparable to conventional CMOS in terms of area and energy consumption for general purpose computation. Nanomagnetic memory (MRAM) has already been commercialized for limited applications. Certain properties of magnets, such as high temperature operation, and radiation hardness, and non-volatility are attractive for different applications. One early application of magnetic logic is to augment magnetic memory. Wherever magnetic memory is effective and judged suitable, one can apply the proposed magnetic logic to add a few additional logic constructs that would be used directly in the memory. For a more widespread and general purpose application of magnetic logic, further research is needed to reduce clocking energy through new materials and physical phenomenon.

#### REFERENCES

- Das, J. and Alam, S. M. and Bhanja, S., "Low Power Magnetic Quantum Cellular Automata Realization Using Magnetic Multi-Layer Structures," *IEEE Transactions on Emerging and Selected Topics in Circuits and Systems*, June 2011.
- [2] "Llg micromagnetic simulator." http://llgmicro.home.mindspring.com/.
- [3] A. Kakay and L. K. Varga, "Micromagnetic simulation of random anisotropy model," *Journal of Magnetism and Magnetic Materials*, vol. 272-276, no. Part 1, pp. 741–742, 2004. Proceedings of the International Conference on Magnetism (ICM 2003).
- [4] K. Ito, T. Devolder, C. Chappert, M. J. Carey, and J. A. Katine, "Micromagnetic simulation of spin transfer torque switching combined with precessional motion from a hard axis magnetic field," *Applied Physics Letters*, vol. 89, pp. 252509 –252509–3, Dec. 2006.
- [5] B. S. Chun, J. Y. Hwang, J. R. Rhee, T. Kim, S. Saito, S. Yoshimura, M. Tsunoda, M. Takahashi, and Y. K. Kim, "Magnetization switching of cofesib free-layered magnetic tunnel junctions," *Journal of Magnetism and Magnetic Materials*, vol. 303, no. 2, pp. e223 – e225, 2006. The 6th International Symposium on Physics of Magnetic Materials.
- [6] Ito, K. and Devolder, T. and Chappert, C. and Carey, M.J. and Katine, J.A., "Micromagnetic simulation on effect of oersted field and hard axis field in spin transfer torque switching," *Journal of Physics D Applied Physics*, vol. 40, pp. 1261–1267, mar 2007.
- [7] R. P. Cowburn and M. E. Welland, "Room Temperature Magnetic Quantum Cellular Automata," *Science*, vol. 287, pp. 1466–1468, feb 2000.
- [8] A. Imre, G. Csaba, L. Ji, A. Orlov, G. H. Bernstein, and W. Porod, "Majority logic gate for magnetic Quantum-Dot cellular automata," *Science*, vol. 311, pp. 205–208, jan 2006.
- [9] Niemier, M. and Varga, E. and Bernstein, G.H. and Porod, W. and Alam, M.T. and Dingler, A. and Orlov, A. and Hu, X.S., "Boolean logic through shape-engineered magnetic dots with slanted edges," *IEEE Trans. on Nanotechnology*, 2010.
- [10] R. Nakatani, H. Nomura, and Y. Endo, "Magnetic logic devices composed of permalloy dots," *Journ. of Phys.: Conf. Series*, vol. 165, no. 1, p. 012030, 2009.
- [11] A. Orlov, A. Imre, G. Csaba, L. Ji, W. Porod, and G. H. Bernstein1, "Magnetic Quantum-Dot Cellular Automata: Recent Developments and Prospects," *Journal of Nanoelectronics and Optoelectronics*, vol. 3, 2008.
- [12] Varga, E. and Orlov, A. and Niemier, M.T. and Hu, X.S. and Bernstein, G.H. and Porod, W., "Experimental Demonstration of Fanout for Nanomagnetic Logic," *IEEE Transactions on Nanotechnology*, vol. 9, pp. 668–670, nov 2010.

- [13] J. F. Pulecio and S. Bhanja, "Magnetic cellular automata coplanar cross wire systems," *Journ. of Appl. Phys.*, vol. 107, pp. 034308–+, feb 2010.
- [14] Pulecio, J.F. and Bhanja, S., "Magnetic Cellular Automata wires," in Nanotech. Mat. and Dev. Conf., 2009. NMDC '09. IEEE, pp. 73-75, june 2009.
- [15] G. Csaba, P. Lugli, A. Csurgay, and W. Porod, "Simulation of power gain and dissipation in field-coupled nanomagnets," *Journal of Computational Electronics*, vol. 4, pp. 105–110, 2005. 10.1007/s10825-005-7118-5.
- [16] Kumari, A. and Bhanja, S., "Landauer Clocking for Magnetic Cellular Automata (MCA) Arrays," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol. 19, pp. 714–717, april 2011.
- [17] M. Niemier, M. Alam, X. S. Hu, G. Bernstein, W. Porod, M. Putney, and J. DeAngelis, "Clocking structures and power analysis for nanomagnet-based logic devices," in *Proceedings of the 2007 international symposium on Low power electronics and design*, ISLPED '07, (New York, NY, USA), pp. 26–31, ACM, 2007.
- [18] M. Alam, M. Siddiq, G. Bernstein, M. Niemier, W. Porod, and X. Hu, "On-chip clocking for nanomagnet logic devices," *Nanotechnology*, *IEEE Transactions on*, vol. 9, pp. 348–351, may 2010.
- [19] D. Carlton, N. Emley, E. Tuchfeld, and J. Bokor, "Simulation Studies of Nanomagnet-Based Logic Architecture," *Nano Letters*, vol. 8, no. 12, pp. 4173–4178, 2008.
- [20] G. Csaba, A. Imre, G. Bernstein, W. Porod, and V. Metlushko, "Nanocomputing by field-coupled nanomagnets," *Nanotechnology, IEEE Transactions on*, vol. 1, pp. 209–213, Dec. 2002.
- [21] Augustine, Charles and Behin-Aein, Behtash and Fong, Xuanyao and Roy, Kaushik, "A design methodology and device/circuit/architecture compatible simulation framework for low-power magnetic quantum cellular automata systems," in *Proceedings of the 2009 Asia and South Pacific Design Automation Conference*, ASP-DAC '09, (Piscataway, NJ, USA), pp. 847–852, IEEE Press, 2009.
- [22] Suzuki, Daisuke and Natsui, Masanori and Ikeda, Shoji and Hasegawa, Haruhiro and Miura, Katsuya and Hayakawa, Jun and Endoh, Tetsuo and Ohno, Hideo and Hanyu, Takahiro, "Fabrication of a nonvolatile lookup-table circuit chip using magneto/semiconductor-hybrid structure for an immediate-power-up field programmable gate array," in VLSI Circuits, 2009 Symposium on, pp. 80–81, june 2009.
- [23] A. Lyle, J. Harms, A. Klemm, and J.Wang, "Incorporating magneto resistance into mqca logic," in *Annual Conference on Magnetism and Magnetic Material*, 2010.
- [24] Seungyeon Lee and Sunae Seo and Seungjun Lee and Hyungsoon Shin, "A Full Adder Design Using Serially Connected Single-Layer Magnetic Tunnel Junction Elements," *Electron Devices, IEEE Transactions* on, vol. 55, pp. 890–895, march 2008.
- [25] "International Technology Roadmap for Semiconductor," 2009.
- [26] J. C. Slonczewski, "Current-driven excitation of magnetic multilayers," *Journal of Magnetism and Magnetic Materials*, vol. 159, pp. L1–L7, jun 1996.
- [27] D. C. Ralph and R. A. Buhrman, *Concepts in Spin Electronics*, ch. Spin-transfer Torques and Nanomagnets. Oxford Science Publications, 2006.
- [28] Bertotti, Mayergoyz, and Serpico, Nonlinear Magnetization Dynamics in Nanosystems. Elsevier, 2009.
- [29] T. Moriyama, T. J. Gudmundsen, P. Y. Huang, L. Liu, D. A. Muller, D. C. Ralph, and R. A. Buhrman, "Tunnel magnetoresistance and spin torque switching in MgO-based magnetic tunnel junctions with a Co/Ni multilayer electrode," *Applied Physics Letters*, vol. 97, pp. 072513–+, aug 2010.
- [30] A. D. Kent, B. Ozyilmaz, and E. del Barco, "Spin-transfer-induced precessional magnetization reversal," *Appl. Phys. Lett.*, vol. 84, pp. 3897 –3899, may 2004.
- [31] K. J. Lee, O. Redon, and B. Dieny, "Analytical investigation of spin-transfer dynamics using a perpendicular-to-plane polarizer," *Appl. Phys. Lett.*, vol. 86, pp. 022505–022505–3, jan 2005.
- [32] Oh, S.C. and Jeong, J.H. and Lim, W.C. and Kim, W.J. and Kim, Y.H. and Shin, H.J. and Lee, J.E. and Shin, Y.G. and Choi, S. and Chung, C., "On-axis scheme and novel MTJ structure for sub-30nm Gb density STT-MRAM," in *Electron Devices Meeting (IEDM), 2010 IEEE International*, pp. 12.6.1 –12.6.4, dec. 2010.
- [33] Sayeef Salahuddin, "Current Induced Switching of Ferromagnets for Low-power Memory Applications," ISQEDSymposium, 2011. Tutorial.
- [34] P. Braganca, J. Katine, N. Emley, D. Mauri, J. Childress, P. Rice, E. Delenia, D. Ralph, and R. Buhrman, "A three-terminal approach to developing spin-torque written magnetic random access memory cells,"

Nanotechnology, IEEE Transactions on, vol. 8, pp. 190 –195, march 2009.

- [35] J. Katine and E. E. Fullerton, "Device implications of spin-transfer torques," *Journal of Magnetism and Magnetic Materials*, vol. 320, no. 7, pp. 1217 – 1226, 2008.
- [36] S. Yuasa, T. Nagahama, A. Fukushima, Y. Suzuki, and K. Ando, "Giant room-temperature magnetoresistance in single-crystal fe/mgo/fe magnetic tunnel junctions," 2004.
- [37] H. X. Wei, Q. H. Qin, Z. C. Wen, X. F. Han, and X. Zhang, "Magnetic tunnel junction sensor with Co/Pt perpendicular anisotropy ferromagnetic layer," *Applied Physics Letters*, vol. 94, pp. 172902–+, apr 2009.
- [38] "Predictive technology model." http://ptm.asu.edu/. Downloaded in 2010.
- [39] A. Balijepalli, S. Sinha, and Y. Cao, "Compact modeling of carbon nanotube transistor for early stage process-design exploration," in *Proceedings of the 2007 international symposium on Low power electronics and design*, ISLPED '07, (New York, NY, USA), pp. 2–7, ACM, 2007.
- [40] Y. Cao, T. Sato, M. Orshansky, D. Sylvester, and C. Hu, "New paradigm of predictive mosfet and interconnect modeling for early circuit simulation," in *Custom Integrated Circuits Conference*, 2000. *CICC. Proceedings of the IEEE 2000*, pp. 201–204, 2000.
- [41] W. Zhao and Y. Cao, "New generation of predictive technology model for sub-45nm design exploration," in *Proceedings of the 7th International Symposium on Quality Electronic Design*, ISQED '06, (Washington, DC, USA), pp. 585–590, IEEE Computer Society, 2006.
- [42] Zhenyu Sun and Hai Li and Yiran Chen and Xiaobin Wang, "Variation tolerant sensing scheme of Spin-Transfer Torque Memory for yield improvement," in *Computer-Aided Design (ICCAD), 2010 IEEE/ACM International Conference on*, pp. 432 –437, nov. 2010.
- [43] I. L. Prejbeanu, M. Kerekes, R. C. Sousa, H. Sibuet, O. Redon, B. Dieny, and J. P. Nozieres, "Thermally assisted mram," *Journal of Physics: Condensed Matter*, vol. 19, no. 16, p. 165218, 2007.
- [44] Jeon H. and Kim Y. and Choi M., "Offset Voltage Analysis of Dynamic Latched Comparator," 2009.
- [45] R. Jacob Baker, CMOS Circuit Design, Layout, And Simulation. Wiley-IEEE, 2008.
- [46] Anh-Tuan Do and Zhi-Hui Kong and Kiat-Seng Yeo, "Criterion to Evaluate Input-Offset Voltage of a Latch-Type Sense Amplifier," *Circuits and Systems I: Regular Papers, IEEE Transactions on*, vol. 57, pp. 83–92, jan. 2010.
- [47] Jun He and Sanyi Zhan and Degang Chen and Geiger, R.L., "Analyses of Static and Dynamic Random Offset Voltages in Dynamic Comparators," *Circuits and Systems I: Regular Papers, IEEE Transactions on*, vol. 56, pp. 911 –919, may 2009.
- [48] "22 nanometer." http://en.wikipedia.org/wiki/22\_nanometer.

# **Supplementary Documents**

#### **S1.** Analysis of three different read schemes

Fig. 16 shows the sense margin of the read circuit  $(V_x - V_y)$  for three different cases. All the three schemes use the same read circuit as shown in Fig. 13 with only difference in the resistances connected to the two arms of the circuit. The resistances of MTJs in logic state 0  $(R_0)$  and 1  $(R_1)$  are mentioned in Table VII. They are consistent with recent literature [32]. Due to process variations, MTJs suffer variations in their resistances with a standard deviation of  $\sigma_0$  and  $\sigma_1$  as mentioned in Table VII.

1) Reading using reference resistance: In this scheme, a MTJ (logic 0 or logic 1) is compared against a reference resistance  $R_{ref}$  that is maintained at the mid value of the two resistance states of a MTJ. Therefore,

$$R_{ref} = \frac{R_0 + R_1}{2}$$
(13)

 $R_{ref}$  is attached to the right arm of the read circuit while the MTJ to be read is connected to the left arm. Fig. 16a & 16c shows the waveforms at node X and Y during reading a *logic* 0 and *logic* 1. Fig. 16b & 16d shows the worst case waveforms at X and Y when variations have affected the MTJs to be read. Under the worst case scenarios, the resistances for logic 0 and 1 are given by Eq. 14.

$$R_0 = R_0 + 0.093 \times R_0 \tag{14a}$$

$$R_1 = R_1 - 0.13 \times R_1 \tag{14b}$$

A reduction in sense margin of 56.89% and 96.85% is observed in the worst case while reading a logic 0 and 1 of a MTJ.

The two main disadvantages of this scheme are:

- a) The precision required to fabricate the reference resistance.
- b) The sense margin is low since the difference between the reference resistance and the resistance to be read is half of the total difference in resistance  $(|R_1 - R_0|)$  between the two logic states.
- 2) Reading in Differential Scheme: In this scheme, the MTJ to be read is compared against its complement. As mentioned earlier, complement to a bit can easily be obtained in magnetic logic through antiferromagnetic coupling. Fig. 16e shows the waveforms at node X and Y without variations affecting the MTJs. A sense margin of 25.95mV is obtained. In worst case, the variations affect the MTJs in a manner that decrease the sense margin to read. Under such scenarios, the resistance for logic 0 and 1 are given by Eq. 14. A reduction in sense margin by 72.71% is obtained in the worst case as seen in Fig. 16f.

3) Reading in Variability Tolerant Differential Scheme: This scheme, which is introduced in this paper, uses an averaging technique to reduce the variation effect in a differential scheme. In stead of comparing single bits against their complements as in a differential scheme, this scheme compares a pair of same valued bits against a pair of their complements. Fig. 16g shows a sense margin of 19.961mV obtained without variations affecting the MTJs. Under the worst case when variations affect the MTJs and their complements in the opposite directions, a sense margin of 7mV is obtained (as shown in Fig. 16h) which is a reduction of 65.04%.

It can be clearly observed that the variation tolerant differential read out scheme that is introduced in this paper gives a higher read tolerance to variances affecting the MTJs in the logic. The area and temporal cost associated with this improvement is minimal.



**Figure 16:** (a), (b), (c), (d) **Reference Sensing Scheme** - (a) Sensing logic 0 (b) Sensing logic 0 amidst variability (c) Sensing logic 1 (d) Sensing logic 1 amidst variability. (e), (f) **Differential Sensing Scheme** - (e) Without variability (f) With variability. (g), (h) Variability **Tolerant Differential Sensing Scheme** - (g) Without Variability.

Please note: Circuit parameters i.e. transistor dimensions and the bias voltages of Fig. 13 are maintained constant over all the simulations.

# S2. Analysis of the sense margin (with and without variation) under different $R_0$ and TMR values

In this section we will discuss two cases:

**Case 1.** Relation of the sense margin with  $R_0$  and TMR for the different read schemes.

**Case 2.** Relative change in sense margin with variations for the different read schemes.

Please note that in all the cases we have reported the sense margin for the worst case variation effect i.e. under variation analysis,  $R_0$  has been replaced with  $R_{0v} = R_0(1 + \sigma_0)$  and  $R_1$  has been replaced with  $R_{1v} = R_1(1 - \sigma_1)$ .  $\sigma_0$  and  $\sigma_1$  values are taken from Table VII. Before we go into the details of the analysis, we have briefly summarized the steps followed during the simulations for Fig. 16 and Table XI.

- (a) Reference Sensing Scheme
  - (i) Sensing logic 0

Step I:  $R_0$  is read against  $R_{ref}$ , where  $R_{ref} = (R_1 + R_0)/2$ .

**Step II:**  $R_0$  is replaced with  $R_{0v}$  while  $R_{ref}$  is maintained constant. Please note that this is one of the major drawbacks of Reference Sensing scheme where a precise  $R_{ref}$  needs to be maintained. This limitation is overcome in Differential Sensing.

(ii) Sensing logic 1

**Step I:**  $R_1$  is read against  $R_{ref}$ . **Step II:**  $R_1$  is replaced with  $R_{1v}$  while  $R_{ref}$  is maintained constant.

(b) Differential Sensing SchemeStep I: Compare R<sub>0</sub> against R<sub>1</sub>.

**Step II:** Compare  $R_{0v}$  against  $R_{1v}$ .

(c) Variation Tolerance Differential Sensing Scheme Similar to Differential Sensing except that now a pair of R<sub>0</sub>s is compared against a pair of R<sub>1</sub>s.

*Note:*  $\Delta V_1$  and  $\Delta V_2$  in Table XI are obtained from Steps I & II respectively.

**Case 1.** How does  $R_0$  and TMR influence the sense margin ?

As mentioned in Section VII, the difference in currents  $(\Delta I)$  in the two arms of the read circuit during the Sensing phase generates from the difference in resistances of the two arms connected across a constant voltage  $(V_X = V_Y)$  developed during the Pre-charge phase. Let V denote the value of this voltage across the resistances at the end of the Pre-charge phase. The larger the  $\Delta I$ , the larger is the sense margin from the read. In this section, we will discuss the relation of  $\Delta I$  on  $R_0$  and TMR for the different read schemes.

- 1) Reference Sensing
  - a) Sensing logic 0

$$\Delta I = V(\frac{1}{R_0} - \frac{1}{R_{ref}}) = V(\frac{1}{R_0} - \frac{1}{(1+0.5 \times TMR)R_0})$$
$$= \frac{0.5 \times TMR}{(1+0.5 \times TMR)R_0}V \quad (15)$$

Therefore, the sense margin improves when either

- $R_0$  decreases or,
- TMR increases, provided that TMR is not  $\gg$  1 (which occurs in reality).
- b) Sensing logic 1

$$\begin{aligned} |\Delta I| &= V |\frac{1}{R_1} - \frac{1}{R_{ref}}| = V |\frac{1}{R_1} - \frac{1}{(1+0.5 \times TMR)R_0}| \\ &= \frac{0.5 \times TMR \times V}{(1+0.5 \times TMR)(1+TMR)R_0} (16) \end{aligned}$$

Same conclusion as above.

2) Differential Sensing

$$\Delta I = V(\frac{1}{R_0} - \frac{1}{R_1}) = V(\frac{1}{R_0} - \frac{1}{(1 + TMR)R_0})$$
$$= \frac{TMR}{(1 + TMR)R_0}V \quad (17)$$

Again, the sense margin improves when either

- TMR improves in the vicinity of 1 or,
- $R_0$  decreases.
- 3) Variability Tolerant Differential Sensing Similar conclusion holds true.

Comparison between  $Case \ B \& Case \ C$  in Table XI shows similar relation between sense margin and  $R_0$ . Comparison between  $Case \ A \& Case \ B$  and  $Case \ C \& Case \ D$  in Table XI shows data in support of the relation between sense margin and TMR.

**Case 2.** How does the sense margin gets affected with variation?

Here  $\Delta I'$  denotes the current difference in the two arms of the read circuit under variations. Table X compiles the results.

The two key observations from Table X are:

- 1. Sense margin deteriorates with variation as expected (Comparison between  $\Delta V_1$  and  $\Delta V_2$  in Table XI).
- With the increase in TMR the effect of variation on the sense margin decreases. The same is observed between Case A & Case B and Case C & Case D of Table XI.

A few more observations from Table XI are as follows:

- 1) The variation effect for Reference Sensing logic 1 is more prominent than logic 0 since  $\sigma_1 > \sigma_0$ .
- 2) Effect of variability for Differential Sensing is greater than Reference Sensing with logic 0 since in Differential Sensing the resistances in both the arms of the read circuits are replaced with  $R_{0v}$  and  $R_{1v}$  respectively to emulate the worst case variability effects. On the other hand, for the Reference Sensing, the reference resistance has been maintained constant and therefore the impact for variability on that scheme was lesser. The same can be observed from Column 3 of

**Table X:** Difference in currents  $(\Delta I')$  between the two arms of the sense circuit, with variation, for the different read schemes.

| Read Scheme                    | $ \Delta I' $                                                        | $\frac{ \Delta I - \Delta I' }{\Delta I}$ |
|--------------------------------|----------------------------------------------------------------------|-------------------------------------------|
| Reference Sensing (logic 0)    | $\frac{(0.457 \times TMR - 0.085)}{R_0(1 + 0.5 \times TMR)}$         | $(0.086 + \frac{0.17}{TMR})$              |
| Reference Sensing<br>(logic 1) | $\frac{(0.44 \times TMR - 0.115)}{R_0(1 + 0.5 \times TMR)(1 + TMR)}$ | $(0.12 + \frac{0.23}{TMR})$               |
| Differential Sensing           | $\frac{(0.915 \times TMR - 0.199)}{R_0(1 + TMR)}$                    | $(0.085 + rac{0.199}{TMR})$              |

| Table XI: Compariso     | n of different read | d schemes unde | r different |
|-------------------------|---------------------|----------------|-------------|
| $R_0$ and $TMR$ values. |                     |                |             |

| Read Scheme               | $\Delta V_1  (\mathbf{mV}) \bigstar$ | $\Delta V_2$ (mV) $\bigstar \bigstar$ |                                                                |
|---------------------------|--------------------------------------|---------------------------------------|----------------------------------------------------------------|
|                           |                                      |                                       | $\% = \frac{\Delta V_1 - \Delta V_2}{\Delta V_1 - \Delta V_2}$ |
|                           |                                      |                                       | $\Delta V_1$                                                   |
| Case A:                   | $R_0 = 2k\Omega,$                    | TMR = $30\%$                          |                                                                |
| Reference Sensing logic 0 | 14.15                                | 6.1                                   | 56.89%                                                         |
| Reference Sensing logic 1 | 11.122                               | 0.35                                  | 96.85%                                                         |
| Differential Sensing      | 25.97                                | 7.08                                  | 72.71%                                                         |
| Variability Tolerant      | 19.61                                | 6.979                                 | 65.04%                                                         |
| Case B:                   | $R_0 = 2k\Omega,$                    | TMR = 100%                            |                                                                |
| Reference Sensing logic 0 | 21.46                                | 15.88                                 | 26%                                                            |
| Reference Sensing logic 1 | 14.59                                | 8.9                                   | 38.99%                                                         |
| Differential Sensing      | 38.3                                 | 26.26                                 | 31.44%                                                         |
| Variability Tolerant      | 59.98                                | 43.04                                 | 28.24%                                                         |
| Case C:                   | $R_0 = 4k\Omega,$                    | TMR = 100%                            |                                                                |
| Reference Sensing logic 0 | 20.69                                | 15.78                                 | 21.9%                                                          |
| Reference Sensing logic 1 | 13.66                                | 8.41                                  | 38.43%                                                         |
| Differential Sensing      | 36.03                                | 24.98                                 | 30.66%                                                         |
| Variability Tolerant      | 56.48                                | 40.25                                 | 28.73%                                                         |
| Case D:                   | $R_0 = 4k\Omega,$                    | TMR = 150%                            |                                                                |
| Reference Sensing logic 0 | 28                                   | 23.53                                 | 15.96%                                                         |
| Reference Sensing logic 1 | 16.82                                | 11.61                                 | 30.97%                                                         |
| Differential Sensing      | 47.65                                | 36.35                                 | 23.71%                                                         |
| Variability Tolerant      | 72.95                                | 57.42                                 | 21.28%                                                         |

★ Sense Margin without variability.

 $\star\star$  Sense Margin with variability.

Table X by comparing the expressions between the two schemes.

3) Our proposed Variability Tolerant Sensing scheme is more tolerant to variability than the Differential Sensing Scheme. To summarize

- To overcome the drawback of precise reference resistance and low sense margin, Differential Sensing is chosen over Reference Sensing.
- To overcome the variability effects in Differential Sensing, we have proposed the Variability Tolerant Differential Sensing.

# S2. The comparator and its variation analysis in 22nm CMOS technology

Fig. 17 shows the comparator that we designed in 22nm CMOS in order to read the sense margin from the read circuit of Fig. 13. The comparator has two stages: the Decision Circuit and the Sense Amplifier. The circuit parameters (transistor dimensions and the bias voltages) for the comparator are mentioned in Table X.

The comparator has a  $1\sigma$  input offset voltage of around 13.89mV as mentioned in Table XI. This enables the comparator to read the sense margin of 32mV (see Fig. 14b & 14c) which lies above  $2\sigma$  input offset with variations. A process variation analysis in 22nm CMOS is done on the comparator. The threshold voltage offset is calculated using Eq. 15 with  $A_{V_{th}} = 4.5mV\mu m$  [44].

$$\sigma_{V_{th}} = \frac{A_{V_{th}}}{\sqrt{WL}} \tag{15}$$

The area of the comparator can be approximated to  $4.336 \mu m^2$ .



Figure 17: The comparator used in reading the sense margin from the read circuit of Fig. 13. The comparator has two stages: the Decision Circuit and the Sense Amplifier [45]. Please note that C and D need to be connected between the two stages.

Table X: Comparator characteristics.

| Transistor dimensions** and bia | sing voltages in the comparator |
|---------------------------------|---------------------------------|
| StageI                          | StageII                         |
| $P_1, P_2 = 10,000/50$          | $P_3, P_4 = 9000/50$            |
| $N_1, N_2 = 6000/68$            | $N_5, N_6 = 9000/25$            |
| $N_3, N_4 = 6000/60$            | $N_7, N_8 = 9000/25$            |

\*\* All dimensions are in nm.

Please note that we have:

1) reported only the  $V_{th}$  influence on the input offset voltage since it has the greatest influence compared

| Fable  | XI: | Threshold | variations | and | corresponding | input | offset |
|--------|-----|-----------|------------|-----|---------------|-------|--------|
| or the | com | parator.  |            |     |               |       |        |

| Transistor                                           | Magnitude of              | Input Offset |  |  |  |
|------------------------------------------------------|---------------------------|--------------|--|--|--|
|                                                      | threshold variations (mV) | (mV)         |  |  |  |
| $P_1$                                                | 6.4                       | 6.65         |  |  |  |
| $P_2$                                                | 6.4                       | 5.33         |  |  |  |
| $N_1$                                                | 7                         | 0.11         |  |  |  |
| $N_2$                                                | 7                         | 7.43         |  |  |  |
| $N_3$                                                | 7.5                       | 7.97         |  |  |  |
| $N_4$                                                | 7.5                       | 0.16         |  |  |  |
| $P_3$                                                | 6.7                       | 0.28         |  |  |  |
| $P_4$                                                | 6.7                       | 0            |  |  |  |
| $N_5$                                                | 9.5                       | 1.28         |  |  |  |
| $N_6$                                                | 9.5                       | 0.05         |  |  |  |
| $N_7$                                                | 9.5                       | 0.01         |  |  |  |
| $N_8$                                                | 9.5                       | 0            |  |  |  |
| $1\sigma$ input offset for the comparator is 13.89mV |                           |              |  |  |  |

to the transconductance (K) and the capacitance (C) mismatches [46].

2) used a conservative  $A_{V_{th}}$  value of  $4.5mV\mu m$  for our variation analysis of 22nm CMOS technology and verified the functional behavior of the comparator design. However, literature [47] reports optimistic values of  $A_{V_{th}}$  equaling  $1.8mV\mu m$  for NMOS and  $1.7mV\mu m$  for PMOS in 40nm technology. Therefore, our analysis with conservative numbers represent comparator robustness validation under worst case assumptions.

#### **S3.** Analysis of STT Clocking

According to spin-torque induced clocking, the clocking is performed by using appropriate current to drive the cell to a stationary magnetization state along the y-axis. The clocking current can therefore be theoretically derived from Eq. (1) by substituting dm/dt = 0 (stationary) and equating the field components along the  $\hat{x}$ ,  $\hat{y}$  and  $\hat{z}$  directions. The coupling from the underneath tilted reference layer is brought about by the addition of the field term  $H_d$  to  $H_{eff}$  where  $H_d$  is given by

$$\mathbf{H}_{\mathbf{d}} = -H_{dx}\hat{e_x} + H_{dz}\hat{e_z} = -H_d\alpha_p\hat{e_x} + H_d\gamma_p\hat{e_z}$$
(16)

H<sub>eff</sub> is given by [28]

$$\mathbf{H}_{\mathbf{eff}} = \mathbf{H}_{\mathbf{d}} + \mathbf{H}_{\mathbf{M}} + \mathbf{H}_{\mathbf{AN}}$$
(17)

where  $H_M$  being the field due to demagnetization effects and  $H_{AN}$  arising out of the crystalline and shape anisotropy of the free layer. During the clocking state  $H_{AN} = 0$ and  $H_M = -D_y \cdot M_y \cdot \hat{e_y}$  owing to the presence of only the y-component of magnetization. Also from the fundamental constraint,

$$|\mathbf{M}(\mathbf{r},t)| = M_s \tag{18}$$

in the clocking state we have

$$M_y = M_s \tag{19}$$

(21a)

Therefore, when clocked, Eq. (1) modifies to

$$\gamma M_s \hat{m} \times \hat{h_{eff}} = \gamma M_s \frac{J_e}{J_p} G \hat{m} \times \hat{e_p} \times \hat{m}$$
(20)

where  $m = \hat{e_y}$ 

$$h_{eff} = \frac{1}{M_s} \left[ -H_{dx} \hat{e_x} - D_y M_s \hat{e_y} + H_{dz} \hat{e_z} \right]$$
(21b)

Equating the  $\hat{e_z}$  and  $\hat{e_x}$  terms gives

$$M_y \left(\frac{J_e G}{J_p}\right) = \frac{H_{dx}}{\gamma_p} \tag{22a}$$

$$M_y \left(\frac{J_e O}{J_p}\right) = \frac{\Pi_{dz}}{\alpha_p}$$
(22b)  
$$\alpha_p = \gamma_p$$
(23)

which mandates

i.e. the reference layer should have a tilt of  $45^{\circ}$  in its polarization with the z-axis. Therefore, the device switches to a clocked state with a current density of (See Table VI for symbol definitions)

$$J_{clk} = \left(\frac{\mu_0 \cdot M_s \cdot |e| \cdot d \cdot H_d}{\hbar \cdot G}\right)$$
(24)



**Figure 18:** Clocked state of a tilted polarizer MTJ. Please note that the free layer orients along the *y* axis when clocked.

### **Response to Reviewers**

We thank the reviewers and the EIC for their time and encouragement. In this section, we clearly identify how the reviewers' comments have been incorporated in the revised version. Along with our response, we have reproduced the relevant reviewer's comments for ready reference. We have rewritten most of the sections and updated the references as suggested. Topics that required further elaboration are added as supplementary documents in order to keep the main contents of the paper focused on the CMOS-MTJ architecture integrated with the write, clock and read mechanism. The changes made are highlighted in the paper. Change to a figure/table or addition of a figure/table has been marked by highlighting their captions. Data in certain tables are updated according to the latest available references in the area.

For the sake of completeness, we added a few tutorial elements to the paper since the magnetic logic area is relatively new and this raised an issue of clarity in technical contribution. We want to emphasize the technical contribution of the paper in this section for all the reviewers and also modified the revised manuscript in page 2, Section I to incorporate the contribution.

The technical contributions of this work are:

- 1) "We have designed the proposed hybrid architecture to obey the constraints of dipolar interaction of magnetic logic, central to magnetic information processing, to mention some: free layer dimension (shape anisotropy and super-paramagnetic limit), inter-cell spacing, pinned and free layer configurations and readability through TMR.
- 2) We have designed the architecture obeying the 22nm CMOS integration constraints such as 22nm CMOS metal pitch requirements, sizing of the access transistors capable of driving the required switching current and minimizing the number of routing metal layers.
- The architecture is designed such that all magnetic cells are connected with a bitline and a sourceline. A few cells are connected to wordline as well in order to
  - selectively write to input cells or
  - to selectively deactivate a few cells while others are written or clocked in order to achieve lower power dissipation
  - to selectively clock a cell and finally
  - to read an output cell

The architecture is regular and uses bitlines, sourcelines and wordlines similar to conventional memory. Hence the architecture is suitable for logic-in-memory application.

- 4) Even though we have discussed the equations for switching current, clocking current, TMR and clocking schemes in [1], this work is the first to integrate the clocking schemes and timing of clocking pulses into the regular 2D grid architecture. The clocking scheme that is introduced in this work is a sequence of three voltage pulses that ensures the desired clocking operation.
- 5) We have proposed a differential ReadOut scheme where a bit is compared against its complement which eliminates the requirements for a precise reference voltage or resistance. Note that since nanomagnetic logic relies on neighbor interaction, bit and its antiferromagnetically coupled neighbor automatically provides the complements. We leverage from this feature of NML and have thus reduced expensive circuitry required to store variation prone reference values. Additional features of the proposed read schemes are:
  - a) Non-destructive read scheme where the cell's logic states are retained while the cell is being read by TMR.
  - b) Since magnetic cells are sensitive to fabrication and geometric variations, we have proposed a variation tolerant read scheme where duplicate copies (can be extended to multiple copies as well) of the output cells and their complements are compared."
- 6) Finally, we have used the features of the architecture to build a two-input XOR logic (Section IV-F) which is an important component in datapath circuits.

The novelty of this work w.r.t. our previous work [1]:

- While our previous work only proposes that MTJs can be used as elemental logic cell in NML, this work proposes a hybrid CMOS-MTJ architecture that is used to realize magnetic logic. A case study of a two-input XOR is given in Section IV-F.
- 2) Our previous work was aimed at developing a Verilog-A model to emulate the behavior of the free layer of a single MTJ and the coupling between the free layers of neighboring MTJs. In order to build up the model, we required the expressions for switching current, clocking current and TMR of the MTJ which were mentioned in the paper. In this paper we have described how the input cells within the architecture can be written with the help of bitline, sourceline

and wordline, how the MTJs inside the logic can be clocked with the help of a train of voltage pulses and a ReadOut scheme that will use the TMR to read the output of the logic.

The novelty of this work w.r.t. selected previous works:

While selected previous research used MTJs in nanomagnetic logic, our work is significantly different and the first to

- 1) utilize magnetic coupling between the free layers of neighboring MTJs for logic computation, and
- 2) to present a CMOS integrated architectural solution with low power read, write and clocking suitable for large magnetic logic implementation.

We thank the reviewers once again for their valuable feedback for this work.

A brief mention to the changes in figure and table numbers in the revised version:

- Fig. 4 is currently Fig. 9 in revised version.
- Fig. 5 is currently Fig. 7 in revised version.
- Fig. 7 is currently Fig. 12 in revised version.
- Fig. 8 is currently Fig. 13 in revised version.
- Fig. 10 is currently Fig. 15 in revised version.

Table II is currently Table V in revised version. Table III is currently Table VI in revised version. Table IV is currently Table VII in revised version. Table V is currently Table VIII in revised version. Table VI is currently Table IX in revised version.

The figures that are added or modified in the paper. Fig. 4, Fig. 5, Fig. 6, Fig. 8, Fig. 10, Fig. 12, Fig. 14.

*Tables that are added.* Table II, Table III, Table IV, Table VII, Table VIII, Table IX.

## A. Response to Reviewer1

1) The Scaling challenges of CMOS need research as the Abstract of the paper mentions and I thank the authors for their attempt in this direction. However, I feel that their paper did not really succeed in that. Although the paper is obviously a good improvement over previous magnetic logic work, it does not however adequately address the major problems of today's circuits.

We thank the reviewer for identifying this shortcoming in our original submission. We have changed the scope of our work in the revised manuscript to focus on magnetic logic and removed the reference to CMOS scaling challenges. Magnetic memory (as already comercialized by Freescale Semiconductors, and now Everspin) is already in the market and does not really threaten the general purpose CMOS memory. Certain properties of magnets, such as high temperature operation, radiation hardness, and non-volatility, are attractive for different applications. Essentially wherever magnetic memory is effective and judged suitable, one can apply the proposed magnetic logic to add a few additional logic constructs that would be used directly in memory. Thus, we have narrowed the scope of our work towards logic in memory applications as per your suggestion. Last paragraph in Section *IX* discusses this in the manuscript.

2) While presenting the logic based on the nonlinear electro-magnetic interactions Landauer [reference 3 in the references which I propose to you] says "No scheme which requires precisely-timed signals at every stage has been truly successful". The three references that I propose are classics which have proven their strengths over time. I hope you benefit from them. Your use of a clocked structure is essential as you mention on page 7 column 1 to solve another problem which is the I/O isolation. However, this clocking to every logic element is difficult to route in a real chip with billions of elements and consumes a large amount of energy.

We agree with the reviewer that clocking every cell might be unrealistic when considering billions of nodes. In potential early applications for magnetic logic in magnetic memory, it is unlikely to implement a very complicated logic due to the energy constrain you mentioned. Rather, we envision smaller size and regular logic. Also note that our proposed architecture has sourceline, bitline, and wordlines for accessing the magnetic logic cells (MTJs) for clocking, reading, or writing. Such an access mechanism is more inline with traditional memory arrays where a row or column of cells are accessed at a time. Therefore, the proposed CMOS-MTJ architecture is more suitable for accessability and metal routing. However, we do acknowledge (in Section *VIII*) the fact that clocking energy consumption would

limit how many cells are accessed/clocked at a time. Please note that, ultra-low energy clocking technologies are already evolving through straintronics and other methods under investigation and very recently funded by NSF (See "http://www.eetimes.com/electronics-news/4219545/Researchers-aim-for-energy-harvesting-CPUs".)

- Furthermore, your simple half adder for example has a much larger area than one in convnetional CMOS at the same technology node (22nm) due to the large sizes of the magnetic parts.
   We agree with the reviewer. We tried to compare with existing 22nm experimental data points. Unfortunately the exact transistor-pitch (usually even much larger than 22nm) is not available to us. Based on the predicted values [48], the area of the CMOS as well as magnetic logic implementations for similar circuits like the half-adder are in μm<sup>2</sup> range. We believe that this work can potentially augment magnetic memory with built-in non-volatile radiation-hard and high-temperature logic in early applications.
- 4) Your proposal (which I agree is a good idea) might be good for another purpose (memories, sensors, analog, ...) but I do not see it as a good alternative for logic. Please see the references I provided to you. Make a good comparison with conventional CMOS instead of just to other prior magnetic logic and maybe retarget your device to another application.
- 1) R. W. Keyes, "What makes a good computer device?," Science, vol. 230, pp. 138 144, Oct. 1985.
- 2) R. W. Keyes, "Physics of digital devices," Reviews of Modern Physics, vol. 61, pp. 279 287, Apr. 1989.
- 3) R. Landauer, "Advanced technology and truth in advertising," Physica A, vol. 168, pp. 75 87, 1990.

We agree with the reviewer. However, we want to humbly point out that when CMOS was a nascent technology and was competing with BJT, many of the performance indicators were worse than that of BJT. Over 50 years, we are here with a mature technology where many of the initial shortcomings are overcome. Currently we envision the scope of magnetic logic to be narrow (where magnetic memory would be effective) but we are optimistic that much like CMOS, many of the initial shortcomings will be overcome through the evolution of new material and physical phenomena. Magnetic logic is in very early stage of exploration and going through gradual improvements. Hence, we compare with prior magnetic logic work to gauge the improvement. At its current state, magnetic logic is certainly not favorably comparable to conventional CMOS for general purpose computation. Please not that there are articles such as the following that claims as much future potential for energy-efficient CPUs using magnetic logic under active research. "http://www.eetimes.com/electronics-news/4219545/Researchers-aim-for-energy-harvesting-CPUs".

#### B. Response to Reviewer2

The new MTJ structure presented in the paper is quite interesting. The corresponding control/clock logic seems okay too while it's quite hard to judge because they highly depend on the specific MTJ structure in the paper. **How it is addressed in revision:** The clocking scheme which sets the magnetization along the saddle point or y-direction of the free layer is made possible by the tilted-polarizer fixed layer structure of the MTJ. A summary of the different MTJ configurations together with their feasibility of being clocked using STT current is presented in Table III.

- In Fig. 10, the clock frequency is only 100MHz. Are there any reasons to use such slow clock for evaluations? How it is addressed in revision: The clock frequency of 100MHz is as used by the authors of [17]. We have only used the value in Fig. 15 to calculate the energy consumption in field induced clocking.
- 2) As shown in Fig. 10, if the number of cells clocked is more than 40, the energy consumption of STT-induced clocking is large than field-induced one. However, in Table VI, even for an 8x8 multiplier which should has many cells, the energy saving is 96.2%. This looks inconsistent with Fig. 10.

How it is addressed in revision: Thank you for raising this concern. This was not clarified in our previous version. In our revised manuscript we have clarified it by modifying the figure caption which in its current form reads as "Energy consumption vs. number of cells in a single clocking zone for STT current-induced clocking and Field-induced clocking. The clocking frequency for Field-induced clocking is  $10^8 Hz$  with 50% duty cycle with a clocking current of 4 mA [17]. For the cell dimensions of  $100 \times 50 nm^2$  and a vertical pitch of 120 nm, Field-induced clocking of 40 cells in one clock zone would require a clock wire length of  $4.8 \ \mu m$  with an overall resistance of  $0.216 \ \Omega$ . Since the clocking in STT current-induced clocking is implemented using a stationary state in the y-axis, the average clocking duration can be approximated to 3 ns from our simulation results (*see supplemental document*) and the clocking current is a combination of the three voltage pulses as discussed in Section 20."

The number 40 in Figure 15 refers to the number of cells that are clocked in a single clocking zone. However, when we have a  $8 \times 8$  multiplier, the cells are distributed throughout the logic and they will be clocked in separate clocking zones.



Figure 19: Cell placements abiding by the hybrid architecture specifications. Color Code: Green 

Input Cells, Yellow Standard Cells,

 $Blue \rightarrow Controlled \ Cells, \ Red \rightarrow Output \ Cells, \ Pink \rightarrow logic \ 1 \ (fixed \ magnetization), \ Purple \rightarrow logic \ 0 \ (fixed \ magnetization) \ .$ 

This justifies the energy reduction using STT current induced clocking in larger circuits over field induced clocking where the circuits consists of large number of cells (> 40) over field driven clocking.

3) Also, in Table VI, in all logic functions energy consumption using STT-induced clocking is reduced by the same number 96.2%. Is this is just coincident or a theory behind this?

**How it is addressed in revision:** Thank you for this question. It is not a coincidence. For all the circuits mentioned in Table IX the basic building block is a half adder which is again a very important block in datapath circuits. Therefore, the energy reduction obtained in a single half adder gets reflected when computing the energy reduction for the other circuits mentioned in Table IX. This data shows the potential for realizing larger circuits by cascading through magnetic coupling.

As a brief explanation of how we arrived at the energy reduction values in Table IX we decided of presenting here the energy calculation for a half adder and a full adder. Please note that the values reported in Table IX are lesser than the values calculated here. This is because we have kept a margin over the values calculated here in order to account for any interconnects that may be required to build the real circuit. The values reported in the table are therefore conservative.

a) Half-adder:

The cell placement for a half-adder is shown in Fig. 19a.

b) Full-adder:

The cell placement for a full-adder is shown in Fig. 19b.

If the average STT write current is  $I_W$ , the average STT clocking current is  $I_C$  and the duration of writing and clocking are  $t_w$  and  $t_c$  respectively, then the total energy consumed in STT writing and STT clocking is given by Eq. 25

$$E_{STT} = N_W \times I_W \times t_w + N_C \times I_C \times t_c \tag{25}$$

where  $N_W$  and  $N_C$  represents the total number of cells that are written into and clocked respectively during the course of logic execution.

The total energy consumed by a logic which is driven through external fields is calculated using Eq. 26

$$E_{Field} = N_R \times I_F \times t_f \tag{26}$$

where  $N_R$  is the total number of rows of underlying conducting wires through which current is passed (during logic execution) in order to generate the required magnetic fields for the cells lying above.  $I_F$  and  $t_f$  are the current magnitudes and their durations in the wires respectively. In these calculations, we have assumed that the current

Table XII: Energy consumption calculation

| $N_W$ | $N_C$ | $N_R$ | STT (pJ) | Field-induced (pJ) | % reduction | logic      |
|-------|-------|-------|----------|--------------------|-------------|------------|
| 4     | 21    | 11    | 10.78    | 300                | 96.4%       | Half-adder |
| 13    | 50    | 41    | 25.6     | 820                | 96.87%      | Full-adder |

required to generate magnetic field for writing into a nanomagnet is equal to the current required to generate the field for clocking a nanomagnet which is equal to  $I_F$ . This is a realistic assumption.

The values of  $N_W$ ,  $N_C$  and  $N_R$  for a half-adder and a full-adder logic for the cell placement in Fig. 19 is listed in Table XII.

Using the data in Table VII, the values of current and their durations in STT driven operations are calculated to be:  $I_W = 247.45\mu A$ ,  $I_C = 169.81\mu A$ ,  $t_w = 10ps$ ,  $t_c = 3.02ns$ ,

For the field-induced operations, the current and their durations are taken from Niemier et al. [17] and are mentioned below:

 $I_F = 4mA, t_f = 5ns.$ 

The energy consumption values and the corresponding reduction in energies are mentioned in Table XII for each of the half adder and the full adder.

The energy reduction values in Table IX are mentioned as conservative estimate of the reduction obtained through STT current.

# C. Response to Reviewer3

This paper is related to one very new and emerging area of the circuits and systems society. It deals with the exploration of a new computing paradigm for sub-100nm domain. However, the way this paper is presented, makes it very difficult to identify where exactly the novelty is. I am sure the authors have done significantly amount of work. However, the writing style is more like a "technical report" with a lot of redundancy and repetition. I would like to suggest the authors to change the presentation and organisation style of this paper and highlight the novelty at the very beginning.

How it is addressed in revision: We are thankful to the reviewer for this suggestion. We have modified most of the sections and have highlighted the novelties in the paper in Section I. Kindly see the response to the EIC at the beginning of this section where the novelties are mentioned. *My evaluations are given below:* 

 The abstract creates an impression that the authors proposed a novel magnetic logic realization using multi-level spintronnic devices as elemental cells. However, when I read authors' previous work, I found that they have already introduced thi concept in that paper. So, I feel the abstract is misleading. It needs rewriting.

How it is addressed in revision: Thanks for the suggestion. We have modified the abstract and in its present form it reads as

"Magnetic coupling between single layer nanomagnets is used to realize magnetic logic. Apart from writing and reading, one other phenomenon performed on the magnets is clocking. Traditionally, these operations were carried out using external magnetic fields generated by current carrying conductors. But the current requirements are typically in mA which increases the overall power. Also, the fields cannot be sharply terminated at the boundary between two nanomagnets which needs to be clocked at two different instants. The above concerns motivated us to look into alternate magnetic devices to realize magnetic logic. We suggested the use of multi-layer spintronic devices (the Magnetic Tunnel Junctions abbr. MTJs) for carrying out logic computation. MTJs are already in use in magnetic-MRAMs from where we have borrowed some concepts in writing and reading our logic. The MTJ free layers are capable of interacting with neighbors through magnetic coupling. We have proposed the use of this coupling to compute logic in this paper. At the same time, MTJs also provide scope for CMOS integration which we have used to assist in current driven writing, clocking and reading the devices. CMOS integration also improves the overall control over individual cells in the logic. In this paper we have presented a novel CMOS integrated MTJ architecture layout that enables (a) logic computation using magnetic coupling between MTJs and (b) current driven input, clock and read operations that are much more energy efficient. A feasibility study of this integration in 22nm CMOS node is presented in the paper along with a variability tolerant reading scheme for the logic. The proposed architecture achieves over 95% reduction in energy as seen in various adders and array multiplier over traditional magnetic logic with external field-based clocking."

This work is significantly different from our previous work and we listed the differences in our address to the EIC at

the beginning of this section.

2) The Introduction part: The main motivation behind this work, as stated in the introduction part is, "generating better controllability over magnetic logic cells and devising low power write, read and clocking operations for magnatic logic are the key motivation". Then onwards it appears that the use of MTJs is the main contribution. However, it has already been introduced in the authors' previous work where they have also dealt with inplane and perpendicular MTJs. In this work, the authors have used tilted polarizer reference layer MTJs with inplane free layer magnatization to realize magnetic logic. So, is it the novelty of this paper?

How it is addressed in revision: We have rewritten the introduction and highlighted the novelties. The novelties of this work are mentioned in the beginning of this section.

3) On Page-2 left column, you have mentioned in point number 1 that "Optimum" CMOS-Magnetic logic integration. However, I could not find the "optimality criteria" mentioned in the paper.

**How it is addressed in revision:** We have added a separate section, Section IV-A where we have discussed the challenges in CMOS integration and how we arrived at the CMOS integration satisfying the following integration challenges. By 'optimal solution' we meant an integration solution that solves the challenges mentioned below. However, we have removed the word from our revised manuscript.

"Integration Challenges

The integration of MTJs with 22nm CMOS for NML realization needs to meet the following basic criteria.

- a) The spacing between the MTJs should allow effective neighbor interaction.
- b) The CMOS minimum metal pitch requirement should not be violated.
- c) Transistors need sufficient W/L ratio to sustain the required writing and clocking currents.
- d) Minimize the number of metal layers for cost-effective implementation.

We have devised a novel CMOS-Magnetic logic architecture addressing the above mentioned challenges. We observed satisfactory coupling between MTJs when they are placed 20nm apart. The architecture has a regular 2D lattice structure (see Fig. 4) of rows and columns of MTJs placed 20nm apart. The row pitch of the architecture is (50+20)nm = 70nm. The column pitch is (100+20)nm = 120nm. The CMOS minimum metal pitch for layer 1 and intermediate wiring of 64nm [25] is satisfied. Access transistors are integrated to 1 in 4 MTJs of the architecture. For example, (see Fig. 4, only  $X_{11}$  in the group of  $(X_{11}, X_{12}, X_{21}$  and  $X_{22}$ ) has an access transistor."

- 4) On Page-2 left column, you have mentioned about "low power clocking" in point number. It has been mentioned in several places throughout this paper. I was expecting to have its analytical reasoning in Section VI. But there also the authors simply stated that the use of STT current pulse is the cause of power minimization. I would expect the authors to explain this point further with proper analytical reasoning. Similarly, at the end of Section VI, it is mentioned that "current requirement per MTJ reduces to the order of uA, a drop by orders of magnitude..". It needs further analysis. How it is addressed in revision: Thank once again for this feedback. The concept of STT current induced clocking has been introduced in our previous work in order to build up a Verilog-A model for the MTJ [1]. An analysis of the clocking is provided in supplementary document S3. In this work we concentrate on using the technique to clock the cells within the logic with the help of a novel train of voltage pulses. Table VIII gives the current magnitudes for STT clocking. Clocking using external fields requires currents in the order of mA. Therefore, with STT current, the current requirement for clocking reduces by orders of magnitude. An analysis of the energy consumption between STT current driven clocking and field induced clocking is presented in Fig. 15 where comparison is drawn over number of cells that are clocked in a single clocking zone. In Table IX we have presented a comparison between the energy consumed by three different adder circuits and a multiplier circuit using STT clocking and also external fields for clocking. Furthermore, we have modified the contents of Section 20. to highlight the novelties of clocking in this paper.
- 5) Section I, II and III can be made concise. In its present form, it has a lot of redundancies and repetitions. I would like to suggest the authors to state the novelty at the very beginning and then back it up with proper background. At this moment, lot of background research has been given first and the novelty is scattered throughput. It is difficult to follow. I would like to advise the authors to make it more concrete.

**How it is addressed in revision:** We once again thank the reviewer for this feedback. Most of the paper is rewritten, including the Sections *I*, *II* & *III*. The rewritten portions are highlighted. The novelties are listed in Section I. Some additional references are included in Section II and are highlighted.

6) Section VII, page-8, column-2 : Just below "eqn. 12", how did you devise "low power differential readout scheme"?

**How it is addressed in revision:** The low power differential readout schematic is shown in Fig. 13. The read out scheme is differential since we compare the output against its complement and then we read the difference in voltage generated at nodes X and Y through a comparator as mentioned in the main paper which is quoted here:

"A symmetry is maintained between the transistors in the two arms of the circuit. The reading of the cell is carried out in two consecutive phases: *Pre-charge phase* followed by *Sensing phase* as shown in the simulated waveforms of Fig. 14. The waveforms are simulated using 22 nm predictive CMOS technology [38], [39], [40], [41].  $(M_{1a}, M_{1b})$  and  $(M_{2a}, M_{2b})$  are access transistors of the output MTJs  $S_a$ ,  $S_b$  and their complements  $\overline{S}_a$ ,  $\overline{S}_b$ , respectively. The access transistors remain on  $(\phi_1 = 1)$  during the entire read operation.

During the Pre-charge phase, the  $\phi_2$  signal is pulled low to turn off transistors  $M_3$  and  $M_4$ . The active low signal  $\phi_3$  is pulled down to assist in fast pre-charge of nodes X and Y to potential  $V_{DD}$ . Signal  $E_q$  is raised high to equalize nodes X and Y through transistor  $M_9$ . During the sensing phase,  $\phi_2$  is raised to a low voltage, say  $V_{read}$ , for applying a low voltage bias on the output MTJs. With  $E_q = 0$  and  $\phi_3 = 1$ , voltage differences start to grow at nodes X and Y due to differential current from the complementary output states. The Comparator senses the voltage difference and accordingly sets its output O/P to either high or low."

The scheme is low power since

- it is *non-destructive* i.e. the values in the cells  $S_a$ ,  $S_b$  and  $\overline{S_a}$ ,  $\overline{S_b}$  are not erased during the read operation. This eliminates the need for a write back of the values into the cells. From Table VIII we can see that a write operation is more power consumptive than a read operation. Please note that to write a logic 0 and a logic 1 the required current magnitudes are  $216\mu A$  and  $278\mu A$  respectively while to read the required average current magnitudes are from  $28\mu A$  to  $31.4\mu A$  (see Table VIII).
- it uses the TMR which is an inherent property of the MTJs that enables current driven read. As mentioned the current magnitude is low which results in low reading power. The read duration is also in the range of few *ns*, thus making the read operation an energy efficient one as well. This is much against the magnetic sensors that were traditionally proposed to read the logic output which are not only cumbersome but are also supposedly less power efficient.

By the phrase "effective utilizing the TMR of the MTJ" we mean that we have effectively used the TMR of the MTJs, which is an inherent property of MTJs, for reading. In other words, we didn't use external sensors to perform the read from the outputs. By utilizing this TMR, we could devise the low power non-destructive differential read out circuit that is presented in this paper.

# D. Response to Reviewer 4

This paper proposes STT current-induced clocking and write schemes to reduce energy consumption compared to the filed-induced clocking for Magnetic Logic. A differential read-out scheme is proposed for the proposed Magnetic Logic. Section II gives a good tutorial of Magnetic Logic. A half-adder using the STT device instead of using conventional filed-induced MTJ device is presented in this work. However, it is not clear what are the novel points proposed by the authors. The author should focus on presenting the novelty of this work in the revised manuscript. More analysis and comparison should be added.

How it is addressed in revision: We are thankful to the reviewer for this valuable suggestion. We have modified the introduction with special focus on the novelties of the paper. The novelties of the paper are mentioned at the beginning of the section in the response to the EIC

1) Fig. 4 is not readable.

**How it is addressed in revision:** Fig. 4 has been modified and is currently Fig. 9. Some contents from the figure are reduced for better clarity and understanding and the contents are distributed to other figures (like Fig. 4, Fig. 8) that are added to the manuscript.

a) Are the standard cells (yellow ones) connected to the source-line?

**How it is addressed in revision:** The standard cells are the yellow ones in Fig. 9. They are connected to source lines. Two of their connections are shown in Fig. 6 and Fig. 5. For rows that contain only standard cells (see Fig. 5), the source line is connected on-axis to the cell (MTJ). For rows in which both standard cells and cells with access transistors are present, the source line is connected off-axis to the standard cell (see Fig. 6).

b) Cells P, Q, and R in Fig. 5 are not marked in Fig.4

How it is addressed in revision: Fig. 5 has been slightly modified and is currently Fig. 7 in the paper. P, Q and R cells are renamed  $X_{11}$ ,  $X_{13}$  and  $X_{12}$  in the current Fig. 7. The corresponding cells in Fig. 9 are marked as  $X_{11}$ ,  $X_{13}$  and  $X_{12}$ .

c) What do the dots on the figure stand for?

How it is addressed in revision: The dots in the figure (Fig. 9) indicates that the two crossing metal lines are connected at that point.

# 2) The detail operation for Fig. 4 should be added.

**How it is addressed in revision:** As mentioned Fig. 4 is currently Fig. 9. A subsection is added (Section IV-F) to explain the figure along with an Algorithm 1 explaining the sequence of logic operation in the figure.

 This works use the previous Magnetic Logic and STT device to design a half adder. However, the negative voltage pulse to write the STT device is not new. This scheme is commonly used in STT-RAM already. It is not clear what circuit or structure is newly proposed by the authors. Please focus on presenting the novel points in the revised manuscript. How it is addressed in revision: Thank you for your suggestion. We agree with the reviewer that voltage pulse to

write MTJ is already in use in MRAMs. In this work our focus is on how to provide input to the logic that is low power, scalable and doesn't interfere with neighboring cells.

Previous to this work, the traditional way to write into magnetic logic was external magnetic fields generated by current carrying conductors placed underneath the rows of nanomagnets. The current requirement for generating the fields was in mA which raises serious concerns about the overall power consumption. This work originated from one of the motivations to reduce the power in Magnetic Logic. It uses MTJs and therefore leverages from some of the concepts of STT-MRAM. Section V also gives a completeness to the paper for the benefit of the readers.

4) Moreover, the authors do not discuss how to provide the negative voltage. Need a negative charge pump circuit? What is the area and power overhead to provide the required negative voltage to all the STT devices (w/ heavy load for complex logic)? Does this negative voltage scheme still give the power reduction advantage?

**How it is addressed in revision:** Thanks for raising this concern. We do not use any negative voltage in the logic. We just alter the potential difference between the source and bit lines to change the direction of current through the MTJ. We have made this explicit in our revised manuscript (Section 20) as mentioned below.

- "Phase I: A positive voltage pulse,  $V_1$ , is applied across the source and bit lines. The resultant current magnitude is equal to the writing current for logic 1 (see Table VIII). This pulse ensures that all the cells targeted for clocking are in logic 1 state at the end of Phase I. A speed power trade off is obtained if the pulse duration is limited to half a precession ( $\tau/2$ ) [30].
- *Phase II*: A positive voltage pulse,  $V_2$ , is applied across the bit and source lines. The resultant current magnitude is equal to the writing current for logic 0 (see Table *VIII*). This pulse is for a quarter of a precession duration  $(\tau/4)$  and is referred as QP pulse. The pulse sweeps the magnetization of the cells from the logic 1 state towards the logic 0. At the end of the pulse  $(\tau/4)$ , the magnetization is along the *y*-direction. This phase is immediately followed by a clocking pulse described next.
- *Phase III*: A positive voltage pulse,  $V_3$ , is applied across the source and bit lines. The pulse magnitude is sufficient to sustain a clocking current density of  $J_{clk}$ . The current magnitude and duration are mentioned in Table *VIII*. In this phase, the cell remains in the clocked state for the entire duration of the pulse."

Fig. 12 is also modified to explain the voltage requirements across the bit and source lines during clocking. The same is applicable for writing where a logic 1 and a logic 0 are written by altering the polarity of the voltage pulse across source and bit lines.

Since no negative voltage is required, we do not need a negative charge pump circuit.

5) The differential readout circuit requires a comparator or (sense amplifier). What's the area overhead due to the comparator? Does the process variation induce input offset (comparator) cause read failure? (Especially for the 22nm process.)

How it is addressed in revision: We designed a comparator in 22nm CMOS and the design details are provided in supplementary documents, S2. The area overhead due to the comparator is  $4.336\mu m^2$ .  $1\sigma$  input offset for the comparator is 13.89mV which is within the "32mV difference observed across the inputs of the comparator from the read circuit in Fig. 13". Therefore, the comparator is capable of detecting the read margin with process variations. (*Kindly refer to the supplementary document section 16 for further details.*)

6) It is not clear why the authors believe the differential readout circuit is better than conventional scheme. Please add analysis and comparison to support this claim.

**How it is addressed in revision:** In Section S1 (Supplementary Documents) we have shown a comparison between three different readout schemes: using reference resistance, using complementary output values (differential reading) and a variability tolerant differential reading scheme. Kindly see the section for further details.

The main advantages of the differential read scheme are outlined below:

- a) High read margin since a logic 0 is compared against a logic 1. Moreover, these complementary output bits are easily obtained through antiferromagnetic coupling in magnetic logic.
- b) Doesn't require to maintain a high precise reference voltage or reference resistance.
- c) The read scheme is non-destructive in nature. Thus reducing power consumption from rewriting the contents back into the output cell.

Conventionally reading from magnetic logic required magnetic sensors that reduces the portability of the logic. They increase the power consumption and sacrifices the homogeneity of the logic.

- 7) Please Label the M1a and M1b, which mentioned in the text, in Fig. 8. How it is addressed in revision: Fig. 8 is currently Fig. 13 in the manuscript.  $M_{1a}$  and  $M_{1b}$  are labeled in the figure.
- The data shown in TABLE V is for a half adder? Or just a single MTJ cell? How it is addressed in revision: Table V in the previous manuscript is currently Table VII. The data is for a single MTJ.
- 9) The author claim that the proposed readout scheme has features for a high tolerance to variability. Please add analysis and detail discussion to validate this claim.

How it is addressed in revision: Section S1 including Fig. 16 is added to provide an analysis of the variability tolerant feature of the read scheme that is proposed in the paper. We have shown through simulations that the proposed variability tolerant scheme gives a variation in sense margin of 65% as against a read scheme using reference resistance (sense margin variation > 96% with process variation) and complementary outputs i.e. differential scheme (sense margin variation > 72% with process variation). From the simulations we have seen that the variation tolerant scheme is more robust to process variations than the other two read schemes.

10) There is an on-line paper, titled as "Hybrid CMOS-MQCA Logic Architecture using Multi-Layer Spintronic Devices," was published by the same authors of this manuscript. There are 6 figures (60% of the 10 figures) are the same between this manuscript and the on-line paper. Please remove the published material and focus on presenting the new material for the revised manuscript.

**How it is addressed in revision:** The online paper "*Hybrid CMOS-MQCA Logic Architecture using Multi-Layer Spintronic Devices*" is not a published work. It is only uploaded in the archive maintained by Cornell University to time stamp the work. It is a common practice in the physics community to archive ones work while it is under review from journals. We will remove the paper from the archive once this work gets accepted in the journal.

- 11) The paper title for ref [9] is not correct. It should be "...shape-engineered ..." How it is addressed in revision: The correction to ref [9], currently ref [9] is incorporated.
- 12) I suggest to include the following three papers in the manuscript:a) the on-line paper, "Hybrid CMOS-MQCA Logic Architecture using Multi-Layer Spintronic Devices,"
  - b) Daisuke Suzuki et al, "Fabrication of a Nonvolatile Lookup-Table Circuit Chip Using Magneto/Semiconductor-Hybrid Structure for an Immediate-Power-Up Field Programmable Gate Array," Symp. VLSI Circuits, 2008, pp. 80-81
  - c) Seungyeon Lee et al, "A Full Adder Design Using Serially Connected Single-Layer Magnetic Tunnel Junction Element," IEEE Trans. Electron Devices, Mar. 2008, pp. 890-895.

How it is addressed in revision: Thank you for your suggestion. Reference to the works by Suzuki et al. and Lee et al. are included in the paper.

Once again we would like to thank our reviewers for their valuable feedback which helped us to enhance the contents of the paper.

#### **Response to Reviewers**

We would like to once again thank our reviewers, the Associate Editor and the Editor-in-Chief for their precious time and encouragement and their valuable feedbacks. We would like to sincerely acknowledge that each of the feedbacks has helped us significantly to augment the overall technical quality of the paper. We also received our intellectual rewards from all of you from your appreciation of our hard work and effort. In this section we have addressed the concerns of Reviewer 2.

We have added Section S2 under Supplementary Documents. The section heading is highlighted.

#### A. Response to Reviewer 2

This paper proposes a hybrid CMOS-Magnetic Logic architecture using current driven input, clock and read operations to reduce the energy consumption for magnetic logic. I can tell the authors spend a lot effort to improve the quality of this manuscript. I am satisfied with most of the answers. If the author can add descriptions for the following comments, this manuscript would be interested to the readers of TCAS-I.

 According to the description for Fig. 16, the sense margin is degraded by using the differential scheme compared to the reference-resistance scheme. Please discuss this issue and compare the two read out schemes.
 How it is addressed in revision: We thank the reviewer for raising this question. We have added a supplementary using 52 to discuss the environment of the different red advector. The determinishes in Fig. 16 is environmented advector.

section S2 to discuss the sense margins between the different read schemes. The data available in Fig. 16 is compiled under *Case A* in Table XI for the convenience of the readers. For any particular case in Table XI we can see that the sense margin for the Differential Scheme is always greater than the Reference Sensing Scheme. We thank the reviewer for pointing out the closeness in  $\Delta V_2$  values between Reference Sensing of logic 0 and Differential Sensing for *Case A*. This is due to the specific  $R_0$  and *TMR* values that have been chosen for that case. For different  $R_0$  and *TMR* values shown in other cases, we can see that the sense margin  $\Delta V_2$  is different for Reference Sensing of logic 0 and Differential Sensing schemes.

2) The area of the read-out comparator is 4.336µm<sup>2</sup>. The area of a full adder is only 2.7µm<sup>2</sup>. Based on the demonstrated half/full adder circuit, the large area overhead due to the read-out circuit make the proposed hybrid CMOS-magnetic logic not practical. If the authors can add another logic circuit with reasonable read-out area overhead, the readers would be convinced that the proposed CMOS-magnetic circuit is near practical.

How it is addressed in revision: We once again thank the reviewer for this question. The half adder and full adder circuits were mentioned in Table IX since they can be used as elementary blocks for building larger circuits. For instance, the 32 - bit RC adder and the  $8 \times 8$  multiplier mentioned in Table IX are built up hierarchically with full adders as building blocks which in turn are built using half adders. The area for a 32-bit RC adder is  $89.4\mu m^2$  while for a  $8 \times 8$  array multiplier is  $151.2\mu m^2$ . Please note that in nanomagnetic logic

- a) There will be a sensing circuit for only the primary outputs. Full adder is only a part of the logic where two full adders are cascaded inside the logic using magnetic interaction. Like for the array multiplier, the sensing circuit would occupy only 2.8% of the logic area.
- b) The primary outputs from a nanomagnetic logic can be multiplexed so that more than one output can share the sensing circuit in a time division multiplexed manner. This would further reduce the area overhead of the sense circuit for the logic under consideration.