A tour of the Solaris ATA-over-Ethernet (AoE) implementation

Norman Wilson
2007 August 15

1. Introduction

This is a description of the Solaris ATA-over-Ethernet (AoE) device-driver implementation, version 1.4. It includes:

This is not a tutorial on the Solaris operating system, device-driver and STREAMS subsystems, or kernel; the AoE protocol; or C programming. Neither does it explain how to install and operate the AoE subsystem. For background on these topics and others, the reader should have access to the following references:

Solaris AoE-over-Ethernet (AoE) Installation and Operation Guide
Companion to the present document; /opt/CORDaoe/doc/aoe-guide.html in the customer distribution.
Writing Device Drivers
STREAMS Programming Guide
Application Packaging Developer's Guide
Solaris System Administration Guide
Official Sun reference books, available in HTML or PDF from http://docs.sun.com. There are different editions for different versions of Solaris.
AoE (ATA over Ethernet)
AoE protocol specification, available online from http://www.coraid.com.
Data Link Provider Interface (DLPI) Version 2
Ornate setup protocol used by Solaris network devices, available online from http://www.opengroup.org.
Sundry Solaris manual entries
Cited as needed in the text below, in the conventional entry-name(section) form. Available on a running Solaris system or at http://docs.sun.com.

Certain parts of this subsystem behave differently when newer Solaris features are present:

This document uses the new binary-multiple prefixes described in the IEEE 1541-2002 standard. Prefixes like kilo, mega, giga, tera (k, M, G, T) refer to powers of ten; when the near-equivalent powers of two are meant, they are called kibi, mebi, gibi, tebi (ki, Mi, Gi, Ti). For example, a gigabyte (GB) is 109 bytes, while a gibibyte (GiB) is 230 (1073741824). For more details see http://physics.nist.gov/cuu/Units/binary.html.

2. Components

The AoE subsystem comprises several modules executing in kernel mode and several user-mode utility programs.

These are the kernel parts. Each is a loadable kernel module, supplied in both 32- and 64-bit binary versions.

aoecomm
STREAMS module (line discipline). An instance is pushed on each communication channel (usually an Ethernet device) over which AoE messages will be received and sent. Other AoE kernel modules make internal subroutine calls to aoecomm to send messages, to arrange to receive messages, and to query and update a registry of AoE targets known to exist.
aoectl
Character device driver with one statically-configured instance. Supplies two control devices: /dev/aoectl, to which configuration commands may be written; /dev/aoemon, from which error reports and unsolicted AoE messages may be read.
aoed
Block-and-character disk device driver. One instance configured for each possible AoE target disk. Provides disk devices in the /dev/dsk and /dev/rdsk directories implementing normal Solaris disk semantics, sufficient to allow file systems to be created and mounted and to support standard maintenance programs like fsck and format(1M). Implements the required Solaris disk-specific ioctl calls, including some Sun doesn't bother to document.

These are the user-mode parts. Except as noted, each is a binary executable program installed in directory /sbin. Only 32-bit binaries are supplied; they run fine on 64-bit systems.

aoestart
Set up an AoE communication channel on a specified communication device.
aoestop
Shut down a specified channel or channels.
aoectl
Perform control operations through the aoectl driver; e.g. query or tweak the active-target registry, broadcast a Query-Config request to cause targets to identify themselves.
aoemon
Read the monitoring device supplied by the aoectl driver, logging errors and unusual events and updating the registry as Query-Config responses are received.
aoemkconf
Given a list of AoE target addresses, generate the kernel configuration file entry required to declare each device to aoed. Entries can be written by hand as well; this is just a convenience.
aoelabinit
Label an empty disk. Needed to work around a bug in format(1M).
aoeunlabel
Wipe a disk of all labels, to reuse an EFI-labelled disk on a pre-EFI system.
aoe
aoe.xml
Start or stop the AoE subsystem, calling aoestart for each channel named in configuration file /etc/aoe.conf. On a non-SMF system, aoe is stored in directory /etc/init.d and linked into /etc/rc2.d, so it will be run when the system boots. If SMF is installed and active, aoe is stored in directory /lib/svc/manifest/device-aoe and aoe.xml added to the SMF inventory, so that AoE will be started when service svc:/device/aoe is enabled.

All of these programs, along with default configuration files and manual pages and other documentation, are bundled into a single Solaris package named CORDaoe for distribution. To install AoE, a system administrator runs pkgadd(1M), edits a configuration file or two, and starts the subsystem: on a non-SMF system, /etc/init.d/aoe start or a reboot; with SMF, svcadm enable svc:/device/aoe. See the Installation and Operation Guide for further operational details.

3. Source tree layout

The source code tree contains these directories:

kern
Source files for kernel modules, including header files used only by those modules. Object files and 32-bit binaries are kept in the main kern directory, 64-bit binaries (whether for SPARC or x86) in subdirectory kern/bin64.
user
Source files for user-mode utility programs. Object files and binaries are stored here too.
include
Include files shared by user-mode and kernel code.
man
Source files for manual entries, in troff -man format.
doc
Source files for longer documents (including this one), in fidl format. Fidl is an experimental formatter-independent document language; see the author for details.
pkg
Where the installable package is built. Subdirectory pkg/data contains files to be assembled into the eventual package, copied there by make pkg in each of the directories above. The resulting package is stored in subdirectory pkg/CORDaoe in file system format, and file pkg/CORDaoe.Z in compressed stream format.

4. Building programs and package

The build process is controlled by make(1). Makefiles come ready-to-use; there is no system-dependent configuration process, automatic or otherwise.

At present the code may be compiled only on Solaris 10, though the resulting binaries will run on systems as old as Solaris 7.

Each source-file directory kern, user, man, doc has a Makefile with these targets (except as noted):

make (default target)
Build target files from source files: compile and load binary programs, format documents.
make install
Build targets if necessary, then install in standard places on the local system. Used only during development; not for customer use, and not guaranteed to work on any system save the author's. Not in man/Makefile or doc/Makefile.
make pkg
Build targets if necessary, then copy to the staging area in directory pkg/data. Create subdirectories within pkg/data as needed.
make clean
Remove generated targets and intermediate files, leaving only the original source files.

pkg/Makefile has these targets:

make pkg
Use the current contents of staging area pkg/Makefile to create two copies of the installable package: one in file system format (a collection of individual files) in directory pkg/CORDaoe, one in compressed stream format (a single file) in file CORDaoe.Z.
make clean
Remove intermediate files (i.e. empty the staging area); leave installable packages, if they exist.
make clobber
Empty the staging area and remove the packages, leaving only Makefile and the control files.

The root directory of the source tree contains an overall Makefile with these targets:

make (default target)
Compile everything: run make in kern, user, man, and doc, in that order.
make pkg
Compile everything, then build packages: run make pkg in kern, user, man, doc, and pkg, in that order.
make clean
Clean up everything: run make clean in kern, user, man, doc, and pkg, in that order.

Thus:

  • To build packages with contents reflecting the current source code, run make pkg at the top level.
  • To clean up but leave the packages there, run make clean at the top level.
  • To work on a kernel module, cd kern and edit the files. To compile what has been changed, run make. When all seems well (perhaps after installing the new binary by hand to test it), build a new package with
    make pkg && cd ../pkg && make pkg
    or with
    cd .. && make pkg

All Makefiles assume that required tools are in the shell's search path; in particular cc if C programs are to be compiled, ld if kernel object files are to be linked into loadable modules, fidl if long documents are to be rendered.

The stream-format package file resulting from the build process is named CORDaoe.Z. Societal conventions may dictate giving it another name in public, like aoe-version.Z. The package name CORDaoe is encoded in the package data; the filename doesn't matter. A fixed name is used so that pkg/Makefile need not be edited just because the version number has changed.

4.1. Architecture-specific build issues

Makefiles and associated conventions are designed the distribution to build without fuss whether on SPARC or x86; no manual configuration is required. It is assumed that only one of those architectures will be built at a time; there is no provision to keep SPARC- and x86-specific binaries separate. To use the same source tree for both, build one architecture, save the resulting binaries, then run make clean before building the other.

Within each architecture 32- and 64-bit kernel binaries are kept side-by-side. Compiler flag -xarch=generic64 is used for 64-bit compilations; this is recognized by Sun's compiler on either architecture. 64-bit kernel modules are created in subdirectory bin64. When make pkg copies files to pkg/data for packaging, a recursive make call selects target directory usr/kernel/drv/sparcv9 or usr/kernel/drv/amd64 according to the architecture on which make is run.

The user-mode utility programs are built only as 32-bit binaries, since those work fine on 64-bit systems.

pkg/Makefile calls pkgadd(1M) with variables declaring the current architecture name and the corresponding subdirectory name for 64-bit kernel modules, so that names need not be hard-coded in pkginfo and prototype.

5. Coding conventions and other background

Some aspects of the style used in C code and documentation are a bit unusual. The following is meant to aid comprehension, not as a religious tract, despite the occasional descent into sermon.

5.1. C style

As a general but not inviolate rule, code is presented top-down: within each file, major routines and entry points come first, followed by local subroutines they call, and so on. Subroutines used only by one or two related major routines immediately follow those routines; common code used by many different routines is grouped after all uses. Routines performing related functions (e.g. different operations on the same type of data structure) are placed near one another.

Data declarations and definitions are placed at the top of the file, except for tables used by only one or two subroutines, which are usually placed next to the relevant code.

Every procedure has a prototype declaration. A global procedure (one called from another file) is declared in a header file. A static procedure is declared near the top of the source file, after all data declarations but before any code. As a special case, if a procedure's address is used as a data initializer (e.g. device-driver entry points), its prototype precedes the relevant data.

Although every procedure has an ISO-C prototype, procedure definitions declare arguments in old-C syntax:

static mblk_t *aoectlintr(int, mblk_t *);
...
static mblk_t *
aoectlintr(chan, mp)
int chan;
mblk_t *mp;

The original C specification intentionally made procedure definitions resemble calls, with type information as a sidebar, in the belief that this makes the code easier to read. The author still agrees. The meaning of such mixed notation is clearly defined in the ISO standard.

Variadic procedures are an exception: their definitions use the ISO syntax, because that is the only way to express them.

Types are always declared explicitly:

static int
doquery(ap, maj, min)
Aoechan *ap;
int maj, min;
{
     unsigned int x;
     ...
not
static
doquery(ap, maj, min)
Aoechan *ap;
{
     unsigned x;
     ...

Neither procedures nor data are allowed global scope unless that is truly required. (Here the author disagrees with both original and ISO C: static scope should have been the default.)

Machine-specific pseudo-optimizations such as the register keyword are not used.

Source files are not made aware of the source-code directory structure. In particular, #include directives for AoE-specific header files are always of the form #include "file.h"; if file.h might be in directory ../include, compiler option -I../include should be used.

Compile-time conditionals are used sparingly. There are no `normal' options; only one version of the code is officially supported. #ifdefs are used only to hide unsupported experimental code, to make compiler or system-bug band-aids stand out, or to work around compile-time environment differences in different versions of Solaris.

5.2. Binary data structures

Externally-defined binary data structures (e.g. AoE messages, ATA IDENTIFY data, EFI disk labels) are treated as byte arrays, not as C structures. A constant is defined for the offset to each element. Multi-byte objects other than plain byte arrays (e.g. integers of various sizes) are accessed through machine-independent macros that pack or unpack data with appropriate attention to byte order. This allows source code to be ignorant of per-system byte-order and word-alignment rules.

These header files, all stored in directory include in the source tree, define constants and macros to implement this scheme. In the descriptions below, buf is always a buffer address, and may be either char * or unsigned char *. Off is an integer offset within the buffer addressed by buf. Val is an integer value; sometimes only the low-order bits are used.

aoeproto.h
AoE protocol:
  • Offsets for the elements of AoE protocol messages, relative to the first octet of the AoE header, i.e. the byte following the Ethernet MAC header (in case the protocol is someday used over another medium).
  • Constant values used in various protocol fields.
  • These access macros:
    fraoechar(buf, off)
    fraoeshort(buf, off)
    fraoelong(buf, off)
    Fetch 8-bit, 16-bit, or 32-bit unsigned value from offset off in buffer buf, using the byte order defined for the AoE protocol.
    toaoechar(buf, off, val)
    toaoeshort(buf, off, val)
    toaoelong(buf, off, val)
    Store 8-bit, 16-bit, or 32-bit value val at offset off in buf.
  • Macros to encode or decode some sub-octet values, such as version-and-flag or version-and-config-string-op codes.
etherproto.h
Ethernet MAC header:
  • Offsets for Ethernet source, destination, and protocol (type).
  • Macro fretherprot(buf, off) to fetch a protocol-type value from offset off in buffer buf; macro toetherprot(buf, off, val) to set protocol-type value val.
  • These additional constant values:
    LEN_EADDR
    Length of an Ethernet MAC address. Such addresses are treated as opaque fixed-length byte arrays.
    LEN_ETHER
    Length of the MAC header.
    MINLEN_ETHER
    MAXLEN_ETHER
    Minimum and maximum standard lengths of an Ethernet datagram, including header.
ata.h
ATA disk register contents and associated ATA message data:
  • Values of flags and command codes read from or written to ATA registers. Register offsets are not defined; for our purposes a register is a location in an AoE message.
  • Offsets and constants for the relevant (used by this package) parts of the large data structure returned by the ATA IDENTIFY command.
  • Access macros for elements within the IDENTIFY structure:
    frataword(buf, off)
    fratalong(buf, off)
    fratall(buf, off)
    Fetch 16-bit, 32-bit, or 64-bit value from offset off in buf, using the byte order defined by the ATA specification.
    toataword(buf, off, val)
    toatalong(buf, off, val)
    toatall(buf, off, val)
    Store 16-bit, 32-bit, or 64-bit value val to offset off in buf.
aoecommproto.h
Message protocol used for control messages exchanged between user-mode programs and the aoecomm and aoectl drivers. This header defines offsets and other constants; integer values are formatted using the fraoexxx and toaoexxx macros from aoeproto.h.
efilabel.h
Data structures used by the new EFI label scheme:
  • Offsets within the GPT (GUID Partition Table header, one per copy of the label).
  • Offsets within each GPE (GUID Partition Entry, one per partition per copy of the label).
  • Access macros for elements of either structure:
    frefiword(buf, off)
    frefilong(buf, off)
    frefill(buf, off)
    Fetch 16-bit, 32-bit, or 64-bit value from offset off in buf, using the byte order defined by the EFI specification.
    toefiword(buf, off, val)
    toefilong(buf, off, val)
    toefill(buf, off, val)
    Store 16-bit, 32-bit, or 64-bit value val to offset off in buf.
  • These additional constants:
    MINLEN_GPT
    Length of the meaningful contents of the GPT.
    LEN_GPT
    Length of the whole GPT sector.
    EFI_GPT_LOC
    Sector number of the primary copy of the GPT.
    LEN_GPE
    Length of each GPE.
    LEN_GUID
    Length of the UUID (128-bit unique ID) used both in the GPT and in each GPE.
The toefixxx macros are not used at present; they are included for completeness.
dospart.h
Data structures used by DOS partition labels:
  • Offsets for elements (partition descriptors, magic number) within the DOS boot sector.
  • Offsets for the components of each of the four partition descriptors.
  • Access macros:
    frdospchar(buf, off)
    frdospword(buf, off)
    frdosplong(buf, off)
    Fetch 8-bit, 16-bit, or 32-bit value from offset off in buf, using the conventional (Intel) byte order.
    todospchar(buf, off, val)
    todospword(buf, off, val)
    todosplong(buf, off, val)
    Store 8-bit, 16-bit, or 32-bit value val to offset off in buf.
    The todospxxx macros are not used at present; they are included for completeness.
  • Additional constants, including:
    LEN_BSEC
    LEN_DOSP
    Length of the entire boot sector, and of a single partition descriptor.
    BSECMAGIC
    Magic number found at the end of a valid boot sector.
    SYSID_SUNOS
    SYSID_SUNOS2
    The two system ID (partition type) values used when a partition encapsulates a Sun VTOC label.

Sun supply headers describing EFI and DOS labels, but they use C structures accompanied by explicit machine-dependent byte-swapping, something we would rather avoid. Having our own headers may also simplify future support for compiling the AoE code on older versions of Solaris.

This code fragment tests whether the AoE message at p is a response message, and if so fetches its tag value:

if (fraoechar(p, AOE_VF) & AOEF_R)
     tag = fraoelong(p, AOE_TAG);
This fragment sets the major and minor target numbers to standard wildcard values:
toaoeshort(p, AOE_MAJOR, AOEMAJWILD);
toaoechar(p, AOE_MINOR, AOEMINWILD);

5.3. Control message protocol

An internal protocol is used for messages exchanged between user-mode programs and the aoecomm STREAMS module during channel setup, and for messages written to /dev/aoectl and read from /dev/aoemon. The message format is defined by constants in aoecommproto.h; messages should be accessed using the macros defined in aoeproto.h.

Each message comprises a fixed-length header and a variable-length body. The header contains these fields:

AC_TYPE
Message type; determines the body format.
AC_LEN
Length of the entire message, including this header.
AC_CHANNEL
AoE channel number, if any, associated with this message.

These message types are defined:

ACINITCHAN
Initialize a communication channel using the enclosed channel number, local MAC address, Ethernet protocol type, and maximum data-segment length, and validating the request with a 32-bit random cookie. Written first to /dev/aoectl, then to the channel on which aoecomm was pushed.
ACINITACK
Reply from aoecomm to ACINITCHAN, announcing that channel setup is complete or giving an errno value explaining why it failed.
ACSEND
Command to /dev/aoectl: send the enclosed datagram (usually an AoE message) via the channel named. The MAC header must be included, but only the destination address is significant; source address and protocol values are filled in by the system.
ACDEVENAB
Command to /dev/aoectl: query or update the active-target registry. Body gives an AoE target address (major and minor numbers) and a command: enable target, disable, query.
ACLOG
Message from /dev/aoemon: body contains an unexpected, ill-formed, or unsendable AoE message, as explained by a type code:
ACLUNSOL
Unsolicited AoE message. Usually a Query-Config response from a target that is not open.
ACLILLFORM
Ill-formed AoE message.
ACLSENDFAIL
AoE message that couldn't be sent (aoecomm_send failed).
ACLTIMEOUT
AoE message that was sent, but for which no reply was received after repeated retries. The operation that generated the message has already received an error. The message may be truncated to the MAC and AoE headers, omitting any data sectors.
ACLAOEERR
ACLATAERR
AoE message reporting an AoE or ATA error.

See the source code and the supplied aoecomm(7M) and aoectl(7D) manual entries for details.

6. The big picture

Here is an sketch of major operations showing the components and subroutines used.

6.1. Initialize channel (activate AoE on a network device)

These user-mode steps use a file descriptor for a network device of suitable type, suitably initialized (e.g. configured to receive the desired Ethernet protocol), to create an active AoE channel:

  1. Push an instance of the aoecomm STREAMS module onto the file descriptor.
  2. Invent a 32-bit nonzero random cookie value. Compose an ACINITCHAN AoE control message containing the desired channel number, the cookie, the local MAC address of the network device (usually discovered by a device-specific call), the Ethernet protocol type to be used, and the maximum data-segment size (usually computed from the Ethernet device MTU).
  3. Open /dev/aoectl and write the ACINITCHAN message. Close /dev/aoectl. The aoectl module calls aoecomm_initchan to register the cookie for the desired channel.
  4. Write the ACINITCHAN message to the file descriptor. aoecomm compares the cookie stored for the desired channel with that supplied in the message at hand. If the cookies match and the channel number is not already in use, the protocol number and MAC address are stored for use when sending messages, and the channel is made active. The two-way handshake using both aoectl and aoecomm is needed to make it harder for bad guys to cause trouble.
The file descriptor must remain open; when it is closed, the aoecomm instance for this channel is popped, and the channel shut down.

This initialization dance is normally done by user-mode program aoestart within library routine comminit. Aoestart then calls library routine achattach, which uses fattach(3) to attach the file descriptor to a file in directory /etc/aoe to keep the channel open.

6.2. Open AoE disk device

The aoed disk driver open routine initializes the device if necessary, broadcasting an AoE Query-Config message to discover the target's Ethernet MAC address and other characteristics.

Device configuration is semi-automatic. Every target to be used must be listed in advance in kernel configuration file aoed.conf; changes to the file are effective only when the aoed module is reloaded, or (Solaris 10 only) when update_drv -f is run. The memory overhead for a target that is configured but not used is modest; it is unreasonable to declare every possible device allowed by the protocol (255*65535 of them), but prudent to declare every slot in a new EtherDrive shelf even if only a few will be used at first.

It takes a few seconds to discover that an AoE device isn't available on the network even though its device files exist. If there are many such unavailable devices, programs that open every possible disk device in the system (notably format(1M)) will run quite slowly. To prevent this, aoecomm maintains a registry of `enabled' devices, those believed actually to exist. Daemon aoemon listens to /dev/aoectl for unsolicited Query-Config response messages, generated when a target device is powered up or in response to a broadcast Query-Config command from aoestart or aoectl. When such a message arrives, aoemon enables the responding target by writing an ACDEVENAB control-protocol message to /dev/aoectl, triggering a call to aoecomm_devenab to enable the device in the registry.

The aoectl command lists and modifies the registry.

6.3. Read and write disks

The aoed driver supplies read and write entry points for raw I/O, and a strategy routine for block and file system I/O.

I/O is done by composing an AoE command in a STREAMS buffer and calling aoecomm_send to send it to the network device. When an AoE message arrives on a channel in which aoed has registered interest (any on which some disk device has been opened), receive routine adreceive is called. Pending commands are stored in a list. When a message arrives, the system searches for a pending command for the same AoE device, with the same AoE tag value. Timer routine chantimer is called periodically to scan the pending-command list for commands which haven't received responses within a specified time interval. If a command has timed out only a few times, it is retransmitted. After several timeouts the command is abandoned, and the I/O request aborted with an error.

Often an I/O request will involve more data than can be handled in a single AoE command: 1024 bytes by default, more if larger (jumbo) Ethernet frames are allowed. If so, a single request will generate several AoE commands. The corresponding several AoE responses may be returned out of order.

6.4. When the system boots

On a non-SMF system, startup script /etc/init.d/aoe is run at boot time. When SMF is present, AoE is represented by service svc:/device/aoe, initially disabled; when the service is enabled and on subsequent reboots, method script /lib/svc/method/device-aoe is called. In either case the script starts the aoemon daemon, then calls the aoestart command for each channel listed in AoE-specific configuration file /etc/aoe.conf.

The startup script is called early in the boot process, after the root and /usr file systems have been mounted, but before any others. Hence the AoE subsystem must not rely on access to other file systems, and in particular must not require access to /var or /opt. The payoff is that any other file system, including /var or /opt, may be stored on an AoE disk. AoE drivers are normally stored in /usr/kernel, support programs in /usr/sbin, configuration files and the mount points used by fattach in /etc.

6.5. When the system shuts down

Nothing special need be done when the system is shut down. Channels remain active until the very end, even after all processes have been killed, because it is the fattach operation that keeps them open. Thus the normal shutdown code to unmount all file systems at the last minute works without fuss. Notice, however, that the file system where AoE channels are attached may not be unmounted because the AoE fattach calls keep it busy. That is why AoE uses /etc/aoe rather than /var/aoe; the latter directory is sometimes a separate file system, the latter rarely if ever.

AoE may be intentionally shut down by running /etc/init.d/aoe stop (non-SMF system) or svcadm disable svc:/device/aoe (SMF). This is sometimes useful for maintenance purposes, but is not done by a normal shutdown.

7. Kernel component details

7.1. aoecomm

7.1.1. Configuration

An instance of aoecomm is created whenever the module is pushed on a stream file; thus each active AoE channel has a separate instance.

7.1.2. Source code

All code is in a single source file, aoecomm.c. It uses AoE-specific header files aoeproto.h, aoecomm.h, aoecommproto.h, and etherproto.h.

7.1.3. Data structures

None is used outside the aoecomm module.

Several data structures provide global context within aoecomm, but are not accessible to the rest of the kernel:

static void (*logger)(int, int, mblk_t *);
static kmutex_t loggerlock;
Logger points to the logging routine called by aoecomm_log, if any. Loggerlock prevents concurrent access to logger.
static Aoechan *aoechan[MAXCHAN];
static kmutex_t chantlock;
The aoechan table contains pointers to Aoechan structures indexed by channel number (assigned when the channel is initialized) for faster lookup. MAXCHAN is defined in aoecomm.c; its present value is 10. Chantlock prevents concurrent access to the aoechan array, but not the Aoechans.
static Aoecookie pendcookie[MAXCHAN];
static kmutex_t pendlock;
pendcookie[i] contains a pending initialization cookie for channel i, or NOCOOKIE (zero) if none has arrived yet. Pendlock prevents concurrent access to the pendcookie array.

7.1.4. Exports

Standard Solaris loadable-module and STREAMS-module entry points and data structures are supplied: in particular _init, _fini, and _info routines, a modlinkage structure (and its many children), a streamtab structure, and a pair of qinit structures. Open and close (module push and pop), read put and service, and write service routines are supplied; puts to the write queue use putq(9F).

Only _init, _fini, and _info are global. _init calls mod_install to tell the system where to find the modlinkage structure through which the other data structures and routines can be located.

Several routines called by other AoE kernel modules are made available as global entry points. Prototypes for these routines are declared in aoecomm.h, constant values in aoecommproto.h.

If the receiver routine for channel chan is NULL, set it to receiver. If receiver was already this channel's receiver, leave it be. In either case return > 0.

If chan is invalid or already has a different receiver, return < 0.

If a logger routine has been registered with aoecomm_initlog, call it with the same arguments: (*logger)(chan, code, mp). The logger becomes responsible for STREAMS message mp, and will free it when finished; neither aoecomm_log nor its caller may use it further.

Send STREAMS message mp via channel chan. If chan is ACHWILD, send a copy to every active channel.

The message must begin with an Ethernet MAC header with destination MAC address filled in. Source address and protocol type are overwritten with the values from the Aoechan.

If the message is sent, return > 0. If an error occurs (invalid chan, no room downstream), free the message and return < 0.

7.1.5. Algorithms

7.1.5.1. Module pushed

Open routine aoecommopen allocates a new Aoechan, marked inactive; stores its address in q_ptr for both the read and write queues; and calls qprocson(9F) to enable queue processing. A channel number is not assigned yet, so no data may be sent; the channel is inactive, so received data are thrown away.

7.1.5.2. Message arrives on write queue (from user code)

Arriving messages are added to the write queue by putq(9F).

Write service routine aoecommwsrv loops calling getq(9F), processing each message as follows:

  • If the message is of any type but M_DATA, pass it downstream without further interpretation.
  • If the data message comprises a valid ACINITCHAN message, and if:
    • the channel number is valid, and not that of an already-active channel;
    • a cookie has already been stored for that channel;
    • and the cookie in the message matches the stored value;
    make the channel active:
    1. Copy the channel number, MAC address, Ethernet protocol type, and maximum data-segment size from the ACINITCHAN message to this channel's Aoechan.
    2. Call strqset(9F) to adjust the read-queue high water mark according to the maximum data-segment size.
    3. Enter the Aoechan in aoechan.
    4. Send an ACINITACK reply reporting success.
    If the channel number is invalid, the cookie is wrong, or some other error occurs, send an ACINITACK reply containing an appropriate error code.
  • If the data message is not a valid ACINITCHAN, free it and continue.
7.1.5.3. Message arrives on read queue (from communication channel)

Read put routine aoecommrput works as follows:

  • If this is a control message (mp->db_type >= QPCTL), pass the message downstream.
  • If this is a data message and the channel is active, call putq(9F) to place the message on our read queue for later processing by aoecommrsrv.
  • If this is a data message and the channel is not active, free the message.
7.1.5.4. Module popped

Close routine aoecommclose does the following:

7.2. aoectl

7.2.1. Configuration

A single instance of aoectl is configured by driver configuration file aoectl.conf, attached to the pseudo device nexus. There are two minor devices:

/devices/pseudo/aoectl@0:ctl (/dev/aoectl)
Write-only control device. Each write system call provides a complete AoE control protocol message; the return value is the size of the buffer written if the command succeeded, -1 with an appropriate errno value if it failed, zero for a few special cases. Concurrent opens are allowed, but commands are handled one at a time without overlap.
/devices/pseudo/aoectl@0:mon (/dev/aoemon)
Read-only control device; concurrent opens forbidden. Each read returns one or more rejected AoE messages.

Neither device supports polling; neither has any ioctl commands.

7.2.2. Source code

All code is in a single source file, aoecomm.c. It uses AoE-specific header files aoeproto.h, aoecomm.h, aoecommproto.h, and etherproto.h.

7.2.3. Data structures

None is used outside the aoectl module.

static Aoectlstate aoectlstate;
Static private context data, including:
  • Ring of Aoemon structures used to store messages pending receipt by /dev/aoemon.
  • Two pointers into the ring: rmon, the next message to be read; wmon, the next to be written.
  • Number of Aoemons in the ring.
  • kmutex_t lock to prevent concurrent use of /dev/aoectl.
  • kmutex_t lock to prevent concurrent access to the ring buffer.
  • kcondvar_t condition variable on which reads from /dev/aoemon sleep if no messages were available; a flag indicating that some process is sleeping.
  • Pointer to the Solaris dev_info_t structure for the sole device instance.

Each Aoemon contains:

  • Data to be logged: a pointer to a STREAMS message, an integer log-message type, the associated AoE channel number.
  • Count of messages lost before this message was received, if any. Incremented if a message arrives and there's no place to put it.
  • next, a pointer to the next Aoemon in the ring.
The buffer ring is allocated when the /dev/aoemon device is opened, and the next-Aoemon pointers set to sew the Aoemons into a ring. All Aoemons (and any unread messages) are freed when the device is closed. When aoectlstate.wmon == aoectlstate.rmon, the ring is empty; when aoectlstate.wmon->next == aoectlstate.rmon, the ring is full. Notice that there is always at least one empty Aoemon following that pointed to by aoectlstate.rmon.

When a message arrives:

  • If the ring is full, increment the lost-messages counter in *aoectlstate.wmon.
  • Otherwise store the message in *aoectlstate.wmon, and set aoectlstate.wmon to aoectlstate.wmon->next.

When it is desired to read a message:

  • If aoectlstate.wmon == aoectlstate.rmon, no message is available.
  • Otherwise consume the message stored in *aoectlstate.rmon and set aoectlstate.rmon to aoectlstate.rmon->next.

7.2.4. Exports

Standard Solaris loadable-module and device-driver entry points and data structures are supplied: in particular _init, _fini, and _info routines, a modlinkage structure (and its many children), dev_ops and cb_ops structures. A _depends_on string declares a dependency on aoecomm, since aoectl calls procedures supplied by that module.

Only _init, _fini, and _info are global. _init calls mod_install to tell the system where to find the modlinkage structure through which the other data structures and routines can be located. Apparently _depends_on need not be global to work.

Internal routine aoectlintr is registered with aoecomm as the global message receiver if /dev/aoemon is open.

7.2.5. Algorithms

7.2.5.1. Open device

Open routine aoectlopen acts according to the minor device opened:

/dev/aoectl
Return success.
/dev/aoemon
  • If there is already a buffer ring (the device is already open), return EBUSY.
  • Call aoecomm_initlog(aoectlintr) to register aoectlintr as the logger. If this fails, return EBUSY.
  • Allocate the ring buffer. If integer property monbufs exists, allocate that many messages; the default is 32. If this fails, deregister the logger (call aoecomm_initlog(NULL)) and return EBUSY.
Any other device
Return ENXIO.
7.2.5.2. Close device

Close routine aoectlclose acts according to the minor device closed:

/dev/aoemon
  • Call aoecomm_initlog(NULL) to deregister the logger.
  • Free the buffer ring and any STREAMS messages still stored there.
  • Return success.
Any other device
Return success.
7.2.5.3. Write to device

Write routine aoectlwrite acts according to the minor device written:

/dev/aoectl
Fetch the user's buffer, and parse it as if it contains a single complete AoE control protocol message. Process according to message type:
ACINITCHAN
Extract chan and cookie values; call aoecomm_initchan(chan, cookie). On success, return as if all bytes were successfully written; on failure, return ENXIO. Only the channel and cookie values are used here; other parameters are ignored.
ACSEND
Extract channel number and message contents, allocate a STREAMS buffer, and copy the message there. Call aoecomm_send(chan, message). If all is well, return success; if no STREAMS buffer was available, ENOMEM; if aoecomm_send failed, ENXIO.
ACDEVENAB
Extract chan, maj, min, and cmd values. Call aoecomm_devenab(chan, maj, min, cmd). If the result is ADENAB, return as if all bytes were written; if ADDISAB, return zero (no error, but nothing written).
If the message is ill-formed or has an unknown type, return ENXIO.
Any other device
Return ENODEV.
7.2.5.4. Read from device

Read routine aoectlread acts according to the minor device read:

/dev/aoemon
  • If aoectlstate.rmon == aoectlstate.wmon, set the process-sleeping flag, wait on the condition variable, and try again. If a signal arrives while waiting, return EINTR.
  • If aoectlstate.rmon != aoectlstate.wmon and the user's buffer has room for the message stored in *aoectlstat.rmon, compose an ACLOG control-message header including the channel number, log-message type, and lost-message count stored in that Aoebuf; use uiomove(9F) to copy the header, then the data contents of the message, to the user's buffer. Free the STREAMS message; clear the lost-message counter in *aoectlstate.rmon; set aoectlstate.rmon to aoectlstate.rmon->next.
  • Loop until aoectlstate.rmon == aoectlstate.wmon, the user's buffer has no room for the next message, or uiomove returns an error. In the last case return the same error as uiomove; otherwise return the number of bytes read.
Notice that if the first available message is longer than the user's buffer, read returns zero.
Any other device
Return ENODEV.
7.2.5.5. Message received
static void aoectlintr(chan, code, mp)
int chan, code;
mblk_t *mp;

If there's room in the ring, store mp, code (the log-message type), and chan in *aoectlstate.wmon and set aoectlstate.wmon to aoectlstate.wmon->next. If the ring is full, increment the lost-messages counter in *aoectlstate.wmon and free mp. In either case, if the process-sleeping flag is set, signal the condition variable, awakening any process blocked reading /dev/aoemon.

7.3. aoed

7.3.1. Configuration

Each instance of aoed represents one AoE disk target, with 13 or 21 block and 13 or 21 character minor devices:

  • Eight (SPARC) or 16 (x86) devices accessing the standard Solaris disk partitions.
  • Four devices, present even on SPARC systems, accessing the direct DOS partitions if a DOS label is present.
  • A final partition accessing the whole disk regardless of partition tables, reserved cylinders, and other operating-system impedimenta.

Links in directories /dev/dsk and /dev/rdsk use fixed `controller' name ca.

These devices are created for aoed:

/devices/pseudo/aoed@inst:a (/dev/dsk/cadinsts0)
    ...
/devices/pseudo/aoed@inst:p (/dev/dsk/cadinsts15)
Block devices for instance inst, partitions a-h (and on to p on x86) in /devices-speak, or s0-s7 (s15) in /dev-speak.
/devices/pseudo/aoed@inst:r (/dev/dsk/cadinstp1)
    ...
/devices/pseudo/aoed@inst:u (/dev/dsk/cadinstp4)
Block devices for instance inst, partitions r-u in /devices-speak, or p1-p4 in /dev-speak.
/devices/pseudo/aoed@inst:wd (/dev/dsk/cadinst)
Whole-disk block device for inst.
/devices/pseudo/aoed@inst:a,raw (/dev/rdsk/cadinsts0)
    ...
/devices/pseudo/aoed@inst:h,raw (/dev/rdsk/cadinsts7)
/devices/pseudo/aoed@inst:r,raw (/dev/rdsk/cadinstp1)
    ...
/devices/pseudo/aoed@inst:u,raw (/dev/rdsk/cadinstp4)
/devices/pseudo/aoed@inst:wd,raw (/dev/rdsk/cadinst)
Standard and whole-disk character devices for inst.

Sun's standard device drivers create the DOS-partition devices only for certain kinds of disk and only on Solaris/x86; when the DOS devices exist, the whole-disk device in /dev is called p0. The AoE driver supports DOS partition tables on Solaris/SPARC as well, but omits the p0 name to avoid a bug in tools like format(1M).

In 32-bit versions of Solaris, 18 bits are available for minor device numbers. 13 (21 on x86) minor devices are created for each instance, allowing 262144/13 = 20164 (262144/21 = 12483) instances. In 64-bit systems, 32 bits are available, hence there may be 4294967296/13 = 330382099 (4294967296/21 = 204522252).

These optional integer-valued properties may be included:

timeout=ms
If an I/O operation hasn't completed within ms milliseconds, assume it never will; send the command again, or after several retries give up and report an error. The default is 200 milliseconds.
hd=nhead
sec=nsec
Assume this device has cylinders of nhead heads, and tracks of nsec sectors, ignoring any information supplied by the hardware or that invented by fudgegeom. Honoured only if both properties are included; if only one is given it is ignored.
maxdata=len
Allow a data segment of at most len bytes, overriding the default: the minimum of that offered in the target's Query-Config response message (1024 if no value returned), and that configured for the specified aoechan.
maxbuf=nbuf
Limit the maximum buffer count (number of concurrent messages the target can handle) to nbuf rather than the default 256. The value is in any case bounded above by that suggested by the target.
disksort=enable
If enable is 1, call the standard disksort(9F) kernel routine to sort queued disk operations; if zero, just use FIFO ordering.

On Solaris 9 and earlier versions, changes to aoed.conf take effect only when the aoed kernel module is loaded. If AoE disks have already been used, the file will not be effective until the driver is unloaded, which means every AoE disk device must be unmounted and closed. Often it is simplest just to reboot the system.

On Solaris 10, update_drv(1M) updates the configuration. With option -f it can do so without a module unload. Device instances presently in use cannot be changed on the fly, but new devices may be added without rebooting.

7.3.2. Source code

The aoed driver source code is broken into several files:

aoed.c
Main program: official driver entry points and data structures.
addadk.c
aduscsi.c
Handle ATA- and SCSI- (sic) specific ioctl calls.
adio.c
Generate and send AoE commands for an I/O request, or as needed during device initialization. Process AoE replies. Retransmit commands whose answers don't arrive in a specified time; cancel commands that have suffered too many retries.
addos.c
adefi.c
advtoc.c
Read and write disk labels and partition tables; separate implementations for each of the several label formats.
adcmd.c
Allocate, free, and manage lists of pending AoE commands.
adsubr.c
Miscellaneous subroutines: find the Adrive structure associated with a Solaris device; error-message helpers.
adtrace.c
Simple event-tracing mechanism; used only when debugging the driver.
aoed.h
Header file declaring data types and instances and procedure prototypes used in more than one source file.
dadkio32.h
Header file declaring a 32-bit version of certain data structures used by the Solaris DKIOCTL_RWCMD ioctl. For some reason Sun's header files leave these out.

To make it easier to include optional code only as needed (trace in particular), all object files except main program aoed.o are collected into object library adlib.a. The module binary is generated by running

ld -r -o aoed aoed.o adlib.a

7.3.3. Data structures

None is used outside the aoed module.

7.3.3.1. Adrive

An Adrive structure is allocated for each device instance when it is attached, freed when it is detached, containing:

7.3.3.2. Acmd

Each Adrive has a pool of Acmd structures, allocated when the device is first opened, freed on last close. The pool is initialized with as many Acmds as the target allows concurrent commands. When a command is composed and sent, an Acmd is taken from the pool; while the command awaits a response, the Acmd is kept in a per-channel pending-command list; when a response is received or the command is abandoned after a timeout, the Acmd is returned to the pool. If the pool is empty no new commands may be sent until an outstanding command has completed or timed out and its Acmd has been returned to the pool.

Each Acmd contains:

The state value was introduced when tracking a subtle bug that freed an Acmd that was still on the active-channel list. It remains because it's cheap insurance: if another such bug creeps in, it will be caught early and found more easily.
7.3.3.3. Achan

Each aoechan named at least once in aoed.conf has an Achan structure. Achans are allocated as needed as devices are attached, in a dynamic array handled by ddi_soft_state_init and ddi_get_soft_state(9F). The whole array is freed when the module is unloaded.

Each Achan contains:

7.3.3.4. Abuf

Every buf structure for a pending or active I/O transfer has an associated Abuf structure, allocated when the buf is accepted by adstrategy. The address of the Abuf is stored in bp->b_private. The Abuf is freed before biodone(9F) is called.

Macro bptoabuf(bp) is a shorthand for (Abuf *)(bp->b_private).

Each Abuf contains:

  • The number of AoE commands outstanding for this buf.
  • The absolute logical sector number to be used in the next AoE command generated for this transfer.
  • The offset within the transfer to the first byte to be read or written by the next command.
  • The total number of bytes to be transferred.
The transfer associated with a buf structure is complete when the offset value equals the total-bytes value and the number of outstanding commands becomes zero, or when an error occurs.

7.3.4. Exports

Standard Solaris loadable-module and device-driver entry points and data structures are supplied: in particular _init, _fini, and _info routines, a modlinkage structure (and its many children), dev_ops and cb_ops structures. A _depends_on string declares a dependency on aoecomm, since aoed calls procedures supplied by that module.

Only _init, _fini, and _info are global. _init calls mod_install to tell the system where to find the modlinkage structure through which the other data structures and routines can be located. Apparently _depends_on need not be global to work.

Internal routine adreceive is registered with aoecomm as the message receiver for each aoechan that has been used to send at least one message. Internal routine chantimer runs as a self-renewing timer, with one instance per aoechan.

7.3.5. Algorithms

7.3.5.1. Attach device instance

Attach routine adattach works as follows:

  • Allocate an Adrive for this device instance. Initialize it with constants at hand: instance number and dev_info_t pointer; aoechan, aoemaj, aoemin, other aoed.conf-entry properties, fetched with ddi_get_prop_int(9F). If any required property is missing, return failure.
  • Call solversion; if this system is earlier than Solaris 10, set the flag indicating an old-style argument to the DKIOCGETEFI and DKIOCSETEFI ioctls.
  • Call ddi_create_minor_node(9F) to create each of the eighteen minor devices for this instance.
  • Initialize locks and condition variables in this Adrive.
  • Call ddi_report_dev(9F) to inform the system that the instance is ready.
Adattach does not attempt to access the AoE target. The aoechan may not yet be active. Even if it is, the target may not exist; there's no point in stalling for several seconds for each non-existent device configured in aoed.conf.
7.3.5.2. Open device partition

Open routine adopen works as follows:

  • Call aoecomm_devenab to check that the AoE target has been enabled. If not, immediately return ENXIO.
  • Create kstat_t structures for the drive and for the partition being opened, as necessary.
  • If this is a non-blocking open (FNDELAY or FNONBLOCK set in open flags) and the Adrive state is ADCLOSED, call driveinit to start the initialization process.
  • If neither FNDELAY nor FNONBLOCK was set, call openwait to start initialization if necessary and wait for completion. If openwait fails, return the error reported. Check whether the partition being opened is of length zero; if so, return ENXIO.
  • Mark this partition open according to the type of open call: for block or character open, set the appropriate bit in the appropriate open-subdebvice bitmask; for a layered open, increment the layered-open count.

Drive initialization comprises these steps:

1. Driveinit called; drive state is ADCLOSED
Allocate a single Acmd for this drive; broadcast an AoE Query-Config command with the desired aoemaj and aoemin addresses to the desired AoE channel, to locate the target. Set drive state to ADWAOE.
2. Query-Config response arrives matching the pending Acmd; drive state is ADWAOE
Store the maxbuf value reported in the Query-Config response (limited by the maximum configured for this drive), and the MAC address whence it came. Call acinit to top up this drive's Acmd pool to maxbuf entries. If the maximum data-segment size wasn't specified as a configuration property, extract that in the response message, interpreting zero as the default value 1024; compare it with the value configured for the channel, returned by aoecomm_maxdata; store the lesser in ep. Compose and send an ATA IDENTIFY command. Set state to ADWATA.
3. ATA response message arrives; drive state is ADWATA
Extract size and geometry information and device-type strings from the presumed IDENTIFY response. If the disk is VTOC-proper, initialize the dk_geom structure in ep and call fudgegeom to tweak the numbers if necessary; if VTOC-improper, invent a cylinder size such that the largest possible cylinder number (given the size of this drive) is bounded by ULONG_MAX, for disksort(9F). Set up whole-disk partition limits. Fill in the configuration kstat_t structure, now that dynamically-determined values like maxdata and mibsize are known. Set state to ADGATA. Awaken any processes sleeping in openwait for this drive.
4. Drive state ADGATA discovered within openwait
Compose a device ID object (ddi_devid_register(9F)) and several string properties; the former may be used by some Solaris programs, the latter are an invention and may disappear in the future. Set state to ADWPTAB. Call readptab to read the disk label, filling in all remaining partition entries, the label type, and possibly other label-specific data in the Adrive. When all is well, set state to ADREADY.
If any step fails, initialization is completely abandoned: the drive state is reset to ADCLOSED, the Acmd pool is freed. The next open attempt will start over.

Locks are used to assure that at most one process at a time is working on initializing a given drive. Other processes desiring to access the drive sleep until initialization succeeds or is abandoned due to an error.

7.3.5.3. Close device partition
7.3.5.4. Detach device instance

Detach routine addetach works as follows.

  • Discard all minor devices for all device instances (ddi_remove_minor_node(9F) with second argument NULL).
  • If the list of active buf structs for this drive is not empty, log an error message and return failure. (This shouldn't happen.)
  • Discard any remaining Acmds for this drive; clear out the Acmd pool.
  • Free any kstat_t structures associated with this drive, including those for all partitions.
  • Destroy the kmutex_t locks and the kcondvar_t condition variable.
  • Free the Adrive.

The system calls addetach only after all references to this device instance have been closed; adclose blocks until no pending I/O operations remain. Hence adddetach need show no mercy to any I/O still unfinished.

7.3.5.5. Read or write device

Read routine adread and write routine adwrite, called only for the character (raw) device, just call physio(9F) which allocates a buf structure and calls adstrategy. Block-device I/O, whether through the file system or by direct read or write call, is handled by the system; the device driver sees only a call to adstrategy for each block.

All I/O transfers are therefore queued by strategy routine adstrategy, which works as follows:

7.3.5.6. Process ioctl

Ioctl routine adioctl calls openwait to check that initialization is complete (in case this was a non-blocking open), then switches on the ioctl command code.

The data-model rules for ioctl in Solaris complicate matters. When a 32-bit program makes ioctl calls under a 64-bit kernel, conversions may be necessary, depending on the data used by the particular ioctl command. Sometimes the 32- and 64-bit forms of a data structure have the same size and layout; sometimes they differ, and the appropriate version must be filled in and returned. In one case they differ but Sun doesn't ship a header file declaring the 32-bit form even though it is needed for format(1M). See Writing Device Drivers for more about this mess.

In another case, handled by efiioc, the argument format differs according to the Solaris version. This is why adattach checks the version and sets a flag.

The following standard disk-device commands are supported. Many, but not all, are listed in the Writing Device Drivers book and described in somewhat more detail in dkio(7D).

DKIOCINFO
Fetch information about the disk controller. Returns constant controller type DKC_DIRECT, with a constant large controller number (in the hope that it won't conflict with a real one), the instance number as the disk unit number, slave number zero.
DKIOCGAPART
DKIOCSAPART
Fetch or set the starting block number and length of all eight standard Solaris partitions on this drive. Only the in-core Adrive is written, not the disk label.
DKIOCPARTINFO
Fetch the starting block number and length of the partition open on this file descriptor.
DKIOCGGEOM
DKIOCG_PHYGEOM
Return the drive's current dk_geom structure.
DKIOCSGEOM
Update the dk_geom structure in ep, without changing the on-disk label.
DKIOCGMEDIAINFO
Return a dk_minfo structure containing the media type (always DK_FIXED_DISK), sector size (always DEV_BSIZE), and total sector count of the disk.
DKIOCREMOVABLE
Is this a removable-media device? Always returns `no.'
DKIOCSETEFI
DKIOCSETEFI
DKIOCPARTITION
EFI-label-specific calls handled by efiioc.
DKIOCGVTOC
DKIOCSVTOC
VTOC-label-specific calls handled by vtocioc.
DKIOCGMBOOT
DKIOCSMBOOT
DOS-label-specific calls handled by dosioc.
DIOCTL_RWCMD
DIOCTL_GETMODEL
DIOCTL_GETSERIAL
ATA-specific calls handled by dadkioctl.
USCSICMD
SCSI-specific (sic) call handled by aduscsi.
7.3.5.7. Send AoE command

To send an AoE command from within the driver:

  • Call acalloc to allocate an Acmd from the pool for this drive. If none is available, no more commands may be sent for now. Acalloc initializes the Adrive pointer, header-length field, and the common part of the AoE header (AoE command, aoemaj and aoemin, version code and other flags, unique tag value), and zeroes all other fields.
  • Fill in the expiry interval and maximum retry count. Fill in the remainder of the AoE command, e.g. ATA register values for an ATA command, but not data sectors to be written. If this is a transfer, set the buf pointer, the starting block number, and the transfer length and offset.
  • Call adstcmd to send the command and place the Acmd on the pending-command list for this aoechan.

Chantimer is called at regular intervals to discover commands that must be retransmitted or expired.

7.3.5.8. AoE message arrives

AoE messages arrive by calls to adreceive, which calls adlookup to locate the corresponding Acmd, then processes the message according to its command code. If the message is garbled or otherwise questionable, the Acmd is returned to the channel's pending-command list; otherwise it is freed to the pool for its Adrive. A garbled, erroneous, or unsolicited (no Acmd) message is logged and discarded.

7.3.6. Internal interfaces

Here is a summary of major internal procedures, grouped by source file. Unless otherwise noted, each is global within the aoed module, but is not meant to be called from elsewhere in the system; hence aoed.h supplies both a prototype and a name-hiding macro, as explained above.

7.3.6.1. aoed.c
static int openwait(ep)
Adrive *ap;

Return zero when ep has reached state ADREADY, blocking if necessary; or return a nonzero errno value if initialization fails.

If ep is in state ADCLOSED (initialization hasn't started yet), call driveinit to get things started. At each subsequent step prior to ADREADY, call cv_wait_sig(9S) to block on the condition variable in ep. If cv_wait_sig is interrupted by a UNIX1 signal, return EINTR.

Openwait is local to aoed.c.

Search device ep for a valid disk label:

  1. If the drive state is ADGATA or the disk size is not known, return -1.
  2. If fetchvtoc finds a proper VTOC label, return 1, leaving the label-type flags set to AVLABVTOC.
  3. If fetchefi finds a proper EFI label, return 1, leaving the label-type flags set to AVLABEFI.
  4. If fetchdos finds a proper DOS label:
    • If a DOS partition is marked as possibly containing an encapsulated VTOC, and this is a Solaris/x86 system, call fetchvtoc . If no proper VTOC label is found, call fakevtoc to invent one. In either case return 1, leaving the label-type flags set to AVFLABDOS|AVLABVTOC.
    • If no DOS partition has the encapsulated-VTOC type, or in any case on Solaris/SPARC, return 1, leaving the label flags set to AVLABDOS.
  5. If fakevtoc can make a fake VTOC label, return 0, leaving the label-type flags set to AVLABVTOC.
  6. Otherwise return 1, with no label-type flags set.

Readptab is local to aoed.c.

Read (rw==B_READ) or write (B_WRITE) len bytes to or from kernel or user address buf from or to absolute sector bno on drive ep. Return zero for success, nonzero errno value on failure.

If the transfer finished without error but without reading or writing the whole buffer, adrwkern returns nonstandard error value EXDEV; adrwuser returns zero, but if resp is nonzero, stores the number of bytes not transferred in *resp.

Adrwkern makes a synthetic buffer header and calls adstrategy; adrwuser makes a struct uio and calls physio(9F). Both operate on the whole-disk partition.

7.3.6.2. addadk.c
int dadkioctl(ep, cmd, arg, mode)
Adrive *ep;
int cmd;
intptr_t arg;
int mode;

Implement these undocumented ATA-specific ioctl calls. Some are used by format(1M).

DIOCTL_RWCMD
Read from or write to an absolute address on an ATA disk, regardless of the partition table. Uses adrwuser to do the dirty work. .
DIOCTL_GETMODEL
DIOCTL_GETSERIAL
Return the ATA device-model and serial-number strings, learned from an ATA INQUIRY command and stored in the Adrive during device initialization.
Dadkioctl returns zero if all is well, an errno value if an error occurred.
7.3.6.3. aduscsi.c
int aduscsi(ep, cmd, arg, mode)
Adrive *ep;
int cmd;
intptr_t arg;
int mode;

Implement the SCSI-specific USCSICMD ioctl, faking the SCSI INQUIRY, TEST UNIT READY, READ CAPACITY, and READ BLOCK LIMITS commands. In all cases this comprises encoding and returning various data stored in the Adrive.

Those familiar with SCSI may note that the SCSI standard doesn't define READ BLOCK LIMITS for disk devices. Format calls it anyway.

The ATA IDENTIFY command returns a device-model string; SCSI INQUIRY returns separate vendor and model strings. The two SCSI strings together afford less than half the maximum length of the single ATA string, and in fact some ATA devices return strings too long to fit in the SCSI message. The faking code does the best it can.

7.3.6.4. addos.c
int fetchdos(ep)
Adrive *ep;

Handle DOS-specific disk ioctl cmd, with the given arg and call mode:

DKIOCGMBOOT
DKIOCSMBOOT
Call adrwkern to read or write the boot sector (where the DOS label lives) on the disk, to or from the buffer pointed to by arg.
Dosioc returns zero if all was well, an errno value if an error occurred.
7.3.6.5. adefi.c
int fetchefi(ep)
Adrive *ep;

Search drive ep for a valid EFI label, trying the primary (sector 1) and backup (last sector) GPT locations as necessary. If a label is found, store the starting sector address and length of the first eight partitions in the drive's partition table, set label-type flag to AVLABEFI, and return 1. If there is no label, return 0; if an I/O error occurred, return -1.

Handle EFI-specific disk ioctl cmd, with the given arg and call mode:

DKIOCGETEFI
DKIOCSETEFI
Read or write absolute sectors of the drive, as specified by the user-space dk_efi_t structure addressed by arg, via calls to adrwuser. Take care to account for the different format the structure had in Solaris 9. For DKIOCSETEFI only, call zaplabels with flag argument AVLABEFI, then set that flag in the label-type flags of ep. The Solaris libefi(3LIB) library uses these calls; the real label processing is done by the library, not the kernel.
DKIOCPARTITION
Read the EFI partition descriptor for the partition specified in the partition64 structure addressed by arg directly from the disk (not from the driver's in-core partition table). Copy the partition's starting sector, length, and UUID to the partition64 and copy it back to the user.
Efiioc returns zero if all was well, an errno value if an error occurred.
7.3.6.6. advtoc.c
int fetchvtoc(ep)
Adrive *ep;
7.3.6.7. adio.c
int driveinit(ep, ndelay)
Adrive *ep;
int ndelay;

Start initialization for drive ep:

  1. If ndelay is nonzero (a non-blocking open is in progress), reset the MAC address for this drive to the global broadcast address and return success. The real work will be done when driveinit is called by a subsequent openwait call.
  2. Confirm that the target is enabled, and that the target channel is active.
  3. Register adreceive as the receiver for the target channel.
  4. Start chantimer for this channel if necessary.
  5. Call acalloc to allocate a single Acmd, and immediately use it to send an AoE Query-Config command.

Initially ep should be in state ADCLOSED. If all is well, return 1 with ep in state ADWAOE; if an error occurs, return -1 with the state unchanged.

Acp really points to an Achan, but the argument type must be void * to satisfy the rules of timeout(9F).

Examine each Acmd awaiting a response from the aoechan associated with ap. If an Acmd has expired but another retry is allowed, retransmit it. If no more retries are allowed, cancel the command:

  • If there is an associated buf structure, pass it to biodone(9F) with error ETIME, and call adkillbuf to free any other Acmds associated with the same buf.
  • If there is an associated condition variable, call cv_broadcast(9F) to awaken processes waiting on it.
In any case call acfree to return the Acmd to the pool for the corresponding Adrive.

A copy of chantimer normally runs every TMO_CLOCK ticks (20 ms) for each AoE channel that has been used at least once by the aoed module. If a command was retransmitted, chantimer quits, and the next run begins in TMO_CLOCKHOLD ticks (10 ms), to avoid flooding a congested network with retries.

Chantimer is local to adio.c, though it is called from outside the module via timeout.

AoE message mp has arrived via channel chan:

  • If mp is NULL, the channel is shutting down. Locate the corresponding Achan, clear the channel-is-active flag, and return.
  • If the message is obviously malformed (too short, invalid version, not an AoE response), call aoecomm_log to log it with code ACLILLFORM; return.
  • Call adlookup to locate the Acmd for which this is a response, and to remove it from the active-command list. If none is found, log it with code ACLUNSOL and return.
  • If the message is a well-formed Query-Config response and the drive state is ADWAOE, copy the MAC address and maxbuf value to the Adrive; adjust the maximum data-segment size if required; call acfree to free the Acmd; and move to the next initialization step.
  • If the message is a well-formed ATA-command response:
    • In drive state ADWATA, this is the response to an ATA IDENTIFY command: store data; free the Acmd; and move to the next initialization step.
    • In drive state ADGATA, ADWPTAB, or ADREADY, this is the response to an ATA READ or WRITE request. Locate the corresponding buf structure; free the Acmd. Store read data if any; if the transfer is now complete or an error occurred, free the Abuf and call biodone(9F). If the device-is-closing flag is set and both the active-transfer list and the pending-transfer queue are empty, signal the condition variable to awaken adclose. In any case call adstart to start more I/O if any remains pending, whether for this transfer or another another.
  • If the message is ill-formed or the drive in an unexpected state, log the message with code ACLILLFORM or ACLUNSOL, and return the Acmd to the active-command list.

Adreceive is local to adio.c, but is usually registered with aoecomm as the receiver for one or more channels.

Depending on the transfer length and the maximum data-segment size for this drive, several commands may be required for a single buf structure. Adstart starts as much as it can. If it was forced to stop (ran out of Acmds) partway through a buf, the next call to adstart continues with that buffer before starting another.

Compose the AoE command described in dp in a new STREAMS buffer, and call aoecomm_send to send it. On success, call adstore to add dp to the channel's active-command list and return 1. On error, return -1: unreasonable dp contents, aoecomm_send failed.

Before calling aoecomm_send adstcmd locates the Achan for the target channel and checks the channel-is-active flag. If the flag is clear, aoecomm_initdriver is called to re-register adreceive as the receiver for this channel; if aoecomm_initdriver succeeds, the flag is then set. This affords recovery when a channel is shut down and then restarted: adreceive clears the flag when aoecomm reports the shutdown; adstcmd re-registers the receiver when next possible, and sets the flag when it happens. Pending messages presumably time out and are retried. If the channel comes back within the timeout, no data are lost; if not, an I/O error is reported as for any other timeout.

These fields must have been filled in in dp:

dp->ep
Adrive to which the command should be sent. Both the target address (aoechan, aoemaj, aoemin) and Ethernet MAC address are significant; the latter may be the broadcast address.
dp->tag
Tag value for this command.
dp->aoehd
dp->aoelen
AoE command to be sent, excluding data to be written. aoehd is preallocated, with room for AOEHEADLEN (32) bytes, of which aoelen are used by this command.
dp->exptime
Minimum time, in clock ticks, allowed before declaring that this command has expired. Zero means never.
dp->maxtries
dp->ntries
This command may be sent at most maxtries times, i.e. at most maxtries-1 retries; it has already been sent ntries times.
dp->bp
dp->len
dp->off
If bp is not NULL and B_READ is not set in its flags, append len bytes of data from that buffer, starting at offset off within the buffer.

Dp->expires is set to the current time in clock ticks (read from ddi_get_lbolt(9F)) plus the expiry interval, with the latter lengthened a little for retries:

dp->expires = ddi_get_lbolt() + (dp->exptime * (1 + dp->ntries))
7.3.6.8. adcmd.c

Extract the tag value and target-address numbers from the AoE message pointed to by p. Search the active-command list for channel chan for an Acmd with a matching tag, associated with an Adrive with a matching target address. If a match is found, remove the Acmd from the active-command list, set its state to CSLOOSE, decrement the active-command count in the Adrive, and return the address of the Acmd. If there is no match, return NULL.

If dp is on the active-command list for channel chan or cp, remove it; set its state is set to CSLOOSE; decrement the active-command count of the corresponding Adrive.

Cp must be locked before calling adcremove.

Allocate an Acmd from the pool associated with ep, and initialize its contents as follows:

  • Set the Adrive pointer to ep.
  • Store ep->nexttag & ~TAGOFFSET in the tag field of the Acmd, taking care to skip the value zero (used by AoE targets for unsolicited broadcast replies). Increment ep->nexttag for next time.
  • Set the AoE command length to hdlen. Fill in the common header: standard protocol-version and flag values, command type aoecmd, aoemaj and aoemin from ep, tag from that just invented.
  • Set the state to CSLOOSE.
  • Zero all other fields.
Return the address of the new Acmd, or NULL if none was available.

Messages with tag values greater than or equal to TAGOFFSET (0x80000000) are reserved for non-kernel diagnostic programs. There are none such yet in the Solaris implementation. Originally the sense was inverted, with TAGOFFSET set in messages generated by the Solaris kernel driver, lesser values reserved for non-kernel use; this was changed in version 1.3.3 for consistency with the Linux AoE implementation.

Allocate new Acmds from system memory, initialize them with state CSFREE, and add them to the pool associated with ep until the pool contains at least nfree entries or the system runs out of memory.

If the pool is not empty, but has fewer than nfree entries, it is topped up to nfree. If it is already larger than nfree the system panics.

Empty the Acmd pool associated with ep, returning memory to the system.

7.3.6.9. adsubr.c
Adrive *devtodrive(dev, s)
dev_t dev;
char *s;

Return a pointer to the Adrive corresponding to Solaris device dev. If none exists, log an error message (including s if non-NULL) and return NULL.

Aoeerrstr returns a string encoding aoecode as a decimal number, with a string explaining its meaning as an AoE error code if one is known.

Ataerrstr returns a string encoding atacode as an eight-bit hexadecimal number, with a string containing the standard abbreviation describing each bit that is set.

Write the message given by printf-like format string fmt and any following arguments to syslog and the console, prepending a string of the form

aoedinst spart chan/aoemaj/aoemin
describing the device identified by dev and ep.

If ep is NULL, the Adrive is determined from dev if possible. If no Adrive can be found, the target address is omitted. If dev is NODEV, the partition name is omitted.

A newline is appended to the message; fmt should contain none.

The Nblocks and Size properties, at least one of which is required by ZFS, have 64-bit integer values.

7.3.6.10. adtrace.c
void trace(id, p0, i0, i1)
int id;
void *p0;
int i0, i1;

Store the arguments and a sequence number in the next slot in a circular buffer of 1024 trace records. There is sufficient locking to prevent concurrent calls from using the same slot.

Trace is meant for collecting real-time event traces during debugging.

No tools are supplied to read the trace buffer; use adb or mdb(1).

A kmutex_t lock is initialized when trace is first called, but never destroyed. Hence if trace is used and the aoed module is reloaded, the system loses a little memory until the next boot. Since trace is used mostly to gather information just before a reproducible crash or hang, this is unlikely to cause trouble in practice.

8. User-mode component details

This is a summary of function and implementation; see the manual pages for details of usage.

Each program has a single source file, but C programs use the library described in its own section below. Most C programs also use some of the network data structure include files from the include directory.

8.1. Aoestart

Aoestart starts an AoE channel according to its arguments. A network device and channel number must be given. Optional parameters include Ethernet protocol type and maximum data-segment size, with defaults taken from the AoE protocol spec and derived from the device's MTU setting.

Aoestart calls comminit to open the Ethernet device, configure it, and perform the channel startup handshake with the aoecomm and aoectl kernel modules. If all is well, it calls achattach to keep the channel open. By default it then writes an ACSEND command to /dev/aoectl to broadcast a Query-Config command, in the hope that every AoE target will respond, and that aoemon will hear the responses and enable all the devices.

The source code is file aoestart.c.

8.2. Aoestop

Aoestop calls achdetach to close the AoE channels named by its arguments.

The source code is file aoestop.c.

8.3. Aoectl

Aoectl opens /dev/aoectl and writes commands according to its arguments:

probe
Compose an AoE Query-Config command for each target named; use ACSEND to send to the broadcast address on the channel named.
list
Send one or more ACDEVENAB commands with subcommand ADQUERY. If a target has wildcards, first let the system discover whether any matching device is enabled; if so, iterate over possible values to find out what they are. On older hardware this takes a few seconds.
enable
disable
Send one or more ACDEVENAB commands with subcommand ADENAB or ADDISAB. Wildcards are handled by the system.

The source code is file aoectl.c.

8.4. Aoemon

Aoemon opens /dev/aoemon and loops forever reading it. An ACLOG message of type ACLUNSOL containing an AoE Query-Config response causes the responding target device to be enabled, and the action reported with logmsg. Any other message is just logged with logpkt.

The source code is file aoemon.c.

8.5. Library routines

Common code used by more than one user-mode component or isolating some implementation dependency is compiled into object library libcmd.a, used with all the programs listed above. Header file aoecmd.h (in the user directory, since no kernel component needs it) declares prototypes for all library routines, as well as a few parameter values.

Here is a list of library routines, organized by source file.

8.5.1. comm.c

Open Ethernet device name. Configure it to receive AoE messages with Ethernet protocol type proto and maximum data-segment size maxdata (if maxdata <= 0, a size computed from the device's current MTU). Make the resulting file into AoE channel chan. Copy the Ethernet MAC address of the device to retaddr, and return the resulting file descriptor; or return -1 if an error occurred.

If maxdata is zero, fetch the device MTU, compute the corresponding AoE maximum-data size, and so inform aoecomm.

Hidden inside comminit are many Solaris DLPI calls and the AoE startup dance. The channel is initialized, but not attached to its file system mount point; see achattach.

8.5.2. achan.c

Return > 0 for success, < 0 for failure or if chan has an unreasonable value.

Undo what achattach did, closing the file descriptor and shutting down AoE channel chan. If chan is negative, do this for every active channel.

Return > 0 if this was done; zero if there is no file descriptor associated with chan (or none with any chan if zero if no such channel chan is negative); < 0 if an error occurred.

Achattach calls fattach(3) to attach fd to file /etc/aoe/chnn (nn the channel number, expressed as a two-digit decimal number), after creating the file if necessary. Achdetach calls fdetach(3) on /etc/aoe/chnn.

8.5.3. trdwr.c

int tread(fd, buf, len, ms)
int fd;
void *buf;
int len, ms;

int twrite(fd, buf, len, ms)
int fd;
void *buf;
int len, ms;

Call read or write, but return -1 with errno set to EINTR if the operation hasn't completed in ms milliseconds.

8.5.4. log.c

Write to an error log, using syslog(3) with facility LOG_DAEMON and a severity level specified as an abstract argument.

Why not just call syslog directly? Partly to make it easier to fine-tune the implementation (e.g. the best mapping from internal severity levels to those of syslog depends on syslog.conf conventions); partly to avoid stepping on the buffer-overflow and format-string-trust problems in some versions of syslog.

Write a message to the log. Format is a format string of the sort accepted by printf(3); it may be followed by arguments. Type has one of the following values, listed here from most important to least:

LE
Error message: something isn't working right. Uses syslog severity LOG_ERROR.
LN
Notice: something unusual has happened, but perhaps all is still well. Syslog severity LOG_WARNING.
LI
Status information; nothing is wrong. Syslog severity LOG_NOTICE.
LD
Debugging chatter. Syslog severity LOG_DEBUG.
Flag LS may be or-ed into type to request that the message be copied to standard error regardless of the tostderr argument in loginit.

The mapping between type codes and syslog severity values is not quite the obvious one because the default /etc/syslog.conf file supplied with Solaris throws away daemon.notice messages.

Return a printable string representing the six bytes at address addr as an Ethernet MAC address. If buf is nonzero, put the string there; at least LEN_EADDR*3 bytes should be available. If buf is NULL, use a static buffer.

8.5.5. debug.c

int verbose;
int debug;

If verbose has a nonzero value, some programs and library routines chatter a little (via logmsg) as they work. If debug is nonzero, chatter is more copious and more detailed. Normally these are set by command-line options.

8.6. Aoelabinit

Aoelabinit writes an initial disk label to one or more disks. By default the label is VTOC format if the disk size allows that, EFI otherwise, and a label is written only if none (of either format) already exists; different choices may be specified. Writing a label of one type invalidates any existing label of the other.

This program exists only as a workaround for a bug in format(1M), which is unable to cope with an unlabelled ATA disk large enough to require EFI labelling. Probably Sun will fix this eventually, but aoelabinit will remain both to avoid stranding those with older systems and because it seems like a useful tool in its own right.

Aoelabinit uses the Solaris libefi(3LIB) library to read and write EFI labels. Since this is a shared library present only since Solaris 9 4/03, aoelabinit can be run only on newer systems. Static linking isn't practical because the library uses a kernel interface that differs in Solaris 9 and Solaris 10.

8.7. Aoeunlabel

Aoeunlabel zeroes the sectors conventionally used for EFI and VTOC labels and their backups. Something of the sort must be done before a disk with an EFI label may be repartitioned on a pre-EFI system, e.g. when a disk used as part of a ZFS pool is recycled to ordinary use on a Solaris 8 system.

Aoeunlabel uses none of the special Solaris label-access libraries; it just overwrites sectors directly. Hence it works even on a pre-EFI system, unlike aoelabinit.

8.8. Aoemkconf

An argument of the form 0/11/9 names a blade for which an entry is desired; the aoemin part may be an inclusive range, like 0/11/9-11 as a shorthand for 0/11/9 0/11/10 0/11/11. Entries contain only the required properties: name="aoed", parent="pseudo", aoechan and aoemaj and aoemin as specified, instance computed in the standard way.

Any other argument names an existing kernel-configuration file; entries that would duplicate an instance number already declared in the file are omitted.

For example:

aoemkconf /usr/kernel/drv/aoed.conf 0/11/0-14 0/13/0-14
prints entries for two 15-slot EtherDrive shelves at aoemaj addresses 11 and 13, but suppresses any the system has already been told about;
aoemkconf /usr/kernel/drv/aoed.conf `aoectl list`
prints an entry for every device that has made itself known on the network but is not yet configured; and, for those who enjoy living on the edge,
aoemkconf /usr/kernel/drv/aoed.conf `aoectl list` 
     >>/usr/kernel/drv/aoed.conf
updates the configuration file in place.

Aoemkconf is an awk program. There are no explicit hooks for customization, but it ought to be easy to adapt it to local ideas of instance numbers or to supply additional device properties.

8.9. /etc/init.d/aoe, /lib/svc/method/device-aoe

Startup/shutdown shell script, with the same contents under either name: /etc/init.d/aoe for use in the init.d(4) mechanism on a non-SMF system, /lib/svc/method/device-aoe for use with SMF.

The script acts according to its first argument:

aoe start
Start aoemon if necessary. Massage the contents of /etc/aoe.conf to produce a collection of aoestart commands, one for each channel to be started, and execute them. In spirit this is just
sed </etc/aoe.conf 's;^;/sbin/aoestart ;' | sh
but practical details (comments, blank lines) add a little complexity.
aoe stop [ contract ]
Call aoestop to shut down all channels; kill aoemon. If SMF is active, contract is the SMF contract number under which aoemon was run.
aoe restart [ contract ]
aoe refresh [ contract ]
Equivalent to aoe stop; aoe start except that aoemon is not restarted.

On a non-SMF system, /etc/init.d/aoe is linked to /etc/rc2.d/S00aoe so that aoe start will be called very early in the system startup process. There is no Kxxaoe link: there's no need to call aoe stop during a normal shutdown.

On an SMF system, AoE is installed as service svc:/device/aoe, and should be started and stopped (enabled and disabled) with svcadm(1M). The service manifest is user/aoe.xml in the source-code tree, /opt/CORDaoe/lib/aoe.xml and /var/svc/manifest/device/aoe.xml in the installed package. The aoe service depends on svc:/filesystems/root and declares svc:/filesystems/usr as an optional dependent. Thus any file system but the root or /usr, and any swap area, may be placed on an AoE disk.

The choice of dependencies derives from Solaris implementation details:

  • svc:/filesystems/usr enables swap areas, so svc:/device/aoe must be enabled first to allow swapping to AoE disks.
  • The dependency is optional so that a failure to start AoE won't interfere with other startup, except to the extent caused by failure to mount AoE-disk file systems.
  • It sounds impossible for AoE to start before svc:/filesystems/usr, because AoE drivers and tools are stored in /usr/kernel and /usr/sbin. In fact, if /usr is a separate file system (not so common any more) filesystems/root mounts /usr read-only, presumably because so much of Solaris itself is stored in /usr.

If file /etc/default/aoe.options exists, its contents are interpolated into the aoe script (with the shell's . operator) before anything else is done. Many parameters such as the location of aoestart, aoestop, and aoemon, the directory where channel files are attached to keep them open, and the name of the configuration file are set within aoe by shell variables; settings in aoe.options override the default values.

SMF or init.d startup is selected when the CORDaoe package is installed:

9. Design details, compromises, bugs, and other concessions to reality

Here is a collection of notes about design decisions, problems encountered with Solaris or elsewhere and how they have been papered over (or not), and so on.

Some of the problems described here will, we hope, be mended in future versions of the subsystem, or eased by future versions of Solaris, though the latter is no panacea since older Solaris systems must not be abandoned lightly.

9.1. Gcc versus Sun C

Earlier versions of the driver were built with gcc, which proved unsatisfactory:

  • Gcc-specific runtime-library routines were required to support certain 64-bit operations on 32-bit machines. Only certain 64-bit operations were affected, apparently only in certain contexts. The Solaris kernel environment doesn't offer these routines; fetching and compiling the gcc runtime-library source code just for those compiler-support routines seemed prohibitively cumbersome. As a workaround, the aoed driver was modified to keep certain values as 32-bit numbers even though 64 bits would have been more appropriate; in particular, the size of an AoE disk was limited to 32 bits' worth of sectors, i.e. 2 tebibytes.
  • Variadic procedures are implemented differently in gcc and Sun C; to supply a new variadic function requires more gcc-specific runtime support. This forbade the present admsg routine, making error logging cumbersome.

These problems were observed with gcc version 3; perhaps newer versions are better. Sun's C compiler works fine, no longer costs an arm and a leg, and does stricter type-checking (which has helped prevent a few bugs); we plan to stick with it. The compromises required for gcc have been removed from the code and will not be reinstated.

9.2. Solaris version differences

The same CORDaoe binary package may be installed on Solaris 7, 8, 9, or 10. Surprisingly little magic is required to make this work: despite marked changes in the underlying operating system, the interfaces visible to device drivers have been quite stable since Solaris 7.

Care is needed in a few places:

  • Newer versions of Solaris require new disk-specific ioctl calls, especially for EFI-format label support. It does no harm for the corresponding AoE code to be present in older versions, but it does make it difficult to compile the AoE subsystem on an older system.
  • The argument structure to the DKIOCGETEFI and DKIOCSETEFI ioctls, introduced in Solaris 9, was changed for Solaris 10, apparently for the convenience of the library code in which Sun use that call. This call is used by prtvtoc and format(1M), so it is important that it work properly. To paper this over, the aoed module makes a somewhat-hacky explicit runtime Solaris-version test.
  • In Solaris 9, ddi_create_minor_node(9F) gained some flags for Sun Cluster support; those flag values disappeared from the header files in Solaris 10. The driver avoids them. This may make it incompatible with Sun Clusters; we haven't had a chance to test.
  • The package must adapt when installed to use SMF if the target supports it, init.d(4) otherwise.
  • A device to be added to a ZFS pool must have the 64-bit integer property Nblocks defined. A call to define a 64-bit integer property wasn't added to the Sun driver interface until Solaris 9. A custom interface routine is required to paper over this, so that the driver may support ZFS but will still load on Solaris 7 and 8.

The AoE package currently compiles without error only on Solaris 10, though the resulting binaries may be run on any supported system.

9.3. Disk labels

The original Solaris disk-label format has limits that have recently become troublesome: in particular, it is unable to handle multi-terabyte disks. A mid-life update to Solaris 9 introduced a new label format, which removes the old limitations at the cost of considerable extra complexity.

To top it off, Solaris/SPARC and Solaris/x86 use different forms of old-style disk label, and the x86 implementation allows an old-style label to be encapsulated within a specially-designated DOS partition.

Here is a summary of what Sun did and how it affects the AoE driver, derived from Sun documentation, experiment, and some analysis of Open Solaris source code.

9.3.1. VTOC (old-style) labels

The original Solaris volume table of contents (VTOC) disk label comprises a single sector (512 bytes) of label information. The VTOC label records storing disk geometry information (in particular cylinder, head, and sector counts), a string label or two, and an array of eight or sixteen partition descriptors. Values are stored as 16- and 32-bit native integers. The label is protected by a magic number and a simple-minded checksum.

The eight- and sixteen-partition label formats differ quite a bit; they are not meant to interoperate. SPARC systems use the eight-partition variant, x86 systems the sixteen-partition one.

The primary VTOC label is stored in the first sector of the disk. Backups are kept at sector offsets 1, 3, 5, 7, and 9 in the last track.

Many limitations of the VTOC format are more obvious now than when the scheme was adopted in the early 1990s:

  • The format is inherently architecture-dependent, both because it contains native integers and because of the 8-partition/16-partition divide. Hence it is unreasonably messy to share disks between SPARC and x86 systems. That has become more likely with the spread of shared-host storage interconnects like AoE and Fibre Channel.
  • The labelling scheme depends to some extent on the outdated assumption that a disk is composed of an array of equal-sized tracks and cylinders, and the equally obsolete belief that the disk will report such size numbers.
  • Partition sizes are stored as signed 32-bit numbers; in effect a partition may be at most 231 sectors, or 1TiB. Storage arrays affording multi-tebibyte logical disks are now common, and often the point is to be able to have a single large file system.

9.3.2. DOS (fdisk) labels

Solaris/x86 allows (in some cases requires) a disk to have a DOS partition label.

A DOS label is stored in the first sector of the disk; it contains four partition descriptors, each including the starting and ending sector of the partition and a partition type (`system ID'). The label is protected by a magic number. Integer values in the label are always in Intel (little-endian) order. There is no backup label.

The four partitions described in the label are called primary partitions. There is a scheme to allow a primary partition to encapsulate a logical disk containing another DOS label, affording additional logical partitions.

Solaris supports only primary partitions. It has its own encapsulation scheme, however: partition type 130 or 191 (decimal) indicates a logical disk with a Sun VTOC label. Solaris makes the (relative) partitions described in the VTOC available through the conventional subdevices s0-s15; DOS primary partitions are accessed through new devices p1-p4, whether or not there is an encapsulated VTOC.

9.3.3. EFI (new-style) labels

Sun addressed the problems with the old VTOC label scheme by adding support for EFI labels, borrowed from the Intel Extensible Firmware Interface standard. EFI support first appeared in Solaris 9 4/03. Patches are available to add support to older copies of Solaris 9, but not to Solaris 8 or older releases.

An EFI label comprises two data structures: the GUID Partition Table header (GPT) and a GUID Partition Entry (GPE) array.

  • The GPT is exactly one 512-byte sector. It contains header data, including the size of the user-accessible part of the disk (a contiguous area not occupied by EFI data structures); the disk address and length of the GPE table; the disk address of this copy of the GPT, and of the other (normally there are two); and a magic number and checksum.
  • A GPE is 128 bytes. Each descriptor includes the starting and ending disk address of the partition, a 128-bit unique partition identifier, and an 128-bit partition-type code. GPEs are stored in a contiguous array of sectors, which must be at least 16kiB (32 sectors, room for 128 GPEs) but may be longer. Not every GPE in the array need be active.

One copy of the GPT is stored in the second sector of the disk, immediately followed by the GPE array in contiguous sectors. A backup GPT is stored in the last sector, immediately preceded by a backup GPE array. Sectors from the end of the primary GPE array to the beginning of the backup may be allocated to partitions. EFI partitions may not overlap one another, nor may an EFI partition overlap the GPT or GPE or sector 0.

The EFI label contains no information about cylinders, heads, or tracks; the disk is treated as a simple linear array of sectors. Sector addresses and counts are stored as 64-bit unsigned integers, allowing for disks and individual partitions as large as 8 zebibytes (more than 8 billion tebibytes). All integer values are stored in a fixed byte order defined by the EFI standard; hence a label written by a big-endian SPARC system may be read without difficulty by a little-endian IA32 system or vice versa.

9.3.4. Bigger file systems

The original Solaris UFS file system format also allows only 31 bits for sector numbers. In Solaris 9/03 (with corresponding patches for earlier editions of Solaris 9), Sun added a `multi-terabyte file system' variant.

9.3.5. Sun implementation details

When there were only VTOC labels, a Solaris disk driver was expected to support these label-related ioctl commands:

DKIOCGGEOM
DKIOCSGEOM
Fetch or set a struct dk_geom containing disk-geometry information.
DKIOCGVTOC
DKIOCSVTOC
Fetch or set a struct vtoc containing disk-partition information. The on-disk VTOC contains a merger of struct vtoc and struct dk_geom.
DKIOCGAPART
DKIOCSAPART
Fetch or set a struct dk_allmap containing the starting cylinder number and size in sectors of all eight or sixteen partitions.

DOS-label support added these ioctls:

DKIOCG_PHYGEOM
DKIOCG_VIRTGEOM
Fetch physical or virtual geometry. Physical is that of the whole disk; virtual (apparently) that of the encapsulated-VTOC logical disk.
DKIOCPARTINFO
Fetch starting sector and length of the current partition.
DKIOCGMBOOT
DKIOCSMBOOT
Fetch or set the contents of the DOS label; equivalent to reading or writing the first sector of the physical disk.

An EFI-compliant driver supports the VTOC-label and (if DOS labels are implemented) DOS-label ioctls, but returns error ENOTSUP if the disk has an EFI label, or if DKIOCSVTOC is called on a VTOC-improper disk. An EFI-compliant driver also supports these ioctls:

DKIOCGETEFI
DKIOCSETEFI
Fetch or set EFI partition info.
DKIOCPARTITION
Fetch a struct partition64 containing the starting sector address, length, and 128-bit partition-type code for a designated partition.
DKIOCGMEDIAINFO
Fetch a struct dk_minfo giving the sector size in bytes, device size in sectors, and a device-type code (removable disk, fixed disk, CD-ROM, CD-R or CD-RW, floppy, etc.).

The DKIOCGETEFI and DKIOCSETEFI ioctls really just perform I/O to absolute disk addresses, regardless of the partition table.

In an EFI-compliant version of Solaris, utility programs like format and prtvtoc(1M) use new libefi and libvtoc(3LIB) libraries to read and write disk labels. Libvtoc is a simple wrapper around the old ioctls. Libefi is more complicated: when reading a label it validates all the checksums (device drivers just check the magic number); when writing, it computes correct checksums and validates other GPT and GPE values, in particular enforcing the EFI-standard rule that partitions may not overlap and a Sun-specific rule that exactly one partition must be of special reserved type, presumably as a stand-in for the reserved cylinders in the old VTOC scheme.

To find the label on a disk, Sun's EFI-compliant drivers search as follows:

  1. If sector 0 contains a valid VTOC label, use it. If the disk is not VTOC-proper, print a warning. There is apparently no attempt to locate a backup VTOC label.
  2. If sector 1 or the last (backup-label) sector contains a valid EFI label, use it. Sun's drivers use only the first eight or sixteen partitions. Sun's drivers also ignore the eighth partition described in the label, replacing partition :h with one called :wd mapping the entire physical disk, including the label areas.
  3. If the driver supports DOS labels, examine sector 0. If it is a valid DOS label, it. If one of the DOS partitions has an encapsulated-VTOC type, use the VTOC label too. If more than one DOS partition has such a type, ignore all but the highest-numbered.
  4. If no label can be found but the disk is VTOC-proper, invent a default partition table in which partitions :a and :c span the whole disk.
  5. If no label can be found but the disk is VTOC-improper, set up no partition table at all, not even the :wd partition. Apparently it is expected that format will be used to label the disk before use. (But what if the disk was written by another operating system with its own label scheme?)

9.3.6. AoE implementation details

The aoed driver recognizes VTOC, DOS (and encapsulated-VTOC), and EFI labels, but with some differences from the Sun convention:

9.4. Concurrency issues

The Solaris kernel is pre-emptive: all processing in the kernel belongs to a scheduling thread and may be pre-empted. Thus locking is important even on a single-processor system.

9.4.1. Locks in aoecomm

There are three global locks: chantlock protects the table that maps aoechan numbers to Aoechan structures; pendlock protects the pending-cookie table; loggerlock protects message-logger pointer. Each of these locks is held for only a few lines of code at a time; protected code sections contain no procedure calls, and in particular no lock calls.

Each Aoechan structure includes a lock, protecting its contents. These locks are also held only during short code sequences that cannot provoke other locks.

During normal operation, the data protected by loggerlock and the Aoechan lock are written rarely but read constantly. These locks proved to be hot spots; changing them from the kmutex_t type to krwlock_t made raw disk I/O measurably faster.

9.4.2. Locks in aoectl

There are two locks in aoectl. One serializes writes to the /dev/aoectl device; the other prevents concurrent access to the buffer ring feeding the /dev/aoemon device. The two are entirely disjoint: code using one lock never calls code using the other.

9.4.3. Locks in aoed

Two locks are used throughout the aoed module: one protecting the Adrive structure, one the Achan structure. To avoid nested-lock hangups, there is a rule that code in which an Adrive is locked may lock (or call code that locks) an Achan, but not vice versa.

The driver's main entry points locate and lock the relevant Adrive early on, and unlock it just before returning. Most subroutines that take an Adrive argument assume it was already locked, and leave it that way.

Achans are used (hence locked) only here and there, and for short periods; usually just for long enough to search or to make a single change to the active-command list. Only rarely need an Adrive be accessed while an Achan is locked.

Timer routine chantimer contains an exception that illustrates the rule. Chantimer locks an Achan while walking its active-command list looking for expired commands. If one is found, the Achan is unlocked and the Adrive associated with the command locked while the command is retransmitted or cancelled. Then the Adrive is unlocked and the Achan locked again; and because the active-command list may have changed while the Achan was unlocked, chantimer starts over from the beginning of the list.

9.5. The startup dance

The channel-startup algorithm is complicated by a potential security problem.

A STREAMS module has no associated permissions. Anyone can open a stream device of some sort: a pipe, a network connection, his own terminal. Given an open stream, anyone may push any module. In particular, anyone could push aoecomm onto one end of a pipe, handle AoE messages on the other, and the system would think it a valid aoechan.

A malicious user could do this using a channel number normally assigned to official disk devices. An active channel number cannot be reused, but a bad guy might attempt a race with the system administrator or take advantage of a system problem.

Hence the algorithm that first requires /dev/aoectl to be opened: that device file can be given whatever permissions local policy dictates. Someone who may open /dev/aoectl can set up AoE channels; someone may not, cannot.

9.6. How aoed fits into the system

The aoed driver is in a sense half-redundant. Since the AoE protocol is just a way to bundle ATA commands into Ethernet packets, one might think it would be possible to write not a complete disk driver but just an ATA host-adapter driver, using the existing Solaris ATA code to do the rest of the work.

If it were SCSI-over-Ethernet rather than ATA, that would be possible: Sun expects third parties to write SCSI HBA drivers, and documents the details. Unfortunately that is not true for ATA. Hence aoed is a standalone disk driver attached to the pseudo nexus.

Perhaps this can be revisited in the future, if Sun stabilize and publish their ATA-adapter interface, and once their ATA code has been fixed to handle multi-terabyte disks. Even then, support for older Solaris versions would require something like the current driver.

9.7. Device-creation static

Device instances in Solaris are created in one of two ways:

  1. The system scans the available hardware. When a device is discovered, the corresponding driver's probe routine is called to decide whether the device is usable. If probe approves, the driver's attach routine is called to create the software instance. This works only if the system (or some bus-adapter driver) can generate a list of all existing devices, and can somehow tell which device is of what type.
  2. Instances may be configured statically in a driver configuration file. The system reads the file, and for each instance listed creates a dev_info_t structure and calls the driver attach routine.

Because AoE is a software construct attached to the pseudo nexus, only static configuration is allowed. Some versions of Solaris even complain if such a driver has a real probe entry point.

In Solaris 9 and earlier versions, this means devices can be attached only when the aoed module is loaded. To create more devices requires shutting down the driver, unloading the module, and loading it afresh.

In Solaris 10, a static device configuration can be updated without unloading the driver; update_drv(1M) makes it happen. Devices that are in use (open or mounted) cannot be changed, but new devices can be added and idle devices removed or given new properties. At present, the system administrator is expected to update aoed.conf and run update_drv. Perhaps devices can be detected automatically in a future version, though that might forbid custom target-address-to-instance-number mappings.

9.8. Device types, and device-name horrors

It might seem natural for aoed to create device nodes by calling ddi_create_minor_node(9F) with node type DDI_NT_BLOCK_CHAN, using the aoemaj number (or some combination of aoechan and aoemaj) as the device instance and aoemin as the target number. DDI_NT_BLOCK_CHAN is listed in the documentation, but nowhere do the manuals say how to supply a target number. Apparently there are special hooks for use by SCSI and ATA host-bus adapter drivers, but no mechanism for general use. Worse, if one uses DDI_NT_BLOCK_CHAN anyway, the system apparently picks a garbage number out of uninitialized memory.

An early version of the driver used node type DDI_NT_BLOCK. This ran afoul of a bug in devfsadm(1M) in Solaris 7, 8, and 9. Calling ddi_create_minor_node with the dev_info_t for instance 15 created devices named /devices/pseudo/aoed@15:*, as expected; but the links to /dev/dsk and /dev/rdsk were named cNd21*. Apparently the code that creates names in /devices believes instance numbers are decimal, but the code that reads those names and creates names in /dev believes the /devices names are hexadecimal and `corrects' them.

To avoid these problems, the aoed driver does everything the hard way. Device nodes are created with type DDI_NT_PSEUDO, for which devfsadm does no automatic processing. Installing the AoE subsystem adds explicit rules to /etc/devlink.tab to generate /dev links for aoed:

type=ddi_pseudo;name=aoed;minor=a dsk/cadA0s0
type=ddi_pseudo;name=aoed;minor=a,raw rdsk/cadA0s0
type=ddi_pseudo;name=aoed;minor=b dsk/cadA0s1
type=ddi_pseudo;name=aoed;minor=b,raw rdsk/cadA0s1
...
type=ddi_pseudo;name=aoed;minor=p dsk/cadA0s15
type=ddi_pseudo;name=aoed;minor=p,raw rdsk/cadA0s15
type=ddi_pseudo;name=aoed;minor=r dsk/cadA0p1
type=ddi_pseudo;name=aoed;minor=r,raw rdsk/cadA0p1
...
type=ddi_pseudo;name=aoed;minor=u dsk/cadA0p4
type=ddi_pseudo;name=aoed;minor=u,raw rdsk/cadA0p4
type=ddi_pseudo;name=aoed;minor=wd dsk/cadA0
type=ddi_pseudo;name=aoed;minor=wd,raw rdsk/cadA0

This approach reaps some minor benefits:

  • In older versions of Solaris, devfsadm doesn't know what to do with the wd partition. The explicit rules generate a single consistent name.
  • If AoE were treated as just another disk, Solaris would invent a numeric controller ID: AoE disks might be named /dev/dsk/c1d* on one system, /dev/dsk/c2d* on another with different hardware or a different Solaris version; the names might even change after an OS reinstall. Again, the explicit rules generate a single consistent name.

9.9. Format woes

In an early version of aoed, the DKIOCINFO ioctl returned a customer controller type value (>=DKC_CUSTOMER). This causes format(1M) to misbehave quite badly:

  • When a customer-controller-type disk is selected, format prompted for about a dozen subtle disk parameters, most of them meaningless for modern ATA disks, some already available via DKIOCINFO.
  • When the disk label is updated, format attempts to read the backup labels to verify that they were written correctly. On a customer-type disk it did this wrong: it read mysteriously-chosen blocks somewhere in the middle of the disk, nowhere near the backup labels. It couldn't read the backup labels anyway: they are stored outside the user-data part of the disk.
  • On Solaris 9, whren checking the label or running tests on a customer-type disk, format makes lseek, read, and write calls with patently-invalid arguments (as shown by truss(1)). Worse, though the calls return errors as they should, format reports no trouble; apparently the error returns are ignored.

To avoid this mess, aoed attempts to mimic a directly-attached ATA disk driver, even though that is an undocumented interface. DKIOCINFO reports controller type DKC_DIRECT; the driver implements a subset of DIOCTL_RWCMD format uses for direct-address access to ATA disks. The details are not officially documented; they were worked out by tracing format, reading <sys/dktp/dadkio.h>, and applying a mix of imagination and common sense.

The EFI-label code in later Solaris 9 releases and in Solaris 10 introduced two new format embarrassments:

  • If an ATA disk is larger than 1TiB, device-model information is fetched with SCSI-specific ioctl calls. This wouldn't work even with Sun's ATA-disk driver; it's just a bug. aoed implements a fake USCSICMD ioctl. Since ATA allows longer device-model strings than SCSI, the string is sometimes truncated; there's nothing we can do about that.
  • If an ATA disk is larger than 1TiB and does not already have an EFI label, format cannot write one; in fact format cannot do anything with such a disk. The AoE subsystem includes a tool aoelabinit to write an initial label, which format can then edit.

There is also at least one embarrassment in DOS-label support:

  • When a DOS label is present, Solaris creates minor device :q (p0 in /dev) accessing the whole disk. If this device exists, format insists that that the disk have a DOS label; if none is present, format refuses even to print the existing partition table, commanding Please use fdisk first. AoE therefore doesn't make the :q/p0 device. The :wd (no suffix in /dev) device, created by AoE regardless of label type, affords equivalent access if needed.
Some of these bugs may have been cured in Solaris 10 11/06.

9.10. Waiting for the network

When a Solaris Ethernet device is opened, it sometimes takes a few seconds for the hardware and software to initialize. During this interval, DLPI attachment and configuration messages are processed correctly, and DLPI queries report that the device is ready, but it isn't: messages written to the device are silently thrown away. Thus aoestart must somehow wait until the device is really ready before broadcasting a Query-Config message to discover which devices exist.

Comminit, the library routine called by aoestart to open and initialize the device, tries two schemes to wait for the hardware:

  1. Using the kstat(3kstat) library, search for a kstat structure associated with the network device bearing an integer-valued link_up parameter. If found, wait up to 30 seconds for the value to become positive; when it does, return success.
  2. If no link_up parameter was found in a kstat, use magic calls equivalent to the innards of ndd(1M) to query the link_status value for the device. Wait up to 30 seconds for the value to become positive; when it does, return success. A global state variable in the kernel must be changed to select the device to be polled; good citizenship suggests its original value should be saved just before every test and restored immediately after, though races are still possible.
If neither test is possible or the test chosen fails, comminit returns an error.

The kstat test is tried first because it's a bit faster and rather less hacky in implementation than the ndd test. Recent network-device drivers no longer support the ndd parameter, but all Sun-supported network devices we know support the kstat scheme, except the old le (10Mbps LANCE Ethernet) driver. The latter device isn't a good choice for AoE, but support will remain for now because it is sometimes useful during our own testing.

9.11. Newfs woes

Solaris newfs gets into trouble when a large disk has small cylinders, producing complaints like `Insufficient space in super block for rotational tables' and `inode blocks/cyl group >= data blocks.' The details are not yet understood; possibly newfs is incorrectly doing some bit of arithmetic in too small a variable.

Empirically the trouble seems to vanish when cylinders are very large. Hence the phony numbers generated by fudgegeom, which make cylinders as big as the system seems to be able to stand.

The -T (new multi-terabyte file system) newfs option in Solaris 10 and in newer editions of Solaris 9 also avoids the problem, but that's no help in older versions of Solaris, nor on 32-bit hardware.

9.12. I/O timeouts

The default timeouts for I/O operations recorded in aoed.h are surprisingly long: 200ms for a read or write, half a second for an AoE Query-Config or ATA IDENTIFY. Originally much smaller values were used, but complicated devices like RAID controllers really do take as long as 100-150ms to finish some operations.

Longer timeouts shouldn't cause much grief anyway except in special circumstances, since lost messages are unlikely on modern (switched, flow-controlled) networks unless something is broken.

The large values may cause grief on slow, congested networks, e.g. 10Mbps or 100Mbps, especially when repeaters or broadcast cables are used rather than switches. If that case comes up in real life, just set the timeouts down manually. If it happens often enough it may make sense to invent a way to set a per-channel I/O timeout default rather than having to set it for each target.


Footnotes

1.
UNIX is a registered trademark of the Open Group.