Lumos Framework

Introduction

Lumos is a framework to analytically quantify the performance limits of many-core, heterogeneous systems operating at low supply voltage (e.g. near-threshold). Due to limited scaling of supply voltage, power density is expected to grow in future technology nodes. This increasing power density potentially limits the number of transistors switching at full speed in the future. Near-threshold operation can increase the number of simultaneously active cores, at the expense of much lower operating frequency (“dim silicon”). Although promising to increase overall throughput, dim cores suffer from diminishing returns as the number of cores increases. At this point, hardware accelerators become more efficient alternatives. Lumos is developed to explore such a broad design space.

How to Get it

The latest Lumos can be downloaded from tar.gz or zip. Source code is available for quick browse at here.

Quick Start

Lumos is written mostly in Python, with external modules written in C. The latest Lumos requires python 3.4 and a recent GCC, and has been tested on Ubuntu-14.04 box with python 3.4.3 and gcc-4.8.

Prepare python modules

  • numpy and scipy

    This is usually included in a typical installation of python. In case you do not have it, these two packages can be found at numpy and scipy. Lumos has been tested on numpy-1.6.1 and scipy-0.9.0

  • pandas

    This package is used to process technology model files.

  • matplotlib

    This package is required to generate plots for analyses. It is usually included in the package repositories of popular Linux distributions (e.g. Ubuntu). Otherwise, you can always get it from matplotlib. Lumos has been tested on matplotlib-1.1.1rc.

  • lxml

    This package is required to parse the descriptions of kernels and workloads stored in xml files. It is included in the package repository of Ubuntu, and can also be downloaded at lxml. Lumos has been tested on lxml-2.3.2.

  • ConfigObj

    This package is required to parse configurations. It can be installed using pip as:

    pip install configobj
    
  • python-igraph

    This package is required to model directed acyclic graph based application model. It can be installed using pip as:

    pip install python-igraph
    

    Note

    If igraph C-library is not installed (on Ubuntu, it could be installed by pip install libigraph libigraph-dev), the above procedure will try to compile the library first. In this case, GCC is required.

  • nose (optional)

    This packages is only required for unit test.

Build cacti-p

This is as simple as running the following command in the home directory of Lumos:

make -C cacti-p

Run simple example

Now it is ready to go. Since it is purely pythonic, Lumos does not need any compilation steps (technically speaking, the interpreter will “compile” all python scripts to accelerate execution, but this is all transparent to users). Just follow these steps:

  1. Set environment variable LUMOS_HOME to root directory of the package:

    >cd Lumos-0.1
    >export LUMOS_HOME=(full-path-to)/Lumos-0.1
    
  2. Run the sample analysis:

    >python lumos/analyses/homosys_example.py
    
  3. Done!

Now the plots for this sample analysis should be ready to check out in $LUMOS_HOME/analyses/core/figures.

Model

The model includes technology model, cores, accelerators, and applications.

Technology

The supported technology libraries are:

TechName TechVariant Mnemonic
cmos hp cmos-hp
lp cmos-lp
finfet hp finfet-hp
lp finfet-lp
tfet homo30nm tfet-homo30nm
homo60nm tfet-homo60nm

Lumos provides a factory generator to retrieve supported technologies by the name and variant of a technology model:

from lumos.model.tech import get_model
techmodel = get_model('cmos', 'hp')

Core

To create an object for conventional cores, these four arguments have to be specified:

  • technology node (nm)
  • technology name
  • technology variant
  • core type

For example, the following code snippet create an in-order core using CMOS technology with hp variant at 45nm:

from lumos.model.core import BaseCore
core = BaseCore(45, 'cmos', 'hp', 'io')

All supported combinations are listed as follows:

TechName TechVariant CoreType TechNode (nm)
cmos hp, lp io, o3 45, 32, 22, 16
finfet hp, lp smallcore, bigcore 20, 16, 14, 10, 7
tfet homo30nm, homo60nm io, o3, smallcore, bigcore 22

Accelerators

To be edited

Application

Sirius-suite

Data extracted from table 5 of the paper [sirius-asplos15]:

[sirius-asplos15](1, 2) Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers, ASPLOS 15.
[ucore-micro10]Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?, MICRO 10

The paper reported speedup achieved by FPGA (Xilinx ML605) compared to conventional CPU (Xeon E3-1240 v3). We scale the performance speedup as if the baseline is a Core i7-960 using SPECfp2006 scores. According to SPECfp2006 Core i7-960 and SPECfp2006 Xeon E3-1240 v3, the SPECfp2006 scores are 43.5 and 75.6, respectively. Therefore, speedup numbers in table 5 of [sirius-asplos15] are scaled by a factor of 75.6/43.5 = 1.738 for FPGA accelerator. Further, the area of FPGA is estimated by assuming each LUT takes 0.00191mm2 at 40nm. The total number of LUT for Virtex6 LXT240T is 37680*8. The resource utilization of each kernel are listed as follows:

  • gmm: 100%
  • dnn-asr: 100%
  • stemmer: 100%
  • regex: 100%
  • crf: 10 %
  • fe: 20%
  • fd: 20%

Analysis

A typical analysis in the Lumos framework involves three steps: define the worload, define the system, do analysis.

Define Workload

The workload in the Lumos framework is defined as a pool of applications. Each single application is divided into serial and parallel parts, and the ratio is specified as f_parallel. Part of an application can be also partitioned into several computing kernels. These kernels can be accelerated by various computing units, such as multicore, possibly dim CPU cores, RL, and customized ASIC. We model the speedup and the power consumption of RL and customized ASIC for a given kernel by u-core parameters.

A workload is defined by enumerating all applications in the format of XML, such as:

<workload>
   <app name="app0">
     <f_parallel>1</f_parallel>
     <kernel_config>
       <kernel name="ker005" cov="0.4826"/>
     </kernel_config>
   </app>
   ...
</workload>

where f_parallel is the parallel ratio. Within kernel_config, the name is the name of the kernel, and cov is the kernel’s execution time in percentage to the application running with a single baseline core (e.g. an in-order core at the nominal supply voltage and 45nm). Coverages of kernels are not necessarily summed to 100%. Lumos will assume the rest of application can not be accelerated and only be executed on conventional cores.

A set of kernels is enumerated in the format of XML as well, such as:

<kernels>
  <kernel name="ker005">
    <fpga miu="20"/>
    <asic miu="100"/>
    <occur>0.003206</occur>
  </kernel>
  ...
</kernels>

Where miu is the relative performance for FPGA and ASIC, respectively. occur is the probability of this kernel to be presented in an application.

There are a couple of helper functions to assist you in generating kernels and applications following certain statistical distributions. See create_fixednorm_xml(), create_randnorm_xml(), build(), build_fixedcov() for more details. Moreover, existing XML descriptions can be loaded by load_xml() for kernels and load_xml() for workloads.

Define System

Lumos supports conventional cores such as a Niagara2-like in-order core and an out-or-order core (IOCore and O3Core), as well as un-conventional cores, such as accelerators (UCore), and federated cores (FedCore).

On top of these cores, Lumos supports two kinds of systems: a homogeneous multi-core system (HomogSys), and a heterogeneous multi-core system with a serial core and certain amount of throughput cores, as well as accelerators (HeterogSys). The usage of these two systems are demonstrated later in Example Analysis section.

Do Analysis

HeterogSys provides a method get_perf() to get the relative performance of the system for a given application.

HomogSys provides a couple of methods to retrieve relative performance of system for a given application:

  • Explicit constraint on the supply voltage.

    In this case, the system will try to enable as many cores at the given supply as possible within the given power budget. If the supply voltage is relatively high, it ends up with a dark silicon homogeneous many-core system. Use perf_by_vfs() and perf_by_vdd() for this scenario.

  • Explicit constraint on the number of active cores.

    In this case, the system will probe for the highest supply voltage for the core to meet the overall power budget. Use perf_by_cnum() for this scenario.

  • No constraints on the supply voltage or the number of active cores,

    In this case, the system will probe for the optimal configuration of supply voltage and the number of cores to achieve the best overall throughput. Use opt_core_num() for this scenario.

Example Analyses

  1. Example of using HomogSys

    An example analysis of HomogSys is in $LUMOS_HOME/lumos/analyses/homosys_example.py. This example models a homogeneous many-core architecture composed of Niagara2-like in-order cores. The system is defined as follow:

    sys = HomogSys()
    sys.set_sys_prop(area=self.sys_area, power=self.sys_power)
    sys.set_sys_prop(core=IOCore(mech=self.mech))
    

    The analysis compares four scenarios applied to the system: 1) dim cores without consideration of variation-induced frequency penalty; 2) dim cores considering the frequency penalty; 3) dim cores with reduced frequency penalty of 0.5 and 0.1 respectively; 4) dark cores with maximum supply voltage (1.3x nominal) and frequency. Each scenario will require some tweaks to system parameters, for example, the second scenario requires:

    sys.set_core_prop(tech=ctech, pv=True)
    

    More details can be found in analyze(). For each scenario, the relative performance is obtained by opt_core_num() as follow:

    ret = sys.opt_core_num()
    ret['perf']
    

    Finally, plot_series() is used to generate a plot for the above scenarios, as in plot().

  2. Example of using HeterogSys

    An example analysis of HeterogSys is in $LUMOS_HOME/lumos/analyses/heterosys_example.py. This example models a heterogeneous many-core system composed of in-order cores, reconfigurable logic (FPGA), and dedicated ASICs. All related files for this analysis are placed in $LUMOS_HOME/analyses/heterosys_example. For maximum flexibility, this example employs an external configuration file to specify various input parameters in addition to command line parameters. The default configurations file is heterosys_example.cfg. This example uses pre-defined synthetic kernels and workloads stored in kernels_asicfpgaratio10x.xml and workload_norm40x10.xml. The analysis will load kernels and the workload as follows:

    kernels = kernel.load_xml(options.kernel)
    workload = workload.load_xml(options.workload)
    

    The system is defined as follow:

    sys = HeterogSys(self.budget)
    sys.set_mech('HKMGS')
    sys.set_tech(16)
    if kfirst != 0:  # there are ASIC accelerators to be added
        sys.set_asic('_gen_fixednorm_004', alloc*kfirst*0.33)
        sys.set_asic('_gen_fixednorm_005', alloc*kfirst*0.33)
        sys.set_asic('_gen_fixednorm_006', alloc*kfirst*0.34)
    sys.realloc_gpacc(alloc*(1-kfirst))
    sys.use_gpacc = True
    

    The performance is collected as follows:

    perfs = numpy.array([ sys.get_perf(app)['perf']
                for app in self.workload])
    

    This analysis involves a large design space exploration. To accelerate the exhaustive search, this analysis also takes advantage of parallel execution to run fast on multicore machines.

    Finally, the analysis is plotted in plot().

  3. More examples.

    There are a lot of analyses in $LUMOS_HOME/lumos/analyses, which can be used as examples of various functions the Lumos framework provides. Unfortunately, they are less documented at this moment.

License

LICENSE TERMS

Copyright (c) 2013-2015, Liang Wang and Kevin Skadron,
Department of Computer Science, University of Virginia
All rights reserved.

Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice, this
   list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright notice,
   this list of conditions and the following disclaimer in the documentation
   and/or other materials provided with the distribution.
3. Neither the name of the University of Virginia nor the names of its
   contributors may be used to endorse or promote products derived from this
   software without specific prior written permission.
4. All advertising materials or publications mentioning features or use of this
   software must display the following acknowledgment: "This product includes
   software developed by the University of Virginia" and should also cite
   
   L. Wang, K. Skadron, "Dark vs. Dim Silicon and Near-Threshold Computing
   Extended Results," University of Virginia Department of Computer Science
   Technical Report TR-2013-01.

THIS SOFTWARE IS PROVIDED BY THE AUTHORS ''AS IS'' AND ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT
SHALL THE UNIVERSITY OF VIRGINIA BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
OF SUCH DAMAGE.

Funding Acknowledgement

This work was supported by the SRC under GRC task 1972.001 and the NSF under grants MCDA-0903471 and CNS-0916908 and byDARPA MTO under contract HR0011-13-C-0022 and by NSF grant no. EF-1124931 and C-FAR, one of six centers of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA. The views expressed are those of the authors and do not reflect the official policy or position of the sponsors.