Lumos is a framework to analytically quantify the performance limits of many-core, heterogeneous systems operating at low supply voltage (e.g. near-threshold). Due to limited scaling of supply voltage, power density is expected to grow in future technology nodes. This increasing power density potentially limits the number of transistors switching at full speed in the future. Near-threshold operation can increase the number of simultaneously active cores, at the expense of much lower operating frequency (“dim silicon”). Although promising to increase overall throughput, dim cores suffer from diminishing returns as the number of cores increases. At this point, hardware accelerators become more efficient alternatives. Lumos is developed to explore such a broad design space.
Lumos is written mostly in Python, with external modules written in C. The latest Lumos requires python 3.4 and a recent GCC, and has been tested on Ubuntu-14.04 box with python 3.4.3 and gcc-4.8.
numpy and scipy
This package is used to process technology model files.
This package is required to generate plots for analyses. It is usually included in the package repositories of popular Linux distributions (e.g. Ubuntu). Otherwise, you can always get it from matplotlib. Lumos has been tested on matplotlib-1.1.1rc.
This package is required to parse the descriptions of kernels and workloads stored in xml files. It is included in the package repository of Ubuntu, and can also be downloaded at lxml. Lumos has been tested on lxml-2.3.2.
This package is required to parse configurations. It can be installed using
pip install configobj
This package is required to model directed acyclic graph based
application model. It can be installed using
pip install python-igraph
If igraph C-library is not installed (on Ubuntu, it could be
pip install libigraph libigraph-dev), the above
procedure will try to compile the library first. In this case, GCC
This packages is only required for unit test.
This is as simple as running the following command in the home directory of Lumos:
make -C cacti-p
Now it is ready to go. Since it is purely pythonic, Lumos does not need any compilation steps (technically speaking, the interpreter will “compile” all python scripts to accelerate execution, but this is all transparent to users). Just follow these steps:
Set environment variable
LUMOS_HOME to root directory of the
>cd Lumos-0.1 >export LUMOS_HOME=(full-path-to)/Lumos-0.1
Run the sample analysis:
Now the plots for this sample analysis should be ready to check out in
The model includes technology model, cores, accelerators, and applications.
The supported technology libraries are:
Lumos provides a factory generator to retrieve supported technologies by the name and variant of a technology model:
from lumos.model.tech import get_model techmodel = get_model('cmos', 'hp')
To create an object for conventional cores, these four arguments have to be specified:
For example, the following code snippet create an in-order core using CMOS technology with hp variant at 45nm:
from lumos.model.core import BaseCore core = BaseCore(45, 'cmos', 'hp', 'io')
All supported combinations are listed as follows:
|cmos||hp, lp||io, o3||45, 32, 22, 16|
|finfet||hp, lp||smallcore, bigcore||20, 16, 14, 10, 7|
|tfet||homo30nm, homo60nm||io, o3, smallcore, bigcore||22|
To be edited
Data extracted from table 5 of the paper [sirius-asplos15]:
|[sirius-asplos15]||(1, 2) Sirius: An Open End-to-End Voice and Vision Personal Assistant and Its Implications for Future Warehouse Scale Computers, ASPLOS 15.|
|[ucore-micro10]||Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?, MICRO 10|
The paper reported speedup achieved by FPGA (Xilinx ML605) compared to conventional CPU (Xeon E3-1240 v3). We scale the performance speedup as if the baseline is a Core i7-960 using SPECfp2006 scores. According to SPECfp2006 Core i7-960 and SPECfp2006 Xeon E3-1240 v3, the SPECfp2006 scores are 43.5 and 75.6, respectively. Therefore, speedup numbers in table 5 of [sirius-asplos15] are scaled by a factor of 75.6/43.5 = 1.738 for FPGA accelerator. Further, the area of FPGA is estimated by assuming each LUT takes 0.00191mm2 at 40nm. The total number of LUT for Virtex6 LXT240T is 37680*8. The resource utilization of each kernel are listed as follows:
A typical analysis in the Lumos framework involves three steps: define the worload, define the system, do analysis.
The workload in the Lumos framework is defined as a pool of applications. Each
single application is divided into serial and parallel parts, and the ratio is
f_parallel. Part of an application can be also partitioned into
several computing kernels. These kernels can be accelerated by various computing
units, such as multicore, possibly dim CPU cores, RL, and customized ASIC. We
model the speedup and the power consumption of RL and customized ASIC for a
given kernel by u-core parameters.
A workload is defined by enumerating all applications in the format of XML, such as:
<workload> <app name="app0"> <f_parallel>1</f_parallel> <kernel_config> <kernel name="ker005" cov="0.4826"/> </kernel_config> </app> ... </workload>
f_parallel is the parallel ratio. Within
name is the name of the kernel, and
cov is the kernel’s execution time
in percentage to the application running with a single baseline core (e.g. an
in-order core at the nominal supply voltage and 45nm). Coverages of kernels are
not necessarily summed to 100%. Lumos will assume the rest of application can
not be accelerated and only be executed on conventional cores.
A set of kernels is enumerated in the format of XML as well, such as:
<kernels> <kernel name="ker005"> <fpga miu="20"/> <asic miu="100"/> <occur>0.003206</occur> </kernel> ... </kernels>
miu is the relative performance for FPGA and ASIC,
occur is the probability of this kernel to be
presented in an application.
There are a couple of helper functions to assist you in generating kernels and
applications following certain statistical distributions. See
build_fixedcov() for more details. Moreover,
existing XML descriptions can be loaded by
for kernels and
load_xml() for workloads.
Lumos supports conventional cores such as a Niagara2-like in-order core and an
out-or-order core (
O3Core), as well as un-conventional cores, such as
UCore), and federated cores
On top of these cores, Lumos supports two kinds of systems: a homogeneous
multi-core system (
HomogSys), and a heterogeneous
multi-core system with a serial core and certain amount of throughput cores, as
well as accelerators (
HeterogSys). The usage of
these two systems are demonstrated later in Example Analysis section.
HeterogSys provides a method
get_perf() to get the relative
performance of the system for a given application.
HomogSys provides a couple of methods to retrieve
relative performance of system for a given application:
Explicit constraint on the supply voltage.
In this case, the system will try to enable as many cores at the given supply
as possible within the given power budget. If the supply voltage is relatively
high, it ends up with a dark silicon homogeneous many-core system. Use
perf_by_vdd() for this scenario.
Explicit constraint on the number of active cores.
In this case, the system will probe for the highest supply voltage for the
core to meet the overall power budget. Use
perf_by_cnum() for this scenario.
No constraints on the supply voltage or the number of active cores,
In this case, the system will probe for the optimal configuration of supply
voltage and the number of cores to achieve the best overall throughput. Use
opt_core_num() for this scenario.
Example of using
An example analysis of
HomogSys is in
$LUMOS_HOME/lumos/analyses/homosys_example.py. This example models a
homogeneous many-core architecture composed of Niagara2-like in-order cores.
The system is defined as follow:
sys = HomogSys() sys.set_sys_prop(area=self.sys_area, power=self.sys_power) sys.set_sys_prop(core=IOCore(mech=self.mech))
The analysis compares four scenarios applied to the system: 1) dim cores without consideration of variation-induced frequency penalty; 2) dim cores considering the frequency penalty; 3) dim cores with reduced frequency penalty of 0.5 and 0.1 respectively; 4) dark cores with maximum supply voltage (1.3x nominal) and frequency. Each scenario will require some tweaks to system parameters, for example, the second scenario requires:
More details can be found in
analyze(). For each
scenario, the relative performance is obtained by
opt_core_num() as follow:
ret = sys.opt_core_num() ret['perf']
plot_series() is used to generate a
plot for the above scenarios, as in
Example of using
An example analysis of
HeterogSys is in
$LUMOS_HOME/lumos/analyses/heterosys_example.py. This example models a
heterogeneous many-core system composed of in-order cores, reconfigurable
logic (FPGA), and dedicated ASICs. All related files for this analysis are
$LUMOS_HOME/analyses/heterosys_example. For maximum flexibility,
this example employs an external configuration file to specify various input
parameters in addition to command line parameters. The default configurations
heterosys_example.cfg. This example uses pre-defined synthetic
kernels and workloads stored in
workload_norm40x10.xml. The analysis will load kernels and the workload as
kernels = kernel.load_xml(options.kernel) workload = workload.load_xml(options.workload)
The system is defined as follow:
sys = HeterogSys(self.budget) sys.set_mech('HKMGS') sys.set_tech(16) if kfirst != 0: # there are ASIC accelerators to be added sys.set_asic('_gen_fixednorm_004', alloc*kfirst*0.33) sys.set_asic('_gen_fixednorm_005', alloc*kfirst*0.33) sys.set_asic('_gen_fixednorm_006', alloc*kfirst*0.34) sys.realloc_gpacc(alloc*(1-kfirst)) sys.use_gpacc = True
The performance is collected as follows:
perfs = numpy.array([ sys.get_perf(app)['perf'] for app in self.workload])
This analysis involves a large design space exploration. To accelerate the exhaustive search, this analysis also takes advantage of parallel execution to run fast on multicore machines.
Finally, the analysis is plotted in
There are a lot of analyses in
$LUMOS_HOME/lumos/analyses, which can be
used as examples of various functions the Lumos framework provides.
Unfortunately, they are less documented at this moment.
LICENSE TERMS Copyright (c) 2013-2015, Liang Wang and Kevin Skadron, Department of Computer Science, University of Virginia All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the University of Virginia nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. 4. All advertising materials or publications mentioning features or use of this software must display the following acknowledgment: "This product includes software developed by the University of Virginia" and should also cite L. Wang, K. Skadron, "Dark vs. Dim Silicon and Near-Threshold Computing Extended Results," University of Virginia Department of Computer Science Technical Report TR-2013-01. THIS SOFTWARE IS PROVIDED BY THE AUTHORS ''AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY OF VIRGINIA BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
This work was supported by the SRC under GRC task 1972.001 and the NSF under grants MCDA-0903471 and CNS-0916908 and byDARPA MTO under contract HR0011-13-C-0022 and by NSF grant no. EF-1124931 and C-FAR, one of six centers of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA. The views expressed are those of the authors and do not reﬂect the ofﬁcial policy or position of the sponsors.