Update of the original article from August 1, 2007. First Published in July 2007.
I’ve noticed this is getting a lot more hits recently, presumably because Google is ranking it higher. On the assumption it’s proving useful, I thought it deserved a 2008 update.
What’s in a Name?
Let’s get some terminology out of the way. I’ve called this “Systemic Requirements” because I prefer that to the two other commonly used titles: system qualities and constraints, which to me sounds too dry and academic and lacking the kind of visceral relevance you need to get the right level of business attention and non-functional requirements which I find misleading, as I outlined in a previous article called The Analysis Business - i.e. user needs around things like availability are very functional indeed and calling them non-functional might reduce the prominence given to them. Some people just call them the ilities, which allows you to do the joke “You’ll be ill at ease without your ilities” and make your business sponsor want to punch you. I’m not fussy, as long as you think about them, then you can call them what you want. If you’re interested I took the term from an eBay presentation (slide 5) though it’s been in use for quite a while.
When last I looked, Wikipedia listed 61 systemic requirements so I recommend a visit there too, but I have to say some of them felt to me a bit like splitting hairs. Good to have a grasp of perhaps, but a list that long would be pretty impractical to implement on a project. I’ve opted for 15 and even these can get a bit subtle, so I would suggest you go into this with the idea that not all will apply in all circumstances.
Like most things in life, you’ll have to use a bit of common sense.
We aim to please.
Analysts, developers and architects should collate, question, and prioritise each requirement for each logical component or sub-system in the solution as they apply. Many requirements are interdependent - for example, you can’t set high expectations around availability if you completely neglect reliability. These kind of requirements, moreso than functional requirements, are a vehicle for the operational characteristics of a solution to be defined and discussed, none of them mean very much in isolation. If there are symptoms of incomprehension on the business side I find it best to illustrate how the solution would operate (or not) if a requirement isn’t met, based on the idea that it’s quite hard to describe something like security but not hard to describe what might happen if you didn’t have some.
The time the component is fully operational, as a percentage of total time
Component C will be fully operational for P% of the time
over a continuous measured period of 30 days
(note that this is equivalent to T minutes downtime)
Don’t fall into the trap of allowing phrases like “24x7” or “100%” for availability. There’s no such thing (unless you have an astronomical budget). Steer the business instead to use percentages: like “99% available”, defined against a time period (e.g. “per calendar month”), and illustrate what the 1% downtime might mean in lost minutes.
Walk through the solution regularly and look for single points of failure (SPOFs). Be prepared for there being different availability requirements for different components in the same solution - but again highlight dependencies between components that may not be available at the same time. For reference, 99% availability is equal to 7.2 hours downtime in any 30 day period, moving to 99.9% reduces that to 43.2 minutes and with 99.99% availability it’s 4.3 minutes. Those extra nines make a big difference, but they will cost more to provide.
The component consistently maintains all other systemic requirements as data load demands increase
Component C will provide sufficient capacity for U users each with M Mb of files
Component C will provide sufficient capacity for archived reports for MM calendar months
at a report creation rate of R per calendar month
How do you know when capacity is OK? When the other systemic requirements aren’t causing you pain. “Load” as it’s used here represents data (accounts, reports, general storage) and not activity, which is covered under scalability later.
Work closely with operations staff to get this right. Take measurements of existing systems, look at historic growth as well as predicted (remember that business forecasts can be optimistic as well as pessimistic). Plan for peaks, and have a plan in place for what you will do when capacity is exceeded - let’s face it, one day capacity will be exceeded. Often you can’t afford all the capacity you think you’ll need, so cost out what you can, and make sure any remaining risks and issues are well understood by the business.
The component maintains all other systemic requirements when under multiple simultaneous loads
Component C will support a concurrent group of U users running pre-defined acceptance script S simultaneously
Here we’re looking at business operations being thread and resource safe, with their state well managed. You can prove concurrency with performance and soak tests, but the code walk-through is also a very effective preventative measure. The more operations you can run concurrently, the better overall performance you’ll get, but there’s always a bottleneck somewhere - database connections, serial access to a legacy system, etc. Work out what your code will do if it has to sit in a queue for ages - making everything asynchronous is rarely a realistic option - create management plans within the software that will handle situations where access to constrained resource approaches deadlock. A useful bit of background here is Amdahl’s Law, for which I once knocked up an illustrative calculator.
The component supports changes that extend or improve the existing business logic, or functionality
This is not a measurable requirement, unless you already know the enhancements. If you do then it’s simply a case of stating:
Component C must support adding additional feature F using less then M man-days effort.
Whether this is then true is so hard to measure as to be practically unworkable. Existing code is easier to change if it is isolated from other code, look for dependencies, and seek to pass handles to these into functions (in the classic IOC style). It’s also easier to change if it’s logically compartmentalised - i.e. it’s located where you’d expect it.
My suggestion for sign-off here is just to put a logical plan in place and walk the business through it. If they feel you are leaving too much work for the future at least a debate can be had. Less work later often means more work now, so it’s all about trade-offs.
The component supports changes that adds new business logic, or functionality, to cater for previously unknown situations
Not a measurable requirement.
Don’t get caught up in splitting hairs on whether this is been ‘met’ or not. It’s far more likely that so much will change nobody will remember what you meant by it anyway.
The magic phrase here is ‘unknown situations’, and in practice there’s as much luck as judgment in catering for the unknown, but it’s not impossible to make future work easier. Businesses don’t change as much as they think they do, and when they do, everybody knows about it. This is to your advantage, as it means you only need to deal with the known situations (because unknown means everybody is on the back foot to some extent).
What you are looking for is potential reuse and reasonable granularity - both good building blocks for the future. But don’t aim for ‘reuse everywhere’ (a danger in SOA these days) - it’s a myth as made clear by Robert Glass in Facts and Fallacies of Software Engineering - I prefer the adage that it ain’t reusable until it’s being reused.
The component communicates with neighbouring components, without undue overhead, maintaining other named systemic requirements
The word ‘undue’ is somewhat subjective. As for extensibility, the explanatory notes are a useful guide. Components will either work or they won’t (use of techniques like continuous integration should help identify this early), but what amounts to undue overhead should be decision for the Principal Architect or TDA.
Most projects have to deal with heterogeneous environments: shrink-wrapped applications talking to bespoke software, legacy, new, third parties, etc. Each one should get its data across these chasms in the most efficient and standardised way practicable.
Practicable means making a compromise between having an enterprise data model used everywhere (giving rise to excess translations into and out of it, for anything but the bespoke applications) and using native lowest common denominator formats, which lead to inextensible, point-to-point integration.
All high-level designs are signed off by [insert appropriate enterprise authority] as meeting the strategic goals of the organisation
Another that can’t really be defined by example. All organisations should have a business strategy, and there should be some kind of one-pager, idealised, Reference Architecture that shows the kind of technology and thinking that would support this strategy, plus a Target Architecture that represents the realistic, costed, project-aligned plan for the current year. Essentially what we’re saying here is that it’s a systemic requirement for there to be some sense of alignment to that. Otherwise there’s a risk that one project’s success will be obtained at a risk to others.
Hard to measure this one, except by going and asking your group/enterprise architecture team if they are happy with your choices (in many organisations you’ll know this already as they will have loaded you up with seven tons of impossibly optimistic PowerPoint driven ‘strategic’ garbage before you even had time for a project kick-off).
But play nice for a second. Many strategic choices are cost based. Sure, make an argument to use a novel app server if it adds real value to your project, but if using the standard one won’t kill you then concentrate your efforts on making your code better.
Turf wars between solution and enterprise architects aren’t worth it. I’ve been on both sides, and in my experience all they do is distract from the main event. Look on corporate standards as a challenge, like integrating with the 1970’s mainframe that nobody on the project understands. Make it fun. You’ll ultimately enjoy it more and live a bit longer.
A business operation with a clear start and end point completes within a predefined time period
Acceptance script S completes within T1 seconds on an unloaded system,
and within T2 seconds on a system running at maximum capacity as defined
in the Concurrency requirement
You won’t find ‘performance’ as a systemic requirement on this list. It’s too vague and, on closer inspection, can be broken up into smaller, measurable bits. Latency is one of the sub-requirements that many people mean when they say performance: “I click the send button, and the results return to my screen in less than five seconds” etc.
It’s not worth having hundreds of these on a project because, like strategic compliance, meeting them can become a distraction. A few key operations (especially the ones that customers will benchmark you by) are enough. Any other overly slow operations can be tweaked later. As with Concurrency, walkthroughs are useful here - follow the data through each sub-system and across each interface, look at all the things that could slow it down (especially other operations of the same type).
The life span of the component is compatible with the businesses Application Portfolio Management (APM) plan
Add the owner of the APM plan to the reviewer list of the high-level architecture documents that detail which components are new, changing (and to what extent) and being decommissioned. Or better still get the solution architects to reference the APM plan as part of their design.
An Application Portfolio what? Indeed.
Well, it never ceases to amaze me how many organisations don’t have one. It’s basically a multi-year plan showing exactly how strategic each application (bought or built) is, and thus when it will be replaced, updated, upgraded or decommissioned.
If the support software used in your customer call centre isn’t up to the job, and can’t be extended, or take any more sticky tape to hold it together then it’s pretty safe to say it’s not strategic. Therefore your business shouldn’t be investing too much in it. Therefore you don’t want to build any castles on that sand. Therefore your architecture should reflect that.
The component has the ability to undergo routine maintenance operations in a manner that doesn’t conflict with other systemic requirements
List the maintenance activities that the operations team regularly perform (often a good exercise in itself) and define how and when they are required to be performed (elapsed time, time of day, etc).
Does the component need to be maintained (upgraded, log files cleansed, backed-up, etc) while the business is running? Would shutting it down require action on other applications or components?
The best way to deal with this requirement is to start with operational staff and work through a likely maintenance schedule, then take this to the business and run through maintenance scenarios, don’t forget unplanned maintenance (e.g. when a database version is updated and regression tests didn’t pick up the fact that it requires a software upgrade to your application server). Then design your architecture accordingly, and get the operations team to sign it off (having good friends inside operations is a recurring pattern in delivering good solutions to meet systemic requirements).
The component has the ability to undergo routine administration operations in a manner that doesn’t conflict with other requirements
List the administration activities that the business regularly perform and define how and when they are required to be performed (elapsed time, time of day, etc)
Administration is different to Maintenance, in the sense that this set of actions is often performed by the business (reference data updates, new products, price plans, promotions, offers etc) whereas operational actions are performed by Operations (systems administrators and the like).
The business should supply functional requirements as to what they need to be able to change core business data entities, but if they don’t, just map out a list of all the entities in your enterprise model against CRUD actions, and then make sure there’s an answer for every cell. Then create an architecture to support this - and scream from the hilltops if the business think it’s OK to have operational staff doing product updates using SQL scripts. It is not. Not only is it risky, it’s time consuming and result in the business blaming IT for not being able to respond quickly to changes.
The present (and possibly historical) status of the component can be determined easily by operational and support staff
All errors and exceptions encountered during events E1, E2, E3 are to be
captured by component C and reported into the standard monitoring tool MT
If having operational friends is key to getting systemic requirements met, then this is one of the main ways to make those friends.
Slinging a new system over the wall at operations, and hoping they’ll somehow deal with it, is sadly a common practice. Don’t let it be yours. Use of basic tools such as SNMP, JMX or similar and a few simple dashboard additions can go a long way to help system administrators keep and eye on what’s happening and react better when things go awry. Treat them just like any other customer, and do your damndest to meet their requirements. Just because your sponsor won’t ever want to know how many http requests there have been to the web server in the last ten minutes doesn’t mean it’s not important.
From a design perspective this is where techniques like AOP become useful. The details of AOP are outside the scope of this list but it’s pretty straightforward: to have a code component remain relatively clean and free from coupling to other system details you don’t want it to know about things like security, logging or monitoring. These are aspects of the system that apply to all (or most) of your components, they are in a sense the context within which components run, not part of the components themselves - like an inversion of control, just system wide.
The component has the ability to migrate onto a new architecture, at an acceptable cost
Define future platforms and versions in the requirements. Assess cost of migration (rough order of magnitude is the best you’ll get).
Meeting this doesn’t mean you have to use a JVM (or CLR). And, to be fair, for many organisations this is not a particularly high priority requirement. I like to ask though, as it sometimes opens up some illuminating conversations about infrastructure strategy. It’s extraordinary the number of companies that were “Unix” and are soon to be “Microsoft” (as if it were that simple..) or vice versa.
The component has the ability to deal with inconsistent or erroneous data using a mechanism, and within a time, acceptable to the business
On receipt of a new Order by component C1, items bearing data attributes
not conforming to the defined schema will be submitted to component C2
Application A1 will provide an interface to delete these Orders or manually update
them and resubmit to C1
Define forks in the processing pipeline that can accommodate data that cannot be processed, how it will be rectified and resubmitted. Define separately the requirements for how the handling components will be managed (they are in effect applications in their own right and possible not subject to the same high demands as the main application components)
There are subtle differences between recoverability, resilience and reliability so it’s as well to be clear on this:
Recoverability measures the ability of the application or component to deal with bad data - e.g. a mistyped post code on an order application. It’s wise to accept from the outset that duff data will creep into your production system from time to time and that no amount of defensive coding or techniques like Design By Contract can completely protect you. Rather than allow it to be persisted, I prefer to have a separate queue for items bearing unrecognised attributes and let them be dealt with that way (initially this may have to be manual, but as patterns are detected it is quite easy to automate the process).
The component has the ability to deal with unreliable but dependent components
Define failure scenarios that might constitute ‘unreliability’ and define expected action for each
If recoverability is the ability to cope with unpredicted data, then reliability is the ability to cope with unpredicted outages of networks and communicating components. Personally, I use Peter Deutsch’s (et al) Fallacies of Distributed Computing as the benchmark for what can go wrong here. Admittedly, some of the fallacies may more properly relate to other systemic features, but as long as they’re covered it doesn’t really matter what name you address them under.
The component has the ability to manage its data, and itself, in such a way that reliability and recoverability requirements are consistently maintained
I’m not convinced there’s a way to specify enough requirements or detailed designs to cover all eventualities here, or even that you’d want to. It would be highly un-agile to disappear for months itemising error conditions before the code is even written.
My favoured approach is to specify a consistent and detailed error reporting scheme that allows them to be caught in integration testing, and fixed as you go
To make this definition clear: A Resilient system is able to deal with internal failures, as opposed to external (reliability) or data (recoverability) failures, whilst continuing to deliver its defined service. It’s therefore closely related to consistency in that a resilient component is always in a “safe” state, which can atomically be returned to if the next operation fails.
It’s a bit more than just making transactions atomic though, and includes considerations like periodic saving or sharing of state data, load balancing etc.
The component consistently maintains other (specified) systemic requirements as activity load demands vary
Component C will support a P1% increase in the number of concurrent active users U (simulated by running acceptance script S) whilst maintaining average response times within P2% of the latency requirement
Perhaps one of the most misused systemic requirements - scalability is not the same as performance, but the ability to maintain other systemic requirements as active users, or business operations, or whatever, changes (which usually means “increases”, but of course you may want to scale down as well as up).
Three things contribute to scalability:
The most important of these is the design for scalability - a design that explicitly promotes scalability has to, according to Brewer’s Theorem compromise on Consistency or Partition Tolerance (a symptom of reliability).
Simplistically put, the more consistent you force your data to be within any one component the further away from linear the resource-to-performance graph for that component will be, i.e. if you write every received order to a database as it comes in then you probably have strong consistency, but will find, at high loads, that the marshalling and blocking and locking and queuing that needs to happen to keep your data in good shape will reduce the ability to scale.
The second thing you need is some volumetric prediction of what the scaling needs are so you can design your scalability and consistency tradeoffs accordingly. Werner Vogels has a good post on this.
The third thing you need is a Scalability Plan to meet the predicted loads. For example, in the world of smaller/cheaper boxes, there should be a clear horizontal scalability path, not one requiring that servers get more memory added when they run out. And don’t forget that horizontal scalability has implications for shared data (e.g. state) and application design that should have clear answers in a walk through. Having a plan might sound obvious, but it’s here to make a distinction between a system that’s already been load-tested to meet the volumetric prediction (and therefore scales from today’s loads to tomorrow’s loads without any operator action required) and a system that hits its limits and needs some kind of response, which will have a cost associated with it. If that action is cost-effective you have scalability.
The component operates in such way to meet regulatory, compliance and local security standards
Security is a whole topic unto itself, so it’s only here as a reminder. I have a 100+ page document for security audits and I’m not even an expert in this area. One day I will put it online, but for the purpose of this checklist we’ll have to be satisfied with ‘security’ as a catch-all term, and defer to your specific support in this area.
All I can say from my experience is talk to security teams early and do what you can to get sign-off. Let project managers deal with discrepancies arising from cost etc.
The component meets redefined operational demands in extremis
Chances are when there’s a major disaster, such as a data-centre flood or fire, your business will accept a reduced set of operational parameters. (Note that survivability only applies at enterprise/business level, having one component or application at full operational readiness when everything else is wet or burnt is pretty useless business-wise.)
Only the most real-time continuous businesses expect to meet high throughput and low latency targets after a catastrophe has befallen them. Many wouldn’t have the cash to pay for it. But still, it’s always worth the discussion because of what may come up.
Start with business continuity planning (BCP) - what processes need to continue, where users get relocated, what the alternative operational plan looks like. These are business issues that need answers before any technology architecture can be selected to support them.
Then look at Disaster Recovery (DR) options - SAN replication, log shipping, etc. and design the architecture accordingly. If the business wants a DR solution but doesn’t have a BCP (no?.. really?.. some business want technology to solve business problems?.. surely shome mishtake..) escalate the issue away from the solution level. Many a good project has gone off the rails by this kind of über-scope-creep.
The component can handle specified levels of transactional activity
Component C will support N1 transactions per second, of type T1, on an
unloaded system, and N2 transactions per second, of type T2, on a system
loaded as set in the concurrency requirement
Another spin off from the world of performance this one, but thankfully easy to define and test.
The hardest part is in agreeing what a transaction, in the context of an isolated project or component, is (something meaningful to the business is preferable but not always achievable), and then where it begins and ends - it’s also worthwhile being clear about what’s in and out of scope for a transactional Service Level Agreement early on. On more than one occasion I have found suppliers inadvertently agreeing to include “the internet” as part of the transaction they will guarantee throughput times on.
(I used to call this Transactional Integrity, but it seems Consistency is the preferred term these days)
The component maintains data in a predictable, and consistently safe, state
Transactions of type T1, T2, T3.. Tn will have ACID properties
ACID - Atomic, Consistent, Isolated and Durable
Atomic - Means that the operation will either work or not work (not half-work).
Consistent - Means that if the same operation were to be applied over and over again (given the same start point) the result would be the same.
Isolated - Means that it’s working, or not working, and it’s affects, will not be impacted by unconnected events.
Durable - Means once it’s worked, its results will remain (unless the resulting data is modified by another ACID transaction of course).
To be contrasted, when thinking about scalability and availability, with:
BASE - Basically Available, Soft state, Eventually consistent