Wednesday, January 7, 2026

Fundamentals of Software Architecture - An Engineering Approach

2024-2026--fundamentals-of-software-architecture

Notes for the book Fundamentals of Software Architecture - An Engineering Approach by Mark Richards and Neal Ford.

Brief Summary & thoughts

The book had a certain vision of a software architect role, but I missed some discussion on roles on different levels of architecture, e.g. system design, solution, enterprise, etc. (See e.g. SFIA skills framework for Systems Design vs Enterprise architecture.

Some high-level take-aways

  • Importance of context - Past decisions were based on realities of the environment from that time
  • Why is more important than how. (Second law)

Intros

  • Preface: Invalidating axioms
    • Mathematics build theories based on axioms
    • Software architects also build theories atop axioms, but the software world is softer than mathematics.
    • It is important to question assumptions and axioms left over from previous eras.
    • Subtitle "Engineering approach" - Engineering discipline, implying repeatability, rigor and effective analysis.
    • Trade-off analysis is important!

Why no clear career path for software architects?

  • Definition of software architecture or architect is difficult

    • Definition of a software architect from Martin Fowler's Who Needs an Architect
      • Architecture is about the important stuff... whatever it is (Ralph Johnson)
  • Massive amount and scope of responsibility

  • Constantly moving target

  • Much of the material about software architecture has mostly historical relevance.

  • NOTE: When studying architecture, keep in mind that it can only be understood in context

    • Past decisions were based on realities of the environment from that time

One way to think about software architecture:

  • Structure of the system
  • Architecture characteristics ("-ilities")
  • Architecture decisions
  • Design principles
    • Principle is a guideline rather than a hard rule.

Expectations of an architect (instead of defining the role)

  • Define the architecture decisions and design principles
  • Continually analyze the architecture
  • Keep current with latest trends
  • Ensure compliance with decisions
  • Diverse exposure and experience
  • Have business domain knowledge
  • Possess interpersonal skills
    • Gerald Weinberg: "no matter what they tell you, it's always a people problem."
  • Understand and navigate politics
    • Almost every decision an architect makes will be challenged -> negotiation skills

Engineering practices

  • All architectures become iterative because of unknown unknowns, Agile just recognizes this and does it sooner. (Mark Richards)
  • See also Building Evolutionary Architectures
    • Architectural fitness functions

Laws of Software Architecture

  • Everything in software architecture is a trade-off (First law)
  • If an architect thinks they have discovered something that isn't a trade-off, more likely they just haven't identified the trade-off yet. (Corollary 1)
  • Why is more important than how. (Second law)

Part I - Foundations

Chapter 2: Architectural Thinking

Four aspects

  • Understanding the difference between architecture and design
    • In traditional view, unidirectional flow from the architect to the developer
    • To make architecture work, both the physical and virtual barriers between architects and developers must be broken down.
  • Having a wide breath of technical knowledge
    • Three sections
      • Stuff you know (and need to maintain)
      • Stuff you know you don't know
      • Stuff you don't know you don't know
    • Architects should focus on technical breadth more than depth.
  • Understanding and analyzing trade-offs
    • Architecture is the stuff you can't Google (Mark Richards)
    • There are no right or answers in architecture - only trade-offs (Neal Ford)
    • Programmers know the benefits of everything and the trade-offs of nothing. Architects need to understand both (Rich Hickey)
  • Understanding the business drivers and their importance

Chapter 3: Modularity

  • Physics analogy Energy must be added to a physical system to preserve order. The same is true for software systems - Good structural soundness won't happen by accident/itself.
  • Modularity definition of book: Logical grouping of related code (actual term varies by programming language etc.)
    • For discussions about architecture, a logical separation, not necessarily "physical"

Three metrics/concepts for modularity:

  • Cohesion - To what extent the parts of a module should be contained within the same module
  • Coupling
    • Afferent & efferent couplings and other metrics in Software package metrics
    • NOTE: Coming from the age of structure programming, before e.g. object-oriented languages
  • Connascense
    • Two components are connascent if a change in one would require the other to be modified in order to maintain the overall correctness of the system (Meilir Page-Jones)
    • Static connascense (source-code-level coupling) vs dynamic (execution-time) connascense
    • Strength of connascense (see e.g. blog post About connascence
    • Three guidelines from Page-Jones for using connascense to improve system modularity:
      • Minimize overall connascence by breaking the system into encapsulated elements.
      • Minimize any remaining connascence that crosses encapsulation boundaries.
      • Maximize the connascence within encapsulation boundaries.
    • Jim Weirich's two principles
      • Rule of Degree: Convert strong forms of connascence into weaker forms of connascence.
      • Rule of Locality: As the distance between software elements increases, use weaker forms of connascence.
    • Problems with 1990s connascence
      • Low-level, focusing on code quality and hygiene more than architectural structure
      • Doesn't really address a fundamental decision - synchronous or asynchronous communication in distributed architectures

Own book recommendation related to modularity: A Philosophy of Software Design

Chapter 4: Architecture Characteristics Defined

  • Architectural characteristics: "All the things the software must do that isn't directly related to the domain functionality"
    • Traditional "Non-functional requirement" not preferred as "non-functional" doesn't sound something worth paying attention to.
  • Three criteria for architectural characteristics
    • Specifies a non-domain design consideration
    • Influences some structural aspect of the design
    • Is critical or important to application success
  • NOTE: A critical job for architects lies in choosing the fewest architecture characteristics rather than the most possible.
  • Lists
  • Quote: Never shoot for the best architecture, but rather the least worst architecture.
  • See also the book Building Evolutionary Architectures

Chapter 5: Identifying Architectural Characteristics.

At least three ways to extract architecture characteristics:

  • from domain concerns
    • For prioritization, a good approach is to have the domain stakeholders select the top three most important characteristics from the final list.
    • Note: Often domain concern consists of multiple architecture characteristics. E.g. "time to market" comes from agility + testability + deployability
  • from requirements
  • from implicit domain knowledge

For practising, see Architectural Katas

Quote: There are no wrong answers in architecture, only expensive ones.

Chapter 6: Measuring and Governing Architecture Characteristics

  • Challenges around the definition of architecture characteristics
    • They aren't physics
    • Wildly varying definitions
    • Too composite
  • Objective definitions solve all three problems.
  • Architecture fitness functions

Chapter 7: Scope of Architecture Characteristics

  • Earlier discussed metrics, e.g. afferent/efferent coupling are at a too fine-grained level for architectural analysis.
  • Connascense at service level with (micro-)service architecture
    • static connascense: two services share the same class definition of some class
    • dynamic connascense:
      • synchronous: Synchronous call between two distributed services
      • asynchronous: Asynchronous calls allowing fire-and-forget semantics
  • Defining Architecture quantum: An independently deployable artifact with high functional cohesion and synchronous connascense
    • Architecture quantum provides a new scope for architecture characteristics
    • In modern systems, architects define architecture characteristics at the quantum level rather than system level.

Chapter 8: Component Based Thinking

  • Component (here): Physical packaging of modules
    • library: A component wrapping code at a higher modularity than classes/functions
    • layers/subsystems
    • services
  • Architect role (book's view)
    • Typically architect defines, refines, manages and governs components within an architecture
    • Generally, a component is the lowest level of the software system an architect interacts directly with.
  • Two types of top-level architecture partitioning - One of the first decisions an architect must make.
    • Layered (often "technical partitioning")
    • Modular (often "domain partitioning")
  • Conway's Law: Organizations, who design systems, are constrained to produce designs which are copies of the communication structures of these organizations.

Part II - Architecture Styles

  • Architecture style/pattern? (Book's definition)
    • Architecture style: The overarching structure of how the user interface and backend source code are organized and how the source code interacts with a datastore
    • Architecture patterns: Lower-level design structures that help form solutions within an architecture style

Chapter 9: Foundations

  • Architecture style name creates a single name that acts as shorthand between experienced architects

Fundamental Patterns

  • Big Ball of Mud
  • Unitary Architecture - Everything running on one computer
  • Client/Server

Two main types

  • Monolithic
  • Distributed Architectures

For distributed architectures, it's important to be aware of Fallacies of distributed computing

  • The network is reliable
  • Latency is zero
    • Be aware of "long tail" latency - It's important to know the 95th to 99th percentile
  • Bandwidth is infinite
    • Regardless of the technique, ensure that minimal amount of data is passed between services in distributed architecture to avoid bandwidth problems.
  • The network is secure
  • Topology doesn't change
  • There is one administrator
  • Transport cost is zero
  • The network is homogeneous

Other distributed considerations

  • Distributed logging
  • Distributed transactions
    • Eventual consistency
    • Transactional sagas
    • BASE transactions
      • Basically available
      • Soft state
      • Eventually consistent

Chapter 10 - Layered Architecture Style

  • Often implemented if developer/architect is unsure about which architecture style is used or team "just starts coding"
  • Typically four layers: Presentation, business, persistence and database
  • Technically partitioned (opposed to domain-partitioned)
    • Any particular business domain is spread throughout all of the layers of the architecture
  • Can be either
    • closed - request must go through the layer immediately below it to get to the next layer
    • open - allowing requests to pass layers
  • Layer of isolation concept - Changes made in one layers generally don't affect components in other layers, providing the contracts between layers stay unchanged
  • Be aware of architecture sinkhole anti-pattern
  • Why to use?
    • Good choice for small, simple applications or websites.
    • Good choice as a starting point with very tight budget and time constraints

Chapter 11 - Pipeline Architecture Style

  • Also known as pipes and filters
  • Four types of filters
    • Producer - The starting point
    • Transformer - FP map
    • Tester - FP reduce
    • Consumer - The termination point
  • Blog post More shell, less egg illustrates power of these abstractions

Chapter 12 - Microkernel Architecture Style

  • Also known as plug-in architecture
  • Natural fit for product-based applications, e.g. IDEs
  • Basic components
    • Core System (can be implemented as a layered architecture or a modular monolith)
    • Plug-In Components
    • (Additionally) Registry: A common way to manage which plug-in modules are available and how to get them

Chapter 13 - Service-based Architecture Style

  • Hybrid of the microservices style
  • Topology
    • Separately deployed user interface
    • Separately deployed remote coarse-grained services
    • A monolithic database (usually)
  • Service design and granularity
    • Regular ACID database transactions are used to ensure database integrity within a single domain service.
    • Whereas microservice setups typically use BASE transactions
  • When to use?
    • Flexibility
    • One of the most pragmatic architecture styles
    • Natural fit when doing DDD
    • When ACID transactions are needed
    • A good choice for achieving a good level of architectural modularity without complexities and pitfalls of granularity
    • Not so much need for coordination as with other distributed architectures

Chapter 14 - Event-Driven Architecture Style

  • Made up of decoupled event processing components that asynchronously receive and process events
  • Can be used as a standalone style or embedded with other styles
  • How to implement a request-based model with an event-driven style?
    • Request orchestrator - Typically UI but can also be implemented through an API layer or ESB

Two main topology variants

  • Variant: Broker topology
    • no central event mediator
    • Message flow is distributed through a lightweight message broker (e.g. RabbitMQ, ActiveMQ etc)
    • Four primary architecture components
      • Initiating event
      • Event broker
      • Event processor
      • Processing event
    • It is a good practice for each event processor to advertise what it did to the rest of the system (-> provides easy extensibility)
    • Trade-offs of the broker topology
      • (+) Highly decoupled event processors vs (-) Workflow control
      • (+) High scalability vs (-) Error handling
      • (+) High responsiveness vs (-) Recoverability
      • (+) High performance vs (-) Restart capabilities
      • (+) High fault tolerance vs (-) Data inconsistency
  • Variant: Mediator topology
    • Addresses some of the short-comings of the broker topology
    • Event mediator manages and controls the workflow for initiating events
    • Primary architecture components
      • Initiating event
      • Event queue
      • Event mediator
      • Event channels
      • Event processors
    • Various approaches
    • The book recommends classifying events as simple, hard or complex
      • Simple events with a simple mediator (e.g. Apache Camel or Mule)
      • If event workflow requires lots of conditional processing / dynamic paths, a mediator such as Apache ODE or Oracle BPEL Process Manager are recommended
    • Main differences to broker topology
      • Central mediator that can maintain event state and manage error handling, recoverability and restart capabilities
      • Typically, in broker topology the messages are events (things that have happened), whereas in mediator topology they are commands (things that need to happen)

Various topics on Event-driven Architecture Style

Asynchronous capabilities

  • Can be a powerful technique for increasing the overall responsiveness of a system
  • Responsiveness vs Performance
    • Responsiveness - Notifying the user that the action has been accepted and will be processed momentarily
    • Performance - Making the end-to-end process faster
  • Main issue with asynchronous communication is error handling

Preventing Data Loss

  • Typical setup - Processor A sends message to a queue and Processor B accepts the message and inserts the message into a database
  • Three areas of data loss
    • Message does not make it to the queue from Processor A - Can be solved with synchronous send with persistent message queue
    • Processor B de-queues the message and crashes before it is processed - Can be solved with client acknowledgement mode
    • Processor B is unable to persist the message - Can be solved with an ACID transaction via database commit and acknowledging the message only after it

Request-reply messaging (pseudosynchronous communication)

  • Two primary techniques
    • Correlation ID in the request message header and using the same ID in the reply message
    • Temporary queue for the reply

Trade-offs of event-driven model vs request-based

  • (+) Better response to dynamic user content
  • (+) Better scalability and elasticity, responsiveness and performance
  • (+) Better adaptability and extensibility
  • (-) Only supports eventual consistency
  • (-) Less control over processing flow
  • (-) Less certainty over outcome of event flow
  • (-) Difficult to test and debug

Architectural quantum: Even though all communication in event-driven architecture is asynchronous, if multiple event processors share a single database instance, they are all contained within the same architectural quantum.

Chapter 15 - Space-Based Architecture Style

  • Space-based architecture style is specifically designed to address problems involving high scalability, elasticity and high concurrency issues.
  • The style gets its name from the concept of tuple space, the technique of using multiple parallel processors communicating through shared memory.
  • Architecture components:
    • Processing unit containing the application code
    • Virtualized middleware used to manage and coordinate the processing units
    • Data pumps to asynchronously send updated data to the database
    • Data writers that perform the updates from the data pumps
    • Data readers that read DB data and deliver it to processing units upon startup.
  • One decision to be aware of : Replicated cache vs distributed cache

Chapter 16 - Orchestration-Driven Service-Oriented Architecture Style

  • Architecture styles must be understood in the context of the era in which they evolved.
  • Late 1990s
    • Operating systems were expensive and licensed per machine. Commercial database servers with Byzantine licensing schemes...
    • -> Architects were expected reuse as much as possible
    • -> Driving philosophy in this architecture centered around enterprise-level reuse.
  • Topology
    • Business Services - "Entry points"
    • Enterprise Services - Meant as reusable building blocks. (Did not work se well often)
    • Application Services - one-off single-implementation services
    • Infrastructure Services (e.g. monitoring, logging, ...)
    • Orchestration Engine (ESB?)
      • Typically one or a few relational databases - Transactional behaviour handled in the orchestration engine
      • Because of Conway's Law, the team of integration architects responsible for this engine become a political force within an organization, eventually a bureaucratic bottleneck
      • Finding the correct level of granularity of transactions became difficult
  • Reuse ... and coupling
    • Aggressive reuse
    • High amount of coupling between components
    • E.g. Customer service ending up including all the details the organization knows about customers
  • Combining the disadvantages of both monolithic and distributed architectures

Chapter 17 - Microservices Architecture

  • The driving philosophy of microservices is the DDD concept bounded context - Each service models a domain or workflow.
  • Primary goal of microservices: High decoupling, physically modeling the logical notion of bounded context.
  • Each service owning its own process
    • Note: This is nowadays practical because of virtual machines & containers, open source operating systems etc.
  • Granularity
    • Difficult - Often the mistake of making services too small
    • Purpose is to capture a domain or workflow
    • Some guidelines
      • Purpose - Functionally cohesive services
      • Transactions - Entities that need to cooperate in a transaction show a good service boundary
      • Choreography - Sometimes it might make sense to bundle services together to avoid the communication overhead.
  • Data isolation - Avoiding shared schemas and databases used as integration points
  • API Layer
    • Sitting between the consumers of system (UIs or calls from external systems)
    • Optional
  • Operational reuse
    • Microservices prefer duplication to coupling
    • Certain parts of architecture that benefit from coupling - such as operational concerns like monitoring, logging and circuit breakers?
    • Sidecar pattern, even a service mesh
  • Two styles of user interfaces commonly appear with microservices architecture
    • Monolithic frontend - A single UI that calls through the API layer
    • Microfrontends
  • Communication
    • Big decision - Synchronous or asynchronous communication
    • Microservices often utilize protocol-aware heterogeneous interoperability
      • Standardize how services communicate with each other
  • Choreography and orchestration
    • Choreography
      • Same communication style as a broker event-driven architecture
      • No central coordinator/mediator
    • Orchestration - localized mediator
    • Each option has trade-offs
  • Transactions and sagas
    • The best advice for architects that want to do transactions across services is: Don't!
    • Saga pattern is a popular distributed transactional pattern in microservices.

Chapter 18 - Choosing the Appropriate Architecture Styled

  • Very contextual to a number of factors within an organization and what software it builds.
  • Good to be aware of industry trends, when to follow and when not to.
  • Many factors to take into account, e.g.
    • The domain
    • Architecture characteristics that impact structure
    • Data architecture
    • Organizational factors
    • ...
  • Taking these into account, the architect must make several determinations
    • Monolithic versus distributed?
    • Where should data live?
    • Communication style between services - synchronous or asynchronous?
      • Rule of thumb: Use synchronous by default, asynchronous when necessary.
  • Output of this process is architecture topology.

Part III - Techniques and Soft Skills

Chapter 19 - Architecture Decisions

  • Architecture decisions
    • Usually involve the structure of the application or system
    • May involve technology decisions as well, especially when those impact architecture characteristics.
    • A good architecture decision is one that helps guide development teams in making the right technical choices
  • Three major anti-patterns
    • Usually progressive flow: Overcoming the first leads to the second etc.
  • Anti-pattern 1: Covering Your Assets
    • Avoiding or deferring decisions out of fear of making the wrong choice.
    • 2 ways to overcome:
      • Wait until the last responsible moment
      • Continuously collaborate with the development teams to ensure that the decision you made can be implemented as expected
  • Anti-pattern 2: Groundhog Day
    • People don't know why a decision was made -> It keeps getting discussed over and over again
    • Background: Missing of incomplete justification
    • To overcome: Provide both technical and business justifications
  • Anti-pattern 3: Email-Driven Architecture
    • People lose, forget or don't even know an architecture decision has been made
    • Ways to overcome:
      • Do not include the architecture decision in body of an email
      • Instead, mention only nature and context of the decision and provide a link to the single system of record for the architecture decision
      • Only notify those people who really care about the architecture decision.
  • Architecturally significant decisions
    • Those that affect the structure, non-functional characteristics, dependencies, interfaces of construction techniques.
  • Architecture Decision Records
    • ADRs, see https://adr.github.io/
    • Basic structure: Title, Status, Context, Decision and Consequences
    • Important to mark Superseded and link back and forth
    • More emphasis on the why rather than how.

Chapter 20 - Analyzing Architecture Risk

  • One of the key activities of architecture
  • Risk matrix: Likelihood of risk occurring X Overall impact of risk
  • Risk storming exercise, see https://riskstorming.com/
    • Identification, consensus, mitigation
    • Recommendation: Whenever possible, restrict risk storming efforts to a single dimension.
  • Not a one-time process but continuous process

Chapter 21 - Diagramming and Presenting Architecture

  • No matter how great an architect's technical ideas are, if they can't convince managers to fund them and developers to build them, their brilliance will never manifest.
  • Representational consistency - Always showing the relationships between parts of an architecture
  • Avoid "Irrational artifact attachment"
  • Many tools, the authors happily used OmniGraffle for the diagrams of the book
    • Good to have: Layers, Stencils/templates, Magnets
    • Standards: UML, C4 and ArchiMate
  • Various tips & links

Chapter 22 - Making Teams Effective

  • (Own comment) Here the role of an architect was left somewhat open to me.
  • Promoting close cooperation between an architect and development teams.
  • Important: Create and communicate the constraints, in which developers can implement the architecture. Avoid being too tight or loose.
  • Architect personalities:
    • Architect personality: Control Freak
      • (Own comment) In the book, architect is seen as a role that does not code
      • Tries to control every detailed aspect of the software development process.
    • Architect personality: Armchair Architect
      • Quote: Here's a dirty little secret about architecture - it's really easy to fake it as an architect
      • It might not be the intention of an architect to become an armchair architect, but rather it just happens by being spread too thin between projects or development teams and losing touch with technology or the business domain.
      • To avoid, get more involved in the technology being used in the project and understanding the business problem and domain.
    • Architect personality: Effective architect
      • Produces the appropriate constraints and boundaries on the team.
      • Ensures the team members are working well together.
      • Have the right level of guidance on the team.
  • How much control?
    • Elastic Leadership
    • Finding the right level of control (balancing between control freak and armchair architect) depends on 5 factors
      • Team familiarity (with each other)
      • Team size
      • Overall experience
      • Project complexity
      • Project duration
    • Note: This changes as the system continues to evolve and the projects goes on
    • Model in the team: Assign a value between -20 and 20 for each factor and sum
  • Three factors when considering a proper team size
    • Process loss ("Brook's Law" from Mythical Man-Month)
    • Pluralistic ignorance - Occurs as the team size gets too big (compare with "The Emperor's New Clothes")
    • Diffusion of Responsibility
  • Leveraging Checklists
    • See The Checklist Manifesto
    • Good candidates for checklists: Processes that don't have any procedural order or dependent tasks.
    • Three key checklists
      • Developer code completion checklist (~DoD)
      • A unit and functional testing checklist
      • Software release checklist
  • On Providing Guidance
    • (Discussion on code-level 3rd party libraries, frameworks etc)
    • When somebody is proposing a library/framework, first ask the following questions
      • Are there any overlaps between the proposed library and existing functionality in the system?
      • What is the justification for the proposed library
    • (Own comment) Here the architect role is already quite much into the details also.

Chapter 23 - Negotiation and Leadership Skills

  • A software architect must understand the political climate of the enterprise and be able to navigate the politics.
    • Reason: Almost every decision an architect makes will be challenged
  • Negotiating with Business Stakeholders (Various tips)
    • Leverage the use of grammar and buzzwords to better understand the situation
    • Gather as much information as possible before entering the negotiation
    • Validate concerns, then bring the negotiation
    • When all else fails, state things in terms of cost and time
    • Leverage the "divide and conquer" rule to qualify demands or requirements
      • E.g. does some requirement apply to the entire system or some specific part/functionality?
  • Negotiating with Developers
    • Work with the development team to gain respect
    • When convincing developers to e.g. adopt an architecture decision, provide a justification rather than "dictating from on high"
    • Provide justification/reason first to make sure it will be heard.
    • Guide the developer to arrive at the solution on their own.
  • The Software Architect as a Leader
    • 50% of being an effective software architect is having good people skills, facilitation skills and leadership skills.
    • Essential complexity vs accidental complexity
    • 4 C's of architecture: Communication, Collaboration, Clarity and Conciseness
  • Be Pragmatic, Yet Visionary
    • Visionary - Thinking about or planning the future with imagination or wisdom
    • Pragmatic - Dealing with things sensibly and realistically in a way that is based on practical than theoretical considerations
    • Strive to find an appropriate balance between these
  • Leading Teams with Example
    • "No matter what the problem is, it's a people problem" [https://en.wikipedia.org/wiki/Gerald_Weinberg](Gerald Weinberg)
    • (Basic tips)
    • Instead of "What you need" or "You must", try "Have you considered", "What about"
    • Try to use a person's name during conversation or negotiation.
    • When meeting someone for the first time or only occasionally, shake the person's hand and make eye contact.
  • Being effective architect -> Make more time for the development team -> Control meetings
  • "The most important single ingredient in the formula of success is knowing how to get along with people." (Theodore Roosevelt)

Chapter 24 - Developing a Career Path

  • An architect must continue to learn throughout their career as the technology changes at a fast pace.
  • Keep an eye out for relevant resources, both technology and business.
  • 20-minute rule
    • Devote at least 20 minutes a day to your career by learning something new or diving deeper into a specific topic.
    • Some example resources: InfoQ, DZone Refcardz, ThoughtWorks Technology Radar
    • Recommended to leverage the 20-minute rule first thing in the morning.
  • Assess a personal radar
    • Suggested quadrants for personal use:
      • Hold: Not only technologies and techniques to avoid, but also habits you're trying to break.
      • Assess: Promising technologies that you have heard good things about but haven't time to assess for yourself
      • Trial: Active research and development
      • Adopt: The new things you're most excited about

Other potential resources for system design and architecture

Wednesday, October 9, 2024

Accelerate

2024-accelerate

Notes for the book Accelerate - The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations by Nicole Forsgren, Jez Humble and Gene Kim.

Brief summary

The book describes the State of DevOps Reports research conducted by Google's DevOps Research and Assessment (DORA) team. The book goes through what they have found in the research and how they have done the research and presents a case of transforming an organization.

The book/research introduces four "DORA metrics" to use for measuring software delivery performance of an organization

  • Change Lead Time
  • Deployment Frequency
  • Change Failure Rate
  • Mean Time to Recovery (MTTR)

They also present 24 capabilities to drive improvements in software delivery performance, classified into five categories:

  • Continuous delivery
  • Architecture
  • Product and process
  • Lean management and monitoring
  • Cultural

For more details, see reference of 24 capabilities.

Part One - What we found

Chapter 1 - Accelerate

  • Focus on capabilities, not maturity
  • Focusing on a capabilities paradigm and right capabilities, organizations can continuously drive improvement.
  • The research has identified 24 key capabilities that are easy to define, measure and improve.

Chapter 2 - Measuring performance

  • First define a valid, reliable measure of software delivery performance
  • Challenges with many existing ways to measure
    • They focus on outputs rather than outcomes
    • They focus on individual or local measures rather than a team of global ones.

In search for measures of delivery performance to avoid the above challenges, the research settled on the following four

  • Change Lead Time
    • Arising from lead time in Lean theory
    • Here, the delivery part of the lead time, excluding the design part
    • The time it takes to go from code committed to code successfully running in production.
  • Deployment Frequency
    • Arising from batch size
    • Deployment frequency used as a proxy for batch size since it is easy to measure and typically has low variability.
  • Change Failure Rate
  • Mean Time to Recovery (MTTR)

Continuing with these metrics

  • Surprisingly, there is no trade-off between improving performance and achieving higher levels of stability and quality.
  • Figure 2.4: Software delivery performance impacts Organizational Performance and Noncommercial Performance
  • Important: Distinguishing which software is strategic and which isn't, and managing them appropriately.
  • IMPORTANT: Use these tools carefully
    • In organizations with a learning culture, they are incredibly powerful.
    • In pathological and bureaucratic organizations cultures, measurement is used as a form of control and people hide information that challenges existing rules, strategies, and power structures.
    • Deming: "Whenever there is fear, you get the wrong numbers"

Chapter 3: Measuring and Changing Culture

  • Organizational culture can exist at three levels in organizations: Basic assumptions, values and artifacts (Schein 1985)
  • The research uses a model of organizational culture defined by Ron Westrum.
  • Westrum topology of organizational cultures
    • Pathological (power-oriented, characteristics by fear and threat)
    • Bureaucratic (rule-oriented)
    • Generative (performance-oriented, mission-focused)
  • Westrum's insight is that the organizational cultures predicts the way information flows through an organization.
  • Westrum's three characteristics of good information
    • It provides answers to the questions that the receiver needs answered.
    • It is timely.
    • It is presented in such a way that it can be effectively used by the receiver.
  • Bureaucracy is not necessarily bad. The goal of bureaucracy is to "ensure fairness by applying rules to administrative behavior ..." (Mark Schwartz)
    • Westrum's rule-oriented culture is perhaps best thought of as one where following the rules is considered more important than achieving the mission
  • Figure 3.2: Westrum organizational culture impacts Software Delivery Performance and Organizational Performance
  • References to Google's Project Aristotle research on team performance (2015), "it all comes down to team dynamics"
  • Accident investigations that stop at "human error" are dangerous. Human errors should be the start of the investigation, instead.
  • How to change culture? John Shook on transforming the culture of the teams (How to Change a Culture: Lessons From NUMMI

What my NUMMI experience taught me that was so powerful was that the way to change culture is not to first change how people think, but instead to start by changing how people behave — what they do.

  • Figure 3.3: Continuous Delivery and Lean Management impact Westrum Organizational Culture.

Chapter 4: Technical Practices

  • Technical practices are an enabler of more frequent, higher-quality and lower-risk software releases.
  • Continuous Delivery
  • For the Figure 4.2, see Accelerate Digest / Impact of CD
  • Going through various technical practices.
  • Pick: Test automation is important
    • but having automated tests primarily created and maintained by a separate party is not correlated with IT performance.
    • Testers serve also an essential role performing manual testing such as by exploratory, usability and acceptance testing and helping to create and evolve automated tests by working alongside with developers

Chapter 5: Architecture

  • High performance is possible with all kinds of systems. provided that systems - and the teams that build and maintain them - are loosely coupled.
  • Situation likely at low performers
    • Software they were building was custom software developed by another company.
    • Working on mainframe systems. (Interestingly, integrating against mainframe systems was not significantly correlated with performance)
  • Importance of focusing on architecture characteristics rather than implementation details of your architecture.
  • Deployability and testability are important for creating high performance.
  • "Inverse Conway Maneuver" mentioned
  • The goal of loosely coupled architecture is to
    • ensure that the available communication bandwidth (between teams) isn't overwhelmed by implementation-level details but can be used for discussing higher-level shared goals and how to achieve them.
    • enable scaling

Chapter 6: Integrating Infosec into the Delivery Lifecycle

  • Arguably the DevOps movement is poorly named.
  • The original intent of the DevOps movement was - in part - to bring together developers and operations teams to create win-win solutions in the pursuit of system-level goals
  • Not limited to just development and operations, it occurs whenever different functions within the software delivery value stream do not work effectively together.
  • "Shift left" on security
    • Build it into software delivery process instead of making it a separate phase happening downstream in the process.
    • Impacts ability to practice continuous delivery
    • Shift from security teams doing reviews themselves to giving the developers the means to build security in.
  • Related: cloud.gov is now FedRAMP Authorized for use by federal agencies

Chapter 7: Management Practices for Software

  • Lean Management modeled to SW dev with three components
    • Limiting work in progress (WIP)
    • Creating and maintaining visual displays showing key metrics etc.
    • Using data from application performance and infrastructure monitoring tools to make business decisions.
  • WIP limits itself did not strongly predict delivery performance.
    • Only when combined with use of visual displays and having a feedback loop from production monitoring tools back to delivery business or the business.
  • Interesting quote of approval processes

External approvals were negatively correlated with lead time, deployment frequency, and restore time, and had no correlation with change fail rate. In short, approval by an external body (such as a manager or CAB) simply doesn’t work to increase the stability of production systems, measured by the time to restore service and change fail rate. However, it certainly slows things down. It is, in fact, worse than having no change approval process at all.

Chapter 8: Product Development

  • Eric Ries' Lean Startup mentioned
    • Synthesis of ideas from the Lean movement, design thinking, and the work of entrepreneur Steve Blank, emphasizing importance of taking an experimental approach to product development.
  • Figure 8.2: Lean Product management impacts
    • Westrum Organizational Culture, which impacts Organizational Performance
    • Organizational Performance (straight)
    • Software Delivery Performance, which impacts Organizational Performance
    • Less Burnout.

Chapter 9: Making Work Sustainable

  • Deployment pain/feat can tell a lot about a team's software delivery performance.
  • Fundamentally, most deployment problems are caused by a complex, brittle deployment process.
  • This is typically a result of 3 factors
    • SW is often not written with deployability in mind
    • Probability of a failed deployment rises substantially when manual changes must be made to production environment as part of the deployment process.
    • Complex deployments often require complex handoffs between teams.
  • Six organizational risk factors that predict burnout
    • Work overload
    • Lack of control
    • Insufficient rewards
    • Breakdown of community
    • Absence of fairness
    • Value conflicts

Chapter 10: Employee Satisfaction, Identity, and Engagement

  • Employees on high-performing teams were 2.2 times more likely to recommend their organization to a friend
  • Research recommending diverse teams: Rock and Grant 2016, Deloitte 2013, Hunt et al 2013

Chapter 11: Leaders and Managers

  • Transformational leadership
    • Leaders inspire and motivate followers to achieve higher performance by appealing to their values and sense of purpose, facilitating wide-scale organizational change.
  • Model for transformational leadership with five characteristics (Rafferty and Griffin 2004)
    • Vision
    • Inspirational communication
    • Intellectual stimulation
    • Supportive leadership
    • Personal recognition
  • Three things highly correlated with SW delivery performance and contribute to a strong team culture
    • Cross-functional collaboration
    • A climate for learning
    • Tools

Part Two - The Research

  • Presenting the science behind the research findings in Part 1

Chapter 12 - The science behind this book

  • Primary and secondary research
    • Primary research - collecting new data by the research team
    • Secondary research - utilizes data collected by someone else.
  • Qualitative and quantitative research
    • Research presented in this book is quantitative, because it was collected using a Likert-type survey instrument
  • Six types of data analysis (according to framework Dr. Jeffrey Leek)
    • Descriptive
    • Exploratory
    • Inferential predictive
    • Predictive
    • Causal
    • Mechanistic
    • (Analysis presented in this book fall into the first three categories)

Chapter 13 - Introduction to Psychometrics

  • Questions on the research: Why to use surveys, can you trust the data collected with surveys?
  • "Latent construct" is a way of measuring something that can't be measured directly
    • E.g. "organizational culture"
    • Help to think carefully what we want to measure and how we are defining our constructs.

Chapter 14 - Why Use a Survey

  • Discussion on surveys vs "system data"
  • Surveys allow you to collect and analyze data quickly.
  • Measuring the full stack with system data is difficult
  • Measuring completely with system data is difficult
  • You can trust survey data
  • Some things can be measured only through surveys.

Part Three - Transformation

  • Chapter by Steve Ball and Karen Whitney on leadership and organizational transformation

Chapter 16 - High-Performance Leadership and Management

  • Leadership has a powerful impact on results.
  • Component for sustaining competitive advantage (in addition to technical performance): A lightweight, high-performance management framework that:
    • connects enterprise strategy with action
    • streamlines the flow of ideas to value
    • facilitates rapid feedback and learning
    • capitalizes on and connects the creative capabilities of every individual...
  • Case study from ING Netherlands, some picks
    • You have to understand why, not just copy the behaviors
    • The work itself will constantly change; the organization that leads is the one with the people with consistent behavior to rapidly learn and adapt.
  • Summary at https://bit.ly/high-perf-behaviors-practices

Saturday, November 12, 2022

Flow-Based Product Development

2022-principles-of-product-development-flow

Notes for the book The Principles of Product Development Flow: Second Generation Lean Product Development (Amazon) by Donald Reinertsen.

As a summary

  • The author has a strongly opinionated view on how product development should be done
  • The book is built as a collection of principles for various areas
  • The author takes inspiration to product development from various domains (e.g. queuing theory, data communication networks, warfare)

The Principles of Flow

It ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so. (Mark Twain)

  • The author states that the dominant paradigm for managing product development is fundamentally wrong.
  • New paradigm emphasizing achieving flow, emphasizing e.g. small batch transfers, rapid feedback and limiting WIP.
  • Could be labeled Lean Product Development (LPD), though lean manufacturing has very different characteristics than product development.
  • Also ideas from different domains -> the new paradigm is called Flow-Based Product Development

What's the problem?

  • Current paradigm based on internally consistent but dysfunctional beliefs.
    • E.g. combining the belief that efficiency is good with a blindness to queues -> high levels of capacity utilization -> large queues and long cycle times

Problems with the current orthodoxy:

  1. Failure to Correctly Quantify Economics
  • E.g. Focusing too much on proxy variables
  1. Blindness to Queues
  • Too much design-in-process inventory (DIP)
  • Why? Because DIP is typically both financially and physically invisible
  • Also we're often bind to danger with high level of DIP
  1. Worship of Efficiency
  2. Hostility to Variability
  • Without variability, we cannot innovate.
  • Variability is only a proxy variable
  1. Worship of Conformance
  • Utilizing the valuable new information constantly arriving throughout the development cycle.
  1. Institutionalization of Large Batch Sizes
  • Coming from blindness towards queues and focus on efficiency
  • Blindness to the issue of batch size
  1. Underutilization of Cadenza
  • E.g. meetings with regular and predictable cadenza have very low set-up cost.
  1. Managing Timelines instead of Queues
  • Failure to understand the statistics of granular schedules
  • Queues are better control variable than cycle time because today's queues are leading indicators of future cycle-time problems
  1. Absence of WIP Constraints
  2. Inflexibility
  • Specialized resources and high levels of utilization -> delays
  • How to tackle it? Currently focus on the variability
  • Instead, book recommends focusing to making resources, people and processes flexible
  1. Non-economic Flow Control
  • Current systems to control flow are not based on economics.
  1. Centralized Control

Major themes of the book

The book has eight major themes and a major chapter for each.

  • Economics - Economically-based decision making
  • Queues - Even a basic understanding of queuing theory will help a lot with product development
  • Variability
  • Batch Size
  • WIP Constraints
  • Cadenza, Synchronization and Flow Control
  • Fast Feedback (Loops) - Suggesting that feedback is what permits us to operate product development process effectively in a noisy environment
  • Decentralized Control

Relevant idea sources used

  • Lean Manufacturing
  • Economics
  • Queuing Theory
  • Statistics
  • The Internet (protocols)
  • Operating System Design
  • Control Engineering
  • Maneuver Warfare

The Design of the book

The economic view

  • Why do we want to change the product development process? The answer: To increase profits.
  • Proxy objectives/variables often used
  • Experience on asking people what a 60 day delay late to market for a project would cost the company - Typically range of 50 to 1 in answers
  • Approach product development decisions as economic choices

The Nature of Our Economics (Principles E1-E2)

  • Select actions based on quantified overall economic impact.
  • Five key economic objectives: Cycle time, product cost, product value, development expense and risk
  • We can’t just change one thing.

The Project Economic Framework (Principles E3-E5)

  • Unit of measure for a product and project: Life-cycle profit impact
  • If you only quantify one thing, quantify the cost of delay.

The Nature of Our Decisions (... E6-E11)

  • Important trade-offs are likely to have U-curve optimizations.
  • Important properties of U-curves
    • Optimization never occurs at extreme values
    • Flat bottoms -> U-curve optimizations do not require precise answers
  • See e.g. this blog post
  • Even imperfect answers improve decision making
  • Many economic choices are more valuable when made quickly.

Our Control Strategy (E12-E15)

  • Background: Many small decisions creating most value when done quickly
  • Use decision rules to decentralize economic control
    • Instead of controlling the decisions, control the economic logic of the decisions.
  • Ensure decision makers feel both cost and benefit.
  • We should make each decision at the point where further delay no longer increases the expected economic outcome.

Some Basic Economic Concepts (E16-E18)

  • Importance of marginal economics to e.g. avoid "feature creep"
  • Avoid "Sunk cost" fallacy, instead look at the return of the remaining investment

Managing Queues

  • Festina lente
    • Time spent in queues might be more important than speeding up the activities

Queuing theory basics

  • Queuing Theory originates from telecommunication
  • Basic concepts
    • Queue - Waiting work
    • Server - Resource performing work
    • Arrival process - The pattern with which work arrives (can be unpredictable)
    • Service process - The time it takes to accomplish the work (can be unpredictable)
    • Queuing discipline - The sequence/pattern in which waiting work is handled
  • A simple queue: M/M/1/∞ (Kendall notation)
    • First "M" - Arrival process (here a Markov process)
    • Second "M" - Service process (also a Markov process)
    • 1 - Number of parallel servers
    • ∞ (infinite) - (No) upper limit on queue size
  • Measures of queue performance
    • Occupancy
    • Cycle time

Why Queues Matter (Q1-Q2)

  • Idle time increases inventory, which is the root cause of many other economic problems.
  • In manufacturing, we are often aware of work-in-progress (WIP) inventory. But in product development, we're often not aware of the design-in-progress (DIP) inventory.
  • Product development queues are often bigger than manufacturing queues.
  • Product development queues are often invisible -> not sticking into the eye.
  • Effect of queues:
    • Increased cycle time
    • Increased risk
    • Increased variability
    • Increased overhead
    • Lower quality (by delaying feedback)
    • Negative psychological effect

The Behavior of Queues (Q3-Q8)

  • For M/M/1/∞ queue, capacity utilization (𝜌) allows to predict many properties of the queue
    • E.g. Number of Items in the Queue: 𝜌/(1-𝜌) -> as the utilization starts approaching 100%, the queues start grow exponentially
  • Capacity utilization is difficult to measure. Instead, queue size and WIP/DIP are practical factors to measure.
  • See also A Dash of Queueing Theory - A good blog post on the topic with live simulations of various processes
  • High-state queues cause most economic damage
  • If possible to balance the load / share a queue between multiple servers, that helps to manage queues. See M/M/c queue for more details (the book uses term M/M/n queue)

The Economics of Queues (Q9-Q10)

  • Find optimum queue size with quantitative analysis, avoiding simple "Queues are evil"
  • Scheduling affects the queue cost (more on scheduling later)

Managing Queues (Q11-Q16)

  • Cumulative Flow Diagrams (CFDs) are useful for managing queues
  • Little's Law: Mean response time = mean number in system / mean throughput
    • Can be applied both to a queue or to the system as a whole
  • Control queue size instead of utilization or cycle time
  • From statistics of random processes: Over time, queues will randomly spin seriously out of control
    • The distribution of cumulative sum or a random variable flattens as N grows
  • "We can rely on randomness to create a queue but we cannot rely on randomness to correct this queue"
  • -> Monitoring the queues and intervening when needed

Exploiting variability

We cannot add value without adding variability but we can add variability without adding value

  • Economic cost of variability (by an economic payoff-function) is more important than amount of variability

The Economics of Product Development (V1-V4)

  • Risk-taking is central to value creation in product development.
  • We cannot maximize economic value by eliminating all choices with uncertain outcomes
  • Asymmetric Payoff is important with creating economic value with variability (See e.g. Product Development Payoff Asymmetry)
    • Note that payoff functions in product development are different than in manufacturing as in manufacturing variance is most typically a negative thing.
  • Variability is not desired or undesired as such. Instead, it is desired when it increases economic value.
    • -> It shouldn't be minimized or maximized
  • A 50% failure rate is usually optimum for generating information. Note here that all activities are not designed to maximize information, though.

Reducing Variability (V5-V11)

  • Two main approaches to improve the economics of variability
    • Change the amount of variability
    • Change the economic consequences of variability
  • Diffusion principle: When uncorrelated random variables are combined, the variability of the sum decreases.
    • E.g. diversifying a stock portfolio
    • Doing many small experiments instead of one big one.
  • Repetition and reuse reduce variation
  • With buffers we can trade e.g. cycle time for reduced variability in cycle time
    • -> Finding the best amount of buffering (not minimizing buffer nor maximizing confidence)

Reducing Economic Consequences (V12-V16)

  • Usually the best way to reduce cost of variability
  • Rapid feedback
  • Aim to replace expensive variability with cheap variability.
  • Note: Often it is better to improve iteration speed than defect rate.

Reducing batch size

  • Product developers don't usually think of batch size, which would be an important tool to improve flow

The Case for Batch Size Reduction (B1-B10)

  • Reducing batch size (normally)
    • Reduces cycle time
    • Reduces variability
    • Accelerates feedback
    • Reduces risk
    • Reduces overhead
  • Whereas large batch sizes (normally)
    • Reduce overall efficiency
    • Lower motivation
    • Cause exponential cost and schedule growth
    • Lead to even larger batches

The Science of Batch Size

  • Economic batch size is usually a U-curve optimization (see Economic Order Quantity (EOQ))
  • Batch size reduction often lowers transaction costs, which saves more than originally assumed
    • -> Usually we don't know the optimum batch size without testing and measuring.

Managing Batch Size

  • Separate
    • "production change batch" - Changing the state of the product
    • "Transport batch size" - Changing the location of the product (typically more important)
  • To enable small transport batch size, reduce distances. -> Co-locate teams etc.
  • Note: Small batches require good infrastructure
  • Consider sequence/order of batches
  • Adjust batch sizes as the context changes

Applying WIP constraints

It is easier to start work than it is to finish it

Start finishing, finish starting.

  • A tool to respond to growing queues
  • WIP constraints, can be seen in e.g.
    • Manufacturing - Toyota Production System (TPS)
      • Note: Mainly repetitive and homogenous flows
    • Telecommunication networks & protocols as inspiration
      • Assuming highly variable, nonhomogeneous flows

The Economic Logic of WIP Control (W1-W5)

  • WIP constraints
    • Enable controlling cycle time and flow
    • Note that also reject potentially valuable demand and reduce capacity utilization
    • -> Cost-benefit analysis
    • Force rate-matching
  • Theory of Constraints (TOC)
    • Identifying the bottleneck in the process -> work according that
    • A global constraint
    • Useful for predictable and permanent bottlenecks
  • When possible, constrain local WIP pools (Local Constraints)
    • E.g. TPS Kanban system
    • Useful when there is no predictable/permanent bottleneck

Reacting to Emergent Queues (W6-W14)

  • The core of managing queues is not in monitoring queues but the actions when the limits are reached

Various ways to respond to high WIP

  • Demand-focused
    • Block all demand on WIP higher limit
    • Purge low-value jobs on high WIP - Kill the "zombie projects"
    • Shed requirements
  • Supply-focused
    • Extra resources
    • Part-time resources for high variability tasks
    • Powerful experts to emerging bottlenecks
    • T-shaped resources
    • Cross-training
  • Mix change

WIP Constraints in Practice (W15-W23)

  • W15-W23 are practical principles for controlling WIP
  • As one pick: Inspired by Internet protocols "window size", adjust WIP as capacity changes

Controlling Flow Under Uncertainty

Anyone can be captain in a calm sea

  • WIP constraints are important but don't solve all our problems.

Congestion (F1-F4)

  • Congestion: A system condition combining high capacity utilization and low throughput
  • Traffic flow = Speed x Density
    • Vehicles/hour = Miles/hour x Vehicles/Miles
  • Bruce Greenshield's Traffic Flow Model
    • As speed increases, the distances increase and the density decreases
    • Throughput is a parabolic curve - Low throughput at both extremes
    • Low-speed operating point ("left") is inherently unstable
      • Increasing density -> Speed decreases -> Flow decreases -> Density increases
    • High-speed operation point ("right") is inherently stable
      • Increasing density -> Decreasing speed -> Increasing flow -> Decreasing Density
    • See also "Traffic Flow Theory" section at The Science of Kanban - Process
  • For a system with a strong throughput peak, we usually want to operate near that point
    • To maintain the system at desirable operation point, easiest to control occupancy (a more general term for density)
  • Use expected flow time instead of queue size to inform users of congestion

Cadence (F5-F9)

  • Cadence - Use of a regular, predictable rhythm within a process
  • Can be used to e.g.
    • Limit the accumulation of variance
    • Make waiting times predictable
    • To enable small batch sizes
  • Some examples for use of cadence: Product introduction, testing, project meetings, ...

Synchronization (F10-F14)

  • Synchronization vs cadence
    • Cadence causes events to happen at regular time intervals.
    • Synchronization causes multiple events to happen at the same time.
  • Valuable when there is economic advantage from processing multiple items at the same time.

Sequencing Work (F15-F21)

  • Sequencing in manufacturing vs product development
    • In manufacturing, like the Toyota Production System, work is processed on a first-in-first-out (FIFO) basis
    • Product development is different as both delay costs and task durations vary among projects.
    • Hospital emergency room is a good mental model for sequencing work
  • The author emphasizes two points
    • Complex prioritization algorithms often used - Instead prefer a simple approach. (Prevent big mistakes)
    • Sequencing matters most when queues are large - Operating with small queues, sequencing is not needed so much
  • When delay costs are homogenous, do the shortest job first (SJF)
  • When job durations are homogenous, do the high cost-of-delay job first (DCS)
  • When delay costs and job durations are not homogenous, do the weighted shortest job first (WSJF)
  • Three common mistakes with prioritizing
    • Prioritizing purely on ROI.
    • FIFO
    • Critical chain (not optimal when projects have different delay costs)

Managing the Development Network (F22-F28)

  • Ideas based on managing product development resource network with ideas from managing a data communication network
  • Routing tailored for tasks
  • Routing based on current most economic route - Often can be selected only at a short time-horization
  • Alternate routes to avoid congestion
  • ...
  • Flexibility helps to absorb variation but requires pre-planning and investment.

Correcting Two Misconceptions (F29-F30)

  • "It is always preferable to decentralize all resources to dedicated project teams"
    • Often the case but not always
    • Centralized resources can enable variability pooling and thus reduce queues
  • "Queueing delays at a bottleneck are solely determined by the characteristics of the bottleneck"
    • The process before the bottleneck has also an important influence
    • -> Aim to reduce variability before a bottleneck.

Using fast feedback

  • Use of feedback loops and control systems
  • Combining ideas from economics and control systems engineering
  • Issues of dynamic response and stability
  • This material will required a mental shift for people with manufacturing or quality control background
    • In economics of manufacturing, payoff-functions have the inverted U-shape curve - Larger the variance create larger losses
    • Product development is different as the goals are dynamic and payoff functions can be asymmetric
  • Fast feedback can alter the economic payoff-function as fast feedback
    • ... allows to truncate unproductive paths more quickly
    • ... allows to raise the expected gains by exploiting good outcomes

The Economic View of Control (FF1-FF6)

  • What makes a good control variable?
    • Economic influence
    • Efficiency of control
    • Ones that allow early intervention
  • Focus on controlling economic impact instead of focusing on the proxy variables
    • -> E.g. set "alert thresholds" to points of equal impact
  • Note the difference between static and dynamic goals
    • In product development, we continually get better information that allows to reevaluate and shift the goals

The Benefits of Fast Feedback (FF7-FF8)

  • Fast feedback
    • Enables smaller queues
    • Allows to make learning faster and more efficient
  • Typically it requires investment to create an environment to extract the smaller signals

Control System Design (FF9-FF18)

  • Note the difference between a metric and a control system
    • "What gets measured might not get done"
  • Short turning radius reduces the need for longer planning horizons -> reduces the magnitude of the control problem
  • Prefer local feedback
  • Combine long and short control loops
    • short time horizon for adapting to the random variation of the process
    • long time horizon for improving process characteristics considered causal to success

The Human Side of Feedback (FF19-FF24)

  • Colocation typically improves communication
    • Faster feedback
    • Psychological aspects
  • Faster feedback improves the sense of control
  • Large queues prevent an atmosphere of urgency
  • Human elements tend to amplify large excursions -> Aim to keep the system within a controllable range
  • Balance personal/local/overall basis of rewarding to align behaviors

Metrics for Flow-Based Development

  • For a list, see page 15 of Agile Metrics at Scale
  • Flow
    • Design-in-process inventory (DIP)
    • Average flow time
  • Queues
    • Number of items in queue (easy)
    • Estimate amount of work in queue (difficult)
    • Quite often the first one is surprisingly effective and enough

Achieving Decentralized Control

  • Decentralized control allows fast local feedback loops (the topic of the previous chapter) to work best
  • Examining what we can learn from military doctrine
    • Military has long history on balancing centralized and decentralized control
    • Advanced models of centrally coordinated, decentralized control
  • The Marines
    • ... believe that warfare constantly presents unforeseen obstacles and unexpected opportunities.
    • ... believe that the original plan was based on imperfect data

How Warfare works

  • Typically one side attacks and the other defends
    • Typical understanding: attack and defense require different organizational approaches
    • Old military adage: Centralize control for offense, decentralize it for defense
    • Rule of thumb: For an attacker to succeed, they should outnumber defenders by 3 to 1
  • Attacker can concentrate the forces, the defender must allocate forces to the entire perimeter
  • Various approaches for the defender
    • Harden the perimeter at the most logical places for attack. But, often circumvented
    • Better: Mass nearby forces to counteract the local superiority of the attacker.
    • Related: Defense-in-depth approach: Outer perimeter that slows attacking forces, allowing to move more defending forces to the area of the attack.
  • Maneuver warfare: Use of surprise and movement

Balancing Centralization and Decentralization (D1-D6)

  • Decentralize control for problems and opportunities that are best dealt with quickly
  • Centralize control for problems that are infrequent, large or have significant economies of scale
  • Adapt the approach as the knowledge increases
    • Triage process approach (works if there is enough information when a new problem arrives)
    • Escalation process
  • Value of faster response time can out-weight the inefficiency of decentralization
  • Pure decentralization is rarely optimal, instead finding a balance

Military Lessons on Maintaining Alignment (D7-D16)

  • Misalignment is the risk of decentralized control

    • Locally optimal choices might be bad at the system level
    • Overall alignment creates more value than local excellence
  • Maintaining alignment is "the sophisticated heart of maneuver warfare"

  • Mission: Specify the end goal, its purpose and minimal possible constraints

  • Establish clear roles and boundaries

    • Avoid both excessive overlap and gaps
  • Designate a main effort and focus on it

    • Often only a small set of product attributes truly drive success
  • The main effort can be shifted when conditions change

    • -> Develop ability to quickly shift focus
    • OODA loop (Orient->Decide->Act->Observe->) by Colonel John Boyd
  • Localize tactical coordination

  • Make early and meaningful contact with the problem

    • In product development, our "opposing forces" are the market and the technical risks
    • There is no substitute for quick POC-ing and early market feedback

The Technology of Decentralization (D17-D20)

  • Key information is needed to make decisions -> share
    • In the Marine Corps, the minimum is to understand the intentions of commanders two level higher in the organization
  • Accelerate decision-making speed
    • Fewer people and layers of management -> Giving authority, information and practice to lower organizational levels to make decisions.
    • When response time is important, measure if.

The Human Side of Decentralization (D21-D23)

  • Cultivate Initiative
    • The Marines view initiative as the most critical quality in a leader.
  • Face-to-face communication
  • Decentralized control is based on trust. Trust is built through experience.

Monday, October 10, 2022

Building Evolutionary Architectures

2022-building-evolutionary-architectures

Notes for the book Building Evolutionary Architectures by Neal Ford, Rebecca Parsons and Patrick Kua.

Main take-aways / Summary

  • Software architectures are not created in a vacuum - They always reflect the ecosystem in which they were defined
    • E.g. When SOA was popular, all infrastructure was commercial, licensed and expensive.
  • An evolutionary architecture supports guided, incremental change across multiple dimensions.
  • Anything that verifies the architecture is a fitness function
    • -> Treat those uniformly
    • Think of architectural characteristics as evaluable things.

Software Architecture

  • There are many definitions for software architecture.
  • There are many "-ilities" for software architecture to support. In this book adding a new one: evolvability.
  • Whatever aspect of software development - we expect constant change.
  • Alternative to fixed plans? Learning to adapt. Make change less expensive e.g. by automating formerly manual processes etc.
  • Yet another definition of software architecture: "parts hard to change later"
    • Convenient definition but blind spot as a potentially self-fulfilling prophecy.
  • -> Building changeability into architecture?
    • Having ease of change as a principle.

Evolutionary architecture

  • Book's definition:

An evolutionary architecture supports guided, incremental change across multiple dimensions.

  • Incremental change - Two aspects: How teams build software incrementally and how they deploy it.
  • Guided changes - Once architects have chosen important characteristics, they want to guide changes to the architecture to protect those characteristics.
  • There are many dimensions of architecture
    • Architectural concerns, i.e. the list of "-ilities".
    • Not only "-ilities" but other dimensions to consider for evolvability
      • Technical dimensions
      • Data
      • Security
      • Operational/System
  • There are various techniques for carving up architectures
  • In this book, in contrast, we don't attempt to create a taxonomy of dimensions but rather recognize the ones extant in existing projects.
  • Impact of team structure on surprising things, e.g. architecture -> Conway's law

Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.

  • So one should not pay attention only to the architecture and design of the software, but also the delegation, assignment, and coordination of the work between teams.
  • Inverse Conway Maneuver - Structuring teams and organizational structure around the desired architecture.

Structure teams to look like your target architecture, and it will be easier to achieve it.

  • Two critical characteristics for evolutionary architecture: incremental and guided.

Fitness Functions

  • Book's definition for a fitness function:

An architectural fitness function provides an objective integrity assessment of some architectural characteristic(s).

  • Systemwide fitness function - a collection of fitness functions corresponding for different dimensions of the architecture.
  • It is an important architectural decision to define important dimensions (scalability, performance, security, ...)
  • Different "categories" of fitness functions
    • Atomic vs Holistic
    • Triggered vs Continual
    • Static vs Dynamic
    • Automated vs Manual
    • Temporal (e.g. "break upon upgrade"
    • Intentional over Emergent (There will usually be unknown unknowns)
    • Domain-specific
  • Fitness function categories to classify them into
    • Key - Crucial ones
    • Relevant
    • Not relevant

Engineering Incremental Change

Architecture is abstract until operationalized, when it becomes a living thing

  • Long-term viability of an architecture cannot be judged until design, implementation, upgrade and inevitable change are successful.
  • Common combinations of fitness function categories
    • atomic + triggered - e.g. unit tests
    • holistic + triggered - e.g. wider integration testing via a deployment pipeline
    • atomic + continual
    • holistic + continual - e.g. Chaos Monkey
  • Hypothesis- and Data-Driven Development
    • Hypothesis-driven development - Include users also in the feedback loop

Architectural Coupling

  • Focusing appropriate coupling - how to identify which dimensions of the architecture should be coupled
  • Term definitions here
    • module - some way of grouping related code together
    • modularity - logical grouping of related code
    • components - physical packaging of modules
    • Modules imply logical grouping while components imply physical grouping.
    • library is one kind of a component
  • functional cohesion - business concepts semantically binding parts of the system together
  • architectural quantum - independently deployable component
    • quantum size determines the lower bound of the incremental change possible
  • One key thing: Determining structural component granularity and coupling between components
  • In general, the smaller the architectural quanta, the more evolvable the architecture will be.
  • JDepend for package dependencies

Evolvability of architectural styles

  • Different architectural styles have different inherent quantum sizes

Big Ball of Mud

  • Quantum: The whole system
  • Increment change is difficult because of scattered dependencies
  • Building fitness functions is difficult because no clearly defined partitioning
  • Good example of inappropriate coupling

Unstructured monoliths

  • Large quantum size hinders incremental change
  • Building fitness functions difficult but not impossible
  • Somewhat similar coupling as with Big Ball of Mud

Layered monoliths

  • Quantum is still the whole application
  • Incremental change easier particularly if changes are isolated to existing layers
  • Easier to write fitness functions with more structure
  • Often easy understandability

Modular monoliths

  • Many of the benefits of microservices can be achieved also with monoliths if developers are extremely disciplined about coupling
  • Incremental change easier because of modularity
  • Easier to design and implement fitness functions
  • Appropriate coupling

If you can't build a monolith, what makes you think microservices are the answer (Simon Brown)

Microkernel

  • Commonly used in e.g. browsers and IDEs
  • Typically a core system with an API for plug-ins
  • Quantum: One for the core, another for the plug-ins

Event-Driven architectures - Broker pattern

  • Typically message queues, initiating event, intra-process events, event processors
  • Coordination and error handling typically difficult
  • Allow incremental change in multiple forms
  • Atomic fitness functions typically easy to write but holistic fitness functions are both necessary and complex in this architecture
  • Low degree of coupling - Between services and the message contracts

Event-Driven architectures - Mediator pattern

  • Has a hub that acts as a coordinator
  • Primary advantage: Transactional coordination
  • INcremental change as with the broker pattern
  • Holistic fitness functions easier to build than with the broker version
  • Coupling increases

Broker or mediator - classic example of an architectural tradeoff

Service-Oriented Architectures - ESB-driven SOA

  • Enterprise Service Bus (ESB) - Mediator for event interactions
  • Style differs but is based on segregating services based on reusability, shared concepts and scope.
  • Architectural quantum is massive - Entire system
  • Incremental change allows reuse and segregation of resources but hampers making the most common types of change to business domains.
  • Testing in general is difficult.
  • Note: Software architectures are not created in a vacuum - They always reflect the ecosystem in which they were defined
    • E.g. When SOA was popular
      • Automatic provisioning of machines wasn't possible
      • all infrastructure was commercial, licensed and expensive.

Service-Oriented Architectures - Microservices

  • Combines engineering practices of Continuous Delivery with logical partitioning of bounded contexts
    • Typically separated along domain dimension
    • Compared to typical layered architecture, a microservice has all the layers but handles only one bounded context
  • 7 principles from Building Microservices
    • Modelled around the business domain
    • Hide implementation details
    • Culture of automation
    • Highly decentralized
    • Deployed independently
    • Isolate failure
    • Highly observable
  • "Share nothing" - "No entangling coupling points"
  • Service templates as DropWizard and Spring Boot
  • Why not done before? See the earlier not on e.g. SOA

Evolutionary standpoint:

  • Supports both aspects of incremental change
  • Easy to build both atomic and holistic fitness functions.
    • (Well, I wouldn't agree 100% myself about holistic fitness functions)
  • Two kinds of coupling: Integration and service template

Service-based architectures

  • Similar to microservies but differ in one or more:
    • service granularity - bigger services / quantum size
    • database scope - sharing a database
    • integration middleware - a mediator like service bus
  • Incremental change relatively functional
  • Potentially more difficult to write fitness functions
  • More coupling

"Serverless" Architectures

  • Broadly, two different meanings
    • BaaS - Backend as a Service
    • FaaS - Function as a Service
  • Supports incremental change
  • Typically requires more holistic fitness functions
  • Attractive because it eliminates several dimensions/concerns
  • Suffers from serious constraints also

Evolutionary Data

Migrations

  • Developers should treat changes to database structure the same way as to source code: tested, versioned and incremental
  • Most teams have moved away from building undo migration capabilities
    • If all the migrations exist, database can be built just to the point they need without backing up to a previous version
    • Why maintain two versions of correctness, both forward and backward?
    • Sometimes daunting challenges / impossible (e.g. dropping a column or a table)

Shared Database integration

  • Shared Database Integration pattern
  • Using the database as an integration point fossilizes the database schema across all sharing projects
  • To evolve the schema: Expand/contract pattern
  • Options on example change
    • No integration points, no legacy data -> straight-forward
    • Legacy data, no integration points -> migrate the data, after that done
    • Existing data and integration points -> Potentially DB triggers etc.

Two-phase commit Transactions

  • Transactions are a special form of coupling because transactional behavior doesn't typically appear in traditional architecture-centric tools
  • Heavily transactional systems difficult to translate to e.g. microservices
  • Binding by databases is imposing because of transaction boundaries, which often define how the business processes work.

Database transactions act as a strong nuclear force, binding quants together.

Age and Quality of Data + summary

  • adding another join table is a common process used for t expand schema definitions
  • For evolutionary architecture, make sure developers can evolve the data as well (both schema and quality)

Refusing to refactor schemas or eliminate old data couples your architecture to the past, which is difficult to refactor.

Summary

  • The database can evolve alongside the architecture as long as proper engineering practices are applied, such as continuous integration, source control etc.
  • Refactoring databases in an important skill and craft.

Building Evolvable Architectures

  • Tieing previously handled aspects together (fitness functions, incremental change and appropriate coupling)

Mechanics

  • Identify Dimensions Affected by Evolution
  • Define Fitness Function(s) for Each Dimension
  • Use Deployment Pipelines to Automate Fitness Functions

Retrofitting existing architectures

  • Three factors
    • Component coupling and cohesion
    • Engineering practise maturity
    • Developer ease in crafting fitness functions
  • Refactoring vs Restructuring
    • Refactoring - No changes to external behavior
    • Restructuring an architecture - Often changes also behavior
  • Migrating architectures
    • Architects are often tempted by highly evolutionary architecture as a target for migration but this is often difficult, mainly because of existing coupling.
    • Trap of Meta-work is more interesting than work (rather writing a framework than using a framework)

Don't build an architecture just because it will be fun meta-work.

When restructuring architecture, consider all the affected dimensions.

Migrating Architectures

  • When decomposing a monolithic architecture, finding the correct service granularity is key.
    1. Partitioning - considering
    • Business functionality groups
    • Transactional boundaries
    • Deployment goals
    1. Separation of business layers from the UI
    1. Service discovery

When migrating from a monolith, build a small number of larger services first. (Sam Newman)

Various Guidelines for building Evolutionary Architecture

All architectures become iterative because of unknown unknowns, Agile just recognizes this and does it sooner.” (Mark Richards)

  • Build Anticorruption Layers
    • Encourages one to think about the semantics of what is needed from a library, not the syntax.

Developers understand the benefits of everything and the tradeoffs of nothing! (Rich Hickey)

Service Templates

  • Remove needless variables
  • Services templates are one common solution for ensuring consistency
    • Pre-configured sets of common infrastructure libraries (logging, monitoring, ...)
  • Seen as appropriate coupling by the book.

Build Sacrificial Architectures

The management question, therefore, is not whether to build a pilot system and throw it away. You will do that. […] Hence plan to throw one away; you will, anyhow. (Fred Brooks)

Mitigate External Change

  • When relying on code from a third party, create own safeguards against unexpected occurrences: breaking changes, unannounced removal, and so on

Transitive dependency management is our "considered harmful" moment (Chris Ford)

Updating Libraries vs Frameworks

  • "a developer's code calls library whereas the framework calls a developer's code"
  • Libraries generally form less brittle coupling points than frameworks.
  • One informal governance model treats framework updates as push updates (~ASAP) and library updates as pull updates ("update when needed")

Various

  • Version numbering vs internal resolution
    • Prefer internal versioning to numbering
    • support only two versions at a time

Evolutionary Architecture Pitfalls and Antipatterns

  • Pitfalls and antipatterns
    • An antipattern is a practice that initially looks like a good idea, but turns out to be a mistake
    • A pitfall looks superficially like a good idea but immediately reveals itself to be a bad path

Antipattern: Vendor King

  • To escape: Treat all software as just another integration point

Pitfall: Leaky Abstractions

All non-trivial abstractions, to some degree, are leaky (Joel Spolsky)

Antipattern: Last 10% Trap

  • Experiences from a project with 4GL (rapid application development tools)
    • 80% of the functionality was quick and easy to build
    • Next 10% was extremely difficult but possible
    • Last 10% wasn't achieved
  • IBM's San Francisco Project
    • infinite regress problem

Antipattern: Code Reuse Abuse

Software reuse is more like an organ transplant than snapping together Lego blocks. John D. Cook

  • Ease of code use is often inversely proportional to how reusable that code is.
  • Microservices might adtop the philosophy of prefer duplication to coupling

When coupling points impede evolution or other important architectural characteristics, break the coupling by forking or duplication.

Antipattern: Inappropriate Governance

  • Software architecture never exists in a vacuum but it's often a reflection of the environment in which it was designed
  • Goal in most microservices projects isn't to pick different technologies cavalierly, but rather to right-size the technology choice for the size of the problem.
  • Goldilocks Governance model: Pick three technology stacks for standardization: Simple, intermediate and complex

Pitfall: Planning Horizons

  • The more time and effort you invest in planning or a document, the more likely you will protect what's contained in the plan/document even when it is inaccurate or outdated.

Putting Evolutionary Architecture into Practice

  • Cross-Functional Teams
    • One goal here is to eliminate coordination friction
  • Organize teams around business capabilities, not job functions
  • Product over Project
    • Products live potentially forever, unlike the lifespan of project
    • Inverse Conway Maneuver
  • Dealing with external change: Consumer-driven contracts
  • Culture
    • Adjusting the behavior of a team often involves adjusting the process around it

Tell me how you measure me, and I will tell you how I will behave. (Dr Eliyahu M. Goldratt / The Haystack Syndrome)

  • Culture of Experimentation

The real measure of success is the number of experiments that can be crowded into 24 hours. (Thomas Alva Edison)

  • Finding the sweet spot between the proper quantum size and the corresponding costs.
  • The role of an Enterprise architect (in an evolutionary architecture): Guidance and enterprise-wide fitness functions

Why should a Company choose to build an Evolutionary Architecture? (A bit of a last chapter sales pitch)

  • Predictable vs evolvable
  • Scale
  • Advanced business capabilities
  • Cycle time as a business metric
  • Isolating architectural characteristics at the quantum level

Why Should a Company Choose Not to build an Evolutionary Architecture?

  • Can't evolve a ball of mud
  • Other architectural characteristics dominate
  • Sacrificial architecture