Saturday, November 12, 2022

Flow-Based Product Development

2022-principles-of-product-development-flow

Notes for the book The Principles of Product Development Flow: Second Generation Lean Product Development (Amazon) by Donald Reinertsen.

As a summary

  • The author has a strongly opinionated view on how product development should be done
  • The book is built as a collection of principles for various areas
  • The author takes inspiration to product development from various domains (e.g. queuing theory, data communication networks, warfare)

The Principles of Flow

It ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so. (Mark Twain)

  • The author states that the dominant paradigm for managing product development is fundamentally wrong.
  • New paradigm emphasizing achieving flow, emphasizing e.g. small batch transfers, rapid feedback and limiting WIP.
  • Could be labeled Lean Product Development (LPD), though lean manufacturing has very different characteristics than product development.
  • Also ideas from different domains -> the new paradigm is called Flow-Based Product Development

What's the problem?

  • Current paradigm based on internally consistent but dysfunctional beliefs.
    • E.g. combining the belief that efficiency is good with a blindness to queues -> high levels of capacity utilization -> large queues and long cycle times

Problems with the current orthodoxy:

  1. Failure to Correctly Quantify Economics
  • E.g. Focusing too much on proxy variables
  1. Blindness to Queues
  • Too much design-in-process inventory (DIP)
  • Why? Because DIP is typically both financially and physically invisible
  • Also we're often bind to danger with high level of DIP
  1. Worship of Efficiency
  2. Hostility to Variability
  • Without variability, we cannot innovate.
  • Variability is only a proxy variable
  1. Worship of Conformance
  • Utilizing the valuable new information constantly arriving throughout the development cycle.
  1. Institutionalization of Large Batch Sizes
  • Coming from blindness towards queues and focus on efficiency
  • Blindness to the issue of batch size
  1. Underutilization of Cadenza
  • E.g. meetings with regular and predictable cadenza have very low set-up cost.
  1. Managing Timelines instead of Queues
  • Failure to understand the statistics of granular schedules
  • Queues are better control variable than cycle time because today's queues are leading indicators of future cycle-time problems
  1. Absence of WIP Constraints
  2. Inflexibility
  • Specialized resources and high levels of utilization -> delays
  • How to tackle it? Currently focus on the variability
  • Instead, book recommends focusing to making resources, people and processes flexible
  1. Non-economic Flow Control
  • Current systems to control flow are not based on economics.
  1. Centralized Control

Major themes of the book

The book has eight major themes and a major chapter for each.

  • Economics - Economically-based decision making
  • Queues - Even a basic understanding of queuing theory will help a lot with product development
  • Variability
  • Batch Size
  • WIP Constraints
  • Cadenza, Synchronization and Flow Control
  • Fast Feedback (Loops) - Suggesting that feedback is what permits us to operate product development process effectively in a noisy environment
  • Decentralized Control

Relevant idea sources used

  • Lean Manufacturing
  • Economics
  • Queuing Theory
  • Statistics
  • The Internet (protocols)
  • Operating System Design
  • Control Engineering
  • Maneuver Warfare

The Design of the book

The economic view

  • Why do we want to change the product development process? The answer: To increase profits.
  • Proxy objectives/variables often used
  • Experience on asking people what a 60 day delay late to market for a project would cost the company - Typically range of 50 to 1 in answers
  • Approach product development decisions as economic choices

The Nature of Our Economics (Principles E1-E2)

  • Select actions based on quantified overall economic impact.
  • Five key economic objectives: Cycle time, product cost, product value, development expense and risk
  • We can’t just change one thing.

The Project Economic Framework (Principles E3-E5)

  • Unit of measure for a product and project: Life-cycle profit impact
  • If you only quantify one thing, quantify the cost of delay.

The Nature of Our Decisions (... E6-E11)

  • Important trade-offs are likely to have U-curve optimizations.
  • Important properties of U-curves
    • Optimization never occurs at extreme values
    • Flat bottoms -> U-curve optimizations do not require precise answers
  • See e.g. this blog post
  • Even imperfect answers improve decision making
  • Many economic choices are more valuable when made quickly.

Our Control Strategy (E12-E15)

  • Background: Many small decisions creating most value when done quickly
  • Use decision rules to decentralize economic control
    • Instead of controlling the decisions, control the economic logic of the decisions.
  • Ensure decision makers feel both cost and benefit.
  • We should make each decision at the point where further delay no longer increases the expected economic outcome.

Some Basic Economic Concepts (E16-E18)

  • Importance of marginal economics to e.g. avoid "feature creep"
  • Avoid "Sunk cost" fallacy, instead look at the return of the remaining investment

Managing Queues

  • Festina lente
    • Time spent in queues might be more important than speeding up the activities

Queuing theory basics

  • Queuing Theory originates from telecommunication
  • Basic concepts
    • Queue - Waiting work
    • Server - Resource performing work
    • Arrival process - The pattern with which work arrives (can be unpredictable)
    • Service process - The time it takes to accomplish the work (can be unpredictable)
    • Queuing discipline - The sequence/pattern in which waiting work is handled
  • A simple queue: M/M/1/∞ (Kendall notation)
    • First "M" - Arrival process (here a Markov process)
    • Second "M" - Service process (also a Markov process)
    • 1 - Number of parallel servers
    • ∞ (infinite) - (No) upper limit on queue size
  • Measures of queue performance
    • Occupancy
    • Cycle time

Why Queues Matter (Q1-Q2)

  • Idle time increases inventory, which is the root cause of many other economic problems.
  • In manufacturing, we are often aware of work-in-progress (WIP) inventory. But in product development, we're often not aware of the design-in-progress (DIP) inventory.
  • Product development queues are often bigger than manufacturing queues.
  • Product development queues are often invisible -> not sticking into the eye.
  • Effect of queues:
    • Increased cycle time
    • Increased risk
    • Increased variability
    • Increased overhead
    • Lower quality (by delaying feedback)
    • Negative psychological effect

The Behavior of Queues (Q3-Q8)

  • For M/M/1/∞ queue, capacity utilization (𝜌) allows to predict many properties of the queue
    • E.g. Number of Items in the Queue: 𝜌/(1-𝜌) -> as the utilization starts approaching 100%, the queues start grow exponentially
  • Capacity utilization is difficult to measure. Instead, queue size and WIP/DIP are practical factors to measure.
  • See also A Dash of Queueing Theory - A good blog post on the topic with live simulations of various processes
  • High-state queues cause most economic damage
  • If possible to balance the load / share a queue between multiple servers, that helps to manage queues. See M/M/c queue for more details (the book uses term M/M/n queue)

The Economics of Queues (Q9-Q10)

  • Find optimum queue size with quantitative analysis, avoiding simple "Queues are evil"
  • Scheduling affects the queue cost (more on scheduling later)

Managing Queues (Q11-Q16)

  • Cumulative Flow Diagrams (CFDs) are useful for managing queues
  • Little's Law: Mean response time = mean number in system / mean throughput
    • Can be applied both to a queue or to the system as a whole
  • Control queue size instead of utilization or cycle time
  • From statistics of random processes: Over time, queues will randomly spin seriously out of control
    • The distribution of cumulative sum or a random variable flattens as N grows
  • "We can rely on randomness to create a queue but we cannot rely on randomness to correct this queue"
  • -> Monitoring the queues and intervening when needed

Exploiting variability

We cannot add value without adding variability but we can add variability without adding value

  • Economic cost of variability (by an economic payoff-function) is more important than amount of variability

The Economics of Product Development (V1-V4)

  • Risk-taking is central to value creation in product development.
  • We cannot maximize economic value by eliminating all choices with uncertain outcomes
  • Asymmetric Payoff is important with creating economic value with variability (See e.g. Product Development Payoff Asymmetry)
    • Note that payoff functions in product development are different than in manufacturing as in manufacturing variance is most typically a negative thing.
  • Variability is not desired or undesired as such. Instead, it is desired when it increases economic value.
    • -> It shouldn't be minimized or maximized
  • A 50% failure rate is usually optimum for generating information. Note here that all activities are not designed to maximize information, though.

Reducing Variability (V5-V11)

  • Two main approaches to improve the economics of variability
    • Change the amount of variability
    • Change the economic consequences of variability
  • Diffusion principle: When uncorrelated random variables are combined, the variability of the sum decreases.
    • E.g. diversifying a stock portfolio
    • Doing many small experiments instead of one big one.
  • Repetition and reuse reduce variation
  • With buffers we can trade e.g. cycle time for reduced variability in cycle time
    • -> Finding the best amount of buffering (not minimizing buffer nor maximizing confidence)

Reducing Economic Consequences (V12-V16)

  • Usually the best way to reduce cost of variability
  • Rapid feedback
  • Aim to replace expensive variability with cheap variability.
  • Note: Often it is better to improve iteration speed than defect rate.

Reducing batch size

  • Product developers don't usually think of batch size, which would be an important tool to improve flow

The Case for Batch Size Reduction (B1-B10)

  • Reducing batch size (normally)
    • Reduces cycle time
    • Reduces variability
    • Accelerates feedback
    • Reduces risk
    • Reduces overhead
  • Whereas large batch sizes (normally)
    • Reduce overall efficiency
    • Lower motivation
    • Cause exponential cost and schedule growth
    • Lead to even larger batches

The Science of Batch Size

  • Economic batch size is usually a U-curve optimization (see Economic Order Quantity (EOQ))
  • Batch size reduction often lowers transaction costs, which saves more than originally assumed
    • -> Usually we don't know the optimum batch size without testing and measuring.

Managing Batch Size

  • Separate
    • "production change batch" - Changing the state of the product
    • "Transport batch size" - Changing the location of the product (typically more important)
  • To enable small transport batch size, reduce distances. -> Co-locate teams etc.
  • Note: Small batches require good infrastructure
  • Consider sequence/order of batches
  • Adjust batch sizes as the context changes

Applying WIP constraints

It is easier to start work than it is to finish it

Start finishing, finish starting.

  • A tool to respond to growing queues
  • WIP constraints, can be seen in e.g.
    • Manufacturing - Toyota Production System (TPS)
      • Note: Mainly repetitive and homogenous flows
    • Telecommunication networks & protocols as inspiration
      • Assuming highly variable, nonhomogeneous flows

The Economic Logic of WIP Control (W1-W5)

  • WIP constraints
    • Enable controlling cycle time and flow
    • Note that also reject potentially valuable demand and reduce capacity utilization
    • -> Cost-benefit analysis
    • Force rate-matching
  • Theory of Constraints (TOC)
    • Identifying the bottleneck in the process -> work according that
    • A global constraint
    • Useful for predictable and permanent bottlenecks
  • When possible, constrain local WIP pools (Local Constraints)
    • E.g. TPS Kanban system
    • Useful when there is no predictable/permanent bottleneck

Reacting to Emergent Queues (W6-W14)

  • The core of managing queues is not in monitoring queues but the actions when the limits are reached

Various ways to respond to high WIP

  • Demand-focused
    • Block all demand on WIP higher limit
    • Purge low-value jobs on high WIP - Kill the "zombie projects"
    • Shed requirements
  • Supply-focused
    • Extra resources
    • Part-time resources for high variability tasks
    • Powerful experts to emerging bottlenecks
    • T-shaped resources
    • Cross-training
  • Mix change

WIP Constraints in Practice (W15-W23)

  • W15-W23 are practical principles for controlling WIP
  • As one pick: Inspired by Internet protocols "window size", adjust WIP as capacity changes

Controlling Flow Under Uncertainty

Anyone can be captain in a calm sea

  • WIP constraints are important but don't solve all our problems.

Congestion (F1-F4)

  • Congestion: A system condition combining high capacity utilization and low throughput
  • Traffic flow = Speed x Density
    • Vehicles/hour = Miles/hour x Vehicles/Miles
  • Bruce Greenshield's Traffic Flow Model
    • As speed increases, the distances increase and the density decreases
    • Throughput is a parabolic curve - Low throughput at both extremes
    • Low-speed operating point ("left") is inherently unstable
      • Increasing density -> Speed decreases -> Flow decreases -> Density increases
    • High-speed operation point ("right") is inherently stable
      • Increasing density -> Decreasing speed -> Increasing flow -> Decreasing Density
    • See also "Traffic Flow Theory" section at The Science of Kanban - Process
  • For a system with a strong throughput peak, we usually want to operate near that point
    • To maintain the system at desirable operation point, easiest to control occupancy (a more general term for density)
  • Use expected flow time instead of queue size to inform users of congestion

Cadence (F5-F9)

  • Cadence - Use of a regular, predictable rhythm within a process
  • Can be used to e.g.
    • Limit the accumulation of variance
    • Make waiting times predictable
    • To enable small batch sizes
  • Some examples for use of cadence: Product introduction, testing, project meetings, ...

Synchronization (F10-F14)

  • Synchronization vs cadence
    • Cadence causes events to happen at regular time intervals.
    • Synchronization causes multiple events to happen at the same time.
  • Valuable when there is economic advantage from processing multiple items at the same time.

Sequencing Work (F15-F21)

  • Sequencing in manufacturing vs product development
    • In manufacturing, like the Toyota Production System, work is processed on a first-in-first-out (FIFO) basis
    • Product development is different as both delay costs and task durations vary among projects.
    • Hospital emergency room is a good mental model for sequencing work
  • The author emphasizes two points
    • Complex prioritization algorithms often used - Instead prefer a simple approach. (Prevent big mistakes)
    • Sequencing matters most when queues are large - Operating with small queues, sequencing is not needed so much
  • When delay costs are homogenous, do the shortest job first (SJF)
  • When job durations are homogenous, do the high cost-of-delay job first (DCS)
  • When delay costs and job durations are not homogenous, do the weighted shortest job first (WSJF)
  • Three common mistakes with prioritizing
    • Prioritizing purely on ROI.
    • FIFO
    • Critical chain (not optimal when projects have different delay costs)

Managing the Development Network (F22-F28)

  • Ideas based on managing product development resource network with ideas from managing a data communication network
  • Routing tailored for tasks
  • Routing based on current most economic route - Often can be selected only at a short time-horization
  • Alternate routes to avoid congestion
  • ...
  • Flexibility helps to absorb variation but requires pre-planning and investment.

Correcting Two Misconceptions (F29-F30)

  • "It is always preferable to decentralize all resources to dedicated project teams"
    • Often the case but not always
    • Centralized resources can enable variability pooling and thus reduce queues
  • "Queueing delays at a bottleneck are solely determined by the characteristics of the bottleneck"
    • The process before the bottleneck has also an important influence
    • -> Aim to reduce variability before a bottleneck.

Using fast feedback

  • Use of feedback loops and control systems
  • Combining ideas from economics and control systems engineering
  • Issues of dynamic response and stability
  • This material will required a mental shift for people with manufacturing or quality control background
    • In economics of manufacturing, payoff-functions have the inverted U-shape curve - Larger the variance create larger losses
    • Product development is different as the goals are dynamic and payoff functions can be asymmetric
  • Fast feedback can alter the economic payoff-function as fast feedback
    • ... allows to truncate unproductive paths more quickly
    • ... allows to raise the expected gains by exploiting good outcomes

The Economic View of Control (FF1-FF6)

  • What makes a good control variable?
    • Economic influence
    • Efficiency of control
    • Ones that allow early intervention
  • Focus on controlling economic impact instead of focusing on the proxy variables
    • -> E.g. set "alert thresholds" to points of equal impact
  • Note the difference between static and dynamic goals
    • In product development, we continually get better information that allows to reevaluate and shift the goals

The Benefits of Fast Feedback (FF7-FF8)

  • Fast feedback
    • Enables smaller queues
    • Allows to make learning faster and more efficient
  • Typically it requires investment to create an environment to extract the smaller signals

Control System Design (FF9-FF18)

  • Note the difference between a metric and a control system
    • "What gets measured might not get done"
  • Short turning radius reduces the need for longer planning horizons -> reduces the magnitude of the control problem
  • Prefer local feedback
  • Combine long and short control loops
    • short time horizon for adapting to the random variation of the process
    • long time horizon for improving process characteristics considered causal to success

The Human Side of Feedback (FF19-FF24)

  • Colocation typically improves communication
    • Faster feedback
    • Psychological aspects
  • Faster feedback improves the sense of control
  • Large queues prevent an atmosphere of urgency
  • Human elements tend to amplify large excursions -> Aim to keep the system within a controllable range
  • Balance personal/local/overall basis of rewarding to align behaviors

Metrics for Flow-Based Development

  • For a list, see page 15 of Agile Metrics at Scale
  • Flow
    • Design-in-process inventory (DIP)
    • Average flow time
  • Queues
    • Number of items in queue (easy)
    • Estimate amount of work in queue (difficult)
    • Quite often the first one is surprisingly effective and enough

Achieving Decentralized Control

  • Decentralized control allows fast local feedback loops (the topic of the previous chapter) to work best
  • Examining what we can learn from military doctrine
    • Military has long history on balancing centralized and decentralized control
    • Advanced models of centrally coordinated, decentralized control
  • The Marines
    • ... believe that warfare constantly presents unforeseen obstacles and unexpected opportunities.
    • ... believe that the original plan was based on imperfect data

How Warfare works

  • Typically one side attacks and the other defends
    • Typical understanding: attack and defense require different organizational approaches
    • Old military adage: Centralize control for offense, decentralize it for defense
    • Rule of thumb: For an attacker to succeed, they should outnumber defenders by 3 to 1
  • Attacker can concentrate the forces, the defender must allocate forces to the entire perimeter
  • Various approaches for the defender
    • Harden the perimeter at the most logical places for attack. But, often circumvented
    • Better: Mass nearby forces to counteract the local superiority of the attacker.
    • Related: Defense-in-depth approach: Outer perimeter that slows attacking forces, allowing to move more defending forces to the area of the attack.
  • Maneuver warfare: Use of surprise and movement

Balancing Centralization and Decentralization (D1-D6)

  • Decentralize control for problems and opportunities that are best dealt with quickly
  • Centralize control for problems that are infrequent, large or have significant economies of scale
  • Adapt the approach as the knowledge increases
    • Triage process approach (works if there is enough information when a new problem arrives)
    • Escalation process
  • Value of faster response time can out-weight the inefficiency of decentralization
  • Pure decentralization is rarely optimal, instead finding a balance

Military Lessons on Maintaining Alignment (D7-D16)

  • Misalignment is the risk of decentralized control

    • Locally optimal choices might be bad at the system level
    • Overall alignment creates more value than local excellence
  • Maintaining alignment is "the sophisticated heart of maneuver warfare"

  • Mission: Specify the end goal, its purpose and minimal possible constraints

  • Establish clear roles and boundaries

    • Avoid both excessive overlap and gaps
  • Designate a main effort and focus on it

    • Often only a small set of product attributes truly drive success
  • The main effort can be shifted when conditions change

    • -> Develop ability to quickly shift focus
    • OODA loop (Orient->Decide->Act->Observe->) by Colonel John Boyd
  • Localize tactical coordination

  • Make early and meaningful contact with the problem

    • In product development, our "opposing forces" are the market and the technical risks
    • There is no substitute for quick POC-ing and early market feedback

The Technology of Decentralization (D17-D20)

  • Key information is needed to make decisions -> share
    • In the Marine Corps, the minimum is to understand the intentions of commanders two level higher in the organization
  • Accelerate decision-making speed
    • Fewer people and layers of management -> Giving authority, information and practice to lower organizational levels to make decisions.
    • When response time is important, measure if.

The Human Side of Decentralization (D21-D23)

  • Cultivate Initiative
    • The Marines view initiative as the most critical quality in a leader.
  • Face-to-face communication
  • Decentralized control is based on trust. Trust is built through experience.

Monday, October 10, 2022

Building Evolutionary Architectures

2022-building-evolutionary-architectures

Notes for the book Building Evolutionary Architectures by Neal Ford, Rebecca Parsons and Patrick Kua.

Main take-aways / Summary

  • Software architectures are not created in a vacuum - They always reflect the ecosystem in which they were defined
    • E.g. When SOA was popular, all infrastructure was commercial, licensed and expensive.
  • An evolutionary architecture supports guided, incremental change across multiple dimensions.
  • Anything that verifies the architecture is a fitness function
    • -> Treat those uniformly
    • Think of architectural characteristics as evaluable things.

Software Architecture

  • There are many definitions for software architecture.
  • There are many "-ilities" for software architecture to support. In this book adding a new one: evolvability.
  • Whatever aspect of software development - we expect constant change.
  • Alternative to fixed plans? Learning to adapt. Make change less expensive e.g. by automating formerly manual processes etc.
  • Yet another definition of software architecture: "parts hard to change later"
    • Convenient definition but blind spot as a potentially self-fulfilling prophecy.
  • -> Building changeability into architecture?
    • Having ease of change as a principle.

Evolutionary architecture

  • Book's definition:

An evolutionary architecture supports guided, incremental change across multiple dimensions.

  • Incremental change - Two aspects: How teams build software incrementally and how they deploy it.
  • Guided changes - Once architects have chosen important characteristics, they want to guide changes to the architecture to protect those characteristics.
  • There are many dimensions of architecture
    • Architectural concerns, i.e. the list of "-ilities".
    • Not only "-ilities" but other dimensions to consider for evolvability
      • Technical dimensions
      • Data
      • Security
      • Operational/System
  • There are various techniques for carving up architectures
  • In this book, in contrast, we don't attempt to create a taxonomy of dimensions but rather recognize the ones extant in existing projects.
  • Impact of team structure on surprising things, e.g. architecture -> Conway's law

Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.

  • So one should not pay attention only to the architecture and design of the software, but also the delegation, assignment, and coordination of the work between teams.
  • Inverse Conway Maneuver - Structuring teams and organizational structure around the desired architecture.

Structure teams to look like your target architecture, and it will be easier to achieve it.

  • Two critical characteristics for evolutionary architecture: incremental and guided.

Fitness Functions

  • Book's definition for a fitness function:

An architectural fitness function provides an objective integrity assessment of some architectural characteristic(s).

  • Systemwide fitness function - a collection of fitness functions corresponding for different dimensions of the architecture.
  • It is an important architectural decision to define important dimensions (scalability, performance, security, ...)
  • Different "categories" of fitness functions
    • Atomic vs Holistic
    • Triggered vs Continual
    • Static vs Dynamic
    • Automated vs Manual
    • Temporal (e.g. "break upon upgrade"
    • Intentional over Emergent (There will usually be unknown unknowns)
    • Domain-specific
  • Fitness function categories to classify them into
    • Key - Crucial ones
    • Relevant
    • Not relevant

Engineering Incremental Change

Architecture is abstract until operationalized, when it becomes a living thing

  • Long-term viability of an architecture cannot be judged until design, implementation, upgrade and inevitable change are successful.
  • Common combinations of fitness function categories
    • atomic + triggered - e.g. unit tests
    • holistic + triggered - e.g. wider integration testing via a deployment pipeline
    • atomic + continual
    • holistic + continual - e.g. Chaos Monkey
  • Hypothesis- and Data-Driven Development
    • Hypothesis-driven development - Include users also in the feedback loop

Architectural Coupling

  • Focusing appropriate coupling - how to identify which dimensions of the architecture should be coupled
  • Term definitions here
    • module - some way of grouping related code together
    • modularity - logical grouping of related code
    • components - physical packaging of modules
    • Modules imply logical grouping while components imply physical grouping.
    • library is one kind of a component
  • functional cohesion - business concepts semantically binding parts of the system together
  • architectural quantum - independently deployable component
    • quantum size determines the lower bound of the incremental change possible
  • One key thing: Determining structural component granularity and coupling between components
  • In general, the smaller the architectural quanta, the more evolvable the architecture will be.
  • JDepend for package dependencies

Evolvability of architectural styles

  • Different architectural styles have different inherent quantum sizes

Big Ball of Mud

  • Quantum: The whole system
  • Increment change is difficult because of scattered dependencies
  • Building fitness functions is difficult because no clearly defined partitioning
  • Good example of inappropriate coupling

Unstructured monoliths

  • Large quantum size hinders incremental change
  • Building fitness functions difficult but not impossible
  • Somewhat similar coupling as with Big Ball of Mud

Layered monoliths

  • Quantum is still the whole application
  • Incremental change easier particularly if changes are isolated to existing layers
  • Easier to write fitness functions with more structure
  • Often easy understandability

Modular monoliths

  • Many of the benefits of microservices can be achieved also with monoliths if developers are extremely disciplined about coupling
  • Incremental change easier because of modularity
  • Easier to design and implement fitness functions
  • Appropriate coupling

If you can't build a monolith, what makes you think microservices are the answer (Simon Brown)

Microkernel

  • Commonly used in e.g. browsers and IDEs
  • Typically a core system with an API for plug-ins
  • Quantum: One for the core, another for the plug-ins

Event-Driven architectures - Broker pattern

  • Typically message queues, initiating event, intra-process events, event processors
  • Coordination and error handling typically difficult
  • Allow incremental change in multiple forms
  • Atomic fitness functions typically easy to write but holistic fitness functions are both necessary and complex in this architecture
  • Low degree of coupling - Between services and the message contracts

Event-Driven architectures - Mediator pattern

  • Has a hub that acts as a coordinator
  • Primary advantage: Transactional coordination
  • INcremental change as with the broker pattern
  • Holistic fitness functions easier to build than with the broker version
  • Coupling increases

Broker or mediator - classic example of an architectural tradeoff

Service-Oriented Architectures - ESB-driven SOA

  • Enterprise Service Bus (ESB) - Mediator for event interactions
  • Style differs but is based on segregating services based on reusability, shared concepts and scope.
  • Architectural quantum is massive - Entire system
  • Incremental change allows reuse and segregation of resources but hampers making the most common types of change to business domains.
  • Testing in general is difficult.
  • Note: Software architectures are not created in a vacuum - They always reflect the ecosystem in which they were defined
    • E.g. When SOA was popular
      • Automatic provisioning of machines wasn't possible
      • all infrastructure was commercial, licensed and expensive.

Service-Oriented Architectures - Microservices

  • Combines engineering practices of Continuous Delivery with logical partitioning of bounded contexts
    • Typically separated along domain dimension
    • Compared to typical layered architecture, a microservice has all the layers but handles only one bounded context
  • 7 principles from Building Microservices
    • Modelled around the business domain
    • Hide implementation details
    • Culture of automation
    • Highly decentralized
    • Deployed independently
    • Isolate failure
    • Highly observable
  • "Share nothing" - "No entangling coupling points"
  • Service templates as DropWizard and Spring Boot
  • Why not done before? See the earlier not on e.g. SOA

Evolutionary standpoint:

  • Supports both aspects of incremental change
  • Easy to build both atomic and holistic fitness functions.
    • (Well, I wouldn't agree 100% myself about holistic fitness functions)
  • Two kinds of coupling: Integration and service template

Service-based architectures

  • Similar to microservies but differ in one or more:
    • service granularity - bigger services / quantum size
    • database scope - sharing a database
    • integration middleware - a mediator like service bus
  • Incremental change relatively functional
  • Potentially more difficult to write fitness functions
  • More coupling

"Serverless" Architectures

  • Broadly, two different meanings
    • BaaS - Backend as a Service
    • FaaS - Function as a Service
  • Supports incremental change
  • Typically requires more holistic fitness functions
  • Attractive because it eliminates several dimensions/concerns
  • Suffers from serious constraints also

Evolutionary Data

Migrations

  • Developers should treat changes to database structure the same way as to source code: tested, versioned and incremental
  • Most teams have moved away from building undo migration capabilities
    • If all the migrations exist, database can be built just to the point they need without backing up to a previous version
    • Why maintain two versions of correctness, both forward and backward?
    • Sometimes daunting challenges / impossible (e.g. dropping a column or a table)

Shared Database integration

  • Shared Database Integration pattern
  • Using the database as an integration point fossilizes the database schema across all sharing projects
  • To evolve the schema: Expand/contract pattern
  • Options on example change
    • No integration points, no legacy data -> straight-forward
    • Legacy data, no integration points -> migrate the data, after that done
    • Existing data and integration points -> Potentially DB triggers etc.

Two-phase commit Transactions

  • Transactions are a special form of coupling because transactional behavior doesn't typically appear in traditional architecture-centric tools
  • Heavily transactional systems difficult to translate to e.g. microservices
  • Binding by databases is imposing because of transaction boundaries, which often define how the business processes work.

Database transactions act as a strong nuclear force, binding quants together.

Age and Quality of Data + summary

  • adding another join table is a common process used for t expand schema definitions
  • For evolutionary architecture, make sure developers can evolve the data as well (both schema and quality)

Refusing to refactor schemas or eliminate old data couples your architecture to the past, which is difficult to refactor.

Summary

  • The database can evolve alongside the architecture as long as proper engineering practices are applied, such as continuous integration, source control etc.
  • Refactoring databases in an important skill and craft.

Building Evolvable Architectures

  • Tieing previously handled aspects together (fitness functions, incremental change and appropriate coupling)

Mechanics

  • Identify Dimensions Affected by Evolution
  • Define Fitness Function(s) for Each Dimension
  • Use Deployment Pipelines to Automate Fitness Functions

Retrofitting existing architectures

  • Three factors
    • Component coupling and cohesion
    • Engineering practise maturity
    • Developer ease in crafting fitness functions
  • Refactoring vs Restructuring
    • Refactoring - No changes to external behavior
    • Restructuring an architecture - Often changes also behavior
  • Migrating architectures
    • Architects are often tempted by highly evolutionary architecture as a target for migration but this is often difficult, mainly because of existing coupling.
    • Trap of Meta-work is more interesting than work (rather writing a framework than using a framework)

Don't build an architecture just because it will be fun meta-work.

When restructuring architecture, consider all the affected dimensions.

Migrating Architectures

  • When decomposing a monolithic architecture, finding the correct service granularity is key.
    1. Partitioning - considering
    • Business functionality groups
    • Transactional boundaries
    • Deployment goals
    1. Separation of business layers from the UI
    1. Service discovery

When migrating from a monolith, build a small number of larger services first. (Sam Newman)

Various Guidelines for building Evolutionary Architecture

All architectures become iterative because of unknown unknowns, Agile just recognizes this and does it sooner.” (Mark Richards)

  • Build Anticorruption Layers
    • Encourages one to think about the semantics of what is needed from a library, not the syntax.

Developers understand the benefits of everything and the tradeoffs of nothing! (Rich Hickey)

Service Templates

  • Remove needless variables
  • Services templates are one common solution for ensuring consistency
    • Pre-configured sets of common infrastructure libraries (logging, monitoring, ...)
  • Seen as appropriate coupling by the book.

Build Sacrificial Architectures

The management question, therefore, is not whether to build a pilot system and throw it away. You will do that. […] Hence plan to throw one away; you will, anyhow. (Fred Brooks)

Mitigate External Change

  • When relying on code from a third party, create own safeguards against unexpected occurrences: breaking changes, unannounced removal, and so on

Transitive dependency management is our "considered harmful" moment (Chris Ford)

Updating Libraries vs Frameworks

  • "a developer's code calls library whereas the framework calls a developer's code"
  • Libraries generally form less brittle coupling points than frameworks.
  • One informal governance model treats framework updates as push updates (~ASAP) and library updates as pull updates ("update when needed")

Various

  • Version numbering vs internal resolution
    • Prefer internal versioning to numbering
    • support only two versions at a time

Evolutionary Architecture Pitfalls and Antipatterns

  • Pitfalls and antipatterns
    • An antipattern is a practice that initially looks like a good idea, but turns out to be a mistake
    • A pitfall looks superficially like a good idea but immediately reveals itself to be a bad path

Antipattern: Vendor King

  • To escape: Treat all software as just another integration point

Pitfall: Leaky Abstractions

All non-trivial abstractions, to some degree, are leaky (Joel Spolsky)

Antipattern: Last 10% Trap

  • Experiences from a project with 4GL (rapid application development tools)
    • 80% of the functionality was quick and easy to build
    • Next 10% was extremely difficult but possible
    • Last 10% wasn't achieved
  • IBM's San Francisco Project
    • infinite regress problem

Antipattern: Code Reuse Abuse

Software reuse is more like an organ transplant than snapping together Lego blocks. John D. Cook

  • Ease of code use is often inversely proportional to how reusable that code is.
  • Microservices might adtop the philosophy of prefer duplication to coupling

When coupling points impede evolution or other important architectural characteristics, break the coupling by forking or duplication.

Antipattern: Inappropriate Governance

  • Software architecture never exists in a vacuum but it's often a reflection of the environment in which it was designed
  • Goal in most microservices projects isn't to pick different technologies cavalierly, but rather to right-size the technology choice for the size of the problem.
  • Goldilocks Governance model: Pick three technology stacks for standardization: Simple, intermediate and complex

Pitfall: Planning Horizons

  • The more time and effort you invest in planning or a document, the more likely you will protect what's contained in the plan/document even when it is inaccurate or outdated.

Putting Evolutionary Architecture into Practice

  • Cross-Functional Teams
    • One goal here is to eliminate coordination friction
  • Organize teams around business capabilities, not job functions
  • Product over Project
    • Products live potentially forever, unlike the lifespan of project
    • Inverse Conway Maneuver
  • Dealing with external change: Consumer-driven contracts
  • Culture
    • Adjusting the behavior of a team often involves adjusting the process around it

Tell me how you measure me, and I will tell you how I will behave. (Dr Eliyahu M. Goldratt / The Haystack Syndrome)

  • Culture of Experimentation

The real measure of success is the number of experiments that can be crowded into 24 hours. (Thomas Alva Edison)

  • Finding the sweet spot between the proper quantum size and the corresponding costs.
  • The role of an Enterprise architect (in an evolutionary architecture): Guidance and enterprise-wide fitness functions

Why should a Company choose to build an Evolutionary Architecture? (A bit of a last chapter sales pitch)

  • Predictable vs evolvable
  • Scale
  • Advanced business capabilities
  • Cycle time as a business metric
  • Isolating architectural characteristics at the quantum level

Why Should a Company Choose Not to build an Evolutionary Architecture?

  • Can't evolve a ball of mud
  • Other architectural characteristics dominate
  • Sacrificial architecture

Tuesday, January 11, 2022

A Philosophy of Software Design

Notes for the book A Philosophy of Software Design by John Ousterhout (see also his home page.

Introduction

  • The author starts with an interesting thought: Writing computer software is one of the purest activities in the history of the human race.
  • The greatest limitation in writing software is our ability to understand the systems we are creating
  • Two general approaches to fighting complexity
    • Eliminate complexity by making code more simpler and more obvious
    • Encapsulate it
  • Incremental development
    • Software design is never done.
    • You should always be on the lookout for opportunities to improve the design of the system
  • Red flags - signs that a piece of code is probably more complicated than it needs to be.
    • One of best ways to improve your design skills: To Recognize red flags

What is complexity?

  • Book's definition

    Complexity is anything related to the structure of a software system that makes it hard to understand and modify the system.

  • Crude mathematical definition of overall complexity of a system (C) with parts. (Though this kind of excludes interaction between parts - that will be addressed later)

  • Complexity is more apparent to readers than to writers.

  • Symptoms of complexity - Three general symptoms

    • Change amplification
    • Cognitive load
    • Unknown unknowns
  • A very important design goal for a system is to be obvious.

  • Two main causes of complexity

    • Dependencies - In this book: When a given piece of code cannot be understood and modified in isolation
    • Obscurity - When important information is not obvious

Tactical vs strategic programming

  • Tactical: Main focus to get something working
  • Strategic: Working code isn't enough but you need to produce good design
  • Recommendation to continually spend 10-20% of total development time on investment
  • Research data on this would be interesting
  • "Tactical tornado" - Quick progress leaving mess behind
  • A related quote from Uncle Bob: The only way to go fast is to go well

Deep vs shallow modules

  • For this book: A module is any unit of code that has an interface and an implementation. (class, method/function, ...)
  • Viewing each module in two parts: an interface and an implementation
  • Best modules are those whose interfaces are much simpler than their implementations
  • Interface has two parts of information: Formal and informal
  • An abstraction is a simplified view of an entity, which omits unimportant details
  • Abstraction can go wrong in two ways
    • An abstraction can include details that are not really important
    • An abstraction can omit details that really are important
  • Deep vs shallow modules
    • Deep modules: Powerful functionality (lots of functionality) with simple interfaces
    • Shallow modules: Ones whose interface is relatively complex compared to the functionality it provides (Red flag 🚩)
  • Classitis: Syndrome stemming from mistaken view that "classes are good, so more classes are better"
    • Java Streams as an example
  • Interfaces should make the common case as simple as possible

Information hiding

  • Each module should encapsulate a few pieces of knowledge.
  • Information hiding reduces complexity in two ways
    • Simplifies the interface to a module
    • Makes it easier to evolve the system.
  • When designing a new module, you should think carefully what information can be hidden in that module.
  • The opposite of information hiding is information leakage.
    • One of the most important red flags in SW design (Red flag 🚩)
    • Causes dependencies
    • Common cause: temporal decomposition
  • Note: Information hiding can often be improved by making a class slightly larger
  • Overexposure (Red flag 🚩) - If API for the common case requires users to learn about rarely-used cases

General-Purpose Modules are Deeper (generality vs specialization)

  • Design decision: Whether to implement a new class/module in a general-purpose or special-purpose fashion.
    • In general, the author has found that specialization leads to complexity
    • Sweet spot: Implement new modules in somewhat general-purpose fashion: Functionality should reflect your current needs but interface should not.
  • Questions to ask yourself (when designing an interface)
    • What is the simplest interface that will cover all my current needs?
    • In how many situations will this method be used?
    • Is this API easy to use for my current needs
  • Push specialization upwards (or downwards)
    • General-purpose API, specific use of API
    • OTOH downwards: Device drivers - Very specific to devices but APIs are generic
  • Eliminate special cases in code
  • Summa summarum: Unnecessary specialization is a significant contributor to software complexity.
    • Whether in form of special-purpose classes/methods or special cases in code

Layers & abstractions

  • In a well-designed system, each layer provides a different abstraction from the layers above and below it.
  • (Red flag 🚩) Pass-Through methods / adjacent layers with similar abstractions
  • Each new method should contribute significant functionality
  • Decorators - often pass-through methods, easy to overuse
  • Pass-through variable - another form of API duplication - a variable passed down through a long chain of methods

Pull Complexity Downwards

  • It is more important for a module to have a simple interface than a simple implementation (deep modules)
  • E.g. configuration parameters might be an easy excuse to avoid dealing with important issues - though also needed in many cases
  • When developing a module, look for opportunities to take a little bit of extra suffering upon yourself to reduce the load of your users.

Better Together Or Better Apart

  • Given two pieces of functionality, should they be implemented together or separate?
  • The decision should reduce the complexity of the system as a whole and improve its modularity.
  • Some guidelines
    • Bring together if information is shared
    • Bring together if it will simplify the interface
    • Bring together to eliminate duplication
      • (Red flag 🚩) Repetition
    • Separate general-purpose and special-purpose code
      • (Red flag 🚩) Mixing special/general-purpose code
  • (Red flag 🚩) Repetition
  • Method length - Author's opinion: Length by itself is rarely a good reason to split up a method
    • Additional interfaces -> additional complexity
    • When designing methods, most important goal should be to provide clean abstractions
    • Each method should do one thing and do it completely.
    • (Red flag 🚩) Conjoined methods
  • A different opinion: Uncle Bob's Clean Code

The first rule of functions is that they should be small. The second rule of functions is that they should be smaller than that.

Define Errors Out Of Existence

  • Summary: Reduce the number of places where exceptions must be handled
  • Exception here - any uncommon condition that alters the normal flow of control in a program
  • Two typical approaches to deal with exceptions
    • Move forward and complete the work despite the exception
    • Abort the operation in progress and report the exception upwards
  • Exceptions make interfaces more complex - Classes with lots of exceptions are shallower than classes with fewer exceptions
  • Ways to reduce the number of places to handle exceptions
    • Define errors out of existence
    • Mask exceptions
    • Exception aggregation (both this and masking position exception handler where it can catch the most exceptions)
    • Just crash?

Design it Twice

  • When designing a piece of software, design it twice - consider multiple options.
  • If you're accustomed to solving problems with the first quick idea, it doesn't typically work with harder problems
  • Avoid the fallacy of "smart people get it right the first time"

Why Write Comments? The Four Excuses

  • The process of writing comments, if done correctly, will actually improve a system's design
  • Four excuses
    • Good code is self-documenting
      • If users must read the code of a method in order to use it, then there is no abstraction
    • I don't have time to write comments
      • Vs. long-term investment mindset
    • Comments get out of date and become misleading
      • Organizing the documentation to be as easy as possible to keep it up-to-date
      • Code reviews
    • All the comments I have seen are worthless
      • See the next sections
  • Benefits of well-written documents - Overall idea - Capture information that was in the mind of the designer but couldn't be represented in the code
  • A different opinion from Uncle Bob: ... comments are, at best, a necessary evil...

Comments should describe things that aren't obvious from the code

  • Comment categories
    • Interface
      • For users of the interface
      • Note: Separate interface/implementation comments
      • (Red flag 🚩) Implementation documentation contaminates interface
    • Implementation comment
      • Main goal: Help readers understand what the code is doing (not how it does it)
      • Also why
    • Data structure comment
    • Cross-module comment - Describing dependencies
  • Pick conventions
  • Don't repeat the code
    • (Red flag 🚩) Comments repeating the code
  • Lower-level comments add precision
  • Higher-level comments enhance intuition
  • Document cross-module design decisions - "Design notes" documentation (that can be referenced from comments in code)

Choosing names

  • Bad names create bugs
  • Names are a form of abstractions
  • Names should be precise
    • (Red flag 🚩) Vague names
    • (Red flag 🚩) Hard to pick name - a hint that the underlying thing may not have a clean design
  • Use names consistently
  • Avoid extra words
  • A different opinion from the Go style guide on variables: "Keep them short; long names obscure what the code does."

Write The Comments First

  • Write the comments first
  • Comments as a design tool, "Comment-driven design"
  • (Red flag 🚩) Hard to describe - a hint that the underlying thing might be a problem with the design of the thing you're describing

Modifying Existing Code

  • Improving & cleaning as changing code: Ideally, when you have finished with a change, the system should have the structure it would have if you had designed it from the start with that change in mind.

Consistency

  • Consistency applies at many levels is a system, e.g.
    • Names
    • Coding style
    • Interfaces
    • Design patterns
    • Invariants
  • How to ensure consistency?
    • Document
    • Enforce
    • "When in Rome, do as the Romans do"
    • Don't change existing conventions - Even if the new idea would be better, value of consistency over inconsistency is almost always greater than value of one approach over another

Code should be obvious

  • Code being obvious: One can read the code quickly, without much thought, and their first guesses about the behaviour or meaning of the code will be correct.
  • "Obvious" is in the mind of the reader - It's easier to notice that someone else's code is non-obvious than to see problems with your own code.
  • (Red flag 🚩) Non-obvious code
  • Software should be designed for ease of reading, not ease of writing.
  • To make code obvious, ensure that the reader always has the information they need to understand it.
  • Inheritance
    • Should be used with caution
    • Consider composition over inheritance
  • Agile development
    • Risk of focusing to features, not abstractions
    • Risk of encouraging developers to put off design decisions in order to produce working software ASAP
    • The increment of development should be abstractions, not features
  • TDD
    • The writer states that TDD would focus on getting features working instead of finding the best design
    • The writer kind of ignores "refactor" step typically stated important with TDD
  • Getters and setters
    • Although it may make sense to use getters/setters if you must expose instance variables, it's better not to expose instance variables in the first place.

Designing for Performance

  • Measure before (and after) modifying
  • Design around the critical path

Decide What Matters

  • Separate what matters from what doesn't
  • Structure software systems around the things that matter.

Wednesday, June 10, 2020

Notes and picks from "Range: Why Generalists Triumph in a Specialized World"

Notes and picks from book Range: Why Generalists Triumph in a Specialized World by David Epstein.

Specialized learning vs broad/wide learning

Epstein is against having 10,000-Hour Rule in a very high position. He starts with comparing Tiger Woods (who has been practicing golf heavily from early childhood) and Roger Federer (who dabbled in many different kinds of sports in his youth).

Epsteing makes a distinction between "kind" and "wicked" environments (reminds me of Cynefin framework:

  • "Kind" environments are environments with clear rules, cause-and-effect etc.
    • E.g. chess, firefighters, playing violin
    • In fields like this, 10,000-Hour rule is more relevant.
    • Studying often relates to patterns & repetitive structures
  • "Wicked" environments don't have so clear rules
    • Wider learning needed, very narrow expertise might event hurt the outcome

Too narrow knowledge

Epstein states that people are studying/taught too much deep separate branches of knowledge without getting a big picture.

Research of James Flynn is discussed (e.g. Flynn effect, increase in IQ test scores over the 20th century). According to Epstein, Flynn states that universities are teaching too much narrow specialization instead of giving breadth and critical thinking

“Even the best universities aren’t developing critical intelligence,” he said. “They aren’t giving students the tools to analyze the modern world, except in their area of specialization. Their education is too narrow.”

Scientific education does not automatically make us more critical or open-minded: Yale law & psychology professor Dan Kahan has shown that more scientifically-literate people are more likely to become dogmatic in politics-polarized subjects in science, see e.g. column Why we are poles apart on climate change.

Ospedale della Pietà is also discussed

  • A convent, orphanage and music school in Venice.
  • In the 1600s & 1700s it was famous for its all-female musical ensembles.
  • Epstein states that the students were learning many different instruments in their youth instead of focusing early in one instrument.

Analogies, potentially from distant domains, can be valuable when solving difficult problems.

Daniel Kahneman's Curriculum project was also referred - beware the "inside view".

Slow learning preferred

Eptein states that learning should not be fast. Struggle to retrieve information improves learning / moves knowledge to long term memory. Learning is improved by spacing, testing and making connections.

If you want it to stick, learning should be slow and hard, not quick and easy. The professors who received positive feedback had a net negative effect on their students in the long run. In contrast, those professors who received worse feedback actually inspired better student performance later on.

Focused "head start" or "early sampling"

Epstein discusses study & career paths - whether one should "be gritty with their chosen path" or change path if finding out that selected path is not optimal.

  • Epstein has concept of "match quality" - vision of the ideal career
  • "Winners quit fast and often" instead of "Quitters never win"
  • Knowing when to quit is important (though perseverance in difficult times is also important)
  • One's personality is not fixed
    • Personality changes by time, especially between 18 & late 20's -> early guess might result in low match quality.
    • Also, personality varies by context - Instead of asking who's gritty and who is not, ask who is gritty in which situation.

Some related quotes

We find who we are by living.

We discover (our) possibilities by doing, trying out new activities, building new networks, finding new role models.

An early sampling period is better than a focused head start.

Foxes, Birds, Hedgehogs and Frogs

There are parables related to deep vs broad knowledge and experience. These two were presented:

Foxes vs Hedgehogs

  • E.g. Essay by Isiah Berlin
  • Title is attributed to the Ancient Greek poet Archilochus quote: "a fox knows many things, but a hedgehog one important thing"
  • "Hedgehogs" would be people who view the through the lens of a single defining idea
  • "Foxes" would be people drawing on a wide variety of experiences and for whom the world cannot be boiled down to a single ideac

Birds vs frogs

Comes from Dyson Freeman essay). Deep vs broad thinking - both are needed

Birds fly high in the air and survey broad vistas of mathematics out to the far horizon.They delight in concepts that unify our thinking and bring together diverse problems from different parts of the landscape.

Frogs live in the mud below and see only the flowers that grow nearby. They delight in the details of particular objects, and they solve problems one at a time.

On decision-making and communication

Carter racing case study discussed (Related to Nasa Challenger launch disaster and decisions made there)

  • We don’t do good job asking “whether the data currently shown is all the data we need for making a decision or is there more data”
  • Reminds me of Kahneman's concept What You See Is All There Is

"Chain of command" and "Chain of communication" should be differentiated (information should flow in many directions).

Value of wide knowledge / range

Epstein tells an example of coronary stents & cardiologists: High specialization in one area causes one to see that one thing to be “the one” for any case (seeing only a couple of pieces of a huge jigsaw puzzle)

Quote from Oliver Smithies:

Take your skills to a place that's not doing the same sort of thing. Take your skills and apply them to a new problem, or take your problem and try completely new skills.

New knowledge combinations

To recap: work that builds bridges between disparate pieces of knowledge is less likely to be funded, less likely to appear in famous journals, more likely to be ignored upon publication, and then more likely in the long run to be a smash hit in the library of human knowledge. •

Related articles:

Advice for anyone: It's important to read “something outside your field”.

Final quotes

Compare yourself to yourself yesterday, not to younger people who aren’t you... you probably don’t even know where exactly you’re going, so feeling behind doesn’t help

 

So, about that, one sentence of advice: Don’t feel behind... research in myriad areas suggests that mental meandering and personal experimentation are sources of power, and head starts are overrated.

Thursday, June 4, 2020

Json Web Tokens (JWT)

This time I read JWT Handbook by Sebastián Peyrott that is available from Auth0 against giving your email.

JWT in general

What is JWT?

  • JWT stands for JSON Web Token.
  • A standard for safely passing claims in space constrained environments
  • Aims to be a simple, useful, standard container format that can optionally be also validated and/or encrypted.
  • Latest JWT spec: RFC 7519
  • Related specs

Example JWT

Example JWT from jwt.io (newlines added for readibility):

eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.
eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.
SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c

This JWT has three parts separated with a dot:

  • Header JSON (encoded with Base64Url)
{
  "alg": "HS256",
  "typ": "JWT"
}
  • Payload (also encoded with Base65Url)
{
  "sub": "1234567890",
  "name": "John Doe",
  "iat": 1516239022
}
  • Signature built on header, payload & a secret

Typical applications

client-side/stateless sessions

  • for storing client-side data
  • signature is typically used to validate the data
  • Can be potentially also encrypted

Security considerations

  • Signature Stripping
    • Removing the signature and changing the header to claim that the JWT is unsigned
    • -> Validation should not consider unsigned JWTs valid.
  • Cross-Site Request Forgery (CSRF)
    • Try to make the user browser to perform requests against a site where the user is logged in from a different site. (As
    • Relevant when session/JWT is in a cookie as cookies are sent by browser.
  • Cross-Site Scripting (XSS)
    • Attempt to inject JavaScript in trusted sites

Federated identity

OAuth 2.0 Access token & Refresh token as an example:

  • Access Token
    • Gives access to protected resources
    • Usually short-lived
    • Typically carries a signature (as signed JWT) -> can be validated by the resource servers
  • Refresh Token
    • Allows user to request new access tokens
    • Usually long-lived
    • Require access to the authorization server
  • OAuth 2.0 does not specify the format of tokens.
    • JWTs are a good match for access tokens.
    • OpenID Connect uses JWT to represent the ID token

JSON Web Token in Detail

  • Three elements: header, payload and signature/encryption data
  • Header & payload are JSON objects
  • Signature/encryption part depends on the algorithm used for signing or encryption. (In the case of unencrypted JWT it is omitted)
  • Compact serialization: Base64 URL-safe encoding of UTF-8 bytes of header & payload (JSON) and signing/encryption data (not JSON)
  • Also known as the JOSE header (JSON Object Signing and Encryption)
  • Claims about the JWT itself

Claims:

  • alg (Algorithm)
    • Algorithm used for signing and/or encrypting the JWT
    • Only mandatory claim for an unencrypted JWT
  • cty (Content Type)
    • In the typical case of specific claims and arbitrary data, this must not be set.
    • Must be "JWT" when payload is another JWT itself (nested JWT)
  • typ (Media type)
    • relevant only in cases when JWTs could be mixed with other objects carrying a JOSE header (which rarely happens)

Payload

  • No mandatory claims
  • Registered claims have specific meaning

Registered Claims:

  • iss (Issuer)
    • A case-sensitive string or URI uniquely
    • Identifying JWT issuer
    • Application-specific interpretation
  • sub (Subject)
    • A case-sensitive string or URI
    • Identifying the party that this JWT carries information about.
    • JWT claims are about this party.
    • Application-specific handling
  • aud (Audience)
    • Either a single case-sensitive string or URI or an array of such values
    • Identifying intended recipients
    • Application-specific interpretation
  • exp (Expiration (time))
  • nbf (Not before (time))
    • "Opposite of exp claim"
  • iat (Issued At (time))
  • jti (JWT ID)
    • Can be used to differentiate JWTs with others

Other claims are private or public

JSON Web Signatures (JWS)

  • The book states JWS as "probably the single most useful feature of JWTs"
  • Allow to establish the authenticity of the JWT (validation)
  • Note: Does not prevent other parties from reading the contents inside the JWT

Algorithms

Specified in RFC 7518, JSON Web Algorithms (JWA)

Keyed-Hash Message Authentication Code (HMAC) is an algorithm that produces a code (hash) from a certain payload with a secret using a cryptographic hash function.

One algorithm is required to be supported by all JWS conforming implementations:

  • "HS256"
    • HMAC using SHA-256 hash function (shared secret scheme)

These are recommended:

  • "RS256"
    • RSASSA PKCS1 v1.5 using SHA-256
    • RSASSA is a variation of asymmetric RSA algorithm used for signatures.
      • Private key can be created to create signature (and to verify it)
      • Public key can only be used to verify the signature (and thus authenticity of the message)
  • "ES256"
    • ECDSA using P-256 and SHA-256
    • Uses an alternative to RSA, Elliptic Curve Digital Signature Algorithm (ECDSA)
    • Also an algorithm with public and private keys but different mathematics.

Optional ones (that are in practice variations of required and recommended ones):

  • "HS384" & "HS512": Variations of "HS256" with SHA-384 and SHA-512
  • "RS384" & "RS512": Variations of "RS256" with SHA-384 and SHA-512
  • "ES384" & "ES512": Variations of "ES256" with SHA-384 and SHA-512.
  • "PS256", "PS384" & "PS512": RSASSA-PSS + MGF1 with SHA-256/SHA-384/SHA-512

JWS Header Claims

See section 4.1. of RFC 7515

Serializations

JWS spec defines two types of serialization:

  • JWS Compact Serialization
    • The typical JWT serialization
    • baseurl-encoded header, payload and signature separated with dots
    • Single signature
  • JWS JSON Serialization, with two alternatives
    • General syntax that supports multiple signatures
    • Flattened syntax (a single signature)

For more details, see section 7 of RFC 7515.

JSON Web Encryption (JWE)

When JWS makes it possible to validate data, JWE makes it possible to prevent third parties from reading the data.

As in JWS, two schemes:

  • a shared secret scheme - A party that holds the shared secret can encrypt and decrypt data
  • a public/private key scheme
    • A party that holds the public key can encrypt data.
    • A party that holds the private key can decrypt data.
    • NOTE: Anyone holding the public key can encrypt new data
      • Thus JWE does not replace role of JWS in token exchange
      • JWE and JWS are complementary when using a public/private key scheme.
    • Encrypted JWTs are sometimes nested: An encrypted JWT serves as a container for a signed JWT

Structure of an encrypted JWT

Encrypted JWT compact representation has 5 elements (instead of 3 in signed and unsecured JTWs)

  1. The protected header - As JWS header
  2. The encrypted key - A symmetric key used to encrypt the ciphertext & other encrypted data
  • Note that the ciphertext is encrypted in a symmetric way even if an asymmetric algorithm is used to encrypt the key.
  1. The initialization vector - Needed for some encryption algorithms
  2. The encrypted data (ciphertext)
  3. The authentication tag - Can be used validate the ciphertext
  • Note that this doesn't remove the need for nested JWTs

Key Encryption Algorithms

Key Encryption Algorithms ("alg" header) recommended to be implemented:

  • RSA variants:
    • "RSA1_5" - RSAES-PKCS1-v1_5 (NOTE: marked for removal of the recommendation)
    • "RSA-OAEP" - RSAES-OAEP with defaults (marked to be required in the future)
  • AES variants
    • "A128KW" - AES-128 Key Wrap
    • "A256KW" - AES-256 Key Wrap
  • Elliptic Curve variants:
    • "ECDH-ES" - Elliptic Curve Diffie-Hellman Ephemeral Static (ECDH-ES) using Concat KDF (marked to be required in the future)
  • Combinations
    • "ECDH-ES+A128KW" - ECDH-ES using Concat KDF and CEK wrapped with AES-128
    • "ECDH-ES+A256KW" - ECDH-ES using Concat KDF and CEK wrapped with AES-256

Key Management Modes

JWE spec defines a couple of different Key Management Modes related to determining the Content Encryption Key (CEK)

  • Key Encryption - CEK is encrypted for the intended recipient using an asymmetric encryption algorithm
  • Key Wrapping - CEK is encrypted for the intended recipient using a symmetric encryption algorithm
  • Direct Key Agreement - a key agreement algorithm is used to agree upon the CEK value.
  • Key Agreement with Key Wrapping - a key agreement algorithm is used to agree upon a symmetric key used to encrypt the CEK value to the intended recipient using a symmetric key wrapping algorithm.
  • Direct Encryption - shared symmetric key is used as the CEK

It's important to note that CEK and JWE encryption key are different things

  • CEK is the key used to encrypt/decrypt the actual data payload
  • JWE encryption key is used to encrypt or compute the CEK (unless Direct Encryption is used)

Required Content Encryption Algorithms ("enc" header):

  • AES CBC + HMAC SHA - AES 128/256 with Cipher Block Chaining and HMAC + SHA-256/512 for validation.
    • "A128CBC-HS256" - AES_128_CBC_HMAC_SHA_256
    • "A256CBC-HS512" - AES_256_CBC_HMAC_SHA_512

JWE Header Claims

See section 4.1. of RFC 7516

JSON Web Keys (JWK)

  • Different representations for the keys used for signatures and encryption
  • Aiming for a unified presentation of all keys supported in the JWA spec.

An example JWK from RFC 7517:

{
  "kty": "EC",    // Key type: Elliptic Curve
  "crv": "P-256", // Curve type: P-256
  "x": "f83OJ3D2xF1Bg8vub9tLe1gHMzV76e8Tus9uPHvRVEU", // base64-encoded x & y coordinates
  "y": "x_FEzRu9m36HLN_tue659LNpXW6pCyStikYjKIWI5a0", // (Parameters for elliptic curves)
  "kid": "Public key used in JWS spec Appendix A.3 example" // Key identifier
}

Common parameters (more details in section 4 of RFC 7517:

  • kty (Key Type) - "EC" / "RSA" / "oct" (symmetric keys)
  • use (Public Key Use) - "sig" (signature) / "enc" (encryption)
  • key_ops (Key Operations)
    • an array of strings specifying detailed uses for the key
    • Potential values "sign", "verify", "encrypt", "decrypt", "wrapKey", "unwrapKey", "deriveKey", "deriveBits"
  • alg (Algorithm) - the algorithm intended for use with the key
  • kid (Key ID) - A unique identifier for this key.
  • x5u (X.509 URL) - A URL pointing to a X.509 public key certificate or certificate chain in PEM encoded form
  • x5c (X.509 Certificate Chain) - Base64-URL encoded X.509 DER public key certificate or certificate chain
  • x5t (X.509 Certificate SHA-1 Thumbprint) - Base-64-URL encoded SHA-1 thumbprint/fingerprint of the DER encoding of a X.509 certificate
  • x5t#S256 (X.509 Certificate SHA-256 Thumbprint) - As x5t, but with SHA-256 thumbprint/fingerprint.
  • Other parameters specific to the key algorithm. e.g. x, y, d, n, e etc.

JSON Web Key Sets (aka JWK Sets)

  • Carry more than one key
  • Meaning of the order of the keys is user-defined
  • A JSON object with "keys" field consisting of a JSON array of JWKs

JSON Web Algorithms

In this chapter, the algorithms used in earlier chapters are discussed in more detail.

Base64

Base64's is a binary-to-text encoding used widely with JWT, JWS and JWE. With JTW & related specs, a URL-safe variant of Base64 is used. For more details, see e.g. RFC 4648

Secure Hash Algorithm (SHA)

  • SHA used in JWT is defined in FIPS-180, see also RFC 4634.
  • Note: Not to be confused with SHA-1 (deprecated, should not be used)
    • FIPS-180 SHA is sometimes called SHA-2
  • For JWT, SHA-256 & SHA512 are of interest.
  • Roughly:
    • Input is processed in fixed-side chunks
    • For each chunk, perform a bunch of mathematical operations
    • Result is accumulated with previous chunk results
    • After all chunks, digest is computed.
  • For code example, see sha256.js

Hash-based Message Authentication Code (HMAC)

  • Use a cryptographic hash function (e.g. SHA family) and a key to create an authentication code.
  • Takes a hash function, a message and a secret key as input
  • Produces an authentication code (HMAC) as output

Definition from RFC 2104:

To compute HMAC over the data `text' we perform

H(K XOR opad, H(K XOR ipad, text))

  • ipad = the byte 0x36 repeated B times
  • opad = the byte 0x5C repeated B times

So, e.g. "HS256" (HMAC + SHA256) means HMAC using SHA-256 as the hash function,

RSA

  • "RSA" stands for the initials of it's developers Ron Rivest, Adi Shamir and Leonard Adleman.
  • Has variations both for signing and encryption
  • Stands on integer factorization being computationally relative extensize operation to perform.

The RSA "basic expression": (m^e)^d = m (mod n) where

  • It is computationally feasible to find very large integers e, d and n that satisfy the equation.
  • It is relatively difficult to find d when other numbers are known.
  • Public key is composed of values n and e
  • Private key is composed of values n and d

More details can be found from e.g. The Public Key Cryptography Standard #1 (PKCS #1) (RFC 3447).

Signing with RSA

Signing:

  • Produce a message digest from the message
  • Raise digest to the power of d mod n
  • Attach the result as signature

Verifying signature:

  • Raise signature to the power of e mod n
  • Produce a message digest from the message
  • If the results from previous steps match, the signature is valid

JWT "RS256" signature algorithm is PKCS#1 RSASSA v1.5 using SHA-256.

Elliptic Curve (EC)

Elliptic Curves is a different field of mathematics that provides a "one-way function" that is easy to compute but hard to invert (elliptic curve discrete logarithm problem).

Some math resources:

Elliptic Curve Digital Signature Algorithm (ECDSA)

  • Curves and algorithms defined in FIPS 186-4 + other associated standards.
    • JWA uses three curves: P-256, P-384, and P-521.
  • Within certain curve used as a "base point" G for EC operations:
    • Private key can be constructed by picking a random number between 1 and n (order of base point G)
    • Public key can be computed with multiplying private key with the base point G

"ES256" is ECDSA using elliptic curve P-256 and SHA-256 hash.

Best practices

Based on RFC 8725.

Common pitfalls / attacks

  • alg: none
    • setting header "alg" to "none" and modifying payload
  • Using RS256 Public-key as HS256 secret
    • as public key is often public
  • Weak HMAC keys
    • If using a HMAC key of "typical password length", brute force attack might be possible
  • Wrong stacked encryption + signature verification assumptions
    • Wrong assumption that encryption would provide also protection against tampering
    • Esp. non-standard encryption algorithms might not have data integrity verification
    • Nested JTWs: Failing to validate innermost JWT when encrypted JWT is carrying a signed JWT
  • Invalid Elliptic curve attacks
  • Substitution attacks
    • Sending a token intended for recipient A to recipient B (if both verify the token with the same public key)

Mitigations