Functional Thinking Part III (Week 4 Tuesday)

— Christopher Genovese and Alex Reinhart

Announcements #

  • Checking status
  • Office hours: Wed 3 on Zoom (link to announcements)
  • Schedule adjustments forthcoming
  • New exercises forthcoming

Plan for Today #

  • Brief Debrief on Dominoes Activity
  • Data-oriented Functional Programming: A Preview
  • Languages for Describing Data
  • One approach: A Tour
  • Practices, a first look

Review Core Concepts of Functional Programming #

Composability is a fundamental aspiration in software design. We want the parts of our system to interoperate and combine as freely as possible. This offers power, expressiveness, reusability, and many other benefits.

In functional programming (FP), functions are the fundamental unit of abstraction providing simple, easily composed building blocks for solving problems.

Central features of FP:

  • Functions are first-class entities
  • Functions are pure (whenever possible)
  • Data is immutable (whenever possible)
  • Programs have declarative structure (as much as possible)
  • Exploit laziness when appropriate
  • Referential transparency
  • Higher-Order Functions

Distinguish actions, calculations, data.

Data-Oriented Functional Programming: A Preview #

An approach to design that makes the data entities in a system a central concern. This is an opinionated flavor of functional programming that suggests practices to reduce complexity, maximize reuse, and improve correctness.

Let’s start with a few core principles that we will dive into.

Principle #1: Separate Data and Code #

Design operations to operate on data that is separately described and stored. Keep state out of your functions.

def lp_filter(tseries):
    filter_coefs = [0.1, 0.2, 0.3, 0.4, 0.3, 0.2, 0.1]
    return scipy.signal.fftconvolve(tseries, filter_coefs, mode='full')

class Author {
    #...
    @property
    def fullname (self) {
        return self.first_name + ' ' + self.last_name
    }

    def is_prolific(self) {
        return len(self.books) > 100
    }
}

What value does it serve to have that data in the code? What harm could it do? When is it a good idea to include such implict arguments? When not?

This principle suggests decoupling data and code: our code consists of functions whose behavior does not depend on data that is encapsulated in the function’s context.

This means writing functions that operate on the data they need. This applies to both FP and Object Oriented (OO) approaches.

For instance, OO tools: static methods and data classes.

Costs: less control, changes packaging structure

Benefits of this principle: enable reuse, easier testing, reduce complexity

Principle #2: Use Immutable Data When Possible #

Reduces complexity at very small performance penalty for general use. Transient mutability is fine in tight bottlenecks.

Strategies:

  1. Copy on write
  2. Defensive copying
  3. Library immutable data structures (e.g., tuples in Python)
  4. Persistent data structures (structure sharing)
  5. Isolate mutating operations

Costs: small performance hit (escapable in critical code), helps to have a supporting library with persistent data structures.

Benefits: access data with confidence, have history, predictable code behavior, fast equality checks, concurrency safety More Benefits of this principle: enable reuse, easier testing, reduce complexity

Principle #3: Represent Data with Generic Data Structures #

It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures.

– Alan Perlis

Tuples, Arrays, Dictionaries/Hashtables, Sets, Bags

Most languages offer a fairly rich collection of generic data structures with good APIs and good performance characteristics.

Key distinction: specialized data structures with custom behaviors or data relationships built on generic structures.

Examples:

  • Dictionaries/Maps/Hashtables as general associative storage
  • NamedTuples/Dataclasses versus general objects in Python
  • Lists/environments/data frames in R versus S4/S3 classes

Many of the functions we write will have a similar logic, so we can use code that is not limited to a specific use case.

Costs: lose some case-specific optimization, lose compile time checks, sometimes need to convert data

Benefits of this principle: Helps separate distinct problem logic from common operations, allows a flexible data model, better code reuse, less complexity (if we’re careful)

Principle #4: Describe Complex Data Models #

Type systems and schema can provide rich descriptions of our data models.

Benefits:

  • Describing the data carefully helps us think about the problem
  • The data description can help communicate intent (to ourselves and others)
  • Describing the data carefully helps us catch mismatches and bugs
  • It can help our tools (compilers, linters/analyzers) catch more bugs
  • Validation can (to some degree) be automated
  • Documentation can be automated
  • Test generation can (to some degree) be automated
  • Visualization of data relationships

Costs:

  • Overhead
  • Changes in code/design/algorithm require distributed changes in types
  • Some rigidity

Languages for Describing Data #

It helps to have a language for describing the shape of data. In some languages the shape is specified at compile time as types (via a type system), and in others described at compile and/or run time via schemas (e.g., via JSON Schema, Malli, etc.).

Static versus Dynamic Typing #

In Statically typed languages, the types of data are statically defined in the code. This enables the compiler to more thoroughly analyze the code and ensure a baseline legality. It documents intent and potentially prevents many hard to catch bugs. Traditional imperative languages like C, C++, and Java are statically typed (though weakly in a sense because many data conversions are possible). Also, strongly typed functional languages like Haskell and Ocaml have rich but stern type systems. (Strongly typed languages put restrictions on type conversions of your data; weakly typed languages allow more flexibillity – or rope – for such changes.)

Dynamically typed languages typically do not constrain the types of data used in the code. Functions can accept all kinds of objects and will work if they data passed meets some basic requirements. This makes the code highly flexible and easily adaptable to changing or extend requirements. Code reuse is high. But the responsibility for using the right shaped-data falls to the programmer. Complexity can make this difficult to track. Rather than a formal type system, many dynamically-typed languages use schemas to describe data shapes. The schemas are used by tools and optionally for run-time validation.

In between these poles, we see gradual typing as a mechanism for adding type information to existing dynamic languages. This includes type annotations for Python and TypeScript, a variant of JavaScript with a type specification language as part of the spec. Many dynamic languages also have a simpler facility for type hinting to potentially improve performance on critical parts of the code.

Schemas #

Schemas are descriptions of data as data. They are usually applied for linting/analysis, tooling, and runtime validation in dynamic languages.

Typically schemas are separate from the code

{
  "productId": 1,
  "productName": "A green door",
  "price": 12.50,
  "tags": [ "home", "green" ]
}

A schema (https://json-schema.org/) for these data

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "https://example.com/product.schema.json",
  "title": "Product",
  "description": "A product from Acme's catalog",
  "type": "object",
  "properties": {
    "productId": {
      "description": "The unique identifier for a product",
      "type": "integer"
    },
    "productName": {
      "description": "Name of the product",
      "type": "string"
    },
    "price": {
      "description": "The price of the product",
      "type": "number",
      "exclusiveMinimum": 0
    },
    "tags": {
      "description": "Tags for the product",
      "type": "array",
      "items": {
        "type": "string"
      },
      "minItems": 1,
      "uniqueItems": true
    }
  },
  "required": [ "productId", "productName", "price" ]
}

From one point of view, Schema’s are like type systems on the cheap, less constraint but less ability to detect bugs. However, schemas do offer some advantages, including supporting richly featured tools and powerful run-time validation that can do more than a static type system (but later).

A Type Language #

Whether we think in terms types or schemas or type hints/annotations, it is useful to have a language for describing data shape while we are thinking about a problem and designing our approach.

We will borrow the language of types to get a kind of pseudocode for describing the shape of data.

Types have a surprisingly rich algebraic structure that can capture many relationships among our data.

We have concrete, built-in data types in any language. The names may change but the structure is fairly common:

Int, Double, UInt32, [Int], (Int, Int), String

We have function types listed without separators for the arguments:

Int -> Int -> Int     # f(int, int) -> int
String -> String      # g(string) -> string
String -> ()          # An action that returns nothing

# Can *name* these types
data UnaryIntFn = Int -> Int
data BinaryIntFn = Int -> Int -> Int

We can parameterize types by (possibly constrained) classes of types, giving abstract types

[a]                     # List consisting of all one type
Number a => a -> a      # f(number) -> number
Ord a => a -> a -> Bool # Fn of ordered types, a predicate

data Algebra f a = f a -> a
data CoAlgebra f a = a -> f a

# A concrete instantiation
Algebra List Int
CoAlgebra List Int

Functions are like exponentials, but another idea gives us sums and products

Sums:

data Color = Red | Green | Blue

data Shape = Circle Float Float Float
           | Triangle Float Float Float
           | Rectangle Float Float Float Float

type Radius = Float         # Aliases
type XPosition = Float
type YPosition = Float

data Circle = Circle Radius XPosition YPosition  # Clearer

data Test = TestCase Assertion
          | TestList [Test]
          | TestLabel String Test

def Maybe a = Nothing
            | Something a

data Expr a
  = Literal { intVal :: Int }
  | Ident   { name :: String  }
  | Index   { target :: a, idx :: a }
  | Unary   { op :: String, target :: a }
  | Binary  { lhs :: a, op :: String, rhs :: a }
  | Call    { func :: a, args :: [a] }
  | Paren   { subexpr :: a }

Products

data Seed = (End, Pool)
data Point = (XValue, YValue, ZValue)

Recursive specifications:

data List a = Nil | Cons a (List a)

data BTree a = Leaf a | Branch (BTree a) (BTree a)

A Tour of one Approach Illustrating DOP/FP Style #

In documents repository, you can view the code in src/dominoes/ directory.