class: center, middle # Practical Haskell Performance Tim McGilchrist .bottom[![lambda](/talks/lambda-jam-2016-performance/green-lambda.png)] --- class: center, middle, section-white background-image: url(/talks/lambda-jam-2016-performance/background-image03.jpg) # Introduction ??? * spend a lot of time with abstractions * Haskell hides the evaluation model * Limits in hardware * Continue using high level abstractions without significant penalty * Get 80% of the performance with minimum effort --- # Hardware .center[![xeon](/talks/lambda-jam-2016-performance/intel_xeon_e5_2600_v2_chip_scheme_678x452.jpg)] ??? * Xeon Haswell Chip from a few years ago * 10 cores, 25Mb L3 Cache * likes locality of reference in memory * chasing pointers breaks that * optimised for working in the Cache https://gist.github.com/jboner/2841832 --- # Typical Pipeline Program Reads data off storage, does some processing and then writes back to storage. .center[![compute](/talks/lambda-jam-2016-performance/Architecture.svg)] ??? * reading largish chunks of data from disk * performing some computation, (computing stats, filtering, choosing) * writing out largish data * typical pipeline * actually most of our architecture looks like this, composable building blocks --- # Cost Because we use AWS, there is a direct link between a days computing and the angriness of our pointy haired boss. .center[![mark](/talks/lambda-jam-2016-performance/Phb.png)] ??? * why do we care about cost? * cost of a good car most months * performance but not at the cost of correctness * often worth the effort to optimise, operational costs greater than engineering cost. --- # Optimising "The art of poking stacks, trawling asm and black magic." ??? * Optimisation is fickle * Often context sensitive * may require facial hair * Ambiata view on our performance concerns --- # Haskell IO 100% of time spent reading and writing to disk? ??? * obviously this -- This raises two issues 1. How do we efficiently read in enough data without blowing our memory budget? 2. Is it safe? What about IO Exceptions and cleaning up? ??? * Not sacficing performance for correctness or stability. * Producing fast wrong results costs $$$ --- # ByteString Lazy vs cat Using straight ByteString lazy ``` haskell runStatisticsBSL :: InputPath -> OutputPath -> IO () runStatisticsBSL (InputPath i) (OutputPath o) = BSL.readFile i >>= BSL.writeFile o ``` ??? * straight bytestring will read everything into memory * tested on 31G psv file. -- ```shell > ./hatchet sample.psv output.psv \ 2.72s user 35.20s system 53% cpu 1:10.79 total ``` -- ```shell > time cat sample.psv > output.psv \ 1.70s user 49.06s system 65% cpu 1:17.49 total ``` ??? * Bytestring sufficiently fast. * Composability is a problem --- # Conduit * Promises composable building blocks * Constant memory usage on large / infinite streams -- ``` haskell runStatisticsC :: InputPath -> OutputPath -> IO () runStatisticsC (InputPath i) (OutputPath o) = runResourceT . runConduit $ CB.sourceFile i =$= CB.sinkFile o ``` ??? * newtype wrapper around FilePath * add stages in between =$= * already present as a dependency * default buffer size in conduit is too small, using ambiata fork with better value -- ```shell > ./hatchet sample.psv output.psv \ 4.46s user 34.81s system 49% cpu 1:19.35 total ``` ??? * sufficiently fast enough performance * nicer building blocks without IO cost * IO probably isn't your bottleneck * parallel read / download --- # Data Types ``` haskell data Row = Row { rowId :: RowId , rowSpend :: Double , rowItems :: Int , rowDate :: UTCTime , rowState :: Text } deriving (Eq, Show) ``` Each row in our file gets parsed into this structure. ??? * Point out good code style for types -- ``` shell some-id | 150.0 | 5 | 10/3/2016 | NSW ``` ??? * Pipe Separated data the one true enterprise format. --- # Data Type Size ``` shell some-id | 150.0 | 5 | 10/3/2016 | NSW ``` Raw size of a line is 38 Bytes. -- ``` haskell rowId :: RowId -- 6 words + 2N bytes , rowSpend :: Double -- 3 words (2 on 32 bit) , rowItems :: Int -- 2 words , rowDate :: UTCTime -- 6 words , rowState :: Text -- 6 words + 2N bytes ``` Becomes 43 words * 8 or 344 Bytes on 64 bit machine. ??? * increase of 8 times * not really ideal situation, why is it like that? --- # Memory Layout * Value Representation .center[![layout](/talks/lambda-jam-2016-performance/Memory_layout.jpg)] ``` haskell data Row = Row { rowId :: RowId , rowSpend :: Double , rowItems :: Int ... } deriving (Eq, Show) ``` ??? * Ideally you may think that memory is laid out like this. * No pointer chasing which remember our CPU doesn't really like --- # Memory Layout * Value Representation .center[![layout](/talks/lambda-jam-2016-performance/Memory_layout.jpg)] * Reference Representation .center[![layout](/talks/lambda-jam-2016-performance/Memory_layout_pointers.jpg)] ??? * Acutally it looks more like this * Potential cache misses for each pointer chase. * Want to avoid chasing pointers, makes CPU sad and slow * Move towards the layout we actually want --- # Optimising via newtype ``` haskell newtype RowId = RowId Text deriving (Eq, Show) ``` * constructors defined via newtype are free * purely a compile-time idea * takes no space and costs no instructions ??? * There's no excuse for bad types. * What about for other types? --- # Boxed vs Unboxed Values What are they? * Boxed * * most types in Haskell * represented by a pointer to a heap object * Unboxed * primitive values e.g. raw machine types * cannot be directly defined ??? * boxed, represented by a pointer to a heap object * unboxed, no pointers or heap allocation are involved * built into the language and GHC * want something like unboxed for best memory performance --- # UNPACK All The Things As a general rule, scalar fields, such as Int and Double, should always be unpacked. -- ``` haskell data Row = Row { rowId :: {-# UNPACK #-} RowId , rowSpend :: {-# UNPACK #-} Double , rowItems :: {-# UNPACK #-} Int , rowDate :: {-# UNPACK #-} UTCTime , rowState :: {-# UNPACK #-} Text } deriving (Eq, Show) ``` -- * Syntatically noisey ??? * looks like rubbish * Is there a better way? --- # Better Unpacking * Use `-funbox-strict-fields` for all files which contain data types. ``` haskell -- other language pragmas -- {-# OPTIONS_GHC -funbox-strict-fields #-} module Hatchet.Data (Row(..)) where ``` --- # How do I know? Scalar fields can always be unpacked. Things that can't be unpacked: * Sum Types (Types with many constructors) * Polymorphic fields (How would you know what's there?) -- Try draw boxes representing your data types, think about how they might be represented in memory. --- # Strict Data Types By default always use strict data types. -- Haskell being lazy, why should this be the case? -- Think about repeated adding a value to a lazy data structure. -- .center[![layout](/talks/lambda-jam-2016-performance/lazy_hash.jpg)] ??? * performing some calculation on Row 1, repeatedly. * builds up a series of thunks (un-evaluated functions) * building up thunks is expensive --- # Strict Row ``` haskell data Row = Row { rowId :: !RowId , rowSpend :: !Double , rowItems :: !Int , rowDate :: !UTCTime , rowState :: !Text } deriving (Eq, Show) ``` * Add bang patterns to everything ??? This simple thing alone goes a long way to avoiding common laziness problems, suchs as space leaks. Hold on good Sir, I have a recursive data structure. Most of my data types aren't recursive. --- # Strict Recursive ``` haskell data Exp = Infix !Op Exp Exp | Constant !Type !Value data Op = Add | Subtract | Multiply | Divide ``` * select the non-recursive parts to make strict ??? * Here it makes sense for OP to be strict but perhaps not Exp * Similarly for Constant * Perhaps the recursive part isn't that large, experiment with making it strict too. --- # Data Summary We've covered a bunch of things to optimise your data structures: * Choosing the right types * use newtype liberally * make data structures strict * unpacking data types --- # Final Data Type ``` haskell -- other language pragmas -- {-# OPTIONS_GHC -funbox-strict-fields #-} module Hatchet.Data (Row(..)) where data Row = Row { rowId :: !RowId , rowSpend :: !Double , rowItems :: !Int , rowDate :: !UTCTime , rowState :: !Text } deriving (Eq, Show) ``` --- class: center, middle, section-white background-image: url(/talks/lambda-jam-2016-performance/background-image03.jpg) # Functions ??? * Optimising your actual computations isn't always such a clear cut thing. * In general we lean towards having function arguments as lazy. --- # Why Lazy Functions? Consider this code ``` haskell any :: (a -> Bool) -> [a] -> Bool any p = or . map p ``` ??? * Explain needing to map everything in p before or gets a look at it. ie it does more work than it need to compute the answer. --- # Forcing Computation Consider this computation: ``` haskell sum :: [Row] -> Int sum = let loop acc = \case [] -> acc (x:xs) -> loop (acc + (rowSpend x)) xs in loop 0 ``` ??? * going back to a previous example of computing a something across a series of Rows. * The accumulation part of that needs to be strict * it'll build up a series of thunks for each Row it sees, basically a space leak --- # Forcing Computation Sometimes it does make sense to force the arguments to a function. ``` haskell sum :: [Row] -> Int sum = let loop !acc = \case -- Force acc [] -> acc (x:xs) -> loop (acc + (rowSpend x)) xs in loop 0 ``` ??? * The accumulation part of that needs to be strict * Another technique we've used --- # Remove Polymorphism Imagine we wrote our code with this type signature ``` haskell add :: Real a => a -> a -> a add rowSpend acc = acc + rowSpend ``` ??? * Well written using most common form * May be a bad choice for performance sensitive area like an inner loop * You monster, my beautiful polymorphic code -- ``` haskell add :: Int -> Int -> Int ``` ??? * you may already know what the types are so use them * this will perform better than as a monomorphic version * if you do use it with multiple types --- # Specialise If however it really is used polymorphically you can use `SPECIALIZE` to give GHC a hint of what it's used for ``` haskell {-# SPECIALIZE add :: Int -> Int -> Int #-} {-# SPECIALIZE add :: Double -> Double -> Double #-} add :: Real a => a -> a -> a ``` ??? * can go even further with printing core and inspecting assembly. --- # Tools * Dump RTS statistics * raw information like GC Time, Memory usage * add `-rtsopts` to ghc-options in cabal * run with `+RTS -sstderr` -- * Time profiling * shows where program is spending time * add `-prof -auto-all -caf-all -fforce-recomp` * run with `+RTS -p` -- * Space profiling * shows memory usage over time * various options available * differ in how the live heap is broken down -- * Criterion * framework for executing and analysing benchmarks * pretty output and graphs --- # Conclusion By default, use strict data types and lazy functions. * Think about the evaluation model * Plan your data types * Force evaluation where necessary * Measure performance ??? * Johan Tibell's presentations --- class: center, middle # Thanks! Tim McGilchrist @lambda_foo Master Commander of Big Data @ Ambiata Thanks!