Question

I have the following Haskell code, implementing a simple version of the "cat" unix command-line utility. Testing performance with "time" on a 400MB file, it's about 3x slower. (the exact script I am using to test it is below the code).

My questions are:

Is this a valid test of performance?
How can I make this program run faster?
How can I identify performance bottlenecks in Haskell programs in general?

Regarding questions 2 and 3: I have used GHC -prof, then running with +RTS -p, but I'm finding the output a bit uninformative here.

Source (Main.hs)

module Main where  import System.IO import System.Environment import Data.ByteString as BS  import Control.Monad  -- Copied from cat source code bufsize = 1024*128  go handle buf = do   hPut stdout buf   eof <- hIsEOF handle   unless eof $ do     buf <- hGetSome handle bufsize     go handle buf  main = do   file    <- fmap Prelude.head getArgs   handle  <- openFile file ReadMode   buf     <- hGetSome handle bufsize   hSetBuffering stdin $ BlockBuffering (Just bufsize)   hSetBuffering stdout $ BlockBuffering (Just bufsize)   go handle buf

Timing script (run.sh):

#!/usr/bin/env bash  # Generate 10M lines of silly test data yes aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | head -n 10000000 > huge  # Compile with optimisation ghc -O2 Main.hs  # Run haskell echo "timing Haskell" time ./Main huge > /dev/null  echo "" echo ""  # Run cat echo "timing 'cat'" time cat huge > /dev/null

My results:

timing Haskell  real    0m0.980s user    0m0.296s sys     0m0.684s   timing 'cat'  real    0m0.304s user    0m0.001s sys     0m0.302s

The profiling report when compiling with -prof and running with +RTS -p is below:

  Sat Dec 13 21:26 2014 Time and Allocation Profiling Report  (Final)       Main +RTS -p -RTS huge    total time  =        0.92 secs   (922 ticks @ 1000 us, 1 processor)   total alloc = 7,258,596,176 bytes  (excludes profiling overheads)  COST CENTRE MODULE  %time %alloc  MAIN        MAIN    100.0  100.0                                                          individual     inherited COST CENTRE MODULE                   no.     entries  %time %alloc   %time %alloc  MAIN        MAIN                      46           0  100.0  100.0   100.0  100.0  CAF        GHC.Conc.Signal           84           0    0.0    0.0     0.0    0.0  CAF        GHC.IO.FD                 82           0    0.0    0.0     0.0    0.0  CAF        GHC.IO.Handle.FD          81           0    0.0    0.0     0.0    0.0  CAF        System.Posix.Internals    76           0    0.0    0.0     0.0    0.0  CAF        GHC.IO.Encoding           70           0    0.0    0.0     0.0    0.0  CAF      
   GHC.IO.Encoding.Iconv     69           0    0.0    0.0     0.0    0.0

performanceunixhaskell

share|improve this question

editedDec 13 '14 at 21:19

askedDec 13 '14 at 21:10

statusfailed
96 4

2

It's good to see that you have also added -O2 optimization. You can also post the link to the profiled report here.SibiDec 13 '14 at 21:13

4

Instead ofhGetSomeusehGetNonBlocking, it is increasing the performance substantiallySibiDec 13 '14 at 21:25

5

Notice caching impacts the timing significantly. If I runcatfirst in the shell script "benchmark" instead of second it performs about 60% slower on my system.Thomas M. DuBuissonDec 13 '14 at 21:59

3

-profwon’t be terribly useful if you don’t have any cost centers; try compiling it (and the wholebytestringpackage, too, if you can (probably in a Cabal sandbox)) with-fprof-autoand maybe-fprof-cafsin addition to-prof.icktoofayDec 13 '14 at 22:36

4

@statusfailed Apipes solutionruns pretty close. I have also attached benchmark with the code.SibiDec 13 '14 at 23:19

show13more comments

score 14 · Answer 1

This is only a partial answer trying to address the second question:

I tried something like this usingGHC.IO.BufferAPI:

module Main where  import System.IO import System.Environment import GHC.IO.Buffer import Data.ByteString as BS  import Control.Monad  -- Copied from cat source code bufsize = 1024*128  go handle bufPtr = do   read <- hGetBuf handle bufPtr bufsize   when (read > 0) $ do     hPutBuf stdout bufPtr read     go handle bufPtr  main = do   file    <- fmap Prelude.head getArgs   handle  <- openFile file ReadMode   buf     <- newByteBuffer bufsize WriteBuffer    withBuffer buf $ go handle

and it seems to come closer to the performance of 'cat', but still definitely slower...

time ./Cat huge > /dev/null  ./Cat huge > /dev/null  0.00s user 0.06s system 76% cpu 0.081 total  time cat huge > /dev/null   cat huge > /dev/null  0.00s user 0.05s system 75% cpu 0.063 total

I think using the buffer API, we can explicitly avoid allocating all the buffer bytestrings when usinghGetSomelike in the original code, but I am just guessing here and don't know either what exactly is happening in both compiled codes...

UPDATE: Adding the original code's performance on my laptop:

time ./Cat2 huge > /dev/null ./Cat2 huge > /dev/null  0.12s user 0.10s system 99% cpu 0.219 total

UPDATE 2: Adding some basic profiling results:

Original Code:

Cat2 +RTS -p -RTS huge total time = 0.21 secs (211 ticks @ 1000 us, 1 processor) total alloc = 6,954,068,112 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc MAIN MAIN 100.0 100.0 individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc MAIN MAIN 46 0 100.0 100.0 100.0 100.0 CAF GHC.IO.Handle.FD 86 0 0.0 0.0 0.0 0.0 CAF GHC.Conc.Signal 82 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding 80 0 0.0 0.0 0.0 0.0 CAF GHC.IO.FD 79 0 0.0 0.0 0.0 0.0 CAF System.Posix.Internals 75 0 0.0 0.0 0.0 0.0 CAF GHC.IO.Encoding.Iconv 72 0 0.0 0.0 0.0 0.0< /pre>

Buffer-API Code:

Cat +RTS -p -RTS huge      total time  =        0.06 secs   (61 ticks @ 1000 us, 1 processor)     total alloc =   3,487,712 bytes  (excludes profiling overheads)  COST CENTRE MODULE  %time %alloc  MAIN        MAIN    100.0   98.9                                                         individual     inherited COST CENTRE MODULE                  no.     entries  %time %alloc   %time %alloc  MAIN        MAIN                     44           0  100.0   98.9   100.0  100.0  CAF        GHC.IO.Handle.FD         85           0    0.0    1.0     0.0    1.0  CAF        GHC.Conc.Signal          82           0    0.0    0.0     0.0    0.0  CAF        GHC.IO.Encoding          80           0    0.0    0.1     0.0    0.1  CAF        GHC.IO.FD                79           0    0.0    0.0     0.0    0.0  CAF        GHC.IO.Encoding.Iconv    71           0    0.0    0.0     0.0    0.0

Notice especially the big difference in allocation costs...

score 12 · Answer 2

The original question made me think it was about finding a performance issue in the exact code provided. Since the comment "I'm hoping to go for a more idiomatic / "high level" Haskell solution" contradicts that assumption, I'll give the reasonably performing idiomatic Haskell solution.

The way I would expect any random programmer familiar with Haskell to solve this problem is with Lazy bytestrings. This allows the programmer to simply specify the task of reading input and putting output while letting the compiler worry about mucking with the buffering and looping constructs.

module Main where

import System.IO import System.Environment import Data.ByteString.Lazy as BS  import Control.Monad  main :: IO () main = do   file    <- fmap Prelude.head getArgs   handle  <- openFile file ReadMode   buf     <- BS.hGetContents handle   hPut stdout buf

The result is both more readable and better performing than the code in the original question:

timing 'cat'  real    0m0.075s user    0m0.000s sys     0m0.074s timing strict bytestring with GHC -O2  real    0m0.254s user    0m0.126s sys     0m0.127s timing strict bytestring with GHC -O2 -fllvm  real    0m0.267s user    0m0.132s sys     0m0.134s timing lazy bytestring with GHC -O2  real    0m0.091s user    0m0.023s sys     0m0.067s timing lazy bytestring with GHC -O2 -fllvm  real    0m0.091s user    0m0.021s sys     0m0.069s

That is, the lazy bytestring solution is 21% slower thancat. Puttingcatlast for preferential caching behavior results in 59ms runtime placing the Haskell solution at 51% slower.

EDIT: Dons suggested using memory mapped IO would more accurately model cat's behavior. I'm not sure how accurate that statement is but mmap almost always results in better performance and this situation is certainly no exception:

timing memory mapped lazy bytestring with GHC -O2  real    0m0.008s user    0m0.004s sys     0m0.003s

Which was produced by:

module Main where  import System.IO (stdout) import System.Environment import System.IO.Posix.MMap.Lazy import Data.ByteString.Lazy (hPut)  import Control.Monad  main :: IO () main = do   file    <- fmap Prelude.head getArgs   buf     <- unsafeMMapFile file   hPut stdout buf

3 Answers