Opportunities to use monoids

17 Feb 2014

Recently, I was watching this talk by Avi Bryant where he talks about using algebra in programming. The talk is geared towards distributed systems, but I think the concepts are applicable in general. Certainly when I realised that my earlier trouble also concerned monoids.. So, what’s to gain from understanding this algebra? At least improved code reusability and parallelism. And an opportunity to satisfy your curiosity!

As always, any suggestions or remarks are welcomed!

What is a monoid?

A monoid is an algabraic structure. It exists of a Set and an operation on that set (when A and B are elements of a set, A operation B yields C which is also an element of the set). The following conditions must hold:

  • Associativity: (a operation b) operation c == a operation (b operation c)
  • Identity Element: identity operation a == a == a operation identity

You can think of many monoids yourself. For example: the set of integers, the + operation and identity element 0. Or another example: the set of all possible strings, the concat operation and an empty string as identity element.

Why will this be useful?

Above, I talked about code reusability and parallelism. Let’s start with parallelism.

To get started, we’ll use the following monoid: Set = integers; Operation = +; Identity = 0. Let’s assume for a moment that + is a very difficult operation and it takes a computer an hour to calculate the result. When asked to add integers 1 through 6, it is natural to solve the problem in the following way:

(((((((1 + 2) + 3) + 4) + 5) + 6) + 7) + 8)

If we use this method, we would have the result in 7 hours. Could we do it faster? Especially when we have multiple computers? Well yes, just solve the problem in the following way:

(1 + 2) + (3 + 4) + (5 + 6) + (7 + 8)
(3      +  7    ) + (11   ) + (15   )
(      10       ) + (     26        )
(                36                 )

The same calculation now takes 4 hours (but uses 4 computers, a nice trade-off if we have a lot of computers and a lack of time). The advantage lies in the fact that adding is inherently parallel. It doesn’t matter if you start to add at the beginning, the end or in the middle, the end result is always the same. This parallelism stems from the associativity property, which you get with every monoid!

Now, how about code reusability? To understand this you have to think about the fundamental concepts of the monoid. There has to be a set, an operation and an identity method. Even when you don’t know which monoid you are working with, the method of solving is always the same: take 2 elements as input, apply the operation and receive 1 element as output. Repeat until only one element is left. Because this method is always the same, you can just reuse it and plug different monoids in as needed. Still fuzzy? Have a look at the examples.

Examples

I’m importing these libraries everywhere.

import Data.Monoid
import Data.List.Split

The sum example described in the previous section:

-- summing
-- some data to work with
sum_data =  [AdditionTest c | c <- [1..]]

-- define a type to work with. Does anybody know if you can also define a monoid on Int?
data AdditionTest = AdditionTest Int deriving (Show)

-- define the monoid, addTest is the operation
instance Monoid AdditionTest where
    mempty = AdditionTest 0
    mappend = addTest

addTest :: AdditionTest -> AdditionTest -> AdditionTest
addTest (AdditionTest x) (AdditionTest y) = AdditionTest (x+y)

-- when a monoid is defined on a certain type, mappend points to its operation and mempty to the identity element
-- foldl takes a list and applies an operation on it from left to right (with given start element)
-- in this case, the operation is mappend and the start element is mempty
sum_result = foldl mappend mempty $ take 100 sum_data

-- doesn't matter if starting left or right...
sum_result' = foldr mappend mempty $ take 100 sum_data

-- or even different chunks and then combine the chunks...
chunks = chunksOf 10 $ take 100 sum_data
sum_result'' = foldr mappend mempty $ map (foldr mappend mempty) chunks

Let’s generalise our calculation method. We’ll define a new aggregation function that just takes a list and applies the mappend function on the elements. The start element will again be mempty:

aggregate x = foldl mappend mempty x

-- example of the general function with sum
sum_result''' = aggregate $ take 100 sum_data

Now, we’ll be able to use the aggregate function for different monoids. For example, can we aggregate strings?

--defining some data to work with
string_data = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l"]

--a new type and corresponding monoid
data StringAddition = StringAddition String deriving (Show)
instance Monoid StringAddition where
    mempty = StringAddition ""
    mappend = stringAppend

--the mappend operation
stringAppend (StringAddition x) (StringAddition y) = StringAddition (x ++ y)

--reusing the aggregate function
sum_string = aggregate string_data

That was easy! How about finding a maximum?

Apropos, does anybody know how to define a Monoid on all classes that derive Ordering? There is probably a problem with chosing a sane mempty, because ord doesn’t seem to have that?

Apropos2, is it possible to define 2 monoids for the same data type? For example: an Int monoid for sum and one for product.

--same drill: defining type, data and monoid
data MaxTest = MaxTest Int deriving (Show)
max_data =  [MaxTest c | c <- [1..]]

instance Monoid MaxTest where
    mempty = MaxTest 0
    mappend = maxTest

maxTest (MaxTest x) (MaxTest y) = MaxTest $ max x y

max_result = aggregate $take 100 max_data

Averages? This one is a bit different, we cannot just add averages together. Instead, we just store the sum of all numbers and the amount of numbers we processed. The average can then be calculated by doing sum/occurences.

data AverageData = AverageData { sum:: Int, occurences :: Int }  deriving (Show)

average_data = [(AverageData 2 3), (AverageData 2 5)]

instance Monoid AverageData where
    mempty = AverageData 0 0
    mappend = averageAggregate

averageAggregate :: AverageData -> AverageData -> AverageData
averageAggregate (AverageData a b) (AverageData x y) = AverageData (a+x) (b+y)

average_result = aggregate average_data

Next steps? Well, there is still a lot to learn:

What do you think? Suggestions and remarks are welcomed!

Update

This comment on Reddit by mstksg points out that monoids already have an aggregate function built in. mconcat is defined as foldr mappend mempty, almost exactly as I did.



If you have read this far, consider subscribing for updates:

comments powered by Disqus