Amateur Topologist

Politics, programming, math, and science.

Tag: haskell

Security vulnerability in Haskell with CGI

Compiled Haskell programs all include special RTS (Run Time System) options, that change things like the number of cores that it runs on, various internal things relating to how often garbage collection runs, etc. They’re specified by invoking the program like ./foo +RTS -m10 -k2000 -RTS to run the GHC-compiled program ‘foo’, reserving 10% of the heap for allocation and setting each thread’s stack size to a maximum of 2000 bytes. In the current build of GHC, there is no way to disable these options from working (although the option –RTS will make all further options be interpreted as normal, non-RTS options). The problem is that the option -tout will write profiling data to the file out. So, if your program is setuid root, anybody who runs it can write the profiling data to, say /etc/passwd and render the system unusable. They don’t get to pick what gets written, so they can’t add a backdoor for themselves, but they can essentially scribble over whatever files they want. This is bug #3910, and the fix (disabling RTS by default) has been uploaded.

Now, one of the more little-known features of CGI is that if you pass a query string that does not contain any = signs to a CGI script, the httpd may pass the string along as command-line arguments. This is specified in section 4.4 of RFC 3875, and it specifies how the query string SHOULD be turned into arguments (although it does not say anything about whether the httpd should behave this way, only that some do). This is an example script that only outputs its arguments in a comma-separated list; the link gives it some sample arguments. Note that by URL-escaping, you can send arbitrary strings through… including +RTS. So if that were, say, a Haskell script, I could pass the query string ?%2BRTS+-tindex.html+-RTS and overwrite index.html.

There are three ways to get around this: first, GHC 6.12.2 has the -no-rtsopts option, which will obviously disable RTS options. So if you just recompile your script with that, it’ll be safe. Note that 6.14 will disable the RTS options by default; the 6.12.2 patch didn’t for backwards-compatibility reasons. Second, if you don’t want to use 6.12.2 for whatever reason, you can wrap it in a shell script that calls it with no options. For example, replace the Haskell script with a shell script called, say, hscript.cgi (if your Haskell program is called hscript) that calls it with no arguments, e.g.

#!/bin/bash
./hscript.real

and rename the Haskell script to hscript.real, so that it doesn’t get run as CGI (I’m assuming that .real files don’t get run as CGI on your machine!) Another thing you can do is to add the following to your .htaccess, which will give 403 Forbidden errors to anybody passing RTS arguments in the URL:

RewriteEngine on
RewriteCond %{QUERY_STRING} ^(?:[^=]*\+)?(?:%2[bB]|(?:-|%2[dD]){1,2})(?:%52|R)(?:%54|T)(?:%53|S)(?:\+[^=]*)?$
RewriteRule ^ - [F]

This will solve it for every Haskell script you use, but relies on the regex being correct, which isn’t something I can guarantee.

dissociated-blogosphere: never have to write an original post again!

For the past two weeks or so, I’ve been working off and on on a project called dissociated blogosphere (OSX and Linux binaries here). It takes a bunch of URLs, looks through them for an RSS for the raw content of the posts, and then stores the words of the posts in an array. It then picks N random, consecutive words (where in this case N is 2), and starts generating new text, by picking a new word x% of the time if x% of the time, the previous N words were followed by that word. For example, if 90% of the time, the words ‘the quick’ were followed by ‘brown’, and the other 10% of the time, they were followed by ‘red’, then when the two-word phrase ‘the quick’ was randomly generated, it would pick ‘brown’ 9 times out of 10, and ‘red’ 1 time out of 10. This is the algorithm Emacs‘s dissociated press feature uses, hence the name. Running it a few times on this site and picking some of my favorite sentences gives:

Second, I ignored the axes of the work you envision. So start small, and think about the free group on two generators, which is obviously highly undesirable behavior. However, it does have the web interface, I’ll have it up by last week, but that obviously didnt happen. Taking into account the fact that I’m using. The central thing that makes MS Paint Adventures unique to the point where it’s my go-to language for random programs (I still use Python for that), but if we pick two of them and rotate one of the set of all rotations that you have some custom function you want soup or salad, both is not a valid answer.

It’s my first medium-scale project written in Haskell (even though there isn’t a lot of code, what little was there was not trivial to write), and I’ve learned several lessons from it:

  • The Haskell wiki is an excellent resource. When I was trying to learn how to use HXT, the Haskell XML Toolbox, I found the provided documentation somewhat inadequate. But the HXT article on the Haskell wiki is an excellent introduction to the filter abstraction, which is all that I need for the basic stuff that I’m using.
  • Read the Haddock documentation. The HXT article, as useful as it was, didn’t cover a couple essential things I needed to know (such as how to pull all elements with type “application/rss+xml”). So I look at the documentation for Text.XML.HXT.Arrow.XmlArrow (the module containing the arrows that HXT uses to filter XML), and saw that hasAttrValue :: String -> (String -> Bool) -> a XmlTree XmlTree looks about right; from the type, I can guess (correctly) that I need to pass it the attribute and a prediate on the value of the attribute (i.e., hasAttrValue "href" (== "application/rss+xml")).
  • One goal at a time. This isn’t specific to Haskell. When I started on this, I meant for it to require you to provide the RSS feed. Then, I realized that having a larger corpus might be better, so I added the ability to pull from multiple feeds. Then I decided that expecting people to find the RSS feed by hand might be a bit much, so I rewrote it to pull the RSS feed from the site. And I eventually plan to write a CGI frontend so that you can just run it online. If I had decided from the start to do all these things, I probably never would’ve gotten started. As Linus Torvalds said:

    Nobody should start to undertake a large project. You start with a small trivial project, and you should never expect it to get large. If you do, you’ll just overdesign and generally think it is more important than it likely is at that stage. Or worse, you might be scared away by the sheer size of the work you envision. So start small, and think about the details. Don’t think about some big picture and fancy design. If it doesn’t solve some fairly immediate need, it’s almost certainly over-designed.

  • Strip and gzip your executables if you’re going to distribute them. Due to the fact that I’m statically linking in HXT, which is a sizeable library, the compiled, non-stripped version of dissociated-blogosphere is a whopping 12 megabytes. This isn’t due to inefficiencies in my own code, but due to the sheer size of the HXT library. Running the Unix command line utility strip (which only removes internal debugging information) cuts it down to about 5 MB, and then gzipping the binaries takes it down to a little over a megabyte.
  • Split things into libraries where it’s appropriate. Part of the problem with using HXT is that it makes recompilation slow; if I could do it all over again, I might have used HaXml, but HXT has the advantage of having nontrivial amounts of documentation written about it (on the Haskell wiki). If I had instead split the RSS parsing code into its own library, I could have only recompiled those parts whenever I touched them, which wasn’t nearly as often as I touched the code frontend. Plus, it’s just good programming practice.

So what do I have planned for dissociated-blogosphere? First off, I plan to make it faster by caching RSS lookups; by storing a map from page URLs to RSS feed, I can cut the number of network requests in half. Second, I plan to implement actual error handling; right now if you give it a bad URL it fails and doesn’t produce any useful output, regardless of whether other URLs are good. Third, I’m going to split out the RSS part into its own library, which I might make its own package on hackage. Fourth, I intend to eventually write a web interface (either in Haskell or in Python) so that you don’t have to download and install it. I originally intended to have the web interface up by last week, but that obviously didn’t happen. Taking into account the fact that it’ll take longer than I think it does, I’m guessing I’ll have it up by two weeks from now (so, a month). And finally, when/if I do the web interface, I’ll have it color the text according to which blog it’s from, or maybe even output xterm color codes if I don’t write the web interface.

Why I Love Currying

So I’ve been playing around with Haskell a lot lately and using it for various random stuff; I haven’t progressed to the point where it’s my go-to language for random programs (I still use Python for that), but I at least have an idea of how to use it. And there’s one feature of Haskell that I miss sorely when I write code in Python, or pretty much any other vaguely functional language: currying.

In Haskell, every function takes a single argument. A function of multiple arguments, such as map, which applies a function to every element in a list, actually only has one argument; for example, map can be interpreted either as taking a function and a list and returning a list, or as taking a function and returning a function that takes a list and returns a list. More formally, in Haskell, these two type declarations are equivalent:

map :: (a -> b) -> [a] -> [b]
map :: (a -> b) -> ([a] -> [b])

This process, of taking a multi-argument function and converting it into a series of single-argument functions is known as currying, after the mathematician Haskell Curry (who, obviously, is also the source of the name Haskell); the process of partially applying arguments to a function in this way is known as ‘partial application’, but is also called currying. One of the most obvious examples of currying is in sections: the function (0 ==) is syntactic sugar for (==) 0, and returns whether its argument is equal to zero. Furthermore, we can also partially apply the predicate to filter, to make a function that filters its argument on a fixed predicate. So, these three examples are completely equivalent:

removeZeros :: [Integer] -> [Integer]
removeZeros xs = filter (\x -> x /= 0) xs
removeZeros xs = filter (/= 0) xs
removeZeros = filter (/= 0)

(where /= is Haskell’s not-equal operator). The first is the most explicitly-written version, using no currying at all. The second curries the predicate; (/= 0) x is the same as x /= 0. Finally, since removeZeroes applied to an argument is the same as applying filter (/= 0) to it, we might as well define the former as the latter. Or, to take another example, look at the sortBy: it has type (a -> a -> Ordering) -> [a] -> [a], where Ordering is a datatype that can either be EQ, LT or GT for equal, less than, or greater than. So if you have some custom function you want to sort a list on, you can just say mySort = sortBy f and it will be the same as writing mySort xs = sortBy f xs, only cleaner and neater. Or in my Data.Nimber module (specifically lines 38, 39, and 43), many operations on Nimbers that’re required in order for me to call then ‘numbers’ are just the identity operation. So instead of saying abs x = x, I can just say abs = id.

Furthermore, without currying, you couldn’t have variadic functions; in order to work inside Haskell’s type system, the two types a -> b -> c and a -> (b -> c) have to be the same type. The full explanation involves typeclasses, and is (in my opinion) worth a read, because it’s a good explanation of a pretty horriblexcellent (it’s both at once, you see) type system hack.

As an aside, this also means that id :: a -> a, the identity function, is in a sense the same thing as ($) :: (a -> b) -> a -> b, which is function application. You can see this by substituting (b -> c) for a in the type of id, then removing parentheses:

id :: a -> a
id :: (b -> c) -> (b -> c)
id :: (b -> c) -> b -> c

So, in particular, f `id` x is the same as f $ x, which is just f x. Another way to think of this is that f `id` x = id f x = (id f) x = f x.

Variadic Functions in Haskell

Most modern languages have some kind of printf analogue: a function that takes a format string, and a series of things to be inserted into that string, and formats them all accordingly. At first glance, Haskell’s strong type system would seem to preclude this. There’s no built-in system for writing functions that take variable numbers of arguments, and it seems like it would be difficult to write one. The standard approach is to take a list instead, but this fundamentally doesn’t work for printf, since you’re going to be wanting to print Integers, Strings, and Floats. It’s possible to just pre-apply show to everything, but that’s not really a good idea, because you might want to show them in a different way than the built-in show does. You can use an extension called existential types to create a list of PrintfWrappers which wrap integers/floats/strings (more on that below), but that requires your users to manually do the wrapping, which is, once again, not a good idea. Haskell’s Text.Printf module takes a third approach. Look at the following lines:

instance (IsChar c) => PrintfType [c]
instance PrintfType (IO a)
instance (PrintfArg a, PrintfType r) => PrintfType (a -> r)

instance PrintfArg Integer
instance (IsChar c) => PrintfArg [c]

printf :: (PrintfType r) => String -> r

Here’s how to interpret this: PrintfType is the type of things that can be printed to. Printing to a String just gives you a string, much like sprintf in C or Perl, printing to an IO () will actually print it out (so you can use it like a normal printf in do blocks, a behavior which I personally find distasteful.). However, printf will return undefined when asked to return an IO r; the reason that you can nevertheless return one is that only declaring IO () as an instance of PrintfType is invalid according to Haskell 98.

PrintfArg, by comparison, are the elements that are valid arguments to printf; they basically consist of the various WordN/IntN types, Integer, Float, Char, and (IsChar c) => [c]. The point of the last instance is that, while you can’t have a specific version of a polymorphic type be an instance of a typeclass, you can restrict it to types whose parameters are themselves instance of another typeclass; the only instance of isChar is Char.
So now that we have that clarified, let’s suppose we want to call printf with “%s %d %f” “foo” 42 3.1, passing it the format string, String, an Integer, and a Float. This causes printf’s type to become

printf :: String -> String -> Integer -> Float -> String

Does this match the pattern (PrintfType r) => String -> r? Let’s go in reverse. String is an instance of PrintfType, and Float is an instance of PrintfArg, so Float -> String is an instance of PrintfType. Therefore, Integer -> (Float -> String) is an instance of PrintfArg, and so is String -> (Integer -> (Float -> String)). Dropping parentheses, this becomes String -> Integer -> Float -> String. So the types all check out. If you pass an invalid type, then you’ll run into something that isn’t an instance of PrintfArg and so the types won’t check.

I mentioned above that if you use something called ‘existential types’, you can do something similar. The way it works is that you define a new type whose data constructor only requires that its argument be of a given typeclass. Look at the following example

{-# LANGUAGE ExistentialQuantifiers #-}
data Box = forall s. (Show s) => Box s

boxes = [Box 2, Box "f", Box [8,3]]\

showBoxes :: [Box] -> String
showBoxes [] = ""
showBoxes ((Box x):xs) = show x ++ " " ++ showBoxes xs

When you run showBoxes boxes, you get 2 "f" 83, exactly as you’d expect. Note that, however, the function unbox (Box x) = x cannot be written; it would have to be of type (Show s) => Box -> s, and there’s just no real way to do that. So once you’ve wrapped something up in a Box, you can only get at it by showing it. From this, you can see how to pass a heterogeneous list to printf. The reason that this approach is suboptimal is that it would require Text.Printf to export a Printf data constructor which would wrap up everything to make it of the appropriate type, and that would be rather annoying, especially since it relies on show preserving enough information for you to format the number after reading it back in.

This pattern can obviously extended to any other variadic, heterogeneous function, as long as you can define a suitable typeclass that its arguments must all be instances of. And that’s not really a restriction at all; if you can’t specify a behavior that the instances must have, then you don’t really know what you can do with the arguments, and so you can’t do anything at all!

Data.MemoCombinators and You

Part of the beauty of Haskell is that it allows you to simply write recursive functions. But part of the problem with recursive functions is that they tend to have absolutely horrible big-O run times. The usual solution to this problem is to use what’s known as memoization, which is memorization without the ‘r’, since programmers have to have special names for everything. Memoization is usually implemented as an associative array (or a plain array in the common case where the function takes a single non-negative integer as an argument); the function attempts to look up the return value for its arguments in an associative array. If it finds it, it can return without doing expensive computation; if it doesn’t, then it performs the computation, stores the result in its array, and then returns. In Python, a memoized Fibonacci function might be written as follows:

fib_cache = {}
def fib(n):
    if (n < 2): return n
    if n not in fib_cache: fib_cache[n] = fib(n-1) + fib(n-2)
    return fib_cache[n]

The speed savings gained by this are enormous; on my test machine, fib(35) takes 15 seconds to compute without memoization, whereas fib(1000) computes almost instantly with memoization. In terms of big-O running times, I believe that the memoized version takes O(n^2) time, whereas the unmemoized version takes \Theta(\phi^n), which is interesting since the Fibonacci numbers themselves are \Theta(\phi^n). In any case, the memoized version is clearly superior.

But how do you do this in a language such as Haskell? You can’t carry state between the various incarnations of the function, since that could potentially lead to the function’s values not solely depending on its arguments, violating referential transparency. You can’t carry the state around in a monad because then different calls to the function would each have separate caches, so you’d have to pull some kind of trick where the function returns itself and its value, then pass the function around and it would just be a huge mess. So instead what you do is you use Data.MemoCombinators, which is a package that lets you turn functions into other, memoized functions. So how do you use it? It’s not too hard, especially if you’re memoizing functions that only use builtin types. An example, straight from the Data.MemoCombinators page:

fib = Memo.integral fib'
    where
    fib' 0 = 0
    fib' 1 = 1
    fib' x = fib (x-1) + fib (x-2)

There are two things to note here: first, the memoized version, fib, is generated from the non-memoized by calling Memo.integral on it. This is how you create memoized versions of single-variable functions: you apply the appropriate combinator. Second, fib’ calls fib inside it. This is very important: if fib’ called fib’, then you couldn’t save time within the fib function, only outside of it. With fib’ calling fib, on the other hand, then the first time you call fib 1000, not only will it return before the heat death of the universe, but you’ll also get fib 999, fib 998, etc. cached.

But what if your function to be memoized isn’t one of the standard types? That’s why there’s Memo.wrap. You just have to define two mappings: one from your type to some combination of MemoCombinator types, and one that goes from that combination back. An example will make it clear:

data Foo = Foo String Int deriving (Show)
memoFoo = Memo.wrap toFoo fromFoo (Memo.pair memoString Memo.integral) where
    memoString = Memo.list Memo.char
    toFoo (a, b) = Foo a b
    fromFoo (Foo a b) = (a, b)
f = memoFoo f' where
    f' (Foo x y) = x

So as you can see, first you build up a memoString type which can memoize Strings; since a String is just a list of Chars, you can just apply Memo.list to Memo.char. Then you define toFoo and fromFoo, which send you from the abstracted Foo type to a tuple of a String and an Int. Finally, you use Memo.wrap to ‘wrap’ the pair of a memoized String and a memoized Int (constructed using Memo.pair, naturally) up in an abstract memoFoo memoizer.

The other thing you can do with MemoCombinators is memoize functions of multiple variables. Take this sample of code from a project I’m working on:

memoNimber = memoNimber = Memo.wrap toNimber fromNimber Memo.integral
(*) = Memo.memo2 memoNimber memoNimber (*!) where
      x *! (Nimber 1) = x
      (Nimber 1) *! x = x
      a *! b = mex $ liftM2 combine [0 .. pred a] [0 .. pred b]
        where
          mex xs = fromJust $ find (`notElem` xs) [0..]
          combine a' b' = a * b' + a' * b + a' * b'

The actual definition of *! isn’t important, I’m only including it for completeness. Nor are the definitions of toNimber and fromNimber. What is important is Memo.memo2: you use it to generate a memoized function of multiple arguments. You just pass it memoizers for each of its arguments (since * takes two Nimbers, I pass it memoNimber twice) and the unmemoized version, and it gives you a memoized version.

As for how Data.MemoCombinators works, I can’t really explain that. I know it has to do with the fact that expressions in function definitions are cached, but beyond that my knowledge fails. Maybe if I ever learn it I’ll return to this and explain it.

Edit: After I wrote this I realized that Data.MemoTrie exists; while it has cleaner syntax for memoizing functions (the memoizer doesn’t need to know the types of the arguments), it has a disadvantage in that it’s not immediately obvious how to memoize the types it doesn’t give you. But if you’re just memoizing functions of Ints or something, go ahead and use MemoTrie.