Amateur Topologist

Everything but topology.

dissociated-blogosphere: never have to write an original post again!

For the past two weeks or so, I’ve been working off and on on a project called dissociated blogosphere (OSX and Linux binaries here). It takes a bunch of URLs, looks through them for an RSS for the raw content of the posts, and then stores the words of the posts in an array. It then picks N random, consecutive words (where in this case N is 2), and starts generating new text, by picking a new word x% of the time if x% of the time, the previous N words were followed by that word. For example, if 90% of the time, the words ‘the quick’ were followed by ‘brown’, and the other 10% of the time, they were followed by ‘red’, then when the two-word phrase ‘the quick’ was randomly generated, it would pick ‘brown’ 9 times out of 10, and ‘red’ 1 time out of 10. This is the algorithm Emacs‘s dissociated press feature uses, hence the name. Running it a few times on this site and picking some of my favorite sentences gives:

Second, I ignored the axes of the work you envision. So start small, and think about the free group on two generators, which is obviously highly undesirable behavior. However, it does have the web interface, I’ll have it up by last week, but that obviously didnt happen. Taking into account the fact that I’m using. The central thing that makes MS Paint Adventures unique to the point where it’s my go-to language for random programs (I still use Python for that), but if we pick two of them and rotate one of the set of all rotations that you have some custom function you want soup or salad, both is not a valid answer.

It’s my first medium-scale project written in Haskell (even though there isn’t a lot of code, what little was there was not trivial to write), and I’ve learned several lessons from it:

  • The Haskell wiki is an excellent resource. When I was trying to learn how to use HXT, the Haskell XML Toolbox, I found the provided documentation somewhat inadequate. But the HXT article on the Haskell wiki is an excellent introduction to the filter abstraction, which is all that I need for the basic stuff that I’m using.
  • Read the Haddock documentation. The HXT article, as useful as it was, didn’t cover a couple essential things I needed to know (such as how to pull all elements with type “application/rss+xml”). So I look at the documentation for Text.XML.HXT.Arrow.XmlArrow (the module containing the arrows that HXT uses to filter XML), and saw that hasAttrValue :: String -> (String -> Bool) -> a XmlTree XmlTree looks about right; from the type, I can guess (correctly) that I need to pass it the attribute and a prediate on the value of the attribute (i.e., hasAttrValue "href" (== "application/rss+xml")).
  • One goal at a time. This isn’t specific to Haskell. When I started on this, I meant for it to require you to provide the RSS feed. Then, I realized that having a larger corpus might be better, so I added the ability to pull from multiple feeds. Then I decided that expecting people to find the RSS feed by hand might be a bit much, so I rewrote it to pull the RSS feed from the site. And I eventually plan to write a CGI frontend so that you can just run it online. If I had decided from the start to do all these things, I probably never would’ve gotten started. As Linus Torvalds said:

    Nobody should start to undertake a large project. You start with a small trivial project, and you should never expect it to get large. If you do, you’ll just overdesign and generally think it is more important than it likely is at that stage. Or worse, you might be scared away by the sheer size of the work you envision. So start small, and think about the details. Don’t think about some big picture and fancy design. If it doesn’t solve some fairly immediate need, it’s almost certainly over-designed.

  • Strip and gzip your executables if you’re going to distribute them. Due to the fact that I’m statically linking in HXT, which is a sizeable library, the compiled, non-stripped version of dissociated-blogosphere is a whopping 12 megabytes. This isn’t due to inefficiencies in my own code, but due to the sheer size of the HXT library. Running the Unix command line utility strip (which only removes internal debugging information) cuts it down to about 5 MB, and then gzipping the binaries takes it down to a little over a megabyte.
  • Split things into libraries where it’s appropriate. Part of the problem with using HXT is that it makes recompilation slow; if I could do it all over again, I might have used HaXml, but HXT has the advantage of having nontrivial amounts of documentation written about it (on the Haskell wiki). If I had instead split the RSS parsing code into its own library, I could have only recompiled those parts whenever I touched them, which wasn’t nearly as often as I touched the code frontend. Plus, it’s just good programming practice.

So what do I have planned for dissociated-blogosphere? First off, I plan to make it faster by caching RSS lookups; by storing a map from page URLs to RSS feed, I can cut the number of network requests in half. Second, I plan to implement actual error handling; right now if you give it a bad URL it fails and doesn’t produce any useful output, regardless of whether other URLs are good. Third, I’m going to split out the RSS part into its own library, which I might make its own package on hackage. Fourth, I intend to eventually write a web interface (either in Haskell or in Python) so that you don’t have to download and install it. I originally intended to have the web interface up by last week, but that obviously didn’t happen. Taking into account the fact that it’ll take longer than I think it does, I’m guessing I’ll have it up by two weeks from now (so, a month). And finally, when/if I do the web interface, I’ll have it color the text according to which blog it’s from, or maybe even output xterm color codes if I don’t write the web interface.

Leave a Reply