Creating Splunk Alerts (aka 'Saved Searches') from the command line

Splunk Alerts (also called saved searches) are a great way to have Splunk send you data on a scheduled basis, or when certain conditions are met (e.g. a metric crosses a threshold). While these alerts can easily be created in the web UI (by clicking “Save As/Alert” in a search) in many cases would be nice to do it programmatically. This makes it easy to set up alerts for many individual searches, while keeping everything under source control.

First of all, let’s specify our search. We will put each search in its own text file. As an example, here is a really simple search that counts the number of 404 errors by sourcetype:

index=* | stats count(eval(status="404")) AS count_status BY sourcetype

This is the same search definition as you would enter in the Splunk search app UI. It’s a good idea to test and refine the search interactively until you’re happy with it, then save it in a file called my_search.txt.

The Splunk API

Next we’re going to write some Python to drive the Splunk API. The following code assumes you have a valid auth token in the environment variable SPLUNK_AUTH_TOKEN. Here is a little helper script you can use to set this variable (assuming you have xmllint installed):

 #!/bin/sh
 #
 # login to splunk and set SPLUNK_AUTH_TOKEN
 #
 # Usage: eval $( ./splunk-login.sh ) 
 #
  
 SPLUNK_HOST="https://splunk.int.corp:8089" 
  
 read -p "Username: " USERNAME 
 read -s -p "Password: " PASSWORD 
 echo >&2 
  
 response=$( curl -s -d "username=${USERNAME}&password=${PASSWORD}" -k ${SPLUNK_HOST}/services/auth/login ) 
  
 SPLUNK_AUTH_TOKEN=$( echo $response | xmllint --nowarning --xpath '//response/sessionKey/text()' - 2>/dev/null ) 
 if [[ $? -eq 0 ]] ; then 
     echo "export SPLUNK_AUTH_TOKEN=${SPLUNK_AUTH_TOKEN}" 
 else 
     echo $response | xmllint --xpath '//response/messages/msg/text()' - >&2 
     echo >&2 
 fi 

You’ll also need to install the Splunk SDK for Python. This should be as simple typing pip install splunk-sdk as depending on how your environment is configured.

Ok, on to the code!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#!/usr/bin/env python

import os
from splunklib import client

splunk_host = os.getenv("SPLUNK_HOSTNAME", "splunk.int.corp")
splunk_token = os.getenv("SPLUNK_AUTH_TOKEN")
splunk_app = 'my_app'

service = client.Service(host = splunk_host, token = splunk_token, app = splunk_app)

Great, we now have a reference to the Splunk API service. So how do we use the SDK to create a saved search? The Splunk API documentation is slightly terrifying. Luckily, we don’t need to worry about the vast majority of the available parameters. Let’s create an alert that runs on a schedule and sends its results to a webhook:

12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
filename = 'my_search.txt'
search_name = os.path.splitext(os.path.basename(filename))[0]
search = open(filename, 'r').read()

params = {
    'actions': 'webhook',
    'action.webhook.param.url': 'http://splunk-webhook-service/',
    'alert_comparator': 'greater than',
    'alert_threshold': '0',
    'alert_type': 'number of events',
    'alert.digest_mode': '0',
    'alert.suppress': '0',
    'cron_schedule': '0 1 * * *',
    'dispatch.earliest_time': '-30d@d',
    'dispatch.latest_time': 'now',
    'display.general.type': 'statistics',
    'display.page.search.mode': 'fast',
    'display.page.search.tab': 'statistics',
    'is_scheduled': '1',
    'request.ui_dispatch_app': splunk_app,
    'request.ui_dispatch_view': 'search'
}

service.savedsearches.create(search_name, search, **params)

And that’s it! Easy.

There’s a few things in that dictionary of parameters to the API call that are worth calling attention to.

  • The first two lines (#17-18) specify the action to trigger for our alert: in this example, a webhook and the URL that Splunk will POST results to. You could also specify 'actions': 'email' and 'action.email.to' to send the alert as an email instead__*__.

  • Line #24 specifies when to run the saved search, as a crontab entry. This example is “daily at 1:00 am” (in the Splunk server’s time zone).

  • Lines #25-26 define the time range over which to run the search (in this case “last 30 days”).

  • The digest mode and alert threshold settings (lines #20 and #22 ) cause this alert to send results every time it is invoked, rather than depending on a condition being met. Creating a conditional alert is left as an exercise for the reader…

(*) Note that I have not yet tested this, so I’m not sure which other parameters are required.

Fully worked example

The above code is all you actually need, but here’s a slightly expanded example that accepts command line arguments and multiple search files. It also deletes any existing search of the same name (i.e. your new search replaces the old one), and randomises the crontab spec slightly to spread out load on the Splunk server.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
#!/usr/bin/env python 
 
import os, argparse, random 
from splunklib import client 
 
splunk_host = os.getenv("SPLUNK_HOSTNAME", "splunk.int.corp") 
splunk_token = os.getenv("SPLUNK_AUTH_TOKEN") 
 
def create_alert(search_name, search, splunk_app, webhook_host): 
    crontab = "%s 1 * * *" % random.randint(1, 59) # pick a random minute during 01:01 - 01:59 
    params = { 
        'actions': 'webhook', 
        'action.webhook.param.url': webhook_host, 
        'alert_comparator': 'greater than', 
        'alert_threshold': '0', 
        'alert_type': 'number of events', 
        'alert.digest_mode': '0', 
        'alert.suppress': '0', 
        'cron_schedule': crontab, 
        'dispatch.earliest_time': '-30d@d', 
        'dispatch.latest_time': 'now', 
        'display.general.type': 'statistics', 
        'display.page.search.mode': 'fast', 
        'display.page.search.tab': 'statistics', 
        'is_scheduled': '1', 
        'request.ui_dispatch_app': splunk_app, 
        'request.ui_dispatch_view': 'search' 
    } 
 
    service = client.Service(host = splunk_host, token = splunk_token, app = splunk_app) 
    savedsearches = service.saved_searches 
    try: 
        savedsearches.delete(search_name) 
        print("Deleted old version of %s" % search_name) 
    except Exception: 
        pass 
 
    savedsearches.create(search_name, search, **params) 
    print("Created alert '%s' to be triggered at %s" % (search_name, crontab)) 
 
 
if __name__ == '__main__': 
    parser = argparse.ArgumentParser(description=""" 
        Create Splunk alerts. Assumes there is a valid auth token in SPLUNK_AUTH_TOKEN 
        (e.g. `eval $(./splunk-login.sh )`). Default Splunk API hostname (splunk.int.corp)  
        can be overridden by setting SPLUNK_HOSTNAME. 
    """) 
    parser.add_argument('files', metavar='file', type=str, nargs='+', help='a file that contains a search definition') 
    parser.add_argument('--webhook_url', nargs='?', dest='webhook_host',  
                        default='https://splunk-webook-service.int.corp/', 
                        help='the URL of the webhook the alert will be configured to POST to') 
    parser.add_argument('--splunk_app', nargs='?', dest='splunk_app', 
                        default='search',  
                        help='the name of the Splunk app that will own the search (defaults to "search")') 
 
    args = parser.parse_args() 
 
    for f in args.files: 
        name = os.path.splitext(os.path.basename(f))[0] 
        search = open(f, 'r').read() 
        create_alert(name, search, args.splunk_app, args.webhook_host) 

Conclusion

So despite the somewhat lacking official documentation, creating Splunk saved searches is actually pretty straightforward. Thanks to Alexander Leonov for this post that got me headed in the right direction: https://avleonov.com/2019/01/17/creating-splunk-alerts-using-api/

Managing a Splunk Dashboard in Git

Splunk dashboards are great! There’s all sorts of useful insight you can gain from your system logs, and graphing them in convenient information “radiators” makes these insights more visible to your team. They do involve a lot of pointing-and-clicking though, and it makes me pretty nervous that all that fiddly work is sitting unmanaged in a web UI.

Luckily, you don’t have to live that way. Splunk provides a comprehensive REST API for many search- and data-related actions. It’s kind of buried in the documentation under “Knowledge”, but the API contains an endpoint that allows you to GET and POST the XML definition of a dashboard:

https://<host>:<mPort>/servicesNS/{user}/{app_name}/data/ui/views/{dashboard_name}

(Note that mPort may be different to the port you use to access the Splunk UI in a web browser. By default the Splunk API uses port 8089.)

GETting the XML source for a dashboard works great, but I have found it to be a little picky about the XML you try to POST back to Splunk. In particular, you’re going to be sending XML as “form-encoded” data, so you need to be careful about escaping certain characters. In many cases, you won’t be able to just upload the XML exactly as you downloaded it. More on this later.

Authentication

Splunk uses token-based authentication, which is preferable to hard coding a username and password in your script, but it does mean there’s an extra step involved.

To keep things as secure as possible, I recommend prompting the user for a username/password interactively and never storing them on disk (or in your shell history). For example, this script will output a token that you can eval in your shell to set an ENV variable subsequent scripts can use:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
#!/bin/sh
#
# login to splunk and set SPLUNK_AUTH_TOKEN
#
# Usage: eval $( ./splunk-login.sh )

SPLUNK_HOST="https://splunk:8089"

read -p "Username: " USERNAME
read -s -p "Password: " PASSWORD
echo >&2

response=$( curl -s -d "username=${USERNAME}&password=${PASSWORD}" -k ${SPLUNK_HOST}/services/auth/login )

SPLUNK_AUTH_TOKEN=$( echo $response | xmllint --nowarning --xpath '//response/sessionKey/text()' - 2>/dev/null )
if [[ $? -eq 0 ]] ; then
    echo "export SPLUNK_AUTH_TOKEN=${SPLUNK_AUTH_TOKEN}"
else
    echo $response | xmllint --xpath '//response/messages/msg/text()' - >&2
    echo >&2
fi

We can then pass that to further API requests as an HTTP header:

curl -H "Authorization: Splunk ${SPLUNK_AUTH_TOKEN}" \
  https://splunk:8089/servicesNS/username/app_name/data/ui/views?output_mode=json

Getting dashboard XML

Ok, great, we can now interact with the REST API. You probably don’t want to be building your dashboard from scratch in XML, so I’ll assume you have set up something in the Splunk UI. The first thing we will do is get its source XML from the API. This script will do that, and send the result to stdout:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!/bin/sh
#
# Get the XML source of a Splunk dashboard
#

if [[ -z "$1" ]] ; then
    echo "Usage: $0 <dashboard_name>"
    exit 1
fi

SPLUNK_HOST="https://splunk:8089"

DASHBOARD_AUTHOR="username"
DASHBOARD_APP="app_name"
DASHBOARD_NAME="$1"

if [[ -z $SPLUNK_AUTH_TOKEN ]] ; then
    eval $( `dirname $0`/splunk-login.sh )
fi

curl -s \
     -H "Authorization: Splunk ${SPLUNK_AUTH_TOKEN}" \
     -k \
     "${SPLUNK_HOST}/servicesNS/${DASHBOARD_AUTHOR}/${DASHBOARD_APP}/data/ui/views/${DASHBOARD_NAME}?output_mode=json" \
    | jq --raw-output '.entry[].content | .["eai:data"]' \
    | xmllint --format -

We call the script like this:

$ ./get-dashboard-source.sh my-dashboard > my-dashboard.xml

(Make sure you set the variables for Splunk host, dashboard author, and app name appropriately.)

Now we have our dashboard definition in a file that we can commit to git! I feel safer already.

Uploading dashboard XML to Splunk

What if we want to make changes to the dashboard? We can just go point and click around in the web UI as usual, then download and commit another version. That’s better than nothing. But for many things (changing search strings, tweaking dashboard panel settings, even duplicating and rearranging panels) it’s far easier to edit the XML directly.

So go ahead and open the XML file in your favourite editor and tweak away.

When you’re happy, this script will upload the XML to Splunk and update the dashboard definition:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#!/bin/sh
#
# Update a Splunk dashboard with new source XML
#

if [ -z "$1" -o -z "$2" ] ; then
    echo "Usage: $0 <dashboard_name> <dashboard_source.xml>"
    exit 1
fi

SPLUNK_HOST="https://splunk:8089"

DASHBOARD_AUTHOR="username"
DASHBOARD_APP="app_name"
DASHBOARD_NAME="$1"
DASHBOARD_SOURCE="$2"

if [[ -z $SPLUNK_AUTH_TOKEN ]] ; then
    eval $( `dirname $0`/splunk-login.sh )
fi

data_file=$(mktemp)
cat <<EOF > $data_file
eai:data=$( sed -e 's/%/%25/g' -e 's/&gt;/>/g' -e 's/&lt;/</g' < ${DASHBOARD_SOURCE} | tr -d '\n' )
EOF

curl  \
     -H "Authorization: Splunk ${SPLUNK_AUTH_TOKEN}" \
     -X POST \
     --data @${data_file} \
     -k "${SPLUNK_HOST}/servicesNS/${DASHBOARD_AUTHOR}/${DASHBOARD_APP}/data/ui/views/${DASHBOARD_NAME}"

This is all pretty straight forward, but you might want to take a close look at line #24, which has a lot going on. Reading from right to left, we are:

  • removing newline characters
  • from the source XML file
  • expanding “greater than” and “less than” entities to > and < characters
  • URL encoding literal % characters
  • and putting the lot in a field called eai:data

This is all to deal with the fact that Splunk expects us to send XML as form-encoded data in the POST request. There may be other characters I haven’t noticed yet that trip up this encoding/decoding process, but those three (>, <, and %) are particularly common in Splunk dashboard XML and will definitely cause problems.

To use it:

$ ./set-dashboard-source.sh my-dashboard my-dashboard.xml

Et voilà!

Next steps

These are the basic building blocks, and enough to implement a manual version control workflow for Splunk dashboards. But it sure would be nice if it was all a bit more seamless.

A couple of obvious bits of further automation would be:

  • a git hook to upload dashboard XML on commit/push
  • a script to ensure the live dashboard matches what’s in git before making local changes
  • some sort of validation of the XML before uploading (but the API will just refuse to accept invalid XML, so this isn’t really necessary)

Getting app store ratings without authentication

Both the Apple iTunes and Google Play app stores provide extensive APIs for accessing all sorts of metrics and analytics about your apps. This is great, but they require authentication–which is a bit of a pain when you’re trying automate.

So I went looking for a quick and dirty way to get current app ratings from public endpoints.

iTunes Connect

As it turns out, Apple provide a “real” API for this:

$ curl -s https://itunes.apple.com/au/lookup?id=<my-app-id> | jq .results[].averageUserRating

Thanks to user chaoscoder on Stack Overflow for this answer, and also to alberto-m for the “country” tip in this reply, for setting me in the right direction.

BIG FAT CAVEAT: Specifying a country implies that you are only getting the ratings for that country. As far as I can tell, there is no unauthenticated way to get the aggregate rating for all countries (other than enumarating all countries in which your app is for sale, and getting them one by one).

Google Play

Sadly, Google do not seem to provide a similar public API. They do make the ratings available on the app’s store page however, and it appears to be wrapped in a div tag with a distinct aria-label attribute value:

<div class="BHMmbe" aria-label="Rated 3.5 stars out of five stars">3.5</div>

We can use this fact, and the fact the page is well formatted/validated HTML, to apply an XPath query to extract the rating:

$ curl -s 'https://play.google.com/store/apps/details?id=<my-app-id>&hl=en' | \
      xmllint --nowarning --html --xpath '//div[starts-with(@aria-label, "Rated")]/text()' - 2>/dev/null

EVEN BIGGER FATTER CAVEAT: this is super fragile and makes all sorts of assumptions about how the Play store renders its web pages, is completely unauthorised by Google, and could break at any time. YMMV etc.

Wrapping it up in a script

We can put all that together in a bit of Python that can be run as a scheduled job, and maybe ship its logs to Splunk for later analysis:

#!/usr/bin/env python

import sys, urllib, json, logging
from lxml import etree

apple_app_id = "my-app-id"
google_app_id = "my-app-id"

def setup_custom_logger(name):
    formatter = logging.Formatter(fmt="%(asctime)s [%(levelname)s] %(pathname)s %(message)s", datefmt="%Y-%m-%d %H:%M:%S")
    handler = logging.StreamHandler(stream=sys.stdout)
    handler.setFormatter(formatter)
    logger = logging.getLogger(name)
    logger.setLevel(logging.DEBUG)
    logger.addHandler(handler)
    return logger

def get_itunes_rating(app_id):
    response = urllib.urlopen("https://itunes.apple.com/au/lookup?id=%s" % app_id)
    try:
        data = json.loads(response)
        return data["results"][0]["averageUserRating"]
    except (TypeError, ValueError, KeyError, IndexError):
        logger.warn("Response did not contain expected JSON: %s" % response)
        return ""

def get_google_play_rating(app_id):
    response = urllib.urlopen("https://play.google.com/store/apps/details?id=%s&hl=en" % app_id)
    try:
        data = etree.HTML(response)
        return data.xpath("//div[starts-with(@aria-label, \"Rated\")]/text()")[0]
    except (ValueError, IndexError):
        logger.warn("Response did not contain expected XML: %s" % response)
        return ""

if __name__ == '__main__':
    logger = setup_custom_logger("get-app-ratings.py")
    logger.info("apple_store_rating=%s" % get_itunes_rating(apple_app_id))
    logger.info("google_store_rating=%s" % get_google_play_rating(google_app_id))

which produces output like:

$ ./get-app-ratings.py
2019-04-08 17:19:57 [INFO] ./get-app-ratings.py apple_store_rating=3.5
2019-04-08 17:19:58 [INFO] ./get-app-ratings.py google_store_rating=3.5

Getting that stdout output into Splunk is left as an exercise for the reader…

DevOps Talks Conference 2019 Melbourne

I recently started in a new role in a large financial services organisation, with a mission to establish a Site Reliability Engineering practice. Due to some fortunate timing, one of the first I did was attend the DevOps Talks Conference, a two-day conference that “brings together leaders, engineers and architects who are implementing DevOps in start ups and in enterprise companies”. While this conference was obviously primarily about the principles and practices of DevOps, there is a good deal of overlap with the ideas of Site Reliability Engineering.

Below are a few notable highlights from the two days, as seen through the lens of someone who is trying to implement SRE in a traditional IT organisation.

Jennifer Petoff - Google SRE Program Manager

Jennifer described SRE at Google and its key motivators and principles, then talked about what to focus on to implement SRE in your organisation. Key messages:

  • Start with SLOs and consequences (aka error budgets)
  • Push back on the dev teams when necessary to drive improvements in reliability

Matty Stratton - PagerDuty

A good post mortem raises more questions than it answers. With today’s complex distributed systems, there is unlikely to be a singular “root cause” for any given incident. A post mortem/PIR should tell a story, and lead you to learn more about your systems and how they interact.

Nathan Harvey - Google

Some metrics to think about when trying to measure application stability:

  • Mean Time to Recovery (MTTR)
  • Change failure rate

Change failure rate is the percentage of “changes” (i.e. releases/deployments) that “fail”—lead to production incidents or have to be rolled back. The idea is that by working to keep this low you can be more confident in your releases and move faster.

“You have to be safe to move fast… and you have to move fast to be safe.”

Meaning you can move faster when you’re confident in your testing, etc., but by the same token being able to quickly move a change through your pipeline to production is “safer”—it makes it far easier to roll out fixes.

John Willis

Organisational change is doomed to fail if you impose it from above. The people who have to implement and are most affected by the change need to be involved in designing the New Way.

There is a yawning chasm between legacy IT management (CABs, work queues, service tickets, etc.) and agile/cloud/devops teams. Breaking down tribal knowledge and creating institutional knowledge is critical to succeeding with a devops/SRE culture.

You can get rid of your CAB! — Change management processes implement “subjective attestation”, but we can achieve the same ends (confidence in the integrity of our production systems) by using “objective attestation”: automated pipelines, cryptographically authenticated control of changes, etc. See also “DevSecOps” and Mark Angrish’s talk on governance at Kubecon last year.

Mark Angrish - ANZ

Ground swell of support in the technology community is important, but change (to devops, etc.) starts with senior leadership, including the CEO.

Funding models are also critical. Project-based funding where teams are disbanded when they are “finished” makes it really hard to adopt a true devops/SRE culture.

Lindsay Holmwood - Envato

This was a really interesting talk, and contained basically no technical content. Lindsay talked about organisational design—how to understand it, and how it relates to technology innovation and architecture—value chain mapping, and complexity theory. A few take-aways:

  • There’s lots of existing research about “organisation culture” (sociology, anthropology). We should pay attention to it.
  • DevOps is great at introspection, but we need to look outside our own domain. There’s lots to learn from other disciplines.
  • You need to rearchitect your org structure to adopt new technology. Your existing org structure mirrors your product/tech architecture. If you want to innovate/adopt new technologies and architectures, you need to change your organisation to enable this. (See also: Conway’s Law.)

And if you want to learn more, some interesting pointers for further reading/research:

Phew!

Summary

Overall it was a worthwhile couple of days, and gave me a few things to think about and leads to follow… and a bit of reassurance that SRE is possible in a large “enterprise” organisation. People have done this before!

The organisers promised to post videos of the talks in the coming days. Check back at https://devopstalks.com/2019_Melbourne/index.html.

Static Analysis—what's it good for?

Let’s face it, writing software is hard. And frankly we humans suck at it. We need all the help we can get. Our industry has developed many tools and techniques over the years to provide “safety rails”, from the invention of the macro assembler through to sophisticated integration and automated testing frameworks. But somewhere along the way the idea of static analysis went out of favour.

I’m here to convince you that static analysis tools still have a place in modern software engineering.

Note that I am avoiding the word “metrics” here. That is how a lot of people think of these tools, and the whole idea of “measuring” a developer’s work rightly has a terrible reputation (“what does that number even mean?”). Static analysis is simply about providing information that a human can use to learn more about a code base.

What is static analysis?

Simply, it’s a tool that analyses your code “at rest”. That is, it inspects the source code (or in some cases object code) statically, not in a running environment. (Dynamic analysis of running systems, such as with memory profilers like yourkit and valgrind, is a whole other topic.)

Why is it so unpopular?

One of our developers recently made the following comment in an internal chat channel:

After seeing some refactoring that people have done to satisfy static code quality analysis tools, I question their value.

This is a common response to the use of such tools, and perfectly reasonable. But it misses the point. Of course static analysis can be misused, but that doesn’t mean it has to be.

Another common complaint is “but these tools can’t replace the eye of an experienced developer!”. No, they can’t. But they can help focus that experienced eye where it is most needed.

So what is it good for?

  • Early warning of problems

    By scanning the daily report for issues in recently introduced code, tech leads and senior developers can talk with the developers involved. Together they can work out better approaches before the problematic code becomes ossified in the code base and inevitably replicated by copy/paste.

  • Identifying “hot spots” in the code that warrant further attention

  • Overall sense of the “health” of a codebase

  • See trends over time

    Time series graphs can give you a good overview of how your code base is growing and changing. Steadily increasing LoC in a mature system might be a sign that it’s time to factor out a submodule. Or maybe increasing complexity indicates too much pressure to rush out features without enough consideration for design.

The static analysis tool itself is not going to tell you any of these things, but it might suggest places to look for potential trouble.

What is it NOT good for?

  • Gating check-ins/failing builds

    This usually just leads to “gaming the system” or poor refactoring to “get around” the rules, which helps no one. Static analyses are information that needs to be intrepeted by people, not an automatic way to prevent bad code being committed.

  • Measuring developers’ performance

    Hopefully I don’t need to explain why this is a terrible idea. The output of these tools is the start of a conversation, and should certainly never be used against people or teams.

How should I use my analysis tools?

  • Daily report to tech lead - new issues

    Tech leads can review a daily report as a starting point for conversations with developers. E.g. “I see you commited this method with a lot of nested ifs.. have you considered doing it this way instead?”

  • High level graphs over time - complexity, etc.

    Dashboard of health. Are we getting worse? Should we be putting more effort into refactoring and clean up?

  • Predicting “cost of change”

    If a code base has a high complexity–relative to others in your org, or its own past state–the cost of change is likely to be higher. This can be useful information when estimating/predicting future effort.

  • Enforcing style guides

    This is a bit more controversial, and not really the kind of analysis I am talking about… but there is an argument to be made for using tools like checkstyle and rubocop to enforce your local style conventions. If nothing else, it makes arguments about brace position, white space, etc. moot.

Quality is a people problem

“It is impossible to do a true control test in software development, but I feel the success that we have had with code analysis has been clear enough that I will say plainly it is irresponsible to not use it.” – John Carmack, In-Depth: Static Code Analysis

No tool is going to be a silver bullet. Software quality is and always has been primarily a “people problem”. Tools can help, but they cannot automatically fix all your problems and enforce all your “rules”. They simply provide information that can help people focus on the areas most needing attention, and highlight potential problems that might otherwised have been missed.

Static analysis tools (aka “quality metrics”) can be a useful way to gain more insight into your code and identify areas that need more attention.

(This article was originally published on the REA Tech Blog.)

A Clojure library to authenticate with LDAP

My employer has released a small Clojure library I wrote that allows you to easily authenticate users against an LDAP server:

https://github.com/realestate-com-au/clj-ldap-auth.

It uses the UnboundID LDAP SDK for Java to look up a user name in an LDAP server and attempt to bind with specified credentials.

The simplest usage looks like:

(require '[clj-ldap-auth.ldap :as ldap])

(if (ldap/bind? username password)
  (do something-great)
  (unauthorised))

That works, but isn’t very helpful when authentication fails. So you can also pass a function that will be called with a diagnostic message in the event that authentication fails:

(let [reason (atom nil)]
  (if (ldap/bind? username password #(reset! reason %1))
    (do something-great)
    (unauthorised @reason)))

The provided function should take a single argument, which will be a string.

Configuration of the library (i.e. the ldap server to connect to, etc.) is via system properties. See the README for details.

Implementation

The library first establishes a connection to the server, optionally using SSL. If a bind-dn is configured (i.e. credentials with which to connect to the LDAP server), it is used to bind to the server. If that’s successful, we then look up the provided username (in the attribute uid). If found, the entry’s distinguished name (DN) is extracted and this DN and the provided password are used to bind a new connection.

If any of these steps fail (e.g. the binddn is unauthorised, the username can’t be found, or the looked up DN and password can’t bind) the function returns false (and calls the provided sink function to say why). If everything works and the connection can be bound with the target DN and password, it returns true (and the sink function is not called).

Limitations

It would probably be useful to be able to specify what attribute(s) to use for looking up the username, but for now it is hard coded to uid. Also, current test coverage (using midje) is minimal. UnboundID provide an in-memory LDAP server implementation, which could probably be used to build some fast-running integration tests.

Do you have anything to declare, sir?

One of the cornerstones of modern software engineering is dependancy management systems. Think Bundler, Leiningen, or (forgive me) Maven. We stand on the shoulders of giants when we write our apps, and we need a way of specifying which giants. Modern systems like RubyGems are pretty good at this. But not perfect.

I have a dream

I have a simple dream. All I want is this: I want to be able to checkout your project on any computer, install the appropriate language runtime and dependency manager and type make (or rake, or lein, or ./build.sh, or …) and have a running system.

None of this is new. Joel said it, Twelve Factor App says it. But surprisingly few people seem to actually do it.

Undeclared dependencies are the root of all evil

It’s very easy as a developer to introduce dependencies into your project without even realising. Our workstations get all sorts of cruft installed on them over time, and the chances are something lying around fulfills an undeclared transitive dependency for the library you just installed. But the next developer may not be so lucky.

I don’t want to have to find out by trial and error what native libraries your code depends on. I don’t want to belatedly discover some assumptions you made about what else would be running on my computer. I just want to type make.

So what’s the point?

Just this:

Your project’s build system has the responsibilty to install any library that is required to support your code.

Hopefully most of these will be taken care of by your dependency management system. But for those that aren’t (e.g. native libraries that are required by Ruby Gems) your build system needs to make sure they are installed.

A simple polling function in Clojure

One of my projects at work is to build an internal web service around AWS to support our internal tooling. (This led to the development my clj-aws-ec2 library.)

The web service needs “integration” tests that exercise its RESTful API to manipulate AWS resources (i.e. create instances, add tags, etc.). This sort of testing is fraught for many reasons and should be kept to a minimum, but it does provide a bit of an assurance that the service will actually respond to its published interface when deployed.

One of the reasons this sort of testing is fraught is that it depends on an external service that is beyond our control (i.e. AWS). Many things can go wrong when talking to AWS, and everything takes time. So my test needs to invoke the service to perform an action, then wait until the expected state is achieved (or a timer elapses causing the test to fail). What I’d like to be able to write is something like:

(deftest ^:integration instance-lifecycle
  (testing "create instance"
 
    (def result (POST "/instances" (with-principal {:name "rea-ec2-tests/int-test-micro", :instance-type "t1.micro"})))
    (has-status result 200)
 
    (let [id (first (:body result))]
      (prn (str "Created instance " id))
 
      (testing "get instance"
        (has-status (GET (str "/instances/" id)) 200)
        (is (wait-for-instance-state id "running")))
 
      (testing "stop instance"
        (has-status (PUT (str "/instances/" id "/stop")) 200)
        (is (wait-for-instance-state id "stopped")))
 
      (testing "start instance"
        (has-status (PUT (str "/instances/" id "/start")) 200)
        (is (wait-for-instance-state id "running")))
 
      (testing "delete instance"
        (has-status (DELETE (str "/instances/" id)) 200)
        (is (wait-for-instance-state id "terminated"))))))

But how do you write a polling loop in Clojure? A bit of clicking around on Google led me to a function written by Chas Emerick for his bandalore library:

;; https://github.com/cemerick/bandalore/blob/master/src/main/clojure/cemerick/bandalore.clj#L124
(defn polling-receive
  [client queue-url & {:keys [period max-wait]
                       :or {period 500
                            max-wait 5000}
                       :as receive-opts}]
  (let [waiting (atom 0)
        receive-opts (mapcat identity receive-opts)
        message-seq (fn message-seq []
                      (lazy-seq
                        (if-let [msgs (seq (apply receive client queue-url receive-opts))]
                          (do
                            (reset! waiting 0)
                            (concat msgs (message-seq)))
                          (do
                            (when (<= (swap! waiting + period) max-wait)
                              (Thread/sleep period)
                              (message-seq))))))]
    (message-seq)))

That seems pretty close! I generalised it a bit to remove dependencies on Chas’s messaging routines and just take a predicate function:

Finally, a couple of helper functions to tie it all together and enable the tests to be written as above:

(defn get-instance-state [id] (:state (:body (GET (str "/instances/" id)))))
(defn wait-for-instance-state [id state] (wait-for #(= (get-instance-state id) state)))

There’s a couple of improvements that could be made to wait-for, the most obvious being to use a “wall clock” for the timeout. The current implementation will actually wait for timeout + (time-to-evaluate-predicate * number-of-invocations) which is probably not what you want, especially when the predicate could take a non-trivial amount of time to evaluate because it is invoking an external service.

Comments and improvements welcome!

UPDATE: My colleague Eric Entzel pointed out that there is no need to use an atom to store and update the “waiting” counter, its state can just be passed around with function invocations (and recursion). The above gist has been simplified to reflect this observation.

UPDATE: Even better, when I went to implement the “wall clock” timeout, I realised there is no need to maintain any state at all, since the absolute timeout time can be calculated up front and compared to the system clock on each evaluation. (I also flipped the timeout test and the sleep, to more accurately relfect the intent of a timeout.) Gist updated again.

UPDATE: And finally, Adam Fitzpatrick noticed that there’s no longer any need to let bind the poller function to a symbol, we can just put its contents in the main function body. Gist updated again.

New release of clj-aws-ec2

I have released version 0.2.0 of clj-aws-ec2. This version contains no changes from 0.1.11. I’m just trying to adhere more closely to semantic versioning, having been fairly slack about it so far.

This version does however contain many changes since I last mentioned it here. It can now describe, create and delete tags on resources, and create and deregister images (AMIs).

I consider this more or less “feature complete” for my current purposes. Of course, it only covers a very small fraction of the available EC2 SDK but hopefully it is on the right side of the 80/20 rule. :-) I am open to feature requests—or even better pull requests—for further elements of the API that you would like to see supported.

Introducing clj-aws-ec2

We use Amazon’s AWS quite heavily at work, and part of my job involves building internal tools that wrap the public AWS API to provide customised internal services.

I am building some of these tools in Clojure, and I needed a way to call the Amazon API. Amazon provide a Java SDK so it’s a fairly simple matter to wrap this in Clojure. In fact James Reeves had already done so for the S3 API. So I took his good work and adapted it to work with the EC2 components of the API:

https://github.com/mrowe/clj-aws-ec2

The library tries to stay true to Amazon’s official Java SDK, but with an idiomatic Clojure flavour. In particular, it accepts and returns pure Clojure data structures (seqs of maps mostly). For example:

user=> (require '[aws.sdk.ec2 :as ec2])
user=> (def cred {:access-key "..." :secret-key "..."})
user=> (ec2/describe-instances cred (ec2/instance-id-filter "i-b3385c89"))

({:instances
    ({:id "i-b3385c89",
      :state {:name "running",
              :code 272},
      :type "t1.micro",
      :placement {:availability-zone "ap-southeast-2a",
                  :group-name "",
                  :tenancy "default"}, 
      :tags {:node-name "tockle",
             :name "mrowe/tockle",
             :environment "mrowe"},
      :image "ami-df8611e5",
      :launch-time #<Date Tue Nov 13 08:23:09 EST 2012>}),
  :group-names (),
  :groups ({:id "sg-338f1909", :name "quicklaunch-1"})})

This is still a work in progress. So far, you can describe instances and images, and stop and start EBS-backed instances. I plan to work on adding create/terminate instances next.

UPDATE: I just released v0.1.6 which includes run_instance and terminate_instance support.