Site scans: a RESTful case study

I’ve been thinking a lot recently about REST, resource design and addressability, so I was interested to read an article by Troy Hunt on the particular challenges of creating URLs that refer to other URLs.

Scenario

The scenario Troy describes is for a site that will perform a security scan of your website, and then show a report of the result. Addressability is an important concern, as once you have the results of the scan, you may want to forward them to other people.

Troy discusses two methods for including the URL to scan:

  1. Include it in the hierarchical part of the URL (ie, the part after http: and before ?, eg http://asafaweb.com/scan/troyhunt.com/Search)
  2. Include it in the query string, eg http://asafaweb.com/scan?url=troyhunt.com/Search.

He favours the second approach for practical reasons.

A RESTful approach

Troy’s article approaches this question from the point of view of addressability, rather than resource design, and makes a sensible recommendation given the basic premisses; however, the scenario he outlines presents a good opportunity to do a bit of resource design analysis, which can lead us to an even better answer.

First then, let’s think about the type of resource we’re dealing with.

I think it is fair to make a few assumptions about the scan resource and the application that generates it:

  • It will take some time to perform the scan;
  • It will take a fair amount of computing resources to perform the scan;
  • The scan will be accurate for the point in time at which it was created; a subsequent scan of the same URL may generate a different result;
  • It may be interesting to compare scans over time.

From these assumptions we can draw a few conclusions:

  • A scan is an expensive resource to create, and will have a series of different statuses over its lifetime; this means we are looking at a transactional resource model here;
  • As a URL can be scanned more than once, it is not on its own a sufficient identifier of any scan.

If we follow these conclusions, we can make a sketch of the process of performing a scan:

1. Trigger the scan

We make a POST request with the details of the scan we want:

POST http://asafaweb.com/scans
{ "url": "http://troyhunt.com/Search", "options": {…} }

Note that we are POSTing to /scans; this request will create resource subordinate to that collection. Also, as we are making a POST request, we can include further information about the scan we want, perhaps indicating which criteria we are interested in and what level of detail we require; I have indicated this possibility by including an options parameter.

The server responds not by showing us the results of the scan — they haven’t been produced yet —, but by telling us where to look for the results:

201 Created
http://asafaweb.com/scans/{xyz}

2. Check the scan URL

We can go and check this URL straight away by performing a GET:

GET http://asafaweb.com/scans/{xyz}

As the scan is still running, we don’t see the results, but rather a representation of a scan that is still in progress:

200 OK 
{ "scan": {
  "status": "in progress", 
  "url": "http://troyhunt.com/Search", 
  "created": "2011-09-18 11:57:00.000", 
  "options": {…} 
} }

Indeed, if there are a large requests for scans, Troy may have to implement a queueing system, and our scan may have a status of "queued" until it can be processed; we could even cancel a scan by PUTting a representation with a status of "cancelled" to its URL, or perhaps simply by issuing a DELETE request.

3. Retrieve the scan

A little while later, the scan has completed. Perhaps we keep performing a GET on its URL until it’s done; perhaps we have submitted an email address in the initial POST, and have now received email notification that it is ready.

We perform another GET to see the results:

GET http://asafaweb.com/scans/{xyz}

And the server responds:

200 OK 
{ "scan": {
  "status": "complete", 
  "url": "http://troyhunt.com/Search", 
  "created": "2011-09-18 11:57:00.000", 
  "options": {…}, 
  "results": {…} 
} }

We can now send the URL http://asafaweb.com/scans/{xyz} to  other people, and they will see the same results. The server doesn’t have to rescan the site, so retrieving these results can be a quick, inexpensive operation.

4. Search for scans

Throughout this example, I have used {xyz} to indicate the unique identifier of the scan. I have deliberately not given details of what this identifier might be. However, as I said earlier, the URL to scan is not a sufficient identifier, as we want to allow the possibility of scanning the same URL more than once. This identifier could include the URL, but this may not be the ideal solution, both for the technical reasons that Troy indicates in his article, and because this will produce very long identifiers, where we could probably make do with short strings of characters.

The result of this is that we have a system that is eminently addressable, but which uses identifiers that bear an opaque relationship to the scanned URL, and fails the findability criterion. I can easily send a URL like http://asafaweb.com/scans/f72bw8 to a friend, but if they do not have that address, they have no way of guessing that this is the address of a scan of my site.

To remedy this, we can implement a search interface. We already have an address for the entire collection of scans, viz /scans, so  now we can just refine the response to a request of this URL with a query parameter:

GET http://asafaweb.com/scans?url=http://troyhunt.com/Search

The server can then respond with a listing of all the scans that meet these criteria:

200 OK
{ "scans": {
  "url": "http://troyhunt.com/Search",
  "results": [
    {
      "status": "complete",
      "created": "2011-09-18 11:57:00.000",
      "link": {
        href: "http://asafaweb.com/scans/f72bw8"
      }
    }, 
    {
      "status": "complete",
      "created": "2011-08-28 17:36:24.023",
      "link": {
        href: "http://asafaweb.com/scans/89ew2p"
      }
    }, 
  ]
} }

Each result has a link to its location, so all results can be retrieved at any time.

By recognising that our scans are resources that can be searched, and by providing a means to perform that search, we have restored findability to our interface. Also, the fact that our search criterion is a refinement of a more general request, and thus appears in the query string, means that we arrive at a very similar conclusion the one Troy reaches in his article, but this time for reasons based on resource design, rather than practicality.

Applicability to web sites

I have given all my examples in JSON, as it offers a concise and easy-to-read format for technical discussions. However, all of this discussion would work equally well with HTML web pages:

  1. The user fills in and submits a form to request a scan, and is redirected to the scan page;
  2. The scan page shows a message telling the user the scan is in progress;
  3. After a while a page refresh reveals the results of the scan, and the user can send the page URL to their contacts;
  4. The /scans page offers a search form into which the user can enter their URL and retrieve a list of all the dated scans for that location.

A couple of the implementation details will differ, because of the limited number of HTTP verbs and status codes that browsers can deal with, but the principles are exactly the same.