Site scans: a RESTful case study

bnathyuw programming 18 September 201120 December 2012 5 Minutes

I’ve been thinking a lot recently about REST, resource design and addressability, so I was interested to read an article by Troy Hunt on the particular challenges of creating URLs that refer to other URLs.

Scenario

The scenario Troy describes is for a site that will perform a security scan of your website, and then show a report of the result. Addressability is an important concern, as once you have the results of the scan, you may want to forward them to other people.

Troy discusses two methods for including the URL to scan:

Include it in the hierarchical part of the URL (ie, the part after http: and before ?, eg http://asafaweb.com/scan/troyhunt.com/Search)
Include it in the query string, eg http://asafaweb.com/scan?url=troyhunt.com/Search.

He favours the second approach for practical reasons.

A RESTful approach

Troy’s article approaches this question from the point of view of addressability, rather than resource design, and makes a sensible recommendation given the basic premisses; however, the scenario he outlines presents a good opportunity to do a bit of resource design analysis, which can lead us to an even better answer.

First then, let’s think about the type of resource we’re dealing with.

I think it is fair to make a few assumptions about the scan resource and the application that generates it:

It will take some time to perform the scan;
It will take a fair amount of computing resources to perform the scan;
The scan will be accurate for the point in time at which it was created; a subsequent scan of the same URL may generate a different result;
It may be interesting to compare scans over time.

From these assumptions we can draw a few conclusions:

A scan is an expensive resource to create, and will have a series of different statuses over its lifetime; this means we are looking at a transactional resource model here;
As a URL can be scanned more than once, it is not on its own a sufficient identifier of any scan.

If we follow these conclusions, we can make a sketch of the process of performing a scan:

1. Trigger the scan

We make a POST request with the details of the scan we want:

POST http://asafaweb.com/scans
{ "url": "http://troyhunt.com/Search", "options": {…} }

Note that we are POSTing to /scans; this request will create resource subordinate to that collection. Also, as we are making a POST request, we can include further information about the scan we want, perhaps indicating which criteria we are interested in and what level of detail we require; I have indicated this possibility by including an options parameter.

The server responds not by showing us the results of the scan — they haven’t been produced yet —, but by telling us where to look for the results:

201 Created
http://asafaweb.com/scans/{xyz}

2. Check the scan URL

We can go and check this URL straight away by performing a GET:

GET http://asafaweb.com/scans/{xyz}

As the scan is still running, we don’t see the results, but rather a representation of a scan that is still in progress:

200 OK 
{ "scan": {
  "status": "in progress", 
  "url": "http://troyhunt.com/Search", 
  "created": "2011-09-18 11:57:00.000", 
  "options": {…} 
} }

Indeed, if there are a large requests for scans, Troy may have to implement a queueing system, and our scan may have a status of "queued" until it can be processed; we could even cancel a scan by PUTting a representation with a status of "cancelled" to its URL, or perhaps simply by issuing a DELETE request.

3. Retrieve the scan

A little while later, the scan has completed. Perhaps we keep performing a GET on its URL until it’s done; perhaps we have submitted an email address in the initial POST, and have now received email notification that it is ready.

We perform another GET to see the results:

GET http://asafaweb.com/scans/{xyz}

And the server responds:

200 OK 
{ "scan": {
  "status": "complete", 
  "url": "http://troyhunt.com/Search", 
  "created": "2011-09-18 11:57:00.000", 
  "options": {…}, 
  "results": {…} 
} }

We can now send the URL http://asafaweb.com/scans/{xyz} to other people, and they will see the same results. The server doesn’t have to rescan the site, so retrieving these results can be a quick, inexpensive operation.

4. Search for scans

Throughout this example, I have used {xyz} to indicate the unique identifier of the scan. I have deliberately not given details of what this identifier might be. However, as I said earlier, the URL to scan is not a sufficient identifier, as we want to allow the possibility of scanning the same URL more than once. This identifier could include the URL, but this may not be the ideal solution, both for the technical reasons that Troy indicates in his article, and because this will produce very long identifiers, where we could probably make do with short strings of characters.

The result of this is that we have a system that is eminently addressable, but which uses identifiers that bear an opaque relationship to the scanned URL, and fails the findability criterion. I can easily send a URL like http://asafaweb.com/scans/f72bw8 to a friend, but if they do not have that address, they have no way of guessing that this is the address of a scan of my site.

To remedy this, we can implement a search interface. We already have an address for the entire collection of scans, viz /scans, so now we can just refine the response to a request of this URL with a query parameter:

GET http://asafaweb.com/scans?url=http://troyhunt.com/Search

The server can then respond with a listing of all the scans that meet these criteria:

200 OK
{ "scans": {
  "url": "http://troyhunt.com/Search",
  "results": [
    {
      "status": "complete",
      "created": "2011-09-18 11:57:00.000",
      "link": {
        href: "http://asafaweb.com/scans/f72bw8"
      }
    }, 
    {
      "status": "complete",
      "created": "2011-08-28 17:36:24.023",
      "link": {
        href: "http://asafaweb.com/scans/89ew2p"
      }
    }, 
  ]
} }

Each result has a link to its location, so all results can be retrieved at any time.

By recognising that our scans are resources that can be searched, and by providing a means to perform that search, we have restored findability to our interface. Also, the fact that our search criterion is a refinement of a more general request, and thus appears in the query string, means that we arrive at a very similar conclusion the one Troy reaches in his article, but this time for reasons based on resource design, rather than practicality.

Applicability to web sites

I have given all my examples in JSON, as it offers a concise and easy-to-read format for technical discussions. However, all of this discussion would work equally well with HTML web pages:

The user fills in and submits a form to request a scan, and is redirected to the scan page;
The scan page shows a message telling the user the scan is in progress;
After a while a page refresh reveals the results of the scan, and the user can send the page URL to their contacts;
The /scans page offers a search form into which the user can enter their URL and retrieve a list of all the dated scans for that location.

A couple of the implementation details will differ, because of the limited number of HTTP verbs and status codes that browsers can deal with, but the principles are exactly the same.

Published by bnathyuw

Software engineer, composer and wide-eyed explorer of human culture and thought. View all posts by bnathyuw

Published 18 September 201120 December 2012

2 thoughts on “Site scans: a RESTful case study”

Troy Hunt (@troyhunt) says:

18 September 2011 at 22:24

Firstly, big thanks for taking the time to articulate these thoughts so clearly. I literally felt like I’d just gone down the rabbit hole when I was making this decision so it’s great to have some external clarity.

There are a few non-URL related constraints I’ve consciously decided to work within which change the discussion a little. Firstly, scans must be fast. typically they should be within several seconds. This means being very sparse with HTTP requests from ASafaWeb to the site being scanned. When they’re running this fast, the user can wait for a response rather than needing to queue them and return a bit later. At present the median scan duration is 2.5 seconds across 4 HTTP requests – and that’s issuing requests synchronously and not using HTTP compression so it should come way down yet.

Secondly, I don’t want to store any identifying information about either the requestor or the site being scanned which means I don’t want to keep the URL anywhere. This is primarily for privacy – people shouldn’t feel that I’m building a list with the mother load of vulnerable sites! But of course it also significantly mitigates my responsibly; the entire site and DB can be compromised and I’ll just redeploy it without suffering any disclosure fallout.

Of course the performance issue remains but without having yet profiled this, I think I’ll find the bottleneck is not in computing resources but rather in the number of simultaneous HTTP requests that are running. One alternative to persistent storage as a means of overcoming this is caching. I’m using AppHarbor and I’m very keen to try out the Memcacher service that they offer. This could greatly mitigate the scenario where the URL of a scan is sent around, say, via Twitter and gets a heap of hits in a short time frame.

All that said, it’s very, very early days (still in private beta) and I know I’ll inevitably chop and change things as I go, particularly once it has public exposure. I may well change course on the above principles, but I think it’s a cautionary way to begin and I’d always rather start simple and scale it up after that.

Your post is fantastic and great food for thought. I do actually have an item on the backlog related to JSON based services and would love to get your input so will get in touch with via Twitter.

Reply
1. bnathyuw says:
  
  20 September 2011 at 11:28
  
  Thank you, Troy, for reading my post and for responding in such detail.
  
  It’s clear from the explanation that you give, that all four of my assumptions are some way off the mark: I made the assumtion that we’re dealing with expensive, persistent resources, and you have explained that your goal is to produce inexpensive, ephemeral ones.
  
  In terms of my initial post, this does not trouble me: I was interested in working through a problem, rather than telling you how to build your system(!), and the fact that the problem I set out to solve is not the same as the scenario that inspired it needn’t undermine the value of the exercise.
  
  But the scenario you describe is possibly more interesting to model, and I can see various ways this might be done:
  
  1. Treat the results of the scan as normal resources, but destroy them on read. This would follow a POST — 201 Created — GET pattern.
  
  This model is a poor match, as we would need to create an explicit mechanism for storing the data over the 201 Created redirect; also, addressability and findability would fail, as a) the GET resource is ephemeral and cannot be persistently linked to, and b) it’s not possible to link with a POST verb.
  
  2. Treat the results of the scan as a latent resource, which only comes into being when you request it. This would follow a simple GET pattern, and should include all the information about the resource in the hierarchical part of the URL.
  
  There are various reasons you might create a latent resource:
  
  a) There are too many resources for it to be practical (or possible) to store them. If you write a service that calculates square roots, you won’t store every square root, you’ll just calculate it on the fly.
  b) The specific resource is a refinement of more generic resources, which can be stored. If you’re showing a weather forecast map for a specific postcode and time of day, you might actually store hourly maps for each region, and then produce the customised map based on the stored data.
  c) You do not own the resources in question. A social media hashtag aggregator might be an example here: given any input hashtag, the system can fetch the results from elsewhere, but it would be pointless to pre-fetch the results.
  d) You have specific security reasons not to store the resources. This is the case with your scenario.
  
  A model like this maintains addressability, as the URL format is clear, and also emphasises findability, as the user can feed any valid data into the URL and expect a response. If I’m providing a square root service for numbers >= 0, then the user can expect that http://mathsservice/square-roots/n will deliver a valid resorce for any n∈ℕ.
  
  I can see two small problems:
  
  a) In the specific case of running a scan against a URL, there are the technical problems you have outlined in embedding one URL in the hierarchical part of another URL.
  b) With some of the examples I have given, especially the weather example, it’s disputable whether we are referring to a single, highly specified resource, or whether in fact we are offering a customised view of a more general resource. Perhaps in this example, the resource is simply ‘the weather forecast’, and we want a view on it that focuses on such-and-such a location and such-and-such a point in time.
  
  This leads us to a third possibility:
  
  3. You treat the URL of the scan as the refinement of an existing resource. This would allow the user to make a simple GET request, with further information (viz the URL to scan) in the query string.
  
  This is the standard model for searches, and is a reasonable fit for the weather example above. It maintains addressability and findability, and as a bonus avoids the technical issues of embedding a URL in another.
  
  The risk is that this approach fails to identify enough resources, and can lead to a RPC style of programming, where a small number of endpoints are overloaded with query refinements. For the weather example, it may be reasonable to see ‘The weather at 16:30 on Saturday for WC1A 1AA’ as a refinement of ‘the UK weather forecase for the next 5 days’, but with your original example, I’m not sure ‘a security scan of http://foo.bar‘ can be as a refinement of anything else; certainly not ‘a security scan of the entire internet’!
  
  I’m not sure if any of these models is absolutely correct for your scenario. I think the notion of latent resources in model 2. is rather interesting, and certainly merits further thought, and I’m also interested in the questions model 3. raises about what counts as a refinement of a resource, and what should be considered another resource entirely.
  
  What I think can be said for certain is that the POST — 201 Created — GET model, which I describe in my original post, while absolutely suitable for persistent resources, is completely unsuited to ephemeral resources, and therefore doesn’t fit your scenario.
  
  Reply