RDF, Typescript and Deno - Part 2: sample data

In part 1, I laid out some reasons for looking using RDF in Typescript using Deno. In this installment, I’ll put together some quick sample data before I start looking at RDF specifically.

Sample data is always a challenge. Real-world data is often large and messy, or has inconvenient license terms. On the other hand, synthetic data tends to hide the kinds of issues we have to deal with day-to-day in data processing. Who needs another todo list app, really?

For this exercise, I’m using recent sightings of cetaceans around the coast of the UK by the Sea Watch Foundation. There is - as far as I can see - no explicit license on this data, but as it’s reported by anyone and concerns wild animals in their natural habitat I’m going to assume it’s OK to use this data for tutorial purposes. Obviously if I find out otherwise, I’ll use a different dataset.

There’s a tool on the Sea Watch web site to list recent sightings. It looks like this:

Bottlenose dolphin (x6) - Portland Bill, Dorset at 08:00 on 1 Jul 2021 by Des and Shirley Peadon
Grey Seal (x11) - The Chick Island, Cornwall at 15:17 on 30 Jun 2021 by Newquay Sea Safaris Newquay Sea Safaris
Sunfish (x1) - Towan Headland, Cornwall at 14:53 on 30 Jun 2021 by Newquay Sea Safaris Newquay Sea Safaris

This is reasonably well structured data, to human eyes, but still needs a bit of processing to get it into a form we can conveniently process. By inspection, each sighting contains:

There’s a bit of variability in this structure, so it would be convenient to have it as a more structured format, such as JSON:

    "species": "Bottlenose dolphin",
    "quantity": 12,
    "location": "Portland Bill, Dorset",
    "spotter": "Alan Hold",
    "date": "2021-06-30T23:00:00.000Z",
    "time": "09:15"

Mostly this is a case of splitting each line of data up in a robust way, but also performing some basic data transformation. The quantity field is parsed as an integer, not a string, and the date is parsed as a JavaScript Date object. I decided to keep the time field separate, as only around half of the sightings record the time. In theory we could do some data reconciliation to try to recognise the location and the species, but the data is likely to be too noisy to be able to do this robustly. So for now, just robustly parsing the strings is enough.

The code for this data conversion step is available on GitHub. It’s mostly fairly straightforward; perhaps the main notes (other than Typescript and Deno, see below) is that the work of the recogniser is a rather large regex. In JavaScript, regular expressions can capture segments of the input in groups, and these capture groups can be given a name using the construct (?<name>...). Conveniently, these named groups are returned as a JavaScript object looking like the JSON structure above, as long as the names of the fields and capture groups line up.

Of Typescript

TypeScript uses type inference to determine, and then check, the types of variables and parameters automatically where it can. In this simple program, type inference was mostly sufficient. The main thing I need to add to make the type checking work was to constrain the output of the regular expression matcher. As mentioned above, I wanted the output of the regex match to be an object closely resembling my eventual result. The groups field of the match result is an object, but Typescript needs to have more information before it can check that usages of that object are legal. So I defined an interface type SightingsData, which is the interim, not final, form of a line from the sightings data file:

interface SightingsData {
  species: string
  quant?: string
  location: string
  date: string
  time?: string
  spotter: string

Then we can tell the compiler that this will be the result type from parsing, and us the as operator to coerce the result (note that LINE_MATCHER is just the large regex):

function parseLine(line: string): SightingsData | undefined {
  return line.match(LINE_MATCHER)?.groups as (SightingsData | undefined)

Since the match can fail, we need the optional chaining operator ?., so if the left-hand expression evaluates to falsey the expression will evaluate to just that value. And then since the result overall may be undefined, the return type needs to be the type expression SightingsData | undefined.

Using VsCode as my editor, TypeScript can work out the rest of the types, and show them on mouse hover. Here, for example, is the result of hovering on function asData:

Screenshot of VsCode showing TypeScript type annotation

Of Deno

This simple script doesn’t get to exercise much of Deno, but does show a couple of interesting aspects. First, promises are used as the basis for every (potentially) asynchronous operation, like writing or reading files. That means lots of async / await statements, or .then() calls. No callback functions.

Second, imports can be loaded directly from a URL:

import parse from 'https://deno.land/x/date_fns@v2.22.1/parse/index.js'

In a node.js script, this would have meant adding date_fns to the package.json file dependencies, then yarn install or npm install to get the dependency cached into node_modules. Deno can work this way (spoiler alert for part 3 of this blog series), but by default doesn’t need to.

Not diving for yarn add from the command line did feel a bit weird, but I expect I will get used to it. More of an issue, I think, will be keeping the versions consistent. If I use date_fns from more than one file, and then I need to upgrade to version 2.22.2, it seems I’ll have to grep for every use of date_fns@2.22.1. That doesn’t seem very DRY, but maybe there are working practices I’ve not got used to yet.

The third thing about Deno is that scripts don’t have permission to perform risky operations by default. “Risky” in this context means things like: reading a file, writing a file, writing to the network, etc. Running without the permission causes an error:

 $ deno run data-generation.ts
Check file:///home/ian/projects/personal/deno-experiments/data-generation.ts
error: Uncaught (in promise) PermissionDenied: Requires read access to "./sightings-data.txt", run again with the --allow-read flag
  const sourceData = await Deno.readTextFile(source)
    at deno:core/core.js:86:46
    at unwrapOpResult (deno:core/core.js:106:13)
    at async open (deno:runtime/js/40_files.js:46:17)
    at async Object.readTextFile (deno:runtime/js/40_read_file.js:40:18)
    at async readLines (file:///home/ian/projects/personal/deno-experiments/data-generation.ts:18:22)

It’s a nice clear error. But I found it quite easy to forget to add the appropriate flags. The correct version:

$ deno run --allow-read --allow-write data-generation.ts
Check file:///home/ian/projects/personal/deno-experiments/data-generation.ts

It would be quite easy to create a bash alias with those permissions turned on, but that rather defeats the goal. Security or convenience: pick one!

 newer · index · older