C# method for cropping text without breaking words

June 28, 2013, 1:48 pm

≪ Previous: Lessons learned from a small tech startup

I often need to crop a longer text down to a shorter version. That in it self is easy. However, it's often nice to make it less obvious that the text has been shortened by ensuring that it consists of whole words. Below is a method for doing so with .NET/C#.

It's not the most readable method, but compared to for instance splitting the string by whitespace and building it up again using a StringBuilder it's much, much faster.

//using System;
//using System.Collections.Generic;

public static class StringExtensions
{
  public static bool IsNullOrEmpty(this string value)
  {
    return string.IsNullOrEmpty(value);
  }

  private static readonly HashSet<char> DefaultNonWordCharacters 
    = new HashSet<char> { ',', '.', ':', ';' };

  /// <summary>
  /// Returns a substring from the start of <paramref name="value"/> no 
  /// longer than <paramref name="length"/>.
  /// Returning only whole words is favored over returning a string that 
  /// is exactly <paramref name="length"/> long. 
  /// </summary>
  /// <param name="value">The original string from which the substring 
  /// will be returned.</param>
  /// <param name="length">The maximum length of the substring.</param>
  /// <param name="nonWordCharacters">Characters that, while not whitespace, 
  /// are not considered part of words and therefor can be removed from a 
  /// word in the end of the returned value. 
  /// Defaults to ",", ".", ":" and ";" if null.</param>
  /// <exception cref="System.ArgumentException">
  /// Thrown when <paramref name="length"/> is negative
  /// </exception>
  /// <exception cref="System.ArgumentNullException">
  /// Thrown when <paramref name="value"/> is null
  /// </exception>
  public static string CropWholeWords(
    this string value, 
    int length, 
    HashSet<char> nonWordCharacters = null)
  {
    if (value == null)
    {
      throw new ArgumentNullException("value");
    }

    if (length < 0)
    {
      throw new ArgumentException("Negative values not allowed.", "length");
    }
    if (nonWordCharacters == null)
    {
      nonWordCharacters = DefaultNonWordCharacters;
    }

    if (length >= value.Length)
    {
      return value;
    }
    int end = length;

    for (int i = end; i > 0; i--)
    {
      if (value[i].IsWhitespace())
      {
        break;
      }

      if (nonWordCharacters.Contains(value[i]) 
          && (value.Length == i + 1 || value[i + 1] == ' '))
      {
        //Removing a character that isn't whitespace but not part 
        //of the word either (ie ".") given that the character is 
        //followed by whitespace or the end of the string makes it
        //possible to include the word, so we do that.
        break;
      }
      end--;
    }

    if (end == 0)
    {
      //If the first word is longer than the length we favor 
      //returning it as cropped over returning nothing at all.
      end = length;
    }

    return value.Substring(0, end);
  }

  private static bool IsWhitespace(this char character)
  {
    return character == ' ' || character == 'n' || character == 't';
  }
}

↧

ElasticSearch 101

July 1, 2013, 4:00 pm

≫ Next: A book about EPiServer development

≪ Previous: C# method for cropping text without breaking words

ElasticSearch is a highly scalable open source search engine with a REST API that is hard not to love. In this tutorial we'll look at some of the key concepts when getting started with ElasticSearch.

Downloading and running ElasticSearch

ElasticSearch can be downloaded packaged in various formats such as ZIP and TAR.GZ from elasticsearch.org. After downloading and extracting a package running it couldn't be much easier, at least if you already have a Java runtime installed.

Running ElasticSearch on Windows

To run ElasticSearch on Windows we run elasticsearch.bat located in the bin folder from a command window. This will start ElasticSearch running in the foreground in the console, meaning we'll see errors in the console and can shut it down using CTRL+C.

If we don't have a Java runtime installed or not correctly configured we'll not see output like the one above but instead a message saying "JAVA_HOME environment variable must be set!". To fix that first download and install Java if you don't already have it installed. Second, ensure that you have a JAVA_HOME environment variable configured correctly (Google it if unsure of how).

Running ElasticSearch on OS X

To run ElasticSearch on OS X we run the shell script elasticsearch in the bin folder. This starts ElasticSearch in the background, meaning that if we want to see output from it in the console and be able to shut it down we should add a -f flag.

If the script is unable to find a suitable Java runtime it will help you download it (nice!).

Using the REST API with Sense

Once you have an instance of ElasticSearch up and running you can talk to it using it's JSON based REST API residing at localhost port 9200. You can use any HTTP client to talk to it. In ElasticSearch's own documentation all examples use curl, which makes for concise examples. However, when playing with the API you may find a graphical client such as Fiddler or RESTClient more convenient.

Even more convenient is the Chrome plug-in Sense. Sense provides a simple user interface specifically for using ElasticSearch's REST API. It also has a number of convenient features such as autocomplete for ElasticSearch's query syntax and copying and pasting requests in curl format, making it easy to run examples from the documentation.

We'll be looking at a combination of curl requests and screenshots from Sense throughout this tutorial and I recommend you to install Sense and use it to follow along.

Once you have installed it you'll find Sense's icon in the upper right corner in Chrome. The first time you click it and run Sense a very simple sample request is prepared for you.

The above request will perform the simplest of search queries, matching all documents in all indexes on the server. Running it against a vanilla installation of ElasticSearch produces an error in the response as there aren't any indexes.

Our next step is to index some data, fixing this issue.

CRUD

While we may want to use ElasticSearch primarily for searching the first step is to populate an index with some data, meaning the "Create" of CRUD, or rather, "indexing". While we're at it we'll also look at how to update, read and delete individual documents.

Indexing

In ElasticSearch indexing corresponds to both "Create" and "Update" in CRUD - if we index a document with a given type and ID that doesn't already exists it's inserted. If a document with the same type and ID already exists it's overwritten.

In order to index a first JSON object we make a PUT request to the REST API to a URL made up of the index name, type name and ID. That is: http://localhost:9200/<index>/<type>/[<id>].

Index and type are required while the id part is optional. If we don't specify an ID ElasticSearch will generate one for us. However, if we don't specify an id we should use POST instead of PUT.

The index name is arbitrary. If there isn't an index with that name on the server already one will be created using default configuration.

As for the type name it too is arbitrary. It serves several purposes, including:

Each type has its own ID space.
Different types can have different mappings ("schema" that defines how properties/fields should be indexed).
Although it's possible, and common, to search over multiple types, it's easy to search only for one or more specific type(s).

Let's index something! We can put just about anything into our index as long as it can be represented as a single JSON object. In this tutorial we'll be indexing and searching for movies. Here's a classic one:

{
    "title": "The Godfather",
    "director": "Francis Ford Coppola",
    "year": 1972
}

To index that we decide on an index name ("movies"), a type name ("movie") and an id ("1") and make a request following the pattern described above with the JSON object in the body.

curl -XPUT "http://localhost:9200/movies/movie/1" -d'
{
    "title": "The Godfather",
    "director": "Francis Ford Coppola",
    "year": 1972
}'

You can either run that using curl or use Sense. With Sense you can either populate the URL, method and body yourself or you can copy the above curl example, place the cursor in the body field in Sense and press Ctrl/Command + Shift + V and all of the fields will be populated for you.

After executing the request we receive a response from ElasticSearch in the form of a JSON object.

The response object contains information about the indexing operation, such as whether it was successful ("ok") and the documents ID which can be of interest if we don't specify that ourselves.

If we now run the default search request that Sense provides (accessible using the "History" button in Sense given that you indeed executed it) that failed before we'll see a different result.

Instead of an error we're seeing a search result. We'll get to searching later, but for now let's rejoice in the fact that we've indexed something!

Now that we've got a movie in our index let's look at how we can update it, adding a list of genres to it. In order to do that we simply index it again using the same ID. In other words, we make the exact same indexing request as as before but with an extended JSON object containing genres.

curl -XPUT "http://localhost:9200/movies/movie/1" -d'
{
    "title": "The Godfather",
    "director": "Francis Ford Coppola",
    "year": 1972,
    "genres": ["Crime", "Drama"]
}'

The response from ElasticSearch is the same as before with one difference, the _version property in the result object has value two instead of one.

The version number can be used to track how many times a document has been indexed. It's primary purpose however is to allow for optimistic concurrency control as we can supply a version in indexing requests as well and ElasticSearch will then only overwrite the document if the supplied version is higher than what's in the index.

Getting by ID

We've so far covered indexing new documents as well as updating existing ones. We've also seen an example of a simple search request and that our indexed movie appeared in that.

While it's possible to search for documents in the index that's overkill if we only want to retrieve a single one with a known ID. A simple and faster approach would be to retrieve it by ID, using GET.

In order to do that we make a GET request to the same URL as when we indexed it, only this time the ID part of the URL is mandatory. In other words, in order to retrieve a document by ID from ElasticSearch we make a GET request to http://localhost:9200/<index>/<type>/<id>.

Let's try it with our movie using the following request:

curl -XGET "http://localhost:9200/movies/movie/1" -d''

As you can see the result object contains similar metadata as we've saw when indexing, such as index, type and version information. Last but not least it has a property named "_source" which contains the actual document.

There's not much more to say about GET as it's pretty straightforward. Let's move on to the final CRUD operation.

Deleting documents

In order to remove a single document from the index by ID we again use the same URL as for indexing and getting it, only this time we change the HTTP method to DELETE.

curl -XDELETE "http://localhost:9200/movies/movie/1" -d''

The response object contains some of the usual suspects in terms of meta data, along with a property named "_found" indicating that the document was indeed found and that the operation was successful.

If we, after executing the DELETE call, switch back to GET we can verify that the document has indeed been deleted.

Searching

So, we've covered the basics of working with data in an ElasticSearch index and it's time to move on to more exciting things - searching. However, considering the last thing we did was to delete the only document we had from our index we'll first need some sample data. Below is a number of indexing requests that we'll use.

curl -XPUT "http://localhost:9200/movies/movie/1" -d'
{
    "title": "The Godfather",
    "director": "Francis Ford Coppola",
    "year": 1972,
    "genres": ["Crime", "Drama"]
}'

curl -XPUT "http://localhost:9200/movies/movie/2" -d'
{
    "title": "Lawrence of Arabia",
    "director": "David Lean",
    "year": 1962,
    "genres": ["Adventure", "Biography", "Drama"]
}'

curl -XPUT "http://localhost:9200/movies/movie/3" -d'
{
    "title": "To Kill a Mockingbird",
    "director": "Robert Mulligan",
    "year": 1962,
    "genres": ["Crime", "Drama", "Mystery"]
}'

curl -XPUT "http://localhost:9200/movies/movie/4" -d'
{
    "title": "Apocalypse Now",
    "director": "Francis Ford Coppola",
    "year": 1979,
    "genres": ["Drama", "War"]
}'

curl -XPUT "http://localhost:9200/movies/movie/5" -d'
{
    "title": "Kill Bill: Vol. 1",
    "director": "Quentin Tarantino",
    "year": 2003,
    "genres": ["Action", "Crime", "Thriller"]
}'

curl -XPUT "http://localhost:9200/movies/movie/6" -d'
{
    "title": "The Assassination of Jesse James by the Coward Robert Ford",
    "director": "Andrew Dominik",
    "year": 2007,
    "genres": ["Biography", "Crime", "Drama"]
}'

It's worth pointing out that ElasticSearch has and endpoint (_bulk) for indexing multiple documents with a single request however that's out of scope for this tutorial so we're keeping it simple and using six separate requests.

The _search endpoint

Now that we have put some movies into our index, let's see if we can find them again by searching. In order to search with ElasticSearch we use the _search endpoint, optionally with an index and type. That is, we make requests to an URL following this pattern: <index>/<type>/_search where index and type are both optional.

In other words, in order to search for our movies we can make POST requests to either of the following URLs:

http://localhost:9200/_search - Search across all indexes and all types.
http://localhost:9200/movies/_search - Search across all types in the movies index.
http://localhost:9200/movies/movie/_search - Search explicitly for documents of type movie within the movies index.

As we only have a single index and a single type which one we use doesn't matter. We'll use the first URL for the sake of brevity.

Search request body and ElasticSearch's query DSL

If we simply send a request to one of the above URL's we'll get all of our movies back. In order to make a more useful search request we also need to supply a request body with a query. The request body should be a JSON object which, among other things, can contain a property named "query" in which we can use ElasticSearch's query DSL.

{
    "query": {
        //Query DSL here
    }
}

One may wonder what the query DSL is. It's ElasticSearch's own domain specific language based on JSON in which queries and filters can be expressed. Think of it like ElasticSearch's equivalent of SQL for a relational database. Here's part of how ElasticSearch's own documentation explains it:

Think of the Query DSL as an AST of queries. Certain queries can contain other queries (like the bool query), other can contain filters (like the constant_score), and some can contain both a query and a filter (like the filtered). Each of those can contain any query of the list of queries or any filter from the list of filters, resulting in the ability to build quite complex (and interesting) queries.

Basic free text search

The query DSL features a long list of different types of queries that we can use. For "ordinary" free text search we'll most likely want to use one called "query string query".

A query string query is an advanced query with a lot of different options that ElasticSearch will parse and transform into a tree of simpler queries. Still, it can be very easy to use if we ignore all of its optional parameters and simply feed it a string to search for.

Let's try a search for the word "kill" which is present in the title of two of our movies:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "query_string": {
            "query": "kill"
        }
    }
}'

Let's execute the request and take a look at the result.

As expected we're getting two hits, one for each of the movies with the word "kill" in the title. Let's look at another scenario, searching in specific fields.

Specifying fields to search in

In the previous example we used a very simple query, a query string query with only a single property, "query". As mentioned before the query string query has a number of settings that we can specify and if we don't it will use sensible default values.

One such setting is called "fields" and can be used to specify a list of fields to search in. If we don't use that the query will default to searching in a special field called "_all" that ElasticSearch automatically generates based on all of the individual fields in a document.

Let's try to search for movies only by title. That is, if we search for "ford" we want to get a hit for "The Assassination of Jesse James by the Coward Robert Ford" but not for either of the movies directed by Francis Ford Coppola.

In order to do that we modify the previous search request body so that the query string query has a fields property with an array of fields we want to search in:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "query_string": {
            "query": "ford",
            "fields": ["title"]
        }
    }
}'

Let's execute that and see what happens:

As expected we get a single hit, the movie with the word "ford" in its title. Compare that to a request were we've removed the fields property from the query:

Filtering

We've covered a couple of simple free text search queries above. Let's look at another one where we search for "drama" without explicitly specifying fields:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "query_string": {
            "query": "drama"
        }
    }
}'

As we have five movies in our index containing the word "drama" in the _all field (from the category field) we get five hits for the above query. Now, imagine that we want to limit these hits to movies released in 1962. In order to do that we need to apply a filter requiring the "year" field to equal 1962.

To add such a filter we modify our search request body so that our current top level query, the query string query, is wrapped in a filtered query:

{
    "query": {
        "filtered": {
            "query": {
                "query_string": {
                    "query": "drama"
                }
            },
            "filter": {
                //Filter to apply to the query
            }
        }
    }
}

A filtered query is a query that has two properties, query and filter. When executed it filters the result of the query using the filter. To finalize the query we'll need to add a filter requiring the year field to have value 1962.

ElasticSearch's query DSL has a wide range of filters to choose from. For this simple case where a certain field should match a specific value a term filter will work well.

"filter": {
    "term": { "year": 1962 }
}

The complete search request now looks like this:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "filtered": {
            "query": {
                "query_string": {
                    "query": "drama"
                }
            },
            "filter": {
                "term": { "year": 1962 }
            }
        }
    }
}'

When we execute it we, as expected, only get two hits, both with year == 1962.

Filtering without a query

In the above example we limit the results of a query string query using a filter. What if all we want to do is apply a filter? That is, we want all movies matching a certain criteria.

In such cases we still use the "query" property in the search request body, which expects a query. In other words, we can't just add a filter, we need to wrap it in some sort of query.

One solution for doing this is to modify our current search request, replacing the query string query in the filtered query with a match_all query which is a query that simply matches everything. Like this:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "filtered": {
            "query": {
                "match_all": {
                }
            },
            "filter": {
                "term": { "year": 1962 }
            }
        }
    }
}'

Another, simpler option is to use a constant score query:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "constant_score": {
            "filter": {
                "term": { "year": 1962 }
            }
        }
    }
}'

Mapping

Let's look at a search request similar to the last one, only this time we filter by author instead of year.

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "constant_score": {
            "filter": {
                "term": { "director": "Francis Ford Coppola" }
            }
        }
    }
}'

As we have two movies directed by Francis Ford Coppola in our index it doesn't seem too far fetched that this request should result in two hits, right? That's not the case however.

What's going on here? We've obviously indexed two movies with "Francis Ford Coppola" as director and that's what we see in search results as well. Well, while ElasticSearch has a JSON object with that data that it returns to us in search results in the form of the _source property that's not what it has in its index.

When we index a document with ElasticSearch it (simplified) does two things: it stores the original data untouched for later retrieval in the form of _source and it indexes each JSON property into one or more fields in a Lucene index. During the indexing it processes each field according to how the field is mapped. If it isn't mapped default mappings depending on the fields type (string, number etc) is used.

As we haven't supplied any mappings for our index ElasticSearch uses the default mappings for strings for the director field. This means that in the index the director fields value isn't "Francis Ford Coppola". Instead it's something more like ["francis", "ford", "coppola"].

We can verify that by modifying our filter to instead match "francis" (or "ford" or "coppola"):

So, what to do if we want to filter by the exact name of the director? We modify how it's mapped. There are a number of ways to add mappings to ElasticSearch, through a configuration file, as part of a HTTP request that creates and index and by calling the _mapping endpoint.

Using the last approach we could in theory fix the above issue by adding a mapping for the "director" field instructing ElasticSearch not to analyze (tokenize etc.) the field at all when indexing it, like this:

curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d'
{
   "movie": {
      "properties": {
         "director": {
            "type": "string",
            "index": "not_analyzed"
        }
      }
   }
}'

There are however a couple of issues if we do this. First of all, it won't work as there already is a mapping for the field:

In many cases it's not possible to modify existing mappings. Often the easiest work around for that is to create a new index with the desired mappings and re-index all of the data into the new index.

The second problem with adding the above mapping is that, even if we could add it, we would have limited our ability to search in the director field. That is, while a search for the exact value in the field would match we wouldn't be able to search for single words in the field.

Luckily, there's a simple solution to our problem. We add a mapping that upgrades the field to a multi field. What that means is that we'll map the field multiple times for indexing. Given that one of the ways we map it match the existing mapping both by name and settings that will work fine and we won't have to create a new index.

Here's a request that does that:

curl -XPUT "http://localhost:9200/movies/movie/_mapping" -d'
{
   "movie": {
      "properties": {
         "director": {
            "type": "multi_field",
            "fields": {
                "director": {"type": "string"},
                "original": {"type" : "string", "index" : "not_analyzed"}
            }
         }
      }
   }
}'

This time when we try to add the mappings ElasticSearch is happy to do so.

So, what did we just do? We told ElasticSearch that whenever it sees a property named "director" in a movie document that is about to be indexed in the movies index it should index it multiple times. Once into a field with the same name (director) and once into a field named "director.original" and the latter field should not be analyzed, maintaining the original value allowing is to filter by the exact director name.

With our new shiny mapping in place we can re-index one or both of the movies directed by Francis Ford Coppola (copy from the list of initial indexing requests above) and try the search request that filtered by author again. Only, this time we don't filter on the "director" field (which is indexed the same way as before) but instead on the "director.original" field:

curl -XPOST "http://localhost:9200/_search" -d'
{
    "query": {
        "constant_score": {
            "filter": {
                "term": { "director.original": "Francis Ford Coppola" }
            }
        }
    }
}'

Executing it shows that it indeed works:

Where to go from here

We've covered quite a lot of things in this article. Still, we've barely scratched the surface of ElasticSearch's goodness.

For instance, there's a lot more to searching with ElasticSearch than we've seen here. We can create search requests where we specify how many hits we want, use highlighting, get spelling suggestions and much more. Also, the query DSL contains many interesting queries and filters that we can use. Then there's of course also a whole range of facets that we can use to extract statistics from our data or build navigations.

As if that wasn't enough, we can go far, far beyond the simple mapping example we've seen here to accomplish wonderful and interesting things. And then there are of course plenty of performance optimizations and considerations. And functionality to find similar content. And, and, and...

But for now, thanks for reading! I hope you found this tutorial useful on your way to discovering the great open source project ElasticSearch.

↧

A book about EPiServer development

August 19, 2013, 3:34 pm

≫ Next: Workaround: On Page Editing broken after rendering a block

≪ Previous: ElasticSearch 101

After years of blogging about EPiServer CMS I decided it was time to get real.

The cover of the new EPiServer book displayed in an iPhone and a Kindle.

I'm happy to announce a little project that I've been working on lately; a book about how to develop EPiServer CMS sites. I'm publishing it on Leanpub and embracing their motto of publishing early and publishing often. So, while the book is far from done I have published a first version of it and early buyers will be getting it a discounted price.

Content - current and planned

This is a project that I've been thinking about doing for several years but haven't quite gotten around to, until now. Despite having contemplated writing the book for years the exact table of contents isn't entirely decided yet, and I hope to get reader feedback to help guide it. However, the plan is roughly to cover:

General CMS concepts - I actually started writing this chapter about a year ago, but I'm not yet finished with it :)
Installation - "done"
Using the CMS - "done"
Core classes and concepts - "done"
Tools - "done"
A hands-on tutorial for building a first, simple site - in progress and a first version is featured in the current version of the book
Deep dives into content types, properties and templates
Caching and performance
How to build large scale sites, such as media sites
Different ways to build integrations
Extending the CMS in various ways

Whether I'll cover all of the above topics and/or other topics will probably depend on whether there is an interest for the book and on reader feedback.

The book is, and will be, written as a mix of theory and practical tutorials. The examples will be using MVC, primarily as I felt MVC made it easier to focus on EPiServer's API. However, there will be asides/notes describing how to achieve the same thing with Web Forms.

Pricing

Prior to publishing the book interested readers could sign up to be notified when it was published and also suggest a price. The average suggested price was $38 which is actually pretty closed to what I initially thought I'd charge for the book. However, I feel that early readers should be rewarded for taking the "risk" of buying an unfinished product and, even more, be rewarded for providing feedback. Therefor I've set the minimum price (buyers can choose their own actual price depending on how much they think it's worth) at $20 and will probably increase that in the future as the book progresses.

An e-book

Books on Leanpub are only available as e-books. If there's a lot of interest in a print version I may look in to that in the future, but for now I'm embracing the fact that it's possible to continously publish an e-book.

Leanpub books are published and made available to readers in PDF, EPUB (for iPad) and MOBI (for Kindle).

↧

Workaround: On Page Editing broken after rendering a block

August 25, 2013, 10:55 am

≫ Next: Update on my EPiServer book

≪ Previous: A book about EPiServer development

In patch 3 for EPiServer 7 CMS there's a nasty little bug that manifests it self on sites built using ASP.NET MVC. After rendering a block that doesn't have a controller but instead is rendered using only a partial view (which, as we know, is good for performance) other properties that are rendered using the PropertyFor method may not be editable in OPE mode.

Luckily there's an easy workaround for that, which we can use until the issue is resolved by EPiServer; create custom display templates for the types ContentData and ContentArea. Implement each display template like the original ones shipped by EPiServer, using RenderContentData and RenderContentArea, but wrap the method calls in code that stores the value of ViewContext.RouteData.Values["currentContent"] in a variable prior to invoking the method and restores it afterwards.

/Views/Shared/DisplayTemplates/ContentData.cshtml:

@using EPiServer.Web.Mvc.Html
@model EPiServer.Core.IContentData
@{
    var original = ViewContext.RouteData.Values["currentContent"];
}
@{Html.RenderContentData(Model, false);}
@{
    ViewContext.RouteData.Values["currentContent"] = original;
}

/Views/Shared/DisplayTemplates/ContentArea.cshtml:

@using EPiServer.Web.Mvc.Html
@model EPiServer.Core.ContentArea
@{
    var original = ViewContext.RouteData.Values["currentContent"];
}
@{Html.RenderContentArea(Model);}
@{
    ViewContext.RouteData.Values["currentContent"] = original;
}

↧

Update on my EPiServer book

September 16, 2013, 5:08 am

≫ Next: Grouping in ElasticSearch using child documents

≪ Previous: Workaround: On Page Editing broken after rendering a block

Two new chapters. One of them is the first!

It has been almost a month since I announced that I'm writing a book about EPiServer 7 development. Since then I've received a lot of helpful feedback and comments. Not unexpectedly the EPiServer development community has again proven its awesomeness to me!

While I've made two minor updates to the book since its initial release I'm today happy to announce a more major update. In the recently published version (#4) you'll find two brand new chapters. Somewhat ironically one of those is actually the first chapter, which up until now hasn't been done. And, actually it probably still isn't entirely done.

You see, the first chapter is something that I've been thinking about writing for years. It's not about EPiServer. Instead it's an attempt to take a step back and see how a CMS, such as EPiServer, could come to be. In the chapter we follow a site from its infancy with a single static HTML page and see how it evolves into a fairly respectable CMS based sites.

My hope is that this chapter will provide a gentle, understandable introduction for those new to CMS development in general and to EPIServer development in particular. I also hope that it may provide for some interesting reflection for senior EPiServer developers. However, with such ambitions it's hard to get it right the first time and I hope to improve it based on your feedback.

Anyway, if you've already bought the book, please grab the latest version and give me feedback. If you haven't yet bought the book, now may be a good time ;-)

↧

Grouping in ElasticSearch using child documents

October 26, 2013, 8:06 am

≫ Next: Error: connect ECONNREFUSED when integration testing a Node.js app

≪ Previous: Update on my EPiServer book

How to answer questions such as "average number of orders per customer during November?" that would be easy using GROUP BY in a relational database isn't always obvious when using ElasticSearch. Here's one solution, using child documents.

This week I ran into a problem with a seemingly simple use case with ElasticSearch (v. 0.90.5). I had data such as this:

Customer ID	Date
john	2013-11-15T12:00:00
jane	2013-11-20T12:00:00
john	2013-12-01T12:00:00

What this data represents isn't very relevant. Depending on the naming of the columns shown above and other data that I've omitted here it could be comments, Facebook likes or just about any other type of event. However, for the sake of argument let's say that it represent orders.

I wanted to index the data, and a lot more of it, into an ElasticSearch index to enable me to answer questions such as:

How many orders have been placed?
How many orders have been placed during november?
How many orders have been placed during the current month compared to the previous month?
What is the average number of orders per customer?
What was the average number of orders per customer during November?

My initial instinct was to simply index each order as a document. Once I had done so I found that it was easy to answer the first three questions. However, the two last questions, about average number of orders per customers, wasn't so easy.

In fact, it was next to impossible. The only solution that I found (thanks to Henrik) was to search for orders and retrieve a terms facet for Customer ID. Using that I could divide the total number of search hits by the number of terms returned in the facet. However, ElasticSearch doesn't return the total number of unique terms with a terms facet.

So, after having retrieved the search result I would have to look at how many terms was returned in the array myself. While that would work in a scenario where there was only a small number of customers ElasticSearch wouldn't be very happy returning a huge number of terms in the facet. Not to mention that the response body would be huge.

My next idea was to make two indexing requests for each order, one indexing the order and one indexing, or rather upserting, a document representing the customer. By keeping track of the number of orders each customer had done I would be able to answer the fourth question; average number of orders per customer. However, that wouldn't enable me to answer the fifth question; average number of orders per customer during a specific timeframe.

Child documents to the rescue

After having banged my head against the wall for a while, Googled and read up on the state of field collapsing in ElasticSearch I finally found a workable solution; indexing orders and customers separately and mapping orders as child documents to customers. While this does require me to index multiple times it does allow me to answer questions about average number of orders per customer with a single search request.

To start with I created an index with a parent mapping for orders:

curl -XPUT "http://localhost:9200/orders" -d'
{
    "mappings": { 
        "customer": {},
        "order" : {
            "_parent" : {
                "type" : "customer"
            }
        }
    }
}'

Inserting the three orders from the table above can be achieved like this:

curl -XPOST "http://localhost:9200/orders/_bulk" -d'
{ "index" : { "_type" : "customer", "_id" : "john" } }
{ "name" : "John Doe" }
{ "index" : { "_type" : "order", "_parent" : "john" } }
{ "date" : "2013-11-15T12:00:00" }
'

curl -XPOST "http://localhost:9200/orders/_bulk" -d'
{ "index" : { "_type" : "customer", "_id" : "jane" } }
{ "name" : "Jane Doe" }
{ "index" : { "_type" : "order", "_parent" : "jane" } }
{ "date" : "2013-11-20T12:00:00" }
'

curl -XPOST "http://localhost:9200/orders/_bulk" -d'
{ "index" : { "_type" : "customer", "_id" : "john" } }
{ "name" : "John Doe" }
{ "index" : { "_type" : "order", "_parent" : "john" } }
{ "date" : "2013-12-01T12:00:00" }
'

There are two things to note in the requests above. First of all note that each order is indexed with a _parent. Second, I index the customer several times. That could of course be avoided if I could be sure that the customer already existed.

With the mappings done and the data indexed answering the question "What is the average number of orders per customer?" can be done by:

Searching across all types.
Adding a filter excluding all but orders. Note that this should be done using the filter part of the search request rather than in the query part.
Adding a filter facet to the request body. The filter facet filters out all documents except customers and requires each customer to have a child document of type order.

Here's how such a request looks:

curl -XPOST "http://localhost:9200/orders/_search" -d'
{
    "filter": {
        "type": {
            "value": "order"
        }
    },
    "facets": {
        "customer_count": {
            "filter": {
                "and": [{
                  "type": {
                      "value": "customer"
                  }  
                },
                {
                    "has_child" : {
                        "type" : "order",
                        "query" : {
                            "match_all" : {}
                        }
                    }
                }]
            }
        }
    },
    "size": 0
}'

The response from ElasticSearch looks like this:

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 1,
      "hits": []
   },
   "facets": {
      "customer_count": {
         "_type": "filter",
         "count": 2
      }
   }
}

By dividing the total number of hits with the count returned in the filter facet we get the average number of orders per customer, the answer to the fourth question. To answer the fifth question, average number of orders per customer during a specific time frame we can modify the filter part of the request body, filtering not only by type but also by time frame. By doing so we get the total number of orders during the time frame. However, we also need to limit the number of customers so we also modify the filter facet to use a filtered query with the same filter. Like this:

curl -XPOST "http://localhost:9200/orders/_search" -d'
{
    "filter": {
        "and": [{
            "type": {
                "value": "order"
            }
        },
        {
            "range": {
                "date": {
                    "gte": "2013-11-01T00:00:00",
                    "lt": "2013-12-01T00:00:00"
                }
            }
        }]
    },
    "facets": {
        "customer_count": {
            "filter": {
                "and": [{
                  "type": {
                      "value": "customer"
                  }  
                },
                {
                    "has_child" : {
                        "type" : "order",
                        "query" : {
                            "filtered": {
                                "query": {
                                    "match_all": {}
                                },
                                "filter": {
                                    "range": {
                                        "date": {
                                            "gte": "2013-11-01T00:00:00",
                                            "lt": "2013-12-01T00:00:00"
                                        }
                                    }
                                }
                            }
                        }
                    }
                }]
            }
        }
    },
    "size": 0
}'

Conclusion

Using child documents and indexing pretty much the same data twice as described above solves my use cases. There may be other ways of accomplishing the same result but I've yet to find one and this seems to work well so far. However, if you know of a better/more efficient way of accomplishing the same goals, please let me know!

↧

Error: connect ECONNREFUSED when integration testing a Node.js app

October 28, 2013, 11:52 am

≫ Next: A new challenge

≪ Previous: Grouping in ElasticSearch using child documents

In an express.js app that I was integration testing using Mocha the tests suddenly started acting weird with complaint such as:

"before all" hook: Error: connect ECONNREFUSED

Turned out the root cause was a test method that should have been asynchronous wasn't, it was declared as function() instead of function(done). Pretty obvious in hindsight but with an error message like that it took a while to locate.

↧

A new challenge

January 9, 2014, 11:26 am

≫ Next: ElasticSearch - nested mappings and filters

≪ Previous: Error: connect ECONNREFUSED when integration testing a Node.js app

Ranging from improving one of the most well visited sites in Sweden to challenges in the "Big Data" space, and from working with iOS and Android apps to architecting a channel independent platform. Why I gave up freelancing and joined Expressen.

Last year I was contacted by Peter Frey. Peter had recently quit his job as CIO at Swedens largest news site to instead become the new CIO at Sweden's second largest news site, Expressen. His rather inspiring motivation: What could be more challenging than to compete with what he himself had been a part of creating?

As Expressen uses EPiServer CMS Peter was interested in talking to me regarding consulting. I was working on an amazingly fun project at the time. So, even though Expressen is the worlds largest EPiServer site and clearly interesting to work with, I wasn't very interested.

However, Peter and I decided that I would help him a few hours per week with interviewing other candidates, and with technical advice. During this time we got know each other a little better and I also got to know Jakob Wagner, Expressens head PO and general pantomath.

Gradually I also came to realize how many interesting things that could be done at Expressen and that there existed an amazing flora of interesting challenges for a developer. Ranging from improving one of the most well visited sites in Sweden to challenges in the "Big Data" space, and from working with iOS and Android apps to architecting a channel independent platform.

So, as we worked together we got to talking about me joining Expressen as technical head of development/lead developer/CTO. Although I thoroughly enjoyed freelancing the challenge of being able to be a part of a new development organization and having a more long term perspective was appealing. And so we decided that I would come and work for Expressen as a consultant for a few months and if both parties felt that it worked well I would later join the company.

It's been really, really fun and interesting so far. So, this Tuesday I signed an employment contract and I'm thoroughly enjoying this new challenge.

While we've gotten started we're still in the early stages of building a new, kick ass, development organization and we're looking for talented developers/coding architects. So, if you too are interested in a new challenge be sure to drop me a note at joel.abrahamsson@expressen.se.

↧

ElasticSearch - nested mappings and filters

June 6, 2014, 7:59 am

≫ Next: Dynamic mappings and dates in ElasticSearch

≪ Previous: A new challenge

There's one situation where we need to help ElasticSearch to understand the structure of our data in order to be able to query it fully - when dealing with arrays of complex objects.

Arguably one of the best features of ElasticSearch is that it allows us to index and search amongst complex JSON objects. We're not limited to a flat list of fields but can work with object graphs, like we're used to when programming with object oriented languages.

However, there's one situation where we need to help ElasticSearch to understand the structure of our data in order to be able to query it fully - when dealing with arrays of complex objects.

As an example, look at the below indexing request where we index a movie, including a list of the cast in the form of complex objects consisting of actors first and last names:

curl -XPOST "http://localhost:9200/index-1/movie/" -d'
{
   "title": "The Matrix",
   "cast": [
      {
         "firstName": "Keanu",
         "lastName": "Reeves"
      },
      {
         "firstName": "Laurence",
         "lastName": "Fishburne"
      }
   ]
}'

Given many such movies in our index we can find all movies with an actor named "Keanu" using a search request such as:

curl -XPOST "http://localhost:9200/index-1/movie/_search" -d'
{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "term": {
               "cast.firstName": "keanu"
            }
         }
      }
   }
}'

Running the above query indeed returns The Matrix. The same is true if we try to find movies that have an actor with the first name "Keanu" and last name "Reeves":

curl -XPOST "http://localhost:9200/index-1/movie/_search" -d'
{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "bool": {
               "must": [
                  {
                     "term": {
                        "cast.firstName": "keanu"
                     }
                  },
                  {
                     "term": {
                        "cast.lastName": "reeves"
                     }
                  }
               ]
            }
         }
      }
   }
}'

Or at least so it seems. However, let's see what happens if we search for movies with an actor with "Keanu" as first name and "Fishburne" as last name.

curl -XPOST "http://localhost:9200/index-1/movie/_search" -d'
{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "bool": {
               "must": [
                  {
                     "term": {
                        "cast.firstName": "keanu"
                     }
                  },
                  {
                     "term": {
                        "cast.lastName": "fishburne"
                     }
                  }
               ]
            }
         }
      }
   }
}'

Clearly this should, at first glance, not match The Matrix as there's no such actor amongst its cast. However, ElasticSearch will return The Matrix for the above query. After all, the movie does contain an author with "Keanu" as first name and (albeit a different) actor with "Fishburne" as last name. Based on the above query it has no way of knowing that we want the two term filters to match the same unique object in the list of actors. And even if it did, the way the data is indexed it wouldn't be able to handle that requirement.

Nested mapping and filter to the rescue

Luckily ElasticSearch provides a way for us to be able to filter on multiple fields within the same objects in arrays; mapping such fields as nested. To try this out, let's create ourselves a new index with the "actors" field mapped as nested.

curl -XPUT "http://localhost:9200/index-2" -d'
{
   "mappings": {
      "movie": {
         "properties": {
            "cast": {
               "type": "nested"
            }
         }
      }
   }
}'

After indexing the same movie document into the new index we can now find movies based on multiple properties of each actor by using a nested filter. Here's how we would search for movies starring an actor named "Keanu Fishburne":

curl -XPOST "http://localhost:9200/index-2/movie/_search" -d'
{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "nested": {
               "path": "cast",
               "filter": {
                  "bool": {
                     "must": [
                        {
                           "term": {
                              "firstName": "keanu"
                           }
                        },
                        {
                           "term": {
                              "lastName": "fishburne"
                           }
                        }
                     ]
                  }
               }
            }
         }
      }
   }
}'

As you can see we've wrapped our initial bool filter in a nested filter. The nested filter contains a path property where we specify that the filter applies to the cast property of the searched document. It also contains a filter (or a query) which will be applied to each value within the nested property.

As intended, running the abobe query doesn't return The Matrix while modifying it to instead match "Reeves" as last name will make it match The Matrix. However, there's one caveat.

Including nested values in parent documents

If we go back to our very first query, filtering only on actors first names without using a nested filter, like the request below, we won't get any hits.

curl -XPOST "http://localhost:9200/index-2/movie/_search" -d'
{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "term": {
               "cast.firstName": "keanu"
            }
         }
      }
   }
}'

This happens because movie documents no longer have cast.firstName fields. Instead each element in the cast array is, internally in ElasticSearch, indexed as a separate document.

Obviously we can still search for movies based only on first names amongst the cast, by using nested filters though. Like this:

curl -XPOST "http://localhost:9200/index-2/movie/_search" -d'
{
   "query": {
      "filtered": {
         "query": {
            "match_all": {}
         },
         "filter": {
            "nested": {
               "path": "cast",
               "filter": {
                  "term": {
                     "firstName": "keanu"
                  }
               }
            }
         }
      }
   }
}'

The above request returns The Matrix. However, sometimes having to use nested filters or queries when all we want to do is filter on a single property is a bit tedious. To be able to utilize the power of nested filters for complex criterias while still being able to filter on values in arrays the same way as if we hadn't mapped such properties as nested we can modify our mappings so that the nested values will also be included in the parent document. This is done using the include_in_parent property, like this:

curl -XPUT "http://localhost:9200/index-3" -d'
{
   "mappings": {
      "movie": {
         "properties": {
            "cast": {
               "type": "nested",
               "include_in_parent": true
            }
         }
      }
   }
}'

In an index such as the one created with the above request we'll both be able to filter on combinations of values within the same complex objects in the actors array using nested filters while still being able to filter on single fields without using nested filters. However, we now need to carefully consider where to use, and where to not use, nested filters in our queries as a query for "Keanu Fishburne" will match The Matrix using a regular bool filter while it won't when wrapping it in a nested filter. In other words, when using include_in_parent we may get unexpected results due to queries matching documents that it shouldn't if we forget to use nested filters.

↧

Dynamic mappings and dates in ElasticSearch

June 7, 2014, 2:58 am

≫ Next: Notes from learning Go - the basics

≪ Previous: ElasticSearch - nested mappings and filters

JSON doesn't have a date type. Yet ElasticSearch can automatically map date fields for us. While this "just works" most of the time, it can be a good idea to help ElasticSearch help us by instead using naming conventions for dates. Here's why, and how.

ElasticSearch has a feature called dynamic mapping which is turned on by default. Using this we don't have to explicitly tell ElasticSearch how to index and store specific fields. Instead ElasticSearch figures it out itself by inspecting the content of our JSON properties.

Let's look at an example.

curl -XPOST "http://localhost:9200/myindex/tweet/" -d'
{
    "content": "Hello World!",
    "postDate": "2009-11-15T14:12:12"
}'

Given that there isn't already an indexed named "myindex" the above request will cause a number of things to happen in our ElasticSearch cluster.

An index named "myindex" will be created.
Mappings for a type named tweet will be created for the index. The mappings will contain two properties, content and postDate.
The JSON object in the request body will be indexed.

After having made the above request we can inspect the mappings that will have been automatically created with the below request.

curl -XGET "http://localhost:9200/myindex/_mapping"

The response looks like this:

{
   "myindex": {
      "mappings": {
         "tweet": {
            "properties": {
               "content": {
                  "type": "string"
               },
               "postDate": {
                  "type": "date",
                  "format": "dateOptionalTime"
               }
            }
         }
      }
   }
}

As we can see in the above response, ElasticSearch has mapped the content property as a string and the postDate property as a date. All is well.

However, let's look at what happens if we delete the index and modify our indexing request to instead look like this:

curl -XPOST "http://localhost:9200/myindex/tweet/" -d'
{
    "content": "1985-12-24",
    "postDate": "2009-11-15T14:12:12"
}'

In the above request the content property is still a string, but the only content of the string is a date. Retrieving the mappings now gives us a different result.

{
   "myindex": {
      "mappings": {
         "tweet": {
            "properties": {
               "content": {
                  "type": "date",
                  "format": "dateOptionalTime"
               },
               "postDate": {
                  "type": "date",
                  "format": "dateOptionalTime"
               }
            }
         }
      }
   }
}

ElasticSearch has now inferred that the content property also is a date. If we now try to index our original JSON object we'll get an exception in our faces.

{
   "error": "MapperParsingException[failed to parse [content]]; nested: MapperParsingException[failed to parse date field [Hello World!], tried both date format [dateOptionalTime], and timestamp number with locale []]; nested: IllegalArgumentException[Invalid format: \"Hello World!\"]; ",
   "status": 400
}

We're trying to insert a string value into a field which is mapped as a date. Naturally ElasticSearch won't allow us to do that.

While this scenario isn't very likely to happen, when it does it can be quite annoying and cause problems that can only be fixed by re-indexing everything into a new index. Luckily there's a number of possible solutions.

Disabling date detection

As a first step we can disable date detection for dynamic mapping. Here's how we would do that explicitly for documents of type tweet when creating the index:

curl -XPUT "http://localhost:9200/myindex" -d'
{
   "mappings": {
      "tweet": {
         "date_detection": false
      }
   }
}'

We then index our "problematic" tweet again:

curl -XPOST "http://localhost:9200/myindex/tweet/" -d'
{
    "content": "1985-12-24",
    "postDate": "2009-11-15T14:12:12"
}'

When we now inspect the mappings that has been dynamically created for us we see a different result compared to before:

{
   "myindex": {
      "mappings": {
         "tweet": {
            "date_detection": false,
            "properties": {
               "content": {
                  "type": "string"
               },
               "postDate": {
                  "type": "string"
               }
            }
         }
      }
   }
}

Now both fields have been mapped as strings, which they indeed are, even though they contain values that can be parsed as dates. However, this isn't good either as we'd like the postDate field to be mapped as a date though so that we can use range filters and the like on it.

Explicitly mapping date fields

We can explicitly map the postDate field as a date by re-creating the index and include a property mapping, like this:

curl -XPUT "http://localhost:9200/myindex" -d'
{
   "mappings": {
      "tweet": {
         "date_detection": false,
         "properties": {
             "postDate": {
                 "type": "date"
             }
         }
      }
   }
}'

If we now index our "problematic" tweet with a date in the content field we'll get the desired mappings; the content field mapped as a string and the postDate field mapped as a date. That's nice. However, this approach can be cumbersome when dealing with many types or types that we don't know about prior to documents of those types are indexed.

Mapping date fields using naming conventions

An alternative approach to disabling date detection and explicitly mapping specific fields as dates is instruct ElasticSearchs dynamic mapping functionality to adhere to naming conventions for dates. Take a look at the below request that (again) creates an index.

curl -XPUT "http://localhost:9200/myindex" -d'
{
   "mappings": {
      "_default_": {
         "date_detection": false,
         "dynamic_templates": [
            {
               "dates": {
                  "match": ".*Date|date",
                  "match_pattern": "regex",
                  "mapping": {
                     "type": "date"
                  }
               }
            }
         ]
      }
   }
}'

Compared to our previous requests used to creating an index with mappings this is quite different. First of all we no longer provide mappings for the tweet type. Instead we provide mappings for a type named _default_. This is a special type whose mappings will be used as the default "template" for all other types.

As before we start by disabling date detection in the mappings. However, after that we no longer provide mappings for properties but instead provide a dynamic template named dates.

Within the dates template we provide a pattern and specify that the pattern should be interpreted as a regular expression. Using this the template will be applied to all fields whose names either end with "Date" or whose names are exactly "date". For such fields the template instructs the dynamic mapping functionality to map them as dates.

Using this approach all string fields, no matter if their values can be parsed as dates or not will be mapped as string unless the field name is something like "postDate", "updateDate" or simply "date". Fields with such names will be mapped as dates instead.

While this is nice, there's one caveat. Indexing a JSON object with a property matching the naming convention for date fields but whose value can't be parsed as a date will cause an exception. Still, adhering to naming conventions for dates may be a small price to pay compared to the headaches of seemingly randomly having string fields mapped as dates simply because the first document to be indexed of a specific type happened to contain a string value that could be parsed as a date.

↧

Notes from learning Go - the basics

November 8, 2014, 10:17 am

≫ Next: Book announcement: ElasticSearch Quick Start

≪ Previous: Dynamic mappings and dates in ElasticSearch

I recently decided to learn Go. As in Go the programming language, also known as golang. These are my notes from doing so. In the form of code.

Hello World

package main

//Here's a comment

import "fmt"

func main() {
    fmt.Println("Hello world")
}

Try it.

Variables and types

package main

import "fmt"

func main() {
    // Variables are declared using "var [name] [type]"
    var myVariable string
    myVariable = "I'm a string"
    myVariable += " and I was declared the long way"
    fmt.Println(myVariable)
    
    var length int = len(myVariable)
    fmt.Println("I'm this long:", length)

    // Using short variable declaration, omitting the var keyword
    mySecondVariable := "I'm also a string but I was declared the short way"
    fmt.Println(mySecondVariable)

    const constant = "I'm a constant with inferred type"
    fmt.Println(constant)

    var (
        var1 = "This is"
        var2 = "multiple"
        var3 = "variables"
    )

    fmt.Println(var1, var2, var3)
}

Try it.

If statements and loops

package main

import "fmt"

func main() {
    i := 1
    for i <= 5 {
        fmt.Println(i)
        i += 1
    }

    for i := 6; i <= 10; i++ {
        if i % 2 == 0 {
            fmt.Println(i, "even")
        } else if i % 3 == 0 {
            fmt.Println(i, "divisible by 3")
        } else {
            fmt.Println(i, "odd and not divisible by 3")
        }
    }
}

Try it.

Switch

package main

import "fmt"

func main() {
    i := 3

    switch i {
        case 1: fmt.Println("One")
        case 2: fmt.Println("Two")
        case 3: fmt.Println("Three")
        default: fmt.Println("Uhmn, not sure")
    }

    switch i%2 {
        case 0: fmt.Println("Even")
        default: fmt.Println("Odd")
    }
}

Try it.

Arrays

package main

import "fmt"

func main() {
    var myArray [5]string
    myArray[2] = "I'm the third item"
    fmt.Println(myArray)
    myArray[0] = "Woho, I'm the first item!"
    fmt.Println(myArray)
    fmt.Println("Array length:", len(myArray))

    var intArray [3]int
    fmt.Println("\"Empty\" int array:", intArray)

    intArray = [3]int{1, 2}
    fmt.Println("Integer array with the two first values set:", intArray)

    var messageParts = [3]string{
        "Hello ",
        "there",
        "!",
    }

    var message string
    for i := 0; i < len(messageParts); i++ {
        message += messageParts[i]
    }
    fmt.Println(message)

    // Usign the range keyword. The underscore tells the compiler that we don't need the value (index in this case)
    var secondMessage string
    for _, value := range messageParts {
        secondMessage += value
    }
    fmt.Println(secondMessage)

    for idx, _ := range messageParts {
        fmt.Println("Index:", idx)
    }
}

Try it.

Slices

package main

import "fmt"

func main() {
    var myEmptySlice []string
    fmt.Println(myEmptySlice)

    sliceAssociatedWithArray := make([]int, 5)
    fmt.Println(sliceAssociatedWithArray)

    array := [5]int{1, 2, 3, 4, 5}
    middleSlice := array[1:4]
    fmt.Println(middleSlice)
    
    array[2] = 42
    fmt.Println(middleSlice)

    middleSlice[0] = 41
    fmt.Println(array)

    messageSlice := []string{"hello", "there"}
    messageSlice = append(messageSlice, "how", "are", "you")
    fmt.Println(messageSlice)

    helloOnlySlice := make([]string, 2)
    copy(helloOnlySlice, messageSlice)
    fmt.Println(helloOnlySlice)
}

Try it.

Maps

package main

import "fmt"

func main() {
    var myMap map[string]string
    myMap = make(map[string]string) 

    myMap["hello"] = "hej"
    fmt.Println(myMap)
    fmt.Println(len(myMap))
    fmt.Println(myMap["hello"])

    delete(myMap, "hello")
    fmt.Println(myMap)
    fmt.Println(len(myMap))
    fmt.Println(myMap["hello"])

    // Maps, no matter if they have been initialized or not, return the 
    // zero value for the type if they key doesn't have a value
    var ints map[string]int
    fmt.Println(ints["nothing"]) // Outputs 0
    // Accessing an element can return two values where the second is
    // the result of the lookup
    value, exists := ints["nothing"]
    fmt.Println(value) // Still 0
    fmt.Println(exists) // false

    // Short way of creating maps
    lengths := map[string]int {
        "hello": 5,
        "there": 5,
        "!": 1,
    }
    fmt.Println(lengths)

    // Iterating over elements
    for key, value := range lengths {
        fmt.Println(key)
        fmt.Println(value)
    }
}

Try it.

Functions

package main

import "fmt"

func main() {
    secondFunction()
    thirdFunction("Hello yourself!")
    fourthFunction("How", "are", "you", "today?")
    fmt.Println(fifthFunction())
    fmt.Println(sixthFunction())
    part1, part2 := seventhFunction()
    fmt.Println(part1, part2)
    fmt.Println(eightFunction()())
}

func secondFunction() {
    fmt.Println("Hello world!")
}

func thirdFunction(message string) {
    fmt.Println(message)
}

func fourthFunction(messages ...string) {
    for index, message := range messages {
        fmt.Print(message)
        if index < len(messages)-1 {
            fmt.Print(" ")
        } else {
            fmt.Println()
        }
    }
}

func fifthFunction() string {
    return "Hello world as return value"
}

func sixthFunction() (message string) {
    message = "Hello world as named return value"
    return message
}

func seventhFunction() (string, string) {
    return "Hello world", "from multiple return values"
}

func eightFunction() func() string {
    message := "Hello world from closure"
    return func() string {
        return message;
    }
}

Try it.

Pointers

package main

import "fmt"

func main() {
    message := "I'm the original, and best"

    modifier1(message)
    fmt.Println(message) // "I'm the original, and best"

    // The & operator finds the address for a variable, meaning that &message returns a *string
    modifier2(&message)
    fmt.Println(message) // "modifier2 was here"

    secondMessage := new(string) // the new operatore can be used to create pointers
    modifier2(secondMessage)
    fmt.Println(*secondMessage) // "modifier2 was here"
}

func modifier1(message string) {
    message = "modifier1 was here"
}

func modifier2(message *string) {
    *message = "modifier2 was here"
}

Try it.

Structs

package main

import "fmt"

func main() {
    rect := Rectangle{width: 10, height: 5}
    fmt.Println("Height:", rect.height, "Width:", rect.width)
    fmt.Println("Area:", area(rect))
}

type Rectangle struct {
    width, height int
}

func area(rect Rectangle) int {
    return rect.width*rect.height
}

Try it.

Methods

package main

import "fmt"

func main() {
    rect := Rectangle{width: 10, height: 5}
    fmt.Println("Area:", rect.area())
}

type Rectangle struct {
    width, height int
}

func (rect Rectangle) area() int {
    return rect.width*rect.height
}

Try it.

Embedding

package main

import "fmt"

func main() {
    bird := Bird{}
    bird.Animal.Eat()
    bird.Eat()
    bird.Fly()
}

type Animal struct {
}

func (animal *Animal) Eat() {
    fmt.Println("I'm eating")
}

type Bird struct {
    Animal
}

func (bird *Bird) Fly() {
    fmt.Println("I'm flying")
}

Try it.

Defer

package main

import "fmt"

func main() {
    defer first()
    second()
}

func first() {
    fmt.Println("First function here")
}

func second() {
    fmt.Println("Second function here")
}

Try it.

Panic and Recover

package main

import "fmt"

func main() {
    defer func() {
        message := recover()
        fmt.Println(message)
    }()

    panic("Something is seriously wrong")
}

Try it.

↧

Book announcement: ElasticSearch Quick Start

January 18, 2015, 11:33 am

≫ Next: Responsibly Responsive Web Design at Expressen

≪ Previous: Notes from learning Go - the basics

I'm happy to announce a book that I've been meaning to write for quite some time - An introduction to ElasticSearch for developers in tutorial form.

About 18 months ago I was working on a customer project in which ElasticSearch was a major component. While I had quite a lot of prior experience with ElasticSearch something at this time got me thinking about the lack of good, simple, tutorials for helping developers new to ElasticSearch get started. So, one night I sat down and I wrote such a tutorial, a blog post titled "ElasticSearch 101". Since it was first published it has accounted for 50-70% of the total traffic to my site and it has received a lot of positive feedback in the comments.

While the positive reception of ElasticSearch 101 is great I've always felt that there's more I'd like to say on the topic. While there has been a number of extensive books about ElasticSearch published the last couple of years and I've written more blog posts about it myself I've always wanted to write a longer version of ElasticSearch 101 in the form of a book. A book that would be shorter than the typical technical book but more detailed and covering more topics than the original blog post.

For 18 months I made excuses for myself for not writing this book. However, over this Christmas holiday I finally started writing. Wise from my previous book (about EPiServer CMS) I decided that I a) wouldn't strive to make each chapter perfect prior to writing the next and b) not to publish it before all chapters were done (in their first versions). This weekend I finished the (probably) last chapter and hit the publish button on Leanpub.

The book is titled ElasticSearch Quick Start. It's not a hundred percent done yet, but it is pretty much finished in terms of the topics that it will cover. What remains is making it *perfect* based on reader feedback, possibly rewriting some parts and by adding more examples.

If you're interested in ElasticSearch check it out. And, if you decide to buy it please give me feedback! I'd really like the book to be the ultimate way to learn how to be productive with ElasticSearch quickly and in order to do that reader feedback is essential.

↧

Responsibly Responsive Web Design at Expressen

July 5, 2018, 9:19 am

≫ Next: Quickly creating and mapping an array in JavaScript

≪ Previous: Book announcement: ElasticSearch Quick Start

When building a new version of the news site Expressen.se the team hesitated to use responsive web design. Here's why and what we ended up doing.

Expressen.se is the second largest news site in the Nordic region with millions of page views per day. During the fall of 2017 we set out to rebuild the site from scratch with a focus on building as fast a site as possible. You can read more about how and why we did this in Simon Hjälmefjord's blog post (in Swedish).

Another objective that we had when building the new version of the site was to end up with a single code base for the site. Prior to the rebuild we had different versions of the site for different channel and these were distributed over multiple applications. More specifically we had:

A fixed width version for computers.
A partially responsive version for tablets.
A fluid version for smart phones.

The first two of the above versions were served by one application, Ariel. Ariel naturally had functionality to handle the fact that it was serving two versions, including channel specific templates. The third version above, the mobile site, was served by a different application.

This division of labor was not without benefits. We could optimize for each channel's unique pre-requisites as well as perform experiments on subsets of the total traffic and code base. However, the fact that we had three different versions scattered across two code bases also brought some obvious drawbacks. The two code bases diverged heavily from day one and most developers tended to only know one of them well. It was also difficult to maintain a coherent user experience and design across the different channels.

Another problem was that we quite often implemented new functionality for only one of the channels. That wasn't necessarily always a bad thing as the functionality may only make sense for one type of device. In other cases it would make sense for all channels but we decided to test it on a single channel first. However, sometimes we really meant to, and/or wanted to, have the new functionality in all channels but after having built it for a single channel and having spent some time measuring its impact it we had moved on to other features and it was never implemented for the remaining channels.

So, when we were about to rebuild the site the entire team was very much in agreement that we wanted a single code base for the site. However, while we had no doubts about wanting a single code base for all channels, we were a lot less convinced about how to accomplish that. Should we build an application that could handle three different channels, with separate views etc., or should we build a responsive site?

It's the 21st century; of course we should build a responsive site! Shouldn't we?

These days it seems like a non-question whether to use responsive web design or not. Responsive web design has become the de facto standard and all of our competitors had already built responsive sites. However, we wanted to build the best and fastest site possible and asked ourselves the question: is responsive web design really the best for us?

Responsive web design, implemented using media queries etc., brings a number of significant advantages:

Coherent design across all device types and screen sizes is the default and lapses from that require conscious decisions.
A single code base for all channels is obvious and more or less a requirement.
Compared to designing, developing and maintaining separate versions for different channels building a responsive site is cost efficient.

While the above advantages are significant it's interesting to note that none of them have any direct positive impact on the end users experience. For the end user there are few benefits associated with visiting a responsive site.

On the other hand there are (often) drawbacks in the form of additional HTML markup, CSS and JavaScript that the visitor's browser has to download, parse and interpret. Sometimes there are entire blocks of HTML and CSS which are never even shown to the user. There may also be custom JavaScript code and modules that are only relevant in a specific channel or which are there specifically to manage the responsive nature of the site.

In other words: to build a responsive site means taking the risk of serving a slower site to the end user compared to a site which is optimized for the user's device type. It's quite obvious really; when we build a responsive site we force the user's browser to deal with the problem of adapting the content depending on device and viewport witdh as well as using different layouts in different contexts.

There are of course various creative ways of minimizing the performance impact but the end result tends to be less than perfect. At the same time some of the cost efficiency benefits of building a responsive site tends to get lost when we spend time on optimizing the performance of the site.

Having the cake while eating it too

So, what should we do? We wanted the development related benefits of building a responsive site without the performance penalties for our visitors. Therefore we decided to try to have the cake while eating it too. We decided to try an approach that we came to call "responsibly responsive".

When I as a developer run the site on my local machine or when a UX designer views the site on a test environment what we see is a responsive site. When we design and develop we work with a responsive site.

However, when you as a public visitor load the site you only get the HTML, CSS and JavaScript that is required for your device type. Every trace of the site's responsiveness, except for code required to give you a good experience within the context of your device type, is gone.

This way we achieve exactly what we want; the many development related benefits of building a responsive site at the same time as our visitors get a performance optimized experience tailored for the type of device that they are using.

The solution

In order to build a responsive site where all traces of it's responsive nature are cleaned out when it faces public visitors we had to tackle a number of problems:

Channel detection
Filtering of HTML
Filtering of CSS
Filtering of JavaScript

Channel detection

In order to detect what channel a visitor is in we're using device detection in our CDN (Akamai). When a visitor navigates to www.expressen.se the request is routed to Akamai, which inspects the request's user agent and determines whether the user is using a mobile phone, a tablet or a computer. If Akamai doesn't already have the appropriate version of the content in its cache it proceeds to make a request to our servers. This request contains a header telling our servers what type of device the original request was made from.

If our application finds that header in an incoming request it knows that it should serve a version of the site optimized for the device type specified in the header. If on the other hand a request doesn't contain the header our application knows that the request isn't external and will proceed to serve the responsive version of the site.

Using a CDN for device detection is convenient for us but by no means a requirement. If we hadn't been using a CDN or other form of caching layer outside our application we could have implemented the same functionality in our app by inspecting each request's user agent.

HTML filtering

When building a responsive site one can choose between two different approaches for handling how elements should be displayed depending on view port size in HTML and CSS. With the first approach, which is often preferable, the HTML markup doesn't know that it's responsive and instead lets the CSS handle how elements should be positioned and look depending on screen size etc. In the other approach, which Twitter's Bootstrap is an example of, one uses helper CSS classes in the HTML markup to decide whether an element should be displayed or not (or how it should be displayed) depending on browser size.

If we had built a site that would be responsive when it faced public visitors we would probably have used the first method. However in our case we wanted to make it easy to clean up unnecessary HTML elements and for that the second approach, using helper classes, was better. This means that when I look at the site's HTML on my local computer, without channel filtering, a small sample of it can look like this:

<aside class="site-body__column-3 
 hidden-mobile lp_right">
 ...
</aside>

When I instead browse the public version of the site using a computer or tablet the same HTML block looks like the below snippet. Note the absence of the hidden-mobile CSS class.

<aside class="site-body__column-3 
 lp_right">
 ...
</aside>

If I make the same request using a mobile phone the entire HTML block would instead be missing.

In order to accomplish this we first let our application render the responsive HTML with support for all channels and helper classes included. After that, just before the app is about to respond to the incoming HTTP request, a middleware (we use Node.JS) kicks in. The middleware inspects the incoming request and looks for a header (set by Akamai or manually by a developer) containing information about what channel is requested.

If the middleware finds such a header it proceeds to pass the generated markup through a parser. The parser removes all elements that, according to helper classes, shouldn't be shown for the requested channel. The parser also removes all helper classes.

Generating markup only to then parse and rebuild it may sound like an expensive operation. However, our parser, which is built on top of htmlparser2, is simple and pretty fast and only adds a few milliseconds to the total response time. Every millisecond counts though but in practice most of our requests are served directly from our CDN's cache so only a small fraction of all requests are afflicted by the minor overhead added by the parser.

CSS filtering

We write our CSS (actually Stylus) code just as we would have if we would have been working on a regular responsive site with one small exception; in cases where we need to use media queries to adapt for different device types and view port sizes we always do that using helper functions. A fictive example may look like this:

.myElement {
  display: block;
  +mqMinWidth(960px) {
    color: red;
  }
}

The above code example says that elements with the myElement class always should have display: block in all channels. It also says that if the view port is 960 pixels or wider text inside such elements should be red.

When the CSS is built (from Stylus) we create four different versions of it; one for each channel (computers, tablets, phones) and one responsive version. In the responsive version the result of the above Stylus code is:

.my-element {
  display:block
}
@media (min-width:960px) {
  .my-element { color:red }
}

In the CSS built for mobile phones the result is instead:

.my-element {
  display:block
}

In the CSS for tablets, where the view port may or may not be wider that 960 pixels, the result is the same as in the responsive version:

.my-element {
  display:block
}
@media (min-width:960px) {
  .my-element { color:red }
}

For computers we have a minimum width for the site that is above 960 pixels. Therefore the CSS for computers looks like this:

.myElement {
  display:block;
  color:red
}

To summarize; after building our Stylus code we get four different CSS files. One of these makes the site responsive while the other three are heavily optimized for a specific type of device. The decision of what CSS file should be used is handled by our HTML filtering functionality. In practice the markup that includes CSS files looks like the below example prior to HTML filtering.

<link class="hidden-tablet hidden-desktop hidden-responsive" 
  href="/styles/main.mobile.css">
<link class="hidden-mobile hidden-desktop hidden-responsive" 
  href="/styles/main.tablet.css">
<link class="hidden-mobile hidden-tablet hidden-responsive" 
  href="/styles/main.desktop.css">
<link class="visible-responsive" 
  href="/styles/main.responsive.css">

After HTML filtering, in this case for tablets, the above markup is reduced to:

<link href="/styles/main.tablet.css">

JavaScript filtering

Last but not least we need to build channel optimized JavaScript files. The principle is the same as for CSS. We build four different JavaScript bundles, one for the responsive mode and one for each channel. Then we let the HTML filtering decide which one should be used.

Unlike the CSS it's pretty rare that we do anything channel specific in our JavaScript code. In those rare cases though we can do so by inspecting the value of a number of global variables which tell us what channel the script is executing in.

One example of channel filtering in JavaScript is the call to the function that displays a button for opening Expressen's iOS or Android app. This button should only be visible on devices where it's possible to run the app, i.e. tablets and phones. This is how the code for handling the call to the function looks in the responsive bundle:

if (!CHANNEL_DESKTOP) { 
  openInApp();
}

In the responsive scenario our JavaScript bundle contains functionality that populates among others CHANNEL_DESKTOP based on the user's current view port. In the event that the user resizes the browser window the variable's value is changed and events that other code can listen to are triggered.

For a specific channel the JavaScript code is processed, with the help of Vanilla shake, and if-statements that check which channel the user is in are removed. The code inside such if-statements is either removed or left in place depending on the condition in the if-statement and what channel bundle is being built.

The result

When we first started discussing the idea of creating a responsive site that was performance optimized when facing public visitors many of us in the team were skeptical. Would this really work in practice? However, the potential benefits if it did work were very attractive so we decided to give it a try.

Initially there was some effort required to create the components and functionality that I've briefly described in this post. After that initial investment each individual component and the solution in its entirety has worked well.

It has also brought the benefits that we initially hoped for in terms of ways of working and thinking, although we've sometimes caught ourselves thinking in channel specific ways. That may be due to us coming from having worked with channel specific versions previously though.

In terms of performance the solution has been a hit. Visitors to expressen.se only receive the HTML, CSS and JavaScript that is actually needed for the type of device that they are using. Below is an example of a Lighthouse audit of the responsive version of the site, without channel filtering:

Given that the start page of Expressen that is measured here is very long and contains a lot of content the above result isn't exactly bad. However, let's take a look at the result of the same audit performed against the same page but with channel filtering active. Especially note the difference in KB under the "Unused CSS rules" metric.

↧

Quickly creating and mapping an array in JavaScript

August 23, 2018, 12:50 pm

≫ Next: Quickly mapping an array of URLs to responses with JavaScript async/await and Promise.all

≪ Previous: Responsibly Responsive Web Design at Expressen

When, for instance, creating test data one might do something like this:

const data = [];

for(let i = 0; i < 100; i++) {
 data.push({ num: i }); 
}

This may be written slightly shorter and more wrist friendly using Array.fill:

const data = Array(100).fill(null).map((val, i) => {
 return { num: i };
});

↧

Quickly mapping an array of URLs to responses with JavaScript async/await and Promise.all

August 26, 2018, 11:09 am

≫ Next: Flatten array of arrays with JavaScript

≪ Previous: Quickly creating and mapping an array in JavaScript

While perhaps not the most readable, a compact version (using window.fetch) can look like this:

const urls = [
  "https://jsonplaceholder.typicode.com/comments/1",
  "https://jsonplaceholder.typicode.com/comments/2",
  "https://jsonplaceholder.typicode.com/comments/3"
];

async function fetchAll() {
  const results = await Promise.all(urls.map((url) => fetch(url).then((r) => r.json())));
  console.log(JSON.stringify(results, null, 2));
}

fetchAll();

↧

Flatten array of arrays with JavaScript

August 30, 2018, 10:46 am

≫ Next: Exception order when awaiting multiple async tasks in C#

≪ Previous: Quickly mapping an array of URLs to responses with JavaScript async/await and Promise.all

const arrays = [[1], ["2"], [3]];

const merged = [].concat(...arrays);

console.log(merged); // [ 1, '2', 3 ]

Pre ES6:

var merged =[].concat.apply([], arrays);

↧

Exception order when awaiting multiple async tasks in C#

September 9, 2019, 2:30 pm

≫ Next: Why is the async keyword needed in JavaScript?

≪ Previous: Flatten array of arrays with JavaScript

A C# has-been returns to C# and experiments with this new hip thing called async/await and how that relates to execution order and exceptions.

I recently returned to .NET and C# development after a six year hiatus in Node.JS land. A lot has changed since I last wrote C#, although Jon Skeet still writes the best books.

One feature that I never got to use during my previous C# development days were async/await. However, I'm very familiar with the concept having written a lot of JavaScript with promises and async/await. However, one thing that I didn't intuitively knew in C# was in what order Exceptions are thrown, or rather caught, when awaiting multiple Tasks. Therefor I created this little experiment:

using System;
using System.Threading;
using System.Threading.Tasks;

namespace AsyncExceptions
{
  class Program
  {
    async static Task Main(string[] args)
    {
      try
      {
        var one = Do("One", 500);
        var two = Do("Two", 500);

        await one;
        await two;
       } catch(Exception ex)
       {
          Console.WriteLine(ex.Message);
        }
           
      }
      
      public static Task Do(string name, int time)
      {
        return Task.Run(() =>
        {
          Console.WriteLine($"Task {name} starting");
          Thread.Sleep(time);
          Console.WriteLine($"Task {name} pre exception");
          throw new Exception($"Exception {name}");
        });
      }
  }
}

The above program will output five lines in total. Two lines will be printed for when each task starts to run. Two additional lines will be outputted when each task has waited for 500 milliseconds and is about to throw. Finally there will be a line telling us which exception was caught. Can you guess what the output will be?

As it turned out in my experiments the first four lines are quite random. Sometimes the first tasks starts and completes first, sometimes it's the other way around. Here are three sample outputs (the final line omitted for now):

Sample 1:

Task One starting
Task Two starting
Task Two pre exception
Task One pre exception

Sample 2:

Task One starting
Task Two starting
Task One pre exception

Sample 3:

Task Two starting
Task One starting
Task Two pre exception
Task One pre exception

What about the final line? That's always reads `Exception One`. Meaning that although the second exception may sometimes be thrown first in the async context back in our main thread we'll always catch the exception from the task that we await first. That holds true even if we make the second task throw much sooner by modifying the Main method to look like this:

var one = Do("One", 500);
var two = Do("Two", 1);

Of course in hindsight of this experiment the result makes sense. While the second async operation may complete and be ready to return to the initiating thread first in that thread we are waiting for the first task to complete before caring about the result of the second one.

JavaScript

So what about the equivalent code i JavaScript? It could look something like this:

"use strict";

async function main() {
    try {
        const one = Do("One", 500);
        const two = Do("Two", 500);

        await one;
        await two; 
    } catch(ex) {
        console.log(ex.message)
    }
}

main();

function Do(name, time) {
    return new Promise((resolve, reject) => {
        console.log(`Task ${name} starting`)
        setTimeout(() => {
            console.log(`Task ${name} pre exception`);
            reject(new Error(`Exception ${name}`))
        }, time);
    })
}

Here the output is more consistent. In all my attempts the output read like this:

Task One starting
Task Two starting
Task One pre exception
Task Two pre exception
Exception One
(node:6564) UnhandledPromiseRejectionWarning: Error: Exception Two

The first task is put queue for execution on the event loop first and therefor is executed and throws first. We catch the first exception and then get an angry error message due to us not catching the second exception. Clearly Promise.all would be a good solution for that.

But what if we change the second task to complete much faster than the first one?

const one = Do("One", 500);
const two = Do("Two", 1);

Then the output is this:

Task One starting
Task Two starting
Task Two pre exception
(node:6577) UnhandledPromiseRejectionWarning: Error: Exception Two
Task One pre exception
Exception One

The second task throws much earlier and that results in a an UnhandledPromiseRejectionWarning. Then the execution continues and the exception thrown by the first, and first awaited, task is caught. While I find this fascinating I think I'll leave the "why!?" to a different blog post.

↧

Why is the async keyword needed in JavaScript?

September 11, 2023, 1:02 pm

≪ Previous: Exception order when awaiting multiple async tasks in C#

Last week a colleague asked me what the purpose of the async keyword was in JavaScript. Not because he didn't know how to use async/await. He was wondering why the async keyword was needed to use await. This set me off on a quest to find the answer. To avoid beating around the bush: its purpose is backwards compatibility.

Any experienced JavaScript developer is probably well versed in the usage of async/await these days. The await keyword unwraps a Promise. Until the Promise is either fulfilled or rejected the execution of the code after it is put on hold and the event loop is released to do other work. But why is the async keyword needed in order for us to use await in a function?

A common misconception seems to be that it "is used to communicate that the function is async" to anyone who uses it. However, that's not true. Consider the following code:

export async function fetchTextLength() {
 const resp = await fetch("https://www.wikipedia.org");
 const text = await resp.text();
 return text.length;
}

The function above is marked async, but as a consumer of it that's hardly relevant. What is relevant is that it returns a Promise. Indeed, if we hover over it in a file that uses it in VS Code that's what we see:

Further, I think we can all agree that to a consumer of the above module it shouldn't matter if we decide to change the module to instead have the following implementation:

export function fetchTextLength() {
    return inner();
}

async function inner() {
    const resp = await fetch("https://www.wikipedia.org");
    const text = await resp.text();return text.length;
}

The important thing is that our function returns a Promise, not whether it's marked as async or not. So, if the purpose of the async keyword isn't to communicate anything to other developers then what's its purpose? One could think that it's being used by the JavaScript interpreter (or JIT compiler) to identify that the function is using await and to do its thing, implicitly having the function return a Promise. While that's true the interpreter could easily do that simply by looking at the code and identifying usage of the await keyword.

The real motivation for the async keyword can be found by digging through the original proposal for async/await in ECMAScript. There in a comment by Brian Terlson in issue 88 we can read the following:

"Long ago we considered whether it was possible to await everywhere but we could not find a good way to do it without breaking existing usage of await as an identifier"

In other words, with the introduction of await in ECMAScript a new reserved keyword needed to be added to the language. However, you can't just go around adding reserved keywords to programming languages if you want to be compatible with existing code. This is especially true for interpreted ones run in browsers. So the async keyword was introduced as a way to ensure backwards compatibility.

To illustrate, the below code is perfectly valid and outputs "test":

function test() {
    const await = "test";
    console.log(await);
}

test();

However, if we add the async keyword, which would not have been valid in older versions of JS, to the function like below we'll get an error when we try to run it:

async function test() {
    const await = "test";
    console.log(await);
}

test();

The above code produces:

const await = "test";
        ^^^^^

SyntaxError: Unexpected reserved word

↧