CouchDB primer part 2 - design documents and views

October 2nd, 2015    by sigman    3541
  Primers   CouchDB, databases, nosql

This is the second part of the series of primers about CouchDB. In this article I'm covering the design documents, views including map-reduce and rereduce, lookup views, different way of querying, using compound key and debugging.


Link to the first part of the primer: CouchDB primer part 1 - the basics


5. Design documents


5.1 Purpose and structure of design documents

Every database created in CouchDB can have none, one or more design documents. By default, when creating a new database, there isn't any. Design documents are JavaScript objects represented as JSON containing methods that are this database's logic responsible for transformations and outputting of documents this database contains. They are stored just like other documents in the database, they can be retrieved in the same way too through the HTTP API, but they are part of a dedicated _design space and their methods are designed to be executed in order to process the documents.

Design documents can consist of the following functionalities:

  • Views - they go through each document from the database to return only these that meet criteria defined by us (map function) and they can group data as well (reduce function). Functionally they are what SELECT, WHERE, GROUP, COUNT, SUM are in SQL.
  • Shows - they convert the single_document's structure (JSON) into another format, like HTML, XML
  • Lists - they convert the whole view into another format, in example a table or XML collection
  • Document validations - are validations function that can check and reformat incoming documents to ensure consistency before they are saved.
  • Update handlers - they are invoked during document create or update action. They can be implemented to auto-increment some values or update timestamps.
  • Filters - they are used during replication to filter out documents that should not be exchanged, after inspecting properties and values of each document they return it or not (null).

Design document doesn't have to contain all of the properties listed above. One design document can contain many different views, each with a different key.

By convention design document name should represent the type of the document that is generated by it.

Example design document may look as below:

{
  "_id": "_design/articles",
  "language": "javascript",
  "views": {
    "all": {
      "map": "function(doc) { emit(doc.title, doc) }"
    },
    "by_title": {
      "map": "function(doc) { if (doc.title != null) emit(doc.title, doc) }",
      "reduce": "function(keys, values, rereduce){}"
    },
    "by_tag": {
      "map": "function(doc) { 
        for(i=0;i<doc.tag.lenghth();i++) {    
          emit(doc.keywords[i], doc); 
        } 
      }",
    },
  },
  "shows": {
    "article": "function(doc, req) { return '<h1>' + doc.title + '</h1>' }"
}

5.2 Creating design document

What makes a document the design document is the document ID _id value which indicates that the blogarticles document is in the special dedicated _design space:

"_id": "_design/articles"

New design documents need to be created with a PUT request, the _id is required and won't be automatically generated by CouchDB as in the case of standard documents.

Since design documents are usually large, it becomes cumbersome to pass them as parameter with cURL, a more effective way is to save the document's JSON structure to the file and then upload it using the --data argument, as in the example below:

$ curl -X PUT -d @blogarticles_design.js 'http://localhost:5984/blogarticles/_design/articles'

5.3 Retrieving and updating design documents

Retrieval of the design documents structure is done in the same way standard documents are retrieved, remembering that the document's ID is made of the "_design/" part and the document's name:

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles'

{"_id":"_design/articles","_rev":"1-a572b72927492f44aa976d9f5f9b0c0d","language":"javascript","views":{"by_published_date":{"map":"function(doc){if(doc.published_date){emit(doc.published_date, doc);}}"}}}

Executing any of the design document functions is done by sending a GET request to the design document URL appended with the type of the function (_show/_view or other) and it's name:

Here we are looking up the results of the show function named article:

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_show/articles/'

And here we access the results produced by the by_title view:

http://127.0.0.1:5984/blogarticles/_design/articles/_view/by_title

Similarly, updating of design documents requires a PUT request with document ID and revision number:

curl -X PUT -H 'Content-type: application/json' -d '{"_rev":"1-b41a673ce62351a2c629734d4dc220f9", "view/by_title": "..."}' 'http://localhost:5984/blogarticles/_design/articles'

5.4 View functions overview

From purely technical perspective, views are just JavaScript functions that you can write to emit individual documents based on some criteria. In some cases a view may emit more data than the actual number of documents, for example you may want to iterate through comments in each article and emit them. The most common uses of views are querying data from documents, rendering lists or tables for extracted data, summarising, filtering, calculating, counting or reducing numbers of documents.

When a new view is created and first time accessed, CouchDB processes all documents in the database and creates an index made up of information generated by the view function. Depending on amount of data to process, this may make take a while. Every time a view is changed, all data (in the database that uses that particular view) needs to be reindexed. Every time a document in a database is created or changed, CouchDB needs to update the view's index. Even though you access one view only after adding a new document, all views in the same design documents are updated. Under the hood, the indexes are in the form of B-Trees, this ensures efficient retrieval of data.

Views are made of two functions: map and reduce. MapReduce is a programming model that has become very popular due to it's efficiently processing large amounts of data and transforming it into useful information.

During the map phase CouchDB takes each document from the database and sends it to the map function, which may or may not emit the document in the same or different form to generate the view. The pieces of data that are emitted are document's ID, the key that can be used later for querying and the value. Both the key and the value may take the form of a boolean, string, number, lists, objects. Emitting is done with the built-in emit() function that takes the key and value as two arguments.

By convention, the view names are prefixed with "_by" to indicate by which criteria the view returns the data.

In the example below the view name is "by_title" and the map function checks if the document (article) has a value for the title and emits that title as the key and no value.

{
    "views": {
        "by_title": {
            "map": "function(doc) { if (doc.title != null) emit(doc.title, null) }"
        },
}

5.5 Querying views

CouchDB sorts view results by key values represented as UTF-8 encoded strings.

The following map function returns the list of documents with the published_date value as the key and the object containing article's title and author as the value:

function(doc){
  if(doc.published_date){
    emit(doc.published_date, {title: doc.title, author: doc.author
  });
}

If we call the view without any extra parameters we are going to receive all records (note the key always contains the "id", even if it wasn't emitted by us):

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/by_published_date'

{"total_rows":7,"offset":0,"rows":[
{"id":"3c9187bdabb2408e3fbfe6a76f004958","key":"2014-01-20T00:00:00.000Z","value":{"title":"Make your C++ Effective","author":"tom"}},
{"id":"3c9187bdabb2408e3fbfe6a76f005603","key":"2014-10-10T00:00:00.000Z","value":{"title":"Effective Problem Solving","author":"kim"}},
{"id":"3c9187bdabb2408e3fbfe6a76f001bbc","key":"2014-10-19T00:00:00.000Z","value":{"title":"Secrets of Effective Negotiations","author":"kim"}},
{"id":"3c9187bdabb2408e3fbfe6a76f0032f4","key":"2014-11-18T00:00:00.000Z","value":{"title":"Grandma's Cook Book Samples","author":"mark"}},
{"id":"24c8ea5e0552db4cfdede39584004f86","key":"2015-03-19T00:00:00.000Z","value":{"title":"10 Project Management Tips","author":"mark"}},
{"id":"3c9187bdabb2408e3fbfe6a76f000743","key":"2015-05-29T00:00:00.000Z","value":{"title":"Programming Game AI","author":"tom"}},
{"id":"3c9187bdabb2408e3fbfe6a76f0013a3","key":"2015-07-03T00:00:00.000Z","value":{"title":"Marketing and Public Relations in Startups","author":"kim"}}
]}

The blogarticles published dates are ranging from 2014-01-20 to 2015-07-03. If we wanted to request a document with a specific published date using the by_published_date view, we can provide the date as the value of the key query string parameter:

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/by_published_date?key="2015-03-19T00:00:00.000Z"'

{"total_rows":7,"offset":4,"rows":[
{"id":"24c8ea5e0552db4cfdede39584004f86","key":"2015-03-19T00:00:00.000Z","value":{"title":"10 Project Management Tips","author":"mark"}}
]}

To get articles with published_date starting from and including a specific date we can use the startkey parameter. To get articles with date including and no older that a specific date we can use the endkey parameter. For getting the range between two dates we could use both parameters together:

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/by_published_date?startkey="2014-11-18T00:00:00.000Z"&endkey="2015-03-19T00:00:00.000Z"'

{
    "total_rows": 7,
    "offset": 3,
    "rows": [{
        "id": "3c9187bdabb2408e3fbfe6a76f0032f4",
        "key": "2014-11-18T00:00:00.000Z",
        "value": {
            "title": "Grandma's Cook Book Samples",
            "author": "mark"
        }
    }, {
        "id": "24c8ea5e0552db4cfdede39584004f86",
        "key": "2015-03-19T00:00:00.000Z",
        "value": {
            "title": "10 Project Management Tips",
            "author": "mark"
        }
    }]
}

To learn more about views visit: http://guide.couchdb.org/draft/views.html


5.6 Querying compound keys

It is also possible to search by more complex keys then just a single dimensional value. If we wanted to find all comments having article_rate equal or greater than "3" for articles from the "management" category, we would first need to create the following by_category_comment_rate view map function emitting the compound key made of two values:

function(doc) {
  if (doc.category && doc.comments) {
    for (var i = 0, len = doc.comments.length; i < len; i++) {
      var commRate = doc.comments[i].article_rate;
      if (commRate !== null) {
        emit([doc.category, commRate], doc.title);
      }
    }
  }
}

This will give the following output when querying without any parameters (document ids are shortened for brevity):

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/by_category_comment_rate'

{"total_rows":19,"offset":0,"rows":[
{"id":"32f4","key":["cooking",2],"value":"Grandma's Cook Book Samples"},
{"id":"32f4","key":["cooking",3],"value":"Grandma's Cook Book Samples"},
{"id":"32f4","key":["cooking",4],"value":"Grandma's Cook Book Samples"},
{"id":"32f4","key":["cooking",5],"value":"Grandma's Cook Book Samples"},
{"id":"5603","key":["management",null],"value":"Effective Problem Solving"}
// ... and so on ...
]}

Now we can send a GET request using the startkey and endkey in a form of array, where the first item would be the category and the second the start and end of the comment rate range. If our function emitted more values in the key array, the startkey and endkey parameters could contain more values as well. Note that we use the --globoff or -g parameter, as otherwise the square brackets would cause the following error:

curl: (3) [globbing] bad range specification in column 89

Here are the results:

$ curl -X GET --globoff 'http://localhost:5984/blogarticles/_design/articles/_view/by_category_comment_rate?startkey=["management",3]&endkey=["management",5]'

{"total_rows":19,"offset":9,"rows":[
{"id":"3c9187bdabb2408e3fbfe6a76f001bbc","key":["management",3],"value":"Secrets of Effective Negotiations"},
{"id":"24c8ea5e0552db4cfdede39584004f86","key":["management",4],"value":"10 Project Management Tips"},
{"id":"24c8ea5e0552db4cfdede39584004f86","key":["management",5],"value":"10 Project Management Tips"},
{"id":"3c9187bdabb2408e3fbfe6a76f001bbc","key":["management",5],"value":"Secrets of Effective Negotiations"}
]}

Despite the fact that the view queried using the filter returned only 4 records, the total_rows property shows total number of records that the view returns (with no query string filter applied).


5.7 Lookup views


5.7.1 View Collation

Unlike in RDBMS (Relational Database Management Systems), it is not possible to join results from two different tables (or databases in CouchDB world) using one request. Instead two separate requests would need to be sent to each of the databases and data would need to be joined on the client side.

One common approach to overcome this limitation, described in the official documentation, is to store two different type of documents in the same database.

To explain that better, lets discuss the blogarticles database we've been using so far. The fact that the article and associated comments are part of the same document allows us to request them together using one request. But the problem is that if a new comment needs to be added or an existing comment needs to be edited, the whole document needs to be updated because comments are just article's properties. In the case when multiple blog viewers wanted to comment the same article at the same time, some of them would get document update conflict error since as soon as the first user adds her comment, the revision number is going to be updated.

To overcome this issue we can split articles and comments but still keep them in the same database allowing the view to access both. Lets take this document:

{
   "_id": "3c9187bdabb2408e3fbfe6a76f0013a3",
   "title": "Marketing and Public Relations in Startups",
   "category": "startups",
   "published_date": "2015-07-03T00:00:00.000Z",
   "author": "kim",
   "tags": ["marketing", "public relations"],
   "visits": 4300,
   "comments": [
       {
           "nick": "alex",
           "date": "2015-07-03T00:00:00.000Z",
           "comment": "Sample comment",
           "article_rate": 3
       },
       {
           "nick": "nick",
           "date": "2015-08-09T00:00:00.000Z",
           "comment": "Sample comment",
           "article_rate": 2
       }
   ]
}

And split it into three, first the article and then the two comments:

{
   "_id": "marketing-and-public-relations-in-startups",
   "type": "article",
   "title": "Marketing and Public Relations in Startups",
   "category": "startups",
   "published_date": "2015-07-03T00:00:00.000Z",
   "author": "kim",
   "tags": ["marketing","public relations"],
   "visits": 4300
}, {
   "_id": "3c9187bdabb2408e3fbfe6a76f00a488",
   "type": "comment",
   "article": "marketing-and-public-relations-in-startups",
   "nick": "alex",
   "date": "2015-07-03T00:00:00.000Z",
   "comment": "Sample comment",
   "article_rate": 3
}, {
   "_id": "3c9187bdabb2408e3fbfe6a76f00aa87",
   "type": "comment",
   "article": "marketing-and-public-relations-in-startups",
   "nick": "nick",
   "date": "2015-08-09T00:00:00.000Z",
   "comment": "Sample comment",
   "article_rate": 2
}

What we have done here is added a new property called type that describes the type of the document, which could be article or comment. For the sake of brevity, instead of using UUID for article IDs, we use the slug, which is just a user friendly version of the article title in the form of URL. By doing this we also agree that article titles are required to be unique or alternatively we could add some logic to make sure that if there is a title conflict, a slug would get some random prefix like a digit to ensure uniqueness (we could use the update handlers in the design document for this purpose). We linked comment documents with their parent articles using the article property populated with the article's ID.

Now, the next step is to create a view with the map function called 'by_article_id' that would enable us to get both the articles and their comments together:

function(doc){
  if (doc.type === "article") {
     emit([doc._id, 0],doc);
  }
  else if (doc.type === "comment") {
     emit([doc.article, 1],doc);
  }
}

If we send the following query:

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/by_article_id?key=["marketing-and-public-relations-in-startups",0]'

we will get just the article itself. If we provide 1 as the second value to the key we will get comments only:

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/by_article_id?key=["marketing-and-public-relations-in-startups",1]

But if we provide a range we will get both type of documents ordered in the way that comments are positioned after the article - thanks to the second value in the emitted key, called sorting token:

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/by_article_id?startkey=["marketing-and-public-relations-in-startups",0]&endkey=["marketing-and-public-relations-in-startups",1]' --globoff

{
    "total_rows": 6,
    "offset": 3,
    "rows": [{
        "id": "marketing-and-public-relations-in-startups",
        "key": ["marketing-and-public-relations-in-startups", 0],
        "value": {
            "_id": "marketing-and-public-relations-in-startups",
            "type": "article",
            "title": "Marketing and Public Relations in Startups",
            "category": "startups",
            "published_date": "2015-07-03T00:00:00.000Z",
            "author": "kim",
            "tags": ["marketing", "public relations", "startup", "tips"],
            "visits": 4300
        }
    }, {
        "id": "3c9187bdabb2408e3fbfe6a76f00a488",
        "key": ["marketing-and-public-relations-in-startups", 1],
        "value": {
            "_id": "3c9187bdabb2408e3fbfe6a76f00a488",
            "type": "comment",
            "article": "marketing-and-public-relations-in-startups",
            "nick": "alex",
            "date": "2015-07-03T00:00:00.000Z",
            "comment": "Sample comment",
            "article_rate": 3
        }
    }, {
        "id": "3c9187bdabb2408e3fbfe6a76f00aa87",
        "key": ["marketing-and-public-relations-in-startups", 1],
        "value": {
            "_id": "3c9187bdabb2408e3fbfe6a76f00aa87",
            "type": "comment",
            "article": "marketing-and-public-relations-in-startups",
            "nick": "nick",
            "date": "2015-08-09T00:00:00.000Z",
            "comment": "Sample comment",
            "article_rate": 2
        }
    }]
}

More on collation including sorting by keys can be found here: http://docs.couchdb.org/en/1.6.1/couchapp/views/collation.html#views-collation.


5.7.2 Linked Documents

Now what solution could we use if we wanted to request comments by their ID and also see the corresponding article using one request? We could create a view as below, where the map function is called "by_comments":

function(doc){
  if (doc.type === "comment") {
     emit(doc._id, {"_id":doc.article});
  }
}

What we are doing here is we emit the comment's doc_id in order to be able to query by comment's ID and as the value we emit an object with the "_id" as a key and the corresponding article's ID as a value.

Querying by key gives us the following response:

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/by_comment?key="3c9187bdabb2408e3fbfe6a76f008f03"'

"rows": [{
    "id": "3c9187bdabb2408e3fbfe6a76f008f03",
    "key": "3c9187bdabb2408e3fbfe6a76f008f03",
    "value": {
        "_id": "make-your-c++-effective"
    }
}]

But if we add the include_docs=true parameter we are going to receive something completely different:

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/by_comment?key="3c9187bdabb2408e3fbfe6a76f008f03"&include_docs=true'

{
    "total_rows": 4,
    "offset": 0,
    "rows": [{
        "id": "3c9187bdabb2408e3fbfe6a76f008f03",
        "key": "3c9187bdabb2408e3fbfe6a76f008f03",
        "value": {
            "_id": "make-your-c++-effective"
        },
        "doc": {
            "_id": "make-your-c++-effective",
            "_rev": "5-19b57826876e9bd14b526f5d354c02a1",
            "type": "article",
            "title": "Make your C++ Effective",
            "category": "programming",
            "published_date": "2014-01-20T00:00:00.000Z",
            "author": "tom",
            "tags": ["programming", "c++", "software"],
            "visits": 1234
        }
    }]
}

As you can see the include_docs=true option made CouchDB add the doc property to the returned results containing the article document referenced by the _id value emitted from the view.

If we wanted to receive the full comment document as well, we could modify the view so that it could have another emit or add another property to the returned object:

function(doc){
  if (doc.type === "comment") {
     emit(doc._id, null);
     emit(doc._id, {"_id":doc.article});
  }
}

// or

function(doc){
  if (doc.type === "comment") {
     emit(doc._id, {"_id":doc.article, "doc":doc});
  }
}

Why are we emitting null in the first emit? If we didn't, the include_docs=true switch would make the same document returned twice:

{
        "id": "3c9187bdabb2408e3fbfe6a76f008f03",
        "key": "3c9187bdabb2408e3fbfe6a76f008f03",
        "value": {
            "_id": "3c9187bdabb2408e3fbfe6a76f008f03",
            "_rev": "2-da6c6825e4f305fe8034bbc57d5c2610",
            "type": "comment",
            "article": "make-your-c++-effective",
            "nick": "steve",
            "date": "2014-01-20T00:00:00.000Z",
            "comment": "Sample comment",
            "article_rate": 4
        },
        "doc": {
            "_id": "3c9187bdabb2408e3fbfe6a76f008f03",
            "_rev": "2-da6c6825e4f305fe8034bbc57d5c2610",
            "type": "comment",
            "article": "make-your-c++-effective",
            "nick": "steve",
            "date": "2014-01-20T00:00:00.000Z",
            "comment": "Sample comment",
            "article_rate": 4
        }
    }

If we emit null, the document is returned in the doc property:

{
        "id": "3c9187bdabb2408e3fbfe6a76f008f03",
        "key": "3c9187bdabb2408e3fbfe6a76f008f03",
        "value": null,
        "doc": {
            "_id": "3c9187bdabb2408e3fbfe6a76f008f03",
            "_rev": "2-da6c6825e4f305fe8034bbc57d5c2610",
            "type": "comment",
            "article": "make-your-c++-effective",
            "nick": "steve",
            "date": "2014-01-20T00:00:00.000Z",
            "comment": "Sample comment",
            "article_rate": 4
        }
    }

The include_docs switch lets the CouchDB engine query the database "on the fly" when a request is sent and inject resulting document to the response as the doc value. If we emit documents as values in our view (doc as the second parameter in the emit function), these documents are actually going to be cached/indexed on disk (as discussed earlier). The official documentation in the chapter for joins advises that it is not recommended to emit the documents in the view and instead, it is suggested to include the bodies of the documents by requesting the view with ?include_docs=true. This way we save disk space for a price of incurring overhead of making the engine querying the docs from the database requested by us. More details in this thread.


5.8 Reduce functions


5.8.1 Purpose

Role of the reduce functions is to collate the output of the map function call by counting, summing up, grouping items of a particular type. Output of the reduce function is always one new record containing the result of the operation on all documents returned from the map step. Results of the reduce step are indexed in the same way as the results of the map step, this means that the response will be much faster as the data comes from the "cache" and isn't calculated "on the fly" every time a query is sent. Two main things to know about the reduction functionality is the available built-in functions and what are the group levels.


5.8.2 Using built in functions


5.8.2.1 Count

If we wanted to count how many comments we have in total in all articles, we could come up with the following view, where the map function returns each record (the fact that value is 1 isn't important, it could be null as well) and the _count function as the reduce:

"all_comments": {
      "map": "function(doc) { if (doc.type == 'comment')  emit(doc.id, 1) }",
      "reduce" : "_count"
}

If we query this without reduction using the reduce=false parameter we are going to get this result:

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/all_comments?reduce=false'

{"total_rows":4,"offset":0,"rows":[
{"id":"3c9187bdabb2408e3fbfe6a76f008f03","key":null,"value":1},
{"id":"3c9187bdabb2408e3fbfe6a76f00948e","key":null,"value":1},
{"id":"3c9187bdabb2408e3fbfe6a76f00a488","key":null,"value":1},
{"id":"3c9187bdabb2408e3fbfe6a76f00aa87","key":null,"value":1}
]}

If we do not provide any parameter which is equivalent to providing group_level=0 we are going to get the value that represents the result of taking each of the records returned from the map function and adding it to the counter:

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/all_comments'

{"rows":[
{"key":null,"value":4}
]}

5.8.2.2 Sum

As an exercise lets sum up the visits in all articles, in other words lets get a number representing total visits to our blog. For this we would create a view with the map function returning the visits value for each article and use the _sum as the reduction function:

"all_articles_visits": {
   "map": "function(doc) { if (doc.type === 'article')  emit(doc.id, doc.visits) }",
   "reduce": "_sum"
}

It is important to note that since the sum is going to add numbers, the number needs to be returned from the emit function. If the value isn't actually a number but a string, it should be converted into number using likes of parseInt or parseFloat.

Without reduce and with reduce:

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/all_articles_visits?reduce=false'

{"total_rows":2,"offset":0,"rows":[
{"id":"make-your-c++-effective","key":null,"value":1234},
{"id":"marketing-and-public-relations-in-startups","key":null,"value":4300}
]}

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/all_articles_visits'

{"rows":[
{"key":null,"value":5534}
]}

5.8.2.3 Stats

For the following view, where we use _stats for the reduce:

"all_articles_stats": {
   "map": "function(doc) { if (doc.type === 'article')  emit(doc.id, doc.visits); }",
   "reduce": "_stats"
}

we will get the following response data:

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/all_articles_stats?reduce=false'

{"total_rows":2,"offset":0,"rows":[
{"id":"make-your-c++-effective","key":null,"value":1234},
{"id":"marketing-and-public-relations-in-startups","key":null,"value":4300}
]}

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/all_articles_stats'

{"rows":[
{"key":null,"value":{"sum":5534,"count":2,"min":1234,"max":4300,"sumsqr":20012756}}
]}

Since the stats method similarly to the sum takes numbers, the emit is required to return numbers.


5.8.3 Group levels

To better explain the group levels, lets use the example database presented at the beginning of this article and the following view that will provide us information of which author and which article category became most successful in terms of number of visits (our key is is the collation key since it is made up of two elements):

"by_name_category": {
   "map": "function(doc) { emit([doc.author, doc.category], doc.visits) }",
   "reduce": "_sum"
}

**Without reduction99 we will get the following results (ids truncated for brevity):

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/by_name_category?reduce=false'

{"total_rows":12,"offset":0,"rows":[
{"id":"1bbc","key":["kim","management"],"value":8756},
{"id":"5603","key":["kim","management"],"value":12853},
{"id":"13a3","key":["kim","startups"],"value":4300},
{"id":"32f4","key":["mark","cooking"],"value":7845},
{"id":"4f86","key":["mark","management"],"value":1000},
{"id":"0743","key":["tom","programming"],"value":600},
{"id":"4958","key":["tom","programming"],"value":1234}
]}

Notice that the compound key is made of two values. By using the array as the key we can group to multiple levels (as many as values in the key).

Reduction with level 0: group_level=0, equivalent to group=false or no parameters in the query string, is the default and means that no key (sent from the emit) is used for grouping - in other words each line is taken into calculation separately. This is called reducing to a single value. The key in the reduce results is null, this is because we took all different records and calculated them together, so the key is irrelevant in this case.

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/by_name_category?group_level=0'

{"rows":[
{"key":null,"value":36588}
]}

Reduction with level 1: group_level=1 means that the first key is going to be taken into account in the calculations. Since the first item in the key is the author name, all values (visits) for each name are are grouped and added to the sum, hence we are getting summary per each user.

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/by_name_category?group_level=1'

{"rows":[
{"key":["kim"],"value":47518},
{"key":["mark"],"value":17690},
{"key":["tom"],"value":2434}
]}

Reduction with level 2: group_level=2 means that both values from the compound key are going to be used to group values. This means that CouchDB will go through the records, if it finds ones with the same key (both values), it will summarise their values. To explain this better, if we look at the results returned from the view without reduction turned on, we will notice that some records have the same keys, in example these (ids truncated for brevity):

{"id":"1bbc","key":["kim","management"],"value":8756},
{"id":"5603","key":["kim","management"],"value":12853},

{"id":"0743","key":["tom","programming"],"value":600},
{"id":"4958","key":["tom","programming"],"value":1234},

The pairs above will be sum up, as we can observe in the results below:

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/by_name_category?group_level=2'

{"rows":[
{"key":["kim","management"],"value":21609},
{"key":["kim","startups"],"value":4300},
{"key":["mark","cooking"],"value":7845},
{"key":["mark","management"],"value":1000},
{"key":["tom","programming"],"value":1834}
]}

Another way of turning reduce on is using the group=true (which is default in Futon - CouchDB's default client). Using this parameter would have the same effect as group_level=2 in our example database and view as CouchDB would just take all key components returned for each record from the map to group documents and reduce them, into corresponding values.

If we repeat request with group_level=3, we will get the same results as for level 2 because there are only two keys returned from the emit.

More info on reduce and lookup views can be found https://wiki.apache.org/couchdb/Introduction_to_CouchDB_views#Lookup_Views.


5.9 Custom reduce functions - rereduce

The reduce functions allow us creating custom calculations beyond the built-in functions. The reduce function takes three arguments:

  • document keys in the form of array consisting of the key and document's ID received from the map function (only in non-rereduce mode - more on that soon)
  • document values as the array of values emitted from the map function (in rereduce mode values will be product of the previous reduce)
  • boolean value, if true, then the function is called by itself or in other words run in the rereduce mode.

Lets create a reduce function that will return us an object containing count of records and sum of the visits:

function (keys, values, rereduce) {
   var count = 0, sum = 0, i;
   count = values.length;
   sum = sum(values);

   return {count: count, sum: sum};
}

In the example reduce function above we are simply taking the values emitted from the map function (each item in the value array comes from separate emits for each individual record). We are going through each of the value incrementing the count and adding the number of visits to the sum. For our sample database with few records the expected output would be this:

$ curl -X GET 'http://localhost:5984/blogarticles/_design/articles/_view/by_name_category_custom?group_level=1'

{
   "rows":[{
      "key":["kim"],
      "value":{"count":3,"sum":25909}
    },{
      "key":["mark"],
      "value":{"count":2,"sum":8845}
    },{
       "key":["tom"],
       "value":{"count":2,"sum":1834}
    }]
}

If we run the code above on a very large amount of records, instead of the expected results as before, we could see these:

{
    "rows":[{
        "key":"null",
        "value": {
            count: 3,
            sum: "0[object Object][object Object][object Object]"
        }
    }]
}

This can happen if there is a lot of data to process (many rows produced by emits), in such case the reduction process will be repeated - the reduce function is going to be called by itself. CouchDB uses divide and conquer strategy to calculate reduce results more efficiently by breaking up the key/value pairs into smaller chunks. The results are combined into arrays and they are passed to reduce again - which can happen more than once!

This is what the custom reduce function receives from the map function that emits three documents:

reduce([ [key1,id1], [key2,id2], [key3,id3] ], [value1,value2,value3], false)

If we put a breakpoint and inspected the argument values we would see something along these lines (ids and records truncated for the sake of brevity) - refer to the sample records presented in 5.8.3:

reduce([ [["kim","management"], "1bbc"], [["kim", "management"], "5603"], [["kim", "startups"], "13a3"] ...], [8756, 12853, 4300], false)

If the same reduce function was called in the rereduce mode, it wouldn't receive the keys, the values would be calculations returned from the previous reduce call and the boolean will be true:

reduce(null, [value1, value2], true)

If we inspected the argument values in this case we would see something along these lines:

reduce(null, [{count:123, sum:123}, {count:213, sum:231}], true)

In order to make sure our custom reduce function works properly with larger amount of data, we will need to add a condition to handle the rereduce scenario and correctly use the values passed:

function (keys, values, rereduce) {
   var count = 0, sum = 0, i;

   if (rereduce) {
      for (i in values) {
         count += values[i].count;
         sum += values[i].sum;
      }
   } else {
      count = values.length;
      sum = sum(values);
   }

   return {count: count, sum: sum};
}

5.10 Temporary views

Temporary views are the views that aren't stored permanently in the design document and therefore don't have any index data stored for performance efficiency (hence they are also called one-off queries). Instead, when making request, we can provide our map or reduce function to a special _temp_view endpoint and let CouchDB query the database "on the fly" and return results. This isn't a solution we would use for production, but is very handy to quickly check things.

In example with this view we could request tags for each article (ids and results truncated for the sake of brevity):

$ curl -X POST http://localhost:5984/blogarticles/_temp_view -H "Content-Type: application/json" -d '{"map":"function(doc){emit(null, doc.tags);}"}'

{"total_rows":18,"offset":0,"rows":[
{"id":"4f86","key":null,"value":["pmp","project","management","tips"]},
{"id":"0743","key":null,"value":["programming","game","software"]},
{"id":"13a3","key":null,"value":["marketing","public relations","startup","tips"]},
...

5.11 View cleanup and compaction

Both cleanup and compaction will help with reclaiming some disk space, especially with large databases.

Cleanup process purges old view indexes remaining on a disk for a particular database (after a new view has been created). To trigger it we need to send a POST request to the database _view_cleanup endpoint:

$ curl -X POST -H "Content-Type: application/json" http://localhost:5984/blogarticles/_view_cleanup

{"ok":true}

The design document can be queried to get statistics information about it:

$ curl -X GET http://localhost:5984/databasename/_design/test/_info

There are two types of compaction, one for databases and one for views. Compaction removes unused sections created during updates and also old revisions of documents (but will still leave some metadata that is needed for handling conflicts during replication). The view compaction works on the view index from the current version of the design document. We kick off compacting process by sending a POST request to URL made of database name, _compact and the name of the design document:

$ curl -X POST -H "Content-Type: application/json" http://localhost:5984/blogarticles/_compact/articles

{"ok":true}

More on compaction of databases can be found here:https://wiki.apache.org/couchdb/Compaction.


6. Debugging

The database log is available from this endpoint:

http://127.0.0.1:5984/_log
[Fri, 18 Mar 2015 12:40:54 GMT] [info] [<0.116.0>] 127.0.0.1 - - GET /_session 200
[Fri, 18 Mar 2015 12:40:54 GMT] [info] [<0.19980.1>] 127.0.0.1 - - GET / 200
[Fri, 18 Mar 2015 12:40:54 GMT] [info] [<0.5765.1>] 127.0.0.1 - - GET /blogarticles/3c9187bdabb2408e3fbfe6a76f000743?rev=7-fb5654d85ff16f57bf079aee05f35de7 200

In order to be able to see more lines in the log, the number of bytes may be changed:

http://127.0.0.1:5984/_log?bytes=10000

Outputting to the log from the view function:

{
  "map": "function(doc) { log(doc); }"
}