Thursday, November 24, 2016

Logstash+Elasticsearch: Best way to handle JSON arrays



Before I start with the solution, let's review what's the problem we're trying to solve here. If we have these two JSON documents pushed to ES:-
{
    "test": {
        "steps": [{
            "response_time": "100"
        }, {
            "response_time": "101"
        }]
    }
}
{
    "test": {
        "steps": [{
            "response_time": "101"
        }, {
            "response_time": "100"
        }]
    }
}
And you write a Kibana query like:
test.steps.response_time:101
# Full ES query in the background
{
    "query": {
        "query_string": {
           "query": "test.steps.response_time:101"
        }
    }
}
It'll match both documents. Why? Because Elasticsearch flattens the arrays internally.
More details:- https://www.elastic.co/guide/en/elasticsearch/guide/current/complex-core-fields.html#object-arrays and https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html

Not just that, if I were to write a query to search all documents with response_time=101 in second element of array, logically, test.steps[1].response_time:101, it's not possible.

To fix this, we can simple create a filter in Logstash which converts these arrays to hashes recursively, ie, all arrays are converted to hashes, even the nested ones. Hence, we want to write a filter which converts arrays like this.

Before:-
{
  "foo": "bar",
  "test": {
    "steps": [
      {
        "response_time": "100"
      },
      {
        "response_time": "101",
        "more_nested": [
          {
            "hello": "world"
          },
          {
            "hello2": "world2"
          }
        ]
      }
    ]
  }
}
After:-
{
  "foo": "bar",
  "test": {
    "steps": {
      "0": {
        "response_time": "100"
      },
      "1": {
        "response_time": "101",
        "more_nested": {
          "0": {
            "hello": "world"
          },
          "1": {
            "hello2": "world2"
          }
        }
      }
    }
  }
}
The filter that can do this is shared below:-
ruby {
    init => "
        def arrays_to_hash(h)
          h.each do |k,v|
            # If v is nil, an array is being iterated and the value is k.
            # If v is not nil, a hash is being iterated and the value is v.
            value = v || k
            if value.is_a?(Array)
                # "value" is replaced with "value_hash" later.
                value_hash = {}
                value.each_with_index do |v, i|
                    value_hash[i.to_s] = v
                end
                h[k] = value_hash
            end

            if value.is_a?(Hash) || value.is_a?(Array)
              arrays_to_hash(value)
            end
          end
        end
      "
      code => "arrays_to_hash(event.to_hash)"
}
Now, to search the document which contains response_time=101 in second element of array, it's simple.
test.steps.1.response_time:101
Happy ELKing!

No comments:

Post a Comment