Data-driven Open Policy Agent Architecture

Amit Kanfer March 11, 2021
Data Driven OPA Architecture

OPA was designed to let you make context-aware authorization and policy decisions by injecting external data from nearly any data source and being able to write policy using that data.

Those data sources, the policy and the given input from a request are the three critical components needed to build industrial strength, fine-grained access control. OPA maintains a cache of the data set just as it has a cache of the policy. However, OPA is not designed to be the single source of truth for either.

The ability to make policy decisions with these requests and their context is important, but being able to make policy decisions according to attributes from external data sources, however, is a difficult task that presents a new set of challenges.

When designing an authorization framework that relies on an external data source for decision-making, there are a few components that need to be considered, such as:

  1. The total size of the external data source you are integrating with
  2. How “dynamic” the data is or how frequently is it getting updated
  3. How far away the data is from the OPA server to evaluate the potential latency when this data is being fetched

As outlined in the docs, OPA offers 5 different ways to bring external data into its cache, each one is more suitable for different use cases:

  1. JWT Tokens
  2. Overloading / Piggybacking
  3. Bundle API
  4. Push Mode
  5. Pull Mode

We won’t cover the differences between all of these methods as they are described great detail in the docs. In this post, we will focus on the fifth option, the “Pull Mode”, and the advantages we often see for  large and dynamic environments.

Our approach is based on the assumption that authorization is usually done on behalf of an end user or service. In short, a single identity performs a certain action on multiple resources, and then goes away.

In computer science the Pareto distribution seems to be something we see a lot, and it’s applied here as well. In this case, 20% of the users will generate 80% of the traffic. In real life scenarios it isn’t uncommon to see even more extreme distributions.

You might ask yourselves, ‘Why should we cache data attributes in OPA’s cache when that data is likely never going to be queried again? Why not just “lazy-load” the information in real-time, according to live traffic and then just cache it for future use?’

So then we wondered: What is the most cost-effective way to leverage Pull-Mode without having to sacrifice performance or latency?

Ultimately, there might be a “cold start” on the first requests, but that’s only going to happen ONCE in the entire lifetime of our OPA-based agents, and there are ways to mitigate that risk if need be.

Pull mode
OPA provides a functionality to make HTTP requests in real-time as part of the policy evaluation phase to external API servers. This gives you the ability to bring information from external data sources in order to make better policy decisions.

In order to mitigate latency and performance issues you have access to a list of controls around those requests, specifically around caching (cache, force_cache and force_cache_duration_seconds).

While this HTTP functionality is great, it can present a slight problem. When the data resides in a non RESTful server (eg. databases, active directory) it’s up to the developer to build a service (or services) that will receive those HTTP requests, and translate them to the necessary protocol to get the required information from those data sources. The developers also need to build a smart caching mechanism on that extra service in order to prevent inundating the data source with repeated requests.

In order to make it easy to start using our OPA-based agents, Policy Decision Points or PDPs, we decided to implement a few built-in functions that we find useful. These built-in functions include:

LDAP

ldap.is_member_of_group(input.user, "Administrators")

Explanation: Checks whether a given user is part of a group in AD

Databases

build.query_raw(<connection_string>, "SELECT role FROM users WHERE user = $1", [input.user])

Explanation: Uses a specific database connector, makes an SQL / NoSQL / PartiQL query

Cache

cache.set("key", 123)
res := cache.get("key") # res is now 123

Explanation: explicit control of cache, to pass state from one policy evaluation to another

To highlight a few of the benefits for having such functions:

  1. It becomes significantly easier to author and maintain the policies.
  2. The response is cached in OPA, for a configurable amount of time with TTL, preventing future similar queries to be IO bound.
  3. If different PDPs receive requests from different applications – the data is automatically split between the different PDPs according to the context.
  4. The built-in function takes care of keeping the results “fresh” in the background using go-routines, making sure all future requests will avoid making the external call.
  5. There is no need to implement and maintain additional middleware in order to create a bridge between the generic http.send functionality and the data source.

You have the ability to connect multiple OPAs together with a distributed caching mechanism, so once a single OPA server “sees” a certain request and the corresponding response – the cached <key, value> pair is shared between the other instances in the same cluster – automatically.

We’re working on open-sourcing these functions, and would love to hear your thoughts on it.

Subscribe to build.security’s newsletter

Keep up with the latest news on our authorization policy management platform