Building a product platform at NerdWallet – Part 2

July 31, 2018

In Part One we spoke about the charter of the Product Platform team, how that led to some interesting technical challenges, and one way that we’ve solved these challenges. Concretely, we discussed how we settled on a Search API powered by Elasticsearch, and a persistent data store powered by Postgres. In this post, we’ll continue to tie up the loose ends, and speak a little bit about the environment around the platform we’ve been discussing.

Glue

So we’ve identified the two main parts of our system: a persistent data store and a search index. In an ideal world, we’d draw a nice arrow between them, dust our hands and call it a day. However, this is software engineering and that’s not just any arrow, that’s the magic glue between the two parts to keep them in sync.

For example, we spoke about the read and write flows and about Elasticsearch being an intelligent cache on top of our source-of-truth in Postgres. A logical next question: how does data get into the cache? How does that relate to the data sitting in the database at any given point in time? Specifically, let’s consider these two characteristics of our system:

  • Consistency: In our mental model of an intelligent cache we presume that Postgres and Elasticsearch are always consistent: they agree on what the state of data is. In practice, however, there will be minor deviations since these are indeed two different data stores. How can we version our data to make it easy to reason about our state? What happens when reindexing fails or, worse still, partially succeeds?
  • Availability: When something does goes wrong with writes (and it will), how do we maintain the service (i.e keep serving reads) in the meantime?

Consistency

Why is consistency important? The meta-point is that the arguments in favor of consistency typically go along the same lines as the motivations for a database that we covered earlier; these motivations are just expanded from the scope of the data store to the scope of the system itself.

Concretely, let’s consider a scenario where reindexing fails (bad data? intermittent network issues?) and leaves the search index in an inconsistent state. It would be confusing for both our partners and our data admins to see partial updates. How does a data-admin know she’s done her job right and can sign off, or if she has to raise the issue with engineering teams?

Data writes typically go out in batches. For example, it may make no sense to update the interest rate on one of the credit cards of an institution without updating all of them. For example, consider these two cards:

[
 {"id": 1234, "name": "Platinum Card", "apr": 14.99},
 {"id": 1357, "name": "Gold Card", "apr": 19.99},
]

Let’s say the Fed changes the interest rate by 0.5%, and our partners in turn send us updated interest rates on their products:

[
 {"id": 1234, "name": "Platinum Card", "apr": 15.49},
 {"id": 1357, "name": "Gold Card", "apr": 20.49},
]

The data admin in charge reviews the automatic feed that’s come in from our partner, audits the data after the ETL (Extract, Transform, Load) step on our side, and signs off saying “go ahead and make this change live”. Subsequently, there’s some error in the reindexing that results in a partial update:

[
 {"id": 1234, "name": "Platinum Card", "apr": 15.49},
 {"id": 1357, "name": "Gold Card", "apr": 19.99},
]

If this is shown to our clients, we put our data quality at risk since partners would be puzzled at why some of the cards looked to have the updated interest rate and others did not. Reliability is a cornerstone of our workflow: without it, people need to check more pages manually and this leads to slowdowns as we look to give our users the best, most accurate information possible. In essence, the data admin workflow is typically batch-edits followed by batch-review and batch-write. Consistency of the end-to-end system is therefore key in providing a clean mental model.

Availability

Sometimes our batch writes can get pretty large, with lots of marketing information changing in reaction to an event, or like the prime rate example mentioned above. In these cases, we need a way to keep older data available for search to proceed normally while reindexing runs in the background.

Fun fact: Taking the Consistency and Availability together reads almost like Atomicity, where the result of reindex is either completely available when it succeeds, or is a no-op when it fails.

Coordinating Asynchronously

Reindex could be a long-running operation, and thus we run it as a single-threaded background task. The task is parameterized on each product type (or vertical); this gives our clients process isolation from each other.

The task first asks the database for the active index version, and also retrieves the current loose-schema specification for the product-type. Using the version number and the mapping, a new index is created. Data is then retrieved from different tables in the database, flattened into documents that describe each product, and put into this index.

If everything goes well, the product-type’s active index version is switched to point to the new index that has been created.

If anything goes wrong at any point in this process, the index is deleted and the active index version is not modified. The coordination with the Search API happens via this one number: during search, we read the active version off the database and use that to construct our query to the search index.

Storing the active version in the database guarantees the ACID properties we desire. Since the index  version is flipped once at the end of the process, the data only goes live once it has been full refreshed. In the meantime (or if something goes wrong) we continue to serve existing data to our clients.

It is important to call out that we do not stream diffs to the search indexes (even though this is supported by Elasticsearch) and instead opt for the cleaner mental model of versioned indexes. The reason we can do this is because we have an upper bound of total records in our system; this approach may not work for some use-cases.

Another important gotcha is index-pruning: since Elasticsearch persists indexes forever, we’d run out of memory on our cluster if we didn’t prune our indexes. To that end, index pruning periodically cleans up any index versions older than, say, 5 versions old. Having older (and tested) versions of data around is a valuable asset in case we quickly wanted to switch to a previous version for site-reliability reasons.

An Alternate Universe

It’s worth briefly exploring a very different route we could have taken. Elasticsearch does allow us to stream small diffs to an existing index, and we’re typically modifying at most 5% of all records in one batch. So why did we not use the more efficient, better tech? The answer in our case is simple – simplicity. In this approach, where we’d mutate the existing index, we’d need to do the following:

  • Find a way store the before/after state of each record that’s changed
  • Manually rollback the applied changes when something goes wrong during reindex
  • Expose ourselves to intermittent search result inconsistencies, since we’d be writing incremental changes to our active search index

By keeping our search index immutable, we side-step all these complications at the cost of having a couple of large indexes floating around. As we mentioned earlier, this tradeoff makes sense for our data’s scale.

Key Learnings: Glue

  • It’s possible to integrate the best parts of two technologies in a manner that makes the overall system seem incredibly well suited. In our case we were able to simulate ACID behavior on Elasticsearch by storing the index-version in Postgres.
  • There is great value to simplicity, even though it’s a more wasteful or less clever solution. When used appropriately, immutable data structures introduce clean mental models and encourage simple solutions.

Admin UI

In Part 1 we stated that one of the goals of the product platform was to allow product managers to iterate quickly. At an abstract level, this informed some of our technical decisions on how to  model the data. More concretely, we also built an Admin UI in order to allow PMs to easily change data and ship it with a click of a button.

The Admin tool is built on top of the Admin on Rest framework, and provides a single way to manage data across various verticals. The framework integrates seamlessly with the RESTful CRUD APIs that we build on top of our database. A key reason for using this framework was that apart from almost seamless plug-and-play, its components seemed well designed for easy customization.

Our choice of a traditional database served us well here because:

  • We needed transactional capabilities for the Admin interface. Eventual consistency and asynchronous edits are extremely confusing for data administrators. When replacing a system they had almost complete control over, this seemed a sub-optimal choice.
  • A normalized data model is great for data administrators. It removes the risk of applying an update in only 6 out of 7 places, for example, in a denormalized (flat) store.

The Admin UI Platform

Another interesting facet to the admin interface comes, once again, from the loose-schema we mentioned earlier. Since the schema of a vertical is meant to change frequently, we required our UI to be robust to these changes.

One way we could have achieved this is by representing the loose-fields for what they were: one big JSON blob. This had two key downsides:

  • Consistency: Our data administrators thought of the loose-schema fields like any other fields; having the top-level fields individually and the loose-fields grouped together in an alien data format (JSON) was not user-friendly to them.
  • Validation: While we did have server side validation, client side validation is more responsive, and has error messages better tailored to the end-user. However the client can only perform useful validation on loose-schema fields if it understands what those fields are.

We eventually opted for an approach where the Admin UI would ask each vertical to describe it’s schema, and then proceed to lay out the UI based on the response. Since we already used JSON Schema to describe and validate our loose-schema on the back-end, it was a logical response choice for our shape-description API. Concretely, we key off the “type” and “enum” JSON Schema properties to understand what components to render, and use the various limiting properties (like “minValue”, “format”, “pattern”, etc) for client-side validation.

The Complete System

We’ve covered the data storage, the search index, and the glue that powers our Search API. We’ve also touched upon the importance of the Admin UI in our platform, and how it got a platform flavor. We’ll leave you now with a diagram to summarize our tech stack, and give an idea of some related ongoing work.

A brief note on a piece we haven’t touched upon – Third Party Ingestion pipelines. These represent another way to get updated product data into the system. The key question here:  how do we make ingestion pipelines rule-driven and platformized in the same way that we platformized search?

Another exciting direction of future work is around personalized fields. When our client teams leverage our Search API to filter and sort through the inventory, the question they’re really asking is “What’s the best product for this person?” Could we remove the complexity of maintaining these queries, and instead rely on a Machine Learning model’s personalized score for each product instead?

Summary

Building a platform is an interesting experience, especially at a startup. People like autonomy and control, and it takes enormous trust for feature teams to hand over a core part of their daily life to a platform team. We continually remind ourselves that we are not in the business of building platforms; we are in the business of helping client teams solve user problems faster. How we do this is with a platform. This is a subtle difference that we keep repeating because it helps us resist the temptation to (over)engineer a “beautiful” system. The platform, at its core, is purely a way to leverage economies of scale so a small team can do more things. We’re lucky that a bunch of interesting engineering challenges come along for the ride!

Another question we get a lot is around tech-debt, so it’s worth a brief mention here. When the return on investment seems sound, we consider it our responsibility to take on some technical debt to unblock value for a client team. Apart from keeping the business rolling, this also gives us valuable usage-feedback earlier in our design cycle, which lets us iterate to a better solution. This is where hard API interfaces are wonderful and hide the subsequent tinkering that goes on behind the scenes.

Key Learnings

  • Most data-oriented platforms have distinct write and read workflows with very different usage characteristics. Instead of trying to build one bespoke system that does everything, it’s worthwhile considering how a couple of off-the-shelf solutions can work together.
  • We have found benefit in being hyper-aware of where our interesting engineering problems lived; being radically honest with ourselves about the expected scale of our system helped us avoid lofty engineering goals (which often come with more complexity and doubt.)
  • The Admin UI is an often underrated part of a system. So much of the platform is available as self-service via the Admin UI, and it is critical for us to speak with our internal clients, add features or guardrails, and ensure it helps them get their work done accurately and quickly.

We hope you enjoyed this two-part deep dive into the Product Platform! There is still a lot to be done, so if you enjoy solving backend problems and would like to be part of the team, Check out some of the open roles we have on the engineering team here at NerdWallet.