Building a product platform at NerdWallet – Part 1

July 17, 2018

Introduction

NerdWallet’s mission is to provide clarity for all of life’s financial decisions. This means we have reams of content and tens of thousands of financial product recommendations across verticals as diverse as auto loans, credit cards, mortgages and savings accounts. Keeping this information consistent, compliant and up-to-date across our entire product offering is a core business need.

Previously, each of our vertical teams managed their product data themselves – each vertical’s data had its own shape, complexity, scale, and data management practices. For example:

  • Our Personal Loans team managed data via a spreadsheet administered by their product manager which was subsequently ingested into JSON files. Some more complex (and long-lived) data was checked into their codebase along with these JSON files.
  • Our Credit Cards team required rigorous auditing and change tracking workflows for compliance reasons. They had a rich admin interface, highly normalized data, and some pretty funky reward calculations.
  • Our Banking team had a legacy system that hadn’t aged well. They had 300,000 rows of frequently changing offers that flowed into a ad-hoc database without integrity checks and change tracking.

While this approach let each team scale independently in the beginning, some downsides soon became apparent:

  • Different engineers were spending time solving the same table-stakes problems and were not helping our users solve theirs. This resulted in significant inefficiencies.
  • Different systems had different workflows and data APIs; this made it difficult for product managers and engineers to move across teams to pinch-hit and collaborate.
  • As our product evolved, we realized that our users didn’t interact with each vertical individually. For them, each financial product was one thread in an incredibly complex tapestry that connected them to their financial goals.

Determined that there was a better way, a small group of Nerds were banded together in July 2017 and tasked to with building the first* Product Platform at Nerdwallet. This is the story of how we did that and our learnings along the way.

*Except, we weren’t really the first. There were at least two attempts to do this before, and maybe a future blog post will try and figure out why this time it worked.

Product Platform @ NerdWallet

The Product Platform team exists to help the vertical engineering teams focus on solving our users’ problems delightfully and quickly. We do this by working with and across different vertical teams to identify both commonalities and differences in how they model product data; we then provide a platform that does three things in a reliable manner:

  • Ingests and store vertical data with strong guarantees
  • Help Nerds admin data with access control, permissions, and change tracking
  • Provide a consistent and powerful Search API that lets them access inventory data fast

Objectives

Making the right architecture and design decisions is often more a function of context than absolute truth. To that end, we’ll re-state what we were solving for more formally with some objectives and key results for our platform:

  • One Search API across all financial products
    • Engineers learn one API and use it across multiple verticals with predictable results
  • Quick and fearless data management
    • Product managers easily modify inventory data and push it to production without any help from engineers
    • Legal and compliance teams can easily track changes and audit data on the platform
  • Automatic ETL (extract, transform, load) to ingest inventory data across verticals

While we will try and focus on the Search API in this post, it’s important to note these other objectives since they play into how we look at the entire system.

Designing the Search API

We set out with the above goals and came up with the following architectural requirements:

  • We wanted to support loose schema defined by the verticals themselves; this would help them iterate fast without our team being a bottleneck
  • We wanted to support simple querying by any field out of the box; this again would help verticals iterate faster without having to define indexes before they could search by a field
  • We wanted to maintain some notion of hierarchical relationships between classes of records. However, we wanted the API to expose properties of a parent as first-class properties of the child. That is, consider the following toy inventory with just two records:
    [
      {"id": 1234, "name": "Platinum Card", "institution": {"id": 1111, "name": "Foo Bank"}},
      {"id": 1357, "name": "Gold Card", "institution": {"id": 2222, "name": "Bar Credit Union"}},
    ]
    

    While it is obvious that we should be able to query based on the record name or record id, what is maybe less obvious is that we should also be able to query on all records related to Foo Bank.

After some investigation we found that Elasticsearch (the default search index choice) fit our needs. The features that seemed particularly well suited were:

  • Powerful auto-indexing capabilities across all fields. This means we could support queries across any attribute of a financial product automagically.
  • Fast response times even for complex search terms when keeping the entire index in memory. Our inventory data had an upper limit of a million records so we were unlikely to run into major memory problems.
  • Built-in support for full-text search and fuzzy searching. This was a bonus: the feature had been on some product teams’ wishlists and we were thrilled that it came out-of-the-box.

We decided on a set up where each product type (which had their own unique loose-schema) would get an individual index. This isolation mapped to the isolated workflow we provided the respective teams and helped avoid sneaky issues like different product types defining different types for the same field name. Since Elasticsearch allowed searches across multiple indexes, we don’t sacrifice much in terms of functionality.

What about Dynamo?

We also evaluated Dynamo. We liked it because it was a no-frills, well tested, and super fast document store. Further, we could define secondary indexes on the data to solve the same query requirements. Eventually we decided to go with Elasticsearch because:

    • Dynamo had a limit of the number of secondary indexes it could support; working around this limit introduced complexity into our architecture. (Even having to define an index on the data was more complexity than Elasticsearch’s auto-indexing)
    • Dynamo was built as an extremely fast and always-available data store – not a search engine. It did not have some of the fancier fuzzy-search and custom-scoring features that Elasticsearch boasted

Calculated Fields

While we’ve spoken about searching inventory based on existing static fields, frequently the more useful feature is to search based on some calculated fields that call out parts of a product specific to the user. For example, users have different spending habits and “the best rewards credit card” is different for a user who loves traveling, and different for one who spends a lot on groceries. This is further complicated in a platform because, like the loose schema, these calculations are dependent on the domain specific knowledge of the vertical teams and a good platform should enable teams to iterate successfully on the feature.

We initially envisioned leveraging Elasticsearch’s support for script-fields; these are fields that calculate their value based on a script that could reference other static fields for the document (or record) in question. However, we came up against two challenges:

  • Credit card calculations are quite complex; figuring out how much money a user could save or the amount of rewards they could accrue involved math that was beyond the scope of what the scripting language seemed built for.
  • We were underwhelmed by the experience of working with the scripted fields. They proved to have some obscure performance characteristics, and our feedback-loop was sufficiently slow that we were concerned about code maintainability.

To solve this, we ended up running all calculations (and subsequent filters and sorts based on calculated fields) in our application code. We still run a subset of the filters in Elasticsearch which does help us reduce the number of records we fetch into memory. We’ve implemented a Domain Specific Language (DSL) for filters that abstracts away the particular implementation being used. By first loading the input payload into this DSL, we can conditionally run some pre-filtering in Elasticsearch and run some additional filtering in memory, after the calculations have been run.

Key Learnings: Search API

  • Social proof is a very useful signal when exploring technologies: we relied on it when narrowing the field of contenders to Elasticsearch + Dynamo.
  • It’s equally valuable to think about what our platform won’t do that other people are doing: in our case, it meant acknowledging that the data was fairly static and the size was bounded. This helped auto-indexing tip the balance in Elasticsearch’s favor.
  • As a platform team, we have found value in appreciating the workflow that we’re imposing on our clients. We could have attempted to define one data model that all our verticals contributed to; this tickled our engineering fancy until we asked “how much friction is between our clients and this platform’s value?” In our use-case, an ideal, shared, monolithic data model would have imposed too much friction on the business.

Database

While Elasticsearch sounded great for our Search API, there was another aspect to this: how do we get data into our search indexes in a reliable manner? To illustrate what this would look like, let’s channel some old-fashioned reductio ad absurdum.

Let’s assume we had a single data store that is Elasticsearch. The data store accepts a stream of edits to the data and applies changes to existing indexes immediately. Now, let’s punch some holes in this architecture:

  • What happens if an indexing fails, for whatever reason, on our hosted Elasticsearch: how do we get back to a good place?
  • Let’s tackle just the first part of getting back to a good place: where can I go to see “this is what the data should be like”?
  • How do we allow this data to be administered in an easy, deterministic manner? How does a data-admin know she’s done her job right, and how do we minimize the effort required to right mistakes in data entry?
  • How do we deal with hierarchical relationships between data? If one parent data record changed, that might mean multiple denormalized (flat) records changing as a result. How do we ensure correctness, i.e that all of these records change or none of them do?
  • How is data audited, change-tracked and all the other fun things we do with data everyday?

We quickly realized that all of these were addressed with any standard ACID database. This would support an intuitive and reliable data administration flow, and also greatly simplify our mental models when reasoning about data state. The database would always be our source of truth, with Elasticsearch being an intelligent cache for the Search API.

Modeling Loose Schema

There was one (slight) dissonance between one of our stated goals and an ACID database, that is:

Product managers can easily modify inventory data and push it to production without any help from engineers

The complexity, though, was what this meant from a platform perspective: we would need to allow each vertical to own, modify, and iterate their individual schema. Working with a relational database’s notion of table schema, this would traditionally require us to model each vertical’s data in their own tables, and run database migrations on these when they change their relevant schema definitions. This was concerning as we expected many such modifications to happen, and performing DDL operations with high frequency seemed undesirable.

Luckily, we realized that modern relational databases have moved with the NoSQL times. As Uber’s Schemaless technology shows, it is possible to store loose-form data in a relational database via native JSON columns. We could validate and interpret data going into the free-form column using the product-type specific schema definition and avoid running any database schema migrations. We ultimately settled on a data model that was a hybrid of strict columns (id, name, updated_at, etc) and a free form “details” column. The strict columns let us impose some order and consistency across the platform, which potentially lets us run cross-product-type searches in the future.

MySQL vs Postgres

The next (and final) step was to choose a database technology. The two main competitors here are MySQL and Postgres. Without going into the weeds, here is one major caveat: either would probably work reliably at our scale. There just isn’t enough (static) product data for financial services in the United States to exercise the limits of a modern database. With that out of the way, a couple of interesting features tilted the balance towards Postgres.

Note: We were comparing MySQL v5.6 with Postgres v9.3. Subsequently, MySQL v8.0 was released which has much better support for the SQL standard.

Atomic DDL meant that we would not be left in an inconsistent state if one of our (rare) schema migrations went awry. We had experience with this happening in MySQL frequently during local development and occasionally in production, and were looking forward to working with a cleaner mental model

Actual Type and Constraint Checking is puzzlingly missing from MySQL. This is less of an issue because our architecture ensures that access to the database is controlled by our microservice; this meant we could have more expressive checks implemented in application logic. (For the eagle-eyed, the loose-schema we spoke about earlier demands application level validation). If we ever had to break out of the ORM level though (e.g for batch writes) having the option of database-level checks was reassuring.

JSONB and GIN Indexes were great icing on the cake. These gave us the option to do some limited indexing into our loose-schema columns. While range-queries are not supported out of the box, even having just equality and key-membership support was reassuring, especially when building out admin interfaces for this data.

Key Learnings: Databases

  • Relational databases with ACID properties are great as sources of truth. While not the trendiest things, they’ve been around long enough that most things just work
  • It’s easy to go from relational (normalized) to flat (denormalized) data and difficult the other way round
  • MySQL vs NoSQL is a false dichotomy: both technologies have progressed to the mean so choose wisely. While you’re at it, generally beware of false dichotomies between any two technologies, i.e questions posed as “X vs Y”

Interested in more? Read: Part II, putting it all together.

Have strong opinions about appropriate API and data model design? Check out some of the open roles we have on the engineering team here at NerdWallet.