Marquez: The Metadata Discovery Solution at WeWork

Created in 2010, WeWork is a global office and workspace leasing company. Their objective is to provide space for teams of any size including startups, SMEs, and major corporations, to collaborate. To achieve this, what WeWork provides can be broken down into three different categories:

Space: To ensure companies with optimal space, WeWork must provide the appropriate infrastructure, which consists of booking rooms for interviews / one on ones or even entire buildings for huge corporations. They also must make sure they are equipped with the appropriate facilities such as kitchens for lunch and coffee breaks, bathrooms, etc.
Community: Via WeWork’s internal application, the firm enables WeWork members to connect with one another, whether it’s local within their own WeWork space, or globally. For example, if a company is in need of feedback for a project from specific job titles (such as a developer or UX designer), they can directly ask for feedback and suggestions via the application to any member, regardless of their location.
Services: WeWork also provides their members with full IT services if there are any problems as well as other services such as payroll services, utility services, etc

In 2020, WeWork represents:

More than 600,000 memberships.
Locations in 127 cities in 33 different countries.
850 offices worldwide.
Generated $1.82 billion in revenue.

It is clear that WeWork works with all sorts of data from their staff and customers, whether that be individuals or companies. The huge firm was therefore in need of a platform where their data experts could view, collect, aggregate, and visualize their data ecosystem’s metadata. This was resolved by the creation of Marquez.

This article will focus on WeWork’s implementation of Marquez mainly through free & accessible documentation provided on various websites, to illustrate the importance of having an enterprise-wide metadata platform in order to truly become data-driven.

Why Manage and Utilize Metadata?

In his talk “A Metadata Service for Data Abstraction, Data Lineage & Event-Based Triggers” at the Data Council back in 2018, Willy Lulciuc, Software Engineer for the Marquez project at WeWork explained that metadata is crucial for three reasons:

Ensuring Data Quality: When data has no context, it is hard for data citizens to trust their data assets: are there fields missing? Is the documentation up to date? Who is the data owner and are they still the owner? These questions are answered through the use of metadata.
Understanding Data Lineage: Knowing your data’s origins and transformations are key to being able to truly know what stages your data went through over time.
Democratization of Datasets: According to Willy Lulciuc, democratizing data in the enterprise is critical! Having a central portal or UI available for users to be able to search for and explore their datasets is one of the most important ways companies can truly create a self-service data culture.

To sum up: creating a healthy data ecosystem. Willy explains that being able to manage and utilize metadata creates a sustainable data culture where individuals no longer need to ask for help to find and work with the data they need. In his slide, he goes through three different categories that make up a healthy data ecosystem:

Being a self service ecosystem, where data and business users have the possibility to discover the data and metadata they need, and explore the enterprise’s data assets when they don’t know exactly what they are searching for. Providing data with context, gives the ability to all users and data citizens to effectively work on their data use cases.
Being self-sufficient by enabling data users the freedom to experiment with their datasets as well as having the flexibility to work on every aspect of their datasets whether they input or output datasets for example.
And finally, instead of relying on certain individuals or groups, a healthy data ecosystem allows for all employees to be accountable for their own data. Each user has the responsibility to know their data, their costs (is this data producing enough value?) as well as keeping track of their data’s documentation in order to build trust around their datasets.

Room Booking Pipeline Before

As mentioned above, utilizing metadata is crucial for data users to be able to find the data they need. In his presentation, Willy shared a real situation to prove metadata is essential: WeWork’s data pipeline for booking a room.

For a “WeWorker”, the steps are as follows:

Find a location (the example was a building complex in San Francisco).
Choose the appropriate room size (usually split into the number of attendees – in this case they chose a room that could greet 1 – 4 people).
Choose the date for when the booking will take place.
Decide on the time slot the room is booked for as well as the duration of the meeting.
Confirm the booking.

Now that we have an example of how their booking pipeline works, Willy proceeds to demonstrate how a typical data team would operate when wanting to pull out data on WeWork’s bookings. In this case, the example exercise was to find the building that held the most room bookings, and extract that data to send over to management. The steps he stated were the following:

Read the room bookings from a data source (usually unknown).
Sum up all of the room bookings and return the top locations.
Once the top location is calculated, the next step is to write it into some output data source.
Run the job once a hour.
Process the data through .csv files and store it somewhere.

However, Willy stated that even though these steps seem like it’s going to be good enough, usually, there are problems that occur. He goes over three types of issues during the job process:

Where can I find the job input’s dataset?
Does the dataset have an owner? Who is it?
How often is the dataset updated?

Most of these questions are difficult to answer and jobs end up failing. Without being sure and trusting this information, it can be hard to present numbers to management. These sorts of problems and issues are what made WeWork develop Marquez.

What is Marquez?

Willy defines the platform as an “open-sourced solution for the aggregation, collection, and visualization of metadata of [WeWork’s] data ecosystem”. Indeed, Marquez is a modular system and was designed as a highly scalable, highly extensible platform-agnostic solution for metadata management. It consists of the following components:

Metadata Repository: Stores all job and dataset metadata, including a complete history of job runs and job-level statistics (i.e. total runs, average runtimes, success/failures, etc).
Metadata API: RESTful API enabling a diverse set of clients to begin collecting metadata around dataset production and consumption.
Metadata UI: Used for dataset discovery, connecting multiple datasets and exploring their dependency graph.

Marquez’s Design

Marquez provides language-specific clients that implement the Metadata API. This enables a diverse set of data processing applications to build a metadata collection. In their initial release, they provided support for both Java and Python.

The Metadata API extracts information around the production and consumption of datasets. It’s a stateless layer responsible for specifying both metadata persistence and aggregation. The API allows clients to collect and/or obtain dataset information to/from the Metadata Repository.

Metadata needs to be collected, organized, and stored in a way to allow for rich exploratory queries via the Metadata UI. The Metadata Repository serves as a catalog of dataset information encapsulated and cleanly abstracted away by the Metadata API.

According to Willy, what makes a very strong data ecosystem is the ability to search for information and datasets. Datasets in Marquez are indexed and ranked through the use of a search engine based keyword or phrase as well as the documentation of a dataset: the more a dataset has context, the more it is likely to appear first in the search results. Examples of a dataset’s documentation is its description, owner, schema, tag, etc.

You can see more detail of Marquez’s data model in the presentation itself here: https://www.youtube.com/watch?v=dRaRKob-lRQ&ab_channel=DataCouncil

The Future of Data Management at WeWork

Two years after the project, Marquez has proven to be a big help for the giant leasing firm. They’re long term roadmap is to solely focus on their solution’s UI, by including more visualizations and graphical representations in order to provide simpler and more fun ways for users to interact with their data.

They also provide various online communities via their Github page, as well as groups on LinkedIn for those who are interested in Marquez to ask questions, get advice or even report issues on the current Marquez version.