Skip to content
  • HCLSoftware
  • Support
  • Community
  • Downloads
  • Documentation
  • Login
A graphic logo of the Actian Company
  • Products Products
    • blue data icon for Actian

      Data + AI Intelligence

      Actian Data Intelligence Platform New
      Find, trust, and unlock the value of data with a complete governance and marketplace platform
      Actian Data Observability New
      AI-based data quality and observability built for open architecture
      Actian Data Platform
      Easily connect, manage, and analyze data with a unified platform
    • blue Database icon for Actian

      Databases

      • Zen
        Low-maintenance embedded database
      • Actian NoSQL
        Databases for complex object networks
      • Actian Ingres
        Scalable and reliable transactional processing
      • HCL Informix®
        Fast, cost-optimized enterprise data management
    • blue line chart icon for Actian

      Analytics

      • Vector
        High performance, vectorized columnar analytics database
    • integrations

      Data Management

      • DataConnect
        Highly scalable hybrid integration solution
      • Data Quality
        Make informed decisions that drive your business forward
      • DataFlow
        Parallel execution platform data-in-motion
    • Bar Chart blue icon

      App Modernization

      • OpenROAD
        Database, object-oriented rapid app development
      • VoltMX
        Low code app development platform
    • See All Actian Products
    • blue square with right arrow pointing up

      Deployment

      Deployment

      Cloud, hybrid and on-premises

      • Google Cloud Launch your journey to Google with Actian
      • Amazon Web Services Launch your journey to AWS with Actian
      • Microsoft Azure Launch your journey to Azure with Actian
    See All Actian Products Explore All Deployment Partners
  • Solutions Solutions

    Solutions by Industry

    • Manufacturing
    • Transportation and Logistics
    • Banking, Financial Services, and Insurance
    • Healthcare and Life Sciences
    • Retail and Consumer Goods
    • Energy and Utilities

    Use Cases

    • Gen AI icon
      GenAI Data Readiness A quick checklist to evaluate your GenAI readiness
    • blue layer icon for Actian
      Flexible Data Integration Collect, transform, and automate data pipelines
    • database blue icon
      Data Warehouse Modernization Low-risk, simplified migration to a modern data warehouse deployed on-premises and in multiple clouds
    • blue communications solutions icon
      Enterprise Data Marketplace Discover, access, and share data products
    • blue cloud icon for Actian
      Edge-to-Cloud Analytics Modernize application data processing and analytics at the Edge
    • blue dataflow chart icon for Actian
      Customer Data Analytics Hub Get real-time actionable customer intelligence across all your customer experience data silos
    Explore All Industry Solutions
  • Customers Customers

    Customers

    • blue users icon for actian
      Our Customers Join a growing community of businesses across diverse industries who trust Actian to unlock the power of their data

    Featured Customer Stories

    • blue user icon for actian
      Academy Bank
    • blue user icon for actian
      Tsubakimoto
    View All Customers
  • Partners Partners

    Partners

    • blue info square icon for Actian
      Program Overview Competitive solutions, industry-leading incentives and a comprehensive support package
    • blue check icon for Actian
      Become a Partner Accelerate your business with the Actian Partner Program
    • blue Bezier Icon for Actian
      Technology Partners Partnering to create a force multiplier
    • blue user icon for actian
      Refer a Lead Protect your customer, grow your business
    • Find a partner icon
      Find a Partner Leverage expertise and insights from our partner network
  • Learn Learn

    Learn

    • Image Indent Left Icon
      Blog
    • graduation hat blue icon
      Actian Academy
    • book blue icon
      Resources
    • blue icon with paper and magnifying glass for Actian
      Guides
    • blue square
      Webinars
    • blue list logo
      Glossary
    • Podcast Icon
      Podcast
    View All Resources
  • Company Company

    Company

    • blue Actian logo
      About Us
    • announcement blue icon
      Newsroom
    • question blue icon
      About HCLSoftware
    • blue briefcase icon for Actian
      Careers
    • blue users icon Actian
      Leadership
    • blue check icon for Actian
      Awards and Recognition
    • Calendar blue icon
      Events
    • message blue icon
      Contact Us
    Learn More About Actian
Take a Tour Request Demo Login
  • Support
  • Community
  • Downloads
  • Documentation
  • HCLSoftware
Learn more about our data solutions
Contact Us
Data Intelligence

WhereHows: A Data Discovery and Lineage Portal for LinkedIn

Actian Corporation

April 20, 2020

linkedin-wherehows

Latest Blog Posts

Keep up with the latest data trends

Subscribe

Metadata is becoming increasingly important for modern data-driven enterprises. In a world where the data landscape is increasing at a rapid pace, and information systems are more and more complex, organizations in all sectors have understood the importance of being able to discover, understand and trust in their data assets.

Whether your business is in the streaming industry such as Spotify or Netflix , the ride sharing industry such as Uber or Lyft, or even the rental business like Airbnb, it is essential for data teams to be equipped with the right tools and solutions that allow them to innovate and produce value with their data.

In this article, we will focus on WhereHows, an open-source project led by the LinkedIn data team, that works by creating a central repository and portal for people, processes, and knowledge around data. With more than 50 thousand datasets, 14 thousand comments, and 35 million job executions and related lineage information, it is clear that LinkedIn’s data discovery portal is a success.

LinkedIn Key Statistics

Founded by Reid Hoffman, Allen Blue, Konstantin Guericke, Eric Ly, and Jean-Luc Vaillant in 2003 in California, the firm started out very slowly. In 2007, they finally became profitable, and in 2011 had more than 100 million members worldwide.

As of 2020, LinkedIn significantly grew:

  • More than 660 million LinkedIn members worldwide, with 206 million active users in Europe.
  • More than 80 million users on LinkedIn Slideshare.
  • More than 9 billion content impressions.
  • 30 million companies registered worldwide.

LinkedIn is definitely a must-have professional social networking application for recruiters, marketers, and even sales professionals. So, how does the Web Giant keep up with all of this data?

How it All Started

Like most companies with a mature BI ecosystem, LinkedIn started out with a data warehouse team, responsible for integrating various information sources into consolidated golden datasets. As the number of datasets, producers and consumers grew, the team increasingly felt overwhelmed by the colossal amount of data being generated each day. Some of their questions were:

  • Who is the owner of this data flow?
  • How did this data get here?
  • Where is the data?
  • What data is being used?

In response, LinkedIn decided to build a central metadata repository to capture their metadata across all systems and surface it through a unique platform to simplify data discovery: WhereHows.

What is WhereHows?

WhereHows integrates with all data processing environments and extracts metadata from them.

Then, it surfaces this information via two different interfaces:

  1. A web application that enables navigation, searching, lineage visualization, discussions, and collaboration.
  2. An API endpoint that empowers the automatization of other data processes and applications.

This repository enables LinkedIn to solve problems around data lineage, data ownership, schema discovery, operational metadata mashup, data profiling, and cross-cluster comparison. In addition, they implemented machine-based pattern detection and association between the business glossary and their datasets, and created a community based on participation and collaboration that enables them to maintain metadata documentation by encouraging conversations and pride in ownership.

There are three major components of WhereHows:

  1. A data repository that stores all metadata.
  2. A web server that surfaces data through API and UI.
  3. A backend server that fetches metadata from other information sources.

How Does WhereHows Work?

The power of WhereHows comes from the metadata it collects from Linkedin’s data ecosystem. It collects the following metadata:

  • Operational metadata, such as jobs, flows, etc.
  • Lineage information, which is what connects jobs datasets together.
  • The information catalogued such as the dataset’s location, its schema structure, ownership, create date, and so on.

How They Use Metadata

WhereHows uses a universal model that enables data teams to better leverage the value from the metadata; for example, by conducting a search across the different platforms based on different aspects of datasets.

Also, the metadata in a dataset and the job operational metadata are two endpoints. The lineage information connects them together and enables data teams to trace from a datasets/jobs to its upstream/downstream jobs/datasets. If the entire data ecosystem is collected into WhereHows, they can trace the data flow from start to finish.

How They Collect Metadata

The method used to collect metadata depends on the source. For example, Hadoop datasets have scraper jobs that scan through HDFS folders and files, reads the metadata, then stores it back.

For schedulers such as Azkaban, they connect their backend repository to get the metadata, aggregate it and transform it to the format they need, then load it into WhereHows. For the lineage information, they parse the log of a MapReduce job and a scheduler’s execution log, then combine that information together to get the lineage.

What’s Next for WhereHows?

Today, WhereHows is actively used at LinkedIn as not only a metadata repository, but also to automate other data projects such as automated data purging for compliance. In 2016, they integrated with systems down below:

In the future, LinkedIn’s data teams hope to broaden their metadata coverage by integrating more systems such as Kafka or Samza. They also plan on integrating with data lifecycle management and provisioning systems like Nuage or Goblin to enrich the metadata. WhereHows has not said its final word.

Sources:

  • 50 of the Most Important LinkedIn Stats for 2020
  • Open Sourcing WhereHows: A Data Discovery and Lineage Portal
actian avatar logo

About Actian Corporation

Actian makes data easy. Our data platform simplifies how people connect, manage, and analyze data across cloud, hybrid, and on-premises environments. With decades of experience in data management and analytics, Actian delivers high-performance solutions that empower businesses to make data-driven decisions. Actian is recognized by leading analysts and has received industry awards for performance and innovation. Our teams share proven use cases at conferences (e.g., Strata Data) and contribute to open-source projects. On the Actian blog, we cover topics ranging from real-time data ingestion to AI-driven analytics. Get to know the leadership team https://www.actian.com/company/leadership-team/
  • Metadata Management
  • Share withTwitter Icon
  • Share withLinkedin Icon
  • Share withFacebook Icon
  • Share withMail Icon

Subscribe to the Actian Blog

Subscribe to Actian’s blog to get data insights delivered
right to you.

  • Stay in the know – Get the latest in data analytics pushed directly to your inbox.
  • Never miss a post – You’ll receive automatic email updates to let you know when new posts are live.
  • It’s all up to you – Change your delivery preferences to suit your needs.

Subscribe

This email extension () is not allowed. Please update.
This personal email address domain () is not allowed. Please update.

Thank you for subscribing to the Actian Blog!

Get ready to stay informed and inspired with the latest insights, trends, and updates in the world of data analytics and technology.

Expect our carefully curated articles, case studies, and industry news to land in your inbox soon.

Also of Interest:
  • Data Intelligence for Smarter Decisions
  • Is Your Organization Ready for GenAI?
  • Flexible Data Integration

Data + AI Intelligence

  • Actian Data Intelligence Platform
  • Actian Data Observability
  • Actian Data Platform

Capabilities

  • Data Analytics
  • Databases
  • Data Integration & Quality
  • Application Services

Solutions

  • Manufacturing
  • Financial Services
  • Healthcare Data Analytics
  • Transportation & Logistics
  • Communications

Company

  • About HCLSoftware
  • Events
  • Awards & Recognition
  • Newsroom
  • Press
  • Careers
  • Locations

Customers

  • Support
  • Community
  • Documentation
  • Customer Portal Login
  • Actian Data Platform Login

Get Started

  • Request Demo
  • Contact Us
Actian
© 2025 Actian Corporation. All Rights Reserved.
  • Linkedin
  • GitHub
  • youtube
  • Terms of Use
  • Modern Slavery Policy
  • Privacy Policy
  • Trademark Guidelines
  • Patents
  • Security
hcl-logo