Open Source Project For Data Transport And Storage

FlightStats Open Source Project: What is the Hub?

No matter which industry you work in—whether it is technology, travel or something else—you’ll always be looking for ways to improve your business by cutting costs or creating efficiencies. As a data company, it’s important for us to have smooth, effective processes for managing each part of our business. That is why we decided to transform the way we handled data within our company a few years ago. Our goal was to make sharing data among our product teams as simple as possible. Thus, we designed and developed the Hub.

The Hub Project was started in 2013 and is actively under development. FlightStats uses this technology as part of our big data platform. Every piece of data we touch as a company flows through the hub, typically a number of times, and it’s a critical part of how we do business with real-time data and information services.

Defining the Hub

So, what is the Hub and what makes it unique in managing data?

In the dictionary, “a Hub” is defined as “the effective center of an activity, region, or network.” – which is true for our data in the Hub. We use the hub to gather, process, and deliver data to customers all around the world in near real time every day.

Our technical definition of the Hub is a fault tolerant, highly available REST API for distribution and storage of data; however, in layman’s terms, it is a distributed linked list.

The Hub has several key features:

  • Functions as a messaging system
  • Behaves like a key-value database and long-term data store
  • Provides built-in, time-based queries
  • Data payloads are immutable and have guaranteed ordering

What problem does the Hub solve?

Data companies have traditionally relied upon “credentialed datastores” for sharing data among teams and business units. A credentialed datastore is a system like a database where you have to have a username/password or some other key to access it. While this system is commonly used, it is actually burdensome for both the team that maintains the datastore and the team that needs to consume the data.

The challenges associated with credentialed datastores include:

  • Credentials must be issued and monitored
  • Data plans must be learned and accommodated
  • Too much communication has to occur between the teams in order to facilitate the data transfer

These may seem like minor issues when dealing with small or similar datasets; however, this model can put an undue burden on teams as the number of datasets increase in size, volume, and complexity. Ultimately, these challenges will slow down the development process and can introduce errors. That is why FlightStats created an alternative solution to streamline the data transport and storage process.

Since its creation, the Hub has made our data transport and storage require significantly less communication, which has facilitated a significant increase in our development speed across all our teams.

How does the Hub work?

1. Turning data into objects

At FlightStats, the journey begins with Data Acquisition, which includes collecting various pieces of data from our Direct Feeds, and other sources. We take data of different shapes and topics from these sources and normalize them into common chunks called objects to enable processing. These objects describe things like:

  • Where a plane is
  • What gate a plane is at
  • What weather is like at a given time and place

Objects we use are flexible to a degree, and allow description of all manner of data about any kind of topic.

2. Organizing objects into channels

We take these objects and organize them into groups of similar data, and we call those channels in the hub. This helps us manage the huge amount of incoming data into more reasonable categories. Channels are configurable, and contain information like where the data came from, and how long we should keep it before tidying up.

For example, we store each Airline’s data they provide us in a separate channel just for them, and we store all our data as it is processed into different channels so we can see any piece of data at any stage in the process.

After data has made it into our company and into the Hub, we have Data Processing software which is waiting to consume up new data and refine it into better information.

To help make this easy, each channel can be subscribed to, and then any number of services and software are automatically notified that new data in a channel has shown up, so they can get to work.

3. Ordering new objects by time

It’s important that these objects and the messages they contain show up in some kind of order that makes sense, and we decided the most useful ordering for aviation data was time. For example, planes should taxi, then take off, fly a route, then land, and taxi again. No other order makes sense so we organize all the data we have about a flight by time.

4. Duplicating data and delivering it to customers

The final stage after the data has been acquired and processed by the Hub, is to deliver this data to customers via APIs. As you might know, APIs are mechanisms for one program to get information from another program. We allow customers and their software to access our data through our Flex and Trip APIs. To do this, we have to replicate and ship our data all around the world.

The Hub allows us to effortlessly make copies of data from one channel to another channel, even in another Hub, wherever the data is needed around the world.

Since all our data is neatly distributed, our engineers can freely access it. Additionally, our software services can access any combination of data from any channel needed to power a data product for external customers.

Like making a good recipe, we can choose the ingredients we need, and combine them into data products. By sourcing our data ingredients from the Hub, it’s an amazingly fast and flexible way to get our job done.

Making the Hub Open Source

The best part is, you can use the Hub to produce any type of data product. This isn’t something exclusive to FlightStats, or the types of data we work with.

The Hub is an open source software project licensed under an MIT License, effectively making it  free for use as-is as long as credit is given for use of the project.

If you have any questions, please direct them to our Github Project, or reach out to FlightStats. We’d be happy to talk more about the Hub and how it’s helped us!