Getting Started

Getting Started with Hop

  • Download a recent Hop build.

  • unzip hop to a local directory

  • change to the hop directory

Introducing Hop

Hop is a metadata driven environment where you manage your data processing workflows.

Before anything else, we need to explain Hop’s two main concepts:

  • Workflow is a (by default) sequential process that has a starting point and one or more endpoints. Between the start and endpoint, a variety of 'actions' can be performed. These actions can range from executing other workflows or pipelines, archiving files that were processed, sending error messages or success notifications and much more.

  • Pipelines are more granular items of work. A pipeline is where the actual work is done. Pipelines consist of a chain of transforms that read, process or write data. Depending on the execution engine your pipelines run, this can be in batch, streaming or a hybrid mode.

The actions in a workflow and the transforms in a pipeline are connected by 'hops'. Hop are visual links between actions (workflows) and transforms (pipelines).

As you’ll discover soon, the process of creating workflows and pipelines is very similar.

However, there are a number of conceptual differences between workflows and pipelines that you have to keep in mind:

  • the pipeline engine executes all transforms in a pipeline simultaneously and in parallel. The workflow engine executes all actions in a workflow sequentially by default. When action finishes, the workflow engine checks which action needs to be executed next.

  • hops in a pipeline pass data between transforms. In a workflow, hops can conditionally determine which action the workflow needs to execute next (on success, on failure, unconditionally)

  • because of their sequential nature, workflows have 1 action to start from and 1 or more end actions. Pipelines can start with input from multiple transforms simultaneously.

The following tools are at your disposal to work with Hop flows and pipelines:

  • the Hop Gui is your visual IDE to build, preview, run, test, deploy, …​ workflows and pipelines.

  • the Hop Server is a lightweight web server that provides a REST api to run workflows and pipelines remotely.

  • Hop Run is a command line utility to run workflows and pipelines.

The Hop GUI

The Hop Gui is your local development environment to build, run, preview and debug (work)flows and pipelines.

Check out this short video to learn how to download, unzip and start the Hop Gui (on Windows).

<!-- [html-validate-disable-next deprecated] -→ video::RMIOTmZK-YE[youtube, width=75%, height=400px]

Start the Hop GUI

On Linux:

 ./hop-gui.sh

On Windows:

hop-gui.bat

Hop GUI Walkthrough

After starting the Hop Gui, you’ll be presented with a window like the one below.

Hop Gui

After clicking the 'New' icon in the upper left corner, you’ll be presented with the window below. Choose either 'New Workflow' or 'New Pipeline'.

Hop - New Dialog

Pipeline Editor Overview

Your new pipeline is created, and you’ll see the dialog below.

Hop - New Pipeline

Let’s walk through the top toolbar:

Hop - Pipeline Toolbar
Action Icon Description

run

Run

start the execution of the pipeline

pause

Pause

pause the execution of the pipeline

stop

Stop

stop the execution of the pipeline

preview

Preview

preview the pipeline

debug

Debug

debug the pipeline

print

Print

print the pipeline

undo

Undo

undo an operation

redo

Redo

redo an operation

align

Snap To Grid

align the specified (selected) transforms to the specified grid size

align left

Align Left

align the selected transforms with left-most selected transform in the selection

align right

Align Right

align the selected transforms with right-most selected transform in the selection

align top

Align Top

align the selected transforms with top-most selected transform in the selection

align bottom

Align Bottom

align the selected transforms with bottom-most selected transform in the selection

distribute horizontally

Distribute Horizontally

Distribute the selected transforms evenly between the left-most and right-most transform in your selection

distribute vertically

Distribute Vertically

Distribute the selected transforms evenly between the top-most and bottom-most transform in your selection

Build Your First Pipeline

Concepts

Pipelines consist of two main work items:

  • transforms are the basic operations in your pipeline. A pipeline typically consists of a lot of transforms that are chained together by hops. Transforms are granular, in the sense that each transform is designed and optimized to perform one and only one task. Although one transform by itself may not offer spectacular functionality, the combination of all transforms in a pipeline is makes your pipelines powerful.

  • hops link transforms together. When a transform finishes processing the data set it received, that data set is passed to the next transform through a hop. Hops are uni-directional (data can’t flow backwards). Hops only buffer and pass data around, the hop itself is transform-agnostic, it doesn’t know anything about the transforms it passes data from or to. Some transforms can read from or write to other transforms conditionally to or from a number of other transforms, but this a transform-specific configuration. The hop is unaware of it. Hops can be disabled by clicking on them, or through right-click → disable.

Add Transforms

Click anywhere in the pipeline canvas, the area where you’ll see the image below.

Hop - Click Anywhere

Upon clicking, you’ll be presented with the dialog shown below. The search box at the top of this dialog works for transform, name, tags (TODO) etc. Once you’ve found the transform you’re looking for, click on it to add it to your pipeline. An alternative to clicking is arrow key navigation + enter. Repeat this step now or whenever you want to add more transforms to your pipeline. Once you’ve added a transform to your pipeline, you can drag to reposition it.

TODO: link to transform documentation.

Hop - Add Transform

Add a 'Generate Rows' and a 'Add Sequence' transform, and your pipeline should like the one below.

Hop - Add two transforms

Add a Hop

There are a number of ways to create a hop:

  • shift-drag: while holding down the shift key on your keyboard. Click on a transform, while holding down your primary mouse button, drag to the second transform. Release the primary mouse button and the shift key.

  • scroll-drag: scroll-click on a transform , while holding down your mouse’s scroll button, drag to the second transform. Release the scroll button.

  • click on a transform in your pipeline to open the 'click anywhere' dialog. Click the 'Create hop' image::getting-started/icons/HOP.svg[Create hop, 25px, align="bottom"] button and select the transform you want to create the hop to.

Hop - Create Hop

Run your first Pipeline

Click the 'run' button image::getting-started/icons/run.svg[Run, 25px, align="bottom"] in your pipeline toolbar

Hop - Create Hop

Let’s walk through the options in this dialog

  • Pipeline run configurations, edit, new, manage your run configurations. Run configurations are used to specify a name, description and engine to run your pipeline.

  • Log level: choose the log level for your pipeline. The available options are

    • Nothing

    • Error

    • Minimal

    • Basic (default)

    • Detailed

    • Debugging

    • Row Level (very detailed)

  • Clear log before running (enabled by default): logging information from previous runs will be cleared from the logging tab.

  • parameters: This table will show the parameter name, default value and description. enter your runtime parameters in the 'value' field.

  • variables: add the variable name and value you want to set in this tab.

  • always show dialog on run (enabled by default): You’ll be presented with this dialog every time you run this dialog. When disabled, the pipeline will run with the default options.

Click the 'New' button right next to the 'Pipeline run configuration'. Give your run configuration a name and (optionally) a description. Choose the 'local pipeline engine'. As the name implies, the 'local single threaded pipeline engine' runs the pipeline in a single CPU thread. The default 'local pipeline engine' will create a separate CPU thread for each transform in your pipeline to evenly spread the load of your pipeline over your CPU cores.

Hop - Run Configuration Dialog

Click 'Ok' to create your configuration and select it from the dropdown list. For this getting started guide, we’ll leave all other options to the defaults. Click 'Launch'.

Since we haven’t saved our pipeline yet, you’ll be prompted to do so by the dialog below.

Hop - Save Pipeline

Your pipeline will finish in a matter of milliseconds, and the 'Execution Result' view will show up at the bottom of your IDE. This view has 5 tabs:

  • transform metrics: transformName, read, written, input, output, update, rejected, errors, buffers input, buffers output, speed, status (TODO: elaborate)

  • logging: the logging output for your pipeline

  • preview data: a preview of the data for the selected transform. This grid shows the data as it passed through this transform.

  • metrics: TODO

  • performance graph: TODO

Hop - Execution Results Metrics

Preview your first Pipeline

While developing your pipeline, you’ll often want to check your data as it enters or exits a transform. Previews are an easy way to take a glance at the state of your data stream as it exits a transform.

To preview the data that is processed by a transform, click on a transform and select 'Preview output'. The same result can be achieved by selecting a transform in your pipeline (rectangle select) and clicking the preview (eye) icon in the pipeline toolbar.

Hop - Preview Transform

You’ll be presented with the dialog below. You can change the number of rows to preview (1000 by default), but in most cases, you’ll just want to hit the green 'Quick Launch' button.

Hop - Preview Dialog

Once your pipeline finished processing the selected number of rows for the selected transform, a new popup dialog will show your preview results.

Hop - Preview Results
your entire pipeline is executed for a preview, you’re just taking a peek into the processing at the selected transform. If your pipeline modifies data (writes, updates, deletes) further down the stream, those actions *will be performed, even if you’re previewing an earlier transform.

Let’s take a quick look at the buttons at the bottom of this dialog:

  • Close: closes the preview dialog. The pipeline will remain paused, and will therefore still be active.

  • Stop: stop the preview and the pipeline execution.

  • Get more rows: fetch the next 1000 (or any other selected amount of) rows for preview.

Debug your first Pipeline

Debugging a pipeline’s transform is very similar to previewing. Instead of pausing the pipeline execution after a given number of rows, the pipeline is paused when a given condition is met. The process to start a debug session is similar to starting the preview: click on a transform and select 'Debug output' from the pop-up dialog, or select a transform and hit the bug-icon in the pipeline toolbar.

Hop - Preview Transform

You’ll be presented with the dialog below. You’ll recognize this dialog from the 'preview' we just did, but instead, the 'Retrieve first rows (preview)' option is now unchecked, and 'Pause pipeline on condition' is checked.

In the 'Break-point / pause condition' below that option, you can specify on which condition you want to debug. This dialog is the same as the Filter Rows transform.

In our very basic example, we’ve set a breakpoint on 'valuename > 5'.

Hop - Preview Dialog

With the 'valuename > 5' breakpoint, our pipeline is paused as soon as this condition is met (valuename = 6). The rows preceding that moment are also shown, so you can investigate how your data was processed before the breakpoint condition was true.

Similar to the preview options, you can close, stop or continue the debugging ('Get more rows'). When you tell your pipeline to 'Get more rows', the pipeline execution will be resumed until the breakpoint condition is met once more, instead of just fetching the next 1000 (default) rows.

Hop - Preview Results

Create your first Workflow

The design and execution of workflows is very similar to that of pipelines. However, keep in mind that there are significant differences between how Hop handles workflows and pipelines under the hood.

To create a workflow, hit the 'new' icon or 'CTRL-N'. From the pop-up dialog, select 'New workflow'.

Hop - New Workflow

Add the following actions to your workflow and create the hops to connect them:

  • Start

  • Pipeline

  • Success

Hop - New Workflow with actions

Double-click or single-click and choose 'Edit action' to configure the action for the pipeline you just created.

In the pipeline dialog, use the 'Browse' button to select your pipeline and give the action an appropriate name, for example 'First Pipeline'.

Click 'OK'.

Hop - New Workflow pipeline action

Notice how the hops in your workflow are a little different from what you’ve seen in pipeline hops.

Add a fourth action 'Abort' and create a hop from your pipeline action.

Hop - New Workflow abort

You now have the three types of hops that are available in workflows:

  • unconditional (lock icon, black hop): 'unconditional' hops are followed no matter what the exit code (true/false) of the previous action is

  • success (green hop, check mark): 'success' hops are used when the previous action executed successfully.

  • failure (red hop, error mark): 'failure' or 'error' hops are followed when the previous action failed.

The hop type can be changed by clicking on the hop’s icon.

With these three hop types and the actions at your disposal, you’re ready to create powerful data orchestration workflows.

Run your first Workflow

As with designing workflows, the steps to run a workflow are very similar to running a pipeline.

Click the 'run' button Run in your workflow toolbar

In the workflow run dialog, hit the 'New' button in the upper right corner to create a new 'Workflow run configuration'.

Hop - New Workflow Config

In the dialog that pops up, add 'Local Workflow' as the workflow configuration name and choose the 'Local workflow engine'.

Hop - New Workflow Config Dialog

Click 'OK' to return to the workflow run dialog, make sure your configuration is selected and hit 'Launch'.

Hop - New Workflow With Config Dialog

This workflow with our very basic pipeline should execute in less than one second. You’ll now have the execution results pane which again looks very similar to the pipeline execution results.

The first tab in your workflow execution is 'Logging'. This tab shows the logging information for your entire workflow. Any errors that occurred in your workflow will be highlighted in red.

Hop - New Workflow Logging

The second tab are your workflow metrics. This tab is less verbose, but gives you an action-by-action overview of the execution of your workflow. The black, green and red color codings indicate information, success and failure. In larger worfklows, the metrics tab will give you a quick overview of what happened in your workflow, what the required time per action was, etc.

You’ll use the logging tab to find more detailed information about what happened in your workflow or in a particular action.

Hop - New Workflow Metrics

Hop Server

After you’ve designed and tested your pipeline or transform locally, you may want to run it on a headless machine.

The Hop Server is a light weight web server that you can use to run your workflows and pipelines remotely.

First, we’ll have to start the server. Head over to your Hop directory, and locate the 'hop-server' scripts (sh for Mac/Linux, bat for Windows).

Running the script without any arguments will print its usage:

Usage: hop-server <Interface address> <Port> [-h] [-p <arg>] [-s] [-u <arg>]
or
Usage: hop-server <Configuration File>
Starts or stops the hopServer server.
     -h,--help               This help text
     -p,--password <arg>     The administrator password.  Required only if
                             stopping the Hop Server server.
     -s,--stop               Stop the running hopServer server.  This is only
                             allowed when using the hostname/port form of the
                             command.
     -u,--userName <arg>     The administrator user name.  Required only if
                             stopping the Hop Server server.
Example: hop-server.sh 127.0.0.1 8080
Example: hop-server.sh 192.168.1.221 8081
Example: hop-server.sh /foo/bar/hop-server-config.xml
Example: hop-server.sh http://www.example.com/hop-server-config.xml
Example: hop-server.sh 127.0.0.1 8080 -s -u cluster -p cluster

As an example, let’s run our server on our local machine on port 8085:

On Linux:

 ./hop-server.sh localhost 8085

On Windows:

hop-server.bat localhost 8085

The startup process shouldn’t take more than 1 or 2 seconds, and should show 2 lines of logging information:

2020/04/30 16:22:55 - HopServer - Installing timer to purge stale objects after 1440 minutes.
2020/04/30 16:22:55 - HopServer - Created listener for webserver @ address : localhost:8085

In your favorite browser, go to http://localhost:8085 and sign in with the default user 'cluster' and password 'cluster'.

Click the 'show status' link below to get to page shown in the second screenshot.

Hop - Server Index
Hop - Server Status

We now have verified our server is up and running. Let’s return to Hop Gui to configure a run configuration for it. Click the 'New' icon or 'CTRL-N' and select 'Slave Server'.

Hop - New Slave

In the slave server dialog, enter the details for the local server we just created.

Hop - New Slave Config

With our slave server in place, all that’s left to do is to create a run configuration for this server. Head back to your pipeline (again, the process is similar for workflows), and hit 'run'. Before running your pipeline, create a new 'Pipeline run configuration'.

Name this configuration 'Remote Pipeline', select 'Remote pipeline engine' as the engine type, select the 'local' run configuration we created earlier, and select 'localhost' for the slave server we just created.

Select this run configuration and run your pipeline. Your execution results will be almost identical to the locale execution you did earlier, however, the logs will show you executed the pipeline remotely:

2020/04/30 17:01:33 - first_pipeline - Executing this pipeline using the Remote Pipeline Engine with run configuration 'Remote Pipeline'
...
...
...
2020/04/30 17:01:34 - first_pipeline - Execution finished on a remote pipeline engine with run configuration 'Remote Pipeline'

The execution results for this pipeline will now be available in our server’s status page as well:

Hop - Server Status

Select the pipeline or workflow line that you want to investigate, and choose one of the options from the options in the upper left corner of the pipeline or workflow overview table. Click the eye icon to open the details for that specfific execution:

Hop - Server Status Details

Hop Run

Hop Run is the last tool we’ll discuss in this getting started overview. In many cases, you’ll want to run your workflows and pipelines on a headless server, but don’t necessarily want to run through rest services or from Hop Gui.

Hop Run is a command line that can be used to run workflows or pipelines e.g. over ssh of from a cron job.

The command to run is 'hop-run' (sh on Mac/Linux, bat on Windows). Without any arguments, hop-run shows its usage syntax:

A filename is needed to run a workflow or pipeline
Usage: <main class> [-hotw] [-e=<environment>] [-f=<filename>] [-l=<level>]
                    [-r=<runConfigurationName>] [-p=<parameters>[,
                    <parameters>...]]... [-s=<systemProperties>[,
                    <systemProperties>...]]...
  -e, --environment=<environment>
                          The name of the environment to use
  -f, --file=<filename>   The filename of the workflow or pipeline to run
  -h, --help              Displays this help message and quits.
  -l, --level=<level>     The debug level, one of NONE, MINIMAL, BASIC, DETAILED,
                            DEBUG, ROWLEVEL
  -o, --printoptions      Print the used options
  -p, --parameters=<parameters>[,<parameters>...]
                          A comma separated list of PARAMETER=VALUE pairs
  -r, --runconfig=<runConfigurationName>
                          The name of the Run Configuration to use
  -s, --system-properties=<systemProperties>[,<systemProperties>...]
                          A comma separated list of KEY=VALUE pairs
  -t, --pipeline          Force execution of a pipeline
  -w, --workflow          Force execution of a workflow

Since we’ve been working with a very basic pipeline, running it from hop-run is as easy as specifying: * the pipeline filename to run * the run configuration to use

 ./hop-run.sh -f /tmp/first_pipeline.hpl -r local

You’ll get output that will be very similar to the one below:

2020/04/30 17:16:48 - first_pipeline - Executing this pipeline using the Local Pipeline Engine with run configuration 'local'
2020/04/30 17:16:48 - first_pipeline - Execution started for pipeline [first_pipeline]
2020/04/30 17:16:48 - Generate rows.0 - Finished processing (I=0, O=0, R=0, W=10, U=0, E=0)
2020/04/30 17:16:48 - Add sequence.0 - Finished processing (I=0, O=0, R=10, W=10, U=0, E=0)
2020/04/30 17:16:48 - first_pipeline - Pipeline duration : 0.079 seconds [  0.079 ]
2020/04/30 17:16:48 - first_pipeline - Execution finished on a local pipeline engine with run configuration 'local'
./hop-run.sh -f /tmp/first_pipeline.hpl -r local  5.62s user 0.34s system 258% cpu 2.309 total

Where to go from here?

We’ll be adding more documentation as we go, so keep an eye on the Project Hop documentation section.

A good place to start exploring is the detailed documentation for:

Project Hop considers high-quality documentation a very important part of the project. Help us to improve by creating tickets for any documentation errors, suggestions or feature requests in our JIRA system.