Understanding Apache Drill Logical Plan

What is a Logical Plan?

Apache Drill creates two intermediate plans for its execution – The Logical plan & the Physical plan.

The incoming query to Drill can be a SQL 2003 query/DrQL or MongoQL. The query is converted to a Logical Plan that is a Drill’s internal representation of the query (language-agnostic). Drill then uses its optimization rules over the Logical Plan to optimize it for best performance and crafts out a Physical Plan. The Physical Plan is the actual plan which Drill then executes for the final data processing.

Below is a diagram to illustrate the flow:
Drill query flow

The Logical Plan describes the flow of data in a language independent manner i.e. it would be a representation of the input query which would not be dependent on the actual input query language. It generally tries to work with primitive operations without focus on optimization. This makes it more verbose than traditional query languages. This is to allow a substantial level of flexibility in defining higher-level query language features. It would be forwarded to the optimizer to get a physical plan. The logical plan is a DAG of data stream operators.

Sample Logical Plan

Here is a sample logical plan that I created to test my Math Functions floor() & ceil(). Here is a post on creation of the Math Functions. Contribute to Apache Drill: Implementing Math Functions.
Lets dig the contents of the logical plan:

{
       head: {
         type: "APACHE_DRILL_LOGICAL",
         version: "1",
         generator: {
            type: "manual",
            info: "na"
         }
       },

      storage: {
         console: {type:"console"},
         fs1: {type:"fs", root:"file:///"},
         cp: {type:"classpath"}
       },

      query: [
         {
          @id: 1,
          op: "scan",
          memo: "initial_scan",
          ref: "employees",
          storageengine: "cp",
          selection: {
              path: "/employees.json",
              type: "JSON"
              }
          },

         {
          op : "project",
          @id : 2,
          input : 1,
          projections : [ 
                 {
                  ref : "output.ceil",
                  expr : "ceil(1.7)"
                 }, 
                 {
                  ref : "output.floor",
                  expr : "floor(1.7)"
                 }
             ]
          },

         {
          input: 2,
          op: "store",
          memo: "output sink",
          storageengine: "console",
          target: {pipe: "STD_OUT"}
          }
     ]
 }

We have three nodes in our logical plan here:

  • head
  • storage
  • query

The head node of logical plan is pretty straight forward. It tells that this is a APACHE_DRILL_LOGICAL Plan, and it is generated manually.

The storage node defines 3 storage engines, console, fs1 & cp. We would be using the console storage engine in our query.

The query node is the actual query that we want to execute on Drill. The query itself is a collection of operations on the data. Lets look how we are using the operations in our sample logical plan:

  • scan: The first operation is a scan operation that reads the data from the data file employees.json. Its a read operation for Drill.
  • project: The project operation is used to apply transformation over the data. Now the input to the project operation is the output of the previous scan operation. Hence you can see the input:1 in the project query. 1 here is the id of the previous operation i.e. scan operation’s id.
    Another thing to notice is the projection feild. It is again collection of transformations that you may need to apply over the data. The ref tag gets the name of the operation which would be used as a name of output of the operation. The expr tag holds the actual transformation to be applied over data.
  • store: The final component of the query is the store operation. It uses the console storage for dumping the output of the query. As seen earlier, the input of the store operation is the output of the project operation, hence input:2.

Note: The ref tag in the project operation must always start with a output prefix. Eg. output.ceil, output.floor.

These were few components of the Drill Logical Plan. A complete detailed document on drill plans can be found here on drill wiki link. Apache Drill detailed plan document.

Hope the post is helpful. Cheers \m/

4 thoughts on “Understanding Apache Drill Logical Plan”

  1. Pingback: Lifetime of a Query in Drill Alpha Release – Timothy Chen

  2. But one doubt, in general logical plan should be constructed by Drill based on given query right?

    Why the user has to create logical plan? Is it temporary?

  3. Yes Anil your understanding is correct. Drill would generate the logical plan internally.
    This post is to understand the components of the logical plan. It is helpful for scenarios where you can quickly write a logical plan and test its execution without having to hit a query from sqlline prompt and go through entire flow.

Leave a Reply

Your email address will not be published. Required fields are marked *