This is one of the common requirement that every data engineer faces while working on Databricks. When we make connectivity with external systems, the traffic can go from any of the cluster nodes.

If the cluster is provisioned with the Databricks managed network with public IPs in all the nodes, the request will go from any of the nodes and the IP address will vary every time you provision the cluster as dynamic public IP address will get attached to the cluster nodes.

The possible options to address this issue are listed below.

  • Use NAT gateway in the host subnet of the Databricks Cluster. The NAT gateway will act as the single gateway for all the outbound network traffic from the cluster. This option is possible only if you have the access and the authority to modify the network configurations.
  • Provision a VM and configure a proxy using software like HA Proxy. We can attack a static public ip address to the VM and the requests from the Databricks cluster gets routed through this proxy server. The installation of HAPrioxy is very simple and it can proxy databases, APIs etc.

An example of using HAProxy for proxying SnowFlake is present in this repository.

I have used both of the above options in several use cases and both the options work well. You can pick one of these based on your circumstances.