Dataiku Outbound IP Address Control - Proxy or Otherwise?

My team works with a data supplier that requires a white-listing process for IP addresses. When we attempt to use Dataiku to fetch the data our IP address associated with the call occasionally changes.
We have access to a proxy IP address system, and we have made requests through the proxy to the data supplier when we run the code on our local PCs. When we run the same code through the proxy on Dataiku it won't send the request.
What is the preferred way to either assure that the request comes from the same IP address, or get Dataiku to accept a request routed through a proxy?
Answers
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,577 Neuron
It really depends on how you are making these requests in Dataiku. You can configure an HTTP proxy in Dataiku Administration ⇒ Settings ⇒ Other ⇒ Misc but it will only affect certain recipes/datasets so you will need to explain clearly how you are connecting to your data supplier.
-
@Turribeach , I am part of the user group rather than the admin group. Ideally, I wouldn't have to make a request to the admins for an update on this proxy issue. I only want to use the proxy address on a small number of projects.
I am making my call to the supplier's server using the "requests" package within a Python recipe. I have a proxy IP address that we use for other activity on local PC, but have not gotten it to work within Dataiku.
What works on local PC is Python code in the format:
proxies = { 'http': 'http://your_proxy_address:your_proxy_port', 'https': 'http://your_proxy_address:your_proxy_port' } response = requests.get('http://example.com', proxies=proxies)
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,577 Neuron
So this should work. But you are not giving us much additional information:
"When we run the same code through the proxy on Dataiku it won't send the request.""have not gotten it to work within Dataiku."
Can you post the full code snippet? Can you run it in a Dataiku Jupyter Notebook and show the outputs? Do you see any errors? Does it hang? Is your proxy authenticated?
-
@Turribeach, this is the code that runs fine on local:
import requests from datetime import datetime, timezone # Proxy configuration HTTP_PROXY = "my_proxy_url.com:80" proxies = {"http": HTTP_PROXY, "https": HTTP_PROXY} #URL of the file on the server we are fetching url = "my_supplier_url.com/their_file.txt" # Make the GET request through the proxy try: response = requests.get(url2D, proxies=proxies, verify=False, timeout=100) # Print the response status code and content print("Response Status Code:", response.status_code) if response.status_code == 200: print("Response Content:", response.text[0:200]) else: print("Failed to retrieve the file. HTTP Status Code:", response.status_code) except requests.exceptions.RequestException as e: print("An error occurred:", e)
On local, I receive back " Response Status Code: 200" along with the response.text snippet.
When I try to run the same thing through Dataiku, I get this response:
An error occurred: HTTPSConnectionPool(host='my_supplier_url.com', port=443): Max retries exceeded with url: their_file.txt (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at ,<some code>: Failed to establish a new connection: [Errno -2] Name or service not known')))
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,577 Neuron
That's a DNS error. Your DSS box doesn't seem to be using the proxy because it tries to connect to it directly.
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,577 Neuron
Try this approach:
import os os.environ['http_proxy'] = 'http://my_proxy_url.com:80' os.environ['https_proxy'] = 'http://my_proxy_url.com:80'
Then use the requests package normally, no need to pass any proxies as it will pick them up from the environment variables. I used this in the past to connect to external sites via a proxy in Dataiku do I know it works.
-
@Turribeach thanks for your help, but that suggestion resulted in the same error.
I ran this:
import requests from datetime import datetime, timezone import os # Proxy configuration os.environ['http_proxy'] = 'http://my_proxy_url.com:80'
os.environ['https_proxy'] = 'http://my_proxy_url.com:80' proxies = {"http": os.environ['http_proxy'], "https": os.environ['https_proxy']} #URL of the file on the server we are fetching url = "my_supplier_url.com/their_file.txt" # Make the GET request through the proxy try: response = requests.get(url2D, proxies=proxies, verify=False, timeout=100) # Print the response status code and content print("Response Status Code:", response.status_code) if response.status_code == 200: print("Response Content:", response.text[0:200]) else: print("Failed to retrieve the file. HTTP Status Code:", response.status_code) except requests.exceptions.RequestException as e: print("An error occurred:", e)The response is:
An error occurred: HTTPSConnectionPool(host='my_supplier_url.com', port=443): Max retries exceeded with url: their_file.txt (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at <some_string>>: Failed to establish a new connection: [Errno -2] Name or service not known')))
I
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,577 Neuron
You are still passing your proxies to the requests call. You don't need that:
response = requests.get(url, verify=False, timeout=100)
-
Thanks for still trying to help.
I just tried this:
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd import requests from datetime import datetime, timezone import os # Proxy configuration os.environ['http_proxy'] = 'http://my_proxy_url.com:80' os.environ['https_proxy'] = 'http://my_proxy_url.com:80' #URL of the file on the server we are fetching url = "my_supplier_url.com/their_file.txt" # Make the GET request through the proxy try: response = requests.get(url2D, verify=False, timeout=100) # Print the response status code and content print("Response Status Code:", response.status_code) if response.status_code == 200: print("Response Content:", response.text[0:200]) else: print("Failed to retrieve the file. HTTP Status Code:", response.status_code) except requests.exceptions.RequestException as e: print("An error occurred:", e)This returns:
An error occurred: HTTPSConnectionPool(host='my_supplier_url.com', port=443): Max retries exceeded with url: their_file.txt (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at <some_string>>: Failed to establish a new connection: [Errno -2] Name or service not known')))
-
Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 2,577 Neuron
I am not convinced you are running the code exactly as per your code snippet since your code snippet has two bugs:
The response line inside the try statement is not correctly indented and gives this error:Code says "url2D" not "url" so code fails with url2D is undefined:
In any case this is not a Dataiku issue. So what I suggest you do is to fix your code bugs and give the code to your Dataiku administrator and ask them to run it under a Python prompt in the command line. It should fail with the same error proving it's not a Dataiku issue. Then have your administrator figure out why you get this DNS error.
-
Thank you for your continued help.
I was having issues with the code block copy and paste working correctly on your site, which is why the indentation is incorrect in my previous code. I also have been changing some of the variable names to keep my posts anonymous and ambiguous. Those are the only differences you are seeing between what I posted and what I ran previously.
With those updates, this code produces the same response. I have added a screenshot of my Notebook from Dataiku as well as the updated code in the code block.
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd
import requests
from datetime import datetime, timezone
import os
# Proxy configuration
os.environ['http_proxy'] = 'http://my_proxy_url.com:80'
os.environ['https_proxy'] = 'http://my_proxy_url.com:80'
#URL of the file on the server we are fetching
url = "my_supplier_url.com/their_file.txt"
# Make the GET request through the proxy
try:
response = requests.get(url, verify=False, timeout=100)
# Print the response status code and content
print("Response Status Code:", response.status_code)
if response.status_code == 200:
print("Response Content:", response.text[0:200])
else:
print("Failed to retrieve the file. HTTP Status Code:", response.status_code)
except requests.exceptions.RequestException as e:
print("An error occurred:", e) -
I ended up setting outbound rules on the security group and routing all traffic through a NAT gateway with a fixed IP. That let me control egress without using a proxy.
-
I had a similar situation where I needed fixed outgoing IPs without messing with the whole network setup. One trick that worked for me was to route the traffic from Dataiku through a proxy with static IPs. If you're hitting limits with traditional options, you might want to Buy Residential Proxies—they can give you more stable IPs and are less likely to get blocked by external services.