Submit your innovative use case or inspiring success story to the 2023 Dataiku Frontrunner Awards! LET'S GO

Dataiku on AWS: Can I use an AWS EFS mount as my DATA_DIR ?

Solved!
jojin
Level 2
Dataiku on AWS: Can I use an AWS EFS mount as my DATA_DIR ?

I'm evaluating Dataiku for our AWS environment. We want a an architecture that avoids data loss in case our dataiku instace goes down due to availability zone failure. I saw that Dataiku documentation explicity states that AWS EFS is not supported as an installation target. 

Does this include the DATA_DIR for the installation as well? Can I have an EC2 instance in which dataiku is installed on EBS but the DATA_DIR is a mount of an EFS volume? The objective is to be able to spin up another EC2 instance with the same version of dataiku, use the EFS volume as the DATA_DIR and thus recover previous data.

Link to documentation which calls out the limitation: https://doc.dataiku.com/dss/latest/installation/requirements.html#filesystem

0 Kudos
1 Solution
fchataigner2
Dataiker

Hi,

EFS used for the data dir of a DSS instance has been seen to be clearly sub-optimal performance-wise, with lags and hangs, among other (more infrequent) issues. The recommendation is to use EBS and setup regular EBS snapshots (which are cross-AZ) as a backup strategy

View solution in original post

5 Replies
jojin
Level 2
Author

@CoreyS @tgb417 Can you please let me know why the documentation says what it does? I tried installing dataiku with the data directory as an EFS mount directory and it seems to have worked. I have not evaluated any functionality of the application though. Just wanted to be sure of what I'm missing here.

0 Kudos
tgb417

@jojin ,

An interesting idea on your part.  I don’t know why this is a limitation.  Maybe one of the Dataiku folks can jump on this thread with some insight.

cc: @NedM 

--Tom
fchataigner2
Dataiker

Hi,

EFS used for the data dir of a DSS instance has been seen to be clearly sub-optimal performance-wise, with lags and hangs, among other (more infrequent) issues. The recommendation is to use EBS and setup regular EBS snapshots (which are cross-AZ) as a backup strategy

jojin
Level 2
Author

@fchataigner2 Thank you for that explanation. Exactly what I was looking for. Can this be solved by using a Max I/O or provisioned throughput EFS?

When using EBS snapshots as a backup strategy, how can I ensure consistent backups? What is the risk of not being able to restore from the data dir of a snapshot which was taken while the application is running?  I understand that this is application specific.

0 Kudos
fchataigner2
Dataiker

We don't recommend the EFS route, as we never got it to work consistently with DSS.

EBS snapshots are atomic (since they're snapshots), so there is no possible issue with consistency. DSS just needs to be restarted, no specific interaction with backups. They're our recommendation for production instances, with the only caveat that the entire datadir must be on the same EBS (ie a single volume)

0 Kudos