Survey banner
The Dataiku Community is moving to a new home! New posts are now disabled and the community will shortly be in temporary read only mode: LEARN MORE

Dataiku on AWS: Can I use an AWS EFS mount as my DATA_DIR ?

Solved!
jojin
Level 2
Dataiku on AWS: Can I use an AWS EFS mount as my DATA_DIR ?

I'm evaluating Dataiku for our AWS environment. We want a an architecture that avoids data loss in case our dataiku instace goes down due to availability zone failure. I saw that Dataiku documentation explicity states that AWS EFS is not supported as an installation target. 

Does this include the DATA_DIR for the installation as well? Can I have an EC2 instance in which dataiku is installed on EBS but the DATA_DIR is a mount of an EFS volume? The objective is to be able to spin up another EC2 instance with the same version of dataiku, use the EFS volume as the DATA_DIR and thus recover previous data.

Link to documentation which calls out the limitation: https://doc.dataiku.com/dss/latest/installation/requirements.html#filesystem

0 Kudos
1 Solution
fchataigner2
Dataiker

Hi,

EFS used for the data dir of a DSS instance has been seen to be clearly sub-optimal performance-wise, with lags and hangs, among other (more infrequent) issues. The recommendation is to use EBS and setup regular EBS snapshots (which are cross-AZ) as a backup strategy

View solution in original post

8 Replies
jojin
Level 2
Author

@CoreyS @tgb417 Can you please let me know why the documentation says what it does? I tried installing dataiku with the data directory as an EFS mount directory and it seems to have worked. I have not evaluated any functionality of the application though. Just wanted to be sure of what I'm missing here.

0 Kudos
tgb417

@jojin ,

An interesting idea on your part.  I don’t know why this is a limitation.  Maybe one of the Dataiku folks can jump on this thread with some insight.

cc: @NedM 

--Tom
fchataigner2
Dataiker

Hi,

EFS used for the data dir of a DSS instance has been seen to be clearly sub-optimal performance-wise, with lags and hangs, among other (more infrequent) issues. The recommendation is to use EBS and setup regular EBS snapshots (which are cross-AZ) as a backup strategy

jojin
Level 2
Author

@fchataigner2 Thank you for that explanation. Exactly what I was looking for. Can this be solved by using a Max I/O or provisioned throughput EFS?

When using EBS snapshots as a backup strategy, how can I ensure consistent backups? What is the risk of not being able to restore from the data dir of a snapshot which was taken while the application is running?  I understand that this is application specific.

0 Kudos
fchataigner2
Dataiker

We don't recommend the EFS route, as we never got it to work consistently with DSS.

EBS snapshots are atomic (since they're snapshots), so there is no possible issue with consistency. DSS just needs to be restarted, no specific interaction with backups. They're our recommendation for production instances, with the only caveat that the entire datadir must be on the same EBS (ie a single volume)

0 Kudos
yashpuranik

@fchataigner2: I am curious if Dataiku is looking into this: https://aws.amazon.com/blogs/aws/mountpoint-for-amazon-s3-generally-available-and-ready-for-producti...

 

If I am interpreting this correctly, Amazon has made available a way for mounting and working with S3 file systems as a mount point. This may be specially useful when trying to write to managed folders in S3. The current way is to write to a file stream, but if the mount option works as advertised, we could use paths directly

yashpuranik
0 Kudos
fchataigner2
Dataiker

there should be nothing preventing you from creating a filesystem connection to such a mount point, and putting managed folders on it. In the end, DSS doesn't need to know that a portion of the filesystem is actually a mounted S3 bucket.

This being said, the point about EFS still stands: DSS will not work correctly if its datadir (and chiefly the "working" folders like the config/ or the jobs/ folders) is actually a network share or a mounted cloud storage. The latency will be way too high for smooth operation

0 Kudos

Not really sure you are going to gain anything with this. In fact fuse has been able to mount S3 buckets as file systems since 2014. So this functionality is not new, AWS is merely adding support for it. then if you read the blog post you can see it still doesn't make the solution POSIX compliance so you are not dealing with a proper file system, even if you can mount it as one. Finally we did some testing and we had way more IO performance when using EBS local disks than S3 buckets in Dataiku. You mention using paths directly instead of a file stream, not sure why is that a problem. So my advice will be leave S3 Buckets as S3 Buckets. If you need a performance or a proper file system use EBS local disks.

0 Kudos