Dataiku on AWS: Can I use an AWS EFS mount as my DATA_DIR ?

jojin · November 2020

I'm evaluating Dataiku for our AWS environment. We want a an architecture that avoids data loss in case our dataiku instace goes down due to availability zone failure. I saw that Dataiku documentation explicity states that AWS EFS is not supported as an installation target.

Does this include the DATA_DIR for the installation as well? Can I have an EC2 instance in which dataiku is installed on EBS but the DATA_DIR is a mount of an EFS volume? The objective is to be able to spin up another EC2 instance with the same version of dataiku, use the EFS volume as the DATA_DIR and thus recover previous data.

Link to documentation which calls out the limitation: https://doc.dataiku.com/dss/latest/installation/requirements.html#filesystem

fchataigner2 · November 2020

Hi,

EFS used for the data dir of a DSS instance has been seen to be clearly sub-optimal performance-wise, with lags and hangs, among other (more infrequent) issues. The recommendation is to use EBS and setup regular EBS snapshots (which are cross-AZ) as a backup strategy

jojin · November 2020

@CoreyS
@tgb417
Can you please let me know why the documentation says what it does? I tried installing dataiku with the data directory as an EFS mount directory and it seems to have worked. I have not evaluated any functionality of the application though. Just wanted to be sure of what I'm missing here.

tgb417 · November 2020

@jojin
,

An interesting idea on your part. I don’t know why this is a limitation. Maybe one of the Dataiku folks can jump on this thread with some insight.

cc: @NedM

jojin · November 2020

@fchataigner2
Thank you for that explanation. Exactly what I was looking for. Can this be solved by using a Max I/O or provisioned throughput EFS?

When using EBS snapshots as a backup strategy, how can I ensure consistent backups? What is the risk of not being able to restore from the data dir of a snapshot which was taken while the application is running? I understand that this is application specific.

fchataigner2 · November 2020

We don't recommend the EFS route, as we never got it to work consistently with DSS.

EBS snapshots are atomic (since they're snapshots), so there is no possible issue with consistency. DSS just needs to be restarted, no specific interaction with backups. They're our recommendation for production instances, with the only caveat that the entire datadir must be on the same EBS (ie a single volume)

yashpuranik · August 2023

@fchataigner2
: I am curious if Dataiku is looking into this: https://aws.amazon.com/blogs/aws/mountpoint-for-amazon-s3-generally-available-and-ready-for-production-workloads/

If I am interpreting this correctly, Amazon has made available a way for mounting and working with S3 file systems as a mount point. This may be specially useful when trying to write to managed folders in S3. The current way is to write to a file stream, but if the mount option works as advertised, we could use paths directly

fchataigner2 · August 2023

there should be nothing preventing you from creating a filesystem connection to such a mount point, and putting managed folders on it. In the end, DSS doesn't need to know that a portion of the filesystem is actually a mounted S3 bucket.

This being said, the point about EFS still stands: DSS will not work correctly if its datadir (and chiefly the "working" folders like the config/ or the jobs/ folders) is actually a network share or a mounted cloud storage. The latency will be way too high for smooth operation

Turribeach · August 2023

Not really sure you are going to gain anything with this. In fact fuse has been able to mount S3 buckets as file systems since 2014. So this functionality is not new, AWS is merely adding support for it. then if you read the blog post you can see it still doesn't make the solution POSIX compliance so you are not dealing with a proper file system, even if you can mount it as one. Finally we did some testing and we had way more IO performance when using EBS local disks than S3 buckets in Dataiku. You mention using paths directly instead of a file stream, not sure why is that a problem. So my advice will be leave S3 Buckets as S3 Buckets. If you need a performance or a proper file system use EBS local disks.

yonghyun · May 15

I think that since NAS is a storage solution rather than a file system, there shouldn’t be any performance degradation as long as a high-performance NAS with good I/O throughput is used.

Turribeach · May 15

https://community.dataiku.com/discussion/comment/45941#Comment_45941

High-performance NAS can never match high performance local storage. And what you want is high performance local storage so no do not go the NAS route.

Dataiku on AWS: Can I use an AWS EFS mount as my DATA_DIR ?

Best Answer

Answers

Categories

Setup Info

Tags