Dataiku on AWS: Can I use an AWS EFS mount as my DATA_DIR ?

jojin
jojin Registered Posts: 3 ✭✭✭✭

I'm evaluating Dataiku for our AWS environment. We want a an architecture that avoids data loss in case our dataiku instace goes down due to availability zone failure. I saw that Dataiku documentation explicity states that AWS EFS is not supported as an installation target.

Does this include the DATA_DIR for the installation as well? Can I have an EC2 instance in which dataiku is installed on EBS but the DATA_DIR is a mount of an EFS volume? The objective is to be able to spin up another EC2 instance with the same version of dataiku, use the EFS volume as the DATA_DIR and thus recover previous data.

Link to documentation which calls out the limitation: https://doc.dataiku.com/dss/latest/installation/requirements.html#filesystem

Best Answer

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker
    Answer ✓

    Hi,

    EFS used for the data dir of a DSS instance has been seen to be clearly sub-optimal performance-wise, with lags and hangs, among other (more infrequent) issues. The recommendation is to use EBS and setup regular EBS snapshots (which are cross-AZ) as a backup strategy

Answers

  • jojin
    jojin Registered Posts: 3 ✭✭✭✭

    @CoreyS
    @tgb417
    Can you please let me know why the documentation says what it does? I tried installing dataiku with the data directory as an EFS mount directory and it seems to have worked. I have not evaluated any functionality of the application though. Just wanted to be sure of what I'm missing here.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @jojin
    ,

    An interesting idea on your part. I don’t know why this is a limitation. Maybe one of the Dataiku folks can jump on this thread with some insight.

    cc: @NedM

  • jojin
    jojin Registered Posts: 3 ✭✭✭✭

    @fchataigner2
    Thank you for that explanation. Exactly what I was looking for. Can this be solved by using a Max I/O or provisioned throughput EFS?

    When using EBS snapshots as a backup strategy, how can I ensure consistent backups? What is the risk of not being able to restore from the data dir of a snapshot which was taken while the application is running? I understand that this is application specific.

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker

    We don't recommend the EFS route, as we never got it to work consistently with DSS.

    EBS snapshots are atomic (since they're snapshots), so there is no possible issue with consistency. DSS just needs to be restarted, no specific interaction with backups. They're our recommendation for production instances, with the only caveat that the entire datadir must be on the same EBS (ie a single volume)

  • yashpuranik
    yashpuranik Partner, Dataiku DSS Core Designer, Dataiku DSS ML Practitioner, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2022, Neuron 2023 Posts: 69 Neuron

    @fchataigner2
    : I am curious if Dataiku is looking into this: https://aws.amazon.com/blogs/aws/mountpoint-for-amazon-s3-generally-available-and-ready-for-production-workloads/

    If I am interpreting this correctly, Amazon has made available a way for mounting and working with S3 file systems as a mount point. This may be specially useful when trying to write to managed folders in S3. The current way is to write to a file stream, but if the mount option works as advertised, we could use paths directly

  • fchataigner2
    fchataigner2 Dataiker Posts: 355 Dataiker

    there should be nothing preventing you from creating a filesystem connection to such a mount point, and putting managed folders on it. In the end, DSS doesn't need to know that a portion of the filesystem is actually a mounted S3 bucket.

    This being said, the point about EFS still stands: DSS will not work correctly if its datadir (and chiefly the "working" folders like the config/ or the jobs/ folders) is actually a network share or a mounted cloud storage. The latency will be way too high for smooth operation

  • Turribeach
    Turribeach Dataiku DSS Core Designer, Neuron, Dataiku DSS Adv Designer, Registered, Neuron 2023 Posts: 1,980 Neuron

    Not really sure you are going to gain anything with this. In fact fuse has been able to mount S3 buckets as file systems since 2014. So this functionality is not new, AWS is merely adding support for it. then if you read the blog post you can see it still doesn't make the solution POSIX compliance so you are not dealing with a proper file system, even if you can mount it as one. Finally we did some testing and we had way more IO performance when using EBS local disks than S3 buckets in Dataiku. You mention using paths directly instead of a file stream, not sure why is that a problem. So my advice will be leave S3 Buckets as S3 Buckets. If you need a performance or a proper file system use EBS local disks.

Setup Info
    Tags
      Help me…