Basic AWS Cloud Setup Journey

tgb417
tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

All,

So, I've been using DSS on MacOS and local hardware for a while now. I enjoy using the latest version of DSS 8.0.2 (at this time)

I now have a new project I'm working on, and want to bring up a cloud instance. Because I have a bunch of remote team members and we want to keep a consistent shared data flow.

When I go here off of the Getting Started Pages.

https://www.dataiku.com/product/get-started/aws/

It is suggested that I use an AMI to get things set up quickly. This looks like a simple way to go. However, the setup pulldowns seem to suggest that v7.0.3 is the latest version that can be installed. (Local instance is up to at least version 8.0.2) and there are definitely some features that I want to use that are V8 features. Am I missing a more recent AMI on AWS somewhere?

Dataiku DSS AWS.jpg

Also, the image actually seems not to work. Do you need a VPC to make this instance work? Is there any recommended documentation about best practices for setting up a VPC for DSS use? Note that the Launch button is not available.

DSS V7.0.3 image.jpg

I guess what I'm looking for is some detailed recent walkthrough documentation about the installation of 8.0.2 in a cloud environment.

With @AurelienVetil
help, we found this documentation.

https://doc.dataiku.com/dss/latest/installation/other/aws.html

However, there seem to be a lot of assumptions about my knowledge of EC2 maybe VPC details that are not covered in detail. Can anyone point me to best practices in getting a cloud configuration done quickly and in a way that grow over time?

Thanks in advance for any help you can share.

--Tom

Best Answer

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron
    edited July 17 Answer ✓

    Hi @tgb417
    !

    Thanks for the tips about SSL, I will share them with our IT team (as they will start now supporting the DSS deployment).

    About the EC2 instance, I used this AMI as a base: https://aws.amazon.com/marketplace/pp/Centosorg-CentOS-7-x8664-with-Updates-HVM/B00O7WM7QW

    It is Free Tier CentOS 7 distro, and I had no problems installing DSS 8.0.1. I did the installation to work with Glue, Athena, S3 and EKS, following the guide at https://doc.dataiku.com/dss/latest/cloud/aws/reference-architectures/eks-glue-athena.html, but to keep it simple, these are more or less the steps that I followed after I created the EC2 instance with the previous AMI (and you could change where appropriate to dss 8.0.2):

    ## First steps, as 'centos' user (default account, sudoer)
    > sudo yum update
    > sudo yum install -y yum-utils
    ## now it will be a good moment to reboot, and the ssh again
    ## OPTIONAL (but recommended) If you want to create S3 connections, or to use other AWS services
    > curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
    > sudo yum install unzip
    > unzip awscliv2.zip
    > sudo ./aws/install
    ## Add 'dataiku' user
    > sudo adduser dataiku
    > sudo passwd dataiku
    > sudo usermod -a -G wheel dataiku
    ## Now, work as dataiku user:
    > su - dataiku
    # enter password
    > sudo yum install wget
    > wget https://downloads.dataiku.com/public/studio/8.0.1/dataiku-dss-8.0.1.tar.gz
    > sudo yum install python3 python3-devel
    > aws configure ## Configure with your aws credentials
    > tar zxvf dataiku-dss-8.0.1.tar.gz
    # disable SElinux
    > sudo vi /etc/selinux/config
    # Change to SELINUX=disabled
    > reboot
    # ssh as centos user, then `su - dataiku`
    ## install dss dependencies
    > sudo -i "/home/dataiku/dataiku-dss-8.0.1/scripts/install/install-deps.sh"
    # before the next step, upload your license into a file called `license.json`
    > dataiku-dss-8.0.1/installer.sh -d /home/dataiku/dss_data -l license.json -p 10000 -P /usr/bin/python3.6
    # run the next step to have dataiku starting automatically after each reboot
    > sudo -i "/home/dataiku/dataiku-dss-8.0.1/scripts/install/install-boot.sh" "/home/dataiku/dss_data" dataiku
    # to install the graphics-export
    > sudo -i "/home/dataiku/dataiku-dss-8.0.1/scripts/install/install-deps.sh" -without-java -without-python -with-chrome
    > dss_data/bin/dssadmin install-graphics-export
    > sudo sysctl user.max_user_namespaces=1000
    # to install R and gcc tools (not sure if the R integration was fully succesful)
    > sudo -i "/home/dataiku/dataiku-dss-8.0.1/scripts/install/install-deps.sh" -without-java -without-python -with-r
    > sudo yum group install "Development Tools"
    > sudo yum install centos-release-scl
    > scl enable devtoolset-8 bash
    > dss_data/bin/dssadmin install-R-integration
    # Now, install the jdbc libraries you might want to use
    # finally, start dss
    > sudo systemctl start dataiku

    And that more or less was it. At least it what I've in my notes

    Cheers,

    I.

Answers

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Hi @tgb417
    ! I've an instance configured in EC2 with dss 8.0.2 running, but I didn't use the AMI image by dataiku.

    I believe you are right, the tutorials and documentation assume you have a minimal familiarity with AWS. I didn't become an expert, but I learned my way through the IAM roles, VPC, S3 and EC2, enough to have a working design node to be used only by myself (the road to make the instance secure and only available to the dataiku users is something else).

    I would love to help to create some guide or tutorial for people that is also new in the cloud, but I'm not sure I'll have much time to commit for this effort, so I'll try to find on my notes if I did go through a tutorial that I could share with you.

    Have a nice weekend!

    Ignacio

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @Ignacio_Toledo
    thanks. Any bits you can share will help.

    As I work through this I’ll past notes that I find useful.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    We were working with AMI. Got things up and running however, we found that the AMI was old.

    • Centos V7 rather than V8
      • using yum rather than dnf
    • Python V2 rather than V3
    • older gcc compiler
    • Use an old ec2 compute environment.

    When I tried to apply patches to bring things up to date. I ended up with problems.

    The Dataiku installer and update scripts think that we are still using Centos V7 with yum. And are not working as expected. I'm having problems with the gcc compiler and therefore the Python 3 installation. I have not even looked at what is going on with R

    I'm just about to throw out this AWS virtual machine and start over. Maybe from a clean machine and not use the AMI.

    @Ignacio_Toledo
    what Linux OS and version did you end up using for your AWS install?

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @Ignacio_Toledo

    Here is a thread that runs through the issues I worked through to get SSL setup on my inhouse instance at LSC.

    https://community.dataiku.com/t5/Setup-Configuration/Going-from-Prototype-to-Production-SSL-amp-HTTPS/m-p/4599

    Glad to do what I can to help.

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Aah, by the way, I'm using a m5a.2xlarge instance. Is not under the Free Tier, but it was not awfully expensive neither... provided you remember to stop the instance when you are not working.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @Ignacio_Toledo

    We are currently working with an m3.large as recommended on the AMI setup.

    However. That looks to be a previous generation of computing.

    It looks like the m5a.2xlarge you are using has the following specs, based on an AMD processor

    Instance SizevCPUMemory (GiB)Instance Storage
    (GiB)
    Network Bandwidth (Gbps)EBS Bandwidth (Mbps)
    m5a.2xlarge832EBS-OnlyUp to 10Up to 2,880

    And the price looks about the same as the legacy computer, for a lot more resources.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @Ignacio_Toledo

    Thanks for enumerating your steps. That is very helpful.

    I see that you chose to use CentOS V7. CentOS V8 is now out. Is there a reason you have chosen this version of CentOS? (According to the DSS Documentation both versions are supported.) However, it appears that CentOS V7 supports Yum package manager and CentOS V8 supports the dnf package manager.

    The m3.large has 32 GB of disk connected to the box. Only 2 processors and 16GB Ram.

    It looks like the m5a.2xlarge is an AMD CPU and only used Amazon Elastic Block Storage (EBS) how much space have you provisioned and how did it seem to run and cost.

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    @tgb417

    I chose Centos 7 because at the time I couldn't find a free tier Centos 8 AMI in the AWS marketplace. I didn't go with a community version because I didn't have the time to check for their tweaks and characteristics, I wanted the most barebones distro AMI I could find. But now there is a nice selection of Centos 8 in the marketplace as you say. I think that would mostly affect the python3 installation steps (I don't think you'll need them) and the Development Tools installations (not sure if some changes are needed).

    Related to the Storage, I'm using a 64GB EBS, but you only pay for what you use (is one dollar for a month of 10GBs). Why such a small filesystem? Because using S3 for the data storage is much more affordable.

    I don't use the instance too often, as is mainly for testing AWS integration. But, in one month I used the instance for 70 hours, and around 43 GBs of EBS space, and it costed $25 USD (without taxes).

  • JaredP
    JaredP Registered Posts: 2 ✭✭✭

    Just a quick note. I had terrible luck using CentOS for almost everything related to Data Science work. I've had no issues getting Dataiku working on Ubuntu (version 20.x)

    Thanks for everyone replying in this thread. It's helpful!

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @JaredP
    ,

    Thanks for the feedback that is helpful. Will take this under advisement. I was trying to use a shortcut by using the AWS EC2 AMI and CentOS 7 is what came with that image.

    The last DSS instance I installed with on Ubuntu LTS · 18.04. And I was able to get things working.

    Did have a bit of a struggle with reverse proxy. However, I think that was more of an nginx <-> DSS set of challenges.

  • JaredP
    JaredP Registered Posts: 2 ✭✭✭

    On MS Azure I use a trigger to automate turning off my VM at 10pm every night, just in case my own neural net forgets.

    I suspect AWS would have something similar.

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    @JaredP

    That's cool. Nice way to save a few bucks. For folks on MS Azure, do you want to create a new Setup & Configuration describing what you are doing? I have a feeling that others might find this useful.

    --Tom

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Indeed: that functionality can be achieved by means of "Launch templates" at EC2 console.

    Cheers!

  • tgb417
    tgb417 Dataiku DSS Core Designer, Dataiku DSS & SQL, Dataiku DSS ML Practitioner, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 1,598 Neuron

    So, far we have ended up going with a clean install or Red Hat on an Amazon AWS m5a.xlarge. With the help of @NedM
    and @AurelienVetil
    we have made a bunch of forward progress with the DSS 8.0.2 install. We have a reverse proxy set up on a certificate. We have a PostgreSQL server running. I have created my first few models in the new environment. That seems to be working OK. Faster than my very old laptop. But not screening fast. We still have to work out the R installation. Have not gotten the image downloads setup. So things are moving forward. Thanks, everyone for your support.

    cc: @Ignacio_Toledo

  • Ignacio_Toledo
    Ignacio_Toledo Dataiku DSS Core Designer, Dataiku DSS Core Concepts, Neuron 2020, Neuron, Registered, Dataiku Frontrunner Awards 2021 Finalist, Neuron 2021, Neuron 2022, Frontrunner 2022 Finalist, Frontrunner 2022 Winner, Dataiku Frontrunner Awards 2021 Participant, Frontrunner 2022 Participant, Neuron 2023 Posts: 415 Neuron

    Great @tgb417
    ! Are you taking some notes of the process to share later?

Setup Info
    Tags
      Help me…