So, I've been using DSS on MacOS and local hardware for a while now. I enjoy using the latest version of DSS 8.0.2 (at this time)
I now have a new project I'm working on, and want to bring up a cloud instance. Because I have a bunch of remote team members and we want to keep a consistent shared data flow.
It is suggested that I use an AMI to get things set up quickly. This looks like a simple way to go. However, the setup pulldowns seem to suggest that v7.0.3 is the latest version that can be installed. (Local instance is up to at least version 8.0.2) and there are definitely some features that I want to use that are V8 features. Am I missing a more recent AMI on AWS somewhere?
Also, the image actually seems not to work. Do you need a VPC to make this instance work? Is there any recommended documentation about best practices for setting up a VPC for DSS use? Note that the Launch button is not available.
I guess what I'm looking for is some detailed recent walkthrough documentation about the installation of 8.0.2 in a cloud environment.
With @AurelienVetil help, we found this documentation.
However, there seem to be a lot of assumptions about my knowledge of EC2 maybe VPC details that are not covered in detail. Can anyone point me to best practices in getting a cloud configuration done quickly and in a way that grow over time?
Thanks in advance for any help you can share.
Hi @tgb417! I've an instance configured in EC2 with dss 8.0.2 running, but I didn't use the AMI image by dataiku.
I believe you are right, the tutorials and documentation assume you have a minimal familiarity with AWS. I didn't become an expert, but I learned my way through the IAM roles, VPC, S3 and EC2, enough to have a working design node to be used only by myself (the road to make the instance secure and only available to the dataiku users is something else).
I would love to help to create some guide or tutorial for people that is also new in the cloud, but I'm not sure I'll have much time to commit for this effort, so I'll try to find on my notes if I did go through a tutorial that I could share with you.
Have a nice weekend!
We were working with AMI. Got things up and running however, we found that the AMI was old.
When I tried to apply patches to bring things up to date. I ended up with problems.
The Dataiku installer and update scripts think that we are still using Centos V7 with yum. And are not working as expected. I'm having problems with the gcc compiler and therefore the Python 3 installation. I have not even looked at what is going on with R
I'm just about to throw out this AWS virtual machine and start over. Maybe from a clean machine and not use the AMI.
@Ignacio_Toledo what Linux OS and version did you end up using for your AWS install?
Here is a thread that runs through the issues I worked through to get SSL setup on my inhouse instance at LSC.
Glad to do what I can to help.
Thanks for the tips about SSL, I will share them with our IT team (as they will start now supporting the DSS deployment).
About the EC2 instance, I used this AMI as a base: https://aws.amazon.com/marketplace/pp/Centosorg-CentOS-7-x8664-with-Updates-HVM/B00O7WM7QW
It is Free Tier CentOS 7 distro, and I had no problems installing DSS 8.0.1. I did the installation to work with Glue, Athena, S3 and EKS, following the guide at https://doc.dataiku.com/dss/latest/cloud/aws/reference-architectures/eks-glue-athena.html, but to keep it simple, these are more or less the steps that I followed after I created the EC2 instance with the previous AMI (and you could change where appropriate to dss 8.0.2):
## First steps, as 'centos' user (default account, sudoer)
> sudo yum update
> sudo yum install -y yum-utils
## now it will be a good moment to reboot, and the ssh again
## OPTIONAL (but recommended) If you want to create S3 connections, or to use other AWS services
> curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
> sudo yum install unzip
> unzip awscliv2.zip
> sudo ./aws/install
## Add 'dataiku' user
> sudo adduser dataiku
> sudo passwd dataiku
> sudo usermod -a -G wheel dataiku
## Now, work as dataiku user:
> su - dataiku
# enter password
> sudo yum install wget
> wget https://downloads.dataiku.com/public/studio/8.0.1/dataiku-dss-8.0.1.tar.gz
> sudo yum install python3 python3-devel
> aws configure ## Configure with your aws credentials
> tar zxvf dataiku-dss-8.0.1.tar.gz
# disable SElinux
> sudo vi /etc/selinux/config
# Change to SELINUX=disabled
# ssh as centos user, then `su - dataiku`
## install dss dependencies
> sudo -i "/home/dataiku/dataiku-dss-8.0.1/scripts/install/install-deps.sh"
# before the next step, upload your license into a file called `license.json`
> dataiku-dss-8.0.1/installer.sh -d /home/dataiku/dss_data -l license.json -p 10000 -P /usr/bin/python3.6
# run the next step to have dataiku starting automatically after each reboot
> sudo -i "/home/dataiku/dataiku-dss-8.0.1/scripts/install/install-boot.sh" "/home/dataiku/dss_data" dataiku
# to install the graphics-export
> sudo -i "/home/dataiku/dataiku-dss-8.0.1/scripts/install/install-deps.sh" -without-java -without-python -with-chrome
> dss_data/bin/dssadmin install-graphics-export
> sudo sysctl user.max_user_namespaces=1000
# to install R and gcc tools (not sure if the R integration was fully succesful)
> sudo -i "/home/dataiku/dataiku-dss-8.0.1/scripts/install/install-deps.sh" -without-java -without-python -with-r
> sudo yum group install "Development Tools"
> sudo yum install centos-release-scl
> scl enable devtoolset-8 bash
> dss_data/bin/dssadmin install-R-integration
# Now, install the jdbc libraries you might want to use
# finally, start dss
> sudo systemctl start dataiku
And that more or less was it. At least it what I've in my notes 🙂
Thanks for enumerating your steps. That is very helpful.
I see that you chose to use CentOS V7. CentOS V8 is now out. Is there a reason you have chosen this version of CentOS? (According to the DSS Documentation both versions are supported.) However, it appears that CentOS V7 supports Yum package manager and CentOS V8 supports the dnf package manager.
The m3.large has 32 GB of disk connected to the box. Only 2 processors and 16GB Ram.
It looks like the m5a.2xlarge is an AMD CPU and only used Amazon Elastic Block Storage (EBS) how much space have you provisioned and how did it seem to run and cost.
I chose Centos 7 because at the time I couldn't find a free tier Centos 8 AMI in the AWS marketplace. I didn't go with a community version because I didn't have the time to check for their tweaks and characteristics, I wanted the most barebones distro AMI I could find. But now there is a nice selection of Centos 8 in the marketplace as you say. I think that would mostly affect the python3 installation steps (I don't think you'll need them) and the Development Tools installations (not sure if some changes are needed).
Related to the Storage, I'm using a 64GB EBS, but you only pay for what you use (is one dollar for a month of 10GBs). Why such a small filesystem? Because using S3 for the data storage is much more affordable.
I don't use the instance too often, as is mainly for testing AWS integration. But, in one month I used the instance for 70 hours, and around 43 GBs of EBS space, and it costed $25 USD (without taxes).
Just a quick note. I had terrible luck using CentOS for almost everything related to Data Science work. I've had no issues getting Dataiku working on Ubuntu (version 20.x)
Thanks for everyone replying in this thread. It's helpful!
Thanks for the feedback that is helpful. Will take this under advisement. I was trying to use a shortcut by using the AWS EC2 AMI and CentOS 7 is what came with that image.
The last DSS instance I installed with on Ubuntu LTS · 18.04. And I was able to get things working.
Did have a bit of a struggle with reverse proxy. However, I think that was more of an nginx <-> DSS set of challenges.
Aah, by the way, I'm using a m5a.2xlarge instance. Is not under the Free Tier, but it was not awfully expensive neither... provided you remember to stop the instance when you are not working.
We are currently working with an m3.large as recommended on the AMI setup.
However. That looks to be a previous generation of computing.
It looks like the m5a.2xlarge you are using has the following specs, based on an AMD processor
|Instance Size||vCPU||Memory (GiB)||Instance Storage|
|Network Bandwidth (Gbps)||EBS Bandwidth (Mbps)|
|m5a.2xlarge||8||32||EBS-Only||Up to 10||Up to 2,880|
And the price looks about the same as the legacy computer, for a lot more resources.
On MS Azure I use a trigger to automate turning off my VM at 10pm every night, just in case my own neural net forgets.
I suspect AWS would have something similar.
So, far we have ended up going with a clean install or Red Hat on an Amazon AWS m5a.xlarge. With the help of @NedM and @AurelienVetil we have made a bunch of forward progress with the DSS 8.0.2 install. We have a reverse proxy set up on a certificate. We have a PostgreSQL server running. I have created my first few models in the new environment. That seems to be working OK. Faster than my very old laptop. But not screening fast. We still have to work out the R installation. Have not gotten the image downloads setup. So things are moving forward. Thanks, everyone for your support.