class: center, middle, inverse, title-slide # Using R in the Cloud ## A light fluffy talk ### Colin Gillespie --- layout: true <div class="jr-header-inverse"> <span class="social"><table><tr><td>Colin Gillespie</td></tr></table></span> </div> <div class="jr-footer-inverse"><span>Using R in the cloud</span></div> --- layout: true --- class: inverse background-image: url(graphics/itsalive.jpg) # Warning This talk contains __(a)live__ demos -- This is obviously a silly thing to do... --- layout: true <div class="jr-header"> <span class="social"><table><tr><td>Colin Gillespie</td></tr></table></span> </div> <div class="jr-footer"><span>Using R in the cloud</span></div> --- # What is cloud computing? > Cloud computing is a model for enabling ubiquitous, convenient, on-demand network > access to a shared pool of configurable computing resources > that can be rapidly provisioned and released > with minimal management effort or service provider interaction. NIST definition of cloud computing, --- # Cloud computing is a model for... * Ubiquitous * Convenient * On-demand * Configurable computer resources -- that can be used with _minimal effort_ --- # Potential use cases * Run parallel MCMC chains - Current project uses 200 cores -- * Teaching - Run RStudio in the cloud; avoid package issues -- * Big data - Keep your data in one place --- background-image: url(assets/Rlogo_fade.png) # Why use R in the cloud * R is constrained by the available RAM - The cloud gives us access to more * Process the data in one place * Portable & accessible * Scalable - No of cores, RAM, & storage -- * Potentially save money --- background-image: url(assets/Rlogo_fade.png) # Why __not__ use R in the cloud * Time: set-up cost * Privacy: where is the data stored? * Need (or lack of) * Overall it costs more --- background-image: url(graphics/clouds.jpg) # Cloud providers * Microsoft Azure * Google cloud * Amazon AWS - Accounts for 8% of Amazon's revenue! * Other bespoke providers, e.g. location, security -- * To make this talk concrete, we'll use [AWS](https://aws.amazon.com/) - Other providers are _similar_ in terms of cost, speed, etc <img src="graphics/amazon_web_svcs.png" width="40%" style="display: block; margin: auto;" /> --- # On-demand cost (AWS) Cost (per hour) depends on * No. of CPUs (Up to 128 cores) * Amount of RAM (from 0.5GB to 1,952GB) * Location: US, Europe, Asia * Storage: (Amount & type) * GPUs * OS (Linux & Windows) --- # On-demand cost (AWS) <table> <thead> <tr> <th style="text-align:left;"> Instance Name </th> <th style="text-align:right;"> #CPUs </th> <th style="text-align:right;"> RAM </th> <th style="text-align:right;"> Cost (£) </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> t2.micro </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0.009 </td> </tr> <tr> <td style="text-align:left;"> t2.2xlarge </td> <td style="text-align:right;"> 8 </td> <td style="text-align:right;"> 32 </td> <td style="text-align:right;"> 0.290 </td> </tr> <tr> <td style="text-align:left;"> c4.8xlarge </td> <td style="text-align:right;"> 36 </td> <td style="text-align:right;"> 60 </td> <td style="text-align:right;"> 1.200 </td> </tr> <tr> <td style="text-align:left;"> r3.8xlarge </td> <td style="text-align:right;"> 32 </td> <td style="text-align:right;"> 244 </td> <td style="text-align:right;"> 2.000 </td> </tr> <tr> <td style="text-align:left;"> x1.32xlarge </td> <td style="text-align:right;"> 128 </td> <td style="text-align:right;"> 1952 </td> <td style="text-align:right;"> 10.000 </td> </tr> </tbody> </table> > To rent a machine equivilent to £1000 laptop, costs around 30p per hour -- > To rent a machine equivilent to our desktops costs £600 per year --- # Other pricing options * Up front purchase - Pay for the instance for one to three years -- * Spot instances - Bid for the instance - Not guaranteed to get one - Six hours only - Typically save around 80% - Interesting challenge developing statistical algorithms for this set-up --- # Launching an instance (demo) 1. [Login](https://aws.amazon.com/) to Amazon Web Services (AWS) 1. Click on _EC2_ & _Launch Instance_ 1. Choose OS 1. Choose Instance Type - Default Linux user `ec2-user` 1. Create a security key. - Use it to log on 1. Click on Launch Instance 1. Log into instance ``` # Example ssh ubuntu@ec2-XXX-XXX-XXX-XXX.eu-west-2.compute.amazonaws.com sudo apt-get update sudo apt-get install r-base-core ``` --- # Preconfigured instance types RStudio Server Amazon Machine Image ([AMI](http://www.louisaslett.com/RStudio_AMI/)) * R, Julia & python * Latex * GSL * git, svn & Dropbox * RStudio & shiny server * ... -- [Demo](http://www.louisaslett.com/RStudio_AMI/) ``` ssh ubuntu@ec2-XXX-XXX-XXX-XXX.eu-west-2.compute.amazonaws.com ``` --- # Lauching from the command line You can launch `ec2` instances from the command line - Part of the [cloudyr](https://github.com/cloudyr) project -- ```r library("aws.ec2") run_instances(image = "ami-b1b0c3c2", # images id type = "t2.micro", subnet = describe_subnets()[[1]], sgroup = describe_sgroups()[[1]]) l = describe_instances() ``` --- # Storage on the Cloud * You can obviously store things on your ec2 instance, but this isn't backed up -- * Simple Storage Service (S3) (__demo__) - ~50GB for £1 per month - Put data there and access it as needed -- * Glacier storage - ~200GB for £1 per month - Time delay in retrieving your data --- # Access from the command line ```r library("aws.s3") (bucket_contents = get_bucket(bucket ="bayesianexpdesign")) # Bucket: bayesianexpdesign # # $Contents # Key: ip-172-31-1-134.RDS # LastModified: 2017-06-19T15:37:36.000Z # ETag: "dd64ae9954a26a3e0e514ca2499b0dfb" # Size (B): 141693 # Owner: colin_gill # Storage class: STANDARD # # $Contents # Key: ip-172-31-14-254.RDS # LastModified: 2017-06-20T15:10:31.000Z # ETag: "10d6cc6ad86e12610555d7810ea6929f" # Size (B): 127903 # Owner: colin_gill # Storage class: STANDARD ``` --- # Notifications * We want to know when our simulation has finished * Useful for desktop jobs --- # `beeper` * Plays a noise when your job has finished ```r beepr::beep() ``` * Handy for short desktop jobs, not great in the cloud --- # twitter: the rtweet package __twitter__: the communication tool of the US president -- > Thanks - many are saying I'm the best 140 character writer in the world. It's easy when it's fun. > > _Donald Trump_ -- > It works because you only know tiny words. > > _Some Random chap on twitter_ -- * Back to business * Set up a dummy twitter account and post ```r library("rtweet") post_tweet("Simulation finished") ``` --- # Pushbullet Provides an easy interface for pushing notifications on your phone & browser ```r library("RPushbullet") pbPost("note", "Talk", body = "The end is near") ``` --- # Summary * The cloud allows statisticians (or data scientists) to access resources on demand * At a low cost * Involves a new way of working > Data Science is statistics on a Mac. > > @bigdataborat