Machine Learning and Data Science in R on Microsoft ML and SQL Servers with Rafal Lukawiecki
A very intensive, hands-on, 5-day course designed for those who want to learn more in-depth machine learning and data science using R. As you study the free, open source R, you will also learn how to easily make R incredibly fast, scalable, and enterprise-ready with Microsoft ML Server and SQL Server ML Services. In addition to open source R, they are the key environments, together with RStudio, that we will focus on during this course.
You will learn:
- Building and deploying machine learning models using open source R programming language, including data preparation, visualisation, and stringent model validation.
- High-performance ML using the newest version of Microsoft ML Server and SQL Server 2019 with R and RStudio.
- Deployment to production with nanosecond-scale performance.
- Successful data science project formulation and delivery.
Target audience and prerequisites
This course is intended for analysts, budding and current data scientists, BI developers, programmers, power users, predictive modellers, forecasters, consultants, data engineers, anyone interested in using ML for AI, AI engineers.
General ability to work with data in any form: using spreadsheets, tables, or databases. Prior knowledge of any programming language is helpful, however, if you are prepared to work harder by asking Rafal questions and doing a little additional homework during the week you can use this course to learn R as your very ﬁrst programming language.
This course will teach you machine learning and data science using R and Microsoft technologies: you do not need to know it before attending.
About Rafal Lukawiecki
As Data Scientist at Project Botticelli Ltd, Rafal focuses on making advanced analytics and artiﬁcial intelligence easy and useful for his clients.
He can help you ﬁnd valuable, meaningful patterns and statistically valid correlations using data mining and machine learning in data sets both big and small. Rafal is also known for his work in business intelligence, data protection, enterprise architecture, and solution delivery. While majority of his clients come from consumer and corporate ﬁnance, entertainment, healthcare, IT, retail, and the public sectors, Rafal has worked in almost all industries.
He has been a popular speaker at major IT conferences since 1998, and he had the honour of sharing keynote platforms with Bill Gates and Neil Armstrong. A natural educator, he explains complex concepts in simple terms in his enjoyably energetic style.
Above all, this course will teach you modern R: currently, the most powerful language explicitly designed for advanced analytics, statistical learning, data science, and, of course, cutting-edge general-purpose machine learning. While Python is more popular as a universal programming language, and also widely used for image and text analysis using deep learning, R is a clear leader in data science. Specifically, you will learn how to do machine learning in R because it is very well suited for its advanced use, especially on classical data sets that you often encounter in common, business use. Even though such data might come from a data lake, typically you will find plenty of it in a data warehouse, a relational databases, or you can acquire it from files generated by transactional business applications, or from devices, such as: healthcare equipment, point-of-sales devices, or manufacturing and transportation machinery. Above all, R is great for exploratory analysis of data and it can help you draw meaningful conclusions from real-world experiments, such as A-B marketing tests or product trials. This course will teach you the foundations of hypothesis testing in order to be able to draw such conclusions with a high dose of confidence.
Microsoft Machine Learning Server and Microsoft SQL Server 2019/2017 Machine Learning Services support both R and Python in a number of proprietary, high-performance, scalable, enterprise-ready, easy-to-use packages and libraries, notably RevoScale and MicrosoftML. You will learn how to use them during this course. You will also learn how to do almost everything using the most popular algorithms provided by open source R packages, such as rpart, kmeansruns, fps, cluster, clusplot, ts, xts, e1071, caret, glm, and for extra help rattle, qdapTools, MLmetrics, and miscTools.
You will learn how to prepare and visualise data both by using open source packages, mainly dplyr and ggplot2, and other parts of the tidyverse meta-package, like readr, readxl, and lubridate, and how to do it more directly in SQL Server, benefiting from its performance and scalability. We will even combine the power of R with Power BI, to create informative visualisations that are otherwise impossible to do it Power BI alone.
Grouped Notched Boxplot in R While learning about data science process and hypothesis testing, you will discover that some complex business questions can be answered using simpler, statistical techniques, such as tests of significant differences between sets of data, or visualisations like notched box plots. We will refresh your knowledge of rudimentary statistical concepts that are necessary for machine learning and data science, like knowing the difference between ordinal, interval and ratio data, and thus why it does not make sense to calculate a mean star rating, while a median is possible. A little time has been allocated for the discussion of p-values, confidence intervals, and the differences between Bayesian and frequentist interpretation of your results. Bear in mind, that this is not a course about statistics, but a little working knowledge is a must in our industry, and to make the rest of the course easier to follow.
Early in the course, you will learn all the fundamentals of machine learning—no prior knowledge is necessary. You will study: data preparation and relevant structures, algorithm classes and their applications, model evaluation and validation, including all the common performance metrics such as precision and recall. At the heart of this course, however, you will gain an intimate understanding of how some of the most important algorithms work and how to prepare data to make the algorithms give you the most they can.
Visualising Clustering Quality with ClusplotStarting with clustering, you will learn about k-means, k-medians, spherical kmeans and expectation-maximisation. You will find out how to prepare non-numerical and even some numerical data using popular R functions such as mtabulate for these algorithms. Other than using clustering for segmentation, we will also study its use for anomaly detection. We will expand on that subject using other, specialised techniques, such as a One Class SVM and PCA-Based Anomaly Detection, permitting you to predict anomalies, such as fraud.
Decision Tree in R on ML ServerWe dedicate a full day to focus on building classifiers. You will understand the differences between the most important decision tree algorithms: plain, forests and boosting, and you will study both simpler and more complex neural networks, and how they relate to regressions. We will also cover the widely used logistics regression algorithm, which, actually, is a classifier. Later in the course you will meet the large family of regression techniques, starting with classic linear regression, through GLM, the generalised linear model, to non-linear ML regressions. We will also have some time to cover remaining big applications of machine and statistical learning, notably forecasting with time series, and, brieﬂy, recommendation engines.
Microsoft SQL Azure LogoWhen deploying models to production, the benefits of using ML Server and SQL ML Services will impress. After seeing how to do it using open source R, we will culminate with an extremely fast in-database deployment using T-SQL PREDICT statement, and the related real-time sp_rxPredict, which returns predictions on a nano-second scale! You will also see how to deploy your models using web services, interacting via Azure if needed. Please note, however, that this course does not focus on Azure ML, even though we will brieﬂy discuss how to combine those technologies together (please also see our other course by Rafal that focuses on Azure ML).
Every day we will work using RStudio, the most popular, and free, R IDE which is recommended by Microsoft for building R applications on top of SQL and ML Servers. All of our work will follow the modern principles of reproducible research: you will learn how to set-up notebooks, manage packages and their dependencies, including versioning, using snapshots, how to save your work, how to manage change using Git, and how to collaborate. At the end of the course you will keep your own R notebook containing almost 1000 lines of code and results! You are also welcome to keep all data sets that you use during the course labs and tutorials. You will notice that throughout the week you understand and write better and more advanced R, whilst experiencing, first-hand, many of its real-world applications.
Model validity is the most important aspect of any machine learning project. A lot of time has been dedicated to explain it in detail: many validity metrics, such as precision, recall, AUC, F1 score, accuracy (which is rarely a good metric), and the many charts we use to analyse models, especially: confusion matrix, lift/gain charts, ROC curve, precision-recall curve, profit and cost chart, calibration charts, scatter plots, and others used for regression evaluation like histograms of residuals, QQ-Norm plot of residuals, scale-location, Cook’s distance and many others. You will learn how to create those plots using R, and with the help of other tools. At the end of the course you will know when you can trust your models, and you will be able to explain your work to others, especially your project sponsors who rarely are machine learning experts.
Above all, this course will not only teach you the technology and how to use it, but, much more importantly, you will understand how ML works, how to avoid common mistakes, such as overfitting/overtraining, how to balance model accuracy against its reliability—the bias-variance trade-off—and how to relate key ML performance metrics to your business goals, making your bosses and clients happy with your progress and results. You will gain clarity how to start your data science projects and how to finish them. You will know how to express the business need in terms of testable hypotheses, which will guide model building and selection. You will understand what types of work are suited to ML, and which are unlikely to deliver results. You will discover what makes good first projects in your own area of specialisation. These are the key benefits of studying machine learning with Rafal Lukawiecki: industry veteran who has been practicing ML, data mining, statistical learning, and data science with his customers for well over a decade, and who has studied artificial intelligence at Imperial College in the ‘90s under the guidance of the leaders and the inventors of this are of industry and science.