Programming toolkit

This section contains an overview about the programming toolkit you will need for our course. Please read the instructions and complete the tasks listed in the yellow To do boxes.

Python

Python is an object-oriented language (an object is an entity that contains data along with associated metadata and/or functionality). One thing that distinguishes Python from other programming languages is that it is interpreted rather than compiled. This means that it is executed line by line which is particular useful for data analysis, as well as the creation of interactive, executable documents like Jupyter Notebooks.

On top of this, there is a broad ecosystem of third-party tools and libraries that offer more specialized data science functionality.

Jupyter Notebook

Note

Jupyter Notebook is a web-based interactive computational environment for creating documents that contain code and text

Jupyter Notebook is an open-source web application that allows you to create and share documents that contain code, equations, visualizations and narrative text:

  • A notebook is basically a list of cells

  • Cells contain either

    • explanatory text or

    • executable code and its

    • output

Colab

Note

Colab is a free Jupyter notebook environment that requires no setup, and runs entirely on the Cloud.

Colaboratory, or “Colab” for short, is a free to use product from Google Research. Colab allows anybody to write and execute python code through the browser, and is especially well suited to perform data analysis and machine learning.

Watch this video to get a first impression of Colab:

Let`s start your first Colab notebook to get an overview about some basic features:

Markdown

Note

Markdown is a simple way to format text that looks great on any device.

Markdown is one of the world’s most popular markup languages used in data science. Jupyter Notebooks use Markdown to provide an unified authoring framework for data science, combining code, its results, and commentary in Markdown.

According to Wickham and Grolemund [2016], Markdown files are designed to be used in three ways:

  1. For communicating to decision makers, who want to focus on the conclusions, not the code behind the analysis.

  2. For collaborating with other data scientists, who are interested in both your conclusions, and how you reached them (i.e. the code).

  3. As an environment in which to do data science, as a modern day lab notebook where you can capture not only what you did, but also what you were thinking.

Review this sites to learn more about Markdown:

Libraries

Note

Python Libraries are a set of useful functions that eliminate the need for writing codes from scratch.

A Python library is a reusable chunk of code that you can import in your own projects so you don’t have to write all the code by yourself. There are around 140000 available Python projects and one way to discover and install them is to use the Python Package Index (PyPI). Another way to install Python libraries is to use the open source data science platform Anaconda, which will be covered below.

Here a list of some of the libraries we will use frequently:

  • pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool. We will use pandas regularly in our course and you will find all relevant content in this introduction to pandas

  • NumPy offers tools for scientific computing like mathematical functions and random number generators.

  • SciPy contains algorithms for scientific computing.

  • matplotlib is a library for creating data visualizations.

  • Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.

  • plotly is a graphing library to make interactive, publication-quality graphs.

  • statsmodels includes statistical models, hypothesis tests, and data exploration.

  • scikit-learn provides a toolkit for applying common machine learning algorithms to data.

  • TensorFlow is an end-to-end open source platform for machine learning.

Here are two curated lists with an extensive list of resources for practicing data science using Python, including not only libraries, but also links to tutorials, code snippets, blog posts and talks:

Anaconda

Note

Anaconda is a data science toolkit which already includes most of the libraries we need.

The open-source Anaconda Individual Edition (Distribution) is on of the easiest ways to perform Python and R data science and machine learning since it already includes Python and the most important packages and libraries we need. In particular, it already contains Jupyter Notebook and other important data science modules.

Furthermore, Anaconda’s package manager conda makes it easy to manage multiple data environments that can be maintained and run separately without interference from each other (in so called virtual environments). conda analyses the current environment including everything currently installed, and, together with any version limitations specified (e.g. the user may wish to have TensorFlow version 2,0 or higher), works out how to install a compatible set of dependencies, and shows a warning if this cannot be done.

To do

Install Anaconda Individual Edition

Here an example of how to install the Python package seaborn using conda:

  • On Windows open the Start menu and open an Anaconda Command Prompt.

  • On macOS or Linux open a terminal window.

  • Activate the conda environment of your choice (e.g. the base environment)

conda activate base
conda install -c anaconda seaborn

Visual Studio Code

Note

Visual Studio Code is a code editor that can be used with a variety of programming languages including Python.

Visual Studio Code (also called Code) is a powerful source code editor which runs on your desktop and is available for Windows, macOS and Linux. It comes with a rich ecosystem of extensions for Python and we use them to write our Python code.

To do

Install VS Code:

Get familiar with Code

Install Extensions:

Learn how to use Jupyter Notebooks:

If you should have troubles to use Anaconda in Visual Studio Code, follow these instructions:

Additional VS Code options:

How to configure native bracket pair colorization:

  • Remove any existing Bracket Pair Colorizer extensions.

  • Update VS Code

  • Open your user settings: CMD (CTRL for non-Mac users) + Shift + P and type settings.

  • Select `Open settings (JSON)

  • Add the following code:

"editor.bracketPairColorization.enabled": true

Command-line interface

Note

A command-line interface (CLI) processes commands to a computer program in the form of lines of text.

Operating systems like Windows and MacOS implement a command-line interface (other names for the command line are: cmd, CLI, prompt, console or terminal) in a shell for interactive access to operating system functions or services.

We sometimes use the command line interface to perform some simple tasks so you should be familiar with basic commands. If you aren’t familiar with the terminal, read this short introduction to the command-line interface:

Here is a summary of some useful commands:

Command (Windows)

Command (Mac OS / Linux)

Description

Example

exit

exit

close the window

exit

cd

cd

change directory

cd test, cd.. (Windows) or cd .. (Mac)

cd

pwd

show the current directory

cd (Windows) or pwd (Mac OS / Linux)

dir

ls

list directories/files

dir

copy

cp

copy file

copy c:\test\test.txt c:\windows\test.txt

move

mv

move file

move c:\test\test.txt c:\windows\test.txt

mkdir

mkdir

create a new directory

mkdir testdirectory

rmdir (or del)

rm

delete a file

del c:\test\test.txt

rmdir /S

rm -r

delete a directory

rm -r testdirectory

[CMD] /?

man [CMD]

get help for a command

cd /? (Windows) or man cd (Mac OS / Linux)

Git and GitHub

Note

Git is a version control system – like the “Track Changes” features from Microsoft Word with many more additional features.

GitHub is a provider of internet hosting for software development and version control using Git. We will use GitHub as a platform for web hosting and collaboration and as our course management system.

  • Git can be used to store content

  • Code can be changed and other developers can add code in parallel.

  • Git has a remote repository which is stored in a server and a local repository which is stored in the computer of each developer.

You need a free GitHub-account for our course. Please follow the instructions below (in case you already have a GitHub account: please add your HdM-email address to your account):

To do