Programming toolkit¶
This section contains an overview about the programming toolkit you will need for our course. Please read the instructions and complete the tasks listed in the yellow To do boxes.
Python¶
Python is an object-oriented language (an object is an entity that contains data along with associated metadata and/or functionality). One thing that distinguishes Python from other programming languages is that it is interpreted rather than compiled. This means that it is executed line by line which is particular useful for data analysis, as well as the creation of interactive, executable documents like Jupyter Notebooks.
On top of this, there is a broad ecosystem of third-party tools and libraries that offer more specialized data science functionality.
Jupyter Notebook¶
Note
Jupyter Notebook is a web-based interactive computational environment for creating documents that contain code and text
Jupyter Notebook is an open-source web application that allows you to create and share documents that contain code, equations, visualizations and narrative text:
A notebook is basically a list of cells
Cells contain either
explanatory text or
executable code and its
output
Colab¶
Note
Colab is a free Jupyter notebook environment that requires no setup, and runs entirely on the Cloud.
Colaboratory, or “Colab” for short, is a free to use product from Google Research. Colab allows anybody to write and execute python code through the browser, and is especially well suited to perform data analysis and machine learning.
Watch this video to get a first impression of Colab:
Let`s start your first Colab notebook to get an overview about some basic features:
Markdown¶
Note
Markdown is a simple way to format text that looks great on any device.
Markdown is one of the world’s most popular markup languages used in data science. Jupyter Notebooks use Markdown to provide an unified authoring framework for data science, combining code, its results, and commentary in Markdown.
According to Wickham and Grolemund [2016], Markdown files are designed to be used in three ways:
For communicating to decision makers, who want to focus on the conclusions, not the code behind the analysis.
For collaborating with other data scientists, who are interested in both your conclusions, and how you reached them (i.e. the code).
As an environment in which to do data science, as a modern day lab notebook where you can capture not only what you did, but also what you were thinking.
Review this sites to learn more about Markdown:
Libraries¶
Note
Python Libraries are a set of useful functions that eliminate the need for writing codes from scratch.
A Python library is a reusable chunk of code that you can import in your own projects so you don’t have to write all the code by yourself. There are around 140000 available Python projects and one way to discover and install them is to use the Python Package Index (PyPI). Another way to install Python libraries is to use the open source data science platform Anaconda, which will be covered below.
Here a list of some of the libraries we will use frequently:
pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool. We will use pandas regularly in our course and you will find all relevant content in this introduction to pandas
NumPy offers tools for scientific computing like mathematical functions and random number generators.
SciPy contains algorithms for scientific computing.
matplotlib is a library for creating data visualizations.
Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.
plotly is a graphing library to make interactive, publication-quality graphs.
statsmodels includes statistical models, hypothesis tests, and data exploration.
scikit-learn provides a toolkit for applying common machine learning algorithms to data.
TensorFlow is an end-to-end open source platform for machine learning.
Here are two curated lists with an extensive list of resources for practicing data science using Python, including not only libraries, but also links to tutorials, code snippets, blog posts and talks:
Anaconda¶
Note
Anaconda is a data science toolkit which already includes most of the libraries we need.
The open-source Anaconda Individual Edition (Distribution) is on of the easiest ways to perform Python and R data science and machine learning since it already includes Python and the most important packages and libraries we need. In particular, it already contains Jupyter Notebook and other important data science modules.
Furthermore, Anaconda’s package manager conda
makes it easy to manage multiple data environments that can be maintained and run separately without interference from each other (in so called virtual environments). conda
analyses the current environment including everything currently installed, and, together with any version limitations specified (e.g. the user may wish to have TensorFlow version 2,0 or higher), works out how to install a compatible set of dependencies, and shows a warning if this cannot be done.
Here an example of how to install the Python package seaborn using conda
:
On Windows open the Start menu and open an Anaconda Command Prompt.
On macOS or Linux open a terminal window.
Activate the conda environment of your choice (e.g. the base environment)
conda activate base
Install seaborn according to the documentation
conda install -c anaconda seaborn
Visual Studio Code¶
Note
Visual Studio Code is a code editor that can be used with a variety of programming languages including Python.
Visual Studio Code (also called Code) is a powerful source code editor which runs on your desktop and is available for Windows, macOS and Linux. It comes with a rich ecosystem of extensions for Python and we use them to write our Python code.
To do
Install VS Code:
Get familiar with Code
Install Extensions:
Learn how to use Jupyter Notebooks:
If you should have troubles to use Anaconda in Visual Studio Code, follow these instructions:
Additional VS Code options:
How to configure native bracket pair colorization:
Remove any existing Bracket Pair Colorizer extensions.
Update VS Code
Open your user settings:
CMD (CTRL for non-Mac users) + Shift + P
and typesettings
.Select `Open settings (JSON)
Add the following code:
"editor.bracketPairColorization.enabled": true
Command-line interface¶
Note
A command-line interface (CLI) processes commands to a computer program in the form of lines of text.
Operating systems like Windows and MacOS implement a command-line interface (other names for the command line are: cmd, CLI, prompt, console or terminal) in a shell for interactive access to operating system functions or services.
We sometimes use the command line interface to perform some simple tasks so you should be familiar with basic commands. If you aren’t familiar with the terminal, read this short introduction to the command-line interface:
Here is a summary of some useful commands:
Command (Windows) |
Command (Mac OS / Linux) |
Description |
Example |
---|---|---|---|
exit |
exit |
close the window |
exit |
cd |
cd |
change directory |
cd test, cd.. (Windows) or cd .. (Mac) |
cd |
pwd |
show the current directory |
cd (Windows) or pwd (Mac OS / Linux) |
dir |
ls |
list directories/files |
dir |
copy |
cp |
copy file |
copy c:\test\test.txt c:\windows\test.txt |
move |
mv |
move file |
move c:\test\test.txt c:\windows\test.txt |
mkdir |
mkdir |
create a new directory |
mkdir testdirectory |
rmdir (or del) |
rm |
delete a file |
del c:\test\test.txt |
rmdir /S |
rm -r |
delete a directory |
rm -r testdirectory |
[CMD] /? |
man [CMD] |
get help for a command |
cd /? (Windows) or man cd (Mac OS / Linux) |
Git and GitHub¶
Note
Git is a version control system – like the “Track Changes” features from Microsoft Word with many more additional features.
GitHub is a provider of internet hosting for software development and version control using Git. We will use GitHub as a platform for web hosting and collaboration and as our course management system.
Git can be used to store content
Code can be changed and other developers can add code in parallel.
Git has a remote repository which is stored in a server and a local repository which is stored in the computer of each developer.
You need a free GitHub-account for our course. Please follow the instructions below (in case you already have a GitHub account: please add your HdM-email address to your account):
To do
Verify your GitHub email
Adjust your GitHub settings
Settings > Emails > Uncheck “Keep my email address private”
Settings > Emails > Update name and photo