Python overview and installation

What is Python?

Python is a programming language, but when people say Python, they really mean the whole Python ecosystem, meaning all the available packages (sometimes called modules) and the supporting development environment (e.g. PyCharm, Jupyter). It gained popularity because it is easy to read and understand, it's most often interpreted (see here) and it is open source. These features led to lots of contributions to its code base (all the pre-written packages), making it very robust for general programming purposes. There are now packages to do most computational tasks you could imagine and it has the second most pull requests on Github (after Javascript which is the workhorse language of the internet).

That said, in my understanding the only thing Python is the "best" language for is learning general programming. Its idioms seem to correspond to lots of people's procedural logic and its enforced indentation scheme helps us to recognize the logic structure. However, it (still) isn't very good for statistics, it wasn't created with parallel programming in mind, and actually isn't very fast compared to either older options (C/C++) or newer ones (Julia). But its strengths across the board have made it a standard language in both industry and academia.

For scientific work, alternatives include R, Stata, SAS, Matlab, Maple, and Wolfram Mathematica. Mathematica was supposed to be the future of scientific computing as of 10 years old, but like Stata, SAS, Matlab, and Maple, it is requires expensive individual or institutional licenses. While Python lacks many of those languages most powerful features (or knowledge base in the case of Wolfram), contributors have added some the best features to the Python ecosystem. A similar thing has happened with R and my expectation is that R with have overtaken Python as the language of data science in the next five years. We'll talk a bit about that in lectures and why you should still learn Python now.

For many areas of industry, the ability to understand and code in Python is standard, but other languages and tools are always going to be necessary. In principle, it would be possible for data science teams to work exclusively in Python, but in practice teams need to work with legacy code and a variety of other languages (e.g. Javascript/HTML/Ruby/C++) and tools (e.g. Amazon Web Services, SQL databases). That means getting good with Python can get your foot in a door, but working in an academic setting will not set you up to walk onto an industry data science team and contribute code right away. But rest assured that learning your first programming language is the biggest step and that any additional ones come much easier. Furthermore, even experienced working programmers often find themselves relying on documentation to code successfully because it's impractical to remember several languages as you cycle through different projects or stages of projects.

Should you use Python 2.7 or 3.X?

There was a period of consternation for Python users when you really could debate whether you should be using version 2.7 or the backwards incompatible 3.X versions, where X is a number and denotes the most recent stable version. The reason was that tools you depended on might not have a stable 3.X version yet. That is no longer the case and you should always be using 3.X as your default. As of June 2018, the stable version is 3.6.5. Python 2.7 is the final release of the version 2 code base. It works fine but is technically deprecated. If you come across something written in 2.7 or earlier, you can use the 2to3 tool to convert it to 3.X or use 2.7 in a virtual environment.

Getting Python 3.X

There are several ways to get Python, listed below. For CESS, I'm recommending Method 1, but prefer Method 2. Why? Read on!

Method 1: Distribution Packages

As mentioned above, Python is more of a whole ecosystem than just a language. That means it is often helpful to manage your idiosyncratic ecosystem (niche?) with a package management software. The two major ones are Anaconda and Enthought's Canopy. These platforms provide point and click interfaces, come with essential packages installed, and offer means to get to other packages. These distributions lower the bar for getting the latest version of Python installed and are a great boon for educators because they help make sure students have the same versions and that everything runs successful on the seemingly infinite number of computer configurations people can have.

I don't love them personally because I learned to work with Python before they existed and find command line management easier, but they excel at keeping your packages up-to-date. The basic installs are free, but you can find yourself looking at a pay wall if you need more specialized things. This brings my inner crank out because they are usually asking you to pay for open-source code. They are providing a service, but this means of asking us to pay for it doesn't feel right to me. Nonetheless if you're just getting started, I would recommend installing via Anaconda. You can find NU's Research Computing Services exact recommendation here.

Method 2: Directly from Python.org

You can download the latest Python, and development versions from Python.org. These installers include only the Python core. Missing are essential packages like Pandas and matplotlib. The core installation does include pip, a package to help you get other packages with simple command line commands like pip install pandas. This is my preferred means of managing Python, but, still, for this class I'd recommend the first method.

Method 3: System-wide package manager

Just like how Anaconda and Canopy manage your Python packages, there are system-wide package managers to support the creation of specialized computing environments. You can get a package manager for your whole operating system and let it manage your Python (or Anaconda, Conda, or Canopy installation). This might be considered the professional route, but in practice using a general command-line installer looks the same as method 2 above after the first two minutes because you're likely to be mostly in the Python ecosystem.

A note to Mac users

All Macs come with Python 2.7 installed because OSX and many of its apps use it. You can use it for some basic things, but it is a customized version that lacks the tools one often needs for scientific computing. Instead everyone installs another stand alone version. The installer packages know not to touch the system's version and set things up so that you can forget about it. Whatever you do, don't touch the system version! It doesn't live were other Python installs do and you'd have to look hard to find it, but if you do naive text (regex) searches from the command line, you might end up in the wrong spot and break things. Just be careful if you're deleting things or trying to modify a package.

Links for starting

Windows

Method 1: I recommend installing Python via Anaconda. Just follow the directions here on the Anaconda website or here on the Research Computing Services website.

Method 2: You can find the Python binary for direct installation here: https://www.python.org/downloads/release/python-365/

Method 3: If you are interested in trying out a system-wide package manager some day, you might try Scoop. I haven't used Windows for programming tasks in ages and can't vouch for Scoop, but it is design to be like the Mac manager I use. You can then proceed to install Python following this Youtube tutorial

Mac OSX or Linux

Method 1: I recommend installing Python via Anaconda. Just follow the directions here on the Anaconda website or here on the Research Computing Services website.

Method 2: You can find the Python binary for direction installation here.

Method 3: You might also consider trying out the system-wide package manager Homebrew. To install it, simply execute the following line in the terminal:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

then run the following lines one at a time:

brew install wget
wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh
bash Miniconda3-latest-MacOSX-x86_64.sh
rm ~/Miniconda3-latest-MacOSX-x86_64.sh
echo export PATH='~/miniconda3/bin:$PATH' >> ~/.profile
source ~/.profile
conda install anaconda
conda update --all

Getting yourself more familiar with Python and its ecosystem

What is happening when you install packages?

A code package or module is usually just a collection of scripts and a "read me" file with some documentation. You can in fact just put the scripts in a subfolder in your project folder, name the subfolder in accordance with your import statement, and get going. The reason we do installs instead of just downloading to the project folder is because the installer finds the right spot to make the package available (or unavailable) to other projects and updates system paths, a list of where to look for things when you call them from anywhere (i.e. jupyter notebook).

The business about system paths is important and can be infuriating when things go wrong. When you remember that packages are just python code sitting somewhere on your computer, you can appreciate how calling an import statement like import pandas from a Python interpreter (something that will execute code for you) looking at your Desktop might go wrong. If you just downloaded the code onto your desktop, Python would find Pandas, but Pandas might not find Numpy, on which it depends. Or if you accidently let the Pandas folder to go into your Downloads folder, Python won't find it. The registry of system paths lets the interpreter know where to find Python (or other things), which should be in a well-organized place so that Python itself knows to look for packages. Installers made sure packages get to that place.

There are times when you might need different version of a package you already have (e.g. you get someone's older code in order to replicate their findings). Installing multiple versions of a package with the same name directly to your Python framework is likely going to cause problems sooner or later. That's where learning to use virtual environments becomes important. Virtual environments take away the pain of having to make distinctions between pandas-0.14.1 and pandas-0.22.1 and lets you just import pandas. A well-documented and reproducible project should provide the full specification of the environment and make it easy for you to setup a virtual environment with the right specifications. We'll cover virtual environments in a later work shop.

Interpreted and/or compiled languages

People often split programming languages into either interpreted or compiled and Python is generally in the former. This distinction refers to the process of translation by which our code get converted into lower-level machine code (what computers understand how to execute), not the language itself. In principle, both compilers and interpreters can be made for any language, but in practice languages are associated with one or the other process and there can be several different compilers or interpreters for a single language.

When compiling code, there is a noticeable translation step between writing the human-readable code and the machine running the program. Compiling looks at the whole set of things a program or script can do and produces optimized and well-validated machine code. In interpreting a language, the translation work gets split up in some good and bad ways. Some of the components, from packages for example, can be converted just once when they are first run, but to support not needing to reinterpret the human-readable code, they preserve the full flexibility of original code. This can slow the machine code down significantly and doesn't protect against type errors. Compiled code doesn't attempt to preserve flexibility because the goal is for the compiled code it stand on its own. In practice, breaking translation process into chunks like an interpreted language makes it quicker to develop little bits and run them right away. That makes Python very fun and easy to work with, but can slow it down in the grand scheme of things.

If one is done developing some Python code via the standard interpreted route and wants to speed the code up, they can make use of just-in-time compiling tools like PyPy. Alternatively, they can use Cython to compile their Python module into the faster and more efficient C language binaries.

What you should take away from the interpreted versus compiled distinction is that Python is interpreted and therefore slower than compiled code. Basic Python is probably fast enough for most of your needs, but if you are getting up to some real trouble, there are work arounds to speed up the code.

Paradigms: Object-oriented, functional, imperative, declarative etc.

This is some high level stuff that might help you win your bar's trivia night if you live in a college town, but really doesn't matter for getting on with your programming life, especially because Python supports multiple paradigms, as do many others.

Where knowing a bit about paradigms can be helpful is in thinking about how to organize your code. If you are thinking about simulating an interactive process with an agent-based model, it's best to construct the agents as objects with attributes and actions. The object-oriented paradigm focuses on this and defines things like classes and superclasses that one can use to efficiently define heterogenous agents. Conversely, if you are writing a script to download and clean texts, a functional paradigm maps better to the workflow of dealing with instances one at a time; get the raw data, apply various functions to clean it up, and save to a database or storage.

Data Camp R

Data Camp Python

RCS workshops