Tuesday, May 24, 2016

Fundamental Python libraries for ML

For those who have been studying ML, this post will be stating the obvious.  However, some of you might be very new to ML, and might benefit from links to better information sources.  This is one thing that I appreciated and benefited from other sites, so I am paying the favor forward.

To apply ML techniques, it is imperative that we learn Python.  We could survive on other languages of course, but Python is considered the foremost ML language out there.  Many code examples on the internet are written in Python.  The other competitor is R.  There is a very healthy debate around these two within ML circles, a debate we will not address.  It is okay to learn both. :)  As I come from a programming background, I tend to use Python more.

Within Python, we need to have some familiarity with libraries such as NumPy.  We introduced NumPy in Part 1 of the Perceptron series as one of several libraries useful to Python ML programming (the other four are matplotlib, SciPy, pandas, and scikit-learn).

Except for scikit-learn, these related libraries are part of the SciPy stack.  A stack is just a fancy way of identifying a group of closely related libraries that accomplish different, but complementary, functions.  (To confuse further, the SciPy stack is also called a NumPy stack).

  • NumPy stands for numerical python, and is useful in scientific computing.  It is an extension of Python to handle large, multi-dimensional arrays and matrices, along with libraries of fast mathematical operations to apply on these arrays such as sorting, basic linear algera, random number generation and so on.  Its basic data type is an ndarray, which is NumPy's version of an array.  (I was pleased to see that searching for "what is numpy" in Google landed me to an official site that showed sample Numpy code highlighting the speed of NumPy operations over standard Python for loops, including broadcasting and vectorization reference.  The efficiency and readability of vectorized and broadcast Numpy code are exactly what I presented in PLA Part 1!)

  • SciPy is more generally a Python ecosystem (stack) of open-source software for mathematics, science, and engineering.  More specifically, SciPy provides efficient numerical routines, e.g., numerical integration and optimization, that are typically needed in scientific calculations.  SciPy builds on NumPy.  Think of it as a more specialized version of NumPy.

  • matplotlib is a plotting library for Python and works as a natural extension of NumPy (plotting ndarray variables without further modifications).  The animated GIFs on this blog are based on matplotlib output.

  • pandas is a data manipulation library.  The name is based on panel data, a common term for multidimensional data sets in statistics and econometrics.  Its base data type is a dataframe, which should be very familiar to those versed in statistical packages such as R.  Think of a dataframe as a table in Excel (a spreadsheet table).  At its most basic, pandas is used to read/write data from/to files, and quick data processing and analysis.  Often, it is common to see pandas dataframes loaded to NumPy ndarray objects for input to other ML libraries.  Going back and forth between pandas and NumPy is not difficult.

  • scikit-learn is the primary machine learning library for Python.  It has classification, regression, clustering, dimensionality reduction, and other standard ML operations.  It is built on NumPy, SciPy, and matplotlib; thus, it is critical to know these other libraries to take full use of scikit-learn libraries.  As it is not a utility type of library, scikit-learn is not part of the SciPy stack.

Finally, for an environment to run ML experiments, I recommend iPython (also part of the SciPy stack).  iPython is not strictly needed to run ML experiments, but it is convenient as a researcher logbook.  If you have seen a website that combine normal text and Python code, sometimes with output, you might have seen an example of iPython working behind the scenes.

iPython is a programming shell/environment whose layout lends easily to scientific calculations and experiments.  Within iPython, we can create notebooks that combine normal text, mathematical equations, actual Python code (and other languages also, such as R), and see the output of programs (e.g., matplotlib graphs) within the text, thus allowing a log of different runs.  It is a scientists' notebook companion.

iPython uses a markup language called Markdown, a highly intuitive way to write HTML but without using HTML brackets.  It is a natural way of writing without thinking about HTML tags, e.g., instead of "<strong>bold font</strong>" and "<em>italics</em>", we would type "**bold font**" and "*italics*".  It then translates to bold font and italics.  Mathematical equations take some getting used to, but it can handle Latex, another staple of researchers for writing equations.  I use MathJax directly within the text to format equations.

Learning these is easy/medium for people with programming experience.  I learned enough of iPython and Markdown in one evening to write the first Perceptron post, and learned MathJax the following evening for the math equations.  In fact, while I run Python code outside iPython to animate GIFs, I use iPython Notebook to write posts that has inline Python code, transferring over to Blogger only for finishing touches.  There is some manual HTML work needed to make the layout work on Blogger, and some layout effects are indeed lost, but the transition is a simple copy of HTML code.

No comments:

Post a Comment