Why Python?

There are plenty of substantive open source software projects out there for data scientists, so why choose Python?   After all, there is R.   R is a robust and well-supported language written initially by statistician for statisticians.

The view is not to promote one solution over the other.   The goal is to illustrate how the addition of Python to a SAS user’s skill set can broaden ones range of capabilities.   And besides,   Bob Muenchen has already written R for SAS and SPSS Users.

Python has its heritage in scientific and technical computing domains and it has a compact syntax.   The latter making for a relatively easy language to learn while the former means it scales to offer good performance with massive data volumes.   This is one reason Google uses it so extensively and developed an outstanding tutorial for programmers.   See Google's Python Class.

A Quick Start

Another aspect both languages have in common is the wealth of valuable information available on the web.

You would think having a plethora of content makes learning a new language a straightforward proposition.   However,   at times I experienced information overload.   As I worked though examples,   I was not sure until a good investment of time if what I learned was applicable to my overall objectives.   If the focus is on data analysis,   should I start with Python's numpy library,   or pandas,   or some other library?

Sure there is learning for leaning’s sake.   But not every tutorial or text I read was fruitful,   however,   many were.   It was only after a bit of time I realized I needed a specific context for ingesting new information.

Like most people,   I want fast results.   And like most SAS users,   I have developed a mental model for data analysis focused on a series of iterable steps.

What I was lacking was someone to identify both the content to utilize as well as the order in which it should be consumed.  I wanted to initially invest time in just those topics that I needed before getting on with the task of data analysis.

The Python for SAS Approach

A philosophical word (or two) about the merits of Python and SAS as languages.   From my perspective,   it is simply a question of finding the right tool for the job.   Both languages have advantages and disadvantages.   And since they are programming languages,   their designers had to make certain trade offs which manifest themselves as features or quirks,   depending on one’s perspective.

The goal is to provide a quick start for users already familiar with the SAS language and enable them to become familiar with Python.   The choice of which tool to utilize typically comes down to a combination of what you as a user are familiar with and the context of the problem being solved.

The approach taken is to introduce a concept(s) in Python with a description of how the program works followed by a code cell for the Python program.   This is often followed by an example program in the language of SAS to present a compare and contrast approach.   Not every Python example has an analog SAS example.

The SAS language programs were written and verified with Base SAS Version 9.4M1.   The Python examples were written with Anaconda's distribution of Python 3.6.

This approach is illustrated by the cells below.

Code Blocks for SAS and Python

The integers contained inside the square brackets [ ] are elements composing a Python list.   In Python,   a list is a data structure holding an arbitrary collection of items.   i is an object used as the index for the for loop.   product holds the integer value from the arithmetic assignment of product * i. Finally,   the print method produces output.

numbers = [2, 4, 6, 8, 11]
product = 1
for i in numbers:
   product = product * i 
print('The product is:', product)
The product is: 4224


The analog SAS code:

data _null_;

retain product 1;
   do i = 2 to 8 by 2, 11;
      product = product*i;
   end;

put 'The product is: ' product;
run;
The product is: 4224


Ledgibility, Indentation, and Spelling Matter

To quote,   Eric Raymond,   "A language that makes it hard to write elegant code makes it hard to write good code."   From his essay,   entitled, "Why Python",   located here.

The Python program in the cell below is the same as the one above,   with one exception.   The line after the for block is not indented.   This results from the interpreter raises the error:

IndentationError: expected an indented block

numbers = [2, 4, 6, 8, 11]
product = 1
for i in numbers:
product = product * i
print('The product is:', product)
File "<ipython-input-4-354146ef50b8>", line 6
    product = product * i
          ^
IndentationError: expected an indented block

Once you get over the shock of how Python imposes the indentation requirements,   you will come to see how this is an important feature used to create legible and easy-to-understand code.

Notice also there appear to be no symbols used to end a program statement.   The end-of-line character is used to end a Python statement.   This also helps to enforce legibility by keeping each statement on a separate physical line.

Coincidently,   like SAS,   Python honors a semi-colon as an end of statement terminator.   However, you rarely see this.   That's because multiple statements on the same physical line is considered an affront to program legibility.

Python types

Python permits an object-oriented programming model.   SAS is a procedural programming language.   These examples use a procedural programming model for Python given the goal is to map SAS programming constructs into Python.

This object-oriented programming model provides a number of classes with objects being instances of the class.   The Python program in the cell below illustrates the int class (integers).   x is an instance (object) of the int class.   You can execute help(int) to read more.

My early experiences was that the object types I created were not always obvious from the code context.  I neeed to know what type of object was being created. The type() function returns the object's type as illustrated in the cell below.

Python has a number of built-in functions and types that are always available which are documented here.   Throughout the examples there are a number of calls to the type() function to aid in understanding the objects being manipulated.

x = 201;
print(x)
print(type(x))
201
<class 'int'>


Line Continuation Symbol

Should you find you have a line of code that needs to extend past the physical line (i.e. wrap),   then use the backslash (\).   This causes the Python interpreter to ignore the physical end-of-line terminator on the current line and continuing scanning for the next end-of-line terminator.

x = 6 + \
    8 + \
   21
print('Sum of X:', x)
Sum of X: 35


Spelling

Of course,   the incorrect spelling of keywords is a source of error.   Unlike SAS,   Python object names are case sensitive.

Y = 201
print(y)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-6a0796636415> in <module>()
      1 
      2 Y = 201
----> 3 print(y)

NameError: name 'y' is not defined

SAS keywords and variable names are case insensetive.

data _null_;

   X = 201;
   put x ;
run;
201

Finally,   a word about name choices.   Names should be descriptive because more than likely you will be one who has to re-read and understand tomorrow the code you write today.   As with any language,   it is a good practice to avoid language keywords for object names.