/ COMPUTER-SCIENCE •  ACADEMIA

Encountering The R Programming Language In The Wild

This paper was originally written for my course CIS*6650(01) - Programming Language Evaluation (GR), taken at the University of Guelph.

This paper was originally written for my course CIS*6650(01) - Programming Language Evaluation (GR), taken at the University of Guelph.

If you’ve made it this far, you may have read the previous papers in this 6650(01) sega; Why the C Programming Language Deserves an F & A Scripture Reading - ‘The Humble Programmer’, and probably noticed a trend in my writing style, namely that even in technical writing, I’m a sucker for anecdotes, idioms and good old fashion stories. I’ve found critical thinking is amplified when it’s applicable and relatable, and have always appreciated the use of personable and accessible writing styles in formal and academic documents. Given that I haven’t been reprimanded as of yet, for my potentially excessive use of personal anecdotes (thank you!), I thought I would take this opportunity to explore this writing style a bit further and tell the story of my first true reflection of a programming language; R.

This paper explores my first impressions of R, introduced within a course called Econometrics, and how my work with the simple statistical analytics language had me flagged for potential academic misconduct. I describe my gripes with the language, such as it’s lack of defined programming paradigm (functional vs imperative), poor scoping, loosely and weakly typed data, and language inconsistencies and bloat in comparison to general-purpose languages. I also however explore some of the strengths I’ve grown to appreciate about R, from and since that experience, namely that it’s more accessible than I first thought and that the community, packages, as well graphical and data handling seem to have been designed with statistics and data in mind. In general, this paper avoids making objective and general statements about the overall quality of R, by instead making observations on it’s relative strengths and shortcomings.

Using R as a CompSci Student

It was September, 2015 here at the University of Guelph and I was just starting the fourth and final year of my Undergraduate degree in Computer Science, meaning I had recently developed both a mild proficiency in programming and a mild confidence (probably more aptly described as arrogance). I had, up until this point only used general purpose languages such as C, Python, and Java and application specific languages like MySQL, PostGres, CSS, HTML, and Javascript. On top of my compsci courses, I had also registered for a single course to fulfill the requirements for my Minor in Economics. I picked Econometrics with Dr. Prescott for a multitude of reasons, but the most prominent was surely a single sentence within the course description that hinted at the inclusion of programming; “This course will introduce students to economic analysis and modelling through applying statical programming tools”. While I’m generally quite fond of economic courses, I was particularly excited for this one, as it presented an opportunity to both apply my programming knowledge to my minor, and of course to show off to some of my peers outside my major.

Unfortunately, much of my excitement was replaced by frustration as I began working on the first of three assignments analyzing economic data. We were tasked with performing some fairly trivial statistical analyses such as Nonparametric & Linear Regression analyses, Multiple Regression Modelling, Semi-Log Modelling, and Hypothesis Testing on a table of real financial data from a subset of Malawians. My troubles however weren’t based on the problem I was solving, rather what we were using to solve it; the R programming language and environment for statistical computing and graphing.

We were provided some lab tutorials on interfacing with the R environment, a .csv database (I use the term ‘database’ here quite lightly) containing 9578 entries, and a reasonably large set of documentation on how to interact with the data. Through the use of single lines commands, we were to apply provided equations, perform particular analytics and chart some graphs while using pen and paper or a spreadsheet to log temporary values. For most of my peers this was a reasonably straight-forward task, and they spent the majority of their time writing their supplementary 2000 word essays. I, however, spent most of my time reading R documentation, writing new functions, fiddling with the language and generally pounding sand, an effort which ended up teaching me a lesson or two and landing me a meeting to discuss my potential Academic Misconduct.

Functional or Imperative?

Initially I thought R was a functional programming language, as it boasted functions as first-class objects, and initially seemed to treat variables as immutable. Functional programming languages treat statements as the evaluation of mathematical functions and avoid mutable data or changing states. Unlike imperative (or procedural) languages which generally denote everything as a command, data as mutable and functions as subroutines. Many other packages, including MatLab, SAS, SPSS and Stata are considered procedural, with procedures that typically do data analyses down columns, and functions that do calculations across rows. R however, has only functions and those functions can do both. Functions may have a preference to go down rows or across columns but for many functions you can use the “apply” family of functions to force then to go in either direction. It’s a unique specification compared to other statistical languages, but it allows R to treat functions as first-class objects.

R continued acting functional, while using the console for simple commands like mean(bin$Age), xs[nb]*(1-r)+xs[na]*r or var(bin$wage_hrs)^0.5. This changed however when we needed to produce histograms using a subset of our data, meaning for the first time we used an assignment statement, in this case I believe it was set4 <- cbind(set4,F,Urban,Educ). I then checked a conditional statement without an else such as if ( x==1 ){ x <- 2# } x# and sure enough nothing was wrong. The functional interpretation of if is that it is an expression that evaluates to some value. To evaluate the value of if c then e1 else e2 you evaluate the condition c and then evaluate either e1 or e2, depending on the condition. If you have just if c then e, then you don’t know what the result of the evaluation should be if c is false, because there is no else branch. I wasn’t writing expressions, I was writing commands. Printing the contents of a variable after changing it ( x <- 2# x# x <- 3# x# ) confirmed that data was indeed mutable as well.

After reading some documentation, I found that R is indeed technically neither functional nor imperative, as it treats functions as first-class objects, yet data is mutable and statements are commands. This is awkward for a number of reasons, but the most significant is that it allows inconsistent and unfamiliar coding style and structures. I ended up moving as much of the code into a procedural format as possible, moving the collection of copy-paste commands and long equations into functions, and temporary pen-and-paper data into assigned and mutable variables.

Scoping and Typing

One of my next grievances with R, was it’s lack of native lexical scoping or namespaces. A namespace is a declarative region that provides a scope to the identifiers (functions, variables, etc) inside it. Namespaces and scopes are used to organize code into logical groups and to prevent name collisions that can occur when including multiple libraries. I spent many evenings debugging my code before realizing that I had used a variable i that was being reinitialized by function calls (See Figure 1) . Without namespaces, you live in constant fear of using the same variable name as one in your library or imports.

x <- 10
x
foo <- function(){
	x <- 50
}
x 
foo() 
x

[$] 10
[$] 50
[$] 50

Figure 1: An example of the lack of lexical scoping in R, x is reassigned globally from within a function

R also has dynamic and weak typing, meaning anything can be assigned to variables and implicit conversions are performed liberally, even during computation (see Figure 2). typeof simply returns one of the four types of R objects; symbol, pairlist, closure or environment. This can get frustrating when functions are indistinguishable from variables and a character string “1234” looks like a number 1234. While not a particularly large flaw, weakly typed systems can be difficult to manage and generally aren’t suitable for large programming tasks. In light of these problems, I rewrote all my functions to include unique variable names and cleared variables before use.

a <- TRUE a
[$] TRUE 
a[3] <- 2 
a
[$] 1 NA 2 
a[2] = 1.23
a
[$] 1.00 1.23 2.00
a = "hello" 
a
[$] "hello"

Figure 2: An example of the dynamic and weak typing in R

Idiosyncrasies, Inconsistencies and Bloat

R contains a number of peculiar idiosyncrasies, inconsistencies, and general language bloat, most of which were introduced in an attempt to return some of the functionality that was taken away via lazy, dynamic typing and poor scoping. For starters, R has five different assignment operators (<-, <<-, =, ->, ->>), each of which is used to access a variable at a various scopes. For example; <- indicates a variable assignment without scope, whereas = allows assignments to exist only within functions or control structures. There are also a number of ways of indexing a ‘variable’ in R (see Figure 3); for vectors/matrices [ ] and [[ ]] have slight semantic differences (e.g. the latter drops any name or dimname attributes, and partial matching is used for character indices), $ applies to recursive objects like lists/pairlists (it allows only a literal character string or a symbol as the index), and . is used to access variables and functions within objects.

The language lacks vector or matrix collection literals (you have to use c to bind vectors, x = c(1,2,3,4)), has too many commands, treats symbols as literals, contains lots of unmaintained project (a.k.a. libraries), and has poor readability. It’s safe to say, that the majority of my time completing the assignment was spent gritting my teeth, attempting to understand how and why someone would have built a language like R, and staring longingly at my C, Python and Java programming assignments.

x[i]
x[i, j]
x[[i]]
x[[i, j]]
x$a
x$a
x.a

Figure 3: All the valid ways of indexing a data structure in R

Using R as an Economist or Statistician

After a month and a half and fumbling with R, then pulling together my supplementary essay far later than planned, I printed out my paper complete with my R.history file (see Appendix 1) and submitted it to the box on the 7th floor of Mackinnon. About a week from this point is when I retrieved an email asking to meet with Dr. Prescott about potential plagiarism in Assignment 1. I never did ask him whether it was simply due to my restructured code or due to the disparity between the quality of my code and my essay, but either way my R.history file had caught his attention. To the least of my expectations, I ended up staying for a long and enjoyable visit with Dr. Prescott. I explained my background in Computer Science and we walked through the new implicit, commented and procedural based functions I had used to build my simple graphs and figures. He was happy to know that I wasn’t paying someone online, and even asked if he could use some a function or two of mine in some future classes, to which I agreed, although I’m quite sure that never ended up happening.

More importantly however, Dr. Prescott explained why I only received a 70% on the assignment; I had spent all my time on the lightly weighted R-programming section of the assignment and while it was more modular, my R.history was only a quarter the size of the average students. Even with my robust programming functions, indexed data, and comments, my output was the same as the other students and my analyses lacked depth or application of economic models. I’d like to say I learned my lesson in that office, but unfortunately that was a difficult semester for me and I continued to excel at the programming, fall behind on the essay analyses and test and ultimately landed a 66% in the course, one of the top 3 worst marks in my Undergraduate degree.

Thankfully, over the past few years I’ve had an opportunity to use R a bit more throughout the completion of my minor, and teaching and research assistantship with Dr. Gillis. While my previous criticisms still hold true, I’ve learned to act as my own devils advocate and appreciate some of the strengths of R. It doesn’t take a statistician or data analyst to tell me that if Dr. Prescott and Dr. Gillis happily used the language for almost a decade, it’s probably not as bad as I think it is.

Community, Package Library and Graphics

Polls and surveys suggest that R’s popularity has increased substantially in recent years, particularly among statisticians for developing statistical analyses and graphical figures. Which makes sense, as R was released in 1995 as a more powerful and robust implementation of the statistical S Programming Language. R is able to perform a wide variety of functions, such as data manipulation, statistical modeling, and graphics. One of R’s greatest advantages, however, is its extensibility. Developers can easily write their own software and distribute it in the form of add-on packages. Because of the relative ease of creating these packages, literally thousands of them exist. In fact, R boasts one of largest and most active communities in the statistical programming world, arguably rivalled by PERL. On top of producing a supportive environment for newcomers, the community has been prominent in the development (and generally the upkeep) of language packages. R has over 4800 packages available from multiple repositories specializing in topics like econometrics, data mining, spatial analysis, and bio-informatics. Many of these packages have since been incorporated into the base installation of R.

One of the most influential packages known to R is ggplot2, released by Hadley Wickham in 2005, and maintained by the open-source community ever since.ggplot2 is is a data visualization package licensed under GNU GPL v2, which breaks up graphs into semantic components such as scales and layers, and allows the user toadd, remove or alter components in a plot at a high level of abstraction. Since 2005, ggplot2 has grown to become one of the most widely appraised graphing libraries and comes preinstalled in most R environments. In general thanks to the community, diverse statistical packages and ggplot2 integration, R is a worthy rival to MatLab, SAS, Python or PERL in the domain of statistical computing and visualization.

Open & Accessible

Unlike the majority of it’s competitors in statistic computing, R is both open and accessible. R is available under and open-source license without restrictions, which means anyone is able to download, modify, and ultimately improve upon it’s source code, all free-of-charge. For this reason, R is very both stable and reliable. The R Core Development Team has also put a lot of effort into making R available for a variety of hardware and software types, R runs smoothly on Windows, Unix and Mac systems. Several packages also exist to connect R to a plethora of other systems, such as the RODBC package which reads from common relational databases like Oracle using Open Database Connectivity Protocols. R was also built based on Fortran and C, meaning calling code from these languages from within R is relatively easy. And as the community grew, C++, Java, Python and other languages become more seamlessly connected with R.

While R is a powerful statistical tool, it was also built with usability and accessibility in mind, in unison with it’s environments. There are multiple ways to interface with R. Some common environments for graphically interfacing with R are; R GUI, R Commander, and RStudios. These editors/environments are designed to make scripting easy, using dropdown menus to show available commands, variables and data frames. They are also quite convenient when interacting with objects stored in your environment; an Environment window shows all of the objects that you have stored, including data; scalars, vectors, and matrices; model outputs; et cetera., along with a summary of the information that is stored in those objects. They make setting your working directory easier, and generally makes data analytics and graphic formulation more accessible for a casual user.

Conclusion

And so, other than getting a peek into my past experiences with R and exploring some of the it’s strange and beautiful details, what can we take away from this long-winded anecdote? Mainly that R is a Domain Specific Language (DSL), Bo Cowgill described it well when he said,

“The Best Thing About R is that it was written by statisticians… the worst thing about R is that is was written by statisticians”

Comparing R to general purpose languages is akin to comparing apples to oranges, or PHP to MySQL. Tasks should be carefully selected to be sure they most appropriately handled with R, and problems should be reduced to tasks that can be handled by separate languages. For example, use R to manipulate vectors and create graphics, and Python or PERL for other data mining and analysis purposes. One thing, we can say about the R Project is that it is a Functional, Imperative, Dynamic, Weakly-typed, Lexical Scope-less, Bloated, Open-Source, Accessible programming language that was built with statistics, data and graphing in mind, and has a strong community and extensive package library.

nic

Nic Durish

Guelph-based computer scientist.
Web and Automation tinkering addict.
Crappy musician
Mango connoisseur

Read More