Information about statistics and statisticians

This review was motivated by encountering a comment on a competing statistical consultant’s website that characterized SAS and R as the only serious “world class” statistical software systems, relegating SPSS and other systems to the status of mere toys in the statistical playground. In my blog on this website I addressed the motives of the not uncommon practice of deprecating any systems that have succeeded in eliminating inscrutable syntax as an obstacle to their use. In this article my purpose is not to try to rebut the views of people who seek to put down more accessible statistical packages. Instead, my purpose is to promote a more accurate understanding of the relative merits of the most widely used statistical packages. *

The fact of the matter is that no statistical packages are “world class” in regard to all of the criteria by which such packages can be judged, and practically all of the packages are “world class” in some respects. Let’s consider what these criteria are in relation to widely-used, all-purpose statistical software packages. Here is my list:

  1. Ease of use
  2. Learning curve
  3. Depth of menued procedures
  4. Range, quality, and ease of use of statistical procedures offered
  5. Modifiability of analytical output specifications
  6. Ease of transforming table output to formatting conventions (e.g., APA)
  7. Range of graphical output offered
  8. Speed of handling large data sets
  9. Ease and flexibility of data importation
  10. Ease of results exportation
  11. Thoroughness and interpretability of results output
  12. Ease and flexibility of data set manipulation
  13. Pricing for individuals
  14. Thoroughness and informativeness of documentation

I’ll now give you my opinion on how the top 5 statistical packages (SPSS, SAS, R, Minitab, and Stata) stack up on these criteria.

1. Ease of use: This criterion relates to how much program-specific syntax must be recalled to execute an analysis command, and how quickly commands can be entered and executed. In recent years the packages that began as exclusively command-line oriented –SAS, R, and Stata – have grafted menu-driven interfaces onto their systems. In the case of R, there are several alternative menued interfaces developed by individual contributors, foremost among which is RCmdr. As good as it is, though, it is not smoothly integrated into R, requiring the reloading of desired plug-ins each time R is run. Even the menued interfaces created by teams of people at SAS and Stata do not approach the ease with which the menued interfaces of SPSS and Minitab provide access to the vast majority of features of those packages. Indeed, in its latest release SAS seems to have dropped its menued interface (viz., Analyst) in favor of a “simplified” command line interface (viz., IML). Thus, on this criterion, SPSS and Minitab emerge as the only world class packages.

2. Learning Curve: This criterion refers to how long it takes to acquire the knowledge and skill necessary to conduct an average variety of analyses with a statistical package. Let’s say this average variety consists of descriptive statistics, t-tests, ANOVA, multiple regression, and nonparametric analyses. Contrary to common use of the term, a “steep” learning curve means that something can be learned very fast. Learning that occurs slowly because of the difficulty, complexity, and volume of information that must be mastered has a “flatter” learning curve – one that inclines very gradually. The flatter the learning curve associated with a statistical package, the more time and effort must be invested in mastering it, and the lower the proportion people who are likely to make such an investment. From my knowledge of these packages I would classify Minitab and SPSS as having the steepest learning curves – the least time needed to acquire the capability to perform our average set of analyses. SAS and R fall at the other extreme, which is no doubt part of the reason why they’ve sought to devise menued interfaces for their systems. Stata falls about in the middle between these two groups. Thus, again SPSS and Minitab are the “class” of this set of packages in their ease of use, and SAS and R are at the other extreme, presenting much more of a challenge to their mastery than they should. Stata sits right on the border between these two groups, but certainly it could also be much easier to master than it is.

3. Depth of menued procedures: This refers to the range of analytic and output options are offered in the menu for each procedure. SPSS is the clear leader on this criterion. For example, its menu-available options for its GLM ANOVA procedure are very extensive, allowing detailed specification of the model to be analyzed and the type of sums of squares model to use. For some procedures, and ANOVA is a good example here, any menuing system reaches its limit and further modifications are only available through the command syntax (e.g., this is true for simple effects analysis of interactions and nested designs in SPSS GLM ANOVA). However, with the “paste” option in all SPSS menus, you can get a long ways toward composing the desired command through the menu system. This is great time-saver, and no other program seems to offer this capability. Next best, by a considerable margin, is Minitab. The other three systems (SAS, R using Rcmdr, and Stata) offer very little depth in their menus and as far as I can determine, no ability to paste to the command area the part of the desired command selected in the menu.

4. Range of statistical procedures offered: Here we see the order of the rankings reversed. R, SAS, and Stata, in that order, seem to offer the widest ranges of statistical procedures among the five packages. SPSS comes in closely behind Stata, and Minitab picks up the rear. Before there is any rejoicing among the R and SAS snobs, though, it is necessary to ask how much of a difference the ability to perform analyses in the farthest ranges of arcanity actually makes to 95% of the users and consumers of statistics. I would argue: vanishingly little. I would rate the ability of all of these packages to meet the requirements of the vast majority of statistical analysis challenges as “world class”. Moreover, each of them seems to offer some analytical capabilities that the others don’t. For example, arguably the simplest of these packages, Minitab, offers what I consider to be the best time series analysis capability of any of these packages. Where very unusual analyses are needed, the three leaders in this category will almost certainly be able to conduct them, but there is no evidence that they can produce more accurate or complete results for the analyses sought by 95% of users and consumers of statistics.

5. Modifiability of analytical output specifications: This criterion refers to the range of output options a system offers for each of its analytical procedures. For example, SPSS offers a very wide range of options for the output of its descriptive statistics, GLM Anova, Regression, and Explore commands. Minitab offers far fewer options. Stata will produce a considerable range of options, but they mostly exist as follow-up commands on the initial analysis. SAS seems to produce a lot of output whether you need it or not. R seems to narrowly focus its output on the primary results of an procedure; if you want more, it takes another command. There seem to be distinctly different philosophies at work here – decide up front on all the output you are and are not interested in vs. look at the analysis results and decide what else you want to know them. I prefer the former approach because it saves time, but that’s not to say it’s the best approach for everybody. The point is that each of these approaches is best for some users, but neither is better than the other in any absolute sense.

6. Ease of transforming table output to formatting conventions (e.g., APA): I am always trying to find ways to save clients money by maximizing my efficiency. One of the more tedious tasks in preparing results sections for dissertations and journal article manuscripts is preparing the tables. APA format standards provide very specific guidelines for table borders, centering, significant digits, spacing, etc. The closer a stat package’s output tables come to these standards, the less time it takes me to bring any selected table into conformance with the standards. SPSS has formatting options which can produce tables that are quite close to APA standards. Although they still require some work, they require much less work than the output of any of the other packages. The tables output by the other packages can only be described as primitive in comparison. This is one criterion on which SPSS holds a clear edge over the other packages.

7. Range, quality, and ease of use of graphical output offered: I have seen some examples of graphical output produced by R that pretty much blow away anything that seems feasible in the other packages. However, the difficulty of achieving such high quality graphics seems quite high, so one has to weigh the quality and variety of graphics that can be achieved with R against the skill levels required to take advantage of this capability. SAS offers more limited but still quite extensive, high quality graphics, but it too imposes high demands on the skill needed to fully exploit its graphics capability. Stata’s graphical output is very high quality and quite easy to generate, but less extensive in its range of chart types than R and SAS. SPSS offers a range of charts that is close to that of SAS and its charts are MUCH easier to generate than those of SAS. However, the graphical quality of SPSS chart output is quite low – the lowest of any of the five packages. The graphical quality of Miniscribe’s charts is quite high, matching that of Stata, and its charts are very easy to generate. However, its range of charting options is the most limited of any of the packages under consideration here. If you’re getting the feeling that there are trade-offs to be made in choosing between these packages on the basis of their graphic capabilities, you’re right. There is no package that leads the others in range of charting capabilities, quality of graphics, and ease of use. Another consideration that needs to be raised here is that there are a number of third-party charting packages that produce much better graphics than any of these statistical packages and are quite easy to use. It’s not much of a chore to copy the requisite output from a statistical package and paste it into one of these independent graphical packages. Examples of such charting software include SigmaPlot, Grapher, MagicPlot, and ThreeDify Excel Grapher, to mention just a few.

8. Speed of handling large data sets: It is no secret that SAS has invested heavily in developing its database management capability to be able to handle databases as large as any that exist. This capability has put SAS at the forefront of software technology for data mining. Thus, there is no disputing SAS’s leadership in the ability to handle large data sets. R originally was designed to hold all data being processed in memory, which set the amount of free addressable memory in a computer as the limit to the number of cases x variables that could be processed. However, there have been successful efforts to modify this behavior to allow disk paging during processing, so the practical limit on data set size seems to have been removed. Despite the ceiling on its processing capacity having been lifted, R has not been optimized to deal with large data sets like SAS has. The other three packages seem to use disk paging when available memory is exhausted, and they are very slow. I have run analyses with SPSS on 240K cases that have taken hours to complete.

In addition to processing data, another important function for which these packages are used is to extract complexly defined subsets of cases from large data sets. SAS is very good at this, but imposes a very flat learning curve. SPSS requires more work to extract the same data subset, but the skill required to do so can be acquired much more rapidly. For this purpose, though, SAS and SPSS far surpass the capabilities of the other packages under consideration.

9. Ease and flexibility of data importation: This is quite an important criterion by which to assess the usefulness of statistical packages. Four of the systems under consideration here (SAS, SPSS, R, and Stata) are capable of importing the data sets produced by SAS, SPSS, and Stata, along with Excel, Access, Dbase and a few other database formats. None of the four commercial systems are able to import R data, which would have to be exported to a text file first. R is the only system that can import Minitab data sets. Minitab is at the bottom of the pile on this criterion, apparently having the capability to import only blank and tab delimited data files. However, it is possible to cut and paste data from Excel and SPSS worksheets directly into Minitab. Note also that my comments about R’s data import ability are relevant only within the framework of Rcmdr menued interface. It may be possible to achieve this same ability from the command line, but I cannot confirm this. I would conclude that R has best data importation ability (via Rcmdr), followed by SAS, SPSS, and Stata), with Minitab having by far the least capability in this respect.

10. Ease of results exportation: One of the things I like very much about SPSS is its ability to export its results to Microsoft Word, Excel, and Powerpoint and to Adobe .pdf files. This ability has steadily improved since version 15, and currently is quite good in version 20, although there is still plenty of room for further improvements. R and Minitab have no such capability. SAS and Stata (version 12 only) can export results to Excel, but not to any other format. All four of these latter systems offer no other options except cut and paste and screen shots to export output to other documents. The bottom line on this criterion is that SPSS is head and shoulders above the other four systems in its ability to export its results in a range of useful formats. The capabilities of the other four systems are at best primitive.

11. Thoroughness and interpretability of results output: While we’re on the subject of the results output of these systems, it is convenient to address this next criterion relating to the quality of the information reported. This is not a question of accuracy, because all five systems produce results that pretty much agree with each other. The issue instead is the degree to which these systems produce all of the results most users would need from their various analyses. The results should not require a user to do additional computations manually in order to come up with commonly needed indexes. A prime example of this shortcoming is the results of the SPSS logistic regression analysis. It does not produce a good R2 analog index, and worse yet, it does not report the -2 log likelihood for the null model to permit the computation of such an index. This can be computed manually, but only with a hard-to-find formula. However, its output is well organized, and it does allow many options for the results that are output. In contrast to SPSS, SAS does produce such an R2 analog index, but its output is poorly formatted and often requires the user to search out and correctly identify the needed elements of its output. Stata reports the -2 log likelihood, which permits the easy computation of an R2 analog index. However, Stata often requires follow-up commands to get the same output that is automatically produced by other systems. R is more granular and requires explicit specification of the output desired, which can be a major pain. Minitab’s output is quite sparse with few options to enhance its standard outputs in some cases, but is very thorough in others (e.g., time series). So once again, we’re in the situation with these packages where none of them even approach perfection in all aspects of the thoroughness and interpretability of their output. I would rate them a toss-up on this criterion.

12. Ease and flexibility of data set manipulation: Despite its position near the bottom of the list of criteria, I view this as one of the most important. A big part of conducting statistical analysis is selecting subsets of data, creating new variables, labeling, changing data types, and using intermediate results as the raw data for further analyses (e.g., computing the mean rental costs for each of 50 cities and then conducting analyses on those means). In the right hands, SPSS is absolutely superb in this respect. Its Output Management System is unmatched in being able to quickly produce intermediate data sets. Stata is a close second to SPSS on this criterion. SAS does not do these things easily—it can get these operations done, but mainly in command line mode than through its Analyst menu system. Its new IML language may offer some capabilities in this regards, but it is also basically just a new command line mode. R does these data set manipulations even less easily -- its data manipulation capability is limited to command line operations and is very tedious. Minitab actually does these operations better than SAS or R but it keeps generating separate worksheets each time until one becomes buried in worksheets. Thus, the clear winners here are SPSS and Stata, with the other three systems lagging way behind.

13. Pricing for individuals: When considering the accessibility of statistical software for individuals, for those of us in the bottom 99% price is the most important consideration. Four of the five packages under consideration here have pretty much managed to set their prices at levels beyond the reach of most of us. Here are the prices for individual (non-educational) purchases of the basic packages:

  • SAS Analytics Pro: $8,500

  • SPSS Professional: $10,300
    • Includes:
      • IBM SPSS Statistics Base
      • IBM SPSS Advanced Statistics
      • IBM SPSS Categories
      • IBM SPSS Custom Tables
      • IBM SPSS Data Preparation
      • IBM SPSS Decision Trees
      • IBM SPSS Forecasting
      • IBM SPSS Missing Values
      • IBM SPSS Regression
  • SPSS Premium: $15,300
    • Includes (in addition to the components of SPSS Professional):
      • IBM SPSS Bootstrapping
      • IBM SPSS Complex Samples
      • IBM SPSS Conjoint
      • IBM SPSS Direct Marketing
      • IBM SPSS Exact Tests
      • IBM SPSS Amos
      • IBM SPSS SamplePower
      • IBM SPSS Visualization Designer
  • Minitab: $1,395

  • Stata SE: $1,695 (with pdf documentation)

SPSS is the most exorbitantly priced package. However, it should be noted that every version of SPSS from 12 up to 21, with all components activated, has become available for free download through file sharing networks within a month or so of its public release. I for one do not suffer under the delusion that this was due to a combination of lax security and extraordinary hacker skills. It is patently obvious that SPSS has used this strategy to create a huge worldwide loyal user base. SPSS knew that when these people went to work for companies, they would demand continued access to their software in their workplaces, thereby ensuring continued demand for their product in the business sector. They matched this with low pricing for the educational and non-profit sectors. This has been an enormously successful strategy. However, I have seen clues that the corporate barons of IBM may intend to end this backdoor access to SPSS with the release of version 22. Each copy of this next version may only be able to be run on one computer and its license will have to be renewed annually. SAS has used the latter strategy for about 10 years and all the vaunted hackers in the world have been completely stymied in their efforts to crack it.

The stupidity of such a decision, if indeed it has been made, is amplified by its timing, which coincides with the rapid rise in popularity of a free alternative in the form of R. If IBM does intend to extract an annual pound of flesh from individual users in future releases, this will further motivate the open source community to redouble its efforts to advance the development of an easy-to-use menued interface for R that equals or exceeds that of SPSS. This will guarantee that SPSS will join SAS on the path to oblivion.

Thus, it is my conclusion that from this point onward, the only “world class” statistical software from the standpoint of affordability for individuals (and for many small businesses) is R.

14. Thoroughness and informativeness of documentation: There is definitely some variability between the different statistical packages on this criterion. Probably the worst of the lot is the documentation for R, primarily in the form of An Introduction to R. The writing in this document is largely impenetrable and is of virtually no help in learning to use R. Even the books on R which supposedly offer guidance for the complete beginner seem to rapidly descend into the murk. Witness the following excerpt from one of the more popular beginner’s books (Crawley’s The R Book):

We consider the density shown in the 2D three-modal density, and calculate first a piecewise constant function object representing this function, and then calculate the level set tree.

N<-c(35,35) # size of the grid
pcf<-sim.data(N=N,type="mulmod") # piecewise constant function lst.big<-leafsfirst(pcf) # level set tree

We may make the volume plot with the command ''plotvolu(lst)''. However, it is faster first to prune the level set tree, and then plot the reduced level set tree. Function ''treedisc'' takes as the first argument a level set tree, as the second argument the original piecewise constant function, and the 3rd argument ''ngrid'' gives the number of levels in the pruned level set tree. We try the number of levels ngrid=100.

lst<-treedisc(lst.big,pcf,ngrid=100)


It says something about the need for a menued interface when even the books attempting to elucidate the impenetrable syntax are themselves impenetrable.

The documentation of SPSS is barely adequate. It fails to give any justification for its choice of procedures and indexes where multiple alternatives exist. For example, its normality tests are limited to the Kolmogorov-Smirnov and the Shapiro-Wilk. There have been considerable advances since these were proposed, including the Anderson-Darling and Jarque-Bera tests, yet no mention is made of them. Formulae are not provided, the examples are skimpy, and no explanations are offered to help in the choice between alternative procedures (e.g., 18 different procedures for correcting for familywise type I error in post comparisons are offered without a word of guidance as to the relative merits or appropriateness of each in different situations.

Minitab doesn’t provide a manual, but rather only an embedded help system. It does have a “Meet Minitab” guide, but this doesn’t document the various analyses available. I’d have to characterize the documentation for this system as barely adequate. The software itself is quite good and it deserves better documentation.

SAS is undoubtedly the most heavily documented of any statistical package, being the subject not only of its own internal user manuals but also of a large number of independent books. My experience with the user manuals has been very positive. They are detailed and well-written, they use examples very well. On the down side, they are so voluminous that it could take years to absorb all of the nuances of the commands even for just the more common procedures. This is not a criticism of the documentation per se but rather of the degree of complexity of the system itself that requires such extensive documentation.

Finally we come to Stata. This system has just the right amount of detail in its explanations, offers informative justifications for its choices, makes effective use of examples, and achieves a high degree of clarity in its instructions. Its instructions are also self-contained, meaning that I seldom have to go off hunting somewhere else in the manual or help system to figure out what is meant by something in the instructions for the procedure I want to carry out. For me, Stata’s documentation is the most effective of any in conveying everything I need to know about running its analyses.
_______________________

* What qualifies me to perform this role? I started using statistical software 37 years ago with BMDP, probably the first example of statistical software. I then proceeded to master a wonderful system called MIDAS, which had been developed for the University of Michigan’s Computing Center. If it was still available, I’d be using it today. I learned SAS and SPSS in the early 1980s during my work for the District of Columbia government. As the PC versions of these programs began to emerge in the late 1980s, I was one of the earliest adopters. In the late 1990s continuing improvements in the SPSS user interface caused me to shift most of my work from SAS to SPSS. I still occasionally use SAS, but only when forced to by a student who is required to use it for his/her coursework. I have also used Minitab and Stata extensively. I have years of experience in using AMOS and EQS for structural equation modeling (LISREL is still too much of a pain). I have used Matlab extensively for mathematical analysis. I have also used R, JMP, Systat, StatsDirect, MedCalc, NCSS, StatXact, Unistat, and Megastat. Finally, I have written many programs for specialized statistical procedures, including one commercial product (Monte Carlo/PC). I’ll let you judge whether you think this experience qualifies me to offer you guidance on statistical software. If you don’t think so, I guess you can stop reading further.