Friday, March 11, 2016

Hack 2.8 Sorting Very Large SAS Data Sets Using Views


SAS Programming Professionals,

Did you know that you can sort very large SAS data sets with the judicious use of SAS views?

Once in a while, we all get a SAS data set so large that our computer does not have enough space to sort it.  One way around this problem is to use SAS views to break the original data set into segments that can be sorted separately and then reassembled into a final sorted data set.  An overview of this methodology is:

  •  Create several views which subset the data set into segments aligned with a subgroup values of the sort variables.
  • Input each of the views into its own PROC SORT and write the output to individual data sets.
  • Use the APPEND procedure to append the sorted data sets into one large sorted data set.
  • Use the DATASETS procedure to assert the sort for the final SAS data set.
     Here is an example which uses the SASHELP.CLASS data set.  Yes, I do realize SASHELP.CLASS only has 19 observations, but it serves our overall purposes for illustrating the methodology.

              /*Create a data set that is not sorted by first name */
      proc sort data=sashelp.class out=work.class;
           by weight;
      run;

/* Create Views based on first letter of first name */

data class1 / view=class1;
set work.class; 
    where "A" <= substr(name,1,1) <= "G";
run;

data class2 / view=class2;
set work.class;
    where "H" <= substr(name,1,1) <= "R"; 
run;

data class3 / view=class3;
set work.class; 
    where "S" <= substr(name,1,1) <= "Z"; 
run;

/* Sort views out to SAS data sets*/

proc sort data=class1 out=class1_sorted;
     by name;
run;

proc sort data=class2 out=class2_sorted;
     by name;
run;

proc sort data=class3 out=class3_sorted;
     by name;
run;

/* Apppend sorted data sets and clean up after yourself*/
proc append base=class_sorted_by_name data=class1_sorted;
run;

proc append base=class_sorted_by_name data=class2_sorted;
run; 

proc append base =class_sorted_by_name data=class3_sorted;
run;

proc datasets library=work nolist; 

     modify class_sorted_by_name(sortedby= name);
     run;


     delete class1_sorted class2_sorted class3_sorted;

     delete class1(memtype=view) class2(memtype=view)
       class3(memtype=view);

quit;

The first thing we do is create a data set to use for this example.  We sort the CLASS data set by age to make sure it is not sorted by NAME.  Now, we are ready to go.

The first real step in this methodology is to create three DATA step views.  Each one of the views constructs a subset of the original SAS data set  by using a logical sequence of the key variable value we will be using to sort the data set.  (The number of observations returned by all three DATA step views will be equal to the number of observations in the original data set).  Once the views are created, each one is input to a sort.  The SORT procedure inputs the view; which in turn only feeds the related portion of the original data set into the sort.  That portion of the original data set is sorted by NAME and output to its own sorted data set.

We continue by using PROC APPEND to concatenate the sorted segments of our SAS data set together into a new SAS data set.  Finally, we use the DATASETS procedure to specify that our final SAS data set is sorted by NAME.  And, we are careful to clean up those messy, large, intermediate SAS data sets by deleting them.  For good measure, we delete the views too.

This is obviously a simplistic example.  But, a clever person such as yourself can doubtlessly extend it to sort the impossibly large real-world SAS data sets that drift into your own work-life!

Best of luck in all of your SAS endeavors!

----MMMMIIIIKKKKEEEE

(aka Michael A. Raithel)

Excerpt from the book:  Did You Know That?  Essential Hacks for Clever SAS Programmers


I plan to post each and every one of the hacks in the book to social media on a weekly basis.  Please pass them along to colleagues who you know would benefit.