Monday, July 11, 2016

Hack 3.12 Identifying Duplicate Variable Values with PROC MEANS

SAS Programming Professionals,

Did you know that you can use the MEANS procedure to identify duplicate variable values in a SAS data set?

This can come in very handy when you need to QC a SAS data set where you know that a particular variable is supposed to have unique values, or where the combined values of a set of variables is supposed to be unique.

Here is an example:

%MACRO IDENTIFY_DUPES(LIBREF=, DSNAME=, VARLIST=);

proc means data=&LIBREF..&DSNAME nway noprint missing;
      class &VARLIST;
      output out=duplicates(drop=_type_ rename=(_freq_ = 
                 duplicate_count)
                  where=(duplicate_count > 1) ) sum=;
run;

proc print data=duplicates noobs label;
      var &VARLIST duplicate_count;

title1 "Duplicate Values in the &DSNAME SAS Data Set";

title2 "Duplicate Count for Variables: &VARLIST";

label duplicate_count = "Duplicate Count";

run;

%MEND IDENTIFY_DUPES;

%IDENTIFY_DUPES(LIBREF=SASHELP, DSNAME=prdsal2,
    VARLIST=country county prodtype year);

This macro accepts three parameters:

·        LIBREF – The libref of the SAS data library that contains our target data set
·        DSNAME – The name of our target SAS data set
·        VARLIST – The list of variables whose combination we want to check for duplicate values.

The MEANS procedure specifies our target SAS data set by using the LIBREF and DSNAME macro variables.  The NWAY option states that we only want statistics for the unique combinations of CLASS variables.  NOPRINT suppresses that pesky, unneeded list output.  The MISSING option directs PROC MEANS to consider missing values when computing statistics.  We use the list of variables in the &VARLIST macro in the CLASS statement, so the procedure only calculates statistics for those specific variables.

In the OUTPUT statement, we rename PROC MEAN’s _FREQ_ variable to DUPLICATE_COUNT, and only keep the summarized observations whose DUPLICATE_COUNT is greater than 1.  The SUM= option directs PROC MEANS to summarize the number of observations it finds for each combination of variables and store that number in the _FREQ_ variable.

The PRINT procedure creates a report of duplicate variable values surfaced by the MEANS procedure.

Here is part of the resulting list output from the PROC PRINT:



Duplicate Values in the prdsal2 SAS Data Set
Duplicate Count for Variables: country county prodtype year
Country
County
Product Type
Year
Duplicate
Count

Canada

FURNITURE
1995
576

Canada

FURNITURE
1996
576

Canada

FURNITURE
1997
576

Canada

FURNITURE
1998
576

Canada

OFFICE
1995
576

Canada

OFFICE
1996
576

Canada

OFFICE
1997
576

Canada

OFFICE
1998
576



If you are concerned about duplicate variable values in your SAS data sets, this is the macro for you!

Best of luck in all your SAS endeavors!
----MMMMIIIIKKKKEEEE
(aka Michael A. Raithel)

Excerpt from the book:  Did You Know That?  Essential Hacks for Clever SAS Programmers

I plan to post each and every one of the hacks in the book to social media on a weekly basis.  Please pass them along to colleagues who you know would benefit.

No comments:

Post a Comment