SAS
Programming Professionals,
Did you know
that you can use the MEANS procedure to identify duplicate variable values in a
SAS data set?
This can come
in very handy when you need to QC a SAS data set where you know that a
particular variable is supposed to have unique values, or where the combined
values of a set of variables is supposed to be unique.
Here is an
example:
%MACRO
IDENTIFY_DUPES(LIBREF=, DSNAME=, VARLIST=);
proc means data=&LIBREF..&DSNAME
nway noprint missing;
class
&VARLIST;
output
out=duplicates(drop=_type_ rename=(_freq_ =
duplicate_count)
where=(duplicate_count
> 1) ) sum=;
run;
proc print data=duplicates noobs label;
var
&VARLIST duplicate_count;
title1 "Duplicate Values
in the &DSNAME SAS Data Set";
title2 "Duplicate Count
for Variables: &VARLIST";
label duplicate_count = "Duplicate
Count";
run;
%MEND IDENTIFY_DUPES;
%IDENTIFY_DUPES(LIBREF=SASHELP,
DSNAME=prdsal2,
VARLIST=country county prodtype year);
This macro
accepts three parameters:
·
LIBREF – The
libref of the SAS data library that contains our target data set
·
DSNAME – The
name of our target SAS data set
·
VARLIST –
The list of variables whose combination we want to check for duplicate values.
The MEANS
procedure specifies our target SAS data set by using the LIBREF and DSNAME
macro variables. The NWAY option states
that we only want statistics for the unique combinations of CLASS
variables. NOPRINT suppresses that
pesky, unneeded list output. The MISSING
option directs PROC MEANS to consider missing values when computing
statistics. We use the list of variables
in the &VARLIST macro in the CLASS statement, so the procedure only
calculates statistics for those specific variables.
In the
OUTPUT statement, we rename PROC MEAN’s _FREQ_ variable to DUPLICATE_COUNT, and
only keep the summarized observations whose DUPLICATE_COUNT is greater than
1. The SUM= option directs PROC MEANS to
summarize the number of observations it finds for each combination of variables
and store that number in the _FREQ_ variable.
The PRINT
procedure creates a report of duplicate variable values surfaced by the MEANS
procedure.
Here is part
of the resulting list output from the PROC PRINT:
Duplicate Values in the prdsal2 SAS Data Set
|
|||||
Duplicate Count for Variables: country county prodtype year
|
|||||
Country
|
County
|
Product
Type
|
Year
|
Duplicate
Count |
|
Canada
|
FURNITURE
|
1995
|
576
|
||
Canada
|
FURNITURE
|
1996
|
576
|
||
Canada
|
FURNITURE
|
1997
|
576
|
||
Canada
|
FURNITURE
|
1998
|
576
|
||
Canada
|
OFFICE
|
1995
|
576
|
||
Canada
|
OFFICE
|
1996
|
576
|
||
Canada
|
OFFICE
|
1997
|
576
|
||
Canada
|
OFFICE
|
1998
|
576
|
If you are
concerned about duplicate variable values in your SAS data sets, this is the
macro for you!
Best of luck
in all your SAS endeavors!
----MMMMIIIIKKKKEEEE
(aka Michael
A. Raithel)
Excerpt from
the book: Did You Know That? Essential
Hacks for Clever SAS Programmers
I plan to post each and every one of the hacks in the book to social media on a weekly basis. Please pass them along to colleagues who you know would benefit.
No comments:
Post a Comment