Michael A. Raithel's Blog: Hack 3.12 Identifying Duplicate Variable Values with PROC MEANS

SAS Programming Professionals,

Did you know that you can use the MEANS procedure to identify duplicate variable values in a SAS data set?

This can come in very handy when you need to QC a SAS data set where you know that a particular variable is supposed to have unique values, or where the combined values of a set of variables is supposed to be unique.

Here is an example:

%MACRO IDENTIFY_DUPES(LIBREF=, DSNAME=, VARLIST=);

proc means data=&LIBREF..&DSNAME nway noprint missing;

class &VARLIST;

output out=duplicates(drop=_type_ rename=(_freq_ =

duplicate_count)

where=(duplicate_count > 1) ) sum=;

run;

proc print data=duplicates noobs label;

var &VARLIST duplicate_count;

title1 "Duplicate Values in the &DSNAME SAS Data Set";

title2 "Duplicate Count for Variables: &VARLIST";

label duplicate_count = "Duplicate Count";

run;

%MEND IDENTIFY_DUPES;

%IDENTIFY_DUPES(LIBREF=SASHELP, DSNAME=prdsal2,

VARLIST=country county prodtype year);

This macro accepts three parameters:

· LIBREF – The libref of the SAS data library that contains our target data set

· DSNAME – The name of our target SAS data set

· VARLIST – The list of variables whose combination we want to check for duplicate values.

The MEANS procedure specifies our target SAS data set by using the LIBREF and DSNAME macro variables. The NWAY option states that we only want statistics for the unique combinations of CLASS variables. NOPRINT suppresses that pesky, unneeded list output. The MISSING option directs PROC MEANS to consider missing values when computing statistics. We use the list of variables in the &VARLIST macro in the CLASS statement, so the procedure only calculates statistics for those specific variables.

In the OUTPUT statement, we rename PROC MEAN’s _FREQ_ variable to DUPLICATE_COUNT, and only keep the summarized observations whose DUPLICATE_COUNT is greater than 1. The SUM= option directs PROC MEANS to summarize the number of observations it finds for each combination of variables and store that number in the _FREQ_ variable.

The PRINT procedure creates a report of duplicate variable values surfaced by the MEANS procedure.

Here is part of the resulting list output from the PROC PRINT:

Duplicate Values in the prdsal2 SAS Data Set
Duplicate Count for Variables: country county prodtype year
Country	County	Product Type	Year	Duplicate Count
Canada		FURNITURE	1995	576
Canada		FURNITURE	1996	576
Canada		FURNITURE	1997	576
Canada		FURNITURE	1998	576
Canada		OFFICE	1995	576
Canada		OFFICE	1996	576
Canada		OFFICE	1997	576
Canada		OFFICE	1998	576

If you are concerned about duplicate variable values in your SAS data sets, this is the macro for you!

Best of luck in all your SAS endeavors!

----MMMMIIIIKKKKEEEE

(aka Michael A. Raithel)

Excerpt from the book: Did You Know That? Essential Hacks for Clever SAS Programmers

http://www.amazon.com/Michael-A.-Raithel/e/B001K8GG90/ref=ntt_dp_epwbk_0

I plan to post each and every one of the hacks in the book to social media on a weekly basis. Please pass them along to colleagues who you know would benefit.

Michael A. Raithel's Blog

Monday, July 11, 2016

Hack 3.12 Identifying Duplicate Variable Values with PROC MEANS

No comments:

Post a Comment