Just wanted to note that for traditional SQL implementations (e.g., MySQL, MS-SQL), the Group By statement used to aggregate a variable by certain variable(s) returns 1 row for each group. When a column that is not unique within a group is also selected, then the row that’s returned is determined somehow by the DB software.
In contrast, SAS’s Proc SQL will return multiple rows for each group (the number of original rows), with the aggregated variable repeated for each row within a group. Here’s an example:
data foo ;
infile datalines dlm=" " ;
input name $ week $ sales ;
datalines ;
bob 1 20000
bob 2 30000
jane 1 40000
jane 2 50000
mike 1 60000
mike 2 70000
kevin 1 80000
kevin 2 90000
;
run ;
proc sql ;
create table foo_agg as
select
a.name
, a.week
, sum(a.sales) as total_sales
from foo as a
group by name
;
quit ;
run ;
proc export data=foo_agg
outfile="foo_agg.csv"
DBMS=csv
REPLACE
;
run ;
The content of foo_agg.csv looks like
name,week,total_sales
bob,2,50000
bob,1,50000
jane,1,90000
jane,2,90000
kevin,1,170000
kevin,2,170000
mike,2,130000
mike,1,130000
An analogous return from the SQL code in MySQL or MS-SQL might look something like
name,week,total_sales
bob,2,50000
jane,1,90000
kevin,1,170000
mike,2,130000
In SAS’s Proc SQL, one would need to use the Select Distinct statement in order to remove the duplicate rows.
Note that when combining the Group By statement with a Join, these multiple records per group still hold.
SAS’s implementation is not necessarily bad as it gives the user’s more flexibility in returning an aggregated variable with every row without re-joining the aggregated table with the original table. The user just has to remember this behavior ;).