Just wanted to note that for traditional SQL implementations (e.g., MySQL, MS-SQL), the
Group By statement used to aggregate a variable by certain variable(s) returns 1 row for each group. When a column that is not unique within a group is also selected, then the row that’s returned is determined somehow by the DB software.
In contrast, SAS’s
Proc SQL will return multiple rows for each group (the number of original rows), with the aggregated variable repeated for each row within a group. Here’s an example:
data foo ; infile datalines dlm=" " ; input name $ week $ sales ; datalines ; bob 1 20000 bob 2 30000 jane 1 40000 jane 2 50000 mike 1 60000 mike 2 70000 kevin 1 80000 kevin 2 90000 ; run ; proc sql ; create table foo_agg as select a.name , a.week , sum(a.sales) as total_sales from foo as a group by name ; quit ; run ; proc export data=foo_agg outfile="foo_agg.csv" DBMS=csv REPLACE ; run ;
The content of
foo_agg.csv looks like
name,week,total_sales bob,2,50000 bob,1,50000 jane,1,90000 jane,2,90000 kevin,1,170000 kevin,2,170000 mike,2,130000 mike,1,130000
An analogous return from the SQL code in MySQL or MS-SQL might look something like
name,week,total_sales bob,2,50000 jane,1,90000 kevin,1,170000 mike,2,130000
Proc SQL, one would need to use the
Select Distinct statement in order to remove the duplicate rows.
Note that when combining the
Group By statement with a
Join, these multiple records per group still hold.
SAS’s implementation is not necessarily bad as it gives the user’s more flexibility in returning an aggregated variable with every row without re-joining the aggregated table with the original table. The user just has to remember this behavior ;).