Data disclosure control for descriptive statistics

1. Create artificial data

Create artificial data for eleven IDs (1-11), ten years (2001-2010), four countries (BE, CH, US, GB), and the continuous variable x.

clear
set obs 301
generate hvcountry = int((_n-1)/100)+1
generate hv     = _n-100*(hvcountry-1)
generate id     = int((hv-1)/10)+1
generate year   = 2000+hv-10*(id-1)
drop hv

generate str2 country = "BE" if hvcountry==1
replace       country = "CH" if hvcountry==2
replace       country = "US" if hvcountry==3
replace       country = "GB" if hvcountry==4
drop hvcountry

set seed 4869382
generate x = runiform()
label variable x "Default right hand side variable"

Creating missing values

Yearly dummies are often used as catch-all dummies. However, a dummy for the year 2002 would identify IDs 1 and 2. The coefficient for the 2002 dummy thus causes a disclosure problem.

replace x = . if year==2002 & id > 2      
replace x = . if _n==301

Creating problematic identifiers

Create an ID that must be ignored (since x is missing, see section “Creating missing values”).

replace id = 11 if _n==301      

Dummy if id==1. Publishing information on specific firms is not permitted.

generate id1 = id==1         

Creating problematic dummies

Dummy id/year/country: 1/2005/BE, 6/2007/BE. This dummy is 1 for just two distinct IDs. Therefore, it may not be published.

generate dum_2ids   = _n==5|_n==57 
label variable dum_2ids "Dummy for two distinct IDs"   

Dummy id/year/country: 1/2005/BE, 6/2007/BE, 7/2003/BE, 8/2008/CH, 10/2009/BE. This dummy is 1 for five distinct IDs. It may be published.

generate dum_5ids = _n==5|_n==57|_n==63|_n==99|_n==178 
label variable dum_5ids "Dummy for five distinct IDs"

Creating data for percentiles option (xpc={1,…,99})

set seed 9385542
generate u = runiform()
sort u
drop u
generate int  xpc   = _n if _n<=99

Problematic range

generate byte pcdum = (xpc>50 & xpc<=60) 

Only four distinct identifiers in intervall (p50,p60]: 4, 5, 8, 10

replace id =  4 if pcdum & (id==1|id==3|id==6|id==7)
replace id =  3 if xpc==67|xpc==74|xpc==82|xpc==87
replace id =  2 if xpc==17|xpc==33|xpc==93|xpc==94|xpc==99
replace id =  6 if xpc==77|xpc==78
replace id = 10 if xpc== 1|xpc== 7
replace id =  7 if id==5 & xpc!=8 & xpc!=57 & xpc!=.
replace id =  1 if id==8 & xpc!=58 & xpc!=.

Trigger 1: If set, less than 15 percent outside (p50,p60]

replace id =  9 if xpc==40  

Trigger 2: If set, only four distinct identifiers within (p90,max]

replace id =  3 if xpc==92 
label variable xpc "Variable for non deterministic categories"

Creating variables for dominance check

generate byte xfordom = 5 if id==1
replace       xfordom = 4 if id==2
replace       xfordom =-4 if id==2 & country=="US"
replace       xfordom = 1 if id==3
replace       xfordom =10 if id==3 & country=="CH" & year>=2009
label variable xfordom "Variable for dominance"

Creating variables for the treatment of zeros

generate xwithzeros = runiform()
replace  xwithzeros = 0 if id==3 & year<2009

Other

cls

2. Applying nobsdes5

nosbdes5 without options

Only on rare occasions will you be asked for, or be interested in, the number of distinct IDs unrelated to any other variable. Normally, confidentiality has to be resolved with respect to a continuous or discrete variable. In the artificial data, there are eleven distinct identifiers. Since x is set to missing for id==11, there are 11 distinct identifiers but only 10 distinct identifiers for non-missing values of the variable x. Since version 2.0.0 a list of continuous variables is possible, however, not for IDs. For dichotomous or categorical variables use the BY option, that is explained in the next step.

. nobsdes5 id  
Warning Is id an identifier? 
Number of distinct values for identifier  id :    11 
. nobsdes5 id x

Output control
No problem with dominance and sufficient number of distinct IDs (id) (=nobs) of 
variable:
                                                             nobs
x             -            Default right hand side variable    10

Example for a varlist

. nobsdes5 id x xpc

Output control
No problem with dominance and sufficient number of distinct IDs (id) (=nobs) of 
variable:
                                                             nobs
x             -            Default right hand side variable    10
xpc           - Variable for non deterministic categorie ..    10

How to use the BY option

In 2002, x is non-missing only for id==1 and id==2.

To see whether there is a problem at all, for large tables you may use the notab option in the first step.

. nobsdes5 id x, by(year) notab 

Output control

D I S C L O S U R E problem: 
x             -            Default right hand side variable     Share of largest two IDs > 85% and number of distinct IDs (id)    too small

Now you know there is a problem, but you do not know the cause. Thus, in the second step, do not use notab.

. nobsdes5 id x, by(year)

Output control for variable x [Default right hand side variable]
D I S C L O S U R E problem: 
Share of largest two IDs > 85% 
Number of distinct IDs (id) of variable x for each year :

 ---------------------
  year    nobs     CR2
 ---------------------
  2001      10      33
  2002       2     100
  2003      10      35
  2004      10      39
  2005      10      30
  2006      10      30
  2007      10      32
  2008      10      33
  2009      10      30
  2010      10      31
 ----------------------

Now you know that the problem is the year 2002. In this case, there seems to be a simple solution to receive the output.

. table year if year!=2002, statistic(total x)

------------------
        |    Total
--------+---------
year    |         
  2001  |  14.7175
  2003  |  14.2114
  2004  |  13.5246
  2005  |  12.1655
  2006  |  16.4135
  2007  |  13.7453
  2008  |  14.2557
  2009  |  14.2887
  2010  |  13.0675
  Total |   126.39
------------------

However, if you drop just one row, the hidden content may be inferred by the disclosed rows and the total. You can investigate this using the gen option.

. nobsdes5 id x, by(year) notab generate replace

Output control

D I S C L O S U R E problem: 
x             -            Default right hand side variable     Share of largest two IDs > 85% and number of distinct IDs (id)    too small
. egen hvnobs = sum(tg_id_x),   by(year)
. egen hvcr2  = mean(cr2_id_x), by(year)
. nobsdes5 id x if (hvnobs<=4 | hvcr2>85)

Output control

D I S C L O S U R E problem: 
x             -            Default right hand side variable     Share of largest two IDs > 85% and number of distinct IDs (id)    too small

Thus, you must hide at least two rows, or, better yet, aggregate.

. quietly generate int bisyear = 2002 if (year==2001|year==2002)
. quietly replace  bisyear = 2004 if (year==2003|year==2004)
. quietly replace  bisyear = 2006 if (year==2005|year==2006)
. quietly replace  bisyear = 2008 if (year==2007|year==2008)
. quietly replace  bisyear = 2010 if (year==2009|year==2010)
. nobsdes5 id x, by(bisyear)

No problem with dominance 
Number of distinct IDs (id) of variable x for each bisyear :

----------------------
  bisyear |  sum(nobs)
----------+-----------
     2002 |         10
     2004 |         10
     2006 |         10
     2008 |         10
     2010 |         10
----------------------
. table bisyear, statistic(total x) nototals

----------------------
  bisyear |     sum(x)
----------+-----------
     2002 |    17.7986
     2004 |   27.73601
     2006 |   28.57901
     2008 |   28.00096
     2010 |   27.35627
----------------------

The by option is mostly used with categorical data but it may also be used with “sparse” continuous data, for example external tax data. There may be a few tax rates applying to only one or two firms so that there is a confidentiality problem. You can check using nobsdes5 id, by(taxrate). If you include a variable denoting region, you have to use the byregio option. Use the by option also with dummy variables since you can change 0 and 1.

. nobsdes5 id, by(dum_2ids)
Warning Is id an identifier?
Output control for variable dum_2ids [Dummy for two distinct ID ..]
D I S C L O S U R E problem: 
Number of distinct values for identifier id for eachdum_2ids :

  +-----------------+
  | dum_2ids   nobs |
  |-----------------|
  |        0     11 |
  |        1      2 |
  +-----------------+

Zeros

In the case of continuous variables zeros can occur for a certain entity always, sometimes, or never. If zeros always occur, then that particular entity never conducts the respective type of business. Some readers may be aware of this. These entities are not counted. If zeros instead occur infrequently, this could mean that this entity generally conduct the type of business captured by this variable, but sometimes does not. It is presumed that competitors are unaware of this. These entities could be counted. Since it is difficult to define a threshold of fractions of zeros above which an entity is not counted, we always treat zeros as missing values for pragmatic reasons if output control is carried out for descriptive statistics. Use the miss option. If the researcher has strong arguments in favour of keeping the zeros, he has to explain it.

In the case of dummy variables and categorical variables, a zero has a different meaning. Here, zeros are valid observations.

. nobsdes5 id xwithzeros, by(year) 
Warning Zero values for variable xwithzeros found -> Should you use option miss(0) ?

Output control for variable xwithzeros []
No problem with dominance 
Number of distinct IDs (id) of variable xwithzeros for each year :

  +-------------+
  | year   nobs |
  |-------------|
  | 2001     11 |
  | 2002     10 |
  | 2003     10 |
  | 2004     10 |
  | 2005     10 |
  | 2006     10 |
  | 2007     10 |
  | 2008     10 |
  | 2009     10 |
  | 2010     10 |
  +-------------+
. nobsdes5 id xwithzeros, by(year) miss(0)

Output control for variable xwithzeros []
No problem with dominance 
Number of distinct IDs (id) of variable xwithzeros for each year :

  +-------------+
  | year   nobs |
  |-------------|
  | 2001     10 |
  | 2002      9 |
  | 2003      9 |
  | 2004      9 |
  | 2005      9 |
  | 2006      9 |
  | 2007      9 |
  | 2008      9 |
  | 2009     10 |
  | 2010     10 |
  +-------------+

Dominance

There are three different identifiers with values for xfordom in each year. Of course, you should have at least five.

. table id country if xfordom!=., statistic(total xfordom)

----------------------------------
        |          country        
        |   BE    CH    US   Total
--------+-------------------------
id      |                         
  1     |   55    50    70     175
  2     |   44    48   -48      44
  3     |   13    30     9      52
  Total |  112   128    31     271
----------------------------------

Assume that id==2 knows that there are two other IDs and that id==1 is much larger than id==3. If you publish the sum, id==2 can estimate the value of xfordom for id==1.
For country==“BE” it is near 112-44 = 68
For country==“CH” it is near 128-48 = 60
In the case of “US”, id==3 is the second largest firm.
For country==“US” it is near 31- 9 = 22

. nobsdes5 id xfordom, by(country)

Output control for variable xfordom [Variable for dominance]
D I S C L O S U R E problem: 
Share of largest two IDs > 85% and ≤ 100%
Number of distinct IDs (id) of variable xfordom for each country :

  +----------------------+
  | country   nobs   CR2 |
  |----------------------|
  |      BE      3    88 |
  |      CH      3    77 |
  |      US      3   255 |
  +----------------------+

The data is now sorted.

. describe

Contains data
  obs:           301                          
 vars:            18                          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id              float   %9.0g                 
year            float   %9.0g                 
country         str2    %9s                   
x               float   %9.0g                 
id1             float   %9.0g                 
dum_2ids        float   %9.0g                 
dum_5ids        float   %9.0g                 
xpc             int     %8.0g                 
pcdum           byte    %8.0g                 
xfordom         byte    %8.0g                 
xwithzeros      float   %9.0g                 
tg_id_x         byte    %8.0g                 
cr2_id_x        float   %5.1f                 
hvnobs          float   %9.0g                 
hvcr2           float   %9.0g                 
bisyear         int     %8.0g                 
cutp            float   %9.0g                 percentiles of xpc
xpcbins         byte    %8.0g                 xpc categorized by cutp
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: year
     Note: Dataset has changed since last saved.

Maximum and minimum

Maximum and minimum are single observations. Therefore, they must not be shown. If necessary, the RDSC allows the publication of approximate values calculated as averages of a sufficient number of observations (top coding). For maximum and minimum, at least five distinct entities must be covered and the share of the two largest entities must not exceed 85 percent of the total (dominance criterion).

Please use the .ado file maxrdsc to address this problem. If a table or cross-tabulation is desired, use nobsdes5 after completing some preparatory steps.

. maxrdsc id x, min(12) max(12)     

No problems for maximum of x. Average maximum based on  5  distinct ids:  .9832249780495962
No problems for minimum of x. Average minimum based on  5  distinct ids:  .0167608045856468

Quantiles

Quantiles must be specified in ascending order. The RDSC treats quantiles as defining categories. The number of distinct identifiers between the minimum and the lowest percentile as well between the highest percentile and the maximum and between all other consecutive quantiles have to be larger four and the dominance criterion has to be obeyed. Most often, quantiles are used to describe the shape of a distribution and tied values are no issue then. If quantiles serve to create groups instead, ties may be an issue.

. nobsdes5 id xpc if year==2010, pctile(5 50 90)

Output control for variable xpc [Variable for non deterministic categories]
D I S C L O S U R E problem: 
Share of largest two IDs >85% and <=100%
Number of distinct IDs (id) for each percentile (min<=p5<=p50<=p90<=max) of variable xpc :

  +---------------------+
  | pctile   nobs   CR2 |
  |---------------------|
  |    min      1   100 |
  |    p50      4    59 |
  |    max      3    75 |
  +---------------------+
. nobsdes5 id xpc, pctile(5 50 90) by(year) notab

Output control for variable xpc [Variable for non deterministic categories]
D I S C L O S U R E problem: 
Share of largest two IDs > 85% 
Number of distinct IDs (id) for the smallest percentile (min<=p5<=p50<=p90<=max) of variable xpc for year :  too small
. nobsdes5 id xpc, pctile(10 20 30 40 50 60 70 80 90) 

Output control for variable xpc [Variable for non deterministic categories]
D I S C L O S U R E problem: 
No problem with dominance 
Number of distinct IDs (id) for each percentile (min<=p10<=p20<=p30<=p40<=p50<=p60<=p70<=p80<=p90<=max) of variable xpc :

  +---------------------+
  | pctile   nobs   CR2 |
  |---------------------|
  |    p10      6    58 |
  |    p20      5    58 |
  |    p30      6    50 |
  |    p40      4    79 |
  |    p50      5    61 |
  |---------------------|
  |    p60      4    79 |
  |    p70      5    52 |
  |    p80      5    59 |
  |    p90      6    50 |
  |    max      4    66 |
  +---------------------+

Here, there are fewer than five distinct identifiers in three ranges.

Creating graphs

If you want to create a graph, you must document the fact that the graph is based on sufficient observations and that no two identifiers dominate. You must use nobsdes5 before you create the graph. We recommend saving the graph as a .png file.

. nobsdes5 id x, by(year country) notab

Output control

D I S C L O S U R E problem: 
x             -            Default right hand side variable     Share of largest two IDs > 85% and number of distinct IDs (id)    too small
. nobsdes5 id x, by(year country) 

Output control for variable x [Default right hand side variable]
D I S C L O S U R E problem: 
Share of largest two IDs > 85% 
Number of distinct IDs (id) of variable x for each year country :

  +-----------------------------+
  | year   country   nobs   CR2 |
  |-----------------------------|
  | 2001        BE      8    48 |
  | 2001        CH      9    41 |
  | 2001        US      8    37 |
  | 2002        BE      2   100 |
  | 2002        CH      2   100 |
  | 2002        US      2   100 |
  | 2003        BE      8    51 |
  | 2003        CH      9    33 |
  | 2003        US      9    39 |
  | 2004        BE      9    44 |
  | 2004        CH      9    43 |
  | 2004        US      9    56 |
  | 2005        BE      9    46 |
  | 2005        CH      9    41 |
  | 2005        US      9    42 |
  | 2006        BE      9    36 |
  | 2006        CH     10    34 |
  | 2006        US      8    39 |
  | 2007        BE      9    30 |
  | 2007        CH      9    42 |
  | 2007        US     10    47 |
  | 2008        BE      8    30 |
  | 2008        CH     10    37 |
  | 2008        US      8    63 |
  | 2009        BE      9    39 |
  | 2009        CH     10    32 |
  | 2009        US      7    43 |
  | 2010        BE      9    43 |
  | 2010        CH      9    46 |
  | 2010        US      9    37 |
  +-----------------------------+
. egen meanx  = mean(x) if year!=2002, by(year country)
. egen tgcountry = tag(year country)
. gsort -tgcountry year country
. scatter meanx year if country=="BE" & tgcountry
. * graph save graphname.png

Creating histograms

If you want to create a histogram, you must document the fact that the graph is based on sufficient observations and that no two identifiers dominate. You must use nobsdes5 before you create the histogram. We recommend saving the histogram as a .png file.

. nobsdes5 id xpc, histogram(10)              /* e.g. 10 bins */

Output control for variable xpc [Variable for non deterministic categories]
D I S C L O S U R E problem: 
No problem with dominance 
Number of distinct IDs (id) for the smallest or largest of 10 bins of variable xpc : too small

This differs from the height of the bins.

. quietly histogram xpc, freq bin(10) addlabels

A varlist is possible since version 2.0.0.

. nobsdes5 id x xpc, histogram(25 10) 

This histogram example and the last percentile example are almost the same. There are nine ranges with ten observations and one range with nine observations. In the percentile example, the last range has only nine observations – the last digit runs from 1 to 0 in each range except for the last one – while in the histogram example the last digit runs from 1 to 0 in the first four ranges and then from 0 to 9. The range between p40 and p50 has only nine observations.

Example for the use of a varlist

. nobsdes5 id x xpc, histogram(25 10)

Output control for variable x [Default right hand side variable]
No problem with dominance 
Number of distinct IDs (id) for the smallest or largest of 25 bins of variable x : 6


Output control for variable xpc [Variable for non deterministic categories]
D I S C L O S U R E problem:
No problem with dominance 
Number of distinct IDs (id) for the smallest or largest of 10 bins of variable xpc : too small

A good alternative to a histogram is a kernel density. You have to show, that the kernel density is based on at least five distinct ids.

. kdensity xpc
. nobsdes5 id xpc

Output control
No problem with dominance and sufficient number of distinct IDs (id) (=nobs) of 
variable:
                                                             nobs
xpc           - Variable for non deterministic categorie ..    10

Aggregating data

Sometimes, researchers have data characterized by two or more identifiers but their interest is in one identifier only. Thus, they integrate the second identifier out. Since the researchers have access to both identifiers, they have to check, whether the marginalization violates statistical data disclosure. If the RDSC calculates and provides the marginal data, the RDSC has to check. If the reporting entities report marginal data, for example, a MFI reports total loans to enterprises from the machinery industry, but not for individual enterprises, the reporting entity has to care.