Create artificial data for eleven IDs (1-11), ten years (2001-2010), four countries (BE, CH, US, GB), and the continuous variable x.
clear
set obs 301
generate hvcountry = int((_n-1)/100)+1
generate hv = _n-100*(hvcountry-1)
generate id = int((hv-1)/10)+1
generate year = 2000+hv-10*(id-1)
drop hv
generate str2 country = "BE" if hvcountry==1
replace country = "CH" if hvcountry==2
replace country = "US" if hvcountry==3
replace country = "GB" if hvcountry==4
drop hvcountry
set seed 4869382
generate x = runiform()
label variable x "Default right hand side variable"
Yearly dummies are often used as catch-all dummies. However, a dummy for the year 2002 would identify IDs 1 and 2. The coefficient for the 2002 dummy thus causes a disclosure problem.
replace x = . if year==2002 & id > 2
replace x = . if _n==301
Create an ID that must be ignored (since x is missing, see section “Creating missing values”).
replace id = 11 if _n==301
Dummy if id==1. Publishing information on specific firms is not permitted.
generate id1 = id==1
Dummy id/year/country: 1/2005/BE, 6/2007/BE. This dummy is 1 for just two distinct IDs. Therefore, it may not be published.
generate dum_2ids = _n==5|_n==57
label variable dum_2ids "Dummy for two distinct IDs"
Dummy id/year/country: 1/2005/BE, 6/2007/BE, 7/2003/BE, 8/2008/CH, 10/2009/BE. This dummy is 1 for five distinct IDs. It may be published.
generate dum_5ids = _n==5|_n==57|_n==63|_n==99|_n==178
label variable dum_5ids "Dummy for five distinct IDs"
set seed 9385542
generate u = runiform()
sort u
drop u
generate int xpc = _n if _n<=99
Problematic range
generate byte pcdum = (xpc>50 & xpc<=60)
Only four distinct identifiers in intervall (p50,p60]: 4, 5, 8, 10
replace id = 4 if pcdum & (id==1|id==3|id==6|id==7)
replace id = 3 if xpc==67|xpc==74|xpc==82|xpc==87
replace id = 2 if xpc==17|xpc==33|xpc==93|xpc==94|xpc==99
replace id = 6 if xpc==77|xpc==78
replace id = 10 if xpc== 1|xpc== 7
replace id = 7 if id==5 & xpc!=8 & xpc!=57 & xpc!=.
replace id = 1 if id==8 & xpc!=58 & xpc!=.
Trigger 1: If set, less than 15 percent outside (p50,p60]
replace id = 9 if xpc==40
Trigger 2: If set, only four distinct identifiers within (p90,max]
replace id = 3 if xpc==92
label variable xpc "Variable for non deterministic categories"
generate byte xfordom = 5 if id==1
replace xfordom = 4 if id==2
replace xfordom =-4 if id==2 & country=="US"
replace xfordom = 1 if id==3
replace xfordom =10 if id==3 & country=="CH" & year>=2009
label variable xfordom "Variable for dominance"
generate xwithzeros = runiform()
replace xwithzeros = 0 if id==3 & year<2009
cls
Only on rare occasions will you be asked for, or be interested in, the number of distinct IDs unrelated to any other variable. Normally, confidentiality has to be resolved with respect to a continuous or discrete variable. In the artificial data, there are eleven distinct identifiers. Since x is set to missing for id==11, there are 11 distinct identifiers but only 10 distinct identifiers for non-missing values of the variable x. Since version 2.0.0 a list of continuous variables is possible, however, not for IDs. For dichotomous or categorical variables use the BY option, that is explained in the next step.
. nobsdes5 id
Warning Is id an identifier?
Number of distinct values for identifier id : 11
. nobsdes5 id x
Output control
No problem with dominance and sufficient number of distinct IDs (id) (=nobs) of
variable:
nobs
x - Default right hand side variable 10
Example for a varlist
. nobsdes5 id x xpc
Output control
No problem with dominance and sufficient number of distinct IDs (id) (=nobs) of
variable:
nobs
x - Default right hand side variable 10
xpc - Variable for non deterministic categorie .. 10
In 2002, x is non-missing only for id==1 and id==2.
To see whether there is a problem at all, for large tables you may use the notab option in the first step.
. nobsdes5 id x, by(year) notab
Output control
D I S C L O S U R E problem:
x - Default right hand side variable Share of largest two IDs > 85% and number of distinct IDs (id) too small
Now you know there is a problem, but you do not know the cause. Thus, in the second step, do not use notab.
. nobsdes5 id x, by(year)
Output control for variable x [Default right hand side variable]
D I S C L O S U R E problem:
Share of largest two IDs > 85%
Number of distinct IDs (id) of variable x for each year :
---------------------
year nobs CR2
---------------------
2001 10 33
2002 2 100
2003 10 35
2004 10 39
2005 10 30
2006 10 30
2007 10 32
2008 10 33
2009 10 30
2010 10 31
----------------------
Now you know that the problem is the year 2002. In this case, there seems to be a simple solution to receive the output.
. table year if year!=2002, statistic(total x)
------------------
| Total
--------+---------
year |
2001 | 14.7175
2003 | 14.2114
2004 | 13.5246
2005 | 12.1655
2006 | 16.4135
2007 | 13.7453
2008 | 14.2557
2009 | 14.2887
2010 | 13.0675
Total | 126.39
------------------
However, if you drop just one row, the hidden content may be inferred by the disclosed rows and the total. You can investigate this using the gen option.
. nobsdes5 id x, by(year) notab generate replace
Output control
D I S C L O S U R E problem:
x - Default right hand side variable Share of largest two IDs > 85% and number of distinct IDs (id) too small
. egen hvnobs = sum(tg_id_x), by(year)
. egen hvcr2 = mean(cr2_id_x), by(year)
. nobsdes5 id x if (hvnobs<=4 | hvcr2>85)
Output control
D I S C L O S U R E problem:
x - Default right hand side variable Share of largest two IDs > 85% and number of distinct IDs (id) too small
Thus, you must hide at least two rows, or, better yet, aggregate.
. quietly generate int bisyear = 2002 if (year==2001|year==2002)
. quietly replace bisyear = 2004 if (year==2003|year==2004)
. quietly replace bisyear = 2006 if (year==2005|year==2006)
. quietly replace bisyear = 2008 if (year==2007|year==2008)
. quietly replace bisyear = 2010 if (year==2009|year==2010)
. nobsdes5 id x, by(bisyear)
No problem with dominance
Number of distinct IDs (id) of variable x for each bisyear :
----------------------
bisyear | sum(nobs)
----------+-----------
2002 | 10
2004 | 10
2006 | 10
2008 | 10
2010 | 10
----------------------
. table bisyear, statistic(total x) nototals
----------------------
bisyear | sum(x)
----------+-----------
2002 | 17.7986
2004 | 27.73601
2006 | 28.57901
2008 | 28.00096
2010 | 27.35627
----------------------
The by option is mostly used with categorical data but it may also be used with “sparse” continuous data, for example external tax data. There may be a few tax rates applying to only one or two firms so that there is a confidentiality problem. You can check using nobsdes5 id, by(taxrate). If you include a variable denoting region, you have to use the byregio option. Use the by option also with dummy variables since you can change 0 and 1.
. nobsdes5 id, by(dum_2ids)
Warning Is id an identifier?
Output control for variable dum_2ids [Dummy for two distinct ID ..]
D I S C L O S U R E problem:
Number of distinct values for identifier id for eachdum_2ids :
+-----------------+
| dum_2ids nobs |
|-----------------|
| 0 11 |
| 1 2 |
+-----------------+
In the case of continuous variables zeros can occur for a certain entity always, sometimes, or never. If zeros always occur, then that particular entity never conducts the respective type of business. Some readers may be aware of this. These entities are not counted. If zeros instead occur infrequently, this could mean that this entity generally conduct the type of business captured by this variable, but sometimes does not. It is presumed that competitors are unaware of this. These entities could be counted. Since it is difficult to define a threshold of fractions of zeros above which an entity is not counted, we always treat zeros as missing values for pragmatic reasons if output control is carried out for descriptive statistics. Use the miss option. If the researcher has strong arguments in favour of keeping the zeros, he has to explain it.
In the case of dummy variables and categorical variables, a zero has a different meaning. Here, zeros are valid observations.
. nobsdes5 id xwithzeros, by(year)
Warning Zero values for variable xwithzeros found -> Should you use option miss(0) ?
Output control for variable xwithzeros []
No problem with dominance
Number of distinct IDs (id) of variable xwithzeros for each year :
+-------------+
| year nobs |
|-------------|
| 2001 11 |
| 2002 10 |
| 2003 10 |
| 2004 10 |
| 2005 10 |
| 2006 10 |
| 2007 10 |
| 2008 10 |
| 2009 10 |
| 2010 10 |
+-------------+
. nobsdes5 id xwithzeros, by(year) miss(0)
Output control for variable xwithzeros []
No problem with dominance
Number of distinct IDs (id) of variable xwithzeros for each year :
+-------------+
| year nobs |
|-------------|
| 2001 10 |
| 2002 9 |
| 2003 9 |
| 2004 9 |
| 2005 9 |
| 2006 9 |
| 2007 9 |
| 2008 9 |
| 2009 10 |
| 2010 10 |
+-------------+
There are three different identifiers with values for xfordom in each year. Of course, you should have at least five.
. table id country if xfordom!=., statistic(total xfordom)
----------------------------------
| country
| BE CH US Total
--------+-------------------------
id |
1 | 55 50 70 175
2 | 44 48 -48 44
3 | 13 30 9 52
Total | 112 128 31 271
----------------------------------
Assume that id==2 knows that there are two other IDs and that id==1 is much larger than id==3. If you publish the sum, id==2 can estimate the value of xfordom for id==1.
For country==“BE” it is near 112-44 = 68
For country==“CH” it is near 128-48 = 60
In the case of “US”, id==3 is the second largest firm.
For country==“US” it is near 31- 9 = 22
. nobsdes5 id xfordom, by(country)
Output control for variable xfordom [Variable for dominance]
D I S C L O S U R E problem:
Share of largest two IDs > 85% and ≤ 100%
Number of distinct IDs (id) of variable xfordom for each country :
+----------------------+
| country nobs CR2 |
|----------------------|
| BE 3 88 |
| CH 3 77 |
| US 3 255 |
+----------------------+
The data is now sorted.
. describe
Contains data
obs: 301
vars: 18
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id float %9.0g
year float %9.0g
country str2 %9s
x float %9.0g
id1 float %9.0g
dum_2ids float %9.0g
dum_5ids float %9.0g
xpc int %8.0g
pcdum byte %8.0g
xfordom byte %8.0g
xwithzeros float %9.0g
tg_id_x byte %8.0g
cr2_id_x float %5.1f
hvnobs float %9.0g
hvcr2 float %9.0g
bisyear int %8.0g
cutp float %9.0g percentiles of xpc
xpcbins byte %8.0g xpc categorized by cutp
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: year
Note: Dataset has changed since last saved.
Maximum and minimum are single observations. Therefore, they must not be shown. If necessary, the RDSC allows the publication of approximate values calculated as averages of a sufficient number of observations (top coding). For maximum and minimum, at least five distinct entities must be covered and the share of the two largest entities must not exceed 85 percent of the total (dominance criterion).
Please use the .ado file maxrdsc to address this problem. If a table or cross-tabulation is desired, use nobsdes5 after completing some preparatory steps.
. maxrdsc id x, min(12) max(12)
No problems for maximum of x. Average maximum based on 5 distinct ids: .9832249780495962
No problems for minimum of x. Average minimum based on 5 distinct ids: .0167608045856468
Quantiles must be specified in ascending order. The RDSC treats quantiles as defining categories. The number of distinct identifiers between the minimum and the lowest percentile as well between the highest percentile and the maximum and between all other consecutive quantiles have to be larger four and the dominance criterion has to be obeyed. Most often, quantiles are used to describe the shape of a distribution and tied values are no issue then. If quantiles serve to create groups instead, ties may be an issue.
. nobsdes5 id xpc if year==2010, pctile(5 50 90)
Output control for variable xpc [Variable for non deterministic categories]
D I S C L O S U R E problem:
Share of largest two IDs >85% and <=100%
Number of distinct IDs (id) for each percentile (min<=p5<=p50<=p90<=max) of variable xpc :
+---------------------+
| pctile nobs CR2 |
|---------------------|
| min 1 100 |
| p50 4 59 |
| max 3 75 |
+---------------------+
. nobsdes5 id xpc, pctile(5 50 90) by(year) notab
Output control for variable xpc [Variable for non deterministic categories]
D I S C L O S U R E problem:
Share of largest two IDs > 85%
Number of distinct IDs (id) for the smallest percentile (min<=p5<=p50<=p90<=max) of variable xpc for year : too small
. nobsdes5 id xpc, pctile(10 20 30 40 50 60 70 80 90)
Output control for variable xpc [Variable for non deterministic categories]
D I S C L O S U R E problem:
No problem with dominance
Number of distinct IDs (id) for each percentile (min<=p10<=p20<=p30<=p40<=p50<=p60<=p70<=p80<=p90<=max) of variable xpc :
+---------------------+
| pctile nobs CR2 |
|---------------------|
| p10 6 58 |
| p20 5 58 |
| p30 6 50 |
| p40 4 79 |
| p50 5 61 |
|---------------------|
| p60 4 79 |
| p70 5 52 |
| p80 5 59 |
| p90 6 50 |
| max 4 66 |
+---------------------+
Here, there are fewer than five distinct identifiers in three ranges.
If you want to create a graph, you must document the fact that the graph is based on sufficient observations and that no two identifiers dominate. You must use nobsdes5 before you create the graph. We recommend saving the graph as a .png file.
. nobsdes5 id x, by(year country) notab
Output control
D I S C L O S U R E problem:
x - Default right hand side variable Share of largest two IDs > 85% and number of distinct IDs (id) too small
. nobsdes5 id x, by(year country)
Output control for variable x [Default right hand side variable]
D I S C L O S U R E problem:
Share of largest two IDs > 85%
Number of distinct IDs (id) of variable x for each year country :
+-----------------------------+
| year country nobs CR2 |
|-----------------------------|
| 2001 BE 8 48 |
| 2001 CH 9 41 |
| 2001 US 8 37 |
| 2002 BE 2 100 |
| 2002 CH 2 100 |
| 2002 US 2 100 |
| 2003 BE 8 51 |
| 2003 CH 9 33 |
| 2003 US 9 39 |
| 2004 BE 9 44 |
| 2004 CH 9 43 |
| 2004 US 9 56 |
| 2005 BE 9 46 |
| 2005 CH 9 41 |
| 2005 US 9 42 |
| 2006 BE 9 36 |
| 2006 CH 10 34 |
| 2006 US 8 39 |
| 2007 BE 9 30 |
| 2007 CH 9 42 |
| 2007 US 10 47 |
| 2008 BE 8 30 |
| 2008 CH 10 37 |
| 2008 US 8 63 |
| 2009 BE 9 39 |
| 2009 CH 10 32 |
| 2009 US 7 43 |
| 2010 BE 9 43 |
| 2010 CH 9 46 |
| 2010 US 9 37 |
+-----------------------------+
. egen meanx = mean(x) if year!=2002, by(year country)
. egen tgcountry = tag(year country)
. gsort -tgcountry year country
. scatter meanx year if country=="BE" & tgcountry
. * graph save graphname.png
If you want to create a histogram, you must document the fact that the graph is based on sufficient observations and that no two identifiers dominate. You must use nobsdes5 before you create the histogram. We recommend saving the histogram as a .png file.
. nobsdes5 id xpc, histogram(10) /* e.g. 10 bins */
Output control for variable xpc [Variable for non deterministic categories]
D I S C L O S U R E problem:
No problem with dominance
Number of distinct IDs (id) for the smallest or largest of 10 bins of variable xpc : too small
This differs from the height of the bins.
. quietly histogram xpc, freq bin(10) addlabels
A varlist is possible since version 2.0.0.
. nobsdes5 id x xpc, histogram(25 10)
This histogram example and the last percentile example are almost the same. There are nine ranges with ten observations and one range with nine observations. In the percentile example, the last range has only nine observations – the last digit runs from 1 to 0 in each range except for the last one – while in the histogram example the last digit runs from 1 to 0 in the first four ranges and then from 0 to 9. The range between p40 and p50 has only nine observations.
Example for the use of a varlist
. nobsdes5 id x xpc, histogram(25 10)
Output control for variable x [Default right hand side variable]
No problem with dominance
Number of distinct IDs (id) for the smallest or largest of 25 bins of variable x : 6
Output control for variable xpc [Variable for non deterministic categories]
D I S C L O S U R E problem:
No problem with dominance
Number of distinct IDs (id) for the smallest or largest of 10 bins of variable xpc : too small
A good alternative to a histogram is a kernel density. You have to show, that the kernel density is based on at least five distinct ids.
. kdensity xpc
. nobsdes5 id xpc
Output control
No problem with dominance and sufficient number of distinct IDs (id) (=nobs) of
variable:
nobs
xpc - Variable for non deterministic categorie .. 10
Sometimes, researchers have data characterized by two or more identifiers but their interest is in one identifier only. Thus, they integrate the second identifier out. Since the researchers have access to both identifiers, they have to check, whether the marginalization violates statistical data disclosure. If the RDSC calculates and provides the marginal data, the RDSC has to check. If the reporting entities report marginal data, for example, a MFI reports total loans to enterprises from the machinery industry, but not for individual enterprises, the reporting entity has to care.