Create artificial data for eleven ids (1-11), ten years (2001-2010), three countries (17, 39, 400), and continuous variable x and y.
. clear . qui set obs 601 . gen id = int((_n-1)/60)+1 . gen hv = _n-60*(id-1) . gen year = 2000+mod(hv,10)+10*(mod(hv,10)==0) . gen land = int((2000-year+hv)/10)+1 . qui replace land = 17 if land==1|land==4 . qui replace land = 39 if land==2|land==5 . qui replace land = 400 if land==3|land==6 . drop hv . set seed 86937263 . gen x = runiform()
Create an id that has to be ignored since x is missing.
. qui replace id = 11 if _n==301 . qui replace x = . if id==11
Set x to missing in the year 2002 for all ids but two so that there are too few distinct ids in this year.
. qui replace x = . if year==2002 & id > 2
Create negative values
. set seed 69326378 . gen y = rnormal() . cls
MAXIMUM and MINIMUM are single observations and thus must not be shown. If necessary, the RDSC allows to publish approximate values calculated as averages of a sufficient number of observations (top coding). The following requirements apply: at least five distinct entities have to be covered and the share of the two largest ones must not exceed 85 percentages of the total (dominance criterion).
. maxrdsc id x, min(12) No problems for minimum of x. Average minimum based on 5 distinct ids: .0074648052769979
The mean of the five smallest observations is smaller than the result given by maxrdsc.
. qui sum x in 1/5 . di " mean " r(mean) mean .00493174
If we list the ten smallest observations, we see that the five smallest values belong to only three distinct ids. To cover at least five distinct ids we have to use the seven smallest observations .
. sort x . list id x in 1/10, noobs ┌───────────────┐ │ id x │ ├───────────────┤ │ 8 .0007817 │ │ 8 .0009793 │ │ 10 .0048133 │ │ 9 .0061389 │ │ 8 .0119455 │ ├───────────────┤ │ 7 .0132593 │ │ 1 .0143357 │ │ 4 .0150426 │ │ 1 .0150839 │ │ 7 .015683 │ └───────────────┘ . qui sum x in 1/7 . di " mean " r(mean) mean .00746481
Please do not type the last four lines in your output! We type them here only for clarification.
In the next example at most 12 observations are accepted to approximate the maximum.
. maxrdsc id x, max(12) No problems for maximum of x. Average maximum based on 5 distinct ids: .9986101269721985 . list id year land x in 543/552, noobs ┌─────────────────────────────┐ │ id year land x │ ├─────────────────────────────┤ │ 9 2005 39 .9878153 │ │ 4 2003 400 .9896339 │ │ 6 2003 39 .9912419 │ │ 6 2004 39 .9931296 │ │ 4 2008 400 .9978754 │ ├─────────────────────────────┤ │ 8 2010 17 .997907 │ │ 1 2007 17 .9987306 │ │ 6 2006 39 .9992318 │ │ 2 2009 17 .9993058 │ │ 3 2002 400 . │ └─────────────────────────────┘ . qui sum x in 547/551 . di " mean " r(mean) mean .99861013
It is possible to specify minimum and maximum at the same time.
. maxrdsc id x, min(9) max(12) No problems for maximum of x. Average maximum based on 5 distinct ids: .9986101269721985 No problems for minimum of x. Average minimum based on 5 distinct ids: .0074648052769979 . maxrdsc id x, min(12) max(9) No problems for maximum of x. Average maximum based on 5 distinct ids: .9986101269721985 No problems for minimum of x. Average minimum based on 5 distinct ids: .0074648052769979
In the next example 12 observations are not enough to determine the maximum of variable x for Belgium because the 15 largest observations belongt to only four distinct ids.
. foreach i in 17 400 { 2. display 3. display "country: `i'" 4. maxrdsc id x if land==`i', min(12) max(12) 5. } country: 17 D I S C L O S U R E problem: For variable x 12 observations are not sufficient to determine maximum. country: 400 No problems for maximum of x. Average maximum based on 5 distinct ids: .9866548180580139 No problems for minimum of x. Average minimum based on 5 distinct ids: .021380212690149
The researcher can require a specific number of observations without iteration. This option may be needed for comparisons with other studies.
. maxrdsc id x, min(10) max(10) noiterate No problems for maximum of x. Average maximum based on 10 observations: .9941961109638214 No problems for minimum of x. Average minimum based on 10 observations: .0098063051176723 . qui sum x in 542/551 . di " mean " r(mean) mean .99419611 . qui sum x in 1/10 . di " mean " r(mean) mean .00980631
In some cases the researcher may prefer maximum or minimum of the absolute values.
. maxrdsc id y, min(12) max(12) absolute No problems for maximum of y. Average maximum based on 5 distinct ids: 2.683389610714383 No problems for minimum of y. Average minimum based on 5 distinct ids: .0103370569746143
E. g. for making a graph of an empirical cumulative distribution function one has to replace existing values. The researcher can provide his own variable or use the option name() and specify only the name of a new variable.
. gen y_user = y . maxrdsc id y_user, min(12) max(9) update No problems for maximum of y_user. Average maximum based on 5 distinct ids: 2.460660934448242 No problems for minimum of y_user. Average minimum based on 5 distinct ids: -2.637131384440831
or
. maxrdsc id y, min(12) max(9) name(y_tc) /* tc for topcode */ No problems for maximum of y. Average maximum based on 5 distinct ids: 2.460660934448242 No problems for minimum of y. Average minimum based on 5 distinct ids: -2.637131384440831 . cumul y_tc, gen(y_tc_cum) equal . line y_tc_cum y_tc, sort . graph export cumul.png, replace (file cumul.png written in PNG format)
Example Cumulated density function
Top-coding produces ties so you should specify 'equal' with the cumul command.
In case of entire tables the researcher has to use two variables, one for the total and another for the breakdown. Otherwise some of the largest values regarding the total may be replaced by averages of the following samples.
. maxrdsc id x, min(20) max(20) name(x_tc) No problems for maximum of x. Average maximum based on 5 distinct ids: .9986101269721985 No problems for minimum of x. Average minimum based on 5 distinct ids: .0074648052769979
. qui gen x_user = x . foreach i in 17 400 { 2. maxrdsc id x_user if land==`i', min(20) max(20) update 3. } No problems for maximum of x_user. Average maximum based on 5 distinct ids: .945697195827961 No problems for minimum of x_user. Average minimum based on 5 distinct ids: .0207207249113708 No problems for maximum of x_user. Average maximum based on 5 distinct ids: .9866548180580139 No problems for minimum of x_user. Average minimum based on 5 distinct ids: .021380212690149 . sort x . list id land year x x_tc x_user if x_tc!=x & x!=., noobs ┌───────────────────────────────────────────────────┐ │ id land year x x_tc x_user │ ├───────────────────────────────────────────────────┤ │ 8 17 2007 .0007817 .0074648 .0207207 │ │ 8 39 2008 .0009793 .0074648 .0009793 │ │ 10 39 2004 .0048133 .0074648 .0048133 │ │ 9 17 2006 .0061389 .0074648 .0207207 │ │ 8 400 2007 .0119455 .0074648 .0213802 │ ├───────────────────────────────────────────────────┤ │ 7 17 2008 .0132593 .0074648 .0207207 │ │ 1 400 2004 .0143357 .0074648 .0213802 │ │ 4 400 2008 .9978754 .9986101 .9866548 │ │ 8 17 2010 .997907 .9986101 .9456972 │ │ 1 17 2007 .9987306 .9986101 .9456972 │ ├───────────────────────────────────────────────────┤ │ 6 39 2006 .9992318 .9986101 .9992318 │ │ 2 17 2009 .9993058 .9986101 .9456972 │ └───────────────────────────────────────────────────┘ . sort land x . list id land year x x_tc x_user if x_user!=x & x!=., sepby(land) noobs ┌───────────────────────────────────────────────────┐ │ id land year x x_tc x_user │ ├───────────────────────────────────────────────────┤ │ 8 17 2007 .0007817 .0074648 .0207207 │ │ 9 17 2006 .0061389 .0074648 .0207207 │ │ 7 17 2008 .0132593 .0074648 .0207207 │ │ 9 17 2003 .0179612 .0179612 .0207207 │ │ 7 17 2001 .0229569 .0229569 .0207207 │ │ 9 17 2005 .0311553 .0311553 .0207207 │ │ 4 17 2001 .0324234 .0324234 .0207207 │ │ 3 17 2007 .041089 .041089 .0207207 │ │ 5 17 2007 .8989528 .8989528 .9456972 │ │ 2 17 2002 .9068126 .9068126 .9456972 │ │ 2 17 2001 .9111468 .9111468 .9456972 │ │ 4 17 2003 .9113399 .9113399 .9456972 │ │ 2 17 2006 .9246309 .9246309 .9456972 │ │ 1 17 2008 .9270924 .9270924 .9456972 │ │ 8 17 2004 .9375268 .9375268 .9456972 │ │ 1 17 2008 .938831 .938831 .9456972 │ │ 2 17 2009 .9455019 .9455019 .9456972 │ │ 2 17 2010 .9546198 .9546198 .9456972 │ │ 2 17 2004 .9552367 .9552367 .9456972 │ │ 4 17 2001 .9602001 .9602001 .9456972 │ │ 8 17 2006 .96332 .96332 .9456972 │ │ 8 17 2010 .997907 .9986101 .9456972 │ │ 1 17 2007 .9987306 .9986101 .9456972 │ │ 2 17 2009 .9993058 .9986101 .9456972 │ ├───────────────────────────────────────────────────┤ │ 8 400 2007 .0119455 .0074648 .0213802 │ │ 1 400 2004 .0143357 .0074648 .0213802 │ │ 3 400 2006 .0180618 .0180618 .0213802 │ │ 8 400 2004 .0208572 .0208572 .0213802 │ │ 6 400 2004 .021805 .021805 .0213802 │ │ 3 400 2008 .0254984 .0254984 .0213802 │ │ 9 400 2004 .0371579 .0371579 .0213802 │ │ 5 400 2009 .9780802 .9780802 .9866548 │ │ 2 400 2002 .9813354 .9813354 .9866548 │ │ 8 400 2008 .9859142 .9859142 .9866548 │ │ 10 400 2004 .9870899 .9870899 .9866548 │ │ 4 400 2003 .9896339 .9896339 .9866548 │ │ 4 400 2008 .9978754 .9986101 .9866548 │ └───────────────────────────────────────────────────┘
If you want to show results outside the RDSC you have to create a table! The following shows a possibility using the stored results
. foreach i in 17 400 { 2. di "" 3. display "country: `i'" 4. maxrdsc id x if land==`i', min(20) max(20) 5. local min_`i' = r(minval) 6. local max_`i' = r(maxval) 7. qui sum x if land==`i' 8. local n_`i' = r(N) 9. local mean_`i' = r(mean) 10. local sd_`i' = r(sd) 11. } country: 17 No problems for maximum of x. Average maximum based on 5 distinct ids: .945697195827961 No problems for minimum of x. Average minimum based on 5 distinct ids: .0207207249113708 country: 400 No problems for maximum of x. Average maximum based on 5 distinct ids: .9866548180580139 No problems for minimum of x. Average minimum based on 5 distinct ids: .021380212690149
. foreach i in 17 400 { 2. di %30s "`i'" %10.0f `n_`i'' %18.4f `mean_`i'' %18.4f `sd_`i'' %18.4f `min_`i'' %18.4f `max_`i'' 3. } 17 183 0.5070 0.2845 0.0207 0.9457 400 184 0.5106 0.3013 0.0214 0.9867
If the researcher's interest lies in the probability density function he may use kernel density estimates.
. kdensity y . graph export kernel.png, replace (file kernel.png written in PNG format)
Example Kernel density function