04/07/2014

Goodness of fit test for log-normal and exponential distributions by bootstrapping


Stackexchange users be welcome to this page :), if you find the program useful or have suggestions, please feel free to leave a comment.


The poweRlaw R library provides the bootstrap_p function which allows to test the goodness of fit of a power law to the data using bootstrapping.  Unfortunately the library does not provides such methods for other distributions. We here propose such functions for log-normal and exponential models.
The code is based on the principle of the paper by Clauset et al. 2009 to test goodness of fit by bootstrapping for the log-normal and exponential distributions and reuse some of the powerLaw functions.

I wrote it after realizing that the code provided on the page http://users.dimi.uniud.it/~massimo.franceschet/R/fit.html meant for the same purpose was incorrect: the way the goodness of fit is computed is not correct as it does not strictly follow the procedure described in the paper, and implemented in the poweRlaw package.
Specifically: the synthetic data generation procedure is half implemented, it does not search for the best xmin as provided by the extimate_xmin function of the poweRlaw package, and finally the ks.test discards all the ties, while powerLaw doesn't with its built-in KS test.


Usage: a poweRlaw discrete distribution object has to be given as the first parameter. Prints the p-value on the output. Note that following the paper, about 2500 simulations should be generated to get a 2 points precision.

Interpretation of the p-value: a p-value < 0.1 rejects the hypothesis that the tested law generated the data, while a value > 0.1 does not reject it, but does not disprove the model either.

Note that model comparison based on the likelihood ratio (like the Vuong test, which is implemented in the compare_distribution function of poweRlaw) is better able to tell which of two models better fits the data and if there are enough data to confidently claim so.

Code for log-normal distributions:

bootstrap_ln = function(m,nbsims=5) {
 xmin = m$getXmin()
 ntot = length(m$dat)
 n1 = sum(m$dat<xmin)
 n2 = ntot - n1
 print(n1)
 print(n2)
 print(ntot)

 start_time = Sys.time()

 m_est = estimate_xmin(m)

 end_time = Sys.time()
 total_time = difftime(end_time, start_time, units="secs")

 mean = m_est$pars[1]
 sd = m_est$pars[2]

 print(m_est$KS)
 print(unlist(m_est))

 print(total_time)

 count=0
 for(i in 1:nbsims) {
  # generate synthetic data set
  if(xmin>1) {
   syn1 = sample(m$dat[m$dat<xmin],n1,replace=TRUE)
  } else {
   syn1 = c()
  }
  # sample from distribution
  syn2 = c()
  while(length(syn2)<n2) {
   n2bis = n2-length(syn2)
   syn2bis = round(rlnorm(n2bis,meanlog=mean,sdlog=sd))
   syn2bis = syn2bis[syn2bis>0]
   syn2 = c(syn2,syn2bis)
  }  
  syn = c(syn1,syn2)
  print(length(syn))
  # estimate gof of syn data  
  msyn = dislnorm$new(syn)
  syn_est = estimate_xmin(msyn)
  print(syn_est$KS)
  print(unlist(syn_est))
  if(syn_est$KS >= m_est$KS) {
   count = count + 1
  }
 }
 print(count/nbsims)
}

example:

dsyn = round(rlnorm(1000,2.4,0.98))
dsyn = dsyn[dsyn>0]
length(dsyn)
e_lnsyn = dislnorm(dsyn)
e_lnsyn$setXmin(estimate_xmin(e_lnsyn))
bootstrap_ln(e_lnsyn)


Code for exponential distributions:

bootstrap_exp = function(m,nbsims=5) {
 xmin = m$getXmin()
 ntot = length(m$dat)
xmin
 n1 = sum(m$dat<xmin)
 n2 = ntot - n1
 print(n1)
 print(n2)
 print(ntot)

 start_time = Sys.time()

 m_est = estimate_xmin(m)

 end_time = Sys.time()
 total_time = difftime(end_time, start_time, units="secs")

 rate = m_est$pars[1]

 print(m_est$KS)
 print(unlist(m_est))

 print(total_time)

 count=0
 for(i in 1:nbsims) {
  # generate synthetic data set
  if(xmin>1) {
   syn1 = sample(m$dat[m$dat < xmin],n1,replace=TRUE)
  } else {
   syn1 = c()
  }
  # sample from distribution
  syn2 = c()
  while(length(syn2)<n2) {
   n2bis = n2-length(syn2)
   syn2bis = round(rexp(n2bis,rate=rate))
   syn2bis = syn2bis[syn2bis>0]
   syn2 = c(syn2,syn2bis)
  }  
  syn = c(syn1,syn2)
  # estimate gof of syn data  
  msyn = disexp$new(syn)
  syn_est = estimate_xmin(msyn)
  print(syn_est$KS)
  print(unlist(syn_est))
  if(syn_est$KS >= m_est$KS) {
   count = count + 1
  }
 }
 print(count/nbsims)
}

1 comment:

  1. Hello Nicolas.. How the code from the link is not correct?
    What is wrong?
    http://users.dimi.uniud.it/~massimo.franceschet/R/fit.html
    I want to implement for Weibull

    ReplyDelete