starting stage 1 -prepare data ...

35075 :html files detected in data directory
files dataset is: 35075