Hey hi, I can help you in collecting the data, cleaning it by mentioned filters, analyze the data using Bayesian Statistics(Probablistic Classifiers like NBC). Please let me know where do u prefer store the processed data? Would you like to have to a NoSQLDB for this (HBase or Impala ?), i think impala would fit for your requirement, but not sure about your requirement.
I would wait for your response.
Thank you
Mahesh.