Naive Bayes Classifier - technewscircle

Introduction

NaiveBayesclassificationisaverysimpleclassificationalgorithm.ItiscalledNaiveBayesclassificationbecausetheideaofthismethodisreallysimple.NaiveBayes'sthinkingisbasedonthis:Foragivenitemtobeclassified,theprobabilityofeachcategoryappearingundertheconditionoftheitem'sappearanceissolved,whicheveristhelargest,thentheitemtobeclassifiedisconsideredtobelongtowhichcategory.Forexample,ifafruitisred,round,andabout3inchesindiameter,thefruitcanbejudgedtobeanapple.Althoughthesefeaturesaredependentoneachotherorsomefeaturesaredeterminedbyotherfeatures,theNaiveBayesclassifierconsiderstheseattributestobeindependentintheprobabilitydistributionofdeterminingwhetherthefruitisanapple.Forcertaintypesofprobabilitymodels,verygoodclassificationresultscanbeobtainedinthesamplesetofsupervisedlearning.Inmanypracticalapplications,theNaiveBayesianmodelparameterestimationusesthemaximumlikelihoodestimationmethod;inotherwords,theNaiveBayesianmodelcanalsoworkwithoutBayesianprobabilityoranyBayesianmodel..

Despitethesenaiveideasandover-simplifiedassumptions,thenaiveBayesclassifiercanstillachievequitegoodresultsinmanycomplexreal-lifesituations.In2004,anarticleanalyzingtheproblemofBayesianclassifierrevealedseveraltheoreticalreasonswhythenaiveBayesianclassifierobtainstheseeminglyincredibleclassificationeffect.Nevertheless,anarticlein2006comparedvariousclassificationmethodsindetail,andfoundthatthenewermethods(suchasdecisiontreesandrandomforests)outperformBayesianclassifiers.OneadvantageofthenaiveBayesclassifieristhatitonlyneedstoestimatethenecessaryparameters(meanandvarianceofthevariables)basedonasmallamountoftrainingdata.Duetotheassumptionofvariableindependence,onlythemethodofestimatingeachvariableisneeded,withouttheneedtodeterminetheentirecovariancematrix.

Development

NaiveBayeshasbeenextensivelystudiedsincethe1950s.Intheearly1960s,itwasintroducedintothetextinformationretrievalfieldunderanothername,anditisstillapopular(benchmark)methodoftextclassification.,Legality,sportsorpolitics,etc.).Withproperpreprocessing,itcancompetewithmoreadvancedmethodsinthisfield(includingsupportvectormachines).Italsohasapplicationsinautomaticmedicaldiagnosis.

Naive Bayes Classifier

TheNaiveBayesclassifierishighlyscalable,soitrequiresanumberofparametersthathavealinearrelationshipwiththevariables(features/predictors)inthelearningproblem.Maximumlikelihoodtrainingcanbedonebyevaluatingaclosed-formexpression,anditonlytakeslineartimeinsteadofthetime-consumingiterativeapproximationusedbymanyothertypesofclassifiers.Inthestatisticsandcomputerscienceliterature,thenaiveBayesmodelhasvariousnames,includingsimpleBayesandindependentBayes.AllthesenamesrefertotheuseofBayes'theoreminthedecisionrulesoftheclassifier,butNaiveBayesdoesnot(necessarily)useBayesianmethods;"RussellandNorvig"mentions"'NaiveBayes"'SometimescalledtheBayesianclassifier,thissloppyusepromptstrueBayesianstocallitthefoolBayesianmodel."

Bayesianmethod

Therearemanyconstructionmethodsforclassifiers,thecommononesareBayesianmethod,decisiontreemethod,case-basedlearningmethod,artificialneuralnetworkmethod,supportvectormachinemethod,geneticalgorithm-basedmethod,roughset-basedmethod,fuzzySetmethodandsoon.Amongthem,theBayesianmethodisbecomingoneofthemosteye-catchingfocusofmanymethodswithitsuniqueexpressionofuncertaintyknowledge,richprobabilityexpressionability,andtheincrementallearningcharacteristicsofcomprehensivepriorknowledge.Classificationisatwo-stepprocess.Thefirststepistobuildaclassifierwithasetofknownexamples.Thisstepgenerallyoccursinthetrainingphaseorcalledthelearningphase.Theknowninstancesetusedtoconstructtheclassifieriscalledthetraininginstanceset,andeachinstanceinthetraininginstancesetiscalledthetraininginstance.Sincetheclasslabelsofthetrainingexamplesareknown,theprocessofconstructingtheclassifierisalearningprocesswithatutor.Incomparison,inthelearningprocesswithoutatutor,theclasslabelofthetraininginstanceisunknown,andsometimeseventhenumberofcategoriestobelearnedmaybeunknown,suchasclustering.

Thesecondstepistousethebuiltclassifiertoclassifyunknowninstances.Thisstepgenerallyoccursinthetestingphaseorcalledtheworkingphase.Theunknowninstancesusedforclassificationarecalledtestinstances.Generally,beforeaclassifierisusedforprediction,itsclassificationaccuracyneedstobeevaluated.Onlytheclassifierwiththerequiredclassificationaccuracycanbeusedtoclassifythetestcase.

Bayesianmethodprovidesaprobabilisticmeansofreasoning.Itassumesthatthevariablestobeexaminedfollowacertainprobabilitydistribution,andcanmakeinferencesbasedontheseprobabilitiesandtheobserveddata,soastomakethebestdecision.Bayesianmethodcannotonlycalculatetheexplicithypothesisprobability,butalsoprovideaneffectivemeansforunderstandingmostothermethods.ThecharacteristicsoftheBayesianmethodmainlyinclude:thecharacteristicsofincrementallearning;thecharacteristicsofpriorknowledgethatcandeterminethefinalprobabilityofthehypothesistogetherwiththeobservedexamples;thecharacteristicsofallowingthehypothesistomakeuncertaintypredictions;theclassificationofnewexamplesThefeaturethatmultiplehypothesescanbeusedtomakepredictionstogetherwiththeirprobabilitiesastheweight,andsoon.

MaximumLikelihoodEstimation

MaximumLikelihoodEstimationisastatisticalmethod,whichisusedtofindtherelevantprobabilitydensityfunctionparametersofasampleset.ThismethodwasfirstusedbygeneticistandstatisticianSirRonaldFisherbetween1912and1922.

"Likelihood"isatranslationoflikelihoodthatisclosertoclassicalChinese."Likelihood"means"possibility"inmodernChinese.Therefore,itiseasiertounderstandifitiscalled"maximumlikelihoodestimation".

Themaximumlikelihoodmethodexplicitlyusesaprobabilitymodel,anditsgoalistofindaphylogenetictreethatcanproduceobservationdatawithahigherprobability.Themaximumlikelihoodmethodisarepresentativeofaclassofphylogenetictreereconstructionmethodsbasedentirelyonstatistics.Thismethodconsiderstheprobabilityofeachnucleotidesubstitutionineachsetofsequencealignment.

Forexample,theprobabilityofatransitionoccurringisapproximatelythreetimesthatofatransition.Inathree-sequencecomparison,ifoneofthecolumnsisfoundtobeaC,aTandaG,wehavereasontobelievethattherelationshipbetweenthesequenceofCandTislikelytobecloser.Sincethecommonancestorsequenceofthestudiedsequenceisunknown,thecalculationoftheprobabilitybecomescomplicated;andbecausemultiplesubstitutionsmayoccuratonesiteormultiplesites,andnotallsitesareindependentofeachother,theprobabilitycalculationThecomplexityisfurtherincreased.Nevertheless,objectivestandardscanbeusedtocalculatetheprobabilityofeachsiteandtheprobabilityofeachpossibletreerepresentingthesequencerelationship.Then,bydefinition,thetreewiththelargestsumofprobabilitiesismostlikelytobeaphylogenetictreethatreflectstherealsituation.