Introduction
Needsarethemotherofinvention.Inrecentyears,datamininghasattractedgreatattentionintheinformationindustry.Themainreasonisthatthereisalargeamountofdata,whichcanbewidelyused,andthereisanurgentneedtoconvertthesedataintousefulinformationandknowledge.Theacquiredinformationandknowledgecanbewidelyusedinvariousapplications,includingbusinessmanagement,productioncontrol,marketanalysis,engineeringdesignandscientificexploration.
Dataminingisahotissueinthefieldofartificialintelligenceanddatabaseresearch.Theso-calleddataminingreferstothenon-trivialprocessofrevealinghidden,previouslyunknownandpotentiallyvaluableinformationfromalargeamountofdatainthedatabase..Dataminingisadecisionsupportprocess,whichismainlybasedonartificialintelligence,machinelearning,patternrecognition,statistics,databases,visualizationtechnology,etc.,highlyautomatedanalysisofenterprisedata,makinginductivereasoning,anddiggingoutpotentialpatternsfromit,Tohelpdecision-makersadjustmarketstrategies,reducerisks,andmakecorrectdecisions.Theknowledgediscoveryprocessconsistsofthefollowingthreestages:①datapreparation;②datamining;③resultexpressionandinterpretation.Dataminingcaninteractwithusersorknowledgebases.
Dataminingisatechniquetofindthelawfromalargeamountofdatabyanalyzingeachdata.Therearethreemainsteps:datapreparation,lawsearch,andlawexpression.Datapreparationistoselecttherequireddatafromrelateddatasourcesandintegratethemintoadatasetfordatamining;rulesearchistouseacertainmethodtofindoutthelawscontainedinthedataset;therulerepresentationisasfaraspossibletotheuser.Thewayofunderstanding(suchasvisualization)expressesthefoundrules.Dataminingtasksincludeassociationanalysis,clusteranalysis,classificationanalysis,anomalyanalysis,specificgroupanalysisandevolutionanalysis.
Inrecentyears,datamininghasattractedgreatattentionintheinformationindustry.Themainreasonisthatthereisalargeamountofdatathatcanbewidelyused,andthereisanurgentneedtoconvertthesedataintousefulinformationandknowledge.Theacquiredinformationandknowledgecanbewidelyusedinvariousapplications,includingbusinessmanagement,productioncontrol,marketanalysis,engineeringdesign,andscientificexploration.Dataminingusesideasfromthefollowingfields:①sampling,estimationandhypothesistestingfromstatistics;②searchalgorithms,modelingtechniquesandlearningtheoriesinartificialintelligence,patternrecognitionandmachinelearning.Datamininghasalsoquicklyadoptedideasfromotherfields,includingoptimization,evolutionarycomputing,informationtheory,signalprocessing,visualization,andinformationretrieval.Someotherareasalsoplayanimportantsupportingrole.Inparticular,databasesystemsarerequiredtoprovideeffectivestorage,indexing,andqueryprocessingsupport.Technologiesderivedfromhigh-performance(parallel)computingareoftenimportantinprocessingmassivedatasets.Distributedtechnologycanalsohelpprocessmassiveamountsofdata,anditisevenmoreimportantwhenthedatacannotbeprocessedtogether.
Background
Inthe1990s,withthewidespreadapplicationofdatabasesystemsandtherapiddevelopmentofnetworktechnology,databasetechnologyhasalsoenteredabrandnewstage,thatis,fromthepast,onlysomemanagementSimpledatahasdevelopedtomanagevarioustypesofcomplexdatasuchasgraphics,images,audio,video,electronicfiles,andWebpagesgeneratedbyvariouscomputers,andtheamountofdataisalsoincreasing.Whilethedatabaseprovidesuswithawealthofinformation,italsoreflectstheobviouscharacteristicsofmassiveinformation.Intheeraofinformationexplosion,massiveamountsofinformationhavebroughtmanynegativeeffectstopeople.Themostimportantthingisthateffectiveinformationisdifficulttoextract.Toomuchuselessinformationwillinevitablyproduceinformationdistance(informationstatetransferdistance),whichisanobstacletothetransferofinformationstateofathing.Themeasurementofthedisease,referredtoasDISTorDIT)andthelossofusefulknowledge.ThisiswhatJohnNalsbertcalledthe"information-richbutknowledge-poor"dilemma.Therefore,peopleareeagertoconductin-depthanalysisofmassivedata,discoverandextractthehiddeninformationinordertomakebetteruseofthesedata.However,onlywiththeinput,query,andstatisticsfunctionsofthedatabasesystem,therelationshipsandrulesinthedatacannotbefound,thefuturedevelopmenttrendcannotbepredictedbasedontheexistingdata,andthemeanstominethehiddenknowledgebehindthedataarelacking.Itisundersuchconditionsthatdataminingtechnologycameintobeing.
Dataminingobjects
Thetypeofdatacanbestructured,semi-structured,orevenheterogeneous.Themethodofdiscoveringknowledgecanbemathematical,non-mathematical,orinductive.Theknowledgefinallydiscoveredcanbeusedforinformationmanagement,queryoptimization,decisionsupport,andmaintenanceofthedataitself.
Theobjectofdataminingcanbeanytypeofdatasource.Itcanbearelationaldatabase,thistypeofdatasourcecontainingstructureddata;itcanalsobeadatawarehouse,text,multimediadata,spatialdata,timeseriesdata,Webdata,thistypeofdatasourcecontainingsemi-structureddataorevenheterogeneousdata.
Themethodofdiscoveringknowledgecanbedigital,non-digital,orgeneralized.Thefinallydiscoveredknowledgecanbeusedforinformationmanagement,queryoptimization,decisionsupport,andmaintenanceofthedataitself.
Dataminingsteps
Beforeimplementingdatamining,firstformulatewhatstepstotake,whattodoateachstep,andwhatgoalsarenecessarytoachievegoodresults.Onlyplanscanensuretheorderlyimplementationandsuccessofdatamining.Manysoftwarevendorsanddataminingconsultingcompanieshaveprovidedsomedataminingprocessmodelstoguidetheirusersinthedataminingworkstepbystep.Forexample,SPSS's5AandSAS'sSEMMA.
Thestepsofthedataminingprocessmodelmainlyincludedefiningproblems,buildingdatamininglibraries,analyzingdata,preparingdata,buildingmodels,evaluatingmodels,andimplementingthem.Let'stakealookatthespecificcontentofeachstepindetail:(1)Definetheproblem.Thefirstandmostimportantrequirementbeforestartingknowledgediscoveryistounderstanddataandbusinessissues.Theremustbeacleardefinitionofthegoal,whichistodecidewhatyouwanttodo.Forexample,whenyouwanttoincreasetheutilizationrateofe-mail,whatyouwanttodomaybe"increaseuserutilization"or"increasethevalueofone-timeuseruse".Themodelsestablishedtosolvethesetwoproblemsarealmostcompletelydifferent.,Adecisionmustbemade.
(2)Establishadatamininglibrary.Theestablishmentofadatamininglibraryincludesthefollowingsteps:datacollection,datadescription,selection,dataqualityevaluationanddatacleaning,mergingandintegration,buildingmetadata,loadingthedatamininglibrary,andmaintainingthedatamininglibrary.
(3)Analyzedata.Thepurposeoftheanalysisistofindthedatafieldsthathavethegreatestimpactontheforecastoutput,andtodecidewhethertodefinetheexportfields.Ifthedatasetcontainshundredsorthousandsoffields,thenbrowsingandanalyzingthesedatawillbeaverytime-consumingandtiringtask.Atthistime,youneedtochooseagoodinterfaceandpowerfultoolsoftwaretohelpyoucompletethesethings..
(4)Preparedata.Thisisthelaststepofdatapreparationbeforebuildingthemodel.Thisstepcanbedividedintofourparts:selectvariables,selectrecords,createnewvariables,andconvertvariables.
(5)Buildamodel.Modelbuildingisaniterativeprocess.Youneedtocarefullyexaminethedifferentmodelstodeterminewhichmodelismostusefulforthebusinessproblemyouface.Firstusepartofthedatatobuildamodel,andthenusetheremainingdatatotestandverifytheresultingmodel.Sometimesthereisathirddataset,calledthevalidationset,becausethetestsetmaybeaffectedbythecharacteristicsofthemodel.Atthistime,anindependentdatasetisneededtoverifytheaccuracyofthemodel.Trainingandtestingdataminingmodelsrequiresdividingthedataintoatleasttwoparts,oneformodeltrainingandtheotherformodeltesting.
(6)Evaluationmodel.Afterthemodelisestablished,itisnecessarytoevaluatetheresultsobtainedandexplainthevalueofthemodel.Theaccuracyrateobtainedfromthetestsetisonlymeaningfulforthedatausedtobuildthemodel.Inpracticalapplications,itisnecessarytofurtherunderstandthetypesoferrorsandtherelatedcosts.Experiencehasprovedthataneffectivemodelisnotnecessarilythecorrectmodel.Thedirectreasonforthisisthevariousassumptionsimplicitinthemodelestablishment.Therefore,itisimportanttotestthemodeldirectlyintherealworld.Firstapplyitinasmallarea,obtaintestdata,andthenpromoteittoalargeareawhenyoufeelsatisfied.
(7)Implementation.Afterthemodelisestablishedandverified,therearetwomainwaystouseit.Thefirstistoprovideanalystsasareference;theotheristoapplythismodeltodifferentdatasets.
Datamininganalysismethods
Dataminingisdividedintoguideddataminingandunguideddatamining.Guideddataminingistheuseofavailabledatatobuildamodel,whichisadescriptionofaspecificattribute.Unguideddataminingistofindacertainrelationshipamongallattributes.Specifically,classification,estimation,andpredictionbelongtoguideddatamining;associationrulesandclusteringbelongtounguideddatamining.
1.Classification.Itfirstselectstheclassifiedtrainingsetfromthedata,usesdataminingtechnologyonthetrainingsettoestablishaclassificationmodel,andthenusesthemodeltoclassifyunclassifieddata.
2.Valuation.Thevaluationissimilartotheclassification,butthefinaloutputresultofthevaluationisacontinuousvalue,andtheamountofthevaluationisnotpredetermined.Valuationcanbeusedasapreparatoryworkforclassification.
3.predict.Itiscarriedoutthroughclassificationorestimation,andamodelisobtainedthroughclassificationorestimationtraining.Ifthemodelhasahighaccuracyrateforthetestsamplegroup,themodelcanbeusedfortheunknownvariablesofthenewsampleMakepredictions.
4.Relevancegroupingorassociationrules.Thepurposeistodiscoverwhichthingsalwayshappentogether.
5.Clustering.Itisamethodofautomaticallyfindingandestablishinggroupingrules.Itdividessimilarsamplesintoaclusterbyjudgingthesimilaritybetweensamples.
SuccessStories
1.DatamininghelpsCredilogrosCíaFinancieraSAimprovecustomercreditscores
CredilogrosCíaFinancieraSAisthefirstinArgentinaThefivemajorcreditcompanies,withanestimatedassetvalueof95.7millionU.S.dollars,forCredilogros,itisimportanttoidentifythepotentialrisksassociatedwithpotentialprepaymentcustomersinordertominimizetherisktaken.
Thecompany’sfirstgoalistocreateadecisionenginethatinteractswiththecompany’scoresystemandthesystemsoftwocreditreportingcompaniestoprocesscreditapplications.Atthesametime,Credirogrosisalsolookingforcustomriskscoringtoolsforthelow-incomecustomergroupsitserves.Inadditiontothese,otherneedsincludesolutionsthatcanoperateinrealtimeatanyofits35branchofficelocationsandmorethan200relatedsalespoints,includingretailhomeappliancechainstoresandmobilephonesalescompanies.
Intheend,CredilogroschoseSPSSInc.'sdataminingsoftwarePASWModelerbecauseitcanbeflexiblyandeasilyintegratedintoCredilogros'coreinformationsystem.ByimplementingPASWModeler,Credirogrosreducedthetimeneededtoprocesscreditdataandprovidethefinalcreditscoretolessthan8seconds.Thisallowstheorganizationtoquicklyapproveorrejectcreditrequests.ThedecisionenginealsoenablesCredilogrostominimizetheidentificationdocumentsthateachcustomermustprovide.Insomespecialcases,onlyoneidentificationisrequiredtoapprovecredit.Inaddition,thesystemalsoprovidesmonitoringfunctions.CredilogroscurrentlyusesPASWModelertoprocessanaverageof35,000applicationspermonth.Only3monthsafteritwasrealized,ithelpedCredirogrostoreduceloandisbursementderelictionby20%.
2.DatamininghelpsDHLtrackthetemperatureofcargocontainersinrealtime
DHLisaglobalmarketleaderintheinternationalexpressandlogisticsindustry.Itprovidesexpress,landandwaterAirthree-waytransportation,contractlogisticssolutions,andinternationalmailservices.DHL'sinternationalnetworkconnectsmorethan220countriesandregions,withatotalofmorethan285,000employees.UnderthepressureoftheUSFDAtoensurethatthetemperatureofdrugshipmentsduringthetransportationprocessmeetsthepressure,DHL'spharmaceuticalcustomersstronglydemandmorereliableandmoreaffordableoptions.ThisrequiresDHLtotrackthetemperatureofthecontainerinrealtimeatallstagesofdelivery.
Althoughtheinformationgeneratedbytheloggermethodisaccurate,thedatacannotbetransmittedinrealtime,andneitherthecustomernorDHLcantakeanypreventiveandcorrectivemeasureswhentemperaturedeviationsoccur.Therefore,DHL’sparentcompanyDeutschePostWorldNetwork(DPWN),throughtheTechnologyandInnovationManagement(TIM)Group,hasclearlydrawnupaplantouseRFIDtechnologytotrackthetemperatureoftheshipmentatdifferentpointsintime.DrawtheprocessframeworkfordeterminingthekeyfunctionparametersoftheservicethroughtheIBMGlobalBusinessConsultingServicesDepartment.DHLhasgainedtwobenefits:Fortheendcustomer,itenablesmedicalcustomerstorespondinadvancetoshippingproblemsduringthedeliveryprocess,andcomprehensivelyandeffectivelyenhancesdeliveryreliabilityatacompellinglowcost.ForDHL,ithasimprovedcustomersatisfactionandloyalty;laidasolidfoundationformaintainingcompetitivedifferences;andhasbecomeanimportantnewsourceofrevenuegrowth.
3.Applicationsinthetelecommunicationsindustry
Pricecompetitionisunprecedentedlyfierce,voicebusinessgrowthhassloweddown,andthefast-growingChinamobilecommunicationsmarketisfacingunprecedentedsurvivalpressure.TheacceleratedreformofChina'stelecommunicationsindustryhascreatedanewcompetitivesituation,andthebreadthandintensityofcompetitioninthemobileoperatingmarketwillfurtherincrease,especiallyinthefieldofgroupcustomers.Mobileinformatizationandgroupcustomershavebecomeanewengineforoperatorstocopewithcompetitionandobtainsustainedgrowthinthefuture.
Withthedomesticthree-leggedfull-servicecompetitionandtheissuanceof3Glicenses,itwillbeageneraltrendforoperatorstoprovidegroupcustomerswithintegratedinformatizationsolutions,andmobileinformatizationwillbecomeacomprehensiveentryintothefieldofinformatizationservices.Leadingforce.Therefore,traditionalmobileoperatorsarefacingthechallengeofshiftingfromtraditionalpersonalbusinesstosimultaneouslyexpandingthefieldofgroupcustomerinformatizationbusiness.Howtodealwithinternalandexternalchallengesandquicklyusemobileinformatizationservicesasoneofthecompetitivetoolsforintegratedservicestoexpandthegroup'scustomermarketandremaininvincibleinemergingmarketsisanurgentproblemthattraditionalmobileoperatorsneedtosolve.
Classicalalgorithm
Currently,dataminingalgorithmsmainlyincludeneuralnetworkmethod,decisiontreemethod,geneticalgorithm,roughsetmethod,fuzzysetmethod,associationrulemethod,etc.
Neuralnetworkmethod
Theneuralnetworkmethodsimulatesthestructureandfunctionofthebiologicalnervoussystem.Itisanon-linearpredictivemodellearnedthroughtraining.IttreatseveryconnectionasAprocessingunittriestosimulatethefunctionsofhumanbrainneurons,andcancompleteavarietyofdataminingtaskssuchasclassification,clustering,andfeaturemining.Thelearningmethodofneuralnetworkismainlymanifestedinthemodificationofweights.Itsadvantagesareanti-interference,non-linearlearning,associativememoryfunctions,andaccuratepredictionresultsforcomplexsituations;thefirstdisadvantageisthatitisnotsuitableforprocessinghigh-dimensionalvariablesandcannotobservethelearningprocessinthemiddle.Ithas"blackbox"characteristicsandoutputsresults.Itisalsodifficulttoexplain;secondly,ittakeslongerlearningtime.Neuralnetworkmethodismainlyusedindataminingclusteringtechnology.
Decisiontreemethod
Decisiontreeisaprocessofconstructingclassificationrulesaccordingtotheutilityofthetargetvariable.Theprocessofclassifyingdatathroughaseriesofrules,anditsmanifestationisSimilartotheflowchartofthetreestructure.ThemosttypicalalgorithmisJ.R.QuinlanproposedtheID3algorithmin1986,andthenproposedtheextremelypopularC4.5algorithmbasedontheID3algorithm.Theadvantageofusingthedecisiontreemethodisthatthedecision-makingprocessisvisible,doesnotrequirealongtimetoconstructtheprocess,thedescriptionissimple,easytounderstand,andtheclassificationspeedisfast;thedisadvantageisthatitisdifficulttofindrulesbasedonacombinationofmultiplevariables.Decisiontreemethodisgoodatprocessingnon-numericaldata,anditisespeciallysuitableforlarge-scaledataprocessing.Decisiontreesprovideawaytoshowruleslikewhatvaluewillbeobtainedunderwhatconditions.Forexample,inaloanapplication,itisnecessarytomakeajudgmentonthedegreeofriskoftheapplication.
Geneticalgorithm
Geneticalgorithmsimulatesthephenomenaofreproduction,matingandgenemutationthatoccurinnaturalselectionandheredity.Itisakindofoperationthatusesgeneticcombination,geneticcross-mutation,andnaturalselection.Togeneraterules-basedmachinelearningmethodsbasedonevolutionarytheory.Itsbasicviewpointistheprincipleof"survivalofthefittest",whichhasthepropertiesofimplicitparallelismandeasyintegrationwithothermodels.Themainadvantageisthatitcanhandlemanydatatypes,andatthesametime,itcanprocessvariousdatainparallel;thedisadvantageisthatitrequirestoomanyparameters,codingisdifficult,andgenerallytheamountofcalculationisrelativelylarge.Geneticalgorithmsareoftenusedtooptimizeneuralnetworksandcansolveproblemsthataredifficulttosolvebyothertechnologies.
Roughsetmethod
Theroughsetmethod,alsoknownasroughsettheory,wasproposedbyPolishmathematicianZPawlakintheearly1980s.Itisanewwaytodealwithambiguity,Mathematicstoolsforinaccurateandincompleteproblemscanhandledatareduction,datacorrelationdiscovery,anddatameaningevaluation.Theadvantageisthatthealgorithmissimple,andthepriorknowledgeaboutthedataisnotneededintheprocessingprocess,andtheinherentlawoftheproblemcanbeautomaticallyfound;thedisadvantageisthatitisdifficulttodirectlyprocesscontinuousattributes,andthediscretizationofattributesmustbeperformedfirst.Therefore,thediscretizationofcontinuousattributesisadifficultpointthatrestrictsthepracticalapplicationofroughsettheory.Roughsettheoryismainlyappliedtoproblemssuchasapproximatereasoning,digitallogicanalysisandsimplification,andtheestablishmentofpredictivemodels.
Fuzzysetmethod
Thefuzzysetmethodusesfuzzysettheorytoperformfuzzyevaluation,fuzzydecision-making,fuzzypatternrecognitionandfuzzyclusteranalysisonproblems.Fuzzysettheoryusesthedegreeofmembershiptodescribetheattributesoffuzzythings.Thehigherthecomplexityofthesystem,thestrongertheambiguity.
AssociationRulesLaw
Associationrulesreflecttheinterdependenceorcorrelationbetweenthings.ItsmostfamousalgorithmisR.TheApriorialgorithmproposedbyAgrawaletal.Theideaofthealgorithmis:firstfindoutallfrequencysetswhosefrequencyisatleastthesameastheminimumsupportofthepredeterminedmeaning,andthengeneratestrongassociationrulesfromthefrequencysets.Theminimumsupportandtheminimumcredibilityaretwothresholdsgivenfordiscoveringmeaningfulassociationrules.Inthissense,thepurposeofdataminingistominetheassociationrulesthatmeettheminimumsupportandminimumcredibilityfromthesourcedatabase.
Thereareproblems
Dataminingalsoinvolvesprivacyissues.Forexample,anemployercanaccessmedicalrecordstoscreenoutthosewithdiabetesorsevereheartdisease.Therebyintendstoreduceinsuranceexpenditures.However,thisapproachcanleadtoethicalandlegalissues.
Theminingofgovernmentandcommercialdatamayinvolveissuessuchasnationalsecurityorcommercialsecrets.Thisisalsoachallengeforsecrecy.
Datamininghasmanylegitimateuses.Forexample,itcanfindouttherelationshipbetweenadruganditssideeffectsinthedatabaseofapatientgroup.Thisrelationshipmaynotbeacasein1,000people,butpharmacology-relatedprojectscanusethismethodtoreducethenumberofpatientswhohaveadversereactionstodrugs,andmaysavelives;buttherearestilldatabasesthatmaybeTheproblemofabuse.
Dataminingimplementsmethodstodiscoverinformationthatisimpossiblewithothermethods,butitmustberegulatedandshouldbeusedwithappropriateinstructions.
Ifthedataiscollectedfromaspecificindividual,thentherewillbesomeconfidentiality,legalandethicalissues.