Data mining



Introduction

Needsarethemotherofinvention.Inrecentyears,datamininghasattractedgreatattentionintheinformationindustry.Themainreasonisthatthereisalargeamountofdata,whichcanbewidelyused,andthereisanurgentneedtoconvertthesedataintousefulinformationandknowledge.Theacquiredinformationandknowledgecanbewidelyusedinvariousapplications,includingbusinessmanagement,productioncontrol,marketanalysis,engineeringdesignandscientificexploration.

Dataminingisahotissueinthefieldofartificialintelligenceanddatabaseresearch.Theso-calleddataminingreferstothenon-trivialprocessofrevealinghidden,previouslyunknownandpotentiallyvaluableinformationfromalargeamountofdatainthedatabase..Dataminingisadecisionsupportprocess,whichismainlybasedonartificialintelligence,machinelearning,patternrecognition,statistics,databases,visualizationtechnology,etc.,highlyautomatedanalysisofenterprisedata,makinginductivereasoning,anddiggingoutpotentialpatternsfromit,Tohelpdecision-makersadjustmarketstrategies,reducerisks,andmakecorrectdecisions.Theknowledgediscoveryprocessconsistsofthefollowingthreestages:①datapreparation;②datamining;③resultexpressionandinterpretation.Dataminingcaninteractwithusersorknowledgebases.

Dataminingisatechniquetofindthelawfromalargeamountofdatabyanalyzingeachdata.Therearethreemainsteps:datapreparation,lawsearch,andlawexpression.Datapreparationistoselecttherequireddatafromrelateddatasourcesandintegratethemintoadatasetfordatamining;rulesearchistouseacertainmethodtofindoutthelawscontainedinthedataset;therulerepresentationisasfaraspossibletotheuser.Thewayofunderstanding(suchasvisualization)expressesthefoundrules.Dataminingtasksincludeassociationanalysis,clusteranalysis,classificationanalysis,anomalyanalysis,specificgroupanalysisandevolutionanalysis.

Inrecentyears,datamininghasattractedgreatattentionintheinformationindustry.Themainreasonisthatthereisalargeamountofdatathatcanbewidelyused,andthereisanurgentneedtoconvertthesedataintousefulinformationandknowledge.Theacquiredinformationandknowledgecanbewidelyusedinvariousapplications,includingbusinessmanagement,productioncontrol,marketanalysis,engineeringdesign,andscientificexploration.Dataminingusesideasfromthefollowingfields:①sampling,estimationandhypothesistestingfromstatistics;②searchalgorithms,modelingtechniquesandlearningtheoriesinartificialintelligence,patternrecognitionandmachinelearning.Datamininghasalsoquicklyadoptedideasfromotherfields,includingoptimization,evolutionarycomputing,informationtheory,signalprocessing,visualization,andinformationretrieval.Someotherareasalsoplayanimportantsupportingrole.Inparticular,databasesystemsarerequiredtoprovideeffectivestorage,indexing,andqueryprocessingsupport.Technologiesderivedfromhigh-performance(parallel)computingareoftenimportantinprocessingmassivedatasets.Distributedtechnologycanalsohelpprocessmassiveamountsofdata,anditisevenmoreimportantwhenthedatacannotbeprocessedtogether.

Background

Inthe1990s,withthewidespreadapplicationofdatabasesystemsandtherapiddevelopmentofnetworktechnology,databasetechnologyhasalsoenteredabrandnewstage,thatis,fromthepast,onlysomemanagementSimpledatahasdevelopedtomanagevarioustypesofcomplexdatasuchasgraphics,images,audio,video,electronicfiles,andWebpagesgeneratedbyvariouscomputers,andtheamountofdataisalsoincreasing.Whilethedatabaseprovidesuswithawealthofinformation,italsoreflectstheobviouscharacteristicsofmassiveinformation.Intheeraofinformationexplosion,massiveamountsofinformationhavebroughtmanynegativeeffectstopeople.Themostimportantthingisthateffectiveinformationisdifficulttoextract.Toomuchuselessinformationwillinevitablyproduceinformationdistance(informationstatetransferdistance),whichisanobstacletothetransferofinformationstateofathing.Themeasurementofthedisease,referredtoasDISTorDIT)andthelossofusefulknowledge.ThisiswhatJohnNalsbertcalledthe"information-richbutknowledge-poor"dilemma.Therefore,peopleareeagertoconductin-depthanalysisofmassivedata,discoverandextractthehiddeninformationinordertomakebetteruseofthesedata.However,onlywiththeinput,query,andstatisticsfunctionsofthedatabasesystem,therelationshipsandrulesinthedatacannotbefound,thefuturedevelopmenttrendcannotbepredictedbasedontheexistingdata,andthemeanstominethehiddenknowledgebehindthedataarelacking.Itisundersuchconditionsthatdataminingtechnologycameintobeing.

Dataminingobjects

Thetypeofdatacanbestructured,semi-structured,orevenheterogeneous.Themethodofdiscoveringknowledgecanbemathematical,non-mathematical,orinductive.Theknowledgefinallydiscoveredcanbeusedforinformationmanagement,queryoptimization,decisionsupport,andmaintenanceofthedataitself.

Theobjectofdataminingcanbeanytypeofdatasource.Itcanbearelationaldatabase,thistypeofdatasourcecontainingstructureddata;itcanalsobeadatawarehouse,text,multimediadata,spatialdata,timeseriesdata,Webdata,thistypeofdatasourcecontainingsemi-structureddataorevenheterogeneousdata.

Themethodofdiscoveringknowledgecanbedigital,non-digital,orgeneralized.Thefinallydiscoveredknowledgecanbeusedforinformationmanagement,queryoptimization,decisionsupport,andmaintenanceofthedataitself.

Dataminingsteps

Beforeimplementingdatamining,firstformulatewhatstepstotake,whattodoateachstep,andwhatgoalsarenecessarytoachievegoodresults.Onlyplanscanensuretheorderlyimplementationandsuccessofdatamining.Manysoftwarevendorsanddataminingconsultingcompanieshaveprovidedsomedataminingprocessmodelstoguidetheirusersinthedataminingworkstepbystep.Forexample,SPSS's5AandSAS'sSEMMA.

Thestepsofthedataminingprocessmodelmainlyincludedefiningproblems,buildingdatamininglibraries,analyzingdata,preparingdata,buildingmodels,evaluatingmodels,andimplementingthem.Let'stakealookatthespecificcontentofeachstepindetail:(1)Definetheproblem.Thefirstandmostimportantrequirementbeforestartingknowledgediscoveryistounderstanddataandbusinessissues.Theremustbeacleardefinitionofthegoal,whichistodecidewhatyouwanttodo.Forexample,whenyouwanttoincreasetheutilizationrateofe-mail,whatyouwanttodomaybe"increaseuserutilization"or"increasethevalueofone-timeuseruse".Themodelsestablishedtosolvethesetwoproblemsarealmostcompletelydifferent.,Adecisionmustbemade.

(2)Establishadatamininglibrary.Theestablishmentofadatamininglibraryincludesthefollowingsteps:datacollection,datadescription,selection,dataqualityevaluationanddatacleaning,mergingandintegration,buildingmetadata,loadingthedatamininglibrary,andmaintainingthedatamininglibrary.

(3)Analyzedata.Thepurposeoftheanalysisistofindthedatafieldsthathavethegreatestimpactontheforecastoutput,andtodecidewhethertodefinetheexportfields.Ifthedatasetcontainshundredsorthousandsoffields,thenbrowsingandanalyzingthesedatawillbeaverytime-consumingandtiringtask.Atthistime,youneedtochooseagoodinterfaceandpowerfultoolsoftwaretohelpyoucompletethesethings..

(4)Preparedata.Thisisthelaststepofdatapreparationbeforebuildingthemodel.Thisstepcanbedividedintofourparts:selectvariables,selectrecords,createnewvariables,andconvertvariables.

(5)Buildamodel.Modelbuildingisaniterativeprocess.Youneedtocarefullyexaminethedifferentmodelstodeterminewhichmodelismostusefulforthebusinessproblemyouface.Firstusepartofthedatatobuildamodel,andthenusetheremainingdatatotestandverifytheresultingmodel.Sometimesthereisathirddataset,calledthevalidationset,becausethetestsetmaybeaffectedbythecharacteristicsofthemodel.Atthistime,anindependentdatasetisneededtoverifytheaccuracyofthemodel.Trainingandtestingdataminingmodelsrequiresdividingthedataintoatleasttwoparts,oneformodeltrainingandtheotherformodeltesting.

(6)Evaluationmodel.Afterthemodelisestablished,itisnecessarytoevaluatetheresultsobtainedandexplainthevalueofthemodel.Theaccuracyrateobtainedfromthetestsetisonlymeaningfulforthedatausedtobuildthemodel.Inpracticalapplications,itisnecessarytofurtherunderstandthetypesoferrorsandtherelatedcosts.Experiencehasprovedthataneffectivemodelisnotnecessarilythecorrectmodel.Thedirectreasonforthisisthevariousassumptionsimplicitinthemodelestablishment.Therefore,itisimportanttotestthemodeldirectlyintherealworld.Firstapplyitinasmallarea,obtaintestdata,andthenpromoteittoalargeareawhenyoufeelsatisfied.

(7)Implementation.Afterthemodelisestablishedandverified,therearetwomainwaystouseit.Thefirstistoprovideanalystsasareference;theotheristoapplythismodeltodifferentdatasets.

Datamininganalysismethods

Dataminingisdividedintoguideddataminingandunguideddatamining.Guideddataminingistheuseofavailabledatatobuildamodel,whichisadescriptionofaspecificattribute.Unguideddataminingistofindacertainrelationshipamongallattributes.Specifically,classification,estimation,andpredictionbelongtoguideddatamining;associationrulesandclusteringbelongtounguideddatamining.

1.Classification.Itfirstselectstheclassifiedtrainingsetfromthedata,usesdataminingtechnologyonthetrainingsettoestablishaclassificationmodel,andthenusesthemodeltoclassifyunclassifieddata.

2.Valuation.Thevaluationissimilartotheclassification,butthefinaloutputresultofthevaluationisacontinuousvalue,andtheamountofthevaluationisnotpredetermined.Valuationcanbeusedasapreparatoryworkforclassification.

3.predict.Itiscarriedoutthroughclassificationorestimation,andamodelisobtainedthroughclassificationorestimationtraining.Ifthemodelhasahighaccuracyrateforthetestsamplegroup,themodelcanbeusedfortheunknownvariablesofthenewsampleMakepredictions.

4.Relevancegroupingorassociationrules.Thepurposeistodiscoverwhichthingsalwayshappentogether.

5.Clustering.Itisamethodofautomaticallyfindingandestablishinggroupingrules.Itdividessimilarsamplesintoaclusterbyjudgingthesimilaritybetweensamples.

SuccessStories

1.DatamininghelpsCredilogrosCíaFinancieraSAimprovecustomercreditscores

CredilogrosCíaFinancieraSAisthefirstinArgentinaThefivemajorcreditcompanies,withanestimatedassetvalueof95.7millionU.S.dollars,forCredilogros,itisimportanttoidentifythepotentialrisksassociatedwithpotentialprepaymentcustomersinordertominimizetherisktaken.

Thecompany’sfirstgoalistocreateadecisionenginethatinteractswiththecompany’scoresystemandthesystemsoftwocreditreportingcompaniestoprocesscreditapplications.Atthesametime,Credirogrosisalsolookingforcustomriskscoringtoolsforthelow-incomecustomergroupsitserves.Inadditiontothese,otherneedsincludesolutionsthatcanoperateinrealtimeatanyofits35branchofficelocationsandmorethan200relatedsalespoints,includingretailhomeappliancechainstoresandmobilephonesalescompanies.

Intheend,CredilogroschoseSPSSInc.'sdataminingsoftwarePASWModelerbecauseitcanbeflexiblyandeasilyintegratedintoCredilogros'coreinformationsystem.ByimplementingPASWModeler,Credirogrosreducedthetimeneededtoprocesscreditdataandprovidethefinalcreditscoretolessthan8seconds.Thisallowstheorganizationtoquicklyapproveorrejectcreditrequests.ThedecisionenginealsoenablesCredilogrostominimizetheidentificationdocumentsthateachcustomermustprovide.Insomespecialcases,onlyoneidentificationisrequiredtoapprovecredit.Inaddition,thesystemalsoprovidesmonitoringfunctions.CredilogroscurrentlyusesPASWModelertoprocessanaverageof35,000applicationspermonth.Only3monthsafteritwasrealized,ithelpedCredirogrostoreduceloandisbursementderelictionby20%.

2.DatamininghelpsDHLtrackthetemperatureofcargocontainersinrealtime

DHLisaglobalmarketleaderintheinternationalexpressandlogisticsindustry.Itprovidesexpress,landandwaterAirthree-waytransportation,contractlogisticssolutions,andinternationalmailservices.DHL'sinternationalnetworkconnectsmorethan220countriesandregions,withatotalofmorethan285,000employees.UnderthepressureoftheUSFDAtoensurethatthetemperatureofdrugshipmentsduringthetransportationprocessmeetsthepressure,DHL'spharmaceuticalcustomersstronglydemandmorereliableandmoreaffordableoptions.ThisrequiresDHLtotrackthetemperatureofthecontainerinrealtimeatallstagesofdelivery.

Althoughtheinformationgeneratedbytheloggermethodisaccurate,thedatacannotbetransmittedinrealtime,andneitherthecustomernorDHLcantakeanypreventiveandcorrectivemeasureswhentemperaturedeviationsoccur.Therefore,DHL’sparentcompanyDeutschePostWorldNetwork(DPWN),throughtheTechnologyandInnovationManagement(TIM)Group,hasclearlydrawnupaplantouseRFIDtechnologytotrackthetemperatureoftheshipmentatdifferentpointsintime.DrawtheprocessframeworkfordeterminingthekeyfunctionparametersoftheservicethroughtheIBMGlobalBusinessConsultingServicesDepartment.DHLhasgainedtwobenefits:Fortheendcustomer,itenablesmedicalcustomerstorespondinadvancetoshippingproblemsduringthedeliveryprocess,andcomprehensivelyandeffectivelyenhancesdeliveryreliabilityatacompellinglowcost.ForDHL,ithasimprovedcustomersatisfactionandloyalty;laidasolidfoundationformaintainingcompetitivedifferences;andhasbecomeanimportantnewsourceofrevenuegrowth.

3.Applicationsinthetelecommunicationsindustry

Pricecompetitionisunprecedentedlyfierce,voicebusinessgrowthhassloweddown,andthefast-growingChinamobilecommunicationsmarketisfacingunprecedentedsurvivalpressure.TheacceleratedreformofChina'stelecommunicationsindustryhascreatedanewcompetitivesituation,andthebreadthandintensityofcompetitioninthemobileoperatingmarketwillfurtherincrease,especiallyinthefieldofgroupcustomers.Mobileinformatizationandgroupcustomershavebecomeanewengineforoperatorstocopewithcompetitionandobtainsustainedgrowthinthefuture.

Withthedomesticthree-leggedfull-servicecompetitionandtheissuanceof3Glicenses,itwillbeageneraltrendforoperatorstoprovidegroupcustomerswithintegratedinformatizationsolutions,andmobileinformatizationwillbecomeacomprehensiveentryintothefieldofinformatizationservices.Leadingforce.Therefore,traditionalmobileoperatorsarefacingthechallengeofshiftingfromtraditionalpersonalbusinesstosimultaneouslyexpandingthefieldofgroupcustomerinformatizationbusiness.Howtodealwithinternalandexternalchallengesandquicklyusemobileinformatizationservicesasoneofthecompetitivetoolsforintegratedservicestoexpandthegroup'scustomermarketandremaininvincibleinemergingmarketsisanurgentproblemthattraditionalmobileoperatorsneedtosolve.

Classicalalgorithm

Currently,dataminingalgorithmsmainlyincludeneuralnetworkmethod,decisiontreemethod,geneticalgorithm,roughsetmethod,fuzzysetmethod,associationrulemethod,etc.

Neuralnetworkmethod

Theneuralnetworkmethodsimulatesthestructureandfunctionofthebiologicalnervoussystem.Itisanon-linearpredictivemodellearnedthroughtraining.IttreatseveryconnectionasAprocessingunittriestosimulatethefunctionsofhumanbrainneurons,andcancompleteavarietyofdataminingtaskssuchasclassification,clustering,andfeaturemining.Thelearningmethodofneuralnetworkismainlymanifestedinthemodificationofweights.Itsadvantagesareanti-interference,non-linearlearning,associativememoryfunctions,andaccuratepredictionresultsforcomplexsituations;thefirstdisadvantageisthatitisnotsuitableforprocessinghigh-dimensionalvariablesandcannotobservethelearningprocessinthemiddle.Ithas"blackbox"characteristicsandoutputsresults.Itisalsodifficulttoexplain;secondly,ittakeslongerlearningtime.Neuralnetworkmethodismainlyusedindataminingclusteringtechnology.

Decisiontreemethod

Decisiontreeisaprocessofconstructingclassificationrulesaccordingtotheutilityofthetargetvariable.Theprocessofclassifyingdatathroughaseriesofrules,anditsmanifestationisSimilartotheflowchartofthetreestructure.ThemosttypicalalgorithmisJ.R.QuinlanproposedtheID3algorithmin1986,andthenproposedtheextremelypopularC4.5algorithmbasedontheID3algorithm.Theadvantageofusingthedecisiontreemethodisthatthedecision-makingprocessisvisible,doesnotrequirealongtimetoconstructtheprocess,thedescriptionissimple,easytounderstand,andtheclassificationspeedisfast;thedisadvantageisthatitisdifficulttofindrulesbasedonacombinationofmultiplevariables.Decisiontreemethodisgoodatprocessingnon-numericaldata,anditisespeciallysuitableforlarge-scaledataprocessing.Decisiontreesprovideawaytoshowruleslikewhatvaluewillbeobtainedunderwhatconditions.Forexample,inaloanapplication,itisnecessarytomakeajudgmentonthedegreeofriskoftheapplication.

Geneticalgorithm

Geneticalgorithmsimulatesthephenomenaofreproduction,matingandgenemutationthatoccurinnaturalselectionandheredity.Itisakindofoperationthatusesgeneticcombination,geneticcross-mutation,andnaturalselection.Togeneraterules-basedmachinelearningmethodsbasedonevolutionarytheory.Itsbasicviewpointistheprincipleof"survivalofthefittest",whichhasthepropertiesofimplicitparallelismandeasyintegrationwithothermodels.Themainadvantageisthatitcanhandlemanydatatypes,andatthesametime,itcanprocessvariousdatainparallel;thedisadvantageisthatitrequirestoomanyparameters,codingisdifficult,andgenerallytheamountofcalculationisrelativelylarge.Geneticalgorithmsareoftenusedtooptimizeneuralnetworksandcansolveproblemsthataredifficulttosolvebyothertechnologies.

Roughsetmethod

Theroughsetmethod,alsoknownasroughsettheory,wasproposedbyPolishmathematicianZPawlakintheearly1980s.Itisanewwaytodealwithambiguity,Mathematicstoolsforinaccurateandincompleteproblemscanhandledatareduction,datacorrelationdiscovery,anddatameaningevaluation.Theadvantageisthatthealgorithmissimple,andthepriorknowledgeaboutthedataisnotneededintheprocessingprocess,andtheinherentlawoftheproblemcanbeautomaticallyfound;thedisadvantageisthatitisdifficulttodirectlyprocesscontinuousattributes,andthediscretizationofattributesmustbeperformedfirst.Therefore,thediscretizationofcontinuousattributesisadifficultpointthatrestrictsthepracticalapplicationofroughsettheory.Roughsettheoryismainlyappliedtoproblemssuchasapproximatereasoning,digitallogicanalysisandsimplification,andtheestablishmentofpredictivemodels.

Fuzzysetmethod

Thefuzzysetmethodusesfuzzysettheorytoperformfuzzyevaluation,fuzzydecision-making,fuzzypatternrecognitionandfuzzyclusteranalysisonproblems.Fuzzysettheoryusesthedegreeofmembershiptodescribetheattributesoffuzzythings.Thehigherthecomplexityofthesystem,thestrongertheambiguity.

AssociationRulesLaw

Associationrulesreflecttheinterdependenceorcorrelationbetweenthings.ItsmostfamousalgorithmisR.TheApriorialgorithmproposedbyAgrawaletal.Theideaof​​thealgorithmis:firstfindoutallfrequencysetswhosefrequencyisatleastthesameastheminimumsupportofthepredeterminedmeaning,andthengeneratestrongassociationrulesfromthefrequencysets.Theminimumsupportandtheminimumcredibilityaretwothresholdsgivenfordiscoveringmeaningfulassociationrules.Inthissense,thepurposeofdataminingistominetheassociationrulesthatmeettheminimumsupportandminimumcredibilityfromthesourcedatabase.

Thereareproblems

Dataminingalsoinvolvesprivacyissues.Forexample,anemployercanaccessmedicalrecordstoscreenoutthosewithdiabetesorsevereheartdisease.Therebyintendstoreduceinsuranceexpenditures.However,thisapproachcanleadtoethicalandlegalissues.

Theminingofgovernmentandcommercialdatamayinvolveissuessuchasnationalsecurityorcommercialsecrets.Thisisalsoachallengeforsecrecy.

Datamininghasmanylegitimateuses.Forexample,itcanfindouttherelationshipbetweenadruganditssideeffectsinthedatabaseofapatientgroup.Thisrelationshipmaynotbeacasein1,000people,butpharmacology-relatedprojectscanusethismethodtoreducethenumberofpatientswhohaveadversereactionstodrugs,andmaysavelives;buttherearestilldatabasesthatmaybeTheproblemofabuse.

Dataminingimplementsmethodstodiscoverinformationthatisimpossiblewithothermethods,butitmustberegulatedandshouldbeusedwithappropriateinstructions.

Ifthedataiscollectedfromaspecificindividual,thentherewillbesomeconfidentiality,legalandethicalissues.

This article is from the network, does not represent the position of this station. Please indicate the origin of reprint
TOP