您好,欢迎来到尔游网。
搜索
您的当前位置:首页The Effectiveness of Automatically Structured Queries in Digital Libraries ABSTRACT

The Effectiveness of Automatically Structured Queries in Digital Libraries ABSTRACT

来源:尔游网
TheEffectivenessofAutomaticallyStructuredQueriesin

DigitalLibraries

MarcosAndr´eGonc¸alves∗EdwardA.Fox∗AaronKrowne†

P´avelCalado‡AlbertoH.F.Laender‡AltigranS.daSilva§BerthierRibeiro-Neto‡¶

Dept.ofComputer

ScienceVirginiaTechBlacksburg,VA,USA{mgoncalv,fox}@vt.edu

§

LibrarySystemsEmoryUniversityGeneralLibrariesAtlanta,GA,USAakrowne@emory.edu

Dept.ofComputerScienceFederalUniversityofMinas

Gerais

BeloHorizonte,MG,Brazil{pavel,laender}@dcc.ufmg.br

Dept.ofComputer

Science

FederalUniversityof

AmazonasManaus,AM,Brazilalti@dcc.fua.br

ABSTRACT

Structuredorfieldedmetadataisthebasisformanydigi-tallibraryservices,includingsearchingandbrowsing.Yet,littleisknownabouttheimpactofusingstructureontheeffectivenessofsuchservices.Inthispaper,weinvestigateakeyresearchquestion:dostructuredqueriesimproveef-fectivenessinDLsearching?Toanswerthisquestion,weempiricallycomparedtheuseofunstructuredqueriestotheuseofstructuredqueries.WethentestedthecapabilityofasimpleBayesiannetworksystem,builtontopofaDLre-trievalengine,toinferthebeststructuredqueriesfromthekeywordsenteredbytheuser.Experimentsperformedwith20subjectsworkingwithaDLcontainingalargecollectionofcomputerscienceliteratureclearlyindicatethatstruc-turedqueries,eithermanuallyconstructedorautomaticallygenerated,performbetterthantheirunstructuredcounter-parts,inthemajorityofcases.Also,automaticstructuringofqueriesappearstobeaneffectiveandviablealternativetomanualstructuringthatmaysignificantlyreducetheburdenonusers.

AkwanInformationTechnologies

BeloHorizonte,MG,Brazil

www.akwan.com.br

berthier@akwan.com.brGeneralTerms

Experimentation,HumanFactors

Keywords

DigitalLibraries,StructuredQueries,BayesianNetworks

1.INTRODUCTION

CategoriesandSubjectDescriptors

H.3.7[InformationSystems]:InformationStorageandRetrieval—DigitalLibraries

Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermissionand/orafee.

JCDL’04,June7–11,2004,Tucson,Arizona,USA.Copyright2004ACM1-58113-832-6/04/0006...$5.00.

EnsuringthehighqualityofDigitalLibrary(DL)servicesiskeytoguaranteeingDLusefulnessandpatrons’satisfac-tion.Largelybecauseofthisconcernforquality,metadata,andmorespecifically,structuredorfieldedmetadata,hashistoricallybeenthebasisformanydigitallibraryservices,includingbasiconessuchassearchingandbrowsing.Yet,regardingtheeffectivenessofsuchservices,littleisknownabouttheimpactofusingstructure.Moreover,whileafewDLservicestrytoutilizethisinformationthroughtheuseofadvancedinterfaces1,experiencehasshownthatusersrarelymakeuseofthesefeatures,mostprobablyduetothecom-plexityofuserinterfacesandlackofknowledgeofinternalDLstructures.

Inthispaper,weinvestigateakeyresearchquestion:dostructuredqueriesimprovesearcheffectivenessinDLs?Toanswerthisquestion,weempiricallycomparedtheuseofunstructuredqueriestotheuseofstructuredqueries.Sinceusersareoftenunwilling,orunable,tomanuallystructuretheirqueries,wealsoprovideasimplesystemthattriestoclosethegapbetweentheuser’sinformationneedandtheDLcontent.Thisexperimentalengine,builtontopofaBayesiannetworkmodelandaretrievalsystemoptimizedforDLs,triestoinferthebeststructuredqueriesfromthekey-See,forexample,http://www.acm.org/dl,http://www.informatik.uni-trier.de/~ley/db/indices/query.html.

1

or

wordsenteredbytheuser,basedonknowledgeofDLstruc-turesandcollectionstatistics.Averysimpletextboxuserinterfaceguaranteesthesimplicityoftheprocess.Toensurepropertreatmentoftheirinformationneed,userssimplyhavetochoosefromanautomaticallyproducedrankedlistofstructuredqueries.

Totestourhypothesesandmethods,weperformedaseriesofexperimentswith20subjects(graduatestudentsandresearchers)usingCITIDEL(ComputingandInforma-tionTechnologyInteractiveDigitalEducationalLibrary)2,aDLcontainingalargecollectionofcomputersciencelit-erature,includingmetadatafromtheACMDigitalLibrary,theDBLPcollection,NDLTD-Computing(thecomputingsubsetoftheNetworkedDigitalLibraryofThesesandDis-sertations)3,andotherssources.Results,usingthreediffer-entinformationretrieval(IR)measures,indicatethatstruc-turedqueries,eithermanuallyconstructedorautomaticallygenerated,performbetterthantheirunstructuredcounter-partsinthemajorityofcases.Also,automaticstructur-ingofqueriesappearstobeaviablealternativetomanualstructuring,sinceitreducesworkforusers,whileboostingeffectiveness.

Thispaperisorganizedasfollows.Section2explainstheunderlyingmodelsandcontextofthework.Section3de-scribesESSEX,aretrievalsystemoptimizedforDLs,whichprovidesforbasicretrievalcapabilitiesandforthestructur-ingprocess.Section4detailsthequerystructuringprocess,includingtheBayesiannetworkmodelandthequeryrankingschemes.Section5discussesexperimentalsetupandresults.Section6presentsrelatedworkandSection7concludesthepaper,alsoincludingplansforfuturework.

2.CONTEXTANDDEFINITIONS

Inthiswork,weadoptasimplifiedviewofthestructuredmetadatathatdescribesthecontentsofaDL.Accordingtothisview,eachdocumentordigitalobjectdoistoredintheDLisdescribedby,atleast,onemetadataspecification.Thej-thmetadataspecificationforobjectdoiisdefinedasasetofpairs:

msji={A1:v1ji,...,An:vnji},nji≥1

whereeachAkisanattributeormetadatafieldandeachvkjiisavaluebelongingtothedomainofAk.Wenotethattheattributesdonotneed,necessarily,tobethesameforallmetadataspecifications.

Forsomeattributes,insteadofasinglevalue,wemayhaveasetorlistofvalues.Forinstance,inametadataspecifi-cationdescribingapaper,theattributeauthormightbealistofnames.Torepresentthisusingournotation,weal-lowasameattributetoappearseveraltimes,herecalledavaluelist.Thus,ifattributeAp,inthemetadataspecifica-tionmsji,hasndifferentvalues,wecanrepresentmetadataspecificationmsjias:

msji={...,Ap:vp1,Ap:vp2,...,Ap:,...}WedefinethemetadataschemaofaDLasthesetofallattributesthatcomposeanyofthemetadataspecificationsofthatDL.Thus,themetadataschemaofaDLDisdefined

23

Seehttp://www.citidel.org/.Seehttp://www.ndltd.org/.

as:

SD={A|Aisanattributeof

somemetadataspecificationinD}

(1)

WedefineanunstructuredqueryUasasetofkeywords(orterms):

U={t1,t2,...,tk}

Asforametadataspecification,astructuredqueryQisdefinedasasetofpairs:

Q={A1:v1q,...,An:vnq},nq≥1,

whereeachAkisanattributeormetadatafieldandeachvkqavaluebelongingtothedomainofAk.

ThissimplifiedsetofdefinitionsallowsustoignorethedetailsofhowmetadataisactuallyrepresentedintheDL,sinceitcanbemappedfromanyactualrepresentationfor-mat.

3.THERETRIEVALSYSTEM:ESSEX

ESSEXisavector-spaceIRsystemoptimizedforthedig-itallibrarysetting.Itisdesignedtobelightandfast,andtomakefewdemandsonthearchitectureoftherestoftheDLsystem.ItachievestheseobjectivesbyanoptimizedC++implementation,anentirelyin-memoryindex,andaback-grounddaemonmodelusingsocketcommunicationwiththeDLapplication.

Inadditiontothesearchitecturalprovisions,ESSEXhasanumberofquerylanguagefeaturesthatmakeitwellsuitedtodigitallibraries.Besidesbasicfeaturessuchasforce/forbid(“+”and“-”)termoperators,ESSEXsupportsfieldfiltersandadjustablefieldweightings4.

Fieldfiltershavethesyntax“field:term”,where“field”isanindexedmetadatafield,and“term”isthequeryterm.Afieldfiltermodifiesthebehaviorofthesearchsuchthatmatcheswillonlybemadewithtermoccurrenceswithinthespecifiedfield.

ESSEXwasdevelopedprimarilyforCITIDELandcur-rentlyserves5asthesearchengineforCITIDELandPlan-etMath.Ourfamiliaritywiththecodemadeitanaturalchoiceasatest-bedfortheexperimentalquerystructuringsystemdiscussedinthispaper.Inaddition,ESSEX’sfieldfilteringcapabilityservedasthecoreofthequerystructur-ingengine.WealsoutilizedESSEX’ssupportforthe“+”operator,andmayuseitsfieldweightingsupportinthefu-ture.Detailsonhowsomeofthesefeatureswereusedareexplainedinthefollowingsections.

4.RANKINGQUERIES:THEBAYESIAN

NETWORKMODEL

Thissectionpresentsanoverviewoftheautomaticquerystructuringapproach.Westartbydescribingthegeneralqueryingprocessandexplainhowuserqueriesarestructuredautomaticallyandrankedaccordingtothelikelihoodthattheywillsatisfytheusers’needs.

4

theFieldweightingsallowtheDLprovideranduserresultscontributionfeaturesranking.ofForthemorevariousmetadatafieldstotothechangefinalaofESSEXandhowinformationtheymakeitonagoodthisandotherorg/~akrowne/elaine/essex/index.htmldigitallibrarysearchengine,seehttp://br.endernet.choicefor5

.Seehttp://planetmath.org/.

4.1TheQueryStructuringProcess

InourESSEXquerystructureinferencesystem,querystructuringconsistsof:(1)collectingtheunstructureduserquery,(2)buildingasetofcandidatestructuredqueries,and(3)rankingthecandidatequeriesaccordingtotheprob-abilityofbestrepresentingtheuser’sneeds,asproposedin[8,15].

Toexplainthesestepsindetail,assumethattheobjectsinourdigitallibraryhavefieldsauthorandtitle.LetU={t1,t2,t3}betheinitial,unstructuredqueryenteredbytheuser,wheret1,t2,andt3arethreedistinctterms.Tocreatethecandidatequeries,ESSEXsimplybuildsallpossiblecombinationsoffield-termpairs,usingthefieldsinthemetadataschemaoftheDLandthetermsenteredbytheuser.

Toillustrate,iftermt1occursbothinthetitleandintheauthorfieldsoftheobjectsintheDL,pairs,andwouldbecreated.Similarly,iftermst2andt3occuronlyinthetitleoftheobjectsintheDL,pairs,and,wouldbecre-ated.Giventhesefield-termassignments,thecandidatestructuredquerieswouldbeQ1=(author:t1,title:t2,title:t3)andQ2=(title:t1,title:t2,title:t3).Noticethatthesewouldbetheonlytwopossiblequeries,sinceweassumethatonetermcannotoccurintwodifferentfieldsofthesamequery.Inthiscase,termt1cannotoccurinthetitlefieldandauthorfieldofthesamequery.Al-thoughthisassumptioncanexcludesomepossiblyrelevantcandidates,itavoidstheextracomputationalcomplexitythatwouldbeaddedbytheirinclusion.Astudyofhowthisaffectsthequalityofresultsshouldbeperformedinfuturework.

Thecreationofthefield-termpairscanbefurtherre-strictedbyconsideringaminimumfrequencyofoccurrenceofaterminthefieldvaluesofthedigitalobjectsintheDL.Thus,if,say,termt1occurslessthanNtimesintheauthorfield,thepairauthor:t1wouldnotbecreated.ThevalueofNcanbeusedbothtoincreaseefficiency,byreducingthenumberofcandidatequeries,asalsotofilteroutspurioustermsthatmayoccurinafieldduetoerrorsinthedata.InourexperimentsthevalueofNwassetto1,sincethisfilteringprocesswasprovedunnecessary.

Oncethesetofcandidatequeriesiscreated,eachqueryisevaluatedandrankedaccordingtotheprobabilityoffittingthedataintheDL.ThisisaccomplishedthroughtheuseoftheBayesiannetworkmodelfirstproposedbyCaladoetal.in[8],asexplainedinthefollowingsection.

Figure1showsthearchitectureforthequerystructuringprocessinESSEX.Evaluationofastructuredquerytakesplaceintwophases.Thefirstistheevaluationoftheindi-vidualqueryterms(withfieldfilters).Forthisphase,eachtermissenttothesearchengineandaresultssetisreceived.Theranksoftheresultssetdocumentsarecombinedintoascoreforthequeryterm.Becausemanystructuredquerytermsoccurnumeroustimesovertheentiresetofcandidatestructuredqueries,theyarecachedinahashtablewhichmapsthemtotheircorrespondingfusedscores.Inourex-periments,wefoundthatthiscachingspeededuptheentirestructuringprocessbymorethanafactorof3.

Inthesecondphase,thescoresofthestructuredquerytermsarecombinedintoafinalscorefortheentirestruc-turedquery.Thisscoreisthenusedtogeneratetheranksforthesetofallpotentialstructuredqueries.

raw query{structuredqueries}Structure_MainA_1A_2Generate_Structured_QueriesA_kDomainDBsScore_Structured_QueriesTermCached_Score_Query_TermsHashStructuring SubsystemInvertedSearchIndexSearch SubsystemFigure1:Architectureofthequerystructuringsys-tem.

Incaseswherethefulldigitallibrarycontentisnotacces-sible,orinordertoimproveefficiency,notalloftheDLob-jectsareusedinthequerystructuringprocess.Instead,onlyasubsetoftheDLisconsideredforbuildingthecandidatestructuredqueriesandforderivingthenecessarystatisticsfortheBayesiannetworkmodel.ThissubsetiscalledthesampledatabaseandisgenerallybuiltbytakingasampleofobjectsfromtheDLthatarerepresentativeofthewholeDLcontent.Amoredetaileddiscussiononhowasampledatabaseisbuiltcanbefoundin[15].

WenowpresentabriefexplanationoftheBayesiannet-workmodelusedasabasisforthisimplementation,andemphasizethechangesneededtoadaptittoourexperimen-talcollection.

4.2FindingtheBestStructuredQueries

Therankingofcandidatestructuredqueriesisaccom-plishedthroughtheuseoftheBayesiannetworkmodelpro-posedin[8].Forclarity,themodelispresentedinFigure2.Wenotethatalthoughthenetworkcanbeeasilyexpandedtomodelanymetadataschema,forsimplicity,hereweshowonlytwofields,A1andA2.

ThenetworkinFigure2consistsofasetofnodes,eachrepresentingapieceofinformation.Witheachnodeinthenetworkisassociatedabinaryrandomvariable,whichtakesthevalue1toindicatethatthecorrespondinginformationwillbeaccountedforintherankingcomputation.Inthiscase,wesaythattheinformationwasobserved.Inthenet-work,theDLisrepresentedbynodeO,eachnodeAirepre-sentsafield,eachnodeAijrepresentsthej-thvalueoffieldAi,eachnodeaijrepresentsaterminthevalueoffieldAi,

Q1Q2Q3Q4Q11Q21Q31Q22Q32Q42a1a11a12a13a1k1a21a22a23a2k2a2A11A12A13A1n1A21A22A23A2n2A1A2OFigure2:Bayesiannetworkmodelforrankingstruc-turedqueries.

eachnodeQirepresentsastructuredquerytoberanked,andeachnodeQijrepresentstheportionofthestructuredqueryQithatcorrespondstothefieldAj.Vectorsa󰀞1anda󰀞2eachrepresentapossiblestateofthevariablesassociatedwithnodesa1ianda2i,respectively.

Asreflectedbytheedgesinthenetwork,fieldvaluesAijarecomposedoftermsaij,afieldAiiscomposedofallitspossiblevalues,andtheDLOiscomposedofallitsfields.Thus,theobservationofasetoftermswillinfluencetheobservationofavalue,theobservationofasetofvalueswillinfluencetheobservationofafield,andtheobservationofafieldwillinfluencetheobservationoftheDL.Similarly,aqueryQiiscomposedoffieldvaluesQijthatarealsocom-posedoftermsaij.ThelikelihoodofacandidatestructuredqueryQifittingtheDLOcanbeseenastheprobabilityofobservingQi,giventhatOwasobserved,i.e.,P(Qi|O).Byappropriatelydefiningtheconditionalprobabilitiesde-scribedbythenetworkinFigure2,weobtainthefollowingequation:

P(Q1󰀄

i|O)=η×󰀃n1󰀇21−

1−cos(A1j,󰀞a1)󰀁j=1

󰀃n2

+1−

󰀇1−cos(A2j,󰀞a2)

󰀁󰀅

(2)

j=1

where󰀞a1and󰀞a2arethestatesinwhichonlythequeryterms

referringtofieldsA1andA2,respectively,areobserved;n1andn2arethetotalnumberofvaluesforfieldsA1andA2intheDL;andηaccountsfortheconstants1/P(O),P(󰀞a1),andP(󰀞a2).Thefunctioncos(Aij,󰀞ai)representsthesimilaritybetweenthefieldvalueAijandthetermsinthecandidatequerybeingranked.Itisdefinedasthetradi-tionalvectorspacecosinesimilarity[31]betweenthevectoroftermsrepresentingthefieldvalueAijandvector󰀞ai,whichrepresentsthetermsinthequery.

WecaninterpretEq.(2)asstatingthattheprobabilityofthecandidatequeryfittingtheDLdependsontheprobabil-ityofeachofitsattributevaluesfittingthecorrespondingattributevaluesintheDL—themoreallqueryattributevaluesaresimilartothevaluesintheDL,thehighertheprobability.Thesimilarityofanindividualqueryattribute

isrepresentedbyadisjunction,meaningthatitisenoughfortheattributetobeconsidered,ifonevalueintheDLequalsthevalueinthequery.ThisequalitybetweenvaluesinthequeryandvaluesintheDLisdeterminedthroughthecosinesimilarityfunction.

Itisimportanttonotethat,althoughin[8]disjunctiveandconjunctiveoperatorsweresuggestedforthefinalcom-binationfunction,empiricaltestswiththecollectionusedinourexperimentsindicatedthatusingdisjunctiveopera-torsforprobabilityP(Ai|Aij)andameanforprobabilityP(O|Ai)yieldedthebestresults.ForfurtherdetailsonthederivationofEq.(2),referto[8,28].

Tocomputethecosinesimilarity,thevalueAijisseenasavectorofkiterms.ToeachtermtinAij,weassignaweightwitthatreflectstheimportanceofthetermforfieldAi:

wit=tfj(t)·ftfi(t)·fidf(t)

(3)

wheretfj(t)isthetermfrequencyoftermtindocumentj,i.e.,thenumberoftimestermtappearsindocumentj;ftfiisthefieldtermfrequency,i.e.,thenumberoftimestermtoccursinfieldi;andfidfistheinversefielddocumentfrequency,i.e.,theinverseofthenumberoffieldstermtappearsin.

ThefirstfactorinEq.(3),tfj(t),isverycommoninvector-spaceIR.Itindicatesthatthemoretimestermtappearsindocumentj,themorerepresentativetisofdocu-mentj.AlthoughitisalsocommoninIRtohaveanidf,or“inverse-documentfrequency”factor,weleavethisoutforreasonsdiscussedbelow.

Thesecondtwofactorsarenovelinourwork.Ontheonehand,ftfi(t),indicatesthatthemoretimestermtappearsinafieldi,themorerepresentativetisoffieldi.Ontheotherhand,iftermtappearsinmanyfields,thefactorfidf(t)indicatesthatitisprobablytoogenerictobeuseful.Wecallthese“fieldtf”and“fieldidf”respectively,astheyareanalogoustothestandardtfandidfdescribedpreviously.Thedifferenceisthattheyreflecttermdistributionsrela-tivetofieldsratherthandocuments.Thesetermweightingfunctionsdifferfromthoseusedin[8]duetothefactthatCI-TIDEL,andmanyotherdigitallibraries,containcollectionsofscientificpapersand,therefore,manytextualmetadatafields,beingverydifferentfromthecollectionsusedin[8],whichcontainedmostlyinformationoncommercialprod-uctsextractedfromWebdatabases.Inthecontextofquerystructuring,oneofthemaindifferencesintheCITIDELcaseisthatthereisalargeoverlapbetweenthevocabularyintheobject’sfieldssuchastitles,abstracts,publications,andauthors.

Theneteffectoftheseweightingsistovalueterms:(1)stronglytotheextentthattheyoccurmanytimesinthespecifiedmetadatafield,(2)stronglytotheextentthattheyoccurinthemostcommonfieldfortheterm,and(3)weaklytotheextentthattheyare“diluted”byappearinginmanymetadatafields.

Letusconsideranexampleofhowthisisuseful.As-sumethattheunstructuredqueryisgivenas“jonesalgo-rithm”,andthattheword“algorithm”appearsevenlyinthetitleandabstractfields,andahandfuloftimesinthepublicationfield.Also,assumetheword“jones”appearsasmallamountinthefieldabstract,butmuchmorefre-quentlyinthefieldauthor.Withtheweightingsdescribedabove,occurrencesof“algorithm”willhavesimilarvaluein

eitherthetitleorabstractfields.However,occurrencesof“jones”inauthorwillbeworthmuchmorethanoccur-rencesinabstract.Finally,occurrencesof“algorithm”willbeworthlessthanoccurrencesof“jones”,because“algo-rithm”appearsinthreefields,while“jones”appearsinonlytwo.

Giventhisweightingscheme,thecosineoftheanglebe-tweenvectorAijandvector󰀞󰀂

aiisdefinedas:

cos(A,󰀞ai)=󰀆∀t󰀂∈Tiwitgt(a󰀞

i)ij(4)

∀t∈Tiw2itwheregt(󰀞ai)givesthevalueofthet-thvariableofthevector

󰀞ai,andTiisthesetofalltermsinthevaluesoffieldAi.WenowcanrankallthestructuredqueriesbycomputingP(Qi|O)foreachofthem.Theuserthencanselectonequeryforprocessingfromamongthetoprankedones,orthesystemcansimplyprocessthefirstquery.

5.EXPERIMENTS

Totestourresearchquestions,namely,(1)ifstructuredqueriesarebetterthanunstructuredonesand(2)ifau-tomaticallystructuredqueriescanperformaswellas(orbetterthan)theirmanuallyconstructedcounterparts,weconductedaseriesofexperimentswithrealusersandthestructuringBayesiannetwork,asdescribedinSection4,im-plementedontopoftheESSEXsearchengine.

5.1ExperimentalSetupandDesign

ExperimentswereperformedontheCITIDELcollectionwhichcontainsmetadatafromtheACMDigitalLibrary,theDBLPcollection,NDLTD-Computing,andothersources-totalingmorethan440,000metadatarecords.Onlyasub-setoftheACMDigitalLibrary,withapproximately98,000metadatarecords,wasusedasasampledatabaseforquerystructuring.ThismeansthatallinformationusedbytheBayesiannetworkmodeltorankthestructuredquerieswastakenonlyfromthissubsetofCITIDEL.TheACMDLsub-setwaschosenasthesampledatabasesinceitcontainedthegreatestbreadthanddepthofmetadata,henceprovidingacomprehensiveamountofmetadataandcontentwidelyrep-resentativeofthemetadataandcontentofthewholecollec-tion.ThesetofmetadatafieldsconsideredintheexperimentwasSCITIDEL={title,abstract,author,publication},wherepublicationmeansthenameoftheconferenceorjournalwhereapaperwaspublished.

Ourexperimentsinvolved20subjectsamongresearchersintheVirginiaTechDigitalLibraryResearchLabandgrad-uatestudentsfromaDigitalLibrarygraduatecourse.TheprocessisillustratedinFigure3.Eachsubjectwasin-structedtoissuefivesearchesforitemsoftheirowninterestintheCITIDELcollectionandproviderelevancejudgments(asrelevantornon-relevant)fortheitemsreturned.Sub-jectsweredividedintwogroups:G1andG2.SubjectsingroupG1werenotawareofthepossibilityofstructuringquerieswithfieldinformation.Theyissuedunstructuredqueries,whichwerethenautomaticallystructuredusingtheBayesiannetworkmodel.SubjectsingroupG2werere-quiredtoissuemanuallystructuredqueries.Forcompar-ison,anunstructuredversionofthemanuallystructuredquerywascreatedbyremovingallfieldstructureinforma-tionfromthesequeries,whichwereagainre-structuredusingtheBayesiannetworkmodel.

Figure3:Experimentalprocessandevaluation.

Allqueries,i.e.,theunstructuredquery(Q0),thebestofthetop5structuredqueries(Q1–Q5),andthemanuallystructuredquery(QM),weresenttotheESSEXsearchen-gineandthetop25itemsreturnedbyeachweremerged(withremovalofduplicates).Theresultingunionset(L∪)wascompletelyshuffled(L∪R)andpresentedtotheusersforrelevancejudgments.

Allrelevantandnon-relevantitemsreturnedforeachquerywereusedtocomputeprecision,recall,andF1values.Pre-cision(P)isthepercentageofretrieveditemsthatarerele-vant.Itisusefulasanindicationofhowaccuratethesystemiswhenretrievingtheanswerstotheuser’squestion.Re-call(R)isthepercentageofalltherelevantitemsthatwereretrieved.Itindicatesifthesystemisabletoretrievealloftherelevantitems.Highrecallisespeciallyusefulwhentheuserneedstobecertainthatallrelevantinformationwillbefound.Specificallyinourcase,weusedrelativerecallregardingthepooledsetofrelevantdocumentsfromallofthequeries.Finally,F1combinesprecisionandrecallwithequalweightsandisdefinedasF1=2PR/(P+R).TheF1measurecombinesprecisionandrecallintoasinglevalue,providingasimplewayofevaluatingthesystem’soverallperformance.

Topresenttheresults,wealsoconsidertwoformsofpre-cision:10-precisionandR-precision.The10-precisionmea-sureindicatestheprecisionforthefirst10itemsretrievedbythesystem.Thismeasureisimportantinpracticesinceitisknownthatuserstendtoonlylookatthetopresultsinarankedanswerset.TheR-precisionmeasureindicatestheprecisionwhenallrelevantdocumentswereretrieved.Itisameasureofhowmanyspuriousresultstheuserhastolookatbeforeallrelevantresultsareseen.Bothmeasuresareusefulindeterminingnotonlyifthesystemisabletoshowrelevantresultsatthetopofthelistofretrieveditems,butalsoifitcandiscoverallrelevantinformationwhilestillkeepingthenoiseleveltoaminimum.

5.2Results

Beforetheexperiments,alltestsubjectsansweredashortquestionnaireregardingtheirbackgroundandknowledgeintheirareaofinterestincomputerscience.Amongotherquestions,userswereaskedtocitefiveresearchersandthreepublicationsthattheywouldconsiderofimportanceintheir

selectedresearcharea.Asexpected,queriesweregenerallyveryshort,averaging2.59termsperquery,independentlyofbeingmanuallystructuredornot.

Figure4:Distributionoffieldsandcombinationoffieldsinthemanuallystructuredqueries.

Theaveragenumberofitemsindicatedasrelevantbysub-jectsingroupG1wasslightlyhigherthanforgroupG2(18.79vs.14.26records),withahighermedian(12vs.8)andstandarddeviation(19.46vs.14.33).Thismaybeex-plainedbythefactthatwhenusersareforcedtousefieldstructureinformation,queriestendtobemorefocusedandsonaturallytendtoretrieveasmallernumberofrelevantitems.Thissupportstheassumptionthatstructuredqueriesaremoreprecision-oriented.

Figure4showsthedistributionoffieldsandcombinationoffieldsusedbysubjectsinG2intheirmanuallystructuredqueries.Itisworthnoticingthat63%ofthequeriescon-tainonlyonefield,withnoqueriesusingonlypublication,onlyoneusingthreefields,andnoneusingallfourfields.Thisdistributionmayagainbeexplainedbythelackofuserknowledgeaboutpublicationsandthedifficultyofcreatingstructuredqueriesmanually.Wenowexaminetheimpactofquerystructuring,manuallyandautomatically,onthequalityoftheretrievedresults.

5.2.1Unstructuredvs.StructuredQueries

Tables1,2,3,and4showacomparisonbetweentheun-structuredquery(Q0),thetoprankedautomaticallystruc-turedquery(Q1),thebestofthetop5structuredqueries(Q1–Q5),andthemanuallystructuredquery(QM).

Q1vs.Q0F110-precisionR-precisionG1.5%83.3%81.2%G273.4%.3%85.7%Table1:PercentageoftimesqueryQ1isbetterorequaltoqueryQ0consideringtheF1,10-precision,andR-precisionmeasures,ingroupsG1andG2.

AverageF110-precisionR-precisionQ0(G1+G2)28.931.129.4Q1(G1+G2)36.451.449.7Table2:AverageF1,10-precision,andR-precisionvaluesforalltheQ0andQ1queries,ingroupsG1andG2together.

Best(Q1–Q5,F110-precisionR-precisionQM)vs.Q0G181.2%100%100%G285.1%97.9%97.9%Table3:PercentageoftimestheBest(manualorautomatically)structuredqueryisbetterorequaltoqueryQ0consideringtheF1,10-precision,andR-precisionmeasures,ingroupsG1andG2.

AverageF110-precisionR-precisionQ0(G1+G2)28.931.129.4Best(Q1–Q5,QM)62.284.584.7(G1+G2)

Table4:AverageF1,10-precisionandR-precisionvaluesfortheBest(manualorautomatically)struc-turedqueryandqueryQ0,ingroupsG1andG2together.

Forthe97queries6,thetoprankedstructuredqueryQ1hadabetterperformancethantheunstructuredqueryQ0inmostofthecases.ConsideringtheF1measure,resultsforQ1wereequaltoorbetterthanresultsforQ0inanaverageof69%ofthesearches,inbothgroups.Intermsof10-precision,Q1wasbetterorequaltoQ0inanaverage86.3%ofthesearches.IntermsofR-precision,Q1wasbetterorequaltoQ0in83.4%ofthesearches.Wecanconcludethat,withouttheneedofuserintervention(exceptfromenteringthequerykeywords),thesystemisabletoautomaticallyfindastructuredqueryinthetopoftherankedlistthatoutperformsasimplekeyword-basedsearchinthemajorityofcases.Infact,asshowninTable2,theaverageF1,10-precision,andR-precisionvaluesforQ0were28.9,31.1,and29.4,whileforQ1thecorrespondingvalueswere36.4,51.4,and49.7.

WhencomparingthebestoftheQ1throughQ5andQMquerieswithQ0,itisclearthatusingstructuredqueriesisbetterthanasimplekeyword-basedsearch.Thebeststruc-turedqueryshowedresultsbetterorequaltoQ0in83.7%ofthesearches,consideringtheF1measureandin98.9%ofsearchesconsideringboth10-precisionandR-precision.Thebestqueryaveragevalueswere62.2forF1,84.5for10-precision,and84.7forR-precision.

Theseresultsclearlyindicatethatstructuredqueries,ei-thermanuallyconstructedorautomaticallygenerated,per-formbetterthantheirunstructuredcounterpartsinthema-jorityofcases.ThesituationsinwhichtheunstructuredqueryQ0performedbettercanbeclassifiedintofourmajorcases:

1.Insufficientoroutdatedsamplingdata

ThemostcommonreasonforQ0tooutperformthestructuredquerieswastheabsenceofsupportinthesampledatabaseforveryspecificqueries.Thiswasespeciallyevidentinqueriesconcerningnewtrendsincomputerscienceresearch(e.g.,“peer-to-peercomput-ing”,“cognitiveaffordance”,“multi-modalpresenta-tion”,or“discourseprocessing”).Oneobvioussolu-tionwouldbetousethewholecollectionasthesampledatabase,althoughthiscouldhaveanegativeimpactonperformance.Anotherpossibilityistousebettersamplingstrategies,whichcanguaranteehigh-qualitycoverageusingtheminimumpossibledata,andgoodpoliciesforupdates.

2.VeryspecificqueriesandstrictseparationbetweentitleandabstractItwasnoticedthat,inalmostallcases,subjectsdidnotcareinwhichfieldtherelevantconceptsinthequeryappeared.Forinstance,inveryspecificorshortquerieslike“kerberos”usersdonotcareiftheword“kerberos”appearsinthetitleortheabstract.Sincethequerystructuringprocessmustchooseonefieldtoinserttheword“kerberos”,say,thetitlefield,thestructuredquerybecomestoospecific,thusnotretrievingallrelevantitems,despitehavinggoodpre-cision.Thissuggeststhatsomekindofcombination(forinstance,aBooleanOR)offieldswithalargeover-lapinvocabulary,suchastitlesandabstracts,intheautomaticallystructuredquery,maybebeneficial.

6

Rawdataforthreequerieswaslost.

3.The“+”constraintistoorestrictive

Anotherassumptioninthisworkwasthatstructuredqueriesare,bynature,morefocusedandthereforemoreprecision-oriented.Thisledtothedesignchoiceofenforcingthewordsinthestructuredqueriestoap-pearintherespectivefieldsbyusingthe“+”operator.Inafewcases,thisassumptionprovedtoorestrictive,mainlyinlongqueries(e.g.,“parallelallpairsshort-estpath”)orinquerieswithtwoormoreembeddedconceptsthatneveroccurtogetherinthecollection(e.g.,“peer-to-peercomparisonsystems”).Inthesecasesthe“+”constraintbecametoorestrictiveandre-turnednoneorfewresults.Subjectspreferredtomarkatleastafewrecordsasrelevant,evenifallrelevantconceptsinthequerydidnotappeartogetherinthedocument,ratherthanmarknoresultatall.Oneob-vioussolutionistorelaxthe“+”constraint,butinitialtestsshowedthatperformancewouldbedegraded.Abetterchoicecouldbetoidentifytheseextremecasesandonlythenrelaxtheconstraintorapplytechniquesofquerysplitting[22].4.Failureofthemodel

Inaveryfewcasesthenetworkmodelwasunabletocorrectlyrankthestructuredqueries,evenwhentherewassupportfromthesampledatabase.Themainrea-sonsforthisproblemaretiesintherankingandskewedkeyworddistributions.Tiesoccurwhenseveralofthestructuredqueriesgetthesamescoreand,thus,theirrankingorderbecomesarbitrary.Keyworddistribu-tionsinthecollectionareskewedbecausesomefieldstendtocontainmorewordsthanothers.Forinstance,theabstractfieldislargerthanmostothersand,thus,containsthemajorityofwordsinthecollection.Forthisreason,theBayesiannetworkmodeltendstoas-signhigherprobabilitiestoqueriesthatcontainab-stracts.Onepossiblesolutiontobothproblemsistoassigndifferentweightstoeachfield,accordingtotheirrelativeimportanceinthecollection.Preliminaryex-perimentshaveshownthatthisstrategymaybeben-eficial,buttheproperchoiceofweightsforallfieldsishardtoobtainandwillrequirefurtherexperimenta-tion.

5.2.2

ManuallyturedQueries

Structuredvs.AutomaticallyStruc-Wenextcomparetheperformanceofthebestautomati-callystructuredquerytothemanuallystructuredquery.AsshowninTable5,thebestautomaticallygeneratedquerytiedoroutperformedqueryQMin91.8%ofthesearches,consideringtheF1measure.Considering10-precisionandR-precision,thebeststructuredqueryequaledoroutper-formedQMin97.9%ofthesearches.TheaverageF1,10-precision,andR-precisionvaluesforqueryQMare55.6,72.5,and70.2,respectively,whileforthebeststructuredquerythevaluesare59.8,83.4,and84.7,respectively,asshowninTable6.

Wenotethat,inmostcases,oneofthetopfiveautomati-callystructuredqueriespreciselymatchedthequerymanu-allystructuredbytheuser.Also,evenwhennotcompletelycorrectsemantically,automaticallystructuredqueriesgen-erallyoutperformedmanualstructuredqueries.

Theseresultscanbeexplained.Itwasobservedduring

Best(Q1–Q5)F110-precisionR-precisionvs.QMG291.8%97.9%97.9%Table5:PercentageoftimesthebestautomaticallystructuredqueryisbetterorequaltoqueryQMcon-sideringtheF1,10-precision,andR-precisionmea-sures,ingroupG2.AverageF110-precisionR-precisionManual55.672.570.2Best(Q1-Q5)59.883.484.7Table6:AverageF1,10-precisionandR-precisionvaluesforthebestautomaticallystructuredqueryandqueryQM,ingroupG2.

theexperimentsthat,whenjudgingthereturneditems’rel-evance,testsubjectstendedtofocusmostlyonthetitlefield.Thus,incaseswheretheautomaticallystructuredquerycontainstherelevantconceptsinthetitlefield,usersalmostalwaysconsideredthereturneditemsasrelevant.Ontheotherhand,itemsthatcontainedrelevantconceptsinotherfields,suchasintheabstractorpublication,wereveryoftenignored.Partiallyforthisreason,theautomaticstructuringmodelwasabletooutperformtheresultsofthemanuallystructuredqueries,inwhichtheabstractfieldwasoftentheuser’schoice.

ThefewcaseswhereQMperformedbetterthanthebestofthefivestructuredquerieswereduetooneofthefourcasesdescribedintheprevioussection,inparticularCase1,inwhichtherewasinsufficientoroutdatedinformationinthesamplingdata.Theseresultsshowthatautomaticstructuringofqueriesisaviablealternativetosubstitutemanualstructuring,andonethatsignificantlyreducestheburdenontheuserswhilestillyieldinggoodperformance.

6.RELATEDWORK

Fewworkshaveexploreduserinterfacesthatfacilitatethesearchprocessindigitallibraries.However,therearenotableexceptions:theDLITEproject[13],SenseMaker[5],andthequerysynthesizersdescribedin[4].Inmostcases,DLsearchservicesarelimitedtosimplekeyword-basedqueryformulation,arathercommonresourceinalltypesofinformationretrievalsystems[3].Morerecently,keyword-basedqueriesalsohavebeenintroducedtostruc-tureddatabases[2,16,20].Furthermore,thereisalonghistoryofworkintheinformationretrievalcommunityon(semi)automaticgenerationofqueries[6,21,25,30,38]butitgenerallydidnotfocusonstructuringopportunities.

Inthiswork,keyword-basedqueriesformulatedbytheuseraregivenstructurebytheuseofaBayesiannetworkmodel.ThisissomewhatsimilartotheworkofCroftetal.[14],whereBooleanqueriesarederivedfromausergivennaturallanguagequery,andthenimprovedwithautomati-callyinferredphrases.BayesiannetworkmodelswerefirstusedinIRproblemsbyTurtleandCroft[36]andlaterbyRibeiro-NetoandMuntz[29](uponwhoseworkourmodelisbased).Morerecently,Acidetal.[1]furtherrefinedsuchmodelssothatexactpropagationalgorithmscanbeused

toefficientlycomputeprobabilities.BayesiannetworksalsohavebeenappliedtootherIRproblemsbesidesrankingas,forexample,relevancefeedback[24],automaticconstructionofhypertexts[33],queryexpansion[17],informationfilter-ing[9],rankingfusion[37],anddocumentclusteringandclassification[7,18].Nevertheless,nootherworkhasyetap-pliedBayesiannetworkstotheproblemofstructuringuserqueries.

Recentresearchhasproposedseveralmodelsandlanguagesforretrievalofstructureddocuments[11,23,26,27,32,35].Again,differently,ourworkfocusesonstructuredmetadataandmainlyontheinferenceofthe“best”structuredqueriesbasedonabeliefnetworkmodelandthecollection’sstruc-turesandcontent.

PoolingmethodssimilartotheoneemployedinthisworkhavebeenusedbeforetoassessretrievaleffectivenessinlargeanddynamiccollectionssuchastheWeb.PoolinghasbeenusedextensivelyintheannualTRECconferences[39].Theeffectofthepooldepth(i.e.,numberofdocumentstakenfromeachreturnedset)hasbeenstudiedinworkssuchas[40]and[12].Silvaetal.[34]usedapoolofthe10toprankedWebpagesreturnedby6differenttypesofbe-liefnetworksandconcludedthatthecombinationofhub,authorities,andcontent-basedevidentialinformationpro-videdsubstantialgainsinprecisioninasearchenginefortheBrazilianWeb.Canetal.[10]usedthetop200pagesreturnedby8searchenginestobuildatestsetforanau-tomaticsearchengineevaluationprocessandcompareditwithhumanjudgments.

Finally,theworkherepresentedisbasedonthemodelfirstproposedin[8].Itdiffers,however,intwomainpoints.First,itprovidesauser-basedevaluationofthequerystruc-turingframework,whichconfirmstheresultsobtainedin[8]withanartificialquerylog.Second,itempiricallydemon-stratestheusefulnessofstructuredqueriesinthecontextofdigitallibrariesandshowsthatthesecanbeobtainedwithminimumusereffort.

7.CONCLUSIONS

Inthispaper,throughanumberofuserexperiments,wehaveshownthat:(1)structuredqueriesperformbetterthanpurekeyword-basedqueriesinDLsearchingservicesbasedonfieldedmetadata;and(2)asystemcanbeusedtoauto-maticallyaddstructuretotheusers’queries,thusprovidingaviablealternativetomanualstructuringthatsignificantlyreducestheburdenontheuserswhilestillyieldinggoodperformance.Theexperimentsperformedconfirmthat,inthemajorityofcases,betterresultsareachievedbystruc-turedqueriesthanbyunstructuredqueries.Also,usingtheBayesiannetworkmodelproposedin[8]andanappropri-atetermweightingscheme,automaticallystructuredqueriesoutperformnotonlytheunstructuredqueriesbutalsothequerymanuallystructuredbytheuser.

WemayconcludethatasystemsuchastheonedescribedinthisworkcanbeeffectivelyusedtoimproveDLsearchservices.Weenvisionasearchsystemthatisabletosuggestafewalternativestructuredqueriestotheuser.Accordingtoasemi-automaticscenario,thesestructuredqueriescanbepresentedtogetherwiththeresultsoftheinitialunstruc-turedquery.Byclickingononeofthesecandidates,theusercouldgetcorrespondingstructuredsearchresults–arefinementontheinitiallistofitemsretrieved.Alterna-tively,accordingtoafullyautomaticscenario,thesystem

cansimplysubmitthehighestrankedstructuredquery,andprovidecorrespondingresultswithoutuserintervention.Furtherimprovementsonthemodelsusedinthisworkarepossible.Forinstance,ifthelistofcandidatestruc-turedqueriesistoolong,toomuchtimewouldbespentbytheuserinselectingthemostappropriatecandidate.Thus,itwouldbeimportanttoguaranteethat,inthemajorityofcases,thebeststructuredqueryisoneofthetoptwocandidates.Webelievethatsuchalevelofperformanceisultimatelyattainablewithminoradjustmentstoourmodelandimplementation.Combinationofothersourcesofevi-dencesuchaspastqueriesalsocouldbeappliedtoalleviatetheproblem.

BesidestestingthesysteminproductionmodeinCITI-DELandotherDLshostedintheDigitalLibraryResearchLaboratory,futureworkwillcontinueinanumberofdirec-tions.First,wewanttoinvestigatedifferentandmoreeffec-tivesamplingstrategiesthatminimizethediscoveredprob-lemsofincompletenessandoutdatedinformation,includinggoodpoliciesforupdates.Second,wewillinvestigateau-tomaticfieldweightingbasedontherelativeorperceivedimportanceofthefields,inordertoincreasetheaccuracyofourmodel.Third,weintendtoinvestigatenewmodelsthatcancombinefieldswithlargevocabularyoverlap(e.g.,titlesandabstracts)inthequery,andtostudypossiblewaystorelaxthe“+”constraintwithoutreducingeffectiveness.Further,weplantoinvestigatetheeffectofthe“+”con-straintbyitselfinkeyword-basedqueries.Fourth,weplantoincorporaterelevancefeedbackandpersonalizedrankingstrategiesintoourbeliefnetworkmodels[19].Finally,weintendtoworkonnewapproachestoimprovesystemper-formancethatgobeyondoursimplecachingstrategy.

Acknowledgments

ThisresearchworkwasfundedinpartbyNSF,grants

DUE0136690,DUE0121679,IIS0086227,andITR0325579,bytheI3DLproject,grant680154/01-9,bytheGERINDOproject,grantMCT/CNPq/CT-INFO552.087/02-5,byin-dividualgrantsMCT/FCTSFRH/BD/4662/2001(P´avelCal-ado)andCNPq3040/02-5(AlbertoH.F.Laender),bytheSiteFixproject,grantMCT-CNPQ-CT-INFO55.2197/02-5,byaPHILIPSMDSManausR&Dsponsorship(AltigranS.DaSilva),andbyafellowshipfromAOL(MarcosA.Gon¸calves).

REFERENCES

[1]S.Acid,L.M.deCampos,J.M.Fern´andez-Luna,and

J.F.Huete.AninformationretrievalmodelbasedonsimpleBayesiannetworks.InternationalJournalofIntelligentSystems,18(2):251–265,January2003.

[2]S.Agrawal,S.Chaudhuri,andG.Das.DBXplorer:A

systemforkeyword-basedsearchoverrelationaldatabases.InProceedingsofthe18thInternationalConferenceonDataEngineering,pages5–16,SanJose,CA,USA,February2002.

[3]R.Baeza-YatesandB.Ribeiro-Neto.Modern

InformationRetrieval.AddisonWesley,NewYork,NY,USA,1999.

[4]M.Baldonado,S.Katz,A.Paepcke,C.-C.K.Chang,

H.Garcia-Molina,andT.Winograd.Anextensibleconstructortoolfortherapid,interactivedesignofquerysynthesizers.InDL’98:Proceedingsofthe3rd

ACMInternationalConferenceonDigitalLibraries,pages19–28,Pittsburgh,PA,USA,June1998.[5]

M.BaldonadoandT.Winograd.Sensemaker:Aninformation-explorationinterfacesupportingthecontextualevolutionofauser’sinterests.In

ProceedingsofACMCHI97ConferenceonHumanFactorsinComputingSystems,pages11–18,Atlanta,GA,USA,March1997.

[6]

D.Cai,C.J.VanRijsbergen,andJ.M.Jose.

Automaticqueryexpansionbasedondivergence.InProceedingsofthe10thInternationalConferenceonInformationandKnowledgeManagementCIKM’01,pages419–426,NewYork,November2001.[7]

P.Calado,M.Cristo,E.Moura,N.Ziviani,B.Ribeiro-Neto,andM.A.Gon¸calves.Combininglink-basedandcontent-basedmethodsforwebdocumentclassification.InProceedingsofthe12thInternationalConferenceonInformationand

KnowledgeManagement,pages394–401,NewOrleans,LA,USA,2003.

[8]

P.Calado,A.S.daSilva,R.C.Vieira,A.H.F.Laender,andB.A.Ribeiro-Neto.Searchingwebdatabasesbystructuringkeyword-basedqueries.InProceedingsofthe11thInternationalConferenceonInformationandKnowledgeManagement,pages26–33,McLean,VA,USA,2002.ACMPress.[9]

J.P.Callan.Documentfilteringwithinferencenetworks.InProceedingsofthe19thAnnual

InternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,pages262–269,Zurich,Switzerland,August1996.

[10]

F.Can,R.Nuray,andA.B.Sevdik.AutomaticperfomanceevaluationofWebsearchengines.

InformationProcessingandManagement,2004.Inpress.

[11]

T.T.ChinenyangaandN.Kushmerick.ExpressiveretrievalfromXMLdocuments.InProceedingsofthe24thAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,pages163–171,NewOrleans,Louisiana,USA,September2001.

[12]

G.V.Cormack,C.R.Palmer,andC.L.A.Clarke.Efficientconstructionoflargetestcollections.InProceedingsofthe21stAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,pages282–2,Melbourne,Australia,August1998.

[13]

S.B.Cousins,A.Paepcke,T.Winograd,E.A.Bier,andK.Pier.Thedigitallibraryintegratedtask

environment(DLITE).InDL’97:Proceedingsofthe2ndACMInternationalConferenceonDigital

Libraries,pages142–151,Philadelphia,PA,USA,July1997.

[14]

W.B.Croft,H.R.Turtle,andD.D.Lewis.Theuseofphrasesandstructuredqueriesininformationretrieval.InProceedingsofthe13thAnnual

InternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,pages32–45,Chicago,IL,USA,October1991.

[15]

A.S.daSilva,P.Calado,R.C.Vieira,A.H.F.

Laender,andB.A.Ribeiro-Neto.EffectiveDatabasesforText&DocumentManagement,chapter

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

Keyword-basedQueriesoverWebDatabases,pages74–92.IdeaGroupPublishing,Hershey,PA,USA,2003.

S.Dar,G.Entin,S.Geva,andE.Palmon.DTL’s

DataSpot:Databaseexplorationusingplainlanguage.InProceedingsof24thInternationalConferenceonVeryLargeDataBasesVLBD’98,pages5–9,NewYork,NY,USA,August1998.L.M.deCampos,J.M.Fern´andez-Luna,andJ.F.Huete.QueryExpansioninInformationRetrievalSystemsUsingaBayesianNetwork-BasedThesaurus.InProceedingsofthe14thAnnualConferenceon

UncertaintyinArtificialIntelligence(UAI–98),pages53–60,SanFrancisco,CA,July1998.

S.T.Dumais,J.Platt,D.Hecherman,andM.Sahami.Inductivelearningalgorithmsand

representationsfortextcategorization.InProceedingsofthe7thInternationalConferenceonInformationandKnowledgeManagementCIKM’98,pages

148–155,Bethesda,Maryland,USA,November1998.W.Fan,M.D.Gordon,andP.Pathak.Discoveryofcontext-specificrankingfunctionsforeffectiveinformationretrievalusinggeneticprogramming.IEEETransactionsonKnowledgeandDataEngineering,16(4):523–527,2003.

D.Florescu,D.Kossmann,andI.Manolescu.IntegratingkeywordsearchintoXMLqueryprocessing.WWW9/ComputerNetworks,33(1–6):119–135,2000.

E.A.Fox.RelationalModelsoftheLexicon:RepresentingKnowledgeinSemanticNetworks,chapterImprovedRetrievalUsingaRelational

ThesaurusforAutomaticExpansionofBooleanLogicQueries,pages199–210.CambridgeUniversityPress,1988.

E.A.FoxandF.D.Neves.Extendingretrievalwithsteppingstonesandpathways-NSFproposal(funded),2003.

N.FuhrandK.Gross.XIRQL:aquerylanguageforinformationretrievalinXMLdocuments.In

Proceedingsofthe24thAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,pages172–180,NewOrleans,Louisiana,USA,September2001.

D.HainesandW.B.Croft.Relevancefeedbackandinferencenetworks.InProceedingsofthe16thAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,pages2–11,Pittsburgh,PA,USA,June1993.

M.Mitra,A.Singhal,andC.Buckley.Improving

automaticqueryexpansion.InProceedingsofthe21stAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,pages206–214,Melbourne,Australia,August1998.S.H.Myaeng,D.-H.Jang,M.-S.Kim,andZ.-C.Zhoo.AflexiblemodelforretrievalofSGMLdocuments.InProceedingsofthe21stAnnual

InternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,pages138–145,Melbourne,Australia,August1998.

G.NavarroandR.Baeza-Yates.Proximalnodes:Amodeltoquerydocumentdatabasesbycontentand

structure.ACMTransactionsonInformationSystems,15(4):400–435,Oct.1997.

[28]J.Pearl.ProbabilisticReasoninginIntelligent

Systems:NetworksofPlausibleInference.MorganKaufmannPublishers,SanMateo,California,2ndedition,1988.

[29]B.Ribeiro-NetoandR.Muntz.Abeliefnetwork

modelforIR.InProceedingsofthe19thAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,pages253–260,Zurich,Switzerland,August1996.

[30]G.Salton,C.Buckley,andE.A.Fox.Automatic

queryformulationsininformationretrieval.JournaloftheAmericanSocietyforInformationScience,34(4):262–280,July1983.

[31]G.SaltonandM.J.McGill.IntroductiontoModern

InformationRetrieval.McGraw-Hill,Tokio,1983.[32]T.SchliederandH.Meuss.Queryingandranking

XMLdocuments.JASIST,53(6):4–503,2002.

[33]D.Shin,S.Nam,andM.Kim.Hypertextconstruction

usingstatisticalandsemanticsimilarity.InDL’97:Proceedingsofthe2ndACMInternationalConferenceonDigitalLibraries,pages57–63,Philadelphia,PA,USA,July1997.

[34]I.Silva,B.Ribeiro-Neto,P.Calado,E.Moura,and

N.Ziviani.Link-basedandcontent-basedevidentialinformationinabeliefnetworkmodel.InProceedingsofthe23rdAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,TheoryandPracticein

InformationRetrieval,pages96–103,Athens,Greece,July2000.

[35]A.TheobaldandG.Weikum.AddingRelevanceto

XML.InInt’lWorkshopontheWebandDatabases(WebDB),Dallas,TX,May2000.

[36]H.R.TurtleandW.B.Croft.Inferencenetworksfor

documentretrieval.InProceedingsofthe13thAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,pages1–24,Brussels,Belgium,September1990.

[37]R.F.Valle,B.A.Ribeiro-Neto,L.R.S.deLima,

A.H.F.Laender,andH.R.Freitas-Junior.Improvingtextretrievalinmedicalcollectionsthroughautomaticcategorization.InProceedingsofthe10th

InternationalSymposiumonStringProcessingandInformationRetrievalSPIRE2003,pages197–210,Manaus,Brazil,October2003.

[38]E.M.Voorhees.Queryexpansionusing

lexical-semanticrelations.InProceedingsofthe17thAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentinInformationRetrieval,pages61–69,Dublin,Ireland,July1994.

[39]E.M.VoorheesandD.Harman.Overviewofthesixth

textREtrievalconference(TREC-6).Nov.1997.[40]J.Zobel.Howreliablearetheresultsoflarge-scale

informationretrievalexperiments?InProceedingsofthe21stAnnualInternationalACMSIGIRConferenceonResearchandDevelopmentin

InformationRetrieval,pages307–314,Melbourne,Australia,August1998.

因篇幅问题不能全部显示,请点此查看更多更全内容

Copyright © 2019- axer.cn 版权所有 湘ICP备2023022495号-12

违法及侵权请联系:TEL:199 18 7713 E-MAIL:2724546146@qq.com

本站由北京市万商天勤律师事务所王兴未律师提供法律服务