NANO SCIENTIFIC RESEARCH CENTRE PVT.LTD., AMEERPET, HYD
WWW.NSRCNANO.COM, 09640648777, 09652926926
DOT NET PROJECTS LIST--2013
DOT NET 2013 IEEE PAPERS
Efficient and Effective Duplicate Detection in Hierarchical
Data
Abstract:
Although there is a long line of
work on identifying duplicates in relational data, only a few solutions focus
on duplicate detection in more complex hierarchical structures, like XML data.
In this paper, we present a novel method for XML duplicate detection, called
XMLDup. XMLDup uses a Bayesian network to determine the probability of two XML
elements being duplicates, considering not only the information within the
elements, but also the way that information is structured. In addition, to
improve the efficiency of the network evaluation, a novel pruning strategy,
capable of significant gains over the unoptimized version of the algorithm, is
presented. Through experiments, we show that our algorithm is able to achieve
high precision and recall scores in several datasets. XMLDup is also able to
outperform another state of the art duplicate detection solution, both in terms
of efficiency and of effectiveness. Finally, we also study how important the
structure of elements is in the duplicate detection process. We observe that,
not only structure can clearly influence the outcome, but also that, by
ensuring a structure that is adequate to the characteristics of the data, we
can actually improve the quality of the results.
Soft
ware and hard ware requirements
Hardware
Required:
System : Pentium IV
Hard Disk : 80 GB
RAM : 512 MB
Software
Required:
O/S : Windows XP
Language
: Visual
C#
No comments:
Post a Comment