Evaluating String-Based Similarity Algorithms for Duplicate Record Identification in Databases
##plugins.themes.bootstrap3.article.main##
Abstract
Duplicate records negatively affect the accuracy, reliability, and efficiency of database systems. This paper presents a comparative study of two widely used string-based similarity algorithms: Simil and Jaro–Winkler. Both algorithms measure textual similarity between records in order to identify entries that refer to the same real-world entity. The findings show that Simil is more operative for multi-word fields such as names and addresses, while Jaro–Winkler performs better for short words and typographical errors. The study highlights the strengths and limitations of both algorithms and provides practical guidance for their use in database duplicate detection systems. The study is helpful for academic interest and systems developers to get ideas about how to deal with unnecessary data and to improve the overall data quality.
##plugins.themes.bootstrap3.article.details##
Database, Duplicate Records, Jaro-Winkler, Simil algorithm, Data cleansing.







