Identifying and Removing Duplicate Records From Systematic Review Searches

  • Research
  • Open up Access
  • Published:

Better duplicate detection for systematic reviewers: evaluation of Systematic Review Banana-Deduplication Module

  • 18k Accesses

  • 65 Citations

  • 36 Altmetric

  • Metrics details

Abstract

Background

A major problem arising from searching across bibliographic databases is the retrieval of duplicate citations. Removing such duplicates is an essential chore to ensure systematic reviewers do non waste time screening the same commendation multiple times. Although reference management software use algorithms to remove indistinguishable records, this is only partially successful and necessitates removing the remaining duplicates manually. This time-consuming chore leads to wasted resources. We sought to evaluate the effectiveness of a newly developed deduplication plan against EndNote.

Methods

A literature search of 1,988 citations was manually inspected and duplicate citations identified and coded to create a benchmark dataset. The Systematic Review Assistant-Deduplication Module (SRA-DM) was iteratively developed and tested using the benchmark dataset and compared with EndNote's default 1 step automobile-deduplication process matching on ('author', 'twelvemonth', 'title'). The accuracy of deduplication was reported by computing the sensitivity and specificity. Further validation tests, with iii boosted benchmarked literature searches comprising a full of 4,563 citations were performed to determine the reliability of the SRA-DM algorithm.

Results

The sensitivity (84%) and specificity (100%) of the SRA-DM was superior to EndNote (sensitivity 51%, specificity 99.83%). Validation testing on 3 additional biomedical literature searches demonstrated that SRA-DM consistently achieved higher sensitivity than EndNote (90% vs 63%), (84% vs 73%) and (84% vs 64%). Furthermore, the specificity of SRA-DM was 100%, whereas the specificity of EndNote was imperfect (average 99.75%) with some unique records wrongly assigned equally duplicates. Overall, there was a 42.86% increase in the number of duplicates records detected with SRA-DM compared with EndNote auto-deduplication.

Conclusions

The Systematic Review Banana-Deduplication Module offers users a reliable program to remove indistinguishable records with greater sensitivity and specificity than EndNote. This application will salve researchers and information specialists fourth dimension and avoid research waste product. The deduplication program is freely available online.

Peer Review reports

Background

Identifying trials for systematic reviews is time consuming: the average retrieval from a PubMed search produces 17,284 citations [1]. The biomedical databases MEDLINE [two] and EMBASE [three] contain over 41 one thousand thousand records, and almost one 1000000 records are added annually to EMBASE [3] (which now likewise includes MEDLINE records) and 700,000 to MEDLINE [2]. However, the methodological details of trials are often inadequately described past authors in the titles or abstracts, and non all records incorporate an abstruse [4]. Due to these limitations, a wider (that is, more sensitive) search strategy is necessary to ensure manufactures are non missed, which leads to an imprecise dataset retrieved from electronic bibliographic databases. Typically of the thousands of citations retrieved for a systematic review search, over xc% are excluded on the basis of title and abstruse screening [5].

Searching multiple databases is essential because different databases contain unlike records, and therefore, the coverage is widened. Likewise, searching multiple databases utilises differences in indexing to increase the likelihood of retrieving relevant items that are listed in several databases [half dozen], only inevitably, this practice also retrieves overlapping content [7]. The caste of periodical overlap estimated by Smith [8] over a decade agone indicated that nigh 35% of journals were listed in both MEDLINE and EMBASE. Journal overlap can vary from ten% to 75% [9, ten, 8, 11, 12] depending on medical speciality. More than recently, the overlap in MEDLINE and EMBASE was found to be 79% [13] based on trials that had been included in 66 Cochrane systematic reviews.

The trouble of overlapping content and subsequent retrieval of indistinguishable records is partially managed with commercial reference management software programs such as EndNote [14], Reference Manager [15], Mendeley [16] and RefWorks [17]. They contain algorithms designed to identify and remove duplicate records using an auto-deduplication office. However, the detection of indistinguishable records tin exist thwarted by inconsistent citation details, missing information or errors in the records. Typically, motorcar-deduplication is only partially successful [18], and the onerous task of manually sifting and removing the remaining duplicates rests with reviewers or data specialists. Removing such duplicates is an essential job to ensure systematic reviewers do non waste time screening the aforementioned citation multiple times. This study aimed to iteratively develop and examination the functioning of a new deduplication program against EndNote X6.

Methods

Systematic Review Banana-Deduplication Module procedure of evolution

The Systematic Review Assistant-Deduplication Module (SRA-DM) project was developed in 2013 at the Bond University Centre for Research in Evidence-Based Do (CREBP). The project aimed to reduce the corporeality of fourth dimension taken to produce systematic reviews by maximising the efficiency of the diverse review stages such as optimising search strategies and screening, finding total text articles and removing duplicate citations.

The deduplication algorithm was developed using a heuristic-based approach with the aim of increasing the retrieval of indistinguishable records and minimising unique records being erroneously designated as duplicates. The algorithm was developed iteratively with each version tested confronting a criterion dataset of 1,988 citations. Modifications were made to the algorithm to overcome errors in duplicate detection (Table i). For example, errors frequently occurred due to variations in author names (e.g. outset-name/surname sequence, use/absenteeism of initialisation, missing author names and typographical errors), folio numbers (due east.g. full/truncated, or missing), text emphasis marks (e.thousand. French/German/Spanish) and periodical names (e.thousand. abbreviated/complete, and 'the' used intermittently). The performance of the SRA-DM algorithm was compared with EndNote's default ane step auto-deduplication procedure. To decide the reliability of SRA-DM, we conducted a series of validation tests with results of dissimilar literature searches (cytology screening tests, stroke and haematology) which were retrieved from searching multiple biomedical databases (Table 2).

Table 1 SRA-DM algorithm changes

Full size table

Table two Databases searched for retrieval of citations for validation testing

Full size tabular array

Definitions

A duplicate tape was defined every bit beingness the aforementioned bibliographic record (irrespective of how the commendation details were reported, eastward.g. variations in page numbers, author details, accents used or abridged titles). Where further reports from a single study were published, these were not classed as duplicates equally they are multiple reports which tin can appear across or within journals. Similarly, where the same written report was reported in both journal and conference proceedings, these were treated as separate bibliographic records.

Testing against criterion

A total of one,988 citations, derived from a search conducted on 29 July 2013 for surgical and non-surgical management for pleural empyema were used to test SRA-DM and EndNote X6. Six databases were searched (MEDLINE-Ovid, EMBASE-Elsevier, CENTRAL-Cochrane library, CINAHL-Ebasco, LILACS-Bireme, PubMed-NLM). To create the criterion, citations were imported into EndNote database, sorted by writer, inspected for duplicate records and manually coded as a unique or duplicate tape; the database was reordered by article title and reinspected for further duplicates. Once the criterion was finalised, duplicates were sought in EndNote using the default one-footstep car-deduplication procedure which used the matching criteria of 'author', 'twelvemonth' and 'title' (with the 'ignore spacing and punctuation' box ticked). A few additional duplicates were identified in EndNote and SRA-DM whilst cross-checking against the benchmark decisions, and the benchmark and results were updated to take business relationship of these.

Data assay

The accuracy of the results were coded confronting the criterion according to whether it was a truthful positive (true duplicate, i.east. correctly identified duplicate), imitation positive (simulated indistinguishable, i.e. incorrectly identified every bit duplicate), true negative (unique record) or false negative (truthful duplicate, i.eastward. incorrectly identified equally unique record). This process was repeated for results received after using the SRA-DM. Sensitivity is defined every bit the ability to correctly classify a record as indistinguishable and is the proportion of true positive records over the total number of records identified equally true positive and false negative. Specificity is divers equally the ability to correctly classify a record as existence unique or not-duplicate and is the proportion of true negative records over the total number of records identified equally true negative and simulated positive.

Results

Training and development of SRA-DM

First and second iteration

The first iteration of the deduplication algorithm achieved 75.0% sensitivity and 99.9% specificity (Tabular array three). The matching criteria were based on field comparing (ignoring punctuation) with checks made against the year field. This field was chosen because the year field has a lower probability for errors since information technology is restricted to integers 0–9 and therefore is the all-time non mistakable field. Lxxx-four per centum of undetected duplicates arose due to variations in pages numbers (eastward.m. 221–226, 221–6). To address this, short format folio numbers were converted to full format and the algorithm was further modified to increase the sensitivity by incorporating matching criteria on authors OR title. This increased the sensitivity of the second iteration to 95.7% with more than duplicates detected, but every bit a consequence the number of false positives also increased (specificity 99.8%).

Tabular array 3 Sensitivity† and specificity‡ of SRA-DM prototype algorithms and EndNote auto-deduplication (in a dataset of 1,988 citations, including 799 duplicates)

Full size table

Third iteration

The 3rd iteration was modified to match author AND title with the extension of the non-reference fields from only 'year' to yr OR volume OR edition. This distinguished references that were similar (e.g. same author and title combination) but contained unlike source publications, and this improved the specificity to 100% only the sensitivity was reduced (68.0%).

Fourth iteration

The fourth iteration was modified to accommodate author name variations using fuzzy logic and then that differences in names spelt in total or initialised, differences in the ordering of name and unlike punctuation could be accommodated (Tabular array 1); this increased the sensitivity to 84.iv% by correctly identifying 674 citations equally duplicates (TP), 1,189 citations equally unique records (TN), no false positives occurred (100% specificity) and merely 125 indistinguishable records were undetected (FN). This fourth iteration of SRA-DM was and so compared confronting EndNote. EndNote identified 412 of the 1,988 citations as duplicates. Of these, 410 were correctly identified as duplicates (TP) and 2 were incorrectly designated equally duplicates (FP), and one,185 citations were correctly identified as unique records (TN) and 391 duplicate citations were undetected (FN). The sensitivity of EndNote was 51.ii% and specificity 99.eight%. Compared with EndNote, SRA-DM produced a 64% increase in sensitivity and no loss of specificity.

Validation results

The fourth iteration of SRA-DM was farther tested with three boosted datasets using search topics from cytology screening tests (due north = 1,856), stroke (n = 1,292) and haematology (n = ane,415) (Table two). These were obtained from existing searches performed by information specialists to widen the scope of the validation tests. SRA-DM algorithm was consistently more sensitive (Tabular array iv) at detecting duplicates than EndNote [cytology screening: 90% vs 63%; stroke: 84% vs 73% and haematology: 84% vs 64%] and specificity of SRA-DM was 100% accurate, i.e. no faux positives occurred. In dissimilarity, the average specificity of EndNote was lower (99.seven). These false positives occurred in EndNote due to citations with the aforementioned authors and title being published in other journals or as conference proceeding. Compared with EndNote, the boilerplate percentage increase in duplicates detected by SRA-DM across all 4 bibliographic searches was 42.8%.

Tabular array 4 Sensitivity† and specificity‡ of SRA-DM and EndNote auto-deduplication (validation testing)

Full size tabular array

Discussion

Our findings demonstrated that SRA-DM identifies substantially more duplicate citations than EndNote and has greater sensitivity [(84% vs 51%), (xc% vs 63%), (84% vs 73%), (84 vs 64%)]. The specificity of SRA-DM was 100% with no imitation positives, whereas the specificity of EndNote was imperfect.

Waste in research occurs for several methodological, legislative and reporting reasons [19–22]. Another form of waste is inefficient labouring, in part, as a consequence of non-standardised citations details across bibliographic databases, perfunctory error checking and absenteeism of a unique trial identification number for it and its associated further multiple reports. If these issues were solved at source, manual duplicate checking would be unnecessary. Until these bug are resolved, deploying the SRA-DM will save information specialists and reviewers valuable fourth dimension by identifying on average a farther 42.86% of duplicate records.

Several citations were wrongly designated as duplicates by EndNote machine-deduplication due to different citations sharing the same authors and title merely published in other journals or as conference proceedings. In a recent report past Jiang [23], the authors also found that EndNote, for the same reason, had erroneously assigned unique records every bit duplicates. It is probable that in well-nigh scenarios no important loss of information would occur; although sometimes additional methodological or event data are reported, and ideally these need to exist retained for inspection. A contempo written report by Qi [eighteen] examined the content of undetected indistinguishable records in EndNote and institute that errors often occurred due to missing or wrong data in the fields, especially for records retrieved from EMBASE database. This besides affected the sensitivity of SRA-DM, with duplicates undetected due to missing or wrong or extraneous data in the fields.

During the training and evolution stage, the four iterations of SRA-DM achieved sensitivities ranging from 68%, 75%, 84% and 96% with the well-nigh sensitive (96%) achieved with a merchandise-off in specificity (99.75%) with three imitation positives. For systematic reviews and Health Technology Assessment reports, the aim is to conduct comprehensive searches to ensure all relevant trials are identified [24]; thus, losing even iii citations is undesirable. Therefore, the final algorithm (fourth iteration) with the lower sensitivity (84%) just perfect (100%) specificity was preferred. Futurity developments with SRA-DM may incorporate ii algorithms, first using the 100% specific algorithm to automatically remove duplicates and another algorithm with higher sensitivity (admitting with lower specificity) to identify the remaining duplicates for transmission verification. If this strategy was implemented on the respiratory dataset using the fourth and second algorithm (Table 3), only 91 out of ane,988 citations would have to be manually checked and only 34 duplicates would remain undetected.

In spite of this major improvement with the SRA-DM, no software tin can currently detect all duplicate records, and the perfect uncluttered dataset remains elusive. Undetected duplicates in SRA-DM occurred due to discrepancies such as missing page numbers or likewise much variance with writer names. Duplicates were also missed because the OVID MEDLINE platform inserted additional inapplicable information into the title field (east.g. [Review] [72 refs]) whereas the same commodity retrieved from EMBASE or other non-OVID MEDLINE platforms (i.e. PubMed, Web of Noesis) report only the title. Some of these problems could exist overcome in the future with record linkage and commendation enrichment techniques to populate blank fields with meta-data to increase the detection rate.

Strengths and weaknesses

The deduplication program was developed to place indistinguishable citations from biomedical databases and has not been tested on other bibliographic records such as books and governmental reports and therefore may not perform as well with other bibliographies. Nonetheless, the deduplication plan was developed iteratively to remove issues of false positives and was tested on four dissimilar datasets which included comprehensive searches using fourteen different databases that are used by data specialists, and therefore, similar efficiencies should occur in other medical specialities. Besides, the accuracy of SRA-DM was consistently college than that of EndNote, and these finding are probably generalizable to other biomedical database searches due to the same records types and fields used. It is possible that some duplicates were not detected during the manual benchmarking process, although the database was screened twice start past author and then by title, and boosted cantankerous-checking was performed past manually comparison the benchmark confronting EndNote auto-deduplication and SRA-DM decisions—thus minimising the possibility of undetected duplicates.

Whilst we compared SRA-DM confronting the typical default EndNote deduplication setting, we recognise that some information specialists adopt additional steps whilst performing deduplication in EndNote. For case, they may employ multi-phase screening or attempt to supervene upon incomplete citations by updating citation fields with the 'Find References Update' feature in EndNote. Nevertheless, many researchers and information specialists do not employ such techniques, and our aim was to address deduplication with an automated algorithm and compare it confronting the default deduplication process in EndNote. Qi [18] recommended employing a two-footstep strategy to address the trouble of undetected duplicates past first performing auto-deduplication in EndNote followed by manual manus screening to place remaining duplicates. This bones strategy is used by some data specialists and systematic reviewers only is inefficient due to the large proportion of unidentified duplicates. Other more than complex multi-stage screening strategies have been suggested [25] but are EndNote-specific and not feasible for other reference management software.

Conclusions

The deduplication algorithm has greater sensitivity and specificity than EndNote. Reviewers and information specialists incorporating SRA-DM into their research procedures will save valuable time and reduce resources waste. The algorithm is open source [26] and the SRA-DM program is freely available to users online [27]. It allows similar file manipulation to EndNote and currently accepts XML, RIS and CVS file formats enabling citations to be exported straight to RevMan software. It has the option of automatic duplicate removal or manual pair-wise duplicate screening performed individually or with a co-reviewer.

References

  1. Islamaj Dogan R, Murray GC, Nรฉvรฉol A, Lu Z: Agreement PubMed user search behavior through log assay. Database J Biol Databases Curation 2009, 2009:ane.

    Google Scholar

  2. MEDLINE - fact canvass. [http://www.nlm.nih.gov/pubs/factsheets/medline.html]

  3. Embase. [http://www.ovid.com/webapp/wcs/stores/servlet/ProductDisplay?storeId=13051&catalogId=13151&langId=-1&partNumber=Prod-903]

  4. Lefebvre C, Eisinga A, McDonald Southward, Paul Northward: Enhancing access to reports of randomized trials published earth-broad–the contribution of EMBASE records to the Cochrane key register of controlled trials (CENTRAL) in the Cochrane library. Emerg Themes Epidemiol 2008, v:13. ten.1186/1742-7622-5-xiii

    Commodity  PubMed  PubMed Central  Google Scholar

  5. Wallace BC, Trikalinos TA, Lau J, Brodley C, Schmid CH: Semi-automated screening of biomedical citations for systematic reviews. BMC Bioinformatics 2010, xi:1–11. 10.1186/1471-2105-11-i

    Article  Google Scholar

  6. Sampson M, McGowan J, Cogo Due east, Horsley T: Managing database overlap in systematic reviews using batch citation matcher: case studies using scopus. J Med Libr Assoc 2006, 94:461–463.

    PubMed  PubMed Central  Google Scholar

  7. Sievert MC, Andrews MJ: Indexing consistency in information science abstracts. J Am Soc Inf Sci 1991, 42:i–6. x.1002/(SICI)1097-4571(199101)42:1<1::Help-ASI1>3.0.CO;2-9

    Article  Google Scholar

  8. Smith B, Darzins P, Quinn Chiliad, Heller R: Mod methods of searching the medical literature. Med J Aust 1992, 2:603–611.

    Google Scholar

  9. Kleijnen J, Knipschild P: The comprehensiveness of MEDLINE and Embase computer searches. Searches for controlled trials of homoeopathy, ascorbic acrid for common cold and ginkgo biloba for cerebral insufficiency and intermittent claudication. Pharm Weekbl Sci 1992, xiv:316–320. ten.1007/BF01977620

    CAS  Article  PubMed  Google Scholar

  10. Odaka T, Nakayama A, Akazawa Thou, Sakamoto M, Kinukawa N, Kamakura T, Nishioka Y, Itasaka H, Watanabe Y, Olfactory organ Y: The event of a multiple literature database search–a numerical evaluation in the domain of Japanese life science. J Med Syst 1992, 16:177–181. 10.1007/BF00999380

    CAS  Article  PubMed  Google Scholar

  11. Rovers JP, Janosik JE, Souney PF: Crossover comparison of drug information online database vendors: dialog and MEDLARS. Ann Pharmacother 1993, 27:634–639.

    CAS  Commodity  PubMed  Google Scholar

  12. Ramos-Remus C, Suarez-Almazor M, Dorgan M, Gomez-Vargas A, Russell As: Performance of online biomedical databases in rheumatology. J Rheumatol 1994, 21:1912–1921.

    CAS  PubMed  Google Scholar

  13. Royle P, Milne R: Literature searching for randomized controlled trials used in Cochrane reviews: rapid versus exhaustive searches. Int J Technol Assess Wellness Care 2003, 19:591–603.

    Article  PubMed  Google Scholar

  14. EndNote. [http://endnote.com/]

  15. Reference director. [http://www.refman.com/]

  16. Mendeley. [http://www.mendeley.com/]

  17. RefWorks. [www.refworks.com]

  18. Qi X, Yang M, Ren W, Jia J, Wang J, Han G, Fan D: Find duplicates among the PubMed, EMBASE, and Cochrane library databases in systematic review. PLoS Ane 2013, 8:e71838. x.1371/journal.pone.0071838

    CAS  Article  PubMed  PubMed Cardinal  Google Scholar

  19. Glasziou P, Altman DG, Bossuyt P, Boutron I, Clarke M, Julious S, Michie S, Moher D, Wager Due east: Reducing waste from incomplete or unusable reports of biomedical research. Lancet 2014, 383:267–276. 10.1016/S0140-6736(xiii)62228-10

    Commodity  PubMed  Google Scholar

  20. Chan AW, Vocal F, Vickers A, Jefferson T, Dickersin Grand, Gรธtzsche PC, Krumholz HM, Ghersi D, van der Worp HB: Increasing value and reducing waste: addressing inaccessible inquiry. Lancet 2014, 383:257–266. 10.1016/S0140-6736(13)62296-5

    Article  PubMed  PubMed Cardinal  Google Scholar

  21. Chalmers I, Bracken MB, Djulbegovic B, Garattini S, Grant J, Gรผlmezoglu AM, Howells DW, Ioannidis JP, Oliver Due south: How to increase value and reduce waste matter when research priorities are set up. Lancet 2014, 383:156–165. 10.1016/S0140-6736(xiii)62229-1

    Article  PubMed  Google Scholar

  22. Ioannidis JP, Greenland S, Hlatky MA, Khoury MJ, Macleod MR, Moher D, Schulz KF, Tibshirani R: Increasing value and reducing waste in research design, deport, and analysis. Lancet 2014, 383:166–175. 10.1016/S0140-6736(13)62227-8

    Article  PubMed  PubMed Key  Google Scholar

  23. Jiang Y, Lin C, Meng W, Yu C, Cohen AM, Smalheiser NR: Rule-based deduplication of commodity records from bibliographic databases. Database (Oxford) 2014, 2014:1–seven.

    Commodity  Google Scholar

  24. Cochrane handbook for systematic reviews of interventions. [http://www.cochrane.org/handbook]

  25. Removing duplicates in retrieval sets from electronic databases: comparison the efficiency and accurateness of the Bramer-method with other methods and software packages. [http://world wide web.iss.it/binary/eahi/cont/57_Wichor_M._Bramer.pdf]

  26. Source code. [https://github.com/CREBP/SRA]

  27. Systematic review assistant - deduplication module. [http://crebp-sra.com]

Download references

Sources of funding

NHMRC Australia Fellowship: GNT0527500.

Writer information

Affiliations

Corresponding author

Correspondence to John Rathbone.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

All authors contributed to the study concept and pattern. JR devised the testing and analysis of the algorithms. MC wrote and revised the algorithm codes. JR drafted the initial manuscript. TH, PG and MC contributed to the manuscript and all the revisions. All authors read and canonical the last manuscript.

Rights and permissions

This commodity is published under license to BioMed Primal Ltd. This is an Open Access commodity distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/past/4.0), which permits unrestricted use, distribution, and reproduction in whatsoever medium, provided the original work is properly credited. The Artistic Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/nada/1.0/) applies to the information made bachelor in this commodity, unless otherwise stated.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this commodity

Rathbone, J., Carter, M., Hoffmann, T. et al. Better duplicate detection for systematic reviewers: evaluation of Systematic Review Banana-Deduplication Module. Syst Rev 4, vi (2015). https://doi.org/10.1186/2046-4053-4-6

Download commendation

  • Received:

  • Accepted:

  • Published:

  • DOI : https://doi.org/10.1186/2046-4053-4-6

Keywords

  • Systematic Review Banana-Deduplication Module
  • SRA-DM
  • Deduplication
  • EndNote
  • Reference-director
  • Citation-screening
  • Systematic review
  • Bibliographic database

farleyexproul.blogspot.com

Source: https://systematicreviewsjournal.biomedcentral.com/articles/10.1186/2046-4053-4-6

0 Response to "Identifying and Removing Duplicate Records From Systematic Review Searches"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel