Download PDF
Systematic Review  |  Open Access  |  5 Jan 2025

Artificial intelligence versus human researcher performance for systematic literature searches: a study focusing on the surgical management of base of thumb arthritis

Views: 34 |  Downloads: 0 |  Cited:  0
Plast Aesthet Res. 2025;12:1.
10.20517/2347-9264.2024.99 |  © The Author(s) 2025.
Author Information
Article Notes
Cite This Article

Abstract

Aim: In the digital age, artificial intelligence (AI) platforms have gradually replaced traditional manual techniques for information retrieval. However, their effectiveness in conducting academic literature searches remains unclear, necessitating a comparative assessment. This study examined the efficacy of AI search engines (Elicit, Consensus, ChatGPT) vs. manual search for literature retrieval, focusing on the surgical management of trapeziometacarpal osteoarthritis.

Methods: The study was executed per the Cochrane Handbook for Systematic Reviews and PRISMA guidelines. AI platforms were given relevant keywords and prompts, while manual searches used PubMed, Cochrane CENTRAL, Web of Science, and Scopus databases from January 1901 to April 2024. The study focused on English-language randomized controlled trials (RCTs) comparing surgical management of trapeziometacarpal osteoarthritis (TMCJ OA). Two independent evaluators screened and extracted data from the studies. Primary outcomes involved the quality and relevancy of studies chosen by both search methods, evaluated by false positive rates and number of studies, including outcomes of interest.

Results: The manual search yielded the most results (6,018), followed by Elicit (4,980), Consensus (3,436), and ChatGPT (6). Elicit identified the highest number of RCTs (205) but also had the greatest false positive rate (94%). Ultimately, the manual search identified 23 suitable studies, Elicit found 10, Consensus found 9, and ChatGPT identified only 1. No additional studies were found by AI search engines that were not discovered in the manual search.

Conclusion: The findings highlight the potential advantages and drawbacks of AI search engines for literature searches. While Elicit was prone to error, Consensus and ChatGPT were less comprehensive. Significant enhancements in the precision and thoroughness of AI search engines are required before they can be effectively utilized in academia.

Keywords

Artificial intelligence, human, researcher, systematic review, searches

INTRODUCTION

In an era of digital transformation, traditional literature search methods are being supplemented and replaced by artificial intelligence (AI)-based platforms[1,2]. These include software such as Elicit, Consensus, and ChatGPT, which have been proposed as valuable tools for expediting information retrieval and facilitating the dissemination of medical information. In this domain, ChatGPT has received considerable commentary on its potential in academia across a range of topics, from osteoarthritis to cosmetic surgery, with major concerns about its ability to correctly identify the source of its knowledge, albeit surprisingly accurately. Although an interesting avenue to explore, the comparative efficiency and accuracy of different chatbots in locating and sourcing information compared with traditional human-initiated searches have not been explored[3-5].

Trapeziometacarpal osteoarthritis (TMCJ OA) is a common condition among the elderly that significantly limits thumb movement and functionality necessary for everyday tasks[6]. Management of TMCJ OA begins medically in mild cases, progressing to operative intervention only when anti-inflammatory and pain relief prove insufficient. Multiple surgical and non-surgical treatment modalities are available, but their comparative effectiveness is unclear, especially surgical ones[7]. This gap in the literature leaves healthcare professionals and patients in a predicament during the decision-making process, and a systematic review and meta-analysis is likely an effective means to summarize information and facilitate a consensus in the plastics and orthopedics community.

With this in mind, we carried out a comparative study that scrutinized the performance of Elicit, Consensus, and ChatGPT with manual human literature search methods for the management of TMCJ OA. The primary outcomes were the ability to identify publications with higher-level evidence, as well as the number of publications and their relevance. The outcomes of interest specific to TMCJ OA were also investigated to inform the potential role and value of AI for conducting systematic reviews.

METHODS

The current study adhered to the Cochrane Handbook of Systematic Reviews of Interventions and the preferred reporting items for systematic reviews and meta-analyses (PRISMA) statement guidelines throughout all stages[8]. The study was registered on PROSPERO, the International Prospective Systematic Review (CRD420431089). The primary objective of this study was to evaluate the performance of AI-based platforms (Elicit, Consensus and ChatGPT) against human experts for conducting a literature search for a systematic review on the base of thumb arthritis treatments. Institutional ethical approval was not required since this study did not involve human subjects. These three AI platforms were selected for their prominence and widespread adoption in the research community at the time of the study. Elicit, developed by Ought, was chosen for its specialized focus on scientific literature search and summarization. Consensus, created by Consensus.app, was selected for its ability to aggregate and analyze scientific papers. ChatGPT, an advanced language model by OpenAI, was included due to its versatility in understanding and generating human-like text across various domains, including scientific literature. The standard ChatGPT version was used to minimize any potential bias and maintain methodological consistency with other studies.

Literature search strategy

To ensure consistency and comparability between AI and human-based searches, a uniform search strategy was employed. AI-based platforms were prompted with different arrays of keywords and prompted ten times, and all pages were screened for any potential studies [Supplementary Figures 1-3]. The authors (IS and GB) validated the suitability and relevance of the studies initially sourced by the AI tools, and the total search results are shown in Table 1. This entailed identifying randomized controlled trials (RCTs), false positive RCTs, prospective studies, and deciding whether a study was included or excluded without any assistance from the AI tools. The manual search strategy encompassed a combination of pertinent keywords and MeSH terms associated with TMCJ OA, which included thumb OR trapezio-metacarpal OR trapeziometacarpal OR trapezial-metacarpal OR trapezialmetacarpal OR trapezium OR carpal* OR metacarp* OR carpo-metacarpal OR “metacarpophalangeal joint” OR “carpometacarpal joint” OR trapezium) AND (osteoarthritis OR osteoarth* OR “joint disease” OR arthropathy) AND (“basal joint arthroplasty” OR “Arthroscopic Resection Arthroplasty” OR “resection arthroplasty” OR trapeziectomy OR “trapezio-metacarpal arthrodesis. The manual literature search was conducted using Medline (via PubMed), Cochrane Library, Web of Science, and Scopus, covering the period from January 1901 to April 2024. Additionally, the reference lists of relevant articles were manually reviewed. Supplementary Materials includes a comprehensive overview of the search strategies employed.

Table 1

Summary of included studies

Study IDStudy arms, NAge, mean (SD)Male, N (%)Surgical intervention123456789Follow upLevel of evidenceInclusion criteriaPrimary outcomesConclusion
Belcher 2000Trapeziectomy, 1963 (2)1 (5.26%)Trapeziectomy by posterior approach vs. T + LRTI (APL-FCR-APL)14 monthsI1. Adults undergoing trapeziectomy for osteoarthrosis of the thumb TMJ were entered into this study between March 1996 and July 19981. Pain
2. Physical function
3. Adverse events
Both groups expressed equal satisfaction with the operation and there were no significant differences between the two treatment groups. Simple trapeziectomy is an effective operation for osteoarthrosis at the base of the thumb and the addition of a ligament reconstruction was not shown to confer any additional benefit
Trapeziectomy and LRTI, 2358 (1)4 (17.39%)
Belcher 2001Trapeziectomy, 1359 (8)7 (53.8%)Trapeziectomy by posterior approach vs. Trapeziectomy + Permacol porcine xenograft6 monthsI1. Patients undergoing trapeziectomy for osteoarthrosis of the thumb trapeziometacarpal joint were entered into the study between April and December 19991. Pain
2. Physical function
3. Satisfaction
4. Adverse events
Permacol patients reported greater pain and were less satisfied with their operations than control patients. We conclude that interposition of Permacol is detrimental to the results of trapeziectomy
Trapeziectomy and Permacol porcine xenograft, 1359 (9)7 (53.8%)
Brennan 2020Trapeziectomy, 1475 (6)3 (21.43%)Trapeziectomy by posterior approach vs. Trapeziectomy + LRTI (½FCR-MT)17 yearsI1. Patients with osteoarthritis of the CMCJ of the thumb were recruited1. Pain
2. Physical function
3. Satisfaction
Even at 17 years, there is no significant benefit of LRTI over trapeziectomy alone for thumb carpometacarpal joint osteoarthritis
Trapeziectomy and LRTI, 2075 (6)5 (25%)
Corain 2016Trapeziectomy and HAD, 6463 (12)-Trapeziectomy + HDA vs. Trapeziectomy + LR (APL-MT-FCR)6.6 yearsI1. No previous surgeries affecting the same arm
2. No diabetes or connective tissue disorders; symptomatic stage 3 or 4 osteoarthritis according to the Eaton classification
1. Pain
2. Physical function
3. Adverse events
We demonstrate that the trapezium excision and bone space distraction technique require a smaller incision, a shorter surgical time, an easier surgical technique, and a less painful recovery, maintaining overlapping levels of functional restore
Trapeziectomy and LR (APL-MT-FCR), 56-
De smet 2004Trapeziectomy61.5 (10.2)0Trapeziectomy vs. Trapeziectomy + LRTI (FCR-MT)26 monthsI1. Patients suffered from painful primary osteoarthritis of the carpometacarpal joint of the thumb not responding to conservative treatment1. Pain
2. Physical function
Simple trapeziectomy is a good procedure, especially for elderly patients requiring not much force
Trapeziectomy and LRTI58 (6.3)0
Field 2007Trapeziectomy, 32-4 (12.5%)Trapeziectomy by posterior approach vs. Trapeziectomy + LRTI (½FCR-MT)1 yearI1. Patients with osteoarthritis of the carpometacarpal joint of the thumb of Eaton and Glickel Grade III or IV
2. Who had not responded to conservative treatment were recruited into the study between 2001 and 2003
1. Pain
2. Physical function
3. Adverse events
In conclusion, this study suggests that there is no benefit to suspension with an FCR sling after trapeziectomy
Trapeziectomy and LRTI, 33-5 (15.15%)
Gangopdhyay 2012Trapeziectomy, 5357 (6)0Trapeziectomy by posterior approach vs. Trapeziectomy + tendon interposition (PL)6 yearsI1. Women with painful trapeziometacarpal osteoarthritis who had failed to respond to the nonoperative treatment were recruited between 1992 and 20011. Pain
2. Adverse events
The outcomes of these 3 variations of trapeziectomy were similar after a minimum follow-up of 5 years. There appears to be no benefit to tendon interposition or ligament reconstruction in the longer term
Trapeziectomy with palmaris longus interposition, 4657 (6)0
Trapeziectomy with LRTI, 5457 (6)0
Gerwin 1997Trapeziectomy with Ligament Reconstruction, 11--Trapeziectomy + LR (½FCR-MT-Minimitek) vs. Trapeziectomy + LRTI (½FCRMT-Minimitek)23 monthsII1. Patients undergoing trapeziectomy for osteoarthrosis of the thumb trapeziometacarpal joint1. Physical function
2. Satisfaction
Tendon interposition after ligament reconstruction basal joint arthroplasty does not improve the function of the thumb and necessitates a longer surgical incision and a technically more difficult operation
Trapeziectomy with LRTI, 9--
Hansen 2013DLC all-poly cup, 1456 (11)2 (14.29%)Elektra uncemented cup vs. Elektra cemented cup2 yearsI1. Eaton-Glickel stage-2 or -3 TM joint OA in patients over 18 years of age where nonoperative treatment had failed.
2. OA staging was based on a combination of conventional radiographs and CT scans evaluated by one observer
1. Adverse eventsEarly implant fixation and clinical outcome were equally good with both cup designs. This is the first clinical RSA study on trapezium cups, and the method appears to be clinically useful for the detection of loose implants
Electra screw cup, 1060 (12)1 (7.69%)
Hart 2006trapeziometacarpal arthrodesis59 (8)13 (35.14%)Arthrodesis (K-wire) vs. T + LRTI (½FCR-MT -K-wire)-I1. Patients with primary osteoarthritis of stage 4 according to Eaton and Littler of the first carpometacarpal joint1. Adverse eventsThe after-treatment in patients undergoing arthroplasty lasted longer than in patients after the arthrodesis. It is caused by more complex surgery during Epping’s procedure. But the outcomes become similar over a longer period. At the final follow-up control after arthroplasty, only older patients subjectively appreciated better functional performance. After this experience, we reserve the arthrodesis for younger active and arthroplasty for older patients
Trapeziectomy and LRTI59 (8)
Kriegs-au 2005Trapeziectomy with LR, 26--Trapeziectomy + LR (½FCR-MT) vs. Trapeziectomy + LRTI (½FCR-MT)4 yearsII1. Patients undergoing trapeziectomy for osteoarthrosis of the thumb trapeziometacarpal joint1. Pain
2. Physical function
3. Satisfaction
4. Adverse events
Tendon interposition does not affect the outcome after the ligament reconstruction for the treatment of osteoarthritis of the thumb carpometacarpal joint. Furthermore, proximal migration of the thumb metacarpal does not appear to influence the functional outcome
Trapeziectomy with LRTI, 26--
Marks 2017Trapeziectomy with LRTI, 2964 (8)3 (10%)Trapeziectomy + LRTI (½FCR-APL-½FCR) vs. Trapeziectomy + Graft Jacket allograft1 yearI1. If they were diagnosed with CMC I OA and met indications for trapeziectomy with suspension-interposition arthroplasty1. Pain
2. Quality of life
3. Adverse events
The use of the FCR tendon or allograft for trapeziectomy with suspension interposition arthroplasty in patients with CMC I OA leads to similar outcomes with more complications, mainly tendon irritations, associated with the latter. Therefore, we only use the allograft in cases of severe instability requiring a larger amount of suspension-interposition material or for revision procedures after failed suspension interposition with the FCR tendon
Trapeziectomy with Graft Jacket allograft, 3165 (8)6 (19%)
Morais 2021Trapeziectomy with suture-button suspensionplasty, 3761.8 (7.8)4 (10.8%)Trapeziectomy with suture-button suspensionplasty vs. ligament reconstruction and tendon interposition40 monthsI1. Patients with TMC arthritis1. Pain
2. Physical function
3. Range of movement
4. Quality of life 5. Adverse events
The results are related to the hypothesis suggested by biomechanical studies that revealed better initial load-bearing profile and maintenance of trapezial space following serial loading in cadaver models
Ligament reconstruction and tendon interposition, 3961.1 (7.4)2 (5.2%)
Ritchie 2008Trapeziectomy by anterior approach, 2059 (7)6 (30%)Trapeziectomy by anterior approach vs. Trapeziectomy by posterior approach33 monthsI1. Adults undergoing trapeziectomy for osteoarthrosis of the TMJ were entered into this study between January 2001 and October 20021. Pain
2. Physical function
3. Satisfaction
4. Adverse events
Trapeziectomy is a good method of treating osteoarthritis of the thumb base, but outcomes for the anterior approach are equally good or better than with the posterior
Trapeziectomy by posterior approach, 2064 (9)5 (25%)
Salem 2012Trapeziectomy, 59-8 (13.56%)Trapeziectomy + LR (½FCR-MT) vs. Trapeziectomy + LRTI (½FCR-MT)6 yearsI1. Patients with painful trapeziometacarpal joint osteoarthritis who had not responded to nonoperative treatment were recruited during 2002-20051. Pain
2. Physical function
3. Adverse events
This study does not provide evidence to support the use of LRTI and temporary K-wire stabilization after trapeziectomy
Trapeziectomy and LRTI, 55-9 (16.36%)
Salibi 2019Trapeziectomy, 1061 (9)5 (50%)Trapeziectomy vs. carpometacarpal denervation5 yearsII1. A diagnosis of CMC arthritis as well as the failure of nonsurgical management with antiinflammatories, bracing, or corticosteroid injections1. Pain
2. Quality of life
3. Satisfaction
4. Physical function
There was no difference between the two treatments. First CMCJ denervation does not appear to be superior to trapeziectomy. However, the advantage of rapid rehabilitation makes it more favoured by patients but at the expense of a 30% reoperation rate
Carpometacarpal denervation, 3558 (13)6 (17.14%)
Sanchez-Flo 2020Partial Trapeziectomy, 1760.5 (9.8)4 (23.5%)Partial vs. Total trapeziectomy with interposition arthroplasty1 yearIII1. Patients with isolated TMOA grade II to III (Eaton-Littler) with articular pain and loss of hand function1. Physical function
2. Pain
3. Quality of life
4. Adverse events
We cannot conclude that partial trapeziectomy provides an advantage over total trapeziectomy at 1 year after surgery. Although trapeziometacarpal space was substantially preserved in the partial trapeziectomy group at 12 months, this difference was not statistically or clinically significant
Total Trapeziectomy, 1761 (8.9)2 (11.8%)
Spekreijse 2015Burton-Pellegrini technique, 3665 (9)-Trapeziectomy + LRTI (½FCR-MT) vs. Trapeziectomy + LRTI (½FCR-APL-½FCR)5 yearsI1. If they had symptoms of stage IV OA of both TMC and STT joints with functional impairment of daily activities after the failure of conservative therapy1. Pain
2. Physical function
3. Satisfaction
4. Adverse events
This study showed that improved function, strength, and satisfaction obtained at 1 year after trapeziectomy with LRTI with or without the use of a bone tunnel for stage IV TMC thumb osteoarthritis was maintained after 5 years
Weilby technique, 3664 (9)-
Spekreijse 2016Trapeziectomy and LRTI, 2159.5 (6.3)0Arthrodesis (plate/screws) vs. T + LRTI (½FCR-APL-½FCR)5 yearsIV1. Women older than 40 years with primary, symptomatic OA of the thumb TMC joint, stage II or III by the Eaton and Glickel classification1. Pain
2. Physical function
3. Satisfaction
4. Adverse events
Trapeziectomy with LRTI leads to better pain reduction and functional outcome after between 1 and 5 years compared with trapeziometacarpal arthrodesis in women over 40 years old with OA stages II to III
Arthrodesis, 1759.7 (6)0
Tagil 2002Trapeziectomy with LRTI, 1362 (13.5)-Trapeziectomy + LRTI (APL-FCR-APL) vs. Trapeziectomy + Swanson silastic implant4 yearsI1. Patients with radiographic osteoarthritis and disabling pain agreed to participate in the study and were operated on between 1991 and 1995. 2. All had undergone failed conservative treatment including an orthosis1. Pain
2. Satisfaction
3. Adverse events
Both methods gave good, but not complete, pain relief and neither produced better results than the other in the short term
Trapeziectomy with Swanson silastic implant, 1362 (13)-
Thorkildsen 2019Uncemented joint replacement (Elektra), 2064 (5)6 (30%)Uncemented joint replacement (Elektra) vs. trapeziectomy (with ligament reconstruction and tendon interposition, LRTI)2 yearsI1. Symptomatic idiopathic osteoarthritis of the CMC1 joint
2. Patients over 18 years of age with general good health
1. Physical function
2. Quality of life
3. Adverse events
4. Time to revision
The place for joint replacements in the treatment of symptomatic CMC1 osteoarthritis is still not clear, whereas trapeziectomy with LRTI was a reliable procedure in this trial. Further comparative studies using implants with documented good long-term function and longer follow-up will be required to finally ascertain whether, or which, joint replacement is superior
trapeziectomy with LRTI, 2061 (6)6 (30%)
Vermeulen 2014Trapeziectomy and LRTI, 2159 (6.3)-Arthrodesis (plate/screws) vs. T + LRTI (½FCR-APL- ½FCR)1 yearI1. Patients with impaired function who failed to improve after nonsurgical treatment
2. Who had stage-II or III primary osteoarthritis of the trapeziometacarpal joint according to the classification system of Eaton and Glickel
1. Satisfaction
2. Adverse events
Women who are forty years or older with trapeziometacarpal osteoarthritis have fewer moderate and severe complications after trapeziectomy with ligament reconstruction and tendon interposition and are more likely to consider the surgery again under the same circumstances than are those who undergo arthrodesis. Twelve months after surgery, the PRWHE and DASH scores were similar in both groups. We do not recommend routine use of arthrodesis with plate and screws in the treatment of women who are forty years or older with stage-II or III trapeziometacarpal osteoarthritis
Arthrodesis, 1759 (6)-
Vemeulen 2014 (1)Burton-Pellegrini technique, 3664.7 (9.1)-Trapeziectomy + LRTI (½FCR-MT) vs. Trapeziectomy + LRTI (½FCR-APL-½FCR)1 yearI1. Women aged 40 years or older
2. With stage IV osteoarthritis
1. Pain
2. Satisfaction
3. Range of motion
4. Physical function
5. Adverse events
After the bone tunnel technique, patients have better function and less pain 3 months after surgery than do those in the none bone tunnel group, which indicates faster recovery. However, 12 months after surgery, the functional outcome was similar. Because of faster recovery, we prefer the bone tunnel technique in the treatment of stage IV osteoarthritis
Weilby technique, 3663.5 (8.5)-

Eligibility criteria

Studies were included if they met the following criteria: (1) RCTs, which compared surgical management of TMCJ OA; (2) they were conducted on human subjects; (3) they were written in the English language. There were no restrictions on the minimum number of cases or duration of follow-up. Studies were excluded if they were noncomparative, included other joints, or did not report the outcomes of interest. Animal studies, review articles, case reports, conference abstracts, non-English language studies, and duplicate references from the analysis were excluded.

Study selection

Titles and abstracts of studies identified during the search were imported into Endnote X20 for preliminary screening. Full texts of potentially relevant papers were further screened using the eligibility criteria. Two independent reviewers (IS and GB) did this, and any disparity in either selecting eligible studies or assessing findings between the two reviewers was resolved through consultation with the rest of the authors.

Data extraction

Two independent authors (IS and GB) extracted data into an Excel spreadsheet with the following parameters: treatment modalities, age, gender, follow-up, level of evidence, inclusion criteria of studies, primary outcomes, and conclusion. A false positive analysis considered cases where AI included RCTs outside the scope of surgical management.

Risk of bias assessment

The methodological quality of each study was assessed using Cochrane risk-of-bias (ROB) tool for randomized trials [Figure 1]. The RoB tool addresses the following biases: random sequence generation, bias due to deviations from intended interventions, bias due to incomplete outcome data, bias in the measurement of the outcome, and selective reporting. The items were assessed as “low risk”, “high risk”, or “some concerns”. We used the original RoB tool rather than the updated RoB 2 tool, as our research team had extensive experience with the original tool, ensuring consistent and accurate assessments, and wanted to maintain comparability with other systematic reviews in our area of research that predominantly used the original tool. We acknowledge that the RoB 2 tool offers a more nuanced approach, particularly for assessing bias in subjective outcomes and open-label studies. However, our use of the original RoB tool may have resulted in slightly more conservative bias assessments. This conservative approach strengthens the reliability of our findings, as it is less likely to underestimate potential biases in the included studies.

Artificial intelligence versus human researcher performance for systematic literature searches: a study focusing on the surgical management of base of thumb arthritis

Figure 1. Risk of bias of all included studies.

RESULTS

The manual search executed by human authors yielded 6,018 initial results, followed by 4,980 results from Elicit, 3,436 from Consensus, and lastly, only 6 from ChatGPT, Table 1. Elicit found 205 RCTs, while the manual search found 63, Consensus returned 42, and ChatGPT identified one [Figures 2-4]. For prospective studies, the manual search yielded 1,852 results, followed by 1,123 from Elicit, 963 from Consensus, and one from ChatGPT. Elicit’s broader selection of RCTs stems from its indiscriminate inclusion of all studies discussing base of thumb arthritis regardless of comparison with surgical management strategies, and its search focused largely on non-surgical management. Lastly, Elicit had the highest false positives at 94%, followed by consensus at 76%, human researchers at 43%, and ChatGPT at 0%, Table 1.

Artificial intelligence versus human researcher performance for systematic literature searches: a study focusing on the surgical management of base of thumb arthritis

Figure 2. PRISMA figure of consensus platform search. PRISMA: Preferred reporting items for systematic reviews and meta-analyses.

Artificial intelligence versus human researcher performance for systematic literature searches: a study focusing on the surgical management of base of thumb arthritis

Figure 3. PRISMA figure of Elicit platform search. PRISMA: Preferred reporting items for systematic reviews and meta-analyses.

Artificial intelligence versus human researcher performance for systematic literature searches: a study focusing on the surgical management of base of thumb arthritis

Figure 4. PRISMA figure of manual search. PRISMA: Preferred reporting items for systematic reviews and meta-analyses.

Characteristics of included studies

A total of 23 RCTs[9-31] from all searches were eligible for inclusion in this study, as shown in Table 2 and Figure 4. The manual search method covered all 23 studies, followed by Elicit, which found 10 studies, then Consensus, which uncovered 9, while ChatGPT identified only one study. The manual search identified all the studies found by the AI search engine searches, and there was no additional benefit from other searches. By the end of the screening process, manual search led to 5,994 excluded papers, followed by Elicit with 4,969, Consensus with 3,427, and ChatGPT excluding 5.

Table 2

Comparison of artificial intelligence and human in literature search

ConsensusElicitChatGPTHuman manually
Total Search results3,4364,98066,018
Randomized controlled trials in search42205163
False positive randomized controlled trials, N (%)32 (76%)193 (94%)027 (43%)
Prospective studies in search9631,12311,852
Included studies9 (1-9)10 (1-3, 5, 8-13)1 (17)23 (1-23)
Excluded studies3,4274,97055,995

Table 2 summarizes the characteristics of the included studies, including the application of intraoperative adjuvants, the specific muscles implicated, and the surgical approach adopted. In total, 1,335 procedures occurred across 23 studies, 489 of which were trapeziectomies with ligament reconstruction tendon interposition (LRTI) [Table 2]. Participants were, on average, 49.83 years old and were followed up for an average duration of 3.31 years.

Comparison of AI search engines

Compared with manual search, Consensus and Elicit overlooked studies by Marks, Morais, Sanchez-Flo, and Thorkildsen, which exhibited evidence levels I, I, III, and I, respectively. Elicit displayed similar levels of omission, failing to include studies by Gerwin, Ritchie, and Salem at evidence levels II, I, and I, respectively. While ChatGPT struggled to locate most of the literature, it succeeded in identifying the study by Field et al., a level I evidence study that was neglected by the other AI search engines[14].

Number of studies included by each AI search engine

While Elicit and Consensus demonstrated analogous capacities for identifying studies with comparable levels of evidence, Elicit displayed superior capability for identifying a greater number of included studies (totaling 522 patients) compared to Consensus (totaling 438 patients).

Outcomes

Pain

Of the eighteen studies identified by manual searches evaluating pain as an outcome, seven were included by Consensus, and eight by Elicit. Consensus and Elicit found five of the same studies, while ChatGPT found none.

Physical function

Of seventeen studies identified by manual searches that explored physical function as an outcome, eight were found by Consensus and Elicit, although just five were common to both Consensus and Elicit. ChatGPT identified none.

Adverse Events

Nineteen studies identified by the manual search reported adverse events as an outcome. Seven of these were found by Consensus, and nine by Elicit. Once more, five studies were common between Elicit and Consensus, while ChatGPT found none.

Quality of Life

Five studies identified by manual searches reported the quality of life as an outcome. None of these were included by Consensus, while Elicit identified four. ChatGPT identified none.

Satisfaction

Eleven studies identified on manual searches reported satisfaction as an outcome. Seven of these were found by Consensus and five were identified by Elicit. Five studies were common between Elicit and Consensus.

Range of movement

Two included studies addressed the range of motion as an outcome of manual searching. Consensus and Elicit each identified one, but none were common, and ChatGPT found one.

DISCUSSION

This case study is the first to explore the comparative performance between human-initiated and AI-initiated literature searches. These findings demonstrate AI platforms currently have poor proficiency for use in academia, especially ChatGPT, which performed poorly across all domains and outcomes. Although Elicit came the closest to mimicking human precision of the initial search, manual searches were far superior to all AI literature search engines in terms of the number of studies identified and their specificity to the subject of TMCJ OA. AI engines also overlooked studies extracted from the manual search and lacked precision in the subject of the search, evidenced by high false positive identification rates.

Interestingly, the average age of participants across the 23 included studies was 49.83 years, notably younger than the typical patient population seen in most CMC1 (first carpometacarpal joint) osteoarthritis publications. This relatively young cohort raises essential questions about the generalisability of the study results to the broader TMCJ OA population, which typically presents in older adults. Including younger patients may reflect a trend toward earlier surgical intervention, possibly due to increased awareness or changes in treatment paradigms. Further investigation is warranted to understand the implications of this age discrepancy for treatment outcomes and long-term prognosis for TMCJ OA patients.

Upon inspecting the number of relevant studies produced, Elicit was the most comprehensive AI search engine, albeit only surpassing Consensus by a single article. However, most RCTs identified by Elicit were tangential and addressed various topics beyond management strategies. As this methodology has never been applied since the inception of large language models (LLM), these findings cannot be discussed and contextualized in other studies. Despite the promise of AI to replicate laborious manual tasks, the results herein are disappointing and suggest that LLMs currently have no applicability in relieving the burdensome process of literature searching and screening. This study shows that LLMs could do a disservice to the scientific community by excluding publications typically deemed important and including irrelevant ones in initial searches. This misalignment with the topic of discussion led to a 94% false positive rate within the search, compared to a human false positive rate of 43%. While Consensus elicited nearly 1,500 fewer studies than Elicit, it included nine of the ten studies identified by Elicit, yielding a significantly lower false positive rate of 73%. Although Elicit identified the most publications overall, its search was the least precise and most inefficient of all AI search engines. No AI-driven engine could identify studies not included in the human search, indicating that human searches were the most precise and had a very low false negative rate.

Concerning primary outcomes, Elicit emerged as the sole AI search engine capable of identifying RCTs addressing all relevant primary outcomes. However, Consensus failed to uncover any studies focused on quality of life, although AI search engines could identify more than one study, each discussing the range of motion. ChatGPT exhibited the least effective performance, locating only one study addressing two of the six primary outcomes, and finding a volume of studies that was small in comparison to manual searches by authors and AI search engines[32,33]. Overall, AI search engines were inferior to manual searching, highlighting a shortcoming in their algorithms for sourcing comprehensive, high-quality literature relevant to the research topic[2]. The indiscriminate data retrieval by AI search engines in this study points to a potential for them to produce erroneous information outputs, due to a lack of precision and hierarchical structure during information gathering and organization[2]. Peering into the mind of an algorithm, it is clear from these results that these deficits could account for the erroneous or outdated responses sometimes reported in previous studies. Therefore, for AI to be a viable tool in academic literature searches, substantial improvements are needed in categorization, publication filtering, bias detection, database integration, and ethical data handling.

This study explores the use of AI for literature searches, highlighting the significant improvements required for AI tools to be feasibly incorporated into literature searches for the creation of academic content. These improvements may be grouped into a few main broad categories that should be considered. Paramount among these is ensuring “reproducibility”, which is the cornerstone of academic research and literature searches, as exemplified by the dual-reviewer approach outlined in the PRISMA guidelines. Current AI tools fall short in accuracy and comprehensiveness. Additionally, users may ask AI search engines the same question multiple times and receive different answers informed by different sources[34,35,36]. A future AI system must be able to recognize and understand context, academic language, and abbreviations to meet the reproducibility standard. Moreover, AI must develop a nuanced understanding of academic context, hierarchy, and the goals of a literature search to match or surpass human researchers in precision and thoroughness[37].

Secondly, AI should transcend simple keyword identification to gain a deeper semantic understanding of academic papers. This includes comprehending study objectives, methodologies employed, and resultant conclusions. Such advancements would enhance AI’s ability to categorize, filter, and rank search results based on criteria such as relevance, currency, citation frequency, and the publishing journal’s impact factor. Thirdly, improvements are needed to discern and neutralize potential biases, including those related to geographic location, authorship, or publication prestige, to ensure fair data representation. Given the dynamic nature of academia, with its continuous generation of novel knowledge and methodologies, AI platforms must be equipped for accessible, ongoing learning and enhancement.

In conclusion, This study found AI tools such as Elicit, Consensus, and ChatGPT were inaccurate and lacked comprehension compared with human-initiated literature searches. These tools need to evolve beyond simple keyword identification toward a nuanced understanding of academic hierarchy and context. Therefore, AI’s integration into academic literature searches demands substantial enhancements in its understanding of academic context and hierarchy, fulfilling the crucial reproducibility criterion and aligning it with the rigorous standards of human-conducted research.

DECLARATIONS

Authors’ contributions

Conceptualization: Seth I, Lim B, Xie Y, Rozen WM

Methodology: Seth I, Ross R, Rozen WM

Data analysis: Seth I, Lim B, Xie Y

Manuscript writing: Seth I, Lim B, Xie Y, Ross RJ, Cuomo R, Rozen WM

Manuscript editing: Seth I, Lim B, Xie Y, Ross RJ, Cuomo R, Rozen WM

All authors edited and approved the final manuscript.

Availability of data and materials

Data supporting the findings of this manuscript are available from the corresponding author upon reasonable request.

Financial support and sponsorship

None.

Conflicts of interest

All authors declared that there are no conflicts of interest.

Ethical approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Copyright

© The Author(s) 2025.

Supplementary Materials

REFERENCES

1. Yuniarthe Y. Application of artificial intelligence (AI) in search engine optimization (SEO). IEEE. 2017:96-101.

2. Wagner G, Lukyanenko R, Paré G. Artificial intelligence and the conduct of literature reviews. J Inf Technol. 2022;37:209-26.

3. Kung J. Elicit (product review). J Can Health Libr Assoc. 2023;44:15.

4. Schiermeier Q. Pirate research-paper sites play hide-and-seek with publishers. Nature. 2015. Available from: https://www.nature.com/articles/nature.2015.18876. [Last accessed on 6 Jan 2025].

5. Ma J, Wu X, Huang L. The use of artificial intelligence in literature search and selection of the PubMed database. Sci Program. 2015. Available from: https://onlinelibrary.wiley.com/doi/10.1155/2022/8855307. [Last accessed on 6 Jan 2025].

6. Cinquini M, Rocco N, Catanuto G, et al. Should acellular dermal matrices be used for implant-based breast reconstruction after mastectomy? Plast Reconstr Surg Glob Open. 2023;11:e4821.

7. Bowers MR, Pulos N, Pulos BP, Shin AY. Opioid-sparing pain management in upper extremity surgery: part 2: surgeon as prescriber. J Hand Surg Am. 2019;44:878-82.

8. Higgins JPT, Thomas J, Chandler J, et al. Cochrane handbook for systematic reviews of interventions. John Wiley & Sons; 2019.

9. Belcher HJ, Nicholl JE. A comparison of trapeziectomy with and without ligament reconstruction and tendon interposition. J Hand Surg Br. 2000;25:350-6.

10. Belcher HJ, Zic R. Adverse effect of porcine collagen interposition after trapeziectomy: a comparative study. J Hand Surg Br. 2001;26:159-64.

11. Brennan A, Blackburn J, Thomson J, Field J. Simple trapeziectomy versus trapeziectomy with flexor carpi radialis suspension: a 17-year follow-up of a randomized blind trial. J Hand Surg Eur Vol. 2021;46:120-4.

12. Corain M, Zampieri N, Mugnai R, Adani R. Interposition arthroplasty versus hematoma and distraction for the treatment of osteoarthritis of the trapeziometacarpal joint. J Hand Surg Asian Pac Vol. 2016;21:85-91.

13. Smet L, Sioen W, Spaepen D, van Ransbeeck H. Treatment of basal joint arthritis of the thumb: trapeziectomy with or without tendon interposition/ligament reconstruction. Hand Surg. 2004;9:5-9.

14. Field J, Buchanan D. To suspend or not to suspend: a randomised single blind trial of simple trapeziectomy versus trapeziectomy and flexor carpi radialis suspension. J Hand Surg Eur Vol. 2007;32:462-6.

15. Gangopadhyay S, McKenna H, Burke FD, Davis TR. Five- to 18-year follow-up for treatment of trapeziometacarpal osteoarthritis: a prospective comparison of excision, tendon interposition, and ligament reconstruction and tendon interposition. J Hand Surg Am. 2012;37:411-7.

16. Black BE, Griffin PP. The cerebral palsied hip. Clin Orthop Relat Res. 1997;338:42-51.

17. Hansen TB, Stilling M. Equally good fixation of cemented and uncemented cups in total trapeziometacarpal joint prostheses. A randomized clinical RSA study with 2-year follow-up. Acta Orthop. 2013;84:98-105.

18. Hart R, Janeček M, Šiška V, Kučera B, Štipčák V. Interposition suspension arthroplasty according to Epping versus arthrodesis for trapeziometacarpal osteoarthritis. Eur Surg. 2006;38:433-8.

19. Kriegs-Au G, Petje G, Fojtl E, Ganger R, Zachs I. Ligament reconstruction with or without tendon interposition to treat primary thumb carpometacarpal osteoarthritis. Surgical technique. J Bone Joint Surg Am. 2005;87 Suppl 1:78-85.

20. Marks M, Hensler S, Wehrli M, Scheibler AG, Schindele S, Herren DB. Trapeziectomy with suspension-interposition arthroplasty for thumb carpometacarpal osteoarthritis: a randomized controlled trial comparing the use of allograft versus flexor carpi radialis tendon. J Hand Surg Am. 2017;42:978-86.

21. Morais B, Botelho T, Marques N, et al. Trapeziectomy with suture-button suspensionplasty versus ligament reconstruction and tendon interposition: a randomized controlled trial. Hand Surg Rehabil. 2022;41:59-64.

22. Ritchie JF, Belcher HJ. A comparison of trapeziectomy via anterior and posterior approaches. J Hand Surg Eur Vol. 2008;33:137-43.

23. Salem H, Davis TR. Six year outcome excision of the trapezium for trapeziometacarpal joint osteoarthritis: is it improved by ligament reconstruction and temporary Kirschner wire insertion? J Hand Surg Eur Vol. 2012;37:211-9.

24. Salibi A, Hilliam R, Burke FD, Heras-Palou C. Prospective clinical trial comparing trapezial denervation with trapeziectomy for the surgical treatment of arthritis at the base of the thumb. J Surg Res. 2019;238:144-51.

25. Sánchez-Flò R, Fillat-Gomà F, Marcano-Fernández FA, Berenguer-Sánchez A, Balcells-Nolla P, Torner P. Partial versus total trapeziectomy with interposition arthroplasty for trapeziometacarpal osteoarthritis grade II to III Eaton-Littler: a clinical trial. J Hand Surg Glob Online. 2020;2:133-7.

26. Spekreijse KR, Selles RW, Kedilioglu MA, et al. Trapeziometacarpal arthrodesis or trapeziectomy with ligament reconstruction in primary trapeziometacarpal osteoarthritis: a 5-year follow-up. J Hand Surg Am. 2016;41:910-6.

27. Spekreijse KR, Vermeulen GM, Kedilioglu MA, et al. The effect of a bone tunnel during ligament reconstruction for trapeziometacarpal osteoarthritis: a 5-year follow-up. J Hand Surg Am. 2015;40:2214-22.

28. Tägil M, Kopylov P. Swanson versus APL arthroplasty in the treatment of osteoarthritis of the trapeziometacarpal joint: a prospective and randomized study in 26 patients. J Hand Surg Br. 2002;27:452-6.

29. Thorkildsen RD, Røkkum M. Trapeziectomy with LRTI or joint replacement for CMC1 arthritis, a randomised controlled trial. J Plast Surg Hand Surg. 2019;53:361-9.

30. Vermeulen GM, Brink SM, Slijper H, et al. Trapeziometacarpal arthrodesis or trapeziectomy with ligament reconstruction in primary trapeziometacarpal osteoarthritis: a randomized controlled trial. J Bone Joint Surg Am. 2014;96:726-33.

31. Vermeulen GM, Spekreijse KR, Slijper H, Feitz R, Hovius SE, Selles RW. Comparison of arthroplasties with or without bone tunnel creation for thumb basal joint arthritis: a randomized controlled trial. J Hand Surg Am. 2014;39:1692-8.

32. Xie Y, Seth I, Hunter-Smith DJ, Rozen WM, Ross R, Lee M. Aesthetic surgery advice and counseling from artificial intelligence: a rhinoplasty consultation with ChatGPT. Aesthetic Plastic Surgery. 2023;47:1985-93.

33. Seth I, Cox A, Xie Y, et al. Evaluating Chatbot efficacy for answering frequently asked questions in plastic surgery: a ChatGPT case study focused on breast augmentation. Aesthet Surg J. 2023;43:1126-35.

34. Journal CME Questions. J Hand Surg. 2023;48:699. Available from: https://www.sciencedirect.com/science/article/pii/S0363502323002617. [Last accessed on 26 Dec 2024].

35. Seth I, Lim B, Xie Y, Hunter-Smith DJ, Rozen WM. Exploring the role of artificial intelligence chatbot on the management of scaphoid fractures. J Hand Surg Eur Vol. 2023;48:814-8.

36. Seth I, Lim B, Xie Y, et al. Comparing the efficacy of large language models ChatGPT, BARD, and Bing AI in providing information on rhinoplasty: an observational study. Aesthet Surg J Open Forum. 2023;5:ojad084.

37. Shridharani SM, Dayan S, Biesman B, et al. Efficacy and safety of tapencarium (RZL-012) in submental fat reduction. Aesthet Surg J. 2023;43:NP797-806.

Cite This Article

Systematic Review
Open Access
Artificial intelligence versus human researcher performance for systematic literature searches: a study focusing on the surgical management of base of thumb arthritis
Ishith SethIshith Seth, ... Warren M. Rozen

How to Cite

Seth, I.; Lim, B.; Xie, Y.; Ross, R. J.; Cuomo, R.; Rozen, W. M. Artificial intelligence versus human researcher performance for systematic literature searches: a study focusing on the surgical management of base of thumb arthritis. Plast. Aesthet. Res. 2025, 12, 1. http://dx.doi.org/10.20517/2347-9264.2024.99

Download Citation

If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click on download.

Export Citation File:

Type of Import

Tips on Downloading Citation

This feature enables you to download the bibliographic information (also called citation data, header data, or metadata) for the articles on our site.

Citation Manager File Format

Use the radio buttons to choose how to format the bibliographic data you're harvesting. Several citation manager formats are available, including EndNote and BibTex.

Type of Import

If you have citation management software installed on your computer your Web browser should be able to import metadata directly into your reference database.

Direct Import: When the Direct Import option is selected (the default state), a dialogue box will give you the option to Save or Open the downloaded citation data. Choosing Open will either launch your citation manager or give you a choice of applications with which to use the metadata. The Save option saves the file locally for later use.

Indirect Import: When the Indirect Import option is selected, the metadata is displayed and may be copied and pasted as needed.

About This Article

Special Issue

This article belongs to the Special Issue Artificial Intelligence in Plastic Surgery
© The Author(s) 2025. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, sharing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Data & Comments

Data

Views
34
Downloads
0
Citations
0
Comments
0
0

Comments

Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at support@oaepublish.com.

0
Download PDF
Share This Article
Scan the QR code for reading!
See Updates
Contents
Figures
Related
Plastic and Aesthetic Research
ISSN 2349-6150 (Online)   2347-9264 (Print)

Portico

All published articles are preserved here permanently:

https://www.portico.org/publishers/oae/

Portico

All published articles are preserved here permanently:

https://www.portico.org/publishers/oae/