A Semi-automated Peer-review System Bradly Alicea Orthogonal Research

Similar documents
Evaluation next steps Lift and Costs

PREDICTING EMPLOYEE ATTRITION THROUGH DATA MINING

Big Data. Methodological issues in using Big Data for Official Statistics

Preprocessing Technique for Discrimination Prevention in Data Mining

Predictive Modeling using SAS. Principles and Best Practices CAROLYN OLSEN & DANIEL FUHRMANN

Estoril Education Day

Getting Started with Risk in ISO 9001:2015

Inquiry into Scientific Publications

2 Maria Carolina Monard and Gustavo E. A. P. A. Batista

MODELING THE EXPERT. An Introduction to Logistic Regression The Analytics Edge

Predictive Analytics Using Support Vector Machine

Watson-Glaser II Critical Thinking Appraisal. Development Report. John Sample COMPANY/ORGANIZATION NAME. March 31, 2009.

Watson-Glaser II Critical Thinking Appraisal Profile Report

Principles for Predictive Analytics in Child Welfare

INTERAGENCY GUIDANCE ON THE ADVANCED MEASUREMENT APPROACHES FOR OPERATIONAL RISK

ECONOMIC MACHINE LEARNING FOR FRAUD DETECTION

Sawtooth Software. The Importance Question in ACA: Can It Be Omitted? RESEARCH PAPER SERIES

Tutorial Segmentation and Classification

Applying an Interactive Machine Learning Approach to Statutory Analysis

A Taxonomy for Test Oracles

Manufacturing Technology Committee Risk Management Working Group Risk Management Training Guides

TABLE OF CONTENTS ix

Crowe Caliber. Using Technology to Enhance AML Model Risk Management Programs and Automate Model Calibration. Audit Tax Advisory Risk Performance

Towards a Broader Understanding of Journal Impact: Measuring Relationships between Journal Characteristics and Scholarly Impact

Video Traffic Classification

On of the major merits of the Flag Model is its potential for representation. There are three approaches to such a task: a qualitative, a

Section M: Evaluation Factors for Award HQ R LRDR Section M: Evaluation Factors for Award For HQ R-0002

Advances in Machine Learning for Credit Card Fraud Detection

Predictive Accuracy: A Misleading Performance Measure for Highly Imbalanced Data

Model Validation of a Credit Scorecard Using Bootstrap Method

Soil - Plasticity 2017 (72) PROFICIENCY TESTING PROGRAM REPORT

CREDIT RISK MODELLING Using SAS

Using Predictive Analytics to Detect Contract Fraud, Waste, and Abuse Case Study from U.S. Postal Service OIG

Tree Depth in a Forest

CODE OF ETHICS THE ROYAL NETHERLANDS SOCIETY OF ENGINEERS

Nurse Review of Research Councils

Evolutionary Algorithms

Data Analysis and Sampling

The Nottingham eprints service makes this work by researchers of the University of Nottingham available open access under the following conditions.

Identify Risks. 3. Emergent Identification: There should be provision to identify risks at any time during the project.

Chang Xu Mohammad R Nezami Ranjbar Zhong Wu John DiCarlo Yexun Wang

Anti-Money Laundering Solution Deep Dive WHITE PAPER

Solutions Implementation Guide

Predictive Modelling for Customer Targeting A Banking Example

DARPA-BAA MUSE Frequently Asked Questions. As of March 20, 2014

by Xindong Wu, Kui Yu, Hao Wang, Wei Ding

FSC PROCEDURE. The Development and Approval of Controlled Wood National Risk Assessments. Forest Stewardship Council FSC-PRO V3-0 D1-0 EN

Watson-Glaser Critical Thinking Appraisal III (US)

Chapter 8 Appraising and Improving Performance

Conditions of Award. Royal Society Funding Schemes

Weka Evaluation: Assessing the performance

Overview of BakerSCI s RITR & ATLAAS Sensor/Signal Analyses

9 Principles of (alt)metrics

D ECENTRALIZED V ALUE DISTRIBUTION SYSTEM

Enhancing Risk and Internal Audit effectiveness using KNIME. KNIME Fall Summit November 2017

Learning from the Observational Medical Outcomes Partnership (OMOP)

Alzheimer s Disease Neuroimaging Initiative (ADNI) Data Sharing and Publication Policy

Machine Learning Logistic Regression Hamid R. Rabiee Spring 2015

How to build and deploy machine learning projects

Measurement and Scaling Concepts

INTERNATIONAL UNION FOR THE PROTECTION OF NEW VARIETIES OF PLANTS

TIMETABLING EXPERIMENTS USING GENETIC ALGORITHMS. Liviu Lalescu, Costin Badica

A STUDY ON STATISTICAL BASED FEATURE SELECTION METHODS FOR CLASSIFICATION OF GENE MICROARRAY DATASET

PAST research has shown that real-time Twitter data can

Potential Impact of Item Parameter Drift Due to Practice and Curriculum Change on Item Calibration in Computerized Adaptive Testing

ESOMAR 28 ANSWERS TO 28 QUESTIONS TO HELP BUYERS OF ONLINE SAMPLE

Competitive Procurement Evaluation Process Audit

Volunteering at the National Library of Wales: Helping us to achieve

PSYC C performance appraisal 11/27/11 [Arthur] 1

Using Software Measurement in SLAs:

Conditions of Award. Royal Society Newton International Fellowships funded under the Newton Fund

2010 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,

Technical Analysis for Algorithmic Pattern Recognition

I OPT (Input Output Processing Template)

Chapter 5 Evaluating Classification & Predictive Performance

Quality Control Assessment in Genotyping Console

Increasing the Efficiency of Acquisition in a Busy North Sea Season - Dealing with Seismic Interference

Evaluation Tools for COMPUS

Requirements Validation and Negotiation

2. Materials and Methods

2015 Research Trainee Program Competition for Post-Doctoral Fellowship Awards EVALUATION CRITERIA FOR REVIEWERS

Frequently Asked Questions (FAQ) Last update June 2017

Microsoft Project. Reduce Costs, Enhance Operational Efficiency and Drive Growth with PPM

SENSORY EVALUATION TECHNIQUES DO S & DON TS

AVIATION AUTHORITY POLICY

The potential of in-forest segregation using an acoustic tool on a harvester head

Axiom mydesign Custom Array design guide for human genotyping applications

Machine learning in neuroscience

The CWTS Leiden Ranking and other advanced bibliometric tools

Beyond the Price Waterfall

Permutation Free Encoding Technique for Evolving Neural Networks

Shobeir Fakhraei, Hamid Soltanian-Zadeh, Farshad Fotouhi, Kost Elisevich. Effect of Classifiers in Consensus Feature Ranking for Biomedical Datasets

SAMPLE SOCIAL MEDIA POLICY

ORACLE ADVANCED ACCESS CONTROLS CLOUD SERVICE

CODE OF PRACTICE FOR RESEARCH

Data Mining. Chapter 7: Score Functions for Data Mining Algorithms. Fall Ming Li

Exception Prioritization in Continuous Auditing: A Framework and Experimental Evaluation. Pei Li David Y. Chan Alexander Kogan

Determining NDMA Formation During Disinfection Using Treatment Parameters Introduction Water disinfection was one of the biggest turning points for

CAREER TRACKS. Supervisor Toolkit. FAQ: For Managers

Transcription:

A Semi-automated Peer-review System Bradly Alicea bradly.alicea@ieee.org Orthogonal Research Abstract A semi-supervised model of peer review is introduced that is intended to overcome the bias and incompleteness of traditional peer review. Traditional approaches are reliant on human biases, while consensus decision-making is constrained by sparse information. Here, the architecture for one potential improvement (a semi-supervised, human-assisted classifier) to the traditional approach will be introduced and evaluated. To evaluate the potential advantages of such a system, hypothetical receiver operating characteristic (ROC) curves for both approaches will be assessed. This will provide more specific indications of how automation would be beneficial in the manuscript evaluation process. In conclusion, the implications for such a system on measurements of scientific impact and improving the quality of open submission repositories will be discussed. Introduction What makes for a good academic paper? The answer is often more subjective than we like to think. One common answer involves a peer-reviewed consensus. But peer-review can often be subjective, and yields results that do not lend easily to consensus [1]. Perhaps a high degree of selectivity can provide a benchmark for quality. Yet even highly-selective publication venues can suffer from retractions and uneven quality. Part of the problem with the standard model of peer review (a small number of blinded reviews coordinated by an editor) is that the entire process represents a sparse sampling of the subject space. To compensate for this, manuscripts with multiple novel features [2] will often be viewed in an over-critical manner, perhaps even being discarded as fraudulent. How can the vagaries of opinion, aversion to novelty, and bias for selectivity be overcome? This is an important issue with the rise of open-access publication [3]. One way is to make rear-guard improvements to the existing peer-review system. Another option is to build an automated system that filters out fraudulent papers and papers of low quality. Superficially, this system would function like a human-assisted fraud detection system [4], and would provide a uniform set of decisions given a set of assumptions. Model Assumptions There are several assumptions that go into the semi-automation model. The first assumption involves a natural fraud rate. This is the frequency at which intentionally fraudulent papers are encountered. The second assumption involves an acceptance rate (or rejection rate) that may overlap with the natural fraud rate. The third assumption is that the true number of non-fraudulent manuscripts can be better estimated by establishing a degree of novelty. These assumptions can be combined to come up with false positive and false negative rates for specific datasets (see Table 1). 1

Table 1. Example of false positives (FPs) and false negatives (FNs) given a certain natural fraud rate (10%) and acceptance rate (20%). Accept (20%) Reject (80%) Legitimate (90%) TP (15%) FN (75%) Fraud (10%) FP (5%) TN (5%) Design Features The design of the automated peer review system utilizes an adaptive filtering architecture (see Figure 1). This consists of a classifier algorithm, a discriminability measure, and ROC curves to characterize the performance of both existing and automated systems with respect to how strategies can eliminate fraud or (in some cases) hamper innovation in the literature. Thus, while the goal is not to build a fraud-proof system for evaluating manuscripts, an automated peer review system can help us understand how to eliminate human biases and more efficiently evaluate submissions to the scientific literature. Figure 1. Process of semi-automated decision-making that utilizes an adaptive filter. System Design Features There are three design features that are included in this automated system. The first involves excluding all fraudulent work by discovering the natural rate of fraudulent submissions. This includes plagiarized works, and works which include errors in analysis or basic logic. For example, papers might be flagged for graphical and analytical (e.g. statistical artifacts) errors. 2

The second is to ensure the inclusion of novelty, particularly in cases where it may be confused for low quality or fraudulent work. The third feature is to approach the paper being reviewed from multiple points-of-view. This consists of rule systems which are intended to transcend specific disciplinary (or experiential) biases, but also overlap in a way that allows a consensus to be reached. In Figure 1, the d (discriminability) and β (bias) parameters serve as a filter and adaptive calibration mechanism, respectively. The d parameter is the ability of a classifier (set of minimal quality and anti-fraud criteria) to minimize false positive and false negatives with respect to random expectation. The β parameter is a shift in the decision-making criteria, which results in either a more permissive or selective strategy. A given set of decision and adjustment of the evaluation criterion can be used to adjust the discriminability value during the next round of evaluations. ROC Curves To characterize the effects of selectivity and evaluation criteria (the d parameter and associated feedback loop), we can use ROC curves. Figure 2 demonstrates the outcomes for both a traditional peer review and semi-automated review system. The ROC curve characterizes general tendencies for different strategies, in this case the relative degree of selectivity for a prestigious journal, a field-specific journal, and an open-submission repository. Figure 2 provides a schematic view of performance for traditional vs. semi-automated peer review using the same criteria. In terms of discriminability, the semi-automated approach is expected to outperform traditional peer-review. This should especially be true of papers with a high degree of novelty. In turn, this should allow for a higher acceptance rate at a lower cost in terms of quality. Set of Minimal Criteria The parameter value for d is determined by the function of a classifier. The classifier is a set of minimal criteria that are used to classify a given manuscript as fraudulent, below-threshold quality, or acceptable but needing work. This can be done in two ways. The first is by using a rule-based sort, which sorts papers by their location in an n-level contingency tree. The other way is to generate an n-dimensional score using the same criteria, and then apply a discriminant classifier to the manuscript scores. All manuscripts classified accept and reject are then double-checked by human experts using the same criterion. This human expert supervision step provides us with the final classifications of true positive, false positive, true negative, and false negative. Discussion One of the things an ROC framework reveals is the costs of rooting out fraud using naive (non-automated) techniques. One such naive strategy for rooting out fraud is to implement greater selectivity. While this has a heuristic benefit, it also serves as a de facto conservative bias against all manuscripts that do not meet a rigid criterion. This is because selectivity favors wellestablished people and techniques. The costs of homogenization process are both social and intellectual. By contrast, automation allows us to distinguish between novelty and a lack of quality (e.g. fraud). See the Methods section for more details. 3

Figure 2. Example of ROC curves for three datasets. Curves are generated using pseudo-data, and represent general expected tendencies for traditional (red) and semi-automated (blue) manuscript evaluation. One existing alternative to traditional peer-review is the Altmetric approach [5]. With Altmetrics, articles can be assessed using the power of crowdsourcing. However, this still necessitates both good faith and the ability to detect fraud amongst the people evaluating the manuscript. Altmetrics also introduces the concept of an article s potential for influence (e.g. relative impact) [6], which is opportunity for future development of the proposed automated peer-review system. Indeed, an Altmetric approach would be quite suitable as the human training component of the proposed system. Even in the case of open submission repositories, there are criteria for suitability. The arxiv repository [7] uses a volunteer-run committee of subject experts to manually evaluate manuscripts. The criteria for minimal selectivity and bias involve a general assessment of an article s scientific interest, relevance, and value. Fraud detection and quality control includes the 4

identification of duplicate work, an excessive personal publication rate, and copyright violations. Open submission is another area in which an automated system could help keep repositories free of low quality and/or fraudulent papers. Yet for all applications, this semi-automated system provides a opportunity to review manuscripts in a transparent, informed, and unbiased manner. Methods Discriminability measure Discriminability is measured by modeling the signal (classifier performance) and noise (random expectation) as overlapping Gaussian probability distributions. Discriminability (d ) can be calculated by calculating the overlap in the probability distributions d = [1] where f(s) is the distribution that describes the signal, and f(n) is the distribution that describes noise. Discriminability determines the overall performance of the specific evaluation. In an adaptive context, d serves to select the overall acceptance rate. This measurement can also be used to compare between conventional peer-review systems. ROC curves ROC curves characterize all the possible true positive (hits) and false positive values for a specific d value. Bias (β) is characterized by [8] as shifts in the acceptance criterion β = [2] β = exp {d x C} C > 0 = conservative and C < 0 = permissive [3] where B is the threshold criteria (e.g. a priori acceptance rate) for an acceptance, and C characterizes changes in the strategy (e.g. highly-selective vs. open-submission). β can also serve as a likelihood ratio between parameters D and B [8]. Over time, the value of β will converge upon one that results in maximizing the number of acceptance and minimizing the number of false positives (unacceptable manuscripts). Degree of Novelty The degree of novelty is the information content of the rulesets states (for the n-level contingency tree) or scores (for the discriminant classifier) that covers the methods and conventional approaches part of the rulesets N = [4] where p i is the probability of a given score value or state (i) for each rule (j 1, j 2,..j n ). To make the final decision, the degree of novelty will be used to weight the final classifications for each 5

manuscript. Cases of moderate to high degrees of novelty will serve as a flag for the human supervisor to further investigate the paper s merits. Classifiers The rulesets consist of rules that cover the following components of a manuscript: methods, reasoning, plagiarism check, references and self-citation, conventionality of approach, and graphical consistency/analytical artifacts. The rulesets can either result in a binary string or a composite score, and can include as many rules in each category as deemed use-appropriate. All code and pseudo-code are located at Github (http://www.github.com/balicea/semi-auto-peerreview). References [1] Alicea, B. The Novelty-Consensus Dampening. Synthetic Daisies blog, October 22 (2013). http://syntheticdaisies.blogspot.com/2013/10/fireside-science-consensus-novelty.html [2] Uzzi, B., Mukherjee, S., Stringer, M., and Jones, B. Atypical Combinations and Scientific Impact. Science, 342, 468-472 (2013). [3] Bohannon, J. Who s Afraid of Peer Review? Science, 342, 60-65 (2013). [4] Bolton, R.J. and Hand, D.J. Statistical Fraud Detection: a review. Statistical Science, 17(3), 235-255 (2002). [5] Priem, J., Groth, P., and Taraborelli, D. The Altmetrics Collection. PLoS One, 7(11), e48753 (2012). [6] Bollen, J., Van de Sompel, H., Hagberg, A., and Chute, R. A principal component analysis of 39 scientific impact measures. PLoS One, 4(6), e6022 (2009). [7] arxiv.org help: the arxiv moderation system. Cornell University Library. http://arxiv.org/help/moderation [8] Abdi, H. Signal detection theory (SDT). In "International Encyclopedia of Education, 3rd Edition". P.L. Peterson, E. Baker, and B. McGaw eds. Elsevier, New York (2010). 6