Optimal Demand Response Using Device Based Reinforcement Learning

Similar documents
Optimal Tradeoffs between Demand Response and Storage with Variable Renewables

Overview Day One. Introduction & background. Goal & objectives. System Model. Reinforcement Learning at Service Provider

Fragmentation and The Product Cycle

1. For s, a, initialize Q ( s,

Resource Adequacy and Reliability Impact of Demand Response in a Smart Grid Environment

Cyber physical social systems: Modelling of consumer assets and behavior in an integrated energy system

Ph.D. Defense: Resource Allocation Optimization in the Smart Grid and High-performance Computing Tim Hansen

Resource allocation and load-shedding policies based on Markov decision processes for renewable energy generation and storage

DYNAMIC FREIGHT SELECTION FOR REDUCING LONG-HAUL ROUND TRIP COSTS

This document is downloaded from DR-NTU, Nanyang Technological University Library, Singapore.

Monte-Carlo Tree Search

An agent-based approach for road pricing: system-level performance and implications for drivers

LONG-HAUL FREIGHT SELECTION FOR LAST-MILE COST REDUCTION

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS 1. Mehdi Rahmani-Andebili, Member, IEEE, and Haiying Shen, Senior Member, IEEE

Customer Learning and Revenue Maximizing Trial Provision

A hybrid genetic wind driven heuristic optimization algorithm for demand side management in smart grid

Optimising Market Share and Profit Margin: SMDP-based Tariff Pricing under the Smart Grid Paradigm

Automated Energy Distribution In Smart Grids

Environmental Policy and the Business Cycle: The Role of Adjustment Costs on Abatement

14.461: Technological Change, Lecture 9 Climate Change and Technology

Availability Modeling of Grid Computing Environments Using SANs

Learning-Based Energy Management Policy with Battery Depth-of-Discharge Considerations

The Cost of Adapting to Climate Change: Evidence from the US Residential Sector

In collaboration with Jean-Yves Lucas (EDF)

Social User Agents for Dynamic Access to Wireless Networks

Dynamic Vehicle Routing and Dispatching

Modeling Supervisory Control in Multi-Robot Applications

Spatial Information in Offline Approximate Dynamic Programming for Dynamic Vehicle Routing with Stochastic Requests

Dynamic Scheduling and Maintenance of a Deteriorating Server

On Valuing Appreciating Human Assets in Services

Dynamic Pricing, Advance Sales, and Aggregate Demand Learning

Dynamic Scheduling and Maintenance of a Deteriorating Server

R&D, International Sourcing, and the Joint Impact on Firm Performance April / 25

Factory Operations Research Center (FORCe II)

arxiv: v2 [cs.lg] 8 Mar 2018

Introduction CHAPTER 1

Advertising and Market Share Dynamics. Minjung Park University of Minnesota

OPTIMIZATION AND OPERATIONS RESEARCH Vol. IV - Inventory Models - Waldmann K.-H

Introduction to Economics for Integrated Modeling

Sustainability Economics of Groundwater Usage and Management

Search Frictions, Efficiency Wages and Equilibrium Unemployment BATH ECONOMICS RESEARCH PAPERS

A Multi-Objective Approach to Simultaneous Strategic and Operational Planning in Supply Chain Design

Reward Backpropagation Prioritized Experience Replay

Ivan Damnjanovic, Andrew J. Wimsatt, Sergiy I. Butenko and Reza Seyedshohadaie

Designing Short-Term Trading Policies for Litecoin Cryptocurrency Using Reinforcement Learning Algorithms

Self-organization leads to supraoptimal performance in public transportation systems

Globalization and Social Networks

Modelling the mobile target covering problem using flying drones

Learning dynamic prices in electronic retail markets with customer segmentation

Machine learning methods for on- line predic4on and op4mal resource alloca4on in smart buildings and grids

ON QUADRATIC CORE PROJECTION PAYMENT RULES FOR COMBINATORIAL AUCTIONS YU SU THESIS

Performance Management For Cluster Based Web Services

Estimation of carbon cost pass-through in electricity markets

Labor Market Experience and Worker. Flows

Distributed Optimization

Modeling the evolution of ant foraging strategies with genetic algorithms. Kenneth Letendre CS 365 September 4, 2012

Buyer Heterogeneity and Dynamic Sorting in Markets for Durable Lemons

Sequential Innovation and Patent Policy

Sharing Information to Manage Perishables

Flexible Regression and Smoothing

Distributed Drone Base Station Positioning for Emergency Cellular Networks Using Reinforcement Learning

A RESOURCE LEASING POLICY FOR ON-DEMAND COMPUTING

Appendix A Mixed-Effects Models 1. LONGITUDINAL HIERARCHICAL LINEAR MODELS

Market Structure and Product Quality in the U.S. Daily Newspaper Market (Job Market Paper)

DASS: Digital Advertising System Simulation

OPTIMAL STORAGE POLICIES WITH WIND FORCAST UNCERTAINTIES

REAL-TIME ORDER TRACKING FOR SUPPLY SYSTEMS WITH MULTIPLE TRANSPORTATION STAGES

Paper 30 Centralized versus Market-based Task Allocation in the Presence of Uncertainty

Global Supply Chain Planning under Demand and Freight Rate Uncertainty

Fleet Sizing and Empty Freight Car Allocation

Multi-Stage Resource-Aware Scheduling for Data Centers with Heterogenous Servers

Characterization of Allele-Specific Copy Number in Tumor Genomes

Paying Not to Go to the Gym

Identifying Supply and Demand Elasticities of Agricultural Commodities: Implications for the US Ethanol Mandate

Effects of changes in biotoxin closures on recreational shellfish harvest demand

Witness Addressing or Summarizing. Line Decision No. Proceeding No. Statement or Direction

Optimal Control of an Assembly System with Multiple Stages and Multiple Demand Classes

Time-Constrained Restless Bandits and the Knapsack Problem for Perishable Items (Extended Abstract) 1

Autonomous Navigation of UAV by Using Real-Time Model-Based Reinforcement Learning

Of Finite Stage Markov Decision Process In Policy Determination For Employees Motivation

Pricing-based Energy Storage Sharing and Virtual Capacity Allocation

Architecture-based approach to reliability assessment of software systems

Multi-level functional principal components analysis models for replicated genomics time series data sets

Spare Parts Inventory Management with Demand Lead Times and Rationing

Burstiness-aware service level planning for enterprise application clouds

14.461: Technological Change, Lecture 9 Climate Change and Technology

Approximate Dynamic Programming Captures Fleet Operations for Schneider National

Large Neighborhood Search for LNG Inventory Routing

Development of an Optimal Decision Policy for MTS-MTO System. Reno, Nevada, U.S.A. April 29 to May 2, 2011

POLLUTION CONTROL UNDER UNCERTAINTY AND SUSTAINABILITY CONCERN

Flexible Regression and Smoothing

Energy Management of End Users Modeling their Reaction from a GENCO s Point of View

Multi Agent Coordination for Demand Management with Energy Generation and Storage

GSC Discussion Paper Series. Paper No. 17. The Worry about Wealth: Minimum Wage Rate and Unemployment Benefits to the Rescue?

Economic Value Assessment and Optimal Sizing of an Energy Storage System in a Grid-Connected Wind Farm

Reputation-Based Scheduling on Unreliable Distributed Infrastructures

A Study on the Forest Thinning Planning Problem Considering Carbon Sequestration and. Emission

Preemptive Depot Returns for a Dynamic Same-Day Delivery Problem

Climate Change Skepticism in the Face of Catastrophe

Competitive Strategy: Week 6. Dynamic Pricing

Transcription:

Optimal Demand Response Using Device Based Reinforcement Learning Zheng Wen 1 Joint Work with Hamid Reza Maei 1 and Dan O Neill 1 1 Department of Electrical Engineering Stanford University zhengwen@stanford.edu June 12, 2012 Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 1 / 23

Outline 1 Motivation 2 Device Based MDP Model RL-EMS and Consumer Requests Dis-utility Function and Performance Metric Device Based MDP Model 3 RL-EMS Algorithm 4 Simulation Result 5 Conclusion Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 2 / 23

Motivation Demand response (DR) systems dynamically adjust electrical demand in response to changing electricity prices (or other grid signals) By suitably adjusting electricity prices, load can be shifted from peak periods to other periods DR is a key component in smart grid DR can potentially Improve operational efficiency and capital efficiency Reduce harmful emissions and risk of outages Better match energy demand with unforecasted changes in electrical energy generation Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 3 / 23

Motivation (Cont...) DR has been extensively investigated for larger energy users Residential and small building DR offers similar potential benefits However, decision fatigue prevents residential consumers to respond to real-time electricity price (see O Neill et al. 2010) We can t stay at home every day, watching the real-time electricity price and deciding when to use each device Fully-automated Energy Management Systems (EMS) are a necessary prerequisite to DR in residential and small building settings Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 4 / 23

Energy Management Systems EMS makes the DR decisions for the consumer Critical to a successful DR EMS approach is learning the consequences of shifting energy consumption on consumer satisfaction, cost and future energy behavior O Neill et al. 2010 has proposed a residential EMS algorithm based on reinforcement learning (RL), called CAES algorithm In this project, we extend the work of O Neill et al. 2010 and propose a new RL-based energy management algorithm called RL-EMS algorithm Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 5 / 23

RL-EMS System Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 6 / 23

What is New? Compared with CAES, RL-EMS is new in the following aspects: 1 RL-EMS is allowed to perform speculative jobs 2 A consumer request s target time can be different from its request time 3 An unsatisfied consumer request can be cancelled by the consumer 4 RL-EMS learns the dis-satisfaction of the consumer on completed jobs and cancelled delays New Insight: A device-centric view of the problem Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 7 / 23

Outline 1 Motivation 2 Device Based MDP Model RL-EMS and Consumer Requests Dis-utility Function and Performance Metric Device Based MDP Model 3 RL-EMS Algorithm 4 Simulation Result 5 Conclusion Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 8 / 23

RL-EMS and Consumer Requests Our proposed RL-EMS performs the following functions: 1 It receives requests from the consumer, and then schedules when to fulfill the received requests (requested jobs) 2 If a device is idle, RL-EMS could speculatively power on that device (speculative job) Each consumer request is a four-tuple J = (requested device n, requested time τ r, target time τ g, priority g) Jobs are standardized and can be completed in one time step Require τ r τ g τ r + W (n), where W (n) is a known device-dependent time window At each time, there is at most one unsatisfied request for each device Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 9 / 23

Consumer Preference and Dis-utiltiy Function In economics, utility function is used to model the preference of consumers In this project, we work with the negative of utility function, called dis-utility function, which models the dissatisfaction of the consumer Assumption 1: the dis-utility of the consumer at time t = the sum of his evaluations on jobs completed/cancelled at time t Dis-utility is additive over jobs At each time there is at most one job completed/cancelled at a device, hence, dis-utility is also additive over devices Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 10 / 23

Performance Metric We assume that the instantaneous cost at time t is electricity bill paid at time t + dis-utility at time t RL-EMS aims to minimize the expected infinite-horizon discounted cost The instantaneous cost is additive over devices Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 11 / 23

Device Based MDP Model Assumption 2: Both the electricity price and the consumer requests to the RL-EMS follow exogenous Markov chains. Furthermore, we assume that Electricity price process is independent of the consumer requests process Consumer requests to different devices are independent Under Assumption 1 and 2, RL-EMS aims to solve an infinite-horizon discounted MDP, and this MDP decomposes over devices Hence, we can derive an optimal scheduling policy for a single device by solving the associated device-based MDP Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 12 / 23

Probability Transition Model Electricity price follows an exogenous Markov chain The timeline for a device can be divided into episodes Whenever a device completes a job, or an unsatisfied request is cancelled, the current episode terminates At the next time step, the device regenerates its state according to a fixed distribution Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 13 / 23

Probability Transition Model (Cont...) ] S = P max [(2W + 1)g max + Ŵ + 1 Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 14 / 23

DP Solution and Motivation for RL If RL-EMS knows 1 The transition model of the device based MDP 2 The dis-utility function of the consumer then the optimal scheduling strategy can be derived based on finite-horizon dynamic programming (DP) In fact, it is an optimal stopping problem in each episode However, in practice, RL-EMS needs to learn the transition model and dis-utility function while interacting with the consumer and the electricity price RL is the approach in this case Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 15 / 23

Outline 1 Motivation 2 Device Based MDP Model RL-EMS and Consumer Requests Dis-utility Function and Performance Metric Device Based MDP Model 3 RL-EMS Algorithm 4 Simulation Result 5 Conclusion Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 16 / 23

RL-EMS Algorithm In the classical RL literature, the environment consists of an unknown MDP, and an agent learns how to make decisions while interacting with the environment In this case, the agent is the RL-EMS and the environment includes both the electricity price and the consumer In this project, we use Q(λ) algorithm, which combines classical Q-learning with (1) Eligibility traces and (2) Importance sampling Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 17 / 23

Q(λ) Algorithm Initialize Q 0 arbitrarily, set eligibility parameter λ [0, 1]. Repeat for each episode: Choose a small constant step-size β > 0 for each episode. Initialize eligibility trace vector e t 1 = 0. Take a(t) from x(t) according to µ b (e.g. ɛ-softmin policy), and arrive at x(t + 1). for each time step in an episode do Observe sample, (x(t), a(t), x(t + 1), Φ t ) at time step t, where Φ t is the instantaneous cost. def δ t = Φ (x(t), a(t), x(t + 1)) + α min a Q t (x(t + 1), a ) Q t (x(t), a(t)). 1 If a(t) argmin a Q t (x(t), a), then ρ t µ b (a(t) x(t)) ; otherwise ρ t 0. e t = ψ t + ρ t αλe t 1, where ψ t is a binary vector whose only nonzero element is (x(t), a(t)). Q t+1 Q t + βδ t e t. end for Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 18 / 23

Outline 1 Motivation 2 Device Based MDP Model RL-EMS and Consumer Requests Dis-utility Function and Performance Metric Device Based MDP Model 3 RL-EMS Algorithm 4 Simulation Result 5 Conclusion Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 19 / 23

Simulation Result A toy example P max = 4, g max = 2, W = 4, Ŵ = 5 and S = 96 Assume that at the beginning of each episode, the device portion of the regenerated state is fixed (i.e. (0, 0)) For each scheduling policy µ, its performance is ( V (µ) = E P π P [min Q µ [P, 0, 0] T, a) ], a A and we normalize the performance with respect to the optimal performance. We run the proposed RL-EMS algorithm for 8, 000 episodes, and repeat the simulation for 100 times Baseline: the default policy without DR Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 20 / 23

Simulation Result (Cont...) 2.4 2.2 Normalized Performance of Q t 2 1.8 1.6 1.4 1.2 Performance of Q t Optimal Performance Baseline Performance 1 0.8 0.6 0 1000 2000 3000 4000 5000 6000 7000 8000 Episode Reduce the consumer s cost by 56% Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 21 / 23

Outline 1 Motivation 2 Device Based MDP Model RL-EMS and Consumer Requests Dis-utility Function and Performance Metric Device Based MDP Model 3 RL-EMS Algorithm 4 Simulation Result 5 Conclusion Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 22 / 23

Conclusion We present a RL approach (RL-EMS Algorithm) to DR for residential and small commercial buildings RL-EMS does not require the prior knowledge of models on electricity price, consumer behavior and consumer dis-utility RL-EMS learns to make optimal decisions by rescheduling the requested jobs anticipating the future uses of devices and doing speculative jobs Future Work: RL algorithms with better sample efficiency Wen, Maei and O Neill (Stanford) Device Based Demand Response June 12, 2012 23 / 23