Tesla Virtual Power Plant - YouTube

发布时间 2020-03-29 16:00:00 来源

中英文字稿

The electric grid is the largest and most complex machine ever built. It's an amazing feat of engineering providing reliable, safe and on-demand power.

电网是迄今为止最大、最复杂的机器。它是一项令人惊叹的工程壮举，能提供可靠、安全且随需应变的电力。

This grid is built on 20th century technology with large centralized generation, mostly fossil fuel based and only a few points of control. We now face an urgent challenge to transition off of fossil fuels in order to prevent the worst effects of climate change.

这个电网是建立在20世纪的技术基础上的，采用了大型集中式发电方式，主要依赖化石燃料，只有少数的控制点。现在我们面临一个紧迫的挑战，即必须从化石燃料中实现转型，以防止气候变化的最严重影响。

Fortunately, we also now have new tools, clean power generation like wind, solar and hydro are cheap and getting cheaper. But this hardware is not on its own enough to replace fossil fuels while maintaining our current standard of on-demand and reliable power.

幸运的是，如今我们也拥有了新的工具，像风能、太阳能和水力发电这样的清洁能源变得便宜且不断降价。但是，光有这些设备还不足以在保持我们目前的按需和可靠电力标准的同时替代化石燃料。

Software is really the key to enabling these diverse components to act in concert. And one of the things we can do is bring together thousands of small batteries in people's homes to create virtual power plants, providing value to both the electrical grid as well as to the home or business owner.

软件真的是能够使这些不同的组件协同工作的关键。我们可以做的一件事是将成千上万的小型电池组合在人们的家中，创建虚拟电厂，为电网以及家庭或企业业主创造价值。

This marries some of the most interesting and challenging problems in distributed computing with some of the most important and challenging problems in distributed renewable energy. And this is why we work at Tesla. We get to work on these exciting software challenges while also accelerating the world's transition to renewable energy.

这个项目将分布式计算中最有趣且具有挑战性的问题与分布式可再生能源中最重要且具有挑战性的问题结合在一起。这也是为什么我们在特斯拉工作的原因。我们可以参与解决这些激动人心的软件挑战，同时加速世界向可再生能源转型。

We're going to take you through the evolution of the Tesla virtual power plant and share architectures, patterns and practices for distributed computing and IoT that have helped us tackle these complex and exciting challenges. I'm Percy. I'm a software engineer and technical lead on the team that builds Tesla's energy optimization and market participation platform.

我们将带您了解特斯拉虚拟电力厂的演变，并分享分布式计算和物联网的架构、模式和实践，这些帮助我们应对了这些复杂且令人兴奋的挑战。我是帕西。我是一名软件工程师，也是特斯拉能源优化和市场参与平台团队的技术负责人。

And I'm Colin. I'm a software engineer and I lead the teams that build and operate the cloud IoT platforms for Tesla energy products. And just a disclaimer, before we start, we do not speak on behalf of Tesla. We are just representing our personal experiences.

我是Colin。我是一名软件工程师，负责构建和运营特斯拉能源产品的云物联网平台的团队。在开始之前，声明一下，我们不代表特斯拉发言。我们只是代表我们个人的经验。

So before we dig into the software, let's cover some background on how the grid works and on the role that batteries play in it so that you're set up to appreciate the software problems. The tricky thing about the power grid is that supply and demand have to match in real time or else frequency and voltage can deviate and this can damage devices and lead to blackouts.

因此，在我们深入了解软件之前，让我们先了解一下电网的工作原理以及电池在其中所扮演的角色，这样你就能够理解软件问题的重要性了。电力网的棘手之处在于，供需必须实时匹配，否则频率和电压就会变化，这可能会损坏设备并导致停电。

The grid itself has no ability to store power so the incoming power supply and outgoing power consumption need to be controlled in a way that maintains the balance. With old-style centralized fossil fuel generation, supply could be turned up and down according to demand and there were relatively small number of plants to control and this made it relatively straightforward to maintain the balance.

电网本身没有存储能力，因此需要以一种保持平衡的方式来控制进入电网的电力供应和离开电网的电力消耗。旧式的集中式化石燃料发电方式可以根据需求调整供应，而且需要控制的发电厂数量相对较少，这使得保持平衡相对比较简单。

As more renewable generation comes onto the grid, a few things happen. First reduced control. You can't be as easily turned up to follow demand and we don't want to turn generation down or else we're losing some of our clean energy. Second uncertainty and rapid change.

随着可再生能源的逐渐加入电网，会出现一些情况。首先是控制的减少。无法像以前那样轻易地加大发电量来满足需求，我们也不想减少发电量，否则将损失一部分清洁能源。其次是不确定性和快速变化。

Generation can't be forecast precisely and it can change quickly. And third distribution, there are many small generators behaving independently. So in a grid with large amounts of wind and solar generation, the supply might look something like this with variability in supply and with times of high supply, not aligned with times of high demand and this can result in power surpluses and deficits that are larger than previously and that can change fairly rapidly.

发电量很难精确预测，而且它可以很快改变。在第三分布方式中，有许多独立运行的小型发电机。因此，在大量风力和太阳能发电的电网中，供应可能呈现出供应的不稳定性，并且高供应的时间不一定与高需求的时间相符，这可能导致电力过剩和不足的问题比以前更大，并且这种情况可以相当快速地发生改变。

Batteries can charge during the surpluses and discharge during the deficits and they can respond very quickly to offset any rapid swings in imbalance. And this rapid response is actually even an innovation, an opportunity to be better than the old grid. It's not just a compromise.

电池可以在能量过剩时充电，在能量不足时放电，并且它们可以迅速响应以抵消任何快速失衡的波动。而这种快速响应实际上是一种创新，一种超越传统电网的机会。这并不仅仅是一种妥协。

And so to fulfill this role, we could just install giant batteries. Batteries the size of a typical coal or natural gas power plant and they can and do play an important part of the equation. But we can also take advantage of smaller batteries already installed in individual homes that are already providing local value like backup power or helping the owner consume more of their own solar generation. We can aggregate homes and businesses with these smaller batteries and solar into virtual power plants.

为了履行这一角色，我们可以安装巨大的电池。这些电池的尺寸可以与典型的煤炭或天然气发电厂相当，并且它们确实在整个方程中起着重要作用。但是，我们还可以利用已经安装在个别家庭中的较小的电池，这些电池已经提供了本地价值，如备用电源或帮助业主更多地使用自己的太阳能发电。我们可以将这些较小的电池和太阳能与家庭和企业聚集在一起形成虚拟电厂。

So in this presentation, we'll walk you through the evolution of the Tesla energy platform for virtual power plants. And it's broken into four sections with each stage laying the foundation for the next. We'll start with the development of the Tesla energy platform. Then we'll describe how we learn to participate in energy markets and how we learn to build software to do this algorithmically using a single battery, the largest battery in the world.

在这个演示中，我们将带您了解特斯拉能源平台建立虚拟电厂的演化过程。它分为四个部分，每个阶段都为下一个阶段奠定了基础。我们将从特斯拉能源平台的开发开始。然后，我们将描述我们如何学习参与能源市场，以及如何学习使用世界上最大的电池以算法方式构建软件。

Then we'll talk about our first virtual power plant where we learn to aggregate and directly control thousands of batteries in near real time in people's homes. And finally, we'll talk about how we combine all of these platforms and experiences to aggregate, optimize, and control thousands of batteries for energy market participation.

然后我们将谈论我们的第一个虚拟电站，在此电站上我们学习如何将成千上万个电池近乎实时地集合并直接控制在人们的家中。最后，我们将讨论如何将所有这些平台和经验结合起来，集合、优化和控制成千上万个电池参与能源市场。

So let's begin with the architecture of the Tesla energy platform. This platform was built for both residential and industrial customers. So for residential customers, the platform supports products like the Powerwall home battery, which can provide backup power for a house for hours or days in the event of a power outage, solar roof, which produces power from beautiful roofing tiles, and retrofit solar. And the solar products can be paired with Powerwall to provide not only backup power but also maximize solar energy production. And we use software to deliver an integrated product experience across solar generation, energy storage, backup power, transportation, and vehicle charging, as well as create unique products like Stormwatch, where we will charge your Powerwall to full when alerted to an approaching storm so that you have full backup power if the power goes out. Part of the customer experience is viewing the real time performance of the system in the mobile app, and customers can control some behaviors, such as prioritizing charging during low-cost times.

那么，让我们从特斯拉能源平台的架构开始。这个平台是为居住和工业客户而建立的。对于居住客户来说，该平台支持Powerwall家用电池等产品，当供电中断时，它可以为房屋提供数小时或数天的备用电源；太阳能屋顶，它能从漂亮的屋顶瓦片中产生能源；以及改装太阳能系统。太阳能产品可以与Powerwall配对，不仅提供备用电源，还能最大程度地利用太阳能。我们使用软件将太阳能发电、储能、备用电源、交通运输和车辆充电等整合成一个产品体验，并创造出像Stormwatch这样独特的产品。当我们收到接近的风暴警报时，会将您的Powerwall充满电，这样如果停电，您就有充足的备用电源。客户可以通过移动应用程序查看系统的实时性能，并控制一些行为，例如在低成本时间段优先充电。

For industrial customers, the software platform supports products like Powerpack and Megapack for large-scale energy storage. As well as industrial-scale solar. Software platforms like Powerhub allow customers to monitor the performance of their systems in real time, or inspect historical performance over days, weeks, or even years.

对于工业客户，该软件平台支持大型能源储备产品，比如Powerpack和Megapack。以及工业规模的太阳能系统。类似Powerhub的软件平台允许客户实时监测他们系统的性能，或者查看历史性能，可以涵盖几天、几周甚至几年的数据。

Now these products for solar generation, energy storage, transportation, and charging all have an edge computing platform. And zooming in on that edge computing platform for energy, it's used to interface with a diverse set of sensors and controllers, things like inverters, bus controllers, and power stages. And it runs a full Linux operating system and provides local data storage, computation, and control. While also maintaining bidirectional streaming communication with the cloud over WebSocket so that it can regularly send measurements to the cloud for some applications as frequently as once a second. And it can also be commanded on demand from the cloud.

现在，这些用于太阳能发电、能源储存、交通和充电的产品都具有边缘计算平台。具体到能源的边缘计算平台，它用于与各种传感器和控制器进行接口连接，比如逆变器、总线控制器和功率级。它运行完整的Linux操作系统，并提供本地数据存储、计算和控制。同时，它通过WebSocket与云端进行双向流通讯，可以定期将测量数据发送到云端，有些应用甚至每秒钟发送一次。同时，它也可以从云端接收命令来执行相应操作。

Now we'll mention a few things throughout the presentation about this edge computing platform, but our main focus is going to be on that cloud IoT platform. And the foundation of this platform is this linearly scalable WebSocket front end that handles connectivity as well as security. It has a Kafka cluster behind it for ingesting large volumes of telemetry for millions of IoT devices. And this provides messaging durability, decouples publishers of data from consumers of data, and it allows for sharing this telemetry across many downstream services.

现在在演示中，我们会提到一些关于这个边缘计算平台的内容，但我们的主要重点将放在云物联网平台上。这个平台的基础是一个线性可扩展的WebSocket前端，它处理连接和安全性。它背后有一个Kafka集群，用于接收数百万物联网设备的大量遥测数据。这提供了消息的持久性，将数据的发布方与数据的消费方解耦，并允许在许多下游服务之间共享这些遥测数据。

The platform also has a service for published, subscribed messaging, enabling bidirectional command and control of IoT devices. And these three services together are offered as a shared infrastructure throughout Tesla on which we build higher-order services. On the other side of the equation are these customer-facing applications supporting the products that I just highlighted.

这个平台还提供了发布和订阅消息的服务，可以实现对物联网设备进行双向命令和控制。而且这三个服务一起作为共享基础设施提供给特斯拉，我们在其之上构建更高级的服务。在等式的另一边，是支持我刚才强调的产品的面向客户的应用程序。

The APIs for energy products are organized broadly into three domains. The first are APIs for querying telemetry alerts and events from devices or streaming these as they happen. Second are APIs for describing energy assets and the relationships among these assets. And lastly, APIs for commanding and controlling energy devices like batteries.

能源产品的API大致分为三个领域。第一个是用于查询设备的遥测警报和事件的API，或者以流的形式实时获取这些信息。第二个是用于描述能源资产以及这些资产之间关系的API。最后，是用于指挥和控制能源设备（如电池）的API。

Now the backing services for these APIs are composed of approximately 150 polyglot microservices far too many to detail in this presentation. I'll just provide a high-level understanding of the microservices in each domain. And we're going to dive a bit deeper into a few of them later when we look at the virtual power plant. And a theme you'll see throughout is the challenge of handling real-time data at IoT scale.

现在，这些API的后备服务由大约150个多语言微服务组成，这些服务太多了，在本演示中无法详细介绍。我只会提供对每个领域中微服务的一个高层次的理解。稍后当我们看虚拟电厂时，我们将深入探讨其中的一些微服务。而你会在整个过程中发现，在IoT规模下处理实时数据的挑战是一个主题。

Imagine a battery installed in everybody's home. So to support efficient queries and rollups of telemetry, these be queries like what's the power output over the past day or week? We use InfluxDB, which is an open-source purpose-built time series database. It depends on the data stream and the application, but generally our goal is to make historical data available to the customer for the lifetime of the product. We maintain a large number of low-latency streaming services for data ingestion and transformation. And for some of these Kafka topics, the very first thing we do is create a canonical topic where data are already filtered and refined into very strict data types. This is more efficient because it removes this burden from every downstream service, and it also provides consistency across the downstream consumers.

想象一下，在每个人的家里都安装了一个电池。为了支持高效的查询和数据汇总，我们使用像“过去一天或一周的功率输出是多少？”这样的查询。我们使用InfluxDB，这是一个开源的专为时间序列设计的数据库。具体取决于数据流和应用程序，但一般来说，我们的目标是使历史数据在产品的整个生命周期内可供客户使用。我们维护着大量低延迟的流服务，用于数据摄入和转换。对于其中的一些Kafka主题，我们的第一步是创建一个规范主题，在这个主题中，数据已经被过滤和整理成非常严格的数据类型。这样做更加高效，因为它减轻了每个下游服务的负担，同时也为下游消费者提供了一致性。

A very unique challenge in this domain is the streaming real-time aggregation of telemetry from thousands of batteries. This is a service that we'll look at in much more detail because it forms one of the foundations of the virtual power plant.

在这个领域中，一个非常独特的挑战是对数千块电池进行实时流式聚合遥测数据。这是一个我们将会更详细讨论的服务，因为它是虚拟电厂的基础之一。

Now like any large company, product and customer information comes from many, many different business systems, and it's really unworkable to have every microservice connect to every business system, many of which are not designed to be internet facing or IoT scale. So the purposes of the asset management services, there's four things. One is to abstract and unify these many different business systems into one consistent API. Two is to provide a consistent source of truth, especially when there are conflicting data. Three it provides a kind of type system where applications can rely on the same attributes of the same type of device, like a battery. And fourth, it describes unique relationships among these energy assets, like which devices can talk to each other and who can control them. And it relies heavily on a Postgres database to describe these relationships.

现在像任何大公司一样，产品和客户信息来自许多不同的业务系统，而每个微服务都连接到每个业务系统是不可行的，其中许多系统并不是为面向互联网或物联网规模设计的。所以资产管理服务有以下四个目的：一是将这许多不同的业务系统抽象和统一为一个一致的API；二是提供一个一致的真相来源，特别是在存在冲突数据时；三是提供一种类型系统，应用程序可以依赖于同一类型设备（例如电池）的相同属性；第四，它描述了这些能源资产之间的独特关系，例如设备之间的通信以及谁可以控制它们。它在Postgres数据库中广泛使用来描述这些关系。

Now we use Kafka to integrate changes as they happen from many of these different business systems, or we stream the changes directly from IoT devices. And at scale actually this is a lot more reliable. Devices are often the most reliable source of truth, self-reporting their configuration, state and relationships.

现在我们使用Kafka来集成来自许多不同的业务系统的实时变化，或者直接从物联网设备传输这些变化。在大规模应用中，这种方式更为可靠。设备往往是最可靠的真实来源，能够自动报告其配置、状态和关系。

Now a digital twin is the representation of a physical IoT device, a battery, an inverter, a charger, in software, modeled virtually. And we do a lot of digital twin modeling to represent the current state and relationships of various assets. Finally, there are services for commanding and controlling IoT devices, like telling a battery to discharge at a given power set point for a specific duration. And similar to both the telemetry and asset domains, we need a streaming, stateful and real-time representation of IoT devices at scale, including modeling this inherent uncertainty that comes with controlling IoT devices over the internet.

现在，数字孪生是物理物联网设备（如电池、逆变器、充电器）在软件中的虚拟建模表示。我们进行大量的数字孪生建模，以表示各种资产的当前状态和关系。最后，还有一些用于指挥和控制物联网设备的服务，比如指示电池在特定时间内以给定功率放电。与遥测和资产领域类似，我们需要一种具有流式、有状态和实时的规模化物联网设备表示，同时还需要对通过互联网控制物联网设备的固有不确定性进行建模。

Now, Aqua has been an essential tool for us for building these microservices. Aqua is a toolkit for distributed computing, and it also supports actor model programming, which is great for modeling the state of individual entities like a battery, while also providing a model for concurrency and distribution based on asynchronous and mutable message passing. It's really, really great model for IoT, and I'll provide some specific examples later in the presentation. Another part of the Aqua toolkit that we use extensively is the reactive streams component called Oka Streams. Oka Streams provides sophisticated primitives for flow control, concurrency, and data management, all with back pressure under the hood, ensuring that the services have bounded resource constraints. And generally, all the developer rights are functions, and then Oka Streams handles the system dynamics, allowing processes to bend and stretch as the load of the system changes and the messaging volume changes.

现在，Aqua已经成为我们构建这些微服务的重要工具。Aqua是一个分布式计算工具包，它还支持actor模型编程，这对于建模诸如电池等个体实体的状态非常有用，同时还提供了基于异步和可变消息传递的并发和分布模型。这对于物联网来说是非常好的模型，我将在演示中提供一些具体的例子。Aqua工具包的另一个我们广泛使用的部分是叫做Oka Streams的反应式流组件。Oka Streams提供了复杂的流控制、并发和数据管理的原语，并内部使用反压机制，确保服务具有有界的资源限制。通常，所有的开发者只需编写函数，而Oka Streams处理系统动态，允许进程根据系统负载和消息量的变化弯曲和拉伸。

The Oka Paka project has a large number of these reactive streams interfaces, the services like Kafka or AWS S3, and Oka is what we use for interfacing with Kafka extensively. We don't actually use Kafka Streams because we find the interface there is too simplistic for our use case, and Oka Streams provide, and it's also ecosystem specific, and Oka Streams provides this much more general purpose streaming tool.

Oka Paka 项目拥有大量的反应式流接口，比如 Kafka 或 AWS S3 这样的服务，而 Oka 是我们广泛使用与 Kafka 交互的工具。我们实际上不使用 Kafka Streams，因为我们发现其接口对于我们的用例来说过于简单化，而 Oka Streams 则提供了更为通用的流处理工具，同时也是针对我们生态系统开发的。

Now like any large platform, there's a mix of languages, but our primary programming language is Scala, and the reason we came to Scala was through Oka, because it's really the first-class way to use Oka. And then we really kind of fell in love with Scala's rich-type system, and we've become big fans of functional programming for building large, complex, distributed systems. So we like things like the compile-time safety, immutability, pure functions, composition, and doing things like modeling errors as data rather than throwing exceptions. And for a small team, having a primary programming language where you invest in a deep understanding and first-class tooling is a huge boost to productivity, job satisfaction, and the overall quality of a complex system.

现在，像任何大型平台一样，我们有多种语言，但我们的主要编程语言是Scala，我们之所以选择Scala是因为Oka，因为这确实是使用Oka的一流方式。然后我们真的爱上了Scala丰富的类型系统，并且我们已经成为构建大型、复杂、分布式系统的函数式编程的忠实粉丝。所以我们喜欢像编译时安全性、不可变性、纯函数、组合等这样的东西，还喜欢将错误建模为数据而不是抛出异常。对于一个小团队来说，拥有主要的编程语言，你可以对其进行深入理解并使用一流的工具，这对于生产力、工作满意度和复杂系统的整体质量来说是一个巨大的推动。

Majority of our microservices run in Kubernetes, and the pairing of Oka in Kubernetes is really, really fantastic. Kubernetes can handle coarse-grained failures and scaling, so that would be things like scaling pods up or down, running liveness probes, or restarting a failed pod with an exponential back-off. And then we use Oka for handling fine-grained failures, like circuit-breaking or retrying an individual request, and modeling the state of individual entities like the fact that a battery is charging or discharging. And then we use Oka streams for handling the system dynamics in these message-based real-time streaming systems.

我们大部分的微服务都运行在Kubernetes中，而Kubernetes与Oka的配合非常非常出色。Kubernetes能够处理粗粒度的故障和扩展，例如扩展容器组或者进行生命探针检测，或者使用指数回退重新启动失败的容器组。而我们使用Oka来处理细粒度的故障，例如熔断或重试单个请求，并对单个实体的状态进行建模，比如电池正在充电或放电。同时，我们使用Oka Streams来处理基于消息的实时流系统中的系统动态。

The initial platform was built with traditional HTTP APIs and JSON that allowed rapid development of the initial platform. But over the past year, we've invested much more in GRPC. It's been a big win. It's now our preference for new services, or if we extend older services. And it brought three distinct advantages. Strict contracts make these systems much more reliable. Road generation of clients means we're no longer writing clients, which is great. And third, and somewhat unexpected, we saw much improved cross-team collaboration around these contracts. And we're not just seeing this with GRPC because we also prefer protobuf for our streaming messages, including the ones that are going through Kafka. And we maintain a single repository where we share these contracts and then collaborate across projects.

最初的平台建立在传统的HTTP API和JSON上，这使得初始平台的开发非常迅速。但在过去的一年里，我们在GRPC上进行了更多的投资。这是一个巨大的胜利。现在，我们更倾向于使用GRPC来开发新的服务，或者扩展旧的服务。它带来了三个明显的优势。严格的合同使这些系统更加可靠。自动生成客户端意味着我们不再需要编写客户端，这是非常好的。第三点，并且有点出人意料的是，我们看到了围绕这些合同的跨团队合作的显著改善。我们不仅在GRPC上看到了这一点，因为我们还更喜欢在包括通过Kafka传递的流式消息中使用protobuf。我们维护一个单一的代码仓库，我们在其中共享这些合同，并在项目之间进行协作。

I've mentioned this theme of strict typing a few times, rich types in Scala, strict schema with protobuf, and then these strict asset models for systems integration. And constraints ultimately provide freedom, and they allow decoupling of microservices and decoupling of teams. And constraints are really a foundation for reliability in large-scale distributed systems.

我已经几次提到了严格类型化这个主题，就像在Scala中使用丰富的类型，使用protobuf进行严格的模式，以及用于系统集成的严格资产模型。而约束最终提供了自由，并且它们允许微服务之间以及团队之间的解耦。约束真的是大规模分布式系统可靠性的基石。

So takeaways from building the Tesla energy platform. We were lucky to embrace the principles of reactive systems from day one. And this produced incredibly robust, reliable, and effective systems. Reactive streams is really important component for handling the system dynamics and providing resource constraints while also providing this rich general purpose API for streaming. Now what's needed to build these complex services, especially in IoT, is a toolkit for distributed computing. For us, that's been Oka. For others, that might be Erlang OTP. And I think now we're also seeing the evolution of stateful serverless platforms to support the same building blocks. And I kind of imagine that's how we're all going to be programming these systems in the future. So that's things like managing state, modeling individual entities at scale, workflow management, streaming interfaces, and then allowing the runtime to handle concurrency, distribution, and failure.

从建立特斯拉能源平台中的启示。我们很幸运在一开始就采用了响应式系统的原理。这产生了非常健壮、可靠和有效的系统。响应式流是处理系统动态和提供资源限制的非常重要的组件，同时还为流媒体提供了丰富的通用目的API。现在构建这些复杂服务，特别是在物联网领域，需要一个分布式计算工具包。对于我们来说，这就是Oka。对于其他人来说，可能是Erlang OTP。我认为现在我们也可以看到有状态无服务器平台的发展，以支持相同的构建模块。而且我想象到了未来我们所有人都将以此编程系统的方式。因此，这就涉及到了管理状态、规模建模、工作流管理、流接口，同时允许运行时处理并发、分布和故障。

Strict contracts make systems more reliable and allow services and teams to work more decoupled while also improving collaboration. And don't develop every microservice differently just because you can. Compound your investments in your knowledge and in your tooling by creating a deep understanding and also this paved path in your primary tool set.

严格的合同使系统更可靠，使服务和团队能够更加解耦并改善协作。不要仅仅因为你可以而对每个微服务进行不同的开发。通过在主要的工具集中深入了解并创建一条铺就的道路，合理安排你的知识和工具的投资。

So on top of the Tesla energy platform that Colin described, we built our first power plant type application. In this phase, we were learning how to productize real-time forecast and optimization of batteries. And in this case, we started with a single, albeit very large battery, which was the Hornsdale battery. The Hornsdale battery was built on a tight timeline because of this famous tweet. And it's the largest battery in the world at 100 megawatts, 129 megawatt hours, which is about the size of a gas turbine. And Hornsdale helps keep the grid stable even as more renewables are coming online. And not only is it keeping the grid stable, it's actually reduced the cost to customers of doing so. And it provides this service, it helps the grid by providing multiple kinds of services.

除了Colin描述的特斯拉能源平台之外，我们开发了我们的第一个电厂类型的应用程序。在这个阶段，我们正在学习如何将电池的实时预测和优化产品化。在这种情况下，我们从一个单独的、尽管非常大的电池开始，这就是Hornsdale电池。Hornsdale电池是在紧迫的时间表下建造的，因为有了这个著名的推文。它是世界上最大的电池，容量为100兆瓦小时，相当于一个燃气轮机的规模。并且Hornsdale电池在越来越多的可再生能源投入使用时帮助保持电网的稳定。它不仅使电网保持稳定，还实际上降低了客户进行这样做的成本。而且它通过提供多种服务来帮助电网。

So during extreme events, like when a generator trips offline, Hornsdale responds nearly instantaneously to balance big frequency excursions that could otherwise cause a blackout. But even during normal times, whereas a conventional generator's response lags the grid operator's signal by the order of minutes, the battery can follow the grid operator frequency regulation commands nearly instantaneously. And this helps maintain a safe grid frequency. So this big battery provides these services to the grid by way of the energy market.

因此，在极端事件中，比如发电机发生脱离网的情况下，霍恩斯代尔几乎可以立即响应并平衡可能导致停电的频率波动。而即使在正常时期，传统发电机响应电网运营商信号的延迟时间为几分钟，电池几乎可以立即按照电网运营商的频率调整指令进行响应。这有助于保持电网频率的稳定。因此，这个大型电池通过能源市场向电网提供这些服务。

And why do we have to participate in an energy market? So recall the requirement that supply and demand have to be balanced in real-time. Antiques are an economically efficient way to make them balance. Participants bid in what they're willing to produce or consume at different price levels and at different time scales. And the operator activates the participants who can provide the necessary services at the lowest price. And if we want the battery to continually provide services to the grid and help stabilize it as more renewables come online, we need to be participating in energy markets.

为什么我们必须参与能源市场呢？回想一下实时供需必须保持平衡的要求。电池储能是一种经济高效的方式来实现平衡。参与者以不同价格和不同时间尺度来竞标自己愿意生产或消耗的能量。运营商会选择以最低价格提供必要服务的参与者。而且，如果我们希望电池持续为电网提供服务，并在更多可再生能源接入时帮助稳定电网，我们就需要参与能源市场。

And in order to participate in energy markets, we need software. So to do this, we built AutoBitter to operate Hornsdale and now we offer it as a product. This is the UI for AutoBitter. It's a pro tool that's intended for control room operators who look at it day in and day out. This is a lot of information on the screen, I know. But it's running workflows that fetch data, forecast prices, and renewable generation and decide on an optimal bid and then submit it. And these workflows run every five minutes. That's the cadence of the Australian market.

为了参与能源市场，我们需要软件。因此，我们建立了AutoBitter来操作Hornsdale，并将其作为产品提供。这是AutoBitter的用户界面。它是一个专业工具，供控制室操作员每天都会使用。屏幕上显示了很多信息，我知道。但它正在运行工作流程，获取数据、预测价格和可再生能源产量，并决定最佳出价然后提交。这些工作流程每五分钟运行一次。这是澳大利亚市场的节奏。

And at a high level, the optimization problem is trade-offs across different market products, which are different kinds of services that the battery can provide to the grid and trade-offs in time since the battery has a finite amount of energy. AutoBitter is built in the applications layer of the Tesla energy platform that Colin described. And it consists of several microservices and it interacts with both the platform and with third-party APIs.

在较高层面上，优化问题涉及到不同市场产品之间的权衡，这些产品是电池可以为电网提供的不同类型服务，同时还涉及到时间上的权衡，因为电池的能量是有限的。AutoBitter建立在Colin描述的特斯拉能源平台的应用层，由几个微服务组成，并与平台和第三方API进行交互。

And AutoBitter is fundamentally a workflow orchestration platform. And so you might ask why did we build our own rather than using an open source tool. The key thing is this is operational technology. These aren't batch or offline jobs and it's critical for financial and physical reasons that these workflows run. We also leverage our primary tool set rather than so that allowed us to avoid introducing a new language and new infrastructure into our stack.

AutoBitter是一个基本的工作流编排平台。那么你可能会问为什么我们要建立自己的平台，而不使用开源工具呢？关键是这是运营技术。这些并不是批处理或离线作业，对于财务和实际原因来说，这些工作流非常重要。我们还利用我们的主要工具集，而不是引入新语言和新基础架构到我们的系统中。

The center of the system is the orchestrator microservice and this runs automating workflows. And a principle we hold to is we keep this core as simple as possible and contain complexity in the peripheral services. So the market data service abstracts the ETL of complex input data. This data has diverse kinds of timing when it arrives relative to the market cycle. And this service handles that timing and it handles all backs in case of late arriving data or missing data.

系统的核心是调度微服务，它负责自动化工作流程。我们坚持的一个原则是将核心尽可能简化，并将复杂性保留在外围服务中。因此，市场数据服务抽象了复杂输入数据的ETL（提取、转换、加载）过程。这些数据在到达市场周期相对时间上具有多样性。该服务负责处理这种时序，并在数据到达迟缓或缺失时进行回溯处理。

There's a forecast service and optimization service that execute algorithm code and a bid service to interact with the market submission interface. The orchestrator market data service and bid service are written in Scala. And again this common toolkit gives us great concurrency semantics, functional programming, type safety and compounding best practices across our teams. However, the forecast and optimization services are in Python and this is because it's very important to us to enable rapid algorithm improvement and development.

有一种预测服务和优化服务执行算法代码，还有一种竞标服务与市场提交接口进行交互。编排器市场数据服务和竞标服务是用Scala编写的。再次强调，这个常用工具包为我们的团队提供了出色的并发语义、函数式编程、类型安全和复合最佳实践。然而，预测和优化服务是用Python编写的，这是因为对我们来说，实现快速算法改进和开发非常重要。

And Python gives us a couple of things there. Their key numerical and solver libraries available in Python. And also the algorithm engineers on our team are more fluent in Python and having these services in Python empowers them to own the core logic there and iterate on it. The communication between the market data and bidding services and orchestrator happens over GRPC with the benefits call and describe strict contracts, code generation and collaboration. But the communication between the orchestrator and the forecasting and optimization services uses Amazon SQS message queues.

Python给我们提供了一些东西。他们的关键数字和求解器库可以在Python中使用。而且我们团队的算法工程师更擅长Python，并且使用Python提供的这些服务使他们能够掌握核心逻辑并不断迭代。市场数据和竞标服务与协调器之间的通信通过GRPC进行，具有调用和描述严格合约、代码生成和协作的优势。但是协调器与预测和优化服务之间的通信使用Amazon SQS消息队列。

And these queues give us durable delivery, retries in cases of consumer failures and easily support long running tasks without a long live network connection between services. We use an immutable input output messaging model and the messages have strict schemas. This allows us to persist the immutable inputs and outputs and have them available for back testing which is an important part of our overall team's mission. Also SQS allows us to build worker pools.

这些队列给我们带来了持久的传递功能，在消费者失败的情况下进行重试，并且可以轻松支持长时间运行的任务，而无需服务之间长时间保持网络连接。我们使用的是不可变的输入输出消息模型，这些消息具有严格的模式。这使得我们可以持久化不可变的输入和输出，并在回溯测试中使用它们，这是我们整个团队使命的重要组成部分。此外，SQS还允许我们构建工作池。

So like I said forecast and optimization are in Python which has some what cumbersome concurrency semantics. And the message queue allows us to implement concurrency across workers instead of within a worker and it's notable that these services are functions effectively. They take inputs and produce outputs without other effects. And this keeps them more testable and makes these important algorithm changes and improvements safer and also relieves algorithm engineers of the burden of writing IO code and lets us use scholar concurrency for IO.

就像我之前说的，预测和优化是用Python编写的，它有相对复杂的并发语义。而消息队列使得我们能够在工作人员之间实现并发，而不是在一个工作人员内部实现，并且这些服务实际上都是函数。它们接受输入并产生输出，没有其他副作用。这使得它们更加易于测试，并且使重要的算法更改和改进更安全，同时也减轻了算法工程师编写IO代码的负担，并且允许我们对IO使用学者并发。

So stepping back and looking at these workflows as a whole, these workflows are stateful. And the state is a set of immutable facts that are generated by business logic stages. These stages happen sequentially in time. And the workflows state includes things like knowing what the current stage is and accumulating the results of a task within the current stage and across stages. And some stages like the forecast stage have multiple tasks that need to be accumulated before deciding to proceed. And some stages might need outputs of multiple previous stages, not just the immediate predecessor.

回过头来看整个工作流程，我们可以发现这些工作流程是有状态的。而这个状态是由业务逻辑阶段生成的一组不可变事实。这些阶段按照时间顺序依次发生。工作流程的状态包括了当前阶段的信息以及在当前阶段和不同阶段之间累积的任务结果。有些阶段，比如预测阶段，在决定是否继续进行之前需要累积多个任务的结果。而有些阶段可能需要多个先前阶段的输出，不仅仅只是前一阶段的输出。

In case of a failure like the orchestrator pod restarting, we don't want to forget that a workflow was in progress and we'd prefer not to completely restart it. So we can instead take snapshots of the state at checkpoints. And if the workflow fails, it can be resumed from the last checkpoint. And we keep the state in an OCH actor representing the workflow state machine. And OCHA persistence gives us transparent resumption of the state through checkpointing and an event journal.

如果出现类似编排器 Pod 重新启动的故障，我们不想忽略正在进行的工作流，并且更希望不必完全重启它。因此，我们可以在检查点处对状态进行快照。如果工作流失败，可以从上一个检查点恢复它。我们通过表示工作流状态机的 OCH 者来保存状态。而 OCHA 持久化通过检查点和事件日志使我们能够对状态进行透明地恢复操作。

But an important lesson we've learned is to keep the business logic of stage execution and pure functions separate from the actor as much as possible. This makes testing and composition of that business logic so much easier. And the new OCHA typed API naturally helps with that decomposition.

我们学到的重要课程之一就是尽可能将舞台执行的业务逻辑和纯函数与演员分开。这样做可以使得对该业务逻辑的测试和组合变得更加容易。而新的OCHA类型API自然地帮助了这种解耦。

On our team, it's very important to enable rapid development of algorithms and improvement in iteration. And so we have Python in specific places in our system. But we also really need to minimize the risk that the iteration on the algorithms breaks workflows. And so a couple things that work really well for us to minimize that risk are an input output model to the algorithmic services. It keeps that code simpler and more easily testable. And strict contracts, which again gives freedom to change algorithm internal logic independently of the rest of the system.

在我们的团队中，快速开发算法和迭代改进非常重要。因此，在我们的系统中有一些特定的地方是使用Python的。但我们也非常需要将算法的迭代风险降到最低。为此，有一些事情对我们来说非常有效，以减小迭代算法对工作流程的影响。其中一个是使算法服务具备输入输出模型。它使得代码更简洁、易于测试。另一个是严格的合同，它使得算法的内部逻辑可以独立于系统的其余部分进行更改。

It's been important for us to abstract the messy details of external data sources and services from the core system. And this is a fundamental tenant of the whole platform actually. And these workflows are inevitably stateful. But entangling state with the business logic stages can lead to spaghetti code. And instead keep the business logic stages functional, testable, and composable.

对于我们来说，将外部数据源和服务的混乱细节与核心系统抽象出来是非常重要的。事实上，这是整个平台的基本原则。而且，这些工作流程不可避免地具有状态。但是，将状态与业务逻辑阶段交织在一起会导致代码混乱。相反，我们应该保持业务逻辑阶段的功能性、可测试性和可组合性。

Okay. In the next part, we're going to describe our first virtual power plant application then. So Percy just described how we leverage the platform to participate in the energy markets algorithmically with one large battery. Now we'll focus on how we extend that and use what we learn to measure, model, and control a fleet of thousands of power walls that are installed in people's homes to do peak shaving for an electrical utility.

好的。在接下来的部分，我们将描述我们的第一个虚拟电厂应用。所以珀西刚刚描述了我们如何利用平台以算法方式参与能源市场，使用一块大型电池。现在我们将专注于如何扩展这个应用，并利用我们所学的知识来测量、建模和控制成千上万个安装在人们家中的Powerwall电池组，以进行电力公司的负荷削峰。

Now, before I detail this off our architecture, I'll describe the problem that we're trying to solve. So this is a graph of aggregate grid load in megawatts. Now grid load varies with weather and with time of year. This is a typical load profile for a warm summer day. The left-hand side is midnight, the minimum load is around 4 a.m. when most people are sleeping, and then peak load is around 6 p.m. when a lot of people are running air conditioning or cooking dinner.

现在，在我详细描述我们的架构之前，我将描述一下我们正在解决的问题。这是一个以兆瓦为单位的总体电网负荷图。电网负荷随着天气和年份变化而不同。这是一个典型的暖夏日负荷曲线。左侧是午夜，最低负荷约在凌晨4点，那时大多数人都在睡觉，然后高峰负荷在下午6点左右，那时很多人在使用空调或烹饪晚餐。

Now, peak loads are very, very expensive. The grid only needs to meet the peak load a few hours in a year. And the options for satisfying the peak load are build more capacity, which incurs significant capital costs. And then this capacity is largely underused outside of those peaks. And the other option is to import power from another jurisdiction that has excess, and this is often at a significant premium. So power can be cheaper if we can offset demand and make this load curve more uniform. And that's our objective. We want to discharge power wall batteries during the peak grid load, and at other times the homeowner will use this battery for clean backup power.

现在，高峰负荷非常昂贵。电网只需要在一年中的几个小时内达到高峰负荷。满足高峰负荷的选项有两个：建立更多的产能，这会产生巨大的资本成本。而且这些产能在高峰期以外很大程度上被闲置。另一个选项是从拥有多余电力的其他司法管辖区进口电力，这往往需要支付显著溢价。所以，如果我们能平衡需求，使负荷曲线更加均匀，电力的成本可能会更低。这是我们的目标。我们希望在电网高峰负荷时释放储能壁挂电池，而在其他时间，房主可以将这些电池用于清洁备用电力。

A lesson we quickly learned as our virtual power plants grew to thousands of power walls and tens of megawatts of power was that charging the batteries back up right after the peak would lead to our own peak, defeating the purpose. And of course, the solution is to not only control when the batteries discharge, but also when they charge and spread out the charging over a longer period of time.

随着我们的虚拟电力站扩大到数千个电池组和数十兆瓦的功率，我们很快就学到了一个教训，那就是在高峰后立即充电会导致我们自己的高峰出现，从而违背了原有的目的。当然，解决办法不仅是控制电池的放电时间，还包括充电时间，并将充电过程分摊到更长的时间段中。

Now this is what we're trying to accomplish, this picture, but in reality we don't have the complete picture. There's uncertainty. It's noon, and we're trying to predict whether or not there's going to be a peak. And we only want to discharge batteries if there's a high likelihood of a peak. Once we've decided to discharge batteries to avoid the peak, how do we control them?

现在这就是我们试图实现的目标，这张图片，但实际上我们没有完整的图片。存在着不确定性。现在是中午，我们试图预测是否会有高峰。只有在很有可能出现高峰时，我们才想要放电电池。一旦我们决定放电电池以避免高峰，我们如何控制它们呢？

And I want to be very clear that we only control power walls that are enrolled in specific virtual power plant programs. We don't arbitrarily control power walls that aren't enrolled in these programs, so not every customer has this feature.

我想要非常明确地说明，我们只对参与特定虚拟电力站计划的储能墙进行控制。我们不会随意控制未参与这些计划的储能墙，因此并非每个客户都拥有这个功能。

As Percy mentioned, the grids not designed to interact with a whole bunch of small players. So we need to aggregate these power walls to look more like a traditional grid asset, something like a large steam turbine. And typically we do this by having hierarchical aggregations that are a virtual representation in cloud software.

正如Percy所提到的，这些电网并不是设计用来与许多小型玩家互动的。因此，我们需要将这些动力墙集合起来，使其看起来更像是传统电网的组成部分，例如大型蒸汽涡轮。通常情况下，我们通过在云软件中创建层次化的集合来实现这一点，这些集合是虚拟的表示。

The first level is a digital twin representing an individual site, so that would be a house with a power wall. And the next level might be organized by electrical top topology, something like a substation, or it could be by geography, something like a county.

第一层是代表一个单独场所的数字孪生，比如一个装有能量储存器的房屋。而下一个级别可能是按照电力拓扑结构进行组织，比如一个变电站，或者按照地理位置进行组织，比如一个县。

The next level can again be a physical grouping, like an electrical interconnection, or it might be logical, like sites with a battery and sites with a battery plus solar that we want to control or optimize differently. And all of these sites come together to form the top level of the virtual power plant, meaning we can query the aggregate of thousands of power walls as fast as we can query a single power wall and use this aggregate to inform our global optimization.

下一个层级可以再次是物理分组，比如电气互连，也可以是逻辑分组，比如具有电池的站点和具有电池加太阳能的站点，我们希望以不同的方式对其进行控制或优化。所有这些站点汇聚在一起，形成虚拟电站的顶层，这意味着我们可以像查询单个电池一样快速查询成千上万个电池，并利用这些总数来为全球优化提供信息。

It's easy to think of the virtual power plant as uniform, but the reality is more like this. There's a diversity of installations and home loads. Some homes have one battery, some have two or three. The batteries are not all fully charged. Some might be half full or close to empty, depending on home loads, time of day, solar production on that day, and the mode of operation.

很容易把虚拟电厂看作是统一的，但实际情况更像这样。安装和家庭负荷是多样化的。有些家庭有一个电池，有些有两个或三个电池。电池并不都是充满电的。有些电池可能只有一半电量或接近耗尽，这取决于家庭负荷、时间、当天的太阳能产量以及运行模式。

There's also uncertainty in communication with these sites over the internet, as some of them may be temporarily offline. Finally, there's the asset management problem of new sites coming online regularly, firmware being non-uniform in terms of its capabilities across the whole fleet, and hardware being upgraded and replaced over time.

在与这些网站通过互联网进行通信时，也存在着不确定性，因为其中一些可能暂时离线。最后，还存在着资产管理问题，即新网站定期上线、整个设备组的固件能力不统一以及硬件随时间升级和更换。

So it's really critical to represent this uncertainty in the data model and in the business logic. So we want to say things like there's 10 megawatt hours of energy available, but only 95% of the sites we expect to be reporting have reported. And it's really only the consumer of the data that can decide how to interpret this uncertainty based on the local context of that service.

因此，在数据模型和业务逻辑中表达这种不确定性非常关键。所以我们希望能够表述出有10兆瓦小时的能量可用，但只有预计报告的站点的95％已经报告了。而实际上，只有数据的使用者可以根据该服务的地方背景来决定如何解读这种不确定性。

So one way we manage this uncertainty is through a site level abstraction. So even if the sites are heterogeneous, this edge computing platform provides site level telemetry for things like power, frequency, and voltage that gives us a consistent abstraction in software. And then another way is to aggregate the telemetry across the virtual power plant, because people don't want to worry about controlling individual power wall batteries. They want to worry about discharging 10 megawatts from 5 p.m. to 6 p.m. in order to save the peak.

所以我们管理这种不确定性的一种方法是通过站点级抽象。因此，即使站点是异构的，这种边缘计算平台也可以提供诸如电源、频率和电压等站点级遥测数据，为我们在软件中提供一种统一的抽象。另一种方法是对虚拟电力厂的遥测数据进行聚合，因为人们不想担心控制各个功率墙电池，他们只关心在下午5点到6点期间释放1千万瓦的电力以节约峰值。

And this is a really difficult engineering challenge, which is a combination of streaming telemetry and asset modeling. For modeling each site in software, the so-called digital twin, we represent each site with an actor. And the actor manages state, like the latest reported telemetry from that battery, and executes a state machine, changing its behavior if the site is offline and telemetry is delayed. But it also provides a convenient model for concurrency and computation.

这是一个非常困难的工程挑战，它是流媒体遥测和资产建模的结合。为了在软件中对每个站点进行建模，也称为数字双胞胎，我们用一个"角色"来代表每个站点。这个"角色"管理状态，比如从电池中汇报的最新遥测，并执行状态机，如果站点离线并且遥测延迟，它会改变自己的行为。但它也提供了一个方便的模型，用于并发和计算。

So the programmer worries about modeling an individual site in an actor, and then the Oka runtime handles scaling this to thousands or millions of sites. And you don't have to worry about that. It's a very, very powerful abstraction for IoT in particular. And we generally never worry about threads or locks or concurrency bugs.

所以程序员担心将一个单独的站点建模成一个活动者，然后Oka运行时将其扩展到成千上万个站点。而你不必担心这个。这对于物联网来说是一个非常强大的抽象概念。而且我们通常不需要担心线程、锁或并发错误。

The higher level aggregations are also represented by individual actors. And then actors maintain their relationships with other actors describing this physical or logical aggregation. And then the telemetry is aggregated by messaging up to hierarchy in memory, in near real time. And how real time the aggregate is at any level is really just a trade-off between messaging volume and latency.

更高级别的聚合也由单独的参与者表示。然后参与者通过描述物理或逻辑聚合维护与其他参与者的关系。然后遥测通过消息传递在内存中到上层进行聚合，实现近实时处理。在任何级别上，实时聚合的真实性只是消息数量和延迟之间的权衡。

We can query at any node in this hierarchy to know the aggregate value at that location, or query the latest telemetry from an individual site. And we can also navigate up and down the hierarchy from any point.

我们可以在此层次结构的任何节点进行查询，以了解该位置的聚合值，或查询个别站点的最新遥测数据。并且我们还可以从任意点向上或向下导航。

Now, the services that perform this real time hierarchical aggregation run in an Oka cluster. Oka cluster allows a set of pods with different roles to communicate with each other transparently. So the first role is a set of linearly scalable pods that stream data off Kafka. And they use Oka streams for back pressure, bounded resource constraints, and then low latency stream processing. And then they message with a set of pods running all the actors in this virtual representation that I just described.

现在，执行这种实时分级汇总的服务在一个Oka集群中运行。Oka集群允许一组具有不同角色的Pod之间透明地进行通信。因此，第一个角色是一组可线性扩展的Pod，它们从Kafka中流式传输数据。它们使用Oka流进行反压力控制，有限的资源约束，并进行低延迟的流处理。然后，它们与一组运行在这个虚拟表示中的所有角色进行消息传递。

When the stream processors read a message off Kafka for a particular site, they message to the actor representing that site simply using the site identifier. And it doesn't matter where in the cluster that actor is running, the Oka runtime will transparently handle the delivery of that message. This is called location transparency. And site actors message with their parents in a similar manner all the way up that hierarchy. There's also a set of API pods that can serve client requests for site level or aggregate telemetry because they can query into the cluster in this same location transparent way. And it's this collection of services that provides the in-memory near real time aggregation of telemetry for thousands of powerwalls. It's an architecture that provides great flexibility, especially when paired with Kubernetes to manage the pods. Because the actors are just kind of running on this substrate of compute. They're kind of running on the heap, if you will. And an individual pod can fail or be restarted. And the actors that were on that pod will simply migrate to another until it recovers. And the runtime handles doing this. The programmer doesn't have to worry about it. Or the cluster can also be scaled up or down. And the actors will rebalance across the cluster. Actors can recover their state automatically using OCHA persistence. But in this case, we don't actually need to use OCHA persistence. Because the actor can just rediscover its relationships as well as the latest state when the next message from the battery arrives within a few seconds.

当流处理器从特定站点的Kafka读取消息时，它们只需使用站点标识符向代表该站点的actor发送消息。无论该actor在集群中的运行位置如何，Oka运行时都会透明地处理该消息的传递。这就是所谓的位置透明性。站点actor也以类似的方式与其父节点进行通信，一直延伸到层次结构的顶部。还有一组API Pod可以以同样透明的方式查询集群，以提供站点级别或聚合遥测的客户请求。正是这组服务为成千上万的电池聚合了近实时的遥测数据。这个架构提供了很大的灵活性，特别是与用于管理Pod的Kubernetes配对使用时。因为actor只是在这个计算基础上运行。如果一个单独的Pod失败或重新启动，那些在该Pod上的actor将会迁移到另一个Pod直到其恢复。运行时会处理这个过程，程序员不需要担心。或者集群也可以进行扩展或缩减，actor将会在集群上重新平衡。使用OCHA持久化，actor可以自动恢复其状态。但在这种情况下，我们实际上不需要使用OCHA持久化，因为当下一条来自电池的消息到达时，actor只需在几秒钟内重新发现其关系以及最新状态。

So to conclude this section, after aggregating telemetry to know the capacity that's available in the virtual power plant, let's look at how the batteries are actually controlled. So the first step is taking past measurements, forecasting, and deciding how many megawatts to discharge if we are going to hit a peak. And at a high level, this loop of measure, forecast, optimize, and control is basically running continuously. And the control part of this loop is true closed loop control. Once an aggregate control set point has been determined, we continuously monitor the disaggregate telemetry from every single site to see how it responds. And we adjust the set point for the individual sites to minimize error. We can take a look at how this works.

因此，总结本节内容，在了解虚拟电厂中可用电力容量的基础上，让我们来看看电池是如何实际控制的。首先，我们需要根据过去的测量数据和预测，决定在即将达到峰值时需要释放多少兆瓦的电能。从宏观角度来看，测量、预测、优化和控制的这个循环基本上是持续运行的。而这个循环的控制部分则是真正的闭环控制。一旦确定了汇总控制设定点，我们会持续监控来自每个站点的个别遥测数据，以观察其响应情况。并且我们会调整各个站点的设定点，以最小化误差。我们可以看一下这个过程是如何运作的。

The automator platform that Percy described may decide to control the whole fleet. So to give a sense of scale, this might be enough megawatts to offset the need to build a new natural gas peaker plant. Or we might just decide to control a subset of the hierarchy depending on the objective. Now the control service that I mentioned earlier dynamically resolves the individual sites under this target by querying the asset service. And this is because the sites can change over time. New sites are installed. The virtual hierarchy might be modified. Or the properties of an individual site might change. Maybe you add a second battery. The control service queries the battery telemetry at every site, potentially thousands of sites, using the in-memory aggregation that I just discussed, to decide how to discharge the battery at each site. There's no point discharging a battery that's almost empty. And you can kind of think of this somewhat similar to a database query planner, basically trying to plan the execution. The control service then sends a message to each site with a discharge set point and a time frame. And it will keep retrying until it gets an acknowledgment from the site or the time frame has elapsed. Because these logical aggregations of batteries are so large, we stream over these huge collections using OCHA streams to provide bounded resource constraints in all of the steps that I've just described. So that's resolving the sites, reading all of the telemetry, and then sending all the control set points.

Percy所描述的自动化平台可能决定控制整个车队。因此为了给出一个规模的概念，这可能足够用来抵消建造新的天然气备用电厂的需求。或者我们可能只决定根据目标来控制某个等级的子集。现在，我之前提到的控制服务通过查询资产服务动态地解决了这个目标下的各个站点。这是因为站点可能随着时间的推移而发生变化。新的站点被安装。虚拟等级结构可能被修改。或者某个站点的属性可能发生变化。也许你增加了第二块电池。控制服务使用我刚刚讨论过的内存聚合来查询每个站点的电池遥测数据，潜在地有几千个站点，以决定如何在每个站点放电电池。毫无意义地放电几乎没有电的电池。你可以把这个想象成类似于数据库查询规划器，基本上是试图规划执行。然后，控制服务向每个站点发送一个放电设定点和时间框架的消息。它将不断重试，直到从站点获得确认或时间框架已过。因为这些电池的逻辑聚合非常庞大，我们使用OCHA流来在我刚刚描述的所有步骤中提供有限的资源约束，包括解决站点、读取所有遥测和发送所有控制设定点。

So huge aggregations demand different APIs and data processing patterns. You can't just go build typical CRUD microservices. Not going to work. You need streaming semantics for processing large collections with low latency and bounded resource constraints. And what we really need is a runtime for modeling stateful entities that support location transparency, concurrency, scaling, and resilience. Uncertainty is inherent in distributed IoT systems. So we need to just embrace this in the data model, in the business logic, and even in the customer experience rather than trying to escape it. And representing physical and virtual relationships among IoT devices, especially as they change over time is the hardest problem in IoT. Trust me. But essential for creating a great product.

巨大的集合需要不同的API和数据处理模式。你不能只是构建典型的CRUD微服务。这样做行不通。你需要流式语义来处理具有低延迟和有界资源限制的大型集合。而我们真正需要的是一个运行时，用于建模状态化实体，支持位置透明、并发、扩展和弹性。不确定性是分布式物联网系统固有的特性。因此，我们需要在数据模型、业务逻辑甚至客户体验中接受它，而不是试图逃避它。在物联网中，代表物理和虚拟设备之间的关系，特别是随着时间的变化，是最困难的问题。相信我。但这对于创建一个伟大的产品是至关重要的。

Now direct control based on a central objective doesn't account for local needs. And this creates a kind of tension. So imagine a storm is approaching, close to a peak. The global objective wants to discharge these batteries to avoid the peak. But of course the homeowner wants a full battery in case the power goes out. And this leads to the final part of our presentation, the co-optimized virtual power plant.

现在基于中心目标的直接控制没有考虑到本地需求。这种情况会产生一种紧张感。所以想象一下，当一场接近峰值的暴风雨来临时，全局目标希望放电以避免电力峰值。但当然，住户希望电池充满，以防断电。这就引入了我们演示的最后一部分，即共同优化虚拟电力厂。

So just to review where we are. So far we've built on the fundamental platform to first of all optimize a single big battery to participate in an electricity market. And then second, aggregate, optimize, and control thousands of batteries to meet a central goal. And so in this last section, like Colin said, we're again going to aggregate, optimize, and control thousands of batteries. But this time not just for a global goal, we're going to co-optimize local and global objectives.

让我们回顾一下目前的情况。到目前为止，我们已经在基本平台上进行了建设，首先是优化了一个大型电池以参与电力市场。其次，我们聚合、优化和控制了数千个电池，以实现一个中心目标。所以在这最后一部分中，正如Colin所说，我们将再次聚合、优化和控制数千个电池。但这次不仅仅是为了一个全局目标，我们将协调优化本地和全球目标。

So whereas the peak shaving virtual power plant, the Colin just described optimized essential objective and passed the control decisions downward to the sites. The market virtual power plant distributes the optimization itself across the sites and the cloud. And the sites actually in this case participate in the control decisions. This distributed optimization is only possible because Tesla builds its own hardware and has full control over firmware and software. This enables quick iteration across the local and central intelligence and how they relate to each other. And this collaboration is cross-team rather than cross-company.

因此，尽管Colin刚刚描述的峰值削峰虚拟电厂已经针对必要的目标进行了优化，并将控制决策下放到了各个站点，但市场虚拟电厂则将优化分散在站点和云端之间。在这种情况下，站点实际上参与了控制决策。这种分布式优化之所以可能，是因为特斯拉自家建造自己的硬件，并完全掌控固件和软件。这使得本地和中央智能能够快速迭代，并使其彼此之间产生联系。而这种合作是跨团队而不是跨公司进行的。

So when we say that this virtual power plant co-optimizes local and global objectives, what do we mean? So let's take a non-virtual power plant home. And at home with this solar generation and this electricity consumption would have net load like this. And this is the load that the utility sees. The power wall home battery, the power wall can charge during excess solar generation and discharge during high load. And this is thanks to the local intelligence on the device. And the goal of this would be either to minimize the customer's bill or to maximize how much of their own solar production they're using. This is local optimization.

所以，当我们说这个虚拟电厂同时优化了本地和全球目标时，我们是指什么呢？那么让我们以一个非虚拟电厂的家庭为例。在家中，通过太阳能发电和用电的情况会产生净负荷如此。这就是供电公司所看到的负荷。Power Wall家庭电池可以在太阳能过剩时充电，在负荷高时放电。这得益于设备上的本地智能功能。而这样做的目标要么是最小化客户的账单，要么是最大化他们自己的太阳能发电的使用。这就是本地优化。

What does it look like to co-optimize local and global objectives? The local, one way to do it is that the local optimization can consider information about the aggregate goal, like market prices indicating the real-time balancing needs of the grid. So in this example, negative prices in the night perhaps caused by wind over generation might cause the battery to charge earlier. And a high price in the afternoon caused maybe by unexpectedly high demand prompts the battery to discharge rather than waiting to fully offset the evening load like it would have. And just to note that this is all well following local regulations around discharging to the grid.

如何实现本地和全局目标的共同优化？其中一种方法是本地优化可以考虑关于整体目标的信息，例如市场价格可以指示电网实时平衡需求。例如，在夜间出现的负价格可能是由于风力发电过剩，这可能会导致电池提前充电。而下午出现的高价格可能是由于意外的高需求，这会促使电池在不等待完全抵消晚间负荷的情况下进行放电。需要注意的是，这一切都要遵守关于向电网放电的本地法规。

And our co-optimized virtual power plant, auto-bitter generates a time series of price forecasts every 15 minutes. And the Tesla Energy Platform's control component distributes those forecasts to the sites. The local optimization then runs, makes a plan for the battery, and given both the local and global objectives. And then the sites communicate that plan back to the Tesla Energy Platform, which ingests and aggregates it using the same framework that ingests and aggregates to the limitary. And auto-bitter then uses the aggregate plans to decide what to bid.

我们优化协同的虚拟电力厂auto-bitter每15分钟生成一系列价格预测。而特斯拉能源平台的控制组件将这些预测分发到各个场地。然后进行本地优化，制定电池的计划，同时考虑本地和全局目标。场地再将这些计划回传给特斯拉能源平台，使用与限制条件相同的框架进行摄取和汇总。auto-bitter随后使用汇总计划来决定要竞标什么。

This distributed algorithm has a couple of big advantages. One is scalability. We're taking advantage of edge computing power here. And we're not solving one huge optimization problem overall sites. As more sites join the aggregation, we don't have to worry about our central optimization falling over. Our big advantage is resilience to the inevitable intermittency of communication. When sites go offline for short or moderate amounts of time, they have this last version received of this global time series of prices. And they can continue to co-optimize using the best estimate of the global objective. And then if the sites are offline for longer than the length of that price time series, they just revert to purely local optimization. And this is a really reasonable behavior. In the case of degraded connectivity, it's still creating local value for the local site.

这个分布式算法有几个很大的优势。一种是可扩展性。我们在这里利用了边缘计算的能力。而且我们不是在整个站点上解决一个巨大的优化问题。随着越来越多的站点加入聚合，我们不必担心我们的中央优化会出现问题。我们的最大优势是对通信中必然的间歇性的弹性。当站点短暂或适度的时间离线时，它们会拥有此全局价格时间序列的最后一个版本。它们可以继续使用全局目标的最佳估计进行协同优化。如果站点离线时间超过了该价格时间序列的长度，它们将转向纯粹的本地优化。这是一个非常合理的行为。即使在连接质量下降的情况下，它仍然为本地站点创造本地价值。

And then, on the other's, from the other perspective of the server, the telemetry aggregation accounts for offline sites out of the box. If sites haven't reported signals in a certain amount of time, they're excluded from the aggregate.

然后，从服务器的另一个角度来看，遥测汇总默认将离线站点考虑在内。如果站点在一定的时间内没有报告信号，则会从汇总中排除。

And so then auto-bitter is able to bid conservatively and assume that offline sites are not available to participate in market bids. Tesla's unique vertical hardware firmware software integration enables this distributed algorithm. And the vertical integration lets us build a better overall solution.

因此，Auto-bitter能够保守地投标，并假设离线站点无法参与市场投标。特斯拉独特的垂直硬件固件软件整合使得这个分布式算法成为可能。而垂直整合则让我们能够构建出更好的整体解决方案。

This distributed algorithm makes the virtual power plant more resilient. Devices are able to behave in a reasonable way during the inevitable communications failures of a distributed system. And this algorithm is only possible because of the high quality and extensible Tesla energy platform that embraces uncertainty and models' reality.

这种分布式算法使虚拟电厂更具弹性。在分布式系统不可避免的通信故障期间，设备能够以合理的方式行为。而且，这种算法之所以可能，是因为特斯拉高质量、可扩展的能源平台能够处理不确定性并建模现实。

And at the same time, the algorithms help the software platform. The algorithms enhance the overall value of the product. So in our journey building the Tesla energy virtual power plant, we've found it very true that while the algorithms are obviously important to the virtual power plant success, the architecture and reliability of the overall system are the key to the solution.

同时，算法也帮助软件平台。算法增强了产品的整体价值。因此，在我们建设特斯拉能源虚拟电厂的过程中，我们发现算法显然对虚拟电厂的成功至关重要，然而整个系统的架构和可靠性才是解决方案的关键。

It's the system that allows us to provide reliable power to people who have never had it before. Balance renewables on the grid. Or provide flexible energy solutions for disaster relief. And build highly integrated products and services that deliver us pure your customer experience.

这就是我们可以为那些以前从未享受到可靠电力的人提供电力的系统。通过在电网上平衡可再生能源，或为灾难救援提供灵活的能源解决方案。并构建高度整合的产品和服务，为我们带来纯粹的客户体验。

So we're working on a mix of the most interesting and challenging problems in distributed computing as well as some of the most challenging and interesting problems in distributed renewable energy. We're hiring if you want to work on these challenging important problems with us, of course. But equally importantly.

我们正在研究分布式计算中最有趣和具有挑战性的问题以及分布式可再生能源中最具挑战性和有趣的问题的混合。当然，如果你想与我们一起解决这些具有挑战性和重要性的问题，我们正在招聘人员。但同样重要的是。

There has the potential to address many of the most pressing problems in the world from renewable energy and climate change to food and agriculture to cancer and infectious disease research. So let's take our talents in software engineering and work on the most important and lasting problems that we can find.

这个领域有潜力解决世界上许多最紧迫的问题，从可再生能源和气候变化到食品和农业，再到癌症和传染性疾病研究。因此，让我们发挥我们在软件工程方面的才能，致力于解决我们能找到的最重要和持久的问题。