The electric grid is the largest and most complex machine ever built. It's an amazing feat of engineering providing reliable, safe and on-demand power.
电网是迄今为止最大、最复杂的机器。它是一项令人惊叹的工程壮举,能提供可靠、安全且随需应变的电力。
This grid is built on 20th century technology with large centralized generation, mostly fossil fuel based and only a few points of control. We now face an urgent challenge to transition off of fossil fuels in order to prevent the worst effects of climate change.
Fortunately, we also now have new tools, clean power generation like wind, solar and hydro are cheap and getting cheaper. But this hardware is not on its own enough to replace fossil fuels while maintaining our current standard of on-demand and reliable power.
Software is really the key to enabling these diverse components to act in concert. And one of the things we can do is bring together thousands of small batteries in people's homes to create virtual power plants, providing value to both the electrical grid as well as to the home or business owner.
This marries some of the most interesting and challenging problems in distributed computing with some of the most important and challenging problems in distributed renewable energy. And this is why we work at Tesla. We get to work on these exciting software challenges while also accelerating the world's transition to renewable energy.
We're going to take you through the evolution of the Tesla virtual power plant and share architectures, patterns and practices for distributed computing and IoT that have helped us tackle these complex and exciting challenges. I'm Percy. I'm a software engineer and technical lead on the team that builds Tesla's energy optimization and market participation platform.
And I'm Colin. I'm a software engineer and I lead the teams that build and operate the cloud IoT platforms for Tesla energy products. And just a disclaimer, before we start, we do not speak on behalf of Tesla. We are just representing our personal experiences.
So before we dig into the software, let's cover some background on how the grid works and on the role that batteries play in it so that you're set up to appreciate the software problems. The tricky thing about the power grid is that supply and demand have to match in real time or else frequency and voltage can deviate and this can damage devices and lead to blackouts.
The grid itself has no ability to store power so the incoming power supply and outgoing power consumption need to be controlled in a way that maintains the balance. With old-style centralized fossil fuel generation, supply could be turned up and down according to demand and there were relatively small number of plants to control and this made it relatively straightforward to maintain the balance.
As more renewable generation comes onto the grid, a few things happen. First reduced control. You can't be as easily turned up to follow demand and we don't want to turn generation down or else we're losing some of our clean energy. Second uncertainty and rapid change.
Generation can't be forecast precisely and it can change quickly. And third distribution, there are many small generators behaving independently. So in a grid with large amounts of wind and solar generation, the supply might look something like this with variability in supply and with times of high supply, not aligned with times of high demand and this can result in power surpluses and deficits that are larger than previously and that can change fairly rapidly.
Batteries can charge during the surpluses and discharge during the deficits and they can respond very quickly to offset any rapid swings in imbalance. And this rapid response is actually even an innovation, an opportunity to be better than the old grid. It's not just a compromise.
And so to fulfill this role, we could just install giant batteries. Batteries the size of a typical coal or natural gas power plant and they can and do play an important part of the equation. But we can also take advantage of smaller batteries already installed in individual homes that are already providing local value like backup power or helping the owner consume more of their own solar generation. We can aggregate homes and businesses with these smaller batteries and solar into virtual power plants.
So in this presentation, we'll walk you through the evolution of the Tesla energy platform for virtual power plants. And it's broken into four sections with each stage laying the foundation for the next. We'll start with the development of the Tesla energy platform. Then we'll describe how we learn to participate in energy markets and how we learn to build software to do this algorithmically using a single battery, the largest battery in the world.
Then we'll talk about our first virtual power plant where we learn to aggregate and directly control thousands of batteries in near real time in people's homes. And finally, we'll talk about how we combine all of these platforms and experiences to aggregate, optimize, and control thousands of batteries for energy market participation.
So let's begin with the architecture of the Tesla energy platform. This platform was built for both residential and industrial customers. So for residential customers, the platform supports products like the Powerwall home battery, which can provide backup power for a house for hours or days in the event of a power outage, solar roof, which produces power from beautiful roofing tiles, and retrofit solar. And the solar products can be paired with Powerwall to provide not only backup power but also maximize solar energy production. And we use software to deliver an integrated product experience across solar generation, energy storage, backup power, transportation, and vehicle charging, as well as create unique products like Stormwatch, where we will charge your Powerwall to full when alerted to an approaching storm so that you have full backup power if the power goes out. Part of the customer experience is viewing the real time performance of the system in the mobile app, and customers can control some behaviors, such as prioritizing charging during low-cost times.
For industrial customers, the software platform supports products like Powerpack and Megapack for large-scale energy storage. As well as industrial-scale solar. Software platforms like Powerhub allow customers to monitor the performance of their systems in real time, or inspect historical performance over days, weeks, or even years.
Now these products for solar generation, energy storage, transportation, and charging all have an edge computing platform. And zooming in on that edge computing platform for energy, it's used to interface with a diverse set of sensors and controllers, things like inverters, bus controllers, and power stages. And it runs a full Linux operating system and provides local data storage, computation, and control. While also maintaining bidirectional streaming communication with the cloud over WebSocket so that it can regularly send measurements to the cloud for some applications as frequently as once a second. And it can also be commanded on demand from the cloud.
Now we'll mention a few things throughout the presentation about this edge computing platform, but our main focus is going to be on that cloud IoT platform. And the foundation of this platform is this linearly scalable WebSocket front end that handles connectivity as well as security. It has a Kafka cluster behind it for ingesting large volumes of telemetry for millions of IoT devices. And this provides messaging durability, decouples publishers of data from consumers of data, and it allows for sharing this telemetry across many downstream services.
The platform also has a service for published, subscribed messaging, enabling bidirectional command and control of IoT devices. And these three services together are offered as a shared infrastructure throughout Tesla on which we build higher-order services. On the other side of the equation are these customer-facing applications supporting the products that I just highlighted.
The APIs for energy products are organized broadly into three domains. The first are APIs for querying telemetry alerts and events from devices or streaming these as they happen. Second are APIs for describing energy assets and the relationships among these assets. And lastly, APIs for commanding and controlling energy devices like batteries.
Now the backing services for these APIs are composed of approximately 150 polyglot microservices far too many to detail in this presentation. I'll just provide a high-level understanding of the microservices in each domain. And we're going to dive a bit deeper into a few of them later when we look at the virtual power plant. And a theme you'll see throughout is the challenge of handling real-time data at IoT scale.
Imagine a battery installed in everybody's home. So to support efficient queries and rollups of telemetry, these be queries like what's the power output over the past day or week? We use InfluxDB, which is an open-source purpose-built time series database. It depends on the data stream and the application, but generally our goal is to make historical data available to the customer for the lifetime of the product. We maintain a large number of low-latency streaming services for data ingestion and transformation. And for some of these Kafka topics, the very first thing we do is create a canonical topic where data are already filtered and refined into very strict data types. This is more efficient because it removes this burden from every downstream service, and it also provides consistency across the downstream consumers.
A very unique challenge in this domain is the streaming real-time aggregation of telemetry from thousands of batteries. This is a service that we'll look at in much more detail because it forms one of the foundations of the virtual power plant.
Now like any large company, product and customer information comes from many, many different business systems, and it's really unworkable to have every microservice connect to every business system, many of which are not designed to be internet facing or IoT scale. So the purposes of the asset management services, there's four things. One is to abstract and unify these many different business systems into one consistent API. Two is to provide a consistent source of truth, especially when there are conflicting data. Three it provides a kind of type system where applications can rely on the same attributes of the same type of device, like a battery. And fourth, it describes unique relationships among these energy assets, like which devices can talk to each other and who can control them. And it relies heavily on a Postgres database to describe these relationships.
Now we use Kafka to integrate changes as they happen from many of these different business systems, or we stream the changes directly from IoT devices. And at scale actually this is a lot more reliable. Devices are often the most reliable source of truth, self-reporting their configuration, state and relationships.
Now a digital twin is the representation of a physical IoT device, a battery, an inverter, a charger, in software, modeled virtually. And we do a lot of digital twin modeling to represent the current state and relationships of various assets. Finally, there are services for commanding and controlling IoT devices, like telling a battery to discharge at a given power set point for a specific duration. And similar to both the telemetry and asset domains, we need a streaming, stateful and real-time representation of IoT devices at scale, including modeling this inherent uncertainty that comes with controlling IoT devices over the internet.
Now, Aqua has been an essential tool for us for building these microservices. Aqua is a toolkit for distributed computing, and it also supports actor model programming, which is great for modeling the state of individual entities like a battery, while also providing a model for concurrency and distribution based on asynchronous and mutable message passing. It's really, really great model for IoT, and I'll provide some specific examples later in the presentation. Another part of the Aqua toolkit that we use extensively is the reactive streams component called Oka Streams. Oka Streams provides sophisticated primitives for flow control, concurrency, and data management, all with back pressure under the hood, ensuring that the services have bounded resource constraints. And generally, all the developer rights are functions, and then Oka Streams handles the system dynamics, allowing processes to bend and stretch as the load of the system changes and the messaging volume changes.
The Oka Paka project has a large number of these reactive streams interfaces, the services like Kafka or AWS S3, and Oka is what we use for interfacing with Kafka extensively. We don't actually use Kafka Streams because we find the interface there is too simplistic for our use case, and Oka Streams provide, and it's also ecosystem specific, and Oka Streams provides this much more general purpose streaming tool.
Oka Paka 项目拥有大量的反应式流接口,比如 Kafka 或 AWS S3 这样的服务,而 Oka 是我们广泛使用与 Kafka 交互的工具。我们实际上不使用 Kafka Streams,因为我们发现其接口对于我们的用例来说过于简单化,而 Oka Streams 则提供了更为通用的流处理工具,同时也是针对我们生态系统开发的。
Now like any large platform, there's a mix of languages, but our primary programming language is Scala, and the reason we came to Scala was through Oka, because it's really the first-class way to use Oka. And then we really kind of fell in love with Scala's rich-type system, and we've become big fans of functional programming for building large, complex, distributed systems. So we like things like the compile-time safety, immutability, pure functions, composition, and doing things like modeling errors as data rather than throwing exceptions. And for a small team, having a primary programming language where you invest in a deep understanding and first-class tooling is a huge boost to productivity, job satisfaction, and the overall quality of a complex system.
Majority of our microservices run in Kubernetes, and the pairing of Oka in Kubernetes is really, really fantastic. Kubernetes can handle coarse-grained failures and scaling, so that would be things like scaling pods up or down, running liveness probes, or restarting a failed pod with an exponential back-off. And then we use Oka for handling fine-grained failures, like circuit-breaking or retrying an individual request, and modeling the state of individual entities like the fact that a battery is charging or discharging. And then we use Oka streams for handling the system dynamics in these message-based real-time streaming systems.
The initial platform was built with traditional HTTP APIs and JSON that allowed rapid development of the initial platform. But over the past year, we've invested much more in GRPC. It's been a big win. It's now our preference for new services, or if we extend older services. And it brought three distinct advantages. Strict contracts make these systems much more reliable. Road generation of clients means we're no longer writing clients, which is great. And third, and somewhat unexpected, we saw much improved cross-team collaboration around these contracts. And we're not just seeing this with GRPC because we also prefer protobuf for our streaming messages, including the ones that are going through Kafka. And we maintain a single repository where we share these contracts and then collaborate across projects.
I've mentioned this theme of strict typing a few times, rich types in Scala, strict schema with protobuf, and then these strict asset models for systems integration. And constraints ultimately provide freedom, and they allow decoupling of microservices and decoupling of teams. And constraints are really a foundation for reliability in large-scale distributed systems.
So takeaways from building the Tesla energy platform. We were lucky to embrace the principles of reactive systems from day one. And this produced incredibly robust, reliable, and effective systems. Reactive streams is really important component for handling the system dynamics and providing resource constraints while also providing this rich general purpose API for streaming. Now what's needed to build these complex services, especially in IoT, is a toolkit for distributed computing. For us, that's been Oka. For others, that might be Erlang OTP. And I think now we're also seeing the evolution of stateful serverless platforms to support the same building blocks. And I kind of imagine that's how we're all going to be programming these systems in the future. So that's things like managing state, modeling individual entities at scale, workflow management, streaming interfaces, and then allowing the runtime to handle concurrency, distribution, and failure.
Strict contracts make systems more reliable and allow services and teams to work more decoupled while also improving collaboration. And don't develop every microservice differently just because you can. Compound your investments in your knowledge and in your tooling by creating a deep understanding and also this paved path in your primary tool set.
So on top of the Tesla energy platform that Colin described, we built our first power plant type application. In this phase, we were learning how to productize real-time forecast and optimization of batteries. And in this case, we started with a single, albeit very large battery, which was the Hornsdale battery. The Hornsdale battery was built on a tight timeline because of this famous tweet. And it's the largest battery in the world at 100 megawatts, 129 megawatt hours, which is about the size of a gas turbine. And Hornsdale helps keep the grid stable even as more renewables are coming online. And not only is it keeping the grid stable, it's actually reduced the cost to customers of doing so. And it provides this service, it helps the grid by providing multiple kinds of services.
So during extreme events, like when a generator trips offline, Hornsdale responds nearly instantaneously to balance big frequency excursions that could otherwise cause a blackout. But even during normal times, whereas a conventional generator's response lags the grid operator's signal by the order of minutes, the battery can follow the grid operator frequency regulation commands nearly instantaneously. And this helps maintain a safe grid frequency. So this big battery provides these services to the grid by way of the energy market.
And why do we have to participate in an energy market? So recall the requirement that supply and demand have to be balanced in real-time. Antiques are an economically efficient way to make them balance. Participants bid in what they're willing to produce or consume at different price levels and at different time scales. And the operator activates the participants who can provide the necessary services at the lowest price. And if we want the battery to continually provide services to the grid and help stabilize it as more renewables come online, we need to be participating in energy markets.
And in order to participate in energy markets, we need software. So to do this, we built AutoBitter to operate Hornsdale and now we offer it as a product. This is the UI for AutoBitter. It's a pro tool that's intended for control room operators who look at it day in and day out. This is a lot of information on the screen, I know. But it's running workflows that fetch data, forecast prices, and renewable generation and decide on an optimal bid and then submit it. And these workflows run every five minutes. That's the cadence of the Australian market.
And at a high level, the optimization problem is trade-offs across different market products, which are different kinds of services that the battery can provide to the grid and trade-offs in time since the battery has a finite amount of energy. AutoBitter is built in the applications layer of the Tesla energy platform that Colin described. And it consists of several microservices and it interacts with both the platform and with third-party APIs.
And AutoBitter is fundamentally a workflow orchestration platform. And so you might ask why did we build our own rather than using an open source tool. The key thing is this is operational technology. These aren't batch or offline jobs and it's critical for financial and physical reasons that these workflows run. We also leverage our primary tool set rather than so that allowed us to avoid introducing a new language and new infrastructure into our stack.
The center of the system is the orchestrator microservice and this runs automating workflows. And a principle we hold to is we keep this core as simple as possible and contain complexity in the peripheral services. So the market data service abstracts the ETL of complex input data. This data has diverse kinds of timing when it arrives relative to the market cycle. And this service handles that timing and it handles all backs in case of late arriving data or missing data.
There's a forecast service and optimization service that execute algorithm code and a bid service to interact with the market submission interface. The orchestrator market data service and bid service are written in Scala. And again this common toolkit gives us great concurrency semantics, functional programming, type safety and compounding best practices across our teams. However, the forecast and optimization services are in Python and this is because it's very important to us to enable rapid algorithm improvement and development.
And Python gives us a couple of things there. Their key numerical and solver libraries available in Python. And also the algorithm engineers on our team are more fluent in Python and having these services in Python empowers them to own the core logic there and iterate on it. The communication between the market data and bidding services and orchestrator happens over GRPC with the benefits call and describe strict contracts, code generation and collaboration. But the communication between the orchestrator and the forecasting and optimization services uses Amazon SQS message queues.
And these queues give us durable delivery, retries in cases of consumer failures and easily support long running tasks without a long live network connection between services. We use an immutable input output messaging model and the messages have strict schemas. This allows us to persist the immutable inputs and outputs and have them available for back testing which is an important part of our overall team's mission. Also SQS allows us to build worker pools.
So like I said forecast and optimization are in Python which has some what cumbersome concurrency semantics. And the message queue allows us to implement concurrency across workers instead of within a worker and it's notable that these services are functions effectively. They take inputs and produce outputs without other effects. And this keeps them more testable and makes these important algorithm changes and improvements safer and also relieves algorithm engineers of the burden of writing IO code and lets us use scholar concurrency for IO.
So stepping back and looking at these workflows as a whole, these workflows are stateful. And the state is a set of immutable facts that are generated by business logic stages. These stages happen sequentially in time. And the workflows state includes things like knowing what the current stage is and accumulating the results of a task within the current stage and across stages. And some stages like the forecast stage have multiple tasks that need to be accumulated before deciding to proceed. And some stages might need outputs of multiple previous stages, not just the immediate predecessor.
In case of a failure like the orchestrator pod restarting, we don't want to forget that a workflow was in progress and we'd prefer not to completely restart it. So we can instead take snapshots of the state at checkpoints. And if the workflow fails, it can be resumed from the last checkpoint. And we keep the state in an OCH actor representing the workflow state machine. And OCHA persistence gives us transparent resumption of the state through checkpointing and an event journal.
如果出现类似编排器 Pod 重新启动的故障,我们不想忽略正在进行的工作流,并且更希望不必完全重启它。因此,我们可以在检查点处对状态进行快照。如果工作流失败,可以从上一个检查点恢复它。我们通过表示工作流状态机的 OCH 者来保存状态。而 OCHA 持久化通过检查点和事件日志使我们能够对状态进行透明地恢复操作。
But an important lesson we've learned is to keep the business logic of stage execution and pure functions separate from the actor as much as possible. This makes testing and composition of that business logic so much easier. And the new OCHA typed API naturally helps with that decomposition.
On our team, it's very important to enable rapid development of algorithms and improvement in iteration. And so we have Python in specific places in our system. But we also really need to minimize the risk that the iteration on the algorithms breaks workflows. And so a couple things that work really well for us to minimize that risk are an input output model to the algorithmic services. It keeps that code simpler and more easily testable. And strict contracts, which again gives freedom to change algorithm internal logic independently of the rest of the system.
It's been important for us to abstract the messy details of external data sources and services from the core system. And this is a fundamental tenant of the whole platform actually. And these workflows are inevitably stateful. But entangling state with the business logic stages can lead to spaghetti code. And instead keep the business logic stages functional, testable, and composable.
Okay. In the next part, we're going to describe our first virtual power plant application then. So Percy just described how we leverage the platform to participate in the energy markets algorithmically with one large battery. Now we'll focus on how we extend that and use what we learn to measure, model, and control a fleet of thousands of power walls that are installed in people's homes to do peak shaving for an electrical utility.
Now, before I detail this off our architecture, I'll describe the problem that we're trying to solve. So this is a graph of aggregate grid load in megawatts. Now grid load varies with weather and with time of year. This is a typical load profile for a warm summer day. The left-hand side is midnight, the minimum load is around 4 a.m. when most people are sleeping, and then peak load is around 6 p.m. when a lot of people are running air conditioning or cooking dinner.
Now, peak loads are very, very expensive. The grid only needs to meet the peak load a few hours in a year. And the options for satisfying the peak load are build more capacity, which incurs significant capital costs. And then this capacity is largely underused outside of those peaks. And the other option is to import power from another jurisdiction that has excess, and this is often at a significant premium. So power can be cheaper if we can offset demand and make this load curve more uniform. And that's our objective. We want to discharge power wall batteries during the peak grid load, and at other times the homeowner will use this battery for clean backup power.
A lesson we quickly learned as our virtual power plants grew to thousands of power walls and tens of megawatts of power was that charging the batteries back up right after the peak would lead to our own peak, defeating the purpose. And of course, the solution is to not only control when the batteries discharge, but also when they charge and spread out the charging over a longer period of time.
Now this is what we're trying to accomplish, this picture, but in reality we don't have the complete picture. There's uncertainty. It's noon, and we're trying to predict whether or not there's going to be a peak. And we only want to discharge batteries if there's a high likelihood of a peak. Once we've decided to discharge batteries to avoid the peak, how do we control them?
And I want to be very clear that we only control power walls that are enrolled in specific virtual power plant programs. We don't arbitrarily control power walls that aren't enrolled in these programs, so not every customer has this feature.
As Percy mentioned, the grids not designed to interact with a whole bunch of small players. So we need to aggregate these power walls to look more like a traditional grid asset, something like a large steam turbine. And typically we do this by having hierarchical aggregations that are a virtual representation in cloud software.
The first level is a digital twin representing an individual site, so that would be a house with a power wall. And the next level might be organized by electrical top topology, something like a substation, or it could be by geography, something like a county.
The next level can again be a physical grouping, like an electrical interconnection, or it might be logical, like sites with a battery and sites with a battery plus solar that we want to control or optimize differently. And all of these sites come together to form the top level of the virtual power plant, meaning we can query the aggregate of thousands of power walls as fast as we can query a single power wall and use this aggregate to inform our global optimization.
It's easy to think of the virtual power plant as uniform, but the reality is more like this. There's a diversity of installations and home loads. Some homes have one battery, some have two or three. The batteries are not all fully charged. Some might be half full or close to empty, depending on home loads, time of day, solar production on that day, and the mode of operation.
There's also uncertainty in communication with these sites over the internet, as some of them may be temporarily offline. Finally, there's the asset management problem of new sites coming online regularly, firmware being non-uniform in terms of its capabilities across the whole fleet, and hardware being upgraded and replaced over time.
So it's really critical to represent this uncertainty in the data model and in the business logic. So we want to say things like there's 10 megawatt hours of energy available, but only 95% of the sites we expect to be reporting have reported. And it's really only the consumer of the data that can decide how to interpret this uncertainty based on the local context of that service.
So one way we manage this uncertainty is through a site level abstraction. So even if the sites are heterogeneous, this edge computing platform provides site level telemetry for things like power, frequency, and voltage that gives us a consistent abstraction in software. And then another way is to aggregate the telemetry across the virtual power plant, because people don't want to worry about controlling individual power wall batteries. They want to worry about discharging 10 megawatts from 5 p.m. to 6 p.m. in order to save the peak.
And this is a really difficult engineering challenge, which is a combination of streaming telemetry and asset modeling. For modeling each site in software, the so-called digital twin, we represent each site with an actor. And the actor manages state, like the latest reported telemetry from that battery, and executes a state machine, changing its behavior if the site is offline and telemetry is delayed. But it also provides a convenient model for concurrency and computation.
So the programmer worries about modeling an individual site in an actor, and then the Oka runtime handles scaling this to thousands or millions of sites. And you don't have to worry about that. It's a very, very powerful abstraction for IoT in particular. And we generally never worry about threads or locks or concurrency bugs.
The higher level aggregations are also represented by individual actors. And then actors maintain their relationships with other actors describing this physical or logical aggregation. And then the telemetry is aggregated by messaging up to hierarchy in memory, in near real time. And how real time the aggregate is at any level is really just a trade-off between messaging volume and latency.
We can query at any node in this hierarchy to know the aggregate value at that location, or query the latest telemetry from an individual site. And we can also navigate up and down the hierarchy from any point.
Now, the services that perform this real time hierarchical aggregation run in an Oka cluster. Oka cluster allows a set of pods with different roles to communicate with each other transparently. So the first role is a set of linearly scalable pods that stream data off Kafka. And they use Oka streams for back pressure, bounded resource constraints, and then low latency stream processing. And then they message with a set of pods running all the actors in this virtual representation that I just described.
When the stream processors read a message off Kafka for a particular site, they message to the actor representing that site simply using the site identifier. And it doesn't matter where in the cluster that actor is running, the Oka runtime will transparently handle the delivery of that message. This is called location transparency. And site actors message with their parents in a similar manner all the way up that hierarchy. There's also a set of API pods that can serve client requests for site level or aggregate telemetry because they can query into the cluster in this same location transparent way. And it's this collection of services that provides the in-memory near real time aggregation of telemetry for thousands of powerwalls. It's an architecture that provides great flexibility, especially when paired with Kubernetes to manage the pods. Because the actors are just kind of running on this substrate of compute. They're kind of running on the heap, if you will. And an individual pod can fail or be restarted. And the actors that were on that pod will simply migrate to another until it recovers. And the runtime handles doing this. The programmer doesn't have to worry about it. Or the cluster can also be scaled up or down. And the actors will rebalance across the cluster. Actors can recover their state automatically using OCHA persistence. But in this case, we don't actually need to use OCHA persistence. Because the actor can just rediscover its relationships as well as the latest state when the next message from the battery arrives within a few seconds.
So to conclude this section, after aggregating telemetry to know the capacity that's available in the virtual power plant, let's look at how the batteries are actually controlled. So the first step is taking past measurements, forecasting, and deciding how many megawatts to discharge if we are going to hit a peak. And at a high level, this loop of measure, forecast, optimize, and control is basically running continuously. And the control part of this loop is true closed loop control. Once an aggregate control set point has been determined, we continuously monitor the disaggregate telemetry from every single site to see how it responds. And we adjust the set point for the individual sites to minimize error. We can take a look at how this works.
The automator platform that Percy described may decide to control the whole fleet. So to give a sense of scale, this might be enough megawatts to offset the need to build a new natural gas peaker plant. Or we might just decide to control a subset of the hierarchy depending on the objective. Now the control service that I mentioned earlier dynamically resolves the individual sites under this target by querying the asset service. And this is because the sites can change over time. New sites are installed. The virtual hierarchy might be modified. Or the properties of an individual site might change. Maybe you add a second battery. The control service queries the battery telemetry at every site, potentially thousands of sites, using the in-memory aggregation that I just discussed, to decide how to discharge the battery at each site. There's no point discharging a battery that's almost empty. And you can kind of think of this somewhat similar to a database query planner, basically trying to plan the execution. The control service then sends a message to each site with a discharge set point and a time frame. And it will keep retrying until it gets an acknowledgment from the site or the time frame has elapsed. Because these logical aggregations of batteries are so large, we stream over these huge collections using OCHA streams to provide bounded resource constraints in all of the steps that I've just described. So that's resolving the sites, reading all of the telemetry, and then sending all the control set points.
So huge aggregations demand different APIs and data processing patterns. You can't just go build typical CRUD microservices. Not going to work. You need streaming semantics for processing large collections with low latency and bounded resource constraints. And what we really need is a runtime for modeling stateful entities that support location transparency, concurrency, scaling, and resilience. Uncertainty is inherent in distributed IoT systems. So we need to just embrace this in the data model, in the business logic, and even in the customer experience rather than trying to escape it. And representing physical and virtual relationships among IoT devices, especially as they change over time is the hardest problem in IoT. Trust me. But essential for creating a great product.
Now direct control based on a central objective doesn't account for local needs. And this creates a kind of tension. So imagine a storm is approaching, close to a peak. The global objective wants to discharge these batteries to avoid the peak. But of course the homeowner wants a full battery in case the power goes out. And this leads to the final part of our presentation, the co-optimized virtual power plant.
So just to review where we are. So far we've built on the fundamental platform to first of all optimize a single big battery to participate in an electricity market. And then second, aggregate, optimize, and control thousands of batteries to meet a central goal. And so in this last section, like Colin said, we're again going to aggregate, optimize, and control thousands of batteries. But this time not just for a global goal, we're going to co-optimize local and global objectives.
So whereas the peak shaving virtual power plant, the Colin just described optimized essential objective and passed the control decisions downward to the sites. The market virtual power plant distributes the optimization itself across the sites and the cloud. And the sites actually in this case participate in the control decisions. This distributed optimization is only possible because Tesla builds its own hardware and has full control over firmware and software. This enables quick iteration across the local and central intelligence and how they relate to each other. And this collaboration is cross-team rather than cross-company.
So when we say that this virtual power plant co-optimizes local and global objectives, what do we mean? So let's take a non-virtual power plant home. And at home with this solar generation and this electricity consumption would have net load like this. And this is the load that the utility sees. The power wall home battery, the power wall can charge during excess solar generation and discharge during high load. And this is thanks to the local intelligence on the device. And the goal of this would be either to minimize the customer's bill or to maximize how much of their own solar production they're using. This is local optimization.
What does it look like to co-optimize local and global objectives? The local, one way to do it is that the local optimization can consider information about the aggregate goal, like market prices indicating the real-time balancing needs of the grid. So in this example, negative prices in the night perhaps caused by wind over generation might cause the battery to charge earlier. And a high price in the afternoon caused maybe by unexpectedly high demand prompts the battery to discharge rather than waiting to fully offset the evening load like it would have. And just to note that this is all well following local regulations around discharging to the grid.
And our co-optimized virtual power plant, auto-bitter generates a time series of price forecasts every 15 minutes. And the Tesla Energy Platform's control component distributes those forecasts to the sites. The local optimization then runs, makes a plan for the battery, and given both the local and global objectives. And then the sites communicate that plan back to the Tesla Energy Platform, which ingests and aggregates it using the same framework that ingests and aggregates to the limitary. And auto-bitter then uses the aggregate plans to decide what to bid.
This distributed algorithm has a couple of big advantages. One is scalability. We're taking advantage of edge computing power here. And we're not solving one huge optimization problem overall sites. As more sites join the aggregation, we don't have to worry about our central optimization falling over. Our big advantage is resilience to the inevitable intermittency of communication. When sites go offline for short or moderate amounts of time, they have this last version received of this global time series of prices. And they can continue to co-optimize using the best estimate of the global objective. And then if the sites are offline for longer than the length of that price time series, they just revert to purely local optimization. And this is a really reasonable behavior. In the case of degraded connectivity, it's still creating local value for the local site.
And then, on the other's, from the other perspective of the server, the telemetry aggregation accounts for offline sites out of the box. If sites haven't reported signals in a certain amount of time, they're excluded from the aggregate.
And so then auto-bitter is able to bid conservatively and assume that offline sites are not available to participate in market bids. Tesla's unique vertical hardware firmware software integration enables this distributed algorithm. And the vertical integration lets us build a better overall solution.
This distributed algorithm makes the virtual power plant more resilient. Devices are able to behave in a reasonable way during the inevitable communications failures of a distributed system. And this algorithm is only possible because of the high quality and extensible Tesla energy platform that embraces uncertainty and models' reality.
And at the same time, the algorithms help the software platform. The algorithms enhance the overall value of the product. So in our journey building the Tesla energy virtual power plant, we've found it very true that while the algorithms are obviously important to the virtual power plant success, the architecture and reliability of the overall system are the key to the solution.
It's the system that allows us to provide reliable power to people who have never had it before. Balance renewables on the grid. Or provide flexible energy solutions for disaster relief. And build highly integrated products and services that deliver us pure your customer experience.
So we're working on a mix of the most interesting and challenging problems in distributed computing as well as some of the most challenging and interesting problems in distributed renewable energy. We're hiring if you want to work on these challenging important problems with us, of course. But equally importantly.
There has the potential to address many of the most pressing problems in the world from renewable energy and climate change to food and agriculture to cancer and infectious disease research. So let's take our talents in software engineering and work on the most important and lasting problems that we can find.