Projects -> Tesla
2021-02-14 16:29:35 - wk057
This writeup and its accompanying images are Copyright 2021 - Jason Hughes, All rights reserved. No portion of this content may be copied or distributed outside of this site without my express written permission. If you wish to disseminate this information, you may do so in the form of a direct link to this writeup.
I'm posting this on my own (outdated) site for personal reasons (some noted below).
I will preface all of this with a bit of a disclaimer
First, 99.9% of the information here is derived from first hand reverse engineering of Tesla’s BMS, BMS firmware, and other aspects of the Tesla vehicles. Reverse engineering is not an exact science, as in, you don’t get the exact code used to create the software out of the process, and obviously the developer’s commentary/comments are not available either. Much of what is derived relies on some logical interpretation to determine the intended functionality of particular code. While there is a lot of human-readable text in some parts of the code, this isn’t always the case.
I’ve been working on hardware and software design, including reverse engineering of the same, for the better part of 20 years now. While I have extensive knowledge and experience, I don’t personally consider myself to be an expert in the field. Many others definitely disagree, and do consider me to be an expert.
All of that said, what I’m mainly trying to convey is that I’m definitely not perfect. It’s quite possible my interpretation of how all of this works has flaws, and my speculation as to the intention of the developers may also be flawed. I do believe this information is as correct as possible given the information available, however, and is likely the most complete picture that will be available.
I’ll also note that this is a pretty technical explanation of the situation, and some background knowledge of the general topic and technology will be needed to fully understand. I’ll go over some of the background info that is situation-specific, but you’ll likely have to do your own research to fill in any gaps in your knowledge.
If I jump around a lot, my apologies. I'm also not the best writer, overall. I have a lot of information I'm trying to condense into a single writeup here, with lots of connections between the various content.
Finally, I’ll make a blanket note that nothing here has anything to do with any battery fires or safety issues. In a worst case scenario, unmitigated Condition X or Z results in the vehicle being unusable. In no instance is either situation responsible for fires or other safety concerns.
Quick background. Over a year ago I discovered some interesting mechanics in the Tesla BMS code. At the time, it was in my best interest to not publicly disclose those findings in their entirety, so I dubbed the issues the code was attempting to mitigate “Condition X” and “Condition Z”. You can read more about that in a megathread on TMC, if you like, but suffice it to say:
Condition X: A stuck balancing MOSFET (or other similar drain on a single cell group)
Condition Z: Measurement accuracy failure on cell group (BMB sense failure, corrosion, mechanical PCB failure, etc)
Some required background knowledge and definitions
Image: Opened 85 kWh Battery Pack (C) 057 Technology, LLC, Jason Hughes
Tesla Model S and X battery “packs”, as of the time of this writing, consist of 14 ot 16 “modules”. Each module is made of 6 “bricks” of 74 “cells”. (There are some exceptions to this, but not important here.) The cells in each brick are connected in parallel. The bricks are connected to other bricks within a module in series. The modules themselves are connected in series to form a pack. This results in most packs having 96 cell groups in series. (For more basic definitions, consult your favorite search engine.)
Each pack module has a BMB (Battery Management Board) attached which monitors the voltage of each brick on that module, can bleed energy at a slow rate from each brick individually when needed for balancing, along with some temperature probes. The BMBs are connected to the main BMS (Battery Management System) at the rear of the pack which controls the bulk of the functionality of the battery’s monitoring and safety systems.
The BMBs can only discharge cell groups, they can not move energy from one cell group to another or otherwise charge any cell group. This is active balancing, but not charge shuffling. The cells with the highest voltage get their balancer circuits enabled in order to bring their voltage down and in line with the rest of the cell groups within a battery pack.
Keeping a large lithium ion battery with many cells in series in balance is critical for the safety, reliability, and longevity of the battery pack. I won’t get into all of the details on why this is the case here, but information on this is widely available.
Onward to explanations
Tesla started adding a bunch of problem detection code to the BMS sometime in early 2019. To me, the bulk of this seemed very preemptive, as-in nothing they were looking for had presented itself in the wild to the best of my knowledge. There are tons of things the BMS checks for that I personally never would have even thought of. Kudos to the engineers on this. It’s a remarkable feat for sure.
One of those problem detectors was to detect “Condition X”, a stuck MOSFET on a BMS resulting in a perpetual imbalance of a cell group. This seems like a sensible thing to look for. With hundreds of millions of these tiny MOSFETs in the wild, it’s only a matter of time before one fails closed and causes a problem.
Back when I initially started to see the code and algorithms for Condition X detection and mitigation (through ongoing reverse engineering), I was alarmed. It seemed to me that Tesla was trying to hide a BMS failure/flaw at the expense of customer usability. Since this came about the same time as the now infamous range loss update, and since Condition Z was not yet known, it also seemed like this was a much more widespread problem than it really was.
Further, my analysis of the code, along with influence from the media around amplifying a handful of vehicle fires, lead me to some incorrect conclusions about the function of some of the code (see my preface). At the time, I misinterpreted a part of the algorithm and thought that they may be mitigating some kind of unexplained voltage _rise_ in a cell group, which could be catastrophic. There’s no evidence this is the case. This line of thought was purely human error on my part when missing the sign bit on a section of reverse engineered code (if you’re not a programmer I suggest you skip understanding this explanation, accept it was a trivial error on my part, and move on).
Based on this, I pretty hastily posted publicly that there may be an issue, and it may be a safety issue. This definitely got some attention on the range loss issue.
I began examining and piecing together the code more closely, and eventually I realized that the code was trying to detect an unexplained voltage _drop_ in a cell group, not a rise (from a stuck MOSFET). This was bad, but definitely way better than what I had originally interpreted. An unexplained drop could result in the loss of use of the battery pack, perpetual imbalance, etc.
After seeing a slew of updates with new BMS code coming down the line, I took a step back and tried to see the bigger picture. Finally the distinction between Condition X and Z became apparent.
Conditions X and Z
I eventually made some followup tweets and TMC posts about the issue, but after getting some mostly unsolicited but solid advice on the matter, along with some professional courtesy and general CYA thinking, I decided not to detail exactly what I believed the conditions were. It was obvious to me that the engineers were working hard to figure this out and come up with a solution. It didn’t make sense to push people in the wrong direction on this and cause more problems all around. At the time there was no mitigation in place for Condition Z, so it wouldn’t do any good to know about it anyway. My suggestion to the community was to just update, make sure you don’t get stranded, and wait it out.
(Some trivial info: I decided on Condition X and Condition Z based on the way engineers were doing various things in the BMS code related to the different algorithms. My hope was that should a Tesla BMS engineer see my notes, they’d make the association. Never confirmed either way.)
If a battery pack has “Condition X”, a stuck MOSFET on a balancer circuit, this is very bad. If the balancer circuit is stuck on, the BMB is always going to be draining power from that cell group. The end result is that balancing the battery pack is now completely impossible.
Why? Because now no cell group can ever be brought to a lower voltage than the group with the stuck balancer circuit. The BMS now has to enable ALL of the remaining balancers in order to bring the rest of the battery pack back as close as possible to with the group that is too low. This results in a loss of range for multiple reasons: parasitic/vampire drain (roughly 1% SoC per day); and since the lowest point possible to discharge to has now been raised for all cells due to the lowest cell always being lower than the rest of the pack. It will eventually result in complete failure of the pack as the imbalance grows beyond acceptable limits, especially if the vehicle is left sitting idle for extended periods. To my knowledge, there have been no non-contrived examples of this in the wild.
Given the amount of vehicles quickly “detecting” Condition X, it was obvious a stuck MOSFET couldn’t explain the triggering. Additionally, the bulk of those detections (in my testing and research) were triggering for the 6th cell group, and not randomly distributed. It also was not being detected at all in newer packs, nor in the oldest of packs, both of which also had the same MOSFETs. Clearly something was amiss, since if it were a MOSFET failure, the failures should be evenly distributed among the fleet.
Image: BMB v1.5 with Cell 6 balancer circuit highlighted - (C) Jason Hughes
So, originally, “Condition Z” was being detected as if it were “Condition X” due to the tendency of this particular issue to read a lower than expected voltage on a BMB. This triggered mitigations for “Condition X”, which is basically preparing the pack for remanufacturing (physically replacing the bad BMB with the bad MOSFET), and this should have been very rare and an acceptable reason for a warranty replacement. When many (relatively speaking) cases presented, it was obvious Condition X was not the actual culprit. I even had several of these in my mini-fleet of customer vehicles.
Under existing BMS code, Condition Z has a similar effect to Condition X, since it almost always presents to the BMS as if a cell group is lower (or in some cases higher) than others when it is in fact not. Combined with some other testing, this is how Condition X, a slow but consistent draw on a cell group, is detected. This explains why Condition Z was triggering mitigations for Condition X. It had no way to know that the readings from the BMB were not the result of a stuck MOSFET.
A disparity happens, however, because Condition Z isn’t persistent or consistent. Generally the group will read normally most of the time. The BMS takes multiple readings of each cell group per second, internally. People with CAN tools won’t see it in their cell voltage readings, since the CAN data is only reported ever few seconds for a group.
Different sections of code in the BMS use this information differently. For example, in most cases the range/capacity algorithm will ignore a few erratic readings while the safety systems take them very seriously.
2019.16 - All hell breaks loose on the interwebs
In comes 2019.16. People started seeing range loss because the engineers were closing the disparity gap between how the safety systems have always interpreted the data and how the capacity estimator handled the data. When people started seeing range loss, this was the result of the estimator now using the safety system’s interpretation of readings for most things vs the more smoothed data. People with Condition Z immediately saw huge drops, since the capacity estimator was now aware of potentially huge imbalances that resulted from erroneous cell sense readings.
In a case before about 2019.16.x, if your car had Condition Z, the range estimator may have been off of the safety system’s data by a substantial margin. The real world example would be that if you were to attempt to use the vehicle down below what the safety algorithms allowed, your car might still say 30 miles available, but still shutdown. This is why I advised people to update at that time. It didn’t matter if you had the update or not, the range was unusable if you had Condition Z either way. I’m reasonably certain most people would want to know the actual available range vs an algorithm that was out of sync with the safety cutoff system. Either way you were affected, but at least with the update you would have more useful information.
Internally, these erratic readings that result from Condition Z affect how the BMS calculates cell group capacity, and safe charge/discharge levels. For example. if it sees a cell group read say, 100mV high or low at some point, for safety reasons it can’t just ignore this. What if that is the correct reading, and during charging the cell is pushed 100mV above the normal limit? That can never be allowed to happen. Even if the reading is eventually dismissed and things return to normal, temporarily the BMS will tighten things down internally in the interest of caution and safety. Since the range estimator algorithm didn’t previously use all of this data, there could be a disparity in shown range vs actually usable range pre 2019.16.
Condition Z is basically a different type of failure of a BMB that results in erroneous readings on a single cell group. This usually presents on the 6th/last brick of the module. My physical examination of some of these affected modules varies. I found a couple of BMBs with poor conformal coating in that area and some resulting corrosion. I found several with weak solder joints that seemed to be the result of thermal stress (likely heating/cooling from enabling/disabling the nearby balancers), and others that I didn’t see an obvious physical explanation for. There were several with the cell sense leads at the module starting to have their ultrasonic welds (copper wire to aluminum bus plate) start to fail on cell group 6, which could explain an erratic reading.
BMB v1.0 (what most people would call the “A” packs from 2012/2013) don’t appear to be affected. BMB v2.0 (mid 2015 and onward) also don’t appear to be affected, at least not in any significant quantity. This appears to be the result of design improvements, specifically the better placement of the bleeder circuits on the BMB.
Only batteries with BMB v1.5 appear to be affected by Condition Z. My guess is that this is because of the way these are made. The cell sense connections on these go through a single 15-pin connector (7 used, 8 blank for HV spacing), with the cell sense wires ultrasonic welded to their respective bus plates on the module. These ultrasonic welds are a weak point, and seem to be the cause of almost all battery remanufacturing. Generally, any moisture can cause these to fail. The battery packs are not 100% sealed, so in various climates enough moisture may be present to cause a problem. This is a bit of speculation, based on my own data.
Image: BMB v1.5 (C) 057 Technology, LLC
However, the ultrasonic welds don’t appear to be the cause of Condition Z. This is an odd issue, as it’s not related to the somewhat more common ultrasonic weld failure. It’s actually an issue with the voltage sense IC on the BMB not properly reading a cell group. In short,
Condition Z is a failure of the BMB to sense the correct cell group voltage, but not a failure of the wiring for that cell group. While this is a problem with the device, it’s not a safety concern. Tesla, however, needed to do something to correct this problem. Without finding a sensible solution deployable OTA, they would need to remanufacture these battery packs to replace the misbehaving BMB.
And Condition Z seems to be pretty prevalent. Extrapolating out my personal examinations would put something like 1 in 30 of the vehicles with 60 or 85-packs affected by Condition Z. There was likely a LOT of pressure on the BMS engineering team to come up with a solution.
I originally thought this issue was limited to 85 packs, but original 60 packs (not software locked) could be affected as well.
Eventually Tesla started experimenting with mitigations for Condition Z that didn’t require physical labor. Detecting Condition Z is pretty simple when you know what you’re looking for. Look for erratic behavior of the voltage readings on the cell group that aren’t explained by balancers, loads, charging, etc.
Image: Graph of erratic voltage reading on module affected by Condition Z - (C) Jason Hughes
Tesla’s engineers went above and beyond on this, ensuring with 100% certainty that vehicles with condition Z were in fact Z and not X or other issues.
Once detected, Condition Z is able to be mitigated by using all other available data to reconstruct and validate the abnormal cell sense value. In the simplest case, this is basic algebra:
m = a + b + c + d + e + f
If “m” is the module’s total voltage, and “a” through “f” are the individual cell groups, then we have enough data to calculate a single missing value. Let’s say “f” is erratic and inconclusive. We can validate that “f” is valid by checking:
f = m - a - b - c - d - e
If this matches, then the reading for “f” must be correct right now.
There’s also even larger measurements available. We have the total pack voltage (p), the sum of all module voltages (sigma m), and the sum of all cell voltages (sigma c).
The problem is that the margin of error on these aggregated measurements are higher. For example, at the cell group level for “a” through “f” the expected margin of error might be +/- 3mV. However, for “m”, the margin of error is going to be higher, roughly +/- 20mV.
This means if we’re correcting for Condition Z this way, we now have a much higher margin of error on any corrected reading than we did before. Since we know this, we know the BMS safety systems need to take this info into account (they do) and leave a little bit of capacity on the table to account for that margin.
This is just the basics of how Condition Z is being mitigated. It’s not the entire picture, but you get the idea. Since Condition Z almost always presents in cell 6, we can use that information better, also. Cell 6 of one module is physically connected to cell 0 of the next module, so with some clever use of balancing circuits within the pack, it’s possible to get yet another alternative voltage calculation for the erratic cell 6. (I won’t get into those details, but suffice it to say it’s pretty genius on Tesla’s part).
Using additional methods, Tesla seems to have gotten the mitigated margin of error down to about 15mV, which is pretty dang good for a software-only fix. The end result is that vehicle on the latest firmware, after all detections and mitigations have been calibrated, will have minimal range loss as a direct result of Condition Z. In practice, on an 85 pack, this is about ~1-4 miles of range, proportional to normal degradation since new.
A lot of this reminds me of Apollo 13. “I don't care about what anything was DESIGNED to do, I care about what it CAN do.” -Gene Kranz
There’s a problem with the people who have been holding out on updating, since some interim updates have been compiling data for the calibrations needed to implement this particular software-based correction. Simply put, if you’ve been holding out on updating because of the FUD spread (or whatever reason), your BMS doesn’t have the months of information needed to mitigate Condition Z properly. You’ll have to go through that whole process once you do update (it happens in the background, obviously). This can take anywhere from 1 to 6 months from what I can tell, and generally 20-30 battery cycles (4-8k miles). After this, you’ll end up with normal degradation, minus a couple of miles for Condition Z mitigation. There also appears to be a minimum time of about 30 days for this to work, even if you get the required cycles in before then.
I honestly think this is a pretty good fix, overall. A couple miles of range is within the margin of error for any degradation anyway. Given this fix, I don’t think Tesla needs to recall packs or anything else to correct this. They’ve provided a software fix for it that does appear to work in 99.9% of cases. (They should, of course, take care of the other 0.1% as needed).
If you still are affected, most likely your vehicle hasn’t gathered the required data to implement the mitigation yet. It started gathering around version 2020.30 from what I can tell.
So in summary, yes this is a battery issue. If you suddenly lost a bunch of range around a 2019.16 install (range which most likely wasn’t accessible anyway), then you probably have Condition Z. This is a component issue, and Tesla should be required to fix it. I believe they have offered a fix for it with their software mitigations in 2020.30 and up. Whether or not this satisfies their obligations to the customer as far as warranty goes… I’m not a lawyer.
I do think Tesla should be more transparent on this stuff. You really shouldn’t have to get this information from me. It should be coming from them directly. But I can see how this is a sensitive topic, and… I can’t blame them on how they’ve handled it, especially when the vast majority of people affected don’t even seem to notice. Out of the vehicles I personally dealt with and owners I’ve spoken with on the matter… only about 1 in 6 of them even seemed to notice any changes in range, or otherwise were not concerned.
In the end, just update. I know the autopilot related changes suck, and “autopilot jail” is likely to result in road rage, for example… but it’s definitely better to have the fix than not.
Q: “Is my car safe to drive, park in my garage, etc?!”
It’s no more or less safe than it was when it was new, and no more or less safe than a new Tesla, etc. Nothing about these issues makes the car less safe. I personally keep my Tesla vehicles in my garage with zero worry.
Q: “How do I know if I’m affected by Condition X?”
If you’re on 2019.16 or above, your car will warn you that your battery needs service. Tesla will need to replace the battery. If you ignore the warnings, you’ll likely end up stranded when the battery locks out. You probably do not have Condition X.
Q: “How do I know if I’m affected by Condition Z?”
You’ll see a drop in available range when updating from a version older than 2019.16.
Q: “Do you have a list of pack part numbers that may be affected?”
Here is a list of packs that should contain BMB v1.5 and may be affected:
This is completely unofficial and may be inaccurate or incomplete, but is the best guess based on available information.
Your pack part number can be seen in the passenger side front wheel well on a Tesla Model S or X.
Q: “How do I fix it?”
Install the latest update (at least 2020.30) and use your car normally. Eventually the mitigation will take hold and projected range will return to normal.
Q: “I don't want to update because of XYZ. What can happen?”
You can run out of range prematurely when closer to a low SoC due to the safety system seeing the erratic voltage from Condition Z and the range estimator not seeing this and taking it into account. If you have one of the packs above, you should update to ensure you don't get stranded just in case you have Condition Z.
Q: “I got some range back, but it doesn’t seem like I got it all back.”
You may still be down a few miles as a result of the margin of error on the mitigation. In some cases this will continue to tighten (as in, get a few miles back), but most likely your range is the sum of normal degradation and a minor loss from this mitigation.
Q: “I haven’t gotten any range back!”
Install updates, drive more, and don’t sweat it. If your car has Condition Z, it will correct itself. There are other reasons for a battery to have issues, of course, and you should likely schedule service to investigate if you continue to have issues.
Q: “Can you help me replace/upgrade my battery pack?”
The short answer: maybe.
The long answer: You probably don't want to do this. It's likely more cost effective to sell your vehicle and buy the one you want.
If that doesn't deter you, and you're not against shipping your vehicle to NC, feel free to contact 057 Tech.
Q: “I want to sue Tesla over this. Will you help?”
No. I’ve actually tried to help several people/attorney’s on this issue over the past year or so, and this is just too much of a headache for me. These people seem to just want blood and money, not any actual resolution. No class action suite is going to actually benefit owners in the end, just the lawyers and the few case initiates. I want nothing to do with it.
Feel free to present or reference my notes as desired for legal purposes (with full credit given and a non-tracking link directly to the original source), but I will not contribute directly beyond that. This is, at best, an expert analysis of readily available information. My expert witness fee for this is $1,000,000/hour with a 10 hour minimum, plus travel expenses. (That should be ridiculous enough to keep that door solidly shut.)
If you want to sue Tesla over something, sue them for remotely accessing vehicle configurations without the owner’s permission to disable features like supercharging on salvage title cars.
Q: “Why the heck didn't you just post about this before?!”
I had a lot of reasons to hold back public posting of this information. I won't get into all of the details, but some has to do with issues above (see the above QA), general CYA, that it wasn't a safety problem, and the fact that it appeared Tesla was working in the right direction on this. This was my decision to hold off on posting. No one threatened or otherwise pressured me to withhold this info.
Q: “This doesn't fix theory XYZ presented by (insert troll name here). What gives?”
You should probably spend less time believing random posts on the internet. Even take my own with a grain of salt if you don't know anything about me.
Q: “This is great work! Can I copy, public, etc anything that’s here?”
I retain full copyright to everything written and displayed here. Nothing here may be copied, posted elsewhere, etc without my express written permission. Links to this writeup are acceptable, but no portion may be copied without permission.
Q: “Why are you being such a **** about copying this info?”
Honestly? In short because I don’t want to see Tesla Motors Club benefit from my work any further. If I find this information copied on TMC, I’ll likely be submitting DMCA takedowns and pursing those as necessary. If you want this information on TMC, you may link to this page, but you may not post my content there.
I’ll grant permission to almost any other entity to publish this information, at my discretion of course. Email me or DM me on Twitter with your request.
But, I feel quite betrayed by the staff at TMC over their recent handling of misinformation peddlers and generic forum trolls. Their solution has been to essentially banish me, a long time and well respected contributor… instead of actually dealing with a few bad actors. I won’t get into all that here, but figured I’d be completely honest about my reasoning. Had they corrected this situation, I probably wouldn’t have bothered with any additional wording on the copyright matter. However, they’ve gotten enough benefit from my efforts over the better part of a decade. I see no reason to continue to benefit the owners of that forum with additional content and traffic while they won’t even allow me to post.
Q: “How much TSLA do you own/control or have controlled in the past 2 years or intend to buy? ARE YOU SUPER BIASED?!”
0 shares... disappointingly given the meteoric rise of the stock recently.
Q: “I have so many questions!”
Hit me up on Twitter, Email, IRC (##teslamotors on Freenode), etc and I'll do my best to respond.
“My pumps are always running! / My pumps run way more than they used to!”
Tesla did tighten the allowed delta for module temperatures, adjusted for SoC. If the temperatures differ by more than this calculated delta (which is tighter and the high and low ends of the SoC range), then the battery pump runs to equalize the temperatures.
Unfortunately, similar to Condition Z, the readings for the temperature probes may be inaccurate. Unlike Condition Z, there’s no mitigation available for this so the BMS has to work with the data it has and act accordingly.
Again, not a safety issue.
“I can’t supercharge fast anymore!”
This is a whole different issue. While an SoC imbalance presented by Condition Z can affect charge rates, it’s not the cause of the throttling many vehicles have seen. That’s a totally different topic that I won’t get into here today.
“But Tesla said these updates were to prevent fires!”
No they didn’t. As best I can tell, some low rank employee made a statement to a customer about the “abundance of caution” note that they then interpreted as meaning the above. The “abundance of caution” statements, to the best I can tell, relate to detecting “Condition X” and other potential issues with the battery pack in order to find such packs that need refurbishing, combined with tightening the delta-temperature allowed within the pack, which in theory would improve longevity significantly anyway if the sensor data were correct.
There’s been a lot of misinformation on this whole topic. Lot’s of wild theories that have no basis in reality. There’s a handful of forum trolls that have been pushing conspiracy theory after conspiracy theory on this incessantly, too, and their nonsense has spread FUD to many.
“My pack voltage is capped!”
See my notes about how the safety systems handle voltage readings. If at any point the BMS thinks a cell may be higher than it really is, it has to act on that data. This can look like a cell voltage cap.
If that cap is the result of Condition Z, it will be corrected. However, erroneous voltage readings that present higher than expected are much more rare than reporting a lower voltage. Much more care and caution is needed in this case, since overcharging a battery even a little bit is dangerous, while over discharging it a few mV is not really a problem.
Eventually the mitigation for Condition Z will gather enough data to be confident enough to allow charging higher than an erroneous voltage cap, but it can take far longer. If you don’t drive a lot, it can take ages. Hopefully Tesla looks into these specific cases in more detail and comes up with a quicker solution.
“Coolant leaks! Lithium plating! Fires! Boom!”
FUD. Stop listening to internet trolls and exaggerated media stories.