Most consumer SSDs tend to experience data corruption or startup failures after multiple power failures. Therefore it is critical for SSDs to function normally after powering off. This article focuses on the anomalies that occur after powering off and solutions to counter them along with the pros and cons of each.
SSDs may power off in "normal" or "abnormal" modes. In the case of normal power off mode, the host sends a command to the SSD controller telling it that power is to be shut down. Following this the latter can save in buffer to NAND flash and reply to the host with a confirmation message to allow the powering off of the SSD. This is the best method for an SSD to power off without any issues.
There are two types of abnormal power failures, namely when the SSD is “idling" and when the SSD is “writing in." The former occurs when the host is not accessing SSD data and so it is relatively safe. The latter on the other hand occurs when the SSD is writing data into NAND flash when the SSD's internal registers and parameters have not been backed up to NAND flash. In cases like this, data corruption is likely to occur.
As shown in the figure above, it is common practice for SSDs to keep data written by the host in RAM buffer first and then write data there only after the RAM has piled up a certain volume of data. In the case of power failures at this moment, data saved in RAM by the SSD will be lost. There are two options for tackling this issue :
1. Disable RAM buffer : This is the most straightforward solution. Regarding data writing in by the host, SSDs write them into NAND immediately before replying to the host upon completion. This ensures no data loss will occur in the case of abnormal power failure, but at the cost of poorer SSD performance.
2. Add an extra protection circuit: This method adds a chunk of capacitance in the SSD circuit to enable the SSD to finish required operations even when the SSD suffers a power failure. In the case of a power failure, the voltage will detect an external power supply error, and it will call the controller to write data and tables in the buffer to NAND flash immediately and get required operations done before the power failure. This solution prevents data corruption and keeps performance intact. However on the down side, it will increase the cost of SSDs.
Power failures will not only result in vanishing data in the RAM buffer, but also may corrupt data saved in NAND, extend startup times, or even cause lock or full SSD damage. As MLC/TLC NAND saves more than 1 bit into 1 cell, should a power failure occur when programming page data won’t be written in, and therefore programmed data may vanish as well – this is known as the "pair page effect".
As shown in the figure below, page 0 and page 4 are the so-called pair pages as they have the same word lines. In the case power failure occurs when page 4 programming is in process, data in page 0 may be damaged as well. This can be solved by having the SSD controller write new data written in by the host to a SLC cache first as only one bit is kept in SLC. That is, there is no pair page and so it is impossible to suffer this problem. Yet this comes at a cost: a rising write-in enlargement ratio and poorer long-term average write-in performance of the SSD. The other method employs the so called "one pass program" NAND as shown in the figure below. In the case of this kind of NAND, all pages of the same word line are programmed in one shot to eliminate the impact on pages that have already been programmed. The price of this solution is that it mandates controllers with larger RAM buffers as it may have to program several pages in one shot.
After power failure, the SSD controller is required to have the header page, i.e. the page last programmed before power failure. This is done in most cases by scanning an empty page and then the page before it will be the last programmed before power failure. In certain cases, this may lead to mis-judgement: in this case a NAND may suffer power failure a couple of microseconds after programming; however, when scanning after power failure this page would look empty yet already be programmed. Programming it again would result in errors. In addition, for an SSD that has suffered multiple power failures in a short period of time, it may end up with multiple pages with a large amount of bit errors in one block, and render the latter very unstable, which would lengthen its reading time. To tackle with this problem, some SSD controllers migrate effective data in the block to another one before erasing them to deal with this problem.
Remaining reliable after multiple power failures is a critical capability for an SSD. Different solutions in tackling this issue come with their own pros and cons. The adoption of individual solutions varies with requirements of application domain and tradeoffs among price, performance, and post power failure reliability.