Is There A New Way to Correct Errors

Is There A New Way to Correct Errors
Anxiao (Andrew) Jiang
Jehoshua Bruck
Computer Science and Eng. Dept.
Texas A&M University
[email protected]
Electrical Engineering Dept.
California Institute of Technology
[email protected]
Abstract—The classic approach for error correction is to add
controlled external redundancy to data. This approach, called
error-correcting codes, has been studied extensively. And the
rates of ECCs are approaching theoretical limits.
We explore a second approach for error correction in this
work, which is to use the redundancy inside data, even if it is just
the residual redundancy after data compression. We focus on text
data, and show that this approach based on language processing
can significantly improve the error correction performance.
In the recent work [4], the approach of using the internal
redundancy of data for error correction has been explored.
It shows that for text data, which have rich features, a
language-based error correction algorithm can obtain useful
soft information on the noisy bits, which can be provided
to error-correcting codes (ECCs) for significantly improved
performance. For example, for binary symmetric channel
(BSC) with bit error rates in the range [0.2%, 1.3%] (a useful
range of BER for data storage), a reduction of 70% paritycheck bits was shown to be achievable by combining languagebased decoding with ECC-based decoding [4]. The study was
extended to its application in flash memories, where experiments show that the new approach substantially improves data
retention [5]. The algorithm used in [4], [5] for languagebased error correction focuses on a basic method named word
recognition, namely, to recognize probable words in texts. It
is expected that as more techniques based on natural language
processing are used, the error correction performance will be
further, and substantially, improved.
Note that text data form an important part of the big data
in storage systems [1]. They are concise representations of
knowledge, and have low tolerance of errors. Text data are
mainly natural languages, in addition to characters for format
control, etc. In data storage systems, text data are either uncompressed, or compressed using Huffman coding, arithmetic
coding, etc. [3] The compression is often at the character
level or byte level, because the complexity of the compression scheme grows exponentially in the codeword length of
the code book [1]. The algorithms typically view texts as
Markov processes of characters (sometimes combinations of
characters), and do not utilize the many features of languages
(e.g, syntax, POS tagging, etc.) [6]. Consequently, in storage
systems, there exists plenty of redundancy in stored data,
which forms the basis for language-based error correction.
In addition to natural languages, many text files have extra
structural redundancy that can be used for error correction. An
important class of files is webpages, which are mostly in the
HTML language and are highly structured.
Example 1. Here is a simple example of a webpage (written in
one line): <!DOCTYPE html> <html> <body> <h1> The
First Heading </h1> <p> The first paragraph. </p> · · · · · ·
</body> </html>
The tags (such as “<html>” and “</html>”) usually
come in pairs and together form a tree structure [7]. They
appear frequently in webpages, and their regular structures
enable effective error correction using efficient algorithms,
e.g., dynamic programming. Another important class of files
is databases, including both relational databases and semistructured databases [2]. Relational databases use highlystructured tables with well-defined schemas to store large
amounts of data. Semi-structured databases use high-structured
languages including XML, etc. to store big data with flexible
forms. The structural redundancy is suitable for languagebased decoding, and the techniques can be combined with
those for natural languages.
The language-based decoding has some fundamental differences from error-correcting codes. First of all, languages
are open systems. Unlike ECCs, where valid codewords are
predefined and form a closed set, languages are well known
to be open [6]. New words, expressions and grammatical
rules continue to appear, and rare/new cases are abundant.
Second, unlike ECCs, where codewords form a regular (often
linear) space, the codewords for language tokens (e.g., words,
punctuation marks) are much more chaotic, but not totally
random. So new decoding algorithms with efficient searching
processes need to be developed.
[1] EMC Education Services, Information Storage and Management, 2nd
edition, Wiley, 2012.
[2] H. G ARCIA -M OLINA , J. D. U LLMAN AND J. W IDOM, Database Systems: The Complete Book, 2nd edition, 2008.
[3] Hitachi Data Systems Academy, Storage Concepts, Amazon Digital
Services, 2014.
[4] A. Jiang, Y. Li and J. Bruck, “Error correction through language processing,” to appear in Proc. IEEE Information Theory Workshop (ITW), April
[5] A. Jiang, Y. Li and J. Bruck, “Enhanced error correction via language processing,” to appear in Proc. Non-Volatile Memories Workshop (NMVW), March 2015. Full version available at http :
//f .
[6] C. Manning and H. Schutze, Foundations of Statistical Natural Language
Processing, 1st edition, the MIT Press, 1999.
[7] J. N. Robbins, Learning Web Design, 4th edition, O’Reilly Media, 2012.