Choosing the Right Storage for Application Data
We have roughly classified the five types of data you could be possibly working with into the following categories. Naturally, this is not a comprehensive classification, but identifying where you fit into this list will help us to understand the options and possible approaches we can take in order to make the most productive decision for your storage needs.
- Homogeneous data arrays containing elements of the same type
- Multimedia - audio, video and graphics files
- Interim data for internal use (logs of various types, caches)
- Streams of calculated data of various types (e.g. recorded video stream or massive computation results)
- Documents (simple or compound)
The ways for storing such data are as follows:
- Files in filesystem
- Structured storages
- Archives (as a specific form of structured storage)
- Remote (distributed, cloud) storages
Now to discuss which storage mechanism will be the best suited for the types of data mentioned above.
- Homogeneous data arrays
- Audio, video and graphic files
- Temporary data
- Data Streams
- Suggested solutions
Homogeneous data arrays
Homogeneous data arrays contain elements of the same type. Examples of homogeneous data arrays include things such as a simple table, temperature data over time, or last year's stock values.
- For homogeneous data arrays, regular files do not provide support for convenient and fast searches. You have to create, maintain, and constantly update special indexing files. Modification of the data structure is almost impossible. Meta-information is limited and there is no built-in run-time compression or encryption of data.
- Relational databases are well suited for homogeneous data. They comprise a set of predefined records with a rigid internal format. The main advantage of relational databases is the ability to locate data quickly according to a specified criterion, as well as transactional support of data integrity. The significant shortcoming of realtional databases is that they will not work well for large-size data of variable length (BLOB fields are usually stored separately from the rest of the record). Moreover, keeping data in relational databases requires: a) use of a specific DBMS, which severely limits the portability of the data and of the application itself, b) pre-planning of database structure, including inter-relational links and indexing policy, c) researching details of peak loads is required for efficient database development, which may involve serious overhead.
- Structured storages are somewhat analogous to filesystems, i.e. storages are a specific set of enveloped named streams (files). Such storages can be stored in any location, i.e. in a single file on a disk, in a database record, or even in RAM. The main advantage of this approach is that it allows efficient addition or deletion of data in an existing storage, and provides an effective manipulation of data of various sizes (from small to huge). The storages represent separate units (files), and therefore can be easily relocated, copied, duplicated, and backed up. There is no need to track all of the files generated by an application. Moreover, journal keeping makes it possible to restore content completely or partially, thus eliminating accidents or failures. The disadvantage may be relatively slower searches within these huge data arrays.
- ZIP archives, as a specific form of the structured storage, can be used for storing homogeneous data arrays, but only in case when the most of access is read-only. The standardized nature of the ZIP format makes it easy to use, especially in cross-platform applications. However, this format is not suitable for modifying the data after packing, so adding and deleting data can be quite a time-consuming process.
- Remote and distributed storages are the next level of storage in which actual data location and data access are provided by specific layers used for the encapsulating of access mechanics. In such storages, data can actually be stored in the database or be distributed among different filesystems, but the actual storage organization does not matter for an end-user. The user observes only a set of objects accessed through the API, or as a variant through the filesystem calls. A good example are the cloud storages. These types of data storages are to be used in large software complexes. Among other advantages, one can mention unified data access without the need to think about the actual ways in which data is stored. The disadvantages - they cannot be efficiently managed and controlled, and backup or migration of data is complicated.
Audio, video, and graphic files
Storing multimedia files is typically simple. Complexities appear when you need to maintain a large number of files and want to perform a search across the whole multimedia collection.
- Only the very simple and sparse multimedia files can be stored as regular files. Even for an average home collection, simple file-based multimedia data storage becomes unmanageable very quickly. This is mostly due to the size of these files, inability to handle any annotation, tags or metadata, and low speed of copying or relocation.
- Relational databases are a dubious way of storing audio, video, or similar types of data. A RDBMS is not well suited for keeping large BLOBs, especially when it comes to storing large video files. Additionally, each type of data requires its own table (due to different sets of metadata that needs to be stored). On the other hand, an RDBMS can be handy in certain scenarios, as they offer powerful search capabilities that are very suitable for read-only collections.
- Structured storages work perfectly for storing multimedia files when the storage supports metadata and fast searching. If this search is not supported, structured storage becomes a variant of the filesystem.
- Remote and distributed storages are among the best solutions when it comes to the storing of video, music, or similar data. Storage represents a single unit where all elements of a multimedia or video game can be safely stored. There is no risk of losing a single and important file. Searches are fast and efficient if the storage supports tags and metadata.
Temporary data is generated by software on the fly, and usually has a validity term. Most updates are very frequent. In addition, such intermediate information remains easily accessible, integral, and, in many cases, encrypted and secured. It is possible to use regular files for these purposes, but this approach will result in high resource consumption, and there is no reliable way to control and enforce the integrity of data. Additionally, encryption functions must be implemented by your software.
- Files have been used to store interim data for a long period of time. They are quite suitable for storing low-priority, unsecured, temporary data of insignificant size. However, modern legislations of several countries dictate more careful and responsive treatment of interim data. As a result, a regular filesystem is less suitable when the issues of data security, vulnerability, and protection from tampering become paramount.
- Relational databases are not usually used for interim data storage due to the absence (as a rule) of clearly defined structure and the interrelated nature of elements. Low speed of upgrade issues of compression and security add to this unsuitability. At the same time, a relational database can contain interim data related to the database itself and its operation. Also, a database can be used for some kind of data cache, or for storing activity logs (journal files). A RDBMS doesn't suit well if the data is required to be stored for a long period of time and is to be signed or encrypted.
- Structured storages may be considered as an optimal solution when a large volume of interim data needs to be stored, accessed, indexed, searched, compressed, and encrypted on-the-fly. Structured storages may be built with anti-tempering functions, or, should the requirements be available, - provide an easy way for data removal or replacement. As always, such storages can be easily copied or moved without the need to worry about preserving data integrity.
- ZIP archives are rarely used for interim data storage. Fast (as a rule) interim data turnaround makes them impractical in most situations. An encrypted archive may be suitable for this type of data only when snapshots are to be stored for a long time, and need to be protected from loss or tampering.
- Remote and distributed storages are used for interim data streams due to space considerations. They don't provide speed or easy management, and they backup often when required for interim data.
Large volumes of quickly generated data, such as output data feeds, need to be stored efficiently. Regular filesystems significantly limit file sizes, requiring the design of specific handlers for data overflow at the expense of lost integrity and reliability. Since data of this type often contain privileged or sensitive materials, fast on-the-fly encryption is a must. The same applies to efficiency of data compressions, since sizes of these data feeds are typically very significant.
- Regular files are not well suited for this type of data. Quickly increasing file sizes requires creating many intermediate caching files that need to be copied back. Even in the cases of careful designs, the amount of memory or media consumed tends to grow in geometrical progression. Handling, indexing, searching, and encrypting data streams stored in regular files becomes a nightmare.
- Relational databases pose almost exactly the same problems as regular files. Add to the inefficiency of database updates and rigid structure, and it is clear that relational databases are among least suitable storage solutions for streams of data.
- Repositories may be used for data stream storage when requirements are present for security and low vulnerability at the expense of easy searches and fast retrievals. Data can be compressed, but fast and efficient searches become almost impossible.
- Structured storages have the advantages of security, integrity, and efficient searches. Data storages are autonomous single-file units which can be easily transferred or copied. Access is easy and efficient. Data streams kept within them can be encrypted and protected from tampering. The presence of thin partitioning provides another convenience for storage users: the storage will automatically grow in parallel to the increase of data size.
- Remote and distributed storages are well suited for streaming data, and are commonly used in projects generating a vast amount of data. Since such data are frequently analyzed by distributed systems or clusters, the use of remote storages is the best fit. This type of storage provides easy but well controlled data access and protection from illegal tampering or removal.
Documents are rigidly structured data specifically designed to store human-readable textual or graphical information. Documents are one of the most common forms of information, produced and used in business and personal activities.
- Files are the most common storage method for documents. However, when concurrent access to documents is required, the use of regular files becomes complicated. Since all of the compound document structure is stored sequentially in a flat file, any document modifications require the creation of a set of temporary files that contain subsets of document elements to be edited. In addition, deletion of any elements from the document will not reduce file size automatically. To optimize the size, an additional document copy must be created and saved into yet another file. After the edit operation is completed, the original file must be deleted. If this is to be done automatically by the editing software, the developer of this software will have more on their plate.
- Relational databases will work well for some types of documents, and can provide fast and efficient indexing and search & retrieval - if there is an on-the-fly conversion to plain text available. Databases suffer from the same shortcomings applicable to storage of homogeneous data arrays. Keeping data in relational databases requires a) use of a specific DBMS, b) pre-planning of database structure, including inter-relational links and indexing policy, c) researching details of peak loads for efficient database development, which may involve serious overhead.
- Structured, customizable storages are among the best choices when it comes to corporate use of documents. The main advantage of structured storages is that they allow efficient addition or deletion of documents or their parts to existing storage, and they provide effective document access restrictions. Complex documents that contain embedded images or other multimedia can be handled best by seperating the text from the multimedia (doing this will reduce load/save time, make text search easier, etc). Moreover, journaling makes it possible to restore content (completely or partially) following accidents or failures. One more benefit is the possibility to store multiple editions or multiple alternative views of the data within one document. The disadvantage may be slower search, which should be implemented by using on-the-fly conversion to plain text.
- ZIP files are used in some document formats, such as Open Document Format, to store document data. Most of the advantages described above for structured storage are also associated with ZIP file storage. However, the modification and deletion of the information is time-consuming and sometimes requires a complete rewrite of files. Additionally, the ZIP file format doesn't allow you to attach the metadata to the entries inside, and ZIP encryption capabilities are limited (strong AES encryption is a recent addition to the standard, and it's not supported by many ZIP compression and decompression tools and libraries).
- Remote and distributed storages are becoming more widespread and popular. They allow easy collaboration during document creation, use, and remote but tightly controlled and secured access to them. Unlike homogeneous data arrays, the document usually constitutes one object accessed and modified in its entirety, and this makes document retrieval and management quite simple. The cons are the same as in previous paragraphs.
Using the right tool for the job is extremely important in software design. Incorrect or under-thought data and information storage planning can lead to disastrous results.
- Regarding the use of files, you are faced with choice of filesystems.
- There is a wide range commercial database systems to choose from: Oracle, Microsoft SQL Server, etc., as well as several open source solutions.
- Repositories can be created by commercial and public archiving solutions, such as Zip, etc.
- Examples of Structured storages include OLE Structured Storage by Microsoft (offers basic storage capabilities, i.e. no encryption, compression or search are available), ZIP format, or CBFS Storage.
- Remote storages can be designed with CBFS Storage or CBFS Connect, FUSE for Unix-based systems etc.
In any case, only the project developer knows the exact requirements, technologies, features, and restrictions of their project, and therefore has the understanding to make an adequate choice of tools.
We appreciate your feedback. If you have any questions, comments, or suggestions about this article please contact our support team at firstname.lastname@example.org.