Why decentralized storage also ensures that data is not lost?
Why decentralized storage also ensures that data is not lost?
Recently, after I wrote many articles, many people asked me that the most worrying issue of decentralized storage is data loss. After I store the data in it, although it's cheaper, faster, and more privacy, the data stores in miner's machine. Due to the instability of the miners, the file may be lost. Just like Uber's driver, most of the time they are reliable. As long as they are no reliable, the taxi experience is inferior. Therefore, I am not willing to use such unstable products. However, I am willing to use services provided by large companies, such as AWS S3, Axure, Google Cloud, because they have professional computer rooms, professional machines, professional hard drives, to ensure that data is not lost.
It's not true. I had thought about this problem when I designed PP.io, so I wrote this article to explain this problem.
So I have to correct a few cognitive misunderstandings first:
- First of all, AWS S3 and so on can guarantee 100% of the files are not lost? Actually not, they can only guarantee 99.999999999% of the files are not lost (the number of 9 is 11). The storage industry calls this Quality of Service (QoS) parameter durability.
- The miners may indeed be unstable. However, the core technology of P2P is to achieve stable services on multiple unstable nodes. For example, the PPTV I designed earlier, that is, P2P live broadcast, that is, a stable service completed on multiple unstable nodes.
Source: https://aws.amazon.com/s3/reduced-redundancy/Let me explain in detail how PP.io can make this durability very high.
2 redundant modes of PP.io
When designing PP.io, I designed two redundancy modes: - Full copy mode
The full copy mode is to copy the file completely; the new file is the same as the old file. It does not save space, but P2P can download data more quickly. Doing so can improve the user's download experience. - Erasure mode
The erasure mode is to do redundancy through the erasure technique. Simply put, the data is fragmented and encoded, using a configurable number of redundant erasure shares, and different erasure shares store on different miners. Although this is not conducive to the multi-point transmission of P2P, it can significantly save space.
PP.io is a combination of these two redundancy modes. Different scenarios emphasize the use of different redundancy methods.
Below I briefly talk about the mathematical characteristics of the erasure technique:
We use (k,n) erasure codes to encode data, in which there are a total of n erasure shares, and k indicates that in n erasure shares, any k erasure segments can completely recover the original data. If the data size is s bytes, the size of each erasure shares is approximately s/k bytes. If k = 1, it is equivalent to copying a full copy. For example, 1 MB of data, if (10, 16) erasure codes used, each erasure share size is 0.1 M, the total storage data size is 1.6 M. It uses a total of 1.6 times the size of the data space.
PP.io assumptions and calculations
Make the following assumptions:
Let t be unit time, here we assume t=24 hours
Pt represents the daily drop rate of the miners, and we assume Pt = 5% here. In other words, 5% of the copies may go offline every day.
τ is the repair time after the copy lost, that is, if the copy lost, how much time can be fixed. We assume τ = 2 hours.
If it is considered that it can be repaired, the above values will be taken into the following formula, and the actual probability of loss per day will be calculated. This is p:
p = 1-(1-Pt)^(τ/t) = 0.4265318778%
PP.io designed the default full copy number redundancy 2 times, and the erasure copy redundancy is 1.5 times
Let's first look at the full copy mode:
Since the full copy completely duplicated, it is 2 times redundant, that is, there are 2 copies. We call it N=2.
Durability during repair time is: Pa = 1-p²= 99.9981807056%
Annual durability is Pya = Pa^(365t/τ) = 92.3406415922%
Let us look at the erasure mode:
Suppose we use the erasure algorithm as (k,n)= (6,9). Equivalent to 6M data, each erasure share is 1M, a total of 9 erasure shares store, and any 6 erasure shares can recover the complete data so that it stores in 9 miners, the actual space used size is 9M. If you understand, let's continue to look down.Since the erasure algorithm is (k,n), then the redundancy is F = n/k = 1.5.
The number of erasure shares lost during the repair time which is m = np = 0.038387869, and is known as the count of occurrences.
Here explain the classic formula in probability theory, the Poisson distribution fitting formula:
Simply put, the Poisson distribution fitting formula is the probability that x times occur under the condition that the average occurrence of things is m times. To understand the derivation details, you can see the final appendix.
We apply the Poisson distribution fitting formula to get:
Durability during repair time:
That is, Pb = 99.9999912252%
Then the annual durability: Pyb = Pb^(365t/τ) = 99.9615738663%
It can be seen that although the redundancy is smaller, the erasure mode is much higher than the durability of the full copy mode.
Calculation summary:
We combine the two redundancy modes to get the ultimate durability:
Durability during repair time: P = 1 - (1-Pa)(1-Pb) = 99.9999999998%
Annual durability: Py = P^(365t/τ) = 99.9999993008%
Look, the count of 9 is already 8. In other words, if you store 100 million files, you only lose 1 file a year. Are you reliable?
Increase again
The above assumption e the erasure elimination algorithm (k, n) = (6, 9) and the redundancy F is 1.5. If the redundancy F increases appropriately, or the k increases appropriately, the annual durability Py can be significantly improved. The following table is the effect of adjusting k and F on the annual durability. We made a new hypothesis here. There is no full copy at all, only the erasure shares. In doing so, not to pursue speed, only to pursue the lowest price. At this time, Py is equal to Pyb. which is:
It can see that the higher the redundancy F, the higer the durability. At the same time, the more the number of erasure shares n, the higher the durability. n is more sensitive to the effects of durability, but the larger n, the more miners it needs.
It can also see that if you want to pursue 99.9999999999 (the count of 9 is 12). The erasure mode is ultimately adopted, and in the case of redundancy 2, it can be done by dividing 16 erasure shares. Similarly, in the case of redundancy 2.5, it can be done by dividing 12 erasure shares. It exceeds the annual durability of the AWS S3 enterprise storage service. (the count of 9 is 11)
Increase again
In addition to adjusting the N, (k, n), F parameters to improve durability, you can also work through your optimization efforts. There is still a significant improvement. As mentioned earlier, this calculation based on two premise assumptions. Moreover, these two assumptions can still be significantly improved.
The daily loss rate of a single copy is Pt, which I assume is 5%. This assumption itself can be reduced by the design of the token economic system. A more rational economic system can improve the stability and online rate of miners. If the miner is more stable, the lower the value Pt; the lower the value Pt, the higher the annual durability. This value is expected to drop to 1% or even lower.
The repair time after the copy lost is τ. I assume it is 2 hours. This assumption can also be optimized by the PP.io algorithm. If a copy is lost, as long as it can be detected more quickly and can be repaired more quickly to ensure that the number of copies is sufficient, the value of τ will be lower; the lower the value of τ, the higher the annual durability. If the algorithm is at its best, this value is expected to drop to 15 minutes.
Assume that the limit value Pt=1%, τ=0.25 hours, (k,n)=(6,9), F=1.5
Then: Pyb = 99.9999998852%
If you consider 2 full copy redundancy, the annual durability rate is Py = 99.9999999999% (the count of 9 is 12)
PP.io will give developers the flexibility to set parameters
When designing the PP.io architecture, I gave the developer enough flexibility to set different parameters according to their conditions, including full copy number N, erasure algorithm parameters (k, N).
Developers can configure parameters based on their own needs, such as transmission speed, price (higher redundancy, higher price), and acceptable durability to meet their product requirements.
PP.io is a decentralized data storage and delivery platform for developers
that values affordability, speed, and privacy.
Appendix: Derivation of Poisson Distribution Fitting Formula
Assuming that p is the failure rate of a single device per unit time, the probability of a device f failure in n units per unit time is P(k)
Expand combination
Suppose p is small and n is large, generally when n > 20, p < 0.05
Let
Finally, the Poisson distribution formula is obtained, that is, there is an average of λ equipment failures per unit time, and the probability of having k equipment failures per unit time is calculated.
Article author:Wayne Wong
If you want to reprint, please indicate the source
If you have an exchange about blockchain learning, you can contact me with [email protected]