Vinum (war: Plattenspiegelung?)

From: Greg Lehey <grog(at)lemis.com>
Date: Mon, 29 Mar 1999 09:32:10 +0930

On Sunday, 28 March 1999 at 3:55:00 +0200, Stefan Huerter wrote:
> Guckux Greg
>
>> krätig weiterentwickelt. Es lohnt sich, die neuste Version von
>> ftp://ftp.lemis.com/pub/vinum zu holen. Näheres auch bei
>> http://www.lemis.com/vinum.html.
>
> hm, definitiv oder ist das "relativ" identisch mit dem in der 3.1-STABLE
> (Mitte Februar) identisch?

Die Frage verstehe ich nicht ganz. Jedenfalls enthält die 4.0-Version
von heute viele Verbesserungen gegenüber der 3.1-STABLE-Version, vor
allem im Bereiche der Konfiguration. Die Grundfunktionen sind seit
etwa 6 Monaten stabil.

> (Der erste Blick machte es mir richtig sympathisch, vor allem das
> angepriesene speed-up beim mirroring, leider ist mir gerade meine
> 2te- Barracuda irgendwie über den Jordan gegangen :-(

Der Jordan soll sich schämen. Er hat schon genug Platten :-)

Hier ein paar Vergleichszahlen aus einer früheren Mail. Entschuldigt
bitte die Fremdsprache. Die Platten, die ich hier benutze, hat der
Mike Smith irgendwann bei einer Versteigerung erstanden. Die dürften
etwa 10 Jahre alt sein und sind entsprechend langsam. Daher nur die
Vergleiche beachten:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU Writes Reads Mblock Mstripe
ufs 100 582 13.8 479 3.0 559 5.2 1121 24.6 1124 5.2 45.4 2.6
s1k 100 156 15.4 150 12.3 108 3.7 230 6.7 230 2.7 36.3 3.4 311848 328587 619009 138783
s8k 100 1492 44.8 1478 18.4 723 8.1 1466 34.0 1467 8.2 115.4 8.0 38913 41065 56152 9337
s64k 100 1723 48.6 1581 18.6 1021 11.8 1792 39.5 1827 11.1 115.3 8.8 17238 8231 1294 333
s256k 100 1717 47.2 1629 19.0 937 11.2 1469 32.2 1467 8.7 95.9 7.8 16982 9272 2001 494
s512k 100 1772 48.8 1621 18.0 732 8.3 1256 27.4 1254 7.4 115.4 8.8 16157 7564 155 37
r512k 100 379 14.9 385 8.9 360 4.5 1122 24.7 1258 7.4 80.9 6.7 38339 46453 521 793
s4m/16 100 1572 52.8 1336 18.6 612 6.1 1139 25.2 1142 5.6 97.9 7.1 20434 8028 17 7
s4m/17 100 1431 44.8 1234 16.9 613 6.1 1145 25.4 1147 5.6 97.3 7.0 19922 8101 113 31

Sorry for the format; I'll probably remove some of the bonnie columns
when I'm done.

The "Machine" indicates the type and stripe size of the plex (r:
RAID-5, s: striped, ufs: straight ufs partition for comparison
purposes). The additional columns at the end are the writes and reads
at plex level, the number of multiblock transfers (combined read and
write), and the number of multistripe transfers (combined read and
write). A multiblock transfer is one which requires two separate I/Os
to satisfy, and a multistripe transfer is one which requires accessing
two different stripes. They're the main cause of degraded performance
with small stripes.

I tried two different approaches with the 4 MB stripes: with a default
newfs, I got 16 cylinders per cylinder group and cylinder groups of 32
MB, which placed all the superblocks on the first disk. The second
time I tried with 17 cylinders per super group, which put successive
superblocks on a different disk.

Some of the things that seem to come out of these results are:

- Performance with 1 kB stripes is terrible. Performance with 8 kB
  stripes is much better, but a further increase stripe size helps.
  
- Block read and random seek performance increases dramatically up to
  a stripe size of about 64 kB, after which it drops off again.
  
- Block write performance increases up to a stripe size of 512 kB,
  after which it drops off again.
  
- Peak write performance is about 3.5 times that of a straight UFS
  file system. This is due to buffer cache: the writes are
  asynchronous to the process, and can thus overlap. I'm quite happy
  with this particular figure, since it's relatively close to the
  theoretical maximum of a 4x performance improvement.
  
- Peak read performance is about 1.6 times that of a straight UFS file
  system.
  
- RAID-5 read performance is comparable to striped read performance.
  Write performance is about 24% of striped write performance. Note
  that there is a *decrease* in CPU time for RAID-5 writes: the reason
  for the performance decrease is that there are many more I/O
  operations (compare the Reads and Writes columns).
  
The trouble with these results is that they don't make sense.
Although we can see some clear trends, there are also obvious
anomalies:

- On a striped volume, the mapping of reads and writes is identical.
  Why should reads peak at 64 kB and writes at 512 kB?
  
- The number of multiblock and multistripe transfers for s4m/17 is 8
  times that for s4m/16. The number of writes for s4m/17 is lower
  than for s4m/16. The number of writes should be the number of raw
  writes to the device (volume) plus the number of multiblock and
  multistripe transfers; in other words, s4m/17 should have *more*
  transfers, not less. There's obviously something else here, and I
  suspect cache.
  
- The random seek performance is pretty constant for s8k, s64k and
  s512k. Since bonnie performs 8k transfers, this seems reasonable.
  On the other hand, the performance was much worse for s256k, which I
  did last. Again, I suspect that there are other issues here which
  are clouding the results.
  
In addition, bonnie does not simulate true file system performance
well. The character I/O benchmarks are not of relevance for what
we're looking at, and the block I/O benchmarks all use 8 kB transfers.
True file system performance includes transfers of between 1 and 120
sectors of 512 bytes, with an average apparently in the order of 8kB.
In real life, the performance benefits of large stripes will be
greater. I'm currently thinking of writing a program which will be
able to simulate this behaviour and get more plausible measurements.

To add to this theory, I've just re-run the 64 kB test under what look
to me like identical conditions. Here are the results. The first
line is a copy of the one I did yesterday (above), the second one are
the new results:

              -------Sequential Output-------- ---Sequential Input-- --Random--
              -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU Writes Reads Mblock Mstripe
s64k 100 1723 48.6 1581 18.6 1021 11.8 1792 39.5 1827 11.1 115.3 8.8 17238 8231 1294 333
s64k 100 1711 48.4 1633 18.9 983 11.5 1778 39.6 1815 11.3 95.8 7.8 16495 8029 7952 1986

In other words, there are significant differences in the way vinum was
accessed in each case, and in particular we can assume that the
differences in random seek performance are, well, random.

Getting back to your results which started this thread, however, there
are some significant differences:

     -------Sequential Output-------- ---Sequential Input-- --Random--
     -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
  MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
  256 541 7.5 491 1.7 458 2.3 4293 59.1 4335 16.2 193.6 6.8
  100 379 14.9 385 8.9 360 4.5 1122 24.7 1258 7.4 80.9 6.7

Comparing the block write and block read performances, vinum gets
about 30% of the read performance on writes. Your DPT write results
show only 11% of the read performance, and are in fact only slightly
faster than vinum with the ancient disks, so I can't see that this
could be due to the faster disks. So yes, I suspect there is
something wrong here. It's possible that DPT doesn't DTRT with large
slices: Vinum only accesses that part of a slice which is necessary
for the transfers. It's possible that DPT accesses the complete 512
kB block on each transfer, in which case, of course, it would be
detrimental to use a stripe size in excess of about 64 kB, and you
might even get better performance with 32 kB. If this is the case,
however, it's a bug with DPT's implementation, not a general
principle.

Greg

--
See complete headers for address, home page and phone numbers
finger grog(at)lemis.com for PGP public key
To Unsubscribe: send mail to majordomo(at)de.FreeBSD.org
with "unsubscribe de-bsd-questions" in the body of the message
Received on Mon 29 Mar 1999 - 02:02:33 CEST

search this site