Thành viên:Tuankiet65/sandbox

Bách khoa toàn thư mở Wikipedia
Một tủ trong hệ thống Petabox

PetaBox là hệ thống lưu trữ được thiết kế bởi Capricorn Technologies[1], các nhân viên của Internet Archive[2] và CR Saikley[3]. Hệ thống này có thể lưu trữ và xử lí được 1 petabyte (1024 GB) dữ liệu một cách an toàn[4]

Lịch sử[sửa | sửa mã nguồn]

Trong nhiều năm, Internet Archive đã lưu trữ nhiều tài liệu lịch sử có giá trị cho mai sao. Càng đi về sau, số lượng dữ liệu lưu trữ càng tăng. Chỉ riêng dịch vụ Wayback Manchine (lưu trữ bản sao của các trang web cho mai sau) cũng đã ngốn hết 1 petabyte dung lượng lưu trữ. Và số lượng máy chủ càng tăng theo thời gian. Từ đó Internet Archive muốn một hệ thống có thể lưu trữ nhiều dung lượng nhất có thể với số máy chủ rẻ tiền và số điện năng tiêu thụ càng ít càng tốt.

Info[sửa | sửa mã nguồn]

Petabox là hệ thống được thiết kế bởi nhân viên của Internet Archive và CR Saikley giúp lưu trữ là xử lý 1 Petabyte dữ liệu

Mục tiêu[sửa | sửa mã nguồn]

The project goal is to make a low-cost, low-maintenance, high-density storage cluster, and open-source the hardware and software designs for anyone to use. The short-term goal is to get the first cabinet of a petabyte-sized computer cluster shipped to Amsterdam, where it will act as a mirror site for the Internet Archive's data collection. We shipped on May 28th.


The PetaBox(tm), custom-designed by Internet Archive staff, was originally created to safely store and process one petabyte (a million gigabytes) of information. The goals and design points were:

  • Low power: 6kW per rack, 60kW for the entire storage cluster
  • High density: 100+ TB/rack
  • Local computing to process the data (800 low-end PC's)
  • Multi-OS possible, linux standard
  • Co-location friendly
  • Shipping container friendly: Able to be run in a 20' by 8' by 8' shipping container.
  • Easy Maintenance: One system administrator per petabyte
  • Software to automate full mirroring
  • Easy to scale
  • Inexpensive design
  • Inexpensive storage

The Internet Archive data center now houses ~3PB of PetaBox storage technology and is expanding steadily. Over its 10-year history, the Internet Archive's storage infrastructure has continually evolved. "We're probably on the fourth generation of systems," says John Berry, the Archive's vice president of operations.

The Archive's current storage architecture is a distributed system. As Berry explains, "you couldn't fit all this on one machine. The Wayback Machine alone is about a petabyte of compressed data. So you're kind of stuck using many machines. You also get some nice robustness by having a large number of machines."

The Wayback Machine is a service that allows people to visit archived versions of Web sites. Visitors to the Wayback Machine can type in a URL, select a date range, and then begin surfing on an archived version of the Web.

"When you have a lot of computers like we do and a lot of disks like we do, there's always something that's breaking," says Berry. "So you want to have a system that's resilient and allows services to operate in the face of degraded hardware. So we really didn't have a choice about having many machines. Our approach is to use fairly low cost commodity-type hardware, so that we can scale very large at low cost."

Throw in a PetaBox...

The Archive also makes use of a relatively new storage technology called a PetaBox, built by Capricorn Technologies (www.capricorn-tech.com).

"We wanted to have very large amounts of storage in the smallest space and using the least energy possible," explains Berry. So Capricorn developed a high density, low cost, low power, scalable, mass storage solution called a PetaBox (www.petabox.com), actually a family of products, specifically for — and with — the Archive.

"The PetaBox is a software system as much as it's a physical entity," explains Berry. "It will scale out to thousands of machines. And roughly, with the kinds of storage machines we use, you can fit a petabyte in 500 machines, give or take, depending on which disks you put into them. So anywhere in the 500 to 1,000 machine range you can get a petabyte in a PetaBox. Right now we have between 2,000 and 3,000 machines, organized into clusters. [A cluster includes a computer farm, catalog, monitor and storage/PetaBox.] They're all managed as one entity. And that's really the essence of what the PetaBox is: It allows us to manage 2,000 machines as pretty much one entity."

The PetaBox system has dramatically reduced the Archive's disk failure rates, and it is helping the Archive to keep power and administrative costs low. Each rack, which contains between 80 and 100 terabytes of data housed on approximately four disks, uses only 6kW. And each petabyte in the system only requires one system administrator.

Thế hệ thứ 4:[sửa | sửa mã nguồn]

  • Mật độ: 650TB trên 1 tủ server
  • Tiêu thụ năng lượng: 6KW/1PB
  • No Air Conditioning, instead use excess heat to help heat the building.
  • Raw Numbers as of December 2010:
  • 4 data centers, 1,300 nodes, 11,000 spinning disks
  • Wayback Machine: 2.4 PetaBytes
  • Books/Music/Video Collections: 1.7 PetaBytes
  • Total used storage: 5.8 PetaBytes
  • October 2012 update: Total used storage: 10 Petabytes

Lịch sử[sửa | sửa mã nguồn]

6/2004

  • The first 100TB Rack operational in Amsterdam as of June 2004.
  • The second 80TB rack is operational in San Francisco
  • Internet Archives spins off PetaBox production to newly-formed Capricorn Technologies.

2004 - 2007

  • Capricorn replicates the Internet Archive's successful deployment of the PetaBox for major academic institutions, digital preservationists, government agencies, HPC and major research sites, medical imaging providers, digital image repositories, storage outsourcing sites, and other enterprises around the globe.

Thiết kế[sửa | sửa mã nguồn]

The cluster consists of a bunch of "redbox" nodes, which are 1U-sized Mini-ITX systems, optimized for low power consumption and heat dissipation, high storage density, and low cost per unit disk storage. They each have a 1000MHz Via processor (used to be underclocked to 800MHz, but we stopped doing that) and four Hitachi 400GB IDE hard drives. Their network is 100bT ethernet, and their operating system is Debian. The design philosophy is "configuration, not code", meaning that they run as little custom code as possible (and what custom code they do run is open-sourced). There are a few "special" nodes: "homeserver" (and its backup node) has more disk and gigabit ethernet, and "router" (and its backup node) has more CPU and memory, and gigabit ethernet, but only one disk (due to space constraints).


The PetaBox products, made by Capricorn Technologies, are based on Via mini-ITX motherboards running Debian or Fedora Linux. The IA's PetaBox installation consists of about 16 racks housing 600 systems with 2,500 spinning drives, for a total capacity of roughly 1.5 petabytes, according to the article.

Nguồn[sửa | sửa mã nguồn]

http://archive.org/web/petabox.php

http://hardware.slashdot.org/story/05/06/22/0418253/petabox-big-storage-in-small-boxes

http://www.ciar.org/ttk/images/petabox/

http://www.enterprisestorageforum.com/technology/features/article.php/3633256/The-Wayback-Machine-From-Petabytes-to-PetaBoxes.htm

  1. ^ http://hardware.slashdot.org/story/05/06/22/0418253/petabox-big-storage-in-small-boxes
  2. ^ http://archive.org/web/petabox.php
  3. ^ http://www.ciar.org/ttk/images/petabox/
  4. ^ http://archive.org/web/petabox.php