What is WEKA?
The WEKA Data Platform makes data orders of magnitude faster and easier to access, no matter where it’s stored. It eliminates the need for specialized hardware, allowing easy integration of technological advancements without disruptive upgrades. WEKA addresses common data challenges by removing performance bottlenecks, making it suitable for environments requiring low latency, high performance, and cloud scalability.
What are some common use cases for the WEKA Data Platform? Use cases span various sectors, including AI/ML, Life Sciences, Financial Trading, Engineering DevOps, EDA, Media Rendering, HPC, and GPU pipeline acceleration. Combining existing technologies and engineering innovations, WEKA delivers a powerful, unified solution that outperforms traditional storage systems, efficiently supporting various workloads.
How is WEKA architected? At the heart of the WEKA Data Platform, is WekaFS, a fully distributed parallel filesystem leveraging NVMe flash for file services. Integrated tiering seamlessly expands the namespace to and from object storage, simplifying data management as capacity grows. The intuitive GUI allows easy administration of exabytes of data in a single namespace without specialized storage training.
How does WEKA compare to other file and data solutions? WEKA stands out with its unique architecture, overcoming legacy systems’ scaling and file-sharing limitations. Supporting POSIX, NFS, SMB, S3, and GPUDirect Storage, it offers a rich enterprise feature set, including snapshots, clones, tiering, cloud-bursting, and more. Benefits include high performance across all IO profiles whether on-prem or in the cloud, tremendous scalable capacity, robust security, hybrid cloud support, private/public cloud backup, and cost-effective flash-disk combination. WEKA ensures a cloud-like experience, seamlessly transitioning between on-premises and cloud environments.
How can WEKA benefit Data Sciences?
- Unstructured data access can be hard to manage when many ‘silos’ of data are created. Data management for data science can be simplified by placing data in a single large namespace: A WEKA single namespace can be expanded non-disruptively. Max namespace support is over 500PB flash and over 14PB flash+object, all in a single namespace.
- Data access patterns are very random with millions of files created and deleted: WEKA is completely parallelized, allowing for simultaneous access to data at hundreds to thousands of GB/s, latencies averaging 150 μs, and multi-millions of IOPs on a single cluster. WEKA breaks down all files into 4KB blocks and builds throughput/IOPs from these blocks. IOPs intensive, high throughput, and low latency workloads can all be run natively on the same cluster without having to custom tune WEKA to handle each IO load. Additionally, WEKA stores all metadata striped across all cluster servers to ensure the highest performance metadata operations.
- Millions of files located in a single directory: WEKA can manage up to 6.4 trillion files+directories in a single filesystem, including 6.4 billion in a single directory. A single Weka cluster can manage up to 1,024 separate filesystems each with their own tiering policy, access policy, and file structure.
- Applications are moving to the cloud: the WEKA software solution is the same on-prem as in cloud solutions (GCP, AWS, Oracle, Azure). Local WEKA filesystem snapshots can be exported to the cloud and a new Weka Cluster can be created from this point-in-time snapshot (cloud to on-prem is also supported).
- Data is doubling every 7-12 months: WEKA filesystem can be non-disruptively expanded anytime and support transparent data tiering to object storage. As project data end-of-lifes, WEKA can also transparently contract the cluster and filesystem, saving resources in the cloud, and freeing resources up on-prem to let the next project use them.
Why is WEKA better than a NAS appliance for Life Sciences workloads?
NAS appliances use NFS or SMB protocols which limits performance (throughput and IOPs) that they can provide to a single server. A Life Science pipeline (genomics, cryo-em) is composed of multiple steps where file access and IO varies between millions of files per directory and small to large IO sizes. WEKA’s ability to handle mixed workloads that require low-latency, high-IOPs, metadata operations and throughput all concurrently, allows WEKA to serve these workloads and ensure that the CPUs/GPUs/FPGAs are never bottlenecked by the storage solution.
How can WEKA maintain backups and disaster recovery of these massive datasets?
The WEKA Data Platform has the ability to expand any of its filesystems to an S3 compatible object tier (one or more) which is accomplished using a simple command/api/gui, and is an integrated part of the platform. The ability to create instant local data snapshots, managed per filesystem, and send the snapshot data to an S3 compatible object tier as often as required allows you to create a backup policy managed and maintained by WEKA. Additionally, if the object tier snapshot data is sent to another datacenter this snapshot can be used as an immediate local disaster recovery solution.
How can WEKA provide a high-capacity PB-scale environment, while also providing small block high performance?
The capability to expand the namespace between a flash tier and an object tier allows WEKA to accommodate high filesystem flash performance with the cost of an object tier. Additionally, the object tier and the flash tier can grow independently of each other allowing for independent capacity growth based on performance and space needs as your datasets grow. Hot/warm data is retained on flash for high performance, while cool/colder data is put on the object tier. Due to WEKAs highly parallel architecture, data can be retrieved from object at high speeds.
Can the WEKA cluster be expanded and upgraded without downtime?
WEKA cluster expansions, contractions, and version upgrades do not require downtime and can be performed while all file services are online and servicing IOs.
How is data resiliency protected on the WEKA Data Platform and does it require any special hardware?
WEKA does not require any special hardware in order to protect data (no RAID, UPSs, NVME Ram, SCM, NV DIMMS, DRAM are required). The data is split into multiple 4KB sized block segments and spread across all NVME drives that are part of the WekaFS cluster utilizing a mechanism very similar to erasure coding. The cluster can sustain multiple component failures (drives, servers, network ports, etc.) while still providing data services and self-healing (rebuilding, for example) back to a fully resilient and protected state.