InfinitiesSoft AI-Stack

GIGABYTE has collaborated with InfinitiesSoft to create an integrated private / hybrid cloud platform to streamline data, tools and workflows in AI training & Big Data analysis.
Introduction
GIGABYTE has collaborated with InfinitiesSoft to create an integrated private / hybrid cloud platform to streamline data, tools and workflows in AI training & Big Data analysis. This cloud platform allows you to virtualize and share the GPU and CPU resources of your bare-metal hardware deployment, maximizing time and cost efficiency when running GPU-based AI / DNN training or CPU-based analysis workloads. The AI-Stack is a complete hardware and software package, available in several different versions that is ready to deploy into your data center.
Why InfinitiesSoft AI-Stack?
AI Training Platform for University
Under the guidance of government, AI has become an essential building block for education today. This solution supports colleges and universities to comprehensively cultivate cross-disciplinary AI talents with practical skills and application capabilities to support industrial development needs.
Competitive Advantage for Business
With InfinitiesSoft AI-Stack, you can start using AI services without investing heavily in the early stage, providing modules for container and cloud storage, computing and network functions, allowing users to run AI algorithms or develop AI algorithms in sufficient environment. The preloaded Jupyter development tool is able to rewrite the algorithm.
AI Training Platform for Business
With InfinitiesSoft AI-Stack, you can provide a stable AI training and development platform, deploying an AI education, training and development environment, cultivating talents, shortening the time required to create and enhance value.
An On-Premises Cloud Solution for Machine Learning Workloads
AI-enabled applications and services require data collection and analysis and machine learning to enable the smart in these applications, requiring back-end compute, GPU compute and storage infrastructure. Building this infrastructure on premises is expensive. Using public cloud services could be an option, but not when you need to keep your sensitive data on premises. What is the solution?

InfinitiesSoft AI-Stack allows you to virtualize and share your bare metal CPU and GPU resources for maximized efficiency while keeping your sensitive data on-premises. Cloud Fusion also connects with public cloud services, allowing you utilize extra capacity when required and deploy your applications outward once your AI training or analytics is completed. 

InfinitieSoft AI-Stack can be deployed on as little as a single 1U GPU server as an AI training platform for your organization or university, or can be scaled up to a fully on-premises hybrid cloud to support all compute, storage and GPU computing needs.

InfinitiesSoft AI-Stack Turnkey Packages
InfinitiesSoft AI-Stack Benefits
suitable to Performance, Efficiency
Virtualization of GPU and CPU Resources
Allows you to virtualize & share your bare metal resources for to maximize utilization rates and minimize hardware investment costs
suitable to Security, Data Protection
Private or Hybrid Cloud
Features the security and flexibility of an on-premises solution with the option to connect with public cloud services for added capacity when needed.
suitable to User Friendly, Ease of Use& Lower maintenance requirement
Reduces Complexity to Set Up AI / ML Workloads
Users can focus on AI/ML workloads and not on system maintenance, adjustment and deployment scheduling
suitable to Reliability, Consistency
Kubernetes Integration
Containers make it easier to develop AI applications. Kubernetes allows multiple user connections making it ideal for interactive training jobs.
InfinitiesSoft AI-Stack: Features
The AI-Stack enables data science teams, developers and IT teams to simplify and streamline workloads through a single system, saving them from the time consuming and troublesome tasks of resource allocation, environment setup, container preparation duties or other integration and security woes, and giving them more time instead to focus on the actual work of training machine learning algorithms.
InfinitiesSoft AI-Stack: How It Is Built
Infinities AI-Stack features the following layers within the "stack":

Management Layer
InfinitiesSoft CloudFusion is used as a cloud management platform to dynamically allocate virtualized resources and schedule workloads. CloudFusion also can pool on-premises physical resources with those from public cloud services (AWS, Azure, Google Cloud, Ali-Cloud etc.) to create cloud bursting functionality (a hybrid cloud).

Virtualization Layer
Docker + Kubernetes are used for virtualization of GPU resources (containers), OpenStack is used virtualization of CPU resources (virtual machines), and Bigtera VirtualStor Converger or Scaler is used for a software defined storage cluster.

Hardware Layer
GIGABYTE server hardware is used for the underlying on-premises private cloud infrastructure.
How It Is Built: Management Layer
InfinitiesSoft CloudFusion Cloud Management Platform
The front-end management platform layer of the AI-Stack is provided by InfinitiesSoft CloudFusion, which can support and integrate over 30 different private and public clouds. This gives users the option to build a hybrid cloud wherein they can join their private cloud to one or more public clouds and reap the benefits of all those cloud options.

Users can easily add, drop or change any of their clouds. They can also use the easy-to-understand visualizations on the dashboard to:
・Allocate resources and manage access,
・Evaluate and manage cloud data center CPU, memory,
・Manage storage resource utilization.

Further, a highly elastic open API interface enables developers to connect and integrate new cloud options as they appear on the horizon thus keeping your options open for future developments.
A CloudFusion deployment for AI-Stack is designed with users (i.e. AI and data scientists) and administrators in mind with comprehensive functionalities packaged in 2 portals designated for their distinctive roles:

User Portal
When AI and data scientists (as users) login to the User Portal, they can instantly view resource usage through the dashboard. User Portal allows self-service by users for allocating virtual machine (CPU) and container (GPU) resources, selecting/mounting/loading their required CPU, GPU, Memory, AI Frameworks (e.g. Tensorflow, NVCaffe, Caffe2, PyTorch, MXNet, CNTK,… etc.) and accessing any other resource information relating to their work.

For use cases of interactive sessions, the system can automatically allocate data buckets to facilitate users to upload source training data for machine learning algorithms to produce post-training results (ML models). An object storage service is also provided to allow users to access bucket resources through the accesskeyid and accesskeysecret in S3 Tool.

A batch job mode is also supported to allow more advanced users to dispatch multiple model-training jobs without further human supervision. When it is found that computing resources needed for model-training are temporarily insufficient, a scheduling mechanism will initiate to automatically to put the jobs into a queue, so multiple jobs can be executed in parallel or when the next available computing resources become available, optimizing utilization for improved efficiency and to avoid leaving computing resources lying idle without minute-by-minute human interventions.
Administrator Portal
CloudFusion supports multi-tenancy. The administrator can define resource limits for each tenant and set user-accessible resource specifications, such as AI Framework, OpenStack Flavor configurations, and customizable pricing policies. Besides the private cloud platform incorporated within the AI-Stack additional cloud platform resources can be integrated and managed under the hybrid/multi-cloud management capabilities of CloudFusion, including, but not limited to, resources from public clouds (e.g. AWS, Alicloud), and/or private clouds (e.g. OpenStack, VMware, Kubernetes), etc.
CloudFusion Adminstrator Portal Demonstration
How It Is Built: Virtualization Layer
Customized OpenStack Distribution for Machine Learning
The virtualization layer is used to control nodes and resource pools of the on-premises hardware infrastructure of the private cloud, and is delivered by a customized OpenStack distribution which includes capabilities for machine learning, by featuring integration with Kubernetes for automatic deployment of machine learning containers onto GPU servers for AI training.

This customized OpenStack distribution is administered and managed by InfinitiesSoft CloudFusion cloud management platform for resource allocation and scheduling, and is integrated with a software defined storage cluster using Bigtera VirtualStor™ (Scaler / Converger / Extreme).
On top of standard OpenStack features, this customized distribution also includes the following additions:

・OpenStack and Kubernetes integration: for automatic policy-based deployment of VMs and containers onto any compute node or Kubernetes worker node, through the user-friendly CloudFusion User Portal.

・Tenant based isolation: OpenStack has an inherent architectural concept of tenant which is completely missing from Kubernetes.

・Kubernetes master node clustering: to provide full HA and load balancing capability for Kubernetes.

・NFS – Object Storage Gateway: to ease the migration of legacy software based on the NFS semantics/syntax towards the adoption of an object based storage system.

・Automatic deployment of DNN development environment: specifically, the customized OpenStack distribution automates the deployment of a DNN development environment (TensorFlow, Tensorboard, Caffe, Jupyter, DIGITS etc.) in the form of containers onto GPU enabled servers.

・User authentication and authorization of DNN IDEs (Integrated Development Environment): the nature of DNN development / training is, unlike traditional HPC, interactive. For the security and integrity of the system, the customized OpenStack distribution provides mandatory authentication and authorization for users.

・Integration with HPC job schedulers: unlike in the context of VMs and generic containers, containers for DNN training occupy GPU-enabled servers for acceleration and quicker iterations. Even so, each submitted job will still takes days or weeks for completing one iteration. The use of GPU servers comes at exorbitant costs. Thus, traditional job schedulers like SLURM or Univa Grid Engine are used for resource (storage and GPUs) management and schedulers to contain that expense.
Integration AI and big data analysis over a virtualized place
How It Is Built: Hardware Layer
The AI-Stack is designed and optimized to be used with GIGABYTE's 2nd Generation Intel® Xeon® Scalable Family server systems for the underlying hardware layer.

GIGABYTE has a rich product family of server systems designed for Intel's 2nd Generation Xeon® Scalable Family platform, and are engineered to support the full family of different Xeon® Scalable SKUs that are workload optimized to support different applications, making your GIGABYTE server ideal for a myriad of use cases, from enterprise IT and database, cloud and storage to the most high-demand HPC workloads. 
1/4
4 x GPU Node
G191-H44 (rev. 100/200)
AI-Stack GPU Node
2/4
8 x GPU Node
G291-280 (rev. 100)
AI-Stack GPU Node
3/4
Hyper-Converged Management / Compute / Storage Node
H261-H61 (rev. 100)
K8S master node, OpenStack admin & control node, compute node
4/4
Storage Node
S451-3R0 (rev. 100)
VirtualStor Scaler storage node
Related Technologies
Artificial Intelligence
Artificial Intelligence (AI) is a broad branch of computer science. The goal of AI is to create machines that can function intelligently and independently, and that can work and react the same way as humans. To build these abilities, machines and the software & applications that enable them need to derive their intelligence in the same way that humans do – by retaining information and becoming smarter over time. AI is not a new concept – the idea has been in discussion since the 1950s – but it has only become technically feasible to develop and deploy into the real world relatively recently due to advances in technology – such as our ability to now collect and store huge amounts of data that are required for machine learning, and also the rapid increases in processing speeds and computing capabilities which make it possible to process the data collected to train a machine / application and make it "smarter".
Big Data
Big Data describes the large volume of data – both structured, semi-structured and unstructured – that is collected by a business on a daily basis. This data can be generated by both humans (such as a customer's financial transactions), as well as by machines and processes (such as sensor readings and event logs). By its nature, the amount of Big Data is often massive – ranging from terabytes, petabytes and even exabytes of data captured over time.
Hybrid Cloud
A hybrid cloud is a computing environment that combines on-premises virtualized compute / storage / networking resources (a "private cloud") together with "public cloud" resources (compute / storage / networking resources provided by third parties such as Amazon Web Services, Microsoft Azure or Google Cloud), and allows data and applications to be shared between the two.
You have the ideas, we can help make it happen.
Contact Us
Please send us your idea
 
 
 
 
 
 
/ 1000
 
* For services and supports, please visit eSupport.
* We collect your information in accordance with our Privacy Policy.
Submit Successfully