Leveraging Spark 3 and NVIDIA’s GPUs to Reduce Cloud Cost by Up to 70% for Big Data Pipelines

rw-book-cover

Metadata

Author: Ilay Chen
Full Title: Leveraging Spark 3 and NVIDIA’s GPUs to Reduce Cloud Cost by Up to 70% for Big Data Pipelines
Category:articles
Document Note: Spark 3 RAPIDS can use GPU to accelerate data processing
Summary: By Ilay Chen and Tomer AkiravAt PayPal, hundreds of thousands of Apache Spark jobs run on an hourly basis, processing petabytes of data and requiring a high volume of resources. To handle the growth of machine learning solutions, PayPal requires scalable environments, cost awareness and constant innovation. This blog explains how Apache Spark 3 and GPUs can help enterprises potentially reduce Apache Spark’s jobs cloud costs by up to 70% for big data processing and AI applications.Our journey will begin with a brief introduction of Spark RAPIDS — Apache Spark’s accelerator that leverages GPUs to accelerate processing via the RAPIDS libraries. We will then review PayPal’s CPU-based Spark 2 application, our upgrade to Spark 3 and its new capabilities, explore the migration of our Apache Spark application to a GPU cluster, and how we tuned Spark RAPIDS parameters. We will then discuss some challenges we encountered and the benefits of the updates.Libra scales in the cloud, generated by AIBackgroundGPUs are everywher…
URL: https://medium.com/paypal-tech/leveraging-spark-3-and-nvidias-gpus-to-reduce-cloud-cost-by-up-to-70-for-big-data-pipelines-e0bc02ec4f88?source=rss----6423323524ba---4

Spark RAPIDS is a project that enables the use of GPUs in a Spark application (View Highlight)
It is beneficial for large joins, group by, sorts, and similar functions. (View Highlight)
GPUs have their own environment and programming languages, so we can’t easily run Python/Scala/Java/SQL code on them. You must translate the code to a GPU programming language, and Spark RAPIDS does this translation in a transparent way. Another cool design change that Spark RAPIDS has made is how Spark handles the tasks in each stage of the job’s Spark plan. In pure Spark, every task of a stage is sent to a single CPU core in the cluster. This means that the parallelism is at the task level. In Spark RAPIDS, the parallelism is intra-task, meaning the tasks are parallelized as well as the processing of the data within each task. (View Highlight)