Support better integration between Ray and Spark in passing ObjectRef without actually moving data

## Overview

As a Codeflare user, I want to use Ray and Spark alternately to execute my end-to-end ML jobs.  Some steps might be executed more efficiently using Ray, while others using Spark.  The plasma store in Ray seems to provide an efficient way to share ObjectRef between Ray and Spark.  Currently, RayDP project supports from Spark to Ray in some limited way, by running Spark as a Ray actor.  However, ObjectRef cannot be shared easily in both directions, Spark-2-Ray and Ray-2-Spark.  



### Acceptance Criteria

- `Pandas dataframe` created by remote tasks in local Ray plasma stores can be passed with `ObjectRef `to the Spark driver to create a `Spark dataframe `containing list of `ObjectRef`.  
- Once that is done, on the Spark side, the executors of Spark can then access to the original Pandas dataframe locally.
- From Spark to Ray: Spark preserves `groupby()` partition semantics and writes these partitions to plasma store, instead of using `hashPartition()`.  



### Questions

- In RayDP, only the driver node knows about and can access Ray.  The executors of PySpark doesn't have access to Ray.  This will prevent the PySpark executors from accessing the Ray plasma store.  As a result, it is not possible to seamlessly pass `ObjectRef `between Ray workers and Spark executors.



### Assumptions

- Ray and Spark can share data seamlessly by exchanging ObjectRef among Ray workers and Spark executors. 



### Reference

[Reference] I have opened an issue on the RayDP repo: https://github.com/oap-project/raydp/issues/164

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support better integration between Ray and Spark in passing ObjectRef without actually moving data #32

Overview

Acceptance Criteria

Questions

Assumptions

Reference

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support better integration between Ray and Spark in passing ObjectRef without actually moving data #32

Description

Overview

Acceptance Criteria

Questions

Assumptions

Reference

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions