GitXplorerGitXplorer
F

spark-matrix-multiplication

public
3 stars
4 forks
0 issues

Commits

List of commits on branch master.
Unverified
48b6ecf1b8c5544341e71568c3980ad29e1371a3

Extended the tests with more accurate timing

committed 7 years ago
Unverified
2f05d4be68cecd73360b3a6aa253c13357802b20

Put some sensible text onto the README file

committed 7 years ago
Unverified
d75dfd3af9e559d6588211c20c38daaf8a5192df

Added tests and implemented the matrix multiplication in SQL

committed 7 years ago
Unverified
cd892b42b19d8a15d050238e99b7c0ec501645ef

First commit

committed 7 years ago
Unverified
d6abe22007e841750feaabcb37187a1ead9a334d

Initial commit

FFokko committed 7 years ago

README

The README file for this repository.

Spark Matrix Multiplication in SQL

Doing Sparse Matrix Multiplication using Spark SQL and Dataframes might sound obscure, but right now there are no real efficient ways of doing this inside of Spark. The problem is that it will always be converted to a dense matrix:

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.scala#L519

Therefore huge amounts of memory are potentially allocated, but this is not required by the use of more sophisticated sparse matrix formats, for more information, please refer to: https://en.wikipedia.org/wiki/Sparse_matrix

Sparse primitives are being integrated into Sparks' ML codebase, but did not end up in the distributed matrices yet.

https://issues.apache.org/jira/browse/SPARK-17471

The aim is to try out the performance of doing Sparse Matrix multiplication using a COO like format and getting some performance figures for different sparsity numbers.

In practice most of the time, one table is huge, and the second one is reasonably small, if smaller one fits into memory, broadcasting the matrix to all the executors might be interesting in this case.

When using broadcasting, make sure that the driver has sufficient memory:

spark-submit --conf spark.driver.memory=4g multiply.py

Tests

Code is executed using Spark 2.1:

./spark-submit --conf spark.driver.memory=20g ../../spark-matrix-multiplication/multiply.py 

To tune the sizes of the matrices that are begin multiplied, we use N, this indicates that a matrix of [N, N/2] * [N/10, N/20] is being multiplied. By default we use a sparsity of 5% and no broadcasting for the tests.

For the benchmark below we used a 48core, 1tb machine.

N = 1 000 000

N = 10 000 000

N = 100 000 000