Tpetra parallel linear algebra Version of the Day
Loading...
Searching...
No Matches
Tpetra_DistributionLowerTriangularBlock.hpp
1// @HEADER
2// *****************************************************************************
3// Tpetra: Templated Linear Algebra Services Package
4//
5// Copyright 2008 NTESS and the Tpetra contributors.
6// SPDX-License-Identifier: BSD-3-Clause
7// *****************************************************************************
8// @HEADER
9
10#ifndef __TPETRA_DISTRIBUTORLOWERTRIANGULARBLOCK_HPP
11#define __TPETRA_DISTRIBUTORLOWERTRIANGULARBLOCK_HPP
12
13// Needed by DistributionLowerTriangularBlock
14#include "Tpetra_Distributor.hpp"
15
16// Needed by LowerTriangularBlock operator
17#include "Tpetra_Core.hpp"
18#include "Tpetra_Map.hpp"
19#include "Tpetra_Operator.hpp"
20#include "Tpetra_Vector.hpp"
21#include "Tpetra_CrsMatrix.hpp"
22
23namespace Tpetra {
24
26template <typename gno_t, typename scalar_t, typename node_t>
27class DistributionLowerTriangularBlock : public Distribution<gno_t, scalar_t> {
28 // Seher Acer's lower-triangular block decomposition for triangle counting
29 // See also: LowerTriangularBlockOperator below that allows this distribution
30 // to be used in Tpetra SpMV.
31 //
32 // Requirements:
33 // Matrix must be square (undirected graph)
34 // Number of processors np = q(q+1)/2 for some q.
35 //
36 // Only the lower triangular portion of the matrix is stored.
37 // Processors are arranged logically as follows:
38 // 0
39 // 1 2
40 // 3 4 5
41 // ...
42 //
43 // The lower triangular part of the matrix is stored in a 2D distribution.
44 // For example, the dense 7x7 lower triangular matrix below would be assigned
45 // to processors according to numbers shown as follows:
46 // 0 | |
47 // 00| |
48 // ---------
49 // 11|2 |
50 // 11|22 |
51 // 11|222|
52 // ---------
53 // 33|444|5
54 // 33|444|55
55 // ...
56 // (Note that we expect the matrix to be sparse. For dense matrices,
57 // CrsMatrix is the wrong tool.)
58 //
59 // Matrix rows are assigned to processor rows greedily to roughly balance
60 // (# nonzeros in processor row / # processors in processor row)
61 // across processor rows.
62 // The same cuts are used to divide rows and columns among processors
63 // (that is, all processors have a square block).
64 //
65 // The lower triangular algorithm:
66 // 1. distribute all matrix entries via 1D linear distribution
67 // (this initial distribution is needed to avoid storing the entire
68 // matrix on one processor, while providing info about the nonzeros per row
69 // needed in step 2.
70 // 2. (optional) sort rows in decreasing order wrt the number of nonzeros
71 // per row
72 // 3. find "chunk cuts": divisions in row assignments such that
73 // (# nonzeros in processor row / # processors in processor row) is
74 // roughly equal for all processor rows
75 // 4. send nonzeros to their new processor assignment
76 //
77 // Known issues: (TODO)
78 // - The sorting in Step 2 and computation of chunk cuts in step 3 are
79 // currently done in serial and requires O(number of rows) storage each
80 // processor. More effort could parallelize this computation, but parallel
81 // load balancing algorithms are more appropriate in Zoltan2 than Tpetra.
82 // - The sorting in Step 2 renumbers the rows (assigns new Global Ordinals to
83 // the rows) to make them contiguous, as needed in Acer's triangle counting
84 // algorithm.
85 // (Acer's algorithm relies on local indexing from the chunk boundaries to
86 // find neighbors needed for communication.)
87 // The class currently provides a permutation matrix P describing the
88 // reordering. Thus, the matrix stored in the lower triangular block
89 // distribution is actually P A P -- the row and column permutation of
90 // matrix A in the Matrix Market file.
91 // The fact that a permuted matrix is stored complicates use of the matrix
92 // in algorithms other than Acer's triangle counting. For SpMV with the
93 // vector numbered according to the MatrixMarket numbering, for example,
94 // P^T must be applied to the vector before SpMV, and P^T must be applied to
95 // the result of SpMV. See LowerTriangularBlockOperator to see how this
96 // permutation matrix is used.
97 //
98 // Before addressing these issues, we will decide (TODO)
99 // - Is this Distribution general enough to be in Tpetra?
100 // - Should we, instead, have a separate package for distributions (that could
101 // use Zoltan2 and Tpetra without circular dependence)?
102 // - Or should we allow users (such as the triangle counting algorithm) to
103 // provide their own distributions (e.g., LowerTriangularBlock) that
104 // inherit from Tpetra's Distribution class?
105 // For now, we will push this Distribution into Tpetra, but we will revisit
106 // this decision.
107
108 public:
109 using Distribution<gno_t, scalar_t>::me;
110 using Distribution<gno_t, scalar_t>::np;
111 using Distribution<gno_t, scalar_t>::comm;
112 using Distribution<gno_t, scalar_t>::nrows;
113 using typename Distribution<gno_t, scalar_t>::NZindex_t;
114 using typename Distribution<gno_t, scalar_t>::LocalNZmap_t;
115
119
120 DistributionLowerTriangularBlock(size_t nrows_,
121 const Teuchos::RCP<const Teuchos::Comm<int> > &comm_,
122 const Teuchos::ParameterList &params)
123 : Distribution<gno_t, scalar_t>(nrows_, comm_, params)
124 , initialDist(nrows_, comm_, params)
125 , sortByDegree(false)
126 , permMatrix(Teuchos::null)
127 , redistributed(false)
128 , chunksComputed(false)
129 , nChunks(0) {
130 int npnp = 2 * np;
131 nChunks = int(std::sqrt(float(npnp)));
132 while (nChunks * (nChunks + 1) < npnp) nChunks++;
133
134 TEUCHOS_TEST_FOR_EXCEPTION(nChunks * (nChunks + 1) != npnp, std::logic_error,
135 "Number of processors np = " << np << " must satisfy np = q(q+1)/2 for some q"
136 << " for LowerTriangularBlock distribution");
137 nChunksPerRow = double(nChunks) / double(nrows);
138
139 const Teuchos::ParameterEntry *pe = params.getEntryPtr("sortByDegree");
140 if (pe != NULL) sortByDegree = pe->getValue<bool>(&sortByDegree);
141
142 pe = params.getEntryPtr("readPerProcess");
143 if (pe != NULL) redistributed = pe->getValue<bool>(&redistributed);
144
145 if (me == 0) std::cout << "\n LowerTriangularBlock Distribution: "
146 << "\n np = " << np
147 << "\n nChunks = " << nChunks
148 << std::endl;
149 }
150
151 enum DistributionType DistType() { return LowerTriangularBlock; }
152
153 bool areChunksComputed() { return chunksComputed; }
154
155 Teuchos::Array<gno_t> getChunkCuts() {
156 if (chunksComputed)
157 return chunkCuts;
158 else {
159 throw std::runtime_error("Error: Requested chunk cuts have not been computed yet.");
160 }
161 }
162
163 // Return whether this rank owns vector entry i.
164 // TODO: for now, use same vector dist as 1DLinear;
165 // TODO: think about best distribution of Vectors
166 inline bool VecMine(gno_t i) { return initialDist.VecMine(i); }
167
168 // Return whether this rank owns nonzero (i,j)
169 // Vector map and row map are the same in 1D distribution.
170 // But keep only the lower Triangular entries
171 bool Mine(gno_t i, gno_t j) {
172 if (redistributed) {
173 if (j > i)
174 return false; // Don't keep any upper triangular entries
175 else
176 return (procFromChunks(i, j) == me);
177 } else
178 return initialDist.Mine(i, j);
179 }
180
181 inline bool Mine(gno_t i, gno_t j, int p) { return Mine(i, j); }
182
183 // How to redistribute according to chunk-based row distribution
184 void Redistribute(LocalNZmap_t &localNZ) {
185 // Going to do chunking and sorting serially for now;
186 // need to gather per-row information from each processor
187 // TODO: think about a parallel implementation
188
189 gno_t myFirstRow = initialDist.getFirstRow(me);
190 gno_t nMyRows = initialDist.getNumRow(me);
191 Teuchos::Array<gno_t> nnzPerRow(nMyRows, 0);
192
193 Teuchos::Array<int> rcvcnt(np);
194 Teuchos::Array<int> disp(np);
195 for (int sum = 0, p = 0; p < np; p++) {
196 int prows = initialDist.getNumRow(p);
197 rcvcnt[p] = prows;
198 disp[p] = sum;
199 sum += prows;
200 }
201
202 // If desire sortByDegree, first need to sort with respect to ALL entries
203 // in matrix (not lower-triangular entries);
204 // decreasing sort by number of entries per row in global matrix.
205 // Generate permuteIndex for the sorted rows
206
207 Teuchos::Array<gno_t> permuteIndex; // This is the inverse permutation
208 Teuchos::Array<gno_t> sortedOrder; // This is the original permutation
209
210 Teuchos::Array<gno_t> globalRowBuf;
211 // TODO Dunno why there isn't a Teuchos::gatherAllv;
212 // TODO for now, compute and broadcast
213 if (me == 0) {
214 globalRowBuf.resize(nrows, 0); // TODO: Ick! Need parallel
215 }
216
217 if (sortByDegree) {
218 // Compute nnzPerRow; distribution is currently 1D and includes all nz
219 for (auto it = localNZ.begin(); it != localNZ.end(); it++) {
220 gno_t I = it->first.first;
221 nnzPerRow[I - myFirstRow]++;
222 }
223
224 Teuchos::gatherv<int, gno_t>(nnzPerRow.getRawPtr(), nMyRows,
225 globalRowBuf.getRawPtr(),
226 rcvcnt.getRawPtr(), disp.getRawPtr(),
227 0, *comm);
228
229 permuteIndex.resize(nrows); // TODO: Ick! Need parallel
230 sortedOrder.resize(nrows); // TODO: Ick! Need parallel
231
232 if (me == 0) { // TODO: do on all procs once have allgatherv
233
234 for (size_t i = 0; i != nrows; i++) sortedOrder[i] = i;
235
236 std::sort(sortedOrder.begin(), sortedOrder.end(),
237 [&](const size_t &a, const size_t &b) {
238 return (globalRowBuf[a] > globalRowBuf[b]);
239 });
240
241 // Compute inverse permutation; it is more useful for our needs
242 for (size_t i = 0; i < nrows; i++) {
243 permuteIndex[sortedOrder[i]] = i;
244 }
245 }
246
247 Teuchos::broadcast<int, gno_t>(*comm, 0, permuteIndex(0, nrows));
248 // Ick! Use a directory TODO
249
250 // Sorting is changing the global IDs associated
251 // with rows/columns. To make this distribution applicable beyond
252 // triangle counting (e.g., in a Tpetra operator), we need a way
253 // to map from the original global IDs and back again.
254 // Create a permutation matrix for use in the operator; use
255 // default Tpetra layout.
256 Teuchos::Array<gno_t> myRows;
257 for (size_t i = 0; i < nrows; i++) {
258 if (VecMine(i)) myRows.push_back(i);
259 }
260
262 Teuchos::OrdinalTraits<Tpetra::global_size_t>::invalid();
263 Teuchos::RCP<const map_t> permMap =
264 rcp(new map_t(dummy, myRows(), 0, comm));
265
266 permMatrix = rcp(new matrix_t(permMap, 1)); // one nz / row in permMatrix
267
268 Teuchos::Array<gno_t> cols(1);
269 Teuchos::Array<scalar_t> vals(1);
270 vals[0] = 1.;
271
272 for (size_t i = 0; i < permMap->getLocalNumElements(); i++) {
273 gno_t gid = permMap->getGlobalElement(i);
274 cols[0] = permuteIndex[gid];
275 permMatrix->insertGlobalValues(gid, cols(), vals());
276 }
277
278 permMatrix->fillComplete(permMap, permMap);
279 }
280
281 // Now, to determine the chunks, we care only about the number of
282 // nonzeros in the lower triangular matrix.
283 // Compute nnzPerRow; distribution is currently 1D
284 nnzPerRow.assign(nMyRows, 0);
285 size_t nnz = 0;
286 for (auto it = localNZ.begin(); it != localNZ.end(); it++) {
287 gno_t I = (sortByDegree ? permuteIndex[it->first.first]
288 : it->first.first);
289 gno_t J = (sortByDegree ? permuteIndex[it->first.second]
290 : it->first.second);
291 if (J <= I) { // Lower-triangular part
292 nnzPerRow[it->first.first - myFirstRow]++;
293 nnz++;
294 }
295 }
296
297 // TODO Dunno why there isn't a Teuchos::gatherAllv;
298 // TODO for now, compute and broadcast
299
300 Teuchos::gatherv<int, gno_t>(nnzPerRow.getRawPtr(), nMyRows,
301 globalRowBuf.getRawPtr(),
302 rcvcnt.getRawPtr(), disp.getRawPtr(),
303 0, *comm);
304
305 Teuchos::Array<int>().swap(rcvcnt); // no longer needed
306 Teuchos::Array<int>().swap(disp); // no longer needed
307
308 size_t gNnz;
309 Teuchos::reduceAll(*comm, Teuchos::REDUCE_SUM, 1, &nnz, &gNnz);
310
311 chunkCuts.resize(nChunks + 1, 0);
312
313 if (me == 0) { // TODO: when have allgatherv, can do on all procs
314 // TODO: or better, implement parallel version
315
316 // Determine chunk cuts
317 size_t target = gNnz / np; // target nnz per processor
318 size_t targetRunningTotal = 0;
319 size_t currentRunningTotal = 0;
320 gno_t I = gno_t(0);
321 for (int chunkCnt = 0; chunkCnt < nChunks; chunkCnt++) {
322 targetRunningTotal = (target * (chunkCnt + 1));
323 currentRunningTotal = 0;
324 while (I < static_cast<gno_t>(nrows)) {
325 size_t nextNnz = (sortByDegree ? globalRowBuf[sortedOrder[I]]
326 : globalRowBuf[I]);
327 if (currentRunningTotal + nextNnz <= targetRunningTotal) {
328 currentRunningTotal += nextNnz;
329 I++;
330 } else
331 break;
332 }
333 chunkCuts[chunkCnt + 1] = I;
334 }
335 chunkCuts[nChunks] = static_cast<gno_t>(nrows);
336 }
337
338 // Free memory associated with globalRowBuf
339 Teuchos::Array<gno_t>().swap(globalRowBuf);
340
341 Teuchos::broadcast<int, gno_t>(*comm, 0, chunkCuts(0, nChunks + 1));
342 chunksComputed = true;
343
344 // Determine new owner of each nonzero; buffer for sending
345 Kokkos::View<gno_t *, Kokkos::HostSpace> iOut("iOut", localNZ.size());
346 Kokkos::View<gno_t *, Kokkos::HostSpace> jOut("jOut", localNZ.size());
347 Kokkos::View<scalar_t *, Kokkos::HostSpace> vOut("vOut", localNZ.size());
348 Teuchos::Array<int> pOut(localNZ.size());
349
350 size_t sendCnt = 0;
351 for (auto it = localNZ.begin(); it != localNZ.end(); it++) {
352 iOut[sendCnt] = (sortByDegree ? permuteIndex[it->first.first]
353 : it->first.first);
354 jOut[sendCnt] = (sortByDegree ? permuteIndex[it->first.second]
355 : it->first.second);
356 if (jOut[sendCnt] <= iOut[sendCnt]) { // keep only lower diagonal entries
357 vOut[sendCnt] = it->second;
358 pOut[sendCnt] = procFromChunks(iOut[sendCnt], jOut[sendCnt]);
359
360 sendCnt++;
361 }
362 }
363
364 // Free memory associated with localNZ and permuteIndex
365 LocalNZmap_t().swap(localNZ);
366 if (sortByDegree) Teuchos::Array<gno_t>().swap(permuteIndex);
367
368 // Use a Distributor to send nonzeros to new processors.
369 Tpetra::Distributor plan(comm);
370 size_t nrecvs = plan.createFromSends(pOut(0, sendCnt));
371 Kokkos::View<gno_t *, Kokkos::HostSpace> iIn("iIn", nrecvs);
372 Kokkos::View<gno_t *, Kokkos::HostSpace> jIn("jIn", nrecvs);
373 Kokkos::View<scalar_t *, Kokkos::HostSpace> vIn("vIn", nrecvs);
374
375 // TODO: With more clever packing, could do only one round of communication
376 auto sendIndices = std::make_pair(static_cast<size_t>(0), sendCnt);
377 plan.doPostsAndWaits(Kokkos::subview(iOut, sendIndices), 1, iIn);
378 plan.doPostsAndWaits(Kokkos::subview(jOut, sendIndices), 1, jIn);
379 plan.doPostsAndWaits(Kokkos::subview(vOut, sendIndices), 1, vIn);
380
381 // Put received nonzeros in map
382 for (size_t n = 0; n < nrecvs; n++) {
383 NZindex_t nz(iIn[n], jIn[n]);
384 localNZ[nz] = vIn[n];
385 }
386
387 redistributed = true;
388 }
389
390 Teuchos::RCP<matrix_t> getPermutationMatrix() const { return permMatrix; }
391
392 private:
393 // Initially distribute nonzeros with a 1D linear distribution
394 Distribution1DLinear<gno_t, scalar_t, node_t> initialDist;
395
396 // Flag indicating whether matrix should be reordered and renumbered
397 // in decreasing sort order of number of nonzeros per row in full matrix
398 bool sortByDegree;
399
400 // Column permutation matrix built only when sortByDegree = true;
401 Teuchos::RCP<matrix_t> permMatrix;
402
403 // Flag whether redistribution has occurred yet
404 // This is true
405 // i) after Tpetra performs the redistribution or
406 // ii) when Tpetra reads already-distributed nonzeros by readPerProcess function
407 bool redistributed;
408
409 // If we read the already-distributed nonzeros from per-process files,
410 // this will remain false until a triangle counting code actually computes
411 // the chunks when the need arises.
412 bool chunksComputed;
413
414 int nChunks; // in np = q(q+1)/2 layout, nChunks = q
415 double nChunksPerRow;
416 Teuchos::Array<gno_t> chunkCuts;
417
418 int findIdxInChunks(gno_t I) {
419 int m = I * nChunksPerRow;
420 while (I < chunkCuts[m]) m--;
421 while (I >= chunkCuts[m + 1]) m++;
422 return m;
423 }
424
425 int procFromChunks(gno_t I, gno_t J) {
426 int m = findIdxInChunks(I);
427 int n = findIdxInChunks(J);
428 int p = m * (m + 1) / 2 + n;
429 return p;
430 }
431};
432
434// Tpetra::Operator that works with the DistributionLowerTriangularBlock
435
436template <typename scalar_t,
437 class Node = ::Tpetra::Details::DefaultTypes::node_type>
438class LowerTriangularBlockOperator : public Tpetra::Operator<scalar_t, Tpetra::Map<>::local_ordinal_type,
439 Tpetra::Map<>::global_ordinal_type,
440 Node> {
441 public:
444 using map_t = Tpetra::Map<>;
445 using import_t = Tpetra::Import<>;
446 using export_t = Tpetra::Export<>;
447 using vector_t = Tpetra::Vector<scalar_t>;
448 using mvector_t = Tpetra::MultiVector<scalar_t>;
451
452 LowerTriangularBlockOperator(
453 const Teuchos::RCP<const matrix_t> &lowerTriangularMatrix_,
454 const dist_t &dist)
455 : lowerTriangularMatrix(lowerTriangularMatrix_)
456 , permMatrix(dist.getPermutationMatrix()) {
457 // LowerTriangularBlockOperator requires the range map and domain map
458 // to be the same. Check it here.
459 TEUCHOS_TEST_FOR_EXCEPTION(
460 !lowerTriangularMatrix->getRangeMap()->isSameAs(
461 *lowerTriangularMatrix->getDomainMap()),
462 std::logic_error,
463 "The Domain and Range maps of the LowerTriangularBlock matrix "
464 "must be the same");
465
466 // Extract diagonals
467
468 vector_t diagByRowMap(lowerTriangularMatrix->getRowMap());
469 lowerTriangularMatrix->getLocalDiagCopy(diagByRowMap);
470 diag = Teuchos::rcp(new vector_t(lowerTriangularMatrix->getRangeMap()));
471 Tpetra::Export<> exporter(lowerTriangularMatrix->getRowMap(),
472 lowerTriangularMatrix->getRangeMap());
473 diag->doExport(diagByRowMap, exporter, Tpetra::ADD);
474 }
475
476 void apply(const mvector_t &x, mvector_t &y, Teuchos::ETransp mode,
477 scalar_t alpha, scalar_t beta) const {
478 scalar_t ZERO = Teuchos::ScalarTraits<scalar_t>::zero();
479 scalar_t ONE = Teuchos::ScalarTraits<scalar_t>::one();
480 if (alpha == ZERO) {
481 if (beta == ZERO)
482 y.putScalar(ZERO);
483 else
484 y.scale(beta);
485 return;
486 }
487
488 if (permMatrix == Teuchos::null) {
489 // Multiply lower triangular
490 lowerTriangularMatrix->apply(x, y, Teuchos::NO_TRANS, alpha, beta);
491
492 // Multiply upper triangular
493 lowerTriangularMatrix->apply(x, y, Teuchos::TRANS, alpha, ONE);
494
495 // Subtract out duplicate diagonal terms
496 y.elementWiseMultiply(-alpha, *diag, x, ONE);
497 } else {
498 // With sorting, the LowerTriangularBlock distribution stores (P^T A P)
499 // in the CrsMatrix, for permutation matrix P.
500 // Thus, apply must compute
501 // y = P (beta (P^T y) + alpha (P^T A P) (P^T x))
502
503 vector_t xtmp(x.getMap(), x.getNumVectors());
504 vector_t ytmp(y.getMap(), y.getNumVectors());
505
506 permMatrix->apply(x, xtmp, Teuchos::TRANS);
507 if (beta != ZERO) permMatrix->apply(y, ytmp, Teuchos::TRANS);
508
509 // Multiply lower triangular
510 lowerTriangularMatrix->apply(xtmp, ytmp, Teuchos::NO_TRANS, alpha, beta);
511
512 // Multiply upper triangular
513 lowerTriangularMatrix->apply(xtmp, ytmp, Teuchos::TRANS, alpha, ONE);
514
515 // Subtract out duplicate diagonal terms
516 ytmp.elementWiseMultiply(-alpha, *diag, xtmp, ONE);
517
518 permMatrix->apply(ytmp, y, Teuchos::NO_TRANS);
519 }
520 }
521
522 Teuchos::RCP<const map_t> getDomainMap() const {
523 return lowerTriangularMatrix->getDomainMap();
524 }
525
526 Teuchos::RCP<const map_t> getRangeMap() const {
527 return lowerTriangularMatrix->getRangeMap();
528 }
529
530 bool hasTransposeApply() const { return true; } // Symmetric matrix
531
532 private:
533 const Teuchos::RCP<const matrix_t> lowerTriangularMatrix;
534 const Teuchos::RCP<const matrix_t> permMatrix;
535 Teuchos::RCP<vector_t> diag;
536};
537
538} // namespace Tpetra
539#endif
Functions for initializing and finalizing Tpetra.
Struct that holds views of the contents of a CrsMatrix.
Sets up and executes a communication plan for a Tpetra DistObject.
Abstract interface for operators (e.g., matrices and preconditioners).
virtual void apply(const MultiVector< Scalar, LocalOrdinal, GlobalOrdinal, Node > &X, MultiVector< Scalar, LocalOrdinal, GlobalOrdinal, Node > &Y, Teuchos::ETransp mode=Teuchos::NO_TRANS, Scalar alpha=Teuchos::ScalarTraits< Scalar >::one(), Scalar beta=Teuchos::ScalarTraits< Scalar >::zero()) const =0
Computes the operator-multivector application.
int local_ordinal_type
Default value of Scalar template parameter.
Namespace Tpetra contains the class and methods constituting the Tpetra library.
@ ADD
Sum new values.
@ ZERO
Replace old values with zero.