Ridgeway Kite - GPU Technology Conference

Ridgeway Kite
Innovative Technology for Reservoir Engineers
Large Scale Reservoir Simulation
utilizing multiple GPUs
Garf Bowen
25th March 2014
Summary
• Introduce
– RKS
– Reservoir Simulation
•
•
•
•
HPC goals
Implementation
Large scale simulations
Results & future
Ridgeway Kite
RKS
• Start-up (April 2013)
– Long history in Reservoir Simulation
– Sister company, NITEC – consulting
• Differentiators
– Massively Parallel Code
– Multiple Realizations
– “Unconventional”
– Coupled surface network
Ridgeway Kite
Reservoir Simulation
• Finite Volume
• Unstructured
(features)
• Implicit
 = ∆ −  = 
Ridgeway Kite
Driving from London to Manchester…
Check the Ferrari or the traffic jam?
Lot of code that all needs to go fast
Challenge is often “not to go slow”
Can’t just focus on “hot spots”
Ridgeway Kite
HPC goals
• “not to go slow”
• Portability CPU/GPU (+clusters)
– Want to be future proof
• Simplification
– (massive) parallelization is an opportunity
– Developer efficiency
– Same result on any platform
Ridgeway Kite
Shuffle Calculate Pattern
Scatter
I/O from
node zero
Calculate
“one-to-one”
Shuffle
Ridgeway Kite
•
•
•
•
Gather
output
All data is on the GPU
Calculations are embarrassingly parallel
No indirect addressing
Ability to time separately
Example – calculate flows
One flow two cells
Different flow same cell
One cell involved in
Multiple flows
More flows than cells
Ridgeway Kite
Multiple copies – “slots”
Simplicity Returns?
“one code” kernel
many (independent) calls
Split to run MPI
distributed
Underlying system - XPL
• Takes care of running
• Different modes
• Different architectures
Code looks serial again
Ridgeway Kite
Maps & MPI
Src
Dest Slot
i1
j1
0
i2
j2
1
i3
j3
0
i4
…
j4
…
1
…
Maps are defined in “serial” space
Not recommended
test.exe –cpu
test.exe –gpu
mpirun –np 16 test.exe
Ridgeway Kite
Simple Example
 =  −1  ∀
template<typename KP>
struct Testinv
{
A - n*n small dense matrix
~millions of i’s
LU factorization (partial pivoting)
__host__ __device__
Testinv(Args* inArgs, int index, int N)
{
int ia=0;
mat<double,KP> a(inArgs,ia++,index);
Scaling
y = 2.35x + 2.31
y = 2.23x + 1.20
log time (secs)
5.00
4.00
CPU
3.00
2.00
0.40
0.60
0.80
1.00
Log n
vec<double,KP> r(inArgs,ia++,index);
vec<double,KP> x(inArgs,ia++,index);
mat<double,KP> w(inArgs,ia++,index);
case rks::TestKernels::TEST_INV:
GPU
w = a;
calc(inArgs, gpu<Testinv<kp> >, cpu<Testinv<kp> >);
1.20
w.inv();
break;
x.zero();
w.mult(r,x);
Ridgeway Kite
Now add complexity
well
-40
8.4
====================================================
jac
-40
19.1
Comparison
between:
mass
--40
1.9
cpuflow
1243.630 and--gpu 147.960
-40
16.5
====================================================
flow_
---- 4640
16.0
well
-1.0
0.08
norm
-40
0.4
jac
-1.0
12.62
lin
-30
52.7
52.5
mass
--1.0
17.93
ling
--30
2.0
2.0
flow
--1.0
11.66
lins
--30
50.0
flow_
--1.0 49.9
11.84
orth-it
------ 30
norm
2.19
norm
---- 1.0
---- 219
0.1
lin
-1.0
9.87
precon
----- 189
48.1
ling
--1.0
1.70
pressure
------ 189
46.9
lins
--- 1.0 10.08
orth-it
---- 1.0 10.10
norm
----- 1.0 48.40
precon
----- 1.0
9.17
pressure
------ 1.0
8.24
Ridgeway Kite
Linear Solver Strategy
Linear Solver Important
Communication Mechanism
Challenge in parallel
environments
Like getting “the same” results
If we can implement a solver in XPL,
then we get this for free
…but we’re only a small company
And don’t really want to be linear
solver experts
Home grown
May not be competitive
Using Nvidia’s AmgX
Lose the “same” algorithm
Performing
Ridgeway Kite
Linear Solver
• Home Grown
– Massively helpful for development
• Same results for all configurations
– Challenged algorithmically on difficult problems
• AmgX
– Many options (pre-coded)
– Single GPU working well
– Focussed our effort here
• MPI programming becomes important
Ridgeway Kite
Strategy as problem size increases
• Tesla C2070
– 6Gb memory
– Black Oil model 1million cells (SPE10 1.2e6 cells)
• Little incentive to utilize >1 GPU
• noting people will often run multiple realizations
• Larger model -> cluster
– Memory constrained
Ridgeway Kite
Scaling Test
• Based on SPE10 benchmark
– Refined model
– 5 wells
– ~1 million cells
• We can fit:
– Base case on one GPU
– 4 (connected) copies on 4 GPUs
• Actually require 8 GPUs
– Extra memory
– 16 copies on 16/32 GPUs
• Less challenging scaling than refinement
Ridgeway Kite
Memory & Performance
1400
4500
Memory
3500
Memory Mb
Example Performance
1200
3000
2500
4E6 - 8GPUs
2000
1E6 - 2 GPUs
1500
1E6 - 1GPU
1000
500
Wall Clock Time (secs)
4000
1000
800
600
400
200
0
1
2
3
4
5
processors
Ridgeway Kite
6
7
8
0
"1E6-1GPU"
"1E6-2GPU"
Lessons:
Very variable timings
Instrumentation vital
Future:
Still working on the 32-way case
Classical MPI optimization step
"4E6-8GPU"
Summary & Conclusions
• Shuffle-Calculate pattern
– Works for us, so far
– Portable
– Allowing us to exploit the GPU
– Using Amgx we’re able to tackle realistic cases
requiring multi-GPU’s
• Full system
– Commercial offering early next year
Ridgeway Kite
Acknowledgements
• Co-authors: Bachar Zineddin & Tommy Miller
• Jeremy Appleyard, Nvidia
• “The authors would like to acknowledge the work
presented here made use of the IRIDIS*/EMERALD*
HPC facility provided by the Centre for Innovation.”
• Nvidia for AmgX beta access
Ridgeway Kite
Questions?
Ridgeway Kite
Backup#1 – LU code example
//
// Main elimination loop
//
for (int j=0; j<m_xdim; j++)
{
//
// Sum
//
for (int i=0; i<j;i++)
{
double sum = (*this)(i,j);
for (int k=0; k<i; k++)
{
sum = sum - (*this)(i,k)*(*this)(k,j);
}
(*this)(i,j) = sum;
}
//
// Max
//
aamax = 0.0;
for(int i=j; i<m_xdim; i++)
{
double sum = (*this)(i,j);
for( int k=0; k<j; k++)
{
sum = sum - (*this)(i,k)*(*this)(k,j);
}
(*this)(i,j) = sum;
if ( std::fabs(vv[i]*sum)>=aamax )
{
imax = i;
aamax = std::fabs(vv[i]*sum);
}
}
Ridgeway Kite
//
// Swap
//
if (j!=imax)
{
for( int k=0; k<m_xdim; k++)
{
double dum = (*this)(imax,j);
(*this)(imax,k) = (*this)(j,k);
(*this)(j,k) = dum;
}
vv[imax] = vv[j];
}
//
// Store
//
piv[j] = imax;
if ( (*this)(j,j)==0.0 )
{
(*this)(j,j) = 1e-20;
}
//
// Set
//
if(j!=m_xdim)
{
double dum = 1.0/(*this)(j,j);
for( int i=j+1; i<m_xdim; i++ )
{
(*this)(i,j) = (*this)(i,j)*dum;
}
}
}
//------ End lu step ----
Backup#2 – Home Grown Solver




 

=

 
0
 ∗

  ∗ 
=


0

 ∗ = −   −1 
Note:
1−
−1
= 1 +  + 2 + 3 + … . .
With:
Ridgeway Kite
 =   −1   −1