Seismic Imaging on accelerators

Size: px

Start display at page:

Download "Seismic Imaging on accelerators"

Ann Mathews
5 years ago
Views:

1 Seismic Imaging on acceleaos José M. Cela Dieco CASE Depamen BSC-CNS Kaleidoscope Pojec

2 Talk Ouline Seismic Imaging inoducion Why RTM? How o implemen RTM on scala pocessos? How o implemen RTM on acceleaos?

3 Wavefield ime evoluion 3

4 Seismic imaging on he sea 4

5 5 Wave equaion model,,, f p c p,, 1, f p k p Isoopic acousic wave equaion Isoopic acousic wave equaion wih vaiable densiy

6 6 Wave equaion model,,, 1,,,,, 1,, f y p x p c z q c q f y p x p c z q c p Model VTI Veical symmeic Tansvesal Isoopic media anisoopy paamees:

7 7 Wave equaion model 0,,, 1,, 1,, 1, 1 0 Hp Hq q c f Hq p H Hp p c z x z x H z x z x H 0 sin cos sin sin sin cos Model TTI Tiled axis symmeic Tansvesal Isoopic media 3 anisoopy paamees:

8 Wave equaion model Elasic wave equaion Requies p-s wave ecoding => nodes echnology u, C 1 u, u, f, 8

9 Basic seismic imaging flow Daa Acquisiion Velociy model Build velociy model TOMOGRAFY Geneae an Image MIGRATION Image Gahes p, c p, f, 9

10 Revese Time Migaion RTM ime =1. s Souce Wave Receive Wave Image ime =0.7 s ime =0. s 10

11 Talk Ouline Seismic Imaging inoducion Why RTM? How o implemen RTM on scala pocessos? How o implemen RTM on acceleaos? 11 11

12 WEM vs. RTM 1

13 RTM as velociy building ools 13

14 RTM wihou model 14

15 RTM wih incomplee model 15

16 RTM and model miss-pick 16

17 Image Algoihms Compue Requiemens 17

18 Talk Ouline Seismic Imaging inoducion Why RTM? How o implemen RTM on scala pocessos? How o implemen RTM on acceleaos? 18 18

19 RTM complee algoihm Inpu A daa base in SEGY foma wih he measuemen obained in he field eceive daa A velociy model file ha cove all he measued field Oupu A daa base wih he 3D image of he complee field Algoihm Do all he shos Define he volume aound he sho Define he ime and spaial disceizaion fo he sho Exac fom he daa base he pope velociy piece and he pope eceive aces Pocess hese files in ode o have hem in he pope way inepolaion, ansposiion Run RTM kenel fo his sho Add he geneaed image o he global image Enddo 19

RTM complee algoihm RTM kenel algoihm wih FD disceizaion Do ieaions fowad and backwad Do all he ime seps do I do j do k P3 ijk = sencilp P3 ijk = v d P3 ijk + P ijk P1

20 RTM complee algoihm RTM kenel algoihm wih FD disceizaion Do ieaions fowad and backwad Do all he ime seps do I do j do k P3 ijk = sencilp P3 ijk = v d P3 ijk + P ijk P1 ijk Inoduce inpu wave Apply bounday condiions if fowad Wie P3 else Read P4 Im = Im + coelaionp3,p4 endif InechangeMPI domain bounday Roae poines P3, P, P1 enddo enddo 0

Pocess paallelism If no enough esouces in

field 1 Sho volume divided in 4 domain =

21 RTM complee algoihm Gid paallelism Embaassing paallelism fo he diffeen shos Pocess paallelism If no enough esouces in 1 node Domain Decomposiion MPI o pocess one sho beween seveal nodes Thead paallelism openmp/cell/ heads o execue 1 MPI pocess pe node SIMD capabiliies VMX code / Cell SPU code Sho aeas Complee field 1 Sho volume divided in 4 domain = 4 MPI pocesses 1 domain execued wih 4 heads Scala vs VMX sencil 1

22 RTM Scalabiliy Pocess Paallelism Domain Decomposiion Scalabiliy limied by finie diffeence sencil In pacical poblem he numbe of domain is lowe han 10 Linea scalabiliy can be obained n s Good speed-up => n >> s pocessos

23 RTM Scalabiliy Thead paalelism on scala pocessos OpenMP Blocking using Rivea scheme Linea scalabiliy 3

24 Talk Ouline Seismic Imaging inoducion Why RTM? How o implemen RTM on scala pocessos? How o implemen RTM on acceleaos? 4 4

25 QS memoy configuaion NUMA sysem Memoy bank GBps Memoy bank GBps PPE Cell/B.E. 0 Cell/B.E. 1 PPE SPE 0 SPE 1 SPE SPE 3 SPE 4 SPE 5 SPE 6 SPE GBps SPE 0 SPE 1 SPE SPE 3 SPE 4 SPE 5 SPE 6 SPE 7 Node 0 Node 1 Each chip has is own Acces Concenao abiao, alhough is only enabled in Cell/BE 0 Cell/BE 1 mus ask all memoy equess o AC 0 in Cell/BE 0 o avoid coheence poblems and minimize memoy accesses Each SPE has is own 56- eny TLB. Pagesizes ange fom 4 kb o 16 MB. SPE TLB misses ineup PPE 5

18 byes cache line Then usually daa size mus be also muliple of

26 RTM Scalabiliy on cell Thead paallelism on Cell pocesso QS SPU heads scalabiliy depends on Daa alignmen Daa mus be aligned o 18 byes cache line Then usually daa size mus be also muliple of cache line NUMA managemen Compuaional scheme semi-sencil dma liss 6

27 RTM Pefomance in Cell 7

Kaleidoscope Pojec Plafom Gflops Seep-up Powe W Gflops/W JS1 8,3 1 67 0,03

28 Kaleidoscope Pojec Plafom Gflops Seep-up Powe W Gflops/W JS1 8, ,03 QS 116, ,3 The wok of 3 monhs is now done in: 1 week speed-up 14 8

29 achiecue Inheenly paallel Up o 40 coes/pocesso in cuen NVIDIA s Simple coes In-ode No veco unis Poweful single-pecision floaing poin uni Double-pecision also suppoed Memoy capaciy Up o 4GB pe cad Slow connecion wih he hos memoy PCI expess 9 9

30 local memoy Global memoy up o 4GB pe cad Vey slow cycles Texue memoy 64KB pe cad <cache> Read-only Useful fo some kinds of access paens Consan memoy 64KB pe cad <cache> Read-only cycles when all heads in a wap ead he Shaed memoy 16KB pe SM 8 banks 4 byes side cycles if no bank conflic consecuive accesses Regise memoy egises/sm 16 pe head if 104 heads, 3 if 51 heads 30 30

31 RTM Poing Guidelines o s Communicaion hough PCIe is slow Pefom all compuaions on he o avoid memoy ansfes Real-sized poblems equie > 16GB Domain decomposiion > 1 node Ine-node communicaion is slow Ovelap compuaion/communicaion Inemediae esuls have o be wien/ead o/fom exenal soage Ovelap ansfes wih nex ime seps 31

32 RTM Poing he sencil kenel 1. Use GMAC o auomaically handle memoy ansfes No need fo diffeen allocaions hos and Memoy objecs shae he same addess on he hos and on he device Moe inelligen handling of ansfes. Use shaed memoy o soe he values of he pevious ime sep Bad useful/oal loads aio Divegen banches o load he ghos aea 3. D sliding window poposed by P. Micikevicius NVIDIA Soe he Y geophysical sencil dimension in egises, he ZX plane soed in shaed memoy. Then slide he plane o he end of he cube Bee useful/oal loads aio 3 3

33 3D-Sencil y Tiling + shaed memoy x 33

34 3D-Sencil Regise spilling Exend he 3 d dimension by using a seam of egises Read 4 planes fuhe in advance Shif egise conens evey ieaion Z -4 Z -3 Z - Z Z + -1 Z 1 Z + Z + 3 Z + 4 cuen value 34

35 3D-Sencil Aligned daa sucues wap boundaies Key o ge fully coalesced memoy accesses y x padding 35

36 Domain Decomposiion Daa exchange evey ime-sep MPI ine-node, shaed memoy fo ina-node communicaion Domains decomposed along he Y dimension Node 1 Node GP U1 GP U GP U3 GP U4 MPI GP U1 GP U GP U3 GP U4 36

37 Domain Decomposiion CUDA RT One device pe head CPU addess space 37

38 Domain Decomposiion CUDA RT Sepaae addess spaces addess spaces CPU addess space 38

39 Domain Decomposiion CUDA RT Theads canno access memoy fom ohe s CPU addess space 39

40 Domain Decomposiion CUDA RT Theads canno access memoy fom ohe s CPU addess space 40

41 Domain Decomposiion CUDA RT Inemediae memoy buffes mus be used CPU addess space 41

42 Domain Decomposiion CUDA RT Inemediae memoy buffes mus be used CPU addess space 4

43 Domain decomposiion - GMAC Single addess space Global addess space 43

44 Domain decomposiion - GMAC Boundaies exchange pefomed using a simple memcpy! Global addess space

45 Ovelapping compuaion/communicaion Two-sage execuion: Sage 1 Compue he poins of he boundaies o be exchanged y z 45

46 Ovelapping compuaion/communicaion Two-sage execuion: Sage Compue he emaining poins while exchanging boundaies y z 46

47 Ovelapping compuaion/communicaion GMAC 7600 Mp/s 1. gmacmalloc&inpu, W_SIZE;. gmacmalloc&oupu, W_SIZE; fo all ime seps do 7. // sage 1 8. launch_sencil oupu, inpu ; 9. // sage 10. gmactheadsynchonize; 11. launch_sencil oupu, inpu ; memcpyneighbo, oupu; 15. gmactheadsynchonize; 16. baie; 17. //... exchange poines 18. end fo CUDA Run-ime 7700 Mp/s 1. cudamalloc&d_inpu, W_SIZE;. cudamalloc&d_oupu, W_SIZE; 3. cudahosalloc&i_halos, H_SIZE; 4. cudasceamceae&s1; 5. cudasceamceae&s; 6. fo all ime seps do 7. // sage 1 8. launch_sencil d_oupu, d_inpu, s1 ; 9. // sage 10. launch_sencil d_oupu, d_inpu, s ; 11. cudamemcpyasync i_halos, d_oupu, s1 ; 1. cudaseamsynchonize s1 ; 13. baie; 14. cudamemcpyasync d_oupu, i_halos, s1 ; 15. cudatheadsynchonize; 16. baie; 17. //... exchange poines 18. end fo 47

48 Pefomance Resuls Ina-node scalabiliy 48

49 Pefomance Resuls Ine-node scalabiliy 1 pe node 49

50 Pefomance Resuls CPUs OMP 8coe vs acceleaos single 50

51 Pefomance Resuls We ae oo fas! -> Disk I/O poblems 51

52 Kaleidoscope Pojec Plafom Gflops Seep-up Powe W Gflops/W JS1 8, ,03 QS 116, ,3 TESLA ,8 0,76 The wok of 3 monhs is now done in: 1 week speed-up 14 days speed-up 4, bu his depends on local I/O BW 5

53 Conclusion s ae an excellen achiecue fo codes based in sencil compuaions. Speed-ups x40 can be achieved Special aenion should be paid o ovelap MPI and hos memoy ansfes wih compuaion Inenal memoy hieachy exploiaion is mandaoy Fo RTM a x4 speed-up can be obained if he pope I/O device is used 53 53

54 Thank you fo you aenion! 54