10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 1 Development of Complex Curricula for Molecular Bionics and InfobionicsPrograms within a consortial* framework** Consortium leader PETER PAZMANY CATHOLIC UNIVERSITY Consortium members SEMMELWEIS UNIVERSITY, DIALOG CAMPUS PUBLISHER The Project has been realisedwith the support of the European Union and has been co-financed by the European Social Fund *** **Molekuláris bionika és Infobionika Szakok tananyagának komplex fejlesztése konzorciumi keretben ***A projekt az Európai Unió támogatásával, az Európai Szociális Alap társfinanszírozásával valósul meg. PETER PAZMANY CATHOLIC UNIVERSITY SEMMELWEIS UNIVERSITY sote_logo.jpg dk_fejlec.gif INFOBLOKK 10/5/2011 TÁMOP –4.1.2-08/2/A/KMR-2009-0006 2 Peter Pazmany Catholic University Faculty of Information Technology Virtual machines: signal processing with multicore systems www.itk.ppke.hu Virtuális gépek: jelfeldolgozás sokprocesszoros rendszereken J. Levendovszky, A. Oláh, K. Tornai Digitális-neurális-, éskiloprocesszorosarchitektúrákonalapulójelfeldolgozás Digital-and Neural Based Signal Processing & KiloprocessorArrays Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 3 www.itk.ppke.hu Contents • Implementationof neuralnetworks • Motivationof multicoresystems • Multicore systems• Definitions • Survey of multicore systems, architectures • Applications• Cellular automation demonstration • FFT implementation Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 4 www.itk.ppke.hu NN Implementationexamples Applications Examples Highenergyphysics Digital-neurochip Patternrecognition FPGA, Digital Image/objectrecognition RAM Based, Optical Image segmentation FPGA, Digital Genericimage/video processing RAM Based, Analog Intelligentvideo analytics Optical, FPGA Fingerprint featureextraction, Directfeedbackcontrol Analog Autonomousrobotics Digital, FPGA, DSP Sensorlesscontrol FPGA Opticalcharacter/handwritingrecognition Digital Acousticsoundrecognition DSP Real-timeembeddedcontrol Digital Audiosynthesis Analog Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 5 www.itk.ppke.hu NN Implementation–Digital neuron • In a digital neuron, synaptic weights are stored in shift registers, latches, or memories. • Adders, subtracters, and multipliers are available as standard circuits, and non-linear AFs can be constructed using look-up tables or using adders, multipliers, … • Advantages:simplicity, high signal-to-noise ratio, easily achievable cascadabilityand flexibility, and cheap fabrication • Drawbacks: slower operations, Conversionof the digital representations to and from an analog form may be required• Usuallyinput patterns are available in analog form and control outputs also often required to be in analog form Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 6 www.itk.ppke.hu NN Implementation–Analogneuron • In an analog neuron weights are usually stored using one of the following: resistors, CCD-s, capacitorsand FG EEPROM.In VLSI, a variable resistor as a weight can be implemented as a circuit involving two MOSFETs. • The signals are typically represented by currentsand/orvoltages. • The scalar product and subsequent non-linear mapping is performed by a summing amplifier with saturation • Advantages:analogelements are generally smaller and simpler, • Drawbacks:obtaining consistently precise analog circuits, especially to compensate for variations in temperature and control voltages, requires sophis-ticateddesign and fabrication • The main challenges for analog designs are the synapse multiplier over a useful range and the storage of the synapse weights Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 7 www.itk.ppke.hu NN Implementation–Neurochips • FPGA Basedimplemenation• Reconfigurable FPGAs provide an effective programmable resource for implementing NNs allowing different design choices to be evaluated in a very short time • Partial and online reconfiguration capabilities in the latest generation of FPGAs offer additional advantages. • The circuit density using FPGAs is still comparably lower and is limiting factors in the implementation of large models with thousands of neurons • Associativeneuralmemories • RAM basedimplementations Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 8 www.itk.ppke.hu CNN Implementation • CNN implementations can achieve speeds upto several teraflops and are ideal for the applications which require low power consumption, high processing speed, and emergent computation, e.g., real-time image processing. • ACE16k• Mixed-signalSIMD-CNN ACE (AnalogicCellularEngine) chips asa visionsystemonchip realizingCNN UniversalMachine(CNN-UM) • Designeddusing35um CMOStechnologywith85% analogelements. • Consistsof an arrayof 128x128 locallyconnectedmixed signalprocessingunitsoperatingunderSIMD mode • ACE16k chips havebeenusedincommercialBi-ispeedvisionsystemdevelopedbyAnaLogicComputersLtdand MTA-SZTAKI • An FPGA basedemulated-digitalCNN-UM implementationusingGAPU (Global AnalogicProgrammingUnit) • Falcon was earlier proposed as a reconfigurable multi-layer FPGA based CNN-UM implementation employing systolic array architecture Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 9 www.itk.ppke.hu NN Implementation • Neuromorphicrefers toacircuitthatemulatesthe biological neuraldesign • The processing is mostly analog, although outputs can be digital • Opticalneuralnetworks• Designedon the principles of optical computing • Optical technology utilizes the effect of light beam processing that is inherently massively parallel, very fast, and without the side effects of mutual interference • Optical transmission signals can be multiplexed in time, space, and wavelength domains, and optical technologies may overcome the problems inherent in electronics • The results range from the development of special-purpose associative memory systems through various optical devices (e.g., holographic elements for implementing weighted interconnections) to optical neurochips. • Optical techniques ideally match with the needs for the realization of a dense network of weighted interconnections. • Optical technology has a number of advantages for making interconnections, specifically with regard to density, capacity and 2D programmability Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 10 www.itk.ppke.hu NN Implementationtable ANN Digital Analog Hybrid Neuromorphic FPGA Optical MLP (Perceptron) x x RBF x x x SOFM x x FFNN x x X SpikingNN x x x PulsecodedNN x x x CNN x x x x x AM x x x x RecurrentNN x x StochasticNN x Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 11 www.itk.ppke.hu Introduction –Motivation • Classical architecture: one processing unit • The predicted improvements by Moore-law has physical constraints• Size • Frequency • Power consumption / heat dissipation • Consequently the number of processing unit on one chip must be increased• Changing the art of architecture • Changing the art of programming Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 12 www.itk.ppke.hu Trend • Improving the computational capacity• Instead of using higher frequencies the number of cores on a single chips areincreased • Decreasing the power consumption • More and more chip with many cores are available in the market• New programshavetobe adaptedtothearchitecture Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 13 www.itk.ppke.hu Trend of development • Computational capacity of GPU and CPU time Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 14 www.itk.ppke.hu Classifications of architectures • From the perspective of applications• General purpose• Coding, decoding, software radio • Specific purpose• Well defined application • ASIC: Application Specific Integrated Circuits • RISC: Reduced Instruction set Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 15 www.itk.ppke.hu Classifications of architectures • From the perspective of applications• Data Processing Dominated• Processing large data flows– Image or Video – Voice or Music – Processing radio signals • Repeating same instruction on multiple data– It can be parallelized well Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 16 www.itk.ppke.hu Classifications of architectures • From the perspective of applications• Control Processing Dominated• Packing or unpacking filed • Processing network signals • Conditional instructions, huge state space,enormous number of re-used data– Hardly parallelizable • Less GP core Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 17 www.itk.ppke.hu Classifications of architectures • Computational power versus power consumption• In most cases the parameters are restricted• Mobile phone with capability of playing videos • The goal is to increase the computational power• Furthermore the power consumption is also an issue– Previous example: mobile phone – This issue must be considered as designing Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 18 www.itk.ppke.hu ISA • ISA: Instruction set architecture• Defines the microarchitecture • Defines the hardware-software interface • Each core of traditional ISA is a modified classical processor core• Containing atomic instructions for synchronization • Supported by compiler and software • Not necessarily efficient with high power consumption Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 19 www.itk.ppke.hu ISA • ISA• RISC –Reduced Instruction Set Computer• Simple microarchitecture and compiler • The final code is more complicated • CISC –Complex Instruction Set Computer• More, complex instructions • Complex compiler and microarchitecture • More optimized final code Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 20 www.itk.ppke.hu ISA • ISA• Instruction set extensions• MMX: MultiMedia eXtension (64 bit FP-Instructions) • Streaming SIMD Extensions (SSE)– 70 new special instructions (FP-I, AL-I) • SSE2, SSE3, SSE4, SSE5 extensions • 3DNow! (FP-I, DSP instructions) • Advanced Vector Extensions (AVX) (SIMD) • X86-64, AMD64, EM64 • More optimized, harder to compile Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 21 www.itk.ppke.hu Intel IPP • Add-on• Different instruction sets • To use it efficientlya special library shouldbe used • Intel c++ lib: IPP Képernyõkép-ipp_brief.pdf - Adobe Reader.png Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 22 www.itk.ppke.hu Microarchitecture • In-order processing element• The execution order of instructions is equivalent to the given order of the program code • With pipelines (increasing the size and numbers) the computational power can be further extended• Superscalar • Needs logical control • The length of the instruction is critical • Small size, complexity and power consumption• A number of them can be placed on single chip • The performance is lower Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 23 www.itk.ppke.hu Microarchitecture • Out-of-order processing element• In order to fully utilize the pipelines the order of the instructions arechanged • This can be extremely fast due to scheduling • The complexity is huge and the space on the chip is large • SIMD (Single Instruction Multiple Data)• Solving data oriented problems• Very efficient if the data can be formed as vectors • Unable to use when the application is control dominated Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 24 www.itk.ppke.hu Microarchitecture • Very Long Instruction Word (VLIW)• Processing multiple data in same time • Uses pipelines • Despite of superscalar architecture it uses less hardware logic for scheduling• The compiler optimizes the scheduling • Reduced size, less complexity • Needs special compiler • It is possible to obtain worse code in special cases Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 25 www.itk.ppke.hu Memory • Consistency model• Defines how the memory instructions may be rearranged • Defines the style of the programming method • Has influences on the computational power • Strong consistency model• The order of the instructions can not be changed • Easy to code and simple behavior model • Weak consistency model• Simpler memory controllers • The order of memory accessing may vary on different runs Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 26 www.itk.ppke.hu Cache • On multicore systems the importance of cache memory is higher• Decreases the load of the slow central memory • Decreases the access timeof data • Automatically managed cache memory• The processes do not know each other. Virtually each process have the resources • Therefore the performance varies due to the overhead • Software managed cache• Application dependent whether it is practical to use Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 27 www.itk.ppke.hu Size of cache • Application dependent• Larger cache results lager computational performance• Only when data is reused • There are chips using two kind of cache mode• Streaming mode • Normal mode • The size is determined by manufacturing cost and chip size • Cache levels• Speeding up the memory access of the farther memory (farther in sense of time) Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 28 www.itk.ppke.hu Intrachip connections • Bus• Easy to implement • Each core can access the common resources with the same latency • Very slow above a certain number of cores • Ring• Needs traffic control logic • Different latencies • Supports more cores/processors Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 29 www.itk.ppke.hu Intrachip connections • Network on Chip (NoC)• Similar to Ring, but latencies are smaller • More complex logical circuits • Crossbar• Equal latencies for each cores • High number of cores can be connected • Complex logic • Large size on the chip surface Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 30 www.itk.ppke.hu Intrachip connections • Tasks• Communication • Maintaining cache coherency• It is common notto deal with the coherency • Broadcast based– Broadcasting the changes • Directory based– The memory is divided into blocks – Each block has responsible manages (home directory) – The synchronization is made from and into the home directory Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 31 www.itk.ppke.hu Further elements • Special integrated devices and special purpose circuits• Memory controller • Coder and decoder supporter • Speeding up the image processing• Hardware implemented special instructions • Rastering logics Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 32 www.itk.ppke.hu Comparison of different architectures Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 33 www.itk.ppke.hu Comparison of different architectures Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 34 www.itk.ppke.hu Comparison of different architectures Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 35 www.itk.ppke.hu Tilera TILE64 –DSP • Maximum 64 VLIW • NoC • Shared memory• GP • Big power consumption • Directory based coherency Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 36 www.itk.ppke.hu Element CXI ECA-64-DSP • Data driven applications • Software controlled memory • Small power consumption • Cluster• 15 ALU, 1 CPU • 32 kB local memory • Hierarchical connections Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 37 www.itk.ppke.hu Silicon Hive Hiveflex DSP • Small power consumption • Heterogeneous construction • Simple memory architecture • It is hard to write efficient software Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 38 www.itk.ppke.hu ARM Cortex-A9(GP, Mobile) • Capable of running Operating Systems and traditional applications • Broadcast coherency • Poor performance on data driven applications • Variably core number Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 39 www.itk.ppke.hu TI OMAP 4330-GP SoC • Smartphones • ARM for general purpose applications • C64x for data driven multimedia applications • Shared common memory Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 40 www.itk.ppke.hu NvidiaG200-GPU • Summary: 240 SP core • 30 processors• 8SIMD in each group • Capable of branching, butall of 24 core must do the same • Relatively small amount of memoryper core • MIMD design but for GPU purposes Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 41 www.itk.ppke.hu NvidiaFermi • 32 coreinoneSM processor• 4 timesmore thaninthepreviousgeneration • More local memoryand cache size • EachCUDA core(SP) has a doubleprecisionFPU and integer ALU • Itis importanttoproperlydesign thealgorithmforGPGPU and tofullyutilizetheavailablememory NVIDIA GeForce GF100 Fermi Block Diagram Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 42 www.itk.ppke.hu Intel Core i7 –GP • High power consumption • 8 cores maximum • Each core • can be multithread • 128 bit SIMD unit • Memory coherence • Huge cache• Broadcast based Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 43 www.itk.ppke.hu Cell broadband engine Cell – 1 PPE• Two threaded RISC– L1 and L2 cache • Operating system and supervising SPE-s • Oneof SPE• 128 bits SIMD, RISC • 256 memory, – Circular data bus Relatively low power consumption Cell_Arch-custom-size-660-499.jpg Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 44 www.itk.ppke.hu FPGA Configurable Logic Block (CLB) – Look-up table (LUT) – Register – Logic circuit• Adder • Multiplier • Memory • Microprocessor Input/Output Block (IOB) Programmable interconnect Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 45 www.itk.ppke.hu Interconnection types Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 46 www.itk.ppke.hu Cellular automaton • Cells in a regular grid • Every cell may be in one of a finite set of states • The grid may be arbitrary dimensional • The time is discrete • Every cell operates with the same rules • Generation• After each cell performed the state transition governed by the rule a new generation of cells (state) Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 47 www.itk.ppke.hu Cellular automaton –Rules • One dimensional automaton:• Each cell has two neighbors • The output of the cell is based on the current output of the neighbors and the cells current output: • For example current pattern: 000, output 1 • These local rules can be written as tables: current pattern 111 110 101 100 011 010 001 000 new state for center cell 0 1 1 0 1 1 1 0 Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 48 www.itk.ppke.hu Cellular automaton –Rules • Rule table • The first row is constant • The second rule can be interpreted as a binary representation of a number therefore this sequence:01101110 can be interpreted as 110. • Thus the name of the rule is Rule 110. current pattern 111 110 101 100 011 010 001 000 new state for center cell 0 1 1 0 1 1 1 0 Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 49 www.itk.ppke.hu Cellular automaton –Rules • Rule 124 • Rule 30 current pattern 111 110 101 100 011 010 001 000 new state for center cell 0 1 1 1 1 1 0 0 current pattern 111 110 101 100 011 010 001 000 new state for center cell 0 0 0 1 1 1 1 0 Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 50 www.itk.ppke.hu Cellular automaton http://www.crystalinks.com/cellularautomaton.gif Previousstateof cell Previousstate ofneighborcells New stateof cell Time Space–1 row= 1D arrayof cells Output of 3thcell at time instant 1 Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 51 www.itk.ppke.hu Cellular automaton –Rule 90 Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 52 www.itk.ppke.hu Cellular automaton –Rule 30 Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 53 www.itk.ppke.hu Cellular automaton –Parallelization • This automaton can be easily parallelized on FPGA • The speedup obtained is extremely huge• Simulation• The cells are evaluated each after each • Hardware implementation• 640 cells are evaluated simultaneously on a chip in a 2D array when the size of array is 640x480 As previous example indicates the parallelized version outperforms the simulated linear implementation Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 54 www.itk.ppke.hu FFT (Fast Fourier Transform) • Recall• FFT is used to compute a signals DFT (Discrete Fourier Transform) in short time • The FFT algorithm eliminates a great number of calculation of the DFT algorithm • „Butterfly” 80007 Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 55 www.itk.ppke.hu FFT (Fast Fourier Transform) –Parallelization • Recall• Morebutterfly • Complex,bigger FFT • This can beeasilyparallelized 80006 Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 56 www.itk.ppke.hu FFT –Parallelization, pipeline • OriginalPipeline stages 80006 80006 Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 57 www.itk.ppke.hu FFT –Parallelization, pipeline • Parallel –Pipeline FFTs Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 58 www.itk.ppke.hu FFT –Parallelization, pipeline • Parallel –Pipeline FFTs• Pipeline FFT is very common for communication systems (OFDM, DMT) • Implements an entire "slice" of the FFT and reuses-hardware to perform other slices • Advantages: Particularly good for systems in which x(n) comes in serially (i.e. no block assembly required), very fast, more area efficient than parallel, can be pipelined • Disadvantages: Controller can become complicated, large intermediate memories may be required between stages, latency of N cycles (more if pipelining introduced) Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 59 www.itk.ppke.hu FFT –Parallelization, Circuits Radix-2 Multi-path Delay Commutator: R2MDC Radix-2 Single-path Delay Feedback: R2SDF ... Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 60 www.itk.ppke.hu FFT –Parallelization, Circuits Hardware Resource Requirements Following table gives a short overview about the hardware needs of previously specified hardware elements Complex Multipliers Complex Adders Memory Control Logic Comp. Efficiency add/sub Multiplier R2SDF log2N-2 2log2N N-1 Simple 50% 50% R22SDF log4N-1 4log4N N-1 Simple 75% 75% R2MDC log2N-2 2log2N 3N/2-2 Simple 50% 50% Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 61 www.itk.ppke.hu FFT • Depending on the use of the FFT algorithm the FFT could be implemented in various methods • The parallelized method could improve the performance of the traditional Fast Fourier Transform• Hardware implementation • Furthermore software implementation on• Nvidia CUDA platform • DSP chips • The advantages is obvious in a real-time system Signal processing on digital, neural, and kiloprocessor based architectures: Multicore systems 10/5/2011. TÁMOP –4.1.2-08/2/A/KMR-2009-0006 62 www.itk.ppke.hu Conclusions • Motivation • Multicore systems• Definitions • Survey of multicore systems, architectures • Applications• Cellular automation demonstration • FFT implementation