from numba import *
import numpy as np

@cuda.jit
def hello(data):
    data[cuda.blockIdx.x,cuda.threadIdx.x] = cuda.blockIdx.x

numBlocks = 1
threadsPerBlock = 5

data = np.ones((numBlocks, threadsPerBlock), dtype=np.uint8)

hello[numBlocks,threadsPerBlock](data)

print data


# @cuda.autojit - this decorator is used to tell the CUDA compiler that the function is to be compiled for the GPU. With autojit, the compiler will try and determine the type information of the variables being passed in. You can create your own signatures manually by using the jit decorator.

# cuda.blockIdx.x - this is a read-only variable that is defined for you. It is used within a GPU kernel to determine the ID of the block which is currently executing code. Since there will be many blocks running in parallel, we need this ID to help determine which chunk of data that particular block will work on.

# cuda.threadIdx.x - this is a read-only variable that is defined for you. It is used within a GPU kernel to determine the ID of the thread which is currently executing code in the active block.

# myKernel[number_of_blocks, threads_per_block](...) - this is the syntax used to launch a kernel on the GPU. Inside the list (the square brackets [...]), the first number is the total number of blocks we want to run on the GPU, and the second is the number of threads there are per block. It's possible, and in fact recommended, for one to schedule more blocks than the GPU can actively run in parallel. The system will just continue executing blocks until they have all completed. The following video addresses grids, blocks, and threads in more detail.