File decimate.cu¶
Functions
 inline __device__ cuComplex __shfl_down_sync (cuComplex var, unsigned int srcLane, int width=32)
Overloads __shfl_down to handle cuComplex.
__shfl can only shuffle 4 bytes at time. This overload utilizes a trick similar to the below link in order to shuffle 8 byte values. https://devblogs.nvidia.com/parallelforall/fasterparallelreductionskepler/ http://docs.nvidia.com/cuda/cudacprogrammingguide/#warpshufflefunctions
 Parameters
var – [in] cuComplex value to shuffle.
srcLane – [in] Relative lane from within the warp that should shuffle its variable down.
width – [in] Section of the warp to shuffle. Defaults to full warp size.
 Returns
Shuffled cuComplex variable.
 __device__ cuComplex parallel_reduce (cuComplex *data, uint32_t tap_offset)
Performs a parallel reduction to sum a series of values together.
NVIDIA supplies many versions of optimized parallel reduction. This is a slightly modified version of reduction #5 from NVIDIA examples. /usr/local/cuda/samples/6_Advanced/reduction
 Parameters
data – A pointer to a set of cuComplex data to reduce.
tap_offset – [in] The offset into the data from which to pull values.
 Returns
Final sum after reduction.
 __device__ __forceinline__ cuComplex _exp (cuComplex z)
cuComplex version of exponential function.
 Parameters
z – [in] Complex number.
 Returns
Complex exponential of input.
 __global__ void bandpass_decimate (cuComplex *original_samples, cuComplex *decimated_samples, cuComplex *filter_taps, uint32_t dm_rate, uint32_t samples_per_antenna, double F_s, double *freqs, uint32_t stride)
Performs decimation using bandpass filters on a set of input RF samples.
This function performs a parallel version of filtering+downsampling on the GPU to be able process data in realtime. This algorithm will use 1 GPU thread per filter tap if there are less than or equal to 1024 taps for all filters combined. Only works with power of two length filters, or a filter that is zero padded to a power of two in length. This algorithm takes a single set of wide band samples from the USRP driver, and produces an output data set for each RX frequency. The phase of each output sample is corrected after decimating via modified Frerking method.
gridDim.x  Total number of output samples there will be after decimation. gridDim.y  Total number of antennas.
blockIdx.x  Decimated output sample index. blockIdx.y  Antenna index.
blockDim.x  Number of filter taps divided by stride length. blockDim.y  Total number of filters. Corresponds to total receive frequencies.
threadIdx.x  Filter tap index. threadIdx.y  Filter index.
 Parameters
original_samples – [in] A pointer to original input samples from each antenna to decimate.
decimated_samples – [in] A pointer to a buffer to place output samples for each frequency after decimation.
filter_taps – [in] A pointer to one or more filters needed for each frequency.
dm_rate – [in] Decimation rate.
samples_per_antenna – [in] The number of samples per antenna in the original set of samples.
F_s – [in] The sampling frequency in hertz.
freqs – [in] A pointer to the frequencies used in mixing.
stride – [in] Number of filter products to calculate per thread.

void bandpass_decimate_wrapper(cuComplex *original_samples, cuComplex *decimated_samples, cuComplex *filter_taps, uint32_t dm_rate, uint32_t samples_per_antenna, uint32_t num_taps_per_filter, uint32_t num_freqs, uint32_t num_antennas, double F_s, double *freqs, cudaStream_t stream)¶
This function wraps the bandpass_decimate kernel so that it can be called from another file.
 Parameters
original_samples – [in] A pointer to original input samples from each antenna to decimate.
decimated_samples – [in] A pointer to a buffer to place output samples for each frequency after decimation.
filter_taps – [in] A pointer to one or more filters needed for each frequency.
dm_rate – [in] Decimation rate.
samples_per_antenna – [in] The number of samples per antenna in the original set of samples.
num_taps_per_filter – [in] Number of taps per filter.
num_freqs – [in] Number of receive frequencies.
num_antennas – [in] Number of antennas for which there are samples.
F_s – [in] The original sampling frequency.
freqs – A pointer to the frequencies being filtered.
stream – [in] CUDA stream with which to associate the invocation of the kernel.
 __global__ void lowpass_decimate (cuComplex *original_samples, cuComplex *decimated_samples, cuComplex *filter_taps, uint32_t dm_rate, uint32_t samples_per_antenna, uint32_t stride)
Performs decimation using a lowpass filter on one or more sets of baseband samples corresponding to each RX frequency.
This function performs a parallel version of filtering+downsampling on the GPU to be able process data in realtime. This algorithm will use 1 GPU thread per filter tap if there are less than 1024 taps for all filters combined. Only works with power of two length filters, or a filter that is zero padded to a power of two in length. This algorithm takes one or more baseband datasets corresponding to each RX frequency and filters each one using a single lowpass filter before downsampling.
gridDim.x  The number of decimated output samples for one antenna in one frequency data set. gridDim.y  Total number of antennas. gridDim.z  Total number of frequency data sets.
blockIdx.x  Decimated output sample index. blockIdx.y  Antenna index. blockIdx.z  Frequency dataset index.
blockDim.x  Number of filter taps in the lowpass filter.
threadIdx.x  Filter tap indices.
 Parameters
original_samples – [in] A pointer to input samples for one or more baseband datasets.
decimated_samples – [in] A pointer to a buffer to place output samples for each frequency dataset after decimation.
filter_taps – [in] A pointer to a lowpass filter used for further decimation.
dm_rate – [in] Decimation rate.
samples_per_antenna – [in] The number of samples per antenna in the original set of samples.
stride – [in] The number of samples to calculate per thread.

void lowpass_decimate_wrapper(cuComplex *original_samples, cuComplex *decimated_samples, cuComplex *filter_taps, uint32_t dm_rate, uint32_t samples_per_antenna, uint32_t num_taps_per_filter, uint32_t num_freqs, uint32_t num_antennas, cudaStream_t stream)¶
This function wraps the lowpass_decimate kernel so that it can be called from another file.
 Parameters
original_samples – [in] A pointer to one or more baseband frequency datasets.
decimated_samples – [in] A pointer to a buffer to place output samples for each frequency after decimation.
filter_taps – [in] A pointer to one lowpass filter.
dm_rate – [in] Decimation rate.
samples_per_antenna – [in] The number of samples per antenna in each data set.
num_taps_per_filter – [in] Number of taps per filter.
num_freqs – [in] Number of receive frequency datasets.
num_antennas – [in] Number of antennas for which there are samples.
stream – [in] CUDA stream with which to associate the invocation of the kernel.