File decimate.cu

Functions

inline __device__ cuComplex __shfl_down_sync (cuComplex var, unsigned int srcLane, int width=32)

Overloads __shfl_down to handle cuComplex.

__shfl can only shuffle 4 bytes at time. This overload utilizes a trick similar to the below link in order to shuffle 8 byte values. https://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/ http://docs.nvidia.com/cuda/cuda-c-programming-guide/#warp-shuffle-functions

Parameters
  • var[in] cuComplex value to shuffle.

  • srcLane[in] Relative lane from within the warp that should shuffle its variable down.

  • width[in] Section of the warp to shuffle. Defaults to full warp size.

Returns

Shuffled cuComplex variable.

__device__ cuComplex parallel_reduce (cuComplex *data, uint32_t tap_offset)

Performs a parallel reduction to sum a series of values together.

NVIDIA supplies many versions of optimized parallel reduction. This is a slightly modified version of reduction #5 from NVIDIA examples. /usr/local/cuda/samples/6_Advanced/reduction

Parameters
  • data – A pointer to a set of cuComplex data to reduce.

  • tap_offset[in] The offset into the data from which to pull values.

Returns

Final sum after reduction.

__device__ __forceinline__ cuComplex _exp (cuComplex z)

cuComplex version of exponential function.

Parameters

z[in] Complex number.

Returns

Complex exponential of input.

__global__ void bandpass_decimate1024 (cuComplex *original_samples, cuComplex *decimated_samples, cuComplex *filter_taps, uint32_t dm_rate, uint32_t samples_per_antenna, double F_s, double *freqs)

Performs decimation using bandpass filters on a set of input RF samples if the total number of filter taps for all filters is less than 1024.

This function performs a parallel version of filtering+downsampling on the GPU to be able process data in realtime. This algorithm will use 1 GPU thread per filter tap if there are less than 1024 taps for all filters combined. Only works with power of two length filters, or a filter that is zero padded to a power of two in length. This algorithm takes a single set of wide band samples from the USRP driver, and produces an output data set for each RX frequency. The phase of each output sample is corrected to after decimating via modified Frerking method.

gridDim.x - Total number of output samples there will be after decimation. gridDim.y - Total number of antennas.

blockIdx.x - Decimated output sample index. blockIdx.y - Antenna index.

blockDim.x - Number of filter taps in the lowpass filter. blockDim.y - Total number of filters. Corresponds to total receive frequencies.

threadIdx.x - Filter tap index. threadIdx.y - Filter index.

Parameters
  • original_samples[in] A pointer to original input samples from each antenna to decimate.

  • decimated_samples[in] A pointer to a buffer to place output samples for each frequency after decimation.

  • filter_taps[in] A pointer to one or more filters needed for each frequency.

  • dm_rate[in] Decimation rate.

  • samples_per_antenna[in] The number of samples per antenna in the original set of samples.

  • F_s[in] The sampling frequency in hertz.

  • freqs[in] A pointer to the frequencies used in mixing.

__global__ void bandpass_decimate2048 (cuComplex *original_samples, cuComplex *decimated_samples, cuComplex *filter_taps, uint32_t dm_rate, uint32_t samples_per_antenna, double F_s, double *freqs)

Performs decimation using bandpass filters on a set of input RF samples if the total number of filter taps for all filters is less than 2048.

This function performs a parallel version of filtering+downsampling on the GPU to be able process data in realtime. This algorithm will use 1 GPU thread to process two filter taps if there are less than 2048 taps for all filters combined. Intended to be used if there are more than 1024 total threads, as that is the max block size possible for CUDA. Only works with power of two length filters, or a filter that is zero padded to a power of two in length. This algorithm takes a single set of wide band samples from the USRP driver, and produces a output data set for each RX frequency.

gridDim.x - Total number of output samples there will be after decimation. gridDim.y - Total number of antennas.

blockIdx.x - Decimated output sample index. blockIdx.y - Antenna index.

blockDim.x - Number of filter taps in each filter / 2. blockDim.y - Total number of filters. Corresponds to total receive frequencies.

threadIdx.x - Every second filter tap index. threadIdx.y - Filter index.

Parameters
  • original_samples[in] A pointer to original input samples from each antenna to decimate.

  • decimated_samples[in] A pointer to a buffer to place output samples for each frequency after decimation.

  • filter_taps[in] A pointer to one or more filters needed for each frequency.

  • dm_rate[in] Decimation rate.

  • samples_per_antenna[in] The number of samples per antenna in the original set of samples.

  • F_s[in] The sampling frequency in hertz.

  • freqs[in] A pointer to the frequencies used in mixing.

void bandpass_decimate1024_wrapper(cuComplex *original_samples, cuComplex *decimated_samples, cuComplex *filter_taps, uint32_t dm_rate, uint32_t samples_per_antenna, uint32_t num_taps_per_filter, uint32_t num_freqs, uint32_t num_antennas, double F_s, double *freqs, cudaStream_t stream)

This function wraps the bandpass_decimate1024 kernel so that it can be called from another file.

Parameters
  • original_samples[in] A pointer to original input samples from each antenna to decimate.

  • decimated_samples[in] A pointer to a buffer to place output samples for each frequency after decimation.

  • filter_taps[in] A pointer to one or more filters needed for each frequency.

  • dm_rate[in] Decimation rate.

  • samples_per_antenna[in] The number of samples per antenna in the original set of samples.

  • num_taps_per_filter[in] Number of taps per filter.

  • num_freqs[in] Number of receive frequencies.

  • num_antennas[in] Number of antennas for which there are samples.

  • F_s[in] The original sampling frequency.

  • freqs – A pointer to the frequencies being filtered.

  • stream[in] CUDA stream with which to associate the invocation of the kernel.

void bandpass_decimate2048_wrapper(cuComplex *original_samples, cuComplex *decimated_samples, cuComplex *filter_taps, uint32_t dm_rate, uint32_t samples_per_antenna, uint32_t num_taps_per_filter, uint32_t num_freqs, uint32_t num_antennas, double F_s, double *freqs, cudaStream_t stream)

This function wraps the bandpass_decimate2048 kernel so that it can be called from another file.

Parameters
  • original_samples[in] A pointer to original input samples from each antenna to decimate.

  • decimated_samples[in] A pointer to a buffer to place output samples for each frequency after decimation.

  • filter_taps[in] A pointer to one or more filters needed for each frequency.

  • dm_rate[in] Decimation rate.

  • samples_per_antenna[in] The number of samples per antenna in the original set of samples.

  • num_taps_per_filter[in] Number of taps per filter.

  • num_freqs[in] Number of receive frequencies.

  • num_antennas[in] Number of antennas for which there are samples.

  • F_s[in] The original sampling frequency.

  • freqs – A pointer to the frequencies being filtered.

  • stream[in] CUDA stream with which to associate the invocation of the kernel.

__global__ void lowpass_decimate1024 (cuComplex *original_samples, cuComplex *decimated_samples, cuComplex *filter_taps, uint32_t dm_rate, uint32_t samples_per_antenna)

Performs decimation using a lowpass filter on one or more sets of baseband samples corresponding to each RX frequency. This algorithm works on filters with less that 1024 taps.

This function performs a parallel version of filtering+downsampling on the GPU to be able process data in realtime. This algorithm will use 1 GPU thread per filter tap if there are less than 1024 taps for all filters combined. Only works with power of two length filters, or a filter that is zero padded to a power of two in length. This algorithm takes one or more baseband datasets corresponding to each RX frequency and filters each one using a single lowpass filter before downsampling.

gridDim.x - The number of decimated output samples for one antenna in one frequency data set. gridDim.y - Total number of antennas. gridDim.z - Total number of frequency data sets.

blockIdx.x - Decimated output sample index. blockIdx.y - Antenna index. blockIdx.z - Frequency dataset index.

blockDim.x - Number of filter taps in the lowpass filter.

threadIdx.x - Filter tap indices.

Parameters
  • original_samples[in] A pointer to input samples for one or more baseband datasets.

  • decimated_samples[in] A pointer to a buffer to place output samples for each frequency dataset after decimation.

  • filter_taps[in] A pointer to a lowpass filter used for further decimation.

  • dm_rate[in] Decimation rate.

  • samples_per_antenna[in] The number of samples per antenna in the original set of samples.

__global__ void lowpass_decimate2048 (cuComplex *original_samples, cuComplex *decimated_samples, cuComplex *filter_taps, uint32_t dm_rate, uint32_t samples_per_antenna)

Performs decimation using a lowpass filter on one or more sets of baseband samples corresponding to each RX frequency. This algorithm works on filters with less that 2048 taps.

This function performs a parallel version of filtering+downsampling on the GPU to be able process data in realtime. This algorithm will use 1 GPU thread to process two filter taps if there are less than 2048 taps for all filters combined. Intended to be used if there are more than 1024 total threads, as that is the max block size possible for CUDA. Only works with power of two length filters, or a filter that is zero padded to a power of two in length. This algorithm takes one or more baseband datasets corresponding to each RX frequency and filters each one using a single lowpass filter before downsampling.

gridDim.x - The number of decimated output samples for one antenna in one frequency data set. gridDim.y - Total number of antennas. gridDim.z - Total number of frequency data sets.

blockIdx.x - Decimated output sample index. blockIdx.y - Antenna index. blockIdx.z - Frequency dataset index.

blockDim.x - Number of filter taps in the lowpass filter / 2.

threadIdx.x - Every second filter tap index.

Parameters
  • original_samples[in] A pointer to input samples for one or more baseband datasets.

  • decimated_samples[in] A pointer to a buffer to place output samples for each frequency dataset after decimation.

  • filter_taps[in] A pointer to a lowpass filter used for further decimation.

  • dm_rate[in] Decimation rate.

  • samples_per_antenna[in] The number of samples per antenna in the original set of samples.

void lowpass_decimate1024_wrapper(cuComplex *original_samples, cuComplex *decimated_samples, cuComplex *filter_taps, uint32_t dm_rate, uint32_t samples_per_antenna, uint32_t num_taps_per_filter, uint32_t num_freqs, uint32_t num_antennas, cudaStream_t stream)

This function wraps the lowpass_decimate1024 kernel so that it can be called from another file.

Parameters
  • original_samples[in] A pointer to one or more baseband frequency datasets.

  • decimated_samples[in] A pointer to a buffer to place output samples for each frequency after decimation.

  • filter_taps[in] A pointer to one lowpass filter.

  • dm_rate[in] Decimation rate.

  • samples_per_antenna[in] The number of samples per antenna in each data set.

  • num_taps_per_filter[in] Number of taps per filter.

  • num_freqs[in] Number of receive frequency datasets.

  • num_antennas[in] Number of antennas for which there are samples.

  • stream[in] CUDA stream with which to associate the invocation of the kernel.

void lowpass_decimate2048_wrapper(cuComplex *original_samples, cuComplex *decimated_samples, cuComplex *filter_taps, uint32_t dm_rate, uint32_t samples_per_antenna, uint32_t num_taps_per_filter, uint32_t num_freqs, uint32_t num_antennas, cudaStream_t stream)

This function wraps the lowpass_decimate2048 kernel so that it can be called from another file.

Parameters
  • original_samples[in] A pointer to one or more baseband frequency datasets.

  • decimated_samples[in] A pointer to a buffer to place output samples for each frequency after decimation.

  • filter_taps[in] A pointer to one lowpass filter.

  • dm_rate[in] Decimation rate.

  • samples_per_antenna[in] The number of samples per antenna in each data set.

  • num_taps_per_filter[in] Number of taps per filter.

  • num_freqs[in] Number of receive frequency datasets.

  • num_antennas[in] Number of antennas for which there are samples.

  • stream[in] CUDA stream with which to associate the invocation of the kernel.