Obtaining a result by processing an image via an automatic system may be useful in many fields today. Manufacturing a defective product is an undesired case for manufacturers in many fields. Processing images is an efficient method used to detect defects on images to eliminate the defective products. Since image processing is conducted on pixel basis, it entails great workload. In cases where speed is important in processing, parallel image processing might be a solution. Therefore, processing images in the current multi-core computers by paralleling them with additional hardware and software can boost the performance. The performance in parallel image processing is related to relevance of the algorithm to the parallelism and its accurate distribution to the processors. Common use of the resources and excess of data exchange affect the performance directly. In this study, parallel application of COLMSTD algorithm developed to detect the defects on rail and profile surface during rolling in Kardemir Inc. rolling plant was conducted in two different ways. The 1st method was carried out by selecting the CUDA core numbers in GPU structure by software and the 2nd method was conducted by using single CUDA core. The performance of the results obtained on GPU (Graphics Processing Unit) with the support of CUDA (Compute Unified Device Architecture) interface was compared with that of CPU values.