Urban perception is a hot topic in current urban study and plays a positive role in urban planning and design. At present, there are two methods to calculate urban perception. 1) Using a model to learn image features directly automatically; 2) Coupling machine learning and feature extraction based on expert knowledge (e.g. object proportion) method. With two typical streets in Wuhan as the study area, video data were recorded and used as the model input. In this study, two representative methods are selected: 1) End to end convolution neural network (CNN-based model); 2) Based on full convolution neural network and random forest (FCN+RF-based model). By comparing the accuracy of two models, we analyze the adaptability of the model in different urban scenes. We also analyze the relationship between CNN-based model and urban function based on POI data and OSM data, and verify its interpretability. The results show that the CNN-based model is more accurate than FCN+RF-based model. Because the CNN-based model considers the topological characteristics of the ground objects, its perception results have a stronger nonlinear correlation with urban functions. In addition, we also find that the CNN-based model is more suitable for scenes with weak spatial heterogeneity (such as small and medium-sized urban environments), while the FCN+RF-based model is applicable to scenes with strong spatial heterogeneity (such as the downtown areas of China’s megacities). The results of this study can be used as a reference to provide decision support for urban perception model selection in urban planning.