Mixed land use has been widely used as a planning tool to improve the functionality of cities. However, depicting mixed land use is rather difficult due to its complexities. Previous studies have decomposed urban land areas using either remote sensing images or geospatial big data. Few studies have combined these two data sources because of the lack of methodologies. This article proposed an end-to-end two-stream convolutional neural network (CNN) for combining features (CF-CNN) to estimate the proportion of mixed land use by integrating high spatial resolution (HSR) images and geospatial big data of real-time Tencent user density (RTUD) data. Two deep learning networks, one for image information extraction and other for human activity-related information extraction, are used to construct two branches of CF-CNN. The mixed land use can be described by calculating the proportions of each land use type at the streetblock level. Compared with methods for using single-source data, CF-CNN obtained the highest classification accuracy. We further applied the Shannon diversity index (SHDI) to quantify the agglomerated urban mixed land use. The Spearman correlation coefficients among the SHDI, community distance, and neighborhood vibrancy were calculated to verify the effectiveness of the mixed land use composition. Our framework provided an alternative way of identifying mixed land use structures by integrating multisource data.