Traffic congestion is one of the most pressing issues for smart cities. Information on traffic flow can be used to reduce congestion by predicting vehicle counts at unmonitored locations so that counter-measures can be applied before congestion appears. To do so pricy sensors must be distributed sparsely in the city and at important roads in the city center to collect road and vehicle information throughout the city in real-time. Then, Machine Learning models can be applied to predict vehicle counts at unmonitored locations. To be fault-tolerant and increase coverage of the traffic predictions to the suburbs, rural regions, or even neighboring villages, these Machine Learning models should not operate at a central traffic control room but rather be distributed across the city. Gaussian Processes (GP) work well in the context of traffic count prediction, but cannot capitalize on the vast amount of data available in an entire city. Furthermore, Gaussian Processes are a global and centralized model, which requires all measurements to be available at a central computation node. Product of Expert (PoE) models have been proposed as a scalable alternative to Gaussian Processes. A PoE model trains multiple, independent GPs on different subsets of the data and weight individual predictions based on each experts uncertainty. These methods work well, but they assume that experts are independent even though they may share data points. Furthermore, PoE models require exhaustive communication bandwidth between the individual experts to form the final prediction. In this paper we propose a hierarchical Product of Expert model, which consist of multiple layers of small, independent and local GP experts. We view Gaussian Process induction as regularized optimization procedure and utilize this view to derive an efficient algorithm which selects independent regions of the data. Then, we train local expert models on these regions, so that each expert is responsible for a given region. The resulting algorithm scales well for large amounts of data and outperforms flat PoE models in terms of communication cost, model size and predictive performance. Last, we discuss how to deploy these local expert models onto small devices.