Calista: A deep learning-based system for understanding and evaluating website aesthetics
Introduction
It is a fact that the Internet and the Software as a Service industries have been pervasive in all human activity: finding information, marketing and advertisement, entertainment and social networking among others. Studies (Lindgaard et al., 2006, Tractinsky et al., 2006, Tuch et al., 2012) have shown that users establish a lasting impression about a website’s appeal in splits of a second. This first impression is based on how visually appealing is what the user is experiencing and has a great impact on the user’s perception about the website’s quality and usability (Hartmann et al., 2008). A first good impression can greatly increase the trust towards the website and its content (Michailidou et al., 2008, Lindgaard et al., 2011, Reinecke et al., 2013) and decrease bounce rates (Robins and Holmes, 2008). For an e-commerce website, this could mean increased conversion rates, while for a news portal increase in daily visits. Thus, the aesthetics of a website is a very important trait that should be taken seriously during the design process (Lindgaard et al., 2011, Lindgaard and Dudek, 2003).
Aesthetics is a topic of major interest within the field of human–computer interaction and especially in the context of webpage design (Tractinsky, 2013). When designing a website, having a good understanding of what users perceive as visually appealing is of utmost significance. Web designers follow some guidelines that have prevailed (Galitz, 2007), but the final outcome is subjective in terms of aesthetics and may differ from the preferences of the user target group (Park et al., 2004). This happens because the designers’ work is influenced by their personal experience and taste. Companies deal with the designer bias problem by getting feedback through user groups and questionnaires. However, this procedure is costly and time-consuming, since enough data must be collected and analyzed. As a consequence, a growing need emerges for tools which can assess website aesthetics automatically.
A number of previous works have tried to quantify website aesthetics by proposing various aesthetics-relevant hand-crafted features, inspired by the photography or psychology literature. For instance, Wu et al. (2011) generate manual features by considering some webpage aspects such as the layout’s block size and color characteristics (e.g. hue, brightness). However, such approaches exhibit some known limitations (Lu et al., 2014). First, they are manually defined, thus a possibility exists that some features are not taken into account. Second, these hand-crafted features comprise mere approximations of psychological rules and empirical observations due to the complexity of computationally implementing them.
With the evolution of deep learning techniques, research shows that by employing convolutional neural networks (CNNs) one can result in state-of-the-art performance for photo aesthetics assessment (Talebi and Milanfar, 2018, Lu et al., 2015a, Kong et al., 2016, Kao et al., 2015). Additionally, it has become evident that replacing hand-crafted features with an end-to-end automatic learning system for aesthetics-related tasks leads to more efficient solutions (Lu et al., 2015b, Lu et al., 2015a). Although a webpage is a special kind of image with differences in its composition, Dou et al. (2019) show that a CNN-based method significantly outperforms previous works that utilize hand-crafted features. However, research on the enhancement of automated website aesthetics evaluation is still limited and immature, since most previous works mainly focus on providing qualitative design guidelines rather than quantitative methods for prediction.
Another point that should be stressed is that the majority of research on website aesthetics assessment exploits the dataset introduced in Reinecke and Gajos (2014). In this work the authors conducted a survey where a webpage was displayed to users and asked them to score it on a scale of 1 to 9. Asking users to provide a numerical rating on a scale is a commonly used method to collect data, e.g. assign a star from 1 to 5 for a book or a score from 1 to 10 for an art painting. Although more straightforward, this approach has a number of limitations, which are discussed in the next paragraph. On the contrary, research dictates that collecting data through pairwise comparisons comprises a more reliable alternative to numerical ratings and overcomes the aforementioned limitations (Ammar and Shah, 2011, Perez-Ortiz and Mantiuk, 2017). In the pairwise comparison case, two items are displayed at a time, selected from a set of items, and the users are asked to choose the one they prefer most.
Given that website aesthetics evaluation is a demanding process for non-expert survey participants, comparison-based data collection is a more appropriate direction for this task. A limitation introduced by ratings is that users may perceive the scores differently which leads to calibration issues (Tsukida and Gupta, 2011, Ammar and Shah, 2011). For instance, users A and B might have had exactly the same enjoyment out of using a product, but the user A’s score “3” might practically correspond to the user B’s score “4”. This does not apply to the comparison-based data collection method since users who share the same impression cannot interpret the given task differently. Second, rating-based surveys demand more concentration from users since they should remember their previous answers and use it as reference. Users may forget their prior answers and mistakenly assign a higher/lower score to an item (Stewart et al., 2005), which may contradict their previous ratings. In contrast, pairwise comparisons are independent of each other and users do not have to remember their previous answers. During each comparison, the items that should be used as reference are displayed, guaranteeing that users use the same reference to avoid calibration issues. Third, pairwise comparisons are recognized as easier and faster for humans to make, as well as a more “natural” task for the human brain (Stewart et al., 2005, Barnett, 2003), which renders them more well suited to non-expert survey participants (Perez-Ortiz and Mantiuk, 2017). Finally, pairwise comparisons provide higher sensitivity and lower measurement error in comparison with ratings (Shah et al., 2016).
In this work, we propose a novel method aiming to further bridge the gap between human–computer interaction research, web design practices and advancements in machine learning. We introduce “Calista”, a deep learning-based system capable of automatically evaluating the visual aesthetics of a website from the perspective of the average user. We address the limitations of previous works in two ways. First, we exploit the strength of convolutional neural networks, which process visual information and automatically extract semantically meaningful features. Second, we introduce a novel strategy to crowdsource pairwise comparison data concerning the user preferences and incorporate them in the training pipeline. Our vision through this work is to establish the first steps towards providing designers and developers with a reliable tool to quantify their web designs’ appealingness to users and customers as well as to facilitate the flow of web design processes in terms of time and cost efficiency. Our main contributions are:
- •The development of efficient deep learning models for assessing website aesthetics that present high correlation to human perception. We explore different approaches and transfer learning settings using pre-trained models on image aesthetics relevant tasks. We also propose a methodology to extend the results of models trained on rating-based data by utilizing comparison-based data.
- •The development of a novel dataset that contains pairwise comparisons between webpages on their aesthetics. We have also analyzed the data and provide a full ranking of the containing webpages, as well as a corresponding aesthetics score to each webpage.
- •The development of an open-source web application1 which can be used to collect pairwise image comparisons through crowdsourcing. Although we utilize this application for website aesthetics assessment in our work, it can be easily configured for image comparison tasks in general. It can also be used as a cost-free alternative to paid services for data collection, such as the Amazon Mechanical Turk (MTurk).
- •Extended experimentation to evaluate our models and compare the results with previous works. We also test the efficiency of our models on real webpage examples and provide a comparison between rating-based and comparison-based approaches.
- •The development of a web application2 that can be used by web designers and developers to assess their web designs’ aesthetics. The only input required for assessment is to enter the webpage’s URL. In this way, we render our deployed ready-to-use models accessible for public use.
The rest of the paper is organized as follows. In Section 2, we present related work on computational website and photo aesthetics. In Section 3, we discuss the datasets and the methods that we implemented, while in Section 4, we present the results from the experiments. Finally, in Section 5, we conclude the paper and propose suggestions for future work. All the datasets, the source code for training the models and for the applications discussed in this paper are publicly available at https://github.com/calista-ai.
Access through your organization
Check access to the full text by signing in through your organization.
Section snippets
Website aesthetics evaluation with hand-crafted features
Early research works on website aesthetics focus on identifying which aspects of a website mostly influence its aesthetics. Based on general guidelines on user interface and web page design, these works define heuristic metrics in an attempt to quantify aesthetic characteristics.
Michailidou et al. (2008) investigate the relationship between the users’ perception of visual complexity, the number of structural elements of the webpage (links, images, words and sections) and some aesthetic
Methodology
Our methodology comprises three main steps. In the first step, we utilize a pre-existing rating-based dataset that contains webpages accompanied by numerical user ratings. Based on this dataset, we provide a detailed exploration of three different approaches to train convolutional neural networks that successfully present high correlation with users’ perception of website aesthetics. We also explore various transfer learning settings to improve the models’ performance. In the second step,
Evaluation metrics
We use the Pearson Correlation Coefficient (PCC) to evaluate the performance of our models. Our primary goal is to succeed in providing predictions that present a strong and positive correlation with user ratings. We also calculate the p-values and the 95% confidence intervals, following the process described in Bonett and Wright (2000). A -value of less than 0.05 is considered to be statistically significant (Swinscow and Campbell, 2002). We note that all the PCC values reported in this paper
Conclusions and future work
In this work, we have presented a series of deep learning methods which can automatically assess website aesthetics from the perspective of the average user. Our experiments suggest that our models succeed in predicting aesthetics scores which are highly correlated with the human preferences on webpages. We also focused on the significance of the data collection method since the evaluation of website aesthetics, as well as image aesthetics in general, from non-expert users is a difficult task
CRediT authorship contribution statement
Alexandros Delitzas: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing. Kyriakos C. Chatzidimitriou: Conceptualization, Resources, Software, Supervision, Validation, Writing – review & editing. Andreas L. Symeonidis: Conceptualization, Project administration, Resources, Supervision, Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
References (54)
- et al.
Colour appeal in website design within and across cultures: A multi-method evaluation
Int. J. Hum.-Comput. Stud.
(2010) - et al.
Webthetics: Quantifying webpage aesthetics with deep learning
Int. J. Hum.-Comput. Stud.
(2019) - et al.
What is this evasive beast we call user satisfaction?
Interact. Comput.
(2003) - et al.
Automated prediction of visual complexity of web pages: Tools and evaluations
Int. J. Hum.-Comput. Stud.
(2021) - et al.
Homepage aesthetics: The search for preference factors and the challenges of subjectivity
Interact. Comput.
(2006) - et al.
Critical factors for the aesthetic fidelity of web pages: empirical studies with professional web designers and users
Interact. Comput.
(2004) - et al.
Aesthetics and credibility in web site design
Inf. Process. Manage.
(2008) - et al.
Evaluating the consistency of immediate aesthetic perceptions of web pages
Int. J. Hum.-Comput. Stud.
(2006) - et al.
Visual complexity of websites: Effects on users’ experience, physiology, performance, and memory
Int. J. Hum.-Comput. Stud.
(2009) - et al.
The role of visual complexity and prototypicality regarding first impression of websites: Working towards understanding aesthetic judgments
Int. J. Hum.-Comput. Stud.
(2012)
Ranking: Compare, don’t score
The modern theory of consumer behavior: Ordinal or cardinal?
Q. J. Austrian Econ.
(2003)
Sample size requirements for estimating Pearson, Kendall and Spearman correlations
Psychometrika
(2000)
Rank analysis of incomplete block designs: I. The method of paired comparisons
Biometrika
(1952)
Studying aesthetics in photographic images using a computational approach
calista-ai/website-aesthetics-datasets: First Release
(2021)
DeCAF: A deep convolutional activation feature for generic visual recognition
(2013)
The Essential Guide To User Interface Design: An Introduction To GUI Design Principles and Techniques
(2007)
Towards a theory of user judgment of aesthetics and user interface quality
ACM Trans. Comput.-Hum. Interact.
(2008)
Squared earth mover’s distance-based loss for training deep neural networks
(2017)
MobileNets: Efficient convolutional neural networks for mobile vision applications
(2017)
MM algorithms for generalized bradley-terry models
Ann. Statist.
(2004)
Empirically validated web page design metrics
Visual aesthetic quality assessment with a regression model
Recognizing image style
(2014)
The design of high-level features for photo quality assessment
A novel approach for website aesthetic evaluation based on convolutional neural networks
Cited by (9)
Improving Explainability and Integrability of Medical AI to Promote Health Care Professional Acceptance and Use: Mixed Systematic Review
2025, Journal of Medical Internet ResearchMultimodal Deep Learning Framework for Automated Usability Evaluation of Fashion E-Commerce Sites
2025, Journal of Theoretical and Applied Electronic Commerce ResearchDeep Learning-Based Imagery Style Evaluation for Cross-Category Industrial Product Forms
2025, Applied Sciences SwitzerlandUsability Evaluation of E-Learning Platforms Using UX/UI Design and ML Technique
2024, Proceedings of 2nd International Conference on Advancements in Smart Secure and Intelligent Computing Assic 2024UniAR: A Unified model for predicting human Attention and Responses on visual content
2024, Advances in Neural Information Processing Systems
© 2023 Elsevier Ltd. All rights reserved.
