Calista: A deep learning-based system for understanding and evaluating website aesthetics

doi:10.1016/j.ijhcs.2023.103019

International Journal of Human-Computer Studies

Volume 175, July 2023, 103019

https://doi.org/10.1016/j.ijhcs.2023.103019 Get rights and content

Highlights

•
A deep learning-based tool for automatic evaluation of website aesthetics is proposed.
•
A pairwise comparison data collection mechanism via crowdsourcing is presented.
•
Deep learning models that display high correlation to human perception are developed.
•
A novel pairwise comparison dataset for performing website aesthetics analytics.
•
Comparison and rating based data to improve the models’ efficiency are employed.

Abstract

Website aesthetics play an important role in attracting users and customers, as well as in enhancing user experience. In this work, we propose a tool that performs automatic evaluation of website aesthetics using deep learning models that display high correlation to human perception. These models were developed using two different datasets. The first dataset was created by employing a rating-based ranking approach and contains user judgments on websites in the form of an explicit numerical value on a scale. Using the first dataset, we developed models following three different approaches and managed to outperform previous works. In addition, we created a new dataset by employing a comparison-based ranking approach, which is a more reliable dataset in the sense that it follows a more “natural” data collection method. In this case, users were asked to compare two websites at a time and choose which is more attractive. Data collection was performed via a web application especially designed and developed for this purpose. In the experiments conducted, we evaluated each model and compared the two data collection methods. This work aims to illustrate the effectiveness of deep learning as a solution to the problem as well as to highlight the importance of comparison-based ranking in order to achieve reliable results. In order to further promote our work, we also developed a tool that scores the aesthetics of a website, simply by providing the website URL. We argue that such a tool will serve as a reliable guide in the hands of designers and developers during the design process.

Introduction

It is a fact that the Internet and the Software as a Service industries have been pervasive in all human activity: finding information, marketing and advertisement, entertainment and social networking among others. Studies (Lindgaard et al., 2006, Tractinsky et al., 2006, Tuch et al., 2012) have shown that users establish a lasting impression about a website’s appeal in splits of a second. This first impression is based on how visually appealing is what the user is experiencing and has a great impact on the user’s perception about the website’s quality and usability (Hartmann et al., 2008). A first good impression can greatly increase the trust towards the website and its content (Michailidou et al., 2008, Lindgaard et al., 2011, Reinecke et al., 2013) and decrease bounce rates (Robins and Holmes, 2008). For an e-commerce website, this could mean increased conversion rates, while for a news portal increase in daily visits. Thus, the aesthetics of a website is a very important trait that should be taken seriously during the design process (Lindgaard et al., 2011, Lindgaard and Dudek, 2003).

Aesthetics is a topic of major interest within the field of human–computer interaction and especially in the context of webpage design (Tractinsky, 2013). When designing a website, having a good understanding of what users perceive as visually appealing is of utmost significance. Web designers follow some guidelines that have prevailed (Galitz, 2007), but the final outcome is subjective in terms of aesthetics and may differ from the preferences of the user target group (Park et al., 2004). This happens because the designers’ work is influenced by their personal experience and taste. Companies deal with the designer bias problem by getting feedback through user groups and questionnaires. However, this procedure is costly and time-consuming, since enough data must be collected and analyzed. As a consequence, a growing need emerges for tools which can assess website aesthetics automatically.

A number of previous works have tried to quantify website aesthetics by proposing various aesthetics-relevant hand-crafted features, inspired by the photography or psychology literature. For instance, Wu et al. (2011) generate manual features by considering some webpage aspects such as the layout’s block size and color characteristics (e.g. hue, brightness). However, such approaches exhibit some known limitations (Lu et al., 2014). First, they are manually defined, thus a possibility exists that some features are not taken into account. Second, these hand-crafted features comprise mere approximations of psychological rules and empirical observations due to the complexity of computationally implementing them.

With the evolution of deep learning techniques, research shows that by employing convolutional neural networks (CNNs) one can result in state-of-the-art performance for photo aesthetics assessment (Talebi and Milanfar, 2018, Lu et al., 2015a, Kong et al., 2016, Kao et al., 2015). Additionally, it has become evident that replacing hand-crafted features with an end-to-end automatic learning system for aesthetics-related tasks leads to more efficient solutions (Lu et al., 2015b, Lu et al., 2015a). Although a webpage is a special kind of image with differences in its composition, Dou et al. (2019) show that a CNN-based method significantly outperforms previous works that utilize hand-crafted features. However, research on the enhancement of automated website aesthetics evaluation is still limited and immature, since most previous works mainly focus on providing qualitative design guidelines rather than quantitative methods for prediction.

Another point that should be stressed is that the majority of research on website aesthetics assessment exploits the dataset introduced in Reinecke and Gajos (2014). In this work the authors conducted a survey where a webpage was displayed to users and asked them to score it on a scale of 1 to 9. Asking users to provide a numerical rating on a scale is a commonly used method to collect data, e.g. assign a star from 1 to 5 for a book or a score from 1 to 10 for an art painting. Although more straightforward, this approach has a number of limitations, which are discussed in the next paragraph. On the contrary, research dictates that collecting data through pairwise comparisons comprises a more reliable alternative to numerical ratings and overcomes the aforementioned limitations (Ammar and Shah, 2011, Perez-Ortiz and Mantiuk, 2017). In the pairwise comparison case, two items are displayed at a time, selected from a set of items, and the users are asked to choose the one they prefer most.

Given that website aesthetics evaluation is a demanding process for non-expert survey participants, comparison-based data collection is a more appropriate direction for this task. A limitation introduced by ratings is that users may perceive the scores differently which leads to calibration issues (Tsukida and Gupta, 2011, Ammar and Shah, 2011). For instance, users A and B might have had exactly the same enjoyment out of using a product, but the user A’s score “3” might practically correspond to the user B’s score “4”. This does not apply to the comparison-based data collection method since users who share the same impression cannot interpret the given task differently. Second, rating-based surveys demand more concentration from users since they should remember their previous answers and use it as reference. Users may forget their prior answers and mistakenly assign a higher/lower score to an item (Stewart et al., 2005), which may contradict their previous ratings. In contrast, pairwise comparisons are independent of each other and users do not have to remember their previous answers. During each comparison, the items that should be used as reference are displayed, guaranteeing that users use the same reference to avoid calibration issues. Third, pairwise comparisons are recognized as easier and faster for humans to make, as well as a more “natural” task for the human brain (Stewart et al., 2005, Barnett, 2003), which renders them more well suited to non-expert survey participants (Perez-Ortiz and Mantiuk, 2017). Finally, pairwise comparisons provide higher sensitivity and lower measurement error in comparison with ratings (Shah et al., 2016).

In this work, we propose a novel method aiming to further bridge the gap between human–computer interaction research, web design practices and advancements in machine learning. We introduce “Calista”, a deep learning-based system capable of automatically evaluating the visual aesthetics of a website from the perspective of the average user. We address the limitations of previous works in two ways. First, we exploit the strength of convolutional neural networks, which process visual information and automatically extract semantically meaningful features. Second, we introduce a novel strategy to crowdsource pairwise comparison data concerning the user preferences and incorporate them in the training pipeline. Our vision through this work is to establish the first steps towards providing designers and developers with a reliable tool to quantify their web designs’ appealingness to users and customers as well as to facilitate the flow of web design processes in terms of time and cost efficiency. Our main contributions are:

•
The development of efficient deep learning models for assessing website aesthetics that present high correlation to human perception. We explore different approaches and transfer learning settings using pre-trained models on image aesthetics relevant tasks. We also propose a methodology to extend the results of models trained on rating-based data by utilizing comparison-based data.
•
The development of a novel dataset that contains pairwise comparisons between webpages on their aesthetics. We have also analyzed the data and provide a full ranking of the containing webpages, as well as a corresponding aesthetics score to each webpage.
•
The development of an open-source web application¹ which can be used to collect pairwise image comparisons through crowdsourcing. Although we utilize this application for website aesthetics assessment in our work, it can be easily configured for image comparison tasks in general. It can also be used as a cost-free alternative to paid services for data collection, such as the Amazon Mechanical Turk (MTurk).
•
Extended experimentation to evaluate our models and compare the results with previous works. We also test the efficiency of our models on real webpage examples and provide a comparison between rating-based and comparison-based approaches.
•
The development of a web application² that can be used by web designers and developers to assess their web designs’ aesthetics. The only input required for assessment is to enter the webpage’s URL. In this way, we render our deployed ready-to-use models accessible for public use.

The rest of the paper is organized as follows. In Section 2, we present related work on computational website and photo aesthetics. In Section 3, we discuss the datasets and the methods that we implemented, while in Section 4, we present the results from the experiments. Finally, in Section 5, we conclude the paper and propose suggestions for future work. All the datasets, the source code for training the models and for the applications discussed in this paper are publicly available at https://github.com/calista-ai.

Access through your organization

Check access to the full text by signing in through your organization.

Access through your organization

Section snippets

Website aesthetics evaluation with hand-crafted features

Early research works on website aesthetics focus on identifying which aspects of a website mostly influence its aesthetics. Based on general guidelines on user interface and web page design, these works define heuristic metrics in an attempt to quantify aesthetic characteristics.

Michailidou et al. (2008) investigate the relationship between the users’ perception of visual complexity, the number of structural elements of the webpage (links, images, words and sections) and some aesthetic

Methodology

Our methodology comprises three main steps. In the first step, we utilize a pre-existing rating-based dataset that contains webpages accompanied by numerical user ratings. Based on this dataset, we provide a detailed exploration of three different approaches to train convolutional neural networks that successfully present high correlation with users’ perception of website aesthetics. We also explore various transfer learning settings to improve the models’ performance. In the second step,

Evaluation metrics

We use the Pearson Correlation Coefficient (PCC) to evaluate the performance of our models. Our primary goal is to succeed in providing predictions that present a strong and positive correlation with user ratings. We also calculate the p-values and the 95% confidence intervals, following the process described in Bonett and Wright (2000). A

p

-value of less than 0.05 is considered to be statistically significant (Swinscow and Campbell, 2002). We note that all the PCC values reported in this paper

Conclusions and future work

In this work, we have presented a series of deep learning methods which can automatically assess website aesthetics from the perspective of the average user. Our experiments suggest that our models succeed in predicting aesthetics scores which are highly correlated with the human preferences on webpages. We also focused on the significance of the data collection method since the evaluation of website aesthetics, as well as image aesthetics in general, from non-expert users is a difficult task

CRediT authorship contribution statement

Alexandros Delitzas: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Visualization, Writing – original draft, Writing – review & editing. Kyriakos C. Chatzidimitriou: Conceptualization, Resources, Software, Supervision, Validation, Writing – review & editing. Andreas L. Symeonidis: Conceptualization, Project administration, Resources, Supervision, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (54)

CyrD. et al.
Colour appeal in website design within and across cultures: A multi-method evaluation
Int. J. Hum.-Comput. Stud.
(2010)
DouQ. et al.
Webthetics: Quantifying webpage aesthetics with deep learning
Int. J. Hum.-Comput. Stud.
(2019)
LindgaardG. et al.
What is this evasive beast we call user satisfaction?
Interact. Comput.
(2003)
MichailidouE. et al.
Automated prediction of visual complexity of web pages: Tools and evaluations
Int. J. Hum.-Comput. Stud.
(2021)
PandirM. et al.
Homepage aesthetics: The search for preference factors and the challenges of subjectivity
Interact. Comput.
(2006)
ParkS.-e. et al.
Critical factors for the aesthetic fidelity of web pages: empirical studies with professional web designers and users
Interact. Comput.
(2004)
RobinsD. et al.
Aesthetics and credibility in web site design
Inf. Process. Manage.
(2008)
TractinskyN. et al.
Evaluating the consistency of immediate aesthetic perceptions of web pages
Int. J. Hum.-Comput. Stud.
(2006)
TuchA.N. et al.
Visual complexity of websites: Effects on users’ experience, physiology, performance, and memory
Int. J. Hum.-Comput. Stud.
(2009)
TuchA.N. et al.
The role of visual complexity and prototypicality regarding first impression of websites: Working towards understanding aesthetic judgments
Int. J. Hum.-Comput. Stud.
(2012)

AmmarA. et al.

Ranking: Compare, don’t score

BarnettW.

The modern theory of consumer behavior: Ordinal or cardinal?

Q. J. Austrian Econ.

(2003)

BonettD. et al.

Sample size requirements for estimating Pearson, Kendall and Spearman correlations

Psychometrika

(2000)

BradleyR.A. et al.

Rank analysis of incomplete block designs: I. The method of paired comparisons

Biometrika

(1952)

DattaR. et al.

Studying aesthetics in photographic images using a computational approach

DelitzasA.

calista-ai/website-aesthetics-datasets: First Release

(2021)

DonahueJ. et al.

DeCAF: A deep convolutional activation feature for generic visual recognition

(2013)

GalitzW.O.

The Essential Guide To User Interface Design: An Introduction To GUI Design Principles and Techniques

(2007)

HartmannJ. et al.

Towards a theory of user judgment of aesthetics and user interface quality

ACM Trans. Comput.-Hum. Interact.

(2008)

HouL. et al.

Squared earth mover’s distance-based loss for training deep neural networks

(2017)

HowardA.G. et al.

MobileNets: Efficient convolutional neural networks for mobile vision applications

(2017)

HunterD.R.

MM algorithms for generalized bradley-terry models

Ann. Statist.

(2004)

IvoryM.Y. et al.

Empirically validated web page design metrics

KaoY. et al.

Visual aesthetic quality assessment with a regression model

KarayevS. et al.

Recognizing image style

(2014)

KeY. et al.

The design of high-level features for photo quality assessment

KhaniM.G. et al.

A novel approach for website aesthetic evaluation based on convolutional neural networks

Cited by (9)

Improving Explainability and Integrability of Medical AI to Promote Health Care Professional Acceptance and Use: Mixed Systematic Review
2025, Journal of Medical Internet Research
The integration of artificial intelligence (AI) in health care has significant potential, yet its acceptance by health care professionals (HCPs) is essential for successful implementation. Understanding HCPs’ perspectives on the explainability and integrability of medical AI is crucial, as these factors influence their willingness to adopt and effectively use such technologies.
This study aims to improve the acceptance and use of medical AI. From a user perspective, it explores HCPs’ understanding of the explainability and integrability of medical AI.
We performed a mixed systematic review by conducting a comprehensive search in the PubMed, Web of Science, Scopus, IEEE Xplore, and ACM Digital Library and arXiv databases for studies published between 2014 and 2024. Studies concerning an explanation or the integrability of medical AI were included. Study quality was assessed using the Joanna Briggs Institute critical appraisal checklist and Mixed Methods Appraisal Tool, with only medium- or high-quality studies included. Qualitative data were analyzed via thematic analysis, while quantitative findings were synthesized narratively.
Out of 11,888 records initially retrieved, 22 (0.19%) studies met the inclusion criteria. All selected studies were published from 2020 onward, reflecting the recency and relevance of the topic. The majority (18/22, 82%) originated from high-income countries, and most (17/22, 77%) adopted qualitative methodologies, with the remainder (5/22, 23%) using quantitative or mixed method approaches. From the included studies, a conceptual framework was developed that delineates HCPs’ perceptions of explainability and integrability. Regarding explainability, HCPs predominantly emphasized postprocessing explanations, particularly aspects of local explainability such as feature relevance and case-specific outputs. Visual tools that enhance the explainability of AI decisions (eg, heat maps and feature attribution) were frequently mentioned as important enablers of trust and acceptance. For integrability, key concerns included workflow adaptation, system compatibility with electronic health records, and overall ease of use. These aspects were consistently identified as primary conditions for real-world adoption.
To foster wider adoption of AI in clinical settings, future system designs must center on the needs of HCPs. Enhancing post hoc explainability and ensuring seamless integration into existing workflows are critical to building trust and promoting sustained use. The proposed conceptual framework can serve as a practical guide for developers, researchers, and policy makers in aligning AI solutions with frontline user expectations.
PROSPERO CRD420250652253; https://www.crd.york.ac.uk/PROSPERO/view/CRD420250652253
Multimodal Deep Learning Framework for Automated Usability Evaluation of Fashion E-Commerce Sites
2025, Journal of Theoretical and Applied Electronic Commerce Research
Aesthetic quality evaluation of packaging design with graph neural networks and composition features
2025, Scientific Reports
Deep Learning-Based Imagery Style Evaluation for Cross-Category Industrial Product Forms
2025, Applied Sciences Switzerland
Usability Evaluation of E-Learning Platforms Using UX/UI Design and ML Technique
2024, Proceedings of 2nd International Conference on Advancements in Smart Secure and Intelligent Computing Assic 2024
UniAR: A Unified model for predicting human Attention and Responses on visual content
2024, Advances in Neural Information Processing Systems

View all citing articles on Scopus

View full text

Calista: A deep learning-based system for understanding and evaluating website aesthetics

Highlights

Abstract

Introduction

Access through your organization

Section snippets

Website aesthetics evaluation with hand-crafted features

Methodology

Evaluation metrics

Conclusions and future work

CRediT authorship contribution statement

Declaration of Competing Interest

Int. J. Hum.-Comput. Stud.

Int. J. Hum.-Comput. Stud.

Interact. Comput.

Int. J. Hum.-Comput. Stud.

Interact. Comput.

Interact. Comput.

Inf. Process. Manage.

Int. J. Hum.-Comput. Stud.

Int. J. Hum.-Comput. Stud.

Int. J. Hum.-Comput. Stud.

Ranking: Compare, don’t score

The modern theory of consumer behavior: Ordinal or cardinal?

Q. J. Austrian Econ.

Sample size requirements for estimating Pearson, Kendall and Spearman correlations

Psychometrika

Rank analysis of incomplete block designs: I. The method of paired comparisons

Biometrika

Studying aesthetics in photographic images using a computational approach

calista-ai/website-aesthetics-datasets: First Release

DeCAF: A deep convolutional activation feature for generic visual recognition

The Essential Guide To User Interface Design: An Introduction To GUI Design Principles and Techniques

Towards a theory of user judgment of aesthetics and user interface quality

ACM Trans. Comput.-Hum. Interact.

Squared earth mover’s distance-based loss for training deep neural networks

MobileNets: Efficient convolutional neural networks for mobile vision applications

MM algorithms for generalized bradley-terry models

Ann. Statist.

Empirically validated web page design metrics

Visual aesthetic quality assessment with a regression model

Recognizing image style

The design of high-level features for photo quality assessment

A novel approach for website aesthetic evaluation based on convolutional neural networks