HTML: Hierarchical Transformer-based Multi-task Learning for Volatility Prediction

The volatility forecasting task refers to predicting the amount of variability in the price of a financial asset over a certain period. It is an important mechanism for evaluating the risk associated with an asset and, as such, is of significant theoretical and practical importance in financial analysis. While classical approaches have framed this task as a time-series prediction one – using historical pricing as a guide to future risk forecasting – recent advances in natural language processing have seen researchers turn to complementary sources of data, such as analyst reports, social media, and even the audio data from earnings calls. This paper proposes a novel hierarchical, transformer, multi-task architecture designed to harness the text and audio data from quarterly earnings conference calls to predict future price volatility in the short and long term. This includes a comprehensive comparison to a variety of baselines, which demonstrates very significant improvements in prediction accuracy, in the range 17% - 49% compared to the current state-of-the-art. In addition, we describe the results of an ablation study to evaluate the relative contributions of each component of our approach and the relative contributions of text and audio data with respect to prediction accuracy.


INTRODUCTION
Predicting how the degree of variability in the price of a financial asset will vary over a certain period -the so-called volatility of the asset -is an important financial analysis task. Price volatility is generally considered to be a useful proxy for the level of risk associated with an asset and thus it plays an important role in assessing financial market risk and the pricing of financial derivatives. As a result, developing effective techniques for predicting price volatility has become increasingly important among academics and practitioners.
To a large extent, past research efforts have focused on the use of time-series modeling and prediction techniques using historical pricing data [33,41,66]. However, with recent advances in natural language processing (NLP) it has become possible to harness novel sources of data -from unstructured textual data in the form of financial news [16,63,65] and financial reports [23,32,47], to realtime social media [7,43,60,62] -during the prediction process. Of particular relevance to this work is the information contained in earnings call transcripts [30,46,56], which typically accompany the earnings reports of publicly traded companies. Generally speaking, these are conference-calls, in which company executives discuss the latest results, offer guidance on their expectations for the coming year, and provide investors and analysts with an opportunity to ask questions. The information conveyed during conference call, and particularly the subsequent question-answer session with investors and analysts, can provide new information (Q&A parts is not well prepared by executives) into the current state of the company and it's future prospects, which, in turn, change investor perception of firm risk (price volatility). Indeed, recent work has shown how not only the text of the call can be useful [5,25,34], but also the vocal content and features contained within the call audio [24,42,46].
This work seeks to build on this recent research to further explore the utility of including textual and audio data from earnings calls for volatility forecasting. The overview of the proposed method is

Transcripts Audio Records
Hi guys, it's noon so we're going to get started. As people are getting sett led, th an k you very very mu ch fo r jo in ing us here at th e Bo sto n Scientific We have a great lin e up of speakers for you to d ay . We'll try an d keep it tig h t ab o u t 15 , 20 minutes of pr ep ar ed remarks.  Figure 1: An overview of the proposed framework. For a given earnings call, the text and audio features are extracted from transcripts and audio records, respectively, and the resulting features are used as input features for a mult-task learner.

HTML Multimedia Input
described in Figure 1. The primary technical contributions include a description and evaluation of a novel, deep-learning architecture for this task: Figure 2 presents our Hierarchical, Transformer-based, Multi-Task (HTML) model which combines a hierarchical, transformer [55] with multi-task learning [39]. Hierarchical transformer models [14,37] have proven useful in many sequence-to-sequence learning tasks, including machine translation and text summarisation, and the approach is used here to extract the text features from call transcripts that will be used as inputs to the multi-task learner. Following [46], we extract 27 vocal features including pitch, intensity, jitter, and the harmonic to noise ratio using Praat [6]. The audio and text features are combined in the information fusion layer to provide input features for the multi-task learner. Multi-task learning -exploiting similarities and differences between related, simultaneous learning tasks -is used because it has proven to be successful when it comes to controlling for overfitting and improving generalisation [10], and here we simultaneously learn models to predict: (1) average n-day volatility (that is, the volatility of the following n days); and (2) single-day volatility (that is, the volatility on a single day, n-days in the future).
The remainder of this paper is organized as follows. In the next section we summarise relevant related work focusing in particular on the volatility prediction task, hierarchical model, multi-task learning, and multimedia information fusion. Section 3 presents a problem formulation before describing our proposed approach in Section 4 in detail. Before concluding, in Sections 5 and 6 we presents the results of a detailed evaluation using a benchmark dataset and in comparison to a number of state-of-the-art baseline techniques. The results of this evaluation demonstrate clear and significant prediction accuracy benefits accruing to our proposed approach, accuracy improvements in the range 17% -49% compared to the current-state-of-the-art. Moreover, a detailed ablation study further clarifies the relative contributions of each model component and data source to overall prediction accuracy. We believe that these results establish this as a new performance benchmark for volatility forecasting.

RELATED WORK
This paper brings together a number of different ideas -volatility prediction, hierarchical model, multi-task learning, and multimedia information fusion -and in what follows we briefly summarise the relevant state-of-the-art in each of these areas, as it relates to the present work.

Volatility Prediction
Volatility modeling and prediction is of interest to researchers because of its theoretical importance and its practical applications. Conventional approaches [33,41,66] rely on historical pricing data and typically use continuous time-series models (local and stochastic volatility [12,22,26,28,49,50]) and discrete time-series models (e.g. GARCH models [8,17]).
Recently, research attention has focused on additional sources of volatility information. Significant improvements in NLP methods, many applied to sources of financial information [7,15,16,23,32,43,47,60,62,65], demonstrate how mining financial news, analyst reports, earnings reports, and social media has the potential to improve many financial prediction tasks by harnessing powerful new features that are absent from traditional time-series data. Moreover, features derived from the audio features of earnings calls have also proven to be useful for volatility prediction. For example, [46] incorporates CEO's vocal features, such as emotions and voice tones in earnings conference calls in predicting volatility of a stock using a multimodal deep regression model. By modeling the textual and vocal information contained in a conference call, resulting in a substantial improvement in volatility prediction accuracy, compared to classical methods.
The work in [46] is especially relevant in this context of this work as it provides a starting point for this work, and the best available baseline against which to evaluate our progress. We argue that the model proposed by [46] does not sufficiently investigate the power of both verbal and vocal information and that it fails to fully exploit the interaction between the text and audio information. The improvements derive from three aspects. First, we show how enriched textual and audio data can be extracted from call data using co-evolutionary methods. Second, we demonstrate how the use of a pre-trained language model and hierarchical features can greatly improve the representations used for learning and prediction. Finally, a key novelty of the present work stems from the way in which textual and audio features are integrated for multi-task learning, which, as we shall see later, leads to significant prediction benefits.

Hierarchical, Multi-Task Learning
Hierarchical learning techniques have recently proved to be successful across a variety of NLP tasks. Hierarchical attention networks were first proposed by [64], as a way to generate richer and more powerful natural language representations, and since then they have been applied to good effect in tasks such as document classification, relation extraction, and machine translation. More recently, further technical improvements in hierarchical architectures based on transformers have been developed for tasks such as automatic text summarization [18,37]. This suggests that similar techniques might prove to be useful when it comes to extracting textual feature from earnings call transcripts.
Multi-task learners solve multiple learning tasks at the same time, by exploiting commonalities and differences between the tasks, to provide an effective set of learning constraints that have been shown to reduce the risk of overfitting, improve generalize ability, and overall improve the effectiveness of the learned models compared to the models produced by single-short learners using the same training data. Multi-task learning has shown particular promise in NLP [51,58,61] and speech recognition [13,52] tasks. And the idea of combining hierarchical and multi-task learning, by using a hierarchical framework consisting of several relevant tasks as a joint multi-task learning model, was first proposed by [21]; see also the work of [48] on the use of a hierarchical architecture for learning word embeddings from semantic NLP tasks.
In this paper we propose a hierarchical, multi-task learning approach consisting of two financial forecasting tasks. The primary task involves predicting asset volatility over a given time period (number of days), while our secondary task involves predicting asset volatility for a single day. Our intuition is that this multi-task learning framework will improve prediction performance by reducing the representation bias of our model, and, to the best of our knowledge, this is the first time that a hierarchical, multi-task transformer has been used for volatility prediction.

Multi-modal Information Fusion
In this work we focus on learning from different types of data -text and audio -which has often proven challenging in the past because of the challenges associated with combining fundamentally different features. However, recent progress in deep learning research has led to significant improvement in similar multi-modal learning tasks, whereby high-level embeddings from different types of data are integrated via a deep neural network [40]. For instance, the Vision-and-Language BERT (ViLBERT) [38] learns task-agnostic joint representations of image and natural language content. Elsewhere, related ideas have been used to combine text and image information for multi-modal review generation [53]. And the interaction between text and audio data in a multi-modal learning framework has been the subject of recent studies in speech communication, in which acoustic features have been shown to be highly correlated with emotion [2], trustworthiness [4], and confidence [27].
To date the use of audio data sources has been all but absent from financial applications, with the exception of [46]. Given the effectiveness of recent multi-modal approaches, and the availability of task-relevant text and audio data for volatility forecasting, it is clear that these techniques warrant further consideration, hence the approach is taken in the present work.

MEASURING ASSET VOLATILITY
We formulate the volatility forecasting problem as a multivariate regression task, with textual and audio data as raw inputs, and an n-day volatility predictions (that is the predicted average volatility over the following n days) and single-day volatility prediction for day-n as the dual prediction outputs.
In Equation 1, r i is the stock return on day i and r is the average stock return in a window of n days. The return is defined as The single day log volatility is estimated by the daily log absolute return, as in 2, where v n can also be considered a noisy proxy of log volatility [9].
Our multi-task learning objective is to simultaneously predict these two quantities v [0,n] and v n using our input data; predicting v [0,n] is our main task, while predicting v n is our auxiliary task. Figure 2 summarizes the proposed HTML model which contains four components: (1) token-level transformer encoder; (2) multimedia information fusion; (3) sentence-level transformer encoder; and (4) multi-task prediction. Briefly, to begin with, text and audio features are extracted from the raw text/audio call content: text tokens are extracted from the text data and encoded into a vector using a pre-trained language model, while a range of 27 different audio features are extracted from the audio data using Praat [6], based on the sentence-level audio clips, and in line with the approach described by [46]. The resulting text and audio features are combined by the information fusion layer and used as input for the sentence-level transformer encoder to generate a new intermediate, multimodal representation to act as the input representation for the multi-task learner. The multi-task prediction layer generates average and single-day volatility prediction based on the inputs from the sentence-level transformer encoder. A more detailed implement implementation description of each of these components is presented in the following.

Token-level Transformer Encoder
The token-level transformer encoder consists of a Multi-Head Self-Attention Mechanism, a Residual Connections and Layer Normalization Layer, a Feed Forward Layer, and a Residual Connections and Layer Normalization Layer [55]. The training of the encoder involves two phases, namely pre-training and fine-tuning. The pretraining phase can be considered as a self-supervised step and is performed using the Whole Word Masking BERT (WWM-BERT) [14] where WordPiece tokens belong to same word are masked jointly. The WWM-BERT mitigates the drawbacks of the original implementation of BERT whereby it explicitly forces the model to predict a whole word instead of WordPiece tokens in the training task. The find-tuning phase gently adjusts the pre-trained model using the output for our multi-task regression task. Since the pre-trained model already encode much information about our language, the fine-tuning phase takes substantially less time compared to training the entire model from scratch. The steps in the two-phase training are illustrated in Figure 3.
To describe the token-level transformer encoder in more detail, is an artificial EOS (end of sentence) token. The word embedding matrix associated with sentence W i is initialized as Here e(·) maps each token to a d dimensional vector using the WWM-BERT, and p j is the position embedding of the token w j i with the same dimension d. Consequently, e j i ∈ R d for all j. The calculation of the position embeddings is performed in the same manner as in [55]: p j,2m+1 = cos j/10000 2m/d (5) where j is the position of the token and m is the dimension of the embedding.
A sentence representation T i ∈ R d t of the sentence W i is calculated by average pooling that operates over the second last layer of network due to the experimental experience, where d t represents the default dimensions of word embeddings.

Multimedia Information Fusion
The sentence representations and the corresponding audio features are then combined. An earnings call document is represented as Here T k i and A k i represent the sentence and audio features of sentence i in document D (k ) ∈ R M ×d s , and P i ∈ R M ×d s denotes the trainable sentence-level position embedding, and M is the maximum number of sentences in any document.

Sentence-level Transformer Encoder
The sentence-level transformer encoder extracts sentence-level features for prediction. The architecture of this encoder is shown in Figure 4. In particular, the architecture consists of two layer normalization steps [1]: where LayerNorm is layer normalization introduced in [1], MLP denotes a two-layer feed-forward network with ReLU activation function, and MultiHead denotes the multi-head attention mechanism proposed in [55].
The multi-head attention applied to the documents {D k } is calculated as follows: where where W Q i ,W K i ,W V i ∈ R d s ×d s are weight metrics, and the attention is computed as for some input query, key and value matrices Q, K, V ∈ R M ×d s . The h outputs from the attention calculations are concatenated and transformed using a output weight matrix W o ∈ R d s h×d s .

Multi-task Prediction Layer
The multi-task prediction layer consists of two separate singlelayer feed-forward networks. An average pooling is first applied to the output of the sentence-level transformer encoder where the resulting output is then fed into these two feed-forward networks. The objective function is a weighted average of the loss of the two prediction tasks: whereŷ i andŷ j are the predicted values for the main and auxiliary tasks, respectively, and y j denote the corresponding true volatility. The weight α ∈ [0, 1] controls the importance of the auxiliary task and is tuned using the validation set. We use Adam [31] as the optimizer and adopt the trick of decay learning-rate with the steps increase to train our model until converge.

EVALUATION
We describe the dataset for our application and several baselines for the task of stock volatility prediction. A metric to assess and compare the performance of each method is also introduced.

Dataset
The dataset used in this paper is a public S&P 500 Earning Conference Calls dataset used by [46] 1 . It contains the audio records and the corresponding text transcripts from earnings calls for 500 large public companies traded on American stock exchanges (S&P 500) during 2017. There are 2,243 earnings conference calls in 2017 in the raw dataset. However, a large proportion of raw data was discarded because the audio-text alignment is very noisy and is prone to errors. So, there are 576 unique training instances (earnings calls) in which the audio records are sufficiently closely aligned with the corresponding text transcripts in total; the remainder of instances are removed due to a lack of alignment between the audio and text content. These 576 earning calls (instances) correspond to 88,829 aligned sentences (text and audio). In addition to this call data we downloaded the dividend-adjusted closing prices needed for volatility prediction from Yahoo Finance 2 . Also, the pre-trained WWW-BERT model 3 is used to form text representation for each input token, and consequently a sentence representation is obtained.

Baselines
We compare our approach to volatility prediction to a number of important baselines, chosen to reflect the range of approaches that have been applied to the volatility forecasting task, and also including the current state-of-the-art. These baselines can be grouped depending on whether they use historical pricing data [20,29,39,57] (classical approaches), more recent uses of textual data [54,64], or even more recent uses of multi-modal data [45,46]. In each case we outline several different baselines, which, to the best of our knowledge, collectively offer the best available volatility prediction methods at the time of writing.

Price-based baselines. :
The following approaches all rely on historical pricing data only, as the basis of volatility prediction: (1) Classical Methods: Its include the GARCH model (an classical auto-regressive volatility prediction model) [19] and its variants [29]. These are among the most common approaches for volatility prediction. They are designed for short term volatility prediction, and tend to be less effective when it comes to average (n-day) volatility prediction. Therefore, the more effective prediction results corresponding to the ARCH are reported here.
(2) LSTM [20]: Long short-term memory networks (LSTMs) are widely used in financial time series prediction. For volatility prediction, we choose a simple LSTM as a benchmark using the preceding, optimal, n-day historical volatility.

Transcripts
Hi guys, it' s noon so we're going to get st arted. As people are getting settled , th an k yo u very very much fo r jo in in g us here at th e Bo sto n Scientific We have a grea t lin e up of speakers for yo u to d ay . We'll tr y an d keep it tig h t ab o u t 15, 20 minutes of prepared remarks.

Dataset:
Objective: Using language model to predict the masked whole word  Figure 4: The sentence-level transformer mechanism. It contains self-attention mechanism and multi-head attention.
(3) LSTM+ATT [57]: By incorporating an attention mechanism with an LSTM we can build a prediction model that can focus on specific period of volatility in the training data, rather than assuming uniform historical data. (4) MT-LSTM+ATT [39]: This multi-task variation combines average n-day volatility (the primary prediction task) with single-day volatility prediction using attention-based LSTMs as the underlying learners.

Text-based baselines. :
The following text-based approaches all rely on earnings call transcripts for volatility prediction. The baselines themselves reflect recent significant progress in this task and include the current state-of-the-art in volatility prediction tasks.
(1) SVR+RBF(TF-IDF) [54]: Following previous studies [54], Support Vector Regression (SVR) with a Radial Basis Function (RBF) kernel is adapted for stock volatility prediction, representing each instance as a vector of TF-IDF scores (Term Frequency-Inverse Document Frequency [59]) for each term in an earnings call transcript. The TF-IDF of a given term t is calculated as where tc d i (t) is the number of occurrences of term t in transcript i, ∥d i ∥ denotes the Euclidean norm of the term weights of the transcript, and |d i | is the number of the terms in the transcript. (2) SVR+RBF(Glove): This baseline again uses SVR+RBF but with each transcript term mapped to a pre-trained Glove 300-dimensional embedding [44] so that the transcript is represented as a weighted average of the embeddings. This intuition is that this provides a richer transcript representation than using the raw terms. (3) HAN (Glove) [64]: For this baseline, we use a Hierarchical Attention Network with two levels of attention mechanisms, which are applied to word and sentence levels. Each word in a sentence is first converted to a word embedding using the pre-trained Glove 300-dimensional embeddings. Then each sentence, with its embedded words, is input into a Bi-GRU encoder [3,11], while another Bi-GRU encoder is used to represent each document as a sequence of sentences. The document representation is then passed to the final regression layer for predictions.

Multimodal baselines:
These baseline all combine transcript text and audio data and the MDRM version represents the current state-of-the-art in volatility prediction.
(1) SVR (Glove+Audio) [46]: Both text and audio features are used as input features for a SVR in which both types of input are fused using a simple shallow model. (2) bc-LSTM (Glove+Audio) [45]: We use a bi-directional contextual LSTM, proposed by [45], to extract context-dependent multi-modal utterance features, including text features, audio features, and video features. (3) MDRM [46]: This recent multi-modal deep regression model is the current state-of-the-art in volatility prediction. The overfitting problem of audio-only model is reported in the previous work [46]. For this reason, different from the previous work, we discuss our model's performance in two scenarios: textonly and text+audio.

Methodology
To facilitate a direct comparison with the current state-of-theart (the MDRM based on [46]) we follow the evaluation carried out by [46] by splitting our dataset into mutually exclusive training/validation/testing sets in the ratio 7:1:2, and the 7:1:2 split refers to the earning calls. We sort the dataset (i.e. earning calls) in chronological order because the future data cannot be used for prediction. For each baseline (plus our HTML approach), each model is trained using the training set.
In line with best practice, model hyper-parameters are tuned using the validation set. In particular, the maximum sequence length is set as 520 following [46], and for the token-level model, we use the default settings for the hyper-parameters of WWM-BERT to encode each token. Based on that, we develop an agile transformer in the sentence-level to reduce the training and prediction time. We use a grid search to determine the optimal parameters and select the learning rate λ for Adam among {1e-5, 2e-5, 5e-5, 1e-4, 2e-4, 5e-4}, the depth of transformer layers h ∈ {1, 2, 3, 4}, the number of multihead attention m ∈ {1, 2, 3, 4}, and the batch size b ∈ {4, 8, 16, 32}. The optimal hyper-parameters are shared among all settings except the trade-off parameter α between two tasks. The different optimal parameters α are tuned on the validation set for different n − days volatility predictions. Each of these tuned models is then evaluated based on its ability to predict average n-day volatility using the test set, for n = 3, 7, 15, 30. The resulting optimal hyperparameter values used are reported in Table 1.
The resulting predictions are compared to the actual volatility values to compute a mean squared error; see Equation 14, whereŷ i is the predicted value, y i denotes the actual volatility.

RESULTS AND DISCUSSION
The results of this evaluation are presented in Table 2, for each of the baselines and a number of variations of our HTML model, for 3, 7, 15, and 30-day time-periods. It should be clear that significant prediction benefits accrue to the HTML model. The HTML model achieves the highest prediction performance (lowest MSE values) for each of the target time-periods. In particular, the text-only and text+audio versions of HTML generate predictions with substantially lower errors compared to the corresponding versions of the current state-of-the-art, MDRM alternative. These error improvements relative to MDRM are substantial significant, varying with the time-period as follows: 3-days (+38.4%), 7-days (+16.9%), 15-days (+49.0%), and 30-days(+38.7%). Improvements of this scale, relative to the state-of-the-art, are likely to translate into substantial practical benefits and suggest that this new HTML approach stands as a new performance benchmark for volatility forecasting.
In addition to such overall measures of performance, however, we are also interested in better understanding the different relative contributions, if any, that the design decision of the HTML model make, when it comes to prediction performance. Thus, in the following subsections we consider a number of related evaluation questions to better the relative contributions of data sources and model components.

Comparing Price-based Methods with
Alternative Methods Table 2 shows how both text-based and multimodal approaches consistently outperform methods that are purely based on historical pricing, for both short-term (n = 3) and long-term (n = 30) volatility prediction. Excluding the HTML model, the performance of price-based methods and other methods offer comparable for medium-term (n = 7, 15) volatility prediction performance. And, in the case of HTML, its prediction performance always exceeds that offered by pricing-based methods, regardless of n. This provides strong evidence in support of the idea that text and audio features can improve volatility prediction.

On the Utility of Audio Features
Previous research [46] has demonstrated the benefits of combining text with audio data, compared to text-only features, in volatility prediction; [46] reported significant differences, based on a onetailed t-test, for n=3/n=7 p ≤ 0.001 and for n=15 p ≤ 0.01). For HTML, the benefits of using multimodel learning are statistically significant for n=3 only, however (p ≤ 0.01). HTML delivers its most accurate short-term predictions using text+audio, but its most accurate long-term predictions come from the text-only version. This may hint that short-term volatility is more greatly influenced by the vocal cues contained within audio features, although further research is required, as short-term volatility may also be impacted opportunistic effects such as so-called post earnings announcement drift (PEAD) [5].

On the Benefits of the Hierarchical Transformer Architecture
We explore the benefits of attention mechanisms for price-based and text-based models separately. For technical analysis, the attention mechanism based on LSTM achieves some minor improvement in almost all of the settings, excluding n=7. While in the text-based methods, if we adapt a hierarchical attention network (HAN) with a bi-directional GRU model, we note a distinct improvement. It is noteworthy that HAN outperforms the state-of-the-art multimodal results for n=30. This finding provides further evidence in support of the idea that audio features are unlikely to contribute significantly to longer term volatility predictions.
We also compare the results obtained from the attention model used in HAN and our Hierarchical Transformer, which contains self-attention and mutual-head attention, with text only data. The performance of our model is stronger on all tasks, suggesting improvements due to the progressive architecture of Hierarchical Transformer and the use of pre-trained word embeddings.
Regarding the embeddings used, the results of an ablation study on the different embeddings used by HTSL and HTML approaches used in this work are presented in Table 3. As might be expected, WWM-BERT has a beneficial effect on each prediction task compared to Glove; although adding audio features to the Glove embeddings offers similar performance benefits.

Single-Task vs Multi-Task Approaches
Also in Table 3 we can see how the multi-task approach tends to offer improved performance compared to the single-task approach. On a like-for-like basis, most of the multi-task variations in Table 3 present that we superior prediction performance when compared to the corresponding single-task variation, especially for long-term prediction tasks.
We further explore how the auxiliary (single-day prediction) task affects prediction performance. The influence of the auxiliary weight α is important, because it determines relative weight of each task during learning. The validation MSE results, by varying α, are presented in Figure 5. Each individual graph shows the nday (main task) and single-day (auxiliary task) MSE for a different value of n and a range of values for α, and for text-only and multimodal variations. Using text-only data the optimal value for alpha (minimum MSE on the main task) varies in the range 0.5 to 0.8 for different values of n, whereas for multi-modal data it tends to be lower, in the range 0.4 to 0.6, for varying n. By tuning α during the validation stage we are effectively trading-off n-day prediction performance and single-day prediction performance and overall we can see that n-day performance can be optimised by tuning in this way.

CONCLUSIONS
Predicting the historical volatility of publicly traded companies is an important financial analysis task and considerable research effort in the past has been devoted to producing models that are capable of predicting pricing volatility for different time horizons. Recent advances in machine learning means that researcher attention has

MSE of Auxiliary Task
Auxiliary Task (b) Multimedia data as input Figure 5: Validation MSE as a result of varying α. Left and right y-axis represent the validation MSE of primary task and auxiliary task respectively. moved from conventional time-series prediction approaches, based on historical pricing data, to more sophisticated methods that incorporate alternative sources of (often unstructured) data such as text reports or social media. In this paper we have proposed a novel hierarchical, multi-task, transformer learning model for volatility prediction, based on the text and/or audio of earning calls. The model builds on very recent work by [46] and delivers substantial performance improvements, for short and long-term volatility prediction, providing a new performance benchmark for this task. Moreover, our evaluation includes a detailed study of a variety of experimental conditions, to better understand the relative contributions of different aspects of the proposed model to prediction performance.
The utility of audio data, and vocal features, in this important financial prediction task, suggests there exists a significant opportunity to explore the use of audio features in a range of related or complementary tasks (e.g. fraud detection, asset pricing, stock recommendation etc.), where such data is readily available alongside more traditional forms of financial data.