Open-set Web Genre Identification

Dept. of Information and Communication Systems Eng. Doctor of Philosophy Open-set Web Genre Identification by Dimitrios PRITSOS World wide web is constantly increasing and people use information in webpages for everyday activities. There is an emerging need for facilitating access in this huge repository in a seamless way that is in accordance with users’ understanding. Genre is an important factor to characterize the properties of web-pages. Web genres (e.g., blogs, e-shop, FAQs, etc.) refer to the form, structure, and communicative purpose of web-pages rather than their topic. Web Genre Identification (WGI) provides a means to improve effectiveness of information retrieval systems by allowing sophisticated queries combining topic and genre information and ranking/grouping search results according to genre. Specialized document collections can be compiled by adopting genre-aware focused crawling. The credibility assessment of web-pages can be significantly enhanced given that information about their genre is available. Cyber-security applications like anti-phishing can also be enhanced by incorporating genre of web-pages. In case natural language technology tools should be applied to the textual part of web-pages, knowing their genre allows the selection of appropriate tools that have been trained to handle similar documents. Existing work in WGI largely follows the closed-set classification scenario where given a genre palette and training examples for each known genre the task is to assign every new web-page to one of the known genres. However, this does not fit most of applications related to WGI. There is no consensus about the definition of a large genre palette covering most of the Web. It should be expected that large volumes of web-pages will not belong to any of the pre-defined genre labels. This could be viewed as noise in WGI. In addition, genres evolve in time, new genres emerge and existing genres are modified (e.g., blogs and micro-blogs). It seems reasonable to adopt the open-set scenario to better deal with WGI tasks. The very few existing studies focusing on open-set WGI lack an objective evaluation that will reveal their true potential.

iv Declaration I declare that I am the sole author of this thesis and that I have not used any source other than those listed in the bibliography and Identified as references. I further declare that I have not submitted this thesis at any other institution in order to obtain a degree. v Dedication To my family.
vi Acknowledgements I would like to thank my supervisor, Professor Stamatatos Efstathios for being my teacher. A teacher with excellent academic skills who has been closely guiding me into innovative and novel research paths. A teacher, eager for knowledge who has been inspiring me with his patience and dedication to our research goals.

Timeline Acknowledgements
I would like to thank a few people who I believe they had an influence on my academic choices. My postgraduate supervisor Dr. Nehmzow Ulrich for his lectures and book, in (robotics) research methodology. My undergraduate professors Dr. Skourlas Christos and Dr. Vassilas Nikos for helping me pursuit my academic goals with their personal guidance. My teachers, Mr. Vargiadakis (elementary school), Mr. Dimitropoulos (highschool) and Mr. Mavrelis (senior highschool) who they taught me the values of analytical and critical thinking.
I would like to thank my three close friends Vasilis Paraskevopoulos, Eleftherios Garifalidis and Vasilis Dimitriadis for their support and discussion time, always being useful for my academic decisions. I would also like to thank my fiancee Vasiliki Charalampopoulou who has been with me since the early days of my PhD studies.
Last but not least, I would like to thank my parents who taught me, through example, working tirelessly for my goals. Antonis, my father, who is my very first teacher in analytical thinking and Antonia, my mother, who inspired me the love for knowledge.
• Information Extraction: The goal is to extract specific information from documents, e.g. the names of people/places/organizations and dates of events in news stories.
• Text Classification: The goal is to assign labels from a predefined set to documents. Such labels could correspond to thematic area (e.g., 'politics', 'sport'), or the sentiment of texts (opinion mining) or the author of documents.
• Text Clustering: The goal is to group documents according to their similarity. This is used when there is no predefined list of categories and can also create structured taxonomies that organize and facilitate access to a document collection.
• Text Visualization: This aims at graphically depicting the main information found in a collection of documents to facilitate the exploration of similarities/differences among them and provide understandable information.
• Document Summarization: The goal is to provide a brief summary of a long document or a collection of documents by removing trivial details and including all crucial information. This facilitates access to collections of documents that are constantly updating.

Classifying Documents by Genre
Genre Identification is the natural progress of the almost ancient process of categorizing the human intellectual creations on such an abstract taxonomy as their Genus. Artifacts such as paintings, music pieces and written texts are always a subject of research interest to be classified based on their from, style and communicative purpose rather than their content. For example, novels or poems for documents, impressionism or expressionism for paintings, blues or funky for music, are some examples of genres that depend on structural information. Especially for documents, the defining factors for distinguishing between genres are their form, style, and communicative purpose.
There is a great debate for defining the notion of genre in the linguistic studies. Additionally, the genre notion is confusing when compared with other abstract categorizations of texts such as the text types or registers etc. Despite the methodological differences the linguistic community concluded that the idiosyncrasy of the genre taxonomy is mutable and diverse (Coutinho and Miranda, 2009). This kind of idiosyncrasy is yielded to the genre taxonomy due to the spontaneous genesis of the genre classes. Genre classes are emerging or mutating when a communication process is taking place.
Definition 1 Genre is the genus of some arbitrary texts, which comprehensively describes their form, style and communicative purpose other than their content, where it emerges as a sociocentric interaction for accelerating the social communication when it comes to the description of the texts.
Automated Genre Identification (AGI): Identification of the text's genre and sometime equivalent to text's register. That is the the automated identification of the form, style and communicative purpose of texts. News indicates a different kind of texts than Blogs with respect to genre. Editorial is different than Article with respect to the register while both can be considered as opinion articles written in argumentative style.
A subset of AGI is Web Genre Identification (WGI) focusing on the World Wide Web where enriched documents (hypertexts) are classified on a given genre taxonomy/palette (e.g., blogs, home pages, e-shops, discussion forums, etc). The ability to automatically recognize the genre of web documents can enhance performance in several applications including the following: • IR systems can enable genre-based grouping/filtering of search results (Braslavski, 2007;Rosso, 2008). A search engine can provide its users the option to define sophisticated queries combining genre labels and topics (e.g., blogs about machine learning or e-shops about sports equipment).
• Specialized collections and intuitive hierarchies of web page collections can be built by combining topic and genre information (De Assis et al., 2009). Genreaware focused crawling, unlike general web-crawling, explores and downloads only relevant web-pages belonging to certain genres (Siqueira et al., 2017). As a result valuable time and resources are saved and more specialized indices can be produced. The main challenge in this task is to be able to guess the genre of web-pages in advance, i.e. before the page is actually downloaded (Priyatam et al., 2013).
• Knowing the genre of web-pages can be very helpful information in order to assess their credibility in spam detection (Agrawal, Mohan, and Reddy, 2018).
• In cyber-security, genre of web-pages can be exploited to enhance anti-phishing attempts (Abbasi et al., 2015).
• The recognition of web genre can also enhance the effectiveness of processing the content of web pages in information extraction applications. For example, given that a set of web pages has to be part-of-speech tagged, appropriate models can be applied to each web page according to their genre (Nooralahzadeh, Brun, and Roux, 2014).
Despite such interesting application areas, research in WGI is relatively limited due to fundamental difficulties emerging from the genre notion itself. The most significant difficulties in the WGI domain are the following: • There is not a consensus on the exact definition of genre (Crowston, Kwaśnik, and Rubleske, 2011).
• There is not a common genre palette that comprises all available genres and sub-genres (Santini, 2011;Mehler, Sharoff, and Santini, 2010;Mason, Shepherd, and Duffy, 2009b;Sharoff, Wu, and Markert, 2010a), moreover, genres are evolving in time since new genres are born or existing genres are modified (Boese and Howe, 2005).
• It is not clear whether a whole web page should belong to a genre or sections of the same web page can belong to different genres (Jebari, 2015;Madjarov et al., 2015).
• Style of documents is affected by both genre-related choices and author-related choices (Petrenz and Webber, 2011;Sharoff, Wu, and Markert, 2010b). As a result, it is hard to accurately distinguish between personal style characteristics and genre properties when style is quantified.

Closed-set vs. Open-set Classification
In a typical text classification task, we are given a collection of documents D = {d 1 , . . . , d |D| } and a set of labels C = {c 1 , . . . c |C | } and the task is to assign each document to some of the labels. That is, for each pair < d j , c i >∈ D × C a binary answer is produced indicating whether document d i is assigned to class c j . Usually, text classification tasks are successfully handled by applying supervised machine learning methods (Sebastiani, 2002). This assumes the availability of a labeled training corpus T = {d 1 , . . . , d |T | } ⊂ D where every pair < d j , c i > is either a positive or a negative instance of c i . Then, a classifier learns a function φ :T ×C → {True, False} that approximates the target functionφ :D × C → {True, False}. The effectiveness of the classifier is estimated using another labeled dataset (test/evaluation set) E = {d 1 , . . . , d |T | } ⊂ D that is non-overlapping with the training set.
Most previous studies in WGI consider the simple case where all web pages should belong to a predefined taxonomy of genres (Lim, 2005;Santini, 2007;Kanaris and Stamatatos, 2009;Jebari, 2014). This is known as closed-set classification.
Definition 2 Closed-set Classification assumes that the training and test sets are drawn from the same distribution and all their instances necessarily belong to at least one of the predefined labels.
There are several variations of that scenario, for example single-label (where each web-page belongs to exactly one label) or multi-label classification (where it is possible multiple labels to be assigned to a certain web-page), and soft classification (where an algorithm can return the probability score for every class from the trained label space (Geng, Huang, and Chen, 2018)).
The naive assumption of closed-set classification is not appropriate for most applications related with WGI. As already mentioned, it is not feasible to define a complete set of web genres. The scale of the Web makes any attempt to map existing web-pages to a specific genre label intractable. In addition, web genres in particular are evolving in time, some are modified or seize to exist and new ones are emerging (e.g., some years ago, blogs or tweets were unknown). The vast majority of previous work in WGI avoid to consider such concerns and as a result their effectiveness in closed-set classification conditions is over-estimated.
It is therefore realistic to assume that despite best efforts to define a long genre label list, there will always be a great amount of web-pages that do not belong to any of these. Previous work in WGI define such web-pages as noise (this term can also refer to the case where multiple genres co-exist and there is no dominant genre label) (Santini, 2011;Levering, Cutler, and Yu, 2008). To handle noise in WGI there are two main options. First, to adopt the closed-set classification setup having one predefined category devoted to noise. Positive training examples are given for this noise class. Since this category would comprise all web pages not belonging to the known genre labels, it would not be homogeneous and it is not clear how to sample it. Moreover, this noise class would be much more greater with respect to the other genres causing class imbalance problems.
The second option is to adopt the open-set classification setting where it is possible for some web pages not to be classified into any of the predefined genre categories (Stubbe, Ringlstetter, and Schulz, 2007;Pritsos and Stamatatos, 2013). This setup avoids the problem of class imbalance caused by numerous noisy pages and also avoids the problem of handling a diverse and highly heterogeneous class. On the other hand, open-set classification requires strong generalization with respect to the closed-set setup (Scheirer et al., 2013).
Definition 3 Open-set Classification assumes that it is likely for samples of classes unseen during the training phase to appear in test phase. An open-set classifier should be able to accurately recognize test instances belonging to the known classes (seen during training) and avoid to be confused by instances belonging to unknown classes (not seen during training) (Geng, Huang, and Chen, 2018).
Open-set classification is closely related to the Novelty Detection and One-class Classification where it is assumed that only positive examples of a particular class are available for the supervised learning methods. These methods have been adapted to this problem and there are several examples such as One-Class SVM, One-Class Neural Networks, etc. It might sound similar but it is not a binary classification setup for training these algorithms due to the lack of the negative examples. Oneclass classification requires very strong generalization and it is suitable when either the negative class is not available or it is huge and heterogeneous so that it is not possible to be adequately sampled.
It is possible to transform a (soft) closed-set classifier to an open-set one by introducing a reject option that is used to leave a test instance unclassified. For example, a reject option may examine how far a test instance is from the class centroids or what the difference in decision probabilities between the most likely classes is and in case some predefined criteria are not met then the test instance is left unclassified (Onan, 2018). Closed-set classification methods with a reject option are not open-set essentially since they avoid to estimate the open-space risk.
Each classifier attempts to draw boundaries between the known classes (i.e., seen during training phase). A closed-set classifier (no matter if it uses a reject option) separates the whole instance space by such decision boundaries. However, the samples of known classes may be gathered in specific parts of the instance space. The space far away from known class instances is known as the open space. The openspace risk refers to the act of labeling a test instance in the open-space (Geng, Huang, and Chen, 2018).
A more formal definition of open-set classification is one where the open space risk is considered. Let T be the training data, R O the open space risk, and R ε the empirical risk. Then the objective of open-set classification is to find a function f ∈ L which minimizes the following open-set risk: where f (x) > 0 implies correct recognition and λ is a regularization constant. Thus, open-set risk balances the empirical risk and the open space risk (Geng, Huang, and Chen, 2018). In practice the empirical risk is the loss function of the openset classification model in the training set while the open-space risk is the ratio of the open space to the full vector space.

Representation of Web-pages
In order to use supervised learning technology to WGI, it is required to transform the information in raw web documents into a quantitative representation. This means that each web-page should be represented as a numerical vector where each dimension (feature) properly captures relevant information. In addition, ideally the vectors should be dense and compact to enable ML algorithms deal with the classification task efficiently. The web documents can be considered a super-set of the document format types because it expands Postscript 1 by introducing functionality and versatility based on HTML and virtually infinite inter-connectivity because of the hyperlinks.
In relevant literature there is a great variety of ideas aiming at document representation for WGI. The main features that can be extracted from web-pages are related to the following information: 1. The Uniform Resource Locator (URL) and hyperlinks of web-pages (and the graph formed by these connections).
2. The HTML tags and Document Object Model (DOM) structure of the webpage.
3. The textual content of the web-page.
In some cases, it has been reported that the web-pages's URL alone is sufficient for predicting its genre (Abramson and Aha, 2012;Jebari, 2014;Priyatam et al., 2013;Zhu, Zhou, and Fung, 2011). Concerning available hyperlinks in web-pages there are two parts than can provide useful information: the URL of the hyperlink itself handled as a string of characters and its anchor text. Alternatively, the structure of the graph which is formed by the hyperlinks and information found in neighboring pages can also be used. Usually, the neighbouring pages can contribute by amplifying the signals for the correct genre classification using either information extracted from their text or based on the assumption that pages of the same genre tend to be inter-linked (Abramson and Aha, 2012;Asheghi, Markert, and Sharoff, 2014;Jebari, 2014;Priyatam et al., 2013;Zhu, Zhou, and Fung, 2011).
The HTML tags can provide useful information about the structure of web-pages. In the simplest approach, HTML tags can be treated as raw text and the frequency of specific tags is measured with some potential heuristics. However, the W3C suggested HTML web-page composition paradigm is changing and constantly violated. As a result, heuristics can only contribute but in a few practical cases. A more sophisticated and sensible approach can be the analysis of the DOM structure, where the format of the text can be captured. As an example, e-shop web-pages are different from the academic web-pages. This resembles the difference in typographic format of a printed magazine and a printed newspaper. However, most likely several heuristics are needed for identifying these structures, because of the HTML composition paradigm violation (Mehler and Waltinger, 2011).
The bulk of research work in WGI has focused mostly on the features which can be extracted from the textual part of web-pages (i.e., after the removal of HTML tags) (Mason, Shepherd, and Duffy, 2009c;Sharoff, Wu, and Markert, 2010a;Sharoff, Wu, and Markert, 2010b;Nooralahzadeh, Brun, and Roux, 2014;Onan, 2018). The following are the main categories of textual features: 1. Lexical features: Each web-page is seen as a series of tokens and frequencies of specific words (e.g. function words) or sequences of tokens (e.g., word ngrams) can be measured. In addition, information about the length of words and sentences can be useful.
2. Character features: Each web-page is handled as a alphanumeric string and usually frequencies of character n-grams can provide a very detailed and highly dimensional representation.
3. Syntactic features: This requires some kind of sophisticated analysis by NLP tools that can provide information about the syntactic patterns found in the web-pages. One popular and relatively simple approach is the use of part-ofspeech (POS) n-grams. Syntactic features are language-dependent and their reliability correlates with the error rate of the used NLP tools.
Typical term weighting schemes, like binary, Term Frequency (TF) and Term Frequency -Inverted Document Frequency (TF-IDF) are popular in WGI. In addition, there are some schemes specifically designed for WGI tasks like Term Frequency -Inverted Genre Frequency (TF-IGF). This is an extension of TF-IDF that is based on the frequencies of a term in the documents of particular genre rather than the whole corpus (Sugiyanto et al., 2014).
Recently, distributed representations provide an alternative way to represent documents using neural network language models (Mikolov et al., 2013a;Le and Mikolov, 2014). In contrast to the popular n-gram features that produce sparse vectors, distributed representations produce dense vectors of relatively low dimensionality. This approach has obtained state-of-the-art effectiveness in several text classification tasks but it has not thoroughly tested in WGI so far.

Motivation
As already mentioned, the vast majority of previous work in WGI adopt the closedset classification scenario that is not realistic and leads to an over-estimation of performance. Since it is not feasible to define a complete list of genre labels and genres constantly evolve in time, the open-set classification scenario better suits WGI.
Among the few attempts to follow open-set classification in WGI, very few use pure open-set classifiers (Stubbe, Ringlstetter, and Schulz, 2007;Asheghi, 2015). An additional issue is how to handle the test web-pages belonging to unknown genres. One option is to consider these as unstructured noise where the true genre of noisy pages is not available and another is to examine structured noise where the true genre of noisy pages is available (yet unknown during the training phase).
So far, it is not clear what specific open-set classification methods can better handle these cases. In addition, there is lack of a evaluation framework that can appropriately measure the effectiveness of open-set WGI methods with the presence of either unstructured or structured noise. This requires the use of appropriately defined evaluation measures and the suitable design of experimental setup. In addition, we need a clear way to compare different methods in application-dependent conditions where, for example, precision may be considered more important than recall.
Most previous studies attempt to combine heterogeneous information coming from the hyperlinks between web-pages, the HTML code and the textual content of web-pages. Despite the usefulness of all these information, the main question is whether it is possible to accurately predict the genre of a web-page focusing on its textual content since this is not affected by technology changes and habits of web developers or arbitrary changes in neighboring web-pages.
There is a great variety of text representation measures applied to WGI, most of them attempt to capture the stylistic properties of web genres. It is not yet clear how specific approaches, like word and character n-grams, known to be very effective in closed-set WGI (Sharoff, Wu, and Markert, 2010a), are still effective in open-set WGI where the dimensionality of the representation may severely affect the ability of the open-set classifier for generalization. Finally, the recent success of the use of distributed representations acquired by neural network language models in other text classification tasks is a strong motivation to attempt to examine their effectiveness also in open-set WGI. One main advantage of such approaches is that they produce a space of relatively low dimensionality and in theory this may be an advantage for specific open-set classifiers that may suffer when irrelevant and redundant features are available.

Contribution
This thesis focuses on open-set WGI and examines specific algorithms and experimental setups that allow their evaluation in realistic conditions. More specifically, the main contributions are listed bellow: • An approach based on one-class classification, where only positive training examples of a target class are considered, is introduced to WGI. The proposed method is based on one-class support vector machines (OCSVM) and is modified to handle multi-class open-set classification. This algorithm is presented in detail in section 3.4.1.
• The Random Feature Subspacing Ensemble (RFSE) is introduced to WGI. This open-set classifier is based on an existing approach originally proposed for authorship attribution and it is adopted to better handle the WGI task (Koppel, Schler, and Argamon, 2011). This algorithm has been implemented in python and in its general form can handle any kind of text representation 2 . This algorithm is presented in detail in section 3.4.2.
• Another open-set classifier, the Nearest Neighbors Distance Ratio (NNDR) is introduced to WGI. This is a modification of the well-known k-Nearest Neighbor classifier (Mendes Júnior et al., 2016) and it is extended to better suit the WGI requirements. This algorithm has been implemented in python 3 and is presented in detail in section 3.5.
• The noise (i.e., web-pages not belonging to any of the known genres) in WGI is distinguished into unstructured and structured noise and each case is thoroughly studied. The former considers all unknown genres as a common heterogeneous class. The latter admits that there is structure in the unknown webpages, namely the existence of genre labels not seen during the training phase. In this thesis it is introduced the openness as an indication of how the number of known classes is compared to the number of unknown classes. This concept is borrowed by relevant work in visual object recognition (Scheirer et al., 2013) and it perfectly suits the WGI task.
• An experimental framework suitable for evaluating open-set WGI algorithms is introduced including abilities to study different kinds of noise (unstructured or structured). The use of openess enables the study of open-set WGI where the difficulty of the task is explicitly controlled (i.e., few known classes vs. many unknown classes or many known classes vs. few unknown classes). In addition, appropriate evaluation measures provide a detailed view on the obtained performance. This is especially important since evaluation measures usually involved in closed-set classification can be misleading since they handle all classes equally. However, in open-set WGI, the class of unknown web-pages (including all web-pages that do not belong to known genres) is usually much larger than the known classes and it should be treated in a special way as it is explained in Chapter 4.
• The proposed open-set WGI algorithms are extensively evaluated using the aforementioned experimentation framework. The particular hyper-parameters and settings that allow these algorithms to achieve as good results as possible are examined. In addition, the use of different kinds of text representation is considered and their effect on the performance of each algorithm is studied. The most popular textual features in WGI covering lexical, character, and syntactic features are considered.
• The application of distributed representations acquired from neural network language models in WGI is explored. The effect of such low dimensional and dense representations on the effectiveness of the NNDR open-set WGI algorithms is studied. It is demonstrated that especially the precision of this approach can be considerably enhanced making it more suitable for specific WGI applications.

Publications
Parts of the work described in this thesis have already been published in scientific journals and conference proceedings. The list of related publications is following:

Thesis Outline
The rest of this thesis is outlined below. Chapter 2 discusses relevant work on AGI and WGI tasks. Definitions and uses of genre from the fields of linguistics and computational linguistics are presented. The state-of-the art for the representation of web-pages and the ML methodologies for genre identification are discussed. The few open-set WGI approaches are described.
Finally, the available corpora for evaluating WGI methods and their properties are discussed.
Chapter 3 focuses on open-set WGI and analytically presents the three algorithms examined in this thesis (i.e., OCSVM, RFSE, and NNDR). The characteristics of these methods and their differences of with existing approaches are discussed.
Chapter 4 introduces the experimental framework proposed in this thesis for evaluating open-set WGI approaches. The use of openess as a means to control the difficulty of WGI tasks is discussed. Appropriate evaluation measures are defined for both unstructured and structured noise.
Chapter 5 deals with the experimental analysis of OCSVM and RFSE algorithms. The evaluation corpora used in this study and their properties are discussed. Experiments when structured and unstructured noise is considered are presented. The effect of text representation on the effectiveness of the examined methods is studied.
In Chapter 6, the usefulness of distributed representation in open-set WGI is presented. The NNDR algorithm is evaluated using traditional n-gram-based features and distributed features. Experimental results show how the performance of this algorithm is affected and it compares with OCSVM and RFSE.
Finally, Chapter 7 summarizes the main conclusions drawn from this study and discusses future work directions.

Introduction
This chapter describes previous work in genre recognition. First, the notion of genre is discussed using approaches from different disciplines and background. Important aspects of genre are noted and a general definition that is adopted in this study is provided.
In general, genre recognition is viewed as a text classification task. Thus, the main issues that are studied are the following: • Represent documents in a feature space.
• Learn a model that can distinguish between classes.
Genre-related information can be extracted from various sources. Since genre is mainly associated with form, structure, and communicative purpose of documents, features can relate to textual content, visual appearance, URL and graph of interlined web-pages, etc. In addition, as concerns textual features, information about style is far more important than topic of documents. The existing approaches to define suitable document representations are analytically described. We include in this discussion both AGI and WGI tasks.
There is also a great variety of classification algorithms applied to genre recognition tasks. These include general-purpose ML methods and approaches specificallybuilt for these tasks. Special emphasis is given in the type of classification setup adopted by existing approaches, mainly whether a closed-set or an open-set scenario is followed.
Finally, we present an overview of existing resources to evaluate WGI approaches. A list of corpora used in previous studies and their main characteristics are described.

The Notion of Genre
In general, genre is related to form and communicative purpose of texts rather than their theme. It is closely related to style and Genus 1 (Sugiyanto et al., 2014). Approaches to define text genre start mainly from two directions: linguistics and computational analysis of language (e.g. computational linguistics, natural language processing, text mining).
In studies of linguistics there is a great debate in defining the notion of genre as an abstract categorization scheme of texts and the relations between them. Despite the methodological differences the linguistic community concluded that the idiosyncrasy of the genre taxonomy is mutable and diverse (Coutinho and Miranda, 2009). This kind of idiosyncrasy is yielded to the genre taxonomy due to the spontaneous genesis of the genre classes. The genesis of a genre class is a socio-centric interaction which is emerging from the need to describe the texts in order to accelerate the social communication procedure. Thus, genre classes are spontaneously emerging while the communication procedure is taking place.
Humans can efficiently recognize the genre-types by processing the texts intuitively. However, there is a lack of consensus for defining genres, particularly when specific names (labels) should be assigned to the genres. There is an effort of several user studies for eliciting the mechanics in the process of genre identification and tagging. The results on user agreement were very discouraging. Also, when humans attempt to describe specifically the terms or/and the attributes which they use to identify different genres, there is a great confusion and disagreement (Rosso, 2005;Asheghi, 2015). A convincing explanation for this is the plethora of textual, stylistic and conceptual description terms which humans use and depend on their background (e.g., teachers, scientists or engineers use different vocabularies to describe texts belonging to a common genre (Roussinov et al., 2001;Crowston, Kwaśnik, and Rubleske, 2011).
Researchers from cognitive science found that humans are recognizing the genre type of a document (or web-page) using cognitive processes related mostly to the form of the text (Clark et al., 2014). Particularly they used configured apparatus for tracking the eyes movement while subjects attempt to recognize genre of documents. One can resemble the process like navigation where the eyes are constantly moving while they are focusing for small fragments of time in landmarks of interest. The pausing of the eyes on the text "landmarks" is called fixation while the "jumping" movements of the eyes is called saccadic. The whole process aimed to locate information of interest such as specific text forms, names, verbs, or phrases that are related to the abstract concept in order to decide whether the text matches their interest and is worth of further reading. They systematically found that the process of finding the genre of the text is the same as to find out whether a text i worth of further reading. Thus, the knowledge of a genre taxonomy definitely accelerates the communication procedure and helps readers of the text to find the information of interest faster. 1 Genus in Greek means type or class The discipline of the English for Academic Purposes (EAP) has vividly discussed the divergence in the genre taxonomies between the different academic disciplines and reasoned the utility of the genre taxonomy for enabling the teachers and the students to improve their rhetorical and written language skills with the purpose of improving the teaching procedure. What is important to note for this study is the conclusion that any given certain genre conveys information about the communication purpose of the document, i.e. as text identity carrier, but it can also contain the same style and other language properties when the purpose is similar. For example, a newspaper article and a magazine article can be claimed to belong to different genres although they are mainly governed by the same linguistic properties. Therefore, for the writer of a text is is very important to be aware (thus to be taught) of the different genres and the taxonomy of genres in order the texts (s)he produces to be recognizable by the reader (Hardy and Friginal, 2016;Melissourgou and Frantzi, 2017;Al-Khasawneh, 2017). However, genre itself requires different level of human reading abilities to be recognized and even with these skills different humans may disagree (McCarthy et al., 2009).
The utility of text genre identification has been realized by the journalism professionals. There are well-defined structures and guidelines given by newspaper editors about how to present, e.g. news articles. The structure consists of abstract elements and they follow specific paradigms, like the inverted pyramid (i.e., contents are structured from the most important to the least important information), Martini Glass (i.e., it first presents a summary of the story, then an inverted pyramid and finally a chronological elaboration), Kabob (i.e., it starts with an anecdote, continues with the main story and closes with a general discussion) and Narrative (i.e., it presents a chronological sequence of events) (Dai, Taneja, and Huang, 2018).
Some terms used in relevant literature, like register, and text type seem very relevannt to genre. Actually, they are used interchangeably, complimentary and even contradictory (Melissourgou and Frantzi, 2017). Although the exact definitions of these terms deviate according to the scholars and their background, text type is generally associated with linguistic properties of documents. Register usually refers to non-linguistic terms like the purpose of communication, the relation between speaker and hearer etc. Genre can be viewed as more general than both text type and register since it combines linguistic and non-linguistic information.
From a computational analysis point of view, genre (and genre taxonomy) is important as a classification factor to distinguish between documents. Genre labels are defined according to their association with practical applications rather than based on a rigid theoretical background (Kanaris and Stamatatos, 2009;Eissen and Stein, 2004;Santini, 2007). Genre identification is a style-based text categorization task. Another similar task is authorship attribution where the focus is on identifying the personal style of the author (Stamatatos, 2009;Koppel, Schler, and Argamon, 2011;Koppel and Winter, 2014). On the other hand, genre is mainly regarded as a group style. For example, scientists use a common form of language to write research papers, journalists describe news events and their opinion using similar patterns, bloggers express their beliefs and interests based on similar structures, etc.
As concerns web genres (and their respective taxonomy), the utilities and opportunities that can provide as well as the difficulties they impose have been eloquently analyzed. It has been pointed out that the genre taxonomy summarizes the type and style of texts in a single term as a communicative act (De Assis et al., 2009). In the domain of WGI, usually a web genre palette is defined usually obtained from a top-down approach, where a group of domain-experts design the taxonomy based on specific objectives of the task (Crowston, Kwaśnik, and Rubleske, 2011). Moreover, the genre palette may be flat or structured (Wu, Markert, and Sharoff, 2010). The former assumes that genre labels are independent while the latter defines a hierarchy of genres and sub-genres. Another important issue is whether a web-page should belong to exactly one genre label or page segmentation should be applied first and then each segment should be assigned to a genre label (Madjarov et al., 2015;Jebari, 2015).
As described so far, there is agreement for the criteria which are defining the genres (and web genres) in a given domain. These are, the style, form, and the communicative purpose of documents. In theory, topic is considered orthogonal to genre. However, thematic information can also be useful in automated genre identification. For example, the genre of academic home web-pages is distinguished by the use of a specific vocabulary. The genre of research papers also use specific science-related terms. Certainly, some of these terms may be too specific (e.g. about biology, mathematics, or computer science). However, content-specific information can be used to differentiate scientific documents from non-scientific documents (Coutinho and Miranda, 2009;Crowston, Kwaśnik, and Rubleske, 2011;Kanaris and Stamatatos, 2009;Jebari, 2015;Gollapalli et al., 2011).
Considering the above discussion, it is clear that the notion of web genre depends on the use of this information. In this thesis, our approach is influenced by the use of web genres as a classification factor in order to enhance the potential of information processing and management systems. In particular, we adopt the following definition: Definition 4 A web genre is a class of web documents that share form, structure, and communicative purpose properties. Every web-page is always derived under a unique class distribution and the class distributions are not overlapped.

Textual Features
The textual content of a document is the most analyzed source of information. Similarly, the textual part of a web-page is considered very important in WGI studies (Mason, Shepherd, and Duffy, 2009a;Sharoff, Wu, and Markert, 2010b). As it has already been explained, style rather than topic is crucial in genre recognition. However, it is not clear how style properties of documents can be captured adequately. In addition, style is affected by both genre and the personal style of the author. Ideally, the extracted measures should only depend on the former.
One simple way to represent documents is based on n-grams of either words or characters. This is a language-independent approach and has been demonstrated to be quite effective in WGI studies (Kanaris and Stamatatos, 2009;Sharoff, Wu, and Markert, 2010a;Kumari, Reddy, and Fatima, 2014). In addition, surface features that are considered important to quantify stylistic properties of documents, such as statistics (i.e., count, mean, max, etc.) of word length (in characters), sentence length (in words), paragraph length (in words), capitalized word, lowercase word, punctuation marks, type/token ratio etc. (Feldman et al., 2009;Santini, 2005;Onan, 2018). All these features attempt to represent information operating on lexical or character level. As a result, they do not require the use of complicated NLP tools and they can practically be extracted from raw text.
Another idea is to attempt to quantify the difficulty of understanding the information included in documents by using readability assessment features. The main purpose of developing such features is to help in estimating the quality of texts with respect to the degree of comprehension by the reader. Examples of readability assessment features are the word variation index (OVIX), the nominal ratio (NR) and Lasbarhetsindex (LIX) defined as follows (Falkenjack, Mühlenbock, and Jönsson, 2013): where A is the number of words, B is the number of special characters (i.e., colon, period, capital fist letter), and C is the number of long words (more than 6 letters for the English language). A more sophisticated type of features concerns the syntactic properties of documents since the grammar of sentences is considered important for stylistic purposes (Sharoff, Wu, and Markert, 2010a;Petrenz and Webber, 2011). Moreover, this information is less likely to depend on topic of documents in comparison to lexical and character features. The simplest form of capturing syntactic information is the use of POS n-grams where the texts are analyzed by a POS tagger that assigns a tag in each word and then sequences of POS tags are counted. Other syntactic features are based on a more elaborate analysis of documents by NLP tools, like full syntactic parsers. Examples of such syntactic features include average dependency distance, ratio of dependencies, sentence depth (in dependency terms), unigram dependency type (based on token terms), average verbal arity, unigram verbal arity, tokens per clause, number of prepositional components, etc (Falkenjack, Mühlenbock, and Jönsson, 2013;Falkenjack, Santini, and Jönsson, 2016). A major weakness of such features is that their usefulness depend on the accuracy of the NLP tools used to extract them from documents (Stamatatos, 2009). This is especially crucial in case the documents that have been used for training the NLP tools significantly differ from the documents we want to analyze.
A text is usually viewed as a sequence of words or characters. However, an alternative idea is to construct a graph from a document and then use graph metrics to represent the properties of documents. Such graph-based features are discussed in (Nabhan and Shaalan, 2016) aiming to enhance effectiveness in genre recognition. An unweighted graph is built from each document based on word bigrams found within sentence boundaries. Each word is a node of the graph and if a bigram is found in the text an edge connects the respective words. The frequency of bigram was not taken into account.
Then, graph-based measures are extracted to represent documents including node degree, clustering coefficient, average shortest path length, network diameter, number of connected components, average neighborhood connectivity, network centralization and network heterogeneity. The average node degree, i.e. the number of neighbor connections, shown to be an important criterion for discriminating for example scientific to humorous web-pages. A higher average of node degree may indicate a preference to use an established vocabulary.
A high value of clustering coefficient would mean there is tendency for a set of nodes to cohere or stay connected in a sub-network. The Religion, Fiction, and Adventure classes seem to have relatively high value of clustering coefficient as compared to News, Editorial and Hobbies. A high number of connected components indicates topic diversity within a genre. News and Hobbies have shown to have higher score, i.e. higher diversity, than Religion and Fiction. In addition, a relatively high score in network Centralization seems to be a good indicator for Fiction and Adventure genres. The network heterogeneity was found to be higher in News and Hobbies and this reflects the tendency of the graph to have links between high-degree to low degree-nodes. This can indicate a tendency to use function words in text. Genrespecific graph characteristics also found in that study (Nabhan and Shaalan, 2016) including high global clustering coefficient found for Learned and Religious text genres. Moreover, average local clustering strongly correlates to the node degree shown to be a good indicator for genres showing concentration to specific concepts.
Finally, the graph-based measures can also be used for discovering the existence of sub-genre within a genre such as in News. It has been shown that there are some areas within the News genre where the bigram graph has high node connection concentration (or high edge concentration).
In (Kim and Ross, 2010) the Harmonic Descriptor Representation (HDR) of web-pages is proposed. This is inspired by the musical analogy of a string of a musical instrument. Each document is considered to be a temporal sequence of symbols (i.e. characters or words). Particularly, instead of counting the overall frequency of terms, the intervals of the the occurrences of terms within the document are measured. This shows how the occurrences of a term are distributed within a document. This approach defines Range as the interval between the initial and the ultimate occurrence of the term in a document and Period as the time duration (i.e. the count of characters) between two consecutive occurrences of the term. Then HDR word encoding is a tuple of three explicit measurements defined as follows: 1. FP is the time duration before the first occurrence of the symbol in a document (i.e., the period before the first occurrence divided by the total number of characters into the document).
2. LP is the time duration after the last occurrence of the symbol (i.e., the period after the last occurrence divided by the total number of characters) 3. AP is the average period ratio calculated as follows: is the frequency of symbol s in document d, P s is the set of periods between all consecutive occurrences of s in d and |d| is the length in characters of d.

Structural Features
As already discussed, genre is mainly associated with form of the presented information. However, it is quite unclear how this information can be quantified appropriately. The easiest way is to focus on HTML tags by counting the HTML tags frequency in the hypertext (Kanaris and Stamatatos, 2009). Special focus in some cases is given to the image tags and the hyperlink tags (Lim, 2005;Levering, Cutler, and Yu, 2008). These sources of information are useful and usually their combination with textual features enhances the performance of WGI model. In addition there are very few cases where the DOM object structure is analyzed for extracting information but usually as part of the whole set of features selected and not as a stand alone choice (Mehler and Waltinger, 2011). Another interesting approach is to view a web-page as an image and attempt to extract visual features that describe what components are found and in what position (Levering, Cutler, and Yu, 2008). An approach that is based on structural features is presented in (Mehler and Waltinger, 2011). They focus on the web genre of homepages and its sub-genres (i.e., personal, conference, project). The web-pages are first automatically segmented into their constituent parts (e.g., for the personal academic homepage the segments are: contact information, personal information, publications, research, and teaching).
Then, each page is represented according to the detected segments that were found in it (i.e., bag-of-segments). The reported results show a significant increase in performance when this structure-based method is compared with traditional approaches based only on (bag-of-words) textual features.

Image-related Features
In (Chen et al., 2012) there is a very interesting approach where image processing features have been used in a AGI task applied to office documents. In their experiments, they used image-based features that were found significantly better that regular textual features when comparing their work to previous ones. The combination of both kinds of features increased the performance even more.
The image-based features were extracted by splitting the image of the document into 25 tils (5 horizontally and 5 vertically) plus a full-page til. The features used were: (a) Image Density, (b) Horizontal projection, (c) Vertical projection, (d) Color correlogram, (e) Lines, (f) Image size. In all cases the document images where converted to black and white for these features to be extracted. The exception is the correlogram which analyzed the full color spectrum of the document in its image format. The image-based features described above are similar to the ones used in (Clark et al., 2014).
• The mage density utility was used for differentiating where the images and the text were located. In addition, the titles from the rest of the text could be also separated. To capture this feature the black to total pixels ratio was calculated for each til of the document.
• The horizontal projection was used for differentiating the slides where the text is large and less than the rest of the non-slides documents. After the process required for locating the text boxes (similarly tho the OCR software) then a five-bin histogram were used for identifying the majority of the text font sizes.
• The vertical projection was used to differentiate the papers from tables by capturing the number of text columns and the distribution of their width. Similarly to the horizontal projection a five-bin histogram of column width were used.
• The color correlogram represents the spatial correlation of colors. The process is starting by quantizing the colors to a 96 scale in distance range for 0 to 1. In addition 3 pixels are used thus every til of the document has 288 dimensions. The selection of the optimal features for reducing even further the dimensions was operated using the Maximally Relevant Minimally Redundant (mRMR) method, resulting 50 features per til. The preservation of the location of the spatial color correlation coefficients is important thus an implicit strategy was followed. Particularly after the mRMR the selected features where preserved to their til-vector position and then all tils vectors concatenated into one vector. Finally the non-selected features from mRMR where discarded and the "compressed" form of the concatenated vector was the final outcome of the correlogram preprocessing.
• The lines were used particularly for locating tables. The process was operated on the full-page til and it was measuring the continuous sequence of black pixels of the black and white form of the picture. Then a line-length histogram was used for discriminating the table lines from other lines present in a text such as header of footer lines often met in textbooks.
• The image size was operated only on the full-page size, for finding the page size of the document and differentiate the papers form slides or picture usually having different sized while papers usually delivered in a specific size page size.
Their reported experiments of that study were conducted to a very special case of the AGI research and for a very specialized taxonomy of office documents. The corpus included papers in PDF format, photos in JPG format, PowerPoint slides, and tables in documents. This corpus has been collected manually and then also manually annotated. Fleiss' Kappa agreement score for the annotators, has been used in order to evaluate the quality of their corpus (the Kappa score was from 0.88 to 0.92).

Hyperlinks and URL-based Representation
The web is structured as a directed graph where each web-page is linked with other pages through hyperlinks. Information about incoming and outgoing hyperlinks is important for WGI. In addition, information found in web-pages that are linked with the one in question could also be used.
In addition, each web-page has a unique address, the Uniform Resource Locator (URL) that is used to identify it. Usually, important information is encoded in URLs and sometimes this may refer to genre. For example, the string "blog" is quite likely to appear in a the URL of blogs. Several previous studies attempt to exploit this kind of information.
To begin with, a study is based on the web-graph and the implicit genre relation among web pages assuming that neighbouring web pages are more likely to belong to the same genre, a property called homophily. Then, the content of neighboring pages are used to enhance the representation of a given web page in a semi-supervised learning framework (Asheghi, Markert, and Sharoff, 2014) GenreSim is a link-based graph model which exploits link structure to select relevant neighbouring pages in order to amplify the information required for a page to be classified to a genre taxonomy. This algorithm improves performance of WGI significantly in cases where the textual information is very limited in a web-page such as movie homepages, photography websites etc. On the other hand, the reported experimental results indicate that in regular web-pages, where the textual part consists of at least a couple of paragraphs, the advantage of using hyperlink-based graph information is not remarkable (Zhu, Zhou, and Fung, 2011;Zhu et al., 2016). GenreSim is a ranking algorithm based on PageSim algorithm, extended to fit in the problem of WGI. Similar to all this kind of algorithms, is based on the assumption that the more web-pages referred to a particular page, the more this page is related to them with respect to topic and/or genre. As concerns genre class, GenreSim focuses on forward F(p) and backward B(p) hyperlinks. Moreover, utilizing the entire graph structure, web-pages are characterized as Hubs H(p) or Authorities A(p). The null hypothesis of the algorithm is that the web pages of the same genre are inter-connected with their hyperlinks. Consequently, a few pages backwards and forwards to a specific web-page compose a small network of the same genre. Using this "genre-network", the textual (and partially the structural) information of neighboring web-pages can be used to amplify the signals required to classify a new web-page to that genre.
In more detail, hubs are pages with many outgoing hyperlinks, whereas pages with many incoming hyperlinks are called authorities. The number of incoming and outgoing hyperlinks are increasing the respective scores as shown below: where V is the set of vertices (web-pages) of the graph. Web-pages with high score but with few backward hyperlinks are quite likely to be spam pages. In order to regulate this, the ω(p) factor reduces the score for the web pages with few backward hyperlinks. In addition, this is also useful to normalize the few links issue. That is, the number of the backward links is correlated to the number of links the page itself contains. This factor is defined as follows: where N is the number of backward links of the original page and N(p) the number of backward links of neighboring page p. The score for a web-page in a given graph, is calculated as follows: In general, the recommendation score of page u propagating to page v across the path P(u, v) is calculated as follows: The score of a recommended web-page is decreasing gradually as this pages lies away (in hops) from the web-page to be classified. The d factor is set to be 0.5, i.e. the page score is decreasing by half for every hop away from the page under examination. Finally, the similarity of two pages is defined as follows: This similarity score is used to select the top K most similar neighboring pages with respect to a given page. All these pages are analyzed to provide features.
It should be underlined that hyperlinks themselves can be exploited by extracting information from the URL string and not from the hyperlink-graph. Particularly, a URL can be segmented to its components, i.e. the domain name, the path after the domain and the anchor text. Special characters such as { , ., ?, $, %}, top-level domains {.gr, .uk, .com, etc}, and file suffixes such as ".html", ".pdf" are usually discarded and then character n-grams are extracted from the URL counterparts. WGI experiments using only the hyperlink information combined (or not) with other web-page information seems to be a promising researching path especially for performance oriented WGI applications such as genre-based focused-crawling where only the URLs are available (Jebari, 2014;Jebari, 2015;Abramson and Aha, 2012;Priyatam et al., 2013)

Combination of Features
Instead of using only one type of features, studies in genre recognition tend to combine several sources of information (Lim, 2005). Usually, textual features are considered more important and they are combined with alternative kinds of features . Usually, such combinations increase the effectiveness of the method (Kanaris and Stamatatos, 2009).
An example of combination of textual features from different levels of analysis is reported in (Onan, 2018). The following features are used: • Most frequent words (function words).
• Character n-grams • POS n-grams • Capitalized and lowercase words • Punctuation marks • Semantic feature (time and money entities).
• Genre-specific features (n-grams occurring many times within a genre) In a similar fashion, (Waltinger and Mehler, 2009) (Ströbel et al., 2018;Virik, Simko, and Bielikova, 2017). Interestingly, for each feature, the required NLP analysis to extract such measures from documents is also shown. It has to be noted that elaborate types of NLP analysis (e.g. syntactic parsing) introduce a cost concerning the efficiency of the model. In addition, such features are language-dependent.

Domain-specific Genre Representation
Beyond general characteristics that can be extracted from web-pages and be useful in any WGI task, there are domain-specific features related to certain genres and domains that provide a rich representation of their properties.
Blog is a genre with special interest for several research domains and as might be expected it has its own particular characteristics. These features require lexical analysis, morphological analysis, lightweight syntactical analysis, and structural analysis of documents so that they become available. In table 2.2 a rich set of such linguistic properties used for Blog's sub-genres classification are presented in detail. In (Virik, Simko, and Bielikova, 2017) there is a detailed analysis for the correlation of the linguistic features and the Blog's sub-genres. Example of these sub-genres are the following: informative, affecting, reflective, narrative, emotional and rational.
In (Dai, Taneja, and Huang, 2018) the focus is on the News genre. They use a combination of features to recognize the main paradigms of presenting events in news. These features include word unigrams and bigrams, syntactic features like the frequency of syntactic production rules as well as primitive semantic information provided by a pre-defined dictionary (Linguistic Inquiry and Word Count (LIWC)). The latter indicates terms associated with time, motion, and space, important information for quantifying the narrative scheme of the news story. In addition, key events placement features are introduced that attempt to quantify information about specific persons, time, and location of the news story and the point of the document at which they occur. In practice, these features calculate the overlap of title with the paragraphs of the document.
Automated genre identification is a subject of interest in the domain of intellectual products (e.g. paintings, music, movies etc). Taxonomies of movies has also a special interest for the technology and entertainment industries. The part of this research related with the current thesis, is when movie genre is induced by using textual features such as subtitles and the text description of a video content. Features that are specifically defined for this domain are summarized in Table 2.3. Particularly, BOW, surface and syntactic features are combined. Surface features include content-free and content-specific (the ones related to specific topics) information (Lee, 2017). It has been found that not all of these features are so important for this task. The most important of them are: token-type ratio, words per minute, characters per minute, hapax legomena, dislegomena, short words ratio, and ratios of (10, 4, 3, 1)-letter words.
Wikipedia (and in general Wiki sites) is considered as a special genre due to its richness of textual content per page and secondary for its informative linguistic register. There are several sub-genres of wiki pages which are also characterized as popular science websites and web documents (e.g. Wikipedia, Nature, New Scientist, Wikinews, etc). There are some domain-specific features that seem to work well for classifying wiki-pages into a sub-genre taxonomy. Table 2.4 shows the set of features used for representing sub-genres of popular science and grouping web-pages with similar properties (Lieungnapar, Todd, and Trakulkasemsuk, 2017).
On the other hand, it is also crucial to study what features used in genre recognition studies remain unaffected by domain variations. This is especially important in genres like News as well as Online reviews. In such cases, it is very important to avoid topic-related information. Ideally, a WGI approach trained with samples of a specific topic (e.g., sports) could be applied to other topics (e.g., politics) without a significant drop in its performance. This is called domain transfer learning (Finn and Kushmerick, 2006). Table 2.5 comprises a topic-neutral set of features (mainly composed of function words and punctuation marks) to achieve this.

Feature Weighting and Selection
Term weighting is an essential issue in text mining applications. The features extracted from web-pages can be represented using a variety of traditional weighting schemes such as Binary representation, Term Frequency (TF), and Term Frequency -Inverted Document Frequency (TF-IDF) (Sharoff, Wu, and Markert, 2010a;Santini, 2007).
The binary scheme is the simplest and according to which each term is represented by a binary value indicating its occurrence or absence in the document. Despite its naivety, very good results were obtained using this scheme in WGI studies kanaris2009learning,sharoff2010web.
TF weighs each term according to its frequency in the document. Several variations of this approach can be found in the literature. For example, the raw frequency of terms can be used. This certainly depends on the length of documents. Another idea is to normalize the raw frequency of a term over text length: is the raw frequency of term t in document d. Yet another modification is to divide the raw frequency with the maximum frequency of any term in document d.
TF-IDF is a balancing weighting scheme of document terms (e.g., word n-grams, character n-grams, POS n-grams, etc) given a collection of documents. It regulates the significance of the very low and very high frequency terms of the collection. That is, it decreases the value of the very high frequency terms (i.e., function words), and increases the importance of very low frequency terms when they occur in only a few documents. The calculation of a terms IDF in a document collection is shown below: where N is the number of the documents in the collection and d f (t) is the document frequency of t, that is the number of distinct documents it occurs. Although TF-IDF is a popular choice in many text mining studies, the study of (Sugiyanto et al., 2014) demonstrates that it is not the best choice for WGI tasks. On the contrary, they propose a genre-specific weighting scheme, called TF-IGF. The main idea is that instead of considering a collection of documents, they consider a collection of genres (i.e., each genre is a collection of documents). Then, the terms are weighted by using the frequency of the term within a genre and the genre frequency of the term (i.e., the number of different genres it occurs): where f (t, g) is the frequency of term t in genre g and g f (t) is the genre frequency of t. Since TF-IGF depends on genre, the average value over all genres in a given palette is finally used. The TF-IGF score can be used to select the most informative features that highlight genre-related information and reported results show that it is a better criterion for feature selection in comparison to regular TF-IDF (Sugiyanto et al., 2014).
In (Kanaris and Stamatatos, 2009) a frequency-based method to select the most promising features is described. Initially, the feature set comprises character n-grams of variable length (n = {3, 4, 5}. Then the LocalMaxs algorithm is used to find the most prominent n-grams taking into account the frequencies of constituent n-grams of lower order and using a glue function. The reported results show that this simple approach is quite effective in WGI tasks. Another WGI-specific term weighting scheme has been suggested to deal with features obtained from URLs of web-pages (Jebari, 2014). In particular, an approach called Structure-oriented Weighting Technique (SWT) first extracts character n-grams from URLs and then each n-gram is weighted according to the following: where f (t, s, d) denotes the raw frequency of n-gram t in section s of document (i.e., URL) d. Namely, this approach assumes that the URL is segmented into fields and each field has its own importance, as follows: Weights {α, β , γ} should be defined empirically using a training corpus (Jebari, 2014).  it, its, itself, large, little, many, may, me, might, mighten, mine, mostly, much, musn't, must, my, nearly, our, perfectly, probably, several, shall, she, should, shouldn't, since, some, strongly, that, their, them, themselves, therefore, these, they, this, thoroughly, those, tonight, totally, us, utterly, very, was, wasn't, we, were, weren't, what, whatever, when, whenever, where, wherever, whether, which, whichever, while, who, whoever, whom, whomever, whose, why, will, won't, would, wouldn't, you, your Punctuation marks ! " $ % ' ( ) * + -. : ; = ?

Machine Learning Approaches to Genre Identification
Genre identification of documents is generally viewed as a text categorization task.
After defining a feature space to represent documents, a classification algorithm can be applied to a training set in order to learn to distinguish between genres. As already pointed out, the majority of previous work studies consider this to be a closed-set classification task. In addition, most of the existing studies consider a flat genre palette where each genre is independent on the other genres. In the remaining of this section, the machine learning algorithms that have been used to learn the properties of genres are discussed according to the adopted setup of the task.

Closed-set Genre Recognition
The main research volume in this area adopt a closed-set classification framework. Several well-known machine learning algorithms have been used for this task, including SVM, Naive Bayes, Random Forest, Decision Trees, Ensemble-based models (Lim, 2005;Santini, 2007;Kanaris and Stamatatos, 2009;Jebari, 2015;Sharoff, Wu, and Markert, 2010a). The SVM classifier was tested either in binary or multi-class WGI tasks (Dai, Taneja, and Huang, 2018). It is an algorithm than can easily handle high-dimensional and sparse feature spaces (Joachims, 1997). In (Sharoff, Wu, and Markert, 2010a) analytical experiments using a variety of datasets demonstrated that SVM WGI models could surpass the best reported results in most of the cases combined with character n-gram features. In addition (Virik, Simko, and Bielikova, 2017) compare SVM models with Naive Bayes and k-Nearest Neighbours models on the recognition of Blog sub-genres. The reported results show that SVM obtained higher accuracy results. Recently, an SVM-based approach was tested on the very challenging case of cross-lingual genre classification (i.e., when the training documents are in one language and the test documents in another language) and obtained very promising results (Nguyen and Rohrbaugh, 2019).
Distance-based approaches in the WGI task include mainly variations of nearestneighbor classifiers. One particular case is based on ranked feature distributions distances (Waltinger and Mehler, 2009). The features of the samples of a class are ranked in descending order according to their TF or TF-IDF values. In order to measure the distance of a new web-page from the classes, the features of the new web-page are also ranked and then the difference in rankings indicate the most similar class. That is the TF or TF-IDF value of features is not important anymore since only the ranking of features is considered. Moreover, when a feature is not present in either the new web-page or a class, then a predefined Max value is assigned. The total ranking distance between a web-page d and a class g is calculated as follows: (2.14) The new web-page is then classified to the nearest class. The accuracy of this method has been reported to surpass that of SVM using the same features (Waltinger and Mehler, 2009).
Following the impressive performance obtained in classification tasks involving natural language texts, deep learning algorithms have also been tested in WGI tasks (Ströbel et al., 2018). A recurrent neural network comprising 200 gated recurrent unit cells in the hidden layer has been suggested. On top of that, a fully-connected layer assigns documents to classes using a Softmax decision function. Very promising results are reported for this deep learning model in closed-set WGI tasks.
In another recent study, a variety of deep learning algorithms are compared with traditional methods and the latter seem to be more accurate in genre identification tasks (Worsham and Kalita, 2018). In more detail, a convolutional neural network, a long short-term memory network and a hierarchical attention network have been applied to recognition of literary genres. However, they were outperformed by relatively simple models based on traditional machine learning algorithms. In addition, deep learning methods considerably increase the training time cost and require special hardware infrastructure to handle long texts.
Instead of learning a simple model, ensemble methods attempt to extract several base models and then combine them. One main direction is to use well-known ensemble learning methods such as AdaBoost, Bagging and Random Forests (Sugiyanto et al., 2014;Onan, 2018;Worsham and Kalita, 2018). This approach can easily handle high-dimensional representations and heterogeneous features.
Another idea is to build a separate model for each web-page modality. For example, an ensemble algorithm called Multiple Classifier Combination (MCC) is presented in (Zhu et al., 2016). Particularly, the main idea is to use information from the web-page to be classified to a given genre palette as well as information from a set of neighbouring web-pages (i.e., that are near the specific web-page in the graph formed by hyperlinks between pages). The MCC algorithm builds a set of SVM classifiers, each trained using a particular set of features. Then a decision matrix is formed including all predictions of base SVM classifiers: where d i j is the membership degree given by classifier i to genre j, N is the number of base classifiers, and |G| is the number of genres. Then, the final decision is taken by applying simple methods to combine these predictions columnwise, such as the min, max or average rules. Another late fusion ensemble is proposed in (Finn and Kushmerick, 2006). Again, the idea is to build homogeneous base models each trained only on a specific feature subset. In the testing phase the majority voting is a common strategy. Particularly in their study they learn C4.5 decision trees for different web-page modalities (i.e., BOW, POS, text statistics features) and then build a Multi View Ensemble that combines the predictions of the modality-specific models. It is important to note that in the training phase active learning is used. This is a sample selection strategy where an evaluating process indicates which sample is better to be used for the specific C4.5 learner, for a given feature set. The late fusion ensemble with the active learning strategy obtained promising results including the domain transfer scenario.

Open-set Classification
Most previous studies in WGI consider the case where all web pages should belong to a predefined taxonomy of genres (Lim, 2005;Santini, 2007;Kanaris and Stamatatos, 2009;Jebari, 2014). This corresponds to the closed world assumption. However, this naïve assumption is not appropriate for most applications related to WGI since it is not possible to construct a universal genre palette that covers at least a great extend of the Web. Web-pages that do not belong to any of the predefined genres are considered noise and also include web-pages where multiple genres co-exist (Santini, 2011;Levering, Cutler, and Yu, 2008).
Noise in WGI can be categorized into structured noise and unstructured noise. The former assumes that there is no information about the composition of noise (i.e., a random collection of web-pages not belonging to the known genres) (Santini, 2011). The latter assumes that noise is composed by specific unknown genres (i.e., for which there are no training examples). However, it is highly unlikely that such a collection represents the real distribution of pages on the web.
The effect of noise in WGI was first studied in (Shepherd, Watters, and Kennedy, 2004;Kennedy and Shepherd, 2005) where predefined genres were personal, organizational, and corporate home pages while noise consisted of non-home pages. However, the distribution of pages into these four categories was practically balanced, hence it was not realistic. In another study, a clustering framework is used where one cluster is built for each predefined class and another cluster is built for the noise (Kennedy and Shepherd, 2005).
To handle noise in WGI there are two options. First, to adopt the closed-set classification setup having one predefined category devoted to noise. That is training examples for known classes as well as the noise class are provided. Since this category would comprise all web pages not belonging to the known genre labels, it would not be homogeneous. Moreover, this noise class would be much more greater with respect to the other genres causing class imbalance problems.
A few studies follow this direction. In these cases samples of noise are available in the training phase of the prediction model. In (Vidulin, Luštrek, and Gams, 2007) structured noise samples constitute the negative class of a binary classifier. The most common approach to handle noise is to build binary classifiers where the positive class is based on a certain predefined category and the negative class is based on the concatenation of all other predefined categories plus the noise (Kennedy and Shepherd, 2005). In (Dong et al., 2006) noise is used as the majority class in an experiment where 190 instances from personal homepage, FAQ, and e-shop categories were used in combination with 600 noise pages. Similarly, (Levering, Cutler, and Yu, 2008) use about 200 instances for the predefined genres of store homepages, product lists, and product descriptions in combination with about 800 other pages (noise). Such a combination of binary classifiers can also be seen as a multi-label and open-set classification model where a web page can belong to different genres and it is possible for one page not to belong to any of the predefined genres. From another point of view, (Jebari, 2015) considers outlier samples of known genre labels as noise and excludes them from computing the centroids that represent genres in mult-label WGI. The centroid of a genre is then adjusted each time a new web page is classified to that genre.
Another, more realistic option is to adopt the pure open-set classification setting where training examples belong only to known classes and it is possible for some web pages not to be classified into any of the predefined genre categories. This setup avoids the problem of class imbalance caused by numerous noisy pages and also avoids the problem of handling a diverse and highly heterogeneous class. On the other hand, open-set classification requires strong generalization with respect to the closed-set setup.
A more concrete open-set classification models for WGI is presented in (Stubbe, Ringlstetter, and Schulz, 2007). They use the idea of mono-classification for the WGI task, building a genre-specific classifier following a cascading sequence of genrespecific rules. Initially, a feature set is considered (including surface, lexical, HTMLbased and POS features) and rules based on these features are constructed. The genre-specific rules are evaluated based on the F 1 measure. In the rule construction phase, the training data were used for determining feature selection and combination criteria. Then, the classification criteria were formed into a decision tree. The ones led to performance improvement were kept and the rest were discarded. An example of such rules are shown in table 2.6.
The mono-classification cascading sequences were finally formed using F1-ordering or Confusion-matrix-ordering.
In its original version (Stubbe, Ringlstetter, and Schulz, 2007), the algorithm has been tested in a noise-free corpus but in (Asheghi, 2015) it has been applied on a corpus with unstructured noise and its performance was drastically lower.
Another open-set classifier is presented in (Chen et al., 2012). It is an ensemble approach composed by two base SVM classifiers each trained using a different mutually exclusive training subset. The assumption of this approach is that part of the  (Stubbe, Ringlstetter, and Schulz, 2007) .

Rule
Criteria Continuous text (NV > 18) ∧ (NC > 2) Literary or casual language (NSBA > 17) ∧ (SA/A > 0.5)∧ (SA/A < 4) ∧ (C < 2.5) ∧ (CL < 3) FAQ, Interview, Filter commentaries (AL < 1.3) ∧ (GL < 3.8) ∧ (Q < 3) NV = number of verbs, NC = number of conjunctions, NSBA = number of sentiment bearing adjectives, SA = sentiment adjectives, A = adjectives, C = contractions, CL = casual language, AL = arguing language, GL = generalizing language, Q = questionmarks support vectors will be optimized for every SVM preserving the generalization of the two independent models and the combined classification will manage to fit well over the whole corpus. The combination of the output of the base models is a pairwise genre operation that examines whether the SVM classifiers agreee on their decision about a new web-page: where {k, m} are the genre classes and {g 1 , g 2 }, are the SVM classifiers. For any new web-page, the truth table of this binary rule for all genre pairs might end up full of zero values. Then, this page remains unclassified. In all other cases at least one genre will return as true.
The above ensemble is an early fusion approach where the different features and document representations are all combined in a concatenated vector for each document. Then the concatenated vectors are the input for the learners of the ensemble.
As already discussed, the web is a dynamic environment and web genres tend to evolve through time. New genres emerge and existing genres are modified in time. Genres are adapting due to the medium transition such as from News on paper to News on the Web, or because of the medium itself emerging novelties such as the Blogs which have been evolved to micro-Blogs and the Social-Media. In (Caple and Knox, 2017) there is a characteristic study about the temporal manner of the web genres, where it is analyzed how the News (as a web genre) have changed over time and the way the News sub genres appeared.
Most WGI approaches assume a static genre palette and a set of training examples for each genre that are representative of its characteristics. An approach that attempts to take into account the evolving of genres is presented in (Jebari, 2015). The Enhanced Centroid-based Classification (ECC) algorithm builds an incremental centroid-based ensemble where genres are represented by their centroids and these centroids are adjusted after the classification of each new web-page.
In more detail, the ECC algorithm calculates an initial set of centroids for every genre g given a set of training examples T g as follows: where x denotes the 2-norm of vector x. Then, each training sample is re-examined. If its distance from the centroid is more than a threshold, it is considered to be outlier and it is excluded from the new calculation of the centroid for that class. The threshold is defined as the average of similarities of training examples to the centroid: In the testing phase, each new web-page is examined separately. If the similarity of a new web-page with the centroid of a genre surpasses the threshold then the web-page is used to re-calculate the centroid and the threshold of the genre. That way ECC can adapt in the evolution of genres in time. Note that this is a multilabel classification method since it assumes that a web-page can belong to more than one genres. In addition, this is an open-set classification method because when the similarity of a new web-page is not higher than any genre threshold, then this page is left unclassified. On the other hand, this algorithm is sensitive to noise since webpages not actually belonging to any of the known genres but miss-classified to a known genre can be used to alter its properties (centroid and threshold).

Semi-supervised and Unsupervised Learning
In WGI, there is lack of large labeled corpora including multiple genres and sufficient training samples per genre. On the other hand, there is plenty of unlabeled examples that can easily be used. One suggested direction in WGI research is the use of semisupervised learning approaches in order to exploit the virtually infinite number of unlabeled data of the Web.
Particularly, in (Chetry, 2011) the Co-Training algorithm is used. Their model is based on a SVM and a Naive Bayes classifier trained on 1, 232 labeled web-pages as well as a set of 20, 000 unlabeled samples. Co-Training is based on an iterative process where the unlabeled data are classified by the classifiers initially trained using the labeled data. In every iteration the highest ranked unlabeled samples, in terms of classification certainty of the classifiers, are labeled accordingly and the training of classifiers restart using these new samples together with the previously labeled samples. The process continues until all unlabeled samples have been used or a specific number of iterations is reached. The reported results show an improvement in performance over a regular SVM classifier. The reported experiments adopt a closedset framework with a corpus including the genres of Spam, Discussion, Educational Research, News Editorial, Commercial, Personal Leisure.
Another semi-supervised learning approach is described in (Asheghi, Markert, and Sharoff, 2014). Given the graph of web-pages formed using the hyperlinks, the min-cut algorithm is used to estimate the minimum cost of labeling based on the homophily assumption (i.e., neighboring pages tend to belong to the same genre). The reported results indicate that this assumption is valid for particular genres such as blogs and news.
Domain transfer is the ability to learn a model using training examples from one domain and be robust enough to be applied to another domain. As an example, for the genre News there might be several topic domains such as Sports, Technology, Science, Health, Politics. An ML model which has been trained for News only on Sports topic and still can perform equally well for Technology, etc, is considered robust. This is very important particularly for genre identification tasks where usually the positive available samples for a genre do not cover a wide range of topics. An extreme case of domain transfer is cross-lingual genre classification where the task is to train a model using documents in one language and then apply the model to documents in another language, particularly with different linguistic properties, such as English to Chinese and vice versa.
One proposed solution to achieve robustness across domains is presented in (Petrenz and Webber, 2011). Their approach focuses on the use of language-independent features such as character-n-grams and surface text characteristics such as type/token ratio with an iterative strategy of training a ML model. Such a method is the Iterative Target Language Adaption (ITLA).
ITLA is a special case of cross-lingual AGI method where pair-wise inter-language training is possible. That is, one can train a model to one language and then optimize it to another. This method enables the potential training of a model on one language and adapted to another with few labeled samples set for the required genre taxonomy and a rich set of unlabeled samples. In (Petrenz and Webber, 2011) SVM was the model of choice. The process includes the following steps: 1. Initially train an SVM classifier on language L L S . Then with the help of unlabeled L U T set for the target language the model is evaluated for its prediction confidence on the genre taxonomy.
2. Using a labeled subset of the target language set L L T , another SVM model is trained where the prediction confidence of the initial training is used for selecting only the samples of the subset returning the highest confidence score.
3. The samples with very low score in L L T are filtered out and a new subset is re-sampled.
4. The process continues between the steps 2 and 3 until no change in the prediction confidence occurs or the iteration number has reached its pre-defined limit.
The results in this study were very promising given that with a generic language independent approach manages to exceed the results of the common solution of using machine translation technology (i.e., where the texts of the source language, used to Average sentence length, Disciplinespecific word density, compound noun density, adjective density, coordinating conjunction density, content word density Informative, Elaborated, Impersonal train the model, are automatically translated to the target language and they are used to train the classification model).
Finally, genre recognition could be performed using unsupervised learning, namely without any positive training examples. An interesting work that induces a genre taxonomy from unlabeled documents is presented in (Lieungnapar, Todd, and Trakulkasemsuk, 2017). They use a k-means clustering method to induce sub-genres of popular science. A set of linguistic features (presented in table 2.4) is used to represent web-pages and by taking into account the correlation of z-scores of these features to the four detected clusters it is possible to estimate their significance for each cluster. In addition, the detected clusters are associated with sub-genres of popular science based on a functional relations analysis (e.g. impersonal, narrative, persuasive, informative, elaborated, impersonal). That way a high-level description of the detected sub-genres of popular science is provided. The results of this analysis providing specific examples of popular science sub-genres, the key linguistic features and the functional relations for each detected cluster can be seen in Table 2.7. Finally, they have shown that their evaluation of the proposed semi-automated process was as effective as the experts agreement on the same task.

Hierarchical Classification
So far, all discussed methods assume that the genre palette consists of independent genres in a flat structure. Another approach is to consider a hierarchical structure of genres where super-genres and sub-genres are defined. Such a hierarchy can be obtained by human-experts (Stubbe, Ringlstetter, and Schulz, 2007).
An automated approach to build a hierarchy of genres is presented in (Madjarov et al., 2015). First, they use two clustering methods attempting to develop a hierarchy where a flat multi-class taxonomy could potentially be organized in a hierarchical structure. That is, given a set of leaf class-tags, an agglomerative or a balanced kmeans algorithm is used to create a class hierarchy.
The reported results show than balanced k-means works better for this task on their data set and experimental set-up. The success of balanced k-means can be explained by the fact that it needs the size of the clusters to be provided. Thus, the objective function of this method is to optimize two (contradictory) objectives: first, to find the most dense and well separated clusters and second to maintain the sizes of the clusters equal. To do so, the Hungarian algorithm is used for the optimization process, a combinatorial optimization algorithm that solves the assignment problem in polynomial time.
The automatically obtained hierarchy has been compared with the one built by a human-expert. The former has been found to work equally or even better than the latter. It has also been demonstrated that the obtained hierarchical structure can be used for a multi-class classification scenario (Malinen and Fränti, 2014).

Corpora for WGI Evaluation
In order to evaluate WGI approaches, there is need for corpora including multiple web genres. Given that most WGI methods need to be trained, an adequate number of positive examples per genre should be provided so that a large part of it to be used for training and hyper-parameter tuning and the rest for test purposes. Unfortunately, the available corpora used in previous studies are relatively small. Table 2.8 shows a list of corpora developed by researchers to evaluate their methods (focusing on corpora that include HTML documents). Most of these corpora have also been used by other researchers to provide comparative evaluation results when new WGI methods are proposed. Due to the small size of the existing corpora, the most popular approach is to apply 10-fold cross-validation.
• KI-04: This is a corpus of web genres selected according to their usefulness for web search purposes (Eissen and Stein, 2004). Documents were downloaded in 2004.
• I-EN: This is a small collection of web-pages randomly selected from a large corpus representing English Web in 2005 (Sharoff, 2010).
• 7-Genre: This is a corpus of web genres downloaded in 2005 (Santini, 2007). All pages were manually collected.
• SANTINIS: This is an augmented version of 7-Genres. It includes four additional genres from BBC pages as well as 1,000 unlabeled pages from the SPIRIT collection (Joho and Sanderson, 2004). The latter can be viewed as noise. • HGC: This corpus provides a hierarchically structured genre palette (Stubbe, Ringlstetter, and Schulz, 2007). For example poem is a sub-genre of literature and reportage is a sub-genre of journalism.
• MGC: This is the only multi-labelled corpus of the list. Each web-page can belong to several genres. (Vidulin, Luštrek, and Gams, 2007) • LWGC-B: This is a corpus constructed using crowd-sourcing to ensure the reliability of genre labels (Asheghi, 2015). It is a balanced corpus meaning that the samples are evenly distributed over the selected genres.
• LWGC-R: This is based on the same methodology with LWGC-B. However, a random collection of web-pages was assigned to genres meaning that it is highly unblalanced. In addition, a large part (almost half) of web-pages do not belong to any of the given genres (i.e., noise) (Asheghi, 2015).
As can be seen, the characteristics of these collections differ. Each one enables the evaluation of WGI approaches in specific setups. In addition, there is variety of genre labels included in these collections. The most common labels refer to personal home pages and e-shop. In general, genre labels of one collection cannot be easily mapped to labels of another collection. It has to be noted that the time of building the corpus constrains the genres that includes. For example, KI-04 that is relatively old exclude blogs and FAQs. This also clearly indicates the emerging of new web genres (Dash and Arulmozi, 2018).
Another weakness of some corpora is the lack of representativeness for specific genres. For example, the FAQ label of 7-Genres consists of web-pages mainly discussing hurricanes. Thus, this genre becomes too specific and could be identified thematically.
As concerns, open-set evaluation of WGI approaches, only two corpora include (unstructured) noise: SANTINIS and LWGC-R. Unfortunately, the latter is not publiclyavailable.

Conclusions
The bulk of research in genre recognition studies focuses on the extraction of relevant information from documents. Various sources have been explored including the textual part of web-pages, the visual appearance and structure, the URL and graph of interlinked web-pages. Usually, a combination of features from several such categories assist to enhance performance in WGI. However, the textual features are considered the most important ones and provide the starting point of almost every study. It is remarkable that among the most effective ways to represent web-pages in the framework of WGI tasks are word and character n-grams. Such features are easy to extract requiring minimal resources, language-independent, and able to capture nuances of stylistic properties of texts. On the other hand, such features build high-dimensional and sparse representations.
It has to be noted that, in contrast to thematic text classification approaches, minimum text pre-processing should be applied in WGI tasks. That is stopword removal is not a good idea since the most frequent words are associated with certain syntactic structures that provide useful stylistic information. Punctuation marks, capitalization, etc. should not be removed since they are important style markers. Stemming or lemmatization is not advisable since significant morphological information will be lost.
The vast majority of WGI approaches adopt the closed-set classification scenario. This is far from realistic for most WGI applications where it is not possible build an adequate genre palette. In an experiment presented in (Asheghi, 2015), users (in a crowd-sourcing environment) were asked to assign about 1,000 randomly selected pages to 15 pre-defined general genre labels. More than 45% of the web-pages were left unclassified. This clearly shows that a great amount of web-pages will not meet the criteria of any pre-defined genre palette. In addition, it demonstrates that the level of noise is higher than any pre-defined genre in WGI tasks.
The most successful classification algorithms applied to genre recognition tasks so far, are SVMs, distance-based methods, and ensemble methods. It seems that it is especially important to adequately handle irrelevant and redundant features. The application of deep learning methods in WGI tasks has not provided remarkable results yet, in contrast to several other text classification tasks. One reason for this could be the lack of large volumes of training examples.
There is only a few open-set WGI approaches. Some of them are not adequately evaluated since experiments using noise-free corpora are used (Stubbe, Ringlstetter, and Schulz, 2007;Jebari, 2015). In addition, even when noise is included in the evaluation corpus, the evaluation measures do not take into account the open-set conditions (Asheghi, 2015).
The evaluation corpora used for the estimation of effectiveness of WGI methods are small in size and greatly differ in characteristics. Most of them are old enough and do not cover modern genres. Unfortunately, two recently developed ones (LWGC-B and LWGC-R) specifically designed to provide reliability of genre labels and thematic-neutrality of samples belonging to each genre are not currently publicly available.

Introduction
WGI is a task that can be approached either as a closed-set or an open-set classification problem. The former case assumes that there is a well-defined genre palette that covers all possible genres that can be found in our domain. In addition, for each such genre there are representative instances of web-pages to be used as training data. These assumptions are far from realistic in most WGI applications. As already explained in previous chapters, it is not feasible to define a universal genre palette for the Web since there is no consensus over genre labels and new genres are emerging or existing genres evolve through time. On the other hand, it is possible to determine certain web genres where there is general agreement about their characteristics (e.g., blogs, e-shops). For such web genres it is relatively easy to find representative training data.
Open-set classification is, therefore, a more realistic option to model the WGI task. In this setup, a genre palette covering very specific web genres is given and all other genres are considered as noise (i.e., instances of noise should not be assigned to any of the known genres). An effective open-set WGI approach can suit any type of relevant application since it provides the ability to recognize the known web genres without being confused by the presence of noise. It should be underlined that it is expected for noise to outnumber the training instances of the known genres. Web is chaotic and of huge scale and known genres only cover a small part of it.
Open-set classifiers have to deal with an important difficulty: the Open Space Risk (OSR). This corresponds to the instance space that lies away from the instances of known genres and can be occupied by samples of an unknown genre. An openset classifier should be able to set the boundaries of known genres so that to avoid the risk of including an area where an unknown genre is found. This is especially challenging when the dimensionality of the representation is high. This is exactly the case with most of the popular text representation schemes that are composed of hundreds or thousands of features (e.g., character n-grams, word n-grams). It is therefore crucial to develop open-set classifiers for WGI that are robust to high-dimensional representations or combine open-set algorithms with appropriate compact representations.
In this chapter, three open-set WGI methods are described in detail. The first method is based on one-class classification where only positive examples are considered for each known genre. This does not mean that it is not possible to find negative examples. However, the negative class is too huge and heterogeneous that is quite challenging to extract representative negative samples. The second approach considers training samples for all available known genres and attempts to reduce the effect of high dimensionality of representation by performing repetitive subspacing. The main idea is to build an ensemble of classifiers, each one using a subset of the initial features. The third approach is an extension of the nearest-neighbor classification algorithm and attempts to directly regularize the effect of OSR.
The rest of this chapter first describes the main properties of open-class classification and discuss the main existing paradigms. Then, each one of the three proposed methods for WGI tasks is analytically presented.

Open-set Classification
An open-set classification task is a tuple (C , K , U ), where C is a set of predefined known classes, K is a set of training samples for the known classes (i.e., for each c ∈ C there is a set of training samples K c ⊂ K ), and U is a set of unknown samples to be assigned to classes. Each u ∈ U may belong to either one c ∈ C or none of them. Furthermore, the subset of U not belonging to any of the known classes is called noise N .

Noise in Open-set Recognition
The previous definition of open-set classification task only considers two kinds of classes: known and unknown. A more detailed analysis is provided in (Geng, Huang, and Chen, 2018): • Known-known classes are the classes for which positive samples are available. This is directly comparable to C .
• Known-unknown classes consist of negative samples that can be merged into one big artificial class, like background classes (Dhamija, Günther, and Boult, 2018).
• Unknown-known classes are classes that can be described using some kind of side-information (e.g., a semantic description). However, there is lack of positive training examples for these classes. The recognition of such classes can be performed by zero-shot learning (Palatucci et al., 2009).
• Unknown-unknown classes are classes without any positive training examples and without any side-information. This directly corresponds to N .
In this thesis we distinguish noise into unstructured and structured forms: • Unstructured Noise corresponds to the case there is not a distinction between the unknown classes. In other words, all unknown classes are merged into a single super-class. This is very realistic in WGI applications where it is quite unclear how to define the genre of a large number of web-pages.
• Structured Noise is composed of distinct unknown classes, that is we consider that each n ∈ N belongs to a class c / ∈ C . Certainly, this information is not given to the open-set classifier but it is only used to estimate its performance. This is also realistic in certain WGI applications where we are interested about the recognition of specific genres and it is also known that several other genres exist.

The Open-Space Risk
One possibility to build classifiers that can leave some (test) instances unclassified is to introduce a reject option to closed-set classification algorithms. First, a regular closed-set classifier is trained using K . Then, a reject criterion is determined, usually associated with the confidence of the predictions, and each test instance that does not satisfy this criterion is not classified to any of the classes in C (Onan, 2018). For example, the reject criterion could relate to the difference of probabilities assigned to the two most likely classes in C . If this difference is large, then it is an indication that the instance in question really belongs to the most likely class (i.e., the confidence of prediction is high). If, on the other hand, the difference is small (i.e., the confidence of the prediction is low), then this means that the instance most probably does not belong to these classes.
One big problem of this approach is that it provides strong predictions for the entire instance space. Actually, closed-set classifiers segment the instance space so that instances belonging to the known classes to be well separated. However, this also means that if an unknown class lies in the space that is far away from the known classes, it cannot be easily distinguished anymore. Figure 3.1(a) depicts the case where a closed-set classifier is trained to recognize two known classes. Note that the decision boundary affects the entire instance space. There is also an unknown class that lies away from the known classes, almost equally away from both of them, and also near the decision boundary. This scenario can be handled by a rejection option since all members of the unknown class will be equally likely to belong to either of the known classes and, therefore, can be rejected. Figure 3.2(b) shows a similar case with two known classes and one unknown class. However, this time the unknown class lies deep in the space that seems to belong to one of the known classes. The members of unknown class are still far away from both known classes but now the rejection option will not work since it seems that one of the known classes is far more likely than the other.  Note that the most important issue about an open-set classifier seems to be the appropriate definition of the known class boundaries. If the classifier is too conservative, then the space allocated to the known class will be too small and it is possible to exclude some of its members. On the other hand, if the classifier is optimistic, then the area allocated to the known class will be large including neighboring areas of the known class training instances increasing the risk of including samples of unknown classes. This is demonstrated in Figure 3.3. The more optimistic an open-set classifier is, the more likely to suffer by the open space risk.
Let f y be a recognition function for a known class y, f y (x) = 1 corresponds to the case x is assigned to class y while f y (x) = 0 means that x is not recognized to belong to y. Then, the open space risk is formally defined as follows (Scheirer et al., 2013): where f (x) > 0 implies correct recognition and λ is a regularization constant. Thus,

Paradigms in Open-set Classification
In the relevant literature, a variety of approaches to open-set recognition can be found. A thorough recent review is provided in (Geng, Huang, and Chen, 2018). In general, the following main paradigms are usually followed: • One-class classification methods

• Modification of traditional ML methods
• Deep learning methods

• Generative models
One way to approach open-set classification is to apply One-Class classification (OCC) methods. An OCC method is based on only positive samples of a given class. It is assumed that negative samples are either difficult to obtain or the negative class is so heterogeneous that it not easy to sample it. There are several approaches towards the solution of this problem. A compact survey on OCC is provided in (Khan and Madden, 2010).
The Rocchio's algorithm is the simplest one-class classification algorithm where it has been used for information retrieval tasks because of its simplicity and consistency (Joachims, 1997). The learning process is just the summation of all the sample vectors of a given class, i.e the prototype vector. Then, a new sample is classified as positive or negative using the angular distance from the prototype vector and a threshold value.
Datta (cited in (Manevitz and Yousef, 2002)) proposed a Naive Bayes Classifier modification for OCC problems and use only positive samples in the learning process. A probability density function of a class E is induced as prediction model. Classifying a document d involves calculating the probability p(d|E) which, under the naive assumption, is equal to the product of its features w n probabilities p(w|E), where n is the size of feature vector. To decide wether the document is classified as positive, a threshold is required to be defined.
Perhaps the most popular OCC approach is described in (Scholkopf et al., 1999). It is actually a modification of the well-known SVM algorithm to the problem of the overlapping samples distributions, known as ν-SVM (Bishop, 2006). The nature of ν-SVM allows to use it in binary classification problems as long as to OCC problems. The parameter ν is both controlling the fraction of support vectors and the margin errors, i.e. positive samples considered as outliers. The optimization process begins with considering the origin as the only negative example. More details this approach are given in Section 3.4.1.
Outlier-SVM is another SVM-based algorithm discussed in (Manevitz and Yousef, 2002;Khan and Madden, 2010). The performance of this model was competitive but not top performer when compared with methods such as One Class Neural Networks, One Class Naive Bayes Classifier, One Class Nearest Neighbor, and Rocchio Prototype. In addition this algorithm is sensitive to the term weighting schema, i.e. Binary, TF, TF-IDF, etc., and vector dimensionality.
There are also OCC methods exploiting the availability of unlabeled data. (Yu, 2005) proposed two OCC algorithms that use positive and unlabeled data for building a classification model that describes the single class boundary. The Mapping Convergence (MC) algorithm incrementally labels negative data from the unlabeled data set using the margin maximization property of SVM. The Support Vector Mapping Convergence (SVMC) optimizes the MC algorithm for fast training. Both algorithms had been compared into real world classification tasks, letter recognition, and diagnosis of breast cancer with higher performance than Spy Expectation Maximization (S-EM), SVM-NN (i.e. C-SVM using unlabeled data point as negative ones) and Naive Bayes Classifier with noise samples (Liu et al., 2002;Li and Liu, 2003).
In contrast to OCC, the majority of the approaches to open-set recognition are able to handle poth positive and negative samples of a given class. Several variations of well-known classification algorithms have been proposed so far. The 1-vs-Set SVM algorithm introduced in (Scheirer et al., 2013) was the first attempt to regulate the open space based on formula 3.3 using a second hyperplane parallel to the separating hyperplane. However, the space corresponding to each known class remains unbounded. This means that the open space risk still remains. Another SVM-based approach (W-SVM) consists of two models, a one-class SVM and a binary SVM using a Wibull cumulative distribution function (Scheirer, Jain, and Boult, 2014). Yet another idea used in the POS-SVM method (Scherreik and Rigling, 2016) models open space risk and empirical risk probabilistically.
The Distance Based algorithms can be adopted in the open-set framework by bounding the true positive samples by the outliers. Nearest Non-Outlier (NNO) algorithm is a center-based method that uses OSR regularization for keeping the outliers bounded. There are several center based algorithms one of them is the RFSE algorithm developed for this thesis and described in 3.4.2. NNDR described in 3.5 is also a distance-based method.
Deep Neural Networks are usually developed with a SoftMax function forcing the whole modeling setup to follow a closed-set assumption. However, there have been several efforts to modify deep learning models for open-set classification, notably using OpenMax (Bendale and Boult, 2016;Cardoso, Gama, and França, 2017). First, a normal SoftMax model is trained. Then, the layers of the network are modified to be able to recognize (pseudo) unknown classes. Another approach is to follow the adversarial learning setup where it is attempted to generate the unknown classes. One such method, the Generative OpenMax algorithm (Ge et al., 2017) estimates the decision boundary between known classes and the generated unknown ones.
Another generative approach is based on the Dirichlet Process, a distribution over distributions. This model is not overly depended on the training samples and can adapt to changes in data distribution. The collective decision-based OSR (CD-OSR) method applies co-clustering to model each known class . Each known class can be represented by several of the obtained clusters while some clusters are not associated with any of the known classes. In the testing phase, each instance that falls into these unassociated clusters is assigned to the unknown classes. The main advantage of this generative approach over discriminative-based ones is that it does not need any threshold definition.

One-Class SVM
The first open-set WGI method introduced in this thesis follows the OCC paradigm. Basically, the main idea is to build a one-class SVM classifier for each class c ∈ C using only the positive instances of that class. Ideally, the members of the other known classes as well as members of the unknonwn classes will not be recognized by any of these one-class classifiers.
One-class SVM attempts to find the contour including the positive samples of the target class, as depicted in figure 3.4. Following the logic from the traditional SVM algorithm, a one-class modification, called ν-SVM, was introduced in (Scholkopf et al., 1999). Let x 1 , x 2 , ..., x l be a set of positive samples of the target class and φ a feature map. ν-SVM considers the origin (in feature space φ ) as the only negative sample and attempts to separate the positive samples from the origin and maximize the distance of the decision hyperplance from the origin. The latter is called margin (ρ). More formally, the algorithm solves the following optimization problem: subject to: where ξ i correspond to slack variables allowing the model to handle outliers and ν is a hyper-parameter in (0, 1). Similar to the traditional SVM, the solution involves the construction of a dual problem where a Lagrange multiplier (α i ) is associated with every constraint of the primary problem. Thus, the following optimization problem is solved: subject to: where K(x, y) is a kernel function. Non-zero α i are the support vectors (see Figure  3.4) and only them contribute to the decision function: Note that the offset ρ can be derived by any support vector whose α i is not at the upper or lower bound. The hyper-parameter ν has the following properties: • ν is an upper bound on the fraction of outliers.
• ν is a lower bound on the fraction of support vectors.  (Scholkopf et al., 1999) it is reported that in their experiments when using ν = 0.05, 1.4% of the training set has been classified as outliers while using ν = 0.5, 47.4% is classified as outliers and 51.2% is kept as support vectors.
In WGI we usually have multi-class classification problems. For each known class, a separate OCSVM model is extracted. Then, in the application phase, for each unknown sample, each OCSVM model decides whether the sample belongs to its class. In addition to a crisp decision, we also take into account the distance of the sample from the hyperplane as an indication of the confidence of this prediction. Finally, the unknown sample is assigned to the class with maximal confidence or left unclassified in case all OCSVM models reject it.
This OCSVM approach to WGI was first introduced in (Pritsos and Stamatatos, 2013) and it is analytically described in algorithm 3.1 1 .
Note that the same hyper-parameter ν value is used for all known genres. This value should be determined empirically. OCSVM is affected by the curse of dimensionality which causes the generalization error to increase with the number of irrelevant and redundant features (Erfani et al., 2016). The following open-set classification method attempts to avoid this problem.
Data: G a genre palette and W g a set of known web-pages for each g ∈ G, w an unknown webpage of the W a arbitrary webpages set, F the feature set, ν the nu hyper-parameter of OCSVM, Result: r ∈ {G, / 0} 1 score[:, :]=0, the score 2D matrix where rows are for genre's class tags and columns for each webpage under evaluation for each g ∈ G do 2 Model(g) = ocsvmTrain(W g , F, ν), train a OCSVM model in vector space F with hyper-paramenter ν for genre g; 3 end 4 for each g ∈ G do 5 for each w ∈ W a do 6 score[g, w] = ocsvmApply(Model(g), F, w), the distance of the unknown page w from the hyperplane;

Random Feature Subspacing Ensemble
WGI tasks are usually associated with high dimensional data. In addition, the kind of features involved in text representation schemes are highly redundant and irrelevant. It is therefore crucial for an open-set classification method to handle the curse of dimensionality appropriately.
A distance-based open-set classification method has been introduced in (Koppel, Schler, and Argamon, 2011) aiming to handle the task of Author Identification where similar types of problems exist with respect to WGI. In the original approach, there is only one training example for each known class and a number of simple classifiers is repetitively learned based on random feature subspacing (i.e., a randomly-selected number of features is used). Each classifier uses a similarity measure to estimate the most likely class for a given new sample. The main idea is that it is more likely for the true class to be selected by the majority of the classifiers since the used subset of features will still be able to reveal the high similarity. If, on the other hand, there is no prevailing class, then the new sample is not assigned to any of the known classes. This method is depicted in Figure 3.5.
Note that in author identification we are mainly interested about stylistic similarities. The style of the author (of genre) can be captured by many different features so a subset of them will also contain enough stylistic information (redundant features). Since WGI is also a style-based text categorization task, this idea should also work for it. In this thesis, we adopt this method for open-set WGI tasks (Pritsos and Stamatatos, 2013). In WGI there are multiple training samples for each known genre. To maintain simplicity of classifiers, we have used a centroid vector for each genre. In the training phase, a centroid vector is formed for every known class by averaging all the representation vectors of the training examples of web pages belonging to the same genre.
The class centroids are all formed for a given feature type. Then, an evaluation sample is compared against every centroid and this process is repeated I times. Every time a different randomly-selected feature subset is used. Then, the scores are ranked from highest to lowest and we measure the number of times the sample is top-matched with every class. The sample is assigned to the genre with maximum number of matches given that this score exceeds a predefined σ threshold. In the opposite case, the sample remains unclassified, the RFSE responds "I Don't Know". The RFSE method is analytically described in Algorithm 3.2.
The number of iterations and the decision threshold should be derived empirically. With respect to the similarity function used by the algorithm, there are several choices. In this thesis, we examine three options. First, the cosine similarity, a typical selection in text mining tasks since it can easily handle high-dimensional and sparse vectors. Then, the MinMax similarity, inspired by the excellent results reported by (Koppel and Winter, 2014) in another style-based text categorization task. These two similarity measures for vectors of dimensionality n are defined as follows: Algorithm 3.2: The RFSE algorithm.
Data: G a genre palette and W g a set of known web-pages for each g ∈ G, w an arbitrary web-page of the W a arbitrary webpages set, F the feature set, f s a fraction of feature set size, I a number of iterations, σ the decision threshold Result: r ∈ {G, / 0} 1 for each g ∈ G do 2 centroid[g] = average(W g , F), average all known web-pages W g of genre g to build a centroid vector; 13 score(maxg) = score(maxg) + 1, increase the score of top match genre; 14 until I times; 15 if max(score(g))/I > σ then 16 r = argmax g∈G (score(g)), assign the unknown page to genre with maximum top matches; 17 else 18 r = / 0, none of the known genres or "I don't know"; 19 end Finally, we introduce an approach that combines these two similarity functions. The idea is that the most confident measure can be used in each iteration. More specifically, since cosine and MinMax may have different mean and standard deviation for the set of all evaluation samples and all iterations per sample, their values should first be normalized. Then, for each evaluation sample and each iteration we select the one with maximum normalized value. We call this Combo similarity measure.

Nearest Neighbors Distance Ratio
The approaches we consider so far use the positive training instances for the available known classes and do not attempt to estimate the open space risk. However, the distribution of known classes could be used as an indication about the existence of other unknown classes. The next algorithm attempts to follow this direction. Júnior et al., 2016), which in turn, is an extension upon the k-Nearest Neighbors (NN) algorithm. The main idea is that if the new sample lies close to the training samples of a known class and far away from the closest samples of other known classes, then it most likely belongs to that class. If, on the other hand, the new sample is more or less equally distanced from the closest Algorithm 3.3: The NNDR algorithm Data: G a genre palette and W g a set of known web-pages for each g ∈ G, w an arbitrary web-page of the W a arbitrary web-pages set, DRT the distance ratio threshold classes, then it should not be assigned to none of them. This is depicted in the examples of Figure 3.6. More formally, let d(x, y) be the distance between two samples x and y. NNRD calculates the distance of a new sample s to its nearest neighbor t and to the closest training sample u belonging to a different class with respect to t. Then, if the ratio:

The Nearest Neighbors Distance Ratio (NNRD) algorithm is an open-set classification algorithm introduced in (Mendes
is higher than a predefined threshold, the new sample is classified to the class of s. Otherwise, it is left unclassified. An analytical description of this approach is presented in Algorithm 3.3 2 . The original approach uses the Euclidean distance to find the closest neighbors. In this thesis, we use the cosine distance (i.e., 1 -cosine similarity) to better suit the properties of high dimensional and sparse data usually found in WGI tasks.
NNDR needs a way to estimate the threshold that is appropriate for a given dataset. While traditional NN approaches in the training phase are practically idle, NNDR attempts to determine a good threshold. It is remarkable that, in contrast to other open-set classifiers, training of NNDR requires both known samples (belonging to classes known during training) and unknown examples (belonging to other/unknown classes) of interest. In more detail, the Distance Ratio Threshold (DRT) used to classify new samples is adjusted by maximizing the Normalized Accuracy (NA): where A KS is the accuracy on known samples and A US is the accuracy on unknown samples. The parameter λ regulates the mistakes trade-off on the known and unknown samples prediction. Since usually in training phase only samples of known classes are available, Mendes et al. proposed an approach to repeatedly split available training classes into two sets (i.e., known and "simulated" unknown) (Mendes Júnior et al., 2016).
In this thesis we adapt the threshold estimation process to work as follows. During the training phase the known classes are split into two sets C K and C U according to a predefined ratio p 1 . The latter is used as the simulated unknown classes. In addition, the samples K c of each class c ∈ C K are split into two parts K F c and K V c according to another predefined ratio p 2 . The former is used as the fitting set and the latter is used as the validation set of known classes. Thus, the original training set is split into two parts: the fitting set (containing the p 2 of the positive instances of each c ∈ C K ) and the validation set (including the (1 − p 2 ) of the positive instances of each c ∈ C K and all positive instances of each c ∈ C U ). Then, a given range of DRT values is examined. The NNDR algorithm is called for each DRT value and the fitting set to estimate the class of each member of the validation set. That way, it is possible to calculate the A KS and A US in formula 3.13 and the DRT value that maximizes normalized accuracy can be estimated. This process is repeated for all possible splits of the known classes set. In particular, given that n = |C | is the amount of known classes the number of splits is taken by the binomial coefficient: For example, in case we have 8 known genres and a splitting ratio p 1 = 0.25, the number of possible splits is 56. Finally the DRT value that optimizes the normalized accuracy over all splits is extracted. Note that by considering a subset of known classes as noise, the NNDR algorithm attempts to directly model the open space risk. This comes with a considerable increase in training time of the algorithm. In addition, the process of estimating DRT assumes that a big enough set of known classes is available so that a subset of them to be used as (simulated) unknown. This makes the application of this algorithm difficult in cases where there only a few known classes.

Conclusions
In this Chapter, we describe three open-set classification algorithms that can be used to WGI tasks. The first method (OCSVM) follows the OCC paradigm and constructs a separate model for each known class by only considering positive instances of that class. This is a general-purpose approach that can also be used in any type of openset classification task. In addition, this approach is expected to suffer from the curse of dimensionality, a common feature of representation schemes usually adopted in WGI. Our goal is to use this general-purpose approach as baseline for other more sophisticated methods that better suit the WGI properties.
Another proposed method (RFSE) attempts to take advantage of the curse of dimensionality focusing on random subsets of features and constructing an ensemble of classifiers. Given that in style-based text categorization tasks, the representation vectors are composed by large amounts of redundant and irrelevant features, it is likely that a random subset of features will still contain enough distinguishing stylistic characteristics. The consistency of indicating a certain known genre as the most likely in the majority of such feature subsets is a strong indication of class membership. This method seems very suitable for WGI tasks.
The last proposed method (NNDR) attempts to directly model the open space risk examining the distribution of known classes and defining simulated unknown classes. This also decreases the training phase efficiency of the method. However, given that the original NN method has zero training phase requirements, the introduced cost is not unbearable in comparison to the training time cost of other alternative classifiers. The main issue with the direct modeling of open space risk is that it makes the application of NNDR in cases with limited size of the known classes set difficult or even unfeasible (e.g., when only one known class exists). As a descendant of NN, this method also inherits its well-known problems, most crucially the difficulty to handle high-dimensional representation schemes with irrelevant features. It seems that NNDR can be effective for WGI given that an appropriate feature set is provided.

Introduction
This chapter describes a framework suitable for the open-set WGI task. Particularly, the properties of evaluation measures usually adopted in closed-set classification tasks are demonstrated. The sometimes misleading conclusions that can be drawn in case they are also used in open-set conditions are highlighted. To avoid this problem, specific evaluation measures are adopted in this thesis, specialized for the open-set WGI task.
The main difference in open-set WGI with respect to closed-set WGI is the presence of noise. As already explained, noise can be unstructured (when the labels of web-pages not belonging to any of the known genres are not given) or structured (when the labels of web-pages not belonging to any of the known genres are given). Traditional evaluation measures do not make any distinction between known genres and the unknown class (noise). Moreover, in case of structured noise, we need a way to indicate the difficulty of the task taking into account the amount of known and unknown genres. For example, the case where we have 10 known genres and 3 unknown genres is way different than the case where 3 known genres and 10 unknown genres are available. In this thesis we adopt an openness measure that specifically quantifies this relation and can be used to thoroughly study the performance of WGI methods in varying conditions.
In the remaining of this chapter, we first describe the properties of well-known evaluation measures usually adopted in supervised learning tasks and discuss their suitability for open-set classification tasks. Then, we focus on appropriate evaluation measures that can depict the performance of open-set classifiers in varying conditions. Finally, the proposed evaluation framework is summarized. In machine learning, specifically in supervised learning, a confusion matrix is a table that depicts the performance of an algorithm. It is a special case of a contingency table, with two dimensions (i.e., actual and predicted). In the binary classification case, such as depicted in table 4.1, there are two classes (i.e., A and ¬A ) and four types of results: True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN). TP and TN correspond to correct predictions while FP and FN are the two types of errors (they are also called Type I and Type II errors).
In order to compare the performance of binary classification algorithms, the Accuracy measure can be used. This is actually the ratio of correct predictions over all available predictions (which is equivalent to the number of the samples of the whole evaluation dataset). Formally, it is defined as follows: Accuracy is heavily influenced by uneven class distribution. Moreover, it gives equal weight to the two types of errors and it cannot handle cases where one of them is more important than the other. In such cases, this evaluation measure can provide misleading conclusions.
Alternative evaluation measures that can compensate these weaknesses are Precision and Recall. Precision, also known as Positive Predictive Value indicates the fraction of correct predictions for class A over all predictions while recall, also known as Sensitivity, Hit Rate and True Positive Rate indicates the fraction of correct predictions for class A over all available instances of this class. These evaluation measures are defined as follows: There is a well-known trade-off between precision and recall (Weiss et al., 2010). Usually when one attempts to optimize one of them the other drops significantly. A popular metric that combines these two measures is called F-Score and it is actually the harmonic mean of precision and recall which is increased when both precision and recall take high values and is reduced when at least one of them takes low values. This is defined in the following equation: where β can be used to regulate the weighting bias towards precision or recall. Usually β = 1 (i.e., F 1 ) is used for equally weighted precision and recall significance. If β > 1 then recall is more significant than precision and if β < 1 then precision is more important. This can be useful in specific applications where more emphasis is put on one of these two measures. Note that precision is influenced by FPs while recall is affected by FNs. For example, in email spam detection, precision is usually regarded more important than recall. It is far more important to avoid to miss-classify as spam all legal messages (FPs) than leaving some spam messages to appear in the inbox (FNs). It is also important to note that precision and recall as well as F-score are calculated for a particular class. So far, taking into account Table 4.1, we considered A as the reference class. In general, especially when we have to deal with multi-class classification tasks, precision and recall can be calculated for each class separately. Then, we can combine these measures by taking their arithmetic mean. This provides the macro-averaged precision and recall. Let C be the set of classes in a multi-class classification task (e.g., in WGI this corresponds to the known genre palette) while P c and R c are the precision and recall scores of class c ∈ C , respectively. Then macroaveraged precision recall are defined as follows: where |C | is the number of known classes. Accordingly, the macro-averaged F-score can be calculated: where F c is the F-score for the class c ∈ C . Alternatively, one can also calculate micro-averaged precision, recall, and F-score. In that case, all data samples are taken together and a single precision and recall value is calculated for all classes cumulatively. TPs correspond to correct predictions, i.e., the diagonal values in the confusion matrix. All other cells of the confusion matrix are considered as both FPs and FNs (i.e., when a sample of class X is miss-classified to class Y this is a FN for X and a FP for Y). Thus, micro-Precision will be equal to micro-Recall and their harmonic mean (F 1 ) will also be the same. Actually, micro-averaged F 1 is also equal to the accuracy measure. Consequently, F micro is strongly dependent on the distribution of samples over the classes. On the other hand, F macro gives equal weight to all classes.

Open-set Variants of Evaluation Measures
In an open-set classification task, we are given a set of known classes C and training samples for each c ∈ C . However, in the evaluation phase the dataset may consist of both samples belonging to members of C and samples of classes excluded from C , that is noise N . The latter can be composed of several classes. Especially in WGI, it is expected that the number of web-pages not belonging to any of the known genres would be very high (Asheghi, 2015). If one adopts the evaluation measures described in the previous section for an open-set classification task, then all samples belonging to any c / ∈ C could be considered as a single unknown class (i.e., a super-class). Then, precision, recall, and F 1 values can be obtained for this unknown class that would be considered equally important to the corresponding ones of known classes (members of C ) when calculating either macro-averaged or micro-averaged precision, recall, and F 1 . However, this implies that TPs for the unknown class are equally important with TPs for a known class. Since there are no training samples for the unknown class and it is actually a merging of several classes, it does not make sense for the evaluation measures to consider this super-class in a regular way (Mendes Júnior et al., 2016). Rather, the open-set evaluation measures should focus on the correct recognition of known classes only. It should be noted that open-set classifiers attempt to recognize the known classes and actually leave unclassified any new samples that are not assigned to any of those classes rather than actually recognizing the unknown classes. This should be reflected in the evaluation measures used to estimate their performance.
An open-set variant of macro-averaged precision, recall, and F 1 can be obtained by ignoring the unknown class and calculating the arithmetic mean of only the known classes (Mendes Júnior et al., 2016). Formulas 4.5, 4.6, and 4.7 can still be used. However, it should be underlined that in the open-set scenario the confusion matrix has |C | + 1 rows and columns. This means that one class (the unknown class) is ignored when calculating the macro-averaged scores. On the other hand, the samples of the unknown class that are miss-classified to the known classes (false knowns) and the samples of the known classes miss-classified as unknown (false unknowns) still affect the precision and recall of known classes, respectively. It is important to include these errors in the evaluation measures since they actually determine the effect of noise in open-set recognition.
Similar to open-set macro-averaged scores, open-set micro-averaged precision, recall, and F 1 can be obtained. Note that in this case, micro-precision is not necessarily equal to micro-recall since the former is affected by the presence of false knowns and the latter is affected by false unknowns. Again, the TPs of the unknown class are ignored.
We provide an illustrating example that demonstrates the difference between traditional evaluation measures and their open-set variants. Table 4.2 shows an example of a confusion matrix for an open-set classification task with four known classes (C = {A, B,C, D}). In WGI, this could correspond to a genre palette of four known genres (e.g. blogs, e-shop, home pages, discussion) for which training samples are available. As can be seen, there are 20 evaluation samples for each known class and 200 samples of the unknown class, or noise ( / 0). This is realistic since in practice noise is expected to outnumber any given known genre. Correct predictions (TPs) are in boldface. The 180 samples of noise correctly left unclassified are the TPs for the unknown class ( / 0). False knowns (see column of / 0) consist of 20 samples of noise miss-classified to the known classes (i.e., 10 to B and 10 to D) while false unknowns (see row of / 0) comprise 24 samples of the known classes that are wrongly left unclassified (i.e., 6 from A, 8 from B, and 10 from C). These errors affect the precision and recall of known classes. They correspond to the effect of noise in open-set classification. Table 4.3 shows the precision, recall, and F 1 scores for all, both known and unknown, classes. In addition, it demonstrates the traditional macro-averaged and micro-averaged precision, recall, and F 1 when all classes (C ∪ / 0) are taken into account as well as their corresponding open-set variant based exclusively on C ( / 0 is excluded). As can be seen, the unknown class has a particularly high F 1 score since most samples not belonging to the known classes where left unclassified. The regular macro-averaged F 1 score (i.e., when all classes are included) is positively affected by this. On the other hand, the open-set F 1 variant (i.e., when only the known classes are considered) is more realistic since it focuses on the recognition of classes for which there are training examples. By using the regular macro-averaged F 1 score in open-set classification, an over-estimation of performance can be obtained. This is far more obvious in the case of using micro-averaged F 1 scores. In that case, the difference between regular micro F 1 and its open-set variant is huge due to the class imbalance problem. As already said, the noise usually outnumbers any known class in WGI tasks and this considerably affects the credibility of microaveraged measures. It is also noticeable that while regular micro-averaged scores are by definition equal for precision, recall, and F 1 (also for accuracy), this is not the case for their open-set variant.
Clearly, there is a significant difference between P macro and P micro , as well as between R macro and R micro . However, note that the change is in opposite direction when regular measures or their open-set variants are used. In particular, open-set P macro is higher than open-set P micro while regular P macro is lower than regular P micro . The same pattern applies in recall and F 1 scores. This clearly indicates that by adopting regular evaluation measures that are suitable for closed-set classification, it is possible to extract unreliable conclusions.  Note also that, in this example case, the difference between open-set macroaveraged scores and micro-averaged scores does not seem so important. Both approaches seem robust and roughly indicate similar conclusions. However, recall that the samples are evenly distributed over the known classes. If this is not the case, then macro-averaged scores are more reliable since they give equal weight to all known classes. In this thesis, open-set macro-averaged evaluation scores are used.

Precision-Recall Curves
So far, the evaluation measures consider classification algorithms that provide crisp predictions (i.e., hard classifiers). The discussed evaluation measures can only show particular aspects of the performance of classifiers. To obtain a deeper look we need richer evaluation methods that can depict the performance of classifiers in a variety of conditions. One such method is the Precision-Recall Curve (PRC), a standard method for evaluating information retrieval systems and ranking systems. This approach can only be applied to soft classifiers that are able to explicitly estimate class conditional probabilities. Fortunately, the vast majority of hard classifiers can be / 0 D adopted to also provide some form of score that can be regarded as class conditional probability. The calculation of a PRC requires the ranking of estimated probabilities in descending order. In each step, the next prediction is considered and a new precision and recall point is calculated. Both macro-averaged and micro-averaged PRC can be calculated. Table 4.4 shows an example of this procedure. As can be seen several samples may have the same certainty score that is used to order predictions. Wrong predictions have as consequence the decrease of precision and recall values.
In order to facilitate the comparison of PRCs corresponding to the performance of different algorithms on the same evaluation dataset, the 11-standard recall level normalization is typically used. The initial points of PRC are reduced to 11 that correspond to standard recall levels [0, 0.1, ..., 1.0]. For example, in case Recall= 0.1, we measure precision when 10% of the samples belonging to the known classes have been correctly recognized. Precision values are interpolated based on the following formula: P(r j ) = max r j r r j+1 (P(r)) (4.8) where P(r j ) is the precision at r j standard recall level (r j = {0, 0.1, 0.2, ..., 1.0}).
The Area Under the Curve (AUC) is a scalar measure that can be extracted from a PRC and can be used to facilitate comparison of different approaches. Certainly, it lacks the details of a PRC but it is useful especially when we want to compare the performance obtained when different parameter settings are applied on the same algorithm. In those cases, both AUC and F 1 can be used as optimization criteria. An example of calculating these measures for two systems is provided in Table 4.5. In this thesis, we will adopt both of these measures. In Figure 4.2 the PRCs of two different systems are depicted when regular evaluation measures are used (i.e., the unknown class is also considered). Similar, Figure  4.1 shows the corresponding performance of the same systems the open-set variants of the evaluation measures are used (i.e., the unknown class is excluded). As can be seen, according to macro-averaged scores, regular measures and open-set variants lead to opposite conclusions. The former favours the grey system while the latter indicates that the red system is better. Clearly, the red system makes less mistakes in the recognition of known genres. In addition, the estimation of performance of both systems with micro-averaged PRCs seems very optimistic.

The Openness Test
The open-set evaluation measures defined in this chapter can be used in both unstructured and structured noise. However, in the structured noise case, we need a more detailed analysis of the performance to evaluate the ability of the open-set classifier to handle low/high number of training/unknown classes. It is especially important to study the relation of the number of training classes with respect to the number of unknown classes.
In (Scheirer et al., 2013), the openness measure is introduced for measuring this relation. The openness measure indicates the difficulty of an open-set classification task by taking into account the number of training classes (i.e. the known classes used in the training phase) and the number of testing classes (i.e., both known and unknown classes used in the testing phase) (Mendes Júnior et al., 2016): When openness is 0.0, it is essentially a closed-set task, that is the training and testing classes are exactly the same. This actually means that there is no noise. At the other extreme, when openness reaches 1.0 this means that the known classes are far less than the unknown classes or that the amount of noise is especially high and heterogeneous. Therefore, by varying the openness level we can study the performance of WGI models in different conditions.
Note that the openness measure can only be applied to datasets where all available samples have been labeled with class information. In the case of WGI, we have to know the genre labels of the pages that form the noise (i.e. structured noise). This information is only used to quantify the homogeneity of the noise.
The study of open-set classifiers can be significantly extended by measuring their performance (e.g., macro-averaged F 1 ) for varying values of the openness score. Given a dataset with C set of known classes, it is possible to segment it into two parts: one K ⊂ C to serve as the set of known classes and another S ⊂ S (K ∪U = C ) to serve as structured noise. Thus, it is possible to vary the training classes from 1 to |C | − 1 while the testing classes are always |C |. Figure 4.3 depicts the F1 macro scores on different openness levels that correspond to two open-set algorithms. A dataset with 5 classes was used to obtain these results. The lowest openness score in that figure corresponds to the case of 4 training classes and 5 testing classes (i.e., noise is composed of a single class). In the highest openness score, there is only one known class and 5 testing classes (i.e., noise is composed of 4 classes). As can be seen, it is possible to conclude that algorithm B is generally better than A. However, algorithm A is better able to cope with the extreme cases of openness.

Conclusions
In this chapter we discussed evaluation measures that can be used for open-set classifiers. We demonstrated that traditional precision, recall, and F 1 measures are misleading since they take into account the unknown class (noise) as a single regular class. However, since this is an heterogeneous class without training examples, it should not be treated equally with the known classes. For that reason, modifications of these evaluation measures, the open-set precision, recall, and F 1 are more appropriate since the TPs of the unknown class are ignored. For open-set WGI, where the noise is usually highly heterogeneous and significantly outnumbers the known classes, the use of the open-set variants of the measures is considered very important. Another main direction is the use of graphical evaluation methods that better depict the performance of open-set classifiers in various conditions. We suggest the use of two such measures, the precision-recall curves on 11-standard recall levels and the openness test. The former provides a detailed view of the performance of an open-set WGI system that suits any given application (e.g., in ranking applications, precision at low recall levels is of paramount importance). The latter provides a direct control of the difficulty of the task in structured noise and can demonstrate the performance of the classifiers in varying conditions, from cases very similar to the closed-set scenario where noise is homogeneous to cases very similar to binary classification where the structured noise is highly heterogeneous.
In the next chapters we will adopt the evaluation principles described here to evaluate the open-set WGI algorithms introduced in this thesis.
Experimental Analysis of Open-set WGI Methods

Introduction
Based on the evaluation framework described in the previous chapter, it is now possible to evaluate the open-set WGI algorithms presented in chapter 3. Certainly, any kind of empirical evaluation depends on the dataset that is used for estimating the performance of the examined models. Each dataset has its weaknesses and this might lead to an over-estimation or under-estimation of performance of the examined methods in more realistic conditions. However, what we want to study here is the comparison of performance of different approaches on exactly the same datasets and experimental setup to extract conclusions about the relative improvement in performance of one method with respect to another.
In this thesis, we focus on the effect of noise in open-set WGI approaches. In particular, we want to examine the performance of WGI methods when either unstructured or structured noise is available. The former is a realistic scenario in most WGI applications where it is difficult, if not impossible, to define the genre of a large part of the web. In such cases, it is better to assume that the unknown class comprises any kind of web-pages that do not belong to the known genres. On the other hand, this makes the definition of noise chaotic and extremely heterogeneous.
The case of structured noise offers the opportunity to study the performance of open-set WGI methods when the heterogeneity of noise can be controlled. The assumption that information about the unknown classes is available (although the classifier has no training examples for these classes) is not unrealistic. For example, an open-set WGI system that aims to recognize news articles should not be distracted by blogs. That is, we know that blogs exist and perhaps comprise the majority of noise in that system but we do not provide training examples for that class.
Among the three open-set WGI methods proposed in this thesis, OCSVM and RFSE are examined in this Chapter while NNDR is thoroughly tested in Chapter 6. The reason for this is that NNDR is especially difficult to be tested in conditions of structured noise, especially when limited known classes are available since a part of known classes have to be used for estimating the open space risk. Therefore, NNDR is only tested with unstructured noise.
The estimation of performance of WGI methods also depends on the applications they are going to be used. Some applications require high precision (e.g., ranking genre-based search results). On the other hand, it is rather unusual to aim for high recall with the cost of reducing precision in WGI-related applications. The experimental analysis should also reflect these facts.
Another crucial issue is the representation of web-pages. As already explained in chapter 3 the dimensionality of representation, the existence of irrelevant and redundant features can severely harm the performance of certain open-set WGI methods. Therefore, it is important to study how different text representation schemes, especially the ones that were found to be the more reliable ones in previous WGI studies, affect the performance of the examined methods.
In the remaining of this chapter, we first describe the datasets used in this study and the experimental setup. Then, we present the experimetntal results in open-set WGI when either unstructured noise or structured noise is available. Finally, we summarize the drawn conclusions.

Corpora
In this paper we study the performance of the open-set classification models on the WGI task. In particular, the two open-set algorithms described above are analytically tested on benchmark corpora. In particular, our experiments are based on the following corpora already used in previous work in WGI (Eissen and Stein, 2004;Santini, 2007;Kanaris and Stamatatos, 2009): 1. SANTINIS (Mehler, Sharoff, and Santini, 2010): This is a corpus comprising 1,400 English web pages evenly distributed into 7 genres as well as 80 BBC web pages evenly categorized into 4 additional genres. In addition, it comprises a random selection of 1,000 English web pages taken from the SPIRIT corpus (Joho and Sanderson, 2004). The latter can be viewed as noise in this corpus. Details are given in

Experimental Setup
The text representation features used in this thesis are based exclusively on textual information from web pages excluding any structural information, URLs, etc. This does not mean that we consider other kinds of information (e.g., HTML-based features, URL-based features etc.) as less important in WGI. However, information coming from the text itself is less likely to be affected by technology-related choices that can be easily altered through time. By focusing on the text of the web pages we ensure that the drawn conclusions are more reliable and long lasting.
Based on the good results reported in (Sharoff, Wu, and Markert, 2010a;Kanaris and Stamatatos, 2009;Asheghi, 2015) as well as some preliminary experiments, the following document representation schemes are examined: • Character 4-grams (C4G) • Word unigrams (W1G) • Word 3-grams (W3G) • Part-of-speech 3-grams (POS3G) The Stanford POS tagger has been used for POS3G creation. We use the Term-Frequency (TF) weighting scheme and the feature space is defined by a Vocabulary which is extracted based on the terms appearing at training set only. There is no pre-processing of textual data (e.g., stop word removal, stemming etc.) since in style-based text categorization tasks these processes remove significant stylistic information (Stamatatos, 2009).
Each open-set WGI method has some hyper-parameters to be tuned. In order to extract the best possible parameter settings for each classification method we apply grid-search over the space of all parameter value combinations. This is not the most sophisticated approach but ensures that the extracted parameter values will fine-tune the model for the specific dataset.
With respect to RFSE, four parameters should be set: the vocabulary size F, the number of features used in each iteration f s, the number of iterations I, and the threshold σ . We examined F={5k, 10k, 50k, 100k}, f s={1k, 5k, 10k, 50k, 90k}, I={10, 50, 100} (following the suggestion in (Koppel, Schler, and Argamon, 2011) that more than 100 iterations does not improve significantly the results) and σ ={0.5, 0.7, 0.9} (based on some preliminary tests). Additionally, in this thesis we test three document similarity measures used in RFSE approach: cosine similarity, MinMax similarity, and Combo (as defined in Section 3.4.2).

WGI with Unstructured Noise
The two open-set algorithms RFSE and OCSVME, describe in sections 3.1 and 3.2, are initially tested on SANTINIS corpus which as explained above is an Unstructured Noise, samples corpus. In the training phase, only the 11 known genres are considered. In the testing phase, the noise pages coming from the SPIRIT corpus are also used. It is important to be noted that information about the true genre of these pages is not available. The 10-fold cross validation is performed where in each fold the full set of 1,000 pages of noise is included. This evaluation strategy is giving a more realistic evaluation framework since the size of the noise is much greater than the size of any genre included in the given palette.
Figures 5.1 and 5.2 depict the Precision-Recall curves (PRC) of OCSVM and RFSE models, respectively. For each model and each one of the three document representations, the parameters that maximize performance with respect to the F 1measure are used. Remember from section 4.2.3 whenever recall does not reach 1.0 this means that some pages belonging to known classes were classified as unknown.
In all cases, RFSE outperforms OCSVM. Moreover, for both methods, W3G seems to be the best feature type for this corpus, followed by C4G. OCSVM performance is only comparable with RFSE when W3G is used.
The performance of the open-set WGI methods are further explored by selecting parameter settings with different optimization criteria. Tables 5.2 and 5.3 show the combination of parameters that optimize performance of OCSVM and RFSE based on AUC, F 1 and F 0.5 . In the tables 5.2 and 5.3 the values are presented, of all three performance measures where, for every row, one of them is maximized. It is clear that the performance in all cases is maximized when W3G document representation is used. In previous studies based on a closed-set framework, C4G was the document type of features to maximize performance (Sharoff, Wu, and Markert, 2010b). This indicates that contextual and content information is important for this corpus (Asheghi, 2015).
In addition, in almost all cases, RFSE models are far more effective than OCSVM. Another important conclusion is that the optimization criterion plays a crucial role for the properties of the model especially for RFSE. When AUC is maximized, recall is favored. On the other hand, while F 1 is maximized, precision is substantially increased. Fig. 5.3 shows the performance of OCSVM and RFSE models when AUC and F 1 criteria are used to select parameter settings. As can be seen, the RFSE model based on F 1 maximization avoids to make wrong decisions and leaves a large number of web pages unclassified. On the other hand, the model optimized by AUC prefers to make a lot of errors in order to recognize more web pages of known genres. OCSVM models seem not significantly affected. Note that choosing between WGI models that prefers precision over recall and vice versa is an application-specific task.

WGI with Structured Noise
In this section the RFSE and OCSVME algorithms we describe experiments using a corpus with structured noise. The KI-04 corpus has been used for this set of experiments.
The experiments are extensively testing the algorithms' noise tolerance in the open-set classification task for different openness levels as explained in section 4.3. In more detail, the openness measure is adopted varying the number of training classes from 7 to 1 while keeping the number of testing classes always the same, at maximum 8. As a result, the openness measure varies from 0.065 to 0.646.
One extreme refers to the case where only one genre class is unknown while in the other extreme only one genre class is known. In the extreme case of the maximum openness level, the problem is actually reduced to a binary problem of 1-vs-rest. On the contrary, in the extreme case of minimum openness level, the problem is a multiclass classification with only one unknown class which is virtually complete, i.e. contains single genre pages and no other pages that could be considered as noise.
The known classes are randomly selected for each openness level and the experiment are repeated 8 times, where each time performing 10-fold cross-validation. Moreover, to avoid any biased selection of parameter values, the parameter settings found to be optimal for the SANTINIS corpus are used, in section 5.4. Figures 5.4 and 5.5 show the performance (F 1 ) of OCSVE and RFSE models using different text representation features for varying openness levels. Standard error bars are also depicted to show the variance of performance for each model. RFSE models based on C4G and W1G gradually get worse while openness increasing while W3G models seems to be relatively stable. Surprisingly, the performance of OCSVM seems to improve by increasing openness and this pattern is consistent in all three feature types while C4G seem to be the most effective type. Although, in the maximum openness level the problem is equivalent to the closed-set binary (i.e. 1-vs-rest) classification problem.
As it was highlighted in the previous section, according to the properties of the application in which WGI is involved, precision may be more important than recall or vice-versa. In figure 5.6 the macro-precision of RFSE is depicted for W3G, W1G and C4G features. MinMax similarity is used since it increases significantly the performance of RFSE in respect with precision. As concerns text representation, W1G is the best choice when precision is at more importance than recall. On the other hand, W3G features seem to be more stable because the standard error is lower than that of the other features and also the W3G model is not affected too much when openness surpasses 0.5 (actually it improves).
In the case of C4G and W1G where the openness level is 0.646 the standard error in both case is high. Since, this problem is only occurring in the case where the problems has been reduced to binary, it is interesting to see whether it is caused by choice of the document representation or by the choice of the similarity measure. Despite OCSVM's improvement when structured noise is used, it can only be competitive to RFSE on a high openness level, where all genre labels but one are considered unknown. This can be better viewed in figure 5.7 where OCSVM is compared with RFSE models based on MinMax and Combo similarity measures for a varying openness level. These curves correspond to W1G features, so they are not the optimal models. However, they provide a fair comparison between examined methods. As standard error bars indicate, the performance of RFSE models with respect to the F 1 measure is significantly better than that of OCSVM while openness is less than 0.5. Beyond that level, OCSVM is significantly better than RFSE models. It should also be noted that Combo measure helps RFSE in while openness is relatively low and MinMax seems to be a better choice when openness increases.

Conclusions
In this chapter it has been presented an experimental study on WGI focusing on open-set evaluation for this task. In contrast to vast majority of previous work in this area, the open-set scenario is adopted which is more realistic for WGI, since it is not feasible to construct a genre palette with all available genres and appropriate samples for each one of them. Moreover, we examined two open-set classification methods and several feature types and similarity measures. The presented evaluation of open-set WGI covers two basic scenarios. The first is when noise is unstructured, i.e., information about the true genre of pages not belonging to the known genre palette is not available. The second scenario applies when noise is structured, i.e., we actually know the true genre of pages not included in the training classes. For both cases they have been used the proposed appropriate evaluation methodologies for the open-set classification, presented in chapter 4.
In almost all examined cases, RFSE models outperformed the corresponding OCSVM models. This verifies previous work findings about the appropriateness of RFSE for WGI (Pritsos and Stamatatos, 2013). RFSE is able to provide effective models and additionally it is possible to manage preference on recall or precision, an application-dependent choice, by focusing on optimizing AUC or F 1 respectively. On the other hand, OCSVM proved to be the best-performing method in extreme cases when openness is high. Actually, the restrictions of the available corpora did not allow us to examine cases where openness approaches 1.0. However, it seems that when openness is more than 0.5 OCSVM outperforms RFSE.
As concerns the feature types, in most of the cases W3G and C4G provided the best results. However, the selection of text representation features is a crucial choice that affects performance and it seems to be corpus-dependent. Another crucial parameter of RFSE is the similarity measure. Among the examined measures, MinMax and its combination with cosine similarity provide the most robust results. The choice of similarity measure correlates with feature types. It seems that the combo measure is more effective than MinMax in low openness conditions.
To enhance the evaluation of WGI models in open-set conditions, we need larger corpora including multiple genre labels. New enhanced open-set WGI methods are needed and they should be evaluated using the proposed paradigm. Otherwise, using an evaluation paradigm more appropriate for closed-set tasks, the performance may be over-estimated.
The Usefulness of Distributed Representations in WGI

Introduction
The most traditional text representation scheme in text mining tasks is the Bag-of-Words (BOW) model which is based on individual tokens as features. It is a simplistic approach to quantify textual information assuming independence of the occurrence of individual tokens in documents. The result is a document vector of high dimensionality (i.e., in the order of thousands of features) and sparseness (i.e., only a few non-zero values per document). The BOW model is not able to capture information about the grammar of documents and completely ignores word order. In addition, it is confused by synonym terms since it assumes they are independent. Nevertheless, it provides an easy and quite competitive approach to represent documents (the W1G scheme used in Chapter 5 is actually based on BOW).
A more elaborate text representation scheme is to consider n-grams of words (e.g., the W3G model used in Chapter 5). This would capture information about word sequences, like phrases. This can improve the ablity of the model to represent syntactic information since the context of words is partially taken into account. Nevertheless, the dimensionality of representation is considerably increased when the order of the model (n) is high. In addition, the sparseness of the vectors is increased. It is also possible to apply the n-gram approach on the character level or on POS-tag level, as shown in the experiments of Chapter 5 (i.e., C4G, POS3G). The main assumption that each feature (n-gram) is independent of the other features is still doubtful in such models.
An alternative approach is to use distributed representations that attempt to introduce some kind of dependence of each word (or n-gram) on the other words (or n-grams). For example, the words usually encountered in the context of a specific word are more dependent on that word. In addition, different words found in similar context get a higher share of dependence. Distributed representations can be obtained by applying language modeling methods. Especially, the use of neural network language models and the popular word and document embeddings introduced in (Mikolov et al., 2013a).
One main advantage of distributed representations is that they provide compact (i.e., low-dimensional) and dense vectors to quantify syntactic and semantic information in documents. In comparison to regular BOW or n-gram models, distributed features are much less redundant and irrelevant since each such feature captures a combination of information that cannot be specifically determined. Therefore, it seems that open-set WGI methods that are not able to easily handle high-dimensional, sparse vectors with many irrelevant and redundant features would be highly improved by using distributed representations. As already explained in Chapter 3, NNDR is an algorithm that, in theory, is vulnerable when it is not combined with appropriate feature sets. The main goal of this Chapter is to examine how the performance of NNDR in WGI tasks is affected when combined with either traditional BOW-like features or distributed features.
The rest of this chapter is organized as follows. First, the main ideas of distributed representation are presented. Then, the specific distributed features used in this thesis are described. Next, we compare the performance of NNDR using traditional sparse representation schemes with the case dense vectors are used. We also compare these versions of NNDR with OCSVM and RFSE methods and discuss the main conclusions of this study.

Obtaining Distributed Representations
One way to obtain a low-dimensional and dense representation of documents is the use of topic modeling. Topic modeling methods attempt to group terms according to their co-occurrence in documents. They provide a new feature space (composed by latent topics) of pre-defined dimensionality. One popular topic modeling approach is Latent Semantic Analysis, a linear algebraic method that transforms a highdimensional and sparse representation to a low-dimensional and dense one applying singular value decomposition (Kontostathis and Pottenger, 2006). Another popular approach is Latent Dirichlet Allocation, a generative probabilistic model where each documents is represented as a mixture over a set of latent topics. Each topic is in turn defined as a distribution over words (Blei, Ng, and Jordan, 2003).
Another main direction that gained huge popularity during the last years is the use of neural probabilistic language models (Bengio et al., 2003). We first describe how words can be represented in a continuous space and then we focus on documents.

Word Embeddings
The main idea is that words can be represented by real vectors (word embeddings) that are learned by a neural network (Mikolov et al., 2013b). This is unsupervised learning since documents need not be labeled. The neural network is trained to recognize words that occur in similar context. Then, each word is represented in continuous vector space and similar words tend to cluster in the same area. In addition, the distance between related words is affected by semantic similarity (e.g., the difference between terms "king" and "man" is close to the difference between "queen" and "woman") (Mikolov et al., 2013b).
In practice the distributed features is the mapping of the vocabulary words V = {w i , i ∈ [1, |V |]} to a real vector t i ∈ R m . One basic architecture is the Continuous Bag-of-Words (CBOW) model which attempts to predict a word given its context. This is a Feedforward Neural Network with an input layer, a projection layer, and an output layer as shown in figure 6.1 (Mitra and Craswell, 2018). The input layer is composed by the context of a word (i.e., the few words immediately to its right and left). Every word in the vocabulary is assigned to a one-hot vector t i (i.e., a vector of size |V | with all but one values equal to zero). The sequence of context word vectors are added and form the input vector t i * . Since the order of words is not important in this setting, the model bears similarities to Bag-of-Words (Mitra and Craswell, 2018).
The weight matrix W in is of size |V | × m while W out is of size m × |V |, where m is the size of the hidden layer (m << |V |) and it also corresponds to the dimensionality of the extracted distributed representation. The size of the output vector is equal to the vocabulary size.
During training, CBOW attempts to learn weight matrices W in and W out . The loss function of CBOW is the following conditional log probability: where k is the size of context words and S is the number of possible context windows in training texts. Stochastic Gradient Decent and Backpropagation are used to train that network. CBOW is actually an encoder-decoder model and applies a SoftMax function in its output: where y t i is the output vector for term t i . Another architecture is the skip-gram model, that attempts to predict the context of a word. This is depicted in figure 6.2 (Mitra and Craswell, 2018). Again, input and output are one-hot vectors while the hidden layer is of dimensionality m (<< |V |). The objective is to learn weight matrices W in and W out and the loss function is as follows: where k is the number of context words to be predicted, S the number of all windows in training set, and p(t i+ j |t i ) is obtained as follows: FIGURE 6.1: Architecture of the C-BOW model (Mitra and Craswell, 2018). The network attempts to predict a word given its context words. The order of input words is ignored. The hidden layer has much lower dimensionality in comparison to the one-hot representation of input and output words. The learned weights in W in (and W out ) can be used as word embeddings.
Finally, the above neural models, either CBOW or skip-grams, since they are approximating the continuous distribution probability function of words over the the Vocabulary V they also satify the following constraint: Note that in both CBOW and skip-gram models the two weight matrices W in and FIGURE 6.2: Architecture of skip-gram model (Mitra and Craswell, 2018). Given a word the network tries to predict its context words. The dimensionality of the hidden layer is much lower than the onehot representation of input and output words. The learned weights in W in (and W out ) can be used as word embeddings.
W out can be used to provide the word embeddings 1 . Usually W in plays this role and W out is discarded. To summarize, the above models are very effective Language Modeling approaches having the ability to quantify simultaneously syntanctic and semantic information of words. They provide a distributed representation for words (i.e., each word is represented with a dense vector which is a point in a space of relatively low dimensionality). However, it is not easy to understand the actual meaning of each dimension in this space. The sequence of words in texts is now considered and can also be applied in cases input texts are composed of sequences of characters or POS tags.
Finally, the training of the CBOW and the skip-gram models can be expensive despite the fact of limiting the number of hidden layers. However, there are several engineering solution that are accelerating the training time, such as Huffman binary tree encoding of words and hierarchical softmax. The latter is a solution that enables us to use multi-processing power and update the weight parameters concurrently. The parallel asynchronous updating of the parameter matrices is not conforming to the mathematical constraints however in practice the negative effect is minor. Huffman binary tree is a method for compressing the encoding of terms where the ones with the higher frequency are accessed faster. In addition to this, negative sampling, subsampling, or ramdom sampling are also used where in the range of k window for surrounding words only a few ones are selected during training with minor effect in performance and significant acceleration in training (Mikolov et al., 2013b;Mitra and Craswell, 2018).

Document Embeddings
There are several approaches to transform word embeddings to document embeddings (Mitra and Craswell, 2018;Mikolov et al., 2013a). The most simple method produces a vector for a given document by averaging the word embeddings of the words in a document. It is also possible to modify the network architecture and work on the sentence level. For example, word embeddings per sentence are averaged and the goal is to predict a sentence given its context sentences (Kenter, Borisov, and Rijke, 2016). Another idea, the Sent2Vec method 2 , is to compose sentence embeddings by extending CBOW to include word vectors and word n-gram embeddings (Pagliardini, Gupta, and Jaggi, 2018).
In this thesis, we use the Doc2Vec approach, introduced in (Le and Mikolov, 2014), that attempts to generalize the word embeddings methods to work with sequences of words. The main idea is to train a neural network so that to learn embeddings for entire documents (or passages). There are two versions of this approach that are analogous to CBOW and skip-gram models.
The Paragraph Vector -Distributed Memory (PV-DV) model is based on CBOW. The task is to train a network to predict the next word in a text window given the paragraph vector and the word vectors of its context (actually the preceding words). The paragraph (it could be entire document) vector is considered as memory of the words distribution and aims to capture general information like the topic of the document.
Another approach, following the skip-grams paradigm, is to ignore the context words in the input, and train a model for predicting a context word given its paragraph vector. This method, called Paragraph Vector -Distributed Bag-of-Words (PV-DBOW), is depicted in figure 6.3. In practice, at each iteration of stochastic gradient descent, a text window of size k is sampled. Then, a random word is sampled from the text window and form a classification task given the paragraph vector. This model requires to store less data, because only the SoftMax weights are stored as opposed to both SoftMax weights and word vectors in the PV-DM.
The loss function of PV-DBOW (a modification of the corresponding skip-gram loss function shown in formula 6.3 is as follows: where D i is the document vector of i-th document, S is the number of windows over the training texts and k is the number of words to be predicted surrounding the input word. Consequently, the SoftMax function for the output of the model is modified as follows: There are several modifications for the PV-DBOW method aiming to increase its efficiency including, document frequency based negative sampling and document length regularization (Le and Mikolov, 2014;Posadas-Durán et al., 2017). It should be noted that the paragraph vectors could be used to represent sentences, paragraphs, or entire documents. In this study, the whole web-page is considered. In addition the input texts could be sequences of characters, POS tags, character n-grams, word n-grams etc.
This method of producing document embeddings has successfully been used in several text classification tasks (Le and Mikolov, 2014). Its main advantage over traditional BOW and n-gram representation schemes is that it provides compact and dense vectors that include a rich combination of syntactic semantic and stylistic information of documents.

Experimental Setup
In this chapter, the usefulness of the previously described distributed representation of documents is examined in the framework of open-set WGI. As already explained, NNDR is vulnerable when combined with a text representation scheme of irrelevant and redundant features. In this thesis, NNDR is used in combination with Distributed Features (DF), obtained by the PV-DBOW approach. We compare this new method with NNDR using traditional BOW and n-gram features as well as with other openset methods (OCSVM and RFSE). The experiments of this chapter are based on SANTINIS, a benchmark corpus, as described in Chapter 5. Briefly, this dataset comprises 1,400 English web-pages evenly distributed into seven genres (blog, eshop, FAQ, frontpage, listing, personal home page, search page) as well as 80 BBC web-pages evenly categorized into four additional genres (DIY mini-guide, editorial, features, short-bio). In addition, the dataset comprises a random selection of 1,000 English web-pages taken from the SPIRIT corpus (Joho and Sanderson, 2004). The latter can be viewed as unstructured noise since genre labels are missing.
The PV-DBOW models have been trained using the whole corpus. Note that the training of this approach is unsupervised (i.e., the genre labels are not taken into account). The corpus initially is split to a set of paragraphs, as required from PV-DBOW. To be more specific, the paragraphs are sentences split from all the documents of the whole corpus. We examine three different variations, using either sequences of word unigrams (W1G), word trigrams (W3G) or character 4-grams (C4G) as input texts (W1G correspond to texts in their original form). Each type of n-grams is used separately as suggested in (Posadas-Durán et al., 2017). The dimensionality of document embeddings is selected from DF dim = {50, 100, 250, 500, 1000}.
In addition, the terms with very low-frequency in the training set are discarded. In this study, we examine f min = {3, 10} as frequency cutoff threshold. The text window size is selected from W size = {3, 8, 20}. The remaining parameters of PV-DBOW are set as follows: α = 0.025, epochs = {1, 3, 10} and decay = {0.002, 0.02}.
In practice, a library for HTML removal and and vector representation of the web-pages has been created for this work, named Html2Vec 3 . There is a special module for PV-DBOW modeling that has been built based on the the implementation of the algorithm found in Gensim package 4 .
We also represent documents with traditional representation schemes to conduct comparative experiments. Similar to PV-DBOW, we extract regular C4G, W1G, and W3G. For each of these schemes, we use Term-Frequency weights (we use TF to refer to this kind of traditional feature as opposed to DF for distributed features). The feature space for TF is defined by a vocabulary V T F , which is extracted based on the most frequent terms of the training set. We consider V T F = {5k, 10k, 50k, 100k}.
The parameter tuning for OCSVM and RFSE methods has been performed as described in Chapter 5 for the SANTINIS corpus. The reported evaluation results are obtained by performing 10-fold cross-validation and, in each fold, the full set of 1,000 noise pages is included. This evaluation strategy is giving a more realistic evaluation. Since the noise size is greater than the size of any known genre.
To compensate the unbalanced distribution of web pages over the genres because of the noise part, the open-set macro-averaged precision, recall, and F 1 measures are used (Mendes Júnior et al., 2016). Note again than this variant of evaluation measures ignores the unknown class.
Finally, for selecting the parameter settings that obtain optimal evaluation performance, two scalar measures are used: the Area under the macro Precision-Recall Curve (AUC) of 11 standard Recall levels and the macro-averaged F 1 (F macro 1 ) score.

The Effect of Distributed Representation on NNDR
Initially NNDR is evaluated using the traditional TF scheme as shown in Table 6.1. The overall performance is poor. NNDR seems to work better with W3G features. Note that the dimensionality of this representation is quite high. The performance of the algorithm is slightly affected by parameter tuning for splitting ratios p 1 and p 2 while DRT in all cases is 0.8. The method seems to be robust to the examined values of λ regularization parameter. It should also be noted that both F 1 and AUC are maximized for the same parameter settings and document representation.
The evaluation of NNDR combined with PV-DBOW features is shown in Table  6.2. As can be seen, in two out of three types of features (C4G and W1G) the performance of the algorithm is significantly improved in terms of both macro F 1 and AUC. The best overall performance is still acquired by W3G features and it is slightly improved in comparison to the respective results when TF representation is used (the improvement is considerably higher for AUC measure). DF seems to particularly enhance precision results for C4G features and recall results for W1G features.
These results are obtained using a much lower dimensionality of representation (i.e., an order of magnitude lower than TF scheme). This demonstrates that NNDR is better able to cope with the compactness and density of DF vectors. It should also be noted that the robustness of the model is increased since the best results are acquired for exactly the same parameter settings and most NNDR parameters do not affect the obtained performance.
A more detailed view of the performance of NNDR when combined with either traditional or distributed W3G features is depicted in PRCs of Figure 6  in both cases the same parameter settings are used for the NNDR classifier. As can be seen, the precision of the model based on DF remains high for more standard recall levels in comparison to TF which is significantly affected by the presence of noise. This means that DF is particularly useful in WGI applications where precision is considered more important than recall. The two approaches have comparable performance when recall reaches 0.5 although DF still outperforms TF. The points where curves stop indicate the percentage of the corpus that has been classified as unknown which is similar in both cases (i.e., about 40% of the corpus).

Comparison of Open-set WGI Methods
In this section, the performance of NNDR on the SANTINIS corpus is compared to that of OCSVM and RFSE obtained as described in Chapter 5. The experimental setup for NNDR with either TF or DF schemes is exactly the same therefore the evaluation results for these models are directly comparable. In the framework of this experiment, OCSVM and RFSE serve as baseline models to help us see how competitive the NNDR approach can be when it is assisted by DF representation in unstructured noise conditions. First, NNDR with TF features is compared with the baselines. In this case, NNDR outperforms OCSVM. On the other hand, RFSE performed NNDR in both macro-averaged F 1 and AUC. This is consistent for any kind of features (C4G, W1G, or W3G). The RFSE model is the top overall performer while both OCSVM and NNDR are significantly low in respect of AUC, F 1 and precision. Only, NNDR with TF scheme for W3G is competitive.
There is notable difference in the dimensionality of representation used by the examined approaches though. RFSE relies upon a 50k-manifold while NNDR and OCSVM are based on much lower dimensional spaces. This demonstrates the ability of RFSE to exploit the existence of redundant feature sets. It has to be noted that RFSE builds an ensemble by iteratively and randomly selecting a subset of the available features. That way, it internally reduces the dimensionality for each constituent base classifier (RFSE is using 1,000 randomly selected features from the 50,000 most frequent features in each repetition).
Next, NNDR with DF is compared with the same baselines. Although there is a notable improvement for NNDR using DF, it is still outperformed by RFSE in terms of both F 1 and AUC. On the other hand, NNDR returns a notably higher performance than RFSE with respect to precision for C4G and W3G features. This indicates that NNDR using DF could be more useful than RFSE in WGI applications where precision is more important than recall.
A closer look at the comparison of the examined methods is provided in Fig. 6.5, where macro-averaged precision-recall curves are depicted. The NNDR-DF model maintains very high precision scores for low levels of recall. Particularly, for W3G features the difference between NNDR-DF and RFSE at that point is clearer. NNDR-TF is clearly worse than both NNDR-DF and RFSE. In addition, OCSVM is competitive in terms of precision only when W3G features are used but its performance drops abruptly in comparison to that of NNDR-DF. RFSE with W1G performs significantly better in terms of precision than NNDR (with DF). It also manages to recognize correctly larger part of the corpus, more than 70% either for W3G or for W1G, as compared to NNDR-DF that reaches 60% in both cases. OCSVM FIGURE 6.5: Precision curves in 11-standard recall levels of the examined open-set classifiers using either W3G features (left) or W1G features (right).

Conclusions
In this chapter, we presented an experimental study focused on WGI and the use of distributed features in combination with an open-set classifier that obtained promising results in other domains (Mendes Júnior et al., 2016). Our experiments are based on a benchmark corpus with unstructured noise already used in previous work and a strong baseline. It seems that distributional features provide a significant enhancement to the performance of NNDR in WGI tasks. The low-dimensionality and density of DF are crucial to enhance the performance of NNDR which suffers from the presence of irrelevant and redundant features (as any nearest-neighbor method). Yet, RFSE proves to be a hard-to-beat baseline at the expense of relying upon a much higher representation space (usually in the thousands of features). However, with respect to precision, NNDR with PV-DBOW features is much more conservative and it prefers to leave web-pages unclassified rather than predicting an inaccurate genre label. Depending on the application of WGI, precision can be considered much more important than recall and this is where the proposed approach seems more suitable (e.g., web-page ranking applications).
Further research could focus on more appropriate distance measures within NNDR specially with recent data-driven features obtained with powerful NLP convolutional and recurrent deep networks. Moreover, alternative types of distributed features could be used (e.g., topic modeling or pre-trained language models). Finally, a combination of NNDR with RFSE models could be studied as they seem to exploit complementary views of the same problem.

Introduction
WGI is a text mining task that can improve information retrieval systems allowing richer description of search queries and results. Genre is orthogonal to topic of documents and the combination of information about these two factors better describe the properties of documents in comparison to the case only topic is used. WGI can be useful to build specialized collections of documents via focused crawling where both topic and genre are controlled. Genre is important information for credibility assessment systems that decide whether certain web-pages can be trusted as well as for anti-phishing systems. In addition, genre of documents is important information in order to apply suitable models when NLP technology is going to be used.
Despite these interesting uses of WGI, research in this area is limited. In addition, the vast majority of existing works adopt the unrealistic scenario of closed-set classification. Clearly, it is not possible to define a universal genre palette including all possible web genres. Despite best efforts, it is expected that a large portion of web-pages could not match a given genre palette no matter how long it is. In addition, web genres evolve through time and the genre palette should be re-defined periodically.
The adoption of the closed-set classification scenario provides an over-estimation of the obtained results and does not permit the highlighting of weaknesses and strengths in real world conditions. Moreover, the few open-set WGI models that have been proposed so far have not been adequately tested. In some cases the evaluation is based on unrealistic noise-free corpora. In other cases, regular evaluation measures that are not suitable for open-set classification tasks are adopted.
Another crucial issue is the type of noise considered in WGI tasks. So far, only unstructured noise has been tested. However, structured noise is equally important since it provides the opportunity to examine the behavior of an open-set model controlling the homogeneity of noise and the difficulty of the problem.
As a result, there are significant doubts about the applicability of WGI technology in challenging environments, outside controlled lab experiments. It is also important to note that not all WGI applications are the same. Some applications (e.g., ranking of search results by genre, focused crawling) require higher precision than recall. It is crucial, therefore, to demonstrate what methods are better able to deal with such cases.
In the remaining of this Chapter, we present the main findings of this thesis and the suggested direction for future work.

Main Findings
The focus of this thesis is on open-set WGI. A series of methods have been described and experiments demonstrating their strengths and weaknesses have been reported using a suggested evaluation framework for open-set classification tasks. The main findings of this study are following: • An evaluation framework for open-set WGI is proposed. First, as clarified in Chapter 4, the evaluation measures used for closed-set classification tasks are not appropriate for open-set conditions. The focus of the evaluation measures should be on the recognition of known genres. The recognition of the unknown class should not affect the results since there are no training examples for that class. In addition, in case micro-averaged scores are used, the performance estimation is skewed when the unknown class is regularly taken into account since it is the majority class. Open-set variants of precision, recall, and F 1 (i.e., excluding the unknown class) should be used. The effect of noise is thus based only on false unknowns and false knowns errors (i.e., true positives of the unknown class are ignored). In addition, graphical models, like precision-recall graph in 11-standard recall levels provide a detailed view on the properties of examined methods that match requirements of a given WGI application.
• The effect of structured noise in WGI is studied. We demonstrate in Chapter 4 that the openness measure provides a means to control the homogeneity of structured noise. This also affects the difficulty of the open-set task. By using the openness test it is possible to see how open-set WGI methods behave in extreme conditions (e.g. when very few known classes are available or when noise is quite homogeneous) and better understand their pros and cons.
• An open-set WGI method following the one-class classification paradigm has been developed. The OCSVM approach learns only from positive samples of a target class and a hyper-parameter (ν) adjusts whether the model will be conservative or optimistic. It is a general-purpose approach that can be applied to any open-set classification task as it has already been demonstrated in (Mendes Júnior et al., 2016). In the framework of this thesis, it serves as a baseline to other more suitable approaches. The results described in Chapter 5 demonstrate that OCSVM is a conservative model that prefers to leave samples unclassified rather than miss-classify them. The performed experiments also showed that this method outperformed the other examined approaches when the openness level is quite high. This means that when very few known classes are available and the noise is heterogeneous (i.e., this corresponds to a high open-space risk) this relatively conservative method obtains the best results.
• An open-set WGI method following the ensemble paradigm (RFSE) has been developed. This approach takes advantage of a high-dimensional feature space with lots of redundant and irrelevant features such as the ones provided by character and word n-grams. This method achieves the best overall results in terms of F 1 as demonstrated in Chapter 5. It provides the most balanced case of handling both unstructured and structured noise. It is also possible to favour recall or precision by using different parameter optimization criteria (i.e., AUC or F 1 ). Thus, it can be adapted to the requirements of WGI applications. In addition, for very high openness values this algorithm does not seem to be robust. This actually means that it cannot easily handle cases with very few known genres. Its performance increases by enlarging the known genre set.
• An open-set WGI method following the k-nearest neighbor approach (NNDR) has been developed (Mendes Júnior et al., 2016). In contrast to kNN, this method spends time during training to estimate the open-space risk. In order to do that, part of training examples are used as simulated noise. On the other hand, this method is very difficult (or even unfeasible) to be applied when limited training data (especially when limited known genres) are available. This is the main reason that it has been evaluated only in conditions of unstructured noise in this thesis. Another important property of the algorithm is that it is better able to handle compact and dense representations rather than high-dimensional and sparse feature spaces. Thus, it combines very well with distributed representations acquired from neural language modeling as demonstrated in Chapter 6. Although the best results of this method are inferior to the ones obtained by RFSE, NNDR seems to provide better precision scores for low recall levels. The latter is very important in applications related with ranking of web-pages.
• The experiments conducted in this thesis are based on textual features only. These have been found in previous studies to be the most useful source of information to represent genre-related information. Features like word and character n-grams have been examined. They provide a language-independent and easy-to-measure set of features. More elaborate features requiring analysis of documents by NLP tools, like POS n-grams have also been tested. As the results of Chapter 5 indicate, word n-grams provide the best results. More specifically, word trigrams seem to be better than word unigrams. This also indicates that content and stylistic information is useful for the specific corpora used in the experiments. Recall that these corpora are far from ideal for evaluating WGI methods since they are of small size and some genres are represented by thematically correlated text samples. This could mean that the conclusion that best results are obtained for open-set WGI methods using word trigrams is probably corpus-dependent. The type of features that should be used in combination with the proposed methods, especially RFSE, is a hyper-parameter that should be optimized for any given corpus.
• Distributed representations obtained by the PV-DBOW neural network language model has been introduced to WGI. As demonstrated in Chapter 6, these features when combined with NNDR can provide a very competitive approach especially when precision is the most important aspect of performance. Distributed representations are compact and dense and cannot easily confuse NNDR with irrelevant and redundant features. In addition, it has been demonstrated that NNDR can be a very robust model when matched with distributed features requiring minimal tuning of hyper-parameters.

Future Work Directions
The current work attempted to highlight weaknesses of existing approaches and provide a framework for the appropriate evaluation of open-set WGI approaches. In addition, specific open-set methods have been tested using this framework.
There are several open questions that are outside the scope of this work and could be examined in the future. Promising directions for research include the following: • The open-set classifiers examined in this thesis can be combined in the framework of a heterogeneous ensemble. This could provide a more robust approach of enhanced performance, both overall and in low recall levels.
• Open-set methods proposed in other domains, like the one described in (Fei and Liu, 2016), could be tested in WGI as well. Distance-based approaches seem to fit WGI tasks.
• Alternative neural network language modeling approaches could be used to provide distributed representations. These include pre-trained language models, like BERT, ULMFiT, ELMo, and GPT-2 that have been developed recently and obtained excellent results in several text classification tasks (Devlin et al., 2019;Howard and Ruder, 2018;Peters et al., 2018;Radford et al., 2019).
• Combinations of textual features with alternative sources of information, like structural or graph-based and URL-based features could be examined. It would be interesting to see how such enlarged feature spaces could affect methods like RFSE and NNDR.
• Additional experiments can be performed on new web corpora including dozens of genres and thousands of training examples per genre (Egbert, Biber, and Davies, 2015). This would allow a more objective evaluation of open-set WGI approaches especially focusing on the effect of structured noise using the openness test. On the other hand, this possibility is limited given the copyright issues that complicate access to recently-developed collections (Asheghi, 2015).
• Case studies examining the usefulness of open-set WGI in increasing effectiveness of specific applications, like genre-aware crawling, credibility assessment systems should be performed (Siqueira et al., 2017;Agrawal, Mohan, and Reddy, 2018). This would both study the robustness of open-set WGI in real world conditions and would trigger the interest of research community for further research.