Статья 'Hybrid categorical expert system for use in content aggregation' - журнал 'Software systems and computational methods' - NotaBene.ru
по
Меню журнала
> Архив номеров > Рубрики > О журнале > Авторы > Требования к статьям > Политика издания > Редакция > Порядок рецензирования статей > Редакционный совет > Ретракция статей > Этические принципы > О журнале > Политика открытого доступа > Оплата за публикации в открытом доступе > Online First Pre-Publication > Политика авторских прав и лицензий > Политика цифрового хранения публикации > Политика идентификации статей > Политика проверки на плагиат
Журналы индексируются
Реквизиты журнала
ГЛАВНАЯ > Вернуться к содержанию
Software systems and computational methods
Правильная ссылка на статью:

Hybrid categorical expert system for use in content aggregation / Гибридная категориальная экспертная система для использования в агрегации контента

Кирьянов Денис Александрович

ORCID: 0000-0001-8502-8333

магистр, Балтийский государственный технический университет ВОЕНМЕХ имени Д. Ф. Устинова

190005, Россия, г. Санкт-Петербург, ул. 1-Я красноармейская, 1

Kiryanov Denis Aleksandrovich

Master's Degree, Department of Information Systems and Software Engineering, Baltic State Technical University "Voenmeh" anmed after D. F. Ustinov

190005, Russia, Saint Petersburg, 1st Krasnoarmeyskaya str., 1

dennis.kiryanov@gmail.com
Другие публикации этого автора
 

 

DOI:

10.7256/2454-0714.2021.4.37019

Дата направления статьи в редакцию:

02-12-2021


Дата публикации:

21-12-2021


Аннотация: Предметом исследования является разработка архитектуры экспертной системы для распределенной системы агрегирования контента, основное предназначение которой категоризация агрегированных данных. Автор подробно рассматривает такие аспекты темы, как преимущества и недостатки экспертных систем, инструментарий разработки экспертных систем, классификация экспертных систем, а также рассматривается применение экспертных систем для решения проблем категоризации данных. Особое внимание уделяется описанию архитектуры предложенной экспертной системы, которая состоит из компонента для фильтрации спама, компонента определения главной категории для каждого из типов обрабатываемого контента, а также компонентов для определения подкатегорий, один из которых основан на правилах доменной области, а другой компонент использует методы машинного обучения, дополняя первый. Основным выводом данного исследования является то, что экспертные системы возможно эффективно применять для решения проблем категоризации данных в системах агрегации контента. Автором было выяснено, что гибридные решения, объединяющие подход, основанный на использовании базы знаний и правил с использованием нейронных сетей, помогают снизить стоимость экспертной системы. Новизна исследования заключается в предложенной архитектуре системы, которая является легко расширяемой и адаптируемой к нагрузкам за счет масштабирования существующих или добавления новых модулей. Предложенный модуль определения спама основан на адаптировании поведенческого алгоритма определения спама в электронных письмах, предложенный модуль определения основных категорий контента использует два вида алгоритмов, на основе нечетких отпечатков: Fuzzy fingerprints и Twitter Topic Fuzzy Fingerprints, который изначально использовался для категоризации сообщений в соц. сети Твиттер. Работа модуля, определяющих подкатегорию на основе ключевых слов происходит во взаимодействии с базой данных-словарем (Тезаурус). Последний классификатор использует алгоритм опорных векторов для конечного определения подкатегорий.


Ключевые слова:

Экспертная система, Алгоритм нечетких отпечатков, Агрегация контента, Нейронная сеть, Категоризация контента, Инженерия знаний, Метод опорных векторов, TF-IDF, CLIPS, Идентификация спама

Abstract: The subject of this research is the development of the architecture of an expert system for distributed content aggregation system, the main purpose of which is the categorization of aggregated data. The author examines the advantages and disadvantages of expert systems, a toolset for the development of expert systems, classification of expert systems, as well as application of expert systems for categorization of data. Special attention is given to the description of the architecture of the proposed expert system, which consists of a spam filter, a component for determination of the main category for each type of the processed content, and components for the determination of subcategories, one of which is based on the domain rules, and the other uses the methods of machine learning methods and complements the first one. The conclusion is made that an expert system can be effectively applied for the solution of the problems of categorization of data in the content aggregation systems. The author establishes that hybrid solutions, which combine an approach based on the use of knowledge base and rules with the implementation of neural networks allow reducing the cost of the expert system. The novelty of this research lies in the proposed architecture of the system, which is easily extensible and adaptable to workloads by scaling existing modules or adding new ones.


Keywords:

Expert system, Fuzzy fingerprints algorithm, Content aggregation, Neural network, Content categorization, Knowledge acquisition, Support Vector Machine, TF-IDF, CLIPS, Spam identification

Introduction

Modern science and industry are inconceivable without the use of computer technology. Over the past 50 years, the level of information and intellectual support of various technologies has increased tremendously [1]. The amount of obtained information is so great that it is very difficult for a person, even a specialist, to deal with it. To perceive and process it, special intellectual support is required.

Therefore, expert systems and decision support systems find their application in various fields of economics, medicine, and science [2]. An expert system can be defined as a computer system designed to solve complex problems by emulating the decision-making process of human experts [3].

Expert systems emerged as a significant practical result in the application and development of artificial intelligence, i.e., a set of scientific disciplines that study methods for solving problems of an intellectual (creative) nature using computers [4]. The first expert systems were developed in the late 60s of the last century and were intended to create an artificial “super mind” in some subject area [5].

The first expert systems were implemented using specialized programming languages such as Lisp and Prolog [6]. Some of those systems are still in active use today. An example of such a system is DENDRAL [7], the purpose of which is to create organic molecular graphs of non-cyclic isomers (written in Lisp). Another good example is PROSPECTOR II [8] which was successfully used in the search for mineral deposits.

There are many types and implementations of expert systems. For example, paper [9] surveys and classifies expert systems using two categories: rule-based systems and knowledge-based systems with their applications for different research and problem domains.

The purpose of this article is to propose an expert system that is a part of a distributed content aggregation system and helps categorize aggregated content. Categorization is a very complex process due to the sheer volume of content that should increase the relevance of the search result. This task also required research into the advantages and disadvantages of expert systems, as well as the tools for their development, to develop the most appropriate architectural solution.

This paper is structured as follows. Section 1 provides an overview of the benefits of expert systems. Section 2 contains the main disadvantages of expert systems. Section 3 describes the general architecture of an expert system. The classification of expert systems is presented in Section 4. Section 5 provides an overview of the tools for creating expert systems. Examples of expert systems that perform categorization tasks are listed in Section 6. Section 7 explains the architecture of the proposed system. Finally, the conclusions are given in Section 8.

1 The advantages of expert systems

In the modern sense, an expert system is a kind of artificial intelligence (AI), i.e., a set of programs that perform the functions of a human expert in solving problems from a specific subject area [10, p. 203]. And one of the most important differences between expert systems and other systems with artificial intelligence is that the expert system models the mechanism of human thinking in relation to solving problems in this problem area and not business logic.

The expert system, in addition to performing computational operations, forms certain considerations and conclusions based on the knowledge it has (this component is usually called the knowledge base). In addition, expert systems differ from other AIs in that they use heuristic and approximate methods to solve problems [1].

One of the main advantages of expert systems is performance. In general, an expert system deals with real-world objects and such operations usually require significant human experience, i.e., expertise. Well-designed expert systems find a solution within a reasonable time, which is at least no worse than that which a specialist in this subject area can solve the same task. It means that expert systems are productive, and their power lies in the high-quality awareness of task areas [11, p. 74].

Expert systems can easily analyze all aspects of a problem, which often leads to the selection of the best alternative. Such systems turn out to be extremely effective when the knowledge bases are huge because once entered the machine, knowledge is stored forever. On the other hand, an expert (person) has a limited knowledge base, and there is always a risk of loss of expert knowledge due to an expert leaving the company.

The study [12] notes a very positive effect of expert systems on improving the general control audits in electronic accounting systems: they enhance the separation between jobs and duties inside the management of information systems. Also, the study reveals that there is an impact of the expert systems on enhancing the access controls by increasing the chance for controlling and authenticating the inputs. Finally, the expert systems enhance the security and protection of the files and there is an effect on enhancing the controls of system documentation, development, and maintenance.

Summing up the benefits of using expert systems, the following can be highlighted [11, pp. 80-81]:

1. Increased availability and reliability: Expertise can be accessed on any computer hardware and the system always completes responses on time.

2. Multiple expertise: Several expert systems can be run simultaneously to solve a problem and gain a higher level of expertise than a human expert.

3. Explanation: Expert systems always describe of how the problem was solved.

4. Fast response: The expert systems are fast and able to solve a problem in real-time.

5. Reduced cost: The cost of expertise for each user is significantly reduced.

2 The disadvantages of expert systems

Even the best existing expert systems have certain limitations compared to a person who is an expert in their subject area. For example, most expert systems are not quite suitable for use by the end-user and high qualification is needed to work with them.

Another problem when using expert systems is the presentation of expert knowledge in a form that the system can understand. It is also known that knowledge acquisition can be very expensive and time consuming, when done correctly [13, p. 79].

Typically, the time it takes to acquire knowledge varies from case to case but can easily range from 50 to 100 man-weeks. It is also worth noting that the preparatory phase, which includes initial orientation, a feasibility study, and a selection of a programming shell, can take an additional 15 to 25 man-weeks [14, pp. 165-166].

Also, expert systems do not have a self-learning mechanism and are inapplicable in large subject areas. Their use is limited to subject areas in which an expert can decide in a time from several minutes to several hours. In addition, in those areas where experts may be absent, the use of expert systems turns out to be impossible [6].

It is also known that knowledge-based systems are ineffective when it comes to rigorous analysis when the number of solutions depends on thousands of different possibilities and many variables that change over time. The expert system knows the algorithm for processing knowledge but not the algorithm for solving the problem, in contrast to traditional applied applications. It means that the knowledge processing algorithm can lead to an unintended result [6].

Another disadvantage is a fact that a portion of knowledge in expert systems (typically less than 10 percent) escapes standard representation schemes and requires special fixes. Special fixes pose an audit risk and a security risk because they offer the opportunity to hide the knowledge that can result in unusual or dysfunctional behavior [15, p. 9].

The study [16] considers the ethical characteristics of an expert system such as lack of human intelligence, lack of emotions, accidental bias, and lack of values. It was also shown that the selected characteristics of the expert system negatively affect the degree of ethics in the organizational environment.

The evidence of effectiveness of expert systems in medicine is mixed. Although some reviews reported that expert systems improved the performance of health care providers and patient outcomes, other reviews were less optimistic about their effects, requiring additional evidence to demonstrate the cost-effectiveness of these systems [13, p. 92]. For example, in the case of routine medical consultations, expert systems are irrelevant and considered to be designed to support only routine consultations so that doctors have all the patient data they need to make a diagnosis [17].

Finally, the following disadvantages of using expert systems can be summarized [11, p. 81]:

1. Expert systems have superficial knowledge, and a simple task can potentially become computationally expensive.

2. Expert systems require knowledge engineers to input the data, data acquisition is very hard.

3. The expert system may choose the most inappropriate method for solving a particular problem.

4. Problems of ethics in the use of any form of AI are very relevant at present.

5. It is a closed world with specific knowledge, in which there is no deep perception of concepts and their interrelationships until an expert provides them.

3 The general architecture of an expert system

In general view, an expert system includes the following components: a knowledge base, an inference engine, an explanation facility, a knowledge acquisition facility, and a user interface. The general architecture of an expert system is shown in Figure 1 [11, p.75].

Figure 1 – Architecture of an expert system

The high-level architecture of an expert system which is shown in Figure 1 can be explained as follows [11, pp. 75-76]:

1. The knowledge base stores the facts for processing. It is domain information entered by the experts.

2. An inference engine is an interpreter of the rules, it works together with the agenda component which contains a list of queries to execute.

3. An explanation facility is a subsystem that explains the reasoning of the expert system to a user.

4. A knowledge acquisition facility is used to obtain information from the user in an automatic mode. It uses different techniques such as process analysis, interviews, and observation.

5. A user interface module translates the rules from its internal representation into user comprehensible form.

4 Classification of expert systems

Expert systems are usually divided into four classes according to their operating principles, including rule-based, frame-based, fuzzy logic and neural network-based expert systems [10, p. 203].

4.1 Rule-based expert systems

Rule-based systems encode the knowledge of a human expert for use in an automated system using a set of statements, that is, facts, and a set of rules that embody that knowledge [18, 19]. These rules are set in the IF-THEN form. Such expert systems are very popular in medicine [20 – 27]. In [28], a unified framework for building rule-based systems is presented, which consists of the operations of rule generation, rule simplification, and rule representation.

4.2 Frame-based expert systems

The frame-based expert systems [29 – 32] have a frame that is a developed data structure containing the concept-related information: the concept name, the possible values of each attribute, and the procedural information of the target problems. Frame-based systems can deal with more complex problems, compared to the rule-based system [10, pp. 203-204], and are often combined with a rule-based method, thus making a powerful system for solving complex problems [33].

4.3 Fuzzy logic-based expert systems

Fuzzy logic-based expert systems [34 – 40] integrate the fuzzy theory, using it as a bias of the reasoning. Such systems are highly reliable and can perform preliminary and heuristic reasoning [10, p. 204].

The purpose of fuzzy logic-based expert systems is to provide an easy way to work with systems full of uncertainty. In such systems or environments, fuzzy logic is considered very effective when inferences do not need to be precise, but acceptable to a certain degree of certainty [41].

4.4 Neural network-based expert systems

Neural network-based expert systems, as it follows from the naming, use the neural networks for building the rule base from examples given by a human expert. A neural network-based expert system increases the knowledge represented in its connections over time by learning from examples [42].

The neural network-based approach can be used when it is difficult to determine whether the knowledge base is correct, consistent, or incomplete. It also applies in situations where it is difficult to get an adequate set of rules from human experts [43].

Even though the neural networks were not originally designed to make expert systems [42], this approach is actively used nowadays due to the rapid development of machine learning algorithms [10, p. 204].

When building expert systems using the neural network-based approach, various algorithms and neural network types can be used. For instance, paper [44] shows how the feed-forward backpropagation algorithm [45] can be used for predicting the temperature of the kiln shell. Paper [46] describes an expert video surveillance system based on a recurrent neural network (RNN) [47] and a long-short term memory network (LSTM) [48]. The study [49] proposed an expert system based on the generalized regression neural network (GRNN) [50] for diagnosing hepatitis B virus disease.

5 Toolkit for creating expert systems

The development of expert systems is a very complex task requiring knowledge engineers who translate expert knowledge into the language of the expert system. To speed up the development process, specialized software is often used. This section provides a brief overview of some of the shells and programming languages that are used to create expert systems.

5.1 Exsys Corvid

Exsys Corvid [51] has been one of the most popular commercial shells for many years and is still actively used today. It includes tools for debugging and testing the program, editing for modifying knowledge and data. The Java-based Corvid Interface Engine allows solving complex problems using the IF-THEN rules.

Knowledge automation expert systems with Exsys Corvid software and services have been developed worldwide in a wide variety of fields such as medicine, maintenance, human resources, government, energy, and many others [52]. The use of Exsys Corvid as a development tool for building an expert system is shown in articles [53 – 56].

5.2 CLIPS

CLIPS [57] is a well-known rule-based software tool for building expert systems. It is written in the C programming language and uses forward chaining. Currently, CLIPS is actively used in numerous modern projects, such as the development of an expert system for the selection of tunnel boring machine [58], rule-based expert systems prototyping [59], as well as a digital fitness coach [60].

5.3 Java Expert System Shell (JESS)

The Java Expert System Shell (JESS) is another popular shell for building expert systems. This shell is an interpreter for the Jess programming language and can be used in console and GUI applications. From an architectural point of view, JESS is a production system that executes a rule-based program [61].

JESS has been used successfully in many projects, including Interactive Voice System [62], semantic web service discovery [63], security risk analysis [64], building virtual laboratory platform [65], and others.

5.4 Kappa PC

Kappa PC [66, 67] is a shell that brings together the critical technologies needed to rapidly develop low-cost and high-performance expert systems. It allows writing applications using GUI and generates standard ANSI C code. Domain components are represented as objects and can represent real things like cars or intangible concepts like property, and these objects can be extended using methods [66].

Applications of the Kappa PC software can be found in many projects such as an expert system for the design of commercial buses [68] or an advisory system that helps to improve the efficiency of the transport system [69].

5.5 Prolog

Prolog [70 – 72] is a logic programming language that is very popular in artificial intelligence programming and is often used for expert systems. The main features of Prolog are pattern matching mechanism, automatic backtracking, and tree-based data structuring.

5.6 Flex

Flex is a Prolog-based expert system’s toolkit. It supports frame-based reasoning with inheritance, rule-based programming, and data-driven procedures fully integrated into a logic programming environment [73, p. 9]. There are many expert systems built using this shell, for example, an expert system for site selection for thermal power plants [74] and an expert system for interpreting the results of the allergen microarray [75].

5.7 Gensym G2

G2 is a powerful expert system for real-time operations provided by Gensym Corporation. G2 can process tens of thousands of rules per second, supports reasoning within a deadline and default reasoning, natural language rule definition, and task priority scheduling [76].

G2 was used in such projects as a dynamic simulation of an opencast coal mine [77], and implementation of a conceptual framework for modeling a biopharmaceutical manufacturing plant [78], where high performance and reliability were needed.

5.8 Lisp

In addition to Prolog, Lisp is another popular programming language for creating expert systems, which is actively used today in projects such as an expert system for diagnosing and treating diabetes [79] and others.

5.9 VisiRule

VisiRule [81] is a popular visual modeling tool which is designed for building reliable decision models. VisiRule requires no programming skills and generates Flex and Prolog code from visual models. An example of working that VisiRule can be found in the study [81] describing the creation of a rule-based decision-making expert system.

As shown above, there are many shells and programming languages that can be used to build expert systems. Unfortunately, many tools are not currently supported. The technical report [82] provides a detailed overview of many of these.

6 Categorization and classification tasks using expert systems

Expert systems can be used to solve a categorization problem, i.e., they can determine some objects or consequences of uncertain knowledge through hierarchical categorization. The knowledge base of such categorical systems consists of a taxonomic set of verbal categories, and their purpose is to determine the category of the input object based on the available facts [83].

Since categorial knowledge consists only of logical relationships between facts and does not contain an element of doubt, it can be expressed as IF-THEN rules. Categorical expert systems also require an inference engine to solve a particular problem. The inference engine can use backward and forward chaining methods and include explanation and conflict resolution modules [84, pp. 25-30].

Current research shows that, in addition to the rule-based approach, the neural network approach is currently very popular in creating a classification module for such expert systems. There are many applications of expert systems in data classification and categorization problems and this section contains a description of some of them.

6.1 Categorial expert system Jurassic

A Jurassic expert system [85] is a well-known example of categorical expert systems. The system's knowledge base consists of 423 rules, which are presented as a directed acyclic graph of a depth of five.

Jurassic uses the approach [86] of representing objects not in the form of feature sets, but in the form of lists, which makes it possible to include copies of the same object in a single object representation, differing in their position in the list. The system performs categorization using a neural deductive system. The similarity is calculated in the case of uncertain knowledge based on common features.

6.2 Expert system for categorizing multiple intelligences of students

The paper [87] presents an expert system that classifies students' abilities in one of three areas: engineering, management, and science. The system architecture includes a user interface, an inference engine, a knowledge base, a student database, and a database containing student answers to questions that are used to determine the most appropriate course for each student.

The knowledge base of the system contains predefined rules that must be corrected in the process. The system determines the preferred course for the student based on weights calculated using special functions defined for each type of intelligence for each grade.

6.3 Expert system for classification of pavement cracking

The study [88] considers a multiagent expert system for automatic distress detection. The proposed system uses an expert system as a component performing the classification task, which is performed using a neural network. The system is considered hybrid [89] and has a complex architecture consisting of three agents and, in addition to the expert system, uses various technologies such as fuzzy logic [90], image processing, soft computing methods, etc.

6.4 Expert system for voltage dip classification

The paper [91] presents an expert system for classifying events of voltage dips in the power system. There are four event classes considered by the expert system: fault-induced events transformer events, induction motor events and step-change events. The classification task is based on their characteristics, which are related to the temporary decrease in the voltage. The system’s knowledge base contains the features uniquely characterizing the events in a set of rules.

6.5 Expert system for tweets classification

Expert systems are often used in the content classification task. For example, study [92] presents MISNIS, an expert system that automatically classifies tweets into a set of topics of interest. The system uses the Twitter Topic Fuzzy Fingerprints method [93] and compares the fuzzy fingerprint of an individual text to that of a potential author. To determine if a tweet is related to a specific topic, the system creates a topic fingerprint and a fingerprint of trending topics.

6.6 Expert system for multi-language documents categorization

The GENIE project described in the paper [94] is a multi-language rule-based text categorization expert system that is based on five stages: preprocessing, attribute-based classification, statistical classification, geographical classification, and ontological classification.

The categorization process begins with the preprocessing stage which includes lemmatization [95], named entities recognition [96], and keywords extraction [97]. Then it applies the attribute-based classification based on the thesaurus, i.e., a list of words and a set of their relations. The next stage is statistical classification, where the machine learning techniques are used to find patterns that correspond to the statistical information and to get labels that match the general topics of the document.

The system then applies a geographic classifier to identify possible geographic references included in the text. The geographic classifier uses a gazetteer component [98] which represents a systematized knowledge and details about named places.

Finally, the ontological classification is performed, using a lexical database with sets of synonyms and semantic relations among them.

A similar approach for classification module architecture is used in the Hypatia project [99] which is an expert system for documentation departments that provides categorization, semantic search, summarization, knowledge extraction, aggregation, and many other functions in the field of document analysis.

7. Proposed system

7.1 System’s architecture

The proposed categorization expert system is considered as a part of a high loaded distributed content aggregation system that aggregates text data of multiple types: news, blogs, job ads, company information (including feedback on work), social events (meetups, conferences, exhibitions, etc.) and displays it in a user-friendly format and design.

Since the main purpose of this system is to provide a relevant response to a query, the categorization of the aggregated content is very important. The task is compounded by the sheer volume of data, which means the entire system must be productive and scalable.

Each of the aggregated documents has a set of properties, such as title, creation date, URL, type, short description, etc. These properties are used by the rule-based mechanism to categorize the data when the neural network approach is not sufficient to decide.

The high-level architecture of the proposed system is shown in Figure 2.

Figure 2 – Architecture of the expert system for aggregated content categorization

The system described in Figure 2 consists of a cluster of content Downloaders [100], i.e., web crawler agents, Content parser module, Classification application, Pre-processor, Spam classifier, Fuzzy fingerprint classifier, Attribute-based classifier, and SVM classifier.

The system also has a Thesaurus – a database with a list of words for different languages to categorize the data. At each step, the system tries to get labels that correspond to the categories of processed content.

The entire presented system can be divided into two parts: the first part is the information retrieval, and the second is its subsequent processing and categorization. These parts will be described below, with more emphasis on the categorization part, since content aggregation technology is not the main topic of this study.

7.2 Information retrieval

The Downloaders are responsible for information retrieval: they send hundreds of requests to the sources on the Internet and save web pages to the content repository database.

The Content parser module is a distributed set of parser applications that receive aggregated content and parse it according to business rules. The parsed content is stored in the aggregated content database. Both the content repository database and the aggregated content database are relational databases (PostgreSQL [101]), following the master-slave concept, which is used to stabilize the system.

The Classification application module retrieves the parsed data from the aggregated content database and adds it to the Classification message queue (Rabbit Mq [102]). The amount of information processed is very large, and the message queue helps to scale the load.

7.3 Pre-processor

The Pre-processor module automatically retrieves HTML data from the Classification message queue and processes it to ease the further work of the categorization mechanism.

The Pre-processor’s architecture is shown in Figure 3.

Figure 3 – Architecture of the Pre-processor module

As it follows from Figure 3, the Pre-processor module’s architecture consists of the separated applications to perform HTML markup removal, stop words removal, stemming [103], lemmatization, lowercasing, punctuation marks removal, and keyword extraction using term frequency–inverse document frequency (TF-IDF) algorithm [104].

7.4 Spam classifier

The considered content aggregation system should have an effective mechanism for detecting spam or inappropriate content. The crux of the problem is that spam can be found in various types of content, from unwanted advertising to illegal content in articles or aggregated comments or reviews. It is a very hot issue, and there are many approaches to solving it, including rule-based expert systems and systems that use machine learning algorithms.

For example, study [105] presents a cost-based heterogeneous learning framework for detecting spam in Twitter messages, which is a combination of the work of experts and a machine learning algorithm that filters spam messages.

In paper [106] spam emails have been identified using machine learning and deep learning approaches. The researchers deployed six learning models and found that XGBoost [107] has the best performance among the machine learning models to perform the spam classification task.

The proposed Spam classifier component is based on the behavioral method [108], which uses a combination of a rule-based approach and a neural network to detect spam in e-mails. Its architecture is illustrated in Figure 4.

Figure 4 – Architecture of the Spam classifier

When a new batch of aggregated content arrives at the Spam classifier, it first analyzes incoming data to see if there are blacklisted external links. If so, the data is considered spam and is saved to a separate database.

The next step is rule-based processing, which uses the domain knowledge from the knowledge base. If the data is considered spam, it is stored again in the spam database.

To identify spamming behaviors, it is supposed to form news, comments, blogs, and other aggregated content in accordance with their keywords, tags, date of creation, information about the author, external links, descriptions of images, etc., and present it in a vector form for further use of the backpropagation neural-network architecture as described in paper [108].

7.5 Fuzzy fingerprints classifier

All aggregated content should have main categories that correspond to the general content of the meaning. In addition, there are more specific subcategories. For example, for the Sports category, some possible subcategories are Hockey or Football.

For this, the Fuzzy fingerprints classifier is used, which defines the main categories for each type of aggregated content. For content types like articles and blogs, which contain a lot of text, this module uses the Fuzzy fingerprints method [109]. In the case of the comments and reviews, which are less wordy, it uses the Twitter Topic Fuzzy Fingerprints method [93].

To detect the main category of the content, the fingerprint of the category is created, based on a set of training datasets containing the entities that are known to be associated with the category. The fingerprints are stored in the PostgreSQL database.

If the classifier receives ambiguous results, a rule-based approach comes into action, which uses domain logic associated with the properties of the document being analyzed.

7.6 Attribute-based and SVM classifiers

The idea to use the Attribute-based classifier was adopted from the GENIE expert system’s design [94]. It is a rule-based process that finds the subcategories of processed documents according to their properties and based on the main category found in the previous step by the Fuzzy fingerprints’ classifier.

In the last step, the SVM classifier is used, built using the Support Vector Machine method[110, 111]. SVM classifier looks for matching patterns to retrieve subcategories that were probably not found by the attribute-based classifier.

8 Conclusions

The problem of content categorization is very relevant for a content aggregator that collects huge amounts of data. For this reason, an expert system architecture has been presented that classifies aggregated content using a combination of a rule-based approach and neural networks.

To find an architectural solution that is suitable for the current subject area, a study was made of the advantages and disadvantages of the expert system. Research has shown that expert systems can be very expensive, and data acquisition is often resource-intensive and time-consuming.

On the other hand, expert systems are fast and can solve problems in real-time. The proposed architecture is a hybrid solution that uses a rules-based approach to overcome the errors made by the neural network. It is expected to be more efficient than just using a rules-based approach and be able to recognize more patterns.

The proposed system has a spam classifier module that uses a combination of a rule-based approach and a neural network to detect spam in aggregated content. There is also a Fuzzy fingerprints classifier that defines the main categories for each type of aggregated content, and a rule-based approach is used to correct the results. The Attribute-based classifier defines the sub-categories of content to be processed, and the SVM classifier is used to improve the results on the final step. The proposed system is flexible and additional components can be easily added to it.

This paper also provides an overview of expert systems’ development tools. It has been shown that there are many available frameworks and programming languages that are used in the development of expert systems. The author's choice is CLIPS, portable, extensible, well-documented public domain software.

Библиография
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
86.
87.
88.
89.
90.
91.
92.
93.
94.
95.
96.
97.
98.
99.
100.
101.
102.
103.
104.
105.
106.
107.
108.
109.
110.
111.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
73.
74.
75.
76.
77.
78.
79.
80.
81.
82.
83.
84.
85.
86.
87.
88.
89.
90.
91.
92.
93.
94.
95.
96.
97.
98.
99.
100.
101.
102.
103.
104.
105.
106.
107.
108.
109.
110.
111.

Результаты процедуры рецензирования статьи

Рецензия скрыта по просьбе автора

Ссылка на эту статью

Просто выделите и скопируйте ссылку на эту статью в буфер обмена. Вы можете также попробовать найти похожие статьи


Другие сайты издательства:
Официальный сайт издательства NotaBene / Aurora Group s.r.o.