[论文翻译]经验:通过累积相对频率分布改善意见垃圾邮件检测


下载PDF:https://arxiv.org/pdf/2012.13905v1.pdf


Experience: Improving Opinion Spam Detection by Cumulative Relative Frequency Distribution

体验:通过累积相对频率分布改善意见垃圾邮件检测

Abstract

Over the last years, online reviews became very important since they can influence the purchase decision of consumers and the reputation of businesses, therefore, the practice of writing fake reviews can have severe consequences on customers and service providers. Various approaches have been proposed for detecting opinion spam in online reviews, especially based on supervised classifiers. In this contribution, we start from a set of effective features used for classifying opinion spam and we re-engineered them, by considering the Cumulative Relative Frequency Distribution of each feature. By an experimental evaluation carried out on real data from Yelp.com, we show that the use of the distributional features is able to improve the performances of classifiers.

摘要

在过去几年中,在线评论变得非常重要,因为他们可以影响消费者的购买决定和企业的声誉,因此,写作假审查的做法可能对客户和服务提供商具有严重的后果。已经提出了在在线评论中检测意见垃圾邮件的各种方法,特别是在受监督的分类器的基础上。在这一贡献中,我们从一组用于分类意见垃圾邮件的有效功能开始,我们通过考虑每个特征的累积相对频率分布来重新设计它们。通过从 Yelp.com 的真实数据执行的实验评估,我们表明使用分配功能能够改善分类器的性能。

Introduction

As illustrated in a recent survey on the history of Digital Spam, the Social Web has led not only to a participatory, interactive nature of the Web experience', but also to the proliferation of new and widespread forms of spam, among which the most notorious ones are fake news and spam reviews, opinion spam. This results in the diffusion of different kinds of disinformation and misinformation, where misinformation refers to inaccuracies that may even originate acting in good faith, while disinformation is false information deliberately spread to deceive. Over the last years, online reviews became very important since they reflect the customers' experience with a product or service and, nowadays, they constitute the basis on which the reputation of an organization is built. Unfortunately, the confidence in such reviews is often misplaced, due to the fact that spammers are tempted to write fake information in exchange for some reward or to mislead consumers for obtaining business advantages. The practice of writing false reviews is not only morally deplorable, as it is misleading for customers and inconvenient for service providers, but it is also punishable by law. Considering both the longevity and the spread of the phenomenon, scholars for years have investigated various approaches to opinion spam detection, mainly based on supervised or unsupervised learning algorithms. Further approaches are based on Multi Criteria Decision Making. Machine learning approaches rely on input data to build a mathematical model in order to make predictions or decisions. To this aim, data are usually represented by a set of features, which are structured and ideally fully representative of the phenomenon being modeled. An effective feature engineering process, i.e., the process through which an analyst uses the domain knowledge of the data under investigation to prepare appropriate features, is a critical and time-consuming task. However, if done correctly, feature engineering increases the predictive power of algorithms by facilitating the machine learning process.

简介

如最近的数字垃圾邮件历史调查所示,社交网络不仅导致了 Web 体验的参与式,互动性质,而且还导致了新的和广泛形式的垃圾邮件的扩散,其中最臭名昭着的是假新闻和垃圾邮件评论,意见垃圾邮件。这导致不同种类的不忠实和错误信息的扩散,其中误导是指甚至可以善意地发挥行为的不准确性,而 DISINATION 是虚假的信息,故意传播欺骗。在过去的几年里,在线评论开始非常重要,因为它们反映了客户的产品或服务的客户经验,而且现在,他们构成了构建了组织声誉的基础。不幸的是,由于垃圾邮件发送者旨在编写伪造信息,以换取一些奖励或误导消费者获得业务优势,往往将被放置的信心往往错位。写作虚假审查的做法不仅是道德地令人遗憾的,因为它对客户来说是误导性和服务提供商的不方便,但它也是法律判处的。考虑到寿命和现象的蔓延,学者多年来已经调查了各种意见垃圾邮件检测方法,主要基于受监督或无监督的学习算法。进一步的方法是基于多标准决策制定。机器学习方法依赖于输入数据来构建数学模型,以便进行预测或决策。为此目的,数据通常由一组特征表示,该特征是结构化的,理想地完全代表被建模的现象。有效的特征工程过程,即分析师使用域名对调查数据的域名知识来准备适当的功能的过程是一个关键和耗时的任务。但是,如果正确完成,特征工程通过促进机器学习过程来增加算法的预测力。

In this paper, we do not aim to contribute by defining novel features suitable for fake reviews detection, rather, starting from features that have been proven to be very effective by Academia, we them, by considering the distribution of the occurrence of the features values in the dataset under analysis. In particular, we focus on the Cumulative Relative Frequency Distribution of a set of the basic features already employed for the task of fake review detection. We compute this distribution for each feature and substitute each feature value with the corresponding value of the distribution. To demonstrate the effectiveness of the proposed approach, the and the ones have been exploited to train several supervised machine-learning classifiers and the obtained results have been compared. To the best of the authors' knowledge, this is the first time that Cumulative Relative Frequency Distribution of a set of features has been considered for the unveiling of fake reviews. The experimental results show that the distributional features improve the performances of the classifiers, at the mere cost of a small computational surplus in the feature engineering phase. The rest of the paper is organized as follows. The next section revises related work in the area. Sectionfeatures describes the process of feature engineering. In Sectionsetup, we present the experimental setup, while Sectionresults reports the results of the comparison among the classification algorithms. Moreover, in this section, we assess the importance of the distributional features and discuss about the benefits brought by their adoption. Finally, Sectionconcl concludes the paper.

在本文中,我们并不目的是通过定义适合虚假评论检测的小说特征来贡献,而不是从学术界被证明是非常有效的特征,我们通过考虑分配特征值的分配在分析的数据集中。特别是,我们专注于已经用于假审查检测任务的一组基本功能的累积相对频率分布。我们为每个特征计算此分发,并将每个特征值替换为分布的相应值。为了展示所提出的方法的有效性,已经利用和该人被利用培训了几种监督的机器学习分类器,并且已经比较了所获得的结果。据作者所知,这是第一次考虑了一组特征的累积相对频率分布,已经考虑了揭幕假审查。实验结果表明,分布特征在特征工程阶段的小型计算盈余的仅成本中提高了分类器的性能。本文的其余部分安排如下。下一节修改了该地区的相关工作。 SectionFeatures 描述了特征工程的过程。在 SentsSetup 中,我们介绍了实验设置,而切片结果报告了分类算法之间的比较结果。此外,在本节中,我们评估分配特征的重要性,并讨论采用带来的益处。最后,SectionConcl 结束了这篇论文。

Social Media represent the perfect means for everyone to spread contents in the form of User-Generated Content (UGG), almost without any traditional form of trusted control. Since years, Academia, Industry, and Platform Administrators have been fighting for developing automatic solutions to raise the users' awareness about the credibility of the news they read online. One of the contexts in which the problem of credibility assessment is receiving the most interest is spam - or fake - reviews detection. The existence of spam reviews has been known since the early 2000s when e-commerce and e-advice sites began to be popular. In his seminal work, Liu lists three approaches to automatically identify opinion spam: the supervised, unsupervised, and group approaches. In a standard supervised approach, a ground truth of a priori known genuine and fake reviews is needed. Then, features about the labeled reviews, the reviewers, and the reviewed products are engineered. The performances of the first models built on such features achieved good results with common algorithms such as Naive Bayes and Support Vector Machines. As usual, a supervised approach is particularly challenging since it requires the existence of labeled data, that is, in our scenario, a set of reviews with prior knowledge about their (un)trustworthiness. To overcome the frequent issue of lack of labeled data, in the very first phases of investigation in this field, the work done by Jindal et al. in exploited the fact that a common practice of fraudulent reviewers was to post almost duplicate reviews: reviews with similar texts were collected as fake instances. As shown in , linguistic features have been proven to be valid for fake reviews detection, particularly in the early advent of this phenomenon. Indeed, pioneer fake reviewers exhibited precise stylistic features in their texts, such as a marked use of short terms and expressions of positive feelings. Anomaly detection was also been widely employed in this field: an analysis of anomalous practices with respect to the average behavior of a genuine reviewer led to good results. Anomalous behavior of the reviewer may be related to general and early rating deviation, as highlighted by Liu in, or temporal dynamics (see Xie et al.). Going further with the useful methodologies, human annotators, possibly recruited from crowd-sourcing services like Amazon Mechanical Turk, have also been employed, both 1) to manually label reviews' sets to separate fake from non-fake reviews (e.g., see the very recent survey by Crawford et al. in) and 2) to let them write intentionally false reviews, in order to test the accuracy of existing predictive models on such set of ad hoc crafted reviews, as nicely reproduced by Ott et al. in. Recently, an interesting point of view has been offered by Cocarascu and Tonotti in: deception is analysed based on contextual information derivable from review texts, but not in a standard way, e.g., considering linguistic features, but evaluating the influence and interactions that one text has on the others. The new feature, based on bipolar argumentation on the same review, has been shown to outperform more traditional features, when used in standard supervised classifiers, and even on small datasets.

相关工作

社交媒体代表每个人以用户生成的内容(UGG)的形式传播内容的完美手段,几乎没有任何传统的可信控制形式。自年来,学术界,工业和平台管理员一直在争取自动解决方案,以提高用户对他们在线阅读新闻信誉的认可。信用评估问题收到最多兴趣的背景之一是垃圾邮件 - 或伪造 - 评论检测。自 2000 年代初以来,垃圾邮件的存在是已知的,当时电子商务和电子咨询网站开始受欢迎。在他的开创性工作中,刘列出了三种方法,以自动识别意见垃圾邮件:监督,无监督和团体方法。在一个标准的监督方法中,需要一个先验的已知真实和假审查的基础事实。然后,有关标签评论,评论者和审查产品的特点是设计的。基于此类功能的第一款模型的性能与常见的算法(如天真贝叶斯和支持向量机)实现了良好的效果。像往常一样,监督方法尤其具有挑战性,因为它需要标记数据的存在,即在我们的方案中,这是一套关于他们(联合国)可靠性的先验知识的评论。在这一领域的调查的第一阶段,克服频繁问题缺乏标签数据,jindal 等人完成的工作。在利用欺诈性评审员的常见做法是将几乎重复的审查发布:与类似文本的评论被收集为假实例。如图所示,语言特征已被证明对假评评论检测有效,特别是在这种现象的早期出现。事实上,先锋假审查员在文本中表现出精确的风格特征,例如明确使用短篇小说和积极情绪的表达。该领域也广泛采用异常检测:对真正审查人员平均行为的异常做法导致了良好效果的分析。审阅者的异常行为可能与一般和早期评级偏差有关,如刘在或时间动态(见谢等)所突出显示。进一步利用有用的方法,人类注册人,可能是亚马逊机械土耳其人的人群采购服务,也已被雇用,两者都是 1)以手动标记审查,以分离非假审查(例如,看到非常 Crawford 等人最近的调查。最近,Cocarascu 和 Tonotti 提供了一个有趣的观点:Cocarascu 和 Tonotti 一个文字对其他文字有关。基于同一审查的 Bipolar 参数的新功能已被证明在标准监督分类器中使用时,甚至在标准监督分类器中使用的更优于更传统的功能。

Supervised learning algorithms usually need diverse examples - and the values of diverse features derived from such examples - for an accurate training phase. Wang et al. investigated the 'cold-start' problem: the identification of a fake review when a new reviewer posts one review. Without enough data about the stylistic features of the review and the behavioral characteristics of the reviewer, the authors first find similarities between the review text under investigation and other review texts. Then, they consider similar behavior between the reviewer under investigation and the reviewers who posted the identified reviews. A model based on neural networks proves to be effective to approach the problem of lack of data in cold-start scenarios. Although many years have passed and, as we will see briefly later, the problem has been addressed in many research works, with different techniques, automatically detecting a false review is an issue not completely solved yet, as stated in the recent survey of Wu et al.. This inspiring work examines the phenomenon not only giving an overview of the various detection techniques used over time, but also proposing twenty future research questions. Notably, to help scholars find suitable datasets for a supervised classification task, this survey lists the currently available review datasets and their characteristics. A similar work by Hussain et al, aimed at a comparison of different approaches, focuses on the performances obtained by different classification frameworks. Also, the authors carried on a relevance analysis of six different behavioral features of reviewers. Weighting the features with respect to their relevance, a classification over a baseline dataset obtains an 84.5% accuracy. A quite novel work considers the unveiling of malicious reviewers by exploiting the notion of `neighborhood of suspiciousness'. In, Kaghazgaran et al. proposed a system called TwoFace that, starting from identifiable reviewers paid by crowd-sourcing platforms to write fake reviews on well-known e-commerce platforms, such as Amazon, studies the similarity between these and other reviewers, based, e.g., on the reviewed products, and shows how it is possible to spot organized fake reviews campaigns even when the reviewers alternate genuine and malicious behaviors. Serra et al. developed a supervised approach where the task is to differentiate amongst different kinds of reviewers, from fraudulent, to uninformative, to reliable. Leveraging a supervised classification approach based on a deep recurrent neural network, the system achieves notable performances over a real dataset where there is an a priori knowledge of the fraudulent reviewers. The research work reminded so far lies in supervised learning. However, unsupervised techniques have been employed too, since they are very useful when no tagged data is available.

监督学习算法通常需要不同的例子 - 以及从这些示例中衍生的不同特征的值 - 用于准确的训练阶段。 Wang 等人。调查了“冷启动”问题:当新审查员发布一次审查时,确定假审查。没有足够的数据关于审查员的风格特征和审阅者的行为特征,作者首先在调查和其他审查文本下找到了审查文本之间的相似之处。然后,他们考虑审查员在调查下的类似行为和发布所确定的评论的审核人员。基于神经网络的模型证明是有效地接近冷启动方案中缺乏数据的问题。虽然多年来已经过去了,正如我们稍后会看到的那样,许多研究工作已经解决了问题,具有不同的技术,自动检测错误审查是一个没有完全解决的问题,如吴 et 的近期调查所述问题。 Al ..这种鼓舞人心的工作审查了不仅概述了随着时间的推移使用的各种检测技术的现象,还提出了 20 个未来的研究问题。值得注意的是,为了帮助学者找到适合监督分类任务的合适数据集,本调查列出了当前可用的审核数据集及其特征。 Hussain 等人在旨在进行不同方法的比较,侧重于不同分类框架获得的性能。此外,作者在审查人员的六种不同行为特征的相关性分析上进行了相关性分析。对其相关性的特征加权,基线数据集的分类获得了 84.5%的精度。一项非常小说的工作考虑了通过利用“怀疑邻里”的概念来揭幕。在,kaghazgaran 等。提出了一个称为 Twoface 的系统,从人群采购平台支付的可识别审核者开始写作亚马逊等知名电子商务平台的假审查,研究这些和其他审阅者之间的相似性,例如,在审查中产品,并展示了如何在审阅者替代真正和恶意行为的情况下发现有可能的人造假审查活动。 Serra 等人。制定了一个受监督的方法,任务是在不同类型的审稿人之间区分,从欺诈,无知到可靠。利用基于深度经常性神经网络的监督分类方法,该系统通过真正的数据集实现显着的性能,在那里存在对欺诈性评审者的先验知识。到目前为止,研究工作提醒了监督学习。但是,也使用了无监督的技术,因为当没有标记数据时,它们非常有用。

Fake reviewers' coordination can emerge by mining frequent behavioral patterns and ranking the most suspicious ones. A pioneer work by first identifies groups of reviewers that reviewed the same set of products; then, the authors compute and aggregate an ensemble of anomaly scores (e.g., based on similarity amongst reviews and times at which the reviews have been posted): the scores are ultimately used to tag the reviewers as colluding or not. Another interesting approach for the analysis of colluding users is the one proposed by: the authors check whether a given group of accounts (e.g., reviewers on ) contains a subset of malicious accounts. The intuition behind this methodology is that the statistical distribution of reputation scores (e.g., number of friends and followers) of the accounts participating in a tampered computation significantly diverges from that of untampered ones. We close this section by referring back to the division made by Liu in about supervised, unsupervised, and group approaches to spot fake reviewers and/or reviews. As noted in , these are classification methods, mostly aiming at classifying in a binary or multiple way information items (i.e., credible vs non-credible)' with the evaluation of a series of credibility features extracted from the data. Notably, approaches, which are based on some prior domain knowledge, are promising in providing a ranking of the information item (i.e., in our scenario, of the review) with respect to credibility. This is the case of recent work by Pasi et al., which exploits a Multi-Criteria Decision Making approach to assess the credibility of a review. In this context, a given review, seen as an alternative amongst others, is evaluated with respect to some credibility criteria. An overall credibility estimate of the review is then obtained by means of a suitable model-driven approach based on aggregation operators. This approach has also the advantage of assessing the contribution that single or interacting criteria/features have in the final ranking. The techniques presented above have their pros and cons, and depending on the particular context, one approach can be preferred with respect to another. The most relevant contribution of our proposal with respect to the state of the art is to improve the effectiveness of the solution based on supervised classifiers, which, as seen above, is a well-known and widely-used approach in this context.

假审阅者的协调可以通过挖掘频繁的行为模式和排名最具可疑的协调来涌现。首先识别先锋工作,识别审查同一套产品的审稿人群;然后,作者计算并汇总异常分数的集合(例如,基于评论已经发布的评论和时间的相似性):分数最终用于标记审阅者作为勾结。另一个有趣的分析勾结用户的方法是提出的方法:作者检查一组给定的帐户(例如,审核员)是否包含恶意账户的子集。这种方法背后的直觉是,参与篡改计算的账户的信誉分数(例如,朋友和追随者的数量)显着不同于不可歧视的计算。我们通过返回刘的划分关于监督,无人监督,小组途径以发现假审查者和/或审查的方法来关闭本节。如上所述,这些是分类方法,主要针对在“数据”中提取的一系列可信度功能的评估中分类(即,可信 VS 不可信)的分类。值得注意的是,基于一些现有领域知识的方法承诺在可信度方面提供信息项(即,在我们的方案中的信息)的排名。这是 PASI 等人的最新工作的情况。,PASI 等人的工作利用多标准决策方法来评估审查的可信度。在这种情况下,在一些可信度标准的情况下评估了作为其他人的替代方案的给定审查。然后通过基于聚合运算符的合适的模型驱动方法获得审查的整体可信度估计。这种方法还具有评估单个或互动标准/特征在最终排名中的贡献的优点。上面呈现的技术具有它们的优点和缺点,并且根据特定的背景,可以对另一个方法优选一种方法。我们关于本领域技术的提案的最相关贡献是提高基于监督分类器的解决方案的有效性,如上所述,这是在这种情况下是一种众所周知的和广泛使用的方法。

Feature Engineering

In this section, we introduce a subset of features that have been adopted in past work to detect opinion spam and we propose how to modify them in order to improve the performances of classifiers. We emphasize that the listed features have been used effectively for this task by past researchers. We give below the rationale for their use in the context of unveiling fake reviews. Finally, it is worth noting that the list of selected features is not intended to be exhaustive.

功能工程

在本节中,我们介绍了过去的工作中采用的一个功能,以检测意见垃圾邮件,并建议如何修改它们以改善分类器的性能。我们强调,过去的研究人员已经有效地使用了所列功能。我们在揭幕假审查的背景下给予他们使用的理由。最后,值得注意的是,所选功能列表并不令人穷的。

Basic Features

Following a supervised classification approach, the selection of the most appropriate features plays a crucial role, since they may considerably affect the performance of the machine learning models constructed starting from them. Features can be review-centric or reviewer-centric. The former are features that refer to the review, while the latter refer to the reviewer. In the literature, several reviewer-centric features have been investigated, such as the maximum number of reviews, the percentage of positive reviews, the average review length, the reviewer rating deviation. According to the outcomes of several works proposed in the context of opinion spam detection , we focused on reviewer-centric features, which have been demonstrated to be more effective for the identification of fake reviews. Thus, we relied on a set of , which have been already used proficiently in the literature for the detection of opinion spam in reviews. Specifically, we focused on the following reviewer-centric features: - {Photo Count}: This metric measures the number of pictures uploaded by a reviewer and is directly retrieved from the reviewer profile. In, the authors demonstrated the effectiveness of using photo count, together with other non-verbal features, for detecting fake reviews. - {Review Count}: It measures how many reviews have been posted by a reviewer on the platform. showed that spammers and non-spammers present different behavior regarding the number of reviews they post. In particular, spammers usually post more reviews, since they may get paid. This feature has also been investigated by and . - {Useful Votes}: The most popular online review platforms allow users to rank reviews as useful or not. This information can be retrieved from the reviewer profile, or computed by summing the total amount of useful votes received by a reviewer. This feature has already been exploited by and it has been demonstrated to be effective for opinion spam detection. - {Reviewer Expertise}: Past research in highlights that reviewers with acquired expertise on the platform are less prone to cheat.

基本功能

遵循监督分类方法,选择最合适的功能起到重要作用,因为它们可能会大大影响从它们开始构造的机器学习模型的性能。特点可以是以审查为中心的或以评论者为中心的。前者是参考审查的功能,而后者则指的是审阅者。在文献中,已经调查了几个审查员为中心的特征,例如最多的评论,验证的百分比,平均审查长度,审核率偏差。根据在意见垃圾邮件检测范围内提出的几项作品的结果,我们专注于审核人的特征,这些功能已经被证明更有效地识别虚假评论。因此,我们依赖于一套,这已经在文献中已经熟练地用于检测评论中的意见垃圾邮件。具体来说,我们专注于以下审阅者为中心的特征:

  • {照片计数}:此度量标准测量审阅者上传的图片数量,并从审阅者配置文件中直接检索。在,作者展示了使用照片计数的有效性,以及其他非语言特征,用于检测假审查。
  • {审核计数}:衡量平台上的评论家发布了多少评论。显示垃圾邮件发送者和非垃圾邮件发送者对他们发布的评论数量的不同行为产生了不同的行为。特别是,垃圾邮件发送者通常会发布更多审查,因为他们可能会得到报酬。此功能也被调查了。
  • {有用的投票}:最受欢迎的在线评论平台允许用户将评论评为有用的综合评论。可以从审阅者配置文件中检索此信息,或者通过概括审阅者收到的有用投票总额来计算。此功能已被利用,并且已被证明是对意见垃圾邮件检测有效。
  • {审阅者专业知识}:过去的研究突出的审查员在平台上获得的专业知识不太容易作弊。

Particularly, Mukherjee et al. in report that opinion spammers are usually not longtime members of a site. Genuine reviewers, however, use their accounts from time to time to post reviews. Although this experimental evidence does not mean that no spammer can be a member of a review platform for a long time, the literature has considered useful to exploit the activity freshness of an account in cheating detection. The Reviewer Expertise has been defined by Zhang et al. in as the number of days a reviewer has been a member of the platform (the original name was Membership Length).

  • {Average Gap}: The review gap is the time elapsed between two consecutive reviews. This feature has been previously introduced in the seminal work by Mukherjee et al., under the name {\it Activity Window}, and successfully re-adopted for detecting both colluders (i.e., spammers acting with a coordinated strategy) and singleton reviewers (i.e., reviewers with just isolated behavioral posting) . In the cited work, the Activity window feature as been proved highly discriminant for demarcate spammers and non-spammers. Quoting from, fake reviewers are likely to review in short bursts and are usually not longtime active members.

特别是 mukherjee 等。在报告中,意见垃圾邮件发送者通常不会长期成员。但是,真正的审核人员不时使用他们的帐户来发布评论。虽然这种实验证据并不意味着长期没有垃圾邮件发送者可以成为审查平台的成员,但文献已被认为有助于利用在作弊检测中的账户的活动新鲜度。 Zhang 等人已经定义了审稿人的专业知识。在审阅者的日子中,审阅者一直是平台的成员(原名是会员长度)。

  • {平均差距}:审查缺口是两次连续评论之间经过的时间。此功能以来,Mukherjee 等人在 Omkherjee 等人的名称下,并以{\ IT 活动窗口}的名义,并成功地重新采用,以检测勾结斗争者(即,用协调策略的垃圾邮件发送者)和单例评论者(即,审查员只有孤立的行为帖子)。在引用的工作中,活动窗口特征是被证明是划分垃圾邮件发送者和非垃圾邮件发送者的高度判别。引用,假审查人员可能会在短暂的爆发中审查,通常不是长期活动成员。

On a Yelp dataset where was a priori known the benign and malicious nature of reviewers, work in proved that, by computing the difference of timestamps of the last and first reviews for all the reviewers, a majority (80%) of spammers were bounded by 2 months of activity, whereas the same percentage of non-spammers remain active for at least 10 months. We define the Average Gap feature as the average time, in days, elapsed between two consecutive reviews of the same reviewer and is defined as:

$$ AG_i = \frac{1}{N_i - 1} \sum_{j=2}^{N_i} (T_{i,j} - T_{i,j-1}) $$

where $ AG_i $ is the Average Gap for the $ i $ -th user, $ N_i $ is the number of reviews written by the user, $ T_{i,j} $ is the timestamp of the $ j $ -th reviews of the $ i $ -th user.

  • {Average Rating Deviation}: The rating deviation measures how much a reviewer's rating is far from the average rating of a business. observed that spammers are more prone to deviate from the average rating than genuine reviewers. However, a bad experience may induce a genuine reviewer to deviate from the mean rating. The Average Rating Deviation is defined as follows

$$ ARD_i = \frac{1}{N_i} \sum_{j=1}^{N_i} \abs{R_{i,j} - {R_{B(j)}}} $$

where $ ARD_i $ is the Average Rating Deviation of the $ i $ -th user, $ N_i $ is the number of reviews written by the user, $ R_{i,j} $ is the rating given by the $ i $ -th user to her/his $ j $ -th reviews corresponding to the business $ B(j) $ , $ {R_{B(j)}} $ is the average rating obtained by the business $ B(j) $ .

  • {First Review}: Spammers are usually paid to write reviews when a new product is placed on the market. This is due to the fact that early reviews have a great impact on consumers' opinions and, in turn, impact the sales, as pointed out by and . We compute the time elapsed between each review of a reviewer and the first review, for the same business. Then, we average the results on all the reviews. Specifically, the First Review value for reviewer $ i $ is given by:

$$ FRT_i = \frac{1}{N_i} \sum_{j=1}^{N_i} (T_{i,j} - F_{B(j)}) $$

where $ FR_i $ is the First Review value of the $ i $ -th user, $ N_i $ is the number of reviews written by the user, $ T_{i,j} $ is the time the $ i $ -th user wrote the $ j $ -th review and $ F_B(j) $ is the time the first review of the same business $ B(j) $ , corresponding to the one of the $ j $
-th review, has been posted.

  • {Reviewer Activity}: Several works pointed out that the more active a user on the online platform, the more the user is likely genuine, in terms of contributing with knowledge sharing in a useful way. The usefulness of this feature has been demonstrated several years ago. Since the early 00s, surveys have been conducted on large communities of individuals, trying to understand what drives them to be active and useful on an online social platform, in terms of sharing content. Results showed that people contribute their knowledge when they perceive that it enhances their reputations, when they have the experience to share, and when they are structurally embedded in the network. The Activity feature expresses the number of days a user has been active and it is computed as:
    $$ A_i = T_{i,L} - T_{i,0} $$
    where $ A_i $ is the activity (expressed in days) of the $ i $ -th user, $ T_{i,L} $ is the time of the last review of the $ i $ -th user and $ T_{i,0} $ is the time of the first review of the $ i $ -th user.

在 yelp 数据集上,先验已知审阅者的良性和恶意性质,证明,通过计算对所有评论者的最后一个评论的时间戳和第一份评论的差异,垃圾邮件发送者的大多数(80%)被束缚 2 个月的活动,而相同的非垃圾邮件发送百分比仍然活跃至少 10 个月。我们将平均差距特征定义为平均时间,在同一审阅者的两个连续询问之间经过的时间,并且被定义为:
$$ AG_i = \frac{1}{N_i - 1} \sum_{j=2}^{N_i} (T_{i,j} - T_{i,j-1}) $$
其中$ ag_i $是$ i $ --th 用户的平均间隙,$ n_i $是用户编写的审查数量,$ t_ {i,j} $是$ i $ -th 用户的$ j $的时间戳。

  • {平均评级偏差}:评级偏差衡量评审者的评级远远距离业务的平均评级。观察到垃圾邮件发送者更容易偏离比真正审稿人的平均额定值。然而,糟糕的经历可能会诱使真正的评论者偏离平均评分。平均评级偏差定义如下:
    $$ ARD_i = \frac{1}{N_i} \sum_{j=1}^{N_i} \abs{R_{i,j} - {R_{B(j)}}} $$
    在$ ard_i $是$ i $ -th 用户的平均评级偏差,$ n_i $是用户,$ r_ {i,j}的审核数量。 $ i $ -th 用户给她/他的$ b $ sth 评论给她的商业$ b(j)$,$ {r_ {b(j)}}$由商家为$B(j)$。
  • {第一个审查}:垃圾邮件发送者通常在市场上播放新产品时撰写审查。这是由于早期评论对消费者的意见产生了很大影响,而且反过来影响销售额,如下所示。我们计算同一业务对审阅者的每次审查和第一次审查之间过去的时间。然后,我们平均对所有评论的结果。具体而言,审阅者$ i $的第一个审查值由:
    $$ FRT_i = \frac{1}{N_i} \sum_{j=1}^{N_i} (T_{i,j} - F_{B(j)}) $$
    其中$ FR_i $是$ i $ -th 用户编写$ j $ sth 评论和$ f_b(j)$的时间是第一次审查同一$ F_B(j) $的时间,对应于其中一个$ j $审查,已发布。
  • {审阅者活动}:几种作品指出,在在线平台上的用户越活跃,用户可能是真正的,就在有用的方式贡献知识共享方面。几年前已经证明了这一功能的有用性。自从 00 年代初期以来,已经在个人的大型社区上进行了调查,试图了解在共享内容方面,在线社交平台驱动他们在线社交平台上的驱动器。结果表明,当他们认为它具有共享经验时,人们在察觉时促进他们的声誉,以及在结构上嵌入在网络中时,人们会促进他们的知识。活动特征表达了用户已激活的天数,并且计算为:
    $$ A_i = T_{i,L} - T_{i,0} $$
    其中$ a_i $是$ i $ -th 用户,$ t_ {i,l} $的活动(在几天)是关于$ i $ -th 用户和$ t_ {i,0} $的最后审查时间第一次审查$ i $最终用户的时间。

From Basic to Cumulative Features

The features described so far have been used to train a machine learning algorithm to construct a classifier, in a supervised-learning fashion. In this work, we propose to build on the basic features, with a proper feature engineering process, to possibly assess an improvement of the classification performances. The proposed feature engineering process is based on the concept of Cumulative Relative Frequency Distribution. The Relative Frequency is a quantity that expresses how often an event occurs divided by all outcomes. It can be easily represented by a Relative Frequency table. The Relative Frequency table is constructed directly from the data by simply dividing the frequency of a value by the total number of values in the dataset. The Cumulative Relative Frequency is then calculated by adding each frequency from the Relative Frequency table to the sum of its predecessors. In practice, the Cumulative Relative Frequency indicates the percentage of elements in the dataset that lies below the current value. In this work, we modify each feature by using its Cumulative Relative Frequency Distribution. In the following, we show an example of how to compute the Cumulative Relative Frequency Distribution. Let us consider the feature and assume that for each photo count value, the corresponding number of occurrences is the one reported in the second column of Tablefreq_table. Thus, the second column reports the number of reviews associated with a reviewer who uploaded a given number of photos: in our example, there are 7,944 reviews whose reviewers have no photo associated. The third column measures the Relative Frequency, which is computed by dividing the number of occurrences by the total number of reviews. Finally, the fourth column reports the Cumulative Relative Frequency values, which have been obtained by adding each Relative Frequency value to the sum of its predecessor. In our proposal, the process described so far is carried out for each basic feature and the Cumulative Relative Frequency values are used to train the classifier instead of the simple values. This involves, in practice, to substitute each value of the first column with the corresponding value of the fourth column of Tablefreq_table.

Photo Frequency Relative.Freq. Cumulative
Count (#Reviews) (%Reviews) Rel.Freq.
0 7944 0.44 0.44
1 2301 0.13 0.57
2 1756 0.10 0.67
3 1401 0.08 0.75
4 822 0.04 0.79
5 1382 0.08 0.87
Photo Frequency Relative.Freq. Cumulative
Count (#Reviews) (%Reviews) Rel.Freq.
6 550 0.03 0.90
7 347 0.02 0.92
8 780 0.04 0.96
9 342 0.02 0.98
10 460 0.02 1.00

从基本到累积功能

所描述的特征已经用于训练机器学习算法以监督学习方式构建分类器。在这项工作中,我们建议建立基本功能,具有适当的特征工程过程,可能会评估分类性能的改进。所提出的特征工程过程基于累积相对频率分布的概念。相对频率是表示事件发生的频率除以所有结果的量。它可以容易地由相对频率表表示。通过简单地将值的频率通过数据集中的总值划分值,直接从数据构成相对频率表。然后通过将来自相对频率表的每个频率从相对频率表添加到其前述者的总和来计算累积相对频率。在实践中,累积相对频率指示在低于当前值的数据集中的元素百分比。在这项工作中,我们通过使用其累积相对频率分布来修改每个功能。在下文中,我们显示了如何计算累积相对频率分布的示例。让我们考虑该功能并假设对于每个照片计数值,相应的出现次数是 TableFreq_table 的第二列中报告的。因此,第二列报告与上传给定照片的评论员相关的评论的次数:在我们的示例中,有 7,944 条评论审查员没有照片相关联。第三列测量相对频率,通过划分差异的总次数来计算的相对频率。最后,第四列报告累积相对频率值,该累积频率值是通过将每个相对频率值添加到其前任的总和而获得的。在我们的提议中,到目前为止所描述的过程是针对每个基本特征进行的,并且累积相对频率值用于训练分类器而不是简单的值。这涉及在实践中替代第一列的每个值与 TableFreq_table 的第四列的相应值。

Experimental Setup

In this section, we describe the setting of the experiments conducted to evaluate the effectiveness of the proposed features. This is done by comparing the results obtained when using the basic features and the cumulative ones with the most widespread supervised machine learning algorithms.

实验设置

在本节中,我们描述了对评估所提出的特征的有效性进行的实验的设置。这是通过比较使用基本特征和累积的结果获得的结果来完成的,这是通过与最广泛的监督机器学习算法进行比较。

Dataset Construction and Characteristics

The dataset used in this study is composed of 56,317 business reviews, 42,673 businesses (both restaurants and hotels), and 1,429 reviewers. This dataset has been obtained by repopulating the YelpCHI dataset. The YelpCHI dataset included 67,395 reviews from 201 hotels and restaurants done by 38,063 reviewers and each review was tagged with a fake/non-fake label. To tag reviewers in the YelpCHI dataset and to obtain fresher data to work with, we operate as follows: In Tabledataset_summary we report a summary with the statistics of the basic features for the repopulated dataset, whereas in Figurecorr_mat we present the correlation matrix, which shows the correlation coefficients among variables.

photo review useful reviewer avg avg rating first reviewer
count count votes expertise gap deviation review activity
\mean 170.9 201.9 502.7 3664.7 55.2 0.01 13.9 2637.6
std_dev 911 298.2 2089.45 579.8 110.6 0.06 50.2 991.3

image

数据集建设和特性

本研究中使用的数据集由 56,317 个商务评估,42,673 名业务(餐馆和酒店)组成,以及 1,429 名评论员。通过重新填充 yelpchi 数据集获得了此数据集。 Yelpchi DataSet 包括来自 201 家酒店和餐馆的 67,395 条点评由 38,063 名评论员提供,每次评论都标有假/非假标签。在 Yelpchi DataSet 中标记 Reviewers 并获取使用的更新数据,我们按照以下操作:在 TableDataseT_Summary 中运行,我们将摘要与重新填充数据集的基本功能的统计数据统计,而在图中,我们呈现了所显示的相关矩阵变量之间的相关系数。

Data Labeling

One limitation of supervised classification approaches is the possible lack of labeled data. To overcome this problem, relied on Amazon Mechanical Turks to generate fake and genuine reviews. Nevertheless, highlighted the limits of this approach, since workers could be not always effective in imitating the behavior of fake reviewers, and proposes to use the results of the Yelp classification algorithm for obtaining labeled data. The same approach has also been used by and and the literature has recognized its effectiveness in detecting fake reviews. According to this result, we use the same approach to build a dataset of labeled examples (as described in Sectiondatasetorigin).

数据标签

监督分类方法的一个限制是可能缺乏标记数据。为了克服这个问题,依赖于亚马逊机械土耳其人来产生假和真正的评论。尽管如此,突出了这种方法的极限,因为工人在模仿假审查者的行为方面并不总是有效,并且建议使用 yelp 分类算法的结果来获得标记的数据。除了相同的方法也被使用,而文献已经认识到其在检测假审查方面的有效性。根据此结果,我们使用相同的方法来构建标记示例的数据集(如在 SectionDataSetorigin 中所述)。

Experimental Framework

We experiment several supervised machine-learning algorithms, namely Logistic Regression (LogReg), Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), Decision Tree (DT), Naive Bayes (NB), K-Nearest Neighbors (k-NN). The performances of the learning algorithms are evaluated by means of commonly used metrics. Specifically, we compute the Accuracy on train (TraAcc) and test (TstAcc), the Precision (Pre), Recall (Rec) and F1-score (F1) for both classes. In addition, we compute the Matthews Correlation Coefficient (MCC) and the Area Under Curve (AUC), which are two evaluation metrics used to measure the quality of binary classifications and are useful to compare classification models' performances. To ensure higher reliability of results, we apply a Stratified k-Fold Cross Validation approach, with $ k=5 $ . The cross-validation involves partitioning the dataset into $ k $ sets, then a model is trained using $ k-1 $ folds (called training set) and validated on the remaining part of the data (validation set). To reduce variability, this process is repeated $ k $ times, using only once each partition for the validation. The performance measures are obtained by averaging the validation results on all runs. The Stratified approach ensures the preservation of the frequencies of the classes in each training and validation fold. The experimental framework described so far is depicted in Figureframework. All the experiments have been developed in Python, with the support of the Scikit-learn library.

image

实验框架

我们试验多个监督机器学习算法,即 Logistic 回归(Logreg),线性判别分析(LDA),支持向量机(SVM),决策树(DT),Naive Bayes(NB),K 离最近邻居(k-nn)。通过常用的指标评估学习算法的性能。具体而言,我们为两个类计算了列车(TSACC)和测试(TSTACC),精度(前),召回(REC)和 F1 分数(F1)的准确性。此外,我们计算 Matthews 相关系数(MCC)和曲线下的区域(AUC),它们是用于测量二进制分类质量的两个评估度量,并且可用于比较分类模型的性能。为了确保结果的更高可靠性,我们应用了一个分层的 k 折交叉验证方法,$ k=5 $。交叉验证涉及将数据集分区为$ k $集,然后使用$ k-1 $折叠(称为训练集)培训模型,并在数据的剩余部分(验证集)上验证。为了降低可变性,该过程重复$ k $次,仅使用每分区一次进行验证。通过对所有运行的验证结果进行平均来获得性能措施。分层方法确保在每个训练和验证折叠中保存类别的频率。到目前为止描述的实验框架在图框架中描绘。所有实验都是在 Python 开发的,并支持 Scikit-rearn 图书馆的支持。

Evaluation and Discussion

In this section, we report the results obtained by applying the aforementioned algorithms for the detection of fake and genuine reviews, when using the basic or the cumulative features. Due to the imbalanced nature of the dataset, we report the performance metrics both for positive (fake-0) and negative (non fake-1) classes. Tablereviews_balanced_013_rating_simplef and Tablereviews_balanced_013_rating report the results of the learning algorithms, obtained by using the basic and cumulative features, respectively. The performance values are computed by averaging the results over 5 runs of the algorithms, according to the employed cross-validation scheme.

Algorithm TraAcc TstAcc Prec-0 Rec-0 F1-0 Prec-1 Rec-1 F1-1 MCC AUC
DT(Dep=10) 0.99 0.97 0.77 1.00 0.87 1.00 0.96 0.98 0.86 0.98
%0.99 0.98 0.81 1.00 0.90 1.00 0.97 0.99 0.89 0.99
DT(Dep=5) 0.97 0.94 0.67 1.00 0.80 1.00 0.94 0.97 0.79 0.97
% 0.97 0.95 0.72 0.99 0.83 1.00 0.95 0.97 0.82 0.97
%k-NN(k=5) 0.99 0.97 0.80 1.00 0.89 1.00 0.97 0.98 0.88 0.98
k-NN(k=5) 0.93 0.92 0.71 0.79 0.75 0.96 0.94 0.95 0.70 0.95
%k-NN(k=10) 0.98 0.95 0.69 1.00 0.82 1.00 0.94 0.97 0.81 0.97
k-NN(k=10) 0.90 0.89 0.70 0.64 0.67 0.92 0.94 0.93 0.61 0.93
%SVM(rbf) 0.92 0.90 0.53 0.94 0.68 0.99 0.89 0.94 0.66 0.92
SVM(rbf) 0.90 0.88 0.82 0.47 0.60 0.88 0.97 0.93 0.56 0.92
%SVM(linear) 0.90 0.90 0.53 0.91 0.66 0.99 0.89 0.94 0.64 0.90
%SVM(linear) 0.87 0.85 0.67 0.39 0.50 0.87 0.95 0.91 0.43 0.90
%LogReg 0.91 0.89 0.45 0.93 0.66 0.99 0.89 0.93 0.64 0.91
LogReg 0.87 0.85 0.43 0.89 0.58 0.99 0.85 0.91 0.55 0.87
%LDA 0.90 0.88 0.50 0.93 0.65 0.99 0.88 0.93 0.62 0.90
LDA 0.84 0.81 0.36 0.87 0.51 0.98 0.80 0.88 0.48 0.84
%Naive Bayes 0.86 0.86 0.44 0.86 0.58 0.98 0.86 0.91 0.55 0.86
Naive Bayes 0.85 0.78 0.34 0.96 0.50 0.99 0.75 0.86 0.49 0.85
Algorithm TraAcc TstAcc Prec-0 Rec-0 F1-0 Prec-1 Rec-1 F1-1 MCC AUC
DT(Dep=10) 0.99 0.98 0.83 1.00 0.91 1.00 0.97 0.99 0.90 0.99
DT(Dep=5) 0.96 0.93 0.64 1.00 0.78 1.00 0.92 0.96 0.77 0.96
k-NN(k=5) 0.97 0.94 0.64 1.00 0.78 1.00 0.93 0.96 0.77 0.96
k-NN(k=10) 0.95 0.91 0.56 1.00 0.72 1.00 0.90 0.95 0.71 0.95
SVM(rbf) 0.92 0.90 0.53 0.94 0.68 0.99 0.89 0.94 0.66 0.92
%SVM(linear) 0.90 0.90 0.53 0.91 0.66 0.99 0.89 0.94 0.64 0.90
LogReg 0.91 0.89 0.45 0.93 0.66 0.99 0.89 0.93 0.64 0.91
LDA 0.90 0.88 0.50 0.93 0.65 0.99 0.88 0.93 0.62 0.90
Naive Bayes 0.86 0.86 0.44 0.86 0.58 0.98 0.86 0.91 0.55 0.86

Tablereviews_balanced_013_rating_simplef highlights that all the algorithms obtain very good precisions on the class (tagged with 1), while the performances are worst when dealing with the class. This is due to the fact that the dataset is imbalanced, with a ratio between classes of about 0.13. Among the experimented methods, the one that obtained the best precision on the minority class is the Support Vector Machine Classifier (precision = 0.97). However, the best global performances are obtained by the Decision Tree classifier (max depth=10), since in this case the MCC and the AUC reach the highest values (0.86 and 0.98, respectively). This occurs because the recall values obtained by the Decision Tree classifier are almost equal (majority class) or better (minority class) with respect to the ones obtained by the Support Vector Machine. We also experiment with a Decision Tree classifier with , to obtain a less complex model, but we notice that the performances on the minority class drop dramatically. Still regarding the precision on the minority class, all the remaining algorithms but the SVM are outperformed by Decision Tree (max depth=10).

评估和讨论

在本节中,我们报告了通过应用上述算法来检测使用基本或累积特征时的虚假和正版评论获得的结果。由于数据集的性质不平衡,我们报告了正(假-0)和负(非假-1)类的性能指标。 TableEviews_balanced_013_rating_simplef 和 tableEviews_balanced_013_rating 通过使用基本和累积特征获得的学习算法的结果。根据采用的交叉验证方案,通过平均 5 次算法的结果来计算性能值。

table views_balanced_013_ration_simplef 突出显示所有算法在类上获得非常好的精度(标记为 1),而在处理类时,性能最差。这是由于数据集是不平衡的事实,因此大约 0.13 的类之间的比率。在实验方法中,在少数群体类上获得最佳精度的方法是支持向量机分类器(Precision = 0.97)。然而,决策树分类器(最大深度= 10)获得了最佳的全局性能,因为在这种情况下,MCC 和 AUC 分别达到最高值(分别为 0.86 和 0.98)。这是由于决策树分类器获得的召回值几乎相等(多数类)或更好的(少数类),而不是由支持向量机器获得的那些。我们还尝试了一个决策树分类器,以获得更少的复杂模型,但我们注意到少数群体阶级的表现急剧下降。仍然关于少数群体类的精度,所有剩余的算法,但 SVM 都是由决策树(最大深度= 10)的表现。

However, all of them, except the SVM, perform quite well with respect to the recall. This implies that the algorithms work well in finding the fake instances, but there are also many false positives, i.e., genuine reviews erroneously classified as fake. The results described so far and presented in Tablereviews_balanced_013_rating_simplef are generally worse with respect to those shown in Tablereviews_balanced_013_rating, where we reported the results obtained by using the Cumulative Relative Frequency distributions. Also when using the cumulative features, the best approach is the Decision Tree (max depth=10), which outperforms the other algorithms. By comparing the two tables, we notice that the results obtained by k-NN classifiers, using the basic features, reach higher values in the precision of the minority class, but this improvement in precision is balanced by a lower recall. In practice, the fake reviews detected by the classifiers are indeed fake, but they also miss a lot of actual fakes. To summarize the performance differences, we report in Tablecomparison_simple_cum a comparison between the MCC and the AUC values. In particular, for each metric, we reported the corresponding value obtained by using the basic features, the cumulative features, and the improvement percentage. Tablecomparison_simple_cum highlights that the MCC and the AUC values are always higher or equal when using cumulative features, for all the algorithms except for the Decision Tree (max depth =5), since in this case the basic features lead to better MCC and AUC values.

Algorithm MCC AUC
Basic Cumulative Improv.(%) Basic Cumulative Improv.(%)
DT(Dep=10) 0.86 0.90 4.7 0.98 0.99 1.0
DT(Dep=5) 0.79 0.77 -2.5 0.97 0.96 -1.0
k-NN(k=5) 0.70 0.77 10.0 0.95 0.96 1.1
k-NN(k=10) 0.61 0.71 16.4 0.93 0.95 2.2
SVM(rbf) 0.56 0.66 17.9 0.92 0.92 0.0
LogReg 0.55 0.64 16.4 0.87 0.91 4.6
LDA 0.48 0.62 29.2 0.84 0.90 7.1
Naive Bayes 0.49 0.55 12.2 0.85 0.86 1.2

We finally remark that the process described in Sectionengineering represents a pre-processing step and it can be performed only once before applying several machine learning algorithms. We computed the time required to build the Cumulative Relative Frequency for all the features in the pre-processing of the cross-validation process and we obtained an average value of 155 ms, pointing out that the impact of the calculation time for the proposed features is negligible.

然而,除了 SVM 之外,它们都表现出很好的召回。这意味着算法在寻找假实例时很好地工作,但也有许多误报,即,正版评论是错误的归类为假的。到目前为止所描述的结果并在 TableViews_balanced_013_rating_simplef 上呈现到 TableEviews_balanced_013_rating 中所示的结果通常更差,在那里我们报告了通过使用累积相对频率分布获得的结果。此外,在使用累积特征时,最好的方法是决策树(最大深度= 10),其优于其他算法。通过比较这两个表,我们注意到通过基本功能的 K-NN 分类器获得的结果,以少数级别的精度达到更高的值,但这种精度的改进由较低的召回平衡。在实践中,分类器检测到的虚假评论确实是假的,但他们也错过了很多实际假货。为了总结绩效差异,我们在桌布计数_simple_cum 中报告 MCC 和 AUC 值之间的比较。特别地,对于每个指标,我们报告了通过使用基本特征,累积特征和改进百分比获得的相应值。 TableComparison_Simple_Cum 突出显示 MCC 和 AUC 值始终在使用累积特征时始终更高或等于除决策树(MAX Depth = 5)之外的所有算法,因为在这种情况下,基本功能导致更好的 MCC 和 AUC 值。

我们最终备注,在施加多种机器学习算法之前,截面中描述的过程代表了预处理步骤。我们计算了在交叉验证过程的预处理中构建累积相对频率所需的时间,并且我们获得了 155 毫秒的平均值,指出了计算时间对所提出的功能的影响是微不足道。

Features Importance

The importance of features plays a significant role in a predictive model, since it provides insights into the data and into the model itself. Moreover, it poses the basis for dimensionality reduction and feature selection, which can sometimes improve the efficiency and the effectiveness of a predictive model. From the experiments carried on, the best algorithm resulted to be a Decision Tree model. Decision Tree algorithms offer importance scores based on the reduction in the criterion used to select split points in the tree, such as Gini or entropy. In this case, the importance of a feature has been computed following the Gini importance , which counts the times a feature is used to split a node, weighted by the number of samples it splits. We computed the Gini importance in each fold of the cross-validation and we averaged the results obtained. In Figure significance, we report a graph showing the importance of each cumulative feature, from the most important to the least one. The first three features, namely the number of reviews posted on the platform (), the number of useful votes received (), and the membership length of a reviewer () are probably the most useful for this predictive model.

image

To confirm this intuition, we perform a feature selection process, by selecting the first three most important features. A new Decision Tree model has been trained and tested, and the results (averaged on 5-folds) are reported in Tablefeat_selection_results. The features selected are the same for both models, although they appear in different order of importance. For the sake of clarity, we reported in Tablefeat_selection_results the results already presented in Tablereviews_balanced_013_rating, related to the Decision Tree models without feature selection. From their comparison, we notice that the feature selection process actually contributes to slightly improving the performances. In particular, for the Decision Tree model with , this improvement is more evident since all the metrics reach higher values when the feature selection is applied. On the other hand, for the Decision Tree model with , there is a small improvement in the precision and the F1-score of the positive (fake-0) class, whereas the recall for the negative (non fake-1) class is a bit lower with the feature selection applied. Nevertheless, the global performances are better, since the MCC value is higher.

Algorithm Selected Features TraAcc TstAcc Prec-0 Rec-0 F1-0 Prec-1 Rec-1 F1-1 MCC AUC
DT(Dep=10) c.rev.count, c.useful votes, c.rev.exp. 0.99 0.98 0.85 1.00 0.92 1.00 0.98 0.99 0.92 0.99
DT(Dep=5) c.rev.count, c.rev.exp., c.useful votes 0.97 0.95 0.70 1.00 0.82 1.00 0.94 0.97 0.81 0.97
DT(Dep=10) all 0.99 0.98 0.83 1.00 0.91 1.00 0.97 0.99 0.90 0.99
DT(Dep=5) all 0.96 0.93 0.64 1.00 0.78 1.00 0.92 0.96 0.77 0.96

特征重要性

功能的重要性在预测模型中起着重要作用,因为它为数据和模型本身提供了洞察力。此外,它构成了维度降低和特征选择的基础,这有时可以提高预测模型的效率和有效性。从继续进行的实验中,最好的算法导致是决策树模型。决策树算法基于用于在树中选择分割点的标准的减少来提供重要评分,例如 Gini 或熵。在这种情况下,在 GINI 重要性之后已经计算了一个特征的重要性,这对特征用于拆分节点的时间来计算,这是它拆分的样本数量的加权。我们在交叉验证的每个折叠中计算了 Gini 重要性,并且我们平均获得的结果。在图中,我们报告了一个图表,显示了每个累积特征的重要性,从最重要的是至少一个。前三个功能,即 platform()上发布的评论的数量,所接收的有用投票的数量和审阅者()的成员长度可能是对该预测模型最有用的。

要确认这种直觉,我们通过选择前三个最重要的功能来执行特征选择过程。已训练和测试新的决策树模型,结果(在 5 倍上平均)在表格_selection_result 中报告。两个模型所选择的功能是相同的,尽管它们以不同的重要性顺序出现。为了清楚起见,我们在 TableFeat_Selection_Result 中报告了 StableViews_balance_013_rating 的结果,与决策树模型有没有功能选择。从他们的比较来看,我们注意到特征选择过程实际上有助于略微提高性能。特别地,对于决策树模型,这种改进更明显,因为当应用特征选择时所有度量都达到更高的值。另一方面,对于决策树模型,精度和正(假-0)类的精度和 F1 分数有很小的改善,而负(非假-1)类的召回是一个使用特征选择较低。尽管如此,全球性能更好,因为 MCC 值更高。

Discussion on Feature Enhancement

In this section, we explain at a conceptual level why using the cumulative relative frequencies works better than the simple values. In data pre-processing, the normalization task changes the feature values to a common scale to avoid that differences in the data ranges can distort the results. Min-max normalization is one of the most used methods and consists in re-scaling the values of features in the range [0, 1]. However, min-max normalization suffers from outliers, because it is sufficient one relevant outlier to flatten all the input values. Another normalization method that transforms the feature values in such a way that they have zero-mean is Z-score. This method avoids the outlier issue but does not produce data on the same scale. Several probabilities distributions are affected by the presence of outliers. One of them is the Zipfian distribution, which is a discrete power-law probability distribution in which the frequency of input is inversely proportional to its rank in the frequency table. That is, the most frequent inputs occur twice as often as the second most frequent ones, three times as often as the third most frequent inputs, and so on. For example, consider the number of friends in a social network: normally, most users have very few friends (usually, 0 is the most frequent value), whereas very few users have a very high number of friends. In these cases, the performance of min-max normalization gets worse due to outliers, whereas Z-score does not well scale the input data.

讨论在本节中的特征增强

我们在概念层面解释为什么使用累积相对频率的工作比简单值更好。在数据预处理中,归一化任务将特征值改变为常规比例以避免数据范围内的差异可以扭曲结果。最小常规化是最常用的方法之一,在重新缩放范围内的特征值中,可以重新缩放[0,1]。但是,最小巨大的归一化遭受异常值,因为它是一个相关的异常值,以平衡所有输入值。另一种常规方法,以这样的方式转换特征值,使得它们具有零平均值是 Z 分数。此方法避免了异常问题,但不会在相同的比例上生成数据。几个概率分布受异常值存在的影响。其中一个是 Zipfian 分布,它是一个离散的电力法概率分布,其中输入的频率与频率表中的等级成反比。也就是说,最常用的输入发生的是第二个最频繁的输入,这是第三个最常用输入的三倍,依此类推。例如,考虑社交网络中的朋友数量:通常,大多数用户都有很少的朋友(通常,0 是最常见的价值),而很少有用户有很多朋友。在这些情况下,由于异常值,最小最大归一化的性能会变得更糟,而 Z 分数并不稳定输入数据。

Our approach allows us to scale the value of a feature in the desired range [0, 1] using its rank in the frequency table, in a similar way as the Zipfian distribution. This transformation does not suffer from outliers: indeed, the normalized value of a feature does not depend on the magnitude of an outlier. Moreover, the normalized values are better distributed. Consider again the example of the number of friends in a social network presented above: the normalized feature $ v $ of a user with no friend is equal to zero using any standard normalization method. In our approach, $ v=\frac{n_0}{n} $ , which is the fraction of users with no friends. By examining the features used in our experiments, we found that the most important ones follow a Zipfian distribution. This result confirms the findings of other research works, such as , that observed a power-law distribution in many social network dimensions. We expect that our strategy performs well in scenarios in which input data are distributed according to a power-law probability distribution because our feature normalization is strongly derived from the power-law distribution. Thus, as a practical recommendation in a classification task, we suggest to identify the probability distribution of each feature: in the case such a distribution follows the power law, then re-engineering this feature by considering the Cumulative Relative Frequency Distribution should improve the overall accuracy of the classifier performances. By considering that, in many fields, such as physical and social sciences , data follow a Zipfian distribution, we conclude that our normalization strategy can be effective in many contexts.

我们的方法允许我们使用与 Zipfian 分布类似的方式使用其在频率表中的等级来缩放所需范围[0,1]中的特征的值。此转换不会遭受异常值:实际上,功能的归一化值不依赖于异常值的幅度。此外,归一化值是更好的分布式。再次考虑上面呈现的社交网络中的朋友数量的示例:使用任何标准归一化方法,没有朋友的用户的归一特征$ v $等于零。在我们的方法中,$ v=\frac{n_0}{n} $,这是没有朋友的用户的分数。通过检查我们实验中使用的功能,我们发现最重要的是遵循 ZIPFIAN 分布。该结果证实了其他研究工作的结果,例如,在许多社交网络维度中观察到幂律分布。我们预计我们的策略在根据权力概率分布根据电力法概率分布分发输入数据的情况下,我们的策略在其特征标准中强烈地源于幂律分布。因此,作为在分类任务中的实际建议中,我们建议识别每个功能的概率分布:在这种分布遵循权力法的情况下,通过考虑累积相对频率分布应该改善整个频率分布来重新设计该特征准确性的分类器性能。通过考虑到,在许多领域,如身体和社会科学,数据遵循 ZIPFIAN 分布,我们得出的结论是,我们的归一化策略在许多背景下都能有效。

Conclusions

User opinions are an important information source, which can help a customer and a vendor to evaluate pros and cons of the buying/selling when they interact. For the importance of opinion role, there is the possibility to have unfair opinions used to promote own products or to disparage products of competitors. The important challenge of detecting unfair opinions has attracted and attracts the scientific community and one of the most promising approaches to address this problem is based on the use of supervised classifiers, which have been proven to be highly effective. In this paper, we tried to further improve their effectiveness, not by proposing some change in the well-tested state-of-the-art algorithms, but only by modifying the input used for the training phase to construct supervised classifiers. Specifically, we considered eight features widely used to detect opinion spam and pre-processed them by considering the cumulative relative frequency distribution. To demonstrate the effectiveness of our proposal, we extracted a data set from Yelp.com and measured the performances of the six most used classifiers in detecting opinion spam, both in their standard use and when our proposal is adopted. The results of this comparison show that the use of the cumulative relative frequency distribution improves the performance of the state-of-the-art classifiers. As future work, we intend to extend our proposal to detect not only individual spammers, but also groups of users who, acting in a coordinated and synchronized way, aim to give credit or discredit a product (or a service). The idea is that, once an ensemble of malicious reviewers is detected, an overlapping between the products that malicious reviewers have evaluated is searched. Groups of users with large overlap (i.e., who revised the same products) could be colluders.

结论

用户意见是一个重要的信息来源,可以帮助客户和供应商在互动时评估购买/销售的优缺点。对于意见角色的重要性,有可能有不公平的意见用于促进自己的产品或贬低竞争对手的产品。检测不公平意见的重要挑战已经吸引并吸引了科学界,并且解决了这个问题的最有希望的方法之一是基于使用监督分类机,这已被证明是非常有效的。在本文中,我们试图进一步提高其有效性,而不是通过提出经过良好测试的最先进算法的变化,而是仅通过修改用于训练阶段的输入来构建监督分类器。具体而言,我们考虑了八个特征,广泛用于检测意见垃圾邮件,并考虑累积相对频率分布来预处理它们。为了展示我们提案的有效性,我们提取了从 Yelp.com 中的数据集,并测量了六种最常用的分类器的性能在其标准使用中以及采用我们的提案时均在其标准使用中检测垃圾邮件。该比较的结果表明,使用累积的相对频率分布改善了最先进的分类器的性能。作为未来的工作,我们打算扩展我们的建议,不仅要检测单独的垃圾邮件发送者,还要群体,以协调和同步的方式行事,旨在提供信贷或诋毁产品(或服务)。这一想法是,一旦检测到恶意审阅者的集合,就搜查了恶意审阅者评估的产品之间的重叠。具有大重叠的用户组(即,修订同一产品的谁)可能是斗争者。