Experience: Improving Opinion Spam Detection by Cumulative Relative Frequency Distribution



Over the last years, online reviews became very important since they can influence the purchase decision of consumers and the reputation of businesses, therefore, the practice of writing fake reviews can have severe consequences on customers and service providers. Various approaches have been proposed for detecting opinion spam in online reviews, especially based on supervised classifiers. In this contribution, we start from a set of effective features used for classifying opinion spam and we re-engineered them, by considering the Cumulative Relative Frequency Distribution of each feature. By an experimental evaluation carried out on real data from Yelp.com, we show that the use of the distributional features is able to improve the performances of classifiers.




As illustrated in a recent survey on the history of Digital Spam, the Social Web has led not only to a participatory, interactive nature of the Web experience', but also to the proliferation of new and widespread forms of spam, among which the most notorious ones are fake news and spam reviews, opinion spam. This results in the diffusion of different kinds of disinformation and misinformation, where misinformation refers to inaccuracies that may even originate acting in good faith, while disinformation is false information deliberately spread to deceive. Over the last years, online reviews became very important since they reflect the customers' experience with a product or service and, nowadays, they constitute the basis on which the reputation of an organization is built. Unfortunately, the confidence in such reviews is often misplaced, due to the fact that spammers are tempted to write fake information in exchange for some reward or to mislead consumers for obtaining business advantages. The practice of writing false reviews is not only morally deplorable, as it is misleading for customers and inconvenient for service providers, but it is also punishable by law. Considering both the longevity and the spread of the phenomenon, scholars for years have investigated various approaches to opinion spam detection, mainly based on supervised or unsupervised learning algorithms. Further approaches are based on Multi Criteria Decision Making. Machine learning approaches rely on input data to build a mathematical model in order to make predictions or decisions. To this aim, data are usually represented by a set of features, which are structured and ideally fully representative of the phenomenon being modeled. An effective feature engineering process, i.e., the process through which an analyst uses the domain knowledge of the data under investigation to prepare appropriate features, is a critical and time-consuming task. However, if done correctly, feature engineering increases the predictive power of algorithms by facilitating the machine learning process.



In this paper, we do not aim to contribute by defining novel features suitable for fake reviews detection, rather, starting from features that have been proven to be very effective by Academia, we them, by considering the distribution of the occurrence of the features values in the dataset under analysis. In particular, we focus on the Cumulative Relative Frequency Distribution of a set of the basic features already employed for the task of fake review detection. We compute this distribution for each feature and substitute each feature value with the corresponding value of the distribution. To demonstrate the effectiveness of the proposed approach, the and the ones have been exploited to train several supervised machine-learning classifiers and the obtained results have been compared. To the best of the authors' knowledge, this is the first time that Cumulative Relative Frequency Distribution of a set of features has been considered for the unveiling of fake reviews. The experimental results show that the distributional features improve the performances of the classifiers, at the mere cost of a small computational surplus in the feature engineering phase. The rest of the paper is organized as follows. The next section revises related work in the area. Sectionfeatures describes the process of feature engineering. In Sectionsetup, we present the experimental setup, while Sectionresults reports the results of the comparison among the classification algorithms. Moreover, in this section, we assess the importance of the distributional features and discuss about the benefits brought by their adoption. Finally, Sectionconcl concludes the paper.

在本文中,我们并不目的是通过定义适合虚假评论检测的小说特征来贡献,而不是从学术界被证明是非常有效的特征,我们通过考虑分配特征值的分配在分析的数据集中。特别是,我们专注于已经用于假审查检测任务的一组基本功能的累积相对频率分布。我们为每个特征计算此分发,并将每个特征值替换为分布的相应值。为了展示所提出的方法的有效性,已经利用和该人被利用培训了几种监督的机器学习分类器,并且已经比较了所获得的结果。据作者所知,这是第一次考虑了一组特征的累积相对频率分布,已经考虑了揭幕假审查。实验结果表明,分布特征在特征工程阶段的小型计算盈余的仅成本中提高了分类器的性能。本文的其余部分安排如下。下一节修改了该地区的相关工作。 SectionFeatures描述了特征工程的过程。在SentsSetup中,我们介绍了实验设置,而切片结果报告了分类算法之间的比较结果。此外,在本节中,我们评估分配特征的重要性,并讨论采用带来的益处。最后,SectionConcl结束了这篇论文。

Social Media represent the perfect means for everyone to spread contents in the form of User-Generated Content (UGG), almost without any traditional form of trusted control. Since years, Academia, Industry, and Platform Administrators have been fighting for developing automatic solutions to raise the users' awareness about the credibility of the news they read online. One of the contexts in which the problem of credibility assessment is receiving the most interest is spam - or fake - reviews detection. The existence of spam reviews has been known since the early 2000s when e-commerce and e-advice sites began to be popular. In his seminal work, Liu lists three approaches to automatically identify opinion spam: the supervised, unsupervised, and group approaches. In a standard supervised approach, a ground truth of a priori known genuine and fake reviews is needed. Then, features about the labeled reviews, the reviewers, and the reviewed products are engineered. The performances of the first models built on such features achieved good results with common algorithms such as Naive Bayes and Support Vector Machines. As usual, a supervised approach is particularly challenging since it requires the existence of labeled data, that is, in our scenario, a set of reviews with prior knowledge about their (un)trustworthiness. To overcome the frequent issue of lack of labeled data, in the very first phases of investigation in this field, the work done by Jindal et al. in exploited the fact that a common practice of fraudulent reviewers was to post almost duplicate reviews: reviews with similar texts were collected as fake instances. As shown in , linguistic features have been proven to be valid for fake reviews detection, particularly in the early advent of this phenomenon. Indeed, pioneer fake reviewers exhibited precise stylistic features in their texts, such as a marked use of short terms and expressions of positive feelings. Anomaly detection was also been widely employed in this field: an analysis of anomalous practices with respect to the average behavior of a genuine reviewer led to good results. Anomalous behavior of the reviewer may be related to general and early rating deviation, as highlighted by Liu in, or temporal dynamics (see Xie et al.). Going further with the useful methodologies, human annotators, possibly recruited from crowd-sourcing services like Amazon Mechanical Turk, have also been employed, both 1) to manually label reviews' sets to separate fake from non-fake reviews (e.g., see the very recent survey by Crawford et al. in) and 2) to let them write intentionally false reviews, in order to test the accuracy of existing predictive models on such set of ad hoc crafted reviews, as nicely reproduced by Ott et al. in. Recently, an interesting point of view has been offered by Cocarascu and Tonotti in: deception is analysed based on contextual information derivable from review texts, but not in a standard way, e.g., considering linguistic features, but evaluating the influence and interactions that one text has on the others. The new feature, based on bipolar argumentation on the same review, has been shown to outperform more traditional features, when used in standard supervised classifiers, and even on small datasets.


社交媒体代表每个人以用户生成的内容(UGG)的形式传播内容的完美手段,几乎没有任何传统的可信控制形式。自年来,学术界,工业和平台管理员一直在争取自动解决方案,以提高用户对他们在线阅读新闻信誉的认可。信用评估问题收到最多兴趣的背景之一是垃圾邮件 - 或伪造 - 评论检测。自2000年代初以来,垃圾邮件的存在是已知的,当时电子商务和电子咨询网站开始受欢迎。在他的开创性工作中,刘列出了三种方法,以自动识别意见垃圾邮件:监督,无监督和团体方法。在一个标准的监督方法中,需要一个先验的已知真实和假审查的基础事实。然后,有关标签评论,评论者和审查产品的特点是设计的。基于此类功能的第一款模型的性能与常见的算法(如天真贝叶斯和支持向量机)实现了良好的效果。像往常一样,监督方法尤其具有挑战性,因为它需要标记数据的存在,即在我们的方案中,这是一套关于他们(联合国)可靠性的先验知识的评论。在这一领域的调查的第一阶段,克服频繁问题缺乏标签数据,jindal等人完成的工作。在利用欺诈性评审员的常见做法是将几乎重复的审查发布:与类似文本的评论被收集为假实例。如图所示,语言特征已被证明对假评评论检测有效,特别是在这种现象的早期出现。事实上,先锋假审查员在文本中表现出精确的风格特征,例如明确使用短篇小说和积极情绪的表达。该领域也广泛采用异常检测:对真正审查人员平均行为的异常做法导致了良好效果的分析。审阅者的异常行为可能与一般和早期评级偏差有关,如刘在或时间动态(见谢等)所突出显示。进一步利用有用的方法,人类注册人,可能是亚马逊机械土耳其人的人群采购服务,也已被雇用,两者都是1)以手动标记审查,以分离非假审查(例如,看到非常Crawford等人最近的调查。最近,Cocarascu和Tonotti提供了一个有趣的观点:Cocarascu和Tonotti一个文字对其他文字有关。基于同一审查的Bipolar参数的新功能已被证明在标准监督分类器中使用时,甚至在标准监督分类器中使用的更优于更传统的功能。

Supervised learning algorithms usually need diverse examples - and the values of diverse features derived from such examples - for an accurate training phase. Wang et al. investigated the 'cold-start' problem: the identification of a fake review when a new reviewer posts one review. Without enough data about the stylistic features of the review and the behavioral characteristics of the reviewer, the authors first find similarities between the review text under investigation and other review texts. Then, they consider similar behavior between the reviewer under investigation and the reviewers who posted the identified reviews. A model based on neural networks proves to be effective to approach the problem of lack of data in cold-start scenarios. Although many years have passed and, as we will see briefly later, the problem has been addressed in many research works, with different techniques, automatically detecting a false review is an issue not completely solved yet, as stated in the recent survey of Wu et al.. This inspiring work examines the phenomenon not only giving an overview of the various detection techniques used over time, but also proposing twenty future research questions. Notably, to help scholars find suitable datasets for a supervised classification task, this survey lists the currently available review datasets and their characteristics. A similar work by Hussain et al, aimed at a comparison of different approaches, focuses on the performances obtained by different classification frameworks. Also, the authors carried on a relevance analysis of six different behavioral features of reviewers. Weighting the features with respect to their relevance, a classification over a baseline dataset obtains an 84.5% accuracy. A quite novel work considers the unveiling of malicious reviewers by exploiting the notion of `neighborhood of suspiciousness'. In, Kaghazgaran et al. proposed a system called TwoFace that, starting from identifiable reviewers paid by crowd-sourcing platforms to write fake reviews on well-known e-commerce platforms, such as Amazon, studies the similarity between these and other reviewers, based, e.g., on the reviewed products, and shows how it is possible to spot organized fake reviews campaigns even when the reviewers alternate genuine and malicious behaviors. Serra et al. developed a supervised approach where the task is to differentiate amongst different kinds of reviewers, from fraudulent, to uninformative, to reliable. Leveraging a supervised classification approach based on a deep recurrent neural network, the system achieves notable performances over a real dataset where there is an a priori knowledge of the fraudulent reviewers. The research work reminded so far lies in supervised learning. However, unsupervised techniques have been employed too, since they are very useful when no tagged data is available.

监督学习算法通常需要不同的例子 - 以及从这些示例中衍生的不同特征的值 - 用于准确的训练阶段。 Wang等人。调查了“冷启动”问题:当新审查员发布一次审查时,确定假审查。没有足够的数据关于审查员的风格特征和审阅者的行为特征,作者首先在调查和其他审查文本下找到了审查文本之间的相似之处。然后,他们考虑审查员在调查下的类似行为和发布所确定的评论的审核人员。基于神经网络的模型证明是有效地接近冷启动方案中缺乏数据的问题。虽然多年来已经过去了,正如我们稍后会看到的那样,许多研究工作已经解决了问题,具有不同的技术,自动检测错误审查是一个没有完全解决的问题,如吴et的近期调查所述问题。 Al ..这种鼓舞人心的工作审查了不仅概述了随着时间的推移使用的各种检测技术的现象,还提出了20个未来的研究问题。值得注意的是,为了帮助学者找到适合监督分类任务的合适数据集,本调查列出了当前可用的审核数据集及其特征。 Hussain等人在旨在进行不同方法的比较,侧重于不同分类框架获得的性能。此外,作者在审查人员的六种不同行为特征的相关性分析上进行了相关性分析。对其相关性的特征加权,基线数据集的分类获得了84.5%的精度。一项非常小说的工作考虑了通过利用“怀疑邻里”的概念来揭幕。在,kaghazgaran等。提出了一个称为Twoface的系统,从人群采购平台支付的可识别审核者开始写作亚马逊等知名电子商务平台的假审查,研究这些和其他审阅者之间的相似性,例如,在审查中产品,并展示了如何在审阅者替代真正和恶意行为的情况下发现有可能的人造假审查活动。 Serra等人。制定了一个受监督的方法,任务是在不同类型的审稿人之间区分,从欺诈,无知到可靠。利用基于深度经常性神经网络的监督分类方法,该系统通过真正的数据集实现显着的性能,在那里存在对欺诈性评审者的先验知识。到目前为止,研究工作提醒了监督学习。但是,也使用了无监督的技术,因为当没有标记数据时,它们非常有用。

Fake reviewers' coordination can emerge by mining frequent behavioral patterns and ranking the most suspicious ones. A pioneer work by first identifies groups of reviewers that reviewed the same set of products; then, the authors compute and aggregate an ensemble of anomaly scores (e.g., based on similarity amongst reviews and times at which the reviews have been posted): the scores are ultimately used to tag the reviewers as colluding or not. Another interesting approach for the analysis of colluding users is the one proposed by: the authors check whether a given group of accounts (e.g., reviewers on ) contains a subset of malicious accounts. The intuition behind this methodology is that the statistical distribution of reputation scores (e.g., number of friends and followers) of the accounts participating in a tampered computation significantly diverges from that of untampered ones. We close this section by referring back to the division made by Liu in about supervised, unsupervised, and group approaches to spot fake reviewers and/or reviews. As noted in , these are classification methods, mostly aiming at classifying in a binary or multiple way information items (i.e., credible vs non-credible)' with the evaluation of a series of credibility features extracted from the data. Notably, approaches, which are based on some prior domain knowledge, are promising in providing a ranking of the information item (i.e., in our scenario, of the review) with respect to credibility. This is the case of recent work by Pasi et al., which exploits a Multi-Criteria Decision Making approach to assess the credibility of a review. In this context, a given review, seen as an alternative amongst others, is evaluated with respect to some credibility criteria. An overall credibility estimate of the review is then obtained by means of a suitable model-driven approach based on aggregation operators. This approach has also the advantage of assessing the contribution that single or interacting criteria/features have in the final ranking. The techniques presented above have their pros and cons, and depending on the particular context, one approach can be preferred with respect to another. The most relevant contribution of our proposal with respect to the state of the art is to improve the effectiveness of the solution based on supervised classifiers, which, as seen above, is a well-known and widely-used approach in this context.


Feature Engineering

In this section, we introduce a subset of features that have been adopted in past work to detect opinion spam and we propose how to modify them in order to improve the performances of classifiers. We emphasize that the listed features have been used effectively for this task by past researchers. We give below the rationale for their use in the context of unveiling fake reviews. Finally, it is worth noting that the list of selected features is not intended to be exhaustive.



Basic Features

Following a supervised classification approach, the selection of the most appropriate features plays a crucial role, since they may considerably affect the performance of the machine learning models constructed starting from them. Features can be review-centric or reviewer-centric. The former are features that refer to the review, while the latter refer to the reviewer. In the literature, several reviewer-centric features have been investigated, such as the maximum number of reviews, the percentage of positive reviews, the average review length, the reviewer rating deviation. According to the outcomes of several works proposed in the context of opinion spam detection , we focused on reviewer-centric features, which have been demonstrated to be more effective for the identification of fake reviews. Thus, we relied on a set of , which have been already used proficiently in the literature for the detection of opinion spam in reviews. Specifically, we focused on the following reviewer-centric features: - {Photo Count}: This metric measures the number of pictures uploaded by a reviewer and is directly retrieved from the reviewer profile. In, the authors demonstrated the effectiveness of using photo count, together with other non-verbal features, for detecting fake reviews. - {Review Count}: It measures how many reviews have been posted by a reviewer on the platform. showed that spammers and non-spammers present different behavior regarding the number of reviews they post. In particular, spammers usually post more reviews, since they may get paid. This feature has also been investigated by and . - {Useful Votes}: The most popular online review platforms allow users to rank reviews as useful or not. This information can be retrieved from the reviewer profile, or computed by summing the total amount of useful votes received by a reviewer. This feature has already been exploited by and it has been demonstrated to be effective for opinion spam detection. - {Reviewer Expertise}: Past research in highlights that reviewers with acquired expertise on the platform are less prone to cheat.



  • {照片计数}:此度量标准测量审阅者上传的图片数量,并从审阅者配置文件中直接检索。在,作者展示了使用照片计数的有效性,以及其他非语言特征,用于检测假审查。
  • {审核计数}:衡量平台上的评论家发布了多少评论。显示垃圾邮件发送者和非垃圾邮件发送者对他们发布的评论数量的不同行为产生了不同的行为。特别是,垃圾邮件发送者通常会发布更多审查,因为他们可能会得到报酬。此功能也被调查了。
  • {有用的投票}:最受欢迎的在线评论平台允许用户将评论评为有用的综合评论。可以从审阅者配置文件中检索此信息,或者通过概括审阅者收到的有用投票总额来计算。此功能已被利用,并且已被证明是对意见垃圾邮件检测有效。
  • {审阅者专业知识}:过去的研究突出的审查员在平台上获得的专业知识不太容易作弊。

Particularly, Mukherjee et al. in report that opinion spammers are usually not longtime members of a site. Genuine reviewers, however, use their accounts from time to time to post reviews. Although this experimental evidence does not mean that no spammer can be a member of a review platform for a long time, the literature has considered useful to exploit the activity freshness of an account in cheating detection. The Reviewer Expertise has been defined by Zhang et al. in as the number of days a reviewer has been a member of the platform (the original name was Membership Length).

  • {Average Gap}: The review gap is the time elapsed between two consecutive reviews. This feature has been previously introduced in the seminal work by Mukherjee et al., under the name {\it Activity Window}, and successfully re-adopted for detecting both colluders (i.e., spammers acting with a coordinated strategy) and singleton reviewers (i.e., reviewers with just isolated behavioral posting) . In the cited work, the Activity window feature as been proved highly discriminant for demarcate spammers and non-spammers. Quoting from, fake reviewers are likely to review in short bursts and are usually not longtime active members.

特别是mukherjee等。在报告中,意见垃圾邮件发送者通常不会长期成员。但是,真正的审核人员不时使用他们的帐户来发布评论。虽然这种实验证据并不意味着长期没有垃圾邮件发送者可以成为审查平台的成员,但文献已被认为有助于利用在作弊检测中的账户的活动新鲜度。 Zhang等人已经定义了审稿人的专业知识。在审阅者的日子中,审阅者一直是平台的成员(原名是会员长度)。

  • {平均差距}:审查缺口是两次连续评论之间经过的时间。此功能以来,Mukherjee等人在Omkherjee等人的名称下,并以{\ IT活动窗口}的名义,并成功地重新采用,以检测勾结斗争者(即,用协调策略的垃圾邮件发送者)和单例评论者(即,审查员只有孤立的行为帖子)。在引用的工作中,活动窗口特征是被证明是划分垃圾邮件发送者和非垃圾邮件发送者的高度判别。引用,假审查人员可能会在短暂的爆发中审查,通常不是长期活动成员。

On a Yelp dataset where was a priori known the benign and malicious nature of reviewers, work in proved that, by computing the difference of timestamps of the last and first reviews for all the reviewers, a majority (80%) of spammers were bounded by 2 months of activity, whereas the same percentage of non-spammers remain active for at least 10 months. We define the Average Gap feature as the average time, in days, elapsed between two consecutive reviews of the same reviewer and is defined as:

$$ AG_i = \frac{1}{N_i - 1} \sum_{j=2}^{N_i} (T_{i,j} - T_{i,j-1}) $$

where $ AG_i $ is the Average Gap for the $ i $ -th user, $ N_i $ is the number of reviews written by the user, $ T_{i,j} $ is the timestamp of the $ j $ -th reviews of the $ i $ -th user.

  • {Average Rating Deviation}: The rating deviation measures how much a reviewer's rating is far from the average rating of a business. observed that spammers are more prone to deviate from the average rating than genuine reviewers. However, a bad experience may induce a genuine reviewer to deviate from the mean rating. The Average Rating Deviation is defined as follows

$$ ARD_i = \frac{1}{N_i} \sum_{j=1}^{N_i} \abs{R_{i,j} - {R_{B(j)}}} $$

where $ ARD_i $ is the Average Rating Deviation of the $ i $ -th user, $ N_i $ is the number of reviews written by the user, $ R_{i,j} $ is the rating given by the $ i $ -th user to her/his $ j $ -th reviews corresponding to the business $ B(j) $ , $ {R_{B(j)}} $ is the average rating obtained by the business $ B(j) $ .

  • {First Review}: Spammers are usually paid to write reviews when a new product is placed on the market. This is due to the fact that early reviews have a great impact on consumers' opinions and, in turn, impact the sales, as pointed out by and . We compute the time elapsed between each review of a reviewer and the first review, for the same business. Then, we average the results on all the reviews. Specifically, the First Review value for reviewer $ i $ is given by:

$$ FRT_i = \frac{1}{N_i} \sum_{j=1}^{N_i} (T_{i,j} - F_{B(j)}) $$

where $ FR_i $ is the First Review value of the $ i $ -th user, $ N_i $ is the number of reviews written by the user, $ T_{i,j} $ is the time the $ i $ -th user wrote the $ j $ -th review and $ F_B(j) $ is the time the first review of the same business $ B(j) $ , corresponding to the one of the $ j $
-th review, has been posted.

  • {Reviewer Activity}: Several works pointed out that the more active a user on the online platform, the more the user is likely genuine, in terms of contributing with knowledge sharing in a useful way. The usefulness of this feature has been demonstrated several years ago. Since the early 00s, surveys have been conducted on large communities of individuals, trying to understand what drives them to be active and useful on an online social platform, in terms of sharing content. Results showed that people contribute their knowledge when they perceive that it enhances their reputations, when they have the experience to share, and when they are structurally embedded in the network. The Activity feature expresses the number of days a user has been active and it is computed as:
    $$ A_i = T_{i,L} - T_{i,0} $$
    where $ A_i $ is the activity (expressed in days) of the $ i $ -th user, $ T_{i,L} $ is the time of the last review of the $ i $ -th user and $ T_{i,0} $ is the time of the first review of the $ i $ -th user.

$$ AG_i = \frac{1}{N_i - 1} \sum_{j=2}^{N_i} (T_{i,j} - T_{i,j-1}) $$
其中$ ag_i $是$ i $ --th用户的平均间隙,$ n_i $是用户编写的审查数量,$ t_ {i,j} $是$ i $ -th用户的$ j $的时间戳。

  • {平均评级偏差}:评级偏差衡量评审者的评级远远距离业务的平均评级。观察到垃圾邮件发送者更容易偏离比真正审稿人的平均额定值。然而,糟糕的经历可能会诱使真正的评论者偏离平均评分。平均评级偏差定义如下:
    $$ ARD_i = \frac{1}{N_i} \sum_{j=1}^{N_i} \abs{R_{i,j} - {R_{B(j)}}} $$
    在$ ard_i $是$ i $ -th用户的平均评级偏差,$ n_i $是用户,$ r_ {i,j}的审核数量。 $ i $ -th用户给她/他的$ b $ sth评论给她的商业$ b(j)$,$ {r_ {b(j)}}$由商家为$B(j)$。
  • {第一个审查}:垃圾邮件发送者通常在市场上播放新产品时撰写审查。这是由于早期评论对消费者的意见产生了很大影响,而且反过来影响销售额,如下所示。我们计算同一业务对审阅者的每次审查和第一次审查之间过去的时间。然后,我们平均对所有评论的结果。具体而言,审阅者$ i $的第一个审查值由:
    $$ FRT_i = \frac{1}{N_i} \sum_{j=1}^{N_i} (T_{i,j} - F_{B(j)}) $$
    其中$ FR_i $是$ i $ -th用户编写$ j $ sth评论和$ f_b(j)$的时间是第一次审查同一$ F_B(j) $的时间,对应于其中一个$ j $审查,已发布。
  • {审阅者活动}:几种作品指出,在在线平台上的用户越活跃,用户可能是真正的,就在有用的方式贡献知识共享方面。几年前已经证明了这一功能的有用性。自从00年代初期以来,已经在个人的大型社区上进行了调查,试图了解在共享内容方面,在线社交平台驱动他们在线社交平台上的驱动器。结果表明,当他们认为它具有共享经验时,人们在察觉时促进他们的声誉,以及在结构上嵌入在网络中时,人们会促进他们的知识。活动特征表达了用户已激活的天数,并且计算为:
    $$ A_i = T_{i,L} - T_{i,0} $$
    其中$ a_i $是$ i $ -th用户,$ t_ {i,l} $的活动(在几天)是关于$ i $ -th用户和$ t_ {i,0} $的最后审查时间第一次审查$ i $最终用户的时间。

From Basic to Cumulative Features

The features described so far have been used to train a machine learning algorithm to construct a classifier, in a supervised-learning fashion. In this work, we propose to build on the basic features, with a proper feature engineering process, to possibly assess an improvement of the classification performances. The proposed feature engineering process is based on the concept of Cumulative Relative Frequency Distribution. The Relative Frequency is a quantity that expresses how often an event occurs divided by all outcomes. It can be easily represented by a Relative Frequency table. The Relative Frequency table is constructed directly from the data by simply dividing the frequency of a value by the total number of values in the dataset. The Cumulative Relative Frequency is then calculated by adding each frequency from the Relative Frequency table to the sum of its predecessors. In practice, the Cumulative Relative Frequency indicates the percentage of elements in the dataset that lies below the current value. In this work, we modify each feature by using its Cumulative Relative Frequency Distribution. In the following, we show an example of how to compute the Cumulative Relative Frequency Distribution. Let us consider the feature and assume that for each photo count value, the corresponding number of occurrences is the one reported in the second column of Tablefreq_table. Thus, the second column reports the number of reviews associated with a reviewer who uploaded a given number of photos: in our example, there are 7,944 reviews whose reviewers have no photo associated. The third column measures the Relative Frequency, which is computed by dividing the number of occurrences by the total number of reviews. Finally, the fourth column reports the Cumulative Relative Frequency values, which have been obtained by adding each Relative Frequency value to the sum of its predecessor. In our proposal, the process described so far is carried out for each basic feature and the Cumulative Relative Frequency values are used to train the classifier instead of the simple values. This involves, in practice, to substitute each value of the first column with the corresponding value of the fourth column of Tablefreq_table.

Photo Frequency Relative.Freq. Cumulative
Count (#Reviews) (%Reviews) Rel.Freq.
0 7944 0.44 0.44
1 2301 0.13 0.57
2 1756 0.10 0.67
3 1401 0.08 0.75
4 822 0.04 0.79
5 1382 0.08 0.87
Photo Frequency Relative.Freq. Cumulative
Count (#Reviews) (%Reviews) Rel.Freq.
6 550 0.03 0.90
7 347 0.02 0.92
8 780 0.04 0.96
9 342 0.02 0.98
10 460 0.02 1.00



Experimental Setup

In this section, we describe the setting of the experiments conducted to evaluate the effectiveness of the proposed features. This is done by comparing the results obtained when using the basic features and the cumulative ones with the most widespread supervised machine learning algorithms.



Dataset Construction and Characteristics

The dataset used in this study is composed of 56,317 business reviews, 42,673 businesses (both restaurants and hotels), and 1,429 reviewers. This dataset has been obtained by repopulating the YelpCHI dataset. The YelpCHI dataset included 67,395 reviews from 201 hotels and restaurants done by 38,063 reviewers and each review was tagged with a fake/non-fake label. To tag reviewers in the YelpCHI dataset and to obtain fresher data to work with, we operate as follows: In Tabledataset_summary we report a summary with the statistics of the basic features for the repopulated dataset, whereas in Figurecorr_mat we present the correlation matrix, which shows the correlation coefficients among variables.

photo review useful reviewer avg avg rating first reviewer
count count votes expertise gap deviation review activity
\mean 170.9 201.9 502.7 3664.7 55.2 0.01 13.9 2637.6
std_dev 911 298.2 2089.45 579.8 110.6 0.06 50.2 991.3



本研究中使用的数据集由56,317个商务评估,42,673名业务(餐馆和酒店)组成,以及1,429名评论员。通过重新填充yelpchi数据集获得了此数据集。 Yelpchi DataSet包括来自201家酒店和餐馆的67,395条点评由38,063名评论员提供,每次评论都标有假/非假标签。在Yelpchi DataSet中标记Reviewers并获取使用的更新数据,我们按照以下操作:在TableDataseT_Summary中运行,我们将摘要与重新填充数据集的基本功能的统计数据统计,而在图中,我们呈现了所显示的相关矩阵变量之间的相关系数。

Data Labeling

One limitation of supervised classification approaches is the possible lack of labeled data. To overcome this problem, relied on Amazon Mechanical Turks to generate fake and genuine reviews. Nevertheless, highlighted the limits of this approach, since workers could be not always effective in imitating the behavior of fake reviewers, and proposes to use the results of the Yelp classification algorithm for obtaining labeled data. The same approach has also been used by and and the literature has recognized its effectiveness in detecting fake reviews. According to this result, we use the same approach to build a dataset of labeled examples (as described in Sectiondatasetorigin).



Experimental Framework

We experiment several supervised machine-learning algorithms, namely Logistic Regression (LogReg), Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), Decision Tree (DT), Naive Bayes (NB), K-Nearest Neighbors (k-NN). The performances of the learning algorithms are evaluated by means of commonly used metrics. Specifically, we compute the Accuracy on train (TraAcc) and test (TstAcc), the Precision (Pre), Recall (Rec) and F1-score (F1) for both classes. In addition, we compute the Matthews Correlation Coefficient (MCC) and the Area Under Curve (AUC), which are two evaluation metrics used to measure the quality of binary classifications and are useful to compare classification models' performances. To ensure higher reliability of results, we apply a Stratified k-Fold Cross Validation approach, with $ k=5 $ . The cross-validation involves partitioning the dataset into $ k $ sets, then a model is trained using $ k-1 $ folds (called training set) and validated on the remaining part of the data (validation set). To reduce variability, this process is repeated $ k $ times, using only once each partition for the validation. The performance measures are obtained by averaging the validation results on all runs. The Stratified approach ensures the preservation of the frequencies of the classes in each training and validation fold. The experimental framework described so far is depicted in Figureframework. All the experiments have been developed in Python, with the support of the Scikit-learn library.



我们试验多个监督机器学习算法,即Logistic回归(Logreg),线性判别分析(LDA),支持向量机(SVM),决策树(DT),Naive Bayes(NB),K离最近邻居(k-nn)。通过常用的指标评估学习算法的性能。具体而言,我们为两个类计算了列车(TSACC)和测试(TSTACC),精度(前),召回(REC)和F1分数(F1)的准确性。此外,我们计算Matthews相关系数(MCC)和曲线下的区域(AUC),它们是用于测量二进制分类质量的两个评估度量,并且可用于比较分类模型的性能。为了确保结果的更高可靠性,我们应用了一个分层的k折交叉验证方法,$ k=5 $。交叉验证涉及将数据集分区为$ k $集,然后使用$ k-1 $折叠(称为训练集)培训模型,并在数据的剩余部分(验证集)上验证。为了降低可变性,该过程重复$ k $次,仅使用每分区一次进行验证。通过对所有运行的验证结果进行平均来获得性能措施。分层方法确保在每个训练和验证折叠中保存类别的频率。到目前为止描述的实验框架在图框架中描绘。所有实验都是在Python开发的,并支持Scikit-rearn图书馆的支持。

Evaluation and Discussion

In this section, we report the results obtained by applying the aforementioned algorithms for the detection of fake and genuine reviews, when using the basic or the cumulative features. Due to the imbalanced nature of the dataset, we report the performance metrics both for positive (fake-0) and negative (non fake-1) classes. Tablereviews_balanced_013_rating_simplef and Tablereviews_balanced_013_rating report the results of the learning algorithms, obtained by using the basic and cumulative features, respectively. The performance values are computed by averaging the results over 5 runs of the algorithms, according to the employed cross-validation scheme.

Algorithm TraAcc TstAcc Prec-0 Rec-0 F1-0 Prec-1 Rec-1 F1-1 MCC AUC
DT(Dep=10) 0.99 0.97 0.77 1.00 0.87 1.00 0.96 0.98 0.86 0.98
%0.99 0.98 0.81 1.00 0.90 1.00 0.97 0.99 0.89 0.99
DT(Dep=5) 0.97 0.94 0.67 1.00 0.80 1.00 0.94 0.97 0.79 0.97
% 0.97 0.95 0.72 0.99 0.83 1.00 0.95 0.97 0.82 0.97
%k-NN(k=5) 0.99 0.97 0.80 1.00 0.89 1.00 0.97 0.98 0.88 0.98
k-NN(k=5) 0.93 0.92 0.71 0.79 0.75 0.96 0.94 0.95 0.70 0.95
%k-NN(k=10) 0.98 0.95 0.69 1.00 0.82 1.00 0.94 0.97 0.81 0.97
k-NN(k=10) 0.90 0.89 0.70 0.64 0.67 0.92 0.94 0.93 0.61 0.93
%SVM(rbf) 0.92 0.90 0.53 0.94 0.68 0.99 0.89 0.94 0.66 0.92
SVM(rbf) 0.90 0.88 0.82 0.47 0.60 0.88 0.97 0.93 0.56 0.92
%SVM(linear) 0.90 0.90 0.53 0.91 0.66 0.99 0.89 0.94 0.64 0.90
%SVM(linear) 0.87 0.85 0.67 0.39 0.50 0.87 0.95 0.91 0.43 0.90
%LogReg 0.91 0.89 0.45 0.93 0.66 0.99 0.89 0.93 0.64 0.91
LogReg 0.87 0.85 0.43 0.89 0.58 0.99 0.85 0.91 0.55 0.87
%LDA 0.90 0.88 0.50 0.93 0.65 0.99 0.88 0.93 0.62 0.90
LDA 0.84 0.81 0.36 0.87 0.51 0.98 0.80 0.88 0.48 0.84
%Naive Bayes 0.86 0.86 0.44 0.86 0.58 0.98 0.86 0.91 0.55 0.86
Naive Bayes 0.85 0.78 0.34 0.96 0.50 0.99 0.75 0.86 0.49 0.85
Algorithm TraAcc TstAcc Prec-0 Rec-0 F1-0 Prec-1 Rec-1 F1-1 MCC AUC
DT(Dep=10) 0.99 0.98 0.83 1.00 0.91 1.00 0.97 0.99 0.90 0.99
DT(Dep=5) 0.96 0.93 0.64 1.00 0.78 1.00 0.92 0.96 0.77 0.96
k-NN(k=5) 0.97 0.94 0.64 1.00 0.78 1.00 0.93 0.96 0.77 0.96
k-NN(k=10) 0.95 0.91 0.56 1.00 0.72 1.00 0.90 0.95 0.71 0.95
SVM(rbf) 0.92 0.90 0.53 0.94 0.68 0.99 0.89 0.94 0.66 0.92
%SVM(linear) 0.90 0.90 0.53 0.91 0.66 0.99 0.89 0.94 0.64 0.90
LogReg 0.91 0.89 0.45 0.93 0.66 0.99 0.89 0.93 0.64 0.91
LDA 0.90 0.88 0.50 0.93 0.65 0.99 0.88 0.93 0.62 0.90
Naive Bayes 0.86 0.86 0.44 0.86 0.58 0.98 0.86 0.91 0.55 0.86

Tablereviews_balanced_013_rating_simplef highlights that all the algorithms obtain very good precisions on the class (tagged with 1), while the performances are worst when dealing with the class. This is due to the fact that the dataset is imbalanced, with a ratio between classes of about 0.13. Among the experimented methods, the one that obtained the best precision on the minority class is the Support Vector Machine Classifier (precision = 0.97). However, the best global performances are obtained by the Decision Tree classifier (max depth=10), since in this case the MCC and the AUC reach the highest values (0.86 and 0.98, respectively). This occurs because the recall values obtained by the Decision Tree classifier are almost equal (majority class) or better (minority class) with respect to the ones obtained by the Support Vector Machine. We also experiment with a Decision Tree classifier with , to obtain a less complex model, but we notice that the performances on the minority class drop dramatically. Still regarding the precision on the minority class, all the remaining algorithms but the SVM are outperformed by Decision Tree (max depth=10).


在本节中,我们报告了通过应用上述算法来检测使用基本或累积特征时的虚假和正版评论获得的结果。由于数据集的性质不平衡,我们报告了正(假-0)和负(非假-1)类的性能指标。 TableEviews_balanced_013_rating_simplef和tableEviews_balanced_013_rating通过使用基本和累积特征获得的学习算法的结果。根据采用的交叉验证方案,通过平均5次算法的结果来计算性能值。

table views_balanced_013_ration_simplef突出显示所有算法在类上获得非常好的精度(标记为1),而在处理类时,性能最差。这是由于数据集是不平衡的事实,因此大约0.13的类之间的比率。在实验方法中,在少数群体类上获得最佳精度的方法是支持向量机分类器(Precision = 0.97)。然而,决策树分类器(最大深度= 10)获得了最佳的全局性能,因为在这种情况下,MCC和AUC分别达到最高值(分别为0.86和0.98)。这是由于决策树分类器获得的召回值几乎相等(多数类)或更好的(少数类),而不是由支持向量机器获得的那些。我们还尝试了一个决策树分类器,以获得更少的复杂模型,但我们注意到少数群体阶级的表现急剧下降。仍然关于少数群体类的精度,所有剩余的算法,但SVM都是由决策树(最大深度= 10)的表现。

However, all of them, except the SVM, perform quite well with respect to the recall. This implies that the algorithms work well in finding the fake instances, but there are also many false positives, i.e., genuine reviews erroneously classified as fake. The results described so far and presented in Tablereviews_balanced_013_rating_simplef are generally worse with respect to those shown in Tablereviews_balanced_013_rating, where we reported the results obtained by using the Cumulative Relative Frequency distributions. Also when using the cumulative features, the best approach is the Decision Tree (max depth=10), which outperforms the other algorithms. By comparing the two tables, we notice that the results obtained by k-NN classifiers, using the basic features, reach higher values in the precision of the minority class, but this improvement in precision is balanced by a lower recall. In practice, the fake reviews detected by the classifiers are indeed fake, but they also miss a lot of actual fakes. To summarize the performance differences, we report in Tablecomparison_simple_cum a comparison between the MCC and the AUC values. In particular, for each metric, we reported the corresponding value obtained by using the basic features, the cumulative features, and the improvement percentage. Tablecomparison_simple_cum highlights that the MCC and the AUC values are always higher or equal when using cumulative features, for all the algorithms except for the Decision Tree (max depth =5), since in this case the basic features lead to better MCC and AUC values.

Algorithm MCC AUC
Basic Cumulative Improv.(%) Basic Cumulative Improv.(%)
DT(Dep=10) 0.86 0.90 4.7 0.98 0.99 1.0
DT(Dep=5) 0.79 0.77 -2.5 0.97 0.96 -1.0
k-NN(k=5) 0.70 0.77 10.0 0.95 0.96 1.1
k-NN(k=10) 0.61 0.71 16.4 0.93 0.95 2.2
SVM(rbf) 0.56 0.66 17.9 0.92 0.92 0.0
LogReg 0.55 0.64 16.4 0.87 0.91 4.6
LDA 0.48 0.62 29.2 0.84 0.90 7.1
Naive Bayes 0.49 0.55 12.2 0.85 0.86 1.2

We finally remark that the process described in Sectionengineering represents a pre-processing step and it can be performed only once before applying several machine learning algorithms. We computed the time required to build the Cumulative Relative Frequency for all the features in the pre-processing of the cross-validation process and we obtained an average value of 155 ms, pointing out that the impact of the calculation time for the proposed features is negligible.

然而,除了SVM之外,它们都表现出很好的召回。这意味着算法在寻找假实例时很好地工作,但也有许多误报,即,正版评论是错误的归类为假的。到目前为止所描述的结果并在TableViews_balanced_013_rating_simplef上呈现到TableEviews_balanced_013_rating中所示的结果通常更差,在那里我们报告了通过使用累积相对频率分布获得的结果。此外,在使用累积特征时,最好的方法是决策树(最大深度= 10),其优于其他算法。通过比较这两个表,我们注意到通过基本功能的K-NN分类器获得的结果,以少数级别的精度达到更高的值,但这种精度的改进由较低的召回平衡。在实践中,分类器检测到的虚假评论确实是假的,但他们也错过了很多实际假货。为了总结绩效差异,我们在桌布计数_simple_cum中报告MCC和AUC值之间的比较。特别地,对于每个指标,我们报告了通过使用基本特征,累积特征和改进百分比获得的相应值。 TableComparison_Simple_Cum突出显示MCC和AUC值始终在使用累积特征时始终更高或等于除决策树(MAX Depth = 5)之外的所有算法,因为在这种情况下,基本功能导致更好的MCC和AUC值。


Features Importance

The importance of features plays a significant role in a predictive model, since it provides insights into the data and into the model itself. Moreover, it poses the basis for dimensionality reduction and feature selection, which can sometimes improve the efficiency and the effectiveness of a predictive model. From the experiments carried on, the best algorithm resulted to be a Decision Tree model. Decision Tree algorithms offer importance scores based on the reduction in the criterion used to select split points in the tree, such as Gini or entropy. In this case, the importance of a feature has been computed following the Gini importance , which counts the times a feature is used to split a node, weighted by the number of samples it splits. We computed the Gini importance in each fold of the cross-validation and we averaged the results obtained. In Figure significance, we report a graph showing the importance of each cumulative feature, from the most important to the least one. The first three features, namely the number of reviews posted on the platform (), the number of useful votes received (), and the membership length of a reviewer () are probably the most useful for this predictive model.


To confirm this intuition, we perform a feature selection process, by selecting the first three most important features. A new Decision Tree model has been trained and tested, and the results (averaged on 5-folds) are reported in Tablefeat_selection_results. The features selected are the same for both models, although they appear in different order of importance. For the sake of clarity, we reported in Tablefeat_selection_results the results already presented in Tablereviews_balanced_013_rating, related to the Decision Tree models without feature selection. From their comparison, we notice that the feature selection process actually contributes to slightly improving the performances. In particular, for the Decision Tree model with , this improvement is more evident since all the metrics reach higher values when the feature selection is applied. On the other hand, for the Decision Tree model with , there is a small improvement in the precision and the F1-score of the positive (fake-0) class, whereas the recall for the negative (non fake-1) class is a bit lower with the feature selection applied. Nevertheless, the global performances are better, since the MCC value is higher.

Algorithm Selected Features TraAcc TstAcc Prec-0 Rec-0 F1-0 Prec-1 Rec-1 F1-1 MCC AUC
DT(Dep=10) c.rev.count, c.useful votes, c.rev.exp. 0.99 0.98 0.85 1.00 0.92 1.00 0.98 0.99 0.92 0.99
DT(Dep=5) c.rev.count, c.rev.exp., c.useful votes 0.97 0.95 0.70 1.00 0.82 1.00 0.94 0.97 0.81 0.97
DT(Dep=10) all 0.99 0.98 0.83 1.00 0.91 1.00 0.97 0.99 0.90 0.99
DT(Dep=5) all 0.96 0.93 0.64 1.00 0.78 1.00 0.92 0.96 0.77 0.96




Discussion on Feature Enhancement

In this section, we explain at a conceptual level why using the cumulative relative frequencies works better than the simple values. In data pre-processing, the normalization task changes the feature values to a common scale to avoid that differences in the data ranges can distort the results. Min-max normalization is one of the most used methods and consists in re-scaling the values of features in the range [0, 1]. However, min-max normalization suffers from outliers, because it is sufficient one relevant outlier to flatten all the input values. Another normalization method that transforms the feature values in such a way that they have zero-mean is Z-score. This method avoids the outlier issue but does not produce data on the same scale. Several probabilities distributions are affected by the presence of outliers. One of them is the Zipfian distribution, which is a discrete power-law probability distribution in which the frequency of input is inversely proportional to its rank in the frequency table. That is, the most frequent inputs occur twice as often as the second most frequent ones, three times as often as the third most frequent inputs, and so on. For example, consider the number of friends in a social network: normally, most users have very few friends (usually, 0 is the most frequent value), whereas very few users have a very high number of friends. In these cases, the performance of min-max normalization gets worse due to outliers, whereas Z-score does not well scale the input data.



Our approach allows us to scale the value of a feature in the desired range [0, 1] using its rank in the frequency table, in a similar way as the Zipfian distribution. This transformation does not suffer from outliers: indeed, the normalized value of a feature does not depend on the magnitude of an outlier. Moreover, the normalized values are better distributed. Consider again the example of the number of friends in a social network presented above: the normalized feature $ v $ of a user with no friend is equal to zero using any standard normalization method. In our approach, $ v=\frac{n_0}{n} $ , which is the fraction of users with no friends. By examining the features used in our experiments, we found that the most important ones follow a Zipfian distribution. This result confirms the findings of other research works, such as , that observed a power-law distribution in many social network dimensions. We expect that our strategy performs well in scenarios in which input data are distributed according to a power-law probability distribution because our feature normalization is strongly derived from the power-law distribution. Thus, as a practical recommendation in a classification task, we suggest to identify the probability distribution of each feature: in the case such a distribution follows the power law, then re-engineering this feature by considering the Cumulative Relative Frequency Distribution should improve the overall accuracy of the classifier performances. By considering that, in many fields, such as physical and social sciences , data follow a Zipfian distribution, we conclude that our normalization strategy can be effective in many contexts.

我们的方法允许我们使用与Zipfian分布类似的方式使用其在频率表中的等级来缩放所需范围[0,1]中的特征的值。此转换不会遭受异常值:实际上,功能的归一化值不依赖于异常值的幅度。此外,归一化值是更好的分布式。再次考虑上面呈现的社交网络中的朋友数量的示例:使用任何标准归一化方法,没有朋友的用户的归一特征$ v $等于零。在我们的方法中,$ v=\frac{n_0}{n} $,这是没有朋友的用户的分数。通过检查我们实验中使用的功能,我们发现最重要的是遵循ZIPFIAN分布。该结果证实了其他研究工作的结果,例如,在许多社交网络维度中观察到幂律分布。我们预计我们的策略在根据权力概率分布根据电力法概率分布分发输入数据的情况下,我们的策略在其特征标准中强烈地源于幂律分布。因此,作为在分类任务中的实际建议中,我们建议识别每个功能的概率分布:在这种分布遵循权力法的情况下,通过考虑累积相对频率分布应该改善整个频率分布来重新设计该特征准确性的分类器性能。通过考虑到,在许多领域,如身体和社会科学,数据遵循ZIPFIAN分布,我们得出的结论是,我们的归一化策略在许多背景下都能有效。


User opinions are an important information source, which can help a customer and a vendor to evaluate pros and cons of the buying/selling when they interact. For the importance of opinion role, there is the possibility to have unfair opinions used to promote own products or to disparage products of competitors. The important challenge of detecting unfair opinions has attracted and attracts the scientific community and one of the most promising approaches to address this problem is based on the use of supervised classifiers, which have been proven to be highly effective. In this paper, we tried to further improve their effectiveness, not by proposing some change in the well-tested state-of-the-art algorithms, but only by modifying the input used for the training phase to construct supervised classifiers. Specifically, we considered eight features widely used to detect opinion spam and pre-processed them by considering the cumulative relative frequency distribution. To demonstrate the effectiveness of our proposal, we extracted a data set from Yelp.com and measured the performances of the six most used classifiers in detecting opinion spam, both in their standard use and when our proposal is adopted. The results of this comparison show that the use of the cumulative relative frequency distribution improves the performance of the state-of-the-art classifiers. As future work, we intend to extend our proposal to detect not only individual spammers, but also groups of users who, acting in a coordinated and synchronized way, aim to give credit or discredit a product (or a service). The idea is that, once an ensemble of malicious reviewers is detected, an overlapping between the products that malicious reviewers have evaluated is searched. Groups of users with large overlap (i.e., who revised the same products) could be colluders.