海量文库 文档专家
当前位置:首页 >> 数学 >>


What is Data Mining?
Many people treat data mining as a synonym for another popularly used term, “Knowledge Discovery in Databases”, or KDD. Alternatively, others view data mining as simply an essential step in the process of knowledge discovery in databases. Knowledge discovery consists of an iterative sequence of the following steps: · data cleaning: to remove noise or irrelevant data, · data integration: where multiple data sources may be combined, ·data selection : where data relevant to the analysis task are retrieved from the database, ·data transformation : where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance, · data mining: an essential process where intelligent methods are applied in order to extract data patterns, · pattern evaluation: to identify the truly interesting patterns representing knowledge based on some interestingness measures, and ·knowledge presentation: where visualization and knowledge representation techniques are used to present the mined knowledge to the user . The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user, and may be stored as new knowledge in the knowledge base. Note that according to this view, data mining is only one step in the entire process, albeit an essential one since it uncovers hidden patterns for evaluation. We agree that data mining is a knowledge discovery process. However, in industry, in media, and in the database research milieu, the term “data mining” is becoming more popular than the longer term of “knowledge discovery in databases”. Therefore, in this book, we choose to use the term “data mining”. We adopt a broad view of data mining functionality: data mining is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. Based on this view, the architecture of a typical data mining system may have the following major components: 1. Database, data warehouse, or other information repository. This is one or a set of databases, data warehouses, spread sheets, or other kinds of information

repositories. Data cleaning and data integration techniques may be performed on the data. 2. Database or data warehouse server. The database or data warehouse server is responsible for fetching the relevant data, based on the user’s data mining request. 3. Knowledge base. This is the domain knowledge that is used to guide the search, or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize attributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a pattern’s interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and metadata (e.g., describing data from multiple heterogeneous sources). 4. Data mining engine. This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association analysis, classification, evolution and deviation analysis. 5. Pattern evaluation module. This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search towards interesting patterns. It may access interestingness thresholds stored in the knowledge base. Alternatively, the pattern evaluation module may be integrated with the mining module, depending on the implementation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation of pattern interestingness as deep as possible into the mining process so as to confine the search to only the interesting patterns. 6. Graphical user interface. This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the intermediate data mining results. In addition, this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms. From a data warehouse perspective, data mining can be viewed as an advanced stage of on-1ine analytical processing (OLAP). However, data mining goes far beyond the narrow scope of summarization-style analytical processing of data warehouse systems by incorporating more advanced techniques for data understanding. While there may be many “data mining systems” on the market, not all of them can perform true data mining. A data analysis system that does not handle large

amounts of data can at most be categorized as a machine learning system, a statistical data analysis tool, or an experimental system prototype. A system that can only perform data or information retrieval, including finding aggregate values, or that performs deductive query answering in large databases should be more appropriately categorized as either a database system, an information retrieval system, or a deductive database system. Data mining involves an integration of techniques from mult1ple disciplines such as database technology, statistics, machine learning, high performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial data analysis. We adopt a database perspective in our presentation of data mining in this book. That is, emphasis is placed on efficient and scalable data mining techniques for large databases. By performing data mining, interesting knowledge, regularities, or high-level information can be extracted from databases and viewed or browsed from different angles. The discovered knowledge can be applied to decision making, process control, information management, query processing, and so on. Therefore, data mining is considered as one of the most important frontiers in database systems and one of the most promising, new database applications in the information industry. A classification of data mining systems Data mining is an interdisciplinary field, the confluence of a set of disciplines, including database systems, statistics, machine learning, visualization, and information science. Moreover, depending on the data mining approach used, techniques from other disciplines may be applied, such as neural networks, fuzzy and or rough set theory, knowledge representation, inductive logic programming, or high performance computing. Depending on the kinds of data to be mined or on the given data mining application, the data mining system may also integrate techniques from spatial data analysis, Information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web technology, economics, or psychology. Because of the diversity of disciplines contributing to data mining, data mining research is expected to generate a large variety of data mining systems. Therefore, it is necessary to provide a clear classification of data mining systems. Such a classification may help potential users distinguish data mining systems and identify those that best match their needs. Data mining systems can be categorized according to various criteria, as follows. 1) Classification according to the kinds of databases mined.

A data mining system can be classified according to the kinds of databases mined. Database systems themselves can be classified according to different criteria (such as data models, or the types of data or applications involved), each of which may require its own data mining technique. Data mining systems can therefore be classified accordingly. For instance, if classifying according to data models, we may have a relational, transactional, object-oriented, object-relational, or data warehouse mining system. If classifying according to the special types of data handled, we may have a spatial, time -series, text, or multimedia data mining system , or a World-Wide Web mining system . Other system types include heterogeneous data mining systems, and legacy data mining systems. 2) Classification according to the kinds of knowledge mined. Data mining systems can be categorized according to the kinds of knowledge they mine, i.e., based on data mining functionalities, such as characterization, discrimination, association, classification, clustering, trend and evolution analysis, deviation analysis , similarity analysis, etc. A comprehensive data mining system usually provides multiple and/or integrated data mining functionalities. Moreover, data mining systems can also be distinguished based on the granularity or levels of abstraction of the knowledge mined, including generalized knowledge(at a high level of abstraction), primitive-level knowledge(at a raw data level), or knowledge at multiple levels (considering several levels of abstraction). An advanced data mining system should facilitate the discovery of knowledge at multiple levels of abstraction. 3) Classification according to the kinds of techniques utilized. Data mining systems can also be categorized according to the underlying data mining techniques employed. These techniques can be described according to the degree of user interaction involved (e.g., autonomous systems, interactive exploratory systems, query-driven systems), or the methods of data analysis employed(e.g., database-oriented or data warehouse-oriented techniques, machine learning, statistics, visualization, pattern recognition, neural networks, and so on ) .A sophisticated data mining system will often adopt multiple data mining techniques or work out an effective, integrated technique which combines the merits of a few individual approaches.

许多人把数据挖掘视为另一个常用的术语—数据库中的知识发现或 KDD 的 同义词。而另一些人只是把数据挖掘视为数据库中知识发现过程的一个基本骤。 知识发现的过程由以下步骤组成: 1)数据清理:消除噪声或不一致数据, 2)数据集成:多种数据可以组合在一起, 3)数据选择:从数据库中检索与分析任务相关的数据, 4) 数据变换: 数据变换或统一成适合挖掘的形式, 如通过汇总或聚集操作, 5)数据挖掘:基本步骤,使用智能方法提取数据模式, 6)模式评估:根据某种兴趣度度量,识别表示知识的真正有趣的模式, 7)知识表示:使用可视化和知识表示技术,向用户提供挖掘的知识。 数据挖掘的步骤可以与用户或知识库进行交互。把有趣的模式提供给用户, 或作为新的知识存放在知识库中。注意,根据这种观点,数据挖掘只是整个过程 中的一个步骤,尽管是最重要的一步,因为它发现隐藏的模式。 我们同意数据挖掘是知识发现过程中的一个步骤。然而,在产业界、媒体和 数据库研究界, “数据挖掘” 比那个较长的术语 “数据库中知识发现” 更为流行。 因此,在本书中,选用的术语是数据挖掘。我们采用数据挖掘的广义观点:数据 挖掘是从存放在数据库中或其他信息库中的大量数据中挖掘出有趣知识的过程。 基于这种观点,典型的数据挖掘系统具有以下主要成分: 数据库、数据仓库或其他信息库:这是一个或一组数据库、数据仓库、电子 表格或其他类型的信息库。可以在数据上进行数据清理和集成。 数据库、数据仓库服务器:根据用户的数据挖掘请求,数据库、数据仓库服 务器负责提取相关数据。 知识库:这是领域知识,用于指导搜索,或评估结果模式的兴趣度。这种知 识可能包括概念分层, 用于将属性或属性值组织成不同的抽象层。用户确信方面 的知识也可以包含在内。可以使用这种知识,根据非期望性评估模式的兴趣度。 领域知识的其他例子有兴趣度限制或阈值和元数据(例如,描述来自多个异种数 据源的数据) 。 数据挖掘引擎:这是数据挖掘系统基本的部分,由一组功能模块组成,用于 特征化、关联、分类、聚类分析以及演变和偏差分析。 模式评估模块:通常,此成分使用兴趣度度量,并与数据挖掘模块交互,以 便将搜索聚集在有趣的模式上。 它可能使用兴趣度阈值过滤发现的模式。模式评 估模块也可以与挖掘模块集成在一起,这依赖于所用的数据挖掘方法的实现。对 于有效的数据挖掘, 建议尽可能深地将模式评估推进到挖掘过程之中,以便将搜 索限制在有兴趣的模式上。

从数据仓库观点,数据挖掘可以看作联机分析处理(OLAP)的高级阶段。然 而, 通过结合更高级的数据理解技术,数据挖掘比数据仓库的汇总型分析处理走 得更远。 尽管市场上已有许多“数据挖掘系统” ,但是并非所有系统的都能进行真正 的数据挖掘。不能处理大量数据的数据分析系统,最多是被称作机器学习系统、 统计数据分析工具或实验系统原型。一个系统只能够进行数据或信息检索,包括 在大型数据库中找出聚集的值或回答演绎查询,应当归类为数据库系统,或信息 检索系统,或演绎数据库系统。 数据挖掘涉及多学科技术的集成,包括数据库技术、统计学、机器学习、高 性能计算、模式识别、神经网络、数据可视化、信息检索、图像与信号处理和空 间数据分析。在本书讨论数据挖掘的时候,我们采用数据库的观点。即,着重强 调在大型数据库中有效的和可伸缩的数据挖掘技术。一个算法是可伸缩的,如果 给定内存和磁盘空间等可利用的系统资源, 其运行时间应当随数据库大小线性增 加。通过数据挖掘,可以从数据库提取有趣的知识、规律或者高层信息,并可以 从不同的角度来观察或浏览。发现的知识可以用于决策、过程控制、信息管理、 查询处理,等等。因此,数据挖掘被信息产业界认为是数据库系统最重要的前沿 之一,是信息产业中最有前途的交叉学科。 数据挖掘是一个交叉学科的领域,受到多个学科的影响,包括数据库系统、 统计学、机器学习、可视化和信息科学。此外,依赖于所用的数据挖掘方法,以 及可以使用的其他学科的技术, 如神经网络、 模糊和/或粗糙集理论、 知识表示、 归纳逻辑程序设计或高性能计算。 依赖于所挖掘的数据类型或给定的数据挖掘应 用,数据挖掘系统也可以集成空间数据分析、信息检索、模式识别、图形分析、 信号处理、计算机图形学、Web 技术、经济、商业、生物信息学或心理学领域的 技术。 由于数据挖掘源于多个学科,因此在数据挖掘研究中就产生了大量的、各种 不同类型的数据挖掘系统。这样,就需要对数据挖掘系统给出一个清楚的分类。 这种分类可以帮助用户区分数据挖掘系统, 确定出最适合其需要的数据挖掘系统。 根据不同的标准,数据挖掘系统可以有如下分类: 1)根据挖掘的数据库类型进行分类。 数据挖掘系统可以根据挖掘的数据库类型进行分类。 数据库系统本身可以根 据不同的标准(如数据模型,或数据或所涉及的应用类型)来分类,每一类都可 能需要自己的数据挖掘技术。这样,数据挖掘系统就可以据此进行相应的分类。 例如,如果是根据数据模型来分类,我们可以有关系的、事务的、面向对象 的、对象-关系的或数据仓库的数据挖掘系统。如果是根据所处理的数据的特定 类型分类,我们可以有空间的、时间序列的、文本的或多媒体的数据挖掘系统,

或是 WWW 的数据挖掘系统。 2)根据挖掘的知识类型进行分类。 数据挖掘系统可以根据所挖掘的知识类型进行分类。 即根据数据挖掘的功能, 如特征化、区分、关联、分类聚类、孤立点分析和演变分析、偏差分析、类似性 分析等进行分类。一个全面的数据挖掘系统应当提供多种和/或集成的数据挖掘 功能。 此外, 数据挖掘系统也可以根据所挖掘的知识的粒度或抽象层进行区分,包 括概化知识(在高抽象层) ,原始层知识(在原始数据层) ,或多层知识(考虑若 干抽象层) 。一个高级的数据挖掘系统应当支持多抽象层的知识发现。 数据挖掘系统还可以分类为挖掘数据规则性(通常出现的模式)和数据不规 则性(如异常或孤立点)这几种。一般地,概念描述、关联分析、分类、预测和 聚类挖掘数据规律,将孤立点作为噪声排除。这些方法也能帮助检测孤立点。 3)根据所用的技术进行分类。 数据挖掘系统也可以根据所用的数据挖掘技术进行分类。 这些技术可以根据 用户交互程度(例如自动系统、交互探查系统、查询驱动系统) ,或利用的数据 分析方法(例如面向数据库或数据仓库的技术、机器学习、统计学、可视化、模 式识别、 神经网络等) 来描述。 复杂的数据挖掘系统通常采用多种数据挖掘技术, 或是采用有效的、集成的技术,结合一些方法的优点。

Data Mining and Data Publishing
Data mining is the extraction of vast interesting patterns or knowledge from huge amount of data. The initial idea of privacy-preserving data mining PPDM was to extend traditional data mining techniques to work with the data modified to mask sensitive information. The key issues were how to modify the data and how to recover the data mining result from the modified data. Privacy-preserving data mining considers the problem of running data mining algorithms on confidential data that is not supposed to be revealed even to the party running the algorithm. In contrast, privacy-preserving data publishing (PPDP) may not necessarily be tied to a specific data mining task, and the data mining task may be unknown at the time of data publishing. PPDP studies how to transform raw data into a version that is immunized against privacy attacks but that still supports effective data mining tasks. Privacy-preserving for both data mining (PPDM) and data publishing (PPDP) has become increasingly popular because it allows sharing of privacy sensitive data for analysis purposes. One well studied approach is the k-anonymity model [1] which in turn led to other models such as confidence bounding, l-diversity, t-closeness, (α,k)-anonymity, etc. In particular, all known mechanisms try to minimize information loss and such an attempt provides a loophole for attacks. The aim of this paper is to present a survey for most of the common attacks techniques for anonymization-based PPDM & PPDP and explain their effects on Data Privacy. Although data mining is potentially useful, many data holders are reluctant to provide their data for data mining for the fear of violating individual privacy. In recent years, study has been made to ensure that the sensitive information of individuals cannot be identified easily. Anonymity Models, k-anonymization techniques have been the focus of intense research in the last few years. In order to ensure anonymization of data while at the same time minimizing the information loss resulting from data modifications, everal extending models are proposed, which are discussed as follows. 1.k-Anonymity k-anonymity is one of the most classic models, which technique that prevents joining attacks by generalizing and/or suppressing portions of the released microdata so that no individual can be uniquely distinguished from a group of size k. In the k-anonymous tables, a data set is k-anonymous (k ≥ 1) if each record in the data set is in- distinguishable from at least (k . 1) other records within the same data set. The larger the value of k, the better the privacy is protected. k-anonymity can ensure that

individuals cannot be uniquely identified by linking attacks. 2. Extending Models Since k-anonymity does not provide sufficient protection against attribute disclosure. The notion of l-diversity attempts to solve this problem by requiring that each equivalence class has at least l well-represented value for each sensitive attribute. The technology of l-diversity has some advantages than k-anonymity. Because k-anonymity dataset permits strong attacks due to lack of diversity in the sensitive attributes. In this model, an equivalence class is said to have l-diversity if there are at least l well-represented value for the sensitive attribute. Because there are semantic relationships among the attribute values, and different values have very different levels of sensitivity. After anonymization, in any equivalence class, the frequency (in fraction) of a sensitive value is no more than α. 3. Related Research Areas Several polls show that the public has an in- creased sense of privacy loss. Since data mining is often a key component of information systems, homeland security systems, and monitoring and surveillance systems, it gives a wrong impression that data mining is a technique for privacy intrusion. This lack of trust has become an obstacle to the benefit of the technology. For example, the potentially beneficial data mining re- search project, Terrorism Information Awareness (TIA), was terminated by the US Congress due to its controversial procedures of collecting, sharing, and analyzing the trails left by individuals. Motivated by the privacy concerns on data mining tools, a research area called privacy-reserving data mining (PPDM) emerged in 2000. The initial idea of PPDM was to extend traditional data mining techniques to work with the data modified to mask sensitive information. The key issues were how to modify the data and how to recover the data mining result from the modified data. The solutions were often tightly coupled with the data mining algorithms under consideration. In contrast, privacy-preserving data publishing (PPDP) may not necessarily tie to a specific data mining task, and the data mining task is sometimes unknown at the time of data publishing. Furthermore, some PPDP solutions emphasize preserving the data truthfulness at the record level, but PPDM solutions often do not preserve such property. PPDP Differs from PPDM in Several Major Ways as Follows : 1) PPDP focuses on techniques for publishing data, not techniques for data mining. In fact, it is expected that standard data mining techniques are applied on the published data. In contrast, the data holder in PPDM needs to randomize the data in

such a way that data mining results can be recovered from the randomized data. To do so, the data holder must understand the data mining tasks and algorithms involved. This level of involvement is not expected of the data holder in PPDP who usually is not an expert in data mining. 2) Both randomization and encryption do not preserve the truthfulness of values at the record level; therefore, the released data are basically meaningless to the recipients. In such a case, the data holder in PPDM may consider releasing the data mining results rather than the scrambled data. 3) PPDP primarily “anonymizes” the data by hiding the identity of record owners, whereas PPDM seeks to directly hide the sensitive data. Excellent surveys and books in randomization and cryptographic techniques for PPDM can be found in the existing literature. A family of research work called privacy-preserving distributed data mining (PPDDM) aims at performing some data mining task on a set of private databases owned by different parties. It follows the principle of Secure Multiparty Computation (SMC), and prohibits any data sharing other than the final data mining result. Clifton et al. present a suite of SMC operations, like secure sum, secure set union, secure size of set intersection, and scalar product, that are useful for many data mining tasks. In contrast, PPDP does not perform the actual data mining task, but concerns with how to publish the data so that the anonymous data are useful for data mining. We can say that PPDP protects privacy at the data level while PPDDM protects privacy at the process level. They address different privacy models and data mining scenarios. In the field of statistical disclosure control (SDC), the research works focus on privacy-preserving publishing methods for statistical tables. SDC focuses on three types of disclosures, namely identity disclosure, attribute disclosure, and inferential disclosure. Identity disclosure occurs if an adversary can identify a respondent from the published data. Revealing that an individual is a respondent of a data collection may or may not violate confidentiality requirements. Attribute disclosure occurs when confidential information about a respondent is revealed and can be attributed to the respondent. Attribute disclosure is the primary concern of most statistical agencies in deciding whether to publish tabular data. Inferential disclosure occurs when individual information can be inferred with high confidence from statistical information of the published data. Some other works of SDC focus on the study of the non-interactive query model, in which the data recipients can submit one query to the system. This type of non-interactive query model may not fully address the information needs of data

recipients because, in some cases, it is very difficult for a data recipient to accurately construct a query for a data mining task in one shot. Consequently, there are a series of studies on the interactive query model, in which the data recipients, including adversaries, can submit a sequence of queries based on previously received query results. The database server is responsible to keep track of all queries of each user and determine whether or not the currently received query has violated the privacy requirement with respect to all previous queries. One limitation of any interactive privacy-preserving query system is that it can only answer a sublinear number of queries in total; otherwise, an adversary (or a group of corrupted data recipients) will be able to reconstruct all but 1 . o(1) fraction of the original data, which is a very strong violation of privacy. When the maximum number of queries is reached, the query service must be closed to avoid privacy leak. In the case of the non-interactive query model, the adversary can issue only one query and, therefore, the non-interactive query model cannot achieve the same degree of privacy defined by Introduction the interactive model. One may consider that privacy-reserving data publishing is a special case of the non-interactive query model. This paper presents a survey for most of the common attacks techniques for anonymization-based PPDM & PPDP and explains their effects on Data Privacy. k-anonymity is used for security of respondents identity and decreases linking attack in the case of homogeneity attack a simple k-anonymity model fails and we need a concept which prevent from this attack solution is l-diversity. All tuples are arranged in well represented form and adversary will divert to l places or on l sensitive attributes. l-diversity limits in case of background knowledge attack because no one predicts knowledge level of an adversary. It is observe that using generalization and suppression we also apply these techniques on those attributes which doesn’t need this extent of privacy and this leads to reduce the precision of publishing table. e-NSTAM (extended Sensitive Tuples Anonymity Method) is applied on sensitive tuples only and reduces information loss, this method also fails in the case of multiple sensitive tuples.Generalization with suppression is also the causes of data lose because suppression emphasize on not releasing values which are not suited for k factor. Future works in this front can include defining a new privacy measure along with l-diversity for multiple sensitive attribute and we will focus to generalize attributes without suppression using other techniques which are used to achieve k-anonymity because suppression leads to reduce the precision of publishing table.

数据挖掘中提取出大量有趣的模式从大量的数据或知识。 数据挖掘隐私保护 PPDM 的最初的想法是将传统的数据挖掘技术扩展到处理数据修改为屏蔽敏感信 息。 关键问题是如何修改数据以及如何从修改后的数据恢复数据挖掘的结果。隐 私保护数据挖掘认为机密数据上运行数据挖掘算法的问题不应该透露方运行算 法。相比之下,隐私保护数据发布(PPDP)不一定是绑定到一个特定的数据挖掘任 务,和数据挖掘任务时可能是未知的数据发布。PPDP 研究如何将原始数据转换成 一个版本接种隐私攻击 , 但仍然支持有效的数据挖掘任务。隐私保护数据挖掘 (PPDM)和数据发布(PPDP)已成为越来越受欢迎,因为它允许共享隐私的敏感数据 进行分析的目的。 深入研究方法之一是 k-anonymity 匿名模型进而导致信心边界 等模型,l-diversity, t-closeness,(α ,k)-anonymity,等。特别是,所有已知的 机制,尽量减少信息损失,试图提供一个漏洞攻击。 本文的目的是提出一项调查最 常见的攻击技术即 PPDM & PPDP 和解释它们对数据隐私的影响。 尽管数据挖掘可能是有用的,很多数据持有者不愿提供他们的数据对数据挖 掘的恐惧侵犯个人隐私。近年来,研究了以确保个人敏感信息不能轻易识别。 匿名模型(k-匿名)技术一直是研究的焦点,在过去的几年里。为了确保匿 名数据的同时尽量减少所造成的信息损失数据的修改 ,提出了几个扩展模型,讨 论如下。 1. k-匿名模型 k-anonymity 最经典模型之一,加入的攻击技术,防止泛化和/或抑制微数据 发布的一部分,这样任何个人可以独特区别一群大小 k。k-anonymous 表,一个数 据集是 k-anonymous(k≥1)如果每个记录的数据集——至少(k 区分开来)其他相 同的数据集内的记录。k 值越大,更好的隐私保护。英蒂 k-anonymity 可以确保 ——viduals 不能唯一标识链接攻击。 2.扩展模型 因为 k-anonymity 不提供足够的保护属性披露。l-diversity 的概念试图解 决这个问题 , 要求每个等价类至少 l 上流每个敏感属性值。比 k-anonymity l-diversity 技术有一定的优势。因为 k-anonymity 数据集允许强大的攻击由于 缺乏多样性的敏感属性。在这个模型中,一个等价类据说 l-diversity 如果至少 有 l 上流的敏感属性的值。因为有语义属性值之间的关系,以及不同价值观有不 同水平的敏感性。anonymization 之后,在任何等价类,一个敏感的频率(分数)值 不超过 α 。 3.相关研究领域 一些民意调查显示,公众有——有折痕的隐私的失落感。由于数据挖掘通常 是信息系统的一个关键组成部分,国土安全系统,以及监测和监测系统,它给了一

个错误的印象,荷兰国际集团数据隐私入侵的技术。这种缺乏信任已经成为障碍 的技术中获益。 例如,潜在的有益的数据挖掘,搜索项目,恐怖主义信息意识(TIA), 是由美国国会终止由于其争议的程序收集、分享和分析个人留下的痕迹。出于隐 私问题的数据挖掘工具 , 一个叫隐私保护的数据挖掘研究领域 (PPDM) 出现在 2000 年。PPDM 的最初的想法是将传统的数据挖掘技术扩展到处理数据修改为屏 蔽敏感信息。 关键问题是如何修改数据以及如何从修改后的数据恢复数据挖掘的 结果。这些解决方案通常与数据挖掘算法在考虑紧密耦合。相比之下,隐私保护 数据发布(PPDP)不一定绑到一个特定的数据挖掘任务,和数据挖掘任务有时是未 知的数据发布的时候。 此外,一些 PPDP 解决方案强调保存数据记录级别的真实性, 但是 PPDM 解决方案通常不保留这样的财产。PPDP 有别于 PPDM 在几个主要方面 如下: 1)PPDP 关注技术发布数据,数据挖掘技术。事实上,它预计,标准的数据挖掘 技术应用于分析数据。相反,数据持有人在 PPDM 需要随机数据的方式,数据挖掘 结果可以从随机数据中恢复过来。为此,持有人必须了解数据挖掘任务的数据和 算法。这种级别的预计数据持有人参与 PPDP 通常不是一个数据挖掘专家。 2)随机化和加密不保存记录的真实值水平;因此,公布的数据基本上是毫无 意义的决策。 在这种情况下,数据持有人 PPDM 可能考虑释放数据挖掘结果而不是 加密数据。 3)PPDP 主要“anonymizes”通过隐藏的数据记录所有者的身份,而 PPDM 寻 求直接隐藏敏感数据。优秀的调查和书籍 PPDM 随机化和加密技术可以在现有的 文献中找到。家庭中的数据称为隐私保护数据,分布式数据挖掘的研究工作 (PPDDM)旨在执行一些私有数据库的数据挖掘任务在一组由不同的政党。它遵循 的原则,安全多方计算(SMC),并禁止任何数据共享除了最后一个数据挖掘的结果。 克利夫顿等人提出一套 SMC 操作,如安全,安全设置,安全设置十字路口的大小, 和标量的产品,有很多的有用的数据挖掘任务。相比之下,PPDP 不执行实际的数 据挖掘任务,但担忧如何发布的匿名数据是有用的数据,以便数据挖掘。 我们可以 说,PPDP 保护隐私数据层面而 PPDDM 保护隐私在流程级别。他们处理的是不同的 隐私保护数据挖掘模型和场景。 领域的统计信息披露控制(SDC),研究工作集中在 隐私保护出版统计表的方法。SDC 关注三种类型的披露,即身份披露,属性信息披 露和推论披露。 身份信息披露发生如果敌人可以识别被公布的数据。透露一个人 是一个被调查者的数据收集可能会或可能不会违反保密要求。 属性披露机密信息 被披露时,可以归因于被申请人。属性信息披露的主要关心的是大多数统计机构 在决定是否发布表格数据。 推论披露发生在个人信息可以推断高信心已发布数据 的统计信息。 其他一些作品 SDC 关注非交互式查询模型的研究,在数据接收者可以向系统

提交一个查询。 这种类型的非交互式查询模型不能完全解决数据接收者的信息需 求,因为在某些情况下,它是非常困难的一个数据接收方准确地构造一个一次查 询一个数据挖掘的任务。因此,有一系列的交互式查询模型,研究数据接收者,包 括敌人,可以根据先前提交的查询序列得到查询结果。数据库服务器负责跟踪每 个用户的所有查询并确定当前收到的查询是否有违反了隐私要求对所有先前的 查询。任何互动隐私保护查询系统的一个限制是,它只能在总回答亚线性数量的 查询;否则,敌人(或一组损坏数据接收者)能够重建。 原始数据是一个非常强大的 侵犯隐私。 当达到最大数量的查询,查询服务必须关闭,以避免隐私泄漏。在非交 互式查询模型的情况下,对手只能发行一个查询,因此,非交互式查询模型无法达 到同样程度的隐私定义的介绍互动模型。 你可能认为隐私保护数据发布的非交互 式查询模型是一个特例。 本文提出一项调查为最常见的攻击技术 PPDM & PPDP 和解释对数据隐私的影 响。k-anonymity 匿名模型用于安全的受访者身份和减少链接攻击在同质性的情 况下攻击失败,我们需要一个简单的 k-anonymity 模型概念,l-diversity 防止这 种攻击的解决方案。所有元组都安排在很好的体现形式和对手会把 l 地方或 l 敏感属性。l-diversity 限制在背景知识的情况下攻击,因为没有人预测对手的 知识水平。观察,使用泛化和镇压我们也应用这些技术在这些属性不需要这种程 度的隐私,这导致减少发布表的精度。 e-NSTAM(扩展敏感元组匿名方法)应用于敏 感元组,可以减少信息损失,这种方法也不能在多个敏感元组。 泛化与抑制数据丢 失的原因也因为抑制强调不释放值不适合导热系数。 未来在这方面的工作可以包 括定义一个新的隐私措施连同 l-diversity 多个敏感属性,我们将集中概括属性 没有抑制使用其他技术用来实现 k-anonymity 匿名模型, 因为抑制会导致减少发 布表的精度。





外文文献及翻译 - SQL server database management


外文文献原稿和译文_物理_自然科学_专业资料。北京化工大学北方学院毕业设计(论文)外文文献原稿和译文 外文文献原稿和译文 原 稿 The introduction of the ...


英文文献及翻译 - 附件:英文文献及翻译 英文文献 The Importance


8外文文献及翻译 - 若有侵权,请联系我立即删除,上传目的仅供本科毕业论文参考。... 8外文文献及翻译_哲学/历史_人文社科_专业资料。若有侵权,请联系我立即删除,上...


外文文献及翻译 - 学校代码: 学 10128 号: 201210707016


土木工程专业外文文献及翻译 - 英文原文: Building construct


通信类英文文献及翻译 - 姓名:刘峻霖 班级:通信 143 班 学号:20141


机械类外文文献及翻译 - 机械类外文文献及翻译 机械类外文文献及翻译 (文档含中


单片机外文文献翻译 - 毕业论文说明书 毕业设计外文文献及翻译 Single-c


外文翻译及外文原文(参考格式) - 外文翻译要求: 1、外文资料与毕业设计(论文)选题密切相关,译文准确、质量好。 2、阅读 2 篇幅以上(10000 字符左右)的外文资料,...


外文文献及翻译 - What is Data Mining? Many peop


土木工程专业外文文献及翻译 - 学校代码: 学号: 外文文献及翻译 (题目: A


手把手教你把一篇pdf英文文献瞬间翻译成doc格式的中文_教育学/心理学_人文社科_专业资料。手把手教你把一篇 pdf 英文文献瞬间翻译成 doc 格式的中文一、准备工作:...


英文文献及翻译(计算机专业) - NET-BASED TASK MANAGEME


毕业论文英文文献及翻译 - 英文原文 ASP.NET and the .NET


外文文献中文翻译 - Servlet 程序在服务器端运行,动态地生成 Web 页


工程造价专业外文文献翻译(中英文对照 - 山东建筑大学毕业设计外文文献及译文


外文参考文献及翻译_教育学/心理学_人文社科_专业资料 暂无评价|0人阅读|0次下载|举报文档外文参考文献及翻译_教育学/心理学_人文社科_专业资料。有关纳税筹划的...


土木工程外文文献及翻译 - 山东建筑大学 本科毕业设计 外文文献及译文 文献、资

网站首页 | 网站地图
All rights reserved Powered by 酷我资料网 koorio.com
copyright ©right 2014-2019。