Theory of Data Mining 

Theory of Data Mining 

The structure of data mining system

A practical data mining system should have the following characteristics:
(l)the ability of quick respond. The system can send information back to customers even the operation time of system is very long;
(2)the ability to deal with large amount of data, that is, having better time complexity in realization method;
(3)friendly and alternating interface; data output can be realized in many ways.
(4)the ability to choose self-adaptation and suggest better parameters and models.
(5) Because customers are usually not experts, they don not know which model is suitable for the existing data.
Database, data warehouse and other information base: it is a information base of a database or a group of database, data warehouse and electronic table or other types. It can clear and integrate data on data.
(2) Server of database or data warehouse: according to users’ requirement of data mining, server of database or data warehouse are responsible for extracting relevant data.
(3) Knowledge base: knowledge base is used to save knowledge needed in data mining. This knowledge will be used to guide the searching process of data mining, or to help evaluate the data mining results. The threshold value defined by customers in mining algorithm is the simplest area knowledge.
(4) Data mining engine; it is the basic part of data mining system, which is composed of a group of functional modules, used for the analysis of characterization, relevance, classification and clustering, and evolution and variance analysis.
(5) The module of mode evaluation; usually, this component uses the degree of interest to measure, and interacts with data mining module in order to focus the search on the interesting modules. It may use the threshold of interest degree to filter the discovered modes. The module of mode evaluation can also integrate with mining module, which depends on the realization of all the used data mining methods.
(6) The interface of graphs and users; this module communicates between users and data mining system, allowing users to interact with system, pointing the inquiry or task of data mining, providing information, helping search focus, doing data mining with searching method according to the middle results of data mining.
Besides, this component also allows users to browse database and data warehouse or data structure, to evaluate mining mode, and to visualize the mode with different forms.

Functions of data mining

Data mining makes the proactive decision based on knowledge through forecasting future trend. The aim of data mining is to discover the hidden and meaningful knowledge from database. It mainly has the following five kinds of functions:
forecasting trend and behaviors automatically
Data mining looks for forecasting information automatically in the large-size database. Nowadays, the problems that need more manual analysis in former times can get conclusion more quickly and directly from data themselves. A typical example is market forecasting problem. Data mining uses the data about promotion in the past to look for the users who will give the largest investment in future. The other problems that can be forecasted include forecasting bankruptcy and identify the groups that will probably respond to given events.
association analysis
Association analysis is a kind of important knowledge that can be discovered in database. If there is some regularity between values of two or more than two variables, it is called association. Association can be divided into simple association, time-sequence association, and cause-result association. The aim of association analysis is to find out the association web hidden in database. Sometimes we do not know the association function of data in database, even we know it, it is uncertain. So the rule of association analysis is reliable.
clustering
The record in database can be divided into a series of meaningful itemsets, that is, clustering. Clustering strengthens people’s understanding to the objective reality and it is the precondition of concept description and derivation analysis. Clustering technology mainly includes the method of traditional mode identification and mathematics taxonomy in the early 1980s, Michalski put forward the concept of clustering technology, whose points are that when dividing objects it not only needs to consider the distance among objects, but also the types divided have the discretion of some kind of connotation, so that some sidedness of traditional technology can be avoided.
concept description
Concept description is to describe the connotation of some type of objection, and summerize its relevant features. Concept discretion is divided into feature description and distinction description. The former one describes common features of some type of object while the latter one describes the distinction among different types of objects. The feature description generating one type only evolves the general feature of all objects in this type. There are many methods of generating distinction description, such as decision-making tree method, and genetic method, and so on.
derivation testing
There are often some abnormal records of data in database. It is meaningful to test these derivations in database. Derivation includes much potential knowledge, such as the unusual examples, the special cases not meeting the rules, the derivation between the observing results and model forecasting value, the change of value with time, and so on. The basic method of derivation testing is to look for meaningful differences between observing results and the reference value.

Methods used in data mining

rule summing
It is to sum and extract valuable if-then rules through statistics method, such as association rule mining.
the method of decision-making tree
It uses tree shape to stand for decision set. These decision sets will generate rules through the classification of data collection. The method of decision-making tree first uses information to look for the field with the greatest information in database and then establishes a node of decision-making tree. And then according to different values of fields, branches of the tree will be established; then in each itemset of branch, establish repeatedly the lower-layer node of the tree and branches, i.e. establishing decision-making tree. The most influential international decision-making method is the method of ID3 ( Interaction Detection) developed by Quinlan, whose typical application is classification data mining.
artificial neural network
This method mainly imitates the neuron structure of human being, and it is also a kind of non-linear forecasting model to study through training. It can finish many kinds of data mining tasks such as classification, clustering, and feature rules, and so on. Meanwhile, it is based on the MP(McCuIIoch Pitts ) model and HEBB( Hebbian ) study rule to establish three kinds of network models : forward feedback network, backward feedback network, and self-organization network .
genetic algorithm
This is a kind of algorithm imitating the process of biology evolution, which is put forward by Holland
in the 1970s for the first time. It is an iterative process based on groups, and with the features of random and directional search. These processes include four kinds of typical operators; gene portfolio, crossover and mutation, and natural selection. Genetic algorithm acts on a group composed of many potential solutions of the problems, and each individual in the group is expressed by a code.
Meanwhile, each individual needs to be given a suitable value according to the objective function of the problem in order to perform the ability of advantageous search of genetic algorithm.
fuzzy technology1201
Fuzzy technology is making use of the theory of fuzzy set to make fuzzy assessment, fuzzy decision-making, fuzzy mode identification and fuzzy clustering analysis. This kind of fuzziness exists objectively, and the higher the complexity of the system is, the stronger the fuzziness will be. On the basis of the traditional fuzzy theory and probability statistics, the cloud model shaped through putting forward the qualitative and quantitative and uncertain transferring model combines the fuzziness and randomness of the concepts together, which provides a kind of new method of concept and knowledge expression, qualitative and quantitative transferring and comprehension and decomposition for data mining.
Rough Set method
It is a kind of completely new data analysis method put forward by Poland logician Pawlak. In recent years, it has been paid great attention to and widelyapplied in the fields of machine learning and KDD, and so on. This kind of rough set method is an effective researching method in uncertain and imprecise problems in information system, whose basic principle is based on the ideas of equal-price type.
However, the elements of this kind of equal-price type are regarded as indistinctive.
The basic method is first using the method of rough set approximation to discrete the attribute value of information system; and then dividing each attribute into equal-price type, and then making use of the equivalence to simplify the information system; and finally getting a minimum decision association in order to get rules conveniently.
visualized technology
It uses the form of visual graphics to display the information mode, the association or the trend of data to the decision-makers. In this way the decision-makers can analyze the data association interactively through the visualized technology. The visualized technology mainly includes three aspects of visualization of data, model, and process. Among them, data visualization mainly has histogram, box map and fall apart point chart; the concrete method of model visualization has something to do with the algorithm of data mining. For example, the algorithm of decision-making tree is expressed by tree shape while process visualization uses data flow to describe the knowledge discovery process.
Although data mining technologies mentioned above have their own features and applicable scope, the types of knowledge they discover are not the same. Among them, induction is often suitable for association rule, feature rule, sequence mode and the mining of discrete data; the method of decision-making and genetic algorithm and rough set method are often suitable for classification mode and structure; but neural network method can be used to realize many kinds of data mining such as classifying, clustering, feature rules; fuzzy technology is often used in mining association, fuzzy classification and fuzzy clustering rule.

the necessity of the distributed data mining

The present data mining algorithm and model are mainly concentrative. Even under the condition of data distributed storing, it also requires to recollect these data into a concentrative place (like data warehouse), which demands high-speed network of data communications; also, responding time will become longer and the privacy and security will be damaged, Especially when the distributed data are not in the same structure. Although the band network is increasing, it cannot catch up with the speed of data increasing. As a result, it needs to use limited band network to move large capacity of data. What is more, the centralized concentrative data mining algorithm is not suitable for future analysis application of large capacity and distributed data. Therefore, because of privacy and secrecy of data and the incompatibility of system, it is not realistic to put all the data into a concentrative platform.

the key technological problems in distributed data mining

In the distributed data mining, there are the following four aspects of key technologies needing to be noticed;
(1) Data Consistency™
The first stage of data mining is to collect data from the data source of logical distribution or physical distribution. The traditional method is to extract data table first from association database, and then put it into a concentrative data warehouse or data set. Therefore, to the distributed data mining system, it is very important to provide a consistent storing structure for all the data mining processes. Besides, it is also critical to minimize the data movement of the whole data mining period as much as possible in the distributed environment. And also, an important topic is to develop an inquiry interface which is compatible with SQL for data mining algorithm to visit the information of the distributed database directly.
(2) Parallel Data Mining
On the side of servers, running data mining under large scales of data set are very time consuming, because data mining algorithm has very high complexity .A better way is parallel data mining algorithm. Hany data mining algorithm have been developed, such as Association Rules , Neural Network , Genetic Algorithms , Decision Tree , and so on. But traditional algorithms used only think about the using of single data base, commonly they are series algorithms. Along with parallel and distributed technique development, more and more mining algorithms based on the parallel and distributed emerged. For example , recently parallel mining association algorithms have CD( Count Distribution X CaD( Candidate Distribution % DD ( Data Distribution ) by Agrawal, PDM by Park . Based on distributed data based mining algorithms have DMA and FDM by Chueng. Parallel classifier analysis algorithms have SPRINT by Shafer. Usually, parallel mining algorithms and distributed data mining algorithms are universal.
(3) Knowledge Assimilation
In data distributed and function distributed environment, knowledge assimilation is very important. Its basic ideas are using data mining algorithm to assimilate knowledge from several data sets (generally not disjoint), and then using the knowledge fragment produced in the data mining process to compose complete knowledge.
(4) Distributed Software Engineering
In recent years, internet has become the super-structure of Client/Server computer mode in the worldwide. In new environment, application development is mainly to develop software parts, and then combine them. Software parts have encapsulation. Its compatibility with outside is finished through the applied procedure interface (API) which is defined beforehand. The biggest advantage of software parts is that they support software diplex. In this way system designing stuff can use existing software parts. The most popular distributed models nowadays are CORBA, ActiveX/DCOM and vTava Beans.

The variety of association rules

We classify the association rule according to different situations:
(1) Based on the types of handling variables in the rules, the association rule can be divided into Boolean type and number type.
The values in Boolean type are all discrete, and are classified by type. It shows associations among these variables ; while number type association rule can be combined with multi-dimensional cooassociation or multi-layer association rule to handle the fields of number type, and do dynamic division, or handle the original data directly. Of course type variables can be included in the number type association rule.
For example : sex = *female”=> carrer= “secretary” is the Boolean type relate rule, sex = ^female”=> avg income =1300, the involved income is number type. Therefore, it is a number type association rule.
(2) Based on the abstract level of data in rule, it can be divided into single layer rule and multi-layer association rule.
In the single-layer association rule, all variables do not consider that realistic data have different levels; but in the multi-layer association rule, the multi-layer of data has been fully considered.
For example, IBM desktop=>Sony printer , it is a single-layer association rule in detailed data; desktop=>Sony printer , is a multi-layer association rule in a relatively higher lever and is more detailed.
(3) Based on the involved dimension of data in the rule, the association rules can be divided into single dimension and multi-dimension.
In the association rule of single dimension, we only deal with one dimension of data, such as the goods that user buys; while in the association rule of multi-dimension, the data to be handled will involve many dimensions. In other words, the single association rule handles the association of single attribute; multi-dimension association rule handles the associations among each attribute.
For example, beer=>diaper, this rule only involves the goods users purchase; gender= *female//=>career= ‘”secretary”, this rule involves the information of two fields, and an association rule in two-dimension.
After giving the classification of the association rule, in the actual application of the association rule, we can consider which concrete method is fit for which kind of mining of rule, and how many different methods can be used to handle some kind of rule.

The distributed data mining system (DDMS)

Specifically speaking, data mining can be regarded as a forecasting model or rule set got from one or more (distributed) data collections applying corresponding data mining algorithm. Here different strategies can be used mainly according to the data themselves, the distribution of the data, the software and hardware resources that can be used, and the required precision. Accordingly, the centralized distributed data mining systems have some differences in the following strategies :
(l)Data Strategy
The distributed data mining can choose the final result of moving data, or moving middle result, or providing forecasting model, or moving data mining algorithm. We can use the distributed data mining system of Local Learning to establish models in each distributed places, and then carry these models to a centre region. We can also use the data mining system of Centralized Learning to carry the data to the centre region and then establsih models. Besides, some data mining systems use Hybrid Learning, i.e. the strategy combining partial leaning and the centralized leaning.
(2)Task Strategy
The distributed data mining system can choose to co-ordinately use one kind of data mining algorithm in several data stations, and can also choose to use different data mining algorithms independently in each data station. In the mode of Independent Learning, each kind of data mining algorithm is respectively applied in each distributed data station; in the mode of Coordinated Learning, one (or more) data station use one kind of data mining algorithm to coordinate mining task in several data stations .
(3)Model Strategy
There are many methods of combining the forecasting models established in different places. Among these methods, the simple and the most often used one is making use of voting, which is to combine the output of the models of each type according to the majority voting. But the method of Knowledge Probing is to establish a comprehensive model according to the input and output of all kinds of models and the expected output.
A distributed data mining system should have good performance in Scalability
Efficiency, Portability, Adaptivity, and Extensibility.
The extensibility of a distributed data mining system is such a kind of ability of the system: when the number of the data sites is increasing, the performance of the system has no substantive and obvious declining. The effectiveness means to make use of the centralized system resources effectively and get the correct mining results.
The Portability refers tothat a distributed data mining system should normally operate in the multi-environment with software and hardware equipments, and can combine multi-model with different expressions.
Almost all the environment of most data mining systems will change. The adaptivity of the distributed data mining system refers to the ability hoe to evolve and adjust according to the changed environment.
Not only data and mode will change with time, but algorithms and tools will have some change with the progress of machine leaning and data mining. The extensibility of the distributed data mining system means that ii must have enough flexibility to adapt the present and future data mining technology; or else, it will be inapplicable quickly and close to obsolescence.

 

Le rapport de stage ou le pfe est un document d’analyse, de synthèse et d’évaluation de votre apprentissage, c’est pour cela rapport-gratuit.com propose le téléchargement des modèles complet de projet de fin d’étude, rapport de stage, mémoire, pfe, thèse, pour connaître la méthodologie à avoir et savoir comment construire les parties d’un projet de fin d’étude.

Table des matières

CHAPTER 1 INTRODUCTION 
1.1 topic background
1.2 The research contents
1.3 Organization form of the thesis
CHAPTER 2 Theory of Data Mining 
2.1Data Mining
2.1.2 The structure of data mining system
2.1.2 Functions of data mining
2.1.3 Methods used in data mining
2.1.4 the main application and the developing trend
2. 2 Distributed Data Mining
2.2.1 the basic principle of Distributed Data Mining (DDM)[28]
2.2.2 the necessity of the distributed data mining[7]
2.2.3 the key technological problems in distributed data mining
2.2.4 The Research Result of Distributed Data Mining
2.2.4.1 The existing distributed mining algorithms
2.2.4.2 the architecture of the existing distributed data mining
CHAPTER 3 General introduction of Association Rules 
3.1 Basic conception and problem describing
3.2 The variety of association rules
3.3 The steps and classic algorithm of mining association rule
3.3.1 The steps of mining association rule
3.3.2 Classical Association Rules Algorithm Apriori Algorithm[56]
3.3.3 The existing improvement of Apriori algorithmt[56]
3.4 The Generalization of Association Rule
3.5 The analysis of some existing association rule mining algorithms
CHAPTER 4 The distributed data mining system based on multi-agent 
4.1 the distributed data mining system (DDMS)
4.2 Agent introduction
4.2.1 the characteristics and definition of Agent
4.2.2 Multi-agent system
4.3 the distributed data mining system based on multi-agent
4.3.1 Structure
4.3.2 Module function
4.3.2.1 Users’ interface agent
4.3.2.2 Task management agent
4.3.2.3 Correspond agent
4.3.2.4 data mining agent ( DMA)
4.3.2.5 Knowledge management agent
4.3.2.6 Users’ information base
4.3.2.7 The overall knowledge base
4.3.3 The Work Process of System
4.4 Summery
CHAPTER5 The distributed association rule mining algorithm based on multi-agent — RK-tree algorithm
5.1 the basic concept and theory
5.2 the basic principle of RK -tree
5.3 the description of RK -tree algorithm
5.4 The example of RK-tree algorithm
5.5 comparison between RK -tree algorithm and other algorithms
5.6 The comparison of optimized experimental results
5.7 Summery
CHAPTER6 CONCLUSION

Rapport PFE, mémoire et thèse PDFTélécharger le rapport complet

Télécharger aussi :

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *