Data mining with rattle and r pdf


    Rattle and R deliver a very sophisticated data mining environment. It can also produce graphics output in PDF, JPG, PNG, and SVG formats, and table output. script, one can clearly see that the Gondi language is very distinct from Telugu. An interesting Gondi - English - Telug R: Mining Spatial, Text, Web, and Social . Graham Williams — Data Mining Desktop Survival R for the Data Miner .. Cairo: cairo pdf surface create could not be located

    Language:English, Spanish, Arabic
    Genre:Politics & Laws
    Published (Last):07.12.2015
    Distribution:Free* [*Registration Required]
    Uploaded by: PRINCE

    58602 downloads 118090 Views 36.81MB PDF Size Report

    Data Mining With Rattle And R Pdf

    PDF | On Feb 1, , Kassim S. Mwitondi and others published Data mining with Rattle and R. Rattle and R deliver a very sophisticated data mining environment. The default is to save in PDF format, saving to a file with the filename extension . Read Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery (Use R) Kindle ready Download here.

    Explorations Chapter 1 Introduction For the keen data miner, Chapter 2 provides a quick-start guide to data mining with Rattle, working through a sample process of loading a dataset and building a model. Data mining is the art and science of intelligent data analysis. The aim is to discover meaningful insights and knowledge from data. Discoveries are often expressed as models, and we often describe data mining as the process of building models. A model captures, in some formulation, the essence of the discovered knowledge. A model can be used to assist in our understanding of the world. Models can also be used to make predictions. For the data miner, the discovery of new knowledge and the building of models that nicely predict the future can be quite rewarding. Indeed, data mining should be exciting and fun as we watch new insights and knowledge emerge from our data. With growing enthusiasm, we meander through our data analyses, following our intuitions and making new discoveries all the time—discoveries that will continue to help change our world for the better. Data mining has been applied in most areas of endeavour. There are data mining teams working in business, government, financial services, biology, medicine, risk and intelligence, science, and engineering. Anywhere we collect data, data mining is being applied and feeding new knowledge into human endeavour.

    The concepts of modelling are introduced in Chapter 8, introducing descriptive and predictive data mining. Specific descriptive data mining approaches are then covered in Chapters 9 clusters and 10 association rules. Predictive data mining approaches are covered in Chapters 11 decision trees , 12 random forests , 13 boosting , and 14 support vec- tor machines. Not all predictive data mining approaches are included, leaving some of the well-covered topics including linear regression and neural networks to other books.

    Having built a model, we need to consider how to evaluate its perfor- mance. This is the topic for Chapter We then consider the task of deploying our models in Chapter Both R and Rattle are open source software and both are freely available on multiple platforms. Appendix B describes in detail how the datasets used throughout the book were obtained from their sources and how they were transformed into the datasets made available through rattle.

    All R code segments included in the book are run at the time of typeset- ting the book, and the results displayed are directly and automatically obtained from R itself. The Rattle screen shots are also automatically generated as the book is typeset. Because all R code and screen shots are automatically generated, the output we see in the book should be reproducible by the reader.

    Running the same code on other systems particularly on 32 bit systems may result in slight variations in the results of the numeric calculations performed by R. Other minor differences will occur with regard to the widths of lines and rounding of numbers.

    The following options are set when typesetting the book. The continuation prompt is used by R when a single command extends over multiple lines to indicate that R is still waiting for input from the user. For our purposes, including the continuation prompt makes it more difficult to cut-and-paste from the examples in the electronic version of the book.

    The options example above includes this change to the continuation prompt. R code examples will appear as code blocks like the following exam- ple though the continuation prompt, which is shown in the following example, will not be included in the code blocks in the book.

    A free graphical interface for data mining with R. Version 2. Type 'rattle ' to shake, rattle, and roll your data. In providing example output from commands, at times we will trun- cate the listing and indicate missing components with [ While most examples will illustrate the output exactly as it appears in R, there will be times where the format will be modified slightly to fit publication limitations. This might involve silently removing or adding blank lines.

    In describing the functionality of Rattle, we will use a sans serif font to identify a Rattle widget a graphical user interface component that we interact with, such as a button or menu.

    The kinds of widgets that are used in Rattle include the check box for turning options on and off, the radio button for selecting an option from a list of alternatives, file selectors for identifying files to load data from or to save data to, combo boxes for making selections, buttons to click for further plots or information, spin buttons for setting numeric options, and the text view, where the output from R commands will be displayed.

    R provides very many packages that together deliver an extensive toolkit for data mining. When we discuss the functions or commands that we can type at the R prompt, we will include parentheses with the function name so that it is clearly a reference to an R function. The command rattle , for example, will start the user interface for Rattle. Many functions and commands can also take arguments, which we indicate by trailing the argument with an equals sign.

    Gnome is independent of any programming language, and the GUI side of Rattle started out using the Python programming language. Moving to R allowed us to avoid the idiosyncrasies of interfacing multiple languages. The Glade graphical interface builder is used to generate an XML file that describes the interface independent of the programming language. That file can be loaded into any supported programming language to display the GUI.

    Through the use of Glade, we have the freedom to quickly change languages if the need arises. R itself is written in the procedural programming language C. Where computation requirements are significant, R code is often translated into C code, which will generally execute faster. The details are not important for us here, but this allows R to be surprisingly fast when it needs to be, without the users of R actually needing to be aware of how the function they are using is implemented.

    Currency New versions of R are released twice a year, in April and October. R is free, so a sensible approach is to upgrade whenever we can. The examples included in this book are from version 2.

    Rattle is an ever-evolving package and, over time, whilst the concepts remain, the details will change. For example, the advent of ggplot2 Wickham, provides an opportunity to signif- icantly develop its graphics capabilities. Similarly, caret Kuhn et al. New data mining algorithms continue to emerge and may be incorporated over time. Preface xiii Similarly, the screen shots included in this book are current only for the version of Rattle available at the time the book was typeset.

    Expect some minor changes in various windows and text views, and the occasional major change with the addition of new functionality. Appendix A includes links to guides for installing Rattle. We also list there the versions of the primary packages used by Rattle, at least as of the date of typesetting this book.

    Acknowledgements This book has grown from a desire to share experiences in using and deploying data mining tools and techniques. A considerable proportion of the material draws on over 20 years of teaching data mining to un- dergraduate and graduate students and running industry-based courses. The aim is to provide recipe-type material that can be easily understood and deployed, as well as reference material covering the concepts and terminology a data miner is likely to come across.

    Many thanks are due to students from the Australian National Uni- versity, the University of Canberra, and elsewhere who over the years have been the reason for me to collect my thoughts and experiences with data mining and to bring them together into this book. I have benefited from their insights into how they learn best.

    They have also contributed in a number of ways with suggestions and example applications. I am also in debt to my colleagues over the years, particularly Peter Milne, Joshua Huang, Warwick Graco, John Maindonald, and Stuart Hamilton, for their support and contributions to the development of data mining in Australia.

    Colleagues in various organisations deploying or developing skills in data mining have also provided significant feedback, as well as the mo- tivation, for this book. Anthony Nolan deserves special mention for his enthusiasm and ongoing contribution of ideas that have helped fine-tune the material in the book.

    Many others have also provided insights and comments. Illustrative examples of using R have also come from the R mailing lists, and I have used many of these to guide the kinds of examples that are included in the book. The many contributors to those lists need to be thanked. Thanks also go to the reviewers, who have added greatly to the read- ability and usability of the book.

    Thanks also to John Garden for his encouragement and insights in choos- ing a title for the volume.

    My very special thanks to my wife, Catharina, and children, Sean and Anita, who have endured my indulgence in bringing this book together. Canberra Graham J. Confusion Matrix. Evaluation Datasets. Data mining is the art and science of intelligent data analysis. The aim is to discover meaningful insights and knowledge from data. Discov- eries are often expressed as models, and we often describe data mining as the process of building models.

    A model captures, in some formula- tion, the essence of the discovered knowledge. A model can be used to assist in our understanding of the world. Models can also be used to make predictions. For the data miner, the discovery of new knowledge and the building of models that nicely predict the future can be quite rewarding.

    Indeed, data mining should be exciting and fun as we watch new insights and knowledge emerge from our data. With growing enthusiasm, we mean- der through our data analyses, following our intuitions and making new discoveries all the time—discoveries that will continue to help change our world for the better.

    Data mining has been applied in most areas of endeavour.

    Data Mining With Rattle and R by Graham Williams - PDF Drive

    There are data mining teams working in business, government, financial ser- vices, biology, medicine, risk and intelligence, science, and engineering. Anywhere we collect data, data mining is being applied and feeding new knowledge into human endeavour.

    We are living in a time where data is collected and stored in un- precedented volumes. Large and small government agencies, commercial enterprises, and noncommercial organisations collect data about their businesses, customers, human resources, products, manufacturing pro- G.

    Williams, Data Mining with Rattle and R: Data is the fuel that we inject into the data mining engine. Amongst data there can be hidden clues of the fraudulent activity of criminals. Data provides the basis for understanding the scientific processes that we observe in our world. Turning data into information is the basis for identifying new opportunities that lead to the discovery of new knowledge, which is the linchpin of our society!

    Data mining is about building models from data. We build models to gain insights into the world and how the world works so we can predict how things behave. A data miner, in building models, deploys many dif- ferent data analysis and model building techniques. Our choices depend on the business problems to be solved. This is charac- terised by the volume of data available, commonly in the gigabytes and terabytes and fast approaching the petabytes.

    It is also characterised by the complexity of that data, both in terms of the relationships that are awaiting discovery in the data and the data types available today, including text, image, audio, and video. Modelling is what people often think of when they think of data mining. Modelling is the process of turning data into some structured form or model that reflects the supplied data in some useful way. Overall, the aim is to explore our data, often to address a specific problem, by modelling the world.

    From the models, we gain new insights and develop a better understanding of the world. Data mining, in reality, is so much more than simply modelling. It is also about understanding the business context within which we deploy it. It is about understanding and collecting data from across an enterprise and from external sources. It is then about building models and evalu- ating them. And, most importantly, it is about deploying those models to deliver benefits.

    There is a bewildering array of tools and techniques at the disposal of the data miner for gaining insights into data and for building models. Relational database theory had been developed and successfully deployed, and thus began the era of collecting large amounts of data. How do we add value to our massive stores of data?

    The first few data mining workshops in the early s attracted the database community researchers. Before long, other computer science, and particularly artificial intelligence, researchers began to get interested. Machine learning is about collecting observational data through interacting with the world and building models of the world from such data. That is pretty much what data mining was also setting about to do.

    So, naturally, the machine learning and data mining com- munities started to come together. However, statistics is one of the fundamental tools for data analysis, and has been so for over a hundred years. Statistics brings to the table essential ideas about uncertainty and how to make allowances for it in the models that we build. Dis- coveries need to be statistically sound and statistically significant, and any uncertainty associated with the modelling needs to be understood.

    Today, data mining is a discipline that draws on sophisticated skills in computer science, machine learning, and statistics. However, a data miner will work in a team together with data and domain experts.

    An initiation meeting of a data mining project will often involve data miners, domain experts, and data experts. The data miners bring the statistical and algorithmic understanding, programming skills, and key investigative ability that underlies any analysis.

    The domain experts know about the actual problem being tackled, and are often the business experts who have been working in the area for many years. The data experts know about the data, how it has been collected, where it has been stored, how to access and combine the data required for the analysis, and any idiosyncrasies and data traps that await the data miner.

    Generally, neither the domain expert nor the data expert understand the needs of the data miner. In particular, as a data miner we will often find ourselves encouraging the data experts to provide or to provide access to all of the data, and not just the data the data expert thinks might be useful.

    It is critical that all three experts come together to deliver a data mining project.

    [PDF] Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery (Use R!)

    Their different understandings of the problem to be tackled all need to meld to deliver a common pathway for the data mining project. In particular, the data miner needs to understand the problem domain perspective and understand what data is available that relates to the problem and how to get that data, and identify what data processing is required prior to modelling.

    There are many other important tasks that we will find ourselves involved in. These include ensuring our data mining activities are tackling the right problem; understanding the data that is available, turning noisy data into data from which we can build robust models; evaluating and demonstrating the performance of our models; and ensuring the effective deployment of our models. Whilst we can easily describe these steps, it is important to be aware that data mining is an agile activity.

    An allied aspect is the concept of pair programming, where two data miners work together on the same data in a friendly, competitive, and collaborative approach to building models. The agile approach also em- phasises the importance of face-to-face communication, above and be- yond all of the effort that is otherwise often expended, and often wasted, on written documentation.

    This is not to remove the need to write doc- uments but to identify what is really required to be documented. We now identify the common steps in a data mining project and note that the following chapters of this book then walk us through these steps one step at a time!

    Problem Understanding 2. Data Understanding 3.

    Data Preparation 4. Modeling 5. Evaluation 6. Deployment The chapters in this book essentially follow this step-by-step process of a data mining project, and Rattle is very much based around these same steps. Using a tab-based interface, each tab represents one of the steps, and we proceed through the tabs as we work our way through a data mining project.

    One noticeable exception to this is the first step, problem understanding. That is something that needs study, discussion, thought, and brain power. Practical tools to help in this process are not common. Within the organisation, data mining projects can be initiated by the business or by this analytics team. Often, for best business engagement, a business-initiated project works best, though business is not always equipped to understand where data mining can be applied.

    It is often a mutual journey. Data miners, by themselves, rarely have the deeper knowledge of business that a professional from the business itself has. Yet the business owner will often have very little knowledge of what data mining is about, and indeed, given the hype, may well have the wrong idea.

    It is not until they start getting to see some actual data mining models for their business that they start to understand the project, the possibilities, and a glimpse of the potential outcomes. We will relate an actual experience over six months with six significant meetings of the business team and the analytics team.

    The picture we paint here is a little simplified and idealised but is not too far from reality. Meeting One The data miners sit in the corner to listen and learn.

    The business team understands little about what the data miners might be able to deliver. They discuss their current business issues and steps being taken to improve processes. The data miners have little to offer just yet but are on the lookout for the availability of data from which they can learn. Meeting Two The data miners will now often present some obser- vations of the data from their initial analyses. Whilst the analyses might be well presented graphically, and are perhaps interesting, they are yet to deliver any new insights into the business.

    At least the data miners are starting to get the idea of the business, as far as the business team is concerned. Meeting Three The data miners start to demonstrate some initial modelling outcomes. The results begin to look interesting to the business team. They are becoming engaged, asking questions, and understanding that the data mining team has uncovered some interesting insights.

    Meeting Four The data miners are the main agenda item. Their analyses are starting to ring true. They have made some quite interest- ing discoveries from the data that the business team the domain and data experts supplied. The discoveries are nonobvious, and sometimes intriguing.

    Sometimes they are also rather obvious. The data mining team has presented its evaluation of how well the models perform and explained the context for the deployment of the models.

    The business team is now keen to evaluate the model on real cases and monitor its performance over a period of time. Meeting Six The models have been deployed into business and are being run daily to match customers and products for marketing, to iden- tify insurance claims or credit card transactions that may be fraudulent, or taxpayers whose tax returns may require refinement.

    Procedures are in place to monitor the performance of the model over time and to sound alarm bells once the model begins to deviate from expectations. The key to much of the data mining work described here, in addition to the significance of communication, is the reliance and focus on data.

    Data Mining with Rattle and R

    This leads us to identify some key principles for data mining. We need to have good data that relates to a process that we wish to understand and improve. Without data we are simply guessing. Considerable time and effort spent getting our data into shape is a key factor in the success of a data mining project. In many circumstances, once we have the right data for mining, the rest is straightforward. As many others note, this effort in data collection and data preparation can in fact be the most substantial component of a data mining project.

    My list of insights for data mining, in no particular order, includes: Focus on the data and understand the business. Build multiple models: Stress repeatability and efficiency, using scripts for everything. Let the data talk to you but not mislead you.

    Communicate discoveries effectively and visually. We need to be vigilant to record all that is done. This is often best done through the code we write to perform the analysis rather than having to document the process separately.

    Having a separate process to document the data mining will often mean that it is rarely completed. An implication of this is that we often capture the process as transparent, executable code rather than as a list of instructions for using a GUI. There are many important advantages to ensuring we document a project through our coding of the data analyses. There will be times when we need to hand a project to another data miner.

    Or we may cease work on a project for a period of time and return to it at a later stage. For whatever reason, when we return to a project, we find the documentation, through the coding, essential in being efficient and effective data miners.

    Various things should be documented, and most can be documented through a combination of code and comments. We need to document our access to the source data, how the data was transformed and cleaned, what new variables were constructed, and what summaries were gener- ated to understand the data. Then we also need to record how we built models and what models were chosen and considered. Finally, we record the evaluation and how we collect the data to support the benefit that we propose to obtain from the model.

    Through documentation, and ideally by developing documented code that tells the story of the data mining project and the actual process as well, we will be communicating to others how we can mine data.

    Our processes can be easily reviewed, improved, and automated. We can transparently stand behind the results of the data mining by having openly available the process and the data that have led to the results.

    R R is used throughout this book to illustrate data mining procedures. It is the programming language used to implement the Rattle graphical user interface for data mining. Rattle 11 then you will find Muenchen a great resource. It provides all of the common, most of the less common, and all of the new approaches to data mining. The basic modus operandi in using R is to write scripts using the R language.

    After a while you will want to do more than issue single simple commands and rather write programs and systems for common tasks that suit your own data mining. Thus, saving our commands to an R script file often with the. R filename extension is important. We can then rerun our scripts to transform our source data, at will and automatically, into information and knowledge. As we progress through the book, we will become familiar with the common R functions and commands that we might combine into a script.

    Whilst for data mining purposes we will focus on the use of the Rat- tle GUI, more advanced users might prefer the powerful Emacs editor, augmented with the ESS package, to develop R code directly.

    We also note that direct interaction with R has a steeper learning curve than using GUI based systems, but once over the hurdle, perform- ing operations over the same or similar datasets becomes very easy using its programming language interface.

    A paradigm that is encouraged throughout this book is that of learn- ing by example or programming by example Cypher, The inten- tion is that anyone will be able to easily replicate the examples from the book and then fine-tune them to suit their own needs. This is one of the underlying principles of Rattle, where all of the R commands that are used under the graphical user interface are also exposed to the user.

    This makes it a useful teaching tool in learning R for the specific task of data mining, and also a good memory aid! Rattle Rattle is built on the statistical language R, but an understanding of R is not required in order to use it. Rattle is simple to use, quick to deploy, and allows us to rapidly work through the data processing, modelling, and evaluation phases of a data mining project.

    On the other hand, 1 An early version is available from http: When we need to fine-tune and further develop our data mining projects, we can migrate from Rattle to R.

    Rattle can save the current state of a data mining task as a Rattle project. A Rattle project can then be loaded at a later time or shared with other users. Projects can be loaded, modified, and saved, allow- ing check pointing and parallel explorations.

    Projects also retain all of the R code for transparency and repeatability.

    Related Post: 1066 AND ALL THAT PDF

    The R code can be loaded into R outside of Rattle to repeat any data mining task. However, it also provides a stepping stone to more sophisticated processing and modelling in R itself.

    It is worth emphasising that the user is not limited to how Rat- tle does things. For sophisticated and unconstrained data mining, the experienced user will progress to interacting directly with R.

    The typical workflow for a data mining project was introduced above. In the context of Rattle, it can be summarised as: Load a Dataset. Select variables and entities for exploring and mining. Explore the data to understand how it is distributed or spread. Transform the data to suit our data mining purposes.

    Build our Models. Evaluate the models on other datasets. Export the models for deployment. It is important to note that at any stage the next step could well be a step to a previous stage.

    We illustrate a typical workflow that is embodied in the Rattle inter- face in Figure 1. Identify Data Start by getting as much data Select Variables as we can and then cull. Clean and Transform We may loop around here many times as we clean, transform, and Build and Tune Models then build and tune our models. Evaluate Models Evaluate performance, structure, complexity, and deployability.

    Deploy Model Is the model run manually on demand or on an automatic Monitor Performance shecdule? Figure 1. The typical workflow of a data mining project as supported by Rattle. R and Rattle are free software in terms of allowing anyone the freedom to do as they wish with them.

    This is also referred to as open source software to distinguish it from closed source software, which does not provide the source code. Closed source software usually has quite restrictive licenses associated with it, aimed at limiting our freedom using it. R and Rattle can be obtained for free.

    On 7 January , the New York Times carried a front page tech- nology article on R where a vendor representative was quoted: I think it addresses a niche market for high-end data analysts that want free, readily available code.

    We have customers who build engines for aircraft. I am happy they are not using freeware when I get on a jet.

    This is a common misunderstanding about the concept of free and open source software. R, being free and open source software, is in fact a peer-reviewed software product that a number of the worlds top statisti- cians have developed and others have reviewed. In fact, anyone is permit- ted to review the R source code. Over the years, many bugs and issues have been identified and rectified by a large community of developers and users. On the other hand, a closed source software product cannot be so readily and independently verified or viewed by others at will.

    Bugs and enhancement requests need to be reported back to the vendor. Customers then need to rely on a very select group of vendor-chosen people to assure the software, rectify any bugs in it, and enhance it with new algorithms. Bug fixes and enhancements can take months or years, and generally customers need to download the new versions of the software.

    Both scenarios open source and closed source see a lot of effort put into the quality of their software. With open source, though, we all share it, whereas we can share and learn very little about the algorithms we use from closed source software. It is worthwhile to highlight another reason for using R in the con- text of free and commercial software. In obtaining any software, due diligence is required in assessing what is available. However, what is fi- nally delivered may be quite different from what was promised or even possible with the software, whether it is open source or closed source, free or commercial.

    With free open source software, we are free to use it without restriction. If we find that it does not serve our purposes, we can move on with minimal cost. With closed source commercial downloads, once the commitment is made to download the software and it turns out not to meet our requirements, we are generally stuck with it, having made the financial commitment, and have to make do.

    We list some of the advantages with using R: It incorporates all of the standard statistical tests, models, and analyses, as well as providing a comprehensive language for manag- ing and manipulating data. New technology and ideas often appear first in R. It reflects well on a very competent community of computational statisticians.

    Because R is open source, unlike closed source software, it has been reviewed by many internationally renowned statisticians and computational scientists. R runs on many operating systems and different hardware. Have you ever tried getting support from the core developers of a commercial vendor?

    Whilst the advantages might flow from the pen with a great deal of enthusiasm, it is useful to note some of the disadvantages or weaknesses of R, even if they are perhaps transitory! There are several simple-to- use graphical user interfaces GUIs for R that encompass point- and-click interactions, but they generally do not have the polish of the commercial offerings. However, some very high-standard books are increasingly plugging the documentation gaps. R is a software application that many people freely devote their own time to developing.

    Problems are usually dealt with quickly on the open mailing lists, and bugs disappear with lightning speed. Users who do require it can download support from a number of vendors internationally.

    This can be a restriction when doing data mining. There are various solutions, including using 64 bit operating systems that can access much more memory than 32 bit ones. Laws in many countries can directly affect data mining, and it is very worthwhile to be aware of them and their penalties, which can often be severe.

    There are basic principles relating to the protection of privacy that we should adhere to. They include: Please take that responsibility seriously.

    Think often and carefully about what you are doing. Some basic familiarity with R will be gained through our travels in data mining using the Rattle interface and some excursions into R. In this respect, most of what we need to know about R is contained within the book. But there is much more to learn about R and its associated packages. The book covers the basic data structures, read- ing and writing data, subscripting, manipulating, aggregating, and re- shaping data. Introductory Statistics with R Dalgaard, , as mentioned earlier, is a good introduction to statistics using R.

    Moving more towards areas related to data mining, Data Analysis and Graphics Using R Maindonald and Braun, provides excellent practical coverage of many aspects of exploring and modelling data using R. The Elements of Statistical Learning Hastie et al. Quite a few specialist books using R are now available, including Lat- tice: A newer graphics framework is detailed in ggplot2: Elegant Graphics for Data Analysis Wickham, Bivand et al.

    Moving on from R itself and into data mining, there are very many general introductions available. One that is commonly used for teaching in computer science is Han and Kamber It provides a compre- hensive generic introduction to most of the algorithms used by a data miner. It is presented at a level suitable for information technology and database graduates. Chapter 2 Getting Started New ideas are often most effectively understood and appreciated by ac- tually doing something with them.

    So it is with data mining. Fun- damentally, data mining is about practical application—application of the algorithms developed by researchers in artificial intelligence, machine learning, computer science, and statistics. This chapter is about getting started with data mining.

    Our aim throughout this book is to provide hands-on practise in data mining, and to do so we need some computer software. There is a choice of software packages available for data mining. These include commercial closed source software which is also often quite expensive as well as free open source software. Open source software whether freely available or commercially available is always the best option, as it offers us the freedom to do whatever we like with it, as discussed in Chapter 1.

    This includes extending it, verifying it, tuning it to suit our needs, and even selling it. Such software is often of higher quality than commercial closed source software because of its open nature. For our purposes, we need some good tools that are freely available to everyone and can be freely modified and extended by anyone.

    There- fore we use the open source and free data mining tool Rattle, which is built on the open source and free statistical software environment R. See Appendix A for instructions on obtaining the software. Now is a good time to install R. Much of what follows for the rest of the book, and specifically this chapter, relies on interacting with R and Rattle.

    The aim is to build a model that captures the essence of the knowledge discovered from our data. Be careful though—there is a G. Once we have qual- ity data, Rattle can build a model with just four mouse clicks, but the effort is in preparing the data and understanding and then fine-tuning the models.

    In this chapter, we use Rattle to build our first data mining model—a simple decision tree model, which is one of the most common models in data mining. We cover starting up and quitting from R, an overview of how we interact with Rattle, and then how to load a dataset and build a model.

    Once the enthusiasm for building a model is satisfied, we then review the larger tasks of understanding the data and evaluating the model. This assumes that we have already installed R, as detailed in Appendix A. One way or another, we should see a window Figure 2. We will generally refer to this as the R Console.

    These include options for working with script files, managing packages, and obtaining help. We start Rattle by loading rattle into the R library using library. We supply the name of the package to load as the argument to the com- mand. The rattle command is then entered with an empty argument list, as shown below. The prompt indicates that R is awaiting user commands. The initial Rattle window displays a welcome message and a little introduction to Rattle and R.

    The key to using Rattle, as hinted at in the status bar on starting up Rattle, is to supply the appropriate information for a particular tab and to then click the Execute button to perform the action. Always make sure you have clicked the Execute button before proceeding to the next step. To exit from Rattle, we simply click the Quit button. A great book by all means. Some people take a lot of interest in the fine demarcation between statistics and machine learning; however, for me, there is too much overlap between the topics.

    I have given up on the distinction as it makes no difference from the applications perspective. The book introduces R-Weka package — Weka is another open source software used extensively in academic research. I like this book because of the interesting topics this book covers including text mining, social network analysis and time series modeling. Having said this, the author could have put in some effort on the formatting of this book which is pure ugly. At times you will feel you are reading a masters level project report while skimming through the book.

    However, once you get over this aspect the content is really good to learn R. However trust me, apart from a few minor issues Rattle is not at all bad. I really hope they keep working on Rattle to make it better, as it has a lot of potential. It is much better than the base graphics that comes pre-installed with R, so I would recommend you start directly with ggplot 2 without wasting your time on base graphics.

    However, if you want to get to further depths of ggplot-2 then this is the book for you. Though I prefer ggplot 2, Lattice is another package at par with ggplot 2. The author of this book has extensive experience in R coding and that is evident when you read this book. I must warn you that at times while reading this book one wonders about the utility of some of the things Mr.

    You can check your reasoning as you tackle a problem using our interactive solutions viewer. Plus, we regularly update and improve textbook solutions based on student ratings and feedback, so you can be sure you're getting the latest information available.

    Our interactive player makes it easy to find solutions to Data Mining with Rattle and R problems you're working on - just go to the chapter for your book.

    Hit a particularly tricky question? Bookmark it to easily review again before an exam. The best part?

    Similar files:

    Copyright © 2019