web space | free hosting | Business WebSite Hosting | Free Website Submission | shopping cart | php hosting

*DISCLAIMER: THIS WORK IS THE PROPERTY OF CARA M. MARSHALL.

COPYING THIS IS UNLAWFUL WITHOUT PERMISSION FROM AUTHOR

 

Data Mining

TABLE OF CONTENTS

 

INTRODUCTION.............................................................................................................. 3

CONTENTS OF A DATABASE..................................................................................... 3

HOW DATA IS GATHERED.......................................................................................... 4

HISTORICAL UTILIZATION OF CUSTOMER INFORMATION............................... 6

STATISTICAL SOFTWARE PACKAGES................................................................... 6

DATA MINING TECHNOLOGY..................................................................................... 7

HOW DATA MINING WORKS....................................................................................... 7

Association Rule............................................................................................................................................................. 9

Apriori Algorithms..................................................................................................................................................... 10

Distributed/Parallel Algorithms............................................................................................................................... 11

Sequential Rule............................................................................................................................................................. 11

AprioriAll and AprioriSome Algorithms................................................................................................................ 11

Classification or Clustering Rule............................................................................................................................. 12

ID3 Algorithm............................................................................................................................................................. 13

C4.5 Algorithm............................................................................................................................................................ 13

SLIQ Algorithm.......................................................................................................................................................... 14

Other Types of Algorithms......................................................................................................................................... 14

REAL-LIFE EXAMPLES.............................................................................................. 14

CONCLUSION............................................................................................................... 15


INTRODUCTION

 

It does not make sense to spend your marketing dollars on the entire population when you can target only the prime customer candidates and market the product to them through segmentation.  By reducing wasted money spent chasing after people who are unlikely to purchase your product, you will maximize your return on the marketing investment. 

The key to segmentation is gathering and utilizing information about your customers and/or prospects effectively.  This information is collected and stored in a database, several databases with the ability to interact, or a data warehouse.  A “data warehouse is a central repository for all or significant parts of the data that an enterprise's various business systems collect” (www.whatis.com).  “Data warehousing emphasizes the capture of data from diverse sources for useful analysis and access, but does not generally start from the point-of-view of the end user or knowledge worker who may need access to specialized, sometimes local databases” (www.whatis.com).  Many companies do not have the database capability to maximize segmentation and are reluctant to place the appropriate emphasis on, and investment in, database technology. 

 

CONTENTS OF A DATABASE

 

Simple databases include information like: first name, last name, salutation, department, title, company, address line 1, address line 2, city, state, country, and zip code.  In a more sophisticated database, other fields stored would be: gender, income, age, interests, products purchased, purchase dates, marketing strategy or medium that initially brought about the response.  For a very thorough analysis, customer attributes can be collected, such as: demographics (address, income, etc.), psychographic information (personality types), technographics (if a technical interface is involved – type of system used), product characteristics, buyer or visitor statistics (purchase history, click stream information), and permissions (mediums or options that the person opted-in for). 

 

HOW DATA IS GATHERED

 


                Companies gather data through the use of online or offline: forms, surveys, focus groups, and by using other marketing research techniques.  Questions should be presented in a way in which data can be represented quantitatively.  Results are directly or indirectly inputted into the database, databases, or data warehouse.  The following illustrations provide an example of a form that is required by users who are signing up for a free yahoo email account. 


 


HISTORICAL UTILIZATION OF CUSTOMER INFORMATION

 

Generally, marketers would use the information stored in the database (or databases) to run queries based on assumptions about purchase behavior.  For example, if you are marketing Jaguars, you may have a suspicion that your customers earn upwards of $100,000 per year.  This assumption may be based on a hunch or a logical prediction of the target market.  It may even be based on an examination of the customer records in your database. 

Often times, entire marketing plans were (and in many cases are still) built on human hunches and initial predictions.  Wouldn’t it be useful to draw relationships first through data analysis for better accuracy and later add the human interpretations?  What if the information is too complex for humans to discover the patterns? 

 

STATISTICAL SOFTWARE PACKAGES

 

Some tools used to assist in examining or reporting are online analytic processing systems (OLAP) or statistical packages. This technology “gives users access to analytical content such as time series and trend analysis views and summary-level information as well as insight into data organized into multiple dimensions” (Carickhoff).  Many applications of OLAP and statistical packages are basically extensions of database or data warehouse capability.  OLAP and statistical software generally provide users with a GUI (graphical user interface) platform that breaks down the complicated input, which is necessary from users in order to run the analysis.  These systems “rely on you to discover patterns and decide what to do with them” (Greening).  While this technology has been at the forefront of the decision support industry, it is only a tool in the knowledge discovery process and cannot actually perform the knowledge discovery.

OLAP has been useful not only in analysis of relational databases, but particularly with regard to multidimensional databases.  Another advantage to using OLAP is that many vendors have been able to provide web browser access to their OLAP engines; often referred to as Web OLAP or WOLAP. 

 

DATA MINING TECHNOLOGY

 

For true knowledge discovery, a data-mining tool should be utilized.  Data mining tools discover relationships or patterns among data and can report or act on those findings.  “Data mining is data-driven, not user-driven or verification-driven” (Gilman).  Through this technology, the marketer actually looks at the data-driven relationships whereas before, they would develop theories based on hunches about their potential customers (our example was in reference to the marketing of Jaguars).  So if the marketer for Jaguar had been using data-mining technology, the system may have supported her theory that customers are earners of $100,000 or more.  But, the technology may have also discovered that customers are generally under the age of 45.  This bit of information would be very useful in determining who to target and how best to relay the message (i.e. which mediums to use and how the message should look, sound and feel).  The marketer may not have been looking for this relationship among customers, so the pattern may have been overlooked had it not been analyzed through a data-driven system. 

 

HOW DATA MINING WORKS

 

Data mining works by utilizing algorithms to search the database for hidden patterns.  The technology was first developed to help scientists make sense of experimental data, but was quickly applied to business applications.  “Data-mining tools can sift through immense collections of customer, marketing, production, financial data, and statistical and artificial intelligence techniques, identify what’s worth noting and what’s not” (Verity). 

Data mining has three components including: (1) associations (one event can be correlated to another), (2) sequences (one event leads to another), and (3) classification (pattern recognition which leads to data reorganization) or clustering (find/visualize groups of facts previously unknown).  We can utilize results from these components as they are, and/or to forecast and uncover patterns of data which lead to educated predictions about the future. 

Technical designers have tried various methods of programming to best attain the desired results.  Machine learning algorithms have had the widest use and have thus far been the most successful.  I’ll discuss machine learning algorithms in depth, following the brief explanation of these other techniques:

·        Statistical algorithms such as SAS and SPSS have been widely used to detect unusual patterns and explain patterns using linear models. 

·        Genetic algorithms are “optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of natural evolution” (Joshi). 

·        The nearest neighbor method classifies each record based on a combination of classes. 

·        Rule induction extracts sets of if/then rules predetermined based on statistical significance. 

·        Data visualization provides a “visual interpretation of complex relationships in multidimensional data” (Joshi). 

·        Neural networks, a relatively new data mining tool, is a form of artificial intelligence, “modeled on the logical associations made by the human brain” (DiCarlo).  The network is trained to recognize parameters set by administrators, which are based on mathematical models that accumulate data.  Once the network recognizes these parameters it makes an evaluation, reaches a conclusion, and takes action (predetermined and set by administrators).  Neural networks have been successful particularly in applications that involve classification. 

 

The following figure (Figure 2) provides an overview of the data mining process.


 


In the next portion of this paper, I will discuss the types of machine learning algorithms that are successful for each of the three components; association, sequential, and classification or clustering. 

 

Association Rule

 

            Association rules scour the database to find associations between items that satisfy user-specified minimum support and statistical confidence constraints.  An association rule would take the form of ‘A’ and ‘B’ where ‘A’ and ‘B’ are sets of items.  The rule derives a meaning that transactions of the database that contain ‘A’ tend to contain ‘B’.  For example, 60% of customers at a fast food chain who order hamburgers tend to also purchase soda.   25% of transactions in the database contain both of these items.  In this case, 60% is the confidence level and 25% is the support level of the rule. 

Apriori Algorithms

Apriori algorithms are a type of association rule algorithm that was developed by IBM’s Quest project team for use on large transaction databases.  They begin scouring the database by first finding all combinations that meet the minimum support requirements and then determine if the rules hold by computing:

ratio r = support (ABCD)/support (AB).

Finally, the code runs the minimum confidence requirements to generate the desired results.  Apriori algorithms accomplish the final result by passing over the database multiple times in steps to gather frequency information first and then joining conditions.  They use a decision tree data structure to display the counts of potential candidates.  Decision trees are considered the best models for displaying results for a number of reasons: they are inexpensive to construct, easy to interpret, easy to integrate, and return comparable or better accuracy.  An illustration of a decision tree follows in Figure 3.


 


Distributed/Parallel Algorithms

            Distributed/Parallel algorithms were developed in order to mine data using less processing power and time.  This tool distributes the processing across multiple sites so as to generate results quickly.  Desired results are achieved in a manner similar to the Apriori, however the number of messages passed are reduced by exploring relationships between large sets of data and using pruning techniques to remove useless data at individual processing sites. 

 

Sequential Rule

 

            The discovery of sequential rules was motivated by advances in customer satisfaction and opportunities to cross-sell products, but sequential analysis results can be applied to many fields both inside and outside of the scope of business.  In order to analyze sequential patterns, input data must have the ability to be organized into sequences.  The sequences are ordered lists of transactions or items and may have a transaction time associated with each item.  Sequential rules scour the database to find all of the sequential patterns that comply with the minimum level of support (percentage of data sequences that contain the pattern) specified by the user.  For example, 45% of customers who took a training company’s introductory course on word processing later enrolled in their more advanced course on Microsoft Windows.  In this scenario, 45% is the support level. 

 

AprioriAll and AprioriSome Algorithms

Sequential analysis begins with the sorting phase in which items or transactions can be concatenated to form sequences and then we can run sequential algorithms to do the analysis and discover the underlying patterns.  The next step involves grouping all like itemsets and large sequences.  These large sequences are then tested against those in the customer database.  Records that do not fit the pattern are dropped from the newly transformed database, but are still counted in the total number of records.  The final phase in sequential analysis involves determining the maximal desired sequences.  This is accomplished by combing the data multiple times using one of two types of algorithms; count-all and count-some.  The familiar Apriori algorithm can be utilized (called AprioriSome and/or AprioriAll) to apply to sequential rules. 

The count-all approach gathers all sequences and must later be pruned to remove the non-maximal sequences.  The count-some approach starts by counting the longer sequences first, since many sequences also reside in the longer sequences, so as to limit the count to only the maximal sequences (no need for pruning). 

 

Classification or Clustering Rule

 

            Classification or clustering rules attempt to create decision trees that label and shelve data into categories.  A company might “build a classification model to predict who is likely to purchase identified products or services” (Liu and Yap).  Or the company may “build a classification model to predict the likelihood of buying a product based on those customers that have been identified from association rules only” (Liu and Yap).  Classification analysis has been of particular interest to direct marketers for its ability to determine who would be best to target and then to actually gather the necessary individuals for the campaign.  Classification can also determine customer attrition (churn) and can be utilized in predicting a customer’s loyalty and/or likelihood of switching to a competitor. 

Clustering has also been highly useful in detecting fraud in banks and credit card companies.  By labeling each transaction ‘honest’ or ‘fraud’ and analyzing purchases and payment history, the classification algorithm can detect fraud by monitoring transactions on the account. 

Classification, or clustering, works by finding groups in which data points are more similar to one another, or data points in separate clusters that differ.  One of the earliest and most widely used classification algorithms is Hunt’s. Many later algorithms were written based on the principles of Hunt’s method.  Hunt’s method constructs decision trees using binary tests to determine class distribution followed by calculation based on either information theory (used in ID3 and C4.5 algorithms) or Gini index (used in SLIQ algorithms).  Information theory (INFO) tends to result in many smaller clusters, whereas Gini tends to lump more data together in fewer large groups. 

 

ID3 Algorithm

            The ID3 algorithm builds decision trees by testing the values of the properties of objects in the database.  The tree is built in a top down model and a property is tested at each node and results of the test partition the data until each leaf node contains homogeneous data. 

 

C4.5 Algorithm

            The C4.5 algorithm is also based on the principles of the Hunt’s method.  It’s considered a depth-first strategy because it basically attempts to accomplish the same things as the ID3, but it generates a decision tree by first considering all possible tests and begins with the one that will provide the most information gain. 

 

SLIQ Algorithm

            Supervised Learning in Quest (SLIQ) generates a decision tree in a breadth-first fashion.  Data is pre-sorted and class-listed instead of splitting attribute lists.  This method, though cost-effective, consumes an excessive amount of memory.  In order to run this algorithm efficiently IBM developed a new version of the algorithm with no memory restrictions.  They called this new algorithm ‘SPRINT’ (Scalable Parallelizable Induction of Decision Trees).  SPRINT works similarly to SLIQ, but much like distributed/parallel association algorithms, SPRINT distributes the processing across multiple sites so as to generate results quickly. 

 

Other Types of Algorithms

 

            Other types of machine learning algorithms include: Nearest-Neighbor, Naïve-Bayes, OODG (Oblivious Read-Once Decision Graph), and Lazy Decision Trees.  Also, hybrids are often created and/or used in practice so they can be tailored to serve the needs of the users.  As I mentioned earlier, though machine learning algorithms have had the widest use thus far, many other methods are used as well. 

 

REAL-LIFE EXAMPLES

 

            There are many real-life data mining success stories.  The phone company, U S West Inc. wanted to pinpoint customers who would install second phone lines and keep them long enough for the carrier to make a profit.  They designed a data mining program called “PALMS” which runs on a powerful NCR parallel-processing computer and was ‘told’ to provide a statistical model of the ideal prospect.  Using this information, marketers launched a campaign targeting the clusters of prospects that fit the profile, which were also identified in the database through the use of PALMS.  The marketers chose to relay their message through several direct mail campaigns, “which ran from November 4 to early January.  U S West has enjoyed a response rate equal to that of a broadcast campaign costing ‘several million dollars’ more” (Verity). 

Wal-Mart has been collecting transaction data through its cash registers since the early 1980s.  The company was “faced with a mind-boggling 700 million potential forecasts to calculate – one for each item in 2,700 stores” (Verity) but was unable to use all of the data, until the introduction of data mining.  Recently, Wal-Mart has taken advantage of knowledge discovery software to predict demand for individual items in specific stores, and to work on improving accuracy on their market-basket analysis (examining the combinations of items that customers purchase together). 

 

CONCLUSION

 

I’m sure we all remember the 1997 chess match between human champion Garry Kasparov and IBM’s Deep Blue supercomputer.  The implications of that chess match ignited serious controversy over artificial intelligence.  Data mining is a form of artificial intelligence that has changed the face of all facets of business, particularly marketing.  The chess match may have left many of us sympathizing for Garry and forming negative opinions about AI, but as we’ve seen throughout this paper, the positive outcomes cannot be ignored.  Of course, I wouldn’t suggest ignoring human predictions and intuition altogether.   “It’s a good idea to combine analytical results with business intuition” (Liu and Yap).  There should always be a balance between data-driven and user-driven analysis.

Otherwise overlooked relationships in data can be exploited through data mining technology, resulting in a vast amount of time and money saved.  Today, many companies have a significant edge over competition simply due to the investment that they place on knowledge discovery.  As the saying goes, ‘knowledge is power’. 

 


WORKS CITED

Carickhoff, Rich.  “A New Face for OLAP”.  DBMS Magazine, January 1997. 

 

DiCarlo, Lisa.  “The Rebirth of Artificial Intelligence”.  Forbes Magazine, May 16, 2000. 

 

Gilman, Michael, PhD.  “Data Mining Overview”.  The Direct Marketing Association

 

 White Papers, 2000.

 

Greening, Dan R.  “Data Mining on the Web – There’s Gold in that Mountain of Data”.

 

DBMS Magazine 2000. 

 

Joshi, Karuna Pande.  “Analysis of Data Mining Algorithms”.  1997.

 

Liu, Shiping and Jeremy Yap.  “Beyond Intuition”.  DB2 Magazine, Quarter 4, 2001. 

 

Verity, John W.  “Coaxing Meaning Out of Raw Data – Software can now Find Patterns

 

Never Seen Before”.  Business Week, May 1997. 

 

HTTP://WWW.WHATIS.COM

HTTP://WWW.YAHOO.COM