Stay ahead with the world's most comprehensive technology and business learning platform.

Probably the widest applications of data-mining techniques are in marketing for tasks such as targeted marketing, online advertising, and recommendations for cross-selling. Data mining is used for general customer relationship management to analyze customer behavior in order to manage attrition and maximize expected customer value. The finance industry uses data mining for credit scoring and trading, and in operations via fraud detection and workforce management. Major retailers from Walmart to Amazon apply data mining throughout their businesses, from marketing to supply-chain management.

Many firms have differentiated themselves strategically with data science, sometimes to the point of evolving into data mining companies. The primary goals of this book are to help you view business problems from a data perspective and understand principles of extracting useful knowledge from data. There is a fundamental structure to data-analytic thinking, and basic principles that should be understood.

There are also particular areas where intuition, creativity, common sense, and domain knowledge must be brought to bear. A data perspective will provide you with structure and principles, and this will give you a framework to systematically analyze such problems. As you get better at data-analytic thinking you will develop intuition as to how and where to apply creativity and domain knowledge.

Throughout the first two chapters of this book, we will discuss in detail various topics and techniques related to data science and data mining. At a high level, data science is a set of fundamental principles that guide the extraction of knowledge from data. Data mining is the extraction of knowledge from data, via technologies that incorporate these principles. It is important to understand data science even if you never intend to apply it yourself.

Data-analytic thinking enables you to evaluate proposals for data mining projects. For example, if an employee, a consultant, or a potential investment target proposes to improve a particular business application by extracting knowledge from data, you should be able to assess the proposal systematically and decide whether it is sound or flawed. This does not mean that you will be able to tell whether it will actually succeed—for data mining projects, that often requires trying—but you should be able to spot obvious flaws, unrealistic assumptions, and missing pieces.

Throughout the book we will describe a number of fundamental data science principles, and will illustrate each with at least one data mining technique that embodies the principle. For each principle there are usually many specific techniques that embody it, so in this book we have chosen to emphasize the basic principles in preference to specific techniques. That said, we will not make a big deal about the difference between data science and data mining, except where it will have a substantial effect on understanding the actual concepts. Consider an example from a New York Times story from Residents made for higher ground, but far away, in Bentonville, Ark.

Consider why data-driven prediction might be useful in this scenario. It might be useful to predict that people in the path of the hurricane would buy more bottled water. Maybe, but this point seems a bit obvious, and why would we need data science to discover it?

1. Introduction: Data-Analytic Thinking - Data Science for Business [Book]

It might be useful to project the amount of increase in sales due to the hurricane, to ensure that local Wal-Marts are properly stocked. The prediction could be somewhat useful, but is probably more general than Ms. It would be more valuable to discover patterns due to the hurricane that were not obvious. To do this, analysts might examine the huge volume of Wal-Mart data from prior, similar situations such as Hurricane Charley to identify unusual local demand for products.

Indeed, that is what happened. The New York Times Hays, reported that: Dillman said in a recent interview. How are such data analyses performed? Consider a second, more typical business scenario and how it might be treated from a data perspective. This problem will serve as a running example that will illuminate many of the issues raised in this book and provide a common frame of reference. Assume you just landed a great analytical job with MegaTelCo, one of the largest telecommunication firms in the United States.

They are having a major problem with customer retention in their wireless business. Since the cell phone market is now saturated, the huge growth in the wireless market has tapered off. Customers switching from one company to another is called churn , and it is expensive all around: You have been called in to help understand the problem and to devise a solution. Attracting new customers is much more expensive than retaining existing ones, so a good deal of marketing budget is allocated to prevent churn. Marketing has already designed a special retention offer.

Think carefully about what data you might use and how they would be used. Specifically, how should MegaTelCo choose a set of customers to receive their offer in order to best reduce churn for a particular incentive budget? Answering this question is much more complicated than it may seem initially. We will return to this problem repeatedly through the book, adding sophistication to our solution as we develop an understanding of the fundamental data science concepts.

In reality, customer retention has been a major use of data mining technologies—especially in telecommunications and finance businesses. These more generally were some of the earliest and widest adopters of data mining technologies, for reasons discussed later. Data science involves principles, processes, and techniques for understanding phenomena via the automated analysis of data. In this book, we will view the ultimate goal of data science as improving decision making, as this generally is of direct interest to business.

It distinguishes data science from other aspects of data processing that are gaining increasing attention in business. Data-driven decision-making DDD refers to the practice of basing decisions on the analysis of data, rather than purely on intuition. For example, a marketer could select advertisements based purely on her long experience in the field and her eye for what will work.

Or, she could base her selection on the analysis of data regarding how consumers react to different ads. She could also use a combination of these approaches.


  • The Economic History of India in the Victorian Age: From the Accession of Queen Victoria in 1837 to the Commencement of the Twentieth Century: Volume 27 (Trubners Oriental Series)!
  • Mining the Social Web by Matthew A. Russell.
  • The Best Data Analytics & Big Data Books You Should Read.
  • Eine Chance für die Jugendsozialarbeit beim Übergang von Schule und Beruf bei benachteiligten Jugendlichen in der Offenen Jugendarbeit (German Edition);

DDD is not an all-or-nothing practice, and different firms engage in DDD to greater or lesser degrees. The benefits of data-driven decision-making have been demonstrated conclusively.

Product description

They developed a measure of DDD that rates firms as to how strongly they use data to make decisions across the company. They show that statistically, the more data-driven a firm is, the more productive it is—even controlling for a wide range of possible confounding factors. And the differences are not small. DDD also is correlated with higher return on assets, return on equity, asset utilization, and market value, and the relationship seems to be causal.

The sort of decisions we will be interested in in this book mainly fall into two types: The Walmart example above illustrates a type 1 problem: Consumers tend to have inertia in their habits and getting them to change is very difficult. Decision makers at Target knew, however, that the arrival of a new baby in a family is one point where people do change their shopping habits significantly.

Since most birth records are public, retailers obtain information on births and send out special offers to the new parents. However, Target wanted to get a jump on their competition. They were interested in whether they could predict that people are expecting a baby. If they could, they would gain an advantage by making offers before their competitors. Using techniques of data science, Target analyzed historical data on customers who later were revealed to have been pregnant, and were able to extract information that could predict which consumers were pregnant.

For example, pregnant mothers often change their diets, their wardrobes, their vitamin regimens, and so on. These indicators could be extracted from historical data, assembled into predictive models, and then deployed in marketing campaigns. We will discuss predictive models in much detail as we go through the book. For the time being, it is sufficient to understand that a predictive model abstracts away most of the complexity of the world, focusing in on a particular set of indicators that correlate in some way with a quantity of interest who will churn, or who will purchase, who is pregnant, etc.

Importantly, in both the Walmart and the Target examples, the data analysis was not testing a simple hypothesis. Instead, the data were explored with the hope that something useful would be discovered. Our churn example illustrates a type 2 DDD problem. MegaTelCo has hundreds of millions of customers, each a candidate for defection. Tens of millions of customers have contracts expiring each month, so each one of them has an increased likelihood of defection in the near future. If we can improve our ability to estimate, for a given customer, how profitable it would be for us to focus on her, we can potentially reap large benefits by applying this ability to the millions of customers in the population.

This same logic applies to many of the areas where we have seen the most intense application of data science and data mining: This highlights the often overlooked fact that, increasingly, business decisions are being made automatically by computer systems. Different industries have adopted automatic decision-making at different rates.

The finance and telecommunications industries were early adopters, largely because of their precocious development of data networks and implementation of massive-scale computing, which allowed the aggregation and modeling of data at a large scale, as well as the application of the resultant models to decision-making. In the s, automated decision-making changed the banking and consumer credit industries dramatically. In the s, banks and telecommunications companies also implemented massive-scale systems for managing data-driven fraud control decisions. As retail systems were increasingly computerized, merchandising decisions were automated.


  • Collecting and Manipulating Twitter Data - Mining the Social Web [Book];
  • A MADNESS: Tales of a Scandalous Life!
  • Categories?
  • Definition of 'Data Mining'!
  • .
  • You're Only Mine (Data Mining at its Best Book 1) eBook: Prasoon Kumar: theranchhands.com: Kindle Store.
  • Deadlock: A John Hutchinson Novel.

Currently we are seeing a revolution in advertising, due in large part to a huge increase in the amount of time consumers are spending online, and the ability online to make literally split-second advertising decisions. It is important to digress here to address another point. There is a lot to data processing that is not data science—despite the impression one might get from the media.

Product details

Data engineering and processing are critical to support data science, but they are more general. For example, these days many data processing skills, systems, and technologies often are mistakenly cast as data science. To understand data science and data-driven businesses it is important to understand the differences.

Data science needs access to data and it often benefits from sophisticated data engineering that data processing technologies may facilitate, but these technologies are not data science technologies per se. Data processing technologies are very important for many data-oriented business tasks that do not involve extracting knowledge or data-driven decision-making, such as efficient transaction processing, modern web system processing, and online advertising campaign management.

Big data essentially means datasets that are too large for traditional data processing systems, and therefore require new processing technologies. As with the traditional technologies, big data technologies are used for many tasks, including data engineering. Occasionally, big data technologies are actually used for implementing data mining techniques. He finds that, after controlling for various possible confounding factors, using big data technologies is associated with significant additional productivity growth. This leads to potentially very large productivity differences between the firms at the extremes.

One way to think about the state of big data technologies is to draw an analogy with the business adoption of Internet technologies. We can think of ourselves as being in the era of Big Data 1. Firms are busying themselves with building the capabilities to process large data, largely in support of their current operations—for example, to improve efficiency.

Once firms had incorporated Web 1. We should expect a Big Data 2.

Unlock The Power of Your Data With These 15 Big Data & Data Analytics Books

Once firms have become capable of processing massive data in a flexible fashion, they should begin asking: The principles and techniques we introduce in this book will be applied far more broadly and deeply than they are today. It is important to note that in the Web 1. Similarly, we see some companies already applying Big Data 2. Amazon again is a company at the forefront, providing data-driven recommendations from massive data. There are other examples as well. Online advertisers must process extremely large volumes of data billions of ad impressions per day is not unusual and maintain a very high throughput real-time bidding systems make decisions in tens of milliseconds.

We should look to these and similar industries for hints at advances in big data and data science that subsequently will be adopted by other industries. The prior sections suggest one of the fundamental principles of data science: Too many businesses regard data analytics as pertaining mainly to realizing value from some existing data, and often without careful regard to whether the business has the appropriate analytical talent. Viewing these as assets allows us to think explicitly about the extent to which one should invest in them.

Further, thinking of these as assets should lead us to the realization that they are complementary. The best data science team can yield little value without the appropriate data; the right data often cannot substantially improve decisions without suitable data science talent.


  • Data Mining.
  • The Facility Management Handbook, Chapter 33: Problem Solvers Look at the Current State and the Future of Facility Management.
  • Dr J Secrets, Anti Aging and Weight Loss;
  • Data mining.
  • Data Science for Business by Foster Provost, Tom Fawcett.

As with all assets, it is often necessary to make investments. Building a top-notch data science team is a nontrivial undertaking, but can make a huge difference for decision-making. Our next case study will introduce the idea that thinking explicitly about how to invest in data assets very often pays off handsomely. The classic story of little Signet Bank from the s provides a case in point. Previously, in the s, data science had transformed the business of consumer credit. Modeling the probability of default had changed the industry from personal assessment of the likelihood of default to strategies of massive scale and market share, which brought along concomitant economies of scale.

It may seem strange now, but at the time, credit cards essentially had uniform pricing, for two reasons: Around , two strategic visionaries Richard Fairbanks and Nigel Morris realized that information technology was powerful enough that they could do more sophisticated predictive modeling—using the sort of techniques that we discuss throughout this book—and offer different terms nowadays: These two men had no success persuading the big banks to take them on as consultants and let them try.

Finally, after running out of big banks, they succeeded in garnering the interest of a small regional Virginia bank: You can read more about the specifics at http: Besides, we can use the clues in the tweets themselves to reliably extract retweet information with a simple regular expression. By convention, Twitter usernames begin with an symbol and can only include letters, numbers, and underscores. Thus, given the conventions for retweeting, we only have to search for the following patterns: Since neither of the example tweets contains both of the groups enclosed in the parenthetical expressions, one string is empty in each of the tuples.

Regular expressions are a basic programming concept whose explanation is outside the scope of this book. The edge itself will carry a payload of the tweet ID and tweet text itself. The basic steps involved are generalizing a routine for extracting usernames in retweets, flattening out the pages of tweets into a flat list for easier processing in a loop, and finally, iterating over the tweets and adding edges to a graph.

Building and analyzing a graph describing who retweeted whom. For example, the number of nodes in the graph tells us that out of tweets, there were users involved in retweet relationships with one another, with edges connecting those nodes.

Navigation menu

In this case, most of the values are 1, meaning all of those nodes have a degree of 1 and are connected to only one other node in the graph. A few values are between 2 and 9, indicating that those nodes are connected to anywhere between 2 and 9 other nodes. The extreme outlier is the node with a degree of Graphviz is a staple in the visualization community. This section introduces one possible approach for visualizing graphs of tweet data: Graphviz binaries for all platforms can be downloaded from its official website , and the installation is straightforward regardless of platform.

Generating DOT language output is easy regardless of platform. With DOT language output on hand, the next step is to convert it into an image. You can read more about the various Graphviz options in the online documentation. Visual inspection of the entire graphic file confirms that the characteristics of the graph align with our previous analysis, and we can visually confirm that the node with the highest degree is justinbieber , the subject of so much discussion and, in case you missed that episode of SNL , the guest host of the evening.

How data mining works

Keep in mind that if we had harvested a lot more tweets, it is very likely that we would have seen many more interconnected subgraphs than are evidenced in the sampling of tweets that we have been analyzing. Further analysis of the graph is left as a voluntary exercise for the reader, as the primary objective of this chapter was to get your development environment squared away and whet your appetite for more interesting topics.

In addition to spitting some useful information out to the console, it accepts a search term as a command line parameter, fetches, parses, and pops up your web browser to visualize the data as an interactive HTML5-based graph. It is available through the official code repository for this book at http: You are highly encouraged to try it out. The boilerplate in the sample script is just the beginning—much more can be done!

Windows users can use GVedit instead of interacting with Graphviz at the command prompt. The ability to work around this issue fairly easily by generating DOT language output may be partly responsible for why it has remained unresolved for so long. Stay ahead with the world's most comprehensive technology and business learning platform. With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more. Start Free Trial No credit card required. Collecting and Manipulating Twitter Data. Frequency Analysis and Lexical Diversity.

Caution Prior to Python 3. Finished processing dependencies for nltk. What are people talking about right now? Extracting relationships from the tweets.