Visualize This - Nathan Yau - E-Book

Visualize This E-Book

Nathan Yau

4,6
28,99 €

-100%
Sammeln Sie Punkte in unserem Gutscheinprogramm und kaufen Sie E-Books und Hörbücher mit bis zu 100% Rabatt.
Mehr erfahren.
Beschreibung

Practical data design tips from a data visualization expert of the modern age Data doesn't decrease; it is ever-increasing and can be overwhelming to organize in a way that makes sense to its intended audience. Wouldn't it be wonderful if we could actually visualize data in such a way that we could maximize its potential and tell a story in a clear, concise manner? Thanks to the creative genius of Nathan Yau, we can. With this full-color book, data visualization guru and author Nathan Yau uses step-by-step tutorials to show you how to visualize and tell stories with data. He explains how to gather, parse, and format data and then design high quality graphics that help you explore and present patterns, outliers, and relationships. * Presents a unique approach to visualizing and telling stories with data, from a data visualization expert and the creator of flowingdata.com, Nathan Yau * Offers step-by-step tutorials and practical design tips for creating statistical graphics, geographical maps, and information design to find meaning in the numbers * Details tools that can be used to visualize data-native graphics for the Web, such as ActionScript, Flash libraries, PHP, and JavaScript and tools to design graphics for print, such as R and Illustrator * Contains numerous examples and descriptions of patterns and outliers and explains how to show them Visualize This demonstrates how to explain data visually so that you can present your information in a way that is easy to understand and appealing.

Sie lesen das E-Book in den Legimi-Apps auf:

Android
iOS
von Legimi
zertifizierten E-Readern

Seitenzahl: 356

Veröffentlichungsjahr: 2011

Bewertungen
4,6 (18 Bewertungen)
11
6
1
0
0
Mehr Informationen
Mehr Informationen
Legimi prüft nicht, ob Rezensionen von Nutzern stammen, die den betreffenden Titel tatsächlich gekauft oder gelesen/gehört haben. Wir entfernen aber gefälschte Rezensionen.



Table of Contents

Cover

Chapter 1: Telling Stories with Data

More Than Numbers

What to Look For

Design

Wrapping Up

Chapter 2: Handling Data

Gather Data

Formatting Data

Wrapping Up

Chapter 3: Choosing Tools to Visualize Data

Out-of-the-Box Visualization

Programming

Illustration

Mapping

Survey Your Options

Wrapping Up

Chapter 4: Visualizing Patterns over Time

What to Look for over Time

Discrete Points in Time

Continuous Data

Wrapping Up

Chapter 5: Visualizing Proportions

What to Look for in Proportions

Parts of a Whole

Proportions over Time

Wrapping Up

Chapter 6: Visualizing Relationships

What Relationships to Look For

Correlation

Distribution

Comparison

Wrapping Up

Chapter 7: Spotting Differences

What to Look For

Comparing across Multiple Variables

Reducing Dimensions

Searching for Outliers

Wrapping Up

Chapter 8: Visualizing Spatial Relationships

What to Look For

Specific Locations

Regions

Over Space and Time

Wrapping Up

Chapter 9: Designing with a Purpose

Prepare Yourself

Prepare Your Readers

Visual Cues

Good Visualization

Wrapping Up

Introduction

Learning Data

Chapter 1

Telling Stories with Data

Think of all the popular data visualization works out there—the ones that you always hear in lectures or read about in blogs, and the ones that popped into your head as you were reading this sentence. What do they all have in common? They all tell an interesting story. Maybe the story was to convince you of something. Maybe it was to compel you to action, enlighten you with new information, or force you to question your own preconceived notions of reality. Whatever it is, the best data visualization, big or small, for art or a slide presentation, helps you see what the data have to say.

More Than Numbers

Face it. Data can be boring if you don’t know what you’re looking for or don’t know that there’s something to look for in the first place. It’s just a mix of numbers and words that mean nothing other than their raw values. The great thing about statistics and visualization is that they help you look beyond that. Remember, data is a representation of real life. It’s not just a bucket of numbers. There are stories in that bucket. There’s meaning, truth, and beauty. And just like real life, sometimes the stories are simple and straightforward; and other times they’re complex and roundabout. Some stories belong in a textbook. Others come in novel form. It’s up to you, the statistician, programmer, designer, or data scientist to decide how to tell the story.

This was one of the first things I learned as a statistics graduate student. I have to admit that before entering the program, I thought of statistics as pure analysis, and I thought of data as the output of a mechanical process. This is actually the case a lot of the time. I mean, I did major in electrical engineering, so it’s not all that surprising I saw data in that light.

Don’t get me wrong. That’s not necessarily a bad thing, but what I’ve learned over the years is that data, while objective, often has a human dimension to it.

For example, look at unemployment again. It’s easy to spout state averages, but as you’ve seen, it can vary a lot within the state. It can vary a lot by neighborhood. Probably someone you know lost a job over the past few years, and as the saying goes, they’re not just another statistic, right? The numbers represent individuals, so you should approach the data in that way. You don’t have to tell every individual’s story. However, there’s a subtle yet important difference between the unemployment rate increasing by 5 percentage points and several hundred thousand people left jobless. The former reads as a number without much context, whereas the latter is more relatable.

Journalism

A graphics internship at The New York Times drove the point home for me. It was only for 3 months during the summer after my second year of graduate school, but it’s had a lasting impact on how I approach data. I didn’t just learn how to create graphics for the news. I learned how to report data as the news, and with that came a lot of design, organization, fact checking, sleuthing, and research.

There was one day when my only goal was to verify three numbers in a dataset, because when The New York Times graphics desk creates a graphic, it makes sure what it reports is accurate. Only after we knew the data was reliable did we move on to the presentation. It’s this attention to detail that makes its graphics so good.

Take a look at any New York Times graphic. It presents the data clearly, concisely, and ever so nicely. What does that mean though? When you look at a graphic, you get the chance to understand the data. Important points or areas are annotated; symbols and colors are carefully explained in a legend or with points; and the Times makes it easy for readers to see the story in the data. It’s not just a graph. It’s a graphic.

The graphic in Figure 1-1 is similar to what you will find in The New York Times. It shows the increasing probability that you will die within one year given your age.

Figure 1-1: Probability of death given your age

Check out some of the best New York Times graphics at http://datafl.ws/nytimes.

The base of the graphic is simply a line chart. However, design elements help tell the story better. Labeling and pointers provide context and help you see why the data is interesting; and line width and color direct your eyes to what’s important.

Chart and graph design isn’t just about making statistical visualization but also explaining what the visualization shows.

Note

See Geoff McGhee’s video documentary “Journalism in the Age of Data” for more on how journalists use data to report current events. This includes great interviews with some of the best in the business.

Art

The New York Times is objective. It presents the data and gives you the facts. It does a great job at that. On the opposite side of the spectrum, visualization is less about analytics and more about tapping into your emotions. Jonathan Harris and Sep Kamvar did this quite literally in We Feel Fine (Figure 1-2).

Figure 1-2: We Feel Fine by Jonathan Harris and Sep Kamvar

The interactive piece scrapes sentences and phrases from personal public blogs and then visualizes them as a box of floating bubbles. Each bubble represents an emotion and is color-coded accordingly. As a whole, it is like individuals floating through space, but watch a little longer and you see bubbles start to cluster. Apply sorts and categorization through the interface to see how these seemingly random vignettes connect. Click an individual bubble to see a single story. It’s poetic and revealing at the same time.

Interact and explore people’s emotions in Jonathan Harris and Sep Kamvar’s live and online piece at http://wefeelfine.org.

There are lots of other examples such as Golan Levin’s The Dumpster, which explores blog entries that mention breaking up with a significant other; Kim Asendorf’s Sumedicina, which tells a fictional story of a man running from a corrupt organization, with not words, but graphs and charts; or Andreas Nicolas Fischer’s physical sculptures that show economic downturn in the United States.

See FlowingData for many more examples of art and data at http://datafl.ws/art.

The main point is that data and visualization don’t always have to be just about the cold, hard facts. Sometimes you’re not looking for analytical insight. Rather, sometimes you can tell the story from an emotional point of view that encourages viewers to reflect on the data. Think of it like this. Not all movies have to be documentaries, and not all visualization has to be traditional charts and graphs.

Entertainment

Somewhere in between journalism and art, visualization has also found its way into entertainment. If you think of data in the more abstract sense, outside of spreadsheets and comma-delimited text files, where photos and status updates also qualify, this is easy to see.

Facebook used status updates to gauge the happiest day of the year, and online dating site OkCupid used online information to estimate the lies people tell to make their digital selves look better, as shown in Figure 1-3. These analyses had little to do with improving a business, increasing revenues, or finding glitches in a system. They circulated the web like wildfire because of their entertainment value. The data revealed a little bit about ourselves and society.

Facebook found the happiest day to be Thanksgiving, and OkCupid found that people tend to exaggerate their height by about 2 inches.

Figure 1-3: Male Height Distribution on OkCupid

Check out the OkTrends blog for more revelations from online dating such as what white people really like and how not to be ugly by accident: http://blog.okcupid.com.

Compelling

Of course, stories aren’t always to keep people informed or entertained. Sometimes they’re meant to provide urgency or compel people to action. Who can forget that point in An Inconvenient Truth when Al Gore stands on that scissor lift to show rising levels of carbon dioxide?

For my money though, no one has done this better than Hans Rosling, professor of International Health and director of the Gapminder Foundation. Using a tool called Trendalyzer, as shown in Figure 1-4, Rosling runs an animation that shows changes in poverty by country. He does this during a talk that first draws you in deep to the data and by the end, everyone is on their feet applauding. It’s an amazing talk, so if you haven’t seen it yet, I highly recommend it.

The visualization itself is fairly basic. It’s a motion chart. Bubbles represent countries and move based on the corresponding country’s poverty during a given year. Why is the talk so popular then? Because Rosling speaks with conviction and excitement. He tells a story. How often have you seen a presentation with charts and graphs that put everyone to sleep? Instead Rosling gets the meaning of the data and uses that to his advantage. Plus, the sword-swallowing at the end of his talk drives the point home. After I saw Rosling’s talk, I wanted to get my hands on that data and take a look myself. It was a story I wanted to explore, too.

Figure 1-4: Trendalyzer by the Gapminder Foundation

Watch Hans Rosling wow the audience with data and an amazing demonstration at http://datafl.ws/hans.

I later saw a Gapminder talk on the same topic with the same visualizations but with a different speaker. It wasn’t nearly as exciting. To be honest, it was kind of a snoozer. There wasn’t any emotion. I didn’t feel any conviction or excitement about the data. So it’s not just about the data that makes for interesting chatter. It’s how you present it and design it that can help people remember.

When it’s all said and done, here’s what you need to know. Approach visualization as if you were telling a story. What kind of story are you trying to tell? Is it a report, or is it a novel? Do you want to convince people that action is necessary?

Think character development. Every data point has a story behind it in the same way that every character in a book has a past, present, and future. There are interactions and relationships between those data points. It’s up to you to find them. Of course, before expert storytellers write novels, they must first learn to construct sentences.

What to Look For

Okay, stories. Check. Now what kind of stories do you tell with data? Well, the specifics vary by dataset, but generally speaking, you should always be on the lookout for these two things whatever your graphic is for: patterns and relationships.

Patterns

Stuff changes as time goes by. You get older, your hair grays, and your sight starts to get kind of fuzzy (Figure 1-5). Prices change. Logos change. Businesses are born. Businesses die. Sometimes these changes are sudden and without warning. Other times the change happens so slowly you don’t even notice.

Figure 1-5: A comic look at aging

Whatever it is you’re looking at, the change itself can be interesting as can the changing process. It is here you can explore patterns over time. For example, say you looked at stock prices over time. They of course increase and decrease, but by how much do they change per day? Per week? Per month? Are there periods when the stock went up more than usual? If so, why did it go up? Were there any specific events that triggered the change?

As you can see, when you start with a single question as a starting point, it can lead you to additional questions. This isn’t just for time series data, but with all types of data. Try to approach your data in a more exploratory fashion, and you’ll most likely end up with more interesting answers.

You can split your time series data in different ways. In some cases it makes sense to show hourly or daily values. Other times, it could be better to see that data on a monthly or annual basis. When you go with the former, your time series plot could show more noise, whereas the latter is more of an aggregate view.

Those with websites and some analytics software in place can identify with this quickly. When you look at traffic to your site on a daily basis, as shown in Figure 1-6, the graph is bumpier. There are a lot more fluctuations.

Figure 1-6: Daily unique visitors to FlowingData

When you look at it on a monthly basis, as shown in Figure 1-7, fewer data points are on the same graph, covering the same time span, so it looks much smoother.

I’m not saying one graph is better than the other. In fact, they can complement each other. How you split your data depends on how much detail you need (or don’t need).

Of course, patterns over time are not the only ones to look for. You can also find patterns in aggregates that can help you compare groups, people, and things. What do you tend to eat or drink each week? What does the President usually talk about during the State of the Union address? What states usually vote Republican? Looking at patterns over geographic regions would be useful in this case. While the questions and data types are different, your approach is similar, as you’ll see in the following chapters.

Figure 1-7: Monthly unique visitors to FlowingData

Relationships

Have you ever seen a graphic with a whole bunch of charts on it that seemed like they’ve been randomly placed? I’m talking about the graphics that seem to be missing that special something, as if the designer gave only a little bit of thought to the data itself and then belted out a graphic to meet a deadline. Often, that special something is relationships.

In statistics, this usually means correlation and causation. Multiple variables might be related in some way. Chapter 6, “Visualizing Relationships,” covers these concepts and how to visualize them.

At a more abstract level though, where you’re not thinking about equations and hypothesis tests, you can design your graphics to compare and contrast values and distributions visually. For a simple example, look at this excerpt on technology from the World Progress Report in Figure 1-8.

The World Progress Report was a graphical report that compared progress around the world using data from UNdata. See the full version at http://datafl.ws/12i.

These are histograms that show the number of users of the Internet, Internet subscriptions, and broadband per 100 inhabitants. Notice that the range for Internet users (0 to 95 per 100 inhabitants) is much wider than that of the other two datasets.

Figure 1-8: Technology adoption worldwide

The quick-and-easy thing to do would have been to let your software decide what range to use for each histogram. However, each histogram was made on the same range even though there were no countries who had 95 Internet subscribers or broadband users per 100 inhabitants. This enables you to easily compare the distributions between the groups.

So when you end up with a lot of different datasets, try to think of them as several groups instead of separate compartments that do not interact with each other. It can make for more interesting results.

Questionable Data

While you’re looking for the stories in your data, you should always question what you see. Remember, just because it’s numbers doesn’t mean it’s true.

I have to admit. Data checking is definitely my least favorite part of graph-making. I mean, when someone, a group, or a service provides you with a bunch of data, it should be up to them to make sure all their data is legit. But this is what good graph designers do. After all, reliable builders don’t use shoddy cement for a house’s foundation, so don’t use shoddy data to build your data graphic.

Data-checking and verification is one of the most important—if not the most important—part of graph design.

Basically, what you’re looking for is stuff that makes no sense. Maybe there was an error at data entry and someone added an extra zero or missed one. Maybe there were connectivity issues during a data scrape, and some bits got mucked up in random spots. Whatever it is, you need to verify with the source if anything looks funky.

The person who supplied the data usually has a sense of what to expect. If you were the one who collected the data, just ask yourself if it makes sense: That state is 90 percent of whatever and all other states are only in the 10 to 20 percent range. What’s going on there?

Often, an anomaly is simply a typo, and other times it’s actually an interesting point in your dataset that could form the whole drive for your story. Just make sure you know which one it is.

Design

When you have all your data in order, you’re ready to visualize. Whatever you’re making, whether it is for a report, an infographic online, or a piece of data art, you should follow a few basic rules. There’s wiggle room with all of them, and you should think of what follows as more of a framework than a hard set of rules, but this is a good place to start if you are just getting into data graphics.

Explain Encodings

The design of every graph follows a familiar flow. You get the data; you encode the data with circles, bars, and colors; and then you let others read it. The readers have to decode your encodings at this point. What do these circles, bars, and colors represent?

William Cleveland and Robert McGill have written about encodings in detail. Some encodings work better than others. But it won’t matter what you choose if readers don’t know what the encodings represent in the first place. If they can’t decode, the time you spend designing your graphic is a waste.

Note

See Cleveland and McGill’s paper on Graphical Perception and Graphical Methods for Analyzing Data for more on how people encode shapes and colors.

You sometimes see this lack of context with graphics that are somewhere in between data art and infographic. You definitely see it a lot with data art. A label or legend can completely mess up the vibe of a piece of work, but at the least, you can include some information in a short description paragraph. It helps others appreciate your efforts.

Other times you see this in actual data graphics, which can be frustrating for readers, which is the last thing you want. Sometimes you might forget because you’re actually working with the data, so you know what everything means. Readers come to a graphic blind though without the context that you gain from analyses.

So how can you make sure readers can decode your encodings? Explain what they mean with labels, legends, and keys. Which one you choose can vary depending on the situation. For example, take a look at the world map in Figure 1-9 that shows usage of Firefox by country.

Figure 1-9: Firefox usage worldwide by country

You can see different shades of blue for different countries, but what do they mean? Does dark blue mean more or less usage? If dark blue means high usage, what qualifies as high usage? As-is, this map is pretty useless to us. But if you provide the legend in Figure 1-10, it clears things up. The color legend also serves double time as a histogram showing the distribution of usage by number of users.

Figure 1-10: Legend for Firefox usage map

You can also directly label shapes and objects in your graphic if you have enough space and not too many categories, as shown in Figure 1-11. This is a graph that shows the number of nominations an actor had before winning an Oscar for best actor.

Figure 1-11: Directly labeled objects

A theory floated around the web that actors who had the most nominations among their cohorts in a given year generally won the statue. As labeled, dark orange shows actors who did have the most nominations, whereas light orange shows actors who did not.

As you can see, plenty of options are available to you. They’re easy to use, but these small details can make a huge difference on how your graphic reads.

Label Axes

Along the same lines as explaining your encodings, you should always label your axes. Without labels or an explanation, your axes are just there for decoration. Label your axes so that readers know what scale points are plotted on. Is it logarithmic, incremental, exponential, or per 100 flushing toilets? Personally, I always assume it’s that last one when I don’t see labels.

To demonstrate my point, rewind to a contest I held on FlowingData a couple of years ago. I posted the image in Figure 1-12 and asked readers to label the axes for maximum amusement.

Figure 1-12: Add your caption here.

There were about 60 different captions for the same graph; Figure 1-13 shows a few.

As you can see, even though everyone looked at the same graph, a simple change in axis labels told a completely different story. Of course, this was just for play. Now just imagine if your graph were meant to be taken seriously. Without labels, your graph is meaningless.

Figure 1-13: Some of the results from a caption contest on FlowingData

Keep Your Geometry in Check

When you design a graph, you use geometric shapes. A bar graph uses rectangles, and you use the length of the rectangles to represent values. In a dot plot, the position indicates value—same thing with a standard time series chart. Pie charts use angles to indicate value, and the sum of the values always equal 100 percent (see Figure 1-14). This is easy stuff, so be careful because it’s also easy to mess up. You’re going to make a mistake if you don’t pay attention, and when you do mess up, people, especially on the web, won’t be afraid to call you out on it.

Figure 1-14: The right and wrong way to make a pie chart

Another common mistake is when designers start to use two-dimensional shapes to represent values, but size them as if they were using only a single dimension. The rectangles in a bar chart are two-dimensional, but you only use one length as an indicator. The width doesn’t mean anything. However, when you create a bubble chart, you use an area to represent values. Beginners often use radius or diameter instead, and the scale is totally off.

Figure 1-15 shows a pair of circles that have been sized by area. This is the right way to do it.

Figure 1-15: The right way to size bubbles

Figure 1-16 shows a pair of circles sized by diameter. The first circle has twice the diameter as that of the second but is four times the area.

It’s the same deal with rectangles, like in a treemap. You use the area of the rectangles to indicate values instead of the length or width.

Figure 1-16: The wrong way to size bubbles

Include Your Sources

This should go without saying, but so many people miss this one. Where did the data come from? If you look at the graphics printed in the newspaper, you always see the source somewhere, usually in small print along the bottom. You should do the same. Otherwise readers have no idea how accurate your graphic is.

There’s no way for them to know that the data wasn’t just made up. Of course, you would never do that, but not everyone will know that. Other than making your graphics more reputable, including your source also lets others fact check or analyze the data.

Inclusion of your data source also provides more context to the numbers. Obviously a poll taken at a state fair is going to have a different interpretation than one conducted door-to-door by the U.S. Census.

Consider Your Audience

Finally, always consider your audience and the purpose of your graphics. For example, a chart designed for a slide presentation should be simple. You can include a bunch of details, but only the people sitting up front will see them. On the other hand, if you design a poster that’s meant to be studied and examined, you can include a lot more details.

Are you working on a business report? Then don’t try to create the most beautiful piece of data art the world has ever seen. Instead, create a clear and straight-to-the-point graphic. Are you using graphics in analyses? Then the graphic is just for you, and you probably don’t need to spend a lot of time on aesthetics and annotation. Is your graphic meant for publication to a mass audience? Don’t get too complicated, and explain any challenging concepts.

Wrapping Up

In short, start with a question, investigate your data with a critical eye, and figure out the purpose of your graphics and who they’re for. This will help you design a clear graphic that’s worth people’s time—no matter what kind of graphic it is.

You learn how to do this in the following chapters. You learn how to handle and visualize data. You learn how to design graphics from start to finish. You then apply what you learn to your own data. Figure out what story you want to tell and design accordingly.

Chapter 2

Handling Data

Chapter 3

Chapter 4

Visualizing Patterns over Time

Chapter 5

Visualizing Proportions

Chapter 6

Visualizing Relationships

Statistics is about finding relationships in data. What are the similarities between groups? Within groups? Within subgroups? The relationship that most people are familiar with for statistics is correlation. For example, as average height goes up in a population, most likely average weight will go up, too. This is a simple positive correlation. The relationships in your data, just like in real life, can get more complicated though as you consider more factors or find patterns that aren’t so linear. This chapter discusses how to use visualization to find such relationships and highlight them for storytelling.

As you get into more complex statistical graphics, you can make heavy use of R in this chapter and the next. This is where the open-source software shines. Like in previous chapters, R does the grunt work, and then you can use Illustrator to make the graphic more readable for an audience.

What Relationships to Look For

So far you looked at basic relationships with patterns in time and proportions. You learned about temporal trends, and compared proportions and percentages to see what’s the least and greatest and everything in between. The next step is to look for relationships between different variables. As something goes up, does another thing go down, and is it a causal or correlative relationship? The former is usually quite hard to prove quantitatively, which makes it even less likely you can prove it with a graphic. You can, however, easily show correlation, which can lead to a deeper more exploratory analysis.

You can also take a step back to look at the big picture, or the distribution of your data. Is it actually spaced out or is it clustered in between? Such comparisons can lead to stories about citizens of a country or how you compare to those around you. You can see how different countries compare to one another or general developmental can progress around the world, which can aid in decisions about where to provide aid.

You can also compare multiple distributions for an even wider view of your data. How has the makeup of a population changed over time? How has it stayed the same?

Most important, in the end, when you have all your graphics in front of you, ask what the results mean. Are they what you expected? Does anything surprise you?

This might seem abstract and hand-wavy, so now jump right into some concrete examples on how to look at relationships in your data.

Correlation

Correlation is probably the first thing you think of when you hear about relationships in data. The second thing is probably causation. Now maybe you’re thinking about the mantra that correlation doesn’t equal causation. The first, correlation, means one thing tends to change a certain way as another thing changes. For example, the price of milk per gallon and the price of gasoline per gallon are positively correlated. Both have been increasing over the years.

Now here’s the difference between correlation and causation. If you increase the price of gas, will the price of milk go up by default? More important, if the price of milk did go up, was it because of the increase in the gas price or was it an outside factor, such as a dairy strike?

It’s difficult to account for every outside, or confounding factor, which makes it difficult to prove causation. Researchers spend years figuring stuff like that out. You can, however, easily find and see correlation, which can still be useful, as you see in the following sections.

Correlation can help you predict one metric by knowing another. To see this relationship, return to scatterplot and multiple scatterplots.

More with Points

In Chapter 4, “Visualizing Patterns over Time,” you used a scatterplot to graph measurements over time, where time was on the horizontal axis and a metric of interest was on the vertical axis. This helped spot temporal changes (or nonchanges). The relationship was between time and another factor, or a variable. As shown in Figure 6-1, however, you can use the scatterplot for variables other than time; you can use a scatterplot to look for relationships between two variables.

If two metrics are positively correlated (Figure 6-2, left), dots move higher up as you read the graph from left to right. Conversely, if a negative correlation exists, the dots appear lower, moving from left to right, as shown in the middle of Figure 6-2.

Sometimes the relationship is straightforward, such as the correlation between peoples’ height and weight. Usually, as height increases, weight increases. Other times the correlation is not as obvious, such as that between health and body mass index (BMI). A high BMI typically indicates that someone is overweight; however, muscular people for example, who can be athletically fit, could have a high BMI. What if the sample population were body builders or football players? What would relationships between health and BMI look like?

Figure 6-1: Scatterplot framework, comparing two variables

Figure 6-2: Correlations shown in scatterplots

Remember the graph is only part of the story. It’s still up to you to interpret the results. This is particularly important with relationships. You might be tempted to assume a cause-and-effect relationship, but most of the time that’s not the case at all. Just because the price of a gallon of gas and world population have both increased over the years doesn’t mean the price of gas should be decreased to slow population growth.

Create a Scatterplot

In this example, look at United States crime rates at the state level, in 2005, with rates per 100,000 population for crime types such as murder, robbery, and aggravated assault, as reported by the Census Bureau. There are seven crime types in total. Look at two of them to start: burglary and murder. How do these relate? Do states with relatively high murder rates also have high burglary rates? You can turn to R to investigate.

As always, the first thing you do is load the data into R using read.csv(). You can download the CSV file at http://datasets.flowingdata.com/crimeRatesByState2005.csv, but now load it directly into R via the URL.

# Load the data crime <-     read.csv(‘http://datasets.flowingdata.com/crimeRatesByState2005.csv’,     sep=",", header=TRUE)

Check out the first few lines of the data by typing the variable, crime, followed by the rows you want to see.

crime[1:3,]

Following is what the first three rows look like.

          state murder forcible_rape robbery aggravated_assault burglary 1 United States    5.6          31.7   140.7              291.1    726.7 2       Alabama    8.2          34.3   141.4              247.8    953.8 3        Alaska    4.8          81.1    80.9              465.1    622.5   larceny_theft motor_vehicle_theft population 1        2286.3               416.7  295753151 2        2650.0               288.3    4545049 3        2599.1               391.0     669488

The first column shows the state name, and the rest are rates for the different types of crime. For example, the average robbery rate for the United States in 2005 was 140.7 per 100,000 population. Use plot() to create the default scatterplot of murder against burglary, as shown in Figure 6-3.

plot(crime$murder, crime$burglary)

Figure 6-3: Default scatterplot of murder against burglary

Chapter 7

Spotting Differences

Sports commentators like to classify a select few athletes as superstars or as part of an elite group, while the rest are designated average or role players. These classifications usually aren’t so much from sports statistics as they are from watching a lot of games. It’s the know-it-when-I-see-it mentality. There’s nothing wrong with this. The commentators (usually) know what they’re talking about and are always considering the context of the numbers. It always makes me happy when a group of sports analysts look at performance metrics, and almost without fail someone will say, “You can’t just look at the numbers. It’s the intangibles that make so and so great.” That’s statistics right there.

Obviously this doesn’t apply to just sports. Maybe you want to find the best restaurants in an area or identify loyal customers. Rather than categorizing units, you could look for someone or something that stands out from the rest. This chapter looks at how to spot groups within a population and across multiple criteria, and spot the outliers using common sense.

What to Look For

It’s easy to compare across a single variable. One house has more square feet than another house, or one cat weighs more than another cat. Across two variables, it is a little more difficult, but it’s still doable. The first house has more square feet, but the second house has more bathrooms. The first cat weighs more and has short hair, whereas the second cat weighs less and has long hair.

What if you have one hundred houses or one hundred cats to classify? What if you have more variables for each house, such as number of bedrooms, backyard size, and housing association fees? You end up with the number of units times the number of variables. Okay, now it is more tricky, and this is what we focus on.

Perhaps your data has a number of variables, but you want to classify or group units (for example, people or places) into categories and find the outliers or standouts. You want to look at each variable for differences, but you also want to see differences across all variables. Two basketball players could have completely different scoring averages, but they could be almost identical in rebounds, steals, and minutes played per game. You need to find differences but not forget the similarities and relationships, just like, oh yes, the sports commentators.

Comparing across Multiple Variables

One of the main challenges when dealing with multiple variables is to determine where to begin. You can look at so many variations and subsets that it can be overwhelming if you don’t stop to think about what data you have. Sometimes, it’s best to look at all the data at once, and interesting points could point you in the next interesting direction.

Getting Warmer

One of the most straightforward ways to visualize a table of data is to show it all at once. Instead of the numbers though, you can use colors to indicate values, as shown in Figure 7-1.

Figure 7-1: Heatmap framework

You end up with a grid the same size of the original data table, but you can easily find relatively high and low values based on color. Typically, dark colors mean greater values, and lighter colors represent lower values but that can easily change based on your application.

You also read the heatmap (or heat matrix) the same way you would a table. You can read a row left to right to see the values of all variables for a single unit, or you can see how all the units compare across a single variable.

This layout can still confuse you, especially if you have a large table of data, but with the right color scheme and some sorting, you can make a useful graphic.

Create a Heatmap

It’s easy to make heatmaps in R. There’s a heatmap() function that does all the math work, which leaves you with picking colors best suited for your data and organizing labels so that they’re still readable, even if you have a lot of rows and columns. In other words, R sets up the framework, and you handle the design. That should sound familiar by now.

In this example, take a look at NBA basketball statistics for 2008. You can download the data as a CSV file at http://datasets.flowingdata.com/ppg2008.csv. There are 22 columns, the first for player names, and the rest for stats such as points per game and field goal percentage. You can use read.csv() to load the data into R. Now look at the first five rows to get a sense of the data’s structure (Figure 7-2).

bball <-     read.csv("http://datasets.flowingdata.com/ppg2008.csv",     header=TRUE) bball[1:5,]

Figure 7-2: Structure of the first five rows of data

Players are currently sorted by points per game, greatest to least, but you could order players by any column, such as rebounds per game or field goal percentage, with order().

bball_byfgp <- bball[order(bball$FGP, decreasing=TRUE),]

Now if you look at the first five rows of bball_byfgp, you see the list is led by Shaquille O’Neal, Dwight Howard, and Pau Gasol instead of Dwyane Wade, Lebron James, and Kobe Bryant. For this example, reverse the order on points per game.

bball <- bball[order(bball$PTS, decreasing=FALSE),]

Tip

The decreasing argument in order() specifies whether you want the data to be sorted in ascending or descending order.

As is, the column names match the CSV file’s header. That’s what you want. But you also want to name the rows by player name instead of row number, so shift the first column over to row names.

row.names(bball) <- bball$Name bball <- bball[,2:20]

The first line changes row names to the first column in the data frame. The second line selects columns 2 through 20 and sets the subset of data back to bball.

Chapter 8

Chapter 9

Visualize This: The FlowingData Guide to Design, Visualization, and Statistics

Published by Wiley Publishing, Inc. 10475 Crosspoint Boulevard Indianapolis, IN 46256www.wiley.com

Copyright © 2011 by Nathan Yau

Published by Wiley Publishing, Inc., Indianapolis, Indiana

Published simultaneously in Canada

ISBN: 978-0-470-94488-2

ISBN: 978-1-118-14024-6 (ebk)

ISBN: 978-1-118-14026-0 (ebk)

ISBN: 978-1-118-14025-3 (ebk)

Manufactured in the United States of America

10 9 8 7 6 5 4 3 2 1

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or website may provide or recommendations it may make. Further, readers should be aware that Internet websites listed in this work may have changed or disappeared between when this work was written and when it is read.

For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Not all content that is available in standard print versions of this book may appear or be packaged in all book formats. If you have purchased a version of this book that did not include media that is referenced by or accompanies a standard print version, you may request this media by visiting http://booksupport.wiley.com. For more information about Wiley products, visit us at www.wiley.com.

Library of Congress Control Number: 2011928441

Trademarks: