Steam Dev Days – Data to Drive Decision-Making

Steam Dev Days – Data to Drive Decision-Making by Mike Ambinder, Valve

How and Why Valve uses data to drive the choices they make

Mike is an experimental psychologist – takes what he knows about human behavior and applies it to game design.

Data to Drive Decision-Making

  • Decision-Making at Valve
  • Introduction to experimental design
  • Data collection/analysis infrastructure
  • Examples: LD4, DOTA2, CS:GO

Decision-Making at Valve

  • No formal management structure
  • Decision-making is meritocracy
  • All data is available for every employee
  • We just want to make the best decisions possible
  • We don’t want it to rely on ‘instinct’ -> it is fallible
    No centralized command hierarchy – as such decision-making is a meritocracy. [Huh? Who, what, how without linkage. What about regression to the mean? How is “merit” determined? The more times I hear this, it seems shockingly political. “Spending lots of time making good decisions” implies to me that there is some rubicon to evaluate them. How is that not a political process?] All data is made available with the exception of employee compensation. By instinct he really means let our biases run amok.

Decision-Making

  • Explicit
  • Data-driven
  • Theory-driven
  • Measurable Outcomes
  • Iterative

Explicit

  • What problem are you trying to solve?
  • Define terminology/constructs/problem space
  • Ask the ‘second’ question
  • Force yourself to be specific
  • Force yourself to be precise
    ‘Second’ question -> “What do you mean by that?” It’s a technique to dig into something to make sure comprehension happens, that you’re specific and precise, that there’s consistent logic and supporting data.

Data-Driven

  • What do we know about the problem?
  • What do we need to know before we decide?
  • What do we still not know after we decide?
    Need to know what you know – and what don’t. Being honest with yourself about that is important.

Theory-Driven

  • What does the data mean?
    ** Is it consistent with expectations?
    ** Is it reliable?
  • Model derived from prior experience/analysis
  • Coherent narrative
  • Prove a hypothesis right (or wrong)
  • Want result AND explanation
    Behavior during a Steam Sale is different than not, so make sure you consider that. Have sufficient confidence in your data using statistical analysis. You want to have some “intuition” about why something happened. A narrative. [Odd choice of words that…] Even if you don’t know for sure have at least a hypothesis for what’s going on – and then set out to prove it correct or wrong. The goal is to make smarter decision in the future.

Measurable Outcomes

  • Define ‘Success’
  • How will we know we made the right choice?
  • Know the ‘outcome’ of your decision
    Know what success is for every decision you make. If your decision is loosely tied to customer actions – how do you know it was a good one? Measure the outcome of your choices.

Iterative

  • Gather Data
  • Analyze Data
  • Formulate Hypothesis
    Data from one game informs decisions in other games. “TF2 is a test-bed for DOTA2, and vice-versa.”

Introduction to Experimental Design

  • If it can be destroyed by the truth, it deserves to be destroyed by the truth.” – Carl Sagan
    We all want to be right all the time. Valve would rather be accurate than right. They want estimations of how reality is to match what reality actually is.

The Scientific Method Cycle [YAY!]

  • Theory – use the theory to make a prediction
  • Prediction – design an experiment to test the prediction
  • Experiment – perform the experiment
  • Observation – create or modify the theory

Experimental Design

  • Observational
    ** Retrospective vs. Prospective
    ** Correlational not causal
  • Experiment
    ** Control Condition and Experimental Condition
    ** Account for confounding variables
    ** Measure variables of interest
    Try to eliminate external influences.

Experimental Design (Part 2)

  • What have we learned?
  • What biases are present?
  • How are future experiments informed?
  • What other hypotheses need to be ruled out?
  • What should we do next?

Data Collection/Analysis Infrastructure – Valve Data Collection

  • Record lots and lots (and lots) of user behavior
  • If we’re not recording it, we’ll start recording it
  • Define questions first, then schema
  • Collection -> Analysis -> Communication
    Always willing to spend engineering time to get the data to answer the questions they have. They never regret that. It doesn’t mean they’re always right – but they’re always smarter. Once you have the data you need to have an idea of how you’re going to share it.

Data Collection – Games

  • OGS – Operational Game Stats (?)
  • Platform for recording gameplay metrics
  • Kills, Deaths, Hero Selection, In-Game Purchases, matchmaking wait times, Bullet trajectories, Friends in Party, Low-Priority Penalties, etc.
    They records “everything”.

Data Collection – Games (2)

  • Organizational schemas defined for each game
  • Data sent at relevant intervals
  • Daily, Monthly, Lifetime Rollups, Views, Aggregations
    [These data collection examples are Valve games only. There’s no Steam provisioning for this sort of metrics collection. I wager there are partners who’d want that.]

ValveStats

  • Disseminate the data using Tableau
  • Examples:
    ** Account First Purchase
    ** Chinese Users Performance
    ** DOTA Heroes
    ** DOTA Item Balance
    ** DOTA Matches
    ** DOTA Geographic Purchases
    ** DOTA Item Purchases / Drops
    ** DOTA Sales by Currency
    ** DOTA Weekly
    ** DOTA Performance
    [Charts are really hard to read, so no scale or value data readable. Probably available elsewhere if required.] Have 200 separate workbooks, about 800 pieces of analysis.

Data Collection – Steam

  • Steam Database – Raw data
  • SteamStats Database – Analysis/Summary of raw data
  • Record all relevant data about Steam user behavior
    [Screenshot of SteamWorks Product Data screen at 24:19] He made an interesting comment about if ARPU or ARPUU are good metrics to use. [Seemed to downplay their significance. Not surprising given the Trade System examples and free user monetization strategies that they use.]

Valve’s Game Design Process

  • Goal is a game that makes customers happy =>
  • Game designs are hypotheses =>
  • Playtests are experiments =>
  • Evaluate designs based off play test results =>
  • Repeat from start =>
    We are very poor proxies for their customers. They don’t know if something actually works until they put it in front of people who are not them.

Playtest Methodologies

  • Traditional:
    ** Direct Observation
    ** Verbal Reports
    ** Q&A’s
  • Technical:
    ** Stat Collection/Data Analysis
    ** Design Experiments
    ** Surveys
    ** Physiological Measurements (Heart Rate, etc.)

Example – Left 4 Dead – Enabling Cooperation

  • Coop Game where competing gets you killed
  • Initial playtest were not as enjoyable as hoped
  • Initial playtests were not as cooperative as hoped
    ** Players letting their teammates die
    ** Ignoring cries for help

Enabling Cooperation

  • Explicit: Players letting teammates die
  • Data-Driven: Surveys, Q&As, high death rates
  • Theory-Driven: Lack awareness of teammate
  • Measurements: Survey, Q&As, death rates
  • Hypothesis: Give better visual cures to teammate location
    Improving the visual queues caused deaths to go down by ~40%. [Duh. The previous version was clearly inadequate.]

Results

  • Survey rating of enjoyment/cooperation increased
  • Anecdotal responses decreased
  • Deaths decreased
  • Iterative: Where else can visual cues aid gameplay?

Example – DOTA 2 – Improve Player Communication

  • Explicit: Reduce negative communication
  • Data-Driven: Chat, reports, forums, emails, quitting
  • Theory-Driven: No feedback loop to punish negativity
  • Measurements: Chat, reports, ban rates, recidivism
  • Iterative: Will this work in TF2? Do these systems scale?
  • Hypothesis: Automating communication bans will reduce negativity in-game.
    They had data which suggested that they had a problem. The (early-on) only significant predictor for why a person would quit DOTA was being in a game where a player had been reported for abusive behavior. Rewarding positive behavior is a different axis. The way it works (38:09) is the player gets a report player dialog which categories the report (i.e. Communication Abuse) with a free-text more information box. They also get a Thank You dialog which specifically tells the player that Valve has taken action against another player and that they have another (note singular) report to use. Players have a weekly quota of reports. [Both of those are really interesting feedback loops. I’m not coming up with any other games which do this? Every game I can think of specifically does the opposite of this.] They take away the other players ability to chat scaling from a day to a week depending on severity and frequency of bans.

Results

  • 35% fewer negative words used in chat
  • 32% fewer communication reports
  • 1% of active player base is currently banned
  • 61% of banned players only receive one ban
    [Missing is what this has done to quit rates.] They balanced the word list to stay around the 1% mark to avoid overdoing the banning. [Not stated is how many reports for a particular player are required to automatically ban a player.]

Example: CS:GO – Weapon Balance

  • Explicit: M4A4 usage is high; few choices in late-game
  • Data-driven: Purchase rates
  • Theory-driven: Greater tactical choice => Player retention
  • Measurements: Purchase rates, playtime, efficacy
  • Iterative: Inform future design choices
  • Hypothesis: Creating a balanced alternative weapon will increase player choice and playtime
    The M4A4 was too popular – 80% of players. Could be good, but wasn’t sure. They introduced the silenced M4A1 which split evenly with the M4A4 purchasers.

Results

  • ~50/50 split between new and old favorites
  • Increase in playtime
    ** Conflated with other updates
    ** Difficult to isolate
  • Open question as to whether or not increased weapon variability increases player retention

Where Can You Begin?

  • Start asking questions
  • Gather data – any data
    ** Playtests
    ** Gameplay metrics
    ** Steamstats
    ** Forum posts/emails/Reddit
  • Tell Valve what data you’d like them to provide

Contact Info

  • Mike Ambinder
  • mikea AT valvesoftware.com

Question: How often do you get to isolate a single change?
We play as much as we can as often as we can. Twice a week, twenty people, for longer than a year for L4D. It’s going to be messy sometimes. You need to be aware that the data you have isn’t representative of the population at large.

Question: Data-driven approach to avoid mis-steps.
We make mistakes all the time. The way the company is designed makes that ok. They did not realize the customers had an expect ion. Now they have more informed policies about holiday events in the future.

Advertisements