This paper describes the formulation of a new model for unintrusive targeted advertising on the web, extending the linear programming approach taken by Langheinrich et al.[10] A feature of our model is that it avoids unrealistic solutions of the type which may show ads to only a too-narrow group of users. This is accomplished by using a statistically derived entropy maximization model, which incorporates a form of randomization in associating advertisements with targetable groups of users, as well as considering click-through probability. It is then shown that this nonlinear entropy model can be embedded in larger models for the purpose of optimal management of web advertisement portfolios by agencies or brokerages.
This paper outlines the formulation of some entropy models for directed advertising on the web ([9]), and in particular for unintrusive targeted advertising. The initial aim is to build an optimization model which can be used to compute frequencies for showing advertisements to groups of users to stably optimize click-throughs or revenue. A feature of the model is that it is formulated to avoid unrealistic, "over-targeted'' solutions. To illustrate what this means, and to motivate the rest of this paper, let us consider a very simple example.
Suppose we have 100 identical banner ads to be presented to two distinguishable types, or groups, of users, who view the page on which the ad may be displayed in equal numbers, and who have estimated click-through probabilities of 51% and 49%. How many of the ads should we show to each type of user to maximize the expected number of click-throughs?
If we let x1, x2 be the number of ads shown to users of type 1 and 2, this problem can be expressed as a tiny linear program (LP):
0The obvious "optimal" solution is x1=100, x2=0, in other words to show all the ads to the first group of users, to achieve an expected 51 click-throughs. The second group are shown no ads at all.
Is this solution realistic or desirable? If the click-through probabilities and ratio of users were known with perfect accuracy, perhaps. But in the real world such data are uncertain. Suppose the uncertainty in the click-through probabilities is a modest 5%. Then in the worst case, the actual probabilities might be 46% and 54%, the coefficients in the function to be maximized would be 0.46 and the 0.54, rather than 0.51 and 0.49, and the "optimal" solution would be completely different - to show all of the ads to the second group of users (x1=0, x2=100). This would result in 54 expected click-throughs, whereas our previous solution with x1=100 would result in only 46. These drastic differences in solution are clearly unsatisfactory, and we shall refer to "all-or-nothing" solutions of this type as "over-targeted".
It might be thought that one way out of this dilemma is to specify a minimum allocation of ads to each group - to demand that (say) each group is shown at least 10 ads (i.e. xi
10). However the effect of this is that once each group has been assigned its minimum 10 ads, the remaining 80 will again be assigned on an all-or-nothing basis as before, and our solution will be (x1=90, x2=10) - only slightly more stable under changes in the probabilities than before. We might improve the situation by increasing the lower bound further, but we have no guidance on how much.
Faced with the above situation, the commonsense approach is to "more or less evenly distribute the ads between the groups of users, but with a bias toward the group(s) with the higher click-through probability". We must then ask how this is to be expressed mathematically. The results of section 3 of this paper suggest that instead of solving the tiny linear program (LP)above, we should solve the following tiny non-linear program (NLP):
0The solution (to 4 figures) to this NLP is x1=51.00, x2=49.00, yielding an expected number of 50.00 click-throughs - one less than in the LP optimum, but much more robust under changes in the data. Under the worst case 5% variation in probabilities considered above, the extreme cases - when the probabilities are (0.56, 0.44) and (0.46,0.54) - yield solutions of (55.97, 44.03) and (46.01, 53.99) with expected click-throughs of 50.72 and 50.32 - a far more stable outcome. The nonlinear terms in the NLP objective have accomplished the goal of "more or less evenly distributing the ads between the groups of users" , which we might more formally describe as randomization, but the linear terms still give us a "bias toward the group(s) with the higher click-through probability". While the concept of randomizing may at first appear odd in the advertising context, any web surfer will have observed that portals do in fact serve up mixtures of ads.
The first priority of this paper is then to formulate optimization models which avoid "over-targeted'' solutions by a technique which allows for randomization, as well as considering click-through probability, and which is suitable for much larger (and less contrived) problems than our example, involving many types of users and ads. We use a statistical argument in deriving this model, following the lead given in other modeling areas - notably transportation planning. These models also have the advantage that efficient algorithms are available for their solution, making them attractive in practice.
A further advantage of this formulation is that it provides a fundamental building block for more far-reaching objectives in web-based advertising. In particular we will describe how this linear-logarithmic model can be embedded in a larger model for the purpose of optimal management of web advertisement portfolios by agencies or brokerages.
In the next section we move beyond the highly simplified LP model described above, and outline the quite general linear programming model formulated by Langheinrich et al.[10] for their ADWIZ system. This is followed by a derivation of the entropy approach and a description of the form of the solutions. We then describe how this model may be embedded in a prototype multi-time-period planning model for an advertising campaign. Finally we summarize some of the many interesting areas of further exploration associated with the entropy modeling approach.
Langheinrich et al.[10] have described a system for "unintrusive customization" of web advertising - unintrusive in the sense that only impersonal information, such as use of a search keyword, or accessing of a page, is required. This system consists of a number of elements. Those of most interest to us here are a Selection Engine and a Learning System. The first of these components uses a set of weights in a Relevancy Computation to compute the probability of displaying a particular ad to a user, the user being associated with certain keywords or pages.
The component of principle interest here is a "learning" system, which computes display weights based on current estimates of click-through probabilities, the keyword list and the ad inventory. The components and formulation of the model (but with the indices reversed here) are as follows:
| i = 1,...,m | The keywords |
| j = 1,...,n | The ads available |
| hj | The desired display rate for ad j (normalized over all ads) |
| ki | The keyword input probabilities (normalized over all keywords) |
| cij | Click-through probability for ad j and keyword i |
| dij | The display probability of ad j for keyword i. |
n
j = 1 |
dij = 1 |
(i = 1,...,m) |
m
i = 1 |
ki dij = hj | (j = 1,...,n) |
(1) |
dij![]() |
0 | (i = 1,..,m ; j = 1,...,n) |
m
i = 1 |
n
j = 1 |
cij ki dij |
Note that the first equation in (1) above can be multiplied through by ki, followed by a change of variable:
ij = ki dij |
to produce the equivalent constraints:
n
j = 1 |
ij = ki |
i = 1,...m) |
|
m
i = 1 |
ij = hj |
(j = 1,...n) |
(2) |
ij 0 |
(i = 1,..,m; j = 1,...,n) |
and objective:
Maximize |
m
i = 1 |
n
j = 1 |
cij ij |
(3) |
This is a classic "transportation problem" (see e.g. [3]), and we shall work with this and subsequent models in this form. The model may be complicated in practice by contractual requirements setting a minimum on certain ad display rates. Such constraints, of the form
![]() |
ij | ![]() |
L | ij |
Despite the virtues of this model - simplicity and fast solution time - it suffers from much the same weakness as the LP model in the introduction. Linear programming theory (see [3]) tells us that a model like this, with (m+n) linear constraints on mn variables, has optimal solutions in which at most (m+n) of the variables are away from their lower bound (usually zero). Thus for realistically large values of m and n we expect only a small fraction of the variables to be nonzero. In other words we expect the same phenomenon of over-targeting; on an even larger (if less extreme) scale. Once again we must regard such solutions as unrealistic and unsatisfactory, with many of the ads not being shown at all for most keywords. Langheinrich et al.[10] recognized this, and resorted to the technique of applying rather arbitrary lower bounds on all the variables. As we saw in the small example, this can remove the immediate symptom - where many or most ad/keyword combinations never appear - but it is far from clear in what sense such solutions can be described as "optimal".
In this section we shall employ a methodology used in Traffic Theory for the distribution of vehicles from origins (sources) to destinations (sinks). Following the description (but not the notation) of Wilson [13] in the traffic context, let us define a total number of trips Ai originating at origin node i, and require Bj trips to end at destination j. Let xij be the number of trips for the origin-destination pair (i,j), further let cij be some generalized cost or impedance of travelling from i to j. Then clearly we require:
n
j = 1 |
xij = Ai |
(i = 1,...,m) |
|
m
i = 1 |
xij = Bj |
(j = 1,...n) |
(4) |
xij 0 |
(i = 1,..,m; j = 1,...,n) |
and for feasibility we must have:
X = |
m
i = 1 |
Ai = |
n
j = 1 |
Bj |
(5) |
where X is the total number of trips.
We shall also assume for the time being that another constraint is satisfied:
m
i = 1 |
n
j = 1 |
cij xij = C |
(6) |
The basic assumption is that the probability of the distribution {xij} occurring is proportional to the number of states w(xij) of the system which give rise to this distribution. The number of distinct arrangements of individuals which give rise to a distribution {xij} is:
| w(xij) = (X!) / ( |
ij |
xij!) | (7) |
Now finding the maximum of this function is equivalent to finding the maximum of the log of w, which after applying Stirling's formula, and neglecting constant terms, requires us to:
Maximize - |
m
i = 1 |
n
j = 1 |
xij ln(xij) |
(8) |
Without going into detail, the solutions of this entropy maximization problem can be shown to be of the form:
xij = ai Ai bj Bj exp(- ![]() |
cij) | (9) |
is the Lagrange multiplier for equation (6). In practice it is not necessary to know the value of
. An efficient iterative (scaling) procedure is available (see [11] and [5]) which enables us to solve the problem without having to resort to more expensive general nonlinear programming methods. It can also be shown that for practical values of the Ai, Bj that this maximum is sharp ([13]).
The maximum entropy solution discussed above is one form of equilibrium solution.
The applicability of this model to the targeted advertising problem should now be clear. If we take trips originating at origins as analogous to "buckets" of users, with Ai users in each bucket, and the destinations as groups of ads to be shown, with Bj ads of each type, and consider the cij as click-through probabilities with C as the total number of clicks required, we obtain an unnormalized extension of the model discussed in section 2. Once again, contractual lower bounds may be imposed on particular ad/user pairings (as long as feasibility is not lost). However, there is no "all-or-nothing" nature to the solution, and over-targeting may be avoided without any requirement for arbitrary lower bounds. Indeed, since the model involves the logarithm of the xij's, they must necessarily be positive.
Another statistical model of interest here has also arisen in traffic theory, and is perhaps even more appropriate (see [12]). We proceed by analogy with the Helmholtz free-energy function, which is at a minimum for a system in equilibrium in conditions of constant volume and temperature. This function is of the form:
| F = E - K lnw | (10) |
| _
cij |
= [ |
max pq |
cpq] - cij |
(11) |
F = constant + |
m
i = 1 |
n
j = 1 |
xij( |
_ c |
ij |
+ ln(xij)) |
(12) |
Here
is a constant, replacing K, whose value is yet to be determined. We now assert that the equilibrium distribution is that which minimizes F subject to (4). The constraint (6) is no longer needed, and the parameter
accommodates a range of cases, from the extreme of
= 0, which gives us the unnormalized version of the LP model in section 2, to a completely proportional model, g iving the solution
| xij = Ai Bj / X | (13) |
is taken to be arbitrarily large. The general form of the solution to this model can be shown to be of formx ij = |
^
a i |
^
b j |
exp(- |
_
c ij |
/ ) |
(14) |
. It is shown in [12] that under certain assumptions the weighted mean of the cost values (as defined in (11)) provides a good fit for the analogous traffic problem, and we shall initially assume so here. This allows us to use an iterative procedure (in
) to solve the problem. A good initial value for
has proved to be simply the mean of these cost values for some models, and sometimes this is even a good enough estimate to obtain good agreement between the model and real data. (We used this value in the tiny NLP example in the introduction). Once again the solution to (14) for any
can be obtained by an efficient iterative (scaling) procedure.
Thus far we have stated this form of the statistical model as minimization problem. Once
has been chosen this is of course equivalent to the maximization problem:
Maximize |
m
i = 1 |
n
j = 1 |
xij( cij - ln(xij)) |
(15) |
This form of the statistical model offers significant advantages over that stated in (4)-(8). The constraints are those of the classical transportation problem, and the rather arbitrary constraint (6) has been replaced by a parameter in the linear-logarithmic objective function for which we have some rationale for assigning a value. For either case, we have a self-contained, easily solvable, constrained optimization model which can be embedded in more complex models which we may now consider building for the management of web advertising campaigns.
Note that we have made no assumptions on how the "buckets" of users are defined. They may correspond to search keywords, states or histories. Similarly, the assigning of the ads to groups may be by individual or classes of ad. The key pieces of data are the number of users or ads in each bucket or group and the click-through probabilities. The question of maximizing revenue then naturally arises, and can be answered by applying revenue weights to the cij term in the objective. This will be seen in the more general model of the next section.
We will consider the environment in which our models are to be used - the ad supply chain - to be made up of three segments:
For concreteness, let us formulate a model which considers only the first two of these specifically - an agency and a number of advertisers who wish to present ads to users in (at least some of) the same buckets. We also broaden the model to multiple time periods. The aim of the agency is to obtain ads from the advertisers which will maximize their net revenue, given the expected number of users in each bucket per time period, and the click-through probabilities for ads in each time period. The components of this model are:
| i = 1,...,m | The buckets of users |
| j = 1,...,n | The ad types available |
| k = 1,...,K | The advertisers |
| t = 1,...,T | Time periods |
| Ait | The number of expected users in bucket i in period t. |
| cij | Click-through probability for ad j by user i in period t. |
| Rijt | Revenue from click-through for ad j by user i in period t. |
| Dijt | Revenue (or Cost) for displaying ad j by user i in period t. |
| Pjt+ | Penalty for shortfall of shown ads type j at end of period t |
| Pjt- | Penalty for excess of shown ads type j at end of period t |
| Mjkt | Agency's cost of obtaining ads of type j from advertiser k to be shown in period t, |
| Ujkt | Upper limit on ads of type j from advertiser k in period t. |
| Ljkt | Lower limit on ads of type j from advertiser k in period t. |
t |
Entropy weight for period t. |
| xijt | The displays of ad j for user type i in period t. |
| yjkt | The number of ads j bought by advertiser k for display in period t. |
| zjt | The number of ads j shown to all users in period t. |
| sjt+ | Inventory of unshown ads of type j at end of period t |
| sjt- | Excess of shown ads of type j at end of period t |
Material Balance:
| sjt+ - sjt- = sj,t-1+ - sj,t-1- + | K
k = 1 |
yjkt - zjt | for all j,t |
|||
| (16) | ||||||
n
j = 1 |
zjt = |
m
i = 1 |
Ait |
for all t |
||
Supply and Demand:
n
j = 1 |
xijt = Ait |
for all i,t |
|
| (17) | m
i = 1 |
xijt = zjt |
for all j,t |
Bounds:
sjt+,sjt-, xijt, zjt 0 |
for all i,j,t | |
| (18) | ||
Lijt yijt Uijt |
for all i,j,t | |
Maximize
i,j,t |
xijt(Dijt+Rijtcijt - t ln(xijt)) - |
jt |
(Pjt+ sjt+ + Pjt- sjt-) - |
jkt |
Mjktyjkt | (19) |
Although the broader questions of costing of web advertising (see [1] and [7]) are beyond the scope of this paper, note that by allowing revenues (or costs) to be associated with ads that are clicked on or otherwise (via the Dijt and Rijt coefficients), as well as marketing costs Mjkt, considerable flexibility is provided.
This prototype model needs some further specifications to complete it. For example dummy ads may need to be added to make sure that the second material balance constraint can be satisfied, but the above description gives us a sufficiently detailed model to examine its structure. Clearly it can grow quite large quite quickly if T becomes large, or if m or n are taken to be large. However the model has a definite structure. If we let H represent an m by n transportation problem coefficient matrix, the structure of the entire problem coefficient matrix is of the form:
|
| | | | | | | | | |
A(0)
A(1) A(2) · · A(T) |
H |
H |
· · · |
H |
| | | | | | | | | |
It is important to note that the Generalized Benders approach can be applied in a wider context. It is not limited in applicability to the maximum entropy submodels we have considered. Any procedure for displaying groups of ads to buckets of users which can notionally be expressed as an optimization problem in our xijt variables, subject to constraints only on those variables, is a candidate for this treatment. Thus some form of Markovian model (see [2]) may be just as amenable as the entropy model(s) we have discussed.
There are many possible extensions to the embedded targeting model described above, to encompass more of the ad supply chain. It is relatively straightforward to extend it to consider properties on multiple portals by stratifying the buckets of users (say by adding an index p) and considering not only variables xpijt etc., but stratified revenues Rpijk etc. Such a model, incorporating agencies, advertisers and properties/portals will still have the basic matrix structure shown above, and be amenable to treatment by decomposition.
There are other statistical techniques, again grounded in transportation studies, which might well be considered for application in targeted advertising applications. One of these is the "intervening opportunities" model (see [13]) which ranks, in this context, groups of ads in decreasing attractiveness for each bucket of users, and using a probability that opportunities of a certain rank will be passed up, constructs an exponential decay model for associating users with groups of ads.
It should be considered that the model(s) formulated here deliberately assume (since they are "unintrusive") that very little is known about individual users - only the bucket to which they belong and the click-through probabilities for that bucket of users. If we relax the uninstrusivity requirement it may well be that we can stratify users by information level - some with the information level we have used above, some with limited information available through cookies, and others for which a detailed click-trail is available. Once again, it is possible to extend the model to accommodate this stratification, modifying it to vary the weight on the entropy terms for the different strata, without losing the matrix structure which promises efficient solution.
Other extensions of this model involve examining the structure of the costs which we have so far considered constant. Especially when multiple advertisers and multiple portals are considered, there is an opportunity to use the model to evaluate some forms of nonlinear pricing for yield management (see [4] and [14]).
We have shown that entropy models, and potentially other modeling techniques from other areas of Operations Research, can provide useful tools in the context of web advertising campaigns, either in a single time-slice application, or more generally, at a strategic level in the management over time of entire campaigns and portfolios.
As in any large scale optimization application, data availability is critical. This work is at an early stage. The next step is the acquisition of suitable data to test and calibrate the models we have described.
The author is indebted to Prabhakar Raghavan and Sridhar Rajagopalan for valuable discussions in the course of this work.