Histogram Matching for Camera Pose Neighbor Selection
Kevin L. Steele, Parris K. Egbert, Bryan S. Morse
{ steele, egbert, morse} @ cs. byu. edu
Department of Computer Science, Brigham Young University
3361 TMCB, Brigham Young University, Provo, Utah 84602
Abstract
A prerequisite to calibrated camera pose estimation
is the construction of a camera neighborhood adjacency
graph, a connected graph defining the pose neighbors of the
camera set. Pose neighbors to a camera C are images con-taining
sufficient overlap in image content with the image
from C that they can be used to correctly estimate the pose
of C using structure- from- motion techniques. In a video
stream, the camera neighborhood adjacency graph is often
a simple connected path; frame poses are only estimated
relative to their immediate neighbors.
We propose a novel method to build more complex cam-era
adjacency graphs that are suitable for estimating the
pose of large numbers of wide- and narrow- baseline im-ages.
We employ Content- Based Image Retrieval techniques
to identify similar images likely to be graph neighbors. We
also develop an optimization to improve graph accuracy
that is based on an observation of common camera motions
taken when photographing with the intent of structure- from-motion.
Our results substantiate the use of our method for
determining neighbors for pose estimation.
1. Introduction
Camera pose estimation, the recovery of a camera’s ex-trinsic
parameters, is a long- studied problem in computer
vision, and researchers have generated significant results
and some robust algorithms [ 8]. In many pose- estimation
algorithms, image features such as points and lines are
matched between image pairs, triplets, or sequences and the
matches are used to compute the camera pose. A prereq-uisite
to this feature matching is the identification of im-age
pairs or an image neighborhood defining similar images
with which to match features. These image neighbors can
be expressed as an undirected adjacency graph [ 18], where
nodes of the graph are cameras and their respective images,
and edges infer proximity between cameras whose view
frusta overlap to include common scene structure. Edges
Figure 1. A camera neighborhood adjacency
graph. These five images contain vary-ing
amounts of overlap of a museum taxi-dermy
display. Images containing signifi-cant
overlap are bi- directionally connected in
the graph. Thus, the center image has four
neighbors, and its camera pose should be
estimated relative to these four neighboring
cameras.
essentially connect cameras that have a high likelihood of
successful pose estimation due to the overlapping image
content ( see Figure 1 for an example).
Many pose- estimation algorithms depend on the transi-tive
correctness of successive camera poses:
( A
p !
B) ^ ( B
p !
C) ) ( A
p !
C), ( 1)
where A
p !
B is a binary relation between camera pair
{ A, B} 2 Scams ( the set of all input cameras) indicating
that the pose of camera B is correctly estimated relative to
that of camera A. Since there is often an explicit ordinality
Proceedings of the Third International Symposium on
3D Data Processing, Visualization, and Transmission ( 3DPVT' 06)
0- 7695- 2825- 2/ 06 $ 20.00 © 2006
to the pose estimation of a camera sequence, a poor esti-mate
early in the chain can propagate large pose errors. It
is therefore important that the camera neighborhood adja-cency
graph be as connected as possible ( Figure 1). Our ob-jective
in this paper is to propose a novel process of camera
neighbor selection for construction of the adjacency graph.
The emergence of inexpensive and high- quality digital
cameras and camcorders, together with the improved abil-ity
to quickly move image and video content into computer
memory has enabled many applications of 3D reconstruc-tion
and visualization. It is increasingly desirable to con-struct
dense camera networks of hundreds of cameras for
visualization and reconstruction purposes. However, given
the limitations of current pose estimation algorithms, espe-cially
with wide- baseline cameras, creating the prerequisite
adjacency graph is difficult. Current algorithmic deficien-cies
include the inability to involve all close camera neigh-bors
in an initial pose estimate while excluding images that
do not overlap at all. Another deficiency is the lack of a
coherent method to include all types of footage, e. g., video
streams and still images, simultaneously in the pose esti-mates.
In this paper we propose a novel and efficient way to
create the camera neighborhood adjacency graph in the
presence of hundreds of input images by using methods
from content- based image retrieval systems. We use color
histograms to identify similar images as candidate cam-era
neighbors, and partial histogram functions ( defined in
Section 4) to more accurately determine neighbors given
constraints unique to the purpose of pose estimation. Our
method has the ability to find accurate camera neighbors
for large numbers of input images without imposing a spe-cific
pose order ( at the expense of an O( n2) algorithm), and
makes no distinction between image input formats ( still im-ages
or video streams).
2. Background
Often the adjacency graph creation for a pose solution
is done by hand in order to exploit known matching char-acteristics
and optimize the quality of the pose estimation.
When few images are to be matched, neighbor selection is
trivial. In this section we describe past methods in determin-ing
pose- neighbor selection. Often the selection process is
implicit to a pose- estimation algorithm, and an adjacency
graph is not explicitly constructed.
Teller et al. [ 18] create omni- directional images of out-door
urban environments for pose estimation. They deter-mine
camera neighbors by taking the k- nearest neighbors of
an initial adjacency graph constructed from GPS sensor data
acquired at the physical camera location. Most other meth-ods
( including ours) attempt to build the adjacency graph
exclusively from image content rather than utilizing exter-
( a)
( b)
X
X
( Degenerate adjacency graph)
( Better adjacency graph)
Figure 2. Adjacency graphs created from an
implicit ordering of the input cameras. The
pyramids in ( a) and ( b) show the camera loca-tions
of two hypothetical input video streams
pointing toward a central point ( X). The in-put
for ( a) is a one- dimensional tracked se-quence
about X. Most current algorithms
construct the ( degenerate) adjacency graph
using immediate frame neighbors, as shown
with the arrows. The input for ( b) is a viewing
sphere [ 9] from a zigzag about X. Looking be-yond
the immediate frame sequence [ 9, 10],
one can create better graph configurations
with more pose neighbors, and consequently
more accurate pose estimates.
nal sensors.
Much recent work has focused on using video streams in
the matching process [ 5, 9, 10, 11, 12, 16], in which fea-tures
( points, lines, etc.) are identified in an initial frame
and then tracked through subsequent frames until the fea-tures
are no longer identifiable. New features are typically
added throughout the tracking process so that at any given
frame of the stream many features exist between the current
frame and its immediate predecessor and successor. Thus
an image’s match ( and pose) neighbors are the frames im-mediately
preceding or following it in the video stream, and
the corresponding adjacency graph degenerates to a path
graph, i. e. a path containing all the nodes of the graph ( see
( a) in Figure 2). All algorithms that operate on sequences
Proceedings of the Third International Symposium on
3D Data Processing, Visualization, and Transmission ( 3DPVT' 06)
0- 7695- 2825- 2/ 06 $ 20.00 © 2006
of images produce similar degeneracies, regardless of in-put
format. For instance, Lhuillier and Quan [ 11] use still
images rather than a video stream, but they nonetheless en-force
an ordering on the input images to define match part-ners.
Sainz et al. [ 16] designed their calibration solution to
include video sequences, manually tracked sequences, and
still images. However, they still require an imposed order-ing
on all input images. In contrast, our method makes
no assumption on the input order and produces adjacency
graphs in which most or all correct world- space neighbors
are detected and included, not simply those nearby in a lin-ear
sequence.
Algorithmic variants include matching to images several
frames separated ( 2, 3, or more frames ahead or behind the
current frame) to improve matching characteristics such as
sharpness or depth variance [ 13]. The first attempts to de-part
from the conventional sequential ordering requirement
are those in [ 9, 10]. Koch et al. [ 9] proposed the method of
sweeping a camcorder over a viewpoint surface in a zigzag
pattern to construct a viewpoint mesh, a 3D polygonal mesh
whose vertices are the viewpoints of the reconstructed cam-eras.
Rather than restricting their pose neighbors to frames
before and after a current frame, they exploit the zigzag na-ture
of the sweeping pattern to find additional neighbors.
Their method backtracks at each frame to examine the 3D
pose locations of the previous cameras— any prior cameras
within a distance threshold are included as neighbors to
the current frame. Figure 2 ( b) shows an example of their
method. In [ 10] the authors predict a very coarse estimate
of an unknown camera pose by first computing the funda-mental
matrix F of an image pair from feature matches.
They use the epipole extracted from F to predict the new
pose direction, and the residual correspondence error of the
rectified image pair1 to predict the distance from the pose-partner.
Given this coarse pose estimate, they can use a
world- space proximity measure to determine camera neigh-bors
from previously estimated cameras. Thus the authors
are able to implicitly build a more complete, non- degenerate
camera neighborhood adjacency graph. This method is sim-ilar
to ours in that it attempts to build a non- degenerate adja-cency
graph. However, their video stream must be centered
on one central object of the scene, since the coarse pose es-timate
cannot account for camera rotation. Our method is
able to deal with arbitrary camera motion around any part
of the scene, provided there is sufficient overlap of scene
content in the input set.
Several recent contributions associate overlapping im-ages
using local image features. Brown and Lowe [ 1, 2]
extract SIFT features from each image and build a k- d tree
containing all features from all the images of the input
set. Images of the same objects or scenes are identified by
1For efficiency the authors use a linear affine mapping as an approxi-mation
to the projective rectification.
searching the k- d tree for the nearest neighbors of a given
query feature found on the object of interest. This allows the
implicit creation of an adjacency graph, which the authors
use to determine the metric pose of each image through bun-dle
adjustment.
Schaffalitzky and Zisserman [ 17] find geometrically and
photometrically invariant feature descriptors in all images
and store them in a BSP tree. Feature vectors within a
threshold distance are considered matches, and an adja-cency
graph in the form of an explicit spanning tree is con-structed.
In both [ 1, 2] and [ 17] local image features are
used to identify adjacent images. Their algorithmic com-plexity
is linear in the number of images, plus a non– linear
( though small) tree searching component.
In this paper we propose the use of content- based image
retrieval ( CBIR) techniques for the task of image neighbor
determination for camera pose estimation. By using CBIR
to build the camera neighborhood adjacency graph, rather
than relying on implicit input order, we remove the require-ment
of a sequentially- ordered input set. The input set can
thus be seen as an image database from which to draw pose
neighbors for a given image. Queries made on the database
have the constraint that all returned images are neighbors in
the adjacency graph.
3. Content- Based Image Retrieval
Content- based image retrieval is a set of techniques for
retrieving images from a database using features automat-ically
derived from the images themselves. The features
used to query CBIR databases often include color histogram
content, texture, color location, shape and image composi-tion.
CBIR has received widespread research attention, and
a number of general- purpose CBIR search engines exist—
IBM QBIC [ 6], MIT Photobook [ 14], and Berkeley Blob-world
[ 3] to name a few.
Image retrieval in the larger context is concerned with
finding the images in a database that are semantically rele-vant
or similar to a query image. Often the relevant images
are of the same class or category, such as “ all brown dogs”
or “ all persons on a beach.” In the context of pose estima-tion,
however, we are concerned with finding locationally
relevant images, or images of the same part of a scene as the
query image. This puts a large constraint on typical CBIR
usage, and restricts the useful feature types.
We use color histograms as the feature type to match im-ages.
The color histogram is a widely- used feature repre-sentation
for image retrieval [ 15]. Its advantages include
invariance to rotation and translation of the image content.
One major drawback of using color histograms for image
retrieval is that they discard any spatial information in the
image content. For example, a histogram of an outdoor
scene loses all information that the blue sky is at the “ top” of
Proceedings of the Third International Symposium on
3D Data Processing, Visualization, and Transmission ( 3DPVT' 06)
0- 7695- 2825- 2/ 06 $ 20.00 © 2006
x
z
y
Figure 3. Pyramids representing camera view
frusta. Each of the darker blue cameras in
the left column has been rotated to the right
about the origin in 12 increments of 3 ; each
row of the pictured matrix of viewpoints com-prises
a rotational set Src. Each set has also
been rotated 5 up from the previous set to
represent varying grazing angles of the ge-ometry
with the cameras. The view frusta for
all cameras intersect the ground plane y = 0
forming individual quadrilaterals. The quads
for two such cameras ( circled) are outlined in
bold red and black lines. Note that the over-lap
between the two quads is exactly the geo-metric
scene content shared between the two
images of the circled cameras. The image
of the overlap in this figure is shown as the
shaded area of Figure 4.
the image, complicating queries for more blue sky images.
However, as will be discussed in Section 4, this weakness
is not a disadvantage when using histogram matching for
camera neighbor selection.
We transform RGB image color to theHSV color space
and make comparisons using the hue component exclu-sively.
A simple L1- distance comparison function between
two hue histograms H and I works well in practice:
d( H, I) =
n X k= 1
| Hk - Ik| ( 2)
where n is the number of color bins. Given a query im-age
histogram H, and a database image histogram I, the
histogram distance d( H, I) represents the dissimilarity be-tween
the two images. For the L1- distance metric, d( H, I)
is precisely the number of pixels that differ in hue in both
images. For each query image, we can sort the remaining
images in our input set ( the database) on increasing values
of d, then threshold either the number of neighbors or the
90°
85°
80°
75°
70°
65°
60°
55°
50°
45°
40°
35°
3° 6° 9° 12° 15° 18° 21° 24° 27° 30° 33° 36°
Y- Axis Rotation
X- Axis Rotation
A B
C
Figure 4. Grid of quadrilateral re- projections.
Each square of the grid shows the re-projection
of an initial camera’s quad to a
rotated camera of the same set. The white
area in each square is the overlap of the re-projected
quad with the quad of the corre-sponding
rotated camera. It is easily seen
that the coverage pattern shifts to the right
( the square labeled B) as the cameras in-crease
in their y- axis rotation, and the pat-tern
rotates clockwise as the camera sets de-crease
in their x- axis rotation ( the square la-beled
C). This suggests the search patterns
illustrated in Figure 5. The green shaded area
in the figure corresponds to the quadrilateral
overlap from Figure 3.
value of d to determine the actual neighbors of the query
image.
By performing histogram matching using all images in
the input set as independent query images, we can build a
directed adjacency graph. If the degree of each node is con-stant,
as would be the case in a typical implementation, then
the graph is directed due to the lack of a symmetric binary
relation in the set of all neighbors, i. e. I1 having neighbor I2
does not imply I2 has neighbor I1. For many pose estima-tion
algorithms, a directed adjacency graph is sufficient. If
an undirected graph is necessary, it can be assumed from the
directed graph with the allowance of a non- uniform thresh-old
( for example, each node may have differing numbers of
neighbors).
4. Optimization
We now propose a novel optimization that improves the
accuracy of the adjacency graph as constructed in Section 3.
Proceedings of the Third International Symposium on
3D Data Processing, Visualization, and Transmission ( 3DPVT' 06)
0- 7695- 2825- 2/ 06 $ 20.00 © 2006
The optimization is based on an observation of typical cam-era
motions made when photographing scenes for eventual
structure- from- motion applications:
Observation. When photographing for the purpose of later
3D reconstruction or visualization, we have a tendency to
follow specific camera- motion patterns— we tend either to
rotate about a subject’s up- axis, or to track ( translate) hor-izontally
or vertically past a subject.
We also tend to avoid extreme or arbitrary camera mo-tions
such as cyclo- rotation, panning ( except in the case of
creating panoramas), diagonal tracks, large rotations and
large tracks. While we haven’t performed user studies on
the validity of this observation, we believe it to be a fair
characterization based on our own camera motion patterns
and those observed in the computational stereo literature.
This observation will guide the development of our algo-rithm.
Let the ground plane y = 0 be a coarse approxima-tion
of the geometry of interest in a structure- from- motion
application. If we map the typical camera motions taken
from the observation, we get a set of translated and rotated
camera positions centered on a point, e. g., the Euclidean
origin. Figure 3 illustrates candidate sets of rotated cam-eras.
The set of rotations can be expressed by
where Src is the set of homogeneous camera projection ma-trices
representing the rotated cameras, P is the homoge-neous
camera matrix of an initial camera, M( , t) is the pa-rameterized
matrix that rotates the initial camera by about
a point on the principal axis t units from the camera center,
and and are the limits of rotation. M( , t) is expressed
in Equation ( 4) as the composition of the individual affine
transformations in ( 5). These matrices translate the cam-era’s
rotational center to the origin, rotate the camera about
the positive y- axis, and then translate it back again.
0 0 0 1
3 7 7 5
2 6 6 4
1 0 0 0
0 1 0 0
0 0 1 - t
0 0 0 1
3 7 7 5
.
( 5)
The intersections of the view frusta with the ground
plane form quadrilaterals, as shown in Figure 3. Let qi be
the image of the quadrilateral formed by an initial camera
Ci, and qr be the image of the quadrilateral formed by a
rotated camera Cr, where Ci and Cr are cameras whose
defining matrices Mi, Mr 2 Src. The image qi can be re-projected
to Cr by a homographyH. This re- projection ˆ qi is
the original quadrilateral as seen from the vantage point of
Cr. The overlap of the re- projected quad ˆ qi with qr is pre-cisely
that geometry that is seen from both cameras. Fig-ure
4 illustrates the overlap of ˆ qi with qr for the array of
cameras in Figure 3.
In the context of pose estimation, the overlap represents
the image content from which to derive useful features for
feature matching. The quality of the matches directly deter-mines
the success of the pose estimation. Since larger image
overlaps imply larger quantities of matches, we would like
to quantify the amount of overlap between a query image
and all other images, and assign neighbors to the query im-age
from among those with the most overlap. Herein lies
the optimization: we can improve the accuracy of the adja-cency
graph by detecting this overlap, then computing color
histogram comparisons only on the overlapping portions of
images rather than on entire images.
Examining the family of overlap patterns created from
typical camera motions suggests an efficient method to de-termine
the correct overlap. Consider the example of the
two cameras labeled A and B in the top corners of Figure 4.
Camera B is horizontally rotated 36 about a point visible
to both cameras. The overlap pattern seen at label B in Fig-ure
4 is approximately a right- shift of the image contents.
If the scene contains any depth variation, there will be par-allax
in the shift, but for moderate depth variation the shift
still maintains most of the color content. If we compute
a color histogram on only the shifted portion of image B,
and its equal but opposite shift ( we denote as the conjugate
shift) in image A, we exclude from the histograms those
pixels that are not likely to be shared between the images.
Since we do not know in advance the amount of rotation
( if any) between two cameras, we can parameterize the shift
and perform an iterated search for the smallest histogram
distance. The histogram distance function then becomes
where H is the partial histogram of the query image given
a shift pattern, is the partial histogram of the test image
given the conjugate shift pattern, and | | is the size in num-ber
of pixels of the partial histogram. The function returns
the set of pixels of a specific shift parameter, and the set
of pixels of the conjugate shift. The distance is weighted
by the size of the partial histogram to give distance per unit
pixel. The histogram distance of the best overlap is given
by
dmin = min
where l is the shift parameter. In the above example, l iter-
Proceedings of the Third International Symposium on
3D Data Processing, Visualization, and Transmission ( 3DPVT' 06)
0- 7695- 2825- 2/ 06 $ 20.00 © 2006
( b)
( d)
( f)
( a)
( c)
( e)
Figure 5. Histogram search coverage
patterns— each pair consists of a search
pattern and its conjugate. We propose six
pairs of search patterns ( a- f) that follow
the coverage patterns in Figure 4. Pattern
( a) proceeds left- to- right in one image and
right- to- left in the other image. Patterns ( b- d)
proceed similarly according to the arrows.
Patterns ( e) and ( f) proceed from the corners
as illustrated to detect low- grazing angle
rotation, for example the pattern labeled C
from Figure 4. Note in this case that we do
not need to proceed from the bottom corners
for two reasons: the effect of perspective
decreases the overlap much more at the
top corners, and camera angles from the
underside of surfaces are not included in the
list of common camera motions.
ates left- to- right across the columns of image B, H ( l) com-putes
the histogram of the pixels to the right of column l in
image B, and I ( l) computes the histogram of the pixels to
the left of column ( ImageWidth - l) in image A.
This process can be repeated using other shift patterns
that represent the common camera motions. We use six dif-ferent
shift patterns, given in Figure 5, and take the mini-mum
dmin of the six patterns to be the final partial histogram
distance between two images. Figure 6 shows the outlines
of an optimal partial histogram region from an image pair.
The algorithmic complexity of partial histogram compar-ison
is similar to that of using global histograms. Com-puting
the global histogram is O( n) in the number of pix-els,
and computing partial histograms is also O( n) for
each shift pattern since, for each value of the shift param-eter
l, we simply add a line of pixel values to the his-togram
data computed for the previous parameter value.
However, while comparing histograms is O( n) in the
number of hues chosen for both global and partial his-tograms,
we make m more comparisons for partial his-tograms,
m = ImageWidth, m = ImageHeight, or
m = ImageWidth + ImageHeight depending on the
Best overlapping region, computed
from partial histogram distance
Figure 6. Image pair showing optimal partial
histogram regions. The partial histogram dis-tance
function found the minimum at column
219 ( 640x480) in the left image using cover-age
pattern ( b) from Figure 5.
shift pattern. We have observed this commensurately in-creased
running time in practice.
5. Results
We tested our optimized and non- optimized adjacency
graph building algorithms using four sets of wide- baseline
images. Sample images from each of the four sets are
shown in Figure 7. We use precision vs. recall comparisons
to measure the accuracy of the set Si of computed neighbors
for each image. To use this measure, we also constructed the
set Ti of “ true” neighbors for each image. Precision mea-sures
the percentage of Si in Ti, or | Si T Ti|/| Si|. Recall
measures the percentage of Ti in Si, or | Si T Ti|/| Ti|.
To automatically construct the set Ti we utilized fea-ture
correspondences that are often- times used to estimate
internal and external camera parameters. We computed an
accurate set of point matches and their connecting vectors
between all pairs of input images. The proportion of im-age
overlap for each pair was determined from its average
matching vector. Since low variances of point match vec-tors
correlate well with close image neighbors, we rejected
outlying images with no overlap by thresholding on 2. Fi-nally,
we constructed Ti for each image from the images
exhibiting the most overlap.
We use Equation ( 7) to build the candidate neighbors for
each image in the test sets. For a given query image, the
minimum histogram distances are computed to all other im-ages
in the set. The distances are sorted, and the test images
corresponding to the smallest n distances are kept as neigh-bors
to the query image. We measure the precision/ recall
accuracy for values of n = [ 1.. k - 1], k being the size of
the test set.
Proceedings of the Third International Symposium on
3D Data Processing, Visualization, and Transmission ( 3DPVT' 06)
0- 7695- 2825- 2/ 06 $ 20.00 © 2006
Museum ( from 55 images)
Office ( from 39 images)
Outdoor1 ( from 30 images)
Outdoor2 ( from 16 images)
Figure 7. Representative images from four
test sets. The images from both the Mu-seum
and Office sets have camera motions
containing significant horizontal and verti-cal
rotation, while Outdoor1 has horizontal
and some vertical rotation, and Outdoor2 has
only horizontal rotation.
Results of the four test cases are given in Figure 8. Each
data point on the four graphs represents the average pre-cision
and recall of the computed neighbor sets for a spe-cific
n number of neighbors. The graphs show a typical
inverse relationship between precision and recall [ 4], and in
all four cases using partial histogram distances improves the
precision accuracy from simply using the global histogram
function in Equation ( 2). Partial histogram distances im-prove
precision on average by 26.9% for 5 neighbors, and
by 24.3% for ten neighbors. Table 1 lists percentage im-provements
for each of the four test cases.
Museum
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Partial
Global
1- x
Office
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Partial
Global
1- x
Outdoor 1
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Partial
Global
1- x
Outdoor 2
0
0.2
0.4
0.6
0.8
1
0 0.2 0.4 0.6 0.8 1
Recall
Partial
Global
1- x
Figure 8. Precision/ recall graphs from the
four test suites ( museum, office, outdoor1
and outdoor2). The lighter pink plot in each
graph shows the precision/ recall accuracy
for histograms taken from the whole images
( global histogram). The darker blue plots
show the accuracy for partial histograms. In
all test cases the partial histogram compar-isons
improved the accuracy significantly.
Percentage Improvement
5 Neighbors 10 Neighbors
Museum 35.8% 36.2%
office 24.0% 24.1%
outdoor1 24.7% 23.7%
outdoor2 23.0% 13.3%
Table 1. Improvement of precision accuracy
for using partial histogram distances rather
than global histogram distances.
6. Summary and Conclusions
We have shown a novel and efficient method to create
the camera neighborhood adjacency graph for use in pose
Proceedings of the Third International Symposium on
3D Data Processing, Visualization, and Transmission ( 3DPVT' 06)
0- 7695- 2825- 2/ 06 $ 20.00 © 2006
estimation without having to first find feature matches be-tween
two images— a potentially time- consuming process
for wide- baseline images [ 7]. In addition, our method can
process sequential and non- sequential image collections si-multaneously,
such as images from still cameras and video
streams from camcorders. An ordering such as a least-cost
Hamiltonian path could later be imposed on the ad-jacency
graph for use in existing sequential pose- estimation
algorithms, although we foresee hierarchical or other non-sequential
posing schemes to ultimately be more efficient.
We have also proposed an optimization for improving
the accuracy of the adjacency graph by computing partial
histograms, i. e., histograms on just the overlapping portions
of image pairs. The optimal overlapping portion is found as
the minimum of a distance function that computes partial
histograms on a pre- determined pattern of image overlaps.
The pre- determined pattern comes from an observation of
common camera motions used when photographing content
for visualization or 3D reconstruction.
Our method works well for the image sets we have
shown in this paper and for many other image sets we have
tested. Since our algorithm requires each image in a set to
act as the query image, and each query image is compared
to all other images in the set, it is an O( n2) algorithm. In
practice one can perform the algorithm on sets containing
hundreds of images without undue computation time on a
modern processor. However, for sets of extremely large size
( thousands of images) it may be beneficial to hierarchically
cluster the input images as they are matched. Traditional
clustering methods could be used for this using coarse his-togram
content as a cluster feature.
References
[ 1] M. Brown and D. G. Lowe. Recognising panoramas. In
Proc. International Conference on Computer Vision ( ICCV
’ 03), pages 1218– 1225, 2003.
[ 2] M. Brown and D. G. Lowe. Unsupervised 3d object recogni-tion
and reconstruction in unordered datasets. In Proc. Fifth
International Conference on 3- D Digital Imaging and Mod-eling
( 3DIM 2005), pages 56– 63, 2005.
[ 3] C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and
J. Malik. Blobworld: A system for region- based image in-dexing
and retrieval. In Proc. 3rd Int. Conference on Visual
Information and Information Systems ( VISUAL ’ 99), pages
509– 516, London, UK, 1999. Springer- Verlag.
[ 4] G. G. Chowdhury. Introduction to Modern Information Re-trieval,
2nd Ed. Facet Publishing, London, 2004.
[ 5] A. W. Fitzgibbon and A. Zisserman. Automatic camera re-covery
for closed or open image sequences. In Proc. 5th
European Conference on Computer Vision- Volume I ( ECCV
’ 98), pages 311– 326, London, UK, 1998. Springer- Verlag.
[ 6] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang,
B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic,
D. Steele, and P. Yanker. Query by image and video con-tent:
The QBIC system. Computer, 28( 9): 23– 32, 1995.
[ 7] T. Goedem ´ e, T. Tuytelaars, and L. Van Gool. Fast wide base-line
matching for visual navigation. In Proc. Conference
on Computer Vision and Pattern Recognition ( CVPR ’ 04),
pages 24– 29, 2004.
[ 8] R. Hartley and A. Zisserman. Multiple View Geometry in
Computer Vision, Second Edition. Cambridge University
Press, ISBN: 0521540518, 2004.
[ 9] R. Koch, M. Pollefeys, B. Heigl, L. Van Gool, and H. Nie-mann.
Calibration of hand- held camera sequences for
plenoptic modeling. In Proc. International Conference on
Computer Vision ( ICCV ’ 99), pages 585– 591, 1999.
[ 10] R. Koch, M. Pollefeys, and L. Van Gool. Robust calibration
and 3d geometric modeling from large collections of uncal-ibrated
images. In DAGM, 1999.
[ 11] M. Lhuillier and L. Quan. Quasi- dense reconstruction from
image sequence. In Proc. 7th European Conference on Com-puter
Vision- Part II ( ECCV ’ 02), pages 125– 139, London,
UK, 2002. Springer- Verlag.
[ 12] D. Nist ´ er. Reconstruction from uncalibrated sequences with
a hierarchy of trifocal tensors. In Proc. 6th European Con-ference
on Computer Vision- Part I ( ECCV ’ 00), pages 649–
663, London, UK, 2000. Springer- Verlag.
[ 13] D. Nist ´ er. Frame decimation for structure and motion. In
Second European Workshop on 3D Structure from Multi-ple
Images of Large- Scale Environments ( SMILE ’ 00), pages
17– 34, London, UK, 2001. Springer- Verlag.
[ 14] A. Pentland, R. Picard, and S. Sclaroff. Photobook:
content- based manipulation of image databases. Proc. SPIE,
2185: 34– 47, 1994.
[ 15] Y. Rui, T. S. Huang, and S.- F. Chang. Image retrieval: past,
present, and future. In International Symposium on Multi-media
Information Processing, 1997.
[ 16] M. Sainz, A. Susin, and N. Bagherzadch. Camera calibration
of long image sequences with the presence of occlusions. In
Proc. IEEE International Conference on Image Processing
( ICIP ’ 03), pages I: 317– 320, 2003.
[ 17] F. Schaffalitzky and A. Zisserman. Multi- view matching for
unordered image sets, or “ How do I organize my holiday
snaps?” . In Proc. 7th European Conference on Computer
Vision- Part I ( ECCV ’ 02), pages 414– 431, 2002.
[ 18] S. Teller, M. Antone, Z. Bodnar, M. Bosse, S. Coorg,
M. Jethwa, and N. Master. Calibrated, registered images of
an extended urban area. International Journal of Computer
Vision, 53( 1): 93– 107, 2003.
Proceedings of the Third International Symposium on
3D Data Processing, Visualization, and Transmission ( 3DPVT' 06)
0- 7695- 2825- 2/ 06 $ 20.00 © 2006