Minimum Spanning Tree Pose Estimation
Kevin L. Steele, Parris K. Egbert
{ steele, egbert} @ cs. byu. edu
Department of Computer Science, Brigham Young University
3361 TMCB, Brigham Young University, Provo, Utah 84602
Abstract
The extrinsic camera parameters from video stream
images can be accurately estimated by tracking features
through the image sequence and using these features to
compute parameter estimates. The poses for long video
sequences have been estimated in this manner. How-ever,
the poses of large sets of still images cannot be esti-mated
using the same strategy because wide- baseline cor-respondences
are not as robust as narrow- baseline feature
tracks. Moreover, video pose estimation requires a lin-ear
or hierarchically- linear ordering on the images to be
calibrated, reducing the image matches to the neighboring
video frames.
We propose a novel generalization to the linear ordering
requirement of video pose estimation by computing the Min-imum
Spanning Tree of the camera adjacency graph and
using the tree hierarchy to determine the calibration order
for a set of input images. We validate the pose accuracy
using an error metric that is functionally independent of the
estimation process. Because we do not rely on feature track-ing
for generating feature correspondences, our method can
use internally calibrated wide- or narrow- baseline images
as input, and can estimate the camera poses from multiple
video streams without special pre- processing to concate-nate
the streams.
1 Introduction
External camera calibration consists of determining the
external or extrinsic parameters of a camera matrix P,
which are the parameters defining the camera location and
orientation relative to a world coordinate frame. Much ef-fort
in computer vision has gone into developing stable
methods of estimating the external camera parameters in
projective and metric spaces; see [ 5] for a rigorous treat-ment
and a compilation of references. A related body of
work describes the process of auto- calibration, the auto-mated
estimation of a camera’s internal parameters from
a collection of uncalibrated images [ 1, 3, 12, 21]. Auto-calibration
is often performed simultaneously with external
calibration ( pose estimation).
More recently research has focused on the use of video
streams as input to pose estimation problems [ 4, 6, 7, 8, 11,
15]. Large numbers of camera poses can be successively
estimated by tracking features through a sequence of video
frames and using those features to estimate camera poses
relative to their predecessors in the sequence. Dense pose
estimation of this sort is an important precursor to recon-struction
and visualization applications. The advantage to
using video streams as an input to pose estimation, rather
than taking still images of the same structure, is that the
correspondence problem is simpler to resolve in a narrow-baseline
setting. Hence, feature matches between image
pairs and image triplets are more accurate, improving the
pose estimation accuracy.
However, several disadvantages exist to using video
streams. Video pose estimation uses an implicit ordering
to determine the estimation order— camera poses are esti-mated
relative to immediate or close predecessors in the im-age
sequence. Most algorithms are unable to exploit out- of-sequence
image matches that would otherwise improve an
estimate. It is also problematic to combine multiple video
streams of a scene, since features are not propagated from
one stream to another. We would also like the ability to es-timate
the pose for large numbers of wide- baseline images
that cannot be matched using robust feature tracking.
In this paper we propose a generalization to the sequen-tial
ordering scheme required by video pose estimation.
Rather than calibrating cameras in a linear ordering, we uti-lize
the camera adjacency graph [ 19] to determine the best
images from which to extract match features for pose esti-mation.
We compute the Minimum Spanning Tree ( MST)
of the adjacency graph to determine the pose estimation or-der
for the set of input images, and validate the pose accu-racy
using a novel error metric that is functionally indepen-dent
of the estimation process.
The contributions of our proposed method are that it can
utilize both narrow- and wide- baseline images as input, it
Proceedings of the Third International Symposium on
3D Data Processing, Visualization, and Transmission ( 3DPVT' 06)
0- 7695- 2825- 2/ 06 $ 20.00 © 2006
Figure 1. Six images of a park bench. The
pyramids show the position and orientation
of the cameras after their pose was estimated
using the MST algorithm. The inset shows a
small amount of the 3- D reconstruction.
can include multiple video streams, and it can determine an
optimal set of calibrated images to use as match generators
in computing the next estimate. Our method has produced
reliable pose estimates in scenes of over one hundred wide-baseline
images without using feature tracking as a corre-spondence
solution.
2 RelatedWork
The goal of dense pose estimation ( the estimation of
many related camera poses) has been addressed almost ex-clusively
in the context of narrow- baseline imagery from
video stream input. This is due largely to the existence of
highly robust solutions to the correspondence problem in
the narrow- baseline setting, where feature tracking can take
a predominant role [ 20]. In this section we review the work
on dense pose estimation from video streams ( video pose
estimation).
In [ 4] the authors present a method to track features
through an open or closed sequence of video frames, and
use successive frame triplets to estimate trifocal tensors.
They then hierarchically combine the tensors to build a re-construction
within a common world frame. While the ten-sor
hierarchy promotes reliable 3D structure throughout the
sequence to aid in matching, the matching order is still es-sentially
linear in that images are matched to preceding or
succeeding video frames. A generalization of [ 4] is pre-sented
in [ 11] where registration of leaf- level trifocal ten-sors
is delayed until the tensor hierarchy is complete. A set
of spanning tensors ( wide tensors) are chosen from the hier-archy
to represent the entire sequence, from which interme-diate
views are registered and the structure is triangulated.
In this way unnecessary video frames can be discarded.
Still images have also been used for closed loops [ 8], but
a linear ordering on the input images is still enforced, and a
“ quasi- dense” feature set is used to compute the fundamen-tal
matrices and trifocal tensors, thus partially avoiding the
difficulty of estimating wide- baseline camera poses. In [ 18]
tracked features are used for video pose estimation, then the
calibrated sequence is partitioned into rigid body clusters to
simplify the bundle adjustment phase.
There have been several attempts to utilize feature
matches outside the conventional video frame order [ 6, 7].
In [ 6] the authors sweep a camcorder over the object of
interest in a zigzag fashion, and construct a 3D polygo-nal
mesh whose vertices are the viewpoints of the recon-structed
cameras. Rather than restricting their fundamental
matrix computations to the preceding frames, they exploit
the zigzag nature of the sweeping pattern to find additional
images with which to match. Their method backtracks at
each frame to examine the 3D locations of previously esti-mated
poses— any prior cameras within a distance threshold
are used to update the current pose estimate.
In [ 7] the authors sequentially predict a coarse pose esti-mate
for the next video frame by using the epipole extracted
from the fundamental matrix F to predict the new pose di-rection,
and the residual correspondence error of the rec-tified
image pair to predict the distance from the previous
pose. Given this coarse pose estimate, they use a world-space
distance threshold to find additional images to up-date
the pose estimate. A deficiency of this method is that
the video stream must be centered on one central object in
the scene, since the coarse pose estimate cannot account for
camera rotation.
While these methods can successfully estimate the pose
of many frames in their video sequences, they are still re-stricted
to single streams, or streams that have been modi-fied
to permit concatenation. We propose a unifying solu-tion
to both still image and multiple video stream pose es-timation
that does not require the robustness of inter- frame
feature tracking.
Some related work has been done to create 2D topolo-gies
( connected graphs) for image mosaics [ 10, 16]. In [ 16]
the authors present a method to find the connected graph re-lating
all the images of a mosaic or panorama. Their choice
Proceedings of the Third International Symposium on
3D Data Processing, Visualization, and Transmission ( 3DPVT' 06)
0- 7695- 2825- 2/ 06 $ 20.00 © 2006
( 1) - Root
( 6)
( 5) ( 4)
( 3)
( 2)
Figure 2. Camera adjacency graph and MST
of Figure 1. The images in this figure are
spatially oriented as shown by the pyramids
in Figure 1. The edges of the adjacency
graph are small dotted lines— in this simple
example the graph is completely connected.
The minimum spanning tree was constructed
with image ( 1) as the root node, and the
edges are marked with thick solid lines.
of graph edges depends on two competing goals: to connect
images with the best overlap, and to connect images that
will best improve the global registration accuracy. In [ 10]
the authors build upon [ 16] by creating a spanning tree of
the 2D topology whose edges are determined as the mini-mum
of the normalized distance between image centroids
projected onto the mosaic being constructed.
Both [ 10] and [ 16] illustrate the application of graph con-struction
to optimize image/ camera placement. However,
their application is specifically developed for the purpose
of mosaic creation, where their cost function is related to
image registration and their camera placement is about a
common optical center. Our contribution, in contrast, is to
use graph construction to determine the optimal ordering
for camera placement in general position, requiring a novel
cost function. Specifically, our cost function uses valida-tion
on the 3D reconstruction to accurately determine edge
inclusion, as developed in Section 3.1.
3 Minimum Spanning Tree
The basic data structure we use is the camera adjacency
graph [ 19], an undirected graph whose nodes are cameras
and their respective images, and whose edges infer geomet-ric
proximity between the cameras. Two nodes sharing an
edge in the adjacency graph imply that the view frusta of the
corresponding cameras overlap to include common scene
structure, and thus the images share some amount of con-tent.
In [ 19] each node of the camera adjacency graph is
constructed from the k- nearest neighbors taken from GPS
sensor data acquired at the physical camera location. For
traditionally- acquired camcorder or still camera imagery,
the adjacency graph could be constructed by determining
the quantity or quality of feature matches between image
pairs; edges in the graph indicate large numbers of accurate
feature correspondences. Alternatively, we construct our
graph based on the amount of image overlap between image
pairs, determined from color histogram comparisons [ 14].
From each node in the graph ( each image of the input set)
we add edges to the n nodes closest in histogram distance.
Our primary contribution is to remove the linear cali-bration
order requirement by using the Minimum Spanning
Tree of the adjacency graph to determine the calibration or-der
of the input cameras. We start by specifying the root
node, either manually or heuristically ( a node attached to
the smallest- weighted edge, for instance). We then proceed
by using Prim’s algorithm to construct the MST— nodes are
iteratively added to the tree in the order of increasing edge
weights [ 2]. The edge weights are the histogram distances
between nodes, and the camera calibration order is the or-der
in which nodes are added. Figure 1 illustrates a simple
input set and its pose estimates, and Figure 2 shows the cor-responding
adjacency graph and MST.
The final spanning tree is guaranteed to be optimal in
minimizing the total edge cost. We wish to transfer this
optimality to the process of pose neighbor selection so that
finding the MST means finding the optimal ordering. We
define this optimality to be the following: At each step of
the tree creation, the node added is precisely that camera
which
( a) contains the most image overlap ( the minimum his-togram
distance) with some node in the tree, and
( b) maintains a scene reconstruction consistent with that
offered by the current tree ( see Section 3.1).
By choosing the camera order based on maximum over-lap
we pre- condition incoming nodes to have a high like-lihood
of correct pose estimation. However, it is not suf-ficient
to simply estimate the camera poses in the order of
maximum image overlap. In practice, the histogram dis-tance
estimator for image overlap is not perfect and will re-sult
in occasional outliers. Additionally, there will be noise
Proceedings of the Third International Symposium on
3D Data Processing, Visualization, and Transmission ( 3DPVT' 06)
0- 7695- 2825- 2/ 06 $ 20.00 © 2006
and occasional outliers in the correspondence set as well,
which propagates errors to the pose estimate computed from
them. Therefore we need to add a validation of the pose es-timate
that is independent of both the adjacency graph edge
weights ( histogram distance) and correspondence set noise.
3.1 Pose Validation
We validate a new pose estimate by comparing the scene
structure contributed by the new pose to the scene struc-ture
provided by its parent in the MST ( we use internally
calibrated cameras in our MST pose estimation algorithm,
so there is no need to upgrade the reconstruction from pro-jective
to metric). Rather than compare points triangulated
from the correspondence set, we compare dense stereo cor-respondence
between images. This results in a richer pose
comparison and thus a more accurate error estimate. A good
survey of dense correspondence methods is found in [ 17].
Given a pose candidate Cnew, its parent C 0 , and its grand-parent
C 0 0 , we compute depth maps D1 and D2 by trian-gulating
dense stereo between C 0 and C 0 0 , and between C 0
and Cnew respectively. Note that D1 and D2 are both com-puted
from the viewpoint of C 0 to make depth comparisons
meaningful. We define the reconstruction similarity S be-tween
two depth maps as the sum of the Gaussian of depth
differences:
S( D1, D2) = X p 2 P
e−( D1p− D2p)/ 2 2
p ( 1)
where P is the set of all pixel locations in the depth maps
D1 and D2. We choose separately for each pixel to be the
camera space inter- pixel distance at the specified depth:
p = Dp
f C 0 q k P C 0 − p k 2 + f2
C 0
, p 2 P ( 2)
where f C 0 is the focal length of C 0 and P C 0 is the principal
point of the image from C 0 . Choosing separately for each
pixel provides a more uniform depth comparison by factor-ing
out projective scaling. As each new pose Cnew is added
to the MST, its similarity S to the current reconstruction is
retained as an attribute of Cnew. The similarity attribute is
undefined for the first two nodes of the MST since they do
not have grandparents, so when the third node is added to
the MST, its similarity is propagated up as a special case.
In the ideal case where both the pose estimate and the
dense stereo correspondence are perfect, the similarity mea-sure
is equal to the number of pixels that overlap between
the three input images. This can be seen by considering
three perfectly matched points from the input images of C 0 0 ,
C 0 , and Cnew. The two resulting triangulated 3- D points
will coincide, and thus the difference in depth distances
D1p − D2p will be zero for those points. The summation
Figure 3. An image pair of a cluttered desk
( upper left pair). The upper right pair of im-ages
illustrates a portion of the point corre-spondences
used to generate the pose es-timate.
The images were taken by hand at-tempting
to restrict the camera motion to hor-izontal
translation only, and the right image
pose was estimated relative to the left. The
left wireframe pyramid shows the left camera
pose defined to be at the world- space origin,
and the cluster of wireframe pyramids on the
right shows the population of pose estimates
from which the most probable pose was se-lected
using Equation 5. Both the left camera
and the final selected pose have their images
texture mapped into the respective pyramids.
Note that due to the large amount of detail
in the images and hence the large number of
accurate point correspondences, the cluster
distribution is small relative to the baseline.
P e0 from Equation 1 will then be equal to the number of
common pixels from the three images.
Realistically there will be two error classes that com-monly
arise— dense correspondence error and pose error.
Correspondence error typically arises in regions of low fre-quency,
and hence good matches generally occur on edges,
corners, and areas of high texture frequency [ 13]. Given a
correct pose estimate, the correspondence error will be less
in areas of high detail, and the similarity measure will cor-respond
to the number of shared pixels with accurate point
matches. This will be a reasonably high value given suf-ficient
image detail ( see Table 1 for typical values). Pose
error arises in the absence of accurate point matches in the
pose estimation process, and a large pose error yields a very
low similarity measure. Thus the reconstruction similarity
S is a valid discriminator of correct or near- correct pose.
Proceedings of the Third International Symposium on
3D Data Processing, Visualization, and Transmission ( 3DPVT' 06)
0- 7695- 2825- 2/ 06 $ 20.00 © 2006
Figure 4. An image pair of a hallway in the
same configuration as Figure 3. Both the
left camera and the final selected pose have
their images texture mapped into the respec-tive
pyramids. Note that these images have
less high- frequency detail, so the point corre-spondences
are noisier, resulting in a larger
cluster distribution than that of Figure 3.
We consider a new pose to be valid if its similarity S
is at least half that of its parent node. This constraint in-validates
poses that are structurally inconsistent with valid
MST nodes, and we have found this threshold to work well
in practice. This condition can arise when a node’s cor-respondence
set has too much noise or too many outlying
matches, or if the node has an incorrect neighbor in the ad-jacency
graph.
When a node is invalidated by failing the similarity com-parison,
it is not added to the MST, and the algorithm pro-ceeds
with the next node. The failed node can be added at a
later stage of the algorithm, but it must be added to a differ-ent
parent. In this way the pose order will be optimal with
respect to both optimality properties ( a) and ( b).
3.2 Noise and Outlier Resolution
To estimate camera pose P relative to the parent node,
we use the eight- point algorithm to first estimate the essen-tial
matrix E from point matches, then we recover rotation
and translation from E [ 9]. Since no perfect point corre-spondence
algorithm exists, there is always some amount
of noise in the point matches, which transfers to the pose
estimate. Worse, there may sometimes be severe outliers in
the point matches that will lead to a completely erroneous
pose. While the validation step of section 3.1 will detect
Figure 5. An example of a multi- modal den-sity
function. This image pair is the same as
in Figure 4, but the camera poses are esti-mated
using a smaller number of point corre-spondences
as input to the eight- point algo-rithm
[ 9]. With a higher probability of choos-ing
outlying matches, the pyramid cluster
distribution is large and the density function
is multi- modal. An accurate pose is selected
because in the correspondence set there is
a larger proportion of correct point matches
than outlying matches.
most such pose errors, we can improve the chances of find-ing
an initially correct estimate by first observing the effect
of noise propagation in the eight- point algorithm.
The effect of correspondence noise in pose estimation
can be illustrated by treating the pose location as a continu-ous
random variable X. 1 If we estimate pose from random
subsets of the point correspondence set, we generate a pop-ulation
of pose locations Xp = { x1, x2, . . . , xn}. By using
a density estimator such as Parzen windows, we define a
likelihood function for a 3D location x:
L( x) =
1
n
n X i= 1
W ( k x − xi k ) ( 3)
and a probability density function for the random variable
X:
( x) = L( x)
R ! L( x) dx
, ! = R3. ( 4)
For the kernel W we use a Gaussian function with a con-stant
factor of the desired pose baseline ( for instance, if the
baseline is set to 1, = .1).
1Camera rotation could also be used as a random variable, but since we
lack a Euclidean distance measure between any two rotations for use in a
smoothing function, i. e., ( R1 − R2) 2 SO( 3) cannot be mapped to R1,
it is more difficult to estimate density from a population of rotations.
Proceedings of the Third International Symposium on
3D Data Processing, Visualization, and Transmission ( 3DPVT' 06)
0- 7695- 2825- 2/ 06 $ 20.00 © 2006
Note that this formulation is not a RANSAC procedure,
in which subsets of sampled data are iteratively and inde-pendently
chosen to robustly fit a model to the sampled
data. While both RANSAC and Equation 4 employ ran-domly
chosen subsets of sampled data, we are simply using
the random subsets of 2D point correspondences as input to
the eight- point algorithm to establish hypothetical camera
poses, and then using these discrete pose locations to de-fine
a continuous function having higher probabilities near
clusters of hypotheses.
The shape of the pdf indicates the stability of the given
correspondence set. A narrow, single- modal function is de-sirable
and will generally yield a correct pose estimate. A
multi- modal function is indicative of extreme outliers in the
correspondence set. We take the pose to be the element of
the population with the maximum value in :
P = arg max
x 2 Xp
( x). ( 5)
This effectively chooses the most probable pose in the pop-ulation
as the correct one. In practice we do not need to
evaluate the denominator in Equation 4 since we only need
to find a maximum. Figures 3 and 4 illustrate two exam-ples
of noisy poses and our method of selecting the most
probable pose from the pdfs.
If the density function is multi- modal, resulting from
match outliers for instance, then the maximum argument
associates P with the mode of maximum density; see Fig-ure
5 for an example. This will be correct only if the ma-jority
of the point correspondences are inliers. While this
will be the case most of the time when using a robust point
correspondence algorithm, it will occasionally fail. To fur-ther
reduce the effect of outlier- propagated error, we aug-ment
the node selection portion of Prim’s algorithm from
section 3 with the pose density estimation of Equations 4
and 5. Rather than estimating the pose of a new node only
with its intended parent in the MST, we additionally esti-mate
it with the k- nearest neighbors of the intended parent,
constrained to similar gaze directions, creating a population
of candidate poses. We then use density estimation again to
determine the most probable pose, setting to a factor of
the Euclidean distance between the intended parent and the
intended grandparent. In practice this avoids nearly all pose
errors resulting from outlying point correspondences. Any
remaining erroneous poses are culled by validation, yield-ing
very stable pose estimation.
4 Results
We verified the stability of MST pose estimation using
several sets of wide- baseline images taken from a still dig-ital
camera. Figures 6, 7, and 8 show the pose estimates
of three sets of input images. Each image set was taken by
Figure 6. Images of a fire hydrant from an in-put
size of 69 images. The pyramids illustrate
the pose estimates for each image. The inset
shows the pose estimates from a viewpoint
close to ground level. The 3- D point clouds
in the background of this figure and of Fig-ures
7 and 8 are shown only to illustrate the
general position of the estimated poses rel-ative
to the scene structure, and are not at-tempts
to accurately reconstruct the 3- D ge-ometry.
Hydrant Skull Lions
¯ S 15402.7 9769.8 15791.8
4219.9 4580.3 3586.9
Table 1. Mean and standard deviation of the
similarity measure S for the examples in Fig-ures
6 ( fire hydrant), 7 ( fossilized skull), and 8
( lion display). The values of ¯ S roughly corre-spond
to the average number of consistently
reconstructed 3- D points in image triplets as
nodes are added to the MST, and thus can be
comparatively used to indicate good pose.
hand with a still digital camera. The examples show ten
sample images from the input set ( each image is 640x480
pixels) and pyramids representing the position and orienta-tion
of the final pose estimates. One image in each set was
Proceedings of the Third International Symposium on
3D Data Processing, Visualization, and Transmission ( 3DPVT' 06)
0- 7695- 2825- 2/ 06 $ 20.00 © 2006
Figure 7. Images of a fossilized skull speci-men
from an input size of 160 images. The
pyramids illustrate the pose estimates for
each image, and the inset shows the pose es-timates
from a higher viewpoint. The pyra-mids
in this figure are necessarily small in
order to show each image from the complete
set.
defined to be located at the world- space origin, and all re-maining
image poses in each set were estimated in the same
space using the MST pose estimation algorithm. We have
listed the average values of the similarity measures for each
example in Table 1.
The camera poses in the examples are estimated fairly
accurately, as shown in the figures by the positions of the
pyramids and the sparse 3- D reconstruction. However, it is
difficult to determine the exact accuracy of the pose esti-mates
without measuring the extrinsic camera calibration
using an external verification setup such as a gantry or
robotic arm while photographing a scene. However, if the
eventual goal of camera calibration is to perform 3- D re-construction
or visualization where the success of the ap-plication
is measured by the accuracy of the 3- D content,
then the similarity measure of Section 3.1 is applicable and
by definition is directly related to the accuracy of the pose
estimation.
Figure 8. Images and pose estimates of a taxi-dermy
display from an input size of 67 im-ages.
The lion shape evident in the recon-structed
point cloud corresponds to the lion
displayed in the top row of images.
5 Summary and Conclusions
We have proposed a novel method, MST pose estima-tion,
to estimate the extrinsic camera parameters, or pose
parameters, for large collections of images. Our method
generalizes on current methods which use narrow- baseline
feature tracking to robustly estimate point correspondences
and camera pose in a linear or hierarchically- linear order
( imposed by the linear nature of the video stream). MST
pose estimation finds the minimum spanning tree of the
camera adjacency graph and uses the tree node hierarchy
to determine pose order. This enables pose candidates to be
matched against a much larger number of images than just
the immediate predecessors in linear video streams. We lose
the robustness from narrow- baseline matching algorithms,
but gain in the ability to pose generalized input: still images
with multiple video streams.
To compensate for the reduced robustness of point cor-respondences,
we proposed a validation method based on
reconstruction similarity to quantify the pose correctness.
We additionally outlined a novel noise error compensation
technique that reduces pose error propagated from corre-spondence
noise. This technique is based on interpreting a
population of pose estimates as a probability density func-
Proceedings of the Third International Symposium on
3D Data Processing, Visualization, and Transmission ( 3DPVT' 06)
0- 7695- 2825- 2/ 06 $ 20.00 © 2006
tion and using density estimation to retrieve the most prob-able
pose from the population. Together with pose vali-dation,
these techniques enable robust pose estimation for
large collections of wide- baseline images.
References
[ 1] M. Armstrong, A. Zisserman, and P. Beardsley. Euclidean
reconstruction from uncalibrated images. In Proc. British
Machine Vision Conference ( BMVC ’ 94), pages 509– 518,
1994.
[ 2] D. Cheriton and R. Tarjan. Finding minimum spanning trees.
SIAM Journal of Computing, 5: 724– 742, 1976.
[ 3] O. Faugeras, Q. Luong, and S. Maybank. Camera self-calibration:
theory and experiments. In Proc. European
Conference on Computer Vision ( ECCV ’ 92), pages 321–
334. Springer- Verlag, 1992.
[ 4] A. W. Fitzgibbon and A. Zisserman. Automatic camera re-covery
for closed or open image sequences. In Proc. 5th
European Conference on Computer Vision- Volume I ( ECCV
’ 98), pages 311– 326, London, UK, 1998. Springer- Verlag.
[ 5] R. Hartley and A. Zisserman. Multiple View Geometry in
Computer Vision, Second Edition. Cambridge University
Press, ISBN: 0521540518, 2004.
[ 6] R. Koch, M. Pollefeys, B. Heigl, L. Van Gool, and H. Nie-mann.
Calibration of hand- held camera sequences for
plenoptic modeling. In Proc. International Conference on
Computer Vision ( ICCV ’ 99), pages 585– 591, 1999.
[ 7] R. Koch, M. Pollefeys, and L. Van Gool. Robust calibration
and 3d geometric modeling from large collections of uncal-ibrated
images. In DAGM, 1999.
[ 8] M. Lhuillier and L. Quan. Quasi- dense reconstruction from
image sequence. In Proc. 7th European Conference on Com-puter
Vision- Part II ( ECCV ’ 02), pages 125– 139, London,
UK, 2002. Springer- Verlag.
[ 9] Y. Ma, S. Soatto, J. Kosecka, and S. S. Sastry. An Invitation
to 3- D Vision. Springer, 2004.
[ 10] R. Marzotto, A. Fusiello, and V. Murino. High resolution
video mosaicing with global alignment. In Proc. Conference
on Computer Vision and Pattern Recognition ( CVPR ’ 04),
pages I: 692– 698, 2004.
[ 11] D. Nist ´ er. Reconstruction from uncalibrated sequences with
a hierarchy of trifocal tensors. In Proc. 6th European Con-ference
on Computer Vision- Part I ( ECCV ’ 00), pages 649–
663, London, UK, 2000. Springer- Verlag.
[ 12] M. Pollefeys, R. Koch, and L. Van Gool. Self calibration
and metric reconstruction in spite of varying and unknown
internal camera parameters. In Proc. 6th International Con-ference
on Computer Vision ( ICCV ’ 98), pages 90– 96, 1998.
[ 13] P. Pritchett and A. Zisserman. Wide baseline stereo match-ing.
In Proc. 6th International Conference on Computer Vi-sion
( ICCV ’ 98), pages 754– 760, 1998.
[ 14] Y. Rui, T. S. Huang, and S.- F. Chang. Image retrieval: past,
present, and future. In International Symposium on Multi-media
Information Processing, 1997.
[ 15] M. Sainz, A. Susin, and N. Bagherzadch. Camera calibration
of long image sequences with the presence of occlusions. In
Proc. IEEE International Conference on Image Processing
( ICIP ’ 03), pages I: 317– 320, 2003.
[ 16] H. S. Sawhney, S. Hsu, and R. Kumar. Robust video mo-saicing
through topology inference and local to global align-ment.
In Proc. 5th European Conference on Computer
Vision- Volume II ( ECCV ’ 98), pages 103– 119, 1998.
[ 17] D. Scharstein and R. Szeliski. A taxonomy and evaluation
of dense two- frame stereo correspondence algorithms. Inter-national
Journal of Computer Vision, 47( 1- 3): 7– 42, 2002.
[ 18] D. Steedly, I. Essa, and F. Dellaert. Spectral partitioning for
structure from motion. In Proc. International Conference on
Computer Vision ( ICCV ’ 03), pages 996– 1003, 2003.
[ 19] S. Teller, M. Antone, Z. Bodnar, M. Bosse, S. Coorg,
M. Jethwa, and N. Master. Calibrated, registered images of
an extended urban area. International Journal of Computer
Vision, 53( 1): 93– 107, 2003.
[ 20] C. Tomasi and T. Kanade. Shape and motion from image
streams under orthography: a factorization method. Inter-national
Journal of Computer Vision, 9( 2): 137– 154, 1992.
[ 21] W. Triggs. Auto- calibration and the absolute quadric. In
Proc. Conference on Computer Vision and Pattern Recogni-tion
( CVPR ’ 97), pages 609– 614, 1997.
Proceedings of the Third International Symposium on
3D Data Processing, Visualization, and Transmission ( 3DPVT' 06)
0- 7695- 2825- 2/ 06 $ 20.00 © 2006