Start Over Please hold this item Export MARC Display Return To Browse
 
     
Limit search to available items
Record: Previous Record Next Record
Author Barnard, Kobus.
Title Computational methods for integrating vision and language / Kobus Barnard.
Publication Info San Rafael, California (1537 Fourth Street, San Rafael, CA 94901 USA) : Morgan & Claypool, 2016.



Descript 1 PDF (xvi, 211 pages) : illustrations.
Note Part of: Synthesis digital library of engineering and computer science.
Contents 1. Introduction -- 1.1 Redundant, complementary, and orthogonal multimodal data -- 1.1.1 Multimodal mutual information -- 1.1.2 Complementary multimodal information -- 1.2 Computational tasks -- 1.2.1 Multimodal translation -- 1.2.2 Integrating complementary multimodal data and cross modal disambiguation -- 1.2.3 Grounding language with sensory data -- 1.3 Multimodal modeling -- 1.3.1 Discriminative methods -- 1.4 Mutimodal inference, applications to computational tasks -- 1.4.1 Region labeling with a concept model -- 1.4.2 Cross-modal disambiguation, region labeling with image keywordS -- 1.4.3 Cross-modal disambiguation, word sense disambiguation with images -- 1.5 Learning from redundant representations in loosely labeled multimodal data -- 1.5.1 Resolving region-label correspondence ambiguity -- 1.5.2 Data variation and semantic grouping -- 1.5.3 Simultaneously learning models and reducing correspondence ambiguity --
2. The semantics of images and associated text -- 2.1 Lessons from image search -- 2.1.1 Content-based image retrieval (CBIR) -- 2.2 Images and text as evidence about the world -- 2.3 Affective attributes of images and video -- 2.3.1 Emotion induction from images and video -- 2.3.2 Inferring emotion in people depicted in images and videos --
3. Sources of data for linking visual and linguistic information -- 3.1 WordNet for building semantic visual-linguistic data sets -- 3.2 Visual data with a single objective label -- 3.3 Visual data with a single subjective label -- 3.4 Visual data with keywords or object labels -- 3.4.1 Localized labels -- 3.4.2 Semantic segmentations with labels -- 3.5 Visual data with descriptions -- 3.6 Image data with questions and answers --
4. Extracting and representing visual information -- 4.1 Low-level features -- 4.1.1 Color -- 4.1.2 Edges -- 4.1.3 Texture -- 4.1.4 Characterizing neighborhoods using histograms of oriented gradients -- 4.2 Segmentation for low-level spatial grouping -- 4.3 Representation of regions and patches -- 4.3.1 Visual word representations -- 4.4 Mid-level representations for images -- 4.4.1 Artificial neural network representations -- 4.5 Object category recognition and detection --
5. Text and speech processing -- 5.1 Text associated with audiovisual data -- 5.2 Text embedded within visual data -- 5.3 Basic natural language processing -- 5.4 Word sense disambiguation -- 5.5 Online lexical resource for vision and language integration -- 5.5.1 WordNet -- 5.5.2 Representing words by vectors --
6. Modeling images and keywords -- 6.1 Scene semantic-keywords for entire images -- 6.2 Localized semantics-keywords for regions -- 6.3 Generative models with independent multi-modal concepts -- 6.3.1 Notational preliminaries -- 6.3.2 Semantic concepts with multi-model evidence -- 6.3.3 Joint modeling of images and keywords (PWRM and IRCM) -- 6.3.4 Inferring image keywords and region labels -- 6.3.5 Learning multi-modal concept models from loosely labeled data -- 6.3.6 Evaluation of region labeling and image annotation -- 6.4 Translation models -- 6.4.1 Notational preliminaries (continuing 6.3.1) -- 6.4.2 A simple region translation model (RTM) -- 6.4.3 Visual translation models for broadcast video -- 6.4.4 A word translation model (WTM) -- 6.4.5 Supervised multiclass labeling (SML) -- 6.4.6 Discriminative models for translation -- 6.5 Image clustering and interdependencies among concepts -- 6.5.1 Region concepts with image categories (CIRCM) -- 6.5.2 Latent dirichlet allocation (LDA) -- 6.5.3 Multiclass supervised LDA (sLDA) with annotations -- 6.6 Segmentation, region grouping, and spatial context -- 6.6.1 Notational preliminaries (continuing 6.3.1 and 6.4.1) -- 6.6.2 Random fields for representing image semantics -- 6.6.3 Joint learning of translation and spatial relationships -- 6.6.4 Multistage learning and inference -- 6.6.5 Dense CRFs for general context -- 6.6.6 Dense CRFs for multiple pairwise relationships -- 6.6.7 Multiscale CRF (mCRF) -- 6.6.8 Relative location prior with CRFs -- 6.6.9 Encoding spatial patterns into the unary potentials with texture-layout features -- 6.6.10 Discriminative region labeling with spatial and scene information -- 6.6.11 Holistic integration of appearance, object detection, and scene type -- 6.7 Image annotation without localization -- 6.7.1 Nonparametric generative models -- 6.7.2 Label propagation --
7. Beyond simple nouns -- 7.1 Reasoning with proper nouns -- 7.1.1 Names and faces in the news -- 7.1.2 Linking action verbs to pose-who is doing what? -- 7.1.3 Learning structured appearance for named objects -- 7.2 Learning and using adjectives and attributes -- 7.2.1 Learning visual attributes for color names -- 7.2.2 Learning complex visual attributes for specific domains -- 7.2.3 Inferring emotional attributes for images -- 7.2.4 Inferring emotional attributes for video clips -- 7.2.5 Sentiment analysis in consumer photographs and videos -- 7.2.6 Extracting aesthetic attributes for images -- 7.2.7 Addressing subjectivity -- 7.3 Noun-noun relationships-spatial prepositions and comparative adjectives -- 7.3.1 Learning about preposition use in natural language -- 7.4 Linking visual data to verbs -- 7.5 Vision helping language understanding -- 7.5.1 Using vision to improve word sense disambiguation -- 7.5.2 Using vision to improve coreference resolution -- 7.5.3 Discovering visual-semantic senses -- 7.6 Using associated text to improve visual understanding -- 7.6.1 Using captions to improve semantic image parsing (cardinality and prepositions) -- 7.7 Using world knowledge from text sources for visual understanding -- 7.7.1 Seeing what cannot be seen? -- 7.7.2 World knowledge for training large-scale fine-grained visual models --
8. Sequential structure -- 8.1 Automated image and video captioning -- 8.1.1 Captioning by reusing existing sentences and fragments -- 8.1.2 Captioning using templates, schemas, or simple grammars -- 8.1.3 Captioning video using storyline models -- 8.1.4 Captioning with learned sentence generators -- 8.2 Aligning sentences with images and video -- 8.3 Automatic illustration of text documents -- 8.4 Visual question and answering --
A. Additional definitions and derivations -- Basic definitions from probability and information theory -- Additional considerations for multimodal evidence for a concept -- Loosely labeled vs. strongly labeled data -- Pedantic derivation of equation (6.13) -- Derivation of the EM equations for the image region concept model (IRCM) -- Bibliography -- Author's biography.
Note Abstract freely available; full-text restricted to subscribers or individual document purchasers.
Mode of access: World Wide Web.
System requirements: Adobe Acrobat Reader.
ISBN 9781608451135 ebook
9781608451128 print
Standard # 10.2200/S00705ED1V01Y201602COV007 doi
Click on the terms below to find similar items in the catalogue
Author Barnard, Kobus.
Series Synthesis lectures on computer vision, # 7
Synthesis digital library of engineering and computer science.
Synthesis lectures on computer vision ; # 7. 2153-1064
Subject Computer vision -- Mathematical models.
Information visualization.
Closed captioning -- Technological innovations.
Keyword searching -- Technological innovations.
Natural language processing (Computer science)
Multimodal user interfaces (Computer systems)
Descript 1 PDF (xvi, 211 pages) : illustrations.
Note Part of: Synthesis digital library of engineering and computer science.
Contents 1. Introduction -- 1.1 Redundant, complementary, and orthogonal multimodal data -- 1.1.1 Multimodal mutual information -- 1.1.2 Complementary multimodal information -- 1.2 Computational tasks -- 1.2.1 Multimodal translation -- 1.2.2 Integrating complementary multimodal data and cross modal disambiguation -- 1.2.3 Grounding language with sensory data -- 1.3 Multimodal modeling -- 1.3.1 Discriminative methods -- 1.4 Mutimodal inference, applications to computational tasks -- 1.4.1 Region labeling with a concept model -- 1.4.2 Cross-modal disambiguation, region labeling with image keywordS -- 1.4.3 Cross-modal disambiguation, word sense disambiguation with images -- 1.5 Learning from redundant representations in loosely labeled multimodal data -- 1.5.1 Resolving region-label correspondence ambiguity -- 1.5.2 Data variation and semantic grouping -- 1.5.3 Simultaneously learning models and reducing correspondence ambiguity --
2. The semantics of images and associated text -- 2.1 Lessons from image search -- 2.1.1 Content-based image retrieval (CBIR) -- 2.2 Images and text as evidence about the world -- 2.3 Affective attributes of images and video -- 2.3.1 Emotion induction from images and video -- 2.3.2 Inferring emotion in people depicted in images and videos --
3. Sources of data for linking visual and linguistic information -- 3.1 WordNet for building semantic visual-linguistic data sets -- 3.2 Visual data with a single objective label -- 3.3 Visual data with a single subjective label -- 3.4 Visual data with keywords or object labels -- 3.4.1 Localized labels -- 3.4.2 Semantic segmentations with labels -- 3.5 Visual data with descriptions -- 3.6 Image data with questions and answers --
4. Extracting and representing visual information -- 4.1 Low-level features -- 4.1.1 Color -- 4.1.2 Edges -- 4.1.3 Texture -- 4.1.4 Characterizing neighborhoods using histograms of oriented gradients -- 4.2 Segmentation for low-level spatial grouping -- 4.3 Representation of regions and patches -- 4.3.1 Visual word representations -- 4.4 Mid-level representations for images -- 4.4.1 Artificial neural network representations -- 4.5 Object category recognition and detection --
5. Text and speech processing -- 5.1 Text associated with audiovisual data -- 5.2 Text embedded within visual data -- 5.3 Basic natural language processing -- 5.4 Word sense disambiguation -- 5.5 Online lexical resource for vision and language integration -- 5.5.1 WordNet -- 5.5.2 Representing words by vectors --
6. Modeling images and keywords -- 6.1 Scene semantic-keywords for entire images -- 6.2 Localized semantics-keywords for regions -- 6.3 Generative models with independent multi-modal concepts -- 6.3.1 Notational preliminaries -- 6.3.2 Semantic concepts with multi-model evidence -- 6.3.3 Joint modeling of images and keywords (PWRM and IRCM) -- 6.3.4 Inferring image keywords and region labels -- 6.3.5 Learning multi-modal concept models from loosely labeled data -- 6.3.6 Evaluation of region labeling and image annotation -- 6.4 Translation models -- 6.4.1 Notational preliminaries (continuing 6.3.1) -- 6.4.2 A simple region translation model (RTM) -- 6.4.3 Visual translation models for broadcast video -- 6.4.4 A word translation model (WTM) -- 6.4.5 Supervised multiclass labeling (SML) -- 6.4.6 Discriminative models for translation -- 6.5 Image clustering and interdependencies among concepts -- 6.5.1 Region concepts with image categories (CIRCM) -- 6.5.2 Latent dirichlet allocation (LDA) -- 6.5.3 Multiclass supervised LDA (sLDA) with annotations -- 6.6 Segmentation, region grouping, and spatial context -- 6.6.1 Notational preliminaries (continuing 6.3.1 and 6.4.1) -- 6.6.2 Random fields for representing image semantics -- 6.6.3 Joint learning of translation and spatial relationships -- 6.6.4 Multistage learning and inference -- 6.6.5 Dense CRFs for general context -- 6.6.6 Dense CRFs for multiple pairwise relationships -- 6.6.7 Multiscale CRF (mCRF) -- 6.6.8 Relative location prior with CRFs -- 6.6.9 Encoding spatial patterns into the unary potentials with texture-layout features -- 6.6.10 Discriminative region labeling with spatial and scene information -- 6.6.11 Holistic integration of appearance, object detection, and scene type -- 6.7 Image annotation without localization -- 6.7.1 Nonparametric generative models -- 6.7.2 Label propagation --
7. Beyond simple nouns -- 7.1 Reasoning with proper nouns -- 7.1.1 Names and faces in the news -- 7.1.2 Linking action verbs to pose-who is doing what? -- 7.1.3 Learning structured appearance for named objects -- 7.2 Learning and using adjectives and attributes -- 7.2.1 Learning visual attributes for color names -- 7.2.2 Learning complex visual attributes for specific domains -- 7.2.3 Inferring emotional attributes for images -- 7.2.4 Inferring emotional attributes for video clips -- 7.2.5 Sentiment analysis in consumer photographs and videos -- 7.2.6 Extracting aesthetic attributes for images -- 7.2.7 Addressing subjectivity -- 7.3 Noun-noun relationships-spatial prepositions and comparative adjectives -- 7.3.1 Learning about preposition use in natural language -- 7.4 Linking visual data to verbs -- 7.5 Vision helping language understanding -- 7.5.1 Using vision to improve word sense disambiguation -- 7.5.2 Using vision to improve coreference resolution -- 7.5.3 Discovering visual-semantic senses -- 7.6 Using associated text to improve visual understanding -- 7.6.1 Using captions to improve semantic image parsing (cardinality and prepositions) -- 7.7 Using world knowledge from text sources for visual understanding -- 7.7.1 Seeing what cannot be seen? -- 7.7.2 World knowledge for training large-scale fine-grained visual models --
8. Sequential structure -- 8.1 Automated image and video captioning -- 8.1.1 Captioning by reusing existing sentences and fragments -- 8.1.2 Captioning using templates, schemas, or simple grammars -- 8.1.3 Captioning video using storyline models -- 8.1.4 Captioning with learned sentence generators -- 8.2 Aligning sentences with images and video -- 8.3 Automatic illustration of text documents -- 8.4 Visual question and answering --
A. Additional definitions and derivations -- Basic definitions from probability and information theory -- Additional considerations for multimodal evidence for a concept -- Loosely labeled vs. strongly labeled data -- Pedantic derivation of equation (6.13) -- Derivation of the EM equations for the image region concept model (IRCM) -- Bibliography -- Author's biography.
Note Abstract freely available; full-text restricted to subscribers or individual document purchasers.
Mode of access: World Wide Web.
System requirements: Adobe Acrobat Reader.
ISBN 9781608451135 ebook
9781608451128 print
Standard # 10.2200/S00705ED1V01Y201602COV007 doi
Author Barnard, Kobus.
Series Synthesis lectures on computer vision, # 7
Synthesis digital library of engineering and computer science.
Synthesis lectures on computer vision ; # 7. 2153-1064
Subject Computer vision -- Mathematical models.
Information visualization.
Closed captioning -- Technological innovations.
Keyword searching -- Technological innovations.
Natural language processing (Computer science)
Multimodal user interfaces (Computer systems)

Subject Computer vision -- Mathematical models.
Information visualization.
Closed captioning -- Technological innovations.
Keyword searching -- Technological innovations.
Natural language processing (Computer science)
Multimodal user interfaces (Computer systems)
Descript 1 PDF (xvi, 211 pages) : illustrations.
Note Part of: Synthesis digital library of engineering and computer science.
Contents 1. Introduction -- 1.1 Redundant, complementary, and orthogonal multimodal data -- 1.1.1 Multimodal mutual information -- 1.1.2 Complementary multimodal information -- 1.2 Computational tasks -- 1.2.1 Multimodal translation -- 1.2.2 Integrating complementary multimodal data and cross modal disambiguation -- 1.2.3 Grounding language with sensory data -- 1.3 Multimodal modeling -- 1.3.1 Discriminative methods -- 1.4 Mutimodal inference, applications to computational tasks -- 1.4.1 Region labeling with a concept model -- 1.4.2 Cross-modal disambiguation, region labeling with image keywordS -- 1.4.3 Cross-modal disambiguation, word sense disambiguation with images -- 1.5 Learning from redundant representations in loosely labeled multimodal data -- 1.5.1 Resolving region-label correspondence ambiguity -- 1.5.2 Data variation and semantic grouping -- 1.5.3 Simultaneously learning models and reducing correspondence ambiguity --
2. The semantics of images and associated text -- 2.1 Lessons from image search -- 2.1.1 Content-based image retrieval (CBIR) -- 2.2 Images and text as evidence about the world -- 2.3 Affective attributes of images and video -- 2.3.1 Emotion induction from images and video -- 2.3.2 Inferring emotion in people depicted in images and videos --
3. Sources of data for linking visual and linguistic information -- 3.1 WordNet for building semantic visual-linguistic data sets -- 3.2 Visual data with a single objective label -- 3.3 Visual data with a single subjective label -- 3.4 Visual data with keywords or object labels -- 3.4.1 Localized labels -- 3.4.2 Semantic segmentations with labels -- 3.5 Visual data with descriptions -- 3.6 Image data with questions and answers --
4. Extracting and representing visual information -- 4.1 Low-level features -- 4.1.1 Color -- 4.1.2 Edges -- 4.1.3 Texture -- 4.1.4 Characterizing neighborhoods using histograms of oriented gradients -- 4.2 Segmentation for low-level spatial grouping -- 4.3 Representation of regions and patches -- 4.3.1 Visual word representations -- 4.4 Mid-level representations for images -- 4.4.1 Artificial neural network representations -- 4.5 Object category recognition and detection --
5. Text and speech processing -- 5.1 Text associated with audiovisual data -- 5.2 Text embedded within visual data -- 5.3 Basic natural language processing -- 5.4 Word sense disambiguation -- 5.5 Online lexical resource for vision and language integration -- 5.5.1 WordNet -- 5.5.2 Representing words by vectors --
6. Modeling images and keywords -- 6.1 Scene semantic-keywords for entire images -- 6.2 Localized semantics-keywords for regions -- 6.3 Generative models with independent multi-modal concepts -- 6.3.1 Notational preliminaries -- 6.3.2 Semantic concepts with multi-model evidence -- 6.3.3 Joint modeling of images and keywords (PWRM and IRCM) -- 6.3.4 Inferring image keywords and region labels -- 6.3.5 Learning multi-modal concept models from loosely labeled data -- 6.3.6 Evaluation of region labeling and image annotation -- 6.4 Translation models -- 6.4.1 Notational preliminaries (continuing 6.3.1) -- 6.4.2 A simple region translation model (RTM) -- 6.4.3 Visual translation models for broadcast video -- 6.4.4 A word translation model (WTM) -- 6.4.5 Supervised multiclass labeling (SML) -- 6.4.6 Discriminative models for translation -- 6.5 Image clustering and interdependencies among concepts -- 6.5.1 Region concepts with image categories (CIRCM) -- 6.5.2 Latent dirichlet allocation (LDA) -- 6.5.3 Multiclass supervised LDA (sLDA) with annotations -- 6.6 Segmentation, region grouping, and spatial context -- 6.6.1 Notational preliminaries (continuing 6.3.1 and 6.4.1) -- 6.6.2 Random fields for representing image semantics -- 6.6.3 Joint learning of translation and spatial relationships -- 6.6.4 Multistage learning and inference -- 6.6.5 Dense CRFs for general context -- 6.6.6 Dense CRFs for multiple pairwise relationships -- 6.6.7 Multiscale CRF (mCRF) -- 6.6.8 Relative location prior with CRFs -- 6.6.9 Encoding spatial patterns into the unary potentials with texture-layout features -- 6.6.10 Discriminative region labeling with spatial and scene information -- 6.6.11 Holistic integration of appearance, object detection, and scene type -- 6.7 Image annotation without localization -- 6.7.1 Nonparametric generative models -- 6.7.2 Label propagation --
7. Beyond simple nouns -- 7.1 Reasoning with proper nouns -- 7.1.1 Names and faces in the news -- 7.1.2 Linking action verbs to pose-who is doing what? -- 7.1.3 Learning structured appearance for named objects -- 7.2 Learning and using adjectives and attributes -- 7.2.1 Learning visual attributes for color names -- 7.2.2 Learning complex visual attributes for specific domains -- 7.2.3 Inferring emotional attributes for images -- 7.2.4 Inferring emotional attributes for video clips -- 7.2.5 Sentiment analysis in consumer photographs and videos -- 7.2.6 Extracting aesthetic attributes for images -- 7.2.7 Addressing subjectivity -- 7.3 Noun-noun relationships-spatial prepositions and comparative adjectives -- 7.3.1 Learning about preposition use in natural language -- 7.4 Linking visual data to verbs -- 7.5 Vision helping language understanding -- 7.5.1 Using vision to improve word sense disambiguation -- 7.5.2 Using vision to improve coreference resolution -- 7.5.3 Discovering visual-semantic senses -- 7.6 Using associated text to improve visual understanding -- 7.6.1 Using captions to improve semantic image parsing (cardinality and prepositions) -- 7.7 Using world knowledge from text sources for visual understanding -- 7.7.1 Seeing what cannot be seen? -- 7.7.2 World knowledge for training large-scale fine-grained visual models --
8. Sequential structure -- 8.1 Automated image and video captioning -- 8.1.1 Captioning by reusing existing sentences and fragments -- 8.1.2 Captioning using templates, schemas, or simple grammars -- 8.1.3 Captioning video using storyline models -- 8.1.4 Captioning with learned sentence generators -- 8.2 Aligning sentences with images and video -- 8.3 Automatic illustration of text documents -- 8.4 Visual question and answering --
A. Additional definitions and derivations -- Basic definitions from probability and information theory -- Additional considerations for multimodal evidence for a concept -- Loosely labeled vs. strongly labeled data -- Pedantic derivation of equation (6.13) -- Derivation of the EM equations for the image region concept model (IRCM) -- Bibliography -- Author's biography.
Note Abstract freely available; full-text restricted to subscribers or individual document purchasers.
Mode of access: World Wide Web.
System requirements: Adobe Acrobat Reader.
ISBN 9781608451135 ebook
9781608451128 print
Standard # 10.2200/S00705ED1V01Y201602COV007 doi

Links and services for this item: