every day, the human brain processes an immense volume of visual information, relying on intricate neural mechanisms to perceive and interpret these stimuli. Recent breakthroughs in functional magnetic resonance imaging (fMRI) have enabled scientists to extract visual information from human brain activity patterns. In this study, we present an innovative method for decoding brain activity into meaningful images and captions, with a specific focus on brain captioning due to its enhanced flexibility as compared to brain decoding into images. our approach takes advantage of cutting-edge image captioning models and incorporates a unique image reconstruction pipeline that utilizes latent diffusion models and depth estimation. we utilized the natural scenes dataset, a comprehensive fMRI dataset from eight subjects who viewed images from the COCO dataset. we employed the generative Image-to-text transformer (GIT) as our backbone for captioning and propose a new image reconstruction pipeline based on latent diffusion models. the method involves training regularized linear regression models between brain activity and extracted features. additionally, we incorporated depth maps from the controlNet model to further guide the reconstruction process. we propose a multimodal based approach that leverages similarities between neural and deep learning representations and by learning alignment between these spaces, we produce textual description and image reconstruction from brain activity. we evaluate our methods using quantitative metrics for both generated captions and images. our brain captioning approach outperforms existing methods, while our image reconstruction pipeline generates plausible images with improved spatial relationships. In conclusion, we demonstrate significant progress in brain decoding, showcasing the enormous potential of integrating vision and language to better understand human cognition. our approach provides a flexible platform for future research, with potential applications based on a combination of high-level semantic information coming from text and low-level image shape information coming from depth maps and initial guess images.

Ferrante, M., Boccato, T., Ozcelik, F., Vanrullen, R., Toschi, N. (2023). Multimodal decoding of human brain activity into images and text. In PROCEEDINGS OF UNIREPS: THE FIRST WORKSHOP ON UNIFYING REPRESENTATIONS IN NEURAL MODELS (pp.11-26). ML Research Press.

Multimodal decoding of human brain activity into images and text

Ferrante M.;Boccato T.;Toschi N.
2023-01-01

Abstract

every day, the human brain processes an immense volume of visual information, relying on intricate neural mechanisms to perceive and interpret these stimuli. Recent breakthroughs in functional magnetic resonance imaging (fMRI) have enabled scientists to extract visual information from human brain activity patterns. In this study, we present an innovative method for decoding brain activity into meaningful images and captions, with a specific focus on brain captioning due to its enhanced flexibility as compared to brain decoding into images. our approach takes advantage of cutting-edge image captioning models and incorporates a unique image reconstruction pipeline that utilizes latent diffusion models and depth estimation. we utilized the natural scenes dataset, a comprehensive fMRI dataset from eight subjects who viewed images from the COCO dataset. we employed the generative Image-to-text transformer (GIT) as our backbone for captioning and propose a new image reconstruction pipeline based on latent diffusion models. the method involves training regularized linear regression models between brain activity and extracted features. additionally, we incorporated depth maps from the controlNet model to further guide the reconstruction process. we propose a multimodal based approach that leverages similarities between neural and deep learning representations and by learning alignment between these spaces, we produce textual description and image reconstruction from brain activity. we evaluate our methods using quantitative metrics for both generated captions and images. our brain captioning approach outperforms existing methods, while our image reconstruction pipeline generates plausible images with improved spatial relationships. In conclusion, we demonstrate significant progress in brain decoding, showcasing the enormous potential of integrating vision and language to better understand human cognition. our approach provides a flexible platform for future research, with potential applications based on a combination of high-level semantic information coming from text and low-level image shape information coming from depth maps and initial guess images.
1st Workshop on Unifying Representations in Neural Models (UniReps)
New Orleans, LA (USA)
2023
1.
Rilevanza internazionale
contributo
2023
Settore PHYS-06/A - Fisica per le scienze della vita, l'ambiente e i beni culturali
English
Intervento a convegno
Ferrante, M., Boccato, T., Ozcelik, F., Vanrullen, R., Toschi, N. (2023). Multimodal decoding of human brain activity into images and text. In PROCEEDINGS OF UNIREPS: THE FIRST WORKSHOP ON UNIFYING REPRESENTATIONS IN NEURAL MODELS (pp.11-26). ML Research Press.
Ferrante, M; Boccato, T; Ozcelik, F; Vanrullen, R; Toschi, N
File in questo prodotto:
File Dimensione Formato  
6_Multimodal_decoding_of_human.pdf

solo utenti autorizzati

Tipologia: Versione Editoriale (PDF)
Licenza: Copyright dell'editore
Dimensione 12.51 MB
Formato Adobe PDF
12.51 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/2108/406124
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 4
social impact