Investigating Speech-Driven Image Synthesis

In the spring of 2021, I took the course 6.864 Advanced Natural Language Processing. For our final project, my teammates and I were interested in building on this paper to synthesize (domain-contrained) images from improvised descriptions spoken by humans (Places Audio Captions dataset). This is a very challenging task, and while we were not able to consistently achieve strong input-output semantic consistency, we learned a lot of interesting things in the process. You can read about our experiments in the report or look at our code.