Publication Library

Publication Library

Sora A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

Description: Sora is a text-to-video generative AI model, released by OpenAI in February 2024. The model is trained to generate videos of realistic or imaginative scenes from text instructions and show potential in simulating the physical world. Based on public technical reports and reverse engineering, this paper presents a comprehensive review of the model’s background, related technologies, applications, remaining challenges, and future directions of text-to-video AI models. We first trace Sora’s development and investigate the underlying technologies used to build this “world simulator”. Then, we describe in detail the applications and potential impact of Sora in multiple industries ranging from film-making and education to marketing. We discuss the main challenges and limitations that need to be addressed to widely deploy Sora, such as ensuring safe and unbiased video generation. Lastly, we discuss the future development of Sora and video generation models in general, and how advancements in the field could enable new ways of human-AI interaction, boosting productivity and creativity of video generation.

Created At: 13 December 2024

Updated At: 13 December 2024

From an Image to a Scene Learning to Imagine the World from a Million 360 Videos

Description: Three-dimensional (3D) understanding of objects and scenes play a key role in humans' ability to interact with the world and has been an active area of research in computer vision, graphics, and robotics. Large scale synthetic and object-centric 3D datasets have shown to be effective in training models that have 3D understanding of objects. However, applying a similar approach to real-world objects and scenes is difficult due to a lack of large-scale data. Videos are a potential source for real-world 3D data, but finding diverse yet corresponding views of the same content has shown to be difficult at scale. Furthermore, standard videos come with fixed viewpoints, determined at the time of capture. This restricts the ability to access scenes from a variety of more diverse and potentially useful perspectives. We argue that large scale 360 videos can address these limitations to provide: scalable corresponding frames from diverse views. In this paper, we introduce 360-1M, a 360 video dataset, and a process for efficiently finding corresponding frames from diverse viewpoints at scale. We train our diffusion-based model, Odin, on 360-1M. Empowered by the largest real-world, multi-view dataset to date, Odin is able to freely generate novel views of real-world scenes. Unlike previous methods, Odin can move the camera through the environment, enabling the model to infer the geometry and layout of the scene. Additionally, we show improved performance on standard novel view synthesis and 3D reconstruction benchmarks.

Created At: 11 December 2024

Updated At: 11 December 2024

Using Drone Swarm to Stop Wildfire A Predict-then-optimize Approach

Description: Drone swarms coupled with data intelligence can be the future of wildfire fighting. However, drone swarm firefighting faces enormous challenges, such as the highly complex environmental conditions in wildfire scenes, the highly dynamic nature of wildfire spread, and the significant computational complexity of drone swarm operations. We develop a predict-then-optimize approach to address these challenges to enable effective drone swarm firefighting. First, we construct wildfire spread prediction convex neural network (Convex-NN) models based on real wildfire data. Then, we propose a mixed-integer programming (MIP) model coupled with dynamic programming (DP) to enable efficient drone swarm task planning. We further use chance-constrained robust optimization (CCRO) to ensure robust firefighting performances under varying situations. The formulated model is solved efficiently using Benders Decomposition and Branch-and-Cut algorithms. After 75 simulated wildfire environments training, the MIP+CCRO approach shows the best performance among several testing sets, reducing movements by 37.3\% compared to the plain MIP. It also significantly outperformed the GA baseline, which often failed to fully extinguish the fire. Eventually, we will conduct real-world fire spread and quenching experiments in the next stage for further validation.

Created At: 11 December 2024

Updated At: 11 December 2024

Sensing-Aided 6G Drone Communications Real-World Datasets and Demonstration

Description: In the advent of next-generation wireless communication, millimeter-wave (mmWave) and terahertz (THz) technologies are pivotal for their high data rate capabilities. However, their reliance on large antenna arrays and narrow directive beams for ensuring adequate receive signal power introduces significant beam training overheads. This becomes particularly challenging in supporting highly-mobile applications such as drone communication, where the dynamic nature of drones demands frequent beam alignment to maintain connectivity. Addressing this critical bottleneck, our paper introduces a novel machine learning-based framework that leverages multi-modal sensory data, including visual and positional information, to expedite and refine mmWave/THz beam prediction. Unlike conventional approaches that solely depend on exhaustive beam training methods, our solution incorporates additional layers of contextual data to accurately predict beam directions, significantly mitigating the training overhead. Additionally, our framework is capable of predicting future beam alignments ahead of time. This feature enhances the system's responsiveness and reliability by addressing the challenges posed by the drones' mobility and the computational delays encountered in real-time processing. This capability for advanced beam tracking asserts a critical advancement in maintaining seamless connectivity for highly-mobile drones. We validate our approach through comprehensive evaluations on a unique, real-world mmWave drone communication dataset, which integrates concurrent camera visuals, practical GPS coordinates, and mmWave beam training data...

Created At: 11 December 2024

Updated At: 11 December 2024

Deconstructing Human-AI Collaboration Agency, Interaction, and Adaptation

Description: As full AI-based automation remains out of reach in most real-world applications, the focus has instead shifted to leveraging the strengths of both human and AI agents, creating effective collaborative systems. The rapid advances in this area have yielded increasingly more complex systems and frameworks, while the nuance of their characterization has gotten more vague. Similarly, the existing conceptual models no longer capture the elaborate processes of these systems nor describe the entire scope of their collaboration paradigms. In this paper, we propose a new unified set of dimensions through which to analyze and describe human-AI systems. Our conceptual model is centered around three high-level aspects - agency, interaction, and adaptation - and is developed through a multi-step process. Firstly, an initial design space is proposed by surveying the literature and consolidating existing definitions and conceptual frameworks. Secondly, this model is iteratively refined and validated by conducting semi-structured interviews with nine researchers in this field. Lastly, to illustrate the applicability of our design space, we utilize it to provide a structured description of selected human-AI systems.

Created At: 11 December 2024

Updated At: 11 December 2024

First 26 27 28 29 30 31 32 Last