Why is multimodal modularity an illusion of Web3 AI?

星球日报

2025-06-18 12:46:59

Original author: @BlazingKevin_, the Researcher at Movemaker

The evolution of multimodal models has not brought chaos, but has deepened the technical barriers of Web2 AI - from semantic alignment to visual understanding, from high-dimensional embedding to feature fusion, complex models are integrating various modal expressions at an unprecedented speed to build an increasingly closed AI highland. The U.S. stock market also voted with its feet, whether it was currency stocks or AI stocks, they came out of a wave of bull market. And this heat wave has nothing to do with Crypto. The Web3 AI attempts we have seen, especially the evolution of the direction of the agent in recent months, are almost completely wrong: wishful thinking to use a decentralized structure to assemble a Web2-style multimodal modular system is actually a double misalignment of technology and thinking. In today’s highly coupled modules, highly unstable feature distribution, and increasingly concentrated computing power demand, multimodal modularization simply cannot stand in Web3. Let’s point out: the future of Web3 AI is not about imitation, it’s about strategic detours. From the semantic alignment of high-dimensional spaces, to the information bottleneck in the attention mechanism, to the feature alignment under heterogeneous computing power, I will expand them one by one to explain why Web3 AI should use the countryside to surround the city as a tactical program.

Web3 AI Based on Flattened Multimodal Models, Semantic Misalignment Leads to Poor Performance

In modern Web2 AI multimodal systems, “semantic alignment” refers to the mapping of information from different modalities (such as images, text, audio, video, etc.) into the same or mutually convertible semantic space, enabling the model to understand and compare the intrinsic meanings behind these originally disparate signals. For example, a photo of a cat and the phrase “a cute cat” need to be projected to nearby positions in a high-dimensional embedding space, allowing for “image-to-text” and “audio-to-visual” associations during retrieval, generation, or inference.

Only by realizing a high-dimensional embedding space does it make sense to divide the workflow into different modules for cost reduction and efficiency improvement. However, in the web3 Agent protocol, high-dimensional embedding cannot be achieved because modularization is an illusion of Web3 AI.

How to understand high-dimensional embedding space? At the most intuitive level, think of “high-dimensional embedding space” as a coordinate system—just like the x–y coordinates on a plane, you can use a pair of numbers to locate a point. The only difference is that in our common two-dimensional plane, a point is completely determined by two numbers (x, y); whereas in “high-dimensional” space, each point needs to be described by more numbers, possibly 128, 512, or even thousands of numbers.

Start from the basics and understand in three steps:

Two-dimensional example:

Think about the coordinates of several cities you marked on the map, such as Beijing (116.4, 39.9), Shanghai (121.5, 31.2), and Guangzhou (113.3, 23.1). Each city here corresponds to a “two-dimensional embedding vector”: the two-dimensional coordinates encode geographical location information into numbers.

If you want to measure the “similarity” between cities—cities that are close together on the map often belong to the same economic zone or climate zone—you can directly compare the Euclidean distance of their coordinates. 2. Expand to multidimensional:

Now suppose you not only want to describe the location on the “geographic space,” but also add some “climatic characteristics” (average temperature, rainfall), “demographic characteristics” (population density, GDP), etc. You can assign a vector to each city that includes these 5, 10, or even more dimensions.

For example, the 5-dimensional vector for Guangzhou might be [113.3, 23.1, 24.5, 1700, 14.5], representing longitude, latitude, average temperature, annual rainfall (mm), and economic index. This “multidimensional space” allows you to compare cities across multiple dimensions such as geography, climate, and economy simultaneously: if the vectors of two cities are very close, it means they are very similar in these attributes. 3. Switching to semantics - Why ‘embedding’: In natural language processing (NLP) or computer vision, we also want to map “words”, “sentences”, or “images” into such a multi-dimensional vector space, so that words or images with “similar meanings” are closer together in space. This mapping process is called “embedding”. For example: We train a model to map “cat” to a 300-dimensional vector v₁, map “dog” to another vector v₂, and map unrelated words like “economy” to v₃. In this 300-dimensional space, the distance between v₁ and v₂ will be small (because they are both animals and often appear in similar linguistic contexts), while the distance between v₁ and v₃ will be large. As the model is trained on a massive amount of text or image-text pairs, each dimension it learns does not directly correspond to interpretable properties such as “longitude” or “latitude”, but rather to some kind of “latent semantic features”. Some dimensions may capture a coarse-grained distinction between “animal vs. non-animal”, while others may differentiate between “domesticated vs. wild”, and still others may correspond to feelings of “cute vs. fierce”… In short, hundreds or thousands of dimensions work together to encode various complex and intertwined semantic layers.

What is the difference between high and low dimensions? Only a sufficient number of dimensions can accommodate a variety of intertwined semantic features, and only high dimensions can make them have a clearer position in their respective semantic latitudes. When the semantics cannot be distinguished, that is, the semantics cannot be aligned, different signals in the low-dimensional space “squeeze” each other, resulting in frequent confusion in the retrieval or classification of the model, and the accuracy is greatly reduced. Secondly, it is difficult to capture subtle differences in the strategy generation stage, and it is easy to miss key trading signals or misjudge the risk threshold, which directly drags down the performance of returns. In addition, cross-module collaboration becomes impossible, each agent works independently, the phenomenon of information islands is serious, the overall response delay increases, and the robustness becomes poor. Finally, in the face of complex market scenarios, the low-dimensional structure has almost no capacity to carry multi-source data, and the stability and scalability of the system are difficult to guarantee, and the long-term operation is bound to fall into performance bottlenecks and maintenance difficulties, resulting in a far gap between the performance of the product after landing and the initial expectation.

Can Web3 AI or Agent protocols achieve high-dimensional embedding spaces? First, let’s address how high-dimensional spaces are realized. In the traditional sense, ‘high-dimensional’ requires that various subsystems—such as market intelligence, strategy generation, execution, and risk control—be aligned and mutually beneficial in terms of data representation and decision-making processes. However, most Web3 Agents simply encapsulate existing APIs (like CoinGecko and DEX interfaces) into independent ‘Agents’, lacking a unified central embedding space and cross-module attention mechanisms. This results in information being unable to interact from multiple angles and levels between modules, leading to a linear workflow that exhibits a single function and fails to create an overall closed-loop optimization.

Many agents call external interfaces directly, and do not even do enough fine-tuning or feature engineering for the data returned by the interface. For example, the market analysis agent only simply takes the price and trading volume, the transaction execution agent only places orders according to the interface parameters, and the risk control agent only gives alarms according to several thresholds. They perform their own duties, but lack multi-modal fusion and deep semantic understanding of the same risk event or market signal, resulting in the system not being able to quickly generate comprehensive and multi-angle strategies in the face of extreme market or cross-asset opportunities.

Therefore, requiring Web3 AI to achieve a high-dimensional space is equivalent to requiring the Agent protocol to develop all the API interfaces involved, which is contrary to its original intention of modularization, and the modular multimodal system described by small and medium-sized enterprises in Web3 AI cannot withstand scrutiny. The high-dimensional architecture requires end-to-end unified training or collaborative optimization: from signal capture to strategy calculation, to execution and risk control, all links share the same set of representation and loss functions. The “module-as-plug-in” idea of Web3 Agent has exacerbated fragmentation—each agent upgrade, deployment, and parameter tuning are completed in its own silo, which is difficult to iterate synchronously, and there is no effective centralized monitoring and feedback mechanism, resulting in soaring maintenance costs and limited overall performance.

In order to realize the full-link agent with industry barriers, it is necessary to break the situation from end-to-end joint modeling, unified embedding across modules, and systematic engineering of collaborative training and deployment, but there is no such pain point in the current market, and naturally there is no market demand.

In low-dimensional spaces, attention mechanisms cannot be precisely designed.

High-level multimodal models need to design sophisticated attention mechanisms. The “attention mechanism” is essentially a way of dynamically allocating computational resources, allowing the model to selectively “focus” on the most relevant parts when processing a modal input. The most common are the self-attention and cross-attention mechanisms in the Transformer: self-attention enables the model to measure the dependencies between elements in a sequence, such as the importance of each word in the text to other words; Transattention allows information from one modality (e.g., text) to decide which image features to “see” when decoding or generating another modality (e.g., an image’s feature sequence). With multi-head attention, the model can learn multiple alignments simultaneously in different subspaces to capture more complex and fine-grained associations.

The premise of the attention mechanism is that multimodality has high dimensions, and in the high-dimensional space, the sophisticated attention mechanism can find the core part from the massive high-dimensional space in the shortest time. Before explaining why the attention mechanism needs to be placed in a high-dimensional space in order to play a role, we first understand the process of Web2 AI represented by the Transformer decoder when designing the attention mechanism. The core idea is that when processing sequences (text, image patches, audio frames), the model dynamically assigns “attention weights” to each element, allowing it to focus on the most relevant information, rather than blindly treating them equally.

To put it simply, if you compare the attention mechanism to a car, designing Query-Key-Value is designing the engine. Q-K-V is the mechanism that helps us determine the key information, Query refers to the query ( “what am I looking for” ), Key refers to the index ( “what label do I have” ), Value refers to the content (" What’s here" ). For a multimodal model, what you input to the model could be a sentence, an image, or an audio. In order to retrieve the content we need in dimensional space, these inputs are cut into the smallest units, such as a character, a small block of a certain pixel size, or a piece of audio frame, and the multimodal model generates Query, Key, and Value for these minimum units for attention calculation. When the model processes a certain location, it will use the query of this position to compare the keys of all positions, determine which tags best match the current requirements, and then extract the value from the corresponding position according to the degree of matching and weighted the combination according to importance, and finally get a new representation that not only contains its own information, but also integrates the relevant content of the whole world. In this way, each output can be dynamically “questioned-retrieved-integrated” according to the context to achieve efficient and accurate information focus.

On the basis of this engine, various components are added, cleverly combining “global interaction” with “controllable complexity”: scaling dot products ensure numerical stability, multi-head parallelism enriches expression, position encoding preserves sequence order, sparse variants balance efficiency, residuals and normalization aid stable training, and cross-attention connects multiple modalities. These modular and progressively layered designs enable Web2 AI to possess strong learning capabilities while operating efficiently within a manageable computational power range when handling various sequence and multimodal tasks.

Why can’t modular-based Web3 AI achieve unified attention scheduling? First, the attention mechanism relies on a unified Query–Key–Value space, and all input features must be mapped to the same high-dimensional vector space in order to calculate the dynamic weights from the dot product. However, independent APIs return different formats and different distributions of data - price, order status, threshold alarms - without a unified embedding layer, and cannot form a set of interactive Q/K/V. Secondly, multi-head attention allows different information sources to be paid attention to in parallel at the same level at the same time, and then the results are aggregated. However, independent APIs often “call A first, then call B, and then call C”, and the output of each step is only the input of the next module, which lacks the ability of parallel and multi-channel dynamic weighting, and naturally cannot simulate the fine scheduling of scoring and synthesizing all positions or modalities at the same time in the attention mechanism. Finally, a true attention mechanism dynamically assigns weights to each element based on the overall context; In the API mode, modules can only see the “independent” context when they are called, and there is no central context shared with each other in real time, so it is impossible to achieve global correlation and focus across modules.

Therefore, relying solely on encapsulating various functions into discrete APIs—without a common vector representation, without parallel weighting and aggregation—makes it impossible to build a “unified attention scheduling” capability like that of a Transformer, just as it is difficult to improve the performance ceiling of a car with a poorly performing engine, no matter how much it is modified.

Discrete modular assembly results in feature fusion remaining at a superficial static stitching.

“Feature fusion” is to further combine the feature vectors obtained by different modal processing on the basis of alignment and attention, so that they can be directly used by downstream tasks (classification, retrieval, generation, etc.). Fusion methods can be as simple as splicing, weighted summing, or as complex as bilinear pooling, tensor decomposition, and even dynamic routing techniques. A higher-order approach is to alternate alignment, attention, and fusion in a multi-layer network, or to establish a more flexible message transmission path between cross-modal features through graph neural networks (GNNs) to achieve deep information interaction.

It goes without saying that Web3 AI is still in the simplest stage of splicing, because the premise for dynamic feature fusion is a high-dimensional space and a precise attention mechanism. When these prerequisites cannot be met, naturally, the final stage of feature fusion cannot achieve outstanding performance.

Web2 AI tends to end-to-end joint training: all modal features such as images, text, and audio are processed simultaneously in the same high-dimensional space, and the model automatically learns the optimal fusion weights and interaction modes in forward and backpropagation through co-optimization with the downstream task layer through the attention layer and fusion layer. Web3 AI, on the other hand, uses more discrete module splicing, encapsulating various APIs such as image recognition, market capture, and risk assessment into independent agents, and then simply piecing together the labels, values, or threshold alarms they output, and making comprehensive decisions by mainline logic or manual, which lacks a unified training goal and no gradient flow across modules.

In Web2 AI, the system relies on attention mechanisms to calculate the importance scores of various features in real-time based on context, dynamically adjusting fusion strategies; multi-head attention can also capture various different feature interaction patterns in parallel at the same level, balancing local details with global semantics. In contrast, Web3 AI often fixes weights such as “image × 0.5 + text × 0.3 + price × 0.2” in advance, or uses simple if/else rules to determine whether to fuse, or may not fuse at all, simply presenting the outputs of each module together, lacking flexibility.

Web2 AI maps all modal features into a high-dimensional space with thousands of dimensions. The fusion process involves not only vector concatenation but also various higher-order interaction operations such as addition and bilinear pooling—each dimension may correspond to some latent semantics, enabling the model to capture deep and complex cross-modal associations. In contrast, the outputs of Web3 AI agents often contain only a few key fields or indicators, with extremely low feature dimensions, making it nearly impossible to express subtle information such as “why the image content matches the text meaning” or “the delicate correlation between price fluctuations and sentiment trends.”

In Web2 AI, the loss of downstream tasks is continuously fed back to various parts of the model through attention layers and fusion layers, automatically adjusting which features should be enhanced or suppressed, forming a closed-loop optimization. In contrast, Web3 AI relies heavily on manual or external processes to evaluate and tune parameters after reporting the results of API calls, lacking automated end-to-end feedback, which makes it difficult for fusion strategies to iterate and optimize online.

Barriers in the AI industry are deepening, but pain points have not yet emerged.

Because of the need to take into account cross-modal alignment, sophisticated attention computing, and high-dimensional feature fusion in end-to-end training, the multimodal system of Web2 AI is often an extremely large engineering project. Not only does it require massive, diverse, and well-annotated cross-modal datasets, but it also requires weeks or even months of training on thousands of GPUs; In terms of model architecture, it integrates various latest network design concepts and optimization technologies. In terms of project implementation, it is also necessary to build a scalable distributed training platform, monitoring system, model version management and deployment pipeline. In the research and development of algorithms, it is necessary to continue to study more efficient attention variants, more robust alignment losses, and lighter fusion strategies. Such full-link, full-stack systematic work has extremely high requirements for capital, data, computing power, talents, and even organizational collaboration, so it constitutes a strong industry barrier and has also created the core competitiveness mastered by a few leading teams so far.

When I reviewed Chinese AI applications in April and compared WEB3 ai, I mentioned a point of view: in industries with strong barriers, Crypto may achieve breakthroughs, which means that some industries have been very mature in traditional markets, but there are huge pain points, high maturity means that there are sufficient users familiar with similar business models, and large pain points mean that users are willing to try new solutions, that is, strong willingness to accept Crypto, both are indispensable, that is, on the contrary, If it is not an industry that is already very mature in the traditional market, but there are huge pain points, Crypto will not be able to take root in it, there will be no room for survival, and the willingness of users to fully understand it is very low, and they do not understand its potential upper limit.

WEB3 AI or any crypto product under the banner of PMF needs to be developed with the tactic of surrounding the city in the countryside, and the water should be tested on a small scale in the marginal position, to ensure that the foundation is solid, and then wait for the emergence of the core scenario, that is, the target city. The core of Web3 AI lies in decentralization, and its evolution path is reflected in the compatibility of high parallelism, low coupling, and heterogeneous computing power. This makes Web3 AI more advantageous in scenarios such as edge computing, and is suitable for tasks with lightweight structures, easy parallelism, and incentivization, such as LoRA fine-tuning, behaviorally aligned post-training tasks, crowdsourced data training and annotation, small basic model training, and edge device collaborative training. The product architecture of these scenarios is lightweight, and the roadmap can be flexibly iterated. But this is not to say that the opportunity is now, because the barriers of WEB2 AI have just begun to form, the emergence of Deepseek has stimulated the progress of multimodal complex task AI, which is the competition of leading enterprises, and it is the early stage of the emergence of WEB2 AI dividends, I think that only when the dividends of WEB2 AI disappear, the pain points left behind by it are the opportunities for WEB3 AI to cut into, just like the original birth of DeFi, and before the time comes, WEB3 AI We need to carefully identify the agreement that has “rural areas surrounding cities”, whether to cut from the edge, first gain a firm foothold in the countryside (or small market, small scene) where the power is weak and the market has few rooted scenes, and gradually accumulate resources and experience; If this cannot be done, then it is difficult to rely on PMF to achieve a market value of $1 billion on this basis, and such projects will not be on the watchlist; WE NEED TO PAY ATTENTION TO WHETHER THE WEB3 AI PROTOCOL NEEDS TO BE FULLY FLEXIBLE, FLEXIBLE FOR DIFFERENT SCENARIOS, CAN MOVE QUICKLY BETWEEN RURAL AREAS, AND MOVE CLOSER TO THE TARGET CITY AT THE FASTEST SPEED.

About Movemaker

Movemaker is the first official community organization authorized by the Aptos Foundation and jointly initiated by Ankaa and BlockBooster, focusing on promoting the construction and development of the Aptos ecosystem in the Chinese-speaking regions. As the official representative of Aptos in the Chinese-speaking areas, Movemaker is dedicated to creating a diverse, open, and prosperous Aptos ecosystem by connecting developers, users, capital, and numerous ecological partners.

Disclaimer:

This article/blog is for informational purposes only and represents the personal views of the author and does not necessarily represent the position of Movemaker. This article is not intended to provide: (i) investment advice or investment recommendations; (ii) an offer or solicitation to buy, sell, or hold digital assets; or (iii) financial, accounting, legal or tax advice. Holding digital assets, including stablecoins and NFTs, is extremely risky, highly volatile in price, and can even become worthless. You should carefully consider whether trading or holding Digital Assets is suitable for you in light of your own financial situation. Please consult your legal, tax or investment advisor if you have questions about your specific circumstances. The information provided in this article, including market data and statistics, if any, is for general information purposes only. Reasonable care has been taken in the preparation of these figures and graphs, but no liability is accepted for any factual errors or omissions expressed in them.

Disclaimer: The information on this page may come from third parties and does not represent the views or opinions of Gate. The content displayed on this page is for reference only and does not constitute any financial, investment, or legal advice. Gate does not guarantee the accuracy or completeness of the information and shall not be liable for any losses arising from the use of this information. Virtual asset investments carry high risks and are subject to significant price volatility. You may lose all of your invested principal. Please fully understand the relevant risks and make prudent decisions based on your own financial situation and risk tolerance. For details, please refer to Disclaimer.

Comment

0/400

No comments