image source head

Ten Wall Street Report on Wall Street: Bitcoin and Nvidia plummeted behind

trendx logo

Reprinted from jinse

02/02/2025·1M

A professional investor who has been an analyst and a software engineer wrote an article on the empty Nvidia. He was reposted in large quantities by Twitter's large V, becoming a major "culprit" for Nvidia's stock plunge. Nvidia's market value has evaporated nearly $ 600 billion, which is by far the largest single -day decline in specific listed companies.

The main point of the investors of the Jeffrey Emanuel is nothing more than Deepseek's cowhide made by Wall Street, large technology companies and Nvidia. Nvidia was overestimated. "Every investment bank recommends buying Nvidia, like a blind person, and I don't know what they are talking about."

Jeffrey Emanuel said that Nvidia must maintain the current growth trajectory and profit margin, and the road is much more rugged than its valuation. There are five different directions of attacking Nivise -architecture innovation, customer vertical integration, software abstraction, breakthrough efficiency, and manufacturing democratization -at least one may have a significant impact on the profit margin or growth rate of Nvidia high. Judging from the current valuation, the market does not take into account these risks.

According to some industry investors, because of this report, Emanuel suddenly became a celebrity on Wall Street. Many hedge funds paid him $ 1,000 per hour, hoping to hear his views on Nvidia and AI. I was so busy that my throat was smoking, but the money was spent.

The following is the full text of the report. Reference.

As a person who has worked in various multi -head/short hedge funds (including working in Millennium and BALYASNY), who has been an investment analyst for about 10 years, is also a mathematics and computer fan that has been studying deep learning since 2010 ( At that time, Geoff Hinton was still talking about the limited Boltzmann. All programming still used MATLAB. Researchers are still trying to prove that they can get better results than using support vector machines in classification.) The development of intelligent technology and its relationship with the equity valuation of the stock market have a very unique view.

In the past few years, I have worked as a developer more and have several popular open source projects to handle various forms of AI models/services (for example, please refer to the LLM Aided OCR, Swiss Army Llama, , Fast Vector Similalic, Source to Prompt, Pastel Inferred Layer, etc.). Basically, I use these cutting -edge models densely every day. I have 3 Claude accounts, so I will not use the request, and I register it a few minutes after the ChatGPT Pro is launched.

I also strive to learn about the latest research progress, and carefully read all important technical report papers released by various artificial intelligence laboratories. Therefore, I think I have a good understanding of the development of this field and things. At the same time, I took a lot of stocks in my life and won the best creative award for the Value Investor Club twice (if you have been paying attention, it is TMS multi -head and PDH short).

I said that not to show off, but to prove that I could express opinions on this issue, and not let the technicians or professional investors feel that I am naive. Of course, there must be many people who are more proficient in mathematics/science than me, and many people are better than me more than I do more/short investment in the stock market, but I think there are not many people who can be in the middle of Ventu like me.

Nevertheless, whenever I meet with my friends in hedge the fund industry and former colleagues, the topic will soon turn to Nvidia. The phenomenon of a company's total development from obscurity to market value exceeds Britain, France, or Germany's stock markets. It is not available every day! These friends naturally want to know my views on this issue. Because I firmly believe that this technology will have a long-term reform-I really believe that it will completely change our all aspects of our economy and society in the next 5-10 years, which is basically unprecedented The development momentum will slow down or stop in the short term.

But even in the past year, I think that the valuation is too high and it is not suitable for me, but the recent series of development has made me a bit inclined to my intuition, that is, a more cautious attitude towards the prospects, and consensus on consensus, and is on consensus. It seemed to be questioned when it was overlapped. As the saying goes, "The wise man believes at the beginning that the fool believes at the end." The reason why this sentence is famous is for a reason.

Cow market case

Before we discuss the progress that made me hesitant, let's briefly review the bull market of Nvidia stocks. Now basically everyone knows the bull market of NVDA stocks. Deep learning and artificial intelligence are the most changeable technologies since the Internet, and are expected to fundamentally change everything in our society. As far as the industry's total capital expenditure is used for training and reasoning infrastructure, Nvidia is almost in the position of approaching monopoly in some way.

Some companies with the largest and highest profits in the world, such as Microsoft, Apple, Amazon, Meta, Google, Oracle, etc., have decided to keep their competitiveness at all costs at all. Essence The area of ​​capital expenditure, electricity consumption, and new data centers, of course, the number of GPUs has explosive growth, and there seems to be no signs of slowing down. Nvidia can earn up to 90%of the amazing gross profit margin with high -end products for data centers.

We just touched the surface of the bull market. There are more aspects now, even those who are already very optimistic will become more optimistic. Except for the rise of human -like robots (I suspect that they can quickly complete a large number of tasks that are not proficient (or even skilled) workers to complete, most people will be surprised, such as laundry, cleaning, organizing and cooking; complete in the workers' team to complete Construction work such as decoration bathrooms or construction of houses; managing warehouses and driving forklifts, etc.), there are other factors that most people have not even considered.

A main topic that smart people talk about is the rise of "new expansion laws". It provides a new paradigm for people's thinking about how the needs of computing needs. Since the emergence of Alexnet in 2012 and the invention of the Transformer architecture in 2017, the original expansion law that promoted the progress of artificial intelligence was the law of pre -training expansion: the higher the value we used for training data (now reached trillions), the model we trained models trained The more parameters, the higher the computing capabilities (FLOPS) we consume these models with these token training. In a variety of very useful downstream tasks, the performance of the final model will be better.

Not only that, this improvement can be predicted to a certain extent, so that leading artificial intelligence laboratories like Openai and Anthropic can even know how good their latest models will be before starting practical training - In some cases, they can even predict the benchmark value of the final model, with an error of not more than a few percentage points. This "law of primitive expansion" is very important, but always allows those who use it to predict the future of the future.

First of all, we seem to have exhausted high -quality training data sets accumulated in the world. Of course, this is not completely correct -there are still many old books and journals that have not yet been correctly digitized. Even if they are digitized, they have not obtained appropriate licenses as training data. The problem is that even if you attribute everything to you -for example, the sum of the English written content made from "professional" from 1500 to 2000, when you talk about a training library of nearly 15 trillion marks, from percentage, from percentage, from percentage From the perspective, this is not a huge number, and the size of the training library is the size of the current cutting -edge model.

In order to quickly check the authenticity of these numbers: So far, Google Books have digitated about 40 million books; if a common book has 50,000 to 100,000 words, or 65,000 to 130,000 markers, then the book alone will just be a book. It accounts for a mark from 2.6T to 5.2T. Of course, a large part of it is already included in the training language library used in large laboratories, whether it is legal in the strict sense. There are many academic papers, and there are more than 2 million papers on the ARXIV website. The U.S. Congress Library has more than 3 billion pages of digital newspapers. In addition, the total number may be as high as 7T token, but because most of them are actually included in the training language library, the remaining "incremental" training data may not be so important in the overall plan.

Of course, there are other methods to collect more training data. For example, you can automatically transcribe each YouTube video and use these texts. Although this may be helpful, its quality is definitely much lower than a highly respected organic chemical textbook, and the latter is the source of knowledge about the world. Therefore, in the original scale law, we have always faced the threat of "data wall"; although we know that we can continue to invest more capital expenditures into GPUs and build more data centers, but large -scale useful new human knowledge has useful new human knowledge It is much more difficult, these knowledge is correctly supplemented with knowledge. Now, an interesting response method is the rise of "synthetic data", that is, the text itself is the output of LLM. Although this seems a bit ridiculous, "improving the quality of models through its own supply" is indeed very effective in practice, at least in the field of mathematics, logic and computer programming.

Of course, the reason is that we can mechanically check and prove the correctness of things. Therefore, we can sample from the huge mathematical theorem or Python script, and then check whether they are correct, and only the correct data will be included in our database. In this way, we can greatly expand the collection of high -quality training data, at least in these fields.

In addition to text, we can also use various other data to train artificial intelligence. For example, if we use the entire genome sequencing data of 100 million people (the amount of data unprepared by a person is about 200GB to 300GB), what will happen to train artificial intelligence? This is obviously a large amount of data, although most of them are almost the same between two people. Of course, due to various reasons, comparing text data on books and the Internet, it is possible to mislead:

The size of the original genome cannot be compared directly with the number of marks

The information content of genome data is very different from text

The training value of high redundant data is not clear

The calculation requirements for processing genome data are also different

But it is still another huge source of information. We can train it in the future, which is why I incorporate it.

Therefore, although we are expected to obtain more and more additional training data, if you look at the growth rate of training library in recent years, we will find that we will soon encounter bottlenecks in the availability of "commonly useful" knowledge data. And this kind of knowledge can help us get closer to the ultimate goal, that is, to obtain 10 times the artificial super intelligence of Big John von Neumanman, becoming a world -class expert and humans known to be known to each professional field.

In addition to limited data, supporters of pre -training expansion laws have always hidden other concerns in their hearts. One of them is how to deal with all these computing infrastructure after completing the model training? Train the next model? Of course, you can do this, but considering the rapid improvement of the speed and capacity of the GPU, and the importance of electricity and other operating costs in economic computing, is it really meaningful to train new models 2 years ago? Of course, you are more willing to use the new data center you just built. Its cost is 10 times that of the old data center, and because the technology is more advanced, the performance is 20 times that of the old data center. The problem is that at some time, you do need to amortize the early costs of these investment, and use the (Hope to be positive) operating profit flow to recover costs, right?

The market is so excited about artificial intelligence, so that it has ignored this, so that companies such as OpenAI have continuously accumulated operating losses from the beginning, but at the same time, they have obtained increasing valuations in subsequent investment (of course, what is worthy of praise is that what is praised is that the praise is that the praise is that the praise is that the praise is that the praise is that the praise is that the praise is that the praise is that the praise is that the praise is that the praise is that the praise is They also showed very fast growth income). But in the end, if you want to maintain this situation throughout the market cycle, the cost of these data centers will eventually need to be recovered. It is best to have profits. After a period of time, they can be based on risk adjustment. Contradicate.

New paradigm

Well, this is the law of pre -training expansion. So what is this "new" expansion law? Well, this is something that people have only begun to pay attention to in the past year: the reasoning time is calculated. Prior to this, most of the calculations you spent in the process are used to create preliminary training calculations for creating models. Once you have a well -trained model, you only need to use a certain amount of calculation to reason the model (that is, asking questions or let LLM perform a certain task for you).

It is important that the total amount of reasoning calculation (measured in various ways, such as FLOPS, GPU memory occupation, etc.) is far lower than the calculation required for the pre -training phase. Of course, when you increase the size of the context window of the model and the output generated at one time, the amount of calculation of the reasoning will indeed increase (although the researchers have made amazing algorithm improvements in this regard, and the expansion scale that the initial expected expansion is the second party) Essence But basically, until recently, the intensity of reasoning calculation is usually much lower than the training calculation, and basically the number of requests that process process -for example, the more demand for the completion of the ChatGPT text Essence

With the appearance of the revolutionary thinking chain (COT) model launched last year, the most noticeable is OPENAI's flagship model O1 (but recent DeepSeek's new model R1 also adopts this technology. It will be discussed in detail later), and everything has changed. These new COT models no longer directly proportions the length of the reasoning calculation with the output text generated by the model (for a larger context window, model size, etc., it will increase proportional) It is regarded as a "temporary memory" or "internal monologue" when trying to solve your problems or complete the designated task.

This represents a real change in the method of reasoning calculation: Now, the more Token you use in this internal thinking process, the better the final output quality you provide to users. In fact, this is like giving a worker more time and resources to complete a task, so that they can check their work repeatedly, complete the same basic tasks in a variety of different methods, and verify whether the results are the same; "Insert" formula to check whether it really solves the equation.

It turns out that the effect of this method is almost amazing; it uses the long -awaited power of "strengthening learning" and the powerful features of the Transformer architecture. It directly solves one of the biggest weakness in the Transformer model, that is, the tendency of "illusion".

Basically, the way of the next labeling of Transformer at the next label of each step is that if they start to embark on a wrong "road" in the initial response, they will become almost like a pushing child, trying to make up a story story story To explain why they are actually correct, even if they should use common sense to realize on the way, what they say is not correct.

Because the model is always trying to maintain the internal consistency and make each continuously generated mark naturally comes from the previous marks and context, it is difficult for them to modify and trace the route. By decomposing the reasoning process into many intermediate stages, they can try many different methods to see which are effective, and constantly try to modify and try other methods until they can achieve quite high confidence, that is, they are not nonsense.

The most special thing about this method is that in addition to it is indeed effective, the more logical/cot token you use, the better the effect. Suddenly, you have an extra turntable. As the number of COT reasoning token increases (this requires more reasoning calculations, whether it is floating -point operation or memory), the higher the probability you give the correct answer -the code There are no errors at the first runtime, or the solution of logical problems has no obvious inference steps.

I can tell you based on a large number of first -hand experience that although Anthropic's Claude3.5 Sonnet model is very good in Python programming (really excellent), whenever you need to generate any lengthy and complicated code, it always commites one of them one. Or multiple stupid mistakes. Now, these errors are usually easy to repair. In fact, the error generated by the Python interpreter is usually used as a subsequent reasoning prompt (or more practical, using the so -called Linter to use the complete "problem" found in the code editor in the code The collection is pasted to the code), and you can repair them without any further explanation. When the code becomes very long or very complicated, sometimes it takes more time to repair it, and may even be manually debug.

When I tried Openai's O1 model for the first time, it was like an enlightenment: I was surprised that the code was perfect for the first time. This is because the COT process will automatically discover and repair the problem before the answer given by the model.

In fact, the O1 model used in Openai's ChatGPT Plus subscription service ($ 20 per month) and the new ChatGPT Pro subscription service (10 times the price of the former, that is, $ 200 per month, which caused a stir in the developer community) The models used by the Chinese O1-PRO model are basically the same; the main difference is that O1-PRO will think about a longer time before making a response, generate more COT logic marks, and consume a lot of reasoning computing resources for each response.

This is very eye -catching, because even for Claude3.5 Sonnet or GPT4O, even if it is given about 400kb or more contexts, a very lengthy and complicated reminder usually takes less than 10 seconds to start to respond, and it can start to respond, and it can start to respond, and Often less than 5 seconds. And the same prompt to O1-PRO may take more than 5 minutes to get a response (although OpenAI will indeed show you some "reasoning steps" generated in this process during the waiting process; important, OpenAI is out of business out of business out of business. The related reasons for secrets decided to hide the exact reasoning mark it generated to you, but show you a summary of the high degree of simplification).

As you may imagine, in many cases, accuracy is very important -you would rather give up and tell users that you can't do it at all, nor are you unwilling to give an answer that may be easily proved, or give an illusion involving hallucinations involving hallucinations Facts or other likely, non -pushing answers. Only a few cases involving money/transactions, medical care and law.

Basically, as long as the reasoning cost of reasoning is compared to human intellectual workers who interact with artificial intelligence systems, the full salary of the hourly salary is insignificant, so in this case, calling the COT calculation does not need to be considered at all (the main disadvantage is that it is it. It will greatly increase the response delay, so in some cases, you may also want to speed up iteration by obtaining a response with shorter latency, lower accuracy or lower accuracy).

A few weeks ago, there were some inspiring news in the field of artificial intelligence, which involved the O3 model that OpenAI has not yet released. This model can solve a series of problems that were previously considered to be unable to use existing artificial intelligence methods in the short term. OpenAI能够解决这些最棘手的问题(包括极其困难的“基础”数学问题,即使是非常熟练的专业数学家也很难解决),是因为OpenAI投入了大量的计算资源——在某些情况下, It costs more than 3,000 US dollars to solve a task (in contrast, the conventional Transformer model is used. If there is no thinking chain, the traditional reasoning cost of a single task is unlikely to exceed a few dollars).

No artificial intelligence genius can also realize that this progress has created a new law of expansion, which is completely different from the original pre -training expansion law. Now, you still want to train the best model by using as many computing resources and as much trillion high -quality training data as possible, but this is just the beginning of this new world story; now, you can easily use it easily The amazing amount of computing resources is inferred only from these models to obtain very high confidence, or try to solve the extremely difficult problems that require "genius" reasoning to avoid all potential traps. These traps may cause ordinary people to ordinary people. The master's degree in law went astray.

But why do NVIDIA monopolize all benefits?

Even if you believe that the future prospect of artificial intelligence is almost unimaginable, the question still exists: "Why do a company get most of the profits from this technology?" There are indeed many important new technologies in history that have changed the world. But the main winners are not the most promising companies in the initial stage. Although the Wright brothers' aircraft companies invented and improved this technology, the company's market value is now less than $ 10 billion, although it has evolved into multiple companies. Although Ford has a considerable market value of $ 40 billion today, this is only 1.1%of Nvidia's current market value.

To understand this, we must truly understand why Nvidia can occupy such a large market share. After all, they are not the only company that produces GPUs. AMD's GPU with good production performance. From the data point of view, the number of transistors and craft nodes is equivalent to NVIDIA. Of course, the speed of AMD GPU is less than the Nvidia GPU, but the NVIDIA GPU is not 10 times or similar. In fact, as far as the original cost of FLOP is concerned, the AMD GPU is only half of NVIDIA GPU.

From the perspective of other semiconductor markets, such as the DRAM market, although the market is highly concentrated, only three global companies (Samsung, Micron, SK-Hemori) have practical significance, but the gross profit margin of the DRAM market is negative at the bottom of the cycle. The top of the cycle is about 60%, and the average value is about 20%. In contrast, NVIDIA's overall gross profit margin in recent quarters is about 75%, which is mainly dragged down by consumer 3D graphic products with low profit margins and high commercialization.

So how is this possible? Well, the main reason is related to the software -"direct use" on Linux and strictly tested and highly reliable drivers (unlike AMD, its Linux driver is low in quality and unstable and infamous), and highly optimized optimized Open source code, such as PyTorch, can run well on the NVIDIA GPU after adjustment.

Not only that, programmers are used to write a programming framework for the low -level code optimized for GPU optimization CUDA, which is completely owned by NVIDIA and has become a de facto standard. If you want to hire a group of talented programmers, they know how to use the GPU to accelerate the work and be willing to pay them at a salary of $ 650,000/year, or the current salary level of anyone with such special skills, then they are likely to be likely Will "think" and work with CUDA.

In addition to software advantages, another main advantage of NVIDIA is the so -called interconnection -essentially, it is a bandwidth that highly connect thousands of GPUs, so that they can use them to train the most cutting -edge basic model today. In short, the key to efficient training is to make all GPUs always make full use of the state, rather than empty waiting until the next batch of data required for the next training.

Bandwidth requirements are very high, far higher than the typical bandwidth required for traditional data center applications. This kind of interconnection cannot use traditional network devices or fiber, because they bring too much delay and cannot provide the bandwidth of TB per second, which is required to keep all GPUs continuously busy.

Nvidia acquired Israeli Mellanox at the price of $ 6.9 billion in 2019, which is a very wise decision, and it is this acquisition that provides them with the leading interconnection technology in the industry. Please note that compared with the reasoning process (including COT reasoning), the relationship between the interconnection speed and the training process (the output of thousands of GPUs must be used at the same time) is closer. It's just enough VRAM to store the weight (compression) model weight of the trained model.

It can be said that these are the main components of Nvidia's "moat", and it is why it can maintain such a high profit margin for a long time (and a "flywheel effect", that is, they actively invest in a large amount of common profit. It also helps them improve their technologies at a faster speed than competitors, so they are always leading their original performance).

But as mentioned earlier, under the same conditions of all other conditions, what customers really care about are often the performance of each dollar (including the cost of capital expenditure and energy use of the equipment, that is, the performance of each watt), although the GPU of NVIDIA is indeed indeed the GPU of NVIDIA is indeed indeed the GPU of NVIDIA is indeed indeed the GPU of NVIDIA is indeed indeed. The fastest, but if they are only measured by FLOPS, they are not cost -effective.

But the problem is that other factors are not the same. AMD's driver is bad. The popular AI software library is not good at running on the AMD GPU. Outside the game field, you cannot find GPU experts who are really good at AMD GPU ( Why do they bother, the demand for CUDA experts in the market is greater?), Due to AMD's poor interconnection technology, you cannot effectively connect thousands of GPUs together -all this means that AMD is at high -end data centers The field basically has no competitiveness, and there seems to be a good development prospect in the short term.

Okay, it sounds good to sound in NVIDIA, right? Now you know why its stock valuation is so high! But are there any other hidden concerns? Well, I think there are not much concern for attracting great attention. Some problems have been lurking behind the scenes in the past few years, but considering the speed of growth, their impact is small. But they are preparing to develop upward. Other problems have only appeared recently (such as the past two weeks), and may significantly change the trajectory of the recent GPU demand growth.

Primary threat

From the macro perspective, you can think like this: NVIDIA has been operating in a very niche field for a long time; their competitors are very limited, and these competitors are not profitable, the growth rate is not enough It constitutes a real threat, because they do not have enough capital to put pressure on market leaders such as NVIDIA. The game market is very large, and it is still growing, but it does not bring amazing profits or particularly amazing annual growth rates.

Around 2016-2017, some large technology companies began to increase recruitment and expenditure in machine learning and artificial intelligence, but in general, this was never their really important projects-more like the "moon exploration plan" R & D expenditure. However, after the release of ChatGPT in 2022, the competition in the field of artificial intelligence has really begun. Although it is only two years since it is now, it seems that this has passed for a long time.

Suddenly, large companies are preparing to invest billions of dollars at an amazing speed. The number of researchers participating in large -scale research conferences such as Neurips and ICML surged. In the past, smart students who could study financial derivatives changed to study Transformers. Non -executed engineering positions (that is, independent contributors to do not manage teams) of one million US dollars or more salary as the norm of the leading artificial intelligence laboratory.

It takes a while to change the direction of a large cruise ship; even if you move very fast, it takes billions of dollars to build a new data center for one year or longer, ordering all devices (the delivery time will be extended), and and and of. Complete all settings and debugging. Even the smartest programmers need a long time to truly enter the state, familiar with the existing code library and infrastructure.

But you can imagine that the funds, manpower and energy invested in this field are definitely astronomical numbers. NVIDIA is the biggest goal of all participants, because they are the biggest contributors to today's profits, not the future of artificial intelligence dominates our lives.

Therefore, the most important conclusion is "the market will always find a way out." They will find a replacement, completely innovative new method to create hardware, and use a new concept to bypass obstacles to consolidate Nvida's moat.

Hardware level threat

For example, CEREBRAS's so -called "wafer" artificial intelligence training chip is used to use the entire 300mm silicon wafer for an absolutely huge chip. The chip contains more crystal pipes and kernels on a single chip (see them recent Blog articles to understand how they solve the past method that hinders this method in economic practical output).

To explain this, if you compare CEREBRAS's latest WSE-3 chip with NVIDIA's flagship data center GPU H100, the total chip area of ​​Cerebras chip is 46225 square millimeters, while H100 is only 814 square millimeters (according to industry standards, H100 It is a huge chip in itself); this is a 57 -fold multiple! Cerebras chip does not enable 132 "streaming multi -processor" kernels on the chip like H100, but has about 900,000 kernels (of course, each core is smaller and fewer functions, but in comparison, under comparison, but under comparison, under comparison, but in comparison, under comparison, but under comparison, under comparison, but under comparison, under comparison, but under comparison, under comparison, but under comparison, under the comparison, under comparison, but under comparison, under the comparison, under comparison, but under comparison, under the comparison, under comparison, but under comparison, under the comparison, under comparison, but under comparison, under the comparison, under the comparison, under comparison, but in contrast This number is still very large). Specifically, in the field of artificial intelligence, the FLOPS computing power of Cerebras chip is about 32 times that of a single H100 chip. Because the price of H100 chips is close to $ 40,000, it is conceivable that the price of the WSE-3 chip is not cheap.

So, what is the meaning of this? CEREBRAS did not try to adopt a similar method to confront NVIDIA, nor did it compare with the interconnection technology of Mellanox. Instead, a new method was used to bypass the interconnection problem: when everything runs on the same large chip on the same large chip At the same time, the bandwidth problem between the processor becomes less important. You don't even need the same level of interconnection, because a giant chip can replace it into a ton of H100.

And the Cerebras chip performed very well in the artificial intelligence reasoning task. In fact, you can try it here for free and use Meta's very famous LLAMA-3.3-70B model. Its response speed is basically instant, about 1,500 token per second. From a comparison perspective, compared with ChatGPT and Claude, the speed of 30 or more per second is relatively fast for users, and even the speed of 10 Token per second is fast enough. Read it.

CEREBRAS is not the only company, as well as other companies, such as Groq (do not confuse the Grok model series of X AI training with Elon Musk). GROQ uses another innovative method to solve the same basic problems. They did not try to compete directly with Nvida CUDA software stacks, but developed the so -called "tension processing unit" (TPU), specifically used for precise mathematical operations required for deep learning models. Their chip is designed around the concept of "certainty calculation", which means that unlike traditional GPUs, their chips perform operations in a fully predictable manner each time.

This may sound like a small technical detail, but in fact, it has a huge impact on chip design and software development. Because time is completely determined, GROQ can optimize its chip, which cannot be done by the traditional GPU architecture. Therefore, in the past 6 months, they have been showing the reasoning speed of the LLAMA series models and other open source models exceeding 500 token per second, far exceeding the speed that traditional GPU settings can be achieved. Like CEREBRAS, this product is now available, you can try it for free.

Using the LLAMA3 model with the "speculative decoding" function, GROQ can generate 1320 Token per second, which is equivalent to CEREBRAS, far exceeding the performance of conventional GPUs. Now, you may ask, when the user seems to be quite satisfied with the speed of the ChatGPT (less than 1,000 token per second), what is the significance of reaching more than 1,000 token per second. In fact, this is indeed important. When you get instant feedback, the iterative speed will be faster and will not lose focus like human intellectual workers. If you use a model by programming by API, it can enable the application of the new category. These applications require multiple stages of reasoning (the output of the previous stage is used as the input of subsequent stages of prompt/reasoning), or a low latency response, for example, for example, for example Content review, fraud testing, dynamic pricing, etc.

But even more fundamentally, the faster the response request, the faster the cycle, and the busy the hardware is. Although GROQ's hardware is very expensive, the cost of a server is as high as 2 million to 3 million US dollars, but if the demand is large enough, the hardware keeps a busy state, then the cost of the request will be greatly reduced.

Just like the CUDA of NVIDIA, a large part of GROQ's advantage comes from its proprietary software stack. They can adopt open source models developed and released freely developed and published by other companies such as META, Deepseek and Mistral, and decompose them in a special way to make it run faster on specific hardware.

Like CEREBRAS, they have made different technical decisions to optimize certain aspects of the process, so as to carry out work in a completely different way. Taking GROQ as an example, they fully focus on the calculation of the reasoning level, not training: all their special hardware and software can only play a huge speed and efficiency advantage only when they have been trained on the model.

但如果人们期待的下一个重大扩展定律是推理级计算,而COT模型的最大缺点是必须生成所有中间逻辑标记才能做出响应,从而导致延迟过高,那么即使是一家只做推理计算的公司,只要其速度和效率远超英伟达,也将在未来几年内带来严重的竞争威胁。至少,Cerebras和Groq可以蚕食当前股票估值中对于英伟达未来2-3年收入增长的过高预期。

除了这些特别创新但相对不为人知的初创公司竞争对手之外,英伟达的一些最大客户本身也带来了严峻的竞争,他们一直在制造专门针对人工智能训练和推理工作负载的定制芯片。其中最著名的是谷歌,该公司自2016年以来一直在开发自己的专有TPU。有趣的是,尽管谷歌曾短暂地向外部客户出售TPU,但过去几年里,谷歌一直在内部使用其所有TPU,而且它已经推出了第六代TPU硬件。

亚马逊也在开发自己的定制芯片,称为Trainium2和Inferentia2。亚马逊正在建设配备数十亿美元英伟达GPU的数据中心,与此同时,他们也在其他使用这些内部芯片的数据中心投资数十亿美元。他们有一个集群,正在为Anthropic上线,该集群有超过40万块芯片。

亚马逊因完全搞砸了内部人工智能模型开发而饱受批评,将大量内部计算资源浪费在最终没有竞争力的模型上,但定制芯片是另一回事。同样,他们并不一定需要自己的芯片比英伟达的更好、更快。他们需要的只是足够好的芯片,但要以盈亏平衡的毛利率来制造芯片,而不是Nvidia在其H100业务上赚取的约90%以上的毛利率。

OpenAI还宣布了他们制造定制芯片的计划,他们(与微软一起)显然是Nvidia数据中心硬件的最大用户。似乎这还不够,微软自己宣布了自己的定制芯片!

而苹果公司作为全球最有价值的技术公司,多年来一直以高度创新和颠覆性的定制芯片业务颠覆着人们的预期,如今,在每瓦性能方面,其定制芯片业务已经彻底击败了英特尔和AMD的CPU,而每瓦性能是移动(手机/平板电脑/笔记本电脑)应用中最重要的因素。多年来,他们一直在生产自己内部设计的GPU和「神经处理器」,尽管他们尚未真正证明这些芯片在其自定义应用之外的实用性,例如iPhone相机中使用的基于高级软件的图像处理。

虽然苹果公司的关注点似乎与这些其他参与者有所不同,其关注点在于移动优先、消费者导向和「边缘计算」,但如果苹果公司最终在与OpenAI的新合同上投入足够的资金,为iPhone用户提供人工智能服务,那么你必须想象他们有团队在研究如何制造自己的定制芯片用于推理/训练(尽管考虑到他们的保密性,你可能永远不会直接知道这件事!)。

现在,Nvidia的超级扩展器客户群呈现出强大的幂律分布已经不是什么秘密了,其中少数顶级客户占据了高利润收入的绝大部分。当这些VIP客户中的每一个都在专门为人工智能训练和推理制造自己的定制芯片时,我们应该如何看待这项业务的未来?

在思考这些问题时,你应该记住一个非常重要的事实:英伟达在很大程度上是一家基于知识产权的公司。他们不生产自己的芯片。制造这些令人难以置信的设备真正特殊的秘诀可能更多地来自台积电和ASML,后者制造了用于制造这些前沿工艺节点芯片的特殊EUV光刻机。这一点至关重要,因为台积电会将最先进的芯片卖给任何愿意提供足够的前期投资并保证一定数量的客户。他们不在乎这些芯片是用于比特币挖矿专用集成电路、图形处理器、热塑性聚氨酯、手机系统级芯片等。

Nvidia资深芯片设计师的年收入是多少,这些科技巨头肯定能拿出足够的现金和股票,吸引其中一些最优秀的人才跳槽。一旦他们拥有团队和资源,他们就可以在2到3年内设计出创新的芯片(也许甚至没有H100先进50%,但凭借Nvidia的毛利率,他们还有很大的发展空间),而且多亏了台积电,他们可以使用与Nvidia完全相同的工艺节点技术将这些芯片转化为实际的硅片。

软件威胁

似乎这些迫在眉睫的硬件威胁还不够糟糕,过去几年软件领域也出现了一些进展,虽然起步缓慢,但如今发展势头强劲,可能会对Nvidia的CUDA软件主导地位构成严重威胁。首先是AMD GPU的糟糕Linux驱动程序。还记得我们讨论过AMD多年来为何不明智地允许这些驱动程序如此糟糕,却坐视大量资金流失吗?

有趣的是,臭名昭著的黑客乔治·霍茨(George Hotz,因在青少年时期越狱原版iPhone而闻名,目前是自动驾驶初创公司Comma.ai和人工智能计算机公司Tiny Corp的首席执行官,Tiny Corp还开发了开源的tinygrad人工智能软件框架)最近宣布,他厌倦了处理AMD糟糕的驱动程序,迫切希望能够在其TinyBox人工智能计算机中使用成本较低的AMD GPU( 有多种型号,其中一些使用Nvidia GPU,而另一些则使用AMD GPU)。

事实上,他在没有AMD帮助的情况下为AMD GPU制作了自己的自定义驱动程序和软件堆栈;2025年1月15日,他通过公司的X账户发推说:「我们距离AMD完全自主的堆栈RDNA3汇编器仅一步之遥。我们有自己的驱动程序、运行时、库和模拟器。(全部约12000行!)」鉴于他的过往记录和技能,他们很可能在未来几个月内完成所有工作,这将带来许多激动人心的可能性,即使用AMD GPU来满足各种应用的需求,而目前公司不得不为Nvidia GPU支付费用。

好吧,这只是AMD的一个驱动程序,而且还没有完成。还有什么呢?好吧,软件方面还有其他一些领域的影响更大。首先,现在许多大型科技公司和开源软件社区正在共同努力,开发更通用的AI软件框架,其中CUDA只是众多「编译目标」之一。

也就是说,您使用更高级别的抽象来编写软件,系统本身可以自动将这些高级别结构转换为超级优化的低级代码,在CUDA上运行效果极佳。但由于是在这种更高级别的抽象层完成的,因此可以轻松地将其编译为低级代码,从而在许多其他GPU和TPU上运行良好,这些GPU和TPU来自各种供应商,例如各大科技公司正在开发的大量定制芯片。

这些框架中最著名的例子是MLX(主要由苹果公司赞助)、Triton(主要由OpenAI赞助)和JAX(由谷歌开发)。MLX 尤其有趣,因为它提供了一个类似PyTorch 的API,可以在Apple Silicon 上高效运行,展示了这些抽象层如何使AI 工作负载能够在完全不同的架构上运行。与此同时,Triton 越来越受欢迎,因为它允许开发人员编写高性能代码,这些代码可以编译为在各种硬件目标上运行,而无需了解每个平台的底层细节。

这些框架允许开发人员使用强大的抽象功能编写代码,然后自动针对大量平台进行编译——这听起来是不是更有效率?在实际运行代码时,这种方法能够提供更大的灵活性。

在20世纪80年代,所有最受欢迎、最畅销的软件都是用手工调制的汇编语言编写的。例如,PKZIP压缩实用程序就是手工制作的,以最大限度地提高速度,以至于用标准C编程语言编写并使用当时最好的优化编译器编译的代码版本,其运行速度可能只有手工调整的汇编代码的一半。其他流行的软件包,如WordStar、VisiCalc等,也是如此。

随着时间的推移,编译器变得越来越强大,每当CPU架构发生变化时(例如,从英特尔发布486到奔腾,等等),手写汇编程序通常不得不被丢弃并重新编写,只有最聪明的程序员才能胜任这项工作(就像CUDA专家在就业市场上比「普通」软件开发人员更胜一筹一样)。最终,事情逐渐趋于一致,手工汇编的速度优势被用C或C++等高级语言编写代码的灵活性大大超过,因为后者依靠编译器使代码在给定的CPU上以最佳状态运行。

如今,很少有人用汇编语言编写新代码。我相信人工智能训练和推理代码最终也会发生类似的转变,原因大致相同:计算机擅长优化,而灵活性和开发速度越来越成为重要的因素——尤其是如果它还能大幅节省硬件成本,因为您无需继续支付「CUDA税」,而这项税收为英伟达带来了90%以上的利润。

然而,另一个可能会发生巨大变化的领域是CUDA本身可能最终成为一种高级抽象——一种类似于Verilog(作为描述芯片布局的行业标准)的「规范语言」,熟练的开发人员可以使用它来描述涉及大规模并行的高级算法(因为他们已经熟悉它,它结构合理,是通用语言等),但与通常的做法不同,这些代码不是编译后用于Nvidia GPU,而是作为源代码输入LLM,LLM可以将其转换为新的Cerebras芯片、新的Amazon Trainium2或新的Google TPUv6等可以理解的任何低级代码。这并不像你想象的那么遥远;使用OpenAI最新的O3模型,可能已经触手可及,而且肯定会在一两年内普遍实现。

理论上的威胁

也许最令人震惊的发展是前几周发生的。这则新闻彻底震撼了人工智能界,尽管主流媒体对此只字未提,但它在推特上却成为知识分子的热门话题:一家名为DeepSeek的中国初创公司发布了两款新模型,其性能水平基本可与OpenAI和Anthropic的最佳模型相媲美(超越了Meta Llama3模型和其他较小的开源模型,如Mistral)。这些模型分别名为DeepSeek-V3(基本上是对GPT-4o和Claude3.5 Sonnet的回应)和DeepSeek-R1(基本上是对OpenAI的O1模型的回应)。

为什么这一切如此令人震惊?首先,DeepSeek是一家据说只有不到200名员工的小公司。据说他们最初是一家类似于TwoSigma或RenTec的量化交易对冲基金,但在中国加强监管该领域后,他们利用自己的数学和工程专长转向人工智能研究。但事实是,他们发布了两份非常详细的技术报告,分别是DeepSeek-V3和DeepSeekR1。

这些是技术含量很高的报告,如果你对线性代数一窍不通,可能就很难看懂。但你应该尝试的是在AppStore上免费下载DeepSeek应用,使用谷歌账户登录并安装,然后试一试(你也可以在安卓系统上安装),或者直接在桌面上用浏览器试试。确保选择「DeepThink」选项以启用思维链(R1模型),并让它用简单的语言解释技术报告中的部分内容。

这同时会告诉你一些重要的事情:

首先,这个模型是绝对合法的。人工智能基准测试中有很多虚假成分,这些测试通常被操纵,使模型在基准测试中表现出色,但在实际测试中表现不佳。谷歌在这方面无疑是最大的罪魁祸首,他们总是吹嘘自己的LLM有多神奇,但事实上,这些模型在现实世界测试中表现糟糕,甚至无法可靠地完成最简单的任务,更不用说具有挑战性的编码任务了。DeepSeek模型则不同,其响应连贯、有力,与OpenAI和Anthropic的模型完全处于同一水平。

其次,DeepSeek不仅在模型质量方面取得了重大进展,更重要的是在模型训练和推理效率方面取得了重大进展。通过非常接近硬件,并通过将一些独特且非常巧妙的优化组合在一起,DeepSeek能够以一种效率显著提高的方式使用GPU训练这些令人难以置信的模型。根据一些测量,DeepSeek的效率比其他前沿模型高出约45倍。

DeepSeek声称训练DeepSeek-V3的全部成本仅为500多万美元。按照OpenAI、Anthropic等公司的标准,这根本不算什么,因为这些公司早在2024年就达到了单个模型训练成本超过1亿美元的水平。

How is this possible?这家中国小公司怎么可能完全超越我们领先的人工智能实验室的所有最聪明的人,这些实验室拥有100倍以上的资源、员工人数、工资、资本、GPU等?中国不是应该被拜登对GPU出口的限制所削弱吗?好吧,细节相当技术性,但我们至少可以概括地描述一下。也许事实证明,DeepSeek相对较弱的GPU处理能力恰恰是提高其创造力和聪明才智的关键因素,因为「需求是发明之母」嘛。

一项重大创新是他们先进的混合精度训练框架,该框架允许他们在整个训练过程中使用8位浮点数(FP8)。大多数西方人工智能实验室使用「全精度」32位数字进行训练(这基本上指定了描述人工神经元输出时可能的渐变数量;FP8中的8位可以存储比您想象中更广泛的数字——它不仅限于常规整数中256个不同大小的等量,而是使用巧妙的数学技巧来存储非常小和非常大的数字——尽管自然精度不如32位。)主要的权衡是,虽然FP32可以在很大的范围内以惊人的精度存储数字,但FP8为了节省内存和提高性能而牺牲了一些精度,同时仍为许多AI工作负载保持足够的精度。

DeepSeek通过开发一个聪明的系统解决了这个问题,该系统将数字分解为用于激活的小块和用于权重的块,并在网络的关键点策略性地使用高精度计算。与其他实验室先进行高精度训练,然后再进行压缩(在此过程中会损失一些质量)不同,DeepSeek的FP8原生方法意味着他们可以在不影响性能的情况下节省大量内存。当您使用数千个GPU进行训练时,每个GPU的内存需求大幅减少,这意味着总体需要的GPU数量大大减少。

另一个重大突破是他们的多标记预测系统。大多数基于Transformer的LLM模型通过预测下一个标记来推断——一次一个标记。

DeepSeek想出了如何预测多个标记,同时保持单标记预测的质量。他们的方法在这些额外的标记预测中达到了约85-90%的准确率,有效地将推断速度提高了一倍,而不会牺牲太多质量。巧妙之处在于,他们保持了预测的完整因果链,因此模型不仅仅是猜测,而是进行结构化的、上下文相关的预测。

他们最具创新性的发展之一是他们所谓的多头潜在注意力(MLA)。这是他们在处理所谓的键值索引方面的突破,键值索引基本上是单个token在Transformer架构中的注意力机制中的表示方式。虽然从技术角度来说这有点过于复杂,但可以说这些KV索引是训练和推理过程中VRAM的主要用途之一,也是为什么需要同时使用数千个GPU来训练这些模型的部分原因——每个GPU的最大VRAM为96GB,而这些索引会把这些内存吃个精光。

他们的MLA系统找到了一种方法来存储这些索引的压缩版本,这些索引在捕获基本信息的同时使用更少的内存。最精彩的部分是这种压缩直接构建在模型学习的方式中——这不是他们需要做的某个单独步骤,而是直接构建在端到端训练管道中。这意味着整个机制是「可微分的」,并且能够直接使用标准优化器进行训练。之所以能成功,是因为这些模型最终找到的底层数据表示远低于所谓的「环境维度」。因此,存储完整的KV索引是一种浪费,尽管其他人基本上都是这么做的。

不仅因为存储了超出实际需求的海量数据而浪费大量空间,导致训练内存占用和效率大幅提高(再次强调,训练世界级模型所需的GPU数量大大减少),而且实际上可以提高模型质量,因为它可以起到「调节器」的作用,迫使模型关注真正重要的内容,而不是将浪费的容量用于适应训练数据中的噪声。因此,您不仅节省了大量内存,而且模型的性能甚至可能更好。至少,您不会因为节省大量内存而严重影响性能,而这通常是您在人工智能训练中面临的权衡。

他们还通过DualPipe算法和自定义通信内核在GPU通信效率方面取得了重大进展。该系统智能地重叠计算和通信,在任务之间仔细平衡GPU资源。他们只需要大约20个GPU的流多处理器(SM)进行通信,其余的则用于计算。其结果是GPU利用率远高于典型的训练设置。

他们做的另一件非常聪明的事情是使用所谓的混合专家(MOE)Transformer架构,但围绕负载平衡进行了关键创新。您可能知道,人工智能模型的大小或容量通常以模型包含的参数数量来衡量。参数只是一个数字,用于存储模型的某些属性;例如,特定人工神经元相对于另一个神经元的「权重」或重要性,或者特定标记根据其上下文(在「注意力机制」中)的重要性等。

Meta最新的Llama3模型有几种大小,例如:10亿参数版本(最小)、70B参数模型(最常用的)、甚至还有405B参数的大型模型。对于大多数用户来说,这种最大的模型实用性有限,因为你的电脑需要配备价值数万美元的GPU,才能以可接受的速度运行推理,至少如果你部署的是原始的全精度版本。因此,这些开源模型在现实世界中的大多数使用和兴奋点都在8B参数或高度量化的70B参数级别,因为这是消费级Nvidia 4090 GPU可以容纳的,现在你可以花不到1000美元买到it.

那么,这些有什么意义呢?从某种意义上说,参数的数量和精度可以告诉你模型内部存储了多少原始信息或数据。请注意,我并不是在谈论推理能力,或者模型的「智商」:事实证明,即使是参数数量很少的模型,在解决复杂的逻辑问题、证明平面几何定理、SAT数学问题等方面,也能表现出卓越的认知能力。

但是,那些小型模型不一定能够告诉你司汤达每部小说中每一个情节转折的方方面面,而真正的大型模型则有可能做到这一点。这种极端知识水平的「代价」是,模型变得非常笨重,难以训练和推理,因为为了对模型进行推理,你总是需要同时将405B个参数(或任何参数数量)中的每一个都存储在GPU的VRAM中。

MOE模型方法的优势在于,你可以将大型模型分解为一系列较小的模型,每个模型都拥有不同的、不重叠(至少不完全重叠)的知识。DeepSeek的创新之处在于开发了一种他们称之为「无辅助损失」的负载均衡策略,该策略能够保持专家的高效利用,而不会出现负载均衡通常带来的性能下降。然后,根据推理请求的性质,您可以将推理智能地将路由到该集合中最能够回答该问题或解决该任务的较小模型中的「专家」模型。

你可以把它想象成一个专家委员会,他们拥有各自的专业知识领域:一个可能是法律专家,另一个可能是计算机科学专家,还有一个可能是商业战略专家。因此,如果有人问线性代数的问题,你不会把它交给法律专家。当然,这只是非常粗略的类比,实际上并不像这样。

这种方法的真正优势在于,它允许模型包含大量知识,而不会非常笨重,因为即使所有专家的参数总数很高,但只有一小部分参数在任何特定时间处于「活跃」状态,这意味着你只需要将权重的小子集存储在VRAM中即可进行推理。以DeepSeek-V3为例,它有一个绝对庞大的MOE模型,包含671B个参数,比最大的Llama3模型还要大得多,但其中只有37B个参数在任何特定时间处于活跃状态——足以容纳两个消费级Nvidia 4090 GPU(总成本不到2000美元)的VRAM,而不需要一个或多个H100 GPU,每个售价约4万美元。

有传言称ChatGPT和Claude都使用MoE架构,有消息透露GPT-4共有1.8万亿个参数,分布在8个模型中,每个模型包含2200亿个参数。尽管这比将1.8万亿个参数全部放入VRAM要容易得多,但由于使用的内存量巨大,仅运行模型就需要多个H100级GPU。

除了上述内容,技术论文还提到了其他几项关键优化。其中包括其极其节省内存的训练框架,该框架可避免张量并行,在反向传播期间重新计算某些操作,而不是存储它们,并在主模型和辅助预测模块之间共享参数。所有这些创新的总和,当分层在一起时,导致了网上流传的约45倍的效率提升数字,我完全愿意相信这些数字是正确的。

DeepSeek的API成本就是一个有力的佐证:尽管DeepSeek的模型性能几乎达到同类最佳,但通过其API进行推理请求的费用比OpenAI和Anthropic的同类模型低95%。从某种意义上说,这有点像将Nvidia的GPU与竞争对手的新定制芯片进行比较:即使它们不是那么好,但性价比却高得多,因此,只要你能确定性能水平,并证明它足以满足你的要求,而且API可用性和延迟也足够好(到目前为止,尽管由于这些新模型的性能而出现了令人难以置信的需求激增,但人们对DeepSeek的基础设施表现感到惊讶)。

但与Nvidia的情况不同,Nvidia的成本差异是由于他们在数据中心产品上获得了90%以上的垄断毛利,而DeepSeek API相对于OpenAI和Anthropic API的成本差异可能只是因为它们的计算效率提高了近50倍(在推理方面甚至可能远远不止于此——在训练方面,效率提高了约45倍)。事实上,OpenAI和Anthropic是否从API服务中获得了丰厚利润尚不清楚——他们可能更关注收入增长,以及通过分析收到的所有API请求来收集更多数据。

在继续之前,我必须指出,很多人猜测DeepSeek在GPU数量和训练这些模型所花费的GPU时间上撒了谎,因为他们实际上拥有比他们声称的更多的H100,因为这些卡有出口限制,他们不想给自己惹麻烦,也不想损害自己获得更多这些卡的机会。虽然这当然有可能,但我认为他们更有可能说的是实话,他们只是通过在训练和推理方法上表现出极高的聪明才智和创造力,才取得了这些令人难以置信的结果。他们解释了他们的做法,我猜想他们的结果被其他实验室的其他研究人员广泛复制和证实只是时间问题。

真正会思考的模型

更新的R1模型和技术报告可能会更令人震惊,因为它们在思维链上击败了Anthropic,现在除了OpenAI之外,基本上只有它们使这项技术大规模运作。但请注意,OpenAI在2024年9月中旬才发布O1预览模型。那只是大约4个月前的事情!有一点你必须牢记,OpenAI对这些模型在低层次上的实际运作方式讳莫如深,除了微软等签署了严格保密协议的合作伙伴外,不会向任何人公开实际的模型权重。而DeepSeek的模型则完全不同,它们完全开源,且许可宽松。他们发布了非常详细的技术报告,解释了这些模型的工作原理,并提供了代码,任何人都可以查看并尝试复制。

凭借R1,DeepSeek基本上破解了人工智能领域的一个难题:让模型逐步推理,而不依赖于大量监督数据集。他们的DeepSeek-R1-Zero实验表明了这一点:使用纯强化学习与精心设计的奖励函数,他们设法让模型完全自主地发展复杂的推理能力。这不仅仅是解决问题——模型有机地学会了生成长链思维、自我验证其工作,并将更多计算时间分配给更困难的问题。

这里的技术突破是他们新颖的奖励建模方法。他们没有使用复杂的神经奖励模型,因为这种模型可能导致「奖励黑客」(即模型通过虚假方式提高奖励,但实际并不能提高模型的真实性能),而是开发了一种基于规则的巧妙系统,将准确性奖励(验证最终答案)与格式奖励(鼓励结构化思维)相结合。事实证明,这种更简单的方法比其他人尝试过的基于流程的奖励模型更强大、更可扩展。

特别令人着迷的是,在训练过程中,他们观察到了所谓的「顿悟时刻」,即模型在遇到不确定性时自发地学会中途修改其思维过程。这种突发行为并不是预先编好的程序,而是模型与强化学习环境相互作用自然产生的。模型会真正地停下来,标记推理中的潜在问题,然后采用不同的方法重新开始,而这一切都不是经过明确训练的。

完整的R1模型建立在这些见解的基础上,在应用其强化学习技术之前,引入他们所谓的「冷启动」数据——一小组高质量的示例。他们还解决了推理模型中的一大难题:语言一致性。之前尝试的思维链推理通常会导致模型混合使用多种语言或产生不连贯的输出。DeepSeek通过在RL训练期间巧妙地奖励语言一致性解决了这一问题,以较小的性能损失换取更易读且更一致的输出。

结果令人难以置信:在AIME 2024(最具挑战性的高中数学竞赛之一)上,R1的准确率达到79.8%,与OpenAI的O1模型相当。在MATH-500上,它达到了97.3%,在Codeforces编程竞赛中取得了96.3%的分数。但也许最令人印象深刻的是,他们设法将这些能力提炼为更小的模型:他们的14B参数版本比许多大几倍的模型表现更好,这表明推理能力不仅与原始参数数量有关,还与你如何训练模型处理信息有关。

余波

最近在Twitter和Blind(一家企业谣言网站)上流传的小道消息是,这些模型完全出乎Meta的意料,它们的表现甚至超过了仍在训练中的新Llama4模型。显然,Meta内部的Llama项目已经引起了高层技术主管的注意,因此他们有大约13个人在研究Llama,而他们每个人的年薪总和都超过了DeepSeek-V3模型的训练成本总和,而DeepSeek-V3模型的性能比Llama更好。你如何一本正经地向扎克伯格解释?当更好的模型只用2000个H100训练,成本还不到500万美元时,扎克伯格却向Nvidia投入数十亿美元购买10万个H100,他怎么能保持微笑?

但您最好相信,Meta和其他大型人工智能实验室正在拆解这些DeepSeek模型,研究技术报告中的每个单词和他们发布的开源代码中的每一行,拼命尝试将这些相同的技巧和优化整合到他们自己的训练和推理流程中。那么,这一切的影响是什么?好吧,天真地认为训练和推理计算的总需求应该除以某个大数字。也许不是45,而是25甚至30?因为无论你之前认为你需要多少,现在都少了很多。

乐观主义者可能会说:「你只是在谈论一个简单的比例常数,一个单一的倍数。当你面对指数增长曲线时,这些东西会很快消失,最终不会那么重要。」这确实有一定道理:如果人工智能真的像我所期望的那样具有变革性,如果这项技术的实际效用是以数万亿来衡量的,如果推断时间计算是新的扩展定律,如果我们将拥有大量人形机器人,它们将不断进行大量的推断,那么也许增长曲线仍然非常陡峭和极端,英伟达仍然遥遥领先,它仍然会成功。

但Nvidia在未来几年内会有很多好消息,以维持其估值,当你把这些因素都考虑进去时,我至少开始对以2025年预计销售额的20倍来购买其股票感到非常不安。如果销售增长稍微放缓会怎样?如果增长率不是100%以上,而是85%呢?如果毛利率从75%下降到70%,这对半导体公司来说仍然很高,会发生什么?

Summarize

从宏观层面来看,英伟达面临着前所未有的竞争威胁,这使得其20倍远期销售和75%的毛利率越来越难以证明其高估值是合理的。该公司在硬件、软件和效率方面的优势都出现了令人担忧的裂缝。全世界——地球上成千上万最聪明的人,在数不清的数十亿美元资本资源的支持下——正试图从各个角度攻击他们。

在硬件方面,Cerebras和Groq的创新架构表明,英伟达的互联优势(其数据中心统治地位的基石)

more