image source head

Ten Wall Street Report on Wall Street: Bitcoin and Nvidia plummeted behind

trendx logo

Reprinted from jinse


A professional investor who has been an analyst and a software engineer wrote an article on the empty Nvidia. He was reposted in large quantities by Twitter's large V, becoming a major "culprit" for Nvidia's stock plunge. Nvidia's market value has evaporated nearly $ 600 billion, which is by far the largest single -day decline in specific listed companies.

The main point of the investors of the Jeffrey Emanuel is nothing more than Deepseek's cowhide made by Wall Street, large technology companies and Nvidia. Nvidia was overestimated. "Every investment bank recommends buying Nvidia, like a blind person, and I don't know what they are talking about."

Jeffrey Emanuel said that Nvidia must maintain the current growth trajectory and profit margin, and the road is much more rugged than its valuation. There are five different directions of attacking Nivise -architecture innovation, customer vertical integration, software abstraction, breakthrough efficiency, and manufacturing democratization -at least one may have a significant impact on the profit margin or growth rate of Nvidia high. Judging from the current valuation, the market does not take into account these risks.

According to some industry investors, because of this report, Emanuel suddenly became a celebrity on Wall Street. Many hedge funds paid him $ 1,000 per hour, hoping to hear his views on Nvidia and AI. I was so busy that my throat was smoking, but the money was spent.

The following is the full text of the report. Reference.

As a person who has worked in various multi -head/short hedge funds (including working in Millennium and BALYASNY), who has been an investment analyst for about 10 years, is also a mathematics and computer fan that has been studying deep learning since 2010 ( At that time, Geoff Hinton was still talking about the limited Boltzmann. All programming still used MATLAB. Researchers are still trying to prove that they can get better results than using support vector machines in classification.) The development of intelligent technology and its relationship with the equity valuation of the stock market have a very unique view.

In the past few years, I have worked as a developer more and have several popular open source projects to handle various forms of AI models/services (for example, please refer to the LLM Aided OCR, Swiss Army Llama, , Fast Vector Similalic, Source to Prompt, Pastel Inferred Layer, etc.). Basically, I use these cutting -edge models densely every day. I have 3 Claude accounts, so I will not use the request, and I register it a few minutes after the ChatGPT Pro is launched.

I also strive to learn about the latest research progress, and carefully read all important technical report papers released by various artificial intelligence laboratories. Therefore, I think I have a good understanding of the development of this field and things. At the same time, I took a lot of stocks in my life and won the best creative award for the Value Investor Club twice (if you have been paying attention, it is TMS multi -head and PDH short).

I said that not to show off, but to prove that I could express opinions on this issue, and not let the technicians or professional investors feel that I am naive. Of course, there must be many people who are more proficient in mathematics/science than me, and many people are better than me more than I do more/short investment in the stock market, but I think there are not many people who can be in the middle of Ventu like me.

Nevertheless, whenever I meet with my friends in hedge the fund industry and former colleagues, the topic will soon turn to Nvidia. The phenomenon of a company's total development from obscurity to market value exceeds Britain, France, or Germany's stock markets. It is not available every day! These friends naturally want to know my views on this issue. Because I firmly believe that this technology will have a long-term reform-I really believe that it will completely change our all aspects of our economy and society in the next 5-10 years, which is basically unprecedented The development momentum will slow down or stop in the short term.

But even in the past year, I think that the valuation is too high and it is not suitable for me, but the recent series of development has made me a bit inclined to my intuition, that is, a more cautious attitude towards the prospects, and consensus on consensus, and is on consensus. It seemed to be questioned when it was overlapped. As the saying goes, "The wise man believes at the beginning that the fool believes at the end." The reason why this sentence is famous is for a reason.

Cow market case

Before we discuss the progress that made me hesitant, let's briefly review the bull market of Nvidia stocks. Now basically everyone knows the bull market of NVDA stocks. Deep learning and artificial intelligence are the most changeable technologies since the Internet, and are expected to fundamentally change everything in our society. As far as the industry's total capital expenditure is used for training and reasoning infrastructure, Nvidia is almost in the position of approaching monopoly in some way.

Some companies with the largest and highest profits in the world, such as Microsoft, Apple, Amazon, Meta, Google, Oracle, etc., have decided to keep their competitiveness at all costs at all. Essence The area of ​​capital expenditure, electricity consumption, and new data centers, of course, the number of GPUs has explosive growth, and there seems to be no signs of slowing down. Nvidia can earn up to 90%of the amazing gross profit margin with high -end products for data centers.

We just touched the surface of the bull market. There are more aspects now, even those who are already very optimistic will become more optimistic. Except for the rise of human -like robots (I suspect that they can quickly complete a large number of tasks that are not proficient (or even skilled) workers to complete, most people will be surprised, such as laundry, cleaning, organizing and cooking; complete in the workers' team to complete Construction work such as decoration bathrooms or construction of houses; managing warehouses and driving forklifts, etc.), there are other factors that most people have not even considered.

A main topic that smart people talk about is the rise of "new expansion laws". It provides a new paradigm for people's thinking about how the needs of computing needs. Since the emergence of Alexnet in 2012 and the invention of the Transformer architecture in 2017, the original expansion law that promoted the progress of artificial intelligence was the law of pre -training expansion: the higher the value we used for training data (now reached trillions), the model we trained models trained The more parameters, the higher the computing capabilities (FLOPS) we consume these models with these token training. In a variety of very useful downstream tasks, the performance of the final model will be better.

Not only that, this improvement can be predicted to a certain extent, so that leading artificial intelligence laboratories like Openai and Anthropic can even know how good their latest models will be before starting practical training - In some cases, they can even predict the benchmark value of the final model, with an error of not more than a few percentage points. This "law of primitive expansion" is very important, but always allows those who use it to predict the future of the future.

First of all, we seem to have exhausted high -quality training data sets accumulated in the world. Of course, this is not completely correct -there are still many old books and journals that have not yet been correctly digitized. Even if they are digitized, they have not obtained appropriate licenses as training data. The problem is that even if you attribute everything to you -for example, the sum of the English written content made from "professional" from 1500 to 2000, when you talk about a training library of nearly 15 trillion marks, from percentage, from percentage, from percentage From the perspective, this is not a huge number, and the size of the training library is the size of the current cutting -edge model.

In order to quickly check the authenticity of these numbers: So far, Google Books have digitated about 40 million books; if a common book has 50,000 to 100,000 words, or 65,000 to 130,000 markers, then the book alone will just be a book. It accounts for a mark from 2.6T to 5.2T. Of course, a large part of it is already included in the training language library used in large laboratories, whether it is legal in the strict sense. There are many academic papers, and there are more than 2 million papers on the ARXIV website. The U.S. Congress Library has more than 3 billion pages of digital newspapers. In addition, the total number may be as high as 7T token, but because most of them are actually included in the training language library, the remaining "incremental" training data may not be so important in the overall plan.

Of course, there are other methods to collect more training data. For example, you can automatically transcribe each YouTube video and use these texts. Although this may be helpful, its quality is definitely much lower than a highly respected organic chemical textbook, and the latter is the source of knowledge about the world. Therefore, in the original scale law, we have always faced the threat of "data wall"; although we know that we can continue to invest more capital expenditures into GPUs and build more data centers, but large -scale useful new human knowledge has useful new human knowledge It is much more difficult, these knowledge is correctly supplemented with knowledge. Now, an interesting response method is the rise of "synthetic data", that is, the text itself is the output of LLM. Although this seems a bit ridiculous, "improving the quality of models through its own supply" is indeed very effective in practice, at least in the field of mathematics, logic and computer programming.

Of course, the reason is that we can mechanically check and prove the correctness of things. Therefore, we can sample from the huge mathematical theorem or Python script, and then check whether they are correct, and only the correct data will be included in our database. In this way, we can greatly expand the collection of high -quality training data, at least in these fields.

In addition to text, we can also use various other data to train artificial intelligence. For example, if we use the entire genome sequencing data of 100 million people (the amount of data unprepared by a person is about 200GB to 300GB), what will happen to train artificial intelligence? This is obviously a large amount of data, although most of them are almost the same between two people. Of course, due to various reasons, comparing text data on books and the Internet, it is possible to mislead:

The size of the original genome cannot be compared directly with the number of marks

The information content of genome data is very different from text

The training value of high redundant data is not clear

The calculation requirements for processing genome data are also different

But it is still another huge source of information. We can train it in the future, which is why I incorporate it.

Therefore, although we are expected to obtain more and more additional training data, if you look at the growth rate of training library in recent years, we will find that we will soon encounter bottlenecks in the availability of "commonly useful" knowledge data. And this kind of knowledge can help us get closer to the ultimate goal, that is, to obtain 10 times the artificial super intelligence of Big John von Neumanman, becoming a world -class expert and humans known to be known to each professional field.

In addition to limited data, supporters of pre -training expansion laws have always hidden other concerns in their hearts. One of them is how to deal with all these computing infrastructure after completing the model training? Train the next model? Of course, you can do this, but considering the rapid improvement of the speed and capacity of the GPU, and the importance of electricity and other operating costs in economic computing, is it really meaningful to train new models 2 years ago? Of course, you are more willing to use the new data center you just built. Its cost is 10 times that of the old data center, and because the technology is more advanced, the performance is 20 times that of the old data center. The problem is that at some time, you do need to amortize the early costs of these investment, and use the (Hope to be positive) operating profit flow to recover costs, right?

The market is so excited about artificial intelligence, so that it has ignored this, so that companies such as OpenAI have continuously accumulated operating losses from the beginning, but at the same time, they have obtained increasing valuations in subsequent investment (of course, what is worthy of praise is that what is praised is that the praise is that the praise is that the praise is that the praise is that the praise is that the praise is that the praise is that the praise is that the praise is that the praise is that the praise is that the praise is They also showed very fast growth income). But in the end, if you want to maintain this situation throughout the market cycle, the cost of these data centers will eventually need to be recovered. It is best to have profits. After a period of time, they can be based on risk adjustment. Contradicate.

New paradigm

Well, this is the law of pre -training expansion. So what is this "new" expansion law? Well, this is something that people have only begun to pay attention to in the past year: the reasoning time is calculated. Prior to this, most of the calculations you spent in the process are used to create preliminary training calculations for creating models. Once you have a well -trained model, you only need to use a certain amount of calculation to reason the model (that is, asking questions or let LLM perform a certain task for you).

It is important that the total amount of reasoning calculation (measured in various ways, such as FLOPS, GPU memory occupation, etc.) is far lower than the calculation required for the pre -training phase. Of course, when you increase the size of the context window of the model and the output generated at one time, the amount of calculation of the reasoning will indeed increase (although the researchers have made amazing algorithm improvements in this regard, and the expansion scale that the initial expected expansion is the second party) Essence But basically, until recently, the intensity of reasoning calculation is usually much lower than the training calculation, and basically the number of requests that process process -for example, the more demand for the completion of the ChatGPT text Essence

With the appearance of the revolutionary thinking chain (COT) model launched last year, the most noticeable is OPENAI's flagship model O1 (but recent DeepSeek's new model R1 also adopts this technology. It will be discussed in detail later), and everything has changed. These new COT models no longer directly proportions the length of the reasoning calculation with the output text generated by the model (for a larger context window, model size, etc., it will increase proportional) It is regarded as a "temporary memory" or "internal monologue" when trying to solve your problems or complete the designated task.

This represents a real change in the method of reasoning calculation: Now, the more Token you use in this internal thinking process, the better the final output quality you provide to users. In fact, this is like giving a worker more time and resources to complete a task, so that they can check their work repeatedly, complete the same basic tasks in a variety of different methods, and verify whether the results are the same; "Insert" formula to check whether it really solves the equation.

It turns out that the effect of this method is almost amazing; it uses the long -awaited power of "strengthening learning" and the powerful features of the Transformer architecture. It directly solves one of the biggest weakness in the Transformer model, that is, the tendency of "illusion".

Basically, the way of the next labeling of Transformer at the next label of each step is that if they start to embark on a wrong "road" in the initial response, they will become almost like a pushing child, trying to make up a story story story To explain why they are actually correct, even if they should use common sense to realize on the way, what they say is not correct.

Because the model is always trying to maintain the internal consistency and make each continuously generated mark naturally comes from the previous marks and context, it is difficult for them to modify and trace the route. By decomposing the reasoning process into many intermediate stages, they can try many different methods to see which are effective, and constantly try to modify and try other methods until they can achieve quite high confidence, that is, they are not nonsense.

The most special thing about this method is that in addition to it is indeed effective, the more logical/cot token you use, the better the effect. Suddenly, you have an extra turntable. As the number of COT reasoning token increases (this requires more reasoning calculations, whether it is floating -point operation or memory), the higher the probability you give the correct answer -the code There are no errors at the first runtime, or the solution of logical problems has no obvious inference steps.

I can tell you based on a large number of first -hand experience that although Anthropic's Claude3.5 Sonnet model is very good in Python programming (really excellent), whenever you need to generate any lengthy and complicated code, it always commites one of them one. Or multiple stupid mistakes. Now, these errors are usually easy to repair. In fact, the error generated by the Python interpreter is usually used as a subsequent reasoning prompt (or more practical, using the so -called Linter to use the complete "problem" found in the code editor in the code The collection is pasted to the code), and you can repair them without any further explanation. When the code becomes very long or very complicated, sometimes it takes more time to repair it, and may even be manually debug.

When I tried Openai's O1 model for the first time, it was like an enlightenment: I was surprised that the code was perfect for the first time. This is because the COT process will automatically discover and repair the problem before the answer given by the model.

In fact, the O1 model used in Openai's ChatGPT Plus subscription service ($ 20 per month) and the new ChatGPT Pro subscription service (10 times the price of the former, that is, $ 200 per month, which caused a stir in the developer community) The models used by the Chinese O1-PRO model are basically the same; the main difference is that O1-PRO will think about a longer time before making a response, generate more COT logic marks, and consume a lot of reasoning computing resources for each response.

This is very eye -catching, because even for Claude3.5 Sonnet or GPT4O, even if it is given about 400kb or more contexts, a very lengthy and complicated reminder usually takes less than 10 seconds to start to respond, and it can start to respond, and it can start to respond, and Often less than 5 seconds. And the same prompt to O1-PRO may take more than 5 minutes to get a response (although OpenAI will indeed show you some "reasoning steps" generated in this process during the waiting process; important, OpenAI is out of business out of business out of business. The related reasons for secrets decided to hide the exact reasoning mark it generated to you, but show you a summary of the high degree of simplification).

As you may imagine, in many cases, accuracy is very important -you would rather give up and tell users that you can't do it at all, nor are you unwilling to give an answer that may be easily proved, or give an illusion involving hallucinations involving hallucinations Facts or other likely, non -pushing answers. Only a few cases involving money/transactions, medical care and law.

Basically, as long as the reasoning cost of reasoning is compared to human intellectual workers who interact with artificial intelligence systems, the full salary of the hourly salary is insignificant, so in this case, calling the COT calculation does not need to be considered at all (the main disadvantage is that it is it. It will greatly increase the response delay, so in some cases, you may also want to speed up iteration by obtaining a response with shorter latency, lower accuracy or lower accuracy).

A few weeks ago, there were some inspiring news in the field of artificial intelligence, which involved the O3 model that OpenAI has not yet released. This model can solve a series of problems that were previously considered to be unable to use existing artificial intelligence methods in the short term. OpenAI能够解决这些最棘手的问题(包括极其困难的“基础”数学问题,即使是非常熟练的专业数学家也很难解决),是因为OpenAI投入了大量的计算资源——在某些情况下, It costs more than 3,000 US dollars to solve a task (in contrast, the conventional Transformer model is used. If there is no thinking chain, the traditional reasoning cost of a single task is unlikely to exceed a few dollars).

No artificial intelligence genius can also realize that this progress has created a new law of expansion, which is completely different from the original pre -training expansion law. Now, you still want to train the best model by using as many computing resources and as much trillion high -quality training data as possible, but this is just the beginning of this new world story; now, you can easily use it easily The amazing amount of computing resources is inferred only from these models to obtain very high confidence, or try to solve the extremely difficult problems that require "genius" reasoning to avoid all potential traps. These traps may cause ordinary people to ordinary people. The master's degree in law went astray.

But why do NVIDIA monopolize all benefits?

Even if you believe that the future prospect of artificial intelligence is almost unimaginable, the question still exists: "Why do a company get most of the profits from this technology?" There are indeed many important new technologies in history that have changed the world. But the main winners are not the most promising companies in the initial stage. Although the Wright brothers' aircraft companies invented and improved this technology, the company's market value is now less than $ 10 billion, although it has evolved into multiple companies. Although Ford has a considerable market value of $ 40 billion today, this is only 1.1%of Nvidia's current market value.

To understand this, we must truly understand why Nvidia can occupy such a large market share. After all, they are not the only company that produces GPUs. AMD's GPU with good production performance. From the data point of view, the number of transistors and craft nodes is equivalent to NVIDIA. Of course, the speed of AMD GPU is less than the Nvidia GPU, but the NVIDIA GPU is not 10 times or similar. In fact, as far as the original cost of FLOP is concerned, the AMD GPU is only half of NVIDIA GPU.

From the perspective of other semiconductor markets, such as the DRAM market, although the market is highly concentrated, only three global companies (Samsung, Micron, SK-Hemori) have practical significance, but the gross profit margin of the DRAM market is negative at the bottom of the cycle. The top of the cycle is about 60%, and the average value is about 20%. In contrast, NVIDIA's overall gross profit margin in recent quarters is about 75%, which is mainly dragged down by consumer 3D graphic products with low profit margins and high commercialization.

So how is this possible? Well, the main reason is related to the software -"direct use" on Linux and strictly tested and highly reliable drivers (unlike AMD, its Linux driver is low in quality and unstable and infamous), and highly optimized optimized Open source code, such as PyTorch, can run well on the NVIDIA GPU after adjustment.

Not only that, programmers are used to write a programming framework for the low -level code optimized for GPU optimization CUDA, which is completely owned by NVIDIA and has become a de facto standard. If you want to hire a group of talented programmers, they know how to use the GPU to accelerate the work and be willing to pay them at a salary of $ 650,000/year, or the current salary level of anyone with such special skills, then they are likely to be likely Will "think" and work with CUDA.

In addition to software advantages, another main advantage of NVIDIA is the so -called interconnection -essentially, it is a bandwidth that highly connect thousands of GPUs, so that they can use them to train the most cutting -edge basic model today. In short, the key to efficient training is to make all GPUs always make full use of the state, rather than empty waiting until the next batch of data required for the next training.

Bandwidth requirements are very high, far higher than the typical bandwidth required for traditional data center applications. This kind of interconnection cannot use traditional network devices or fiber, because they bring too much delay and cannot provide the bandwidth of TB per second, which is required to keep all GPUs continuously busy.

Nvidia acquired Israeli Mellanox at the price of $ 6.9 billion in 2019, which is a very wise decision, and it is this acquisition that provides them with the leading interconnection technology in the industry. Please note that compared with the reasoning process (including COT reasoning), the relationship between the interconnection speed and the training process (the output of thousands of GPUs must be used at the same time) is closer. It's just enough VRAM to store the weight (compression) model weight of the trained model.

It can be said that these are the main components of Nvidia's "moat", and it is why it can maintain such a high profit margin for a long time (and a "flywheel effect", that is, they actively invest in a large amount of common profit. It also helps them improve their technologies at a faster speed than competitors, so they are always leading their original performance).

But as mentioned earlier, under the same conditions of all other conditions, what customers really care about are often the performance of each dollar (including the cost of capital expenditure and energy use of the equipment, that is, the performance of each watt), although the GPU of NVIDIA is indeed indeed the GPU of NVIDIA is indeed indeed the GPU of NVIDIA is indeed indeed the GPU of NVIDIA is indeed indeed. The fastest, but if they are only measured by FLOPS, they are not cost -effective.

But the problem is that other factors are not the same. AMD's driver is bad. The popular AI software library is not good at running on the AMD GPU. Outside the game field, you cannot find GPU experts who are really good at AMD GPU ( Why do they bother, the demand for CUDA experts in the market is greater?), Due to AMD's poor interconnection technology, you cannot effectively connect thousands of GPUs together -all this means that AMD is at high -end data centers The field basically has no competitiveness, and there seems to be a good development prospect in the short term.

Okay, it sounds good to sound in NVIDIA, right? Now you know why its stock valuation is so high! But are there any other hidden concerns? Well, I think there are not much concern for attracting great attention. Some problems have been lurking behind the scenes in the past few years, but considering the speed of growth, their impact is small. But they are preparing to develop upward. Other problems have only appeared recently (such as the past two weeks), and may significantly change the trajectory of the recent GPU demand growth.

Primary threat

From the macro perspective, you can think like this: NVIDIA has been operating in a very niche field for a long time; their competitors are very limited, and these competitors are not profitable, the growth rate is not enough It constitutes a real threat, because they do not have enough capital to put pressure on market leaders such as NVIDIA. The game market is very large, and it is still growing, but it does not bring amazing profits or particularly amazing annual growth rates.

Around 2016-2017, some large technology companies began to increase recruitment and expenditure in machine learning and artificial intelligence, but in general, this was never their really important projects-more like the "moon exploration plan" R & D expenditure. However, after the release of ChatGPT in 2022, the competition in the field of artificial intelligence has really begun. Although it is only two years since it is now, it seems that this has passed for a long time.

Suddenly, large companies are preparing to invest billions of dollars at an amazing speed. The number of researchers participating in large -scale research conferences such as Neurips and ICML surged. In the past, smart students who could study financial derivatives changed to study Transformers. Non -executed engineering positions (that is, independent contributors to do not manage teams) of one million US dollars or more salary as the norm of the leading artificial intelligence laboratory.

It takes a while to change the direction of a large cruise ship; even if you move very fast, it takes billions of dollars to build a new data center for one year or longer, ordering all devices (the delivery time will be extended), and and and of. Complete all settings and debugging. Even the smartest programmers need a long time to truly enter the state, familiar with the existing code library and infrastructure.

But you can imagine that the funds, manpower and energy invested in this field are definitely astronomical numbers. NVIDIA is the biggest goal of all participants, because they are the biggest contributors to today's profits, not the future of artificial intelligence dominates our lives.

Therefore, the most important conclusion is "the market will always find a way out." They will find a replacement, completely innovative new method to create hardware, and use a new concept to bypass obstacles to consolidate Nvida's moat.

Hardware level threat

For example, CEREBRAS's so -called "wafer" artificial intelligence training chip is used to use the entire 300mm silicon wafer for an absolutely huge chip. The chip contains more crystal pipes and kernels on a single chip (see them recent Blog articles to understand how they solve the past method that hinders this method in economic practical output).

To explain this, if you compare CEREBRAS's latest WSE-3 chip with NVIDIA's flagship data center GPU H100, the total chip area of ​​Cerebras chip is 46225 square millimeters, while H100 is only 814 square millimeters (according to industry standards, H100 It is a huge chip in itself); this is a 57 -fold multiple! Cerebras chip does not enable 132 "streaming multi -processor" kernels on the chip like H100, but has about 900,000 kernels (of course, each core is smaller and fewer functions, but in comparison, under comparison, but under comparison, under comparison, but in comparison, under comparison, but under comparison, under comparison, but under comparison, under comparison, but under comparison, under comparison, but under comparison, under the comparison, under comparison, but under comparison, under the comparison, under comparison, but under comparison, under the comparison, under comparison, but under comparison, under the comparison, under comparison, but under comparison, under the comparison, under the comparison, under comparison, but in contrast This number is still very large). Specifically, in the field of artificial intelligence, the FLOPS computing power of Cerebras chip is about 32 times that of a single H100 chip. Because the price of H100 chips is close to $ 40,000, it is conceivable that the price of the WSE-3 chip is not cheap.

So, what is the meaning of this? CEREBRAS did not try to adopt a similar method to confront NVIDIA, nor did it compare with the interconnection technology of Mellanox. Instead, a new method was used to bypass the interconnection problem: when everything runs on the same large chip on the same large chip At the same time, the bandwidth problem between the processor becomes less important. You don't even need the same level of interconnection, because a giant chip can replace it into a ton of H100.

And the Cerebras chip performed very well in the artificial intelligence reasoning task. In fact, you can try it here for free and use Meta's very famous LLAMA-3.3-70B model. Its response speed is basically instant, about 1,500 token per second. From a comparison perspective, compared with ChatGPT and Claude, the speed of 30 or more per second is relatively fast for users, and even the speed of 10 Token per second is fast enough. Read it.

CEREBRAS is not the only company, as well as other companies, such as Groq (do not confuse the Grok model series of X AI training with Elon Musk). GROQ uses another innovative method to solve the same basic problems. They did not try to compete directly with Nvida CUDA software stacks, but developed the so -called "tension processing unit" (TPU), specifically used for precise mathematical operations required for deep learning models. Their chip is designed around the concept of "certainty calculation", which means that unlike traditional GPUs, their chips perform operations in a fully predictable manner each time.

This may sound like a small technical detail, but in fact, it has a huge impact on chip design and software development. Because time is completely determined, GROQ can optimize its chip, which cannot be done by the traditional GPU architecture. Therefore, in the past 6 months, they have been showing the reasoning speed of the LLAMA series models and other open source models exceeding 500 token per second, far exceeding the speed that traditional GPU settings can be achieved. Like CEREBRAS, this product is now available, you can try it for free.

Using the LLAMA3 model with the "speculative decoding" function, GROQ can generate 1320 Token per second, which is equivalent to CEREBRAS, far exceeding the performance of conventional GPUs. Now, you may ask, when the user seems to be quite satisfied with the speed of the ChatGPT (less than 1,000 token per second), what is the significance of reaching more than 1,000 token per second. In fact, this is indeed important. When you get instant feedback, the iterative speed will be faster and will not lose focus like human intellectual workers. If you use a model by programming by API, it can enable the application of the new category. These applications require multiple stages of reasoning (the output of the previous stage is used as the input of subsequent stages of prompt/reasoning), or a low latency response, for example, for example, for example Content review, fraud testing, dynamic pricing, etc.

But even more fundamentally, the faster the response request, the faster the cycle, and the busy the hardware is. Although GROQ's hardware is very expensive, the cost of a server is as high as 2 million to 3 million US dollars, but if the demand is large enough, the hardware keeps a busy state, then the cost of the request will be greatly reduced.

Just like the CUDA of NVIDIA, a large part of GROQ's advantage comes from its proprietary software stack. They can adopt open source models developed and released freely developed and published by other companies such as META, Deepseek and Mistral, and decompose them in a special way to make it run faster on specific hardware.

Like CEREBRAS, they have made different technical decisions to optimize certain aspects of the process, so as to carry out work in a completely different way. Taking GROQ as an example, they fully focus on the calculation of the reasoning level, not training: all their special hardware and software can only play a huge speed and efficiency advantage only when they have been trained on the model.












似乎这些迫在眉睫的硬件威胁还不够糟糕,过去几年软件领域也出现了一些进展,虽然起步缓慢,但如今发展势头强劲,可能会对Nvidia的CUDA软件主导地位构成严重威胁。首先是AMD GPU的糟糕Linux驱动程序。还记得我们讨论过AMD多年来为何不明智地允许这些驱动程序如此糟糕,却坐视大量资金流失吗?

有趣的是,臭名昭著的黑客乔治·霍茨(George Hotz,因在青少年时期越狱原版iPhone而闻名,目前是自动驾驶初创公司Comma.ai和人工智能计算机公司Tiny Corp的首席执行官,Tiny Corp还开发了开源的tinygrad人工智能软件框架)最近宣布,他厌倦了处理AMD糟糕的驱动程序,迫切希望能够在其TinyBox人工智能计算机中使用成本较低的AMD GPU( 有多种型号,其中一些使用Nvidia GPU,而另一些则使用AMD GPU)。

事实上,他在没有AMD帮助的情况下为AMD GPU制作了自己的自定义驱动程序和软件堆栈;2025年1月15日,他通过公司的X账户发推说:「我们距离AMD完全自主的堆栈RDNA3汇编器仅一步之遥。我们有自己的驱动程序、运行时、库和模拟器。(全部约12000行!)」鉴于他的过往记录和技能,他们很可能在未来几个月内完成所有工作,这将带来许多激动人心的可能性,即使用AMD GPU来满足各种应用的需求,而目前公司不得不为Nvidia GPU支付费用。



这些框架中最著名的例子是MLX(主要由苹果公司赞助)、Triton(主要由OpenAI赞助)和JAX(由谷歌开发)。MLX 尤其有趣,因为它提供了一个类似PyTorch 的API,可以在Apple Silicon 上高效运行,展示了这些抽象层如何使AI 工作负载能够在完全不同的架构上运行。与此同时,Triton 越来越受欢迎,因为它允许开发人员编写高性能代码,这些代码可以编译为在各种硬件目标上运行,而无需了解每个平台的底层细节。





然而,另一个可能会发生巨大变化的领域是CUDA本身可能最终成为一种高级抽象——一种类似于Verilog(作为描述芯片布局的行业标准)的「规范语言」,熟练的开发人员可以使用它来描述涉及大规模并行的高级算法(因为他们已经熟悉它,它结构合理,是通用语言等),但与通常的做法不同,这些代码不是编译后用于Nvidia GPU,而是作为源代码输入LLM,LLM可以将其转换为新的Cerebras芯片、新的Amazon Trainium2或新的Google TPUv6等可以理解的任何低级代码。这并不像你想象的那么遥远;使用OpenAI最新的O3模型,可能已经触手可及,而且肯定会在一两年内普遍实现。


也许最令人震惊的发展是前几周发生的。这则新闻彻底震撼了人工智能界,尽管主流媒体对此只字未提,但它在推特上却成为知识分子的热门话题:一家名为DeepSeek的中国初创公司发布了两款新模型,其性能水平基本可与OpenAI和Anthropic的最佳模型相媲美(超越了Meta Llama3模型和其他较小的开源模型,如Mistral)。这些模型分别名为DeepSeek-V3(基本上是对GPT-4o和Claude3.5 Sonnet的回应)和DeepSeek-R1(基本上是对OpenAI的O1模型的回应)。







How is this possible?这家中国小公司怎么可能完全超越我们领先的人工智能实验室的所有最聪明的人,这些实验室拥有100倍以上的资源、员工人数、工资、资本、GPU等?中国不是应该被拜登对GPU出口的限制所削弱吗?好吧,细节相当技术性,但我们至少可以概括地描述一下。也许事实证明,DeepSeek相对较弱的GPU处理能力恰恰是提高其创造力和聪明才智的关键因素,因为「需求是发明之母」嘛。










Meta最新的Llama3模型有几种大小,例如:10亿参数版本(最小)、70B参数模型(最常用的)、甚至还有405B参数的大型模型。对于大多数用户来说,这种最大的模型实用性有限,因为你的电脑需要配备价值数万美元的GPU,才能以可接受的速度运行推理,至少如果你部署的是原始的全精度版本。因此,这些开源模型在现实世界中的大多数使用和兴奋点都在8B参数或高度量化的70B参数级别,因为这是消费级Nvidia 4090 GPU可以容纳的,现在你可以花不到1000美元买到it.





这种方法的真正优势在于,它允许模型包含大量知识,而不会非常笨重,因为即使所有专家的参数总数很高,但只有一小部分参数在任何特定时间处于「活跃」状态,这意味着你只需要将权重的小子集存储在VRAM中即可进行推理。以DeepSeek-V3为例,它有一个绝对庞大的MOE模型,包含671B个参数,比最大的Llama3模型还要大得多,但其中只有37B个参数在任何特定时间处于活跃状态——足以容纳两个消费级Nvidia 4090 GPU(总成本不到2000美元)的VRAM,而不需要一个或多个H100 GPU,每个售价约4万美元。




但与Nvidia的情况不同,Nvidia的成本差异是由于他们在数据中心产品上获得了90%以上的垄断毛利,而DeepSeek API相对于OpenAI和Anthropic API的成本差异可能只是因为它们的计算效率提高了近50倍(在推理方面甚至可能远远不止于此——在训练方面,效率提高了约45倍)。事实上,OpenAI和Anthropic是否从API服务中获得了丰厚利润尚不清楚——他们可能更关注收入增长,以及通过分析收到的所有API请求来收集更多数据。








结果令人难以置信:在AIME 2024(最具挑战性的高中数学竞赛之一)上,R1的准确率达到79.8%,与OpenAI的O1模型相当。在MATH-500上,它达到了97.3%,在Codeforces编程竞赛中取得了96.3%的分数。但也许最令人印象深刻的是,他们设法将这些能力提炼为更小的模型:他们的14B参数版本比许多大几倍的模型表现更好,这表明推理能力不仅与原始参数数量有关,还与你如何训练模型处理信息有关。








