LLaDA: The Diffusion Model That Could Redefine Language Generation

Introduction What if we could make language models think more like humans? Instead of writing one word at a time, what if they could sketch out their thoughts first, and gradually refine them? This is exactly what Large Language Diffusion Models (LLaDA) introduces: a different approach to current text generation used in Large Language Models (LLMs). Unlike traditional autoregressive models (ARMs), which predict text sequentially, left to right, LLaDA leverages a diffusion-like process to generate text. Instead of generating tokens sequentially, it progressively refines masked text until it forms a coherent response. In this article, we will dive into how LLaDA works, why it matters, and how it could shape the next generation of LLMs. I hope you enjoy the article! The current state of LLMs To appreciate the innovation that LLaDA represents, we first need to understand how current large language models (LLMs) operate. Modern LLMs follow a two-step training process that has become an industry standard: Pre-training: The model learns general language patterns and knowledge by predicting the next token in massive text datasets through self-supervised learning. Supervised Fine-Tuning (SFT): The model is refined on carefully curated data to improve its ability to follow instructions and generate useful outputs. Note that current LLMs often use RLHF as well to further refine the weights of the model, but this is not used by LLaDA so we will skip this step here. These models, primarily based on the Transformer architecture, generate text one token at a time using next-token prediction. Simplified Transformer architecture for text generation (Image by the author) Here is a simplified illustration of how data passes through such a model. Each token is embedded into a vector and is transformed through successive transformer layers. In current LLMs (LLaMA, ChatGPT, DeepSeek, etc), a classification head is used only on the last token embedding to predict the next token in the sequence. This works thanks to the concept of masked self-attention: each token attends to all the tokens that come before it. We will see later how LLaDA can get rid of the mask in its attention layers. Attention process: input embeddings are multiplied byQuery, Key, and Value matrices to generate new embeddings (Image by the author, inspired by [3]) If you want to learn more about Transformers, check out my article here. While this approach has led to impressive results, it also comes with significant limitations, some of which have motivated the development of LLaDA. Current limitations of LLMs Current LLMs face several critical challenges: Computational Inefficiency Imagine having to write a novel where you can only think about one word at a time, and for each word, you need to reread everything you’ve written so far. This is essentially how current LLMs operate — they predict one token at a time, requiring a complete processing of the previous sequence for each new token. Even with optimization techniques like KV caching, this process is quite computationally expensive and time-consuming. Limited Bidirectional Reasoning Traditional autoregressive models (ARMs) are like writers who could never look ahead or revise what they’ve written so far. They can only predict future tokens based on past ones, which limits their ability to reason about relationships between different parts of the text. As humans, we often have a general idea of what we want to say before writing it down, current LLMs lack this capability in some sense. Amount of data Existing models require enormous amounts of training data to achieve good performance, making them resource-intensive to develop and potentially limiting their applicability in specialized domains with limited data availability. What is LLaDA LLaDA introduces a fundamentally different approach to Language Generation by replacing traditional autoregression with a “diffusion-based” process (we will dive later into why this is called “diffusion”). Let’s understand how this works, step by step, starting with pre-training. LLaDA pre-training Remember that we don’t need any “labeled” data during the pre-training phase. The objective is to feed a very large amount of raw text data into the model. For each text sequence, we do the following: We fix a maximum length (similar to ARMs). Typically, this could be 4096 tokens. 1% of the time, the lengths of sequences are randomly sampled between 1 and 4096 and padded so that the model is also exposed to shorter sequences. We randomly choose a “masking rate”. For example, one could pick 40%. We mask each token with a probability of 0.4. What does “masking” mean exactly? Well, we simply replace the token with a special token: . As with any other token, this token is associated with a particular index and embedding vector that the model can process and interpret during training. We then feed our entire sequence into our transformer-based model. This process transforms all the input embedding vectors into new embeddings. We apply the classification head to each of the masked tokens to get a prediction for each. Mathematically, our loss function averages cross-entropy losses over all the masked tokens in the sequence, as below: Loss function used for LLaDA (Image by the author) 5. And… we repeat this procedure for billions or trillions of text sequences. Note, that unlike ARMs, LLaDA can fully utilize bidirectional dependencies in the text: it doesn’t require masking in attention layers anymore. However, this can come at an increased computational cost. Hopefully, you can see how the training phase itself (the flow of the data into the model) is very similar to any other LLMs. We simply predict randomly masked tokens instead of predicting what comes next. LLaDA SFT For auto-regressive models, SFT is very similar to pre-training, except that we have pairs of (prompt, response) and want to generate the response when giving the prompt as input. This is exactly the same concept for LlaDa! Mimicking the pre-training process: we simply pass the prompt and the response, mask random tokens from the response only, and feed the full sequence into the model, which will predict missing tokens from the response. The innovation in inference Innovation is where LLaDA gets more interesting, and truly utilizes the “diffusion” paradigm. Until now, we always randomly masked some text as input and asked the model to predict these tokens. But during inference, we only have access to the prompt and we need to generate the entire response. You might think (and it’s not wrong), that the model has seen examples where the masking rate was very high (potentially 1) during SFT, and it had to learn, somehow, how to generate a full response from a prompt. However, generating the full response at once during inference will likely produce very poor results because the model lacks information. Instead, we need a method to progressively refine predictions, and that’s where the key idea of ‘remasking’ comes in. Here is how it works, at each step of the text generation process: Feed the current input to the model (this is the prompt, followed by  tokens) The model generates one embedding for each input token. We get predictions for the  tokens only. And here is the important step: we remask a portion of them. In particular: we only keep the “best” tokens i.e. the ones with the best predictions, with the highest confidence. We can use this partially unmasked sequence as input in the next generation step and repeat until all tokens are unmasked. You can see that, interestingly, we have much more control over the generation process compared to ARMs: we could choose to remask 0 tokens (only one generation step), or we could decide to keep only the best token every time (as many steps as tokens in the response). Obviously, there is a trade-off here between the quality of the predictions and inference time. Let’s illustrate that with a simple example (in that case, I choose to keep the best 2 tokens at every step) LLaDA generation process example (Image by the author) Note, in practice, the remasking step would work as follows. Instead of remasking a fixed number of tokens, we would remask a proportion of s/t tokens over time, from t=1 down to 0, where s is in [0, t]. In particular, this means we remask fewer and fewer tokens as the number of generation steps increases. Example: if we want N sampling steps (so N discrete steps from t=1 down to t=1/N with steps of 1/N), taking s = (t-1/N) is a good choice, and ensures that s=0 at the end of the process. The image below summarizes the 3 steps described above. “Mask predictor” simply denotes the Llm (LLaDA), predicting masked tokens. Pre-training (a.), SFT (b.) and inference (c.) using LLaDA. (source: [1]) Can autoregression and diffusion be combined? Another clever idea developed in LLaDA is to combine diffusion with traditional autoregressive generation to use the best of both worlds! This is called semi-autoregressive diffusion. Divide the generation process into blocks (for instance, 32 tokens in each block). The objective is to generate one block at a time (like we would generate one token at a time in ARMs). For each block, we apply the diffusion logic by progressively unmasking tokens to reveal the entire block. Then move on to predicting the next block. Semi-autoregressive process (source: [1]) This is a hybrid approach: we probably lose some of the “backward” generation and parallelization capabilities of the model, but we better “guide” the model towards the final output. I think this is a very interesting idea because it depends a lot on a hyperparameter (the number of blocks), that can be tuned. I imagine different tasks might benefit more from the backward generation process, while others might benefit more from the more “guided” generation from left to right (more on that in the last paragraph). Why “Diffusion”? I think it’s important to briefly explain where this term actually comes from. It reflects a similarity with image diffusion models (like Dall-E), which have been very popular for image generation tasks. In image diffusion, a model first adds noise to an image until it’s unrecognizable, then learns to reconstruct it step by step. LLaDA applies this idea to text by masking tokens instead of adding noise, and then progressively unmasking them to generate coherent language. In the context of image generation, the masking step is often called “noise scheduling”, and the reverse (remasking) is the “denoising” step. How do Diffusion Models work? (source: [2]) You can also see LLaDA as some type of discrete (non-continuous) diffusion model: we don’t add noise to tokens, but we “deactivate” some tokens by masking them, and the model learns how to unmask a portion of them. Results Let’s go through a few of the interesting results of LLaDA. You can find all the results in the paper. I chose to focus on what I find the most interesting here. Training efficiency: LLaDA shows similar performance to ARMs with the same number of parameters, but uses much fewer tokens during training (and no RLHF)! For example, the 8B version uses around 2.3T tokens, compared to 15T for LLaMa3. Using different block and answer lengths for different tasks: for example, the block length is particularly large for the Math dataset, and the model demonstrates strong performance for this domain. This could suggest that mathematical reasoning may benefit more from the diffusion-based and backward process. Source: [1] Interestingly, LLaDA does better on the “Reversal poem completion task”. This task requires the model to complete a poem in reverse order, starting from the last lines and working backward. As expected, ARMs struggle due to their strict left-to-right generation process. Source: [1] LLaDA is not just an experimental alternative to ARMs: it shows real advantages in efficiency, structured reasoning, and bidirectional text generation. Conclusion I think LLaDA is a promising approach to language generation. Its ability to generate multiple tokens in parallel while maintaining global coherence could definitely lead to more efficient training, better reasoning, and improved context understanding with fewer computational resources. Beyond efficiency, I think LLaDA also brings a lot of flexibility. By adjusting parameters like the number of blocks generated, and the number of generation steps, it can better adapt to different tasks and constraints, making it a versatile tool for various language modeling needs, and allowing more human control. Diffusion models could also play an important role in pro-active AI and agentic systems by being able to reason more holistically. As research into diffusion-based language models advances, LLaDA could become a useful step toward more natural and efficient language models. While it’s still early, I believe this shift from sequential to parallel generation is an interesting direction for AI development. Thanks for reading! Check out my previous articles: References: [1] Liu, C., Wu, J., Xu, Y., Zhang, Y., Zhu, X., & Song, D. (2024). Large Language Diffusion Models. arXiv preprint arXiv:2502.09992. https://arxiv.org/pdf/2502.09992 [2] Yang, Ling, et al. “Diffusion models: A comprehensive survey of methods and applications.” ACM Computing Surveys 56.4 (2023): 1–39. [3] Alammar, J. (2018, June 27). The Illustrated Transformer. Jay Alammar’s Blog. https://jalammar.github.io/illustrated-transformer/

Introduction

What if we could make language models think more like humans? Instead of writing one word at a time, what if they could sketch out their thoughts first, and gradually refine them?

This is exactly what Large Language Diffusion Models (LLaDA) introduces: a different approach to current text generation used in Large Language Models (LLMs). Unlike traditional autoregressive models (ARMs), which predict text sequentially, left to right, LLaDA leverages a diffusion-like process to generate text. Instead of generating tokens sequentially, it progressively refines masked text until it forms a coherent response.

In this article, we will dive into how LLaDA works, why it matters, and how it could shape the next generation of LLMs.

I hope you enjoy the article!

The current state of LLMs

To appreciate the innovation that LLaDA represents, we first need to understand how current large language models (LLMs) operate. Modern LLMs follow a two-step training process that has become an industry standard:

  1. Pre-training: The model learns general language patterns and knowledge by predicting the next token in massive text datasets through self-supervised learning.
  2. Supervised Fine-Tuning (SFT): The model is refined on carefully curated data to improve its ability to follow instructions and generate useful outputs.

Note that current LLMs often use RLHF as well to further refine the weights of the model, but this is not used by LLaDA so we will skip this step here.

These models, primarily based on the Transformer architecture, generate text one token at a time using next-token prediction.

Simplified Transformer architecture for text generation (Image by the author)

Here is a simplified illustration of how data passes through such a model. Each token is embedded into a vector and is transformed through successive transformer layers. In current LLMs (LLaMA, ChatGPT, DeepSeek, etc), a classification head is used only on the last token embedding to predict the next token in the sequence.

This works thanks to the concept of masked self-attention: each token attends to all the tokens that come before it. We will see later how LLaDA can get rid of the mask in its attention layers.

Attention process: input embeddings are multiplied byQuery, Key, and Value matrices to generate new embeddings (Image by the author, inspired by [3])

If you want to learn more about Transformers, check out my article here.

While this approach has led to impressive results, it also comes with significant limitations, some of which have motivated the development of LLaDA.

Current limitations of LLMs

Current LLMs face several critical challenges:

Computational Inefficiency

Imagine having to write a novel where you can only think about one word at a time, and for each word, you need to reread everything you’ve written so far. This is essentially how current LLMs operate — they predict one token at a time, requiring a complete processing of the previous sequence for each new token. Even with optimization techniques like KV caching, this process is quite computationally expensive and time-consuming.

Limited Bidirectional Reasoning

Traditional autoregressive models (ARMs) are like writers who could never look ahead or revise what they’ve written so far. They can only predict future tokens based on past ones, which limits their ability to reason about relationships between different parts of the text. As humans, we often have a general idea of what we want to say before writing it down, current LLMs lack this capability in some sense.

Amount of data

Existing models require enormous amounts of training data to achieve good performance, making them resource-intensive to develop and potentially limiting their applicability in specialized domains with limited data availability.

What is LLaDA

LLaDA introduces a fundamentally different approach to Language Generation by replacing traditional autoregression with a “diffusion-based” process (we will dive later into why this is called “diffusion”).

Let’s understand how this works, step by step, starting with pre-training.

LLaDA pre-training

Remember that we don’t need any “labeled” data during the pre-training phase. The objective is to feed a very large amount of raw text data into the model. For each text sequence, we do the following:

  1. We fix a maximum length (similar to ARMs). Typically, this could be 4096 tokens. 1% of the time, the lengths of sequences are randomly sampled between 1 and 4096 and padded so that the model is also exposed to shorter sequences.
  2. We randomly choose a “masking rate”. For example, one could pick 40%.
  3. We mask each token with a probability of 0.4. What does “masking” mean exactly? Well, we simply replace the token with a special token. As with any other token, this token is associated with a particular index and embedding vector that the model can process and interpret during training.
  4. We then feed our entire sequence into our transformer-based model. This process transforms all the input embedding vectors into new embeddings. We apply the classification head to each of the masked tokens to get a prediction for each. Mathematically, our loss function averages cross-entropy losses over all the masked tokens in the sequence, as below:
Loss function used for LLaDA (Image by the author)

5. And… we repeat this procedure for billions or trillions of text sequences.

Note, that unlike ARMs, LLaDA can fully utilize bidirectional dependencies in the text: it doesn’t require masking in attention layers anymore. However, this can come at an increased computational cost.

Hopefully, you can see how the training phase itself (the flow of the data into the model) is very similar to any other LLMs. We simply predict randomly masked tokens instead of predicting what comes next.

LLaDA SFT

For auto-regressive models, SFT is very similar to pre-training, except that we have pairs of (prompt, response) and want to generate the response when giving the prompt as input.

This is exactly the same concept for LlaDa! Mimicking the pre-training process: we simply pass the prompt and the response, mask random tokens from the response only, and feed the full sequence into the model, which will predict missing tokens from the response.

The innovation in inference

Innovation is where LLaDA gets more interesting, and truly utilizes the “diffusion” paradigm.

Until now, we always randomly masked some text as input and asked the model to predict these tokens. But during inference, we only have access to the prompt and we need to generate the entire response. You might think (and it’s not wrong), that the model has seen examples where the masking rate was very high (potentially 1) during SFT, and it had to learn, somehow, how to generate a full response from a prompt.

However, generating the full response at once during inference will likely produce very poor results because the model lacks information. Instead, we need a method to progressively refine predictions, and that’s where the key idea of ‘remasking’ comes in.

Here is how it works, at each step of the text generation process:

  • Feed the current input to the model (this is the prompt, followed by  tokens)
  • The model generates one embedding for each input token. We get predictions for the  tokens only. And here is the important step: we remask a portion of them. In particular: we only keep the “best” tokens i.e. the ones with the best predictions, with the highest confidence.
  • We can use this partially unmasked sequence as input in the next generation step and repeat until all tokens are unmasked.

You can see that, interestingly, we have much more control over the generation process compared to ARMs: we could choose to remask 0 tokens (only one generation step), or we could decide to keep only the best token every time (as many steps as tokens in the response). Obviously, there is a trade-off here between the quality of the predictions and inference time.

Let’s illustrate that with a simple example (in that case, I choose to keep the best 2 tokens at every step)

LLaDA generation process example (Image by the author)

Note, in practice, the remasking step would work as follows. Instead of remasking a fixed number of tokens, we would remask a proportion of s/t tokens over time, from t=1 down to 0, where s is in [0, t]. In particular, this means we remask fewer and fewer tokens as the number of generation steps increases.

Example: if we want N sampling steps (so N discrete steps from t=1 down to t=1/N with steps of 1/N), taking s = (t-1/N) is a good choice, and ensures that s=0 at the end of the process.

The image below summarizes the 3 steps described above. “Mask predictor” simply denotes the Llm (LLaDA), predicting masked tokens.

Pre-training (a.), SFT (b.) and inference (c.) using LLaDA. (source: [1])

Can autoregression and diffusion be combined?

Another clever idea developed in LLaDA is to combine diffusion with traditional autoregressive generation to use the best of both worlds! This is called semi-autoregressive diffusion.

  • Divide the generation process into blocks (for instance, 32 tokens in each block).
  • The objective is to generate one block at a time (like we would generate one token at a time in ARMs).
  • For each block, we apply the diffusion logic by progressively unmasking tokens to reveal the entire block. Then move on to predicting the next block.
Semi-autoregressive process (source: [1])

This is a hybrid approach: we probably lose some of the “backward” generation and parallelization capabilities of the model, but we better “guide” the model towards the final output.

I think this is a very interesting idea because it depends a lot on a hyperparameter (the number of blocks), that can be tuned. I imagine different tasks might benefit more from the backward generation process, while others might benefit more from the more “guided” generation from left to right (more on that in the last paragraph).

Why “Diffusion”?

I think it’s important to briefly explain where this term actually comes from. It reflects a similarity with image diffusion models (like Dall-E), which have been very popular for image generation tasks.

In image diffusion, a model first adds noise to an image until it’s unrecognizable, then learns to reconstruct it step by step. LLaDA applies this idea to text by masking tokens instead of adding noise, and then progressively unmasking them to generate coherent language. In the context of image generation, the masking step is often called “noise scheduling”, and the reverse (remasking) is the “denoising” step.

How do Diffusion Models work? (source: [2])

You can also see LLaDA as some type of discrete (non-continuous) diffusion model: we don’t add noise to tokens, but we “deactivate” some tokens by masking them, and the model learns how to unmask a portion of them.

Results

Let’s go through a few of the interesting results of LLaDA.

You can find all the results in the paper. I chose to focus on what I find the most interesting here.

  • Training efficiency: LLaDA shows similar performance to ARMs with the same number of parameters, but uses much fewer tokens during training (and no RLHF)! For example, the 8B version uses around 2.3T tokens, compared to 15T for LLaMa3.
  • Using different block and answer lengths for different tasks: for example, the block length is particularly large for the Math dataset, and the model demonstrates strong performance for this domain. This could suggest that mathematical reasoning may benefit more from the diffusion-based and backward process.
Source: [1]
  • Interestingly, LLaDA does better on the “Reversal poem completion task”. This task requires the model to complete a poem in reverse order, starting from the last lines and working backward. As expected, ARMs struggle due to their strict left-to-right generation process.
Source: [1]

LLaDA is not just an experimental alternative to ARMs: it shows real advantages in efficiency, structured reasoning, and bidirectional text generation.

Conclusion

I think LLaDA is a promising approach to language generation. Its ability to generate multiple tokens in parallel while maintaining global coherence could definitely lead to more efficient trainingbetter reasoning, and improved context understanding with fewer computational resources.

Beyond efficiency, I think LLaDA also brings a lot of flexibility. By adjusting parameters like the number of blocks generated, and the number of generation steps, it can better adapt to different tasks and constraints, making it a versatile tool for various language modeling needs, and allowing more human control. Diffusion models could also play an important role in pro-active AI and agentic systems by being able to reason more holistically.

As research into diffusion-based language models advances, LLaDA could become a useful step toward more natural and efficient language models. While it’s still early, I believe this shift from sequential to parallel generation is an interesting direction for AI development.

Thanks for reading!


Check out my previous articles:



References:

Share the Post:

Related Posts

The Download: how people fall for pig butchering schemes, and saving glaciers

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. Inside a romance scam compound—and how people get tricked into being there Gavesh’s journey had started, seemingly innocently, with a job ad on Facebook promising work he desperately needed.Instead, he found himself trafficked into a business commonly known as “pig butchering”—a form of fraud in which scammers form romantic or other close relationships with targets online and extract money from them. The Chinese crime syndicates behind the scams have netted billions of dollars, and they have used violence and coercion to force their workers, many of them people trafficked like Gavesh, to carry out the frauds from large compounds, several of which operate openly in the quasi-lawless borderlands of Myanmar. We spoke to Gavesh and five other workers from inside the scam industry, as well as anti-trafficking experts and technology specialists. Their testimony reveals how global companies, including American social media and dating apps and international cryptocurrency and messaging platforms, have given the fraud business the means to become industrialized. 
By the same token, it is Big Tech that may hold the key to breaking up the scam syndicates—if only these companies can be persuaded or compelled to act. Read the full story. —Peter Guest & Emily Fishbein
How to save a glacier There’s a lot we don’t understand about how glaciers move and how soon some of the most significant ones could collapse into the sea. That could be a problem, since melting glaciers could lead to multiple feet of sea-level rise this century, potentially displacing millions of people who live and work along the coasts. A new group is aiming not only to further our understanding of glaciers but also to look into options to save them if things move toward a worst-case scenario, as my colleague James Temple outlined in his latest story. One idea: refreezing glaciers in place. The whole thing can sound like science fiction. But once you consider how huge the stakes are, I think it gets easier to understand why some scientists say we should at least be exploring these radical interventions. Read the full story. —Casey Crownhart This article is from The Spark, MIT Technology Review’s weekly climate newsletter. To receive it in your inbox every Wednesday, sign up here.

MIT Technology Review Narrated: How tracking animal movement may save the planet Researchers have long dreamed of creating an Internet of Animals. And they’re getting closer to monitoring 100,000 creatures—and revealing hidden facets of our shared world. This is our latest story to be turned into a MIT Technology Review Narrated podcast, which we’re publishing each week on Spotify and Apple Podcasts. Just navigate to MIT Technology Review Narrated on either platform, and follow us to get all our new content as it’s released. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 Donald Trump has announced 25% tariffs on imported cars and partsThe measures are likely to make new cars significantly more expensive for Americans. (NYT $)+ Moving car manufacturing operations to the US won’t be easy. (WP $)+ It’s not just big businesses that will suffer, either. (The Atlantic $)+ How Trump’s tariffs could drive up the cost of batteries, EVs, and more. (MIT Technology Review) 2 China is developing an AI system to increase its online censorship A leaked dataset demonstrates how LLMs could rapidly filter undesirable material. (TechCrunch)
3 Trump may reduce tariffs on China to encourage a TikTok dealThe Chinese-owned company has until April 5 to find a new US owner. (Insider $)+ The national security concerns surrounding it haven’t gone away, though. (NYT $) 4 OpenAI’s new image generator can ape Studio Ghibli’s distinctive styleWhich raises the question of whether the model was trained on Ghibli’s images. (TechCrunch)+ The tool’s popularity means its rollout to non-paying users has been delayed. (The Verge)+ The AI lab waging a guerrilla war over exploitative AI. (MIT Technology Review)
5 DOGE planned to dismantle USAID from the beginningNew court filings reveal the department’s ambitions to infiltrate the system. (Wired $)+ Can AI help DOGE slash government budgets? It’s complex. (MIT Technology Review) 6 Wildfires are getting worse in the southwest of the USWhile federal fire spending is concentrated mainly in the west, the risk is rising in South Carolina and Texas too. (WP $)+ North and South Carolina were recovering from Hurricane Helene when the fires struck. (The Guardian)+ How AI can help spot wildfires. (MIT Technology Review) 7 A quantum computer has generated—and verified—truly random numbersWhich is good news for cryptographers. (Bloomberg $)+ Cybersecurity analysts are increasingly worried about the so-called Q-Day. (Wired $)+ Amazon’s first quantum computing chip makes its debut. (MIT Technology Review) 8 What’s next for weight-loss drugs 💉Competition is heating up, but will patients be the ones to benefit? (New Scientist $)+ Drugs like Ozempic now make up 5% of prescriptions in the US. (MIT Technology Review) 9 At least we’ve still got memesPoking fun at the Trump administration’s decisions is a form of online resistance. (New Yorker $) 10 Can you truly be friends with a chatbot?People are starting to find out. (Vox)+ The AI relationship revolution is already here. (MIT Technology Review)
Quote of the day “I can’t imagine any professional I know committing this egregious a lapse in judgement.” —A government technology leader tells Fast Company why top Trump officials’ decision to use unclassified messaging app Signal to discuss war plans is so surprising.
The big story Why one developer won’t quit fighting to connect the US’s grids September 2024 Michael Skelly hasn’t learned to take no for an answer. For much of the last 15 years, the energy entrepreneur has worked to develop long-haul transmission lines to carry wind power across the Great Plains, Midwest, and Southwest. But so far, he has little to show for the effort. Skelly has long argued that building such lines and linking together the nation’s grids would accelerate the shift from coal- and natural-gas-fueled power plants to the renewables needed to cut the pollution driving climate change. But his previous business shut down in 2019, after halting two of its projects and selling off interests in three more. Skelly contends he was early, not wrong, and that the market and policymakers are increasingly coming around to his perspective. After all, the US Department of Energy just blessed his latest company’s proposed line with hundreds of millions in grants. Read the full story. —James Temple We can still have nice things A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.) + Severance’s Adam Scott sure has interesting taste in music. + While we’re not 100% sure if Millie is definitely the world’s oldest cat, one thing we know for sure is that she lives a life of luxury.+ Hiking trails are covered in beautiful wildflowers right now; just make sure you tread carefully.+ This is a really charming look at how girls live in America right now.

Read More

Scaling Hashrate Simplified: The Mining Model That Delivered for BitMine

IntroductionBitMine Immersion Technologies (OTCQX: BMNR), a growing player in the Bitcoin mining industry, faced a very common industry opportunity & challenge: how to bring hashrate online in the best way possible. The complexities of sourcing energy, power infrastructure, site development, running operations, ASIC procurement, software optimization, and hashrate management require a holistic approach and entail many operational risks.An experienced capital allocator, BitMine was familiar with these risks owing to deep experience in similar markets prior to founding the company. However, they were open to support. Enter Soluna and Luxor, two industry leaders partnering to provide a complementary solution. Soluna provided power, infrastructure, and operational expertise. Luxor delivered financing,  hedging, procurement, software optimization via LuxOS, and monetization of hashrate via Luxor Pool. Together, they formed a game-changing partnership that addressed BitMine’s needs, setting a new standard for turnkey mining solutions.This case study explores how the collaboration between BitMine, Soluna, and Luxor streamlined deployment, mitigated risk, and unlocked new growth opportunities.BitMine’s Opportunity: Bringing Hashrate Online With Low Operational RiskBitMine had a clear vision: to scale its mining operations efficiently while minimizing risk. However, they knew the pitfalls associated with deployments. This creates vulnerabilities, especially when deals are structured poorly, for example:Power Pricing Pitfalls: Many miners enter long-term hosting agreements with fluctuating rates, or worse, hidden pass-through costs that explode when energy prices spike. Some hosting providers lock clients into contracts that shift all the risk onto the miner.Overpaying for Equipment: Without direct industry relationships, miners may buy hardware at retail prices or from intermediaries with significant mark ups. This happened during the 2021 bull run when desperate new entrants paid $10,000–$15,000 per ASIC, only to watch prices crash to $3,000 during the next bear market.Inefficient Machine Deployment: Delays, customs, DOAs – there are a lot of things that can go wrong in the procurement phase. After the machines have arrived, firmware adjustments, cooling, and heat can affect downtime, all resulting in significantly less hashrate and associated declines in returns. If uptime and efficiency are poor, larger sites can underperform smaller, well-optimized sites.Cash Flow Mismatches: Mining revenue is volatile, fluctuating with network difficulty and Bitcoin price action. Some miners finance their operations with loans assuming steady returns, only to get caught in a bear market where mining rewards drop, electricity bills stay fixed, and debt payments become unmanageable.This is why partnerships with experienced service providers who understand the nuances of power markets, hardware procurement, optimization, and financial hedging are critical. Those who fail to manage these risks effectively often end up selling distressed assets at the bottom of the cycle, exiting the industry with heavy losses, while more sophisticated players continue to scale.The Solution: A Turnkey Approach with Soluna & LuxorRecognizing BitMine’s needs, Soluna and Luxor combined their strengths to offer a comprehensive and predictable end-to-end solution. Soluna: Reliable Infrastructure & Stable PowerBitMine expanded its relationship with Soluna from ~3 MWs at the Project Sophie data center to adding an additional ~10MW at the new Project Dorothy facility. With Soluna currently providing 13MW hosting capacity, this eliminated uncertainty related to fluctuating energy prices and power interruptions, ensuring BitMine had a dedicated, stable source of power.Luxor: Financial, Operational, and Strategic ExpertiseLuxor played a critical role in enabling BitMine’s expansion by leveraging all aspects of its business:Hashrate Forward Contract: Luxor structured a hedging strategy that secured BitMine’s profitability by locking in a fixed hashprice for a 12-month term.Capital & Equipment Financing: Luxor facilitated financing for ASIC machine procurement through a forward hashrate sale, ensuring BitMine could scale without facing capital constraints.Logistics Support: Luxor managed the entire shipping & logistics process to minimize downtime.Fleet Optimization & Management: Luxor firmware was deployed across BitMine’s fleet, unlocking dynamic mining strategies through LuxOS to maximize revenue and efficiency.Why This Model Stands OutThis partnership redefined the traditional mining setup by integrating infrastructure, software & financial services, and operations management into a turnkey solution. By reducing risk across deployment, price volatility, and operational uncertainty, BitMine was able to scale confidently and predictably while focusing on its core business activities.Results: Unlocking More Hashrate, More ASICs, and More EfficiencyThe collaboration between BitMine, Soluna, and Luxor delivered tangible results:Tripled BitMine’s deployed ASIC capacity, significantly boosting its hashrate.Secured long-term power stability, mitigating energy price fluctuations.Locked in hashprice terms, reducing financial exposure to market volatility.Streamlined deployment process, cutting down hardware lead times and ensuring rapid scaling.Enhanced operational efficiency, leveraging LuxOS firmware and running around 10% more efficiently than other miners, leading to improved profitability and lower downtime.This approach provided BitMine with greater financial stability, operational certainty, and a faster growth trajectory, proving the effectiveness of a fully integrated mining solution.Conclusion: The Future of Integrated Mining SolutionsThis partnership between BitMine, Soluna, and Luxor showcases the value of turnkey mining solutions. Each party benefited:BitMine: Gained a complete, risk-mitigated mining solution with price certainty, reliable power, and operational efficiency.Soluna: Secured a long-term customer for its power capacity, reinforcing its role as a leader in sustainable Bitcoin mining.Luxor: Demonstrated the power of its full-service model, proving that its comprehensive approach can drive long-term success for mining companies.As mining economics continue to evolve, integrated win-win-win solutions like this will become increasingly essential. Soluna and Luxor plan to replicate and scale this model, bringing more miners into a stable, profitable framework.For mining companies looking for a reliable, end-to-end solution, this case study validates the effectiveness of strategic partnerships in an industry where efficiency and risk management are critical.Can We Help You?Given the success of this collaboration, Soluna and Luxor are exploring ways to expand this model. If you’re a miner looking for a scalable, turnkey solution, get in touch to learn how this approach can work for you.About BitMine Immersion Technologies, Inc.BitMine is a technology company focused on Bitcoin mining using immersion technology, an advanced cooling technique where computers are submerged in specialized oil circulated to keep units operating at optimal ambient temperature. Immersion technology is more environmentally friendly than conventional mining methodologies while lowering operating expenses and increasing yield. BitMine’s operations are located in low-cost energy regions in Trinidad, Pecos, Texas, and Murray, Kentucky.About Soluna Holdings, Inc. (SLNH)Soluna is on a mission to make renewable energy a global superpower, using computing as a catalyst. The company designs, develops, and operates digital infrastructure that transforms surplus renewable energy into global computing resources. Soluna’s pioneering data centers are strategically co-located with wind, solar, or hydroelectric power plants to support high-performance computing applications, including Bitcoin Mining, Generative AI, and other compute-intensive applications.  Soluna’s proprietary software MaestroOS(™) helps energize a greener grid while delivering cost-effective and sustainable computing solutions and superior returns. To learn more, visit solunacomputing.com.  Follow us on X (formerly Twitter) at @SolunaHoldings. About Luxor Technology CorporationLuxor Technology Corporation is a Bitcoin mining software and services company that offers a suite of products catered toward the mining and compute power industry. Luxor’s suite of software and services includes an open auction ASIC Marketplace, a Bitcoin mining pool, a Hashrate Derivatives Desk, an Antimer ASIC Firmware, and a Bitcoin mining data platform.If you are interested in contacting the Luxor Derivatives Desk, please email [email protected].DisclaimerThis content is for informational purposes only, you should not construe any such information or other material as legal, investment, financial, or other advice.

Read More