Explore

The Paper News

Dive into a curated collection of insights, analysis, and updates from the world of AI, Bitcoin, and Energy. Whether you’re catching up on missed stories or exploring industry trends, our archive is your gateway to understanding the forces shaping tomorrow. Browse through expertly crafted articles and newsletters designed to keep professionals like you ahead of the curve.

The Download: how people fall for pig butchering schemes, and saving glaciers

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology. Inside a romance scam compound—and how people get tricked into being there Gavesh’s journey had started, seemingly innocently, with a job ad on Facebook promising work he desperately needed.Instead, he found himself trafficked into a business commonly known as “pig butchering”—a form of fraud in which scammers form romantic or other close relationships with targets online and extract money from them. The Chinese crime syndicates behind the scams have netted billions of dollars, and they have used violence and coercion to force their workers, many of them people trafficked like Gavesh, to carry out the frauds from large compounds, several of which operate openly in the quasi-lawless borderlands of Myanmar. We spoke to Gavesh and five other workers from inside the scam industry, as well as anti-trafficking experts and technology specialists. Their testimony reveals how global companies, including American social media and dating apps and international cryptocurrency and messaging platforms, have given the fraud business the means to become industrialized. 
By the same token, it is Big Tech that may hold the key to breaking up the scam syndicates—if only these companies can be persuaded or compelled to act. Read the full story. —Peter Guest & Emily Fishbein
How to save a glacier There’s a lot we don’t understand about how glaciers move and how soon some of the most significant ones could collapse into the sea. That could be a problem, since melting glaciers could lead to multiple feet of sea-level rise this century, potentially displacing millions of people who live and work along the coasts. A new group is aiming not only to further our understanding of glaciers but also to look into options to save them if things move toward a worst-case scenario, as my colleague James Temple outlined in his latest story. One idea: refreezing glaciers in place. The whole thing can sound like science fiction. But once you consider how huge the stakes are, I think it gets easier to understand why some scientists say we should at least be exploring these radical interventions. Read the full story. —Casey Crownhart This article is from The Spark, MIT Technology Review’s weekly climate newsletter. To receive it in your inbox every Wednesday, sign up here.

MIT Technology Review Narrated: How tracking animal movement may save the planet Researchers have long dreamed of creating an Internet of Animals. And they’re getting closer to monitoring 100,000 creatures—and revealing hidden facets of our shared world. This is our latest story to be turned into a MIT Technology Review Narrated podcast, which we’re publishing each week on Spotify and Apple Podcasts. Just navigate to MIT Technology Review Narrated on either platform, and follow us to get all our new content as it’s released. The must-reads I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology. 1 Donald Trump has announced 25% tariffs on imported cars and partsThe measures are likely to make new cars significantly more expensive for Americans. (NYT $)+ Moving car manufacturing operations to the US won’t be easy. (WP $)+ It’s not just big businesses that will suffer, either. (The Atlantic $)+ How Trump’s tariffs could drive up the cost of batteries, EVs, and more. (MIT Technology Review) 2 China is developing an AI system to increase its online censorship A leaked dataset demonstrates how LLMs could rapidly filter undesirable material. (TechCrunch)
3 Trump may reduce tariffs on China to encourage a TikTok dealThe Chinese-owned company has until April 5 to find a new US owner. (Insider $)+ The national security concerns surrounding it haven’t gone away, though. (NYT $) 4 OpenAI’s new image generator can ape Studio Ghibli’s distinctive styleWhich raises the question of whether the model was trained on Ghibli’s images. (TechCrunch)+ The tool’s popularity means its rollout to non-paying users has been delayed. (The Verge)+ The AI lab waging a guerrilla war over exploitative AI. (MIT Technology Review)
5 DOGE planned to dismantle USAID from the beginningNew court filings reveal the department’s ambitions to infiltrate the system. (Wired $)+ Can AI help DOGE slash government budgets? It’s complex. (MIT Technology Review) 6 Wildfires are getting worse in the southwest of the USWhile federal fire spending is concentrated mainly in the west, the risk is rising in South Carolina and Texas too. (WP $)+ North and South Carolina were recovering from Hurricane Helene when the fires struck. (The Guardian)+ How AI can help spot wildfires. (MIT Technology Review) 7 A quantum computer has generated—and verified—truly random numbersWhich is good news for cryptographers. (Bloomberg $)+ Cybersecurity analysts are increasingly worried about the so-called Q-Day. (Wired $)+ Amazon’s first quantum computing chip makes its debut. (MIT Technology Review) 8 What’s next for weight-loss drugs 💉Competition is heating up, but will patients be the ones to benefit? (New Scientist $)+ Drugs like Ozempic now make up 5% of prescriptions in the US. (MIT Technology Review) 9 At least we’ve still got memesPoking fun at the Trump administration’s decisions is a form of online resistance. (New Yorker $) 10 Can you truly be friends with a chatbot?People are starting to find out. (Vox)+ The AI relationship revolution is already here. (MIT Technology Review)
Quote of the day “I can’t imagine any professional I know committing this egregious a lapse in judgement.” —A government technology leader tells Fast Company why top Trump officials’ decision to use unclassified messaging app Signal to discuss war plans is so surprising.
The big story Why one developer won’t quit fighting to connect the US’s grids September 2024 Michael Skelly hasn’t learned to take no for an answer. For much of the last 15 years, the energy entrepreneur has worked to develop long-haul transmission lines to carry wind power across the Great Plains, Midwest, and Southwest. But so far, he has little to show for the effort. Skelly has long argued that building such lines and linking together the nation’s grids would accelerate the shift from coal- and natural-gas-fueled power plants to the renewables needed to cut the pollution driving climate change. But his previous business shut down in 2019, after halting two of its projects and selling off interests in three more. Skelly contends he was early, not wrong, and that the market and policymakers are increasingly coming around to his perspective. After all, the US Department of Energy just blessed his latest company’s proposed line with hundreds of millions in grants. Read the full story. —James Temple We can still have nice things A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or skeet ’em at me.) + Severance’s Adam Scott sure has interesting taste in music. + While we’re not 100% sure if Millie is definitely the world’s oldest cat, one thing we know for sure is that she lives a life of luxury.+ Hiking trails are covered in beautiful wildflowers right now; just make sure you tread carefully.+ This is a really charming look at how girls live in America right now.

Read More

Scaling Hashrate Simplified: The Mining Model That Delivered for BitMine

IntroductionBitMine Immersion Technologies (OTCQX: BMNR), a growing player in the Bitcoin mining industry, faced a very common industry opportunity & challenge: how to bring hashrate online in the best way possible. The complexities of sourcing energy, power infrastructure, site development, running operations, ASIC procurement, software optimization, and hashrate management require a holistic approach and entail many operational risks.An experienced capital allocator, BitMine was familiar with these risks owing to deep experience in similar markets prior to founding the company. However, they were open to support. Enter Soluna and Luxor, two industry leaders partnering to provide a complementary solution. Soluna provided power, infrastructure, and operational expertise. Luxor delivered financing,  hedging, procurement, software optimization via LuxOS, and monetization of hashrate via Luxor Pool. Together, they formed a game-changing partnership that addressed BitMine’s needs, setting a new standard for turnkey mining solutions.This case study explores how the collaboration between BitMine, Soluna, and Luxor streamlined deployment, mitigated risk, and unlocked new growth opportunities.BitMine’s Opportunity: Bringing Hashrate Online With Low Operational RiskBitMine had a clear vision: to scale its mining operations efficiently while minimizing risk. However, they knew the pitfalls associated with deployments. This creates vulnerabilities, especially when deals are structured poorly, for example:Power Pricing Pitfalls: Many miners enter long-term hosting agreements with fluctuating rates, or worse, hidden pass-through costs that explode when energy prices spike. Some hosting providers lock clients into contracts that shift all the risk onto the miner.Overpaying for Equipment: Without direct industry relationships, miners may buy hardware at retail prices or from intermediaries with significant mark ups. This happened during the 2021 bull run when desperate new entrants paid $10,000–$15,000 per ASIC, only to watch prices crash to $3,000 during the next bear market.Inefficient Machine Deployment: Delays, customs, DOAs – there are a lot of things that can go wrong in the procurement phase. After the machines have arrived, firmware adjustments, cooling, and heat can affect downtime, all resulting in significantly less hashrate and associated declines in returns. If uptime and efficiency are poor, larger sites can underperform smaller, well-optimized sites.Cash Flow Mismatches: Mining revenue is volatile, fluctuating with network difficulty and Bitcoin price action. Some miners finance their operations with loans assuming steady returns, only to get caught in a bear market where mining rewards drop, electricity bills stay fixed, and debt payments become unmanageable.This is why partnerships with experienced service providers who understand the nuances of power markets, hardware procurement, optimization, and financial hedging are critical. Those who fail to manage these risks effectively often end up selling distressed assets at the bottom of the cycle, exiting the industry with heavy losses, while more sophisticated players continue to scale.The Solution: A Turnkey Approach with Soluna & LuxorRecognizing BitMine’s needs, Soluna and Luxor combined their strengths to offer a comprehensive and predictable end-to-end solution. Soluna: Reliable Infrastructure & Stable PowerBitMine expanded its relationship with Soluna from ~3 MWs at the Project Sophie data center to adding an additional ~10MW at the new Project Dorothy facility. With Soluna currently providing 13MW hosting capacity, this eliminated uncertainty related to fluctuating energy prices and power interruptions, ensuring BitMine had a dedicated, stable source of power.Luxor: Financial, Operational, and Strategic ExpertiseLuxor played a critical role in enabling BitMine’s expansion by leveraging all aspects of its business:Hashrate Forward Contract: Luxor structured a hedging strategy that secured BitMine’s profitability by locking in a fixed hashprice for a 12-month term.Capital & Equipment Financing: Luxor facilitated financing for ASIC machine procurement through a forward hashrate sale, ensuring BitMine could scale without facing capital constraints.Logistics Support: Luxor managed the entire shipping & logistics process to minimize downtime.Fleet Optimization & Management: Luxor firmware was deployed across BitMine’s fleet, unlocking dynamic mining strategies through LuxOS to maximize revenue and efficiency.Why This Model Stands OutThis partnership redefined the traditional mining setup by integrating infrastructure, software & financial services, and operations management into a turnkey solution. By reducing risk across deployment, price volatility, and operational uncertainty, BitMine was able to scale confidently and predictably while focusing on its core business activities.Results: Unlocking More Hashrate, More ASICs, and More EfficiencyThe collaboration between BitMine, Soluna, and Luxor delivered tangible results:Tripled BitMine’s deployed ASIC capacity, significantly boosting its hashrate.Secured long-term power stability, mitigating energy price fluctuations.Locked in hashprice terms, reducing financial exposure to market volatility.Streamlined deployment process, cutting down hardware lead times and ensuring rapid scaling.Enhanced operational efficiency, leveraging LuxOS firmware and running around 10% more efficiently than other miners, leading to improved profitability and lower downtime.This approach provided BitMine with greater financial stability, operational certainty, and a faster growth trajectory, proving the effectiveness of a fully integrated mining solution.Conclusion: The Future of Integrated Mining SolutionsThis partnership between BitMine, Soluna, and Luxor showcases the value of turnkey mining solutions. Each party benefited:BitMine: Gained a complete, risk-mitigated mining solution with price certainty, reliable power, and operational efficiency.Soluna: Secured a long-term customer for its power capacity, reinforcing its role as a leader in sustainable Bitcoin mining.Luxor: Demonstrated the power of its full-service model, proving that its comprehensive approach can drive long-term success for mining companies.As mining economics continue to evolve, integrated win-win-win solutions like this will become increasingly essential. Soluna and Luxor plan to replicate and scale this model, bringing more miners into a stable, profitable framework.For mining companies looking for a reliable, end-to-end solution, this case study validates the effectiveness of strategic partnerships in an industry where efficiency and risk management are critical.Can We Help You?Given the success of this collaboration, Soluna and Luxor are exploring ways to expand this model. If you’re a miner looking for a scalable, turnkey solution, get in touch to learn how this approach can work for you.About BitMine Immersion Technologies, Inc.BitMine is a technology company focused on Bitcoin mining using immersion technology, an advanced cooling technique where computers are submerged in specialized oil circulated to keep units operating at optimal ambient temperature. Immersion technology is more environmentally friendly than conventional mining methodologies while lowering operating expenses and increasing yield. BitMine’s operations are located in low-cost energy regions in Trinidad, Pecos, Texas, and Murray, Kentucky.About Soluna Holdings, Inc. (SLNH)Soluna is on a mission to make renewable energy a global superpower, using computing as a catalyst. The company designs, develops, and operates digital infrastructure that transforms surplus renewable energy into global computing resources. Soluna’s pioneering data centers are strategically co-located with wind, solar, or hydroelectric power plants to support high-performance computing applications, including Bitcoin Mining, Generative AI, and other compute-intensive applications.  Soluna’s proprietary software MaestroOS(™) helps energize a greener grid while delivering cost-effective and sustainable computing solutions and superior returns. To learn more, visit solunacomputing.com.  Follow us on X (formerly Twitter) at @SolunaHoldings. About Luxor Technology CorporationLuxor Technology Corporation is a Bitcoin mining software and services company that offers a suite of products catered toward the mining and compute power industry. Luxor’s suite of software and services includes an open auction ASIC Marketplace, a Bitcoin mining pool, a Hashrate Derivatives Desk, an Antimer ASIC Firmware, and a Bitcoin mining data platform.If you are interested in contacting the Luxor Derivatives Desk, please email [email protected].DisclaimerThis content is for informational purposes only, you should not construe any such information or other material as legal, investment, financial, or other advice.

Read More

CoinDesk 20 Performance Update: SUI Gains 7.1% as Index Inches Higher

CoinDesk Indices presents its daily market update, highlighting the performance of leaders and laggards in the CoinDesk 20 Index.
The CoinDesk 20 is currently trading at 2731.35, up 0.4% (+11.44) since 4 p.m. ET on Wednesday.

STORY CONTINUES BELOW
Don’t miss another story.Subscribe to the Crypto for Advisors Newsletter today. See all newsletters

Sign me up

By signing up, you will receive emails about CoinDesk products and you agree to our terms of use and privacy policy.

Twelve of 20 assets are trading higher.

Leaders: SUI (+7.1%) and AAVE (+3.6%).
Laggards: DOT (-1.6%) and XRP (-1.4%).
The CoinDesk 20 is a broad-based index traded on multiple platforms in several regions globally.

Read More

Canacol Posts $25MM Loss on Deferred Tax Payment

Canacol Energy Ltd. has reported $25.4 million, or $0.75 per share, in net loss for the fourth quarter (Q4), compared to a net profit of $29.9 million for the same three-month period in 2023.

The natural gas exploration and production company, based in Canada but operating in Colombia, attributed the gap to a deferred income tax expense of $28.9 million for Q4 2024 and a deferred income tax recovery of $31.7 million for Q4 2023.

However, Canacol’s adjusted earnings before interest, depreciation, amortization and exploration for Q4 2024 rose 43 percent year-on-year to $76.1 million, according to results published online by the company. That is thanks to a higher operating netback, offset by lower realized contractual volumes.

Operating netback grew 39 percent year-over-year to $6.12 per thousand cubic feet in Q4 2024. “The increase is due to an increase in average sales prices, net of transportation expenses, offset by an increase in royalties”, Canacol said.

Revenues, net of royalties and transport expenses, increased 23 percent to $98.3 million. Canacol attributed the increase to higher sales prices, offset by lower sales volumes.

Realized contractual gas sales volumes fell 4 percent year-on-year to 158 million cubic feet a day.

Net capital expenditures dropped to $28.6 million, from $72.2 million for Q4 2023. “The decrease is due to reduced spending on land and seismic, workovers, and drilling and completion”, Canacol said.

It ended the year with $79.2 million in cash and cash equivalents and $45.5 million in working capital surplus.

“The Corporation expects that commodity pricing will remain strong for the remainder of 2025, and for this reason, in 2025, the Corporation lowered its take-or-pay volumes to maximize exposure to the spot sales market”, Canacol said.

“In line with maintaining and growing Canacol’s reserves and production in its core assets in the LMV [Lower Magdalena Valley), the Corporation plans to optimize its production and increase reserves by drilling up to 11 exploration and three development wells, installing new compression and processing facilities as required, and completing workovers of producing wells in its key gas fields”.

Canacol had 105.1 million barrels of oil equivalent (MMboe) in proven and probable reserves at the end of 2024. Proven developed producing reserves stood at 11.9 MMboe. Proven developed but not producing reserves totaled 26.4 MMboe. Proven undeveloped reserves were 6.3 MMboe, according to a separate report by Canacol.

“During the year ended December 31, 2024, the Corporation recorded increases in certain reserve categories due to both new gas discoveries and positive technical revisions of existing producing gas fields”, Canacol said.

To contact the author, email [email protected]

Read More

Halving Hits Bitcoin Fees Too?

Transaction fees drop to lowest point in three yearsDeveloped in collaboration with OEMs, HashHouse’s solutions offer streamlined design-to-deployment in as little as 90 days. Visit www.hashhouse.tech to learn more.As March 2025 draws to a close, Bitcoin’s transaction fees have experienced another notable decline, now making up only 1.25% of the total block rewards, according to TheMinerMag’s analysis of Bitcoin blocks this month so far.This marks the lowest percentage of transaction fees in three years, since April 2022, signaling a significant shift in the network’s dynamics. In 2025, Bitcoin fees have consistently accounted for less than 2% of the monthly block rewards.For context, transaction fees in March 2025 have totaled 155 BTC so far, which is not yet half of the 361 BTC from three years ago. Has the halving come for transaction fees, too?Meanwhile, Bitcoin’s seven-day moving average hashpower has quietly recovered, climbing back to 840 EH/s from just below 800 EH/s a week ago. The quiet and steady rise in hashrate indicates that confidence remains strong among large, efficient players, despite Bitcoin’s hashprice—the revenue miners earn per terahash per second—remaining stagnant below $50/PH/s.Thanks for reading Miner Weekly! Subscribe for free and support our work.With the hashrate recovery, the network is expected to undergo a difficulty adjustment in approximately 10 days, which is projected to rise by 5%. With more miners competing for the same block subsidies and fewer fees to go around, this increase in difficulty could further strain profitability for miners with higher operational costs.Without a significant uptick in Bitcoin’s market price or a revival in transaction fees, these miners may soon face an unmanageable situation: they may no longer be able to compete.According to TheMinerMag’s analysis of earnings reports, the median hashcost—the direct cost miners incur per terahash per second—among publicly listed mining companies was around $34/PH/s in the latest quarter. This leaves only about $15/PH/s as gross margin, even for the largest institutional mining operations.This environment is setting the stage for further consolidation within the Bitcoin mining industry. Larger players, with more efficient equipment and better access to cheaper energy, are positioned to absorb market share as smaller miners struggle to stay afloat.Regulation NewsProof-of-Work Crypto Mining Doesn’t Trigger Securities Laws, SEC Says – CoinDeskPakistan Considers Bitcoin Mining to Help Offset Surplus Power Supply – TheMinerMagArkansas auto salesman’s plan for new Vilonia cryptomine must win over wary locals – KATVHardware and Infrastructure NewsTVP and Demand Pool announce upcoming launch of first Stratum V2 Bitcoin mining pool – LinkBitcoin Hashprice Remains Flat After Moderate Difficulty Uptick – TheMinerMagAuradine Launches Hydro Bitcoin Miner with 14.5 J/TH Efficiency – TheMinerMagCanaan Signs US Hosting Deals to Boost Bitcoin Hashrate by 4.7 EH/s – TheMinerMagCorporate NewsArgo Names New CEO Amid Hosting Search for Bitcoin Miners – TheMinerMagNYDIG Set to Expand Hashrate after Acquiring Crusoe’s Bitcoin Mining Assets – TheMinerMagBitfarms Appoints HPC Executive as Bitcoin Miners Diversify Amid Industry Headwinds – TheMinerMagArgo Seeks $27M Stock Deal to Acquire GEM Mining – TheMinerMagCango Extends Deadline on 18 EH/s Bitcoin Miner Deal Amid Takeover Talks – TheMinerMagFinancial NewsBernstein cuts price targets for Bitcoin miners amid underperformance relative to BTC in 2025 – The BlockRiot Seeks to Acquire Rhodium’s Rockdale Bitcoin Mining Assets in $185M Deal to Settle Disputes – TheMinerMagArgo Seeks $27M Stock Deal to Acquire GEM Mining – TheMinerMagFeatureBitcoin Miners Feel Squeeze as Hashprice Erases Post-Election Gains – CoindeskAuradine’s AH3880 Server Rack ASIC Miner w/ Sanjay Gupta – The Mining PodBitcoin in the bush – the crypto mine in remote Zambia – BBC

Read More

Tether Boosts Stake in $1.12B Agricultural Firm Adecoagro to 70%

Tether, issuer of $144 billion dollar stablecoin USDT, has boosted its stake in Latin American agricultural firm Adecoagro (AGRO).
The $12.41 per share offer, which is subject to certain closing conditions, would take Tether’s stake in Adecoagro from 51% to 70%, according to an announcement on Thursday.

STORY CONTINUES BELOW
Don’t miss another story.Subscribe to the Crypto Long & Short Newsletter today. See all newsletters

Sign me up

By signing up, you will receive emails about CoinDesk products and you agree to our terms of use and privacy policy.

AGRO shares jumped over 7% to $11.95 in pre-market trading following the announcement.
Adecoagro’s business is focused on sugar, ethanol, dairy and crop production Argentina, Brazil, and Uruguay. It owns 210,400 hectares of farmland and several industrial facilities across these countries.
The company has a market cap of just under $1.12 billion.
Tether views its Adecoagro’s investment as one in the safe haven of land that complements its holdings in bitcoin (BTC) and gold.

“Our investment aligns with Tether’s broader strategy to back infrastructure, technology, and businesses that advance economic freedom and resilience,” Tether CEO Paolo Ardoino said in Thursday’s announcement.
Tether is also increasing its exposure to the entertainment industry, acquiring a 30.4% stake in Italian media company Be Water for 10 million euros ($10.8 million).
This investment follows Tether’s announcement last month of taking a minority stake in Ardoino’s favorite team Juventus FC, arguably the largest soccer club in Italy.

Read More

Join the Industry Leaders’ Inbox.

Get Tailored AI, Bitcoin, and Energy News Straight to Your Inbox.

Anatomy of a Parquet File

In recent years, Parquet has become a standard format for data storage in Big Data ecosystems. Its column-oriented format offers several advantages:

Faster query execution when only a subset of columns is being processed

Quick calculation of statistics across all data

Reduced storage volume thanks to efficient compression

When combined with storage frameworks like Delta Lake or Apache Iceberg, it seamlessly integrates with query engines (e.g., Trino) and data warehouse compute clusters (e.g., Snowflake, BigQuery). In this article, the content of a Parquet file is dissected using mainly standard Python tools to better understand its structure and how it contributes to such performances.

Writing Parquet file(s)

To produce Parquet files, we use PyArrow, a Python binding for Apache Arrow that stores dataframes in memory in columnar format. PyArrow allows fine-grained parameter tuning when writing the file. This makes PyArrow ideal for Parquet manipulation (one can also simply use Pandas).

# generator.py

import pyarrow as pa
import pyarrow.parquet as pq
from faker import Faker

fake = Faker()
Faker.seed(12345)
num_records = 100

# Generate fake data
names = [fake.name() for _ in range(num_records)]
addresses = [fake.address().replace(“n”, “, “) for _ in range(num_records)]
birth_dates = [
fake.date_of_birth(minimum_age=67, maximum_age=75) for _ in range(num_records)
]
cities = [addr.split(“, “)[1] for addr in addresses]
birth_years = [date.year for date in birth_dates]

# Cast the data to the Arrow format
name_array = pa.array(names, type=pa.string())
address_array = pa.array(addresses, type=pa.string())
birth_date_array = pa.array(birth_dates, type=pa.date32())
city_array = pa.array(cities, type=pa.string())
birth_year_array = pa.array(birth_years, type=pa.int32())

# Create schema with non-nullable fields
schema = pa.schema(
[
pa.field(“name”, pa.string(), nullable=False),
pa.field(“address”, pa.string(), nullable=False),
pa.field(“date_of_birth”, pa.date32(), nullable=False),
pa.field(“city”, pa.string(), nullable=False),
pa.field(“birth_year”, pa.int32(), nullable=False),
]
)

table = pa.Table.from_arrays(
[name_array, address_array, birth_date_array, city_array, birth_year_array],
schema=schema,
)

print(table)

pyarrow.Table
name: string not null
address: string not null
date_of_birth: date32[day] not null
city: string not null
birth_year: int32 not null
—-
name: [[“Adam Bryan”,”Jacob Lee”,”Candice Martinez”,”Justin Thompson”,”Heather Rubio”]]
address: [[“822 Jennifer Field Suite 507, Anthonyhaven, UT 98088″,”292 Garcia Mall, Lake Belindafurt, IN 69129″,”31738 Jonathan Mews Apt. 024, East Tammiestad, ND 45323″,”00716 Kristina Trail Suite 381, Howelltown, SC 64961″,”351 Christopher Expressway Suite 332, West Edward, CO 68607”]]
date_of_birth: [[1955-06-03,1950-06-24,1955-01-29,1957-02-18,1956-09-04]]
city: [[“Anthonyhaven”,”Lake Belindafurt”,”East Tammiestad”,”Howelltown”,”West Edward”]]
birth_year: [[1955,1950,1955,1957,1956]]

The output clearly reflects a columns-oriented storage, unlike Pandas, which usually displays a traditional “row-wise” table.

How is a Parquet file stored?

Parquet files are generally stored in cheap object storage databases like S3 (AWS) or GCS (GCP) to be easily accessible by data processing pipelines. These files are usually organized with a partitioning strategy by leveraging directory structures:

# generator.py

num_records = 100

# …

# Writing the parquet files to disk
pq.write_to_dataset(
table,
root_path=’dataset’,
partition_cols=[‘birth_year’, ‘city’]
)

If birth_year and city columns are defined as partitioning keys, PyArrow creates such a tree structure in the directory dataset:

dataset/
├─ birth_year=1949/
├─ birth_year=1950/
│ ├─ city=Aaronbury/
│ │ ├─ 828d313a915a43559f3111ee8d8e6c1a-0.parquet
│ │ ├─ 828d313a915a43559f3111ee8d8e6c1a-0.parquet
│ │ ├─ …
│ ├─ city=Alicialand/
│ ├─ …
├─ birth_year=1951 ├─ …

The strategy enables partition pruning: when a query filters on these columns, the engine can use folder names to read only the necessary files. This is why the partitioning strategy is crucial for limiting delay, I/O, and compute resources when handling large volumes of data (as has been the case for decades with traditional relational databases).

The pruning effect can be easily verified by counting the files opened by a Python script that filters the birth year:

# query.py
import duckdb

duckdb.sql(
“””
SELECT *
FROM read_parquet(‘dataset/*/*/*.parquet’, hive_partitioning = true)
where birth_year = 1949
“””
).show()

> strace -e trace=open,openat,read -f python query.py 2 >&1 | grep “dataset/.*.parquet”

[pid 37] openat(AT_FDCWD, “dataset/birth_year=1949/city=Box%201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 3
[pid 37] openat(AT_FDCWD, “dataset/birth_year=1949/city=Box%201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 3
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=Box%201306/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 4
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=Box%203487/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 5
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=Box%203487/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 3
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=Clarkemouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 4
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=Clarkemouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 5
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=DPO%20AP%2020198/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 3
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=DPO%20AP%2020198/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 4
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=East%20Morgan/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 5
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=East%20Morgan/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 3
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=FPO%20AA%2006122/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 4
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=FPO%20AA%2006122/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 5
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=New%20Michelleport/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 3
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=New%20Michelleport/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 4
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=North%20Danielchester/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 5
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=North%20Danielchester/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 3
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=Port%20Chase/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 4
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=Port%20Chase/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 5
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=Richardmouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 3
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=Richardmouth/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 4
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=Robbinsshire/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 5
[pid 39] openat(AT_FDCWD, “dataset/birth_year=1949/city=Robbinsshire/e1ad1666a2144fbc94892d4ac1234c64-0.parquet”, O_RDONLY) = 3

Only 23 files are read out of 100.

Reading a raw Parquet file

Let’s decode a raw Parquet file without specialized libraries. For simplicity, the dataset is dumped into a single file without compression or encoding.

# generator.py

# …

pq.write_table(
table,
“dataset.parquet”,
use_dictionary=False,
compression=”NONE”,
write_statistics=True,
column_encoding=None,
)

The first thing to know is that the binary file is framed by 4 bytes whose ASCII representation is “PAR1”. The file is corrupted if this is not the case.

# reader.py

with open(“dataset.parquet”, “rb”) as file:
parquet_data = file.read()

assert parquet_data[:4] == b”PAR1″, “Not a valid parquet file”
assert parquet_data[-4:] == b”PAR1″, “File footer is corrupted”

As indicated in the documentation, the file is divided into two parts: the “row groups” containing actual data, and the footer containing metadata (schema below).

The footer

The size of the footer is indicated in the 4 bytes preceding the end marker as an unsigned integer written in “little endian” format (noted “ 1955 and a row group’s maximum birth year is 1954, the engine can efficiently skip that entire data section. This optimisation is called “predicate pushdown”. Parquet also stores other useful statistics like distinct value counts and null counts.

# reader.py
# …

first_row_group = file_metadata_thrift.row_groups[0]
birth_year_column = first_row_group.columns[4]

min_stat_bytes = birth_year_column.meta_data.statistics.min
max_stat_bytes = birth_year_column.meta_data.statistics.max

min_year = struct.unpack(“

Read More

Solana Inflation Reform Effort Fails on Dramatic Final Voting Day

Solana’s high staking rewards will live to inflate SOL another day.
A contentious effort to reform the blockchain network’s generous inflation regime flopped on Thursday after supporters of SIMD-0228 failed to garner the supermajority they needed to implement the major economic change.

STORY CONTINUES BELOW
Don’t miss another story.Subscribe to the Crypto for Advisors Newsletter today. See all newsletters

Sign me up

By signing up, you will receive emails about CoinDesk products and you agree to our terms of use and privacy policy.

The surprise result delivered a blow to the Solana power brokers who rallied to replace Solana’s static inflation mechanics with a market-based system. Their proposal likely would have cut the network’s 4.7% annual staking rewards down to 1% or less.
In a contest that pitted Solana’s influential leaders and investors – who claim the network’s high staking rewards are bad for SOL’s price – against small-time operators who feared the effects of a big cut to their revenue, the opposition rallied hardest on Thursday, as late-voting validators’ ballots broke heavily in favor of “no.”

That was enough to scuttle the first major attempt at lowering Solana’s uncommonly high staking emissions rate. Among the most valuable programmable blockchains by market cap, Solana issues comparatively large sums of new tokens to its validators, the computer operations that power proof-of-stake blockchains.
Much like election night in the U.S., SIMD-0228’s weeklong political circus featured betting, ranting, data threads, chart-reading wonkery, endless social media debates and more than a bit of heated name-calling. One validator put their votes up for sale. Many others split their tickets.
It crescendoed with a dramatic rush of ballots cast by many of Solana’s 1300 validators. In the end, the opposition won an exceptionally high turnout election that laid bare the divide between big and small validators.
In the end, SIMD-0228 became the network’s first economic reform to fail at the polls.
Little stakers
Solana validators are only called upon to vote when the network is grappling with a major economic change, said Jonny, the operator of the Solana Compass validator.
SIMD-0228 is the third ever such vote to appear in records by StakingFacilities.com (the current proposal went up for consideration with an unrelated SIMD that passed). Its controversies sparked the highest turnout vote in the network’s history.
Over 66% of validators cast votes, according to a dashboard from Flipside Crypto. Together they wielded 75% of the network’s voting power, a remarkable share given voting in this decentralized system is voluntary.

Of participating validators with 500,000 SOL or less, over 60% voted against SIMD-0228, per a Dune dashboard. Larger validators saw the exact opposite: of validators with more than 500,000 SOL, 60% voted in favor.
The lopsided results suggest opponents’ warnings of economic ruin struck a nerve with small-time validators.
Big Stakes
Proponents of SIMD-0228 believe it would have solved Solana’s inflation problem, which they claim drags down SOL’s price. Their thinking goes like this: fewer tokens means fewer sellers, and fewer in the hands of tax collectors, too.
In place of the network’s static 4.7% SOL emissions that validators receive annually, they called for a dynamic system that adjusts to nudge staking trends up or down
Opponents, meanwhile, called the proposal reckless and rushed. Some told CoinDesk they suspected its co-author, the influential investment company Multicoin Capital, had written it to favor its own interests. Others publicly warned SIMD-0228 would disrupt elements of Solana’s DeFi economy, or turn off institutional investors who they claimed were attracted to SOL’s native yield.
Some doomsayers even claimed SIMD-0228 would chip away at Solana’s decentralization by forcing hundreds of validators with small SOL stakes offline, though others dispute the size of the blow.
Solana validators make money based on how much SOL they’ve staked, either from their own coffers or from tokens delegated to them by others. Those with smaller stakes are more acutely exposed to changes in emissions than those with bigger operators.

“Many people feel like SIMD-0228 is not the best proposal to address inflation on Solana,” said SolBlaze, a validator operator.
“SIMD-0228 is a significant economic change, and changes on this scale deserve more time to discuss, analyze data, and iterate with feedback from different sectors of the ecosystem.”
Reformists aren’t going to give up the fight, said Max Resnick, one of the proposal’s co-authors and an economic researcher at Anza Labs.
“We are gonna chat with the no’s and come to a compromise,” he said.

Read More

Sir Ian Wood honoured at 38th Offshore Achievement Awards (OAA)

Billionaire industrialist and philanthropist Sir Ian Wood was celebrated at the 2025 Offshore Achievement Awards in Aberdeen.Sir Ian, who led the Wood Group for 45 years and founded his family’s venture philanthropy organisation the Wood Foundation, was recognised with the annual event’s significant contribution judges award.
Over 400 guests celebrated the achievements and performance of companies and individuals in the offshore energy industry at the black-tie ceremony hosted BBC presenter, civil engineer and STEM ambassador, Ayo Sokale.
The 2025 award winners of the 38th Offshore Achievement Awards (OAAs) are:
Emerging Technology Award: Puls8, with Cavitas Energy receiving a highly commended
Field Proven Technology Award, sponsored by TWMA: TechnipFMC
Industry Expert Award: Professor Jon Gluyas, the National Geothermal Centre
Inclusivity Champion Award, sponsored by SLB: Stork, with highly commended in the category for Weatherford.
Sustainability Project Award: Asco
Offshore Workplace of Choice, supported by RigRun: Serica – Bruce Platform
Skills Development Award, sponsored by CNOOC: Stats Group with a highly commended certificate awarded to Aberdeenshire Council Foundation Apprenticeships
Collaboration Award, sponsored by Fugro: Wood
Industry Transferer / Returner: Laura Beaton, Wood
Young Professional Award, sponsored by Harbour: Stuart Hamilton from Fugro, with a highly commended certificate for Nandini Nagra from BP.
Graham Dallas, chairman of the Offshore Achievement Awards committee, said: “Congratulations to our 2025 award winners and finalists. Each winner has demonstrated exceptional leadership and innovation that will undoubtedly inspire others.
“The OAAs serve as a powerful reminder of what we can achieve through collaboration and commitment to excellence. Your success today will help shape our industry’s future, setting new benchmarks for achievement in the years ahead.
“On behalf of the Offshore Achievement Awards committee, I would also like to thank our new principal sponsor, Bilfinger UK, all other supporting sponsors and our judging panel for their time and commitment to the OAAs.”
George Rennie, vice president offshore E&M UK at Bilfinger said: “I would like to extend my congratulations to the finalists and winners of the 2025 Offshore Achievement Awards.
“These remarkable individuals have demonstrated exceptional innovation, dedication, and excellence in the energy industry. Their achievements are a testament to the hard work and commitment of those who strive to push the boundaries and drive progress in offshore operations.
“As a member of the judging panel, I have had the opportunity to observe the remarkable efforts and accomplishments of these distinguished professionals. Kudos to you all on your outstanding contributions and for setting a high standard in our industry.”
The awards are backed by SPE Aberdeen. The Society of Petroleum Engineers (SPE) is a not-for-profit professional association whose more than 127,000 members in 145 countries are engaged in oil and gas exploration and production. The Aberdeen Section is one of the largest of all the SPE sections across the world with over 2,000 members.

Recommended for you

Humber Renewables Awards judges assemble ahead of deadline

Read More

Fourier Transform Applications in Literary Analysis

Poetry is often seen as a pure art form, ranging from the rigid structure of a haiku to the fluid, unconstrained nature of free-verse poetry. In analysing these works, though, to what extent can mathematics and Data Analysis be used to glean meaning from this free-flowing literature? Of course, rhetoric can be analysed, references can be found, and word choice can be questioned, but can the underlying– even subconscious– thought process of an author be found using analytic tactics on literature? As an initial exploration into compute-assisted literature analysis, we’ll attempt to use a Fourier transforming program to search for periodicity in a poem. To test our code, we’ll use two case studies: “Do Not Go Gentle into That Good Night” by Dylan Thomas, followed by Lewis Carroll’s “Jabberwocky.” 

1. Data acquisition

a. Line splitting and word count

Before doing any calculations, all necessary data must be collected. For our purposes, we’ll want a data set of the number of letters, words, syllables, and visual length of each line. First, we need to parse the poem itself (which is inputted as a plain text file) into substrings of each line. This is quite easily done in Python with the .split() method; passing the delimiter “n” into the method will split the file by line, returning a list of strings for each line. (The full method is poem.split(“n”)).  Counting the number of words is as simple as splitting the lines, and follows nicely from it: first, iterating across all lines, apply the .split() method again– this time with no delimiter– so that it will default to splitting on whitespaces, turning each line string into a list of word strings. Then, to count the number of words on any given line simply call the built-in len() function on each line; since each line has been broken into a list of words, len() will return the number of items in the line list, which is the word count. 

b. Letter count

To calculate the number of letters in each line, all we need to do is take the sum of the letter count of each word, so for a given line we iterate over each word, calling len()to get the character count of a given word. After iterating over all words in a line, the characters are summed for the total number of characters on the line; the code to perform this is sum(len(word) for word in words).  

c. Visual length

Calculating the visual length of each line is simple; assuming a monospace font, the visual length of each line is simply the total number of characters (including spaces!) present on the line. Therefore, the visual length is simply len(line). However, most fonts are not monospace, especially common literary fonts like Caslon, Garamond, and Georgia — this presents an issue because without knowing the exact font that an author was writing with, we can’t calculate the precise line length. While this assumption does leave room for error, considering the visual length in some capacity is important, so the monospace assumption will have to be used. 

d. Syllable count

Getting the syllable count without manually reading each line is the most challenging part of data collection. To identify a syllable, we’ll use vowel clusters. Note that in my program I defined a function, count_syllables(word), to count the syllables in each word. To preformat the word, we set it to all lowercase using word = word.lower() and remove any punctuation that may be contained in the word using word = re.sub(r'[^a-z]’, ”, word). Next, find all vowels or vowel clusters– each should be a syllable, as a single syllable is expressly defined as a unit of pronunciation containing one continuous vowel sound surrounded by consonants. To find each vowel cluster, we can use the regex of all vowels, including y: syllables = re.findall(r'[aeiouy]+’, word). After defining syllables, it will be a list of all vowel clusters in a given word. Finally, there must be at least one syllable per word, so even if you input a vowelless word (Cwm, for example), the function will return one syllable. The function is:

def count_syllables(word):
“””Estimate syllable count in a word using a simple vowel-grouping method.”””
word = word.lower()
word = re.sub(r'[^a-z]’, ”, word) # Remove punctuation
syllables = re.findall(r'[aeiouy]+’, word) # Find vowel clusters
return max(1, len(syllables)) # At least one syllable per word

That function will return the count of syllables for any inputted word, so to find the syllable count for a full line of text, return to the previous loop (used for data collection in 1.a-1.c), and iterate over the words list which will return the syllable count in each word. Summing the syllable counts will give the count for the full line: num_syllables = sum(count_syllables(word) for word in words). 

e. Data collection summary

The data collection algorithm is compiled into a single function, which begins at splitting the inputted poem into its lines, iterates over each line of the poem performing all of the previously described operations, and appends each data point to a designated list for that data set, and finally generates a dictionary to store all data points for a single line and appends it to a master data set. While the time complexity is effectively irrelevant for the small amounts of input data being used, the function runs in linear time, which is helpful in the case that it is used to analyze large amounts of data. The data collection function in its entirety is:

def analyze_poem(poem):
“””Analyzes the poem line by line.”””
data = []
lines = poem.split(“n”)

for line in lines:
words = line.split()
num_words = len(words)
num_letters = sum(len(word) for word in words)
visual_length = len(line) # Approximate visual length (monospace)
num_syllables = sum(count_syllables(word) for word in words)
word.append(num_words)
letters.append(num_letters)
length.append(visual_length)
sylls.append(num_syllables)

data.append({
“line”: line,
“words”: num_words,
“letters”: num_letters,
“visual_length”: visual_length,
“syllables”: num_syllables
})

return data

2. Discrete Fourier transform 

Preface: This section assumes an understanding of the (discrete) Fourier Transform; for a relatively brief and manageable introduction, try this article by Sho Nakagome.

a. Specific DFT algorithm

To address with some specificity the particular DFT algorithm I’ve used, we need to touch on the NumPy fast Fourier transform method. Suppose N is the number of discrete values being transformed: If N is a power of 2, NumPy uses the radix-2 Cooley-Tukey Algorithm, which recursively splits the input into even and odd indices. If N is not a power of 2, NumPy applies a mixed-radix approach, where the input is factorized into smaller prime factors, and FFTs are computed using efficient base cases. 

b. Applying the DFT

To apply the DFT to the previously collected data, I’ve created a function fourier_analysis, which takes only the master data set (a list of dictionaries with all data points for each line) as an argument. Luckily, since NumPy is so adept at mathematics, the code is simple. First, find N, being the number of data points to be transformed; this is simply N = len(data). Next, apply NumPy’s FFT algorithm to the data using the method np.fft.fft(data), which returns an array of the complex coefficients representing the amplitude and phase of the Fourier series. Finally, the np.abs(fft_result) method extracts the magnitudes of each coefficient, representing its strength in the original data. The function returns the Fourier magnitude spectrum as a list of frequency-magnitude pairs.

def fourier_analysis(data):
“””Performs Fourier Transform and returns frequency data.”””
N = len(data)
fft_result = np.fft.fft(data) # Compute Fourier Transform
frequencies = np.fft.fftfreq(N) # Get frequency bins
magnitudes = np.abs(fft_result) # Get magnitude of FFT coefficients

return list(zip(frequencies, magnitudes)) # Return (freq, magnitude) pairs

The full code can be found here, on GitHub.

3. Case studies

a. Introduction

We’ve made it through all of the code and tongue-twister algorithms, it’s finally time to put the program to the test. For the sake of time, the literary analysis done here will be minimal, putting the stress on the data analysis. Note that while this Fourier transform algorithm returns a frequency spectrum, we want a period spectrum, so the relationship ( T = frac{1}{f} ) will be used to obtain a period spectrum. For the purpose of comparing different spectrums’ noise levels, we’ll be using the metric of signal-to-noise ratio (SNR). The average signal noise is calculated as an arithmetic mean, given by ( P_{noise} = frac{1}{N-1} sum_{k=0}^{N-1} |X_k| ), where ( X_k ) is the coefficient for any index ( k ), and the sum excludes ( X_{peak} ), the coefficient of the signal peak. To find the SNR, simply take ( frac{X_{peak}}{P_{noise}} ); a higher SNR means a higher SNR means a higher SNR means a higher signal strength relative to background noise. SNR is a strong choice for detecting poetic periodicity because it quantifies how much of the signal (i.e., structured rhythmic patterns) stands out against background noise (random variations in word length or syllable count). Unlike variance, which measures overall dispersion, or autocorrelation, which captures repetition at specific lags, SNR directly highlights how dominant a periodic pattern is relative to irregular fluctuations, making it ideal for identifying metrical structures in poetry.

b. “Do Not Go Gentle into That Good Night” – Dylan Thomas

This work has a definite and visible periodic structure, so it is great testing data. Unfortunately, the syllable data here won’t find anything interesting here (Thomas’s poem is written in iambic pentameter); the word count data, on the other hand, has the highest SNR value out of any of the four metrics, 6.086. 

Figure 1. Note that this figure and all that follow were generated using Google Sheets.

The spectrum above shows a dominant signal at a 4 line period, and relatively little noise in the other period ranges. Furthermore, considering its highest SNR value compared to letter-count, syllable-count, and visual length gives an interesting observation: the poem follows a rhyme scheme of ABA(blank); this means the word count of each line repeats perfectly in tandem with the rhyme scheme. The SNRs of the other two relevant spectrums are not far behind the word-count SNR, with the letter-count at 5.724 and the visual length at 5.905. Those two spectrums also have their peaks at a period of 4 lines, indicating that they also match the poem’s rhyme scheme. 

c. “Jabberwocky” – Lewis Carroll

Carrol’s writing is also mostly periodic in structure, but has some irregularities; in the word period spectrum there is a distinct peak at ~5 lines, but the considerably low noise (SNR = 3.55) is broken by three distinct sub-peaks at 3.11 lines, 2.54 lines, and 2.15 lines. This secondary peak is shown in figure 2, implying that there is a significant secondary repeating pattern in the words Carroll used. Furthermore, due to the increasing nature of the peaks as they approach a period of 2 lines, one conclusion is that Carroll has a structure of alternating word counts in his writing.

Figure 2.

This alternating pattern is reflected in the period spectrums of visual length and letter count, both having secondary peaks at 2.15 lines. However, the syllable spectrum shown in figure 3 shows a low magnitude at the 2.15 line period, indicating that the word count, letter count, and visual length of each line are correlated, but not the syllable count. 

Figure 3.

Interestingly, as the poem follows an ABAB rhyme scheme, suggesting a connection between the visual length of each line and the rhyming pattern itself. One possible conclusion is that Carroll found it more visually appealing when writing for the rhyming ends of words to line up vertically on the page. This conclusion, that the visual aesthetic of each line altered Carroll’s writing style, can be drawn before ever reading the text.

4. Conclusion

Applying Fourier analysis to poetry reveals that mathematical tools can uncover hidden structures in literary works—patterns that may reflect an author’s stylistic tendencies or even subconscious choices. In both case studies, a quantifiable relationship was found between the structure of the poem and metrics (word-count, etc.) that are often overlooked in literary analysis. While this approach does not replace traditional literary analysis, it provides a new way to explore the formal qualities of writing. The intersection of mathematics, computer science, data analytics and Literature is a promising frontier, and this is just one way that technology can lead to new discoveries, holding potential in broader data science fields like stylometry, sentiment and emotion analysis, and topic modeling.  []

Read More

Shell Delivered Record Amount of Marine LNG to Ships in 2024

Shell Plc said it delivered record volumes of liquefied natural gas to power ships last year, boosting the use of a fuel that’s become crucial to the energy transition.

The company’s deliveries reached 1.1 million tons, according to the supermajor, which is one of the largest LNG shipping operators.

The shipping industry spews hundreds of millions of tons of greenhouse gases into the atmosphere each year and is under mounting pressure to decarbonize. LNG, emitting less carbon than oil-derived ship propellant, has been touted as a key bridge fuel during the switch to cleaner energy. Yet it still releases pollutants, including large amounts of methane.

“Demand for LNG-fueled vessels is picking up pace,” Tom Summers, senior vice president for Shell LNG Marketing & Trading, said in an email on Thursday. “LNG helps ship owners to reduce greenhouse gas emissions.”

The company last month raised its long-term forecast for global LNG demand, saying consumption will surge by about 60% into 2040. It expects the number of LNG-powered vessels to almost double in the next five years.

While the European Union has introduced rules that target ships’ emissions, the global marine fuel market is still dominated by oil. LNG only accounted for about 6% of consumption in 2023, according to figures from the International Maritime Organization, the industry’s regulator.

The amount of methane escaping from LNG-fueled ships is higher than assumed by the IMO, according to a study last year by environmental researchers. One critic of the use of LNG to power vessels, Fortescue Ltd.’s billionaire Chairman Andrew Forrest, has plans to tap green ammonia instead.

Methane is the second-largest contributor to global warming, after carbon dioxide. Shell aims to keep the methane intensity of its operated assets below 0.2% this year and achieve near-zero methane emissions by the end of the decade.

Read More

Nvidia’s GTC keynote will emphasize AI over gaming

Nvidia’s GPU Technology Conference (GTC) takes place in San Jose next week, not terribly far from San Franciso concurrently hosting the Game Developer’s Conference in the heart of the city. Despite geographic proximity, the subject matter of both conferences will likely be a world apart, as Nvidia CEO Jen Huang seems to be aiming for less of a talk about what Nvidia will do for gaming and more what it will do for AI.

In Nvidia’s GTC session catalog, the keynote is described succinctly: “Don’t miss this keynote from NVIDIA founder and CEO Jensen Huang. He’ll share how NVIDIA’s accelerated computing platform is driving the next wave in AI, digital twins, cloud technologies, and sustainable computing.”

With the recent launch of the 5090, it makes sense that Nvidia would be somewhat mum about new graphics capabilities for the latest games. At this point, improvements are likely to err more toward an incrementalist game of inches than a massive revolution every few years. But it does indicate that Nvidia is aware which audience is buttering their bread right now and is aiming their focus toward that.

Moreover, Nvidia and Huang likely feel they desperately need to win back the faith of the AI market. The picture of AI has changed dramatically since DeepSeek was revealed a few months ago, making Nvidia’s argument of being the best requiring expensive hardware straight from the source feel a bit shaky and that feeling was born out by the stock market.

Nvidia is going into GTC looking for a win and it is more likely to get that from AI enthusiasts and investors than gamers. This is not to say the company will not have any new information about how to get the best graphics out of all of 2025 and beyond’s shiny new titles, but the core focus lies elsewhere for the keynote at least.

Read More

Oil Drops Below $67 as Trade War Fears Weigh on Demand

Oil fell as signs that US President Donald Trump’s escalating trade war may hamper economic growth contributed to a bearish outlook for global demand.

West Texas Intermediate slid 1.7% to settle below $67 a barrel, following a 2.2% jump on Wednesday that was its biggest gain in almost two weeks. Global oil supply is likely to exceed demand by about 600,000 barrels a day this year as tariffs weaken macroeconomic conditions, the International Energy Agency said. US equities also dropped on uncertainty about the effects of the trade war.

Crude has tumbled from its mid-January highs as the Trump administration’s trade policies threaten a wider economic slowdown and reduce the appeal of riskier assets. On the supply side, an OPEC+ plan to boost production and the prospect of Russian barrels returning to the market also are weighing on prices. Oil briefly swooned to intraday lows after Russian President Vladimir Putin said he’s ready to agree to a ceasefire with Ukraine if it leads to long-lasting peace.

Bearish economic projections like the Federal Reserve Bank of Atlanta’s expectation that the US economy will decline at a 1.5% annualized rate this quarter are threatening prices, according to John Kilduff, a partner at Again Capital.

“A negative US economic outlook is problematic for this market,” he said. “That’s really why we’re down near the lower end of the range here at $66. If we break that, we are going to go back down into the $50s.”

US wholesale inflation was unchanged in February amid declining trade margins, though details were less favorable for the Federal Reserve’s preferred inflation gauge.

Top traders echoed expectations of supply outstripping demand at S&P Global’s CERAWeek conference in Houston, warning that prices could slide lower as more barrels come onto the market.

Oil Prices:

WTI for April delivery fell 1.7% to settle at $66.55 a barrel in New York.
Brent for May settlement slid 1.5% to settle at $69.88 a barrel.

Read More

Mastering Hadoop, Part 2: Getting Hands-On — Setting Up and Scaling Hadoop

Now that we’ve explored Hadoop’s role and relevance, it’s time to show you how it works under the hood and how you can start working with it. To start, we are breaking down Hadoop’s core components — HDFS for storage, MapReduce for processing, YARN for resource management, and more. Then, we’ll guide you through installing Hadoop (both locally and in the cloud) and introduce some essential commands to help you navigate and operate your first Hadoop environment.

Which components are part of the Hadoop architecture?

Hadoop’s architecture is designed to be resilient and error-free, relying on several core components that work together. These components divide large datasets into smaller blocks, making them easier to process and distribute across a cluster of servers. This distributed approach enables efficient data processing—far more scalable than a centralized ‘supercomputer.’

Hadoop Components | Source: Author

The basic components of Hadoop are:

Hadoop Common comprises basic libraries and functionalities that are required by the other modules.

The Hadoop Distributed File System (HDFS) ensures that data is stored on different servers and enables a particularly large bandwidth.

Hadoop YARN takes care of resource distribution within the system and redistributes the load when individual computers reach their limits.

MapReduce is a programming model designed to make the processing of large amounts of data particularly efficient.

In 2020, Hadoop Ozone, which is used as an alternative to HDFS, was added to this basic architecture. It comprises a distributed object storage system that was specially designed for Big Data workloads to better handle modern data requirements, especially in the cloud environment.

HDFS (Hadoop Distributed File System)

Let’s dive into HDFS, the core storage system of Hadoop, designed specifically to meet the demands of big Data Processing. The basic principle is that files are not stored as a whole on a central server, but are divided into blocks of 128MB or 256MB in size and then distributed across different nodes in a computer cluster.

To ensure data integrity, each block is replicated three times across different servers. If one server fails, the system can still recover from the remaining copies. This replication makes it easy to fall back on another node in the event of a failure.

According to its documentation, Hadoop pursues the following goals with the use of HDFS:

Fast recovery from hardware failures by falling back on working components.

Provision of stream data processing.

Big data framework with the ability to process large data sets.

Standardized processes with the ability to easily migrate to new hardware or software.

Apache Hadoop works according to the so-called master-slave principle. In this cluster, there is one node that takes on the role of the master. It distributes the blocks from the data set to various slave nodes and remembers which partitions it has stored on which computers. Only the references to the blocks, i.e. the metadata, are stored on the master node. If a master fails, there is a secondary name node that can take over.

The master within the Apache Hadoop Distributed File System is called a NameNode. The slave nodes, in turn, are the so-called DataNodes. The task of the DataNodes is to store the actual data blocks and regularly report the status to the NameNode that they are still alive. If a DataNode fails, the data blocks are replicated by other nodes to ensure sufficient fault tolerance.

The client saves files that are stored on the various DataNodes. In our example, these are located on racks 1 and 2. As a rule, there is only one DataNode per machine in a rack. Its primary task is to manage the data blocks in memory.

The NameNode, in turn, is responsible for remembering which data blocks are stored in which DataNode so that it can retrieve them on request. It also manages the files and can open, close, and, if necessary, rename them.

Finally, the DataNodes carry out the actual read and write processes of the client. The client receives the required information from the DataNodes when a query is made. They also ensure the replication of data so that the system can be operated in a fault-tolerant manner.

MapReduce

MapReduce is a programming model that supports the parallel processing of large amounts of data. It was originally developed by Google and can be divided into two phases:

Map: In the map phase, a process is defined that can transform the input data into key-value pairs. Several mappers can then be set up to process a large amount of data simultaneously to enable faster processing.

Reduce: The Reduce phase starts after all mappers have finished and aggregates all values that have the same key. The aggregation can involve various functions, such as the sum or the determination of the maximum value. Between the end of the Map phase and the start of the Reduce phase, the data is shuffled and sorted according to the keys.

A classic application for the MapReduce mechanism is word counting in documents, such as the seven Harry Potter volumes in our example. The task is to count how often the words “Harry” and “Potter” occur. To do this, in the map phase, each word is split into a key-value pair with the word as the key and the number one as the value, as the word has occurred once.

The positive aspect of this is that this task can run in parallel and independently of each other, so that, for example, a mapper can run for each band or even for each page individually. This means that the task is parallelized and can be implemented much faster. The scaling depends only on the available computing resources and can be increased as required if the appropriate hardware is available. The output of the map phase could look like this, for example:

[(„Harry“, 1), („Potter“, 1), („Potter“, 1), („Harry“, 1), („Harry”, 1)]

MapReduce using the example of word counts in Harry Potter books | Source: Author

Once all mappers have finished their work, the reduce phase can begin. For the word count example, all key-value pairs with the keys “Harry” and “Potter” should be grouped and counted. 

The grouping produces the following result:

[(„Harry“, [1,1,1]), („Potter“, [1,1])]

The grouped result is then aggregated. As the words are to be counted in our example, the grouped values are added together:

[(„Harry“, 3), („Potter“, 2)]

The advantage of this processing is that the task can be parallelized and at the same time only minimal file movement takes place. This means that even large volumes can be processed efficiently.

Although many systems continue to use the MapReduce program, as used in the original Hadoop structure, more efficient frameworks, such as Apache Spark, have also been developed in the meantime. We will go into this in more detail later in the article.

YARN (Yet Another Resource Negotiator)

YARN (Yet Another Resource Negotiator) manages the hardware resources within the cluster. It separates resource management from data processing, which allows multiple applications (such as MapReduce, Spark, and Flink) to run efficiently on the same cluster. It focuses on key functions such as:

Management of performance and memory resources, such as CPU or SSD storage space.

Distribution of free resources to running processes, for example, MapReduce, Spark, or Flink.

Optimization and parallelization of job execution.

Similar to HDFS, YARN also follows a master-slave principle. The Resource Manager acts as the master and centrally monitors all resources in the entire cluster. It also allocates the available resources to the individual applications. The various node managers serve as slaves and are installed on each machine. They are responsible for the containers in which the applications run and monitor their resource consumption, such as memory space or CPU performance. These figures are fed back to the Resource Manager at regular intervals so that it can maintain an overview.

At a high level, a request to YARN looks like this: the client calls the Resource Manager and requests the execution of an application. This then searches for available resources in the cluster and, if possible, starts a new instance of the so-called Application Master, which initiates and monitors the execution of the application. This in turn requests the available resources from the node manager and starts the corresponding containers. The calculation can now run in parallel in the containers and is monitored by the Application Master. After successful processing, YARN releases the resources used for new jobs.

Hadoop common

Hadoop Common can be thought of as the foundation of the complete Hadoop ecosystem on which the main components can be built. It contains basic libraries, tools, and configuration files that can be used by all Hadoop components. The main components include:

Common libraries and utilities: Hadoop Common provides a set of Java libraries, APIs, and utilities needed to run the cluster. This includes, for example, mechanisms for communication between the nodes in the cluster or support for different serialization formats, such as Avro. Interfaces required for file management in HDFS or other file systems are also included.

Configuration management: Hadoop is based on a large number of XML-based configuration files, which define the main system parameters that are essential for operation. One central aspect is the network parameters required to control the machines in the cluster. In addition, the permitted storage locations for HDFs are defined here or the maximum resource sizes, such as the usable storage space, are determined.

Platform independence: Hadoop was originally developed specifically for Linux environments. However, it can also be extended to other operating systems with the help of Hadoop Common. This includes native code support for additional environments, such as macOS or Windows.

Tools for I/O (input/output): A big data framework processes huge volumes of data that need to be stored and processed efficiently. The necessary building blocks for various file systems, such as TextFiles or Parquet, are therefore stored in Hadoop Common. It also contains the functionalities for the supported compression methods, which ensure that storage space is saved and processing time is optimized.

Thanks to this uniform and central code base, Hadoop Common provides improved modularity within the framework and ensures that all components can work together seamlessly.

Hadoop Ozone

Hadoop Ozone is a distributed object storage system that was introduced as an alternative to HDFS and was developed specifically for big data workloads. HDFS was originally designed for large files with many gigabytes or even terabytes. However, it quickly reaches its limits when a large number of small files need to be stored. The main problem is the limitation of the NameNode, which stores metadata in RAM and, therefore, encounters memory problems when billions of small files are kept.

In addition, HDFS is designed for classic Hadoop use within a computing cluster. However, current architectures often use a hybrid approach with storage solutions in the cloud. Hadoop Ozone solves these problems by providing a scalable and flexible storage architecture that is optimized for Kubernetes and hybrid cloud environments.

Unlike HDFS, where a NameNode handles all file metadata, Hadoop Ozone introduces a more flexible architecture that doesn’t rely on a single centralized NameNode, improving scalability. Instead, it uses the following components: 

The Ozone Manager corresponds most closely to the HDFS NameNode, but only manages the bucket and volume metadata. It ensures efficient management of the objects and is also scalable, as not all file metadata has to be kept in RAM.

The Storage Container Manager (SCM) can best be imagined as the DataNode in HDFS and it has the task of managing and replicating the data in so-called containers. Various replication strategies are supported, such as triple copying or erasure coding to save space.

The Ozone 3 Gateway has an S3-compatible API so it can be used as a replacement for Amazon S3. This means that applications developed for AWS S3 can be easily connected to Ozone and interact with it without the need for code changes.

This structure gives Hadoop Ozone various advantages over HDFS, which we have briefly summarized in the following table:

AttributeHadoop OzoneHDFSStorage StructureObject-based (buckets & keys)Block-based (files & blocks)ScalabilityMillions to billions of small filesProblems with many small filesNameNode – DependencyNo central NameNode & scaling possibleNameNode is bottleneckCloud IntegrationSupports S3 API, Kubernetes, multi-cloudStrongly tied to the Hadoop ClusterReplication StrategyClassic 3-fold replication or erasure codingOnly 3-fold replicationApplicationsBig data, Kubernetes, hybrid cloud, S3 replacementTraditional Hadoop workloads

Hadoop Ozone is a powerful extension of the ecosystem and enables the implementation of hybrid cloud architectures that would not have been possible with HDFS. It is also easy to scale as it is no longer dependent on a central name node. This means that big data applications with many, but small, files, such as those used for sensor measurements, can also be implemented without any problems.

How to start with Hadoop?

Hadoop is a robust and scalable big data framework that powers some of the world’s largest data-driven applications. While it can seem overwhelming for beginners due to its many components, this guide will walk you through the first steps to get started with Hadoop in simple, easy-to-follow stages.

Installation of Hadoop

Before we can start working with Hadoop, we must first install it in our respective environment. In this chapter, we differentiate between several scenarios, depending on whether the framework is installed locally or in the cloud. At the same time, it is generally advisable to work on systems that use Linux or macOS as the operating system, as additional adaptations are required for Windows. In addition, Java should already be available, at least Java 8 or 11, and internal communication via SSH should be possible.

Local Installation of Hadoop

To try out Hadoop on a local computer and familiarize yourself with it, you can perform a single-node installation so that all the necessary components run on the same computer. Before starting the installation, you can check the latest version you want to install at https://hadoop.apache.org/releases.html, in our case this is version 3.4.1. If a different version is required, the following commands can simply be changed so that the version number in the code is adjusted.

We then open a new terminal and execute the following code, which downloads the specified version from the Internet, unpacks the directory, and then changes to the unpacked directory.

wget https://downloads.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz
tar -xvzf hadoop-3.4.1.tar.gz
cd hadoop-3.4.1

If there are errors in the first line, this is most likely due to a faulty link and the version mentioned may no longer be accessible. A more up-to-date version should be used and the code executed again. The installation directory has a size of about one gigabyte.

The environment variables can then be created and set, which tells the system under which directory Hadoop is stored on the computer. The PATH variable then allows Hadoop commands to be executed from anywhere in the terminal without having to set the full path for the Hadoop installation.

export HADOOP_HOME=~/hadoop-3.4.1
export PATH=$PATH:$HADOOP_HOME/bin

Before we start the system, we can change the basic configuration of Hadoop, for example, to define specific directories for HDFS or specify the replication factor. There are a total of three important configuration files that we can adjust before starting:

core-site.xml configures basic Hadoop settings, such as the connection information for multiple nodes.

hdfs-site.xml contains special parameters for the HDFS setup, such as the typical directories for data storage or the replication factor, which determines how many replicas of the data are stored.

yarn-site.xml configures the YARN component, which is responsible for resource management and job scheduling.

For our local test, we can adjust the HDFS configuration so that the replication factor is set to 1, as we are only working on one server, and replication of the data is, therefore, not useful. To do this, we use a text editor, in our case nano, and open the configuration file for HDFS:

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

The file then opens in the terminal and probably does not yet have any entries. A new XML with the property key can then be added within the configuration area:

dfs.replication
1

Various properties can then be set according to this format. The different keys that can be specified in the configuration files, including the permitted values, can be found at https://hadoop.apache.org/docs/current/hadoop-project-dist/. For HDFS, this overview can be seen here.

Now that the configuration has been completed, Hadoop can be started. To do this, HDFS is initialized, which is the first important step after a new installation, and the directory that is to be used as the NameNode is formatted. The next two commands then start HDFS on all nodes that are configured in the cluster and the resource management YARN is started.

hdfs namenode -format
start-dfs.sh
start-yarn.sh

Problems may occur in this step if Java has not yet been installed. However, this can easily be done with the corresponding installation. In addition, when I tried this on macOS, the NameNode and DataNode of HDFS had to be started explicitly:

~/hadoop-3.4.1/bin/hdfs –daemon start namenode
~/hadoop-3.4.1/bin/hdfs –daemon start datanode

For YARN, the same procedure works for the Resource and NodeManager:

~/hadoop-3.4.1/bin/yarn –daemon start resourcemanager
~/hadoop-3.4.1/bin/yarn –daemon start nodemanager

Finally, the running processes can be checked with the jps command to see whether all components have been started correctly.

Hadoop installation in a distributed system

For resilient and productive processes, Hadoop is used in a distributed environment with multiple servers, known as nodes. This ensures greater scalability and availability. A distinction is typically made between the following cluster roles:

NameNode: This role stores the metadata and manages the file system (HDFS).

DataNode: This is where the actual data is stored and the calculations take place.

ResourceManager & NodeManagers: These manage the cluster resources for YARN.

The same commands that were explained in more detail in the last section can then be used on the individual servers. However, communication must also be established between them so that they can coordinate with each other. In general, the following sequence can be followed during installation:

Set up several Linux-based servers to be used for the cluster.

Set up SSH access between the servers so that they can communicate with each other and send data.

Install Hadoop on each server and make the desired configurations.

Assign roles and define the NameNodes and DataNodes in the cluster.

Format NameNodes and then start the cluster.

The specific steps and the code to be executed then depend more on the actual implementation.

Hadoop installation in the cloud

Many companies use Hadoop in the cloud to avoid having to operate their own cluster, potentially save costs, and also be able to use modern hardware. The various providers already have predefined programs with which Hadoop can be used in their environments. The most common Hadoop cloud services are:

AWS EMR (Elastic MapReduce): This program is based on Hadoop and, as the name suggests, also uses MapReduce, which allows users to write their programs in Java that process and store large amounts of data in a distributed manner. The cluster runs on virtual servers in the Amazon Elastic Compute Cloud (EC2) and stores the data in the Amazon Simple Storage Service (S3). The keyword “Elastic” comes from the fact that the system can change dynamically to adapt to the required computing power. Finally, AWS EMR also offers the option of using other Hadoop extensions such as Apache Spark or Apache Presto.

Google Dataproc: Google’s alternative is called Dataproc and enables a fully managed and scalable Hadoop cluster in the Google Cloud. It is based on BigQuery and uses Google Cloud Storage for data storage. Many companies, such as Vodafone and Twitter are already using this system.

Azure HDInsight: The Microsoft Azure Cloud offers HDInsight for complete Hadoop use in the cloud and also provides support for a wide range of other open-source programs.

The overall advantage of using the cloud is that no manual installation and maintenance work is required. Several nodes are used automatically and more are added depending on the computing requirements. For the customer, the advantage of automatic scaling is that costs can be controlled and only what is used is paid for.

With an on-premise cluster, on the other hand, the hardware is usually set up in such a way that it is still functional even at peak loads so that the entire hardware is not required for a large part of the time. Finally, the advantage of using the cloud is that it makes it easier to integrate other systems that run with the same provider, for example.

Basic Hadoop commands for beginners

Regardless of the architecture selected, the following commands can be used to perform very general and frequently recurring actions in Hadoop. This covers all areas that are required in an ETL process in Hadoop.

Upload File to HDFS: To be able to execute an HDFS command, the beginning hdfs dfs is always required. You use put to define that you want to upload a file from the local directory to HDFS. The local_file.txt describes the file to be uploaded. To do this, the command is either executed in the directory of the file or the complete path to the file is added instead of the file name. Finally, use /user/hadoop/ to define the directory in HDFS in which the file is to be stored.

hdfs dfs -put local_file.txt /user/hadoop/

List files in HDFS: You can use -ls to list all files and folders in the HDFS directory /user/hadoop/ and have them displayed as a list in the terminal.

hdfs dfs -put local_file.txt /user/hadoop/

Download file from HDFS: The -get parameter downloads the file /user/hadoop/file.txt from the HDFS directory to the local directory. The dot . indicates that the file is stored in the current local directory in which the command is being executed. If this is not desired, you can define a corresponding local directory instead.

hdfs dfs -get /user/hadoop/file.txt 

Delete files in HDFS: Use -rm to delete the file /user/hadoop/file.txt from the HDFS directory. This command also automatically deletes all replications that are distributed across the cluster.

hdfs dfs -rm /user/hadoop/file.txt

Start MapReduce command (process data): MapReduce is the distributed computing model in Hadoop that can be used to process large amounts of data. Using hadoop jar indicates that a Hadoop job with a “.jar” file is to be executed. The corresponding file containing various MapReduce programs is located in the directory /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar. From these examples, the wordcount job is to be executed, which counts the words occurring in a text file. The data to be analyzed is located in the HDFS directory /input and the results are then to be stored in the directory output/.

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount input/ output/

Monitor the progress of a job: Despite the distributed computing power, many MapReduce jobs take a certain amount of time to run, depending on the amount of data. Their status can therefore be monitored in the terminal. The resources and running applications can be displayed using YARN. To be able to execute a command in this system, we start with the command yarn, and with the help of application-list we get a list of all active applications. Various information can be read from this list, such as the unique ID of the applications, the user who started them, and the progress in %.

yarn application -list

Display logs of a running job: To be able to delve deeper into a running process and identify potential problems at an early stage, we can read out the logs. The logs command is used for this, with which the logs of a specific application can be called up. The unique application ID is utilized to define this application. To do this, the APP_ID must be replaced by the actual ID in the following command, and the greater than and less than signs must be removed.

yarn logs -applicationId

With the help of these commands, data can already be saved in HDFS, and MapReduce jobs can also be created. These are the central actions for filling the cluster with data and processing it.

Debugging & logging in Hadoop

For the cluster to be sustainable in the long term and to be able to read out errors, it is important to master basic debugging and logging commands. As Hadoop is a distributed system, errors can occur in a wide variety of components and nodes. It is therefore essential that you are familiar with the corresponding commands to quickly find and switch off errors.

Detailed log files for the various components are stored in the $HADOOP_HOME/logs directory. The log files for the various servers and components can then be found in their subdirectories. The most important ones are:

NameNode-Logs contains information about the HDFS metadata and possible connection problems:

cat $HADOOP_HOME/logs/hadoop-hadoop-namenode-.log 

DataNode logs show problems with the storage of data blocks:

cat $HADOOP_HOME/logs/hadoop-hadoop-datanode-.log

YARN ResourceManager logs reveal possible resource problems or errors in job scheduling:

cat $HADOOP_HOME/logs/yarn-hadoop-resourcemanager-.log

NodeManager logs help with debugging executed jobs and their logic:

cat $HADOOP_HOME/logs/yarn-hadoop-nodemanager-.log

With the help of these logs, specific problems in the processes can be identified and possible solutions can be derived from them. However, if there are problems in the entire cluster and you want to check the overall status across individual servers, it makes sense to carry out a detailed cluster analysis with the following command:

hdfs dfsadmin -report

This includes the number of active and failed DataNodes, as well as the available and occupied storage capacities. The replication status of the HDFS files is also displayed here and additional runtime information about the cluster is provided. An example output could then look something like this:

Configured Capacity: 10 TB
DFS Used: 2 TB
Remaining: 8 TB
Number of DataNodes: 5
DataNodes Available: 4
DataNodes Dead: 1

With these first steps, we have learned how to set up a Hadoop in different environments, store and manage data in HDFS, execute MapReduce jobs, and read the logs to detect and fix errors. This will enable you to start your first project in Hadoop and gain experience with big data frameworks.

In this part, we covered the core components of Hadoop, including HDFS, YARN, and MapReduce. We also walked through the installation process, from setting up Hadoop in a local or distributed environment to configuring key files such as core-site.xml and hdfs-site.xml. Understanding these components is crucial for efficiently storing and processing large datasets across clusters.

If this basic setup is not enough for your use case and you want to learn how you can extend your Hadoop cluster to make it more adaptable and scalable, then our next part is just right for you. We will dive deeper into the large Hadoop ecosystem including tools like Apache Spark, HBase, Hive, and many more that can make your cluster more scalable and adaptable. Stay tuned!

Read More

Are You Still Using LoRA to Fine-Tune Your LLM?

LoRA (Low Rank Adaptation – arxiv.org/abs/2106.09685) is a popular technique for fine-tuning Large Language Models (LLMs) on the cheap. But 2024 has seen an explosion of new parameter-efficient fine-tuning techniques, an alphabet soup of LoRA alternatives: SVF, SVFT, MiLoRA, PiSSA, LoRA-XS 🤯… And most are based on a matrix technique I like a lot: the SVD (Singular Value Decomposition). Let’s dive in.

LoRA

The original Lora insight is that fine-tuning all the weights of a model is overkill. Instead, LoRA freezes the model and only trains a small pair of low-rank “adapter” matrices. See the illustrations below (where W is any matrix of weights in a transformer LLM).

This saves memory and compute cycles since far fewer gradients have to be computed and stored. For example, here is a Gemma 8B model fine-tuned to speak like a pirate using LoRA: only 22M parameters are trainable, 8.5B parameters remain frozen.

LoRA is very popular. It has even made it as a single-line API into mainstream ML frameworks like Keras:

gemma.backbone.enable_lora(rank=8)

But is LoRA the best? Researchers have been trying hard to improve on the formula. Indeed, there are many ways of selecting smaller “adapter” matrices. And since most of them make clever use of the singular value decomposition (SVD) of a matrix, let’s pause for a bit of Math.

SVD: the simple math

The SVD is a great tool for understanding the structure of matrices. The technique splits a matrix into three: W = USVT where U and V are orthogonal (i.e., base changes), and S is the diagonal matrix of sorted singular values. This decomposition always exists.

In “textbook” SVD, U and V are square, while S is a rectangle with singular values on the diagonal and a tail of zeros. In practice, you can work with a square S and a rectangular U or V – see the picture – the chopped-off pieces are just multiplications by zero. This “economy-sized” SVD is what is used in common libraries, for example, numpy.linalg.svd.

So how can we use this to more efficiently select the weights to train? Let’s quickly go through five recent SVD-based low-rank fine-tuning techniques, with commented illustrations.

SVF

The simplest alternative to LoRA is to use the SVD on the model’s weight matrices and then fine-tune the singular values directly. Oddly, this is the most recent technique, called SVF, published in the Transformers² paper (arxiv.org/abs/2501.06252v2).

SVF is much more economical in parameters than LoRA. And as a bonus, it makes tuned models composable. For more info on that, see my Transformers² explainer here, but composing two SVF fine-tuned models is just an addition:

SVFT

Should you need more trainable parameters, the SVFT paper (arxiv.org/abs/2405.19597) explores multiple ways of doing that, starting by adding more trainable weights on the diagonal.

It also evaluates multiple alternatives like spreading them randomly through the “M” matrix.

More importantly, the SVFT paper confirms that having more trainable values than just the diagonal is useful. See their fine-tuning results below.

Next come several techniques that split singular values in two sets, “large” and “small”. But before we proceed, let’s pause for a bit more SVD math.

More SVD math

The SVD is usually seen as a decomposition into three matrices W=USVT but it can also be thought of as a weighted sum of many rank-1 matrices, weighted by the singular values:

Should you want to prove it, express individual matrix elements Wjk using the USVT form and the formula for matrix multiplication on one hand, theΣ siuiviT form, on the other, simplify using the fact that S is diagonal and notice that it’s the same thing.

In this representation, it’s easy to see that you can split the sum in two. And as you can always sort the singular values, you can make this a split between “large” and “small” singular values.

Going back to the tree-matrix form W=USVT, this is what the split looks like:

Based on this formula, two papers have explored what happens if you tune only the large singular values or only the small ones, PiSSA and MiLoRA.

PiSSA

PiSSA (Principal Singular values and Singular Vectors Adaptation, arxiv.org/abs/2404.02948) claims that you should only tune the large principal values. The mechanism is illustrated below:

From the paper: “PiSSA is designed to approximate full finetuning by adapting the principal singular components, which are believed to capture the essence of the weight matrices. In contrast, MiLoRA aims to adapt to new tasks while maximally retaining the base model’s knowledge.”

The PiSSA paper also has an interesting finding: full fine-tuning is prone to over-fitting. You might get better results in the absolute with a low-rank fine-tuning technique.

MiLoRA

MiLoRA (Minor singular component LoRA arxiv.org/abs/2406.09044), on the other hand, claims that you should only tune the small principal values. It uses a similar mechanism to PiSSA:

Surprisingly, MiLoRA seems to have the upper hand, at least when tuning on math datasets which are probably fairly aligned with the original pre-training. Arguably, PiSSA should be better for bending the behavior of the LLM further from its pre-training.

LoRA-XS

Finally, I’d like to mention LoRA-XS (arxiv.org/abs/2405.17604). Very similar to PiSSA but slightly different mechanism. It also shows good results with significantly fewer params than LoRA.

The paper offers a mathematical explanation of why this setup is “ideal’ under two conditions:

that truncating the bottom principal values from the SVD still offers a good approximation of the weights matrices

that the fine-tuning data distribution is close to the pre-training one

Both are questionable IMHO, so I won’t detail the math. Some results:

The underlying assumption seems to be that singular values come in “large” and “small” varieties but is it true? I made a quick Colab to check this on Gemma2 9B. Bottom line: 99% of the singular values are in the 0.1 – 1.1 range.  I’m not sure partitioning them into “large” and “small” makes that much sense.

Conclusion

There are many more parameter-efficient fine-tuning techniques. Worth mentioning:

My conclusion: to go beyond the LoRA standard with 10x fewer params, I like the simplicity of Transformers²’s SVF. And if you need more trainable weights, SVFT is an easy extension. Both use all singular values (full rank, no singular value pruning) and are still cheap 😁. Happy tuning!

Note: All illustrations are either created by the author or extracted from arxiv.org papers for comment and discussion purposes.

Read More

BlackRock’s BUIDL Fund Tops $1B with Ethena’s $200M Allocation

Global asset manager BlackRock’s BUIDL token, issued in partnership Securitize and backed by U.S. Treasuries, crossed the $1 billion milestone in assets on Thursday, Securitize said.
Pushing the fund’s size above the threshold was a $200 million allocation this afternoon by crypto protocol Ethena, a Securitize spokesperson told CoinDesk. Ethereum blockchain data by Arkham Intelligence shows an entity minting $200 million worth of BUIDL tokens at Thursday 18:47 UTC.

STORY CONTINUES BELOW
Don’t miss another story.Subscribe to the Crypto Long & Short Newsletter today. See all newsletters

Sign me up

By signing up, you will receive emails about CoinDesk products and you agree to our terms of use and privacy policy.

Crypto tokens backed by U.S. Treasuries are at the forefront of tokenization efforts, as digital asset firms and global financial heavyweights race to put traditional instruments such as bonds, private credit and funds on blockchain rails, aiming to achieve faster settlements and operational efficiencies.BUIDL serves as a building block for multiple yield-generating offerings, and it’s increasingly used as collateral on trading platforms. It’s a key reserve asset for Ethena’s yield-generating USDtb token, which now has a $540 million supply. USDtb’s value is backed by USDC and USDT stablecoins and some $320 million worth of BUIDL tokens.”Ethena’s decision to scale USDtb’s investment in BUIDL reflects our deep conviction in the value of tokenized assets and the significant role they will continue to play in modern financial infrastructure,” said Guy Young, founder of Ethena.Read more: Tokenized Treasuries Hit Record $4.2B Market Cap as Crypto Correction Fuels Growth

Read More