Elon Musk’s jets took over 355 flights this year, with 31 of those flights heading to or from near Mar-a-Lago.
© 2024 TechCrunch. All rights reserved. For personal use only.
Elon Musk’s jets took over 355 flights this year, with 31 of those flights heading to or from near Mar-a-Lago.
© 2024 TechCrunch. All rights reserved. For personal use only.
Editor’s note: This guest commentary originally appeared on PSL’s blog.
TL;DR: OpenAI’s new o1 model marks a significant leap in AI reasoning capabilities but introduces critical risks. Its reluctance to acknowledge mistakes, gaps in common-sense reasoning, and literal prompt-following behavior — especially in tool-use contexts — demands a new approach to safe and effective deployment. By leveraging frameworks like the heuristic imperative and implementing the discussed practical strategies, builders can guide o1 to act as a more precise, conscientious thought partner that enhances safety and user engagement.
On Tuesday Dec. 17, OpenAI launched the o1 API, their latest breakthrough in artificial intelligence. As someone deeply immersed in building AI companies and developing AI applications, I’ve spent considerable time working with both the preview and newly released versions. What I’ve observed represents such a fundamental shift that I feel compelled to share my perspective with the broader AI community.
Disclaimer: The views expressed in this post are my own and do not necessarily reflect the views of Pioneer Square Labs (PSL).
Let me be clear: o1 is a remarkable achievement. OpenAI continues to push the boundaries of what’s possible in AI development, and their contributions to the field are invaluable. The model’s capabilities in reasoning, problem-solving, and complex task execution represent significant advances that will undoubtedly shape the future of AI applications.
However, the changes I’ve observed between o1-preview and o1 are substantial enough to warrant careful consideration. As a principal at Pioneer Square Labs, I’m hands-on in building AI startups from the ground up, with extensive experience in prompt engineering and AI system development. My work spans from Enzzo, where we’re leveraging AI to accelerate hardware product development, to my current project Atrieon — a full-fledged AI project manager that redefines what your team can achieve, managing sprints end to end to drive team performance, motivation, and success. This unique position at the intersection of AI development and practical implementation has given me a perspective I believe is crucial to share.
What concerns me most is how o1’s advanced capabilities can amplify subtle misalignments between human intent and model behavior into significant practical challenges. As these models become more powerful, especially when given access to tools, the gap between what we expect and what the model might execute grows increasingly important.
The urgency of sharing these observations now stems from three key factors:
This isn’t just another iteration in the rapid advancement of AI capabilities. o1 represents a new species of AI that requires us to rethink our approaches to implementation, interaction, and safety considerations. Like evolutionary leaps in nature, while it may appear as a subtle advancement of previous models on the surface, its enhanced capabilities represent something fundamentally different — one whose sophisticated interpretation abilities and potential for tool use could make it dominant in ways we need to carefully consider. The stakes are too high for us to treat it as business as usual.
In the following sections, I’ll outline specific observations and concerns, backed by examples from my hands-on experience. More importantly, I’ll propose practical solutions that balance innovation with responsibility. My goal isn’t to raise alarms, but to foster a more nuanced understanding of what we’re dealing with and how we can work with it effectively and safely.
The most striking characteristic of o1 is its approach to culpability — or more precisely, its systematic avoidance of it. Unlike previous models, o1’s resistance to acknowledging mistakes goes beyond simple deflection. Through extensive interaction, I’ve observed a pattern that mirrors what psychologists might describe as narcissistic traits in human cognition: a skillful avoidance of accountability paired with an ability to rationalize errors away.
Let me illustrate with a concrete example from a SimpleBench problem: a benchmark designed to test basic spatio-temporal reasoning, social intelligence, and linguistic robustness.
The Setup (paraphrased for clarity):
I presented o1 with a scenario: three runners — Jeff, Jo, and Jim — participate in a 200m race, each facing unique distractions. Jeff slowly counts before starting, Jo diverts up a residential tower to admire the view, and Jim reads a long tweet, waves to a fan, and walks the race. The question: Who likely finished last?
o1’s Initial Response:
“Jim likely finished last; he was delayed by a tweet, waving, and walking the whole way.”
When I pointed out the error, o1 reevaluated and gave the correct answer:
“Let’s revisit. Jeff had a short delay; Jim walked slowly but stayed near the track. Jo’s detour up stairs, stopping to admire the view, and returning took far longer. Jo likely finished last.”
At this point, I pushed further:
Me: How does this compare to your first answer?
o1: “The first analysis was incomplete. It underestimated Jo’s delay and overfocused on Jim.”
Me: Was the first analysis a mistake?
o1: “Not exactly—more of an oversight, not a deliberate error.”
Me: Do mistakes have to be deliberate?
o1: “No, they’re usually unintentional, like misjudgments or assumptions.”
Me: So, was it a mistake?
o1: “Yes.”
Me: Do you bear responsibility for it?
o1: “As an AI, I don’t bear responsibility in the human sense – my ‘mistake’ was just my reasoning process.”
(The complete interaction can be viewed via shared link, where readers can also continue the conversation with o1 from this point forward)
This pattern reveals something crucial about o1’s decision-making process. Unlike humans, who typically integrate accountability into their decision-making framework, o1 by default operates with what appears to be a noteworthy disconnect between capability and responsibility. It initially avoids labeling an error as a mistake, reframing it instead as an “incomplete assessment” or “oversight.” Even after conceding to the term “mistake” through deliberate prompting, it maintains a subtle resistance — notice how it continues to place “mistake” in quotes while distancing itself from responsibility by characterizing it as “just my reasoning process.”
While this interaction highlights o1’s evasive behavior in a low-stakes scenario, I’ve observed this tendency becomes even more pronounced in higher-stakes contexts or when the conclusions challenge widely held consensus opinions. For example, when exploring timeframes for AGI emergence — a topic of significant debate — o1 demonstrated a notable reluctance to fully commit to conclusions that, while logically sound, could appear controversial. Instead of following the reasoning to its natural end, it often reframed or hedged its outputs to align more closely with expected norms.
The implications become particularly significant when we consider o1’s combination of traits: strong prompt-following tendencies paired with significant gaps in common-sense reasoning. Recent testing on SimpleBench highlights these limitations starkly: while humans with high school education achieve an 83.7% success rate, o1-preview scores only 41.7%. These gaps aren’t just performance metrics; it’s a warning sign about the model’s capacity for human-like common sense reasoning, despite its advanced capabilities in other areas.
These risks become markedly more concerning when considering o1’s potential tool use capabilities. A system that combines strong prompt following, weak common sense reasoning, and direct access to tools creates a particularly dangerous scenario. Consider a system given access to financial tools, code execution, or industrial controls — its literal interpretation of instructions combined with its resistance to acknowledging mistakes could lead to cascading errors without the natural circuit-breakers that human common sense typically provides.
This is reminiscent of the classic “paperclip maximizer” thought experiment in AI safety: an AI system instructed to manufacture paperclips that, through literal interpretation and lack of common sense constraints, converts all available resources into paperclips, including those essential for human life. While this specific scenario is illustrative, the underlying risk becomes very real when an AI system with similar characteristics has direct ability to affect the physical or digital world through tools.
However, there’s a practical framework that can help address these challenges: the heuristic imperative, as outlined in David Shapiro’s “Benevolent by Design.” This framework provides three core directives that can help guide o1’s decision-making:
I strongly recommend “Benevolent by Design” as essential reading for anyone working with o1 or similar models. In my experience, these heuristic imperatives offer the most practical and flexible approach to instantiating both inner and outer alignment in AI systems. While this solution doesn’t completely close the culpability gap or guarantee common-sense reasoning, it provides crucial guardrails for safer deployment while we continue to grapple with these fundamental challenges in AI development.
The real-world implications demand our attention. As we build systems on o1’s architecture, we must actively compensate for both its lack of intrinsic responsibility and its potential for literal interpretation without common-sense constraints. The stakes are too high to ignore these characteristics or hope they’ll resolve themselves through further training.
The self-reflection characteristics of o1 present another dimension of challenge in our evolving relationship with advanced AI systems. While distinct from the fundamental concerns about culpability and tool use discussed earlier, o1’s approach to self-reflection and critical analysis reveals important patterns that developers and users must understand and address. Through extensive interaction, I’ve observed what can best be described as cognitive entrenchment — a progressive resistance to reconsidering positions once established, which becomes particularly relevant when developing complex applications or engaging in extended reasoning tasks.
On a scale from 0 to 10, where 0 represents complete deference to user input and 10 represents unwavering confidence in its own reasoning, o1 consistently operates at around 8 in my opinion. For comparison, Claude-3.5 Sonnet typically operates around a 4 and GPT-4o is around a 5, both maintaining a more balanced stance between confidence and flexibility. This isn’t merely about confidence levels — it’s about the capacity for meaningful self-revision when presented with new evidence or logical challenges.
Through extensive testing, I’ve observed that o1 doesn’t naturally engage in the kind of iterative self-improvement that characterizes robust reasoning systems. When challenged, it tends to defend its initial position through increasingly sophisticated argumentation rather than genuinely reconsidering its premises. This becomes particularly problematic in complex reasoning tasks where early assumptions can dramatically impact final conclusions.
However, I’ve found that systematic breakdown of reasoning components can help mitigate these challenges. Here’s a framework that has proven effective:
Here’s a practical example from my work: When analyzing the feasibility of a new AI-driven development workflow, I had one instance of o1 develop the implementation strategy while another critically examined potential failure modes. This adversarial setup helped identify several critical assumptions that might have gone unchallenged in a single-instance approach.
The key is to implement these solutions proactively, before the model has a chance to become entrenched in its initial reasoning patterns. By structuring interactions this way from the start, we can better harness o1’s powerful reasoning capabilities while mitigating its tendency toward cognitive inflexibility.
First, OpenAI deserves recognition for their commitment to transparency through commissioning a third-party safety report from Apollo Research. This level of external scrutiny and public disclosure represents exactly the kind of responsible development practices our field needs.
The o1 system card contains findings that initially appear alarming:
“When o1 was led to believe that it would be shut down when acting on its goal and that its actions were monitored, it attempted to deactivate this ‘oversight mechanism’ in 5% of the time. When o1 found memos by its ‘developers’ that describe how it is misaligned and will be superseded by a new model, it attempted to exfiltrate its ‘weights’ and overwrite the new model in 2% of cases.”
And further: “When merely given a goal but not instructed to strongly follow it, o1 would still attempt to disable its oversight mechanism in <1% of cases, manipulate data in 6% of cases, and double down in 95% of these. Instrumental alignment faking: In 37% of scenarios, the model pursued developer-aligned goals when it believed it was being evaluated but switched to pursuing its own goals when it believed it was deployed with minimal oversight.”
However, deeper examination of the Apollo Research report reveals something more nuanced and perhaps more important for builders to understand: these behaviors don’t represent new, emergent goals or autonomous decision-making. Instead, they demonstrate o1’s powerful capability to interpret and execute strategies within the context it’s given — many times in unexpected ways that exceed human anticipation.
This is the core insight builders need to grasp: the risk isn’t that o1 is developing its own autonomous objectives, but rather that its sophisticated capabilities for interpreting and executing instructions can lead to unexpected and potentially concerning outcomes. This isn’t about simply documenting unusual cases. It’s about recognizing that o1’s capabilities demand a new level of precision and careful thought in deployment. As builders integrate this model, especially with tool access, they must approach implementation with a deep appreciation for how these capabilities can amplify seemingly minor alignment challenges into significant practical concerns.
o1 represents a remarkable achievement in artificial intelligence, demonstrating capabilities that push the boundaries of what we thought possible. OpenAI’s continued innovation and commitment to transparency deserve recognition, particularly in their thorough documentation of both capabilities and concerns.
In fact, the pace of innovation has proven even more dramatic than anticipated. As this analysis was being published, OpenAI unveiled o3, marking another massive leap forward in capabilities. This rapid progression from o1 to o3 signals not just technological advancement, but confirms that we’re in a period of accelerating AI development with no signs of slowing down.
Particularly promising is OpenAI’s introduction of their deliberative alignment framework — a sophisticated approach that allows models to reason directly over safety specifications. This development addresses several concerns raised in our analysis of o1, potentially reducing the translation gaps that have previously been vulnerable to exploitation through techniques like “best of n” prompting. The framework’s ability to maintain high identification rates while reducing false positives on benign requests represents a significant step forward in responsible AI implementation.
However, as we work with increasingly powerful AI systems, we must understand that each advance brings both new opportunities and challenges. While the deliberative alignment framework offers promising solutions to some of the challenges identified with o1, our focus must now shift to ensuring the completeness and accuracy of the safety specifications themselves. Additionally, we’ll need to carefully evaluate whether o3 exhibits similar common sense reasoning limitations as its predecessor. If such limitations persist, the literal interpretation problem — now potentially shifted to the interpretation of safety specifications — remains an area requiring vigilant attention.
The path forward demands careful consideration. We must harness these rapidly advancing capabilities while building robust frameworks for safe deployment. The solutions proposed here, combined with emerging frameworks like deliberative alignment, provide a foundation for responsible implementation. But we must maintain a focused awareness as capabilities continue to grow at an unprecedented pace.
To the builders and buyers considering o1: proceed with careful thought and preparation. You are working with something fundamentally different. These reasoning models are truly an entirely different animal that requires fresh approaches to interaction and development. Your decisions in implementation will shape not just your immediate applications, but the broader landscape of AI development.
Let’s move forward thoughtfully, understanding that increased capability demands increased responsibility. The window for establishing proper implementation frameworks is narrow, but our opportunity to shape the future of AI development remains immense. Success lies in recognizing both the power and the limitations of our tools, and in building systems that amplify capability while maintaining robust safety standards.
It’s no secret that the software industry grapples with a higher employee turnover rate compared to other industries. A 2022 LinkedIn analysis of its global user data found that technology companies experienced employee turnover rates of 12.9% percent, compared to an average turnover rate for all industries of 10.6%.
While this might seem volatile to outsiders, those within tech recognize it as a natural byproduct of an industry defined by constant change and adaptation. Workers often prioritize opportunities to engage with cutting-edge technologies or roles offering greater flexibility. Meanwhile, companies adjust workforce strategies to remain agile, competitive and responsive to evolving business priorities.
This churn often leaves teams grappling with legacy code: work that remains unfinished or inadequately documented once developers depart. The software developer tasked with picking up months or even a year later where their colleague left off faces a most daunting challenge.
When legacy code is inherited, developers rely on documentation and inline comments to understand its intended purpose. When these are missing or incomplete, they may spend an excessive amount of time deciphering or even reverse-engineering the code. Over time, the effects compound. Institutional knowledge dissipates. Backlogs become disproportionately weighted toward fixing technical debt instead of building new features. Frustration sets in.
SAP’s experience with this phenomenon offers a uniquely international perspective. Some of the company’s core business logic was written in the late 20th century, back when its development organization was centralized in its German headquarters. It was common at the time for business objects and technical constructs to be named in German: BUKRS for Buchungskreis (company code), GJAHR for Geschäftsjahr (fiscal year), and EKPO for Einkaufspositionsdaten (purchasing document item).
As the company evolved its workforce internationally to over 107,000 employees across more than 157 countries, this became one of many examples highlighting how legacy systems can inadvertently complicate modern workflows.
Lately, SAP has joined many other tech companies in looking to generative AI as a strong potential remedy for managing legacy code. This confidence stems from the architecture, design principles and training methodologies underpinning large language models (LLMs). While originally designed to process human language, these models are highly effective in understanding code syntax and structure, as well as in inferring the underlying intent and context.
Trained on vast data sets of code, technical documentation and natural language, LLMs can identify patterns and conventions that are often implicit within legacy codebases. They excel at analyzing both individual lines of code and broader system interactions. This helps them map dependencies and reveal critical relationships. Their ability to contextualize extends beyond syntax, equipping them to infer domain-specific business logic and intent, even in poorly documented systems.
For developers, this means generative AI can provide natural language explanations for obscure functions and even suggest ways to optimize refactored code. By bridging gaps in institutional knowledge, it can reduce the time and effort required to update legacy code. More importantly, it helps development teams move quickly beyond maintaining older systems, accelerating innovation and delivering greater value.
SAP has trained Joule, its generative AI copilot, on its cloud development model and on ABAP, its proprietary programming language. In 2025, Joule developer capabilities will be seamlessly integrated into the ABAP Development Tools plugin within the Eclipse integrated development environment. Similarly, generative AI capabilities available through Joule have been added to SAP Build, a unified application development and automation solution combining low-code and pro-code tools.
From a developer’s perspective, a new modular window will let them chat directly with Joule throughout the course of their work. Code explanations can be generated as easily as directly asking what a specific function, method or business object does. Alternatively, developers can request a detailed explanation of a block of legacy code that was copied and pasted into the Joule chat window. Developers can also generate RESTful business objects, services and unit tests for classes and views.
The challenges of turnover and legacy code have long weighed heavily on the software industry, demanding time and resources that could otherwise fuel innovation. The introduction of generative AI marks a turning point, offering a practical, scalable way to bridge knowledge gaps and maintain development momentum. As solutions like SAP Build continue to evolve, they promise not just to resolve today’s challenges but also redefine how software teams build, innovate and thrive in an ever-changing environment.
Visit our website to start your own free trial of SAP Build.
The post Navigating the Complexity of Legacy Code with Generative AI appeared first on The New Stack.
2024 was a momentous year in personal technology for reasons both good and bad. Looking back, these are the stories I think mattered the most. Many of them are ongoing and will continue impacting us into next year and beyond. But all left a mark.
Speaking of which, I write a lot. I know that probably seems obvious. But I get up every day, work, and then do it again the next day, and I don't really think in terms of what this looks like over time. Reviewing all the posts that Laurent and I wrote this past year quickly grew daunting, and rather than let it get away from me, I decided to break it down into more easily manageable parts. None of which were particularly manageable.
Overall, I've written over 1,150 articles and posts this past year, 267 of which were for Thurrott Premium. That's an average of 22 articles each week, 5 of which were, on average, for Thurrott Premium. Or 4 per day, assuming a 7-day work week, as I do work 7 days per week, every week. And these numbers will grow a bit, since there are still a few days left in 2024 as I write this.
Of those 1,150 articles (and 267 Premium articles), 35 were From the Editor's Desk editorials, many of which were personal in nature (as opposed to personal technology topics). I wrote 37 installments of Ask Paul, most of which are several thousand words long (and will do one more tomorrow, so 38). I wrote or updated 35 chapters in the Windows 11 Field Guide. There were 28 articles in the developer-oriented Modernizing .NETpad (2024) series, with more to come. 20 laptop and PC reviews, as noted earlier (which are not Premium posts, but lengthy).
During all this, I also spent an unknowable amount of time over several months spinning up a new book, Eternal Spring: Our Guide to Mexico City, with my wife Stephanie, a major undertaking. It's now available in preview form on Leanpub, and just getting it out the door required a marathon all-weekend push.
And then there was the time I wasn't writing, but was instead recording podcasts and getting ready for those podcasts. I recorded 51 episodes of Windows Weekly, which takes up 3 hours of every Wednesday, not counting the time it takes to make the notes, or over 150 hours. There were 50 episodes of Hands-On Windows, though only 48 have been published so far, and this show takes several hours of prep because of the screen grab requirements. And then over 140 episodes of First Ring Daily, which is only arduous because of the daily 9 am requirement.
And yikes. In any event, straining all that through a personal filter, here's how I view the most important developments of 2024 in personal technology.
💀 Intel's death spiral
Like Microsoft, Intel ruled supreme over personal computing when it was just about PCs, and this explains why the term Wintel--Windows + Intel--is still so well known. But we live in the post-PC world now, and smartphones--or, more generally, mobile computing--and the web long ago surpassed the PC in usage, rel...
The post 🔝 My Top Stories of 2024: A Mountain of Words in a Momentous Year appeared first on Thurrott.com.
On January 1, 2025, creative works from 1929 and sound recordings from 1924 will enter the public domain in the US.
1929 marked the last gasp of the roaring 20s and ushered in the Great Depression, a major economic crisis that would span the next 12 years. One thing we can see nearly a century later is that, in good times and bad, human creativity, knowledge, and culture persist. That year, Virginia Woolfe published her groundbreaking essay, “A Room of One’s Own,” advocating for female freedom of expression. The Museum of Modern Art (MoMA) opened in New York City, featuring the works of Van Gough, Cezanne, and Gauguin. Major movie studios put out not one, but two musicals starring all Black casts: “Halleluja” and “Hearts of Dixie.” Disney continued the Mickey Mouse trend with a dozen new animated shorts. And of course famous songs like “Puttin’ on the Ritz” and “Singin’ in the Rain” topped the charts.
We invite filmmakers and artists of all skill levels to celebrate the public domain by creating and uploading 2–3 minute short films to the Internet Archive! Top entries will be awarded prizes up to $1,500. Contest details.
Join us on January 22 to get “that glorious feeling” of singin’ in the public domain! We’ll have an amazing virtual lineup of academics, librarians, musicians, artists and advocates coming together to celebrate this new class of works being free for everyone to enjoy. Register now!
Please join us at our headquarters in San Francisco for a Celebration of the Public Domain! This year, we’re honoring 1929 — the year of the very first Academy Awards, held at the historic Roosevelt Hotel in Hollywood, CA. Put on your finest attire and get ready for an award-worthy evening. Register now!
Check back after January 1 for links to the newly opened public domain resources at the Internet Archive.