OpenAI has released a preprint study titled “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?” (arXiv: 2502.12115).
SWE-Lancer is designed to evaluate AI performance on both individual coding tasks and managerial decision-making, where models must select the best solution from multiple freelancer submissions.
One of SWE-Lancer’s key strengths is that it uses end-to-end testing instead of isolated modular checks.
The benchmark includes nearly 1,500 real-world freelance tasks from Expensify, which were originally posted on Upwork. AI models were assigned these same tasks and given a virtual “budget” to earn as much as possible. Importantly, harder tasks had higher payouts.
The tasks were divided into two main categories:
IC SWE tasks vary from quick bug fixes (which take 15 minutes or less) to complex feature additions that may require several weeks.
Unlike many existing AI benchmarks that rely solely on unit tests, SWE-Lancer uses end-to-end tests built by experienced engineers. These automated browser tests simulate real-world usage scenarios and reflect typical review processes in freelance projects. Additionally, the results were reviewed three times by professional developers to confirm accuracy.
SWE Manager tasks evaluate how well AI can assess multiple freelancer proposals and select the best one.
The AI’s decisions were compared to those made by human engineering managers in real projects. Since multiple proposals can be technically correct, these tasks required deep repository knowledge and an understanding of the project’s context to identify the optimal solution.

Researchers assessed not just task completion rates but also total earnings, measuring both:
The evaluation covered two main datasets:
Claude 3.5 Sonnet
GPT-4o
o1
AI models vary in effectiveness, but all are capable of solving some real freelance tasks.

SWE-Lancer categorizes freelance tasks into three real-world case types, showing where AI excels (or struggles).
Typically involve minor UI tweaks or logic adjustments, fixable in minutes or hours.
In Application Logic (IC SWE) tasks:
For SWE Manager tasks (choosing the best proposal):
Conclusion: Simple bug fixes are the easiest AI tasks.
Involves adding new UI/UX components, improving system logic, or refining user experience.
In UI/UX tasks (IC SWE):
In Server-Side Logic tasks:
Conclusion: AI struggles with UX-heavy tasks but performs better in backend optimizations.
Includes architecture refactoring and full system upgrades.
System-Wide Quality & Reliability (IC SWE): 0% success across all models.
For managerial tasks in this category (SWE Manager):
Conclusion: AI can evaluate plans for complex projects but cannot execute them independently.
Entry-level tasks (e.g., $20–$100 bug fixes) may now be automated, reducing demand for junior developers.
Some freelancers already offer AI-assisted automation services, integrating tools like Copilot or AI-generated code pipelines.
AI is already reshaping the freelance market, but it does not eliminate human specialists.
AI complements human expertise rather than replacing it. Freelancers who adapt and learn to leverage AI-powered tools (like Copilot, DeepResearch, and AI-driven testing) will remain highly competitive.
Complex, creative, and high-level decision-making tasks still require human involvement.
Freelancers who understand how to integrate AI into their workflow will have a significant advantage in the evolving gig economy.

AI is transforming search. Learn how SEO evolves into AEO and GEO — where visibility means being cited in AI answers, not just ranked in results.
AI makes work easier, but thinking harder. Learn how to stay creative, critical, and human in the age of intelligent machines.

AI speeds up work but often creates “workslop” - results that look complete yet lack value. Freelancers are the ones turning them into quality.

Disney Creative Strategy: dream, plan, critique — a tool to guide ideas from imagination to real-world results.

Discover how Upwork’s Available Now badge and Profile Boost work, their costs, pros and cons, and which boost is best for freelancers or agencies.

We’ve gathered a set of articles to guide you through the essentials — from setting up your profile to building long-term client relationships.

Etcetera summer 2025 results: quiet season, new team members, shifting Upwork rules, and plans for an active autumn.

Upwork feedback is more than stars — it builds trust, shapes reputation, and guides choices. Learn how to read, request, and write reviews effectively

Instead of mixing emotions, facts, and criticism in chaos — this method by Edward de Bono helps separate thinking modes.

Discover how Upwork’s fees evolved from flat 10% to a pay-to-play model with Connects, boosts, and variable 0-15% commissions in 2025.

Discover 5 practical steps to reset your Upwork strategy in 2025: update skills, rethink pricing, optimize proposals, and grow with the market.

How to build a strong team that survives crises: Denys Safonov shares lessons from 11 years of leading the agency Etcetera through global challenges.