How AI is Transforming

Freelancing: Facts

and Real-World Cases

For years, Expensify has been outsourcing a significant portion of its frontend and backend tasks via Upwork, offering monetary rewards to anyone who successfully completes an assignment. Freelancers are given access to the code when needed—via a fully open repository. These real-world tasks have now become the foundation of the SWE-Lancer benchmark.

The company posts specific tasks (ranging from minor UI fixes to significant mobile app upgrades) along with a set payout. Budgets vary from $20 for simple bug fixes to over $30,000 for complex projects.

The total publicly available project value exceeds $1 million, with at least $500,000 worth of tasks openly published.

SWE-Lancer: A New Benchmark for AI in Freelancing

OpenAI has released a preprint study titled “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?” (arXiv: 2502.12115).

SWE-Lancer is designed to evaluate AI performance on both individual coding tasks and managerial decision-making, where models must select the best solution from multiple freelancer submissions.

One of SWE-Lancer’s key strengths is that it uses end-to-end testing instead of isolated modular checks.

The benchmark includes nearly 1,500 real-world freelance tasks from Expensify, which were originally posted on Upwork. AI models were assigned these same tasks and given a virtual “budget” to earn as much as possible. Importantly, harder tasks had higher payouts.

Task Categories

The tasks were divided into two main categories:

1. Individual Engineering Tasks (IC SWE tasks)

IC SWE tasks vary from quick bug fixes (which take 15 minutes or less) to complex feature additions that may require several weeks.

Unlike many existing AI benchmarks that rely solely on unit tests, SWE-Lancer uses end-to-end tests built by experienced engineers. These automated browser tests simulate real-world usage scenarios and reflect typical review processes in freelance projects. Additionally, the results were reviewed three times by professional developers to confirm accuracy.

2. Managerial Decision-Making Tasks (SWE Manager tasks)

SWE Manager tasks evaluate how well AI can assess multiple freelancer proposals and select the best one.

The AI’s decisions were compared to those made by human engineering managers in real projects. Since multiple proposals can be technically correct, these tasks required deep repository knowledge and an understanding of the project’s context to identify the optimal solution.

Open Data & AI Benchmarks

Researchers assessed not just task completion rates but also total earnings, measuring both:

Effectiveness (how often the model successfully completed a task on the first attempt).
Economic impact (how much money the model could “earn” from the full set of tasks).

The evaluation covered two main datasets:

Diamond Set – Valued at approximately $236,000
Full Task Set – Exceeding $1 million

AI Model Performance on SWE-Lancer

Claude 3.5 Sonnet

Best overall performer: Earned $58,000 from $236,000 in the Diamond set and $403,000 from $1 million in the full set.
Solved 26.2% of IC SWE tasks (Diamond) and 47.0% of SWE Manager tasks (Full).

GPT-4o

Earned $303,500 from the full task set—less than o1 and Claude 3.5 Sonnet.
Had a low success rate of 8.0% for IC SWE (Diamond) but performed slightly better in managerial tasks (38.7%).

Earned $380,000 from the full task set, outperforming GPT-4o.
Task completion rates: 16.5% (IC SWE, Diamond) and 46.3% (SWE Manager, Full)—a middle ground between GPT-4o and Claude 3.5 Sonnet.

AI models vary in effectiveness, but all are capable of solving some real freelance tasks.

Claude 3.5 Sonnet performs best, particularly in managerial decision-making (47% success rate).
GPT-4o struggles the most in IC SWE tasks (only 8%) but compensates with slightly better performance in management-related tasks.
o1 is a balanced performer, outperforming GPT-4o but still trailing Claude 3.5 Sonnet in most metrics.

Real-World AI Use Cases in Freelancing

SWE-Lancer categorizes freelance tasks into three real-world case types, showing where AI excels (or struggles).

Small Bug Fixes

Typically involve minor UI tweaks or logic adjustments, fixable in minutes or hours.
In Application Logic (IC SWE) tasks:
- GPT-4o: 8% success
- o1: 15.9% success
- Claude 3.5 Sonnet: 23.9% success
For SWE Manager tasks (choosing the best proposal):
- GPT-4o: 36.3%
- o1: 42.3%
- Sonnet: 45.8%

Conclusion: Simple bug fixes are the easiest AI tasks.

Mid-Level Feature Development

Involves adding new UI/UX components, improving system logic, or refining user experience.
In UI/UX tasks (IC SWE):
- GPT-4o: 2.4% success
- o1: 17.1% success
- Sonnet: 31.7% success
In Server-Side Logic tasks:
- GPT-4o & o1: 23.5%
- Sonnet: 41.2%

Conclusion: AI struggles with UX-heavy tasks but performs better in backend optimizations.

Large-Scale Projects (System-Wide Changes)

Includes architecture refactoring and full system upgrades.
System-Wide Quality & Reliability (IC SWE): 0% success across all models.
For managerial tasks in this category (SWE Manager):
- GPT-4o & Sonnet: 100% (small dataset)
- o1: 50%

Conclusion: AI can evaluate plans for complex projects but cannot execute them independently.

What This Means for Freelancers on Upwork

1. Increased Competition—but Not Entirely

Entry-level tasks (e.g., $20–$100 bug fixes) may now be automated, reducing demand for junior developers.

2. A Growing Market for “AI-Powered Freelancing”

Some freelancers already offer AI-assisted automation services, integrating tools like Copilot or AI-generated code pipelines.

Final Takeaways

AI is already reshaping the freelance market, but it does not eliminate human specialists.

AI complements human expertise rather than replacing it. Freelancers who adapt and learn to leverage AI-powered tools (like Copilot, DeepResearch, and AI-driven testing) will remain highly competitive.

Complex, creative, and high-level decision-making tasks still require human involvement.

Freelancers who understand how to integrate AI into their workflow will have a significant advantage in the evolving gig economy.

How SEO Is Changing in the Age of AI

21-10-2025

AI is transforming search. Learn how SEO evolves into AEO and GEO — where visibility means being cited in AI answers, not just ranked in results.

Learn more

Keeping the Human Mind Sharp When AI Can Do It All

16-10-2025

AI makes work easier, but thinking harder. Learn how to stay creative, critical, and human in the age of intelligent machines.

Learn more

AI Workslop: Why Businesses Pay Freelancers to Fix AI

07-10-2025

AI speeds up work but often creates “workslop” - results that look complete yet lack value. Freelancers are the ones turning them into quality.

Learn more

Disney Creative Strategy: How Ideas Become Reality

03-10-2025

Disney Creative Strategy: dream, plan, critique — a tool to guide ideas from imagination to real-world results.

Learn more

Upwork Boost: Increasing Freelancer Profile Visibility

29-09-2025

Discover how Upwork’s Available Now badge and Profile Boost work, their costs, pros and cons, and which boost is best for freelancers or agencies.

Learn more

10 Posts to Help You Get Started on Upwork

29-09-2025

We’ve gathered a set of articles to guide you through the essentials — from setting up your profile to building long-term client relationships.

Learn more

Etcetera summer 2025 report

26-09-2025

Etcetera summer 2025 results: quiet season, new team members, shifting Upwork rules, and plans for an active autumn.

Learn more

Upwork Feedback: a trust tool you should learn to read and write

22-09-2025

Upwork feedback is more than stars — it builds trust, shapes reputation, and guides choices. Learn how to read, request, and write reviews effectively

Learn more