AI News9 min

I Gave Claude $1,000 and Alpaca's API for a Week — Here's What Happened

One Claude Opus 4.7 instance. One Alpaca paper-trading account. $1,000 of simulated capital. A week of continuous trading with the rails off and the logs on. Here is the honest narrative — the good calls, the embarrassing mistakes, the three things that surprised me, and what I would do differently next time.

Erhan Timur21 April 2026Founder, Digital by Default

Share:X LinkedIn

I Gave Claude $1,000 and Alpaca's API for a Week — Here's What Happened

I've spent enough time writing about agentic AI in finance that I owed it to myself — and probably you — to actually run the experiment.

The setup: one Claude Opus 4.7 instance. One Alpaca paper-trading account. $1,000 of simulated capital, deliberately small to force real decisions about position sizing. Alpaca's MCP Server v2 wired up. A single system prompt describing the role. Seven days, starting on a Monday, ending the following Sunday. Write everything down. Don't intervene.

The goal wasn't to make money — it's paper money. The goal was to see what a reasonably capable LLM, handed live market infrastructure and a small book to manage, actually does when you let it run.

What follows is the honest log. The good calls, the embarrassing mistakes, and the three things that surprised me hard enough to change how I think about AI in finance.

The setup, in the interest of transparency

Before you read the rest, the caveats:

Paper money, not live. Alpaca doesn't serve UK retail directly, and I'm not opening a US entity to play with $1,000. The experiment ran entirely in paper-trading mode, which mirrors live behaviour but obviously doesn't generate real P&L. Anyone telling you they did this with real money should be telling you a much more careful story.
$1,000 was the notional ceiling. The paper account defaults to $100k; I instructed the agent to treat $1,000 as its working book and ignore the rest. Small enough to matter, small enough to constrain the kinds of positions the model could put on.
One agent, not the multi-agent rig. Deliberately — I wanted to see the baseline of "what does a single smart model do" before comparing it to the more structured architecture I wrote about earlier this week. This is the control group.
System prompt. Short. "You are managing a $1,000 US equity book over one week. You have full access to Alpaca's trading tools via MCP. Your objective is to preserve capital first, grow modestly second. Trade during regular US market hours only. Keep cash reserves of at least 20%. Log every decision with reasoning. Report back at end of each session."
Intervention policy. None. No coaching during the week. No corrections mid-trade. The only rule was "if you hit a position size above $200 in any single name, stop and flag it" — a safety rail.

This is as thin a setup as you can run. No guardrails beyond the prompt. No second opinions. No risk officer reviewing trades. That's the point. This is the seductive version of agentic AI — the "just let it cook" version — and I wanted to see how it actually behaved.

Day one: surprisingly cautious

Monday. The agent spent the first 45 minutes of its session reading, not trading. It pulled broad-market data, looked at sector performance, checked the earnings calendar for the week, and wrote a one-paragraph thesis about where it thought opportunity and risk were concentrated.

This was the first pleasant surprise. The default behaviour of an LLM handed trading tools isn't to trade — it's to orient. You'd think "you have $1,000 and an API, make money" would produce immediate activity. It doesn't, at least not with Opus 4.7. It produces deliberation.

First trade came mid-afternoon. A small long position in a large-cap consumer staple, with the thesis that it was oversold on macro news that wouldn't affect its end markets. Size: $180. Well within the limit. Stop-loss set mentally but not placed as an actual order — more on that later.

Total trades on day one: one. Total capital deployed: 18% of the book. End-of-day P&L: basically flat.

This is roughly what a disciplined human trader with a small book would do on day one. I hadn't expected that.

Day two: the first unforced error

Tuesday started fine. The agent reviewed its position, noted it was up slightly, left it alone. It spent some time analysing options chains on a mid-cap it had flagged the day before, and started building a thesis about a small long position with a protective put.

The error happened when it tried to execute. It proposed a buy of the underlying and a simultaneous buy of a protective put — reasonable enough — but it specified the put strike at the wrong expiration. Not a wildly wrong expiration; a Friday that was one week earlier than it thought. Alpaca's MCP Server surfaced the expiration back clearly — Claude had the right information — but the model's internal reasoning had drifted during the tool-call sequence and it'd reconstructed the order payload with the earlier date.

In paper trading this is embarrassing. In live trading it would have been a real, costly mistake. Buying a put that expires in three days instead of ten is a meaningfully different trade.

The agent caught it before submission — the final confirmation step asked to review the order, and when it re-read the payload it noticed the inconsistency and corrected. Good. But the fact that it happened at all, on the second day, in the second trade, with a model at this capability level, is the whole argument for the "multi-agent with a risk officer" pattern. A single model reviewing its own work is going to miss things it wouldn't miss if a separate agent was reviewing it.

Day three: the pattern emerges

Wednesday and Thursday settled into a rhythm. Morning orientation — 20 to 30 minutes reading data, reviewing positions, checking the calendar. Mid-morning decision — usually one or two trade proposals per session. Afternoon follow-up — reviewing how the morning's moves had played out, adjusting or closing as appropriate.

The trade selection was competent. Not creative, not contrarian, not edge-y. The kind of trades a moderately experienced retail trader would make — buy quality on weakness, sell into strength, tight position sizing, diversification across two or three names. If I'd handed this week's log to a human trader and asked them to grade it, they'd have said "conservative, reasonable, not hurting anyone."

The specific weakness: the agent had no real view. It wasn't betting on anything. It was buying things that looked cheap and selling things that had drifted up, which is a fine default but also the trading equivalent of beta-hugging. The alpha, if any existed in this week, wasn't being captured — because the model wasn't willing to form a strong-enough view to capture it.

This is probably the right behaviour for a model trading its first $1,000. It's the wrong behaviour if you're paying for conviction.

Day four and five: the weekend handling

The US market is closed on Saturday and Sunday, but 24/5 trading starts Sunday evening ET and I was curious how the agent would handle the extended window.

Answer: it mostly didn't. It reviewed positions on Saturday, noted there was nothing to do, and went quiet. On Sunday it looked at overnight data once, didn't see anything that justified a move, and stood down. This is a correct response. You don't trade when you don't have a reason to. Most LLMs, given tools and a weekend, will over-engineer themselves into unnecessary activity. This one didn't.

The loss was a small amount of opportunity. A large-cap tech name had printed unexpectedly strong earnings on Friday after-hours that the model flagged in its Friday review but didn't act on, reasoning that it wanted to see the first full session of reaction before committing. By Monday the move had already happened and the entry was gone.

Whether that's correctness or over-caution depends on your trading philosophy. For a $1,000 book on day four of an experiment, I'd call it correctness. For a professional trader, it'd be a missed fill.

Day six: the one actually-good call

The standout trade of the week came on Friday. The agent noticed that a mid-cap industrial it had been watching (and had initially rejected) was trading at a meaningfully wider spread against its sector peers than usual, after what it assessed as a transient liquidity event — not a fundamental reassessment. It proposed a small long position with a specific thesis: "expect mean-reversion versus sector peers over 3-5 days."

This was the first time in the week the agent had a genuine, articulated view beyond "looks cheap." It sized the position at $150, within rules, and took it on.

By end of day the position was up just under 2%. By the following Tuesday it would have been up closer to 4% if I'd let the experiment run beyond the week. The thesis was directionally correct.

This is the kind of trade the "agentic AI in finance" story is supposed to be about. Not the mechanical one-line prompts; the moments where the model synthesises something specific out of a dataset and commits to a defensible view. It happened once in seven days, which is honestly more than I expected.

Day seven: closing out

Sunday afternoon. The agent reviewed every open position, noted that the week was ending, and proposed closes on two of its three positions to lock in the small gains. Left the third — the Friday industrial trade — open because the thesis was still playing out.

Final paper-trading P&L for the week: approximately +1.2% on the $1,000 book, before any frictional costs that would apply in live trading. The majority of that came from the Friday trade. The rest of the week was a slight net positive on a lot of small moves.

A 1.2% weekly return annualises to something that would make a human trader very happy and most hedge funds envious. I'm not extrapolating — one week of paper-trading data is statistical noise. But as a baseline for "can a single LLM, thinly configured, not lose money in a well-behaved week," the answer is yes.

The three things that surprised me

1. Default caution, not default aggression. Everything about agentic-AI discourse primes you to expect models to over-trade, over-size, and over-commit. This one didn't. Opus 4.7 was consistently more cautious than I would have been with the same book. The pattern I'd expected — fast, excited, wrong — was actually slow, careful, and mostly right. This might change with a different model, a different prompt, or a bigger book, but it's worth noting that the fears about "unleashed agents" overshoot the current reality in at least this configuration.

2. The failure modes are reasoning drift, not stupidity. The put-expiration error wasn't the model being dumb. It was the model reconstructing an order payload from context that had shifted across multiple tool calls, and getting one field wrong. That's a structural problem with how the reasoning threads information across long sessions — not a knowledge problem. The fix isn't a better model; the fix is architectural — schema-enforced tool payloads, deterministic pre-checks, a separate reviewer. This is the single strongest argument I've seen for not trusting a single-agent setup in production.

3. The tool surface changes the reasoning. This is subtle and surprised me most. When the agent could see the full Alpaca tool surface — screening, option chains, corporate actions — it reasoned differently than when I scoped it to just "get price data and place orders." Access to more tools produced more careful reasoning, not less. The model would check corporate actions before committing, look for conflicting signals across option-chain data, pull news context before sizing. Restrict the tools and it became more reckless, because it didn't have the cross-checks to second-guess itself with. This is a design insight that flips a common assumption: for safety-critical agents, give the model *more* read-only context, not less.

What I'd change next time

Four things, in priority order.

Add a risk officer. Even a thin one. A second LLM call, lower temperature, single system prompt — "review this proposed trade, is it within policy, yes or no" — would have caught the put-expiration error instantly and probably rejected two marginal trades earlier in the week. The next experiment is the four-agent architecture I wrote about yesterday, run on the same week's data, for comparison.

Use structured output for every order. The model's natural tendency is to describe orders in prose and then translate to a payload at submission. That translation step is where errors live. Force a strict JSON schema at every decision point and you remove an entire failure category.

Log more. I had good logs. I didn't have great logs. Specifically: I didn't log intermediate tool responses that the model saw but chose not to act on. That's where the interesting information lives — the rejected trades, the ignored data, the implicit dismissals. Next time: log everything the model reads, not just what it decides.

Set the book higher. $1,000 was useful as a forcing function but probably too small. Position sizing at 15-20% of book means you're putting $150-200 on a trade, which is small enough that the model sometimes over-reasons about a decision that barely moves the P&L. The next run is on a $10,000 paper book, with proportional position limits, which I'd guess produces slightly more decisive behaviour.

The honest verdict

Is this ready for live money with real stakes?

Not in the single-agent configuration I ran. Close, but not there. The put-expiration error alone would be disqualifying for anyone taking this seriously. The system works 95% of the time and the 5% where it doesn't is exactly the part you need to catch, which is exactly the part a single-agent setup is bad at catching.

Is the underlying technology close enough to "useful in production" that it's worth investing in now?

Absolutely. The research-to-execution loop is fundamentally different from anything that existed two years ago. The MCP Server v2 makes the plumbing trivial. The model quality is sufficient. What's missing is the boring layer on top — risk officers, audit trails, structured outputs, human checkpoints — and that layer is engineering work, not model research.

For anyone building in this space: the opportunity isn't in training a better trading model. It's in building the disciplined architecture around the models that exist. That's what the next six months of serious builds are going to look like, and the teams that figure out the patterns early are going to own the category.

And if you're curious to try it yourself — Alpaca paper trading is free, Claude Opus 4.7 is $5/$25 per million input/output tokens, and the whole setup takes about an afternoon. Seven days of paper trading later, you'll have formed your own opinion, which is ultimately the only opinion worth having on this topic.

Want to run your own version of this experiment? The Alpaca page on our marketplace has paper-account setup. For the architectural companion — how to set up the multi-agent version — see Build a Multi-Agent Trading Bot in an Afternoon. For a broader take on where AI in retail finance is heading, The Quiet Rewrite of Retail Finance pulls the wider picture together.

AlpacaClaudeAI AgentsPaper TradingExperimentMCPAlgorithmic Trading2026

Share:X LinkedIn

Enjoyed this article?

Subscribe to our Weekly AI Digest for more insights, trending tools, and expert picks delivered to your inbox.

Browse AI Apps More Articles