Having trouble setting up evaluation tools for my locally deployed application - any solutions?

I’m running into several roadblocks while trying to set up proper testing and monitoring for my self-hosted setup. First, my Railway deployment won’t connect to LangSmith no matter what I try. I’ve checked the docs and tried different configuration settings but nothing seems to work. Second, I can’t find any way to integrate with Arize for monitoring and observability. Their documentation doesn’t mention self-hosted environments at all. Third, the n8n workflow evaluation tool has this annoying limitation where you can only run one evaluation per workflow, which makes it pretty much useless for any serious testing. Has anyone found good alternatives or workarounds for these issues? I need reliable evaluation tools that actually work with self-hosted applications.

Been there with my self-hosted setup. Railway to LangSmith issues? Usually network problems - check if your Railway app can hit external APIs. I wasted hours on configs when it was just a firewall rule.

Skip Arize if they don’t support self-hosted. I switched to Prometheus + Grafana and never looked back. More flexible, you own the data, and takes about 2 hours to get decent dashboards running.

That n8n thing drove me nuts too. Built a simple Python script that hits workflows with different test cases. Not pretty but works. You can duplicate workflows for test scenarios if you don’t mind doing it manually.

Most SaaS evaluation tools expect cloud deployments anyway. For self-hosted, stick with open source stuff you can actually control.

Railway deployments mess up LangSmith connections because of environment variable issues. Check that your API keys are set at the service level - Railway won’t pick up your local env file.

If you can’t use Arize, try OpenTelemetry with Jaeger. Takes about half a day to set up but you’ll get proper tracing for self-hosted stuff. Learning curve’s worth it since you control everything.

For n8n’s limitations, run multiple workflow instances with different configs. Make template workflows and trigger them programmatically with different parameters. Not perfect but works for testing.

Had the same headaches when I switched to self-hosted last year. For LangSmith - check Railway’s outbound network settings first. Some regions block API endpoints by default. Just run a quick curl test to see if it connects. For monitoring, ditch Arize and try Weights & Biases instead. Their self-hosted support is way better and the local deployment actually works. Setup docs don’t suck either. Evaluation-wise, I’d build something custom with pytest and basic logging. Yeah, it’s reinventing the wheel, but you get full control over test scenarios and can run stuff in parallel. Spent a weekend on it and it crushes n8n’s limitations.