Best Practices

This guide provides a comprehensive set of best practices to help you transition from prototype to production. Whether you're an experienced machine learning engineer or a newcomer to the field, this guide will equip you with the tools needed to successfully run the platform in a production environment: from ensuring access to our API to designing robust architectures capable of handling high traffic. Use this guide to develop a plan for deploying your application as smoothly and effectively as possible.

Factors Affecting Latency and Mitigation Suggestions

Here are the main factors affecting latency, along with possible mitigation strategies:

Model Selection

We offer models with varying complexity and versatility. The most powerful models, such as gpt-4, generate more slowly but produce diverse results. Simpler models like gpt-4o-mini respond faster but may be slightly less accurate or relevant. Choose an appropriate model based on your use case.

Number of Completion Tokens

Generating a large number of tokens increases latency. Recommendations:

Lower the max_tokens parameter to reduce latency.
Use stop sequences to prevent generating unnecessary tokens.
Reduce the values of n and best_of, which represent the number of completions generated per prompt and the number of completions used for comparison.

Streaming

Setting stream: true allows the model to return tokens immediately after they are generated, reducing the time to first token. This doesn't change the time to retrieve all tokens, but for applications wanting to show partial progress, this is a good way to improve user experience.

Infrastructure

Our servers are located in Hong Kong. To reduce communication latency with JuheNext servers, it's recommended to place relevant parts of your infrastructure in mainland China or in Hong Kong and surrounding areas.

Batching

Depending on your use case, batching requests may help reduce the number of requests. The prompt parameter can accommodate up to 20 unique prompts. Test whether this approach is effective, though in some cases, the number of tokens generated may increase, resulting in longer response times.

Cost Management

When putting a prototype into production, you should budget for the cost of running your application. Every 1000 tokens is approximately equivalent to 750 characters for billing purposes. To estimate costs, you need to predict token usage, considering traffic levels, how frequently users interact with your application, and the amount of data processed.

Strategies to reduce costs can be achieved by reducing token usage or choosing smaller models. You can reduce token usage by shortening prompts, fine-tuning models, or caching common user queries.

Use OpenAI's interactive tokenizer tool>> to estimate costs. After selecting the strongest model for testing, you can try other models to see if they can generate the same results with lower latency and cost.

Security and Compliance

When going into production, you need to assess and address any applicable security and compliance requirements. This includes security measures when handling data, understanding how the API processes data, and regulations that need to be followed. Security and compliance documentation>>.

Common considerations include data storage, transmission, and retention. You may need to implement data privacy protections (such as encryption or anonymization) and follow best practices for secure coding.

Security Best Practices

When developing applications, follow security best practices to ensure the security and success of your application. It's recommended to extensively test your product, actively address potential issues, and limit opportunities for misuse.

Best Practices ​

Factors Affecting Latency and Mitigation Suggestions ​

Model Selection ​

Number of Completion Tokens ​

Streaming ​

Infrastructure ​

Batching ​

Cost Management ​

Security and Compliance ​

Security Best Practices ​