GPT

Precursor

Proximal Policy Optimization (PPO) - an RL algorithm, PPO is better than state-of-the-art approaches while being much simpler to implement and tune and is the default reinforcement learning algorithm at OpenAI.
Learning from human preference (human in the loop) - a method used to infer what humans want by being told which of two proposed behaviors is better.
instructGPT - arguably better at following user intentions than GPT-3 while also making them more truthful and less toxic, using human in the loop.

Last updated 11 months ago