In lifelong learning, a model learns different tasks sequentially throughout its lifetime. State-of-the-art deep learning models, however, struggle to generalize in this setting and suffer from catastrophic forgetting of old tasks when learning new ones. While a number of approaches have been developed in an attempt to ameliorate this problem, there are no established, unified or generalized frameworks for rigorous evaluations of proposed solutions; a problem which is particularly pronounced in the domain of NLP. The few existing benchmarks are typically limited to a specific flavor of lifelong learning – continual open-set classification – where new classes, as opposed to tasks, are learned incrementally. Moreover, the only general lifelong learning benchmark combines a multi-label classification setup with a multi-class classification setup resulting in misleading gradients during training. We empirically demonstrate that the catastrophic forgetting observed here can be attributed to the experimental design rather than to any inherent modeling limitations. To address these issues, we propose an experimental framework for true, general lifelong learning in NLP. Using this framework, we develop a comprehensive suite of benchmarks that target different properties of lifelong learning (e.g., forgetting or intransigence); experiment with diverse facets of language learning: multi-domain, multilingual and different levels of linguistic hierarchy; and present a continuous evaluation scheme under a new metric: Area Under the Lifelong Test Curve. Our framework reveals shortcomings of prevalent memory-based solutions, demonstrating they are unable to outperform a simple experience replay baseline under the realistic lifelong learning setup.