Natural Language Processing: A Pragmatic Approach

Anuj Gupta, Bodhisattwa Majumder, Harshit Surana, Sowmya Vajjala

Since its inception until very recently, Natural Language Processing (NLP) has primarily been the domain of academia and research labs, requiring long formal education and training. The past decade’s breakthroughs have resulted in NLP increasingly being used in a range of domains as diverse as retail, healthcare, finance, law, marketing and human resources. With this exploding usage, a larger and larger proportion of the workforce that is building these NLP systems has to grapple with limited experience and theoretical knowledge. This book bridges this divide between formal training and applied industrial perspective.

The book covers both, building models for various applications with a practical viewpoint as well as the theoretical foundations behind them. The book is divided into four sections starting from the introduction, going deep into to the fundamentals, then covering a gamut of domains and wrapping it up with chapters focused on production-ready system and processes. The book is checkered with references on foundational and cutting-edge works, for a curious reader. We also cover the more abstract cases, pragmatic tips and war stories that may be valuable for a wider audience including technical leaders and managers building products around NLP.

The authors hail from Carnegie Mellon, UC San Diego (current NLP Ph.D. student), U of Tübingen, and the Indian Institutes of Technology. They have built and deployed NLP and ML systems in both, academia and industry, including Silicon Valley startups, Fortune 100 companies, the MIT Media Lab and Microsoft Research. They have also taught NLP courses at US universities as an Assistant Professor and published dozens of research papers in the field with hundreds of citations. The book distills the authors’ collective wisdom for building and scaling NLP systems. The book is also being advised by researchers and scientists from some of the top universities and technology giants in the world.

Table of Contents

More about the Book

The book aims to give the reader a quick overview, followed by in-depth knowledge and theoretical background.

For a better dissemination of knowledge, we have structured each chapter into various sections. In ‘Essentials’ (Section 2), each chapter begins with a background, history and applications. This is followed by a basic algorithm as part of first code walkthrough. We then delve into theoretical foundations behind it and going into more sophisticated algorithms and models. We wrap it up with a glimpse of cutting edge techniques and results. For instance, in the chapter on Text Classification, we begin with Naive Bayes as the first baseline. Continue improving the solution with algorithms like SVM and FastText. The 360-degree view is finally closed with practical tips and a glimpse of state-of-the-art methods like word and character level CNNs and RNNs.

As opposed to going deep and vertically as in Section 2, in ‘Applied’ (Section 3) we traverse a range of topics horizontally to facilitate a comprehensive understanding of how to leverage the knowledge obtained in earlier sections. For instance, in the chapter on E-commerce and Retail, we cover a range of problems starting including attribute extraction, aspect level review analysis, duplicate product detection, faceted search, and ranking, aspect identification in product descriptions, finding substitutes and complement items etc. Similarly, we also compare and contrast generic techniques and domain-specific techniques. For example, we highlight the features and implementation for an e-commerce search engine as opposed to a generic search engine.

The topics covered in this book have been motivated via surveys conducted at various technical workshops and panel discussions that the authors have been a part of. All the algorithms, techniques, datasets and technologies are supported with extensive references so that the reader can dig deep into the details. And throughout the book, we cover practical tips and best practices on building and deploying these models. Last but not the least, every chapter ends with a ‘cutting edge’ section where the state of the art is discussed.

The book will be around 350 pages. It will be accompanied by a code repository containing several Jupyter notebooks for all the chapters to give a walk-through and explain the code in detail. The code base is in Python and various machine learning and natural language processing libraries. The book assumes that the readers have a good grasp of programming but no theoretical and practical knowledge of NLP.

Commonly Asked Questions