Get a solid grounding in Apache Oozie, the workflow scheduler system for managing Hadoop jobs. With this hands-on guide, two experienced Hadoop practitioners walk you through the intricacies of this powerful and flexible platform, with numerous examples and real-world use cases.
Once you set up your Oozie server, you'll dive into techniques for writing and coordinating workflows, and learn how to write complex data pipelines. Advanced topics show you how to handle shared libraries in Oozie, as well as how to implement and manage Oozie's security capabilities.
||Learning Apache Mahout|
In the past few years the generation of data and our capability to store and process it has grown exponentially. There is a need for scalable analytics frameworks and people with the right skills to get the information needed from this Big Data. Apache Mahout is one of the first and most prominent Big Data machine learning platforms. It implements machine learning algorithms on top of distributed processing platforms such as Hadoop and Spark.
Starting with the basics of Mahout and machine learning, you will explore prominent algorithms and their implementation in Mahout development. You will learn about Mahout building blocks, addressing feature extraction, reduction and the curse of dimensionality, delving into classification use cases with the random forest and Naïve Bayes classifier and item and user-based recommendation. You will then work with clustering Mahout using the K-means algorithm and implement Mahout without MapReduce. Finish with a flourish by exploring end-to-end use cases on customer analytics and test analytics to get a real-life practical know-how of analytics projects.
||Apache Solr Search Patterns|
Apache Solr is an open source search platform built on a Java library called Lucene. It serves as a search platform for many websites, as it has the capability of indexing and searching multiple websites to fetch desired results.
We begin with a brief introduction of analyzers and tokenizers to understand the challenges associated with implementing large-scale indexing and multilingual search functionality. We then move on to working with custom queries and understanding how filters work internally. While doing so, we also create our own query language or Solr plugin that does proximity searches. Furthermore, we discuss how Solr can be used for real-time analytics and tackle problems faced during its implementation in e-commerce search. We then dive deep into the spatial features such as indexing strategies and search/filtering strategies for a spatial search. We also do an in-depth analysis of problems faced in an ad serving platform and how Solr can be used to solve these problems.
||Apache Solr Essentials|
Search is everywhere. Users always expect a search facility in mobile or web applications that allows them to find things in a fast and friendly manner.
Apache Solr Essentials is a fast-paced guide to help you quickly learn the process of creating a scalable, efficient, and powerful search application. The book starts off by explaining the fundamentals of Solr and then goes on to cover various topics such as data indexing, ways of extending Solr, client APIs and their indexing and data searching capabilities, an introduction to the administration, monitoring, and tuning of a Solr instance, as well as the concepts of sharding and replication. Next, you'll learn about various Solr extensions and how to contribute to the Solr community. By the end of this book, you will be able to create excellent search applications with the help of Solr.
||Apache Hive Essentials|
In this book, we prepare you for your journey into big data by firstly introducing you to backgrounds in the big data domain along with the process of setting up and getting familiar with your Hive working environment. Next, the book guides you through discovering and transforming the values of big data with the help of examples. It also hones your skill in using the Hive language in an efficient manner. Towards the end, the book focuses on advanced topics such as performance, security, and extensions in Hive, which will guide you on exciting adventures on this worthwhile big data journey.
By the end of the book, you will be familiar with Hive and able to work efficiently to find solutions to big data problems.
||Learning Apache Mahout Classification|
This book is a practical guide that explains the classification algorithms provided in Apache Mahout with the help of actual examples. Starting with the introduction of classification and model evaluation techniques, we will explore Apache Mahout and learn why it is a good choice for classification.
Next, you will learn about different classification algorithms and models such as the Naïve Bayes algorithm, the Hidden Markov Model, and so on.
Finally, along with the examples that assist you in the creation of models, this book helps you to build a mail classification system that can be produced as soon as it is developed. After reading this book, you will be able to understand the concept of classification and the various algorithms along with the art of building your own classifiers.
||Learning Apache Kafka, 2nd Edition|
Kafka is one of those systems that is very simple to describe at a high level but has an incredible depth of technical detail when you dig deeper.
Learning Apache Kafka Second Edition provides you with step-by-step, practical examples that help you take advantage of the real power of Kafka and handle hundreds of megabytes of messages per second from multiple clients. This book teaches you everything you need to know, right from setting up Kafka clusters to understanding basic blocks like producer, broker, and consumer blocks. Once you are all set up, you will then explore additional settings and configuration changes to achieve ever more complex goals. You will also learn how Kafka is designed internally and what configurations make it more effective. Finally, you will learn how Kafka works with other tools such as Hadoop, Storm, and so on.
||Learning Karaf Cellar|
Apache Karaf is a popular OSGi container that provides rich and broad features, and together with Cellar, you can easily manage farms of containers that provide synchronization between the instances of Karaf. In a real production system, users require a farm of containers to implement failover and scalability, as well as the tools required to provision the different members of a cluster. This book will help you understand the architecture, installation, and configuration of a cluster and teach you about different components and features to get the best out of a clustering solution using Apache Karaf Cellar.
Learning Karaf Cellar starts with an introduction to some of the key features of Karaf. After a quick but detailed understanding of OSGi and Karaf, this book takes you through the concept of provisioning clusters and then covers what Cellar is and how to use it.
||Mastering Apache Maven 3|
Maven is the number one build tool used by developers for more than a decade. Maven stands out among other build tools due to its extremely extensible architecture, which is built on top of the concept "convention over configuration". This has made Maven the de-facto tool used to manage and build Java projects.
This book is a technical guide to the difficult and complex concepts in Maven and build automation. It starts with the core Maven concepts and its architecture, and then explains how to build extensions such as plugins, archetypes, and lifecycles in depth.
This book is a step-by-step guide that shows you how to use Apache Maven in an optimal way to address your enterprise build requirements.
||Beginning Apache Cassandra Development|
Beginning Apache Cassandra Development introduces you to one of the most robust and best-performing NoSQL database platforms on the planet. Apache Cassandra is a document database following the JSON document model. It is specifically designed to manage large amounts of data across many commodity servers without there being any single point of failure. This design approach makes Apache Cassandra a robust and easy-to-implement platform when high availability is needed.
||Cassandra High Availability|
Apache Cassandra is a massively scalable, peer-to-peer database designed for 100 percent uptime, with deployments in the tens of thousands of nodes supporting petabytes of data.
This book offers readers a practical insight into building highly available, real-world applications using Apache Cassandra. The book starts with the fundamentals, helping you to understand how the architecture of Apache Cassandra allows it to achieve 100 percent uptime when other systems struggle to do so. You'll have an excellent understanding of data distribution, replication, and Cassandra's highly tunable consistency model. This is followed by an in-depth look at Cassandra's robust support for multiple data centers, and how to scale out a cluster. Next, the book explores the domain of application design, with chapters discussing the native driver and data modeling. Lastly, you'll find out how to steer clear of common antipatterns and take advantage of Cassandra's ability to fail gracefully.