Apache Essentials, 2nd Edition
Take a friendly, non-technical approach to installing, configuring, and maintaining a web server for development and testing on Mac OS, Linux, and Windows. This new edition uses straightforward language to demystify the mechanics of the web, leading the reader through a complex topic via simple, iterative steps. The book reflects current, relevant Apache configurations and web application frameworks, and prepares the reader for working in professional web development environments.
You'll begin by reviewing a brief history of Apache that focuses on the necessary foundation to understand what these web servers are and why they are used. You'll then install a basic Apache setup on your operating system that loads a static HTML file for display in a browser. Next you'll work through progressively more detailed configurations towards a modern, secure application development environment for the two most commonly implemented development frameworks: a PHP-based framewor ...
The Azure Data Lakehouse Toolkit
Design and implement a modern data lakehouse on the Azure Data Platform using Delta Lake, Apache Spark, Azure Databricks, Azure Synapse Analytics, and Snowflake. This book teaches you the intricate details of the Data Lakehouse Paradigm and how to efficiently design a cloud-based data lakehouse using highly performant and cutting-edge Apache Spark capabilities using Azure Databricks, Azure Synapse Analytics, and Snowflake. You will learn to write efficient PySpark code for batch and streaming ELT jobs on Azure. And you will follow along with practical, scenario-based examples showing how to apply the capabilities of Delta Lake and Apache Spark to optimize performance, and secure, share, and manage a high volume, high velocity, and high variety of data in your lakehouse with ease.
The patterns of success that you acquire from reading this book will help you hone your skills to build high-performing and scalable ACID-compliant lakehouses using flexible and cost-ef ...
In-Memory Analytics with Apache Arrow
Apache Arrow is designed to accelerate analytics and allow the exchange of data across big data systems easily.
In-Memory Analytics with Apache Arrow begins with a quick overview of the Apache Arrow format, before moving on to helping you to understand Arrow's versatility and benefits as you walk through a variety of real-world use cases. You'll cover key tasks such as enhancing data science workflows with Arrow, using Arrow and Apache Parquet with Apache Spark and Jupyter for better performance and hassle-free data translation, as well as working with Perspective, an open source interactive graphical and tabular analysis tool for browsers. As you advance, you'll explore the different data interchange and storage formats and become well-versed with the relationships between Arrow, Parquet, Feather, Protobuf, Flatbuffers, JSON, and CSV. In addition to understanding the basic structure of the Arrow Flight and Flight SQL protocols, you'll learn about ...
Cloud-Native Microservices with Apache Pulsar
Apply different enterprise integration and processing strategies available with Pulsar, Apache's multi-tenant, high-performance, cloud-native messaging and streaming platform. This book is a comprehensive guide that examines using Pulsar Java libraries to build distributed applications with message-driven architecture.
You'll begin with an introduction to Apache Pulsar architecture. The first few chapters build a foundation of message-driven architecture. Next, you'll perform a setup of all the required Pulsar components. The book also covers work with Apache Pulsar client library to build producers and consumers for the discussed patterns.
You'll then explore the transformation, filter, resiliency, and tracing capabilities available with Pulsar. Moving forward, the book will discuss best practices when building message schemas and demonstrate integration patterns using microservices. Security is an important aspect of any application; the book will cover authentic ...
Modern Data Engineering with Apache Spark
Leverage Apache Spark within a modern data engineering ecosystem. This hands-on guide will teach you how to write fully functional applications, follow industry best practices, and learn the rationale behind these decisions. With Apache Spark as the foundation, you will follow a step-by-step journey beginning with the basics of data ingestion, processing, and transformation, and ending up with an entire local data platform running Apache Spark, Apache Zeppelin, Apache Kafka, Redis, MySQL, Minio (S3), and Apache Airflow.
Apache Spark applications solve a wide range of data problems from traditional data loading and processing to rich SQL-based analysis as well as complex machine learning workloads and even near real-time processing of streaming data. Spark fits well as a central foundation for any data engineering workload. This book will teach you to write interactive Spark applications using Apache Zeppelin notebooks, write an ...
Apache Pulsar in Action
Apache Pulsar in Action is a comprehensive and practical guide to building high-traffic applications with Pulsar. You'll learn to use this mature and battle-tested platform to deliver extreme levels of speed and durability to your messaging. Apache Pulsar committer David Kjerrumgaard teaches you to apply Pulsar's seamless scalability through hands-on case studies, including IOT analytics applications and a microservices app based on Pulsar functions.
Reliable server-to-server messaging is the heart of a distributed application. Apache Pulsar is a flexible real-time messaging platform built to run on Kubernetes and deliver the scalability and resilience required for cloud-based systems. Pulsar supports both streaming and message queuing, and unlike other solutions, it can communicate over multiple protocols including MQTT, AMQP, and Kafka's binary protocol.
Apache Pulsar in Action teaches you to build scalable streaming messaging systems using Pulsar. Y ...
Mastering Apache Pulsar
Every enterprise application creates data, including log messages, metrics, user activity, and outgoing messages. Learning how to move these items is almost as important as the data itself. If you're an application architect, developer, or production engineer new to Apache Pulsar, this practical guide shows you how to use this open source event streaming platform to handle real-time data feeds.
Jowanza Joseph, staff software engineer at Finicity, explains how to deploy production Pulsar clusters, write reliable event streaming applications, and build scalable real-time data pipelines with this platform. Through detailed examples, you'll learn Pulsar's design principles, reliability guarantees, key APIs, and architecture details, including the replication protocol, the load manager, and the storage layer.
This book helps you: Understand how event streaming fits in the big data ecosystem; Explore Pulsar producers, consumers, and readers for writing and reading events; Build ...
Introducing .NET for Apache Spark
Get started using Apache Spark via C# or F# and the .NET for Apache Spark bindings. This book is an introduction to both Apache Spark and the .NET bindings. Readers new to Apache Spark will get up to speed quickly using Spark for data processing tasks performed against large and very large datasets. You will learn how to combine your knowledge of .NET with Apache Spark to bring massive computing power to bear by distributed processing of extremely large datasets across multiple servers.
This book covers how to get a local instance of Apache Spark running on your developer machine and shows you how to create your first .NET program that uses the Microsoft .NET bindings for Apache Spark. Techniques shown in the book allow you to use Apache Spark to distribute your data processing tasks over multiple compute nodes. You will learn to process data using both batch mode and streaming mode so you can make the right choice depending on ...
Data Pipelines with Apache Airflow
A successful pipeline moves data efficiently, minimizing pauses and blockages between tasks, keeping every process along the way operational. Apache Airflow provides a single customizable environment for building and managing data pipelines, eliminating the need for a hodgepodge collection of tools, snowflake code, and homegrown processes. Using real-world scenarios and examples, Data Pipelines with Apache Airflow teaches you how to simplify and automate data pipelines, reduce operational overhead, and smoothly integrate all the technologies in your stack.
Data pipelines manage the flow of data from initial collection through consolidation, cleaning, analysis, visualization, and more. Apache Airflow provides a single platform you can use to design, implement, monitor, and maintain your pipelines. Its easy-to-use UI, plug-and-play options, and flexible Python scripting make Airflow perfect for any data management task.
Data Pipelines with Apache Airflow ...
Practical Apache Lucene 8
Gain a thorough knowledge of Lucene's capabilities and use it to develop your own search applications. This book explores the Java-based, high-performance text search engine library used to build search capabilities in your applications.
Starting with the basics of Lucene and searching, you will learn about the types of queries used in it and also take a look at scoring models. Applying this basic knowledge, you will develop a hello world app using basic Lucene queries and explore functions like scoring and document level boosting.
Along the way you will also uncover the concepts of partial searching and matching in Lucene and then learn how to integrate geographical information (geospatial data) in Lucene using spatial queries and n-dimensional indexing. This will prepare you to build a location-aware search engine with a representative data set that allows location constraints to be specified during a search. You'll also develop a text classifier using Lucene and Apache M ...
Beginning Apache Spark Using Azure Databricks
Analyze vast amounts of data in record time using Apache Spark with Databricks in the Cloud. Learn the fundamentals, and more, of running analytics on large clusters in Azure and AWS, using Apache Spark with Databricks on top. Discover how to squeeze the most value out of your data at a mere fraction of what classical analytics solutions cost, while at the same time getting the results you need, incrementally faster.
This book explains how the confluence of these pivotal technologies gives you enormous power, and cheaply, when it comes to huge datasets. You will begin by learning how cloud infrastructure makes it possible to scale your code to large amounts of processing units, without having to pay for the machinery in advance. From there you will learn how Apache Spark, an open source framework, can enable all those CPUs for data analytics use. Finally, you will see how services such as Databricks provide the power of Apache Spark, without you having to ...