Spark for Python DevelopersLooking for a cluster computing system that provides high-level APIs? Apache Spark is your answer—an open source, fast, and general purpose cluster computing system. Spark's multi-stage memory primitives provide performance up to 100 times faster than Hadoop, and it is also well-suited for machine learning algorithms.
Are you a Python developer inclined to work with Spark engine? If so, this book will be your companion as you create data-intensive app using Spark as a processing engine, Python visualization libraries, and web frameworks such as Flask.
To begin with, you will learn the most effective way to install the Python development environment powered by Spark, Blaze, and Bookeh. You will then find out how to connect with data stores such as MySQL, MongoDB, Cassandra, and Hadoop.
You'll expand your skills throughout, getting familiarized with the various data sources (Github, Twitter, Meetup, and Blogs), their data structures, and solutions to effectively tackle complex ...
Pro DockerIn this fast-paced book on the Docker open standards platform for developing, packaging and running portable distributed applications, author Deepak Vohra discusses how to build, ship and run applications on any platform such as a PC, the cloud, data center or a virtual machine. He describes how to install Docker images and create Docker containers, and the advantages of Docker containers.
The remainder of the book is devoted to discussing using Docker with important software solutions. He begins by discussing using Docker with a traditional RDBMS using Oracle and MySQL. Next he moves on to NoSQL with chapter on MongoDB Cassandra, and Couchbase. Then he addresses the use of Docker in the Hadoop ecosystem with complete chapters on utilizing not only Hadoop, but Hive, HBase, Sqoop, Kafka, Solr and Spark. ...
Scalable Big Data ArchitectureThis book highlights the different types of data architecture and illustrates the many possibilities hidden behind the term "Big Data", from the usage of No-SQL databases to the deployment of stream analytics architecture, machine learning, and governance.
Scalable Big Data Architecture covers real-world, concrete industry use cases that leverage complex distributed applications , which involve web applications, RESTful API, and high throughput of large amount of data stored in highly scalable No-SQL data stores such as Couchbase and Elasticsearch. This book demonstrates how data processing can be done at scale from the usage of NoSQL datastores to the combination of Big Data distribution.
When the data processing is too complex and involves different processing topology like long running jobs, stream processing, multiple data sources correlation, and machine learning, it's often necessary to delegate the load to Hadoop or Spark and use the No-SQL to serve processed dat ...
Fast Data Processing with SparkSpark is a framework for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and inbuilt tools for interactive query analysis (Shark), large-scale graph processing and analysis (Bagel), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big data sets.
Fast Data Processing with Spark covers how to write distributed map reduce style programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API, to deploying your job to the cluster, and tuning it for your purposes. ...
Storm Real-time Processing CookbookStorm is a free and open source distributed real-time computation system. Storm makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Storm Real Time Processing Cookbook will have basic to advanced recipes on Storm for real-time computation.
The book begins with setting up the development environment and then teaches log stream processing. This will be followed by real-time payments workflow, distributed RPC, integrating it with other software such as Hadoop and Apache Camel, and more. ...
Kubernetes Microservices with DockerThis book on Kubernetes, a container cluster manager, discusses all aspects of using Kubernetes in today's complex big data and enterprise applications, with Docker containers.
Starting with installing Kubernetes on a single node, Kubernetes Microservices with Docker introduces Kubernetes with a simple Hello example and discusses using environment variables in Kubernetes.
Next, the book discusses using Kubernetes with all major groups of technologies such as relational databases, NoSQL databases, and in the Apache Hadoop ecosystem.
The book concludes with using multi container Pods and installing Kubernetes on a multi node cluster. No other book on using Kubernetes - beyond simple introduction - is available in the market. ...
Data Science with JavaData Science is booming thanks to R and Python, but Java brings the robustness, convenience, and ability to scale critical to today's data science applications. With this practical book, Java software engineers looking to add data science skills will take a logical journey through the data science pipeline. Author Michael Brzustowicz explains the basic math theory behind each step of the data science process, as well as how to apply these concepts with Java.
You'll learn the critical roles that data IO, linear algebra, statistics, data operations, learning and prediction, and Hadoop MapReduce play in the process. Throughout this book, you'll find code examples you can use in your applications. Examine methods for obtaining, cleaning, and arranging data into its purest form;Understand the matrix structure that your data should take;Learn basic concepts for testing the origin and validity of data;Transform your data into stable and usable numerical val ...
Learning Apache DrillApache Drill enables interactive analysis of massively large datasets, allowing you to execute SQL queries against data in many different data sources - including Hadoop and MongoDB clusters, HBase, or even your local file system - and get results quickly. With this practical guide, analysts and data scientists focused on business or research applications will learn how to incorporate Drill capabilities into complex programs, including how to use Drill queries to replace some MapReduce operations in a large-scale program.
Drill committers Charles Givre and Paul Rogers provide an introduction to Drill and its ability to handle large files containing data in flexible formats with nested data structures and tables. You'll discover how this capability fills a gap in the Hadoop ecosystem.
Additional topics show you how to:Prepare and organize data to maximize Drill performance;Set expectations for Drill performance on different data types and volumes;Reconcil ...
MongoDB Cookbook, 2nd EditionMongoDB is a high-performance and feature-rich NoSQL database that forms the backbone of the systems that power many different organizations – it's easy to see why it's the most popular NoSQL database on the market. Packed with many features that have become essential for many different types of software professionals and incredibly easy to use, this cookbook contains many solutions to the everyday challenges of MongoDB, as well as guidance on effective techniques to extend your skills and capabilities.
This book starts with how to initialize the server in three different modes with various configurations. You will then be introduced to programming language drivers in both Java and Python. A new feature in MongoDB 3 is that you can connect to a single node using Python, set to make MongoDB even more popular with anyone working with Python. You will then learn a range of further topics including advanced query operations, monitoring and backup using MMS, as well as some very useful ...
Amazon S3 CookbookAmazon S3 is one of the most famous and trailblazing cloud object storage services, which is highly scalable, low-latency, and economical. Users only pay for what they use and can store and retrieve any amount of data at any time over the Internet, which attracts Hadoop users who run clusters on EC2.
The book starts by showing you how to install several AWS SDKs such as iOS, Java, Node.js, PHP, Python, and Ruby and shows you how to manage objects. Then, you'll be taught how to use the installed AWS SDKs to develop applications with Amazon S3. Furthermore, you will explore the Amazon S3 pricing model and will learn how to annotate S3 billing with cost allocation tagging. In addition to this, the book covers several practical recipes about how to distribute your content with CloudFront, secure your content with IAM, optimize Amazon S3 performance, and notify S3 events with Lambada. ...