Cassandra: The Definitive Guide, 2nd EditionImagine what you could do if scalability wasn't a problem. With this hands-on guide, you'll learn how the Cassandra database management system handles hundreds of terabytes of data while remaining highly available across multiple data centers. This expanded second edition - updated for Cassandra 3.0 - provides the technical details and practical examples you need to put this database to work in a production environment.
Authors Jeff Carpenter and Eben Hewitt demonstrate the advantages of Cassandra's non-relational design, with special attention to data modeling. If you're a developer, DBA, or application architect looking to solve a database scaling issue or future-proof your application, this guide helps you harness Cassandra's speed and flexibility.
Understand Cassandra's distributed and decentralized structure; Use the Cassandra Query Language (CQL) and cqlsh - the CQL shell; Create a working data model and compare it with an equivalent relational model; Develop sample applica ...
Fast Data Processing with Spark, 2nd EditionSpark is a framework used for writing fast, distributed programs. Spark solves similar problems as Hadoop MapReduce does, but with a fast in-memory approach and a clean functional style API. With its ability to integrate with Hadoop and built-in tools for interactive query analysis (Spark SQL), large-scale graph processing and analysis (GraphX), and real-time analysis (Spark Streaming), it can be interactively used to quickly process and query big datasets.
Fast Data Processing with Spark - Second Edition covers how to write distributed programs with Spark. The book will guide you through every step required to write effective distributed programs from setting up your cluster and interactively exploring the API to developing analytics applications and tuning them for your purposes. ...
PySpark RecipesQuickly find solutions to common programming problems encountered while processing big data. Content is presented in the popular problem-solution format. Look up the programming problem that you want to solve. Read the solution. Apply the solution directly in your own code. Problem solved!
PySpark Recipes covers Hadoop and its shortcomings. The architecture of Spark, PySpark, and RDD are presented. You will learn to apply RDD to solve day-to-day big data problems. Python and NumPy are included and make it easy for new learners of PySpark to understand and adopt the model.
Understand the advanced features of PySpark2 and SparkSQL; Optimize your code; Program SparkSQL with Python; Use Spark Streaming and Spark MLlib with Python; Perform graph analysis with GraphFrames. ...
Advanced Data Analytics Using PythonGain a broad foundation of advanced data analytics concepts and discover the recent revolution in databases such as Neo4j, Elasticsearch, and MongoDB. This book discusses how to implement ETL techniques including topical crawling, which is applied in domains such as high-frequency algorithmic trading and goal-oriented dialog systems. You'll also see examples of machine learning concepts such as semi-supervised learning, deep learning, and NLP. Advanced Data Analytics Using Python also covers important traditional data analysis techniques such as time series and principal component analysis.
After reading this book you will have experience of every technical aspect of an analytics project. You'll get to know the concepts using Python code, giving you samples to use in your own projects.
Work with data analysis techniques such as classification, clustering, regression, and forecasting; Handle structured and unstructured data, ETL techniques, and different kinds of databases such a ...
Deep LearningAlthough interest in machine learning has reached a high point, lofty expectations often scuttle projects before they get very far. How can machine learning - especially deep neural networks - make a real difference in your organization? This hands-on guide not only provides the most practical information available on the subject, but also helps you get started building efficient deep learning networks.
Authors Adam Gibson and Josh Patterson provide theory on deep learning before introducing their open-source Deeplearning4j (DL4J) library for developing production-class workflows. Through real-world examples, you'll learn methods and strategies for training deep network architectures and running deep learning workflows on Spark and Hadoop with DL4J.
Dive into machine learning concepts in general, as well as deep learning in particular; Understand how deep networks evolved from neural network fundamentals; Explore the major deep network architectures, including Convolutional and R ...
Learning StormStarting with the very basics of Storm, you will learn how to set up Storm on a single machine and move on to deploying Storm on your cluster. You will understand how Kafka can be integrated with Storm using the Kafka spout.
You will then proceed to explore the Trident abstraction tool with Storm to perform stateful stream processing, guaranteeing single message processing in every topology. You will move ahead to learn how to integrate Hadoop with Storm. Next, you will learn how to integrate Storm with other well-known Big Data technologies such as HBase, Redis, and Kafka to realize the full potential of Storm. ...
Learning SparkData in all domains is getting bigger. How can you work with it efficiently? Recently updated for Spark 1.3, this book introduces Apache Spark, the open source cluster computing system that makes data analytics fast to write and fast to run. With Spark, you can tackle big datasets quickly through simple APIs in Python, Java, and Scala. This edition includes new information on Spark SQL, Spark Streaming, setup, and Maven coordinates.
Written by the developers of Spark, this book will have data scientists and engineers up and running in no time. You'll learn how to express parallel jobs with just a few lines of code, and cover applications from simple batch jobs to stream processing and machine learning.
Quickly dive into Spark capabilities such as distributed datasets, in-memory caching, and the interactive shell; Leverage Spark's powerful built-in libraries, including Spark SQL, Spark Streaming, and MLlib; Use one programming paradigm instead of mixing and matching tools like Hiv ...
Big Data Architect's HandbookThe big data architects are the “masters” of data, and hold high value in today's market. Handling big data, be it of good or bad quality, is not an easy task. The prime job for any big data architect is to build an end-to-end big data solution that integrates data from different sources and analyzes it to find useful, hidden insights.
Big Data Architect's Handbook takes you through developing a complete, end-to-end big data pipeline, which will lay the foundation for you and provide the necessary knowledge required to be an architect in big data. Right from understanding the design considerations to implementing a solid, efficient, and scalable data pipeline, this book walks you through all the essential aspects of big data. It also gives you an overview of how you can leverage the power of various big data tools such as Apache Hadoop and ElasticSearch in order to bring them together and build an efficient big data solution.
By the end of this book, you will be able to build ...
Practical Enterprise Data Lake InsightsUse this practical guide to successfully handle the challenges encountered when designing an enterprise data lake and learn industry best practices to resolve issues.
When designing an enterprise data lake you often hit a roadblock when you must leave the comfort of the relational world and learn the nuances of handling non-relational data. Starting from sourcing data into the Hadoop ecosystem, you will go through stages that can bring up tough questions such as data processing, data querying, and security. Concepts such as change data capture and data streaming are covered. The book takes an end-to-end solution approach in a data lake environment that includes data security, high availability, data processing, data streaming, and more.
Each chapter includes application of a concept, code snippets, and use case demonstrations to provide you with a practical approach. You will learn the concept, scope, application, and starting point.
Get to know data lake architecture and desi ...
Beginning Apache Spark 2Develop applications for the big data landscape with Spark and Hadoop. This book also explains the role of Spark in developing scalable machine learning and analytics applications with Cloud technologies. Beginning Apache Spark 2 gives you an introduction to Apache Spark and shows you how to work with it.
Along the way, you'll discover resilient distributed datasets (RDDs); use Spark SQL for structured data; and learn stream processing and build real-time applications with Spark Structured Streaming. Furthermore, you'll learn the fundamentals of Spark ML for machine learning and much more.
After you read this book, you will have the fundamentals to become proficient in using Apache Spark and know when and how to apply it to your big data applications.
Understand Spark unified data processing platform; How to run Spark in Spark Shell or Databricks; Use and manipulate RDDs; Deal with structured data using Spark SQL through its operations and advanced functions; Build real-tim ...