Modern Data Engineering with Apache SparkLeverage Apache Spark within a modern data engineering ecosystem. This hands-on guide will teach you how to write fully functional applications, follow industry best practices, and learn the rationale behind these decisions. With Apache Spark as the foundation, you will follow a step-by-step journey beginning with the basics of data ingestion, processing, and transformation, and ending up with an entire local data platform running Apache Spark, Apache Zeppelin, Apache Kafka, Redis, MySQL, Minio (S3), and Apache Airflow.
Apache Spark applications solve a wide range of data problems from traditional data loading and processing to rich SQL-based analysis as well as complex machine learning workloads and even near real-time processing of streaming data. Spark fits well as a central foundation for any data engineering workload. This book will teach you to write interactive Spark applications using Apache Zeppelin notebooks, write and compile reusable applications and modules, and fully t ...
Mastering Excel Through ProjectsMaster Excel in less than two weeks with this unique project-based book! Let's face it, we all master skills in our own way, but building a soup-to-nuts project is one of the best ways to make learning stick and get up to speed quickly. Whether you are just getting started with Excel or are an experienced user, this book will elevate your knowledge and skills. For a beginner, the micro examples in each chapter will warm you up before you dive into the projects. For experienced users, the projects, especially those with table setup considerations, will help you become more creative in your interactions with Excel.
Readers will benefit from building eight unique projects, each covering a different topic, including a word game, a food nutrition ranking, a payroll (tax withholding) calculation, an encryption, a two-way table, a Kaplan-Meier analysis, a data analysis via a pivot table and the K-means Clustering data mining method. Through these projects, you will experience firsthand how ...
Adaptive Machine Learning Algorithms with PythonLearn to use adaptive algorithms to solve real-world streaming data problems. This book covers a multitude of data processing challenges, ranging from the simple to the complex. At each step, you will gain insight into real-world use cases, find solutions, explore code used to solve these problems, and create new algorithms for your own use.
Authors Chanchal Chatterjee and Vwani P. Roychowdhury begin by introducing a common framework for creating adaptive algorithms, and demonstrating how to use it to address various streaming data issues. Examples range from using matrix functions to solve machine learning and data analysis problems to more critical edge computation problems. They handle time-varying, non-stationary data with minimal compute, memory, latency, and bandwidth.
Upon finishing this book, you will have a solid understanding of how to solve adaptive machine learning and data analytics problems and be able to derive new algorithms for your own use cases. You will also c ...
Practical Simulations for Machine LearningSimulation and synthesis are core parts of the future of AI and machine learning. Consider: programmers, data scientists, and machine learning engineers can create the brain of a self-driving car without the car. Rather than use information from the real world, you can synthesize artificial data using simulations to train traditional machine learning models.That's just the beginning.
With this practical book, you'll explore the possibilities of simulation- and synthesis-based machine learning and AI, concentrating on deep reinforcement learning and imitation learning techniques. AI and ML are increasingly data driven, and simulations are a powerful, engaging way to unlock their full potential.
You'll learn how to: Design an approach for solving ML and AI problems using simulations with the Unity engine; Use a game engine to synthesize images for use as training data; Create simulation environments designed for training deep reinforcement learning and imitation learning models; Us ...
How to Lead in Data ScienceHow to Lead in Data Science is full of techniques for leading data science at every seniority level - from heading up a single project to overseeing a whole company's data strategy. Authors Jike Chong and Yue Cathy Chang share hard-won advice that they've developed building data teams for LinkedIn, Acorns, Yiren Digital, large asset-management firms, Fortune 50 companies, and more. You'll find advice on plotting your long-term career advancement, as well as quick wins you can put into practice right away. Carefully crafted assessments and interview scenarios encourage introspection, reveal personal blind spots, and highlight development areas.
Lead your data science teams and projects to success! To make a consistent, meaningful impact as a data science leader, you must articulate technology roadmaps, plan effective project strategies, support diversity, and create a positive environment for professional growth. This book delivers the wisdom and practical skills you need to thrive a ...
Data PrivacyData Privacy teaches you to design, develop, and measure the effectiveness of privacy programs. You'll learn from author Nishant Bhajaria, an industry-renowned expert who has overseen privacy at Google, Netflix, and Uber. The terminology and legal requirements of privacy are all explained in clear, jargon-free language. The book's constant awareness of business requirements will help you balance trade-offs, and ensure your user's privacy can be improved without spiraling time and resource costs.
Data privacy is essential for any business. Data breaches, vague policies, and poor communication all erode a user's trust in your applications. You may also face substantial legal consequences for failing to protect user data. Fortunately, there are clear practices and guidelines to keep your data secure and your users happy.
Data Privacy: A runbook for engineers teaches you how to navigate the trade-off s between strict data security and real world business needs. In this practical book ...
R in Action, 3rd EditionR in Action, 3rd Edition makes learning R quick and easy. That's why thousands of data scientists have chosen this guide to help them master the powerful language. Far from being a dry academic tome, every example you'll encounter in this book is relevant to scientific and business developers, and helps you solve common data challenges. R expert Rob Kabacoff takes you on a crash course in statistics, from dealing with messy and incomplete data to creating stunning visualizations. This revised and expanded third edition contains fresh coverage of the new tidyverse approach to data analysis and R's state-of-the-art graphing capabilities with the ggplot2 package.
Used daily by data scientists, researchers, and quants of all types, R is the gold standard for statistical data analysis. This free and open source language includes packages for everything from advanced data visualization to deep learning. Instantly comfortable for mathematically minded users, R easily handles practical prob ...
Advanced Analytics with PySparkThe amount of data being generated today is staggering and growing. Apache Spark has emerged as the de facto tool to analyze big data and is now a critical part of the data science toolbox. Updated for Spark 3.0, this practical guide brings together Spark, statistical methods, and real-world datasets to teach you how to approach analytics problems using PySpark, Spark's Python API, and other best practices in Spark programming.
Data scientists Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, and Josh Wills offer an introduction to the Spark ecosystem, then dive into patterns that apply common techniques-including classification, clustering, collaborative filtering, and anomaly detection, to fields such as genomics, security, and finance. This updated edition also covers NLP and image processing.
If you have a basic understanding of machine learning and statistics and you program in Python, this book will get you started with large-scale data analysis.
Familiarize yourself wi ...
In-Memory Analytics with Apache ArrowApache Arrow is designed to accelerate analytics and allow the exchange of data across big data systems easily.
In-Memory Analytics with Apache Arrow begins with a quick overview of the Apache Arrow format, before moving on to helping you to understand Arrow's versatility and benefits as you walk through a variety of real-world use cases. You'll cover key tasks such as enhancing data science workflows with Arrow, using Arrow and Apache Parquet with Apache Spark and Jupyter for better performance and hassle-free data translation, as well as working with Perspective, an open source interactive graphical and tabular analysis tool for browsers. As you advance, you'll explore the different data interchange and storage formats and become well-versed with the relationships between Arrow, Parquet, Feather, Protobuf, Flatbuffers, JSON, and CSV. In addition to understanding the basic structure of the Arrow Flight and Flight SQL protocols, you'll learn about Dremio's usage of Apache Arrow to e ...
Time Series Analysis with Python CookbookTime series data is everywhere, available at a high frequency and volume. It is complex and can contain noise, irregularities, and multiple patterns, making it crucial to be well-versed with the techniques covered in this book for data preparation, analysis, and forecasting.
This book covers practical techniques for working with time series data, starting with ingesting time series data from various sources and formats, whether in private cloud storage, relational databases, non-relational databases, or specialized time series databases such as InfluxDB. Next, you'll learn strategies for handling missing data, dealing with time zones and custom business days, and detecting anomalies using intuitive statistical methods, followed by more advanced unsupervised ML models. The book will also explore forecasting using classical statistical models such as Holt-Winters, SARIMA, and VAR. The recipes will present practical techniques for handling non-stationary data, using power transforms, A ...