{"product_id":"distributed-machine-learning-with-pyspark-migrating-effortlessly-from-pandas-and-scikitlearn-9781484297506","title":"Distributed Machine Learning with PySpark: Migrating Effortlessly from Pandas and Scikit-Learn","description":"\u003cp\u003e\u003c\/p\u003e\u003cblockquote\u003e\n\u003cbr\u003eThis book provides a roadmap for data scientists transitioning from pandas and scikit-learn to PySpark for handling vast amounts of data and achieving faster data processing times. It covers translating Python code,preprocessing large volumes of data,building and training machine learning models,and evaluating algorithms using PySpark. It is designed for data scientists, data engineers, and machine learning practitioners with some familiarity with Python but who are new to distributed machine learning and the PySpark framework. \u003c\/blockquote\u003e\u003cp\u003e\u003cstrong\u003eFormat\u003c\/strong\u003e: Paperback \/ softback\u003cbr\u003e\u003cstrong\u003eLength\u003c\/strong\u003e: 490 pages\u003cbr\u003e\u003cstrong\u003ePublication date\u003c\/strong\u003e: 24 November 2023\u003cbr\u003e\u003cstrong\u003ePublisher\u003c\/strong\u003e: APress\u003cbr\u003e\u003c\/p\u003e \u003cp\u003e\u003cbr\u003eDistributed Machine Learning with PySpark is a comprehensive guide for data scientists looking to migrate from small data libraries like pandas and scikit-learn to big data processing and machine learning with PySpark. This book provides a roadmap to facilitate this transition, leveraging the similarities in syntax, functionality, and interoperability between these tools.\u003cbr\u003e\u003cbr\u003eIn Chapter 1, the book introduces the foundational concepts of distributed machine learning and PySpark. It covers topics such as Spark clusters, RDDs, and Spark SQL, which are essential for handling large amounts of data. The chapter also highlights the advantages of using PySpark for data processing, including its scalability, fault tolerance, and performance.\u003cbr\u003e\u003cbr\u003eChapter 2 delves into the differences between PySpark, scikit-learn, and pandas. It explains how PySpark differs from traditional data processing frameworks and highlights its strengths in handling large-scale data processing and machine learning tasks. The chapter also provides an overview of the key features and functionalities of PySpark, such as its resilient distributed dataset (RDD), functional programming API, and machine learning libraries.\u003cbr\u003e\u003cbr\u003eChapter 3 focuses on translating Python code from pandas and scikit-learn to PySpark. It provides step-by-step instructions on how to preprocess large volumes of data using PySpark, including data cleaning, feature extraction, and transformation. The chapter also covers building, training, testing, and evaluating popular machine learning algorithms such as linear and logistic regression, decision trees, random forests, support vector machines, Naïve Bayes, and neural networks.\u003cbr\u003e\u003cbr\u003eChapter 4 discusses the pipelines of PySpark and scikit-learn. It explains how these tools differ in their approach to data processing and machine learning tasks. The chapter also provides examples of how to combine PySpark and scikit-learn pipelines to build scalable ML data pipelines.\u003cbr\u003e\u003cbr\u003eChapter 5 covers advanced topics in distributed machine learning, such as distributed training, distributed data processing, and streaming data processing. It provides insights into how to optimize the performance of PySpark applications and handle real-time data processing.\u003cbr\u003e\u003cbr\u003eChapter 6 concludes the book by discussing the future of distributed machine learning and PySpark. It highlights the ongoing development and advancements in these tools and provides recommendations for future practitioners.\u003cbr\u003e\u003cbr\u003eWho This Book Is For:\u003cbr\u003e\u003cbr\u003eDistributed Machine Learning with PySpark is designed for data scientists, data engineers, and machine learning practitioners who have some familiarity with Python but are new to distributed machine learning and the PySpark framework. The book assumes a basic understanding of Python programming and mathematics, but it provides comprehensive explanations and examples to help readers grasp the concepts and apply them effectively.\u003cbr\u003e\u003cbr\u003eIn conclusion, Distributed Machine Learning with PySpark is a valuable resource for data scientists looking to migrate from small data libraries to big data processing and machine learning with PySpark. The book provides a comprehensive roadmap to facilitate this transition, leveraging the similarities in syntax, functionality, and interoperability between these tools. By mastering the fundamentals of supervised learning, unsupervised learning, NLP, and recommender systems, understanding the differences between PySpark, scikit-learn, and pandas, and performing linear regression, logistic regression, and decision tree regression with pandas, scikit-learn, and PySpark, readers will gain the skills necessary to apply these methods using PySpark, the industry standard for building scalable ML data pipelines.\u003c\/p\u003e\u003cp\u003e\u003cstrong\u003eWeight\u003c\/strong\u003e: 964g\u003cbr\u003e\u003cstrong\u003eDimension\u003c\/strong\u003e: 254 x 178 (mm)\u003cbr\u003e\u003cstrong\u003eISBN-13\u003c\/strong\u003e: 9781484297506\u003cbr\u003e \u003cstrong\u003eEdition number\u003c\/strong\u003e: 1st ed.\u003c\/p\u003e","brand":"Abdelaziz Testas","offers":[{"title":"Paperback \/ softback","offer_id":44899585261818,"sku":"9781484297506","price":37.47,"currency_code":"GBP","in_stock":true}],"thumbnail_url":"\/\/cdn.shopify.com\/s\/files\/1\/0522\/4297\/2845\/products\/1702664202178_book.jpg?v=1702815958","url":"https:\/\/shulphink.com\/products\/distributed-machine-learning-with-pyspark-migrating-effortlessly-from-pandas-and-scikitlearn-9781484297506","provider":"Shulph Ink","version":"1.0","type":"link"}