Cloudera, the global provider of the fastest, easiest, and most secure data management and analytics platform built on Apache Hadoop and the latest open source technologies, today announced new advancements to further Hadoop as a mainstream platform for data science. Building on recent announcements around Apache Spark and Python that better enable data engineering and data science workloads across big data, Cloudera and Continuum Analytics are making it easier to work with the Python ecosystem through seamless integration of the Anaconda platform with Hadoop. In addition, Cloudera, together with the open source community, announced Apache Arrow, a new open source in-memory columnar data format, to support interoperability and improved performance of Python in the Hadoop ecosystem. These efforts will help data scientists to better take advantage of Hadoop using their preferred skills and tools, and lay the foundation for native data interchange and efficient performance for data engineering and machine learning workloads.
Improving the Python Experience for Data Scientists on Hadoop Python is the language of choice for data scientists and data engineers due to its power, elegance, and robust libraries and third-party integrations for expressing complex workflows. With frameworks like Apache Spark supporting Python, and new emerging tools like Ibis that better support Python natively for big data, Python has become an increasingly popular choice for data engineering and advanced analytics on Hadoop.
To make it easier for data scientists to get started with Python, Cloudera has partnered with Continuum Analytics - the creator and driving force behind Anaconda, a leading open source Python platform. The jointly-developed Anaconda for Cloudera packaging provides a simple, fast experience for customers installing Python, including popular packages such as NumPy, Pandas, and Scikit-Learn, on a Hadoop cluster. Users can deploy Anaconda seamlessly through Cloudera Manager and easily build and run Python-based solutions across Cloudera Enterprise, including under Spark.
"We are grateful to have worked with Cloudera to bring Anaconda to the Cloudera ecosystem," states Peter Wang, chief technology officer and co-founder of Continuum Analytics. "The integration of Anaconda and Cloudera’s platform allows enterprises to realize the full potential of their data by making it easier to get started and distribute Anaconda across Hadoop clusters to support critical data science workloads."
Additionally, Cloudera announced its community involvement with the new Apache Arrow project. Together with developers from Amazon, Databricks, Dremio, MapR, Trifacta, and Twitter, Cloudera is developing Arrow as a new in-memory columnar data structure to standardize in-memory processing and interchange across the ecosystem. Its efficient design will also accelerate analytic workloads across Hadoop frameworks (including Impala and Spark), and enable native interoperability for languages like Python and R for better data access and high-performance analytics.
“Cloudera has been paving the way for data scientists and engineers to become more deeply immersed in the Hadoop ecosystem,” said Wes McKinney, software engineer at Cloudera and the creator of Python pandas. “As the technology continues to mature, the vision of Python programmers leveraging the full-scale Hadoop ecosystem for complex data analysis becomes more tangible. We will continue to improve and expand data science capabilities across the platform, including ongoing development to make languages such as Python first-class citizens for the platform.”
These new advancements in making Hadoop more accessible and usable to the data science community are complemented by Cloudera’s recent development and leadership in this area, including:
● Spark MLlib in Cloudera 5.5: In the latest Cloudera Enterprise 5.5 release, Cloudera added Spark MLlib, broadening Spark’s ease of use and performance gains to machine learning applications within Hadoop. Cloudera also included Spark SQL extending the capabilities of Spark for developers and data scientists by allowing SQL to seamlessly embed within Spark applications.
● Ibis in Cloudera Labs: As a new open source project incubating in Cloudera Labs, Ibis is aimed at enabling advanced data analysis on a 100 percent Python stack and bringing a native Python experience to Hadoop at scale.
● SparkOnHBase in Cloudera Labs: Originating in Cloudera Labs and now committed to the Apache HBase 2.0 branch, SparkOnHBase provides more flexibility for building analytic applications that rely on Spark Streaming.
● Spark Runner for Apache Beam (incubating) in Cloudera Labs: Originating in Cloudera Labs and now part of the Beam SDK (formerly Google Dataflow), this project helps data scientists more easily build practical, massive-scale data processing pipelines for execution on Spark.
● Apache Spark Training: With unprecedented expertise and experience with Hadoop and its ecosystem, Cloudera brings a real-world approach to training and certifications for data scientists and developers to take full advantage of Spark as part of a complete Hadoop platform.
Enabling data scientists to leverage the full power of the Hadoop ecosystem means opening up new possibilities for enterprises looking to build faster, more intelligent data applications and predictive models that improve customer experiences and drive new revenue streams. Through this ongoing evolution, Cloudera is committed to offering seamless accessibility, productivity, and ease-of-use to the data science community.