Media releases are provided as is by companies and have not been edited or checked for accuracy. Any queries should be directed to the company itself.
  • 19 February 2016 10:30

Cloudera Accelerates Data Science Workloads for Apache Hadoop

Simplified Deployment and Improved Experience for Python Programming Language to Bring Better Usability to Data Engineers and Data Scientists

Cloudera, the global provider of the fastest, easiest, and most secure data management and analytics platform built on Apache Hadoop and the latest open source technologies, today announced new advancements to further Hadoop as a mainstream platform for data science. Building on recent announcements around Apache Spark and Python that better enable data engineering and data science workloads across big data, Cloudera and Continuum Analytics are making it easier to work with the Python ecosystem through seamless integration of the Anaconda platform with Hadoop. In addition, Cloudera, together with the open source community, announced Apache Arrow, a new open source in-memory columnar data format, to support interoperability and improved performance of Python in the Hadoop ecosystem. These efforts will help data scientists to better take advantage of Hadoop using their preferred skills and tools, and lay the foundation for native data interchange and efficient performance for data engineering and machine learning workloads.

Improving the Python Experience for Data Scientists on Hadoop Python is the language of choice for data scientists and data engineers due to its power, elegance, and robust libraries and third-party integrations for expressing complex workflows. With frameworks like Apache Spark supporting Python, and new emerging tools like Ibis that better support Python natively for big data, Python has become an increasingly popular choice for data engineering and advanced analytics on Hadoop.

To make it easier for data scientists to get started with Python, Cloudera has partnered with Continuum Analytics - the creator and driving force behind Anaconda, a leading open source Python platform. The jointly-developed Anaconda for Cloudera packaging provides a simple, fast experience for customers installing Python, including popular packages such as NumPy, Pandas, and Scikit-Learn, on a Hadoop cluster. Users can deploy Anaconda seamlessly through Cloudera Manager and easily build and run Python-based solutions across Cloudera Enterprise, including under Spark.

"We are grateful to have worked with Cloudera to bring Anaconda to the Cloudera ecosystem," states Peter Wang, chief technology officer and co-founder of Continuum Analytics. "The integration of Anaconda and Cloudera’s platform allows enterprises to realize the full potential of their data by making it easier to get started and distribute Anaconda across Hadoop clusters to support critical data science workloads."

Additionally, Cloudera announced its community involvement with the new Apache Arrow project. Together with developers from Amazon, Databricks, Dremio, MapR, Trifacta, and Twitter, Cloudera is developing Arrow as a new in-memory columnar data structure to standardize in-memory processing and interchange across the ecosystem. Its efficient design will also accelerate analytic workloads across Hadoop frameworks (including Impala and Spark), and enable native interoperability for languages like Python and R for better data access and high-performance analytics.

“Cloudera has been paving the way for data scientists and engineers to become more deeply immersed in the Hadoop ecosystem,” said Wes McKinney, software engineer at Cloudera and the creator of Python pandas. “As the technology continues to mature, the vision of Python programmers leveraging the full-scale Hadoop ecosystem for complex data analysis becomes more tangible. We will continue to improve and expand data science capabilities across the platform, including ongoing development to make languages such as Python first-class citizens for the platform.”

These new advancements in making Hadoop more accessible and usable to the data science community are complemented by Cloudera’s recent development and leadership in this area, including:

● Spark MLlib in Cloudera 5.5: In the latest Cloudera Enterprise 5.5 release, Cloudera added Spark MLlib, broadening Spark’s ease of use and performance gains to machine learning applications within Hadoop. Cloudera also included Spark SQL extending the capabilities of Spark for developers and data scientists by allowing SQL to seamlessly embed within Spark applications.

● Ibis in Cloudera Labs: As a new open source project incubating in Cloudera Labs, Ibis is aimed at enabling advanced data analysis on a 100 percent Python stack and bringing a native Python experience to Hadoop at scale.

● SparkOnHBase in Cloudera Labs: Originating in Cloudera Labs and now committed to the Apache HBase 2.0 branch, SparkOnHBase provides more flexibility for building analytic applications that rely on Spark Streaming.

● Spark Runner for Apache Beam (incubating) in Cloudera Labs: Originating in Cloudera Labs and now part of the Beam SDK (formerly Google Dataflow), this project helps data scientists more easily build practical, massive-scale data processing pipelines for execution on Spark.

● Apache Spark Training: With unprecedented expertise and experience with Hadoop and its ecosystem, Cloudera brings a real-world approach to training and certifications for data scientists and developers to take full advantage of Spark as part of a complete Hadoop platform.

Enabling data scientists to leverage the full power of the Hadoop ecosystem means opening up new possibilities for enterprises looking to build faster, more intelligent data applications and predictive models that improve customer experiences and drive new revenue streams. Through this ongoing evolution, Cloudera is committed to offering seamless accessibility, productivity, and ease-of-use to the data science community.

Submit a media release

Editor's Recommendations

Solution Centres


View all events Submit your own security event

Latest Videos

  • 150x50

    CSO Webinar: Will your data protection strategy be enough when disaster strikes?

    Speakers: - Paul O’Connor, Engagement leader - Performance Audit Group, Victorian Auditor-General’s Office (VAGO) - Nigel Phair, Managing Director, Centre for Internet Safety - Joshua Stenhouse, Technical Evangelist, Zerto - Anthony Caruana, CSO MC & Moderator

    Play Video

  • 150x50

    CSO Webinar: The Human Factor - Your people are your biggest security weakness

    ​Speakers: David Lacey, Researcher and former CISO Royal Mail David Turner - Global Risk Management Expert Mark Guntrip - Group Manager, Email Protection, Proofpoint

    Play Video

  • 150x50

    CSO Webinar: Current ransomware defences are failing – but machine learning can drive a more proactive solution

    Speakers • Ty Miller, Director, Threat Intelligence • Mark Gregory, Leader, Network Engineering Research Group, RMIT • Jeff Lanza, Retired FBI Agent (USA) • Andy Solterbeck, VP Asia Pacific, Cylance • David Braue, CSO MC/Moderator What to expect: ​Hear from industry experts on the local and global ransomware threat landscape. Explore a new approach to dealing with ransomware using machine-learning techniques and by thinking about the problem in a fundamentally different way. Apply techniques for gathering insight into ransomware behaviour and find out what elements must go into a truly effective ransomware defence. Get a first-hand look at how ransomware actually works in practice, and how machine-learning techniques can pick up on its activities long before your employees do.

    Play Video

  • 150x50

    CSO Webinar: Get real about metadata to avoid a false sense of security

    Speakers: • Anthony Caruana – CSO MC and moderator • Ian Farquhar, Worldwide Virtual Security Team Lead, Gigamon • John Lindsay, Former CTO, iiNet • Skeeve Stevens, Futurist, Future Sumo • David Vaile - Vice chair of APF, Co-Convenor of the Cyberspace Law And Policy Community, UNSW Law Faculty This webinar covers: - A 101 on metadata - what it is and how to use it - Insight into a typical attack, what happens and what we would find when looking into the metadata - How to collect metadata, use this to detect attacks and get greater insight into how you can use this to protect your organisation - Learn how much raw data and metadata to retain and how long for - Get a reality check on how you're using your metadata and if this is enough to secure your organisation

    Play Video

  • 150x50

    CSO Webinar: How banking trojans work and how you can stop them

    CSO Webinar: How banking trojans work and how you can stop them Featuring: • John Baird, Director of Global Technology Production, Deutsche Bank • Samantha Macleod, GM Cyber Security, ME Bank • Sherrod DeGrippo, Director of Emerging Threats, Proofpoint (USA)

    Play Video

More videos

Blog Posts

Media Release

More media release

Market Place