Join our 14,484 members on the New York Open Statistical Programming Meetup

Click here to RSVP

March 30: Validating R dataframes with Pandera via reticulate: A case study in R-Python

We are still virtual this month. We have Niels Bantilan speaking about using using the pandera python package from R via reticulate.

Questions are encouraged in the monthly-meetup-chat channel in the nyhackr slack. Likewise, people should list jobs in the job-postings channel.

After the talk we are going to give away free tickets to attendees to D4 Conference in Tampa August 23-25.

Those that don't win free tickets can use code nyhackr for 20% off D4 Conference.

Thank you to Texas McCombs for sponsoring this meetup. More information about their program is below.

About the Talk:
Pandera is a data validation and testing tool for dataframes in the Python ecosystem… but can it be used to validate R data.frames? This is the question we will attempt to answer in this talk. Some of the interesting issues that come up in this integration are around how R and Python (in particular pandas) differ in the way they represent data types and how pandera will handle the discrepancies between R data.frames and Pandas dataframes as data is passed back and forth between the two runtimes. However as the title suggests, why is this exercise highly unnecessary? Well, because R already has a lot of great data validation packages like `validate` and `pointblank`, and in fact pandera drew a lot of inspiration from these packages at its inception. Despite this redundancy, this talk will provide you with an understanding of how Pandera helps you reason about the schema of your data.frames and how reticulate can help you leverage some of its capabilities that may not be available in the native R data validation packages such as data synthesis for unit testing your data processing code.

About Niels:
Niels is the Chief Machine Learning Engineer at, and core maintainer of Flyte, an open source workflow orchestration tool, author of UnionML, an MLOps framework for machine learning microservices, and creator of Pandera, a statistical typing and data testing tool for scientific data containers. His mission is to help data science and machine learning practitioners be more productive.

He has a Masters in Public Health with a specialization in sociomedical science and public health informatics, and prior to that a background in developmental biology and immunology. His research interests include reinforcement learning, AutoML, creative machine learning, and fairness, accountability, and transparency in automated systems.

The talk will begin at 7 PM America/New_York and we will start admitting people to the event shortly before. Since this is completely remote there will be no pizza but everyone is encouraged to have pizza individually.

About Texas McCombs:
Texas McCombs offers top-ranked MS Programs in Business Analytics, Marketing, Information Technology & Management, and Finance. Each program runs 10 months and is designed to help students accelerate their careers via rigorous technical and quantitative preparation.

New for 2023 is the MS in Business Analytics Working Professionals, a 23-month program, in which students develop and deepen skills in cutting-edge analytics including business intelligence, predictive and prescriptive analytics, machine learning, and statistical analysis. The MSBA-WP offers flexibility for working professionals. The program is a blend of self-paced online learning and live online classes as well as five on-campus "immersives" held at The McCombs School of Business in Austin, Texas.