Skip to content

[RFC]: add support for bootstrap and jackknife resampling #133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
7 tasks done
notacountry opened this issue Apr 2, 2025 · 2 comments
Open
7 tasks done

[RFC]: add support for bootstrap and jackknife resampling #133

notacountry opened this issue Apr 2, 2025 · 2 comments
Assignees
Labels
2025 2025 GSoC proposal. received feedback A proposal which has received feedback. rfc Project proposal.

Comments

@notacountry
Copy link

notacountry commented Apr 2, 2025

Full name

Zach Land

University status

Yes

University name

Imperial College London

University program

Joint Mathematics and Computer Science

Expected graduation

2027

Short biography

Hi! I'm Zach, and I am a first-year Math & CS undergraduate at Imperial College London with a 90.9% grade average.
I am relatively new to open source development, although I did spend 3 years as an active Wiki contributor and administrator (including development of backend templates & data structures).
My primary interest lies in finance and data science, and I want to pursue a career in the field. I am well-versed in statistical methods, both from my university courses and recreationally in Python and C++, where I've recently been reading, implementing, and optimizing various algorithms.
My best programming language is most likely C#, but I have significant experience programming in C++, Haskell, Java, and Kotlin, and in web development with CSS, HTML, and JavaScript with React.

Stdlib interests me in particular because I don't need to turn to another language such as Python or R for data analysis. I feel I can contribute significantly to its mathematics and statistics modules.

A full profile of my work experience and projects can be found on my LinkedIn.

Timezone

Greenwich Mean Time (GMT, UTC+00:00)

Contact details

email:[email protected]

Platform

Windows

Editor

My IDE tends to depend on the language. I use Visual Studio for C# and Sublime Text for HTML, but in general I gravitate towards VSCode since it was easy to adapt to and use. I am looking toward transitioning to NeoVim in the near future though.

Note that my Laptop (from which I will be mainly contributing) uses Windows, however my personal PC uses Fedora Linux so I am okay with programming in and using both.

Programming experience

Key Project

One very recent project that springs to mind is one that my team and I are developing for Imperial College London's month-long AI Agent Hackathon. Our WIP is an accessibility extension to the UK's National Health Service app that allows users to scan medical documents, which are processed using various NLP algorithms such as keyword and sentiment analysis in order to summarize, translate, and signpost (with a focus on correctness and responsibility, as we are dealing with important information which may at times be fatally incorrect).

We are using React Native for our frontend (and I worked on porting several NHS frontend assets into the framework), with a python backend.

Additional Projects

  • Blog Website (JS):
    • Programmed, hosted, and maintained a blog website with search, tagging, and comment functionality.
  • Chatbot & Finance Simulation (C#):
    • Designed and implemented a chatbot which used the Wikipedia API to quiz users and measure their typing speed.
    • Developed a stock market simulator which increased user engagement. Users could buy and sell virtual stocks.
  • Free School Uniform Website (JS):
    • Implemented a website allowing users to search, sort, and order stock from a database that updated in real-time.
  • Black-Scholes research (Python):
    • Analytically and numerically quantified the effect of interest rate on options pricing.
  • British Physics Computational Challenge (Python):
    • Won a gold medal due to my unique (2D and 3D) visualizations and simulations of Kepler's Laws and Newton's Universal Law of Gravitation.

JavaScript experience

I have had a decent amount of experience with JavaScript since I develop websites quite frequently (see programming experience above).

I am very interested in Haskell and the functional programming paradigm in general. To this end, my favorite feature of JavaScript is currying and the ability to manipulate functions as variables. This was actually used in my recent stdlib pull request, wherein the function incrnanmminmaxabs( out, window ) returns an accumulator function which I modified to skip NaN values.

My least favorite feature is JavaScript's dynamic typing. Coming from my C# and Haskell experience, I am very used to having to declare (e.g.) parameter types and function return types, and it feels odd how this is not required in JavaScript. The worst part of this how JavaScript variables can change types. I feel that readability suffers (as someone who likes to precisely document and specify all of my code).

Node.js experience

I learnt Node.js for the first time in preparation for our AI Agents Hackathon (which I mentioned in my programming experience above).

C/Fortran experience

None.

Interest in stdlib

I mentioned this previously, but stdlib provides you with the ability to perform complex calculations within your browser without having to spend time using other languages for data processing.

I don't have too much experience with using stdlib, but one thing I really enjoyed was its huge list of incremental statistics methods. I am interested in stream processing and I find these methods particularly useful. My pull request was related to this, and my showcase utilizes this stdlib feature to perform real-time analysis of stocks.

Version control

Yes

Contributions to stdlib

This pull request adds the package @stdlib/stats/incr/nanmminmaxabs. The package is a thin wrapper around @stdlib/stats/incr/mminmaxabs. It wraps a NaN propagating accumulator and skips NaN values.

stdlib showcase

My showcase uses @stdlib/stats/incr/mmean and @stdlib/plot to take an input stream of financial data, employ a simple moving average strategy, and plot the profit.

I am planning on implementing more (e.g. running an optimization algorithm to improve the sliding window sizes, or implementing more strategies that use stdlib), but unfortunately this application process is happening during a busy time of my life (easter vacation work experience, AI Agents Hackathon, studying for finals next month) so I may not be able to get around to it.

Goals

Stdlib offers some ways to calculate maximum likelihood estimates given some sample data (e.g. see @stdlib/stats/base/mean and @stdlib/stats/base/variance), however from my research I could find no real way in stdlib to calculate the distribution of estimates (i.e. estimators).

I propose to extend this functionality by implementing a resampling package which exposes various resampling methods such as Bootstrapping and Jackknife.

Please note that my ideas are an initial draft. The listed project idea does not mention any mentors to discuss the project with, so any feedback at this stage is invaluable.

I will introduce a package, @stdlib/stats/base/resampling, which contains various sub-packages such as the simple Monte Carlo case resampling algorithm or the more complex Bayesian bootstrap, and then expose these via a class implementation (similar to how @stdlib/stats/base/dists contains distribution packages with sub-packages calculating specific quantities).

This structure was mentioned by kgryte in 2022 on Gitter in reference to statistical distribution implementation, but I feel I can apply that methodology to this project.

Alternatively, we could do what SciPy's bootstrap implementation did and add the various possible algorithms as optional parameters to a bootstrap function, although I dislike this approach.

Inputs and Outputs
Each resampling function should

  • take as inputs: data in the form of a TypedArray, a function which outputs an estimate from a given data set (such as sample mean), and optionally a random number generator for non-deterministic algorithms; and
  • output a TypedArray containing the various estimates made from the bootstrap. This differs from the SciPy implementation, which outputs a confidence interval.

Below I outline some important goals for the project. I understand that implementing many different types of bootstrapping may be overkill for the scope of stdlib. Since the project is 12 weeks, I doubt I will be able to implement all of these different types. I will discuss with my mentors which of the mid-to-low-importance goals I should bring into the project.

High Importance

Mid Importance

  • Poisson bootstrapping is very useful for stream processing and so can be added as a package to @stdlib/stats/incr. Using a Poisson model, we don't need to know the size of the input dataset in advance (see this article from Google's data scientists).
  • Bayesian bootstrapping provides smoother and more reliable estimators, particularly when only a small sample is available. It may be adapted to include a prior representing prior information we have about the data.
  • Bag of Little Bootstraps (BLB) reduces computational complexity for very large data sets by partitioning the sample data into buckets, each of which is separately bootstrapped.

Low Importance

  • Standard bootstrapping assumes the input data is not correlated. One might consider implementing the Block Bootstrap, which preserves the dependence between data somewhat by splitting input into blocks before resampling. This is similar to BLB, but the blocks are partitioned differently. Gaussian proccess bootstrapping is also a consideration since it can also deal with uncorrelated data, however we (funnily enough) require [Idea]: add support for the multivariate normal distribution #11 to be worked upon first.
  • Smooth bootstrapping adds random noise in an attempt to smooth the sample data. Interestingly, smooth bootstrapping can use Kernel Density Estimation as the underlying distribution from which to resample, so we can utilize the pre-defined @stdlib/stats/kde2d package. That being said, it is more computationally efficient to add random noise directly rather than drawing samples from a smooth, continuous approximation with KDE.

Why this project?

What excites me about the proposal is the unique opportunity I have to contribute to open source on a widely-used library for a popular programming language. I feel that stdlib has a lot of room for growth in term of its mathematical capabilities and I would love to contribute to the project in a structured program such as GSoC. My skills in implementing statistical algorithms and bootstrapping lends itself to this project very well, in such a way where I felt lucky to come across a proposal that felt so tailored to what I enjoy doing.

Qualifications

  • I am quite knowledgeable on bootstrapping and statistics generally, and have experience fiddling around with statistics algorithms in C#, C++ and Python.
  • As an A-Level student, I was interested in learning more about statistics. I took MIT's online statistics course in my spare time. It covered maximum likelihood estimators in great detail and thus I am very comfortable working with them. I also read through the A-Level Further Statistics 2 textbook, which was a module that our school didn't allow us to take.
  • During my undergraduate study, we learnt about bootstrapping in the statistics module. Our lecturer is very interested in bootstrapping and mentioned how it was his favorite part of the course. I have a good understanding of bootstrapping from the course, and additionally I can get strong help regarding specific implementation details and optimizations from my lecturer throughout the program.

Prior art

Popular libraries with bootstrapping implementations:

Written Materials

Commitment

Please be aware that I will be doing a full-time 9-week internship which overlaps with most of GSoC, so I expect to be developing part-time. Also, between June 2-20, I have university programming coursework. Despite this, I spend much of my free time programming and, as such, I still believe I can manage my time effectively. This does mean I will be contributing most over the weekends.

Specifically, I will work on the project for around 1 hour every weekday, and as long possible on weekends (say >6 hours per day).

I will contribute and play around with stdlib before the project starts to get used to using Git on a Windows machine and iron out any annoying development environment errors. Note that between finals and my internship start date, I will be available to focus mainly on the GSoC project and thoroughly flesh out my ideas.

Schedule

It is difficult to structure a specific schedule for this project, since it depends on how many resampling methods I will implement during the project. My goals can easily be separated into several mini-projects so I can re-scale the scope of the project easily. The below schedule is highly flexible and includes two weeks of leeway to ensure I will be able to submit the deliverables on time. If the weeks of leeway are not used, I can use the accumulated extra time at the end to implement more mid-to-low importance goals.

Preparation
I will submit more basic pull requests, get used to the stdlib development progress, and iron out development environment errors.

Community Bonding Period (May 8 - June 1):
I will attempt to arrange a voice chat with my mentors to discus the project. We discuss the specific implementation details, edge cases to worry about, how and when to structure and submit my pull requests, etc.
Speaking directly should be a good way to build strong relationships that will last throughout the program.

Week 1 (June 2-8):

  • I will develop the code for the Monte Carlo case sampling algorithm.
  • I will attempt to arrange an informal code review with my mentors.

Week 2 (June 9-15):

  • I will write robust tests and extensive documentation of the Monte Carlo case sampling algorithm.
  • I will submit a pull request and have a formal code review.

Week 3 (June 16-22):

  • I will develop the code for the Jackknife resampling algorithm.
  • I will attempt to arrange an informal code review with my mentors.

Week 4 (June 23-29):

  • My internship begins on June 23. I expect the tasks from now onward to be completed slower.
  • I will write robust tests and extensive documentation for the Jackknife resampling algorithm.
  • I will submit a pull request and have a formal code review.

Week 5 (June 30-July 6):

  • One week of leeway to account for unexpected delays.

Week 6: (Midterm) (July 7-13):

  • I will implement a confidence interval function using basic bootstrap.
  • I will attempt to arrange an informal code review with my mentors.

Week 7 (July 14-21):

  • I will write robust tests and extensive documentation for the basic bootstrap.
  • I will submit a pull request and have a more formal code review.

Week 8-9 (July 21-August 3):

  • Begin implementing one of my suggested Mid-Importance goals.
  • I will attempt to arrange an informal code review with my mentors.

Week 10 (August 4-10):

  • I will write robust tests and extensive documentation for the suggested Mid-Importance goal.
  • I will submit a pull request and have a more formal code review.

Week 11 (August 11-17):

  • One week of leeway to account for unexpected delays.
  • Code freeze.

Week 12 (August 17-24):

  • Reviewing and updating documentation and tests.
  • Final code review.

Final Week (August 25-September 1):

  • Project write-up (I will informally write down my experience throughout the project and bring it all together in a formal post near the end).
  • Project submission.

Related issues

Checklist

  • I have read and understood the Code of Conduct.
  • I have read and understood the application materials found in this repository.
  • I understand that plagiarism will not be tolerated, and I have authored this application in my own words.
  • I have read and understood the patch requirement which is necessary for my application to be considered for acceptance.
  • I have read and understood the stdlib showcase requirement which is necessary for my application to be considered for acceptance.
  • The issue name begins with [RFC]: and succinctly describes your proposal.
  • I understand that, in order to apply to be a GSoC contributor, I must submit my final application to https://summerofcode.withgoogle.com/ before the submission deadline.
@notacountry notacountry added 2025 2025 GSoC proposal. rfc Project proposal. labels Apr 2, 2025
@Planeshifter
Copy link
Member

Thanks Zack for your proposal! Nice showcase demo, too.

There definitely will be some time commitment concerns given your full-time internship and university coursework. Assuming all your commitments take more time, what contingency plans would you have to ensure successful completion of this project?

A few questions:

  • How would you ensure your implementations are correct? How are you planning to validate the correctness of your bootstrap and jackknife implementations?
  • This project lends itself well to a tutorial and post for out blog at https://blog.stdlib.io. You already have something set aside for the final week for the write-up, but could explore starting that earlier so a cycle or two of feedback can be acounted for.
  • Do you expect any performance concerns when using these methods for larger datasets?

@Planeshifter Planeshifter added the received feedback A proposal which has received feedback. label Apr 5, 2025
@notacountry
Copy link
Author

notacountry commented Apr 5, 2025

Hi! Thank you for your response. I will respond to your concerns below:

Commitment

I can see that it would be difficult to juggle my coursework and internship throughout the project. Here's a bit more information on exactly how they are structured:

Coursework

  • The coursework is a group project between May 23 and 20 June. I will do a lot of work between May 23 and June 1, then hold back a bit from June 2 onwards when GSoC officially starts.
  • I contributed a lot to our group's previous project and we got a good grade, so I can ask my group to do less work on this project and they will be okay with that.
  • The coursework is for both Computing students and Joint Maths and Computing (JMC) students, but the JMC students (including me) only do part I of the coursework and not part II. Part II is said to be significantly more difficult than part I, so I don't expect it to take too much of my time.
  • We have no lectures during this period, so we will have a lot of time to work.

Internship

  • The internship is 9:00 AM to 17:00 PM on weekdays (with one hour for lunch). I will not have to do any work outside of those hours.
  • I would like to be able to contribute to GSoC and also produce excellent work throughout my internship. I do not want to compromise my effort in either program.

If time gets tight, I can allocate more time to implementing the bootstrapping and jackknife resampling algorithms. In the schedule, all high-importance goals would be completed by week 7 so there is 5 weeks of leeway.

Another contingency plan I have (if I only complete the high-importance goals) is to not implement jackknife at all and only focus on the monte carlo case resampling algorithm. This article by Stephanie Glen highlights how jackknife doesn't perform as well as bootstrap, and This article by Amit Yadav suggests that bootstrapping is better for larger data sets. With this in mind, I consider bootstrapping to be 'more important' than jackknife. Bootstrapping requires about 10 times the computation as jackknife, which I don't consider to be too much worse (and one can adjust the parameters of the bootstrap such that it is more computationally efficient anyway).

Correctness of Implementations

Thank you for mentioning this. I didn't put as much thought into ensuring correctness as I perhaps should have.

  • Deterministic resampling methods (such as jackknife) should be easy to validate correctness. I can input a large data set into a calculator and check whether my algorithm results in the same answer. I will also check edge cases, etc.
  • Bootstrapping will be more difficult since it's non-deterministic:
    • Use of random seeds for consistent results on benchmarks and examples.
    • I will run the example code on samples generated from a probability distribution, and see whether the probability density function of the original distribution matches that of the histogram generated from the bootstrap. I think I will have to manually check this. Any programmatic checks may run into the issue of a test unexpectedly failing if you get 'unlucky'.
    • I would like some more guidance on how to ensure the implementation is correct.

Tutorial and blog post

I should be able to fit this in quite well. I spent a year writing weekly blog posts for my blog website, so I am quite well-versed in this sort of stuff. I think it would be good to draft a write-up as I go along, and during the weeks of leeway (and the final week) I will focus on editing the draft and turning it into something more formal/professional.

Larger data sets

Yes, I do have performance concerns for larger data sets:

  • Ariel Kleiner, et al. suggest using the Bag of Little Bootstraps for handling massive data sets.
  • David Clarance mentions using the Poisson Bootstrap, which is usually used for handling stream data, for 'significant computational gains while creating resamples'.

However, I most likely won't be implementing either of these in a 90-hour project. As for the algorithms I will definitely be implementing:

  • In the Monte Carlo bootstrap, I can add parameters for resample size and number of resamples made. I will likely add default implementations for these parameters (e.g. both equal to the size of the sample data set), so programmers can optionally tweak the parameters to their performance needs.
  • Jackknife resampling is very limited and so will be difficult to quell performance concerns. It is only meant to be used for relatively small data sets anyway (maybe we include a note in the documentation warning users about performance, and directing them to bootstrapping instead).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2025 2025 GSoC proposal. received feedback A proposal which has received feedback. rfc Project proposal.
Projects
None yet
Development

No branches or pull requests

3 participants