-
Notifications
You must be signed in to change notification settings - Fork 26
[RFC]: add support for bootstrap and jackknife resampling #133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks Zack for your proposal! Nice showcase demo, too. There definitely will be some time commitment concerns given your full-time internship and university coursework. Assuming all your commitments take more time, what contingency plans would you have to ensure successful completion of this project? A few questions:
|
Hi! Thank you for your response. I will respond to your concerns below: CommitmentI can see that it would be difficult to juggle my coursework and internship throughout the project. Here's a bit more information on exactly how they are structured: Coursework
Internship
If time gets tight, I can allocate more time to implementing the bootstrapping and jackknife resampling algorithms. In the schedule, all high-importance goals would be completed by week 7 so there is 5 weeks of leeway. Another contingency plan I have (if I only complete the high-importance goals) is to not implement jackknife at all and only focus on the monte carlo case resampling algorithm. This article by Stephanie Glen highlights how jackknife doesn't perform as well as bootstrap, and This article by Amit Yadav suggests that bootstrapping is better for larger data sets. With this in mind, I consider bootstrapping to be 'more important' than jackknife. Bootstrapping requires about 10 times the computation as jackknife, which I don't consider to be too much worse (and one can adjust the parameters of the bootstrap such that it is more computationally efficient anyway). Correctness of ImplementationsThank you for mentioning this. I didn't put as much thought into ensuring correctness as I perhaps should have.
Tutorial and blog postI should be able to fit this in quite well. I spent a year writing weekly blog posts for my blog website, so I am quite well-versed in this sort of stuff. I think it would be good to draft a write-up as I go along, and during the weeks of leeway (and the final week) I will focus on editing the draft and turning it into something more formal/professional. Larger data setsYes, I do have performance concerns for larger data sets:
However, I most likely won't be implementing either of these in a 90-hour project. As for the algorithms I will definitely be implementing:
|
Full name
Zach Land
University status
Yes
University name
Imperial College London
University program
Joint Mathematics and Computer Science
Expected graduation
2027
Short biography
Hi! I'm Zach, and I am a first-year Math & CS undergraduate at Imperial College London with a 90.9% grade average.
I am relatively new to open source development, although I did spend 3 years as an active Wiki contributor and administrator (including development of backend templates & data structures).
My primary interest lies in finance and data science, and I want to pursue a career in the field. I am well-versed in statistical methods, both from my university courses and recreationally in Python and C++, where I've recently been reading, implementing, and optimizing various algorithms.
My best programming language is most likely C#, but I have significant experience programming in C++, Haskell, Java, and Kotlin, and in web development with CSS, HTML, and JavaScript with React.
Stdlib interests me in particular because I don't need to turn to another language such as Python or R for data analysis. I feel I can contribute significantly to its mathematics and statistics modules.
A full profile of my work experience and projects can be found on my LinkedIn.
Timezone
Greenwich Mean Time (GMT, UTC+00:00)
Contact details
email:[email protected]
Platform
Windows
Editor
My IDE tends to depend on the language. I use Visual Studio for C# and Sublime Text for HTML, but in general I gravitate towards VSCode since it was easy to adapt to and use. I am looking toward transitioning to NeoVim in the near future though.
Note that my Laptop (from which I will be mainly contributing) uses Windows, however my personal PC uses Fedora Linux so I am okay with programming in and using both.
Programming experience
Key Project
One very recent project that springs to mind is one that my team and I are developing for Imperial College London's month-long AI Agent Hackathon. Our WIP is an accessibility extension to the UK's National Health Service app that allows users to scan medical documents, which are processed using various NLP algorithms such as keyword and sentiment analysis in order to summarize, translate, and signpost (with a focus on correctness and responsibility, as we are dealing with important information which may at times be fatally incorrect).
We are using React Native for our frontend (and I worked on porting several NHS frontend assets into the framework), with a python backend.
Additional Projects
JavaScript experience
I have had a decent amount of experience with JavaScript since I develop websites quite frequently (see programming experience above).
I am very interested in Haskell and the functional programming paradigm in general. To this end, my favorite feature of JavaScript is currying and the ability to manipulate functions as variables. This was actually used in my recent stdlib pull request, wherein the function
incrnanmminmaxabs( out, window )
returns an accumulator function which I modified to skipNaN
values.My least favorite feature is JavaScript's dynamic typing. Coming from my C# and Haskell experience, I am very used to having to declare (e.g.) parameter types and function return types, and it feels odd how this is not required in JavaScript. The worst part of this how JavaScript variables can change types. I feel that readability suffers (as someone who likes to precisely document and specify all of my code).
Node.js experience
I learnt Node.js for the first time in preparation for our AI Agents Hackathon (which I mentioned in my programming experience above).
C/Fortran experience
None.
Interest in stdlib
I mentioned this previously, but stdlib provides you with the ability to perform complex calculations within your browser without having to spend time using other languages for data processing.
I don't have too much experience with using stdlib, but one thing I really enjoyed was its huge list of incremental statistics methods. I am interested in stream processing and I find these methods particularly useful. My pull request was related to this, and my showcase utilizes this stdlib feature to perform real-time analysis of stocks.
Version control
Yes
Contributions to stdlib
stdlib showcase
My showcase uses
@stdlib/stats/incr/mmean
and@stdlib/plot
to take an input stream of financial data, employ a simple moving average strategy, and plot the profit.I am planning on implementing more (e.g. running an optimization algorithm to improve the sliding window sizes, or implementing more strategies that use stdlib), but unfortunately this application process is happening during a busy time of my life (easter vacation work experience, AI Agents Hackathon, studying for finals next month) so I may not be able to get around to it.
Goals
Stdlib offers some ways to calculate maximum likelihood estimates given some sample data (e.g. see
@stdlib/stats/base/mean
and@stdlib/stats/base/variance
), however from my research I could find no real way in stdlib to calculate the distribution of estimates (i.e. estimators).I propose to extend this functionality by implementing a resampling package which exposes various resampling methods such as Bootstrapping and Jackknife.
Please note that my ideas are an initial draft. The listed project idea does not mention any mentors to discuss the project with, so any feedback at this stage is invaluable.
I will introduce a package,
@stdlib/stats/base/resampling
, which contains various sub-packages such as the simple Monte Carlo case resampling algorithm or the more complex Bayesian bootstrap, and then expose these via a class implementation (similar to how@stdlib/stats/base/dists
contains distribution packages with sub-packages calculating specific quantities).Alternatively, we could do what SciPy's bootstrap implementation did and add the various possible algorithms as optional parameters to a
bootstrap
function, although I dislike this approach.Inputs and Outputs
Each resampling function should
TypedArray
, a function which outputs an estimate from a given data set (such as sample mean), and optionally a random number generator for non-deterministic algorithms; andTypedArray
containing the various estimates made from the bootstrap. This differs from the SciPy implementation, which outputs a confidence interval.Below I outline some important goals for the project. I understand that implementing many different types of bootstrapping may be overkill for the scope of stdlib. Since the project is 12 weeks, I doubt I will be able to implement all of these different types. I will discuss with my mentors which of the mid-to-low-importance goals I should bring into the project.
High Importance
[lowerBound, upperBound]
given an inputalpha
, to be calculated using a basic (reverse percentile) bootstrap.Mid Importance
@stdlib/stats/incr
. Using a Poisson model, we don't need to know the size of the input dataset in advance (see this article from Google's data scientists).Low Importance
@stdlib/stats/kde2d
package. That being said, it is more computationally efficient to add random noise directly rather than drawing samples from a smooth, continuous approximation with KDE.Why this project?
What excites me about the proposal is the unique opportunity I have to contribute to open source on a widely-used library for a popular programming language. I feel that stdlib has a lot of room for growth in term of its mathematical capabilities and I would love to contribute to the project in a structured program such as GSoC. My skills in implementing statistical algorithms and bootstrapping lends itself to this project very well, in such a way where I felt lucky to come across a proposal that felt so tailored to what I enjoy doing.
Qualifications
Prior art
Popular libraries with bootstrapping implementations:
Written Materials
Commitment
Please be aware that I will be doing a full-time 9-week internship which overlaps with most of GSoC, so I expect to be developing part-time. Also, between June 2-20, I have university programming coursework. Despite this, I spend much of my free time programming and, as such, I still believe I can manage my time effectively. This does mean I will be contributing most over the weekends.
Specifically, I will work on the project for around 1 hour every weekday, and as long possible on weekends (say >6 hours per day).
I will contribute and play around with stdlib before the project starts to get used to using Git on a Windows machine and iron out any annoying development environment errors. Note that between finals and my internship start date, I will be available to focus mainly on the GSoC project and thoroughly flesh out my ideas.
Schedule
It is difficult to structure a specific schedule for this project, since it depends on how many resampling methods I will implement during the project. My goals can easily be separated into several mini-projects so I can re-scale the scope of the project easily. The below schedule is highly flexible and includes two weeks of leeway to ensure I will be able to submit the deliverables on time. If the weeks of leeway are not used, I can use the accumulated extra time at the end to implement more mid-to-low importance goals.
Preparation
I will submit more basic pull requests, get used to the stdlib development progress, and iron out development environment errors.
Community Bonding Period (May 8 - June 1):
I will attempt to arrange a voice chat with my mentors to discus the project. We discuss the specific implementation details, edge cases to worry about, how and when to structure and submit my pull requests, etc.
Speaking directly should be a good way to build strong relationships that will last throughout the program.
Week 1 (June 2-8):
Week 2 (June 9-15):
Week 3 (June 16-22):
Week 4 (June 23-29):
Week 5 (June 30-July 6):
Week 6: (Midterm) (July 7-13):
Week 7 (July 14-21):
Week 8-9 (July 21-August 3):
Week 10 (August 4-10):
Week 11 (August 11-17):
Week 12 (August 17-24):
Final Week (August 25-September 1):
Related issues
Checklist
[RFC]:
and succinctly describes your proposal.The text was updated successfully, but these errors were encountered: