Data Version Control Tutorial – Best Practices for Machine Learning Projects Reproducibility
Today the data science community is still lacking good practices for organizing their projects and effectively collaborating. ML algorithms and methods are no longer simple “tribal knowledge” but are still difficult to implement, manage and reuse.
One of the biggest challenges in reusing, and hence the managing of ML projects, it its reproducibility.
To address the reproducibility we have build Data Version Control or DVC.
This example shows you how to solve a text classification problem using the DVC tool.
Git branches should beautifully reflect the non-linear structure common to the ML process, where each hypotheses can be presented as a Git branch. However, inability to store data in a repository and the discrepancy between code and data make it extremely difficult to manage a data science project with Git.
DVC streamlines large data files and binary models into a single Git environment and this approach will not require storing binary files in your Git repository.
Full article: Data Version Control Tutorial
Preparation
1.1. What we are going to do?
1.2. Getting the sample code
1.3. Install DVC
1.4. InitializeDefine ML pipeline
2.1. Get data file
2.2. Data file internals
2.3. Running commands
2.4. Running in a bulkReproducibility
3.1. How reproducibility works?
3.2. Adding bigrams
3.3. Checkout code and data files
3.4. Tune the model
3.5. Merge the model to masterSharing data
4.1. Pushing data to cloud
4.2. Pulling data from cloudDVC commands
Summary:
Git branches beautifully reflect the non-linear structure of ML processes where each hypotheses can be presented as a Git branch. DVC makes it possible to navigate through Git branches with code and data which makes the ML process more manageable and reproducible.
Congratulations @numizmat! You have completed some achievement on Steemit and have been rewarded with new badge(s) :
Award for the number of posts published
Click on any badge to view your own Board of Honor on SteemitBoard.
To support your work, I also upvoted your post!
For more information about SteemitBoard, click here
If you no longer want to receive notifications, reply to this comment with the word
STOP