ML Configuration Management
This page lists the material that is relevant for the in-class exercises of the ML Configuration Management lecture. In this tutorial we are going to try Β΄mllintΒ΄ to test best practices, DVC to manage the ML pipeline and version control artefacts, and we will use pylint to test a common ML scenario.
Intro Outline
- Initial Project: https://github.com/luiscruz/sms1
- still needs a ton of refactoring
- Teams of 2
- One team does a demo at the end
1. Setup
We will use the basic ML project used in the previous classes. You probably have already set it up, which means you might be able to skip this setup section.
Fork the Github repo https://github.com/luiscruz/sms1
2. mllint
In this tutorial we will use mllint
to help improve our sms1
project.
Run mllint
:
Analyse its output and verify what it says about data version control.
3. Data pipeline
1- Initiate DVC to set up our automated pipeline.
Check the output. Make sure you understand why we are getting the error. Fix it based on the suggestion proposed by dvc
.
In the above command we have added the first stage of the pipeline: get_data
. It depends on the script src/get_data.py
(flag -d) and yields the output dataset
(flag -o). In the command above, we specify that the stage get_data
is executed by running python src/get_data.py
.
(describe the project and show how to check the scripts to see what are the inputs and outputs)
Check dvc run
documentation to understand how it works.
Note: if you are experiencing any issues with a library called fsspec
force install version 2022.7.1
:
pip uninstall fsspec
pip install fsspec==2022.7.1
2- Create a new stage, preprocess
, that will run the script src/text_preprocessing.py
Use dvc run
the same way we did in the previous stage. Think about the dependencies (-d
), the outputs (-o
), and the python script that it needs to run.
Now check how the pipeline looks like with the following command:
You should get a directed acyclic graph (DAG) of all the stages in the pipeline. You can also generate a DAG of the dependencies of all the artefacts:
Also check how the dcv.yml
file looks like.
3- Add the train
stage that runs the script src/text_classification.py
:
Demo
Someone in the class will be asked to present this part of the tutorial to the whole class. Show how the pipeline looks like and explain how you have specified it.
4. Remotes
Previously, we have seen how to automate the execution of your data pipeline while avoiding to re-run it unnecessarily.
Now, imagine the case in which you changed something and executed the pipeline. If you have more people working in your project, everyone will have to re-run the pipeline to retrieve the same artefacts in their local repo.
To avoid such a waste of time and resources, DVC features the concept of remote
: a storage that you can use to store your artefacts.
Remotes can live in your local storage (local remotes π€·ββοΈ) or in a cloud storage (e.g., Amazon S3, GDrive, etc.).
Local Remotes
Set up a local βremoteβ called mylocalremote
.
Check out how the projectβs config file stores this info:
To push all our artefacts to the remote, run the push command:
Now letβs change something and see how the dvc is taking care of our artefacts: lets update the dataset. In the src/get_data.py
change the URL to the following:
Now follow the next 3 steps:
1- reproduce the repo
2- commit changes to git
3- push the changes to the local dvc remote.
Voila. Your new version of the data pipeline is now backed up.
Revert the project to the old commit with the old dataset and reproduce the pipeline. Explain what happened.
Confirm that the dataset is in fact the old version by checking its size (should be 1000):
Revert to the head of the branch and check again that dvc took care of retrieving the latest artefacts (dataset should have 2000 data points).
Cloud Remotes
Instead of having a remote stored locally, we can store it using cloud storage. In this example, we will use google drive. In this step you can have only one teammate setting this up while the others
To make things clear, start by removing the local remote:
dvc remote remove mylocalremote
git commit -am "delete dvc remote 'mylocalremote'"
-
Create a folder in your Google drive 1.1 Share it with your teammate. 1.2 Copy the folder id from its page url (e.g.,
1zUHN-qHKDKvQE8igLPVZ2JMQvrAjyxa
) -
Follow the instructions in the following documentation page to add a GDrive remote:
https://dvc.org/doc/user-guide/setup-google-drive-remote#url-format
- Change something and see how the dvc is taking care of our artefacts. For example, update the dataset to version v3 by updating
URL
insrc/get_data.py
:
-
Rerun the pipeline. Commit and push your changes to Github; push them to the dvc remote.
-
Ask your colleagues to
pull
the pipeline (git and dvc) and notice that all the artefacts are there.
5. Experiment Management
ML is an experimental process. Each new experiment needs to be analysed and compare against certain metrics. Hereβs how to manage experiments using dvc.
- You will have to change the training script (
src/text_classification.py
) to output its metrics to a JSON file (could also be YAML).- Do a
json.dump
to store the accuracy of the model. - It should output a JSON file like the following:
- Do a
{"accuracy": 0.9566666666666667}
-
Change the
dvc.yml
file to include the metrics of the stagetrain
. Hint: check the docs to see an example of advc.yml
with metrics: https://dvc.org/doc/command-reference/metrics -
Run the experiment
dvc exp run
-
See the difference by running
dvc metrics diff
-
Change something in your project (e.g., change the random state) and run a new experiment.
-
Check the experiment log:
$ dvc exp show
βββββββββββββββββββββββββββ³βββββββββββ³βββββββββββ
β Experiment β Created β accuracy β
β‘ββββββββββββββββββββββββββββββββββββββββββββββββ©
β workspace β - β 0.94167 β
β dvc-tutorial β 09:46 AM β 0.945 β
β βββ 3f12aef [exp-7f800] β 09:51 AM β 0.94167 β
β βββ 8f30720 [exp-44136] β 09:48 AM β 0.945 β
βββββββββββββββββββββββββββ΄βββββββββββ΄βββββββββββ
6. Testing with Pylint
Create a test that will test the difference of running two models with 2 different random seeds. The test should fail if the difference is above 0.1.
You should have a new folder, named tests:
.
βββ src
βΒ Β βββ __init__.py
βΒ Β βββ get_data.py
βΒ Β βββ read_data.py
βΒ Β βββ serve_model.py
βΒ Β βββ text_classification.py
βΒ Β βββ text_preprocessing.py
βββ tests
βββ __init__.py
βββ test_simple.py
Hint: You will probably have to import the method text_classification.main
and change it to accept a parameter with the random seed.
Execute the test by running pytest
:
7. Final Discussion
- What were your main challenges in this tutorial?
- What are the main drawbacks of using DVC for ML projects?
Datasets
- v1 (1000 datapoints): https://surfdrive.surf.nl/files/index.php/s/WCPP8WJPrtCbUO5/download
- v2 (2000 datapoints): https://surfdrive.surf.nl/files/index.php/s/OZRd9BcxhGkxTuy/download
- v3 (3000 datapoints): https://surfdrive.surf.nl/files/index.php/s/H4e35DvjaX18pTI/download
- v4 (4000 datapoints): https://surfdrive.surf.nl/files/index.php/s/HU5mY29RzxRlHCU/download