Numlabs Data Science Blog - Self-taught in the world of Machine Learning

After all, it is just another recommendation system or image recognition. A smart programmer will do it in a week basing on tutorials from the Internet. But is it really so? In this article we will present 4 arguments why you need an experienced ML engineer to succeed and not burn through your budget.

1. Data

An often overlooked or explained very superficially topic is the issue of pre-analysis and preparation of data before it is suitable to throw into a neural network or other algorithm. While using thoroughly cleaned or analysed data for the thousandth time in a textbook, the authors of a publication or article focus on the ML algorithm, but repeating the proposed preparation procedure usually makes no sense for the data you collected. The same is true for scientific papers, which often require a huge amount of work in collecting and pre-preparing the data so that the models presented work very well. However, this stage is not usually emphasised. Remember - if something works well on some dataset, it does not at all mean that your dataset will also fit. Doing it correctly and not falling into one of the many traps requires a lot of knowledge.

Samouk w świecie Machine Learningu

2. Choosing a model

Choosing the right AI model is like buying your first car or building a house. Usually we are not experts in these areas, but we hope that after all we can find everything on the Internet, and as reasonable people we will be able to compare and choose the right solutions. However, during the search we come across many different ideas and sometimes contradictory advice and information. At some point we even come to the conclusion that we do not really know what we are looking for and what we want to do. AutoML systems can be of some help here, but still the development of dedicated software with reasonable effectiveness operating in production requires specialist knowledge, which cannot be acquired just like that.

Samouk w świecie Machine Learningu

3. Industrialisation

When creating a commercial solution you have to take into account many factors. Will the neural network work fast enough on millions of records? Will the number of samples I have be sufficient to teach the model? What happens when the number of users of my SaaS triples? What about cloud fees during training and then deployment? Data drift? Unheard of. Where will I get and store the data I need? Unless you're doing a hobbyist project, you're bound to encounter these questions. And these are just a selection of the problems you need to deal with. As is often the case in engineering, there are no simple answers, specific knowledge and extensive analysis are required.

4. Testing and production

When we have already gone through most of the software development process and it seems to us that this is the end of our struggle, there comes the time for that moment - implementation, or a clash with reality. It is very often painful, if we did not take care of proper testing and verification, which, due to the specificity of AI projects, do not include only well-known issues from classical software. Of course, testing will never give us the certainty that there are no mistakes, but without it it may turn out that it is cheaper to start the project from scratch than to search at which stage the omissions were made and how their combination caused the failure of the project.

Self-taught in the world of Machine Learning

Summary

In this short post we have given you a glimpse of what problems arise at each stage of ML model development. Of course the list is not exhaustive and indicates the most popular ones. So, if you are not an AI expert and you are taking the plunge into self-implementation, keep in mind the listed challenges that you have to face when moving into production ML model development and implementation.

Self-taught in the world of Machine Learning

1. Data

2. Choosing a model

3. Industrialisation

4. Testing and production

Summary

Comments

More on our blog

Snowflake: A Comprehensive Analytical Platform and Cloud Database. How Snowflake Revolutionizes Data Management

Scaling MLOps. Efficient Management of Multiple Model Lifecycles Using Apache Airflow, MLflow, and Containerization

HomeLab. A Personal Computer Laboratory for Everyone