relevanceofdata

The relevance of Data for ML Algorithms

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Artificial intelligence (AI) and its associated technologies couldn’t have arrived at a better time as we face some of the most difficult and complex problems that we can’t possibly solve on our own. Artificial Intelligence in basic words means intelligence demonstrated by machines that can mimic the cognitive functions of humans. Machine learning (ML) is a subset of artificial intelligence that involves using algorithms and statistical models to allow computers to ‘learn’ from data, provide deep insights and improve based on past experiences, similar to how we humans operate. 

Even though the concepts of AI and ML have existed since the 1950s, it is only in recent times that they have gained prominence. 

There are many reasons for this but one of the most important ones is data. 

Data is the lifeblood of AI and ML and it is only in recent times, owing to the popularity of the internet, that the necessary volumes and types of data have existed. Growing computational power, data storage capabilities and advanced algorithms have helped in that regards too. 

There are several industries today that gain a competitive advantage by integrating machine learning algorithms into their operations and that is possible only because of data. 

ML algorithms are used more and more in the fields of agriculture, banking, marketing, search engines, healthcare, speech recognition and a plethora of other places. The analysis and predictions provided by these systems are invaluable and the whole process starts with data

It cannot be stressed enough how important the quantity and quality of data is for the proper implementation of machine learning systems.

Relevant Data – Building A Strong Foundation

Since data is so important, proper attention should be given right from the very first stage of data collection. For this, proper data infrastructure needs to be designed. This will allow the right data to be collected, in the right format with the appropriate volume. Since an ML algorithm is only as good as the quality of data that we feed into it, it is very important to ensure that the data being collected is of good quality as bad data can lead to insights that are not actionable, results that are misleading and it will waste valuable time and resources. 

To understand what makes good data, we can take an example of the agriculture sector. The goal of tech in agriculture is to maximise productivity while ensuring minimal waste. To ensure this, data needs to be collected about as many relevant variables as possible. This can include data on soil types, conditions and fertility, weather data like temperature, humidity, wind speed and rainfall, seed quality and variety, yield, types of crop protection chemicals, types of diseases and a whole host of others.

For example, disease detection is really important to minimise crop loss. For this, hundreds of thousands of photos of diseased plants serve as the right data as it trains the ML to recognise the type of disease and its severity through pattern recognition so that pesticides can be applied in a timely and targeted way which further reduces the resources required. Here, it is important to choose the right data to produce the desired result.

Another common example can be that of Google Maps which has a tremendous amount of data about real-world addresses using which suggestions are made to us as soon as we start typing in the app. In this case, addresses that are properly written can be easily suggested and located by the ML algorithm. But if addresses are not written correctly, it can lead to a lot of pre-processing in the background as the ML system has to do a lot of refinement in the address data which results in difficulty in locating those addresses with accuracy which is not the desired result. 

A recent HR fiasco at Amazon can provide one more example of the importance of the right data. In late 2018, it was reported that Amazon had been using machine learning to screen the resumes of job applicants. The ML system shortlisted mostly male candidates because it was trained on the historical pool of data on technical positions held at the company which were mostly males. So, the ML system replicated that in its results as well, leading to a bias against female candidates. 

Although quality data is of immense importance, it can also be really tough to collect the appropriate data.

There are also problems with erroneous signals wherein even a small bias in sensors can impact the overall results making the whole process flawed; to counter to an extent very complex normalizations and feature selection methodologies are tried. 

Also, data collected by humans may not be reliable as it may be riddled with inconsistencies and biases. 

Therefore it is important to come up with a proper data strategy that will ensure smooth, efficient and accurate data collection which will, in turn, help machine learning algorithms to do what they do best. Analyse data, learn from them and provide even better results to help achieve the solutions to complex problems.

Share This Post

Share on facebook
Share on linkedin
Share on twitter
Share on email

Request a call

Leave your contact info and we will get back to you soon.