All Articles

Application inner workings - Data

Data dashboard
The goal of almost any application is to handle data.

This is the second article of some aspects of web development that are not directly linked to coding, but which contains crucial parts of the working of any applications. The first one, regarding application communication, can be found here. As stated in the intro for that article, this series is aiming to be a helpful guide to business people who would like to know a little more about our world, and who want to make the effort of speaking our language. If they know all the points written in this series, I hope they will feel more at ease in discussions about applications and in the definition of stories. For sure, they immediately earn the respect of the developers they work with, and both parties will be happy that it’s not required to have very patient translators in place anymore.

This article will cover the way that applications receive data, how they can fetch more, and what part of data is worthy of being stored in its own databases.

What is data?

Right, first of, we already need to handle the first, basic question - why does an application require data in the first place? What is considered data, how is it classified, and how should it be handled?

Data is considered anything that may be of interest to the user, to the application itself, or to their surroundings. This can be direct user data (username, password), it can be the user’s preferences or settings (like dark vs light mode), it can be values for measurements (average temperature in a geographical region), etc. Pretty much anything can be considered data.

It is important to stress the different types of data, as well as its lifecycle. For instance, some data will become useless after a short period of time (e.g. the chlorine content of a pool two hours ago), whereas other data has an endless lifecycle until it is updated (e.g. your password). Some data has no value on its own, but can become a very valuable tool in analysis after a lot of similar data has been gathered (where the buzzword of Big Data comes into play, with an example of temperature on one particular timestamp vs weather data of the last 100 years). Some data just never has any value, except maybe for regulatory reasons (such as the numbers that fell on the roulette board).

Almost every application handles data. Facebook contains the data of people you are close to, Telegram stores messages, and Netflix has movies. They all have very different usecases, but every single application is built around its core data and how it handles that.

So, if data is that important, why don’t applications simply store everything they know. After all, storage is cheap. Well, the short answer is - they try to. If you live in the European Union, you can request all the data Facebook has about you for example. If you do though, be prepared to find out a lot of things about yourself that you may even have forgotten a long time ago. It’s pretty creepy, to be honest.

However, not every application is like Facebook or Google. In a money management application for instance, your sexual preference is not important, so that should not be stored (nor asked, for that matter). It’s not just sensitive data like this that should be irrelevant to them - the amount of goldfish in your bowl also won’t ever change anything.

So let’s see how applications receive the data to be the most functional for you, and therefore keep you on their platform. There are three ways data can be served - through user input, from its own databases, and by fetching it from third parties. In designing your applications, it is very important to always be aware of how you can get a hold of the data you require for the functions you’re trying to perform.

Incoming data

The first source if data is you - the user. The user often has to input some basic data to even connect to applications. I’m not just talking about login credentials, but also some minimal information to make the app useful. For instance, that money management application can never guess the amount of money on your bank accounts, so you need to either insert its balance, or allow that application to connect to your bank and authorize it to view the account’s balance.

There can be other types of incoming data also though. For a weather station, incoming data comes from sensors that push information such as humidity, temperature, etc. to the servers in real time. For a broker, it will have a live feed of stock analysis platforms with the prices of the latest trades. Very rarely these services are free - usually they are quite expensive, especially if the delay is very short. In these cases, the backend applications are connected to those services through event streaming, MQs or web sockets, as regular polling would not be efficient. More on this can be found here.

Stored data

After the application has received input data, be it from the user or from some other platform, it needs to decide whether to store the data, or simply perform some other processes with it and then discard it. In order to not exhaust disk space (or other storage forms), it may also remove some older historical data, once it is no longer needed.

As stated before, an application can never have too much data. Still, in many cases, due to regulatory reasons, it may be unwise to keep certain data, as it may be illegal to keep it after expressed requests to delete it. If you have too much unorganized data about a person, this may become cumbersome. Therefore you should strive to have all the data your application needs, and not more.

Once it has been decided that data should be stored, you should determine how to store it. There are so many options available, but it usually boils down to a couple of options:

  • SQL DBs
  • NoSQL DBs
  • File storage
  • Blockchain for the fancy people

Which option to choose is not in the scope of the article - the main point is that once you have stored the data, you can retrieve it again easily. In the case of a user inserting his weight every morning on Google Fit, the main reason is not to just feed the data to Google, but instead to be able to see it again at some later time.

It is also your concern to protect the data! Data hacks are very frequent, and the less data you have, the smaller the repercussions of such unfortunate events are.

In a microservice architecture, where many services operate side by side, it is very important to have a clear cut limit of what is stored where. For instance, Netflix may have one backend containing movies and their metadata, and another with the user statistics. It will require a logical connection between the two, but it is most likely ID based, and so you would know that User 12345 has seen movies 55 and 420, and not that those movies are called The Lion King and Scarface. If that information is required, it should be retrieved.

Fetched data

Finally, the third type of data is the information that neither the user, nor you, currently have. For instance, imagine that, in the money management tool, the user previously entered he has one account with 1000 USD, which you’ve now stored. Next week, he comes back to the application, and would like to know how much his net worth is in Vietnamese Dong. Since you do not have the real time conversion rate of USD to VND, you make a request to a third party application to get that rate, which you can then use to make the calculation, and discard again after serving your response.

Fetched data can also be all kinds of data. In monoliths, you will store much more data than you would in a single application in a microservice architecture. In the former, you would for example have the information about a banking client, his accounts, his relationships, etc. In a microservice, you would basically only have the ID of the client, which you would then use to fetch the accounts through one service, the relationships through another, translate the IDs of the relationships in yet another partner service, and so on. The enriching of data can result in quite a few service calls. However, your small application should not store this data, as it is already kept somewhere else, and duplication can result in some messy scenarios. The master of certain data should always be clearly defined. The master is always the source where the true value can be looked up.

Mix and match

Very well, now that the different options of getting data has been determined, how can we make the best use of data? After all, this is the reason you are reading this article. :)

In essence, you will usually need all three types of data to serve the client best. Let’s have a closer look at how the money management app could make use of its data to perform a valuable service to a user.

Inputs

First, the user enters that he has 1 BTC. (We start with a smart user who has invested in an inflation hedge, and who is just a little untrusting towards the future of fiat currencies.) The user also enters that he’d like to buy a chalet in the Swiss mountains, currently valued at CHF 600k.

The second type of input the application could receive is a daily feed of BTC/USD conversion rate.

Own data

Let’s assume that the application has stored all the previous inputs. This way, we have a historical price evaluation for that 1 BTC compared to USD. We can oversimplify the situation and say that its appreciation will be consistent (either exponentially, or linearly, doesn’t really matter for explanatory purposes).

Fetched data

Now, since the conversion rate changes, every day the application goes to the European Central Bank and requests the latest Forex prices. Every day, it would then be able to calculate the value of that Swiss chalet in USD.

After combining these three points, upon every login, it could serve the user with the information by when his 1 BTC will have appreciated enough to buy his chalet. One day, it could say that the happy day will be 3 years into the future, and upon the next login, it could be 5 years away, However, the user will not need to make any calculation himself, as the application has gathered all the required data continuously so that the user needn’t worry about making any mistakes in his own estimations.

I do realize that this example is fairly simple, but even in complex situations, the questions always remain the same - do I have the data already, and if not, can I either fetch it from somewhere or do I need to ask the user to provide it for me.

So now whenever the developer tells you he does not have the data that’s required to perform that operation of dynamically generating a document with 200 inputs, hopefully you understand what he means, and you can impress him by already having analysed all the sources for those inputs. You will amaze him by showing a new API where the user can input 20 fields, state the tables where he can find 80 more values, and then finally show him the endpoints and series of requests required to get those final 100 fields. If you can provide the developer with all that information without him telling you explicitly what he needs from you, be prepared to see an extremely astonished face looking back at you, and for that enormous pride you will feel due to this!