Understanding data security and access control

There’s more to data security and access control than simply granting certain teams within a company or organization different access levels and issuing user passwords. As data scientists, our jobs are not to run the whole security operation in our organizations. However, as we work very closely with data, it is essential that we understand the importance of having good, robust mechanisms in place to prevent sensitive and personally identifiable information from getting into the wrong hands.

Setting ourselves up with a strong password might not cut it in today’s world. Some of the world’s biggest banks, which have an army of highly skilled security professionals, have suffered ever-more smarter cyber attacks. Today, users are logging into work systems and databases through biometrics such as the finger print scanning technology on smart phones, laptops and other devices or computers.

Two-factor authentication is also a popular mechanism that goes beyond simply identifying and authenticating a user through their password alone. Users are now logging into systems using a one-time password – which is sent to their work email, requiring another form of login – in combination with their fingerprint password. Generating a random number or token string each time a user logs into a system can reduce the risk of a single password being decrypted or obtained some other way.

User identity and authentication is only half of the equation, however. The other half is using anomaly detection algorithms or machine learning to pick up on unusual user activity and behavior once a user has logged on. This is something we as data scientists can bring to the table in helping our organizations better secure our customer or business data. Some of the key features of anomaly detection models include time of access, location of access, type of activity or use of the data, device type, and how frequently a user accesses the database. The model collects these data points every time a user logs into the database and continuously monitors and calculates a risk score based on these data points and how much they deviate from the user’s past logins. If the user reaches a high enough score, an automated mobile alert can be sent to the security team to further investigate or to take action.

Some obvious examples include: a user who lives in Boston logged out of the database 10 minutes ago, but is now accessing the database in Berlin. Or, a user who usually logs in to the database during work hours is now logging in at 3am.

Other examples include: an executive assistant, who rarely logs into the database, is now frequently logging into the database every 10 minutes. A data scientist, who usually aggregates thousands of rows of data is now retrieving a single row. A marketer, who usually searches the database for contact numbers, is now attempting to access credit card information, even though that marketer already knows she/he does not have access to this information.

Another way data scientists can safeguard their customer or business data is to keep the data inside the database rather than exporting a subset or local copy of the data onto their computer or device. Nowadays, there are many tools to connect different database providers to R or Python, such as the odbcConnect() function as part of the RODBC library in R, which reads and queries data from a database using an ID and password rather than importing data from a local computer. The ID and password can be removed from the R or Python file once the user has finished working with the data, so an attacker cannot run the script to get the data without a login. Also, if an attacker were to crack open a user’s personal laptop, he or she would not find a local copy of the data on that device.

Row and column access is another way to safeguard data through fine grained access controls. This mechanism masks certain columns or rows to different users. These masked columns or rows in  tabled data usually contain sensitive or personally identifiable information. For example, the columns which contain financial information might be masked from the data science team but not from the finance/payments processing team.

Other ways to safely deal with sensitive and personally identifiable information include differential privacy and k-anonymity. To learn about these techniques, please read Dealing with data privacy – anonymization techniques.

data security


Rebecca Merrett

Writes technical blogs and other content for Wargaming Sydney/BigWorld Technology.

Rebecca MerrettLinkedIn
Raja Iqbal
Raja is the CEO and Chief Data Scientist at Data Science Dojo. He has worked at Microsoft Bing and Bing Ads in various research and development roles in data science and machine learning.
Raja IqbalLinkedIn

Follow us on: