Author: Ofir Chakon

I analyzed Facebook data to decide when to stream on Facebook Live. Here’s how.

I analyzed Facebook data to decide when to stream on Facebook Live. Here’s how.

Streaming on Facebook Live can be a powerful marketing strategy for startups and businesses to share knowledge, provide value, get exposure and collect high-quality leads. When preparing your Facebook Live session upfront by researching about your target audience and building a detailed agenda, the session can boost your business dramatically.

As a chief of product and technology of my previous startup dealing with fraud detection, I decided to try Facebook Live as a new marketing strategy when it was still fairly new. Back then, once a new Facebook Live session was up, relevant people got Facebook notifications to join the session, which increased the exposure even more.

There are many posts talking about how to better build your Facebook live session in terms of topics to cover, building an agenda, camera angles, session duration, etc. But there is one piece of the puzzle that business owners and marketers often tend to forget or not pay attention to – when is the best time to stream your Facebook Live session?

Facebook live

Answering this question can be done using an educated guess based on your familiarity with the target audience, For example:

  • Pregnant moms are ready to consume your Live session on Monday afternoon.
  • Teenagers at the ages of 18-22 are in the right mindset on Saturday morning.

But nowadays, when there is so much data around us that we can use with a few clicks of a button, you actually stay behind if you don’t make a proper usage of some of the data available.

Almost every marketing platform or social network opens API services that you, as a technological entrepreneur, can easily consume. This data, when analyzed properly, can be derived to valuable conclusions that can drive your business objectives way beyond your competitors.

This approach is often called Data-driven decisions.

Once you start justifying any (or at least most) of your business decisions using data you own or data you can collect from different resources, you actually stop guessing and start making data-driven decisions.

I like to think of data-driven decisions as crowd-sourcing. If you had a chance to watch this TED talk by Lior Zoref, where Lior invited an ox to the stage and asked the audience to guess what its weight is, you were probably overwhelmed by how accurate the crowd’s average was compared to the real weight of the ox: 1,792lbs vs. 1795lbs!

Ox weight

When you’re making guesses regarding your business objectives as individuals, you’re not different than any individual sitting in the crowd and trying to evaluate the ox’s weight. You can even be the one who guessed 300lbs or 8000lbs, which will probably cost your business a lot of unnecessary expenses.

But, if you’re using the wisdom of the crowd to make data-driven decisions, which you can do in almost any decision you make online, you’ll most likely be ahead of every other individual, or in business terms, ahead of your competitors.

Although I’m not a pure marketer, with basic data analysis skills, I can push my business forward dramatically in all aspects, including marketing.

In this post, I’m going to walk you through a practical step-by-step guide about how to access Facebook data, analyze it based on our needs on deriving conclusions about when is the optimized time to broadcast on Facebook Live.

In order to follow this guide you need:

  • A Facebook account
  • A Facebook group you would like to analyze (if it’s a private one, you need to be a group member)
  • Python 2.7 installed
  • Jupyter notebook installed
  • Facebook graph API Python library installed

A Jupyter notebook as a highly recommended tool for data analysis in Python. It has a lot of highlights, but on top of all, it enables you to run snippets of code and save the results in memory, instead of running all of your scripts over and over again every time you implement a minor change. This is crucial when doing data analysis because some tasks can take a lot of execution time.

Although it’s not essential, I always recommend working inside a Python virtual environment. Here is a post I wrote about the advantages of a virtual environment when using Python.

Finally, I highly recommend working on an Ubuntu environment when doing data-analysis using Jupyter notebooks.

Step 1: Getting the Facebook group ID

In order to get data from Facebook API, we need to specify the ID of the entity we want to get data from, in our case, a Facebook group.

Lookup-id.com is a nice tool you can use to easily find the ID of a group based on its URL. Copy the URL of your group and paste it in the search bar.

lookup-id

In this post, we will use the group: Web Design and Development.

ID: 319479604815804

Step 2: Getting to know the Graph API Explorer

In order to get the most out of Facebook API, besides documentation, Facebook has developed a playground for developers called the Graph API Explorer.

The Graph API Explorer enables us to easily get a temporary access token and start examining the capabilities that Facebook API has to offer.

Click on Get Token, and then on Get User Access Token. Don’t select any permission, just click Get Access Token.

Get access token

Facebook API has many endpoints you can use, but in this guide, we are going to use two main endpoints:

In order to figure out the structure of the response you’re expecting to get from a specific endpoint, you just need to specify the endpoint URL and click Submit.

Let’s examine the URL endpoint for grabbing the last posts from the group’s feed. Type this URL in the Graph API Explorer:

319479604815804/feed

and hit Submit.

feed_endpoint

You should now see the last posts from the group’s feed in a JSON structure, containing the post’s content, its id and the updated time. By clicking on one of the id’s and adding to the end of the URL:

319479604815804_1468216989942054/reactions?summary=total_count

You should see a list of the reactions for the specific post, and a summary of the total count of reactions.

This way you can play around with all of the features the Facebook API has to offer.

Another wonderful tool for examining API endpoints of APIs which don’t offer a playground is Postman. You can read more about this tool, as well as essential tools for web developers here.

Step 3: Our plan and assumptions

Our goal is to find the optimized time interval to have a Facebook Live session in the chosen group that contains our target audience. In order to do that, we assume that we the more activity there is in the group at a specific time, the most likely our Facebook Live session with gain more traction.

So our goal now is to figure out when there is a peak in the group’s activity over time. And by when I mean a specific weekday and time.

In order to do that, we are going to grab the last 5000 posts from the group’s feed and plot the distribution of the times they were updated on.

We assume that longer posts indicate on more activity in the group because members spend more time in the group writing them. Therefore, our next step will be to take into consideration the length of each post in the distribution.

Reaction on Facebook is probably a great indication of people engaging with a specific post. Therefore, our last step will be to collect the total number of reactions for each post and take it into account in the distribution of activity over weekdays and hours.

Because Reactions probably come after the post, we should be cautious using this data analysis approach.

Step 4: Let’s analyze some data!

In order to start a Jupyter notebook, you should probably execute:

ipython notebook

and then choose New → Python 2.

new_notebook

In order to analyze and plot the data, we are going to use numpy and matplotlib libraries. These are very popular Python libraries you should use in order to better analyze your data.

Let’s import all the libraries we need:

import matplotlib.pyplot as plt
import numpy as np
import facebook
import urlparse
import datetime
import requests

and specify our access token and group id:

ACCESS_TOKEN = 'INSERT_ACCESS_TOKEN_HERE'
GROUP_ID = '319479604815804' # Web Design and Development group

Then, let’s initialize the API object with our access token:

graph = facebook.GraphAPI(ACCESS_TOKEN)

Now we want to grab the posts from the group’s feed. In order to avoid errors during the API calls, we will limit each API call to 50 posts and iterate over 100 API calls:

posts = []
url = "{}/feed?limit=50".format(GROUP_ID)
until = None
for i in xrange(100):
    if until is not None:
        url += "&until={}".format(until)
    response = graph.request(url)
    data = response.get('data')
    if not data:
        break
    posts = posts + data
    next_url = response.get("paging").get("next")
    parsed_url = urlparse.urlparse(next_url)
    until = urlparse.parse_qs(parsed_url.query)["until"][0]

In each API call, we specify the until parameter to get older posts.

Now, let’s organize the posts into weekdays and hours of the day:

weekdays = {i: 0 for i in xrange(7)}

hours_of_day = {i: 0 for i in xrange(24)}

hours_of_week = np.zeros((7,24), dtype=np.int)
for post in posts:
    updated = datetime.datetime.strptime(post.get("updated_time"), "%Y-%m-%dT%H:%M:%S+0000")
    weekday = updated.weekday()
    hour_of_day = updated.hour
    weekdays[weekday] += 1
    hours_of_day[hour_of_day] += 1
    hours_of_week[weekday][hour_of_day] += 1

and then, plot the results using matplotlib bar charts:

plt.bar(weekdays.keys(), weekdays.values(), color='g')
plt.show()

weekdays_1

(0 represents Monday)

plt.bar(hours_of_day.keys(), hours_of_day.values(), color='r')
plt.show()

hours_1

All times specified in IST.

Only with this very basic analysis, we can already learn a lot about better and or worse time slots for broadcasting on this group. But it seems not informative enough. Maybe because the data is divided into 2 graphs and missing some critical information.

Let’s try to present a heat map of the data, that enables us to see 3D information:

plt.imshow(hours_of_week, cmap='hot')
plt.show()

heatmap_1

Well, this is much better! We can clearly see that the group is very active on Mondays to Fridays between 6 am and 10 am.

Now let’s try to take into consideration to post length and see how it affects the results:

weekdays_content = {i: 0 for i in xrange(7)}
hours_of_day_content = {i: 0 for i in xrange(24)}
hours_of_week_content = np.zeros((7,24), dtype=np.int)
for post in posts:
    updated = datetime.datetime.strptime(post.get("updated_time"), "%Y-%m-%dT%H:%M:%S+0000")
    weekday = updated.weekday()
    hour_of_day = updated.hour
    content_length = len(post["message"]) if "message" in post else 1
    weekdays_content[weekday] += content_length
    hours_of_day_content[hour_of_day] += content_length
    hours_of_week_content[weekday][hour_of_day] += content_length

The heatmap we get:

heatmap2

This is nice but should be treated with caution. On one hand, we can clearly see a very specific point in time that is the optimized time slot to have our Facebook Live session. On the other hand, it might be an outlier of a super long post.

I’ll leave it to you to figure it out in your next data analysis project. What I suggest you do is to take a larger amount of posts or grab an older batch of 5000 posts from the group’s feed.

In order to take reactions into account when analyzing the data, we need to make another API call for each post, because it’s a different API endpoint:

weekdays_reactions = {i: 0 for i in xrange(7)}
hours_of_day_reactions = {i: 0 for i in xrange(24)}
hours_of_week_reactions = np.zeros((7,24), dtype=np.int)
for i, post in enumerate(posts):
    url = "https://graph.facebook.com/v2.10/{id}/reactions?access_token={token}&summary=total_count".format(
    id=post["id"],
        token=ACCESS_TOKEN
    )

    headers = {
        "Host": "graph.facebook.com"
    }

    response = requests.get(url, headers=headers)

    try:
        total_reactions = 1 + response.json().get("summary").get("total_count")
    except:
        total_reactions = 1

    updated = datetime.datetime.strptime(post.get("updated_time"), "%Y-%m-%dT%H:%M:%S+0000")
    weekday = updated.weekday()
    hour_of_day = updated.hour
    weekdays_reactions[weekday] += total_reactions
    hours_of_day_reactions[hour_of_day] += total_reactions
    hours_of_week_reactions[weekday][hour_of_day] += total_reactions

The reason we used a low-level approach here by specifying the exact HTTP request and not using the Facebook Python library is that it doesn’t support the last version of the Facebook API that is required when querying the Reactions endpoint.

The heat map generated from this data:

heatmap_3

We can conclude that the three approaches we used agreed on Monday and Wednesday, 6-7am.

Conclusions

Data analysis can be challenging and often requires creativity. But it also exciting and very rewarding.

After choosing our time to broadcast on Facebook Live based on the analysis presented here, we had a huge success and a lot of traction during our Live session.

I encourage you to try and use data analysis to make data-driven decisions in your next business move. And on top of that, start thinking in terms of data-driven decisions.

You can find the Github repository here.

 

Machine Learning for hackers: model comparison and selection

Machine Learning for hackers: model comparison and selection

As technological entrepreneurs, machine learning is all over us. The science of training machines to learn and produce models for future predictions is widely used, and not for nothing. Less complicated code and more advanced learning algorithms and statistical methods are introduced for better solutions of our problems.

As broadly discussed in my post about machine learning 101 and linear regression, the problems that we try to solve using machine learning can be divided into two main types: supervised machine learning vs. unsupervised machine learning. Supervised learners learn from labeled data, that is, for example, data about house characteristics which contains also house price, for house price predictions. In other words, supervised machine learning learns labeled data-points and predicts labels for future ones.

On the other hand, unsupervised learning learns from unlabeled data and cannot predict labels for future data-points. It is commonly used for dimension reduction of the data, clustering the data, and more.

In this post, we will discuss supervised learning related problems, models and methods. I assume that you’re already familiar with some machine learning methods like linear regression, Ridge regression and Lasso by knowing how to train models using some of these methods.

This post is called machine learning for hackers in order to emphasize that developers can train models and use machine learning and make the most out of it without being professional data scientists. Although there are tons of tools and libraries out there to train machine learning models with under 10 lines of code, as a data hacker you need to be familiar with more than just training models. You need to know how to evaluate, compare and choose the best one that fits your specific dataset.

Usually, when working on a machine learning problem with a given dataset, we try different models and techniques to solve an optimization problem and fit the most accurate model, that will neither overfit nor underfit.

When dealing with real world problems, we usually have dozens of features in our dataset. Some of them might be very descriptive, some may overlap and some might even add more noise than signal to our data.

Using prior knowledge about the industry we work in for choosing the features is great, but sometimes we need a hand from analytical tools to better choose our features and compare between the models trained using different algorithms.

My goal here is to introduce you to the most common techniques and criteria for comparing between the models you trained in order to choose the most accurate one for your problem.

In particular, we are going to see how to choose between different models that were trained with the same algorithm. Assuming we have a dataset of 1 feature per data-point that we would like to fit using linear regression. Our goal is to choose the best polynomial degree for fitting the model out of 8 different assumptions.

The Problem (and the dataset)

We have been asked to predict house prices based on their size only. The dataset that was provided us contains house sizes as well as prices of 1,200 houses in NYC. We would like to try and use linear regression to fit a model for predicting future house prices when prior knowledge has been given to us about a few assumptions for model alternatives:

Ŷ1 = β0+β1X
Ŷ2 = β0+β1X+β1X2
Ŷ3 = β0+β1X+β2X2+β3X3
Ŷ4 = β0+β1X+β2X2+β3X3+β4X4
Ŷ5 = β0+β1X+β2X2+β3X3+β4X4+β5X5
Ŷ6 = β0+β1X+β2X2+β3X3+β4X4+β5X5+β6X6
Ŷ7 = β0+β1X+β2X2+β3X3+β4X4+β5X5+β6X6+β7X7
Ŷ8 = β0+β1X+β2X2+β3X3+β4X4+β5X5+β6X6+β7X7+β8X8

Where X represents the house size.

Given the 8 model alternatives, we have been asked to compare between the models using some criteria and choose the polynomial degree that best suits our dataset to predict future house prices.

As described in my previous post, complex models tend to overfit. Therefore, we need to be careful when choosing our model so it’ll provide us with good predictions not only for the current dataset but also for future data-points.

What is a train/test split and why we need it

When dealing with real world machine learning problems, our dataset is limited in its size. With a relatively small dataset, we want to train our model as well as to evaluate the accuracy of it. This is where train/test split comes handy.

A train/test split is a method for splitting our dataset into two groups, a training group of data-points that will be used to train the model, and a testing group that will be used to test it. We usually tend to split it inequality because training the model usually requires as much data-points as possible.

The common splits are 70/30 or 80/20 for train/test.

How NOT to compare the models

The most basic metric for evaluating trained machine learning models is the MSE. MSE stands for mean squared error and is given by the average of the squares of the errors. In other words, the MSE is the difference between the predicted value and the real value so we would like to minimize it when training models:

MSE = ( Ŷi- Yi)2/n

where n is the number of data-points.

The MSE should be used with caution. The reason for that is that the MSE can be calculated either on the training data-points or on the testing data-points. If you haven’t guessed by now, the correct way for using the MSE to evaluate your model is training our model using our training dataset and calculating the MSE using our testing dataset.

Without a train/test split of our data, we will be forced to calculate the MSE on the same dataset we trained the model with. This scenario will cause an immediate overfit. Why?

Assuming we haven’t split the dataset into train and test, trained 8 models (as described above) and calculated the MSEtrain for each of the models. Which model will provide us with the lowest MSEtrain? Most likely that model #8 because it’s the most complex one the overfit the data rather than learn the data.

Because we train and test the model with the exact same dataset, the MSEtrain will be lower as we use more complex models that will fit the training data better (Don’t forget that the optimization problem we try to solve is to minimize the errors between the predictions and the ground truth).

MSE

So we learned that we better use the MSE for testing our dataset after we split it. But there are more advanced criteria for evaluating our models (that are based on the MSE) which we usually use instead of just the MSE alone.

Which criteria we can use

After realizing why we need to split our data into train and test, and what the MSE means, we will cover 3 main criteria for comparing our 8 different models. These criteria know how to handle overfit and how to choose the best model for our dataset.

#1: Mallows’s Cp

Cp is a statistical method suggested by Mallows in 1973 to compute the expectation of the bias. Assuming we have a very small dataset such that splitting it into train and test does not make sense, we can use Cp in order to estimate the MSEtest using the MSEtrain calculated on the training dataset.

The Cp criteria or the estimator for MSEtest is given by:

Cp = MSEtrain + 2σ2P/n

Where σ2 is the error variance based on the full model (model number #8), and P is the number of predictors.

In order to use Mallows’s Cp to compare between our models, we need to train each model on the full dataset, calculate Mallows’s Cp estimator for each of the trained models, and choose the model with the lowest Cp result.

Cp

As we can see, while MSEtrain decreases as the polynomial degree increases (more complex model) which cannot indicate on the model we should choose, MSEtest and Mallows’s Cp criteria both choose model #3 to be the best model based on our dataset.

Note: Mallows’s Cp wasn’t developed to evaluate models that are not trained using linear regression. Therefore, you must not use it with models trained using other machine learning algorithms.

#2: BIC

We’ve already mentioned that when fitting models, adding parameters and making the model more complex can result in overfitting. BIC is a statistical criterion attempt to resolve this problem by introducing a penalty term for the number of parameters in the model.

BIC, which stands for Bayesian Information Criterion, assumes that there is a correct model among the suggested models and its goal is to choose it.

The mathematical form is very similar to Mallows’s Cp and is given by:

BIC = MSEtrain + log(n)σ2P/n

The model with the lowest BIC is preferred.

BIC

Leaving the values of MSEtrain aside, all other criteria choose model #3 to best fit out data unanimously.

Note: when there we are not sure about whether one of the suggested models is correct, BIC can behave in an unexpected way. Therefore, in real world problem, we should use it with caution.

#3: Cross Validation – probably the most popular criteria

Dealing with machine learning problems requires a good understanding of cross validation (or CV). Cross validation is used in many different ways in machine learning when they are all related to comparison and selection of parameters and models.

Cross validation is basically an extension of a train/test split methodology. The advantage of it though, is that it randomly divides the dataset multiple times, and in each time it trains the tests the model on a slightly different dataset.

By doing that, we make sure that we don’t evaluate the model’s error based on outliers or data that doesn’t represent the signal properly. We then average the MSEtest for each split to evaluate the model based on multiple train/test splits:

CV(n) = MSEi,test/n

The preferred model will be the one with the lowest CV(n). There is a critical point to understand here – there is a nested iteration when comparing models using cross validation – For each model, we split the dataset randomly, calculate the MSEi,test and then average them into a CV indicator. Therefore, we come up with a CV indicator for each model, in which based on it we choose the preferred model.

There are two main implementations for cross validation splits:

  1. Leave one out cross validation
  2. K-fold cross validation (the most popular)

Leave one out CV iterates over the dataset and takes out one data-point per iteration that will not be included in the training set but rather will be used to test the model’s performance.

LOOCV

K-fold CV gets a K parameter as an input, splits the dataset into K parts, iterates over the parts and for each iteration leaves the kth part out of training and use it as a testing set.

K-fold CV

Choosing the K parameter which is the number of folds can sometimes be tricky because it affects the bias-variance tradeoff on our data. A rule of thumb will be to choose either 5 or 10 (depends on the size of the dataset).

Cross validation is a wonderful tool used all over in machine learning and statistics. In the following chart, we can see how the different indicators estimate the models’ errors.

CV

Cross validation, unlike Cp or BIC, works well for most of the machine learning algorithms and doesn’t assume that there is one correct model.


I encourage you to try and plot each of these estimators in your next machine learning project when you’re asked to compare between models to figure out what works best for your specific situation.

In real world problems, we are usually asked to compare between models with more than one feature and the comparison can be done exactly the same way we covered in this post.

I encourage you also to check out Lasso method. Lasso has a built in mechanism of regularization that removes unnecessary features from our models, which is what we usually want to achieve when comparing different models.

8 top must-use tools to boost your web development workflow

8 top must-use tools to boost your web development workflow

As developers, before we deploy our applications or even before we choose our cloud provider, we should consider which tools we use for our day-to-day internal workflow. The tools included in our toolbox can either boost our productivity dramatically or turn our web development project extremely complex and difficult to maintain or scale up by recruiting more team members.

A major part of growing ourselves from being junior developers into senior developers involves adaptation of tools that simplifying our task management process, making communication with other team members seamless and building integrations between the tools we use so they work together in harmony to create a perfect stack that works best for you and for your team.

As a technical startup co-founders, we have the responsibility of creating workflows that work well at scale and are easy to use, adapt and maintain by most of the developers that will join our team. In order to implement the most productive workflows for our team, we need to master them ourselves at first.

In this post, I will introduce you to the set of tools that most of the junior web developers use on a daily basis to manage, analyze and maintain their products. You might already be well-familiar with some of them, and therefore, my goal is to not only introduce you to these tools but also provide you with best practices of how to use them and how to integrate them together to create a harmony that works for you.

Before I start listing the tools and diving deep into each of them, I want to mention the most important consideration of all which is the operating system you use. I’m not going to get into further details about operating system considerations here because I’ve already discussed it in depth in my previous post lessons learned moving from Windows to Ubuntu.

Slack

slack

What it is used for

Slack is a communication platform for teams. Despite its initial goal of completely replacing the need for emails which hasn’t been achieved in my opinion, Slack has so many additional advantages. Even if you’re still working alone, keep reading – Slack can be an amazing tool also for individuals.

Slack introduces a new and seamless way to communicate internally with team members, stay on top of milestones, goals, and issues, schedule meetings, and even order lunch.

Rather than having one chat in which all the team communicates in, Slack introduced us with channels. Channels are rooms in which you can discuss different aspects of your company, venture or project: development, sales, PPC campaigns, UI / UX and much more.

Slack provides you with everything you need to manage a rich conversation with your team members: emojis, image sharing, YouTube videos embedding, and of course, integrations.

Integrations provide you with the ability to connect 3rd party tools into your Slack group. You can either install public tools from Slack’s marketplace or develop your own using Slack API and use it inside your Slack group. Slack integrations allow you to schedule meetings with team members by sending a message, set a recurring reminder, notify when a new user signs up or subscribes, order lunch, entertain the team by reacting to specific messages and so much more.

Slack’s search system is robust. Every message is indexed and therefore it is extremely easy to recover anything said in any channel.

Who should use it

Naturally, Slack is built for teams. But, as a developer that works alone on a side project, I encourage you to open yourself a Slack group and play around with everything Slack has to offer. You can increase productivity by sending messages to yourself for settings reminders and schedule meetings instead of accessing many apps in the browser.

slack 2

Best practices

  • Investigate the top integrations which Slack’s marketplace has to offer by integrating them into your group.
  • Develop your own integrations by using open source libraries that utilize Slack API. You can push notifications of newly subscribed users to keep your team on top of your company’s milestones. A perfect culture can be built using Slack.
  • Learn Slack’s keyboard shortcuts for increased productivity.
  • Check out BitBucket integration for Slack to notify a specific channel for each push to production.
  • Check out All-in-one messenger tool further in this post to better use Slack on your desktop computer.

Pricing model

Slack’s pricing model offers a free plan that can serve small teams perfectly with the ability to search and access only the most recent 10K messages (Once you subscribe you can access all of your messages). For Standard and Plus plans you pay per team members and get more integrations, prioritized support and more.

Tip for advanced users

Slack is not used only for internal teams but also for public communities. There are thousands of Slack communities you can join (most of them for free) to discuss with people from all over the world about product, design, development and much more. One of the directories lists Slack communities is Slack List.

Link to Slack

Trello

trello

What it is used for

Trello is a simple yet wonderful task management (or project management) tool. Trello can be used to manage development workflow and tasks, as well as marketing projects, blogs, online businesses and more.

Trello’s user interface is very simplistic and minimalistic but has anything you need in order to manage a project with up to 10 team members, including task labeling, attachments, task assignments and task scheduling.

Who should use it

As a solo developer who runs his own side project, Trello can be a perfect match for managing your tasks and workflow. Once you add new team members (up to 10), Trello contains everything you need to keep managing the project efficiently. Notice that Trello might not be enough for projects that grow to more than 10 team members or have many moving parts.

trello 2

Best practices

  • Use boards for different projects on the same team. You can open a board for marketing, back-end development, front-end development etc.
  • Use different background colors for each board for easier recognition.
  • Keep the left menu open for faster navigation.
  • Assign tasks to team members or watch tasks yourself by dragging and dropping profile pictures from the right menu to a specific task.
  • When starting a project, define your labels by opening a task and clicking on labels. There you can assign labels with titles so you can label your tasks afterward.
  • Use different columns in a board for either listing tasks of different components in your system, or for listing To do, doing and done tasks.

Pricing model

All of the essential features Trello provides can be found in the free plan. For integrations, better security and support check out the Business and Enterprise plans, although in my opinion when scaling up your project you might want to look into different task management solutions.

Tip for advanced users

To see examples of Trello boards and get inspired by them, browse here.

Link to Trello

Redash

redash

What it is used for

Redash is a great open-source tool for visualizing your data in a dedicated dashboard. It provides you with everything you need to give your team the ability the query data, visualize and share it.

It integrated with all of the most popular data sources including MySQL, PostgreSQL, MongoDB, ElasticSearch and much more.

With Redash you can generate visualizations to track milestones and keep yourself and your team engaged with what is going on with your project.

You can also create alerts that will pro-actively warn you about important changes.

Who should use it

Once you deploy your application to production and start collecting data by pushing it to your databases, you should consider using Redash. It can help you monitor potential issues, track your progress to achieve your milestones, get insights from your data and more.

redash 2

Best practices

  • Integrate Redash daily metrics with Slack to push them automatically on a daily basis. This way you don’t need to enter your dashboard daily but only stay in your Slack group and engage your team members as well.

Pricing model

Redash is open-source and therefore is completely free of charge. If you’d like to get a hosted and managed instance of Redash, you can pick on of the paid plans.

Tip for advanced users

Once you feel something is missing, implement it and contribute to the Github repository.

Link to Redash

Zapier

zapier

What it is used for

How many times did you tell to yourself: if we could push the data from Facebook ads to a Google spreadsheet it would be great! And then a few minutes later you find yourself struggling with APIs to get the integration done?

Zapier is a great tool that worth investigating exactly for this reason. It teaches us, as developers, that we don’t have to run and implement every integration we want to achieve in our company. Not only that but the less code we have in our system and the less in-house developments, the better.

Zapier moves info between web apps automatically by integrating more than 750 apps. It allows you to create automated processes and workflows with a few clicks of a button that will last forever.

With Zapier you can, for example, push every issue from BitBucket to Slack in a 2 minutes setup or create Trello cards from Google Form responses.

Who should use it

As developers, we are used to dealing with APIs on a daily basis. I encourage you to check out what Zapier has to offer next time before you’re getting into coding your own integration. It might save you A LOT of time.

If you’re running your own company, consider using Zapier as soon as possible in order to avoid redundant development projects, bugs, and maintainability.

zapier 2

Best practices

  • Sign up with Zapier today.
  • Check out Zapier examples to get inspired about how broad automation can reach.

Pricing model

Zapier offers a free forever plan with limited 2-step zaps and integrations. The free forever plan is definitely enough for playing around with the tool. Once you’re getting real value from Zapier you can consider one of the paid plans without limitations on the zaps you can automate.

Tip for advanced users

Try to work with Google Sheets as much as possible. It will simplify things for you.

Link to Zapier

Draw.io

draw io

What it is used for

Draw.io is a great tool for prototyping, mock-ups and architecture design. It can be used in a wide variety of ways thanks to its template collections while the main focus for using Draw.io is for designing processes, systems, and views before implementing them with code (or with photoshop).

Draw.io is an add-on for Google Drive, therefore it exposes all the sharing and collaboration capabilities that Google Drive has. You can seamlessly collaborate with additional team members on designing servers architecture, for example.

Draw.io offers many components for easy insertion into the sketch. You can go from flow charts up until Android, Bootstrap or iOS screens.

draw io 2

Who should use it

Draw.io is one of the best sketch tools I know, and it’s completely free. I encourage you to try and use it for your next project while in the designing stage.

Pricing model

Draw.io is offered free of charge.

Link to Draw.io

All-in-one messenger

all in one

What it is used for

Most of us have more than one channel of communication with our co-workers, friends, and family. Usually, each communication channel, like WhatsApp, Slack or Facebook Messenger, has its own web application which makes it relatively difficult to stay on top of everything.

All-in-one Messenger is an awesome Chrome application for collecting all your communication channels in one place. It enables you to open a new tab for each communication channel and supports all of the most popular ones. They act and feel in the same way and therefore it is easy to operate them.

Who should use it

From individual developers to companies, All-in-one messenger is applicable for everyone who deals with more than one communication channel on a daily basis.

all in one 2

Best practices

  • Although it’s not so clear, you can add more than one tab for the same communication channel. For example, if you’re a member of more than one Slack group, you can add all of them as different tabs and name them accordingly.

Pricing model

All-in-one messenger is free of charge.

Tip for advanced users

If you want to be more productive at work (which I guess you do, otherwise you weren’t read this post), do yourself a favor and cancel the notifications in the settings tab.

Link to All-in-One Messenger

BitBucket

bitbucket

What it is used for

BitBucket is a distributed version control system that makes it easy for you to collaborate with your team. BitBucket is owned by Atlassian which owns Jira, HipChat, and Trello that are great products for developers as well.

BitBucket, unlike Github, offers private repositories for up to five users for free. BitBucket user-interface is welcoming and easy to use, and the integrations that BitBucket offers are extremely helpful.

Who should use it

For teams of developers, the usage of a version control system is obvious (hopefully). As a solo developer, I encourage you to use a BitBucket as your version control system to manage your code versions, deploy your app to production, integrate with 3rd party tools for code inspection and more.

bitbucket 2

Best practices

  • Use BitBucket & Slack integration to push notifications from BitBucket directly to your development channel inside your Slack group.

Pricing model

As aforesaid, BitBucket offers unlimited private code repositories for up to five collaborators. Once you want to scale up your team you’ll need to upgrade your subscription by paying per user per month.

Link to BitBucket

Postman

postman

What it is used for

As web developers, we often deal with creating APIs for exposing our backend code to different clients like front end apps, mobile apps, and 3rd party cooperations. When building APIs or when using APIs own by different entities, it sometimes difficult to test, document and monitor them.

Postman is a Chrome application allows you to easily send HTTP requests to either local or global server with any parameters, headers and authentication settings you need.

Postman, unlike other tools out there, has a wonderful GUI (graphical user interface) for defining your HTTP request and analyzing the response.

Who should use it

From individual developers who develop or test their own API to companies that require team collaboration and sharing.

postman 2

Best practices

  • Keep Postman open when developing web applications, you’ll most likely find it useful.

Pricing model

Postman’s free forever plan offers everything you need as a solo developer working on his own side project. For team collaboration and advanced features see the paid plans.

Link to Postman

Adapting productive habits for your web development workflow is a must. For you own productivity, and for the team you’ll be in charge of soon, try and play around with different tools to figure out your perfect match.

Deploy Django app: Nginx, Gunicorn, PostgreSQL & Supervisor

Deploy Django app: Nginx, Gunicorn, PostgreSQL & Supervisor

Django is the most popular Python-based web framework for a while now. Django is powerful, robust, full of capabilities and surrounded by a supportive community. Django is based on models, views and templates, similarly to other MVC frameworks out there.

Django provides you with a development server out of the box once you start a new project using the commands:

$ django-admin startproject my_project 
$ python ./manage.py runserver 8000

With two lines in the terminal, you can have a working development server on your local machine so you can start coding. One of the tricky parts when it comes to Django is deploying the project so it will be available from different devices around the globe. As technological entrepreneurs, we need to not only develop apps with backend and frontend but also deploy them to a production environment which has to be modular, maintainable and of course secure.

django dev server

Deployment of a Django app requires different mechanisms which will be listed. Before we begin, we need to perform an alignment in terms of the tools we are going to use throughout this post:

  1. Python version 2.7.6
  2. Django version 1.11
  3. Linux Ubuntu server hosted on DigitalOcean cloud provider
  4. Linux Ubuntu local machine
  5. Git repository containing your codebase

I assume you are already using 1, 2, 4 and 5. About the Linux server, we are about to create it together during the first step of the deployment tutorial. Please note that this post discusses deployment on a single Ubuntu server. This configuration is great for small projects, but in order to scale your resources up to support larger amounts of traffic, you should consider a high-availability server infrastructure, using load balancers, floating IP addresses, redundancy and more.

Linux is much more popular for serving web apps than Windows. Additionally, Python and Django work together very well with Linux, and not so well with Windows.

There are many reasons for choosing DigitalOcean as a cloud provider, especially for small projects that will be deployed on a single droplet (a virtual server in DigitalOcean terminology). DigitalOcean is a great solution for software projects and startups which start small and scale up step by step. Read more about my comparison between DigitalOcean and Amazon Web Services in terms of an early-stage startup software project.

There are some best practices for setting up your Django project I highly recommend you to follow before starting the deployment process. The best practices include working with a virtual environment, exporting requirements.txt file and configuring the settings.py file for working with multiple environments.

django best practices

This post will cover the deployment process of a Django project from A to Z on a brand-new Linux Ubuntu server. Feel free to choose your favorite cloud provider other than DigitalOcean for deployment.

As aforesaid, the built-in development server of Django is weak and is not built for scale. You can use it for developing your Django project yourself or share it with your co-workers, but not more than that. In order to serve your app in a production environment, we need to use several components that will talk to each other and make the magic happen. Hosting a web application usually requires the orchestration of three actors:

  1. Web server
  2. Gateway
  3. Application

The web server

The web server receives an HTTP request from the client (the browser) and is usually responsible for load balancing, proxying requests to other processes, serving static files, caching and more. The web server usually interprets the request and sends it to the gateway. Common web server and Apache and Nginx. In this tutorial, we will use Nginx (which is also my favorite).

The Gateway

The gateway translates the request received from the web server so the application can handle it. The gateway is often responsible for logging and reporting as well. We will use Gunicorn as our Gateway for this tutorial.

The Application

As you may already guess, the application refers to your Django app. The app takes the interpreted request, process it using the logic you implemented as a developer, and returns a response.


Assuming you have an existing ready-for-deployment Django project, we are going to deploy your project by following these steps:

  1. Creating a new DigitalOcean droplet
  2. Installing pre requisites: pip, virtual environment, and git
  3. Pulling the Django app from Git
  4. Setting up PostgreSQL
  5. Configuring Gunicorn with Supervisor
  6. Configuring Nginx for listening to requests
  7. Securing your deployed app: setting up firewall

Creating a droplet

A droplet in DigitalOcean refers to a virtual Linux server with CPU, RAM and disk space. The first step in this tutorial is about creating a new droplet and connect to it via SSH. Assuming your local machine is running Ubuntu, we are going to create a new SSH key pair in order to easily and securely connect to our droplet once it is created. Connection using SSH keys (rather than a password) is both more simple and secure. If you already have an SSH key pair, you can skip the creation process. On your local machine, enter in the terminal:

$ ssh-keygen -t rsa

You should get two more questions, where to locate the keys (the default is fine) and whether you want to set up a password (not essential).

Now the key pair is located in:

/home/user/.ssh/

where id_rsa.pub is your public key and id_rsa is your private key. In order to use the key pair to connect to a remote server, the public key should be located on the remote server and the private key should be located on your local machine.

Notice that the public key can be located on every remote server you wish to connect to. But, the private key must be kept only on your local machine! Sharing the private key will enable other users to connect to your server.

After signing up with DigitalOcean, open the SSH page and click on the Add SSH Key button. In your terminal copy the newly-created public key:

$ cat /home/user/.ssh/id_rsa.pub

Enter the new public key you generated and name it as you wish.

SSH key

Now once the key is stored in your account, you can assign it with every droplet you create. The droplet will contain the key so you connect to it from your local machine, while password authentication will be disabled by default, which is highly recommended.

Now we are ready to create our droplet. Click on “Create Droplet” at the top bar of your DigitalOcean dashboard.

create droplet

Choose Ubuntu 16.04 64bit as your image, droplet size which is either 512MB RAM or 1GB, whatever region that makes sense to you.

 

image distro

droplet size

droplet region

You can select the private networking feature (which is not essential for this tutorial). Make sure to select the SSH key you’ve just added to your account. Name your new droplet and click “Create”.

private networking

select ssh keys

create droplet

Once your new droplet has been created, you should be able to connect to it easily using the SSH key you created. In order to do that, copy the IP address of your droplet from your droplets page inside your dashboard, go to your local terminal and type:

$ ssh root@IP_ADDRESS_COPIED

Make sure to replace withIP_ADDRESS_COPIED your droplet’s IP address. You should be already connected by now.

Tip for advanced users: in case you want to configure an even simpler way to connect, add an alias to your droplet by editing the file:

$ nano /home/user/.ssh/config

and adding:

Host remote-server-name 
    Hostname DROPLET_IP_ADDRESS 
    User root

Make sure to replace remote-server-name with a name of your choice, and DROPLET_IP_ADDRESS with the IP address of the server.

Save the file by hitting Ctrl+O and then close it with Ctrl+X. Now all you need to do in order to connect to your droplet is typing:

$ ssh remote-server-name

That simple.

Installing prerequisites

Once connected to your droplet, we are going to install some software in order to start our deployment process. Start by updating your repositories and installing pip and virtualenv.

$ sudo apt-get update $ sudo apt-get install python-pip python-dev build-essential libpq-dev postgresql postgresql-contrib nginx git virtualenv virtualenvwrapper $ export LC_ALL="en_US.UTF-8" $ pip install --upgrade pip $ pip install --upgrade virtualenv

Hopefully, you work with a virtual environment on your local machine. In case you don’t, I highly recommend you reading my best practices post for setting up a Django project in order to realize why working with virtual environments is an essential part of your Django development process.

Let’s get to configuring the virtual environment. Create a new folder with:

$ mkdir ~/.virtualenvs 
$ export WORKON_HOME=~/.virtualenvs

Configure the virtual environment wrapper for easier access by running:

$ nano ~/.bashrc

and adding this line to the end of the file:

. /usr/local/bin/virtualenvwrapper.sh

Tip: use Ctrl+V to scroll down faster, and Ctrl+Y to scroll up faster inside the nano editor.

Hit Ctrl+O to save the file and Ctrl+X to close it. In your terminal type:

$ . .bashrc

Now you should be able to create your new virtual environment for your Django project:

$ mkvirtualenv virtual-env-name

From within your virtual environment install:

(virtual-env-name) $ pip install django gunicorn psycopg2

Tip: Useful command for working with your virtual environment:

$ workon virtual-env-name # activate the virtual environment 
$ deactivate # deactivate the virtual environment

Pulling application from Git

Start by creating a new user that will hold your Django application:

$ adduser django 
$ cd /home/django 
$ git clone REPOSITORY_URL

Assuming your code base is already located in a Git repository, just type your password and your repository will be cloned into your remote server. You might need to add permissions to execute manage.py by navigating into your project folder (the one you’ve just cloned) and type:

$ chmod 755 ./manage.py

In order to take the virtual environment one step further in terms of simplicity, copy the path of your project’s main folder to the virtual environment settings by typing:

$ pwd > /root/.virtualenvs/virtual-env-name/.project

Make sure to replace virtual-env-name with the real name of your virtual environment. Now, once you use the workon command to activate your virtual environment, you’ll be navigated automatically to your project’s main path.

In order to setup the the environment variable properly, type:

$ nano /root/.virtualenvs/virtual-env-name/bin/postactivate # replace virtual-env-name with the real name

and add this line to the file:

export DJANGO_SETTINGS_MODULE=app.settings

Make sure to replace app.settings with the location of your settings module inside your Django app. Save and close the file.

Assuming you’ve set up your requirements.txt file as described in the Django best practices post, you’re now able to install all your requirements at once by navigating to the path of the requirements.txt file and run from within your virtual environment:

(virtual-env-name) $ pip install -r requirements.txt

Setting up PostgreSQL

Assuming you’ve set up your settings module as described in the Django best practices post, you should have by now a separation between the development and production settings files. Your production.py settings file should contain PostgreSQL connection settings as well. If it doesn’t, add to the file:

DATABASES = { 
    'default': { 
        'ENGINE': 'django.db.backends.postgresql', 
        'NAME': 'app_db', 
        'USER': 'app_user', 
        'PASSWORD': 'password', 
        'HOST': 'localhost', 
        'PORT': '5432', 
    } 
}

I highly recommend updating and pushing the file on your local machine and pulling it from the remote server using the repository we cloned.

Let’s get to creating the production database. Inside the terminal, type:

$ sudo -u postgres psql

Now you should be inside PostgreSQL terminal. Create your DB and user with:

> CREATE DATABASE app_db; 
> CREATE USER app_user WITH PASSWORD 'password'; 
> ALTER ROLE app_user SET client_encoding TO 'utf8'; 
> ALTER ROLE app_user SET default_transaction_isolation TO 'read committed'; 
> ALTER ROLE app_user SET timezone TO 'UTC'; 
> ALTER USER app_user CREATEDB; 
> GRANT ALL PRIVILEGES ON DATABASE app_db TO app_user;

Make sure your details here match the production.py settings file DB configuration as described above. Exit the PostgreSQL shell by typing \q.

Now you should be ready to run migrations command on the new DB. Assuming all of your migrations folders are in the .gitignore file, meaning they are not pushed into the repository, your migrations folders should be empty. Therefore, you can set up the DB by navigating to your main project path with:

(virtual-env-name) $ cdproject

and then run:

(virtual-env-name) $ python ./manage.py migrate
(virtual-env-name) $ python ./manage.py makemigrations
(virtual-env-name) $ python ./manage.py migrate

Don’t forget to create yourself a superuser by typing:

(virtual-env-name) $ python ./manage.py createsuperuser

Configuring Gunicorn with Supervisor

Now once the application is set up properly, it’s time to configure our gateway for sending requests to our Django application. We will use Gunicorn as our gateway, which is commonly used.

Start by navigating to your project’s main path by typing:

(virtual-env-name) $ cdproject

First, we will test gunicorn by typing:

(virtual-env-name) $ gunicorn --bind 0.0.0.0:8000 app.wsgi:application

Make sure to replace app with your app’s name. Once gunicorn is running your application, you should be able to access http://IP_ADDRESS:8000 and see your application in action.

When you’re finished testing, hit Ctrl+C to stop gunicorn from running.

Now it’s time to operate gunicorn from a service to make sure it’s running continuously. Rather than setting up a systemd service, we will use a more robust way with Supervisor. Supervisor, as the name suggests, is a great tool for monitoring and controlling processes. It helps you understand better how your processes operate.

To install supervisor, type outside of your virtual environment:

$ sudo apt-get install supervisor

Once supervisor is running, every .conf file that is included in the path:

/etc/supervisor/conf.d

represents a monitored process. Let’s add a new .conf file to monitor gunicorn:

$ nano /etc/supervisor/conf.d/gunicorn.conf

and add into the file:

[program:gunicorn] 
directory=/home/django/app-django/app 
command=/root/.virtualenvs/virtual-env-name/bin/gunicorn --workers 3 --bind unix:/home/django/app-django/app/app.sock app.wsgi:application 
autostart=true 
autorestart=true 
stderr_logfile=/var/log/gunicorn/gunicorn.out.log 
stdout_logfile=/var/log/gunicorn/gunicorn.err.log 
user=root 
group=www-data 
environment=LANG=en_US.UTF-8,LC_ALL=en_US.UTF-8 

[group:guni] 
programs:gunicorn

Make sure that all the references are properly configured. Save and close the file.

Now let’s update supervisor to monitor the gunicorn process we’ve just created by running:

$ supervisorctl reread 
$ supervisorctl update

In order to validate the process integrity, use this command:

$ supervisorctl status

By now, gunicorn operates as an internal process rather than a process that can be accessed by users outside the machine. In order to start sending traffic to gunicorn and then to your Django application, we will set up Nginx the serve as a web server.

Configuring Nginx

Nginx is one of the most popular web servers out there. The integration between Nginx and Gunicorn is seamless. In this section, we’re going to set up Nginx to send traffic to Gunicorn. In order to do that, we will create a new configuration file (make sure to replace app with your own app name):

$ nano /etc/nginx/site-available/app

then edit the file by adding:

server { 
    listen 80; 
    server_name SERVER_DOMAIN_OR_IP; 
    location = /favicon.ico { access_log off; log_not_found off; } 
    location /static/ { 
        root /home/django/app-django/app; 
    } 
    location / { 
        include proxy_params; 
        proxy_pass http://unix:/home/django/app-django/app/app.sock; 
    } 
}

This configuration will proxy requests to the appropriate route in your server. Make sure to set all the references properly according to Gunicorn and to your app configurations.

Initiate a link with:

$ ln -s /etc/nginx/sites-available/app /etc/nginx/sites-enabled

Check Nginx configuration by running:

$ nginx -t

Assuming all good, restart Nginx by running:

$ systemctl restart nginx

By now you should be able to access your server only by typing your IP in the browser because Nginx listens on port 80 which is the default port browsers use.

Security

Well done! You should have a deployed Django app by now! Now it’s time to secure the app to make sure it’s much difficult to hack it. In order to do that, we will use ufw built-in Linux firewall.

ufw works by configuring rules. Rules tell the firewall which kind of traffic it should accept or decline. At this point, there are two kinds of traffic we want to accept, or in other words, two ports we want to open:

  1. port 80 for listening to incoming traffic via browsers
  2. port 22 to be able to connect to the server via SSH.

Open the port by typing:

$ ufw allow 80 
$ ufw allow 22

then enable ufw by typing:

$ ufw enable

Tip: before closing the terminal, make sure you are able to connect via SSH from another terminal to so you’re not locked outside your droplet due to bad configurations of the firewall.

What to do next?

This post is the ultimate guide to deploy a Django app on a single server. In case you’re developing an app that should serve larger amounts of traffic, I suggest you look into highly scalable server architecture. You can start with my post about how to design a high-availability server architecture.

I moved from Windows to Linux. Here are some lessons learned

I moved from Windows to Linux. Here are some lessons learned

As individuals who spend most of our time next to a computer, we sometimes need to ask ourselves questions about our most basic habits. As you may already guess, I’m talking about the operating system each of us uses on a daily basis. Windows’ market share, in terms of desktop computers, is above 90%! Everyone uses Windows, for different reasons:

  1. Windows OS (operating system) comes with almost every PC (personal computer) as a default OS.
  2. Since we were young we grew up on different Windows OS versions, so it’s difficult to make a move.
  3. Leaving MacOS aside, you barely see non-Windows users so you’re not exposed to additional alternatives. Therefore, most of the people think Windows is the only alternative for running their desktop PC.

I must admit that Windows is well-designed, convenient, allows you to perform many tasks fairly easy and gets updated every once in a while. But, as a Windows user who hasn’t experienced any other operating system, you sometimes tend to not even think of the possibilities that you don’t have.

linux or windows

The Windows alternative I’m about to present here is Linux. Linux is an open-source operating system developed by the community. Linux is Unix-like, which means it is based on the same principals as other Unix-based systems. Linux is completely free and has different distributions, like Ubuntu, CentOS, Debian, etc. Every distribution has its own pros and cons and is commonly used in different applications. Linux is light-weight in terms of hard-disk, and therefore it’s used in embedded systems, smart home devices, IoT (Internet-of-things) and much more. Android OS is also based on Linux.

If you’re not yet a Linux user, I hope that you have an idea by now of what Linux is all about. As a technological entrepreneur who have more than 7 years of experience in software development, data science and entrepreneurship, I have to say that moving from Windows to Ubuntu was one of the most productivity boosts I’ve experienced.

It all started when I noticed that the basic tools that I’m working with, like Android Studio IDE and an Android simulator running on a Windows machine, barely allows me to make progress in work in terms of latency. I thought to myself that it’s probably about the hardware, so I decided to upgrade to a Lenovo Y50-70 PC with 16GB of RAM and 512MB SSD hard-drive.

lenovo ubuntu

After installing the necessary software to keep developing my project, I realized that I face to similar latency issues from my brand-new PC. I didn’t use too many RAM-consuming applications at once and I expected my new PC to work like a spaceship. But it didn’t happen. At this point, I realized that I have to do a more radical pivot (a shift in strategy).

Once realizing that the hardware is probably not the problem, I started to investigate the software approach. After interacting with Linux for short periods of time during college, I decided to conduct a full research. Ubuntu distribution of Linux is the most popular distribution for PC users. Ubuntu is available both as a client edition for PC users and as a server edition for installing and operating servers. One of the huge advantages I found about moving the Linux is to be well-familiar with Ubuntu and to work with it both on my PC and on the servers I operate for my applications.

After reading A LOT of resources online discussing Linux or Windows and Windows vs. Ubuntu, I realized that OS that fits your needs and adapts itself to you is what can make you extremely more productive in the long run.

So I waited for a sign, and the sign arrived – a virus attack that enforced me to backup all my files and format my PC. But this time – with an Ubuntu operating system on board. I had some thoughts about maybe installing Windows and Ubuntu side-by-side for a soft landing, but I’m now happy that I didn’t do it. The reason for leaving Windows entirely is that I wanted to be fully-committed to Ubuntu without the Windows fall-back alternative.

Here are some lessons learned from the process of moving from Windows to Linux. The lessons can refer to any general user but are mostly aimed to developers, coders, programmers, and every person who codes or creates products.

Performance

Linux runs faster than both Windows 8.1 and Windows 10 thanks to its light-weight architecture. After moving the Linux I’ve noticed a dramatic improvement in the speed and performance of my work flow, with the exact same tools I used on Windows. Linux supports many efficient tools for developers and enables you to operate them seamlessly.

Security

Linux is open-source. That is, theoretically, everyone can contribute code in order to enhance the experience, add features, fix bugs, reduce security risks and more. Naturally, every large-scale open-source project enjoys many pairs of eyes examine every aspect of it. Therefore, in terms of security, Linux is naturally more secure than Windows. Rather than installing anti-viruses and 3rd party tools for cleaning malware, you just need to stick to the recommended repositories and you are good to go.

Software development

The terminal in Linux is a wild card. You can do almost anything with it, including software installation, application and server configurations, file system management and much more. As developers, the terminal is our sweet spot. There is nothing more convenient than running servers, training machine learning models, accessing remote machines, compiling and running scripts all from the same terminal window. It’s a huge productivity booster. Automation is also a game-changer using the terminal.

ubuntu terminal

Modularity

Linux provides you with a lot of modularity as a developer. You can easily configure and access any corner in your computer, monitor processes and manage virtual environments for different projects. Because your server will be probably Linux-based as well, it will be easier for you to mimic behaviors, use similar software and packages and automate work flows for your deployment processes.

Working with remote Linux servers

Most of the servers that hold the entire internet are Linux-based for many reasons that will not be listed here. Linux provides any tools you need as a developer to operate a scalable, secure servers. Therefore, mastering Linux for configuring and maintaining servers is a must to have skill for any technological entrepreneur who operates end-to-end applications. While working with Windows on your local machine, you need to use 3rd party tools like PUTTY in order to connect and interact with Linux-based remote servers, which is not so convenient. For copying files, for example, you need to download another tool when you use Windows. A huge advantage of working with a Linux-based local machine is the ability to connect to any remote server with a single line executed via the terminal. Hosts can be stored in a file as well as SSH keys and usernames, so all you have to do in order to connect via SSH is:

ssh ofir-server

and you’re in! No passwords required. This is a simple demonstration of many capabilities of configuring and maintaining Linux-based servers with a Linux-based local machine. The ability to work via the terminal for both machines is a no-brainer. Most of the popular cloud providers also have CLIs (command-line-interface) for easy integrations.

Familiarity with low-level OS principals

Windows implementation is very high-level. In other words, you’re barely exposed to internal issues and implementations of the operating system itself. Linux is just the opposite. When using Linux, you often face configurations that have to be implemented by the terminal, editing OS files, adding scheduled tasks, updating software, installing drivers and more. When running Ubuntu, AskUbuntu.com is your friend. Not only you earn more capabilities as a developer, you also learn (sometimes the hard way) how to solve issues, monitor your machine for potential problems, configure different components and more.

ubuntu

Not everything is perfect, though

  1. Becoming an Ubuntu user is based on a learning curve. Some things you didn’t need to consider back when using Windows, now might need to be configured with help from AskUbuntu.com. Expect issues if you have special hardware installed on your computer, like GPUs.
  2. I believe that every technological entrepreneur must be a little bit of a designer with some minimal skills of graphic design. Unfortunately, Adobe hasn’t released any of its products for Linux users so it’s impossible to run them directly. The Ubuntu alternative is called GIMP, which is a free software for all the basic requirements of a developer-designer (and beyond).

Despite the disadvantages, since I decided to move, I have no regrets. I’m all Ubuntu now and wish I had moved a few years ago.


Linux is not for everyone. As aforesaid, you should check whether it fits your daily tasks. I think that if you consider yourself a technological entrepreneur/developer/data scientist/programmer – a person who codes or interact with technical stuff related to coding in one way or another – you should definitely check out Ubuntu.

Practical machine learning: Ridge regression vs. Lasso

Practical machine learning: Ridge regression vs. Lasso

For many years, programmers have tried to solve extremely complex computer science problems using traditional algorithms which are based on the most basic condition statement: if this then that. For example, if the email contains the word “free!” it should be classified as spam.

In recent years, with the rise of exceptional cloud computing technologies, the machine learning approach for solving complex problems has been magnificently accelerated. Machine learning is the science of providing computers the ability to learn and solve problems without being explicitly programmed. Sounds like a black magic? Maybe. In this post, I will introduce you to problems which can be solved using machine learning, as well as practical machine learning solutions for solving them.

Exactly like humans learn on a daily basis, in order to let a machine to learn, you need to provide it with enough data. Once it processed the data, it can make predictions about the future. Assuming you want to classify emails by whether they are spam emails or not. In order to solve this problem using machine learning, you need to provide the machine with many labeled emails – which are already classified in the correct classes of spam vs. not spam. The classifier will iterate over the samples and learn what are the features that define a spam email. Assuming you trained the machine learning model right, it will be able to predict whether a future email should be classified as spam or not, with high accuracy. In many cases, you’ll not be able to completely understand how the model predicts the class.

Machine learning hierarchy

The world of machine learning can be divided into two types of problems: supervised learning and unsupervised learning. In this post, we will focus only on supervised learning, which is a subset of problems which contain labeled data (That is, every email is labeled as spam or not spam). For cases where you have unlabeled data, unsupervised learning might be a proper solution.

Underneath the supervised learning problems, there is another division of regression problems vs. classification problems. In regression problems, the value you wish to predict is continuous. For example, house price. In classification problems, on the other hand, the value you are about to predict is discrete, like spam vs. not spam.

The data you need to provide in order to train your model depends on the problem and the value you wish to predict. Let’s assume you want to predict a house price based on different properties. So in this case, each row in your dataset should (for example) consist of:

  1. features: house size, the number of rooms, floor, whether elevator exists, etc.
  2. label: house price.

Choosing and collecting the features that best describe a house for predicting its price can be challenging. It requires market knowledge as well as access to big data sources. The features are the keys in which the prediction of the house price will be based upon.

Machine learning as an optimization problem

Every machine learning problem is basically an optimization problem. That is, you wish to find either a maximum or a minimum of a specific function. The function that you want to optimize is usually called the loss function (or cost function). The loss function is defined for each machine learning algorithm you use, and this is the main metric for evaluating the accuracy of your trained model.

For the house price prediction example, after the model is trained, we are able to predict new house prices based on their features. For each house price we predict, denoted as Ŷi, and the actual house price Yi we can calculate the loss by:

li = ( Ŷi– Yi)2

This is the most basic form of a loss for a specific data-point, That is used mostly for linear regression algorithms. The loss function as a whole can be denoted as:

L = ( Ŷi– Yi)2

Which simply defines that our model’s loss is the sum of distances between the house price we’ve predicted and the ground truth. This loss function, in particular, is called quadratic loss or least squares. We wish to minimize the loss function (L) as much as possible so the prediction will be as close as possible to the ground truth.

If you followed me up until now, you are familiar with the basic concept of every practical machine learning problem. Remember, every machine learning algorithm defines its own loss function according to its goal in life.

Linear regression

Linear regression is a basic yet super powerful machine learning algorithm. As you gain more and more experience with machine learning, you’ll notice how simple is better than complex most of the time. Linear regression is widely used in different supervised machine learning problems, and as you may guessed already, it focuses on regression problem (the value we wish the predict is continuous). It is extremely important to have a good understanding of linear regression

before studying more complex learning methods. Many extensions have been developed for linear regression which I will introduce later in this post.

The most basic form of linear regression deals with dataset of a single feature per data point (think of it as the house size). Because we are dealing with supervised learning, each row (house) in the dataset should include the price of the house (which is the value we wish the predict).

An example of our dataset:

House size (X)

House price (Y)

50

102

70

127

32

65

68

131

93

190

44

82

56

120

In a visual representation:

In linear regression we wish to fit a function (model) in this form:

Ŷ = β0+β1X

Where X is the vector of features (the first column in the table below), and β0, β1 are the coefficients we wish to learn.

By learning the parameters I mean executing an iterative process that updates β at every step by reducing the loss function as much as possible. Once we reach the minimum point of the loss function we can say that we completed the iterative process and learned the parameters.

Just to make it even more clear, the combination of the β coefficients are our trained model – which means that we have a solution to the problem!

After executing the iterative process, we can visualize the solution on the same graph:

 

Where the trained model is:

Ŷ = -0.5243+1.987X

Now let’s assume we want to predict based on our trained model, what will be the price of a house of size 85. In order to predict the price, we will substitute the β values we found into the model function, including the house size, and get the predicted house price:

Ŷx=85 168.37

To recap what we’ve covered so far:

  1. Every machine learning problem is basically an optimization problem. That is, we want to minimize (or maximize) some function.
  2. Our dataset is consist of features (X) and a label (Y). In our case – house size is the single feature, house price is the label.
  3. In linear regression problems, we want to minimize the quadratic loss which is the sum of distances between the predictions and the actual value (ground truth).
  4. In order to minimize the loss function and find the optimal β coefficients, we will execute an iterative process.
  5. To predict the label (house price) of a new house based on its size, we will use the trained model.

The iterative process for minimizing the loss function (a.k.a learning the coefficients β), will be discussed in another post. Although it can be done with one line of code, I highly recommend reading more about iterative algorithms for minimizing loss functions like Gradient Descent.


Linear regression with multiple features

In real world problems, you usually have more than one feature per row (house). Let’s see how linear regression can help us with multi-feature problems.

Considering this dataset:

House size (X1)

rooms (X2)

floor (X3)

House price (Y)

50

2

5

123

70

2

3

118

32

1

3

62

68

3

7

148

93

4

10

250

44

2

6

100

56

3

1

110

So currently we have 3 features:

  1. house size
  2. number of rooms
  3. floor

Therefore, we need to adapt our basic linear model to an extended one that can take into account the additional features for each house:

Ŷ = β0+β1X1+β2X2+β3X3

In order to solve the multi-feature linear regression problem, we will the same iterative algorithm and minimize the loss function. The main difference will be that we will end up with four β coefficients instead of only two.

Overfit in machine learning algorithms

Having more features may seem like a perfect way for improving the accuracy of our trained model (reducing the loss) – because the model that will be trained will be more flexible and will take into account more parameters. On the other hand, we need to be extremely careful about overfitting the data. As we know, every dataset has noisy samples. For example, the house size wasn’t measured accurately or the price is not up to date. The inaccuracies can lead to a low-quality model if not trained carefully. The model might end up memorizing the noise instead of learning the trend of the data.

A visual example of a nonlinear overfitted model:

Overfit can happen in linear models as well when dealing with multiple features. If not filtered and explored up front, some features can be more destructive than helpful, repeat information that already expressed by other features and add high noise to the dataset.

Overcoming overfit using regularization

Because overfit is an extremely common issue in many machine learning problems, there are different approaches to solving it. The main concept behind avoiding overfit is simplifying the models as much as possible. Simple models do not (usually) overfit. On the other hand, we need to pay attention the to gentle trade-off between overfitting and underfitting a model.

One of the most common mechanisms for avoiding overfit is called regularization. Regularized machine learning model, is a model that its loss function contains another element that should be minimized as well. Let’s see an example:

L = ( Ŷi– Yi)2 + λβ2

This loss function includes two elements. The first one is the one you’ve seen before – the sum of distances between each prediction and its ground truth. The second element though, a.k.a the regularization term, might seem a bit bizarre. It sums over squared β values and multiplies it by another parameter λ. The reason for doing that is to “punish” the loss function for high values of the coefficients β. As aforesaid, simple models are better than complex models and usually do not overfit. Therefore, we need to try and simplify the model as much as possible. Remember that our goal of the iterative process is to minimize the loss function. By punishing the β values we add a constraint to minimize them as much as possible.

There is a gentle trade-off between fitting the model, but not overfitting it. This approach is called Ridge regression.

Ridge regression

Ridge regression is an extension for linear regression. It’s basically a regularized linear regression model. The λ parameter is a scalar that should be learned as well, using a method called cross validation that will be discussed in another post.

A super important fact we need to notice about ridge regression is that it enforces the β coefficients to be lower, but it does not enforce them to be zero. That is, it will not get rid of irrelevant features but rather minimize their impact on the trained model.

Lasso method

Lasso is another extension built on regularized linear regression, but with a small twist. The loss function of Lasso is in the form:

L = ( Ŷi– Yi)2 + λ|β|

The only difference from Ridge regression is that the regularization term is in absolute value. But this difference has a huge impact on the trade-off we’ve discussed before. Lasso method overcomes the disadvantage of Ridge regression by not only punishing high values of the coefficients β but actually setting them to zero if they are not relevant. Therefore, you might end up with fewer features included in the model than you started with, which is a huge advantage.

Conclusions

Machine learning is getting more and more practical and powerful. With zero knowledge in programming, you can train a model to predict house prices in no time.

We’ve covered the basics of machine learning, loss function, linear regression, ridge and lasso extensions.

There is more Math involved from what I’ve covered in this post, I tried to keep it as practical and, on the other hand, high-level as possible (Someone said trade-off?).

I encourage you to take a deep dive into this amazing world.

Scaling Up Your SaaS Product Infrastructure: High-Availability

Scaling Up Your SaaS Product Infrastructure: High-Availability

Many SaaS (software-as-a-service) products store massive amounts of data while required to provide seamless user experience when the data is being accessed. When serving low volumes of traffic, this mission is considered relatively simple, by operating one main server for HTTP responses, static assets, databases and background tasks. Things become more complex when higher traffic volumes get into the picture.

We’ve recently reached this point of evolution in ClickFrauds, which made us restructuring our servers’ infrastructure to become robust as well as highly-scalable. In ClickFrauds, we help Google AdWords advertisers monitor their campaigns against fraudulent clicks, and automatically prevent from fraudsters to interact with the ads.

A super sensitive application component which is exposed to the world-wide-web is the tracker which get hit thousand of times per minute and responsible for:

  1. Storing data from each request (click)
  2. Redirecting the request to its final destination (the advertiser’s website)

By the nature of the tracker, it goes without saying that malfunctioning of this component is not something we can afford our system to have.

When we started, we compared the main cloud providers out there to make a “smart enough” temporary decision for the next few generations of the product. Choosing DigitalOcean to be our cloud provider was an extremely wise decision. DigitalOcean has a seamless easy-to-use dashboard with a convenient pricing model. Droplets (server units) are highly-scalable, robust and contain most of the features used by an early-stage startup. The support team is very helpful, and the greatest thing about DigitalOcean is the endless collection of tutorials that are also ranked at the top of many Google search results.

Infrastructure 2.0

Starting with the requirements, we’ve listed three main concepts that must be taken into account when redeveloping the new servers’ architecture of ClickFrauds:

  1. High-availability and redundancy.
  2. Security layers against different potential cyber attacks.
  3. Monitoring mechanisms to ensure healthy infrastructure overtime

Now we will dive into the steps that were taken to fulfill each of the above requirements.

High-Availability

High-Availability illustration

High-availability means avoiding a single point of failure in each and every component of the system. When considering many different moving parts with different objectives as well as critical hardware issues that might happen but cannot be controlled or expected, this is not a trivial task. However, there are great open-source tools and mechanisms that if used properly, can achieve a highly-available infrastructure in an elegant way.

Load Balancing

To eliminate a single point of failure on machines that are exposed to the web and respond to HTTP requests, we use redundancy. Redundancy means duplications of machines that are aimed to fulfill the exact same set of tasks. We control and distribute the load over the redundant machines utilizing a load balancer. Now, the load balancer becomes the only machine that is exposed to external requests, while the duplicated nodes are hidden inside the internal network.

To implement a basic yet reliable load balancer we can use Nginx with a round-robin algorithm for distributing the requests between the nodes. Nginx will take care of everything for you, from periodic health-checks for each node, to managing the nodes and getting them back to the line after a failure recovery. Additional configuration should be implemented in order to ensure proper cookie sticking, consistent and reliable headers (like client’s IP address) and proper caching.

At this point, you may ask yourself, what happens if the load balancer goes offline? This is a great question that has to be answered to achieve a full highly-available system.

Floating IP addresses

Another great feature offered by DigitalOcean (for free!) is the floating IP address. As you may already know, DNS A records might take sometime to update from one server to another. In case of a load balancer failure, we need to change the main A record, for example, of the subdomain tracker.clickfrauds.com to point from the unavailable load balancer to an available backup one. Floating IP address enables us to achieve this goal seamlessly. It is an external IP that can be attached to different machines immediately. This way, in case of a load balancer failure, we leave the DNS A record from tracker.clickfrauds.com pointing to the floating IP as is, and reattaching the floating IP address to the backup machine.

 

Floating IP address illustration

Databases

Like any other component in a multi-layered architecture, also databases might fail from many different reasons, either hardware or software related. Databases are highly sensitive components, that are required at any time during the application life cycle. Therefore, it is extremely important to maintain a sustainable database access using a method called replication. Replication, in its most simple form, means holding your data in several different accessible places, rather than on a single machine. Replication eliminates the scenario of a single point of failure in your data system, and provide your application multiple access points to your data. Multiple access points are helpful for distributing I/O (input/output) operations from different machines within your infrastructure. In other words, different requests from users may reach different nodes (machines) to provide them with the data they need in order to balance the load of requests.

In order to accomplish proper replications of a database, your data is either distributed among different machines without duplications, or duplicated between the machines and therefore identical from one machine to another. In the latter, synchronization is a consideration that has to be taken into account – once something changes on one machine, it has to be updated on all the rest. Fortunately, the popular databases contain robust replication mechanisms, both SQL and NoSQL based databases. SQL (relational databases), will mostly use the duplication based replication method, while NoSQL, which was built for scale and for holding huge amounts of data, will use the distributed replication where on each node you will find different parts of you data.

It is common in mature applications to use both types of databases: SQL and NoSQL-based for different purposes. In ClickFrauds, we made a decision to use PostgreSQL, which work seamlessly with Django web framework, as our relational database for storing user-related data, and Redis, which is an in-memory super-fast storing system, for storing cache data, and holding a queue for all the background processes in our system. Both databases are replicated on several machines, which prevents a single point of failure and extends dramatically the I/O capabilities of each of the databases.

Celery Workers

Analysis of clicks data is a computationally heavy operation. Especially when the analysis that we perform on each click is based on a deep investigation of every aspect of the click, using technologies like supervised machine learning models, as well as graph database for pairing clicks that have a high likelihood to originate from the same source. Therefore, this analysis must not be done at the expense of user experience, rather is has to be performed in a background process. On the other hand, the analysis is the most effective when it is completed as closest as possible to the click operation in order to make a decision whether a specific IP address should be blocked immediately. Therefore, in order to accomplish our task at ClickFrauds, we have a requirement for a component that will manage the queue of thousands of background tasks per minute (a.k.a broker) and endpoints that will consume the tasks of click analysis and execute them successfully (workers). To accomplish that, we’ve decided to work with Celery which has high capabilities for managing and consuming background processes in large scale. Celery cannot operate without a broker to store its task queue, which we decided as a Redis database. To scale up the task consuming capabilities, all we have to do is to duplicate the celery worker machine and then immediately the consumption velocity is increased.

Once you reach the point of scaling up your SAAS infrastructure, the main concept that you need to bear in mind is to be able to run new servers in seconds, that will be integrated into the infrastructure seamlessly for doing their job. You can do it easily by taking snapshots of machines just before you launch your new infrastructure.

The next posts in the series will discuss how to secure your highly-available infrastructure and how to monitor it from potential issues. All of the topics mentioned here can be easily discussed further in details and will be discussed in next posts, please feel free to comment about specific topics that interest you the most.

3 best practices for better setting up your Django project

3 best practices for better setting up your Django project

Django is a robust open source Python-based framework for building web applications. Django has gained an increase in its popularity during the last couple of years, and it is already mature and widely-used with a large community behind it. Among other Python-based frameworks for creating web applications (Like Flask and Pyramid), Django is by far the most popular. It supports both Python version 2.7 and Python 3.6 but as for the time of this article being written, Python 2.7 is still the more accessible version in terms of community, 3rd party packages, and online documentation. Django is secured when used properly and provides high dimensions of flexibility, therefore is the way to go when developing server-side applications using Python.

In this article, I will share with you best practices of a Django setup I’ve learned and collected over the recent years. Whether you have a few Django projects under your belt, or you’re just about to start your first Django project from scratch, the collection described here might help you create better applications down the road. The article has been written from a very practical mindset so you can add some tools to your development toolbox immediately, or even create yourself an advanced custom Django boilerplate for your next projects.

* In this article I assume you’re using a Linux Ubuntu machine.

Virtual Environment

While developing Python-based applications, using 3rd party packages are an ongoing thing. Typically, these packages are being updated often so keeping them organized is essential. When developing more and more different projects on the same local machine, it’s challenging to keep track on the current version of each package, and impossible to use different versions of the same package for different projects. Moreover, updating a package on one project might break functionality on another, and vice versa. That’s where Python Virtual Environment comes handy. To install virtual environment use:

$ apt-get update
$ apt-get install python-pip python-dev build-essential

$ export LC_ALL="en_US.UTF-8" # might be necessary in case you get an error from the next line

$ pip install --upgrade pip
$ pip install --upgrade virtualenv
$ mkdir ~/.virtualenvs
$ pip install virtualenvwrapper
$ export WORKON_HOME=~/.virtualenvs
$ nano ~/.bashrc

add this line to the end of the file:

. /usr/local/bin/virtualenvwrapper.sh

then execute:

$ . .bashrc

After installing, create a new virtual environment for your project by typing:

$ mkvirtualenv project_name

While you’re in the context of your virtual environment you’ll notice a prefix that is being added to the terminal, like:

(project_name) ofir@playground:~$

In order to deactivate (exit) the virtual environment and getting back to the main Python context of your local machine, use:

$ deactivate

In order to activate (start) the virtual environment context, use:

$ workon project_name

To list the virtual environments exist in your local machine, use:

$ lsvirtualenv

Holding your project dependencies (packages) in a virtual environment on your machine allows you to keep them in an isolated environment and only use them for a single (or multiple) projects. When creating a new virtual environment you’re starting a fresh environment with no packages installed in it. Then you can use, for example:

(project_name) $ pip install Django

for installing Django in your virtual environment, or:

(project_name) $ pip install Django==1.11

for installing version 1.11 of Django accessible only from within the environment.

Neither your main Python interpreter nor the other virtual environments on your machine will be able to access the new Django package you’ve just installed.

In order to use the runserver command using your virtual environment, while in the context of the virtual environment, use:

(project_name) $ cd /path/to/django/project
(project_name) $ ./manage.py runserver

Likewise, when entering the Python interpreter from within the virtual environment by typing:

(project_name) $ python

it will have access to packages you’ve already installed inside the environment.

Requirements

Requirements are the list of Python packages (dependencies) your project is using while running, including version for each package. Here’s an example for a requirements.txt file:

dicttoxml==1.7.4
Django==1.11.2
h5py==2.7.0
matplotlib==2.0.2
numpy==1.13.0
Pillow==4.1.1
psycopg2==2.7.1
pyparsing==2.2.0
python-dateutil==2.6.0
pytz==2017.2
six==1.10.0
xmltodict==0.11.0

Keeping your requirements.txt file up to date is essential for collaborating properly with other developers, as well as keeping your production environment properly configured. This file, when included in your code repository, enables you to update all the packages installed in your virtual environment by executing a single line in the terminal, and by that to get new developers up and running in no time. In order to generate a new requirements.txt or to update an existing one, use from within your virtual environment:

(project_name) $ pip freeze > requirements.txt

For your convenience, make sure to execute this command in a folder that is being tracked by your Git repository so other instances of the code will have access to the requirements.txt file as well.

Once a new developer is joining the team, or you want to configure a new environment using the same packages listed in the requirements.txt file, execute in the virtual environment context:

(project_name) $ cd /path/to/requirements/file
(project_name) $ pip install -r requirements.txt

All requirements listed in the file will immediately be installed in your virtual environment. Older versions will be updated and newer versions will be downgraded to fit the exact list of requirements.txt. Be careful though, because there might be differences sometimes between different environments that you still want to respect.

I highly recommend integrating these commands to your work flow: updating the requirements.txt file before pushing code to the repository and installing requirements.txt file after pulling code from the repository.

Better settings.py Configuration

Django comes out-of-the-box with very basic yet useful settings.py file, defines the main and most useful configurations for your project. The settings.py file is very straightforward, but sometimes, as a developer working in a team, or when settings up a production environment, you often need more than one basic settings.py file.

Multiple settings files allow you to easily define tailor-made configurations for each environment separately like:

ALLOWED_HOSTS # for production environment
DEBUG
DATABASES # for different developers on the same team

Let me introduce you to an extended approach for configuring your settings.py file which allows you to easily maintain different versions and use the one you want in any given time and environment in no time.

First, navigate to your settings.py file path:

(project_name) $ cd /path/to/settings/file

Then create a new module called settings (module is a folder containing an __init__.py file):

(project_name) $ mkdir settings

Now, rename your settings.py file to base.py and place it inside the new module you created:

(project_name) $ mv settings.py settings/base.py

For this example, I assume that you want to configure one settings file for your development environment and one for your production environment. You can use the exact same approach for defining different settings files for different developers in the same team.

For your development environment create:

(project_name) $ nano settings/development.py

Then type:

from .base import *

DEBUG = True

and save the file by hitting Ctrl + O, Enter and then Ctrl + X.

For your production environment create:

(project_name) $ nano settings/production.py

and type:

from .base import *

DEBUG = False
ALLOWED_HOSTS = [‘app.project_name.com’, ]

Now, whenever you want to add or update settings of a specific environment you can easily do it in its own settings file. The last question that should be asked is how Django knows which settings file to load on each environment? And the answer is: that’s what the __init__.py file is used for. When Django looks for the settings.py it used to load when running the server, for example, it now finds a settings module rather than a settings.py file. But as long as it’s a module containing an __init.py__ file, as far as Django is concerned, it’s the exact same thing. Django will load the __init__.py file and execute whatever written in it. Therefore, we need to define which settings file we want to load inside the __init__.py file, by executing:

(project_name) $ settings/__init__.py

and then, for a production environment, for example, typing:

from .production import *

This way, Django will load all the base.py and production.py settings every time it starts. Magic?

Now, the only configuration left is to keep the __init__.py in your .gitignore file so it will not be included in pushes and pulls. Once you set up a new environment, don’t forget to create a new __init__.py file inside the settings module and import the settings file required exactly like we did before.

In this article we’ve covered three best practices for better setting up your Django project:

  • Working inside a virtual environment
  • Keeping requirements.txt file up to date and use it continuously in your work flow
  • Setting up a better project settings array.

This is part 1 in the series about best practices for Django development. Follow me to get an immediate update once the next parts will be available.

Have you followed these best practices in your last project? Do you have any insights to share? Comments are highly appreciated.

How to choose a cloud computing technology for your startup

How to choose a cloud computing technology for your startup

Cloud computing technology becomes a standard when talking about developing applications nowadays. A few years ago, companies were enforced to have dedicated teams for configuring, running and maintaining server rooms which made it extremely difficult to scale up easily and offer a sustainable product. For small startups, it was even more difficult due to the lack of human resource as well as funding.

In present days, not only there are cloud computing technologies for almost every architecture you might imagine, but the cloud vendors also compete nonstop about our (the developers) attention. Most of the largest tech companies, like Google, Amazon and IBM launched cloud services in the past few years. They advertise, offer free tiers, present in tech conferences and conduct free-of-charge workshops for experiencing with their cloud solutions. They are aware that once you fall in love with their services, it will most likely be your favorite choice in every project for years to come.

So what is a cloud provider anyway? A cloud provider is an entity that offers cloud services for operating your application. Operating may include running servers, serving your application, hosting static files, providing database solutions, handling networking between servers, managing DNS and much more. Different cloud vendors offer different levels of abstractions in their services, usually defined as IaaS vs. PaaS.

cloud computing technology

IaaS (Infrastructure-as-a-service)

IaaS, or infrastructure-as-a-service, refers to a low-level solution, like providing a Linux Ubuntu server with nothing installed on it. This kind of solutions is suitable for more advanced developers who have experience with designing, configuring and securing servers infrastructure in all aspects. IaaS services provide you with flexibility and scalability down the road, and this will most likely be the way to go when designing application for scale. This approach requires, as already mentioned before, at least one developer in your startup who has this skill-set, otherwise, your product will turn into a big mess sooner than later.

PaaS (Platform-as-a-service)

PaaS, or platform-as-a-service, refers to a fully-maintained and managed environment that is hidden under a layer of abstraction you should not even care of. The cloud vendor takes care of maintaining the servers needed for the operations for you, and you get high-level databases for storing your data, services for user authentication, endpoints for client side applications etc. This approach is much easier and faster to get up and running with, and typically satisfies most of the basic applications. You should take into consideration though, that for more complex architectures it might not be enough.

Generally speaking, both IaaS and PaaS are huge time-savers when dealing with deploying and serving applications. You are able to run a server with a click-of-a-button and usually pay per use. Scaling your servers can be done manually or even automatically using APIs when a peak in traffic suddenly occurs. You can be sure that you’re in a good company (as long as you choose wisely) and whatever you can imagine, you can basically create.

cloud providers list

In early-stage startups, using cloud computing technologies became a standard because of the flexibility, the pricing models and the accessibility. Choosing the best cloud service for your startup is an essential task every technological entrepreneur must perform. As the head of development in your company, you should know the differences between the main alternatives, and choose the one that suits your product best.

Technical debts may stack up in a case of a bad decision. In addition, migrating an entire architecture from one cloud provider to another is not considered to be a trivial task at all. Therefore, you should be able to know the differences, experiment with each of the main alternatives and make a wise decision.

cloud services comparisonAfter examining and experiencing the best cloud providers out there:

and using them in a wide variety of project, I’ll take my top two: AWS and DigitalOcean and compare them using a set of parameters.

 

I’ve chosen these two cloud providers to be my best choice after grading each of them using the most important parameters when building a startup from the ground up:

  1. Features (offering) – how wide is the range of available cloud computing technologies, integrations and possibilities for the next generations of your application. In order to build for scale, you need to be sure that a cloud vendor can support your application for years to come.
  2. Pricing – Available pricing models, free tiers for startups and pricing transparency. Early-stage startups (startups that fund themselves) look for the largest value possible in the lowest price.
  3. Ease of use – How fast an intermediate developer can build a basic cloud architecture and deploy his application, How easy is it to iterate over the existing cloud architecture and what about the learning curve for beginners.
  4. Tutorials and support – Availability of online resources to help you get up and running with different services, as well as human customer support accessibility.

Three, two, one, fight!

Features

How wide is the range of services offered?

Amazon Web Services: AWS has by far the widest range of services and it comes to offerings. If you don’t find a cloud computing technology under AWS manifest, you’ll most likely not find it anywhere else. AWS has many different IaaS and PaaS services dedicated to every task needed to be performed by a server, divided into organized categories. When using AWS you can be sure that your startup scalability is potentially endless. On the other hand, the offering might sometimes be confusing for beginners as it makes the getting started process a little longer. If your application has many custom components that are not trivial, AWS might be the cloud provider you should consider.

Grade: 5/5

 

DigitalOcean: DigitalOcean offers a relatively narrow range of services. As for IaaS, you can find droplets (servers), data storage units, networking and monitoring services. As for PaaS, you can easily deploy apps with zero configuration needed, like Node.js, Redis, Docker etc. Although the offering is very concise, I find it to be exactly what you need for more than 80% of the applications. In addition to the standard droplets, high CPU and high memory droplets are available for custom use, as well as backups and snapshots for each droplet. DigitalOcean team is working nonstop on increasing their offering based on the community requests. As a developer who uses DigitalOcean for quite a lot of time now, I can admit that their desire to satisfy their community is highly appreciated.

Grade: 4/5

Amazon Web Services cloud provider

Pricing

Pricing models available and transparency

Amazon Web Services: AWS is based on a pay per use pricing model. Every cloud computing technology has its own unique pricing and a pricing calculator is available for trying to estimate your costs upfront. You might find this calculator a bit complex if you haven’t used AWS before. In order to estimate your costs up front you need to translate your servers architecture design into AWS terms, and then try and estimate by choosing the appropriate services from the sidebar. I find the wide range of offering sometimes overshadows the costs estimations, so I find it useful sometimes to start firing up services and tracking the costs inside the dashboard using pricing alerts. On the other hand, AWS offers a very useful free tier for 12 months that can help early-stage startups get up and running.

Grade: 3.5/5

 

DigitalOcean: DigitalOcean extremely transparent pricing models exist in two different yet similar approaches: pay per hour and pay per month. When using DigitalOcean you have no surprises. You can calculate the exact amount that will be charged, due to fixed prices for each droplet unit. Starting at $5/month for a 512MB droplet, DigitalOcean is suitable also for tiny side projects. Besides droplets and data storage units that are charged according to the resources allocated to you, networking, monitoring, alerts, DNS management and more, are completely free of charge. Bottom line, you pay only for the allocated resources, and you get a lot of useful extra components as a free of charge service.

Grade: 4.5/5

 

Ease of use

How easy is it to get up and running as well as to iterate

Amazon Web Services: AWS dashboard is quite comfortable once you get used to it. Because of the large amounts of services, you might find it a bit overcrowded in comparison to the other alternatives presented here. You can use the default settings for your services and then to get up and running relatively quickly, but if you’d like to dive deeper into details (also for reducing costs) you might find yourself spending quite a lot of time on configurations using AWS dashboard. On the other hand, in large-scaled applications, you can find the additional features available for each service extremely useful and necessary.

Grade: 4/5

 

DigitalOcean: DigitalOcean is branded for a reason as “Cloud computing, designed for developers”. As developers, we have so many things to take care of, especially when in charge of the end-to-end technological stack of our startup. Therefore, we need our cloud provider to be as simple as possible to setup. DigitalOcean’s user interface is the best I’ve used. It’s intuitive and let you get up and running in minutes even when using it for the first time. You don’t need to explore and scroll over too many features and options, just choose your Linux distribution, plan and geographic location, and you’re up and running in no time.

Grade: 5/5

DigitalOcean droplet creation

Tutorials and support

Available resources and support team

Amazon Web Services: AWS has a very useful tutorials library. There are many tutorials, but ones sometimes seem to be less detailed and user-friendly than others. You need to be experienced with servers infrastructure design before accessing many of the AWS tutorials. So, it might take you some time to explore their library before you’ll be able to actually find what you’re looking for. On the other hand, their customer support team is extraordinary. AWS support agents are super responsive and sensitive and will answer your questions in a professional way.

Grade: 4/5

 

DigitalOcean: The tutorials library of DigitalOcean is endless. In almost every Google search about a topic related to servers or cloud infrastructure, you’ll find results from DigitalOcean tutorials library. The tutorials are well-written and cover important principals alongside with the technicalities of how to achieve your goal. In addition to accomplishing your task, you’re actually learning new things when following DigitalOcean’s tutorials. The support team is very responsive and professional, and free of charge virtual meetings are available with cloud specialists to help you design the architecture of your server.

Grade: 5/5

 

Summary – choosing the best cloud computing technology for your startup

Amazon Web Services: AWS is by far the leading cloud provider when it comes to offering, scalability and features. On the other hand, its learning curve is moderate, so if you haven’t experienced with AWS before, it might take you some time to get up and running with properly.

Final startup grade: 4.5/5

 

DigitalOcean: I like comparing DigitalOcean to a boutique hotel. When using their cloud computing technologies you feel like you’re part of a family and treated like one. DigitalOcean covers everything you need as an early-stage startup, it is easy to use and provides expected convenient pricing models.

Final startup grade: 5/5

best cloud solution

The most important thing about cloud provider is to have one. In our world, it’s much better to have your application deployed in a little smaller cloud provider than keep arguing about which cloud provider is better when you have no idea where your application will be 6 months from now.

If you’re familiar with one of the cloud vendors, use it for your main startup unless you’re sure it will not meet your requirements.

When developing side projects, I highly encourage you to try and play with new cloud providers. Who knows, maybe you’ll fall in love with another.

Try DigitalOcean with $10 credit

Try AWS free tier