Official website for Linux User & Developer
FOLLOW US ON:
Sep
14

Create and save data with a MongoDB database

by Mihalis Tsoukalos

Forget about joins and SQL and try NoSQL databases – specifically MongoDB, the leading example

MongoDB is an open source document- oriented database system written in C++ by Dwight Merriman and Eliot Horowitz. It runs on UNIX machines as well as Windows and supports replication and sharding (aka horizontal partitioning) – the process of separating a single database across a cluster of machines.

Many programming languages – including C, C++, Erlang, Haskell, Perl, PHP, Python, Ruby and Scala – support MongoDB. It is suitable for many things, including archiving, event logging, storing documents, agile development, real-time statistics and analysis, gaming, and mobile and location services.

This article will show you how to store Apache log files in a MongoDB database with the help of a small Python script. We’ll also demonstrate how to implement replication in MongoDB.

The replica set consists of nodes 192.168.2.4 (port 27019), 192.168.1.10 (port 27019) and 192.168.2.3 (port 27018)
The replica set consists of nodes 192.168.2.4 (port 27019), 192.168.1.10 (port 27019) and 192.168.2.3
(port 27018)

Resources

MongoDB
Pymongo

Step by step

Step 01

Connecting to MongoDB for the first time

Your Linux distribution probably includes a MongoDB package, so go ahead and install it. Alternatively, you can download a precompiled binary or get the source code from www.mongodb.org and compile it yourself.

After installation, type mongo –version to find out the MongoDB version you are using and mongo to run the MongoDB shell and check if the MongoDB server process is running.

Step 02

MongoDB terminology

NoSQL databases are designed for the web and do not support joins, complex transactions and other features of the SQL language. You can update a MongoDB database schema without downtime, but you should design your MongoDB database without joins in mind.

Their terminology is a little different from the terminology of relational databases and you should familiarise yourself with it.

Step 03

The _id field

Every time you insert a BSON document in MongoDB, MongoDB automatically generates a new field called _id. The _id field acts as the primary key and is always 12 bytes long. To find the creation time of the object with _id
‘51cb590584919759671e4687’, execute the following command from the MongoDB shell:

> ObjectId("51cb590584919759671e4687").getTimestamp() ISODate("2013-06-26T21:11:33Z")

Note: You should remember that queries are case-sensitive

Step 04

Inserting an Apache log file into MongoDB

Now that you know some things about MongoDB, it is time to do something interesting and useful. A log file from Apache will be inserted inside a MongoDB database using a Python script.

The Python script is executed as follows:

$ zcat www6.ex000704.log.gz | python2.7 storeDB.py

…where www6.ex000704.log.gz is the name of the compressed (for saving disk space) log file.

Step 05

The storeDB.py Python script

The storeDB.py script uses the PyMongo Python module to connect to MongoDB. The MongoDB server is running on localhost and listens to port 27017. For every inserted BSON document, its _id field is printed on screen. Finally, the script prints the total number of documents inserted in the MongoDB database.

The host and its port number are hard-coded inside the script, so change them to match yours.

Step 06

Connecting to MongoDB using PyMongo

You first need to connect to MongoDB using:

connMongo = pymongo.Connection('mongodb:// localhost:27017')

You then select the database name you want (LUD) using the following line of code:

db = connMongo.LUD

And finally you select the name of the collection (apacheLogs) to store the data:

logs = db.apacheLogs

After finishing your interaction with MongDB you should close the connection as follows:

connMongo.close()

Step 07

Displaying BSON documents from the apacheLogs collection

Type the following in order to connect to the MongoDB shell:

$ mongo

Select the desired database as follows:

> use LUD

See the available collections for the LUD database as follows:

> show collections
apacheLogs
system.indexes

Lastly, execute the following command to see all the contents of the apacheLogs collection:

> db.apacheLogs.find()

If the output is long, type ‘it’ to go to the next screen.

Step 08

A replication example

Imagine that you have your precious data on your MongoDB server and there is a power outage. Can you access your data? Is your data safe?

To avoid such difficult questions, you can use replication to keep your data both safe and available. Replication also allows you to do maintenance tasks without downtime and have MongoDB servers in different geographical areas.

Step 09

Running the three MongoDB servers from the command line

For this example, you need three MongoDB server processes running.

We ran the three MongoDB servers, on their respective machines, as follows:

$ mongod --port 27018 --bind_ip 192.168.1.10 --dbpath ./mongo10 --rest --replSet LUDev

$ mongod --port 27019 --bind_ip 192.168.2.6 --dbpath ./mongo6 --rest --replSet LUDev

$ mongod --port 27018 --bind_ip 192.168.2.5 --dbpath ./mongo5 --rest --replSet LUDev

Note: You are going to see lots of output on your screen.

Step 10

More information about the three MongoDB servers

You should specify the name of the replica set (LUDev) when you start the MongoDB server and have the data directory, specified by the –dbpath parameter, already created. You do not necessarily need three discrete Linux machines. You can use the same machine (IP address) as long as you are using different port numbers and directories.

Step 11

The rs.initiate() command

Once you have your MongoDB server processes up and running, you should run the rs.initiate() command to actually create and enable the replica set.

If everything is okay, you will see similar output on your screen. If the MongoDB server processes are successfully running, most errors come from misspelled IPs or port numbers. The rs.initiate() command is simple but has a huge impact!

Step 12

Information about replication

Any node can be primary, but only one node can be primary at a given time.

All write operations are executed at the primary node.

Read operations go to primary and optionally to a secondary node.

MongoDB performs automatic failover.

MongoDB performs automatic recovery.

Replication is not a substitute for backup, so you should not forget to take backups.

Step 13

More information about replication

The former primary will rejoin the set as a secondary if it recovers.

Every node contacts the other nodes every few seconds to make sure that everything is okay.

It is advised to read from the primary node as it is the only one that contains the latest information for sure.

All the machines of a replica set must be equally powerful in order to handle the full load of the MongoDB database.

Step 14

The rs.status() command output The rs.status() command shows you the current status of your replica set. It is the first command to execute to find out what is going on.

Apart from primary and secondary nodes, a third type of node exists. It is called arbiter. An arbiter node does not have a copy of the data and cannot become primary. Arbiter nodes are only used for voting in elections for a primary node.

Step 15

Selecting a new primary node

If you shut down the primary MongoDB server (by pressing Ctrl+C), the logs of the remaining two MongoDB servers will show the failure of the 192.168.1.10:27018 MongoDB server:

Mon Jul 1 11:21:29.371 [rsHealthPoll] couldn’t connect to 192.168.1.10:27018: couldn’t connect to server 192.168.1.10:27018

Mon Jul 1 11:21:29.371 [rsHealthPoll] couldn’t connect to 192.168.1.10:27018: couldn’t connect to server 192.168.1.10:27018

It takes about 30 seconds for the new primary server to come up and the new status can be seen by running the rs.status() command.

Important note: Once a primary node is down, you need more than 50 per cent of the remaining nodes in order to select a new primary server.

Step 16

Trying to write data to a non- master node

If you try to write to a non-master node, MongoDB will not allow you and will generate an error message.

Step 17

Useful MongoDB commands

Delete the full apacheLogs collection: db.apacheLogs.drop()

Show available databases: show dbs

Find documents within the apacheLogs collection that have a StatusCode of 404: db.apacheLogs.find({“StatusCode” : “404″})

Connect to the 192.168.1.10 server using port number 27017: mongo 192.168.1.10:27017

Step 18

Hints and tips

It is highly recommended that you first run find() to verify your criteria before actually deleting the data with remove().

Should you need to change the database schema and add another field, MongoDB will not complain and will do it for you without any problems or downtime.

The way to handle very large datasets is through sharding.

Mongo has its own distributed file system called GridFS.

  • Tell a Friend
  • Follow our Twitter to find out about all the latest Linux news, reviews, previews, interviews, features and a whole more.