Author Archive

As inSync 3.0 release nears completion and the action moves from development to release engineering, I get some time to reflect on how inSync has shaped up.
(Please let us know if you have any feedback for beta.)
Some factors proved very crucial in the development process -
Usability First
inSync team always kept usability ahead of everything else. Usability ensured that inSync features can be easily evaluated by prospective customers. A customer may or may not like a product, but that comes after she can evaluate it easily. No usability means no evaluations means no feedback and no sale. We would rather not have a feature than an unusable one.
Zero Baggage
inSync started as a disk-to-disk backup. It does not have any baggage that carries with a tape backup with disk-to-disk backup feature. We also tried not to pick non-core features along the way. If inSync picks up a new feature, be assured that Druvaa is very serious about it. We would rather have a small set of well deveoped features than a plathora of half cooked ones.
Release Early, Release Often (RERO)- The Apple Philosophy
inSync 1.0 was a small feature set, but a complete product. (Interestingly, a set of customers found that it meets all their requirements and continue to use it even today.) Note that RERO is not about the quality or usability of the product. It’s only about leaner releases, each with a small set of new features. RERO allowed us a rich customer feedback that helped us pick the next set of feature to focus on.
I’m not saying that we do not miss deadlines, but things are more predictable while dealing with a small set of features. RERO also implies seamless upgrades, which is an extra effort but much lesser than managing a big release. In essence, we would rather have a small release than a big one that never ships.
Team with Diverse Experience
Druvaa development team has engineers with extensive experience working on Windows, Linux, systems programming, database programming, UI programming, QA development, you name it. Things would be different if all of us were brilliant system programmers but none knew how to do UI, or the other way round. I’m proud to say that inSync is a no handicap team and we strive to ensure that the product does not have any weak areas.
Python as The Programming Language
Kudos to Guido for a wonderfully intuitive language and kudos to the python team for the extensive set of modules. inSync uses C/C++ when it comes to heavy computation or interfacing with the operating system. But in both the cases, we managed to keep the C/C++ code to lower level functions. The high level logic is coded completely in python. It allowed us very fast developement of good quality, highly maintainable, cross platform code.
Don’t Reinvent, Reuse The Wheel
inSync uses Qt (PyQt) for GUI, PostgreSQL as database server and several other python modules. We picked these after extensive experimentation and analysis. We also contribute back any bug fixes and enhancements. It’s not about just reusing the wheel but choosing a good wheel and then making it better.
Please let us know if you have any feedback for the beta. Would be more than glad to take a look.
March 7th, 2009
We receive numerous requests for throwing some light on inSync’s roadmap, so here it is. We have tried to include most of the suggestions we received from the users. At the same time, we did not go for some features as they do not fit in our vision for data protection. We discuss our view point about some of such features towards the end of this blog entry.
Our focus, as always, is ”Light-weight, Simple, Fast and Trustable” backup solution.
Version 2.2 (Oct 10th, 2008)
- Admin configured backup folders - The admin can choose “must have” folders for backup for each profile. Can also choose if user can configure more folders.
- Browser Restore - Enable user to restore files and folders using just the browser, when he is not at his desk.
- Linux Port (beta) - Initially support Ubuntu 8+, openSuse 10+ and RHEL 5+
- Advanced Reporting - 6 different reporting option for flexible and detailed reporting.
- Dump user data locally (on server) while disabling the user.
- Restore user data on server - We plan to allow a dumping user data locally (on server) in case a user is in disabled state. This could be useful for archiving a user’s data before the deleting the user.
- Publish configuration API - Publish the server configuration API to enable third party software vendors to integrate inSync backup in their management console.
Version 3.0 (Dec 21th, 2009)
- Full PC Backup - Use de-duplication to effectively backup entire PC (operating system, application executables in addition to the application configuration and data)
- Bare-metal Restore - Use restore points created by full PC backup to restore a machine that does not have a working operating system.
- Performance Improvements - for large (1GB+) file incremental backup.
- Search in Restore - Search files in restore.
Excluded Features
The features which we believe should not be implemented even though some key players offer them -
- Disable inSync client’s desktop visibility - Don’t show the inSync client running on the the user’s PC to hide backup. In our opinion, this is not a solution. The right approach is to provide a light weight backup solution that does not hamper the PC performance and hence, the user does not want to disable it.
- Server initiated backup- It is not useful for the PC backup environment, especially for mobile laptops that are not always connected. we may consider this for the server version of inSync.
- Allow USB backup or tape backup on users PC - We believe that media based backup is inherently unusable. With falling disk prices and Druvaa’s data de-duplication technology, the best backup policy is to maintain backups on hard-disks.
September 24th, 2008
One of the major goals for inSync 2.1 release (due this week) is improved performance. With this new release users should be able to experience almost 30% speed improvements specially while syncing smaller files.
While working on inSync 2.1, team Druvaa rediscovered some tips and tricks for performance improvement -
Code Profilers
They can give you very quick insights into bottlenecks. It’s better to start at profiler output than from a hypothesis. Start working out a hypothesis only after profiler points out a bad function. We used
gprof2dot, which plots a nice graph from prof or gprof output. An example is shown below -
The graph shows top down hierarchy of functions, the percentage of time each function consumes, the number of calls etc. The percentage of time consumed by a function puts the performance optimization exercise in the right perspective. You don’t want to optimize a function if it contributes just 1% to the whole processing time. The general idea is to concentrate on function that consumes substantial time and is not supposed to do it. Once a few functions like this are optimized, you can go for another round of profiling.
Network Utilization
It’s not sufficient to just reduce the network bandwidth usage. It’s equally important to completely utilize your share of the network bandwidth.
Especially for non-interactive applications, the throughput matters much more than the latency. In a system that uses a single threaded client to issue RPC calls, thethroughput is governed by the latency. If one RPC call takes a long time, the throughput is low even though there is no bottleneck, persay. Looking at it in a different way, the network is not being utilized when the server is processing the call. A multi-threaded client improves network utilization and also throughput. Sometimes the cause for poor network performance could be outside your code. For example, the TCP default window size shows poor performance with high latency-high bandwidth network. Increasing TCP window size improves performance for such networks and so does the use of multiple TCP connections.
Caching
Caching frequently used data reduces the database queries or disk reads. Database queries and disk reads may not consume the CPU cycles but they add to the latency in a big way.
Muti-threading can work around latency but it comes with its own overheads in terms of code complexity and resource consumption. Simple caching avoids frequent trips to database/disk. Databases and operating systems maintain their own cache but the overheads of connecting to a database or issuing a system call are avoided at best.
Beware of stale caches and serialization issues.
Delayed Writes
Synchronous writes are slow. Some writes, for example activity logs, can be delayed indefinitely. Other writes that need persistance gurantees can be synced in batches than individually.
This holds true for both databases and file systems. It’s cheaper to do multiple inserts in one sqlite transaction than to create one transaction for each insert. On the file system side, you are better
off writing a few MBytes to a file, followed by a fsync than multiple few KBytes of writes and a fsync for each write.
Batch requests
A batch of 10 queries sent to a database works faster than 10 queries issued one after the other. Encoding the 10 queries as a pl/sql function works even better. This is primarily due to the socket communication overheads, specifically the latency involved in it.
For inSync 2.1, we found that the lowest hanging fruits were with the database and file system interactions. We sure plucked all of them
July 28th, 2008
Backup is a necessary evil. At Druvaa, our goal is to get rid of the evil part of backup. The first step in that process is to find out the pain points of traditional backup.
- Backup schedules: Traditionally, a backup is a scheduled process that runs at fixed intervals. In case of a failure, the data updates since the last backup are lost. The recovery point objective (RPO) is weaker with traditional backup. Refer to Understanding RPO and RTO for a discussion on RPO.
- Backup slots: Traditional backup process is resource heavy. Also, the server appplication needs to be quisced to get a consistent backup image. This implies that the regular server activity cannot continue during backup. Hence, backup is schduled to run during a timeslot when the regular application activity is not present or is present at a lower scale. As the amount of data and the time to backup grows, it becomes harder to find time-slots for scheduling backup. The increase in the number of business hours also puts additional pressure on the backup slots.
- User interface: Traditional backup interface is complex due to the concepts of full/incremental backups and schedules. Due to the coplexity of the user interface, it becomes harder to let the end user control the backup process. Typically, the administrator configures the backup for enduser desktop/laptop. The configuration remains static and cannot easily adapt to dynamic data layout. Instead, the end user is asked to arrange his/her data to suit the backup configuration.
- Backup media: Traditional backup is performed on media like magnetic tapes or optical disks. Complete automation (using robotic media libraries) of the backup process is too costly. In absence of an automated process, an administrative attention is required to manage the backup media. Maintaining the backup media also requires administrative effort. The restore operation also requires administrative attention because the right backup media needs to be loaded.
- Special hardware: Tradional backup is performed using media like magnetic tapes that require special hardware like tape drives. Special hardware means additional procurenement and maintenance cost.
- Restore operation: With traditional backup, the end user cannot restore her files by herself. Typically, a service request is sent to the administrator, thus increasing the time taken for restore. The recovery time objective (RTO) is weaker with traditional backup. Refer to Understanding RPO and RTO for a discussion on RTO.
In the next post, I’ll discuss possible approaches to address the painpoints of traditional backup.
April 16th, 2008
Python is a powerful languauge for encoding configuration information for a software program. Especially the dictionary contruct and the ability to nest data structures allows to encode complex configuration parameters. Also storing the configuration as a text file allows for easier debugging and manual editing of the configuration. For example, the unix passwd file can be encoded as a dictionary with the user name as the key. Each entry in the dictionary would be another dictionary with name, uid, etc. as keys or it could be a tuple with fixed positions for name, uid, etc.
Offcourse since python is not designed to be used for encoding configuration, it does not directly provide routines to load and save configuration files written as python scripts. Saving the configuration file is fairly simple as the str method cleanly converts any python data structure to a string that can be directly written to a file. Note that for a string type configuration parameter, you need to explicitely add quotes while writing to the file. The python code to save a configuration file myconfig.cfg would look like
follows:
f = open(”/path/to/myconfig.cfg”, “w”)
f.write(”some_config_param = “)
f.write(str(some_config_param))
f.write(”\n”)
Loading a configuration file written as python data structures and then accessing the configuration information seemlessly is slightly more complex. Using the imp module is one possible way to load the configuration file. The imp module provides two functions, find_module to search for a module using the standard heuristics and load_module to load the file found by find_module and return a module object. The return values of find_module are to be passed to load_module as parameters. Once the module object is available, one can access the configurtion information through its attributes. There are couple of issues with using the imp module.
- The file needs to have a .py extension since findmodule searches files with only certain extentions and guesses the type of the file from the extension. This can be worked around by opening the config file instead of using find_module and passing the open file to load_module.
- The load_module method compiles the file as a .pyc file before importing it. That leaves a unrequired file behind. The compiled file is used by load_module if it is newer than the config file. In case of a race between two threads, one saving the config file and the other one loading it, the compiled file could get a timestamp same as the config file but with the old contents. Any further load_module calls load from the compiled file and hence, load the old config data.
The execfile function is a better way to load a configuration file. Again, the execfile method is not intended for loading python data structures. It’s primary use is to run an independent piece of python code. But the function allows us to specify the global and local dictionaries as parameters. Also, the effects of the executed code are reflected in the parameters. The python code to load a configuration file myconfig.cfg would look like follows:
configuration_globals = {}
configuration_locals = {}
execfile(”/path/to/myconfig.cfg”, configuration_globals, configuration_locals)
some_config_param = configuration_locals["some_config_param"]
Happy programming!
March 6th, 2008