Cloud DWH for startups

 

I originally posted this at Kueski.

 

Thriving startups come down to a point where guys ask themselves: how can we best manage our data? … perhaps a more important question could be: how do we establish a clear path for managing our data, now and in the future?

At my workplace, coming down to this precise spot has provided plenty of food for thought. This entry attempts to provide a brief overview about what might be needed, from an infrastructure standpoint, to provide unified, truthful and healthy data to the enterprise, now and down the road.

You may hear some say that certain volumes of data simply do not justify the concern, time and effort to start thinking about data warehousing, data lakes or big data, unless you are already managing PB-sized stores, arguing that the typical costs and technical requirements associated to enterprise data projects are too high and too complex. I believe that organizations who spouse this line of thought are among the many whose late, large, enterprise data management projects have failed.

Data-oriented shops, despite of the state of growth they are in, should seriously think about an early approach to data management, which is needed for growth as well as for consolidation. In year 10, the maturity and value of data will certainly be composed of much of what is done about it in years 1 through 9.

On the cost front things are now not all gloomy because the cloud has brought things down to manageable levels. This fact, coupled with the operational efficiencies to be made in the process, which may offset costs to a degree, lessen the position of cost as a barrier to commit to enterprise data management.

Thus, designing a data management strategy and starting work towards data warehousing should occur *early enough*, perhaps as soon as data related operations start becoming complex. This entry is then focused on data warehousing in the cloud and the road there.

Every major public cloud provider has its own data PaaS, or rather DWaaS offering. Amazon AWS has RedShift, Google Cloud has BigQuery, Microsoft Azure has SQL Data Warehouse, Oracle Cloud has Database Cloud Service, and IBM Bluemix has dashDB.

Obviously, the road to cloud data warehousing requires much more than simply picking a vendor, launching clusters and feeding them data. Some times, it may even not be right to start using one of the offerings mentioned above immediately.

The process is more about looking at current data needs, including users and data growth rate, and data formats, thinking real hard to try to understand the possible direction of the business in these terms as well, and then plan how to properly address these requirements now, and according to the conclusions made, regarding what the future state may look like.

Validating the assumptions made as well as discussing the implications of the planned implementation with business leaders and key data consumers is a definite must. Do this not only to gain their accord but their sponsorship as well. It is important to remember that we are doing this to support their analysis of historical data and decision making in the first place.

In this context, the following notions are not shared as hard recommendations. Adaptation will be required because your startup mileage will vary.

Stopping short of a debate about Inmon vs Kimball, the first thing to do would be to single out the general approach to implement the data warehouse. Once that has been done, the next logical step is the design of the data workflows and schemata. This includes views, semantic layers and security rules for data.

These two steps will likely be an issue because many startups lack a formal enterprise architecture initiative that includes enterprise data architecture. After considering the matter, some may decide that they are at a premature stage to formalize it, foreseeing major evolution to still occur in their business approach. They may decide to continue performing data analysis at earlier stores than at a data warehouse. This is totally alright with the notion of starting early. It does not hinder the ability of a startup to continue taking additional steps in the same direction.

The next step, one that all startups must perform, is to separate transactional from operational data.

Transactional data proceeds from source stores. It is commonly real-time, low-complexity data with short lifecycles. Operational data results from the physical aggregation of data from transactional stores, has a longer lifecycle and is used for reporting and initial-level analysis.

The operational data store may be composed of multiple stores. Little or no transformations occur as data is moved from source to destination. Data transfer occurs on a frequent basis, ranging from every few minutes to perhaps a day.

It is very common to see that startups keep accumulating data on their transactional stores, and leave it there for no apparent reason, other than it originated there. This separation brings many benefits. It enables the transactional store to remain lightweight, operating swiftly. It also enables internal users to perform reporting activities against the operational store without concern for impacting customers.

Once this has been done, the next step is to setup staging and warehousing data stores.

The staging area is setup for purely technical purposes, namely, preparation for warehousing, which must include at least data scrubbing and deduplication. If moving forward with implementing the warehouse at this stage, the process also includes designing the ETL/CDC, batch and streaming processes that will feed the data warehouse including aggregate data calculation.

Since the data warehouse will store the biggest and most complex data structures available to the business, and become the final store of historical business data, it must be very robust. Typically, data warehouse systems reside on systems that are purpose-built as opposed to general high-end systems.

It is time then for picking and choosing a cloud provider. Cloud offerings such as Amazon RedShift remain shy of sharing details about their underlying hardware capabilities. Even though part of using a PaaS/DWaaS offering should mean not having to worry about infrastructure details, I find it an advantage to be able to know about them. Two good examples of engineered systems built specifically for data warehousing and analytics, which do advertise their technical specifications, are Oracle Exatada and Exalytics.

It is often that startups choose the offering of the cloud provider whose other services they already use. It makes sense, and sometimes it may even be hard not to. Nevertheless, the following considerations are relevant approaching this decision:

  • Elasticity – the ability to scale compute separate from storage will mean a great deal to handle both structured and non-structured data
  • Availability – service level agreement and regional availability
  • Manageability – ease of provisioning, scaling and automation
  • Formats – data ingestion and data formats, including well known data dialects required by the business data
  • Security – encryption capabilities, including transport and at-rest, as well as the ability to audit data access and apply control rules

Conclusion

Implementing a data management strategy is not an easy task. Starting early facilitates operational management of data, and provides powerful insight into what business data means then, and how to drive it overtime to mean what it should.

It takes considerable effort, acumen and resources, but when implemented right it provides a solid base for businesses to advance in their effort to make informed decisions about their future.

Advertisements

@Campus Party GDL …

Will be presenting at Campus Party in Guadalajara, Mexico next week. Little talk on what developers need to keep in mind when preparing their zillion dollar apps to be deployed more smoothly on IaaS.

Steve Wozniak will be there too, go for him and hang around my time too if you can.

kcp

Virtual Box HD commands

 

Since I already have a set of base virtual disks from a number of laboratories I have created for various projects, it makes sense to reuse them as many of them have the right OS settings for my common requirements. Currently working with OEM 12c and having limited space on my SSD, I decided to enlarge the HDs. I am using VirtualBox in my local labs.

Depending on your requirements, the disk format you are using, etc., the process I followed required dettaching the disk, converting it to a format VBox could resize, actually enlarging the disk file, and attaching it again to the VM.

The following commands can be useful in preparing disks as described:

  • vboxmanage storageattach VM --storagectl CONTROLLER --port PORT --type hdd --medium none
  • vboxmanage internalcommands sethduuid /path/to/disk
  • vboxmanage clonehd --format FORMAT /path/to/disk /path/to/newdisk
  • vboxmanage modifyhd /path/to/newdisk --resize SIZEvboxmanage storageattach VM --storagectl CONTROLLER --port PORT --type hdd --medium /path/to/newdisk

Naturally, the former disk has no further use in this situation and can be discarded. The OS must also be reconfigured to view the new partition size.

 

Authenticated remote JMX to Flight Recorder in Oracle JDK

 

When doing authenticated remote JMX connections to the Flight Recorder in Oracle JDK, the predicates required for the readwrite permissions in the access file are as follows:

role readwrite \
create com.sun.management.*,com.oracle.jrockit.* \
unregister

Marcus Hirt indicated those MBeans, in his own words:, “from the JMX configuration hell”.