In an ever-accelerating information age, the companies most likely to succeed glean the most profitable insights from their data, faster and more nimbly than their competitors. For the data-driven enterprise today, you likely have game-changing insights into your business and your customers hidden throughout your vast troves of data. Here is why intelligent virtualization technologies are eliminating data silos forever.
What Data Must be Consumerized.
However, to uncover these insights, your data must be consumerized. Consumerized means that the data must be readily available and readable to all stakeholders across the organization — while ensuring reliability and security.
Are data lakes going the way of the dodo?
Data is only going to continue becoming more diverse, dynamic, and distributed. Many organizations attempt to collect all of their data and make it accessible by throwing it all into a data lake, which can hold raw data in its native format until it is needed for analysis.
Until recently, this practice has more or less been compelling enough; companies could afford to wait for data scientists to collect, translate, and analyze the myriad of different data types contained in data lakes.
The need for immediate access to data has grown considerably.
Organizations race to collect and analyze as much data as possible to gain even the slightest competitive advantage over their peers. Traditional data lakes can’t handle the ever-growing number of emerging data sources and new local databases being created.
Queries have to match the specific database you’re working with, so the more databases you have, the more query languages you’ll be forced to utilize. On top of all this, integrating disparate data in a data lake to make it accessible and universally legible still requires manual data engineering, which is intensely time-consuming for data engineers and data scientists.
The lack of agility in data lakes means they will no longer be adequate in a data-driven economy.
Many organizations are, therefore turning to data virtualization to optimize their analytics and BI. The BI and data is connecting all of their data and making it readable and accessible from a single place.
Not all data virtualization is created equal.
Data virtualization creates a software virtualization layer that integrates all of your data across the enterprise. No matter what format the data is in, or which silos, servers or clouds the data resides in, it is translated into a common business language and accessible from a single portal.
In theory, this empowers organizations with a shared data intellect where all the different business units and business users gain immediate access to the data they need—having fast access enabling businesses to make data-driven decisions for a shared purpose.
However, many data virtualization solutions fall short of this promised Eden of analytics. There are a few critical reasons for this.
Many data virtualization providers consolidate and then translate all of an organization’s data into a proprietary format. While consolidation allows the data to be integrated into a single place for a single view, the vendor’s proprietary format often reduces the data to a lowest-common-denominator state.
The common-denominator state can result in some data getting skewed, losing specialized functionality, or even getting lost in translation. Some data may also require the context of its original database to be reliable. Thus, your company may be drawing insights from faulty data and making counterproductive business decisions.
BI tool incompatibility.
BI tools are a considerable investment for organizations. Most enterprise-level companies already have several different types of BI tools across various departments. For example, one department might use Tableau, while another uses Microsoft Power BI or Excel.
For big data analytics to work for enterprises, data has to be easily discoverable and universally accessible to all users, no matter what tools they prefer to use.
Proprietary data formats that many vendors use may not be interoperable with the technologies your company has already invested in.Different tools use many different query languages and vary in the ways they display data. When data with incongruent definitions are integrated, costly errors in analysis can occur.
The ability to use the BI tool of choice is crucial to minimizing business disruptions and maximizing user productivity.
The more your data grows and evolves; the more complicated your queries will become – not ideal for analytics workloads and working with data at scale. The more disparate data sources you have to manage, the more data engineering will be required to run fast, interactive queries.
Moving large volumes of data at query time for distributed joins does not work for interactive queries. It puts unpredictable and unacceptable stress on enterprise infrastructure, and simplistic data caching is insufficient for a dynamic query environment and today’s data sizes.
When you add BI and AI workloads to the mix, performance degrades quickly, driving end-users to seek other direct paths to the data, which undermines the benefits of data virtualization.
In addition to these scaling pitfalls, traditional virtualization products do a poor job of addressing analytics use cases.
Scaling out big and complex data services requires an intimate understanding of the details: statistics on the data, the databases involved, the load on those shared resources, use cases and intent of the data consumers, security constraints.
Virtualization solutions need to offer users a business-contextual view of their data that includes hierarchies, measures, dimensions, attributes, and time series.
What data virtualization should provide.
Most data virtualization solutions have not evolved at the same pace as today’s datasets and data science practices and still rely on traditional data federation approaches and simple caching techniques. There is, however, a next-generation, more intelligent type of data virtualization designed for today’s complex and time-sensitive BI requirements.
If your data virtualization solution does not provide you with the following capabilities, it simply isn’t intelligent enough.
Autonomous data engineering.
Human beings can never be perfect; luckily, computers can.
A human simply cannot manage the complexity of a modern data architecture—at least not at the speed that business now requires to stay competitive. That’s why your data virtualization solution needs to provide autonomous data engineering.
Autonomous data engineering can automatically deduce optimizations based on countless connections and calculations that a human brain wouldn’t be able to conceive of. Machine learning (ML) is leveraged to dissect all company data and examine how it’s queried and integrated into data models being built by all users across the organization.
Automating, as many aspects of data engineering as possible save a significant amount of money and resources while freeing up data engineers to perform more complex tasks that are more valuable to the organization.
Intelligent data virtualization can also automatically place data into the specific database where it will achieve optimal performance.
There are many types of specialized data and different formats that are optimal for that data.
Intelligent data virtualization can automatically decide on what platform to place data based on where it will generate the best performance. Different data platforms have distinct advantages and strengths. For example, if your data model and query are working with time-series data, intelligent data virtualization will place an acceleration structure in a database that is optimized for time series data.
Automatically knowing which database has which strength and then leveraging it will take a traditional liability—the variability of all your different database types—and turn it into an advantage.
Acceleration structures provide significant savings on cloud operating costs. Depending on the platform you’re using, you may be charged for the storage size of your database, the number of queries you run, the data being moved in a query, the number of rows in a question, the complexity of the query, or several other variables.
With Google BigQuery, for example, the amount you’re charged is proportional to the size of your database, and the complexity of the queries.
When you automatically use acceleration structures for both performance and cost optimization, you’re only charged for the query data you used in the acceleration aggregate, not the size of the entire database.
Automatic data modeling.
The next generation of data virtualization doesn’t just translate and provide access to data; intelligent data virtualization can automatically understand the capabilities and limitations of each data platform. It automatically discovers what information is available and how it can be combined and integrated with other data when building models.
Intelligent data virtualization can reverse engineer data models and queries used to create legacy reports, so you can continue using the same report without having to rebuild data models or queries. If, for example, you created a TPS report in your old system, you will still be able to retrieve it in your new system.
Past queries may have been run on old data, but they can still be translated and run on the new system without any rewrites.
Many aspects of IT have become “democratized” in recent years—that is, advances in technology (particularly cloud) have made them accessible to laypersons without extensive technological acumen. While analytics and business intelligence have lagged in the democratization trend, BI tools are now increasingly becoming usable for the average worker.
The BI usage has resulted in the growth of a new “self-service” analytics culture, where business users can directly access and analyze data with their own preferred BI tools, and not have to rely on data engineers or data analysts.
Self-service analytics is fast becoming a necessity for optimizing big data analytics in an organization.
Let’s say, for example, the sales department has data about the previous year’s spend but wants to augment it with data regarding customer behavior patterns in multiple areas. Or the marketing department needs to initiate an account-based marketing campaign that targets companies deemed most likely to switch vendors.
With self-service analytics, the business users in the sales or marketing department can access this data, and use it themselves with their own tools. The self-serve analytics is used rather than having to rely on trained data engineers to source the data for BI tools, and on data scientists to model and predict outcomes.
With the self-service dynamic allows each department in an organization to apply their own experience and expertise to BI, achieving a whole new level of agility.
Intelligent data virtualization provides a business logic layer that virtually translates all of your data into a common business language that is both sources and tool-agnostic. With the logic layer, it means that business users can use any BI tool they prefer, and no users have to bend to a single standard for BI software.
All data will be accessible no matter what or how many tools you use, and all queries will return consistent answers. The standard and logical explanations empower your organization with a shared data intellect and the self-service culture that’s growing increasingly necessary in today’s data-driven business landscape.
In your quest to consumerize your data, you cannot sacrifice security and compliance, no matter the agility and cost benefits.
Virtualization layers have been known to pose security risks. However, with next-generation intelligent data virtualization, your data inherits all of the security and governance policies of the database where it resides. The standard governing procedures mean that your permissions and policies remain unchanged.
All existing security and privacy information are preserved down to individual users by tracking the data’s lineage and user identities.
Even when working with multiple databases with different security policies, the policies are seamlessly merged, and all global security and compliance protocols are automatically applied. There are no additional steps needed to ensure security and compliance after adopting intelligent data virtualization.
Your data virtualization must evolve with the rest of your IT.
As important as it is to have enterprise-wide, consumerized data that is readable, accessible, and reliable, many companies today are simply overwhelmed by the enormous volume of data. The increasingly distributed model with dynamic and diverse formats and use cases add to the data. When users can’t quickly locate and analyze the data they need and be confident that it’s accurate and up-to-date, BI quality decreases, resulting in suboptimal – or even worse – gut-based decisions.
Data virtualization, therefore, needs to evolve to meet these new challenges and complexities so it can genuinely work for big data analytics.
If your data virtualization solution is not providing autonomous data engineering, acceleration structures, such as automatic data modeling, self-service analytics enablement, you have a problem. You need worry-free security, and compliance, or a multi-dimensional semantic layer that speaks the language of the platform. If you don’t have these processes — then your data virtualization solution — simply isn’t intelligent enough.
Dave Mariani is one of the co-founders of AtScale and is the Chief Strategy Officer. Prior to AtScale, he was VP of Engineering at Klout & at Yahoo! where he built the world’s largest multi-dimensional cube for BI on Hadoop. Mariani is a Big Data visionary & serial entrepreneur.