Reading time: 5 min
Business analytics on a performance-optimized BigQuery database? Google’s cloud-based data storage can take on such obstacle.
The following Google BigQuery case study builds on our experience of one of our recently finished analytical projects. We jumped into the exploration of a cloud-based database with both enthusiasm and skepticism. While working on our project, the positive experiences quickly started to predominate. Leveraging BigQuery’s architecture makes it truly a lot easier to generate useful-for-business output from thousands of Gb in the matter of seconds.
BigQuery from a business viewpoint
With regards to the applicability of BigQuery – which is one of the main big data solutions / engines of the Google Cloud Platform (GCP) – we could outline the following main takeaways: (1) extremely speedy queries on hundreds, thousands of Gbs worth of data, (2) high-level integration with software both within and out of Google’s ecosystem and (3) optimal data storing technique which focuses on providing reviewability and minimal redundancy at the same time. The (4) technology and syntax are easy to pick up and finally (5) storing data on GCP’s BQ solution can be quite cost-effective. Going into the details of the above summarized advantages may not be necessary, as all cloud-based technologies promise them and can mostly deliver.
Of course, there are two sides to every coin, however in this case the ‘other side’ does not necessary need to be the weaknesses but rather the fact that BigQuery requires another kind of approach than the traditional database storing techniques. Our experience is that analysts used to the traditional ways need some time to adapt in the beginning of a first BigQuery project. One of these differences is the pricing. It is well documented and scalable, however quite difficult to approximate accurately. As costs are generated dynamically by storing data and running queries, some sort of cost-optimizing pressure forms on participants of a BQ project. In case of an analytical project – which is quite iterative and agile – this may be an obstacle to overcome: everything has a certain price associated to it – from simply exploring the data to generating and storing new tables. Planning project budgets is a small task for small projects, but a big problem for big ones.
Pricing does have another important component to consider – the time spent on development. Writing faultless queries is not enough, neither is to make them run quickly – an important factor is to have them run cheap. Obviously, it makes more sense to pay this special attention to ‘cheapness’ if the query will get implemented in deployment and get used quite often, so when it comes to the PoC phase, it can be handled with less caution, more freely, however later optimization is a must. Making queries cheaper is also part of the new approach, it’s worth keeping it in mind when designing the database and when implementing all lines of code. Fortunately, Google offers some ways to optimize costs and it is recommended to use them for the sake of efficient project management.
In summary learning, adapting to this new approach pays off. With minimal initial investment, a database can be generated, explored, exploited and it’s only a matter of time and experience to make the cost-based thinking part of the whole process. Fortunately, due to conducting work in the cloud, there’s little difference between running analysis in a special environment or doing so on a regular laptop, where usually the ‘runtime’ is the deciding factor.
BigQuery is part of the big data solutions-group of the Google Cloud Platform’s (GCP), which is one of the 6 main services GCP focuses on (Computing & Hosting, Storage, Databases, Networking, Big Data and Machine learning). BQ’s focus is to allow its users to process, query and analyze (big) data stored in the cloud by leveraging Google’s BQ-specific SQL syntax and the technology and capacity implemented by the tech giant. It makes it possible to process and analyze huge amounts of data within seconds. The service offers 3 main user interfaces: (1) the web interface, (2) command line tool and (3) REST API connection from other languages such as Python or Java.
BigQuery’s key element – RECORD field
Besides the usual SQL schema elements, BigQuery implemented and offers a new data field type (RECORD / STRUCT) and mode (REPEATED). BigQuery leans on so called STRUCTs, which are array-like objects that allow users to store different types of data ‘in one’ as one STRUCT may include a date, a string, and an integer type as well. Similar methodologies exist at different market players too, but BigQuery makes it one of its central elements.
It is also one of BigQuery’s main strengths as one table may contain all corresponding information as a whole with the help of these array-like objects. The data this way is in a pre-join format and as soon as users decide to unnest (open) the STRUCTs, all its elements get joined to the table’s other fields.
BigQuery demonstrates its optimal data storing technique by keeping the row number at a minimal level while offering an easily reviewable table. However, upon unnesting these STRUCT fields all array-elements (maybe even multiple ones) and the table’s other fields get joined which can result in a drastic size increase both in number of rows and size of data. With the help of the newly introduced field mode (REPEATED RECORD) BigQuery allows its users to review all related information while maintaining an optimally stored data architecture, as unnesting (joining) is initiated only by the user. When there’s no need to work with those array-like fields, analysis can be carried out on the optimal-sized data table instead of on its tenfold, hundredfold maybe even thousandfold variation.
Integrating BigQuery’s outside and inside Google’s ecosystem
After familiarization with BigQuery’s proposition and differentiating functionality, it’s worth mentioning BigQuery’s integrability with GCP’s other solutions (Data Studio) as well as with software independent from Google, such as Tableau. An important part of data analysis is data visualization which may be just simple interim quick visuals that help in keeping the flow of the analysis going or by well-structured, summarizing plots and charts that support business decision making on interactive dashboards.
For basic visualizations, a great tool is offered by GCP: Data Studio (DS). DS is part of the Google ecosystem and serves as one of its main data visualization / dashboard building tools. DS can directly be accessed from BigQuery in two different ways: (1) connecting to a whole data table or (2) running a SQL query in the background and feeding its result as the subject of visualization.
In the case of the former option, the users just need to select a table on which they want to run visualizations. The latter option, when connecting to a single query, requires the users to submit the SQL code to Data Studio which by leveraging BigQuery’s SQL engine will run the code and return the result table in the background. When the query is changed, upon refreshing the input, the charts and plots also get updated, keeping all of them up to date. What needs to be remembered is that any update to a single chart causes the query to re-run in the background, generating costs associated with the size of the processed data. DS is perfectly capable of quickly and efficiently plotting the data without making the user have to leave GCP’s platform.
We also have the choice of leaving the Google ecosystem and visualizing our results in a more sophisticated environment – such as Tableau, Qlik, MicroStrategy or SAP Analytics. In our project we used Tableau to generate our final dashboard deliverables. When connecting BQ to Tableau, analysts have a very easy task at hand as Tableau, cooperating with Google, set up a user interface specifically for connection to a BigQuery database.
In summary, BigQuery seems to be able to satisfy the needs of IT system admins and business decision makers at once as by leveraging its optimal data storing architecture it’s capable of outputting data tables that are easy to review and interpret. Although when it came to practicality, we did encounter some minor issues that need some tuning, overall, the system performed well during our analytical project and hence passed our test.
Richárd Pertics – Business Analyst
Kristóf Rábay – Data Scientist