Growth is not an accident. It’s a perfect fit.

Back to all news

Case study: Teradata to Snowflake migration for a large retailer

Customer situation

The customer is a leading FTSE 100 UK-based retailer operating a large (approx 300 TB, 10.000+ tables, 100.000+ columns, 30.000.000+ new transactions per day) data warehouse on the Teradata platform. Reports and data from it were used primarily by the finance department and many other teams to manage their performance and input into various analytics tools.

The customer is ongoing a transformation onto a strategic platform. Snowflake was selected as the best-fitting, most performant solution. This strategic platform is also coupled with a completely new data modeling approach according to the Data Vault 2.0 standard. But as this is a long-term project, it was necessary to find an interim solution to address issues (low performance, expensive to operate) of the current Teradata DWH as soon as possible.

Selected approach

For the interim solution, our UK partner company LEIT DATA, selected migration of existing Teradata DB into Snowflake. We decided to keep the current data model to retain the backward compatibility of reports and integrations and to be as quick and efficient as possible. This enabled us to maintain existing reporting tools (e.g., SAP BusinessObjects) with only a minimum tweak. The strategic project also includes a new reporting solution (PowerBI) successfully integrated with the new Snowflake DB.

The Teradata ingestion pipeline consisted of many stored procedures run by various triggers. This solution was replaced with a more maintainable set of Python scripts ingesting data from S3 batch files already generated for the Teradata solution.

We also found that the current Teradata security model wasn’t manageable and unscalable as it consisted of more than 1 million individual SQL statements (“GRANT”s). We implemented a new security model leveraging the native Snowflake data classification model. This enabled the customer to control access to columns and tables containing sensitive PII data efficiently.

The migration took a team of approx—10 people over 1.5 years. Much effort was spent on extensive testing to ensure that the reports were accurate “to the penny.”


High-level architecture transition

Benefits

Data Warehouse migration to Snowflake enabled the customer to decommission legacy Teradata platform support and maintenance costs and eliminate a dedicated team of 8 Teradata support contractors, replaced by a smaller permanent Data Engineering squad focusing on strategic data value products.

This alone resulted in multi-million £ per year‎ savings. Snowflake made us rethink how the team delivered Data Products and optimized team effectiveness. This led to significant optimization in decreased time-to-market from 3+ months to less than four weeks.

Snowflake Data federation allows easy data sharing of migrated database (in legacy format) with the new strategic data warehouse (in Data Vault format). This accelerated the migration path to the strategic data platform.

It also had these additional benefits:

  • Orders of magnitude speed up report generation and data processing.
  • An easily manageable, scalable, and auditable security model ensures full GDPR and PII protection compliance.
  • Reduced complexity for data visualization, science, and analytics communities within the organization increases productivity.

Lessons learned

Here are key issues we came across during the project and lessons learned from them:

  • Large volume data egress from the Teradata platform seems throttled on the hardware level. Export ran extremely slow (300 TB took a month to export), and we investigated every other issue (network stack, landing zone, etc.). We concluded that the Teradata platform itself causes the root cause.
  • Teradata platform has strange behavior of decimals rounding. This was further augmented by the lousy design of the original data model (use of float instead of decimal for storing financial data). This led to different results when reconciling and cross-checking reports from Teradata vs. Snowflake. Each such discrepancy had to be investigated fully, resulting in a lengthy testing period.
  • Some companies provide services for out-of-support Teradata infrastructure (e.g., replacing failed disks). They may be interested in buying out existing systems after migration.
  • As part of any large-scale data migration, it is suggested to review all existing reports to see which ones are not used at all or very sporadically. This can be done by reviewing report access logs or replacing reports with unclear users or usage with a static text to contact the migration team. The goal is to eliminate legacy reports and thus reduce overall testing efforts.

Contact us to get started

Our team participated in critical architecture design and delivery management roles. Contact us for a free assessment session where we will, together with your data leadership team, evaluate the potential for savings and enhance your agility to deliver data-value products.

Related services

Growth is not an accident. It’s a perfect fit.

Back to all news

Meet Our New Colleague: Welcoming Pali Jasem to Our Team

Pavol Jasem Foto

We are happy to introduce you to our new colleague, Pali Jasem, an experienced professional with over 20 years in IT and consulting. Pali’s experience spans a wide range of areas, including data processing, artificial intelligence, business analytics, knowledge discovery, UX/CX, solution architecture, and IT product management.

Before joining Grow2FIT, he held the position of CTO at GymBeam, where he helped grow the company and build the IT team. Previously he worked for companies such as Pelican Travel, Solar Turbines San Diego, Seznam Prague, and other tech start-ups and corporations.

He is currently working as a business architect on a web applications development project for our client Solargis, where he applies his experience in business analysis and architecture. We are happy that Pali will expand our expert team and wish him many personal and professional successes.

Growth is not an accident. It’s a perfect fit.

Back to all news

What is Data Mesh and why do I care? – Part III.

In the first part of our series on Data Mesh, we introduced the concept and principles of Data Mesh. In the second part of the series, we looked at the technology enablers of introducing the Data Mesh idea to your organization and typical objections to Data Mesh. In this final part of the series, we will introduce the plan of how to start with Data Mesh in your organization.

How do we start with Data Mesh?

1. Assess organizational data maturity, pain points, and plans

Do a quick assessment to measure organizational maturity in data areas such as:

  • Data modeling – what modeling standards are used, how are models reviewed, what tools are used and how they are integrated, what artifacts are generated from models…
  • DataOps – state of CI/CD for data flows, batch jobs monitoring, logging analysis and reporting, data quality monitoring, infrastructure monitoring, and scaling…
  • Data Security – how are defined and enforced security policies, how are data classified and how is the classification retained during transformation processes, data lineage analysis, how is users identity and role identified and managed…
  • AI/ML Review – how (if at all) is AI/ML used within an organization, what datasets are required for model training, what data are produced…

Part of the assessment is to also capture the current data stack, and identify potential risks and pain points (legacy tools, expensive licenses preventing wider tools usage, performance or stability issues, etc.).
The assessment should also gather long-term strategy and key ongoing or short-term planned business projects that either impact the data area or require critical data inputs. Such assessment typically takes 4-6 weeks.

2. Plan tactical and strategic data stack and activities

Based on the assessment and gathered inputs prepare:

  • Data platform strategy – describe at a high level outlines how should the data platform operations, capabilities, what are key interactions with other organization’s projects
  • Tactical (next 3 months) and strategic (1-2 years) data stack – what tools should be used, what deprecated, how should they be integrated together and with other systems
  • Domain model – prepare the initial data domain model (L1) and break it into sub-domains (L2) where possible. We suggest using organization structure and IT systems architecture for the initial domain split. In other words – leverage “Conway’s law” rather than trying to fight it.
  • Define governance model – data platform governance structure (incl. mapping of domains onto an org chart, outlining key roles and processes to define and approve Data Products, security rules, monitor operations, audit data accesses, etc.

This activity should take 2-4 weeks including review and approval by key stakeholders.

3. Identify pilot and staff pilot team

Select domain and pilot Data Products (and reports) that should be constructed. Allocate necessary team – prefer fully dedicated allocation where possible to ensure the team’s full focus. Part of the pilot is typically also a validation of new technologies and tooling. For those make sure that appropriate support from IT Operations is committed – to providing necessary installation support, network (firewalls, etc.) setup, access to source data systems, credentials provisioning, etc. Account ample time to resolve and stabilize each tool before allocating users to use it.
The goal is to deliver pilot Data Products and Reports within 3 months (ideally 2 months – depending on lead time for new tooling setup, if any).

4. Evaluate and scale

Evaluate issues met during pilot delivery – especially focus on classifying if the issues are once-off (due to new methodology, tooling, and/or team) or have a more fundamental root cause that needs to be addressed.
Decide on the next data domains and outline the initial set of new Data Products for the build. Communicate project and Data Mesh concept to a wider audience, especially where users can find new Data Product Catalog and relevant reports and provide a contact for the expert team.
The critical step is to establish an in-house “black belts factory” – a program to train the trainers who can then support the Data Mesh rollout organization-wide.

Start now!

We are providing our customers with the necessary knowledge, training, assets, and resources to quickly start the Data Mesh journey.
Our services typically consist of:

  • Quick 2-days focused pre-assessment on identifying key focus areas and areas where Data Mesh can bring the most business value.
  • Run a 2-3 days “data hackathon” integrating with real systems and proposed tooling to demonstrate their feasibility and efficiency.
  • Driving data assessment, presenting outcomes, and proposing plans to C-level stakeholders to gain wide support for Data Mesh roll-out.
  • Design tactical and strategic data architecture, recommended data stack, prepare guidelines (modeling methodology, ingestion patterns, CI/CD pipelines, etc.), and set up architectural ceremonies (data forum, architecture approval committee, etc.).
  • Provide resources to lead and deliver the Data Mesh pilot where the organization can’t sufficiently quickly set up internal staff.
  • Set up and run a training program for internal teams to ensure organizational self-sufficiency and keep key know-how “in-house”.

Contact us please to learn more

Author

Milos Molnar fotoMiloš Molnár
Grow2FIT BigData Consultant

Miloš has more than ten years of experience designing and implementing BigData solutions in both cloud and on-premise environments. He focuses on distributed systems, data processing, and data science using the Hadoop tech stack and in the cloud (AWS, Azure). Together with the team, Miloš delivered many batch and streaming data processing applications.
He is experienced in providing solutions for enterprise clients and start-ups. He follows transparent architecture principles, cost-effectiveness, and sustainability within a specific client’s environment. It is aligned with enterprise strategy and related business architecture.

The entire Grow2FIT consulting team: Our team

Related services

Growth is not an accident. It’s a perfect fit.

Back to all news

What is Data Mesh and why do I care? – Part II.

In the previous part of our series on Data Mesh, we introduced the concept and principles of Data Mesh. In this part of the series, we will look at the technology enablers of introducing the Data Mesh idea to your organization and typical objections to Data Mesh.

Technology Enablers

Technology capabilities must exist to deploy Data Mesh effectively. These are mandatory for any data-mesh-oriented initiative:

Data Product Catalog

The centralized repository where users can publish their data and Data Products with required details for other users (dataset description, SLAs, etc.), technical details (schema, access ports, sample data sets, etc.), and business meanings/usability of data. It can optionally also enforce additional publishing QA and approval workflows. The Catalog can be as simple as a set of Confluence templates or a sophisticated data governance tool.

Data Pipelines

Typically an ELT product that:

  • Enables users to create, test, debug and monitor end-to-end flows.
  • Provide a rich library of various generic (JDBC, REST, file…) and application-specific (Salesforce, SAP, Google Analytics…) connectors.
  • Enables data transformation (dbt, Python, Javascript, etc.) capability.
  • Publish and share flows within the organization.
  • Data Pipelines should also be integrated with Data Product Catalog to automate product publishing, monitor its status, refresh rates, etc.

Storage

Data Mesh will also need different patterns for delivery and execution based on the underlying technology and the location of the data sets that need to be combined. These patterns could range from physical dataset copies to data virtualization, views, and many more. Most of these patterns require an additional structured or semi-structured storage location.
Typical – but not exclusive – consumers of such storage are composite Data Products where the dataset from source systems or other Data Products undergo a transformation, enrichment, or data cleaning, and results need to be stored somewhere.
Storage is also needed for situations when source systems can not (for security reasons or because data are provided as a batch file or stream that can’t be directly queried) or should not (mainly for performance reasons) access directly. In such a case, a data replica needs to be established and maintained up-to-date as specified in a Data Product.
There can be one or multiple Storage variants provided to the users and Data Hub – e.g., PostgreSQL, Snowflake, Azure Synapse, MongoDB, etc. Critical for Data Mesh use is that users can provision new Data Products within this storage themselves and very quickly – while all applicable governance & security rules are automatically applied.

Reporting Engine

Other services can consume data provided by Data Products – e.g. machine learning or data science platforms – the most common use case is their visualization into a set of reports and dashboards. A flexible Reporting Engine must empower users to define and share data visualizations and dashboards to enable this. Such an engine should also enable users to add new data sources as needed.

Security platform

The security platform should act as a centralized place to define, enforce and verify/audit compliance rules, data access privileges, and mapping of users onto roles/groups/claims. It must be integrated with the user’s identity provider (Active Directory, OAuth / JWT issuer, etc.), source systems, and data hub to ensure end-to-end compliance.
Ideally, it should also be capable of automatically detecting (based on defined patterns) sensitive information (e.g., SSN, card number, IP address, etc.) and reporting/masking them accordingly.
The solution should support RBAC and/or ABAC security models and be able to provide clear reports where each user/role/group has what kind of access. Also, the masking engine should provide multiple options for transforming sensitive data – obfuscation, tokenization, randomization, pseudo-randomization, encryption, etc.
From the user’s perspective, the security platform should be fully transparent and not interfere with their work – unless they need to assign an additional role(s) for new Data Products.

Data Mesh sounds great but…

…we already have DWH / ODS

DWH or ODS are great dataset sources for Data Products and typically continue to be used and evolved as many existing integrations and reports don’t make financial sense to replace without any additional reason.
Nevertheless, DWH / ODS are not substitutions for Data Mesh as they are highly centralized systems with carefully curated data. Instead, Data Mesh builds Data Products “on top and alongside” them.
That said, in some setups where DWH is built using expensive technologies, it may make sense to gradually work on replacing existing DWH to save on license costs.

…we already have Data Lake

Data Lake is an evolutionary step from the DWH concept where it empowers users to access datasets and process them as needed quickly. It results in higher agility but still has these concerns:

  • Adding new data sources into a data lake is a typically tedious process managed by a centralized team.
  • The data lake concept doesn’t address publishing, sharing, and reusing produced datasets.
  • Due to semi-structured or unstructured datasets, defining and enforcing a unified security governance model is practically impossible.
  • Not scalable for heterogeneous organizations with large datasets or data volume-intensive analytics (e.g., machine learning).

Data Mesh is designed to address all these shortcomings.

…we already have a Data Catalog

Establishing a data catalog (glossary, data lineage, etc.) is a great accelerator for Data Product definitions. Existing tools can be re-used to act also as a Data Product Catalog. Data Mesh deployment requires additional capabilities than just a catalog.

What do real-world customers say?

Use case in a large universal bank

“We have created data journeys from end-to-end, joining many data sources with my team of 3 and with no IT support” – head of the retail lending department.

One of the largest universal banks in Central Europe embraced the Data Mesh concept together with the rollout of state of the art ELT tool. This enabled the business users to create ad-hoc Data Products and reports on their own.
In a typical case, the Data Product Owner and her team connected data from Google Analytics, social media campaigns, and BigQuery with extracts from Oracle DWH that were productized together with Hadoop data lake and XLS. All are done directly by business users in the selected tool with zero support from IT needed.
Reports were configured in Tableau and connected to dataset storage. The business then could answer questions such as “how many people from social networks campaign come to our physical branches, how do they behave in our online banking, and what effect does promotion X have on their upsell to product Y”.
The usage of Data Mesh supported by a modern data stack enabled them to make business decisions and evaluate the impact of marketing campaigns and the rollout of new products at an unprecedented speed and data richness.

Other references

  • “Finally we have a great analysis of our retail outlets network. We know how our clients behave at these outlets, know how they operate in real-time and where we can optimize” – Retail banking and major European bank after the first year of Data Mesh deployment
  • “HR data is very sensitive. We wanted to analyze the trends, individual people, and how we can support them in their growth. Due to the high sensitivity of data, we need an environment that is fully functioning and runs without any IT support and that we can audit and make sure no data or access to it leaks outside of the HR department. Data Mesh provides us with just that. We were able to create advanced people analytics within half a year only within the boundaries of our tribe. This already creates huge benefits in how we can support our workers in their daily lives.” – Major European institution
  • “To automate KPIs reporting over diverse systems and branches was previously not possible for us. After one year, we have it across all of our companies, and it runs reliably day after day without the need for intervention” – Large bank
  • “Everybody talks about Customer360, but to put data from all internal and external sources and applications, prepare it in a format for automation tools to reliably work with, and enrich it by models developed in DBX was previously not possible for us. With ELT-tool powered Data Products, it’s been in production within the first year and now serves as bases for automation in SalesForce, marketing automation tools, and our internal applications” – Large retailer
  • “We have used the Data Mesh concept for our self-service platform where we prepare and productize our Risk Scoring product that is used automatically by all other departments in their processes like Customer 360, lending applications, or sales predictions. The combination of AWS Lambda functions and ELT tool’s productization allowed us to put this in production within a couple of weeks, and now over 30 other processes within other tribes reap the benefits” – Top global lending institution
  • “Understanding transactions and their context that people are doing with our bank has been a tremendous boost to our step to more proactively manage our customer experience, predict revenues and risks” – Retail bank

Author

Milos Molnar FotoMiloš Molnár
Grow2FIT BigData Consultant

Miloš has more than ten years of experience designing and implementing BigData solutions in both cloud and on-premise environments. He focuses on distributed systems, data processing, and data science using the Hadoop tech stack and in the cloud (AWS, Azure). Together with the team, Miloš delivered many batch and streaming data processing applications.
He is experienced in providing solutions for enterprise clients and start-ups. He follows transparent architecture principles, cost-effectiveness, and sustainability within a specific client’s environment. It is aligned with enterprise strategy and related business architecture.

The entire Grow2FIT consulting team: Our team

Related services

Growth is not an accident. It’s a perfect fit.

Back to all news

What is Data Mesh and why do I care?

In recent years the Data Mesh concept has revolutionized how organizations work with their data. It enabled businesses to produce data reports with unparalleled speed and agility. The value of almost any company is based on 2 aspects - its data and people (staff, partners, customers). And people need the right data at the right time to make the right decisions.

The known theory defines that data could create information, information could create knowledge, and knowledge must be a part of wisdom. Just synergy of data could bring information for decision making. And, making the right decisions based on data means grabbing a value from data. There have been multitudes of solutions proposed to address this fundamental organizational hunger for getting the data quickly. This ranged from traditional Oracle-powered DWH, through Hadoop, Azure Synapse to cloud-native highly parallel databases such as Snowflake or Google BigQuery.

All of these tools and approaches failed to address the major shortcoming – data centralization. There is always a more or less centralized team governing or even creating, the data platform. This central element results in expensive data solutions and takes too long to deliver the required data to business users. Sometimes even the business needs passed and the report was not needed anymore. Data Mesh is a set of design and architecture principles created to enable data democratization without such a bottleneck. Data Mesh is not a product that can be bought and magically solve all the problems. Instead, it needs to be taken as a set of guidelines to be tuned to fit a particular organization or project.

Data Mesh principles also affect target data stack selection – some products enable the rapid roll-out of Data Mesh architecture more easily than others. But there also isn’t one-size-fits-all, so when designing interim and target data stacks, we always consider existing products and tools, organizational data maturity level, available resources, budgets, current pain points, and fit for a long-term vision. When applied pragmatically and customized to the organization, Data Mesh acts as a “10x” factor for increased speed and agility in data deliveries.

Data Mesh: it’s not rocket science… or is it?

We like to use the analogy between traditional approaches such as DWH or ODS and modern Data Mesh to the difference between Space Shuttle and SpaceX rockets. Space Shuttle was an enormously expensive program because every component had to work flawlessly the very first time. And every component had to work flawlessly the first time because it was such an expensive program that failure was not economically or politically possible. SpaceX uses a very much different approach. Failure is not just accepted as an option, it is sometimes a desired outcome to gather most of the data. Their rocket designs and builds are iterating very quickly, and there is constant learning feedback.

The catch is that the SpaceX approach was impossible 20 years ago. Many technologies were not mature enough or simply not existing at all – from computer designs, friction welding, and new metallurgy materials down to 3D printers and many others. They enable SpaceX to iterate extremely quickly with low risk and costs.
A similar situation is with Data Mesh principles. Whilst they are technology-agnostic, they require for efficient implementation certain technical capabilities and maturity – such as a highly flexible reporting/visualization system, data transfer and transformation engine (ELT) supporting distributed teams, and automated CI/CD pipelines and security layer enabling quick and trusted user access control, sensitive data masking/randomization etc. Data mesh should cover all these domains.

History rhymes with itself

The Data Mesh movement resembles an evolution in online integration. In the past, an SOA model leveraging the ESB layer that acted as a central place to expose organization-wide APIs often used a canonical service data model. There were created many WebService standards from security and service discoverability to distributed transactions. This approach worked great technically, but such a middleware team quickly became a bottleneck as it had to play a delivery-critical role in practically any project.

Over time this approach evolved into a microservice, REST-based approach based rather on conventions and without any centralized component. Focus shifted from centrally designed and governed services to solutions favoring speed of delivery, the ability for teams to efficiently expose and consume services as needed in the context of their mission delivery. Such an approach may not look pretty on architecture diagrams where the central ESB component is replaced with “spaghetti integration”, but which one actually better represents real-world complexity?

A similar shift happened on the security layer – rather than checking and enforcing security on the ESB layer JWT/OAuth approach decoupling authentication (determining identity and assigned roles) with actual authorization and security roles being always evaluated in the business context of the service provider. The ease of exposing and consuming a microservice led to the creation of a vibrant service marketplace where REST API is now expected to be part of any new solution. It also erased the difference between “UI integration” and “system-to-system integration”. Frontends are now consuming the very same REST API as any other system. This simplified development and maintenance as developers nowadays don’t have to think about and support two different integration and security models.

We often hear concerns from current ODS/DWH developers that the Data Mesh approach will obsolete their role. In fact, the opposite is true. Their role will evolve into being more tightly coupled with actual business users and Data Product consumers, who will be critical in building complex Data Products. It will become a more rewarding role as they will be able to directly see the results of their work being used and receive feedback firsthand. Data Mesh methodology will enable them to quickly incorporate requested changes without waiting for months-long frustrating release cycles.

Data Mesh principles

The core of Data Mesh are these 3 principles. They aim to enable anybody to safely create, share and consume Data Products as they choose fit. There is no centralized team that can be a bottleneck instead, it is replaced by centralized governance enforcing security, compliance, performance and data privacy rules.

1. Build Data Products

The basic building block of Data Mesh is a Data Product. It is created to provide data for a specific user’s needs. It consists of at least one dataset from one or more data sources. It can possibly consist of other Data Product(s). They can be roughly classified into 2 categories:

  • Source-aligned DPs – these are provided directly by a source system(s). For example Customer, Address, Account, Order, etc. They can consist of a logical join of datasets from multiple source systems.
  • Composite (or consumer-aligned) DPs – these Data Products are built using dataset(s) from another DP(s) and possibly other data sources. They provide additional business value through data enrichment, cleaning, or aggregation. Examples are e.g. Customers with active Accounts, Customer LTV, frequently bought together Articles, etc.

The most important aspect of Data Product is that anybody in the organization can define and publish a new DP.

2. Publish and share Data Products

The main differentiator in Data Mesh compared to previous approaches is in true de-centralization and democratization of users’ access to data. Access to the dataset of a Data product can be provided by multiple methods (REST API, SQL, batch file, etc.). The underlying technology data stack must enable users to quickly and independently create new Data Product and then publish it into a Data Product Catalog to share it with a wider audience. Tools must also support the efficient creation of new, composite, Data Products where a user will “plug-in” existing Data Product(s), additional source systems and/or external data, write necessary transformation login (in whatever user SQL, Python, R, Java, etc.) and store resulting datasets for further access (e.g. in a DB, as a batch file or even as a streaming set of messages/events).

To share a Data Product user must also provide additional metadata about it – such as plain text description, schema(s), performance limitations, operational metrics (refresh rates), data security classification, and others.

3. Manage security access

It is obvious that sharing sensitive data must be done in a controlled and secure manner. This means that there must exist a mechanism to define on Data Product level required access rules on the entity, attribute (column), and record (row) levels. These controls can specify more fine-grained than a typical allow/deny rule – it can allow attributes to be masked, randomized, or pseudo-randomized (e.g. preserving age in randomized day of birth).

Whilst it is possible to configure these controls and rules individually on each source system and also throughout various data products, this approach easily leads to the exponential growth of complex rules (GRANTs, etc.) and is practically unmanageable and unauditable. Instead, it is highly desired to have a centralized tool that will enforce security governance rules end-to-end – from source systems through interim data products to the reports themselves. Only then is it possible to ensure various regulatory compliance requirements for GDPR, PCI-DSS, HIPAA, and others – such as separation of duties, data masking/pseudonymization, PII & PAN data storage controls, encryptions at-rest & at-transit, etc.

These controls are based on traditional  RBAC (role-based) or more modern ABAC (attribute-based) approaches, but they must be centralized, owned, and managed by data providers. Without strong and trusted Security and Governance Guardrails and tooling enabling it will, any Data Mesh concept fail to deliver the expected business benefits.

Link to Article #2 of our Data Mesh series: What is Data Mesh and why do I care? – Part II.

Author

Milos Molnar fotoMiloš Molnár
Grow2FIT BigData Consultant

Miloš has more than ten years of experience designing and implementing BigData solutions in both cloud and on-premise environments. He focuses on distributed systems, data processing and data science using Hadoop tech-stack and in the cloud (AWS, Azure). Together with the team, Miloš delivered many batch and streaming data processing applications.
He is experienced in providing solutions for enterprise clients and start-ups. He follows transparent architecture principles, as well as cost-effective and sustainable within a specific client’s environment. It is aligned with enterprise strategy and related business architecture.

The entire Grow2FIT consulting team: Our team

Related services

Growth is not an accident. It’s a perfect fit.

Back to all news

The second year of Grow2FIT

Even the second year of our existence was accompanied by turbulent developments at home and abroad. It brought events that we thought we would never see again in Europe and that have no place in a civilized society. We have continued with the ideas that led to the foundation of Grow2FIT - to inspire our clients to innovate, to pay attention to the maximum quality of the services provided, and to create a pleasant environment for our colleagues with challenging and exciting projects.

In 2022, we expanded our cooperation with Solargis – a leader in providing meteorological data for evaluating and managing solar energy production. Our 10-member team worked on developing web applications through which Solargis clients access meteorological data and functionalities. We also helped Solargis to find several key internal employees.

We also continued our cooperation with Greyson Consulting and participated in the successful merger of Raiffeisenbank and Equa bank in the Czech Republic. We are also working on exciting projects at Raiffeisenbank International – implementing a new digital bank and, most recently, replacing core banking with a new, modern, cloud-native solution.

Throughout 2022, our experienced DevOps team worked on designing and implementing Deutsche Telekom’s new public cloud service. This new public cloud service is fully compatible with the EU Sovereign Cloud initiative and built exclusively on open-source technologies. In the 1st quarter of 2023, the service will be put into a production environment.

Also, this year, we continued delivering entire teams or individual specialists for tech companies such as Unicorn Systems, ADASTRA, Aston ITM, exe, or Millennium.

We are particularly pleased to cooperate with new and inspiring clients such as SentinelOne – a leader in cyber security in the cloud environment from the USA, ČSOB, or the Czech tech company Atlas Group.

All projects would only be possible to deliver successfully with our team, which this year grew to 35 specialists with experience in various areas such as application development (JavaScript, Java, .NET), DevOps, data solutions, etc. We continued our philosophy of searching for talents outside the Slovak and the Czech labor market, and we expanded the team with new developers primarily from the Balkan countries. Profiles of our most experienced coworkers can be found here: https://www.grow2fit.com/our-team/.
We have expanded the Grow2FIT internal team with a new recruiter so that we can search for IT talents at home and abroad even more effectively.

In a short time, we will introduce you to our new competence in designing, implementing, and integrating solutions built on the Snowflake platform. Snowflake is one of the leaders in cloud data solutions (data mesh, data warehousing, data lakes, and data application development…).

In conclusion, we would like to thank you for your trust and fair cooperation and wish you all the best in the new year and may your work bring you joy and the right challenges.

Growth is not an accident. It’s a perfect fit.

Back to all news

How to prepare dev/test Ceph environment

When you are using Ceph in production, it is important to have an environment where you can test your upcoming upgrades, configuration changes, integration of new clusters, or any other significant changes without touching real production clusters. Such an environment can be built with the tool called Vagrant, which can very quickly build virtualized environment described in one relatively simple config file.

We are using Vagrant on Linux with libvirt and hostmanager plugins. Libvirt is a toolkit to manage Linux KVM VMs. Vagrant can also create virtualized networks to interconnect those Vms and storage devices, so you can have an almost identical copy of your production cluster if you need it.

Let‘s create 5 nodes Ceph cluster. The first 3 nodes will be dedicated for control node daemons, all nodes will also be OSD nodes (2 x 10 GB disks on each node by default), and one node will be a client node. Client nodes can be used for testing access to cluster services. Mapping rbd images, mounting CephFS filesystems, accessing RGW buckets, or whatever you like. The host machine where the virtualized environment run can be any machine with Linux (Ubuntu 22.04 in our case) with KVM virtualization enabled.

user@hostmachine:~/$ kvm-ok 
INFO: /dev/kvm exists
KVM acceleration can be used

Install required packages:

sudo apt-get install qemu libvirt-daemon-system libvirt-clients ebtables dnsmasq-base
sudo apt-get install libxslt-dev libxml2-dev libvirt-dev zlib1g-dev ruby-dev
sudo apt-get install libguestfs-tools
sudo apt-get install build-essential

Install vagrant according to the steps on the official installation page: https://developer.hashicorp.com/vagrant/downloads
Then we need to install vagrant plugins:

vagrant plugin install vagrant-libvirt vagrant-hostmanager 

If there is no ssh keypair in ~/.ssh, generate one. This keypair will be injected into the Vms, because cephadm which we will use for Ceph deployment needs ssh connectivity between VMs and this keypair will be used for ssh authentication between nodes.

ssh-keygen

Now we should be prepared to start your virtual environment on the machine.

mkdir ceph-vagrant; cd ceph-vagrant
wget https://gist.githubusercontent.com/kmadac/171a5b84a6b64700f163c716f5028f90/raw/1cd844197c3b765571e77c58c98759db77db7a75/Vagrantfile

vagrant up

When vagrant up ends without any error, ceph will be installed in the background for a couple of more minutes. You can check deployment progress by accessing ceph shell on node0:

vagrant ssh vagrant ssh ceph1-node0
vagrant@ceph1-node0:~$ sudo cephadm shell
root@ceph1-node0:/# ceph -W cephadm –watch-debug

at the end, you should get a healthy ceph cluster with 3 MON daemons and 6 OSD daemons:

root@ceph1-node0:/# ceph -s
  cluster:
    id:     774c4454-7d1e-11ed-91a2-279e3b86d070
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum ceph1-node0,ceph1-node1,ceph1-node2 (age 13m)
    mgr: ceph1-node0.yxrsrj(active, since 21m), standbys: ceph1-node1.oqrkhf
    osd: 6 osds: 6 up (since 12m), 6 in (since 13m)
 
  data:
    pools:   1 pools, 1 pgs
    objects: 0 objects, 0 B
    usage:   33 MiB used, 60 GiB / 60 GiB avail
    pgs:     1 active+clean

Now your cluster is up and running, and you can install additional services like CephFS or RGW, you can play with adding/removing nodes, and upgrading to the next version. By changing CLUSTER_ID variable in Vagrantfile and copying Vagrantfile to another directory, you can deploy a second cluster and try to setup replication (rbd-mirror, cephfs-mirror, RGW multizone configuration) between clusters. The boundaries of your imagination only constrain you.

When you are done with your tests, you can simply destroy the environment with

vagrant destroy -f

Author

Kamil Madáč
Grow2FIT Infrastructure Consultant

Kamil is a Senior Cloud / Infrastructure consultant with 20+ years of experience and strong know-how in designing, implementing, and administering private cloud solutions (primarily built on OpenSource solutions such as OpenStack). He has many years of experience with application development in Python and currently also with development in Go. Kamil has substantial know-how in SDS (Software-defined storage), SDN (Software-defined networking), Data storage (Ceph, NetApp), administration of Linux servers and operation of deployed solutions.
Kamil regularly contributes to OpenSource projects (OpenStack, Kuryr, Requests Lib – Python).

The entire Grow2FIT consulting team: Our Team

Related services

Growth is not an accident. It’s a perfect fit.

Back to all news

Reference: greenconsus – Jira and Confluence migration

Recently, we have noticed a great emphasis by Atlassian on the use of cloud versions of its products. For most SME companies, it is much more advantageous to use the most modern and flexible cloud products, where the effort associated with the maintenance of servers and applications is eliminated and the need to ensure the security of applications is minimised.

Atlassian allows you to continue to use its applications (Jira, Confluence and others) even after the support license expires, but without the possibility of updating with new functionalities and improvements and without support from the security point of view. The applications thus gradually become obsolete and do not meet the customer’s expectations. They also become a dangerous invitation for potential attackers.

For the reasons mentioned above, the greenconsus company decided not to maintain its solution anymore but to migrate data to the Atlassian Cloud and use cloud versions of the products.

Cloud image

Data analysis was the first step of the migration project, which we implemented in cooperation with the technology company VisionLake. What will need to be migrated? Which users will be affected by the migration? What to do with historical data? How to maintain the consistency of information about employees and users who are no longer active but still need to know their details?

In the second step, a short migration feasibility study was carried out when the company was already actively using Atlassian Cloud solutions, and new data was added daily. The migration had to be done so that the customer did not lose new data (generated by using the cloud version) or old data (stored in the on-premise Server version).

Due to the significant differences between the current cloud version and the server version the customer had, it was not technically possible to implement a one-time migration to the most recent cloud version. Therefore, it was necessary to find out which versions of the applications were still compatible, and thus it was necessary to perform a multi-step migration. It was also necessary to solve the licensing for the given migration since, without licenses, it is impossible to update to compatible versions.

Atlassian Jira Cloud Screenshot

After a successful migration, a data consistency check was performed in both systems, and we used a bit of scripting magic to fine-tune information about users who are not active in the new cloud solution.

And as a bonus, in addition to the migrations, a check of the settings of individual projects in Jira, authorisation settings, notifications and the reduction of the number of Custom Fields was also carried out. The result is a modern and flexible solution with significantly increased data security. Server maintenance and restoration costs have been reduced, and the company has a better overview of resources and can plan work better.

Provided services

  • Consulting

Key Technologies

  • Atlassian Jira
  • Atlassian Confluence

Growth is not an accident. It’s a perfect fit.

Back to all news

GoodBye Summer 2022

It’s necessary to say goodbye to the summer and the holidays appropriately, so we spent a pleasant day in the Aquarea Čierná Voda complex. We completed an introductory freediving course - a sport where you can experience incredible freedom, and discover hidden beauty under the water's surface, and which, even if it doesn't seem like it, is a collective sport (you should never freedive alone).

Fun with families, barbecue and a morning restart by swimming in the lake was not to be missed.
We believe that next year we will try another enjoyable activity and again in more significant numbers 😊

Growth is not an accident. It’s a perfect fit.

Back to all news

Ceph Fundamentals Training Course

Ceph is an open-source software-defined storage platform. It provides a flexible foundation for all data storage, uniting object, block and file types in a single unified RADOS cluster. For enterprises with multiple storage type requirements, Ceph provides a simplified, flexible solution.

We offer you a unique Ceph Fundamentals Training Course with hands-on experience for the attendees:

  • Limited amount of attendees (we prefer max. 6 people).
  • Ceph training environment for each attendee.
  • Attendees will practise their theoretical knowledge during the training course.
  • Attendees will practise troubleshooting in their training Ceph environments.
  • Each attendee finishes the course with an online exam with a certificate if the test is passed.

At the end of the course, participants will be able to:

  • Understand
    • the architecture of Ceph clusters on both network and storage levels.
    • concepts like Placement Group, RADOS object, Crush Map, Crush rules and their relations.
    • how the communication between clients and cluster works.
    • basic principles of deployment and usage of all 3 storage types (Block, File, Object)
  • Install and maintain Ceph cluster using build-in Cephadm installer.
  • Create and configure pools.
  • Troubleshoot nodes, discs, daemons, placement groups etc.

Training Course Agenda

Agenda 1st day

  • Introduction – What is Ceph
  • History
  • Brief overview
  • Features
  • Commodity HW
  • Resiliency and scalability
  • Ceph Architecture
    • RADOS
      • librados
      • RBD
  • RGW
  • CephFS
  • A deeper look at Internals
    • RADOS objects
  • OSDs
  • MONs
  • MGRs
  • Pools
  • Placement Groups
  • Deployment hands-on
  • Cephadm
  • Intro to virtual environments
  • Ceph installation hands-on

Agenda 2nd day

  • Pools
  • RBD
    • Snapshots/Clones
    • RBD data layout
    • RBD-mirror
    • Other RBD features
  • RBD Hands-on
  • CephFS Theory
    • Namespaces and MDS partitioning
    • Snapshots
    • Recursive accounting
    • Other features
    • Access types
  • CephFS Hands-on
  • RGW
    • What is object storage
    • Zones, Zone groups, DR
    • Periods and Epochs
    • Users/Keys
    • Quotas
  • RGW Hands-on

Recapitulation and Final Exam

  • Each course attendee gets access to online recapitulation and final exam and will remain him after the course for 2 weeks.
  • There is a time limit on the final exam.
  • Before the final exam, you can practice and study the course.

Your Lector

Kamil Madáč
Grow2FIT Modern Infrastructure Consultant

Kamil is a Senior Cloud / Infrastructure consultant with 20+ years of experience and strong know-how in designing, implementing, and administering private cloud solutions (primarily built on OpenSource solutions such as OpenStack). He has many years of experience with application development in Python and currently also with development in Go. Kamil has substantial know-how in SDS (Software-defined storage), SDN (Software-defined networking), Data storage (Ceph, NetApp), administration of Linux servers and operation of deployed solutions. Kamil is a regular contributor to OpenSource projects (OpenStack, Kuryr, Requests Lib – Python).

References

If you are interested in the Ceph Fundamentals Training Course, please contact us.