Suggestions

Rémy Vanlerberghe’s blog
remyvanlerberghe.tech
Franceinfo - Les Titres
www.francetvinfo.fr
Cemil Choses A Te Dire YouTube channel
www.youtube.com
ActuTech
www.actutech.info
Le Monde en Commun
lemondeencommun.info

Do you enjoy using Feedbot ?

You can support its material needs and its development by participating financially via Liberapay.

DevOps & SRE
13 hours ago
Circles is one of the fastest-growing technology companies in Singapore and is transforming the telecommunications industry. Launched in 2014, Circles now operates across 14 markets and serves millions of customers globally.Through our digital expertise gained from operating our own digital lifestyle telco brand, Circles.Life, we identified building the world’s first cloud-native telco-as-a-service platform as the next phase of our journey. Built for operators by operators, Circles X offers a suite of software solutions to clients aiming to launch a digital telco or expand into new markets. We launched Circles X in 2019 as a cloud-native SaaS technology platform to collaborate with telco partners and scale globally quickly and efficiently.First launched on Circles X’s platform, we partnered with one of Indonesia’s leading mobile network operators (MNO) to launch a digital telco brand in 2020. Built natively on Google Cloud, a distributed team from India, Sri Lanka, Singapore and Indonesia brought the platform live in less than 60 days — under a third of our six-month target — even amidst the complications of the COVID-19 crisis.To do this, we built the Circles X platform using DevOps Research and Assessment (DORA) metrics endorsed by Google Cloud. Tracking these metrics underpins the performance of our software development team, and has proved instrumental to the success of our mission. The metrics include:Deployment Frequency: How often an organization successfully releases to productionLead Time for Changes: The amount of time it takes a commit to get into productionChange Failure Rate: The percentage of deployments causing a failure in productionTime to Restore Service: How long it takes an organization to recover from a failure in productionWe decided early on to use DORA standards as a guidepost for building our solution to give us the methodologies we needed for success, and the Google Cloud team helped us every step of the way in achieving optimal outcomes following the DORA methodology. Our development team was and continues to be focused on ensuring they meet high or elite performance benchmarks for each of the DORA metrics, resulting in the nimble and timely development of our platform.Here is a step-by-step account of how we built a world-class DORA culture throughout our organization, building upon the site reliability (SRE) practices that we absorbed from Google Cloud training and guidance protocols.Working with the Google Cloud team, we set up a full-stack SRE team including a Developer Solution Architect and leaders in Quality Assurance, Security, SRE DevOps and Digital Data Storage.We adhered to domain-driven design, an optimal software design approach, that we combined with SRE best practices for DORA excellence. This gave the teams the freedom to independently build services and deliver applications without legacy dependencies.Lastly, we extended the structure not just by team, but also services and infrastructure, to be isolated for each domain. All of our DORA metrics and dashboards for internal developer experience are based on domain-driven best practices.Adhering to DORA benchmarks, the team can roll out products and updates weekly — 75% faster than a traditional telecom operator. This year, we expect even faster deployments as we deepen our DORA expertise. Crucially, Google Cloud provided us exceptional support whenever we faced hurdles building our DORA-enabled platform strategy, whether it was technical, organizational, or cultural.We benefited from open conversations with Google Cloud, as the team was always ready to share its feedback, comments and viewpoints. Google Cloud also helped us connect with different industry experts for key insights into architecting our solution and creating the right DORA culture.With Google Cloud team support and focus on DORA metrics, we proved that the Circles X platform could build successful, profitable, agile, and scalable digital brands. Building on these foundations, we’re industrializing the Circles X platform to partner with a global pipeline of MNOs and help the telecom industry build brands and experiences that delight digitally-savvy customers.In the two years since it launched, Indonesia’s digital brand has grown into one of the most popular and fastest-growing digital telcos in Indonesia. Our DORA learnings have been essential for our future plans of telco empowerment around the world. Moreover, Google Cloud gave us the world-class continuous integration and continuous delivery (CI/CD) we need to build a reliable software development and testing practice that would allow us to launch within six months, from early iterations gradually up to our current version of the digital telco brand.This was made possible with the infrastructure-as-code approach enabled by Google Kubernetes Engine (GKE), and other key managed services such as Cloud SQL, DataProc, Filestore, Cloud CDN, Pub/Sub and Cloud Storage. Building Indonesia’s favorite digital brand on GKE saved significantly on costs, eliminated dependencies and improved developer efficiency and deployment frequency to go to market. As a result, we are the first telco in Indonesia to have an on-demand developer environment in production, where we can spin up a GKE environment within just a few minutes. This has brought us competitive advantage, increasing developer productivity by 40% and reducing deployment frequency to days from weeks.Building on the success of the digital brand in Indonesia, we are partnering with Google Cloud to transform our Circles X platform into a robust, flexible solution that enables rapid global launch and expansion for our partner MNOs.
Read more
DevOps & SRE
3 days ago
When making changes to Terraform code that requires no alteration to deployed resources, the Terraform state needs to be updated. This process requires removing Terraform state, updating the Terraform code, then re-importing Terraform state from the existing resources.One of the times you might need to perform this operation is when there are new Terraform resources that are used to deploy and configure Google Cloud resources. This is the same with the newly released google_cloud_run_v2_service Terraform resource.This blog post will describe this new Terraform resource, and describe the process of migrating an example Cloud Run service to this new resource.Terraform resource differencesWhen Cloud Run was first launched, it was built on Knative and Kubernetes, and the v1 API reflected this model. There is now an additional v2 API that is built in a style similar to other Google Cloud APIs. Both APIs are generally available, and v2 does not obsolete v1.v2 is being used by the Cloud Client Libraries, which includes the Google Provider in Terraform. Given the different shapes of these two APIs, google_cloud_run_service resources have different attributes to google_cloud_run_v2_service resources.You may be familiar with having to declare metadata.annoations to configure settings around scaling and Cloud SQL connections. These settings are now defined as direct arguments. Additionally, within the template attribute, containers sat underneath, spec; spec in v2 is not required, and containers is a direct child of template. There are also some attributes that have had slight name changes.Some of the metadata.annotations differences under template include (where the first is v1, and the second is v2):"autoscaling.knative.dev/minScale" is scaling.min_instance_count"autoscaling.knative.dev/maxScale" is scaling.max_instance_count"run.googleapis.com/cpu-throttling" is containers.resources.cpu_idle"run.googleapis.com/cloudsql-instances" is volumes.cloud_sql_instanceSome of the attributes that have different names include:spec.containers.timeout_seconds is containers.timeoutspec.containers.container_concurrency is containers.max_instance_request_concurrencyspec.service_account_name is service_accountspec.containers.env.value_from.secret_key_ref.key is containers.env.value_source.secret_ref_key.secretspec.containers.env.value_from.secret_key_ref.value is containers.env.value_source.secret_ref_key.versionSome of the attributes that have changed values include:containers.resource.limits.cpuv1 example value: 1000mv2 equivalent: 1containers.timeoutv1 example value: 300v2 equivalent: 300sA full list of attributes for both versions are available through the Terraform resource documentation: v1 google_cloud_run_service and v2 google_cloud_run_v2_service.Updating Terraform codeGiven the above differences, the changes required to use the new resource for an existing Cloud Run service may look like this:An example of converting a Terraform Cloud Run resource to version 2.In this example, we changed the integer value timeout_seconds to a string value timeout, changed container_concurrency to max_container_request_concurrency, and service_account to service_account_name. The spec level was also removed, and the resource itself was changed to google_cloud_run_v2_service.Updating Terraform stateFrom a literal standpoint, the changes we’ve made to the Terraform code has no effective changes to the deployed Cloud Run service. However, because from a Terraform standpoint it’s being declared with a different resource, running terraform apply would destroy the Cloud Run service and recreate it. Since Cloud Run is stateless, this may not be a problem in most cases. However, there’s information about the history of the resource’s revisions you may want to keep, the timeline of revisions, etc.To update the Terraform code and state without making any changes to the deployed resource:start with the original code,confirm the service exists in your Terraform state, getting its resource name,using the resource names, get the resource identifiers,update the terraform code to use the new resource,remove the Terraform state for the resources,import state for the resources, andconfirm there are no pending changes in Terraform.The rest of this post will go through each of these steps in detail.In practice exampleThe example service we’ll use is a Cloud Run service running the sample Hello World image. This Terraform has already been applied to my project, and I’m keeping the state files locally (but I could just as easily store the state in a Cloud Storage bucket).Original codecode_block)])]>Confirm Terraform stateterraform state listState is confirmed based on the configurations that are currently in the .tf files, so we’ll need to run this command before making changes.From our configuration, we have two resources and one data source, so those are reflected in the state:code_block)])]>Get resource identifiersecho RESOURCE.NAME.id | terraform consoleFor the later import step, the import command needs both Terraform’s resource name, but also the real world identifier of the resource.From our example, the hello service’s Terraform resource name is google_cloud_run_service.default. We want the id argument on that resource, so we can use the console to view the value:code_block)])]>Now, this identifier is accurate, but it won’t work as is for our later step.This identifier is in the form region/project/name, whereas according to the Cloud Run Admin API v2, it needs to be in the form project/region/name. We’ll come back to this later.Update Terraform filesNow that we have information from our current state, we can make our configuration updates.We’ll change the resource name (including associated references), remove the spec part of the template node (effectively moving container leftwards), and move the annotation to its new name.We could make further changes in this step, but to ensure our terraform plan will notice no new changes, we’ll do that later.Our updated file looks like this:code_block)])]>Remove old stateterraform state rm RESOURCE.NAMEWe’re now ready to remove the old state. This will update the state files so Terraform will ’forget’ about this resource. It won’t have any effect on the running resource, and is just going to remove our resource from the various .tfstate files we have.code_block)])]>Import new stateterraform import RESOURCE.NAME RESOURCE_IDENTIFIERWith the old state removed, we can now pull state from our project. Since we’ve defined the resource in our Terraform configuration, we want to use this resource name as our resource, and the identifier we adapted from the value earlier as the resource identifier.code_block)])]>If you didn’t get the format of the resource identifier right, you will end up with an error like Error 403: Permission denied on resource. Permission will be denied if the resource that’s being referenced doesn’t exist. Check the format of the resource identifier it parsed out, and adjust as needed.Confirm no pending changesterraform planNow we’ve updated the state, we can now confirm that Terraform can actively identify our new resource definition with our, still unchanged, real world resource.By running terraform plan, we can confirm what Terraform would perform to our service to match the resource definition. If we did everything correctly, there would be no required changes.code_block)])]>We can run terraform apply to confirm that we’re in good standing with the consolidation of Terraform and real world state.We have now successfully migrated a Cloud Run service defined in Terraform to a different resource, with zero changes to our deployed resource.What we’ve learntIn this article, we’ve learned about the v2 Cloud Run API, and how to adapt an existing deployed Cloud Run service to use google_cloud_run_v2_service. We also learnt how to remove and import Terraform state to reflect the real world state of our infrastructure, in a non-destructive manner.Learn moreTerraform Resource definition for google_cloud_run_serviceTerraform Resource definition for google_cloud_run_v2_serviceTerraform Resource definition for IAM policies for Cloud Run
Read more
DevOps & SRE
4 days ago
We should chat.Effective communication is the key to success, especially in today’s rapidly evolving business landscape. Good communication forms the foundation for collaboration, innovation, and high-performing teams. Our DevOps Research and Assessment (DORA) and Project Aristotle research demonstrate how effective communication, defined as the capacity to willingly share information and perspectives in an open and transparent way, plays a vital role in driving organizational excellence. So, let’s dive in!How do we know what drives effective software teams and organizational cultures?Unlocking the secrets to success in software delivery and operations is no easy feat, but at Google we’ve made it our mission to find the answers. We not only make a lot of technology, we also study how technology gets made, and deconstructing our processes allows us to locate and target areas for optimization.Google’s DORA program is an academically and statistically rigorous research program that seeks to answer the questions: “How does technology help organizations succeed, and how do we get better at software delivery and operations?”Internal research projects across hundreds of Google teams, such as Project Aristotle, our multi-year study of our own teams to determine why some teams stumble and others soar, allow us to study the drivers of highly effective teams in real-time, and explore ways to scale these tools or processes.Join us as we dive into the drivers of highly effective teams and explore how technology propels organizations to new heights. In this blog series, we’ve distilled years of Google research into five dimensions that you can apply to drive success within your own organization:Cultural ResilienceCommunication (the focus for this blog)CollaborationInnovationEmpowermentVoices: amplifying the power of diversityInclusive and transparent communication is key to driving success within your organization, as communication is not just about transmitting messages; it’s about embracing the diverse voices within your organization. For the purposes of this article, our definition is the “capacity to willingly share information and perspectives in an open and transparent way.” Both the DORA research and Google’s Project Aristotle emphasize the critical nature of inclusivity and the value of diverse perspectives in driving success. When individuals from different backgrounds, experiences, and roles come together, they bring a wealth of ideas and insights that lead to better problem-solving and decision-making while mitigating group-think.Psychological safety, which fosters open communication and trust, is the most important team dynamic, as it allows for the conditions to both offer and harness diverse perspectives. When team members feel safe to ask questions and share their ideas without fear of judgment, it allows for a collaborative and open dialogue environment where success is prioritized over shameEncouraging team members to communicate their perspectives creates an environment that fosters innovation. Each voice provides a unique perspective, enabling teams to tackle complex challenges with greater creativity that unlocks innovation within individual team members while amplifying the team’s value.Information flow: facilitating seamless knowledge exchangeOur Project Aristotle research shows that one key factor differentiating successful teams from the rest is information flow. When employees have access to the information they need and communication channels are open and transparent, it fosters collaboration, reduces silos, and enhances decision-making processes. Embracing technology and tools that streamline information sharing is essential. Collaborative platforms, such as real-time communication channels, empower teams and enhance the decision-making process through transparency and visibility.To unlock the full potential of your organization, it’s crucial to establish channels that make the flow of information easy and accessible. Sociologist Dr. Ron Westrum’s study found that information flow is both influential and indicative of how parts or all of an organization will behave when trouble arises.Insights from Westrum’s research, along with DORA, and Google’s Project Aristotle all highlight the significance of creating an environment where information is freely shared and easily accessible.The2023 Accelerate State of DevOps Report by DORA, finds that teams with quality documentation amplifies the technical capability on organizational performance in the following, non-exhaustive, categories.Building successful teams involves effective communication channels that enable team members to stay informed, align their efforts, and make well-informed decisions. From recurring calendar meetings to anonymous surveys, and collaboration platforms, these channels play a crucial role enabling teams to stay informed, align their efforts, and make data-backed decisions.Another factor differentiating successful teams from the rest is dependability. According to internal Google research, effective team members readily offer assistance, with 1-3 members willing to raise their hand to help a colleague. Trust and dependability are essential in ensuring a successful team. Dependability goes beyond just getting the work done; it also means being able to trust that high-quality work will be delivered on time.Messengers: leading by exampleConsistent findings from Project Aristotle underscore the important role of leaders when it comes to setting the tone for open and transparent communication.Facilitating a successful adoption of this new form of collaborative communication is a top-down endeavor, with employees always looking to their boss to determine how many of their ideas and experiences to share. Google’s research shows us that the best leaders empower their teams by creating a safe space for open dialogue and ensuring everyone’s voice is heard. When personifying authentic leadership, demonstrating genuine authenticity, and actively listening to those beneath them, leaders not only empower team members to voice their unique perspectives and opinions — but also ensure these insights are valued and leveraged. This inclusive approach strengthens each member’s sense of belonging and promotes a culture of continuous learning and improvement.In the dynamic landscape of technology-driven organizational excellence, effective communication stands as a pillar of success. Both Google and DORA research consistently underscores the pivotal role communication plays in encouraging high-performing teams and spearheading innovation. By embracing a culture willingly sharing information and perspectives in an open and transparent way, we unlock the power of diverse voices, facilitate seamless information flow, and empower leaders at all levels to create an environment where success is not just a goal, but an ongoing journey. In harnessing the power of communication, enhanced by technological tools, we pave the way for an organizational ecosystem where trust, collaboration, and breakthrough innovation become the normNext stepsWe developed our proprietary Cloud Culture of Success assessment based on extensive internal and external research, demonstrating that focusing on these five pillars, including communication, operations, reliability, security, and organizational performance, contributes to a more effective software delivery pipeline. Each of these five pillars plays a critical role in shaping a more agile and culturally resilient organization, and if you’re eager to gain deeper insights into how communication can transform your organization’s culture, we invite you to check out dora.dev or contact us. Let’s collaborate to foster a communication-centric culture that not only aligns with your goals but also propels your organization towards sustainable and long-lasting success.
Read more
DevOps & SRE
3 weeks ago
Today, we’re excited to announce the latest update in Google Cloud’s portfolio of DevOps tools. Customers are already using Cloud Deploy for continuous delivery, enabling the definition and manual control of application delivery pipelines, facilitating canary deployments and verification, and more. And now, with the introduction of delivery pipeline automation in preview, Cloud Deploy now supports continuous deployment, the end-to-end automation of continuous delivery. What does this mean for you? Now, you can automate promotions between targets, for example when an application release is successful in the development environment, and automatically promote it to the staging environment. You can automate canary advancement too, for example, when a 10% canary rollout succeeds, automatically advancing it to 50%. Pipeline automation builds on the Cloud Deploy vision to help you deliver faster, reliably, and more safely.In addition to pipeline automation, Cloud Deploy now supports Cloud Run jobs. Cloud Run jobs are a great way to run management and automation scripts, batch data processing jobs, and other run-to-completion workloads. And with this update, you can now use Cloud Deploy to manage all of your Cloud Run deployments, for Services as well as Jobs, in a consistent way.Pipeline automationThe first step towards application continuous delivery involves configuring a delivery pipeline. This is performed by specifying a process to gradually promote your application from a development environment to a staging environment, and finally to production. The manual orchestration of this movement is typically referred to as continuous delivery. When this process occurs automatically, such as automating promotion from development, to staging, and then to production after each successful environment rollout, it is known as continuous deployment.With the introduction of pipeline automation, Cloud Deploy now supports continuous deployment. Discover the power of the two new automation rules introduced with this launch: promote-release and advance-rollout. Promote-release allows for the promotion of a release between targets, upon successful rollout. When a development environment deploy succeeds, you can automatically promote the application to a staging environment, for instance. Similarly, the advance-rollout rule enables you to automate advancement of a canary deployment between canary percentages as the canary rollout progresses successfully.Automation resources for delivery pipelineEvery time an automation occurs, an automationRun resource is generated, providing awareness of what automation was run, when, and its result.Automation details with listing of automation runsCloud Run jobsWith the addition of Cloud Deploy’s support for Cloud Run jobs, you can now manage code delivery to all your Cloud Run services and jobs in a consistent way. You can utilize many of Cloud Deploy’s continuous delivery capabilities, including progressions, approvals, and rollback. Follow the quickstart to get started.The futureComprehensive, easy-to-use, and cost-effective DevOps tools are key to building an efficient software delivery capability, and it’s our hope that Cloud Deploy will help you implement complete CI/CD pipelines. Stay tuned as we introduce exciting new capabilities and features to Cloud Deploy in the months to come.Update your current pipelines with these new features today. Check out the product page, documentation, quickstarts, and tutorials. Finally, If you have feedback on Cloud Deploy, you can join the conversation. We look forward to hearing from you!
Read more
DevOps & SRE
4 weeks ago
Building user trust hinges on a reliable system and a seamless end-user experience. Any downtime or degradation in user experience can erode trust, jeopardizing an organization’s reputation and bottom line. Therefore, having the ability to proactively assess the quality of end user experience offered by applications is crucial for business success.In July 2023, we introduced synthetic monitoring, a new proactive monitoring capability that uses automated Node.js scripts to simulate user interactions with an application or service. This allows you to test the availability, consistency, and performance of your web applications, APIs, and key business workflows from the perspective of a real user, on a periodic basis.Today, we are excited to announce the general availability of synthetic monitoring. Additionally, with this launch we are also including three new capabilities to synthetic monitoring:Terraform supportIntegration with Cloud Trace and Cloud LoggingUI template for Mocha test frameworkHow does synthetic monitoring work?Synthetic monitoring works by executing automated test scripts against an application endpoint from Google Cloud data centers located around the world. The test scripts simulate critical user journeys within an application, such as logging into a website, searching for a product, or adding an item to a cart. When you configure a synthetic monitor, you create a Google Cloud function that executes code in Node.js using an open source framework that is distributed and managed through Cloud Monitoring.The results of the scripts are then collected and analyzed to determine the application’s availability, consistency, and performance. If any problems are detected, alerts can be configured to notify the appropriate team so that they can be fixed before they impact your end-users.Creating a new synthetic monitorGoogle Cloud’s synthetic monitoring service is easy to use and can be configured in minutes. You can create synthetic monitors for any URL or API endpoint, and you can choose from a variety of pre-built scripts or create your own.To get started, you can head to Monitoring > Synthetic Monitoring and select “+ Create Synthetic Monitor”> Custom synthetic monitor. You select the name of the synthetic monitor, its frequency of execution, and other settings.Next, you create a cloud function that will be the target of the synthetic monitor. By clicking “Create Function,” a fly-out is launched that is used to configure the cloud function. At this step, you may name your function, or manage advanced settings such as the ability for this function to have access to an internal VPC network. You are also met with sample code for the function to execute.Google Cloud’s synthetic monitoring is highly scalable. You can monitor thousands of URLs or API endpoints with a single account. And, because it’s powered by Google Cloud, you can be confident that your synthetic monitors will be reliable and accurate.New features in this release1. Terraform supportIn addition to GA, we are also announcing support for Terraform, an open-source infrastructure as code (IaC) software tool that allows you to define and provision your cloud infrastructure using code. To get started with creating a synthetic monitor using Terraform, visit this link.2. Integration with Cloud Trace and Cloud LoggingYou can now also collect trace and log data for any outbound HTTP requests made by your synthetic monitors. Trace and log data can be used to quickly identify performance issues and troubleshoot any failures in your web application, by pointing out the line of code that failed, the error types, and the error messages. To read more about enabling your synthetic monitors to collect trace and log data, visit this link.3. UI template for Mocha test frameworkMocha is a popular JavaScript test framework that is known for its simplicity, flexibility, and extensibility. To make it easier to create tests using the Mocha framework, we are including a Mocha-based UI template in the create flow, alongside the existing generic template.More informationVisit our documentation where you can get additional information and step-by-step instructions for creating your first synthetic monitor.Lastly, if you have questions or feedback about this new feature, head to the Cloud Operations Community page and let us know!
Read more
DevOps & SRE
5 weeks ago
Artifact Registry is a single place to store, manage, and secure your build artifacts that’s fully integrated with Google Cloud’s tooling and runtimes, so you can integrate them with your CI/CD tooling and set up automated pipelines. And today, our remote repository and virtual repository features are now generally available, providing enhanced availability and control of your dependencies, optimizing build performance, and making it easier to manage and share artifacts across different teams.Improved dependency managementCustomers use a wide array of internal and external dependencies in their software supply chain today. Using remote and virtual repositories can help with the complexities of modern day dependency management across your organization:Remote repositories can insulate your build processes from downtime of an external repository by providing local caching of artifacts. Artifact Registry supports access to a variety of popular external repositories including Docker Hub, Maven Central, Npmjs and PyPi.Local caching through remote repositories decreases build latency by allowing dependencies to be served from regions that are closest to development teams.Virtual repositories can help facilitate management and artifact distribution by allowing aggregation of internal and external repositories under a single URL. Developers no longer have to configure and keep track of a myriad of endpoints, and can focus on their core responsibilities such as writing code.Improve security postureRemote repositories can help you scan your external dependencies for security vulnerabilities through Google Cloud’s Artifact Analysis, allowing you to act to fix them before they can be exploited by attackers.Virtual repositories provide control over the priority of your dependency resolution. For example, you can configure a virtual repository such that an internal repository is always checked before going upstream to a third-party source such as Maven Central. This control allows you to prevent the kinds of dependency confusion attacks that have wreaked havoc on many companies in the software industry.A simplified experienceRemote and virtual repositories simplify accessing and managing repositories for developers by consolidating multiple repository URLs into a single endpoint. When using remote, standard, and virtual repositories, you can reduce the number of endpoints from three down to one, and improve these efficiencies even further by adding more repositories to the same virtual repository.Get startedUsing remote and virtual repositories is straightforward through the Artifact Registry UI.Creating a remote repositoryCreation of a virtual repositoryFor more information on remote and virtual repositories, follow the links below to get started:Remote Repositories QuickstartCreate Virtual Repositories
Read more
DevOps & SRE
Last month
Internal research by Google, and by Google’s DevOps Research and Assessment (DORA) organization, shows that teams that encourage a culture of trust — one that allows for questioning, risk-taking and mistakes — perform better. The way an organization responds to opportunity is a big part of its culture. And for software delivery and overall team effectiveness, equally important is how an organization responds to failure.By adopting specific behaviors and ways of working that encourage resilience, we can increase our teams’ effectiveness and achieve better organizational performance.How do we know what drives effective software teams and organizational cultures?At Google, we not only make a lot of technology, we also study how technology gets made.DORA is an academically and statistically rigorous research program that seeks to answer the questions: “How does technology help organizations succeed, and how do we get better at software delivery and operations?”Internal research projects across hundreds of Google teams, such as Project Aristotle, have also allowed us to study the drivers of highly effective teams.In this blog series, we’ve taken years of this Google research and are distilling down the findings into five dimensions that you can apply to drive success within your own organization:Resilience (the focus for this blog)CommunicationCollaborationInnovationEmpowermentLet’s jump in, and consider what resilience is, how it improves performance, and how your team can get more of it.Resilience: congratulations on the failure, now let’s investigate.We define resilience as the ability of teams to take smart risks, share failures openly and learn from mistakes, and teams that exhibit resilience are demonstratively more successful than teams who don’t. This idea that a culture with resilient characteristics can drive desirable organizational outcomes isn’t new. Sociologist Dr. Ron Westrum’s study of how culture influences team behavior when things go wrong typified three distinct organizational cultures, and cultures in which failure led to inquiry, rather than justice or scapegoating, were found to be more performance-oriented. Westrum referred to these as “generative” cultures.This research has been reinforced by our DORA findings since the first State of DevOps Report was published in 2014. Our 2023 Accelerate State of DevOps Report demonstrates that the presence of a generative culture continues to predict higher software delivery and organizational performance. We believe this is because, at its core, DevOps is fundamentally about people and the ways those people work. And people drive culture.Source: DORA 2023 Accelerate State of DevOps ReportTake, for example, security development practices. Our research found organizations with high-trust, resilient cultures are 1.6x more likely to have above-average adoption of emerging security practices than those who did not. We believe these generative traits, including aspects of resilience, may lead to a more desirable security posture due to their influence on teams’ ways of working. For example, generative organizations may be more likely to actively minimize the inconvenience or risk associated with reporting security issues by fostering an atmosphere of “blamelessness,” among other things. The bottom line is, if you want to improve your organization’s security posture (and beyond), consider evaluating your team’s culture first.We can further break resilience down into two additional mindsets:Launching and iterating: getting started, gathering feedback and continuously improvingPsychological safety: a shared belief that a team is safe for interpersonal risk-takingLaunching and iterating: perfect is the enemy of good.Would you be comfortable sharing an idea with your leadership if it were only 20% formulated?Part of resilience is gathering input and continuously improving. Our research shows that teams who adopt a mindset of continuous improvement perform better. This includes starting quickly, adapting to changing circumstances, and experimenting.For example, in the context of software delivery, DORA research supports the philosophy of continuous delivery so that software is always in a releasable state. Maintaining this “golden” state requires creating mechanisms for fast feedback and rapidly recovering from failures. We’ve found that teams that prioritize these feedback mechanisms have better software delivery performance. Our research has also found that working in small batches improves the way teams receive and use such feedback, as well as the ability to recover from failure, among other things.Launching and iterating is not only about improving the software that you ship. It’s also about a teams’ more general ability to self-assess, pivot, and adopt new ways of working when it makes sense based on the data. Inevitably, this experimentation will include both successes and failures. In each case, teams stand to learn valuable lessons.Psychological safety: celebrating failure as successWould you be comfortable openly failing on your team?Extensive research inside Google found that psychological safety provides a critical foundation for highly effective teams. In general, our research demonstrates that who is on a team matters less than how team members interact when it comes to predicting team effectiveness.In order of importance, Google researchers found these five variables were what mattered most when it came to team effectiveness. Source: Google re:Work Guide: Understand team effectivenessProject Aristotle examined hundreds of Google teams to answer the question “what makes a team effective?” Statistical analysis of the resulting data revealed the most important team dynamic is psychological safety, or creating an environment where taking smart risks is encouraged. An environment where members trust they will not embarrass or punish each other for ideas, questions or mistakes. Further DORA analysis found that these practices also benefit teams outside of Google, uncovering that a culture of psychological safety is broadly predictive of better software delivery performance, organizational performance and productivity.It’s important to remember that culture flows downstream from leadership. DORA research shows that effective leadership has a measurable, significant impact on software delivery outcomes. If we want to foster a blameless, psychologically safe environment, leaders must provide their teams with the necessary trust, voice, and opportunities to experiment and fail.How can you practice being resilient?Adopting a mindset of continuous improvement can help you achieve better organizational performance. Likewise, embracing psychological safety within your organization may help your teams work more effectively. This is what we mean when we say using resilience to drive success through culture.So, what does resilience look like when it is applied practically in our behaviors and reinforced through our daily work?We can continuously improve by launching early, defining success metrics, gathering input (including through crowdsourcing), and taking what we learn to heart, both to improve our products and the way we work. This ability can be underpinned by technical practices such as continuous integration, automated testing, continuous delivery and monitoring, to name a few. These practices provide the foundation and guardrails that allow for safe, rapid iteration and reliability.We can also normalize failure by conducting both “premortems” (anticipating the myriad ways an idea may fail), and “blameless postmortems’’ — candid conversations about times when things haven’t gone according to plan and what could be done to improve, without assigning blame. For example, we’ve found that teams who leverage reliability practices, including blameless postmortems, report higher productivity and job satisfaction, and lower levels of burnout, than their counterparts who use more traditional operations approaches. We suspect this is because, among other things, a sustained fear of making mistakes can lead to poor well-being.Blameless postmortems help prevent issues from recurring, help avoid multiplying complexity, and allow you to learn from mistakes and those of others.These ways of working are exemplified by our latest Google Cloud DevOps Award winners. These organizations have demonstrated how they are implementing these and other practices to drive organizational success and elite performance. For example, consider how one company leveraged cross-functional teams to remove bottlenecks, address blockers, and improve communication — the focus of our next blog in this series.In the meantime, be prepared for failure as you experiment with new ways of working, including new approaches to software delivery, operations and beyond. And ask yourself, how will you react next time something goes wrong? To learn more, take the DevOps Quick Check and read the latest State of DevOps Report, both at dora.dev.
Read more
DevOps & SRE
Last month
Today, we hear from NCR Voyix, who co-creates experiences with the world’s leading retailers, restaurants and financial institutions, about their move to Google Cloud, what it meant for their log processing and analytics, and how Cloud Logging and Log Analytics has helped them take advantage of the cloud paradigm.Here at NCR Voyix, the SRE team provides observability solutions necessary for running our Digital Banking platforms, and we rely on observability tools to detect, troubleshoot and resolve issues as quickly as possible. Analyzing logs is not only an important component toward this endeavor, but also for making informed business decisions.Opportunities in the cloudWe recently completed the transition of a key banking platform called Digital Insight onto Google Cloud. This meant moving our observability tools into the cloud as well, including a third-party logs processing and analytics tool that we have relied upon since our on-prem days. But integrating this tool was difficult and plagued with latency issues.We also understood that working in the cloud brings new opportunities. We are constantly looking to optimize our systems and we noticed that there were new tools available, natively within Google Cloud.Enter Log Analytics, a new set of features in Cloud Logging that brings the power ofBigQuery with a new user experience optimized for analyzing logs data. The launch of Log Analytics couldn’t have come at a better time for us. It not only satisfied our use cases, but was also available at no additional cost. Being able to use SQL to query the logs data was the icing on the cake, and trimmed our learning curve.Today, our third- party tool is gone; Log Analytics is our new solution. It’s a great fit for our needs, and we’re excited that new features are being added to it regularly. Throughout the deployment, we collaborated with the Google Cloud product management organization to provide feedback, so Log Analytics could meet the complex requirements of our SRE teams. Here are a few things we learned about the tool, about running in the cloud — and about our organizational culture — along the way.Key learningsWe didn’t need to bring all our baggage into the cloud. Re-framing some of our needs helped us find solutions that ran natively in the cloud. Doing this early can bring savings. For us, the third-party tool was overpriced and underused; sure it had some advanced features, but not for our kind of usage.Cloud Logging and Log Analytics, being native to the cloud, reduced the need to replicate logs to an external system. This in turn increased the speed with which we can generate reports about issues within the overall environment.Using structured logging, such as the JSON format, improves Log Analytics’ performance because there is less overhead from pattern matching.Minimize your learning curve. Getting people to accept new ways of doing things can be difficult. But with Log Analytics it’s been pretty straightforward so far because it just uses SQL and our teams already know SQL.Before, we couldn’t send out some types of logs such as audit logs to third-party tools for further analysis without incurring costs and or causing compliance implications. With Log Analytics, we can now analyze all our logs locally.Log Analytics charts, a new feature within Log Analytics, can be a great alternative to charts built with log-based metrics. These charts can also be embedded into Cloud Monitoring dashboards.Finally, our use of Log Analytics is paving the way for more cohesive observability, where important tools aren’t external entities but part of the same overall system. Everything is in one place.In summaryBy embracing Log Analytics, our organization is reducing costs while enabling users to go deeper with their logs. Being native to the cloud removed the need to replicate logs to an external system, which has improved the latency of reporting on issues within the overall environment. Cloud Logging, as a whole, now handles platform security, scalability and reliability of our system so that our SRE teams can better focus on partnering with our product and application teams.
Read more
DevOps & SRE
Last month
Since 2016, Spotify and Google Cloud have collaborated on solutions for Spotify’s developers, Google Cloud users, and open source communities. One of the most intensive and productive partnerships for both companies has been the work Google Cloud and Spotify have done to develop and optimize Google Cloud technologies in a way that also helps Spotify grow and scale. A recent example of this partnership is an automated fleet management solution that lets Spotify developers deliver secure, scalable, easy-to-manage apps and services faster than they ever have before.The drivers behind fleet managementEven though Spotify is a large, established company with over 3,000 engineers, it still has a startup mindset. More than 500 ‘squads’ have specific goals and are empowered to achieve them. They own their product and technical strategies, and they’re constantly iterating both.The challenge is that Spotify’s squads collectively manage more than 10,000 components, and all these backend services, data pipelines, websites, APIs, and front-end libraries are connected in a complex mesh of interdependencies. For example, the squad that manages playlists provides a full stack of components used by most of Spotify’s engineers. So not only must the playlist squad ensure their components are reliably supporting 550+ million global customers, they must also find the time to manage routine software updates and develop new playlist features.Unprecedented scale required a new approachSpotify’s complexity is only increasing. The company’s growth in users, content, and components is exponential. To meet demand and ease the pressure on engineers, several years ago the company abstracted away as much infrastructure as possible by lifting and shifting its platform to Google Cloud. Yes, the managed services automated many manual procurement and provisioning tasks, but engineers were still spending too much time on software maintenance. Releases, software updates, and new security threats kept squads heads down on tedious tasks. And platform migrations required significant effort by hundreds of teams over months, preventing them from doing new development.Before fleet management, upgrading Spotify’s Java runtime took eight months.Spotify saw that it had to evolve or it wouldn’t be able to scale and innovate fast enough to meet its requirements, let alone its goals. The company increased its use of automation and adopted a fleet management model. A step beyond infrastructure-as-a-service, fleet management removes many repetitive tasks for developers by providing backend services — like library updates, security patches, and even software observability — as part of infrastructure.Niklas Gustavsson, VP and Chief Architect, explains, “We want to abstract away more levels of the technology stack and manage more commodity aspects of our platform so that developers’ work is more productive and fun.”We want to abstract away more levels of the technology stack and manage more commodity aspects of our platform so that developers’ work is more productive and fun. Niklas Gustavsson VP and Chief Architect, SpotifyMaking the shift to fleet managementOne of Spotify’s biggest challenges in adopting fleet management was earning developers’ trust. The company had to show them they could go a step further and rest easy while automated processes pushed code changes to their components without any human interaction. Engineers had to be able to see for themselves that automation worked.Spotify’s Backstage portal, which is now a Cloud Native Computing Foundation open source project, provided a single pane of glass into software components and cloud resources, but that wasn’t enough. The company had to give engineers easy fleetwide observability and controls, so they could see every component change and its impact. Spotify delivered those advanced insight capabilities using BigQuery.Today, from Backstage, developers can make fleetwide changes, updating code used by ten or 1,000 components without taking controls away from the squads that own the components.We tell our developers, ’Instead of you always staying on top of which libraries your software uses or having to patch security vulnerabilities, we’ll do that for you, and you won’t even notice that we’re doing these things. Niklas Gustavsson VP and Chief Architect, SpotifyNo one wants to go backwardsToday, more than 80% of Spotify’s production components are fleet-managed. As a result, developers are happier, squads iterate and ship new features dramatically faster, and security is better. Instead of squads spending weeks and months updating libraries using inconsistent processes, the internal and external software libraries supporting 2,600 components are automatically updated every day. And updates to the internal service framework used by backend services take less than a week; previously, those updates took several months.With fleet management, updating Spotify’s service framework takes 7 days rather than 200 days.More than 95% of Spotify’s developers say software quality has improved with fleet management. That’s because faster updates translate into healthier components with up-to-date internal and external libraries, frameworks, code improvements, bug fixes, and security patches. When the Log4j vulnerability emerged, developers rolled out the security fix to fleet-managed backend services in just nine hours. Manually deploying the fix to the remaining 20% of unmanaged services took eight days.After merging the initial fix for Log4j, 80% of components were patched on the first day of the rollout.Discovering new uses for existing productsAs Spotify continues to expand its fleet management model, the company is looking to take on more complex changes to remove more toil and improve developer experience at Spotify — and at other organizations. Gustavsson explains, “We’re trying to figure out: do we want to externalize some of this stuff?"Spotify has externalized many tools over the years, with Backstage being the most successful example. Today, more than 2,200 global adopters have built developer portals off the Backstage framework to improve their own developer experience and productivity. In December of 2022, Spotify also released a commercial plugin bundle subscription for adopters to enhance open-source Backstage, and Spotify plans to release more plugins to the bundle over time."Some of the infrastructure that we built for fleet management certainly doesn’t need to be unique at Spotify so we want to figure out what parts we can potentially open source or commercialize,” says Gustavsson. "Instead of every company building that portal on their own and building their own plugins, how about we figure out one shared framework in which we can all target our requirements."You can read more about about Fleet Management at Spotify on the Spotify Engineering blog:Part 1: Spotify’s Shift to a Fleet-First MindsetPart 2: The Path to Declarative InfrastructurePart 3: Fleet-wide RefactoringOr listen to NerdOut@Spotify, the official tech podcast from Spotify R&D:Episode 12: Fleet FirstEpisode 22: Declarative Infra and Beyond
Read more
DevOps & SRE
Last month
In the face of rapid digital transformation, a positive organizational culture and user-centric design are the backbone of successful software delivery. And while Artificial Intelligence (AI) is the center of so many contemporary technical conversations, the impact of AI development tools on teams is still in its infancy.These are just some of the findings from the 2023 Accelerate State of DevOps Report, the annual report from Google Cloud’s DevOps Research and Assessment (DORA) team.For nine years, the State of DevOps survey has assembled data from more than 36,000 professionals worldwide, making it the largest and longest-running research of its kind. This year, we took a deep dive into how high-performing DevOps performers bake these technical, process, and cultural capabilities into their development practices to drive success. Specifically, we explored three key outcomes of a having a DevOps practice and the capabilities that contribute to achieving them:Organizational performance - generating value for customers and communityTeam performance - empowering teams to innovate and collaborateEmployee well-being - reducing burnout and increasing satisfaction/productivityThis year, we were working with a particularly robust data set: the total number of organic respondents increased by 3.6x compared to last year, allowing us to perform a deeper analysis of the relationship between ways of working and outcomes. Thank you to everyone who took the survey this year!Measuring software delivery performanceOur research shows that an organization’s level of software delivery performance predicts overall performance, team performance, and employee well-being. In turn, we use the following measures to understand the throughput and stability of software changes:Change lead time: how long it takes a code change to go from committed to deployedDeployment frequency: how frequently changes are pushed to productionChange failure rate: how frequently a software deployment introduces a failure that requires immediate interventionFailed deployment recovery time: how long it takes to recover from a failed deploymentOur analysis revealed four performance levels, including the return of the Elite performance level, which we did not detect in last year’s cohort. Elite performers around the world are able to achieve both throughput and stability.Five key insightsThere are several key takeaways for teams who want to understand how to improve their software delivery capabilities. Here are some of the key insights from this year’s report:1. Establish a healthy cultureCulture is foundational to building technical capabilities, igniting technical performance, reaching organizational performance goals, and helping employees be successful. A healthy culture can help reduce burnout, increase productivity, and increase job satisfaction. Teams with generative cultures, composed of people who felt included and like they belonged on their team, have 30% higher organizational performance than organizations without a generative culture.The aspects of culture that can improve employee well-being2. Build with users in mindTeams can deploy as fast and successfully as they’d like, but without the user in mind, it might be for naught. Our research shows that a user-centric approach to building applications and services is one of the strongest predictors of overall organizational performance. In fact, building with the user in mind appears to inform and drive improvements across all of the technical, process, and cultural capabilities we explore in the DORA research. Teams that focus on the user have 40% higher organizational performance than teams that don’t.3. Amplify technical capabilities with quality documentationHigh-quality documentation amplifies the impact that DevOps technical capabilities (for example, continuous integration and trunk-based development) have on organizational performance. This means that quality documentation not only helps establish these technical capabilities, but helps them matter. For example, SRE practices are estimated to have 1.4x more impact on organizational performance when high-quality documentation is in place. Overall, high-quality documentation leads to 25% higher team performance relative to low-quality documentation.4. Distribute work fairlyPeople who identify as underrepresented and women or those who chose to self-describe their gender have higher levels of burnout. There are likely multiple systemic and environmental factors that cause this. Unsurprisingly, we find that respondents who take on more repetitive work are more likely to experience higher levels of burnout, and members of underrepresented groups are more likely to take on more repetitive work: Underrepresented respondents report 24% more burnout than those who are not underrepresented. Underrepresented respondents do 29% more repetitive work than those who are not underrepresented.And women or those who self-reported their gender do 40% more repetitive work than men.5. Increase infrastructure flexibility with cloudTeams can get the most value out of the cloud by leveraging the characteristics of cloud like rapid elasticity and on-demand self-service. These characteristics predict a more flexible infrastructure. Using a public cloud, for example, leads to a 22% increase in infrastructure flexibility relative to not using the cloud. This flexibility, in turn, leads to teams with 30% higher organizational performance than those with inflexible infrastructures.AI: we’re just getting startedThere is a lot of enthusiasm about the potential of AI development tools. We saw this in this year’s results — in fact a majority of respondents are incorporating at least some AI into the tasks we included in our survey. But we anticipate that it will take some time for AI-powered tools to come into widespread and coordinated use in the industry. We are very interested in seeing how adoption grows over time and the impact that growth will have on performance measures and outcomes that are important to organizations. Here’s where we are seeing the adoption of AI tools today:Applying insights from DORA in your contextThe key takeaway from DORA’s research is that high performance requires continuous improvement. Regularly measure outcomes across your organization, teams, and employees. Identify areas for optimization and make incremental changes to dial up performance.Don’t let these insights sit on a shelf — put them into action. Contextualize the findings based on your team’s current practices and pain points. Have open conversations about your bottlenecks. Comparing your metrics year-over-year is more meaningful than comparing yourself to other companies. Sustainable success comes from repeatedly finding and fixing your weaknesses. DORA’s framework can help you determine which capabilities to focus on next for the biggest performance boost.We hope the Accelerate State of DevOps Report helps organizations of all sizes, industries, and regions improve their DevOps capabilities, and we look forward to hearing your thoughts and feedback. To learn more about the report and implementing DevOps with Google Cloud:Download the full report.Measure your team’s software delivery performance in less than a minute using DORA’s DevOps Quick Check.Model your organization around the DevOps capabilities of elite-performing teams.Share your experiences, learn from others, and get inspiration by joining the DORA community.
Read more