Skip to content

Real-Time Events at Paytm

Dec 3, 2025

Note: This post was originally published on the Paytm Labs blog on February 3, 2021. The original site has since been taken down, but the content is preserved via the Internet Archive. I am reproducing it here.


Property + Event = Trigger Action

I am currently the lead engineer on the real time events platform at Paytm Labs. In this post I describe the incredibly rewarding work done on highly scalable systems that serve our users and provide them with real-time information and actions that help them on a daily basis. I am also thankful to my colleagues Iraj and Arjun who have been instrumental in building these systems.

At Paytm, we have 450+ million users performing billions of actions across our apps. With a multitude of business verticals (Travel, Payments, Mall, etc) we aim to provide relevant information, offers, or reminders to our users based on their actions across each vertical. For example, if a user has added an item to cart but has not purchased the item, we want to nudge them towards buying the item. Or, if a user has a new upcoming bill, we want to show them an icon on their app homepage to remind them about it.

Historically, we collected these user actions and attributes through our batch data pipeline which included risks of delays and not having the most up to date information about the user. We also would create campaigns responsible for sending reminder notifications or showing a banner/icon on the app based on this delayed data. Ultimately, we found these were not relevant for the user and did not provide real value.

Upon realizing this problem, we began the process of building the real-time event architecture to provide us with the capability of serving relevant information to users in real-time, with the latest information we have for them. This process has allowed us to improve personalization and overall app user experience.

Challenge

Across our apps, we had the ability to generate real-time events from user’s actions which were used for a variety of purposes. The challenge was for us to standardize the collection and usage of real-time events, as each vertical’s product team had its own way of generating and using these events. The structure of these events was also different for different teams as you cannot expect events to have the same fields for teams belonging to different domains. For example, Paytm’s Travel team would have fields related to flights, hotels, etc., while the Mall team would have fields related to online products, shops, etc.

We set out to build a system where any product team at Paytm could send their existing events without the burden of creating new standardized events and allow us the ability to tap into their schema without messing around with the business domain of the team.

Approach

Our approach started with building a system in a modular way where we could essentially just plug and play components. It was not a deliberate decision at the start but we started to see its advantages as we started to build the foundations.

One key aspect of event-driven architecture is that we treat events as immutable pieces of data that flow through the system and cannot be modified or altered on a fundamental level.

The closest analogy I can come up with describing our event architecture is to think of a train starting from an origin station and going through multiple different routes to reach its destination station. While the train is traveling towards its destination, its carriages or compartments can be replaced or changed and the train itself can be routed to a different destination depending on a use case.

Our event architecture is built like a train going to different places

Another aspect of our system is to become a central consumer of these events from different teams at Paytm. It means we transform all our events to have standard properties that can be managed in one place as opposed to each team managing it.

These standardized events are then sent to our campaign service which is responsible for running campaigns that deliver real-time information to our users based on these events.

Architecture & Platform

Our architecture involves consuming real-time events with a defined schema at a central location. We defined metadata and enriched these events with any additional fields for our use cases. After the enrichment is completed, we added standardized transformations for our events so every event has common properties to be consumed by the campaign service. This external campaign service can simply consume these transformed events and send these events further to our notification or banner service which in turn can nudge our users by sending notifications or showing an icon on the app.

We created one of the largest Kafka clusters at Paytm to power our real-time events architecture and we created different services in Scala and Java to build our platform on top of it. Our platform is hosted on AWS and we are able to receive events from our hundreds of millions of users.

We decided to use Avro format for our Kafka messages along with Confluent schema registry (A service that stores and retrieves avro based schemas for Kafka) which gave us 2 distinct advantages:

  1. Allows to serialize and deserialize Kafka messages using a schema thereby ensuring consistent data downstream.
  2. Helped in compressing the size of the message thereby reducing the load and cost of our system.

There are multiple ways we consume events from different teams. Our main approach for the majority of teams was to create a service that consumes messages from their Kafka which was usually in JSON format and converts it in the Avro format using the schema defined in the Confluent schema registry. This service is also responsible for filtering messages and flattening nested structures which helped us in consuming events in a standard format.

There were a couple of other ways we consume events from teams including using the Kafka Rest API where the team can call these APIs to produce Kafka messages and a service that reads binary logs from a database and sends it to our Kafka.

Once the conversion to Avro format is complete, we send these event messages to new topics in our Kafka cluster.

After these events are onboarded in our cluster, we add the metadata of the event (example includes event name, Kafka topic name, names of fields that contain a userId or timestamp of the event) in our metastore service. This is used to create a model of our events and use it in campaign service.

We also enrich our events with different fields from different sources based on our requirements. It involves using different services that enrich events from an assortment of sources like AWS S3 buckets, an API service from different teams, or even another event. It gave us the flexibility to enrich events from different sources. Some of the tools we used in enriching these events include using a Cassandra database as an event store or Apache Flink for event processing.

Once the enrichment is completed, we pass these events through an adaptor service that applies our standard transformations to these events and sends them to a new topic ready to be consumed by an external service. Some of the examples of these transformations include adding standard fields like userId, converting different date formats to one epoch timestamp format, or adding an identifier for the event.

Plug & Play

At this point, you can start to see a pattern of using different topics for messages as output and at the same time acting as input to other services. This provides us with the flexibility of using different services independently. We can essentially plug and play these services to suit our needs or even remove these services when we don’t need them. This relates to the train analogy where carriages of the train can be changed or replaced while a train is on a journey to the destination which itself can be routed to a different one.

We thought of a simple way to denote these different stages of an event by using specific naming conventions for our Kafka topic names. For example, we use specified prefixes to denote whether the topic is ready for consumption by downstream services or suffix to denote whether the event is already enriched. We store the topic name of each stage in our metastore service. An improvement for the future is to automate and standardize these naming conventions.

Auditing & Backfilling Events

The other crucial aspect of our platform is the need for auditing the flow of these events. Our business wants to know what happened to a particular event from a user and whether it was received by our system or even processed correctly by it. We used Secor, a service created by Pinterest that is able to send Kafka messages to protected Amazon S3 buckets which serve as an audit store for these events. This gives us a lot of flexibility in querying historical events data easily and efficiently. It also serves as an excellent platform to backfill events in case downstream services are not available.

Conclusion

Hopefully, this will give you an overall view of our current event architecture and how we are able to deliver real-time events across Paytm product teams in a consistent, reliable, and efficient manner. While this has worked well for far for us in terms of scaling to a billion rich events per day, we are continuously finding ways to improve upon what we build.


Follow-up Notes (2025)

I loved this project because it was my first time dealing with real scale and it was being used by my friends and family! Personally, this is one of the best motivation I get while working in software engineering which is seeing your work getting used by others. It doesn’t have to be someone related but anyone who appreciates your work and spends their time working with it.

The scale was massive: 450 million users, 40k RPS on the app. I had to think through some gnarly issues like devising a compaction strategy for Cassandra DB, setting up bin log consumption in Kafka for a massive database, and onboarding different teams to our platform with consistent schema and experience.

I grew a lot after working on this project and it made me a much more confident engineer in the end.