How to calculate subsequent records count using deeper to wider transpose and custom foreign key techniques in CRM Analytics Data Prep / Data Recipe

14 min readSep 21, 2024

This subsequent records aggregation is one of the real time example.

Business Use case:
The client needs to know how many activities the user performs each time they log in to their external sites in order to enhance the user experience and customer journey.

Problem Statement:

The clients wants to see the data like below.. Each user login records should have the cumulative count of subsequent event type(activities) records .

The requirement’s problem statement is that each user may have several login sessions, and each session may contain several actions in a single day or over an extended period of time. Thus, we have to traverse through and compare each login datetime in the user subgroup in order to determine the number of following activities for each login session. We must join this aggregated column to just login session records after aggregation. It can be challenging to accomplish since we are unable to traverse through every login datetime in the data recipe and joining to specific records.

Solution:

We will divide the solution into two parts one is we need to transpose the Activity Datetime field Rows into Columns to traverse and compare.

we need to create a Custom foreign key to aggregate the # of Activity field and to join only to Login Records on the main data stream.

Step 1 : Get the Max Login Records across user

First we have to find what is the maximum number of login records count user has. so that we can create that much transposed field.

Open the data sources dataset and filter by event type equals to “Login” .

Once we filter out with the Login Event Type then group it by the “user Id “ field.

Explanation : In our sample records maximum number of login records the user has is 3 . So we have to create three custom calculated field for transposing the login datetime field .

Consideration:

Since the number of login records may rise with time, it is always preferable to add three or four extra computed fields for these kind of problem. Therefore, we won’t need to restructure it again — at least not right away — if records increases.

We are converting the rows of data into columns in order to compare each record to bucketing under the appropriate login records within the user subgroup, thus this may require some additional effort.

Step 2 : Load the Data into Recipe

Step 3 : Computing Epoch/ UNIX timestamp field

If the source data doesn’t have the Epoch timestamp field then create it by using the below functions. we are going to use them for the comparison.

to_unix_timestamp(timestamp,format(optional))

Step 4: Computing custom foreign key

As we previously mentioned, business wants to view the total count of all subsequent activities for each login record. Thus, by combining “user Id” & “Activity Epoch Timestamp”, we are generating a foreign key column , to join a number of activity column from an aggregated stream.

Step 5: Creating Login Event Type streams

As we now have the necessary computed columns, we must convert the Datetime/epoch timestamp fields of the Login event type into columns. We will establish a separate stream by adding filter node from the main stream that includes only of records related to login events .

After the Transformation node, lets add the filter node to filter only Login event type records.

Step 6: Transposing Login Epoch Timestamp fields rows into columns using Window functions

Window functions are powerful which allows you to perform calculations across rows within a specified window or group of rows.

Approach for getting 1st Login Epoch Timestamp field

First, the window function helps to partition the Data rows by ‘User Id’ and order the Epoch time in Ascending order within each partition.
The current() function is then applied to the ordered ‘Epoch_time’ column within each partition to return the current value of the “Epoch time”.
The resulting 1st Login Date field is Number Data type.

Formula:

current(Epoch_time)

Output:

The output contains current value of the Login Epoch Timestamp in the user based on the order of ascending Login Epoch Timestamp.

Approach for getting 2nd Login Epoch Timestamp field

First, the window function helps to partition the Data rows by ‘User Id’ and order the Epoch time in Ascending order within each partition.
The lead() function is then applied to the ordered ‘1st Login Date’ column/field within each partition to return the after the current value of the “Epoch time”. Null is returned for rows without subsequent or previous data
The resulting 2nd Login Date field is Number Data type.

Formula:

lead(First_Login_Date)

Output:

The output contains after the current value of the Login Epoch Timestamp in the user based on the order of ‘1st Login Date field’.

explanation :

example user : 005B0000008Hht4IAC

The one mentioned above user has three login records, and we were able to get the first login epoch timestamp by using the current row window function. Since we now needed to obtain the second login epoch timestamp, we can do so by applying the lead function to the first login date field. Processing the first records (147046545) in the partition will result in the next rows of the first row, which is 1470053445. Likewise, it will return 1470491168 when processing the second rows(1470053445) in the partition. Since the user 005B0000008Hht4IAC has no further rows after the third row, the third record (1470491168) in the partition process it will returns “Null.”

2nd Login Date Windowfunction output table

Approach for getting 3rd Login Epoch Timestamp field

First, the window function helps to partition the Data rows by ‘User Id’ and order the Epoch time in Ascending order within each partition.
The lead() function is then applied to the ordered ‘2nd Login Date’ column/field within each partition to return the after the current value of the”Epoch time”. Null is returned for rows without subsequent or previous data
The resulting 3rd Login Date field is Number Data type.

Formula:

lead(Second_Login_Date)

Output:

The output contains after the current value of the Login Epoch Timestamp in the user based on the order of ‘2nd Login Date field’.

explanation :

example user : 005B0000008Hht4IAC

The one mentioned above user has three login records, and we were able to get the first login epoch timestamp and second login epoch timestamp . Since we now needed to obtain the third login epoch timestamp, we can do so by applying the lead function to the second login date field. Processing the first records (1470053445) in the partition will result in the next rows of the first row, which is 1470491168. when processing the second rows(1470491168) in the partition it will returns Null because next rows of second row value will be null. Since the user 005B0000008Hht4IAC has no further rows after the third row, the third record (Null) in the partition process it will returns “Null.”

3rd Login Date Window function output table

Approach for getting 4th & 5th Login Epoch Timestamp field

We only have three levels of login records, as we previously stated in the Step consideration part, but we have added two more fields in considering the future. Because we don’t currently have any data, it will only return “Null” values.
In order to determine the fourth login date, let’s use the identical formulas as in the previous phase and apply the lead functions to the third login date field. Apply the lead functions to the 4th Login Date field in the same way for the 5th Login Date field.

Formula:

lead(Third_Login_Date)

Formula:

lead(Fourth_Login_Date)

Output:

the final window functions output will be like this.

Approach for getting all the Epoch Timestamp field per user

We have now converted the rows of the Login Epoch timestamp into columns. If you look at the above table, you will see that the first records in the partition (user level) only have records because we applied the next row of current row functions for all the fields. The remaining records in the partition do not have all the values. hence it’s necessary to just take into account each partition’s initial records. Thus, utilizing window functions such as the current and first value, we will create a flag field to identify the initial records.

Formula

case 
    when current(Transactional_Id) = first_value(Transactional_Id)
    then 'Yes'
    
    else 'No'
end

If you look at the above table, you will see that Is First Row identifier flag also present . so, add the filter node after the transposing Transform node and we will apply Is First Row equates to “Yes” to get only first rows records for each partition .

Result :

we have now required Login Date columns for the each subgroups .

window function stream final output

Step 7: joining Transposed Login Datetime field onto Main Stream

As there are only one entry for each subgroup in the Login Event Type stream. so using the Lookup join, we can get the transposed data on the main stream for additional analysis.

Join keys will be User id for both data streams .and select only the transposed columns on the columns section.

Filtering First Records for each Partition

Step 8: Subsequent records aggregation identifier Transform

Every event record will now contains the related login epoch timestamp. The next step is to compare each event record to the login session timestamp over which these activities happened and mark each record with the appropriate flag for the aggregate. this can be done using the two techniques below

custom Foreign key technique
Flag based technique

Detailed explanation of the Custom Foreign key technique

Explanation :

By traversing through different login timestamp ranges on the ranges in which each event occurred using the case statement. We will then concatenate the corresponding user ID with the login timestamp. Afterwards, we will use this unique foreign key on the group in the Aggregate node to obtain the total number of activities for every user login session. Since we have created the same foreign key on the main stream in step 2, we can now join this to the main data streams with easily.

Pros :

Optimized techniques and performance will be better.

Detailed explanation of the Flag based technique

By traversing through different login timestamp ranges on the ranges in which each event occurred using the case statement. We will then flag them like “1st Login session activities” , “2nd Login session activities” .Afterwards, we will use this flag on the group in the Aggregate node to obtain the total number of activities for every user login session. this is how the aggregated data will look after that.

However, how can we join them on the relevant main stream login records?

For that, we have to create a join node for each transposed column after the aggregation node and Join key will be Flag on both right and left Streams. in our example, transposed column is five — Therefore we need to create five separate joins in order to bring the login epoch timestamp. We will then concatenate the corresponding user ID with the login timestamp. Since we have created the same foreign key on the main stream in step 2, we can now join this to the main data streams with easily.

Con :

Additional Join nodes need to be created. It will have an impact on the performance.

Custom foreign key technique would be preferable in terms of performance. so we’ll look at the execution approach step-by-step.

Problem statement :

If you look at the below data, there is no guaranteed that each login epoch timestamp column will have proper timestamp value because that's depends on the how much login records the user subgroup has . so we cannot simply pass the Login Epoch timestamp ranges on to case statement to find out which login session the subsequent activities has happened.

Explanation : the below user ‘005B0000008Hht4IAC’ in our sample records has 3 login and their subsequent activities records .

When we pass the first and second login date ranges “1470046545–1470053445” to the records, which are all the activity records whose Epoch Timestamp value is greater than 1st login date field value and less than 2nd login date field value, then those are technically related to 1st Login Session subsequent activities.

Likewise When we pass the second and third login date ranges “1470053445–1470491168” to the records, which are all the activity records whose Epoch Timestamp value is greater than 2nd login date field value and less than 3rd login date field value, then those are technically related to 2nd Login Session subsequent activities.

Likewise When we pass the third and fourth login date ranges “1470491168–0” to the records, which are all the activity records whose Epoch Timestamp value is greater than 3rd login date field value and less than 4th login date field value, then those are technically related to 3rd Login Session subsequent activities. In this case, however, the second condition fails as the value of the fourth login date field, which is zero, is smaller than the value of the third login epoch timestamp. then none of the records will be covered by the third login session activities. therefore, we are unable to use the case statement’s epoch timestamp ranges directly.

Note : The last records in each subgroup/partition record will not be included in the respective login session since they do not meet the second condition (second range), which is always 0.

if the subgroup only contains one record, then that record will not be included in the corresponding login session..

so how can we solve the above the problem …?

we can be able to arrest the above problem through nested case statement of below logic .

Nested Case stament formula :

case
when Type != 'Login'
then
    case
    when Epoch_time > "TransposedDate.First_Login_Date"
    then 
        case
        when ("TransposedDate.Second_Login_Date" = 0 OR "TransposedDate.Second_Login_Date" is null OR "TransposedDate.Second_Login_Date" = ' ')
        then concat(User_Id,'-',string("TransposedDate.First_Login_Date"))
        else
            case
            when Epoch_time < "TransposedDate.Second_Login_Date"
            then concat(User_Id,'-',string("TransposedDate.First_Login_Date"))
            else 
                case
                when Epoch_time > "TransposedDate.Second_Login_Date"
                then
                    case
                    when ("TransposedDate.Third_Login_Date" = 0 OR "TransposedDate.Third_Login_Date" is null OR "TransposedDate.Third_Login_Date" = ' ')
                    then
                        case
                        when "TransposedDate.Second_Login_Date" > "TransposedDate.First_Login_Date"
                        then concat(User_Id,'-',string("TransposedDate.Second_Login_Date"))
                        end 
                    else 
                        case 
                        when  Epoch_time < "TransposedDate.Third_Login_Date"
                        then concat(User_Id,'-',string("TransposedDate.Second_Login_Date"))
                        else
                            case
                            when Epoch_time > "TransposedDate.Third_Login_Date"
                            then
                                case
                                when ("TransposedDate.Fourth_Login_Date" = 0 or "TransposedDate.Fourth_Login_Date" is null or "TransposedDate.Fourth_Login_Date" = ' ')
                                then
                                    case
                                    when "TransposedDate.Third_Login_Date" > "TransposedDate.Second_Login_Date"
                                    then concat(User_Id,'-',string("TransposedDate.Third_Login_Date"))
                                    end 
                                else
                                    case
                                    when Epoch_time < "TransposedDate.Fourth_Login_Date"
                                    then concat(User_Id,'-',string("TransposedDate.Third_Login_Date"))
                                    else
                                        case
                                        when Epoch_time > "TransposedDate.Fourth_Login_Date"
                                        then
                                            case
                                            when ("TransposedDate.Fifth_Login_Date" = 0 or "TransposedDate.Fifth_Login_Date" is null or "TransposedDate.Fifth_Login_Date" = ' ')
                                            then
                                                case
                                                when "TransposedDate.Fourth_Login_Date" > "TransposedDate.Third_Login_Date"
                                                then concat(User_Id,'-',string("TransposedDate.Fourth_Login_Date"))
                                                end 
                                            else
                                                case
                                                when Epoch_time < "TransposedDate.Fifth_Login_Date"
                                                then concat(User_Id,'-',string("TransposedDate.Fourth_Login_Date"))
                                                end 
                                            end 
                                        end 
                                    end 
                                end 
                            end 
                        end  
                    end 
            end end
        end
            
    end 
end

the above Nested case statement will produce a custom foreign key by concatenating the user ID with the corresponding Login epoch timestamp value. which is then used for aggregation and identification.

lets look at the below “Custom Foreign Key1” column.

This is how the Nested Case Statement output will look. and you’ll see that we’ve deliberately chosen to ignore the login records since we don’t want their count to be appear in the Subsequent records count columns. If you look at the nested case statement formula above, the first condition is Type field is not equal to “Login".

Step 9 : Subsequent Records Aggregate Node

The stream now has a column for the login session identifier. Its time time to do the aggregation. so lets create a separate data stream from the “Subsequent records aggregation identifier Transform” Node, by adding the Aggregate node. in the Aggregate node, under Aggregates section select Count of Rows, as we need to count the number of subsequent records for each login session. Let’s select “Custom Foreign Key1” in the Group Rows, click apply.

Add the Transform node and use the edit attribute transform function on the aggregated column if you wish to modify the aggregated column Label and API Name.

Aggregated column Naming convention transform

Step 10 : Subsequent Cumulative count onto Login Records Join

The cumulative count of each login session’s subsequent records is now available in the aggregated data stream. Let’s join this stream to the main stream using Custom Foreign Key1 on the Right stream and Custom Foreign Key on the Main stream as join keys as we need to obtain this aggregated column on the main data stream.

Step 10 : Registering Output Node

phew! All of the necessary data transformations have been completed, and the primary data stream contains everything. So, after the “subsequent cumulative count onto Login Records” Join Node, it’s time to add the Output Register node and execute the recipe.

Final Results:

Happy Learning !!!!!

Follow me LinkedIn linkedin.com/in/ramdoss-k-5157566a

How to calculate subsequent records count using deeper to wider transpose and custom foreign key techniques in CRM Analytics Data Prep / Data Recipe

Written by Ramdosk

No responses yet