Chapter 3 创建日志文件

流程挖掘中所使用的数据叫做:日志数据(event log) . 事件日志(Event Log)是记录系统或过程中事件序列的数据集。在过程挖掘和业务流程管理等领域中,事件日志是一种记录系统行为的关键数据来源。每个事件都描述了系统中发生的特定动作或状态变化

一个事件通常包含以下元素:

  1. 案例标识(Case ID): 表示属于哪个特定案例或实例的标识符。在业务过程中,案例可以是一个订单、一个服务请求或其他业务实体。

  2. 时间戳(Timestamp): 记录事件发生的时间,精确到日期和时间。

  3. 活动(Activity): 描述事件对应的活动或操作。在业务过程中,活动可以是购买商品、提交表单等。

  4. 资源(Resource): 指明执行活动的资源,即执行活动的实体,可能是人员、机器或其他。

  5. 附加信息(Additional Information): 可能包括其他与事件相关的信息,如输入参数、输出结果等.

一个典型的event log 数据如下所示.

Case ID   |  Timestamp            |  Activity          |  Resource
--------------------------------------------------------------
Order123  |  2023-01-01 08:00:00  |  Purchase Item A  |  User1
Order123  |  2023-01-01 08:05:00  |  Add to Cart       |  User1
Order123  |  2023-01-01 08:10:00  |  Login             |  User1
Service1  |  2023-01-02 10:30:00  |  Process Request   |  System
Service1  |  2023-01-02 11:00:00  |  Receive Payment   |  System

3.1 case 与 tracee

  1. Trace(轨迹):

轨迹是一个实例(例如,一个订单)的活动序列。它表示一个具体实体的操作历史,其中包含了按照时间顺序发生的所有活动。 例如,订单A的轨迹可能是购买商品、付款、配送。订单B的轨迹可能是购买商品、取消订单。

  1. Case(案例):

案例是一组轨迹的集合,通常代表了相似类型的实体(例如,所有订单的集合)。案例是对一类业务实体的高级概括。 例如,所有订单的案例可能包含多个轨迹,每个轨迹代表一个具体订单的操作历史。

我们来看一个创建event log 的例子:

library(bupaverse)
## 
## .______    __    __  .______      ___   ____    ____  _______ .______          _______. _______
## |   _  \  |  |  |  | |   _  \    /   \  \   \  /   / |   ____||   _  \        /       ||   ____|
## |  |_)  | |  |  |  | |  |_)  |  /  ^  \  \   \/   /  |  |__   |  |_)  |      |   (----`|  |__
## |   _  <  |  |  |  | |   ___/  /  /_\  \  \      /   |   __|  |      /        \   \    |   __|
## |  |_)  | |  `--'  | |  |     /  _____  \  \    /    |  |____ |  |\  \----.----)   |   |  |____
## |______/   \______/  | _|    /__/     \__\  \__/     |_______|| _| `._____|_______/    |_______|
##                                                                                                 
## ── Attaching packages ─────────────────────────────────────── bupaverse 0.1.0 ──
## ✔ bupaR         0.5.3     ✔ processcheckR 0.1.4
## ✔ edeaR         0.9.1     ✔ processmapR   0.5.2
## ✔ eventdataR    0.3.1     
## ── Conflicts ────────────────────────────────────────── bupaverse_conflicts() ──
## ✖ bupaR::filter()          masks stats::filter()
## ✖ processmapR::frequency() masks stats::frequency()
## ✖ edeaR::setdiff()         masks base::setdiff()
## ✖ bupaR::timestamp()       masks utils::timestamp()
## ✖ processcheckR::xor()     masks base::xor()
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.2
## ✔ tibble  3.2.1     ✔ dplyr   1.1.3
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::contains() masks tidyr::contains(), processcheckR::contains()
## ✖ dplyr::filter()   masks bupaR::filter(), stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
library(bupaR)
library(lubridate)
## Loading required package: timechange
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
data <- data.frame(case = rep("A",5),
activity_id = c("A","B","C","D","E"),
activity_instance_id = 1:5,
lifecycle_id = rep("complete",5),
timestamp = now()+ddays(5),
resource = rep("resource 1", 5))

Meventlog <- eventlog(data,case_id = "case",
activity_id = "activity_id",
activity_instance_id = "activity_instance_id",
lifecycle_id = "lifecycle_id",
timestamp = "timestamp",
resource_id = "resource")


Meventlog
## # Log of 5 events consisting of:
## 1 trace 
## 1 case 
## 5 instances of 5 activities 
## 1 resource 
## Events occurred from 2023-12-01 22:48:32 until 2023-12-01 22:48:32 
##  
## # Variables were mapped as follows:
## Case identifier:     case 
## Activity identifier:     activity_id 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle_id 
## 
## # A tibble: 5 × 7
##   case  activity_id activity_instance_id lifecycle_id timestamp          
##   <chr> <chr>                      <int> <chr>        <dttm>             
## 1 A     A                              1 complete     2023-12-01 22:48:32
## 2 A     B                              2 complete     2023-12-01 22:48:32
## 3 A     C                              3 complete     2023-12-01 22:48:32
## 4 A     D                              4 complete     2023-12-01 22:48:32
## 5 A     E                              5 complete     2023-12-01 22:48:32
## # ℹ 2 more variables: resource <chr>, .order <int>
Meventlog
## # Log of 5 events consisting of:
## 1 trace 
## 1 case 
## 5 instances of 5 activities 
## 1 resource 
## Events occurred from 2023-12-01 22:48:32 until 2023-12-01 22:48:32 
##  
## # Variables were mapped as follows:
## Case identifier:     case 
## Activity identifier:     activity_id 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle_id 
## 
## # A tibble: 5 × 7
##   case  activity_id activity_instance_id lifecycle_id timestamp          
##   <chr> <chr>                      <int> <chr>        <dttm>             
## 1 A     A                              1 complete     2023-12-01 22:48:32
## 2 A     B                              2 complete     2023-12-01 22:48:32
## 3 A     C                              3 complete     2023-12-01 22:48:32
## 4 A     D                              4 complete     2023-12-01 22:48:32
## 5 A     E                              5 complete     2023-12-01 22:48:32
## # ℹ 2 more variables: resource <chr>, .order <int>

如果我们想要修改case id 对应的变量,可以使用eventlog函数,set函数

traffic_fines %>%
    eventlog(case_id = "vehicleclass") # 将case id 设置成为vehicleclass
## # Log of 34724 events consisting of:
## 4 traces 
## 4 cases 
## 34724 instances of 11 activities 
## 16 resources 
## Events occurred from 2006-06-17 until 2012-03-26 
##  
## # Variables were mapped as follows:
## Case identifier:     vehicleclass 
## Activity identifier:     activity 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle 
## 
## # A tibble: 34,724 × 18
##    case_id activity        lifecycle resource timestamp           amount article
##    <chr>   <fct>           <fct>     <fct>    <dttm>              <chr>    <dbl>
##  1 A1      Create Fine     complete  561      2006-07-24 00:00:00 35.0       157
##  2 A1      Send Fine       complete  <NA>     2006-12-05 00:00:00 <NA>        NA
##  3 A100    Create Fine     complete  561      2006-08-02 00:00:00 35.0       157
##  4 A100    Send Fine       complete  <NA>     2006-12-12 00:00:00 <NA>        NA
##  5 A100    Insert Fine No… complete  <NA>     2007-01-15 00:00:00 <NA>        NA
##  6 A100    Add penalty     complete  <NA>     2007-03-16 00:00:00 71.5        NA
##  7 A100    Send for Credi… complete  <NA>     2009-03-30 00:00:00 <NA>        NA
##  8 A10000  Create Fine     complete  561      2007-03-09 00:00:00 36.0       157
##  9 A10000  Send Fine       complete  <NA>     2007-07-17 00:00:00 <NA>        NA
## 10 A10000  Insert Fine No… complete  <NA>     2007-08-02 00:00:00 <NA>        NA
## # ℹ 34,714 more rows
## # ℹ 11 more variables: dismissal <chr>, expense <chr>, lastsent <chr>,
## #   matricola <dbl>, notificationtype <chr>, paymentamount <dbl>, points <dbl>,
## #   totalpaymentamount <chr>, vehicleclass <chr>, activity_instance_id <chr>,
## #   .order <int>
#traffic_fines %>%
#    set_case_id("vehicleclass")

另外,还可以使用mapping函数提取关于日志的映射.

mapping_fines <- mapping(traffic_fines)
mapping_fines
## Case identifier:     case_id 
## Activity identifier:     activity 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle

我们可以使用上面描述的方法增量地调整映射。

traffic_fines %>%
    set_case_id("vehicleclass") %>%
    set_activity_id("notificationtype") -> traffic_fines

上面的代码我们修改了traffic_fines数据集的映射, 接下来可以是用re_map函数重新映射:

traffic_fines %>%
    re_map(mapping_fines)
## # Log of 34724 events consisting of:
## 44 traces 
## 10000 cases 
## 34724 instances of 11 activities 
## 16 resources 
## Events occurred from 2006-06-17 until 2012-03-26 
##  
## # Variables were mapped as follows:
## Case identifier:     case_id 
## Activity identifier:     activity 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle 
## 
## # A tibble: 34,724 × 18
##    case_id activity        lifecycle resource timestamp           amount article
##    <chr>   <fct>           <fct>     <fct>    <dttm>              <chr>    <dbl>
##  1 A1      Create Fine     complete  561      2006-07-24 00:00:00 35.0       157
##  2 A1      Send Fine       complete  <NA>     2006-12-05 00:00:00 <NA>        NA
##  3 A100    Create Fine     complete  561      2006-08-02 00:00:00 35.0       157
##  4 A100    Send Fine       complete  <NA>     2006-12-12 00:00:00 <NA>        NA
##  5 A100    Insert Fine No… complete  <NA>     2007-01-15 00:00:00 <NA>        NA
##  6 A100    Add penalty     complete  <NA>     2007-03-16 00:00:00 71.5        NA
##  7 A100    Send for Credi… complete  <NA>     2009-03-30 00:00:00 <NA>        NA
##  8 A10000  Create Fine     complete  561      2007-03-09 00:00:00 36.0       157
##  9 A10000  Send Fine       complete  <NA>     2007-07-17 00:00:00 <NA>        NA
## 10 A10000  Insert Fine No… complete  <NA>     2007-08-02 00:00:00 <NA>        NA
## # ℹ 34,714 more rows
## # ℹ 11 more variables: dismissal <chr>, expense <chr>, lastsent <chr>,
## #   matricola <dbl>, notificationtype <chr>, paymentamount <dbl>, points <dbl>,
## #   totalpaymentamount <chr>, vehicleclass <chr>, activity_instance_id <chr>,
## #   .order <int>

3.2 可用数据集

eventdataR包中提供了一些可用的真实的event log 数据集

3.2.1 败血症数据集 Sepsis

这个真实的事件日志包含了来自医院的脓毒症病例的事件。脓毒症是一种典型的由感染引起的危及生命的疾病。一个病例代表了通过医院的途径。事件由医院的 ERP (企业资源规划)系统记录。约有1000个个案,共有15,000个事件记录在案,涉及16项不同活动。此外,还记录了39个数据属性,例如,负责活动的小组、测试结果和检查表中的信息。事件和属性值已被匿名化。事件的时间戳已被随机化,但跟踪中的事件之间的时间间隔没有更改. 数据来源于:

https://doi.org/10.4121/uuid:915d2bfb-7e84-49ad-a286-dc35f063a460

3.2.2 医院日志 hospital

一家荷兰学术医院的真实生活日志数据

数据来源于: https://doi.org/10.4121/uuid:d9769f3d-0ab0-4fb8-803b-0d1120ffcf54

3.2.3 医院账单数据 hospital_billing

事件日志中的100,000个跟踪是记录时间超过三年的流程实例的随机抽样。事件日志中包含了一些属性,比如流程的“状态”、“ caseType”、底层的“诊断”等等。事件和属性值已被匿名化。为此,事件的时间戳已经随机化,但是跟踪中的事件之间的时间间隔没有更改。

https://doi.org/10.4121/uuid:76c46b83-c930-4798-a1c9-4be94dfeb741

3.2.4 道路交通精细化管理 traffic_fines

道路交通罚款管理信息系统的实时事件日志。

https://doi.org/10.4121/uuid:270fd440-1057-4fb9-89a9-b699b47990f5

3.2.5 病人数据集 Patients

医院急诊室病人的人工事件日志。

3.3 读取XES 数据集

library(xesreadR)
data <- read_xes("https://bupar.net/eventdata/exercise1.xes")

3.4 检查event log

3.4.1 查看元数据

查看Event log 的元数据(也就是映射关系, 将什么变量映射到case id, 将什么变量映射到时间)

patients %>% mapping
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type

其他可用函数还包括:

patients %>% activity_id
## [1] "handling"
patients %>% case_id
## [1] "patient"
patients %>% resource_id
## [1] "employee"

3.4.2 获取基本信息

可以使用summary 函数

patients %>% summary
## Number of events:  5442
## Number of cases:  500
## Number of traces:  7
## Number of distinct activities:  7
## Average trace length:  10.884
## 
## Start eventlog:  2017-01-02 11:41:53
## End eventlog:  2018-05-05 07:16:02
##                   handling      patient          employee  handling_id       
##  Blood test           : 474   Length:5442        r1:1000   Length:5442       
##  Check-out            : 984   Class :character   r2:1000   Class :character  
##  Discuss Results      : 990   Mode  :character   r3: 474   Mode  :character  
##  MRI SCAN             : 472                      r4: 472                     
##  Registration         :1000                      r5: 522                     
##  Triage and Assessment:1000                      r6: 990                     
##  X-Ray                : 522                      r7: 984                     
##  registration_type      time                            .order    
##  complete:2721     Min.   :2017-01-02 11:41:53.00   Min.   :   1  
##  start   :2721     1st Qu.:2017-05-06 17:15:18.00   1st Qu.:1361  
##                    Median :2017-09-08 04:16:50.00   Median :2722  
##                    Mean   :2017-09-02 20:52:34.40   Mean   :2722  
##                    3rd Qu.:2017-12-22 15:44:11.25   3rd Qu.:4082  
##                    Max.   :2018-05-05 07:16:02.00   Max.   :5442  
## 
# 其他可以用来分析变量的函数包括

# patients %>% n_activities
# patients %>% n_activity_instances
# patients %>% n_cases
# patients %>% n_events
# patients %>% n_traces
# patients %>% n_resources

其他分析函数

patients %>% activities()
## # A tibble: 7 × 3
##   handling              absolute_frequency relative_frequency
##   <fct>                              <int>              <dbl>
## 1 Registration                         500             0.184 
## 2 Triage and Assessment                500             0.184 
## 3 Discuss Results                      495             0.182 
## 4 Check-out                            492             0.181 
## 5 X-Ray                                261             0.0959
## 6 Blood test                           237             0.0871
## 7 MRI SCAN                             236             0.0867
patients %>% cases()
## # A tibble: 500 × 10
##    patient trace_length number_of_activities start_timestamp    
##    <chr>          <int>                <int> <dttm>             
##  1 1                  6                    6 2017-01-02 11:41:53
##  2 10                 5                    5 2017-01-06 05:58:54
##  3 100                5                    5 2017-04-11 16:34:31
##  4 101                5                    5 2017-04-16 06:38:58
##  5 102                5                    5 2017-04-16 06:38:58
##  6 103                6                    6 2017-04-19 20:22:01
##  7 104                6                    6 2017-04-19 20:22:01
##  8 105                6                    6 2017-04-21 02:19:09
##  9 106                6                    6 2017-04-21 02:19:09
## 10 107                5                    5 2017-04-22 18:32:16
## # ℹ 490 more rows
## # ℹ 6 more variables: complete_timestamp <dttm>, trace <chr>, trace_id <dbl>,
## #   duration <drtn>, first_activity <fct>, last_activity <fct>
patients %>% resources()
## # A tibble: 7 × 3
##   employee absolute_frequency relative_frequency
##   <fct>                 <int>              <dbl>
## 1 r1                      500             0.184 
## 2 r2                      500             0.184 
## 3 r6                      495             0.182 
## 4 r7                      492             0.181 
## 5 r5                      261             0.0959
## 6 r3                      237             0.0871
## 7 r4                      236             0.0867
patients %>% traces()
## # A tibble: 7 × 3
##   trace                                    absolute_frequency relative_frequency
##   <chr>                                                 <int>              <dbl>
## 1 Registration,Triage and Assessment,X-Ra…                258              0.516
## 2 Registration,Triage and Assessment,Bloo…                234              0.468
## 3 Registration,Triage and Assessment,Bloo…                  2              0.004
## 4 Registration,Triage and Assessment,X-Ray                  2              0.004
## 5 Registration,Triage and Assessment                        2              0.004
## 6 Registration,Triage and Assessment,X-Ra…                  1              0.002
## 7 Registration,Triage and Assessment,Bloo…                  1              0.002

3.5 数据质量

R 中 daqapo包提供了检查Event log 数据质量的一个R包.