Chapter 3 创建日志文件
流程挖掘中所使用的数据叫做:日志数据(event log) . 事件日志(Event Log)是记录系统或过程中事件序列的数据集。在过程挖掘和业务流程管理等领域中,事件日志是一种记录系统行为的关键数据来源。每个事件都描述了系统中发生的特定动作或状态变化
一个事件通常包含以下元素:
案例标识(Case ID): 表示属于哪个特定案例或实例的标识符。在业务过程中,案例可以是一个订单、一个服务请求或其他业务实体。
时间戳(Timestamp): 记录事件发生的时间,精确到日期和时间。
活动(Activity): 描述事件对应的活动或操作。在业务过程中,活动可以是购买商品、提交表单等。
资源(Resource): 指明执行活动的资源,即执行活动的实体,可能是人员、机器或其他。
附加信息(Additional Information): 可能包括其他与事件相关的信息,如输入参数、输出结果等.
一个典型的event log 数据如下所示.
Case ID | Timestamp | Activity | Resource
--------------------------------------------------------------
Order123 | 2023-01-01 08:00:00 | Purchase Item A | User1
Order123 | 2023-01-01 08:05:00 | Add to Cart | User1
Order123 | 2023-01-01 08:10:00 | Login | User1
Service1 | 2023-01-02 10:30:00 | Process Request | System
Service1 | 2023-01-02 11:00:00 | Receive Payment | System
3.1 case 与 tracee
- Trace(轨迹):
轨迹是一个实例(例如,一个订单)的活动序列。它表示一个具体实体的操作历史,其中包含了按照时间顺序发生的所有活动。 例如,订单A的轨迹可能是购买商品、付款、配送。订单B的轨迹可能是购买商品、取消订单。
- Case(案例):
案例是一组轨迹的集合,通常代表了相似类型的实体(例如,所有订单的集合)。案例是对一类业务实体的高级概括。 例如,所有订单的案例可能包含多个轨迹,每个轨迹代表一个具体订单的操作历史。
我们来看一个创建event log 的例子:
##
## .______ __ __ .______ ___ ____ ____ _______ .______ _______. _______
## | _ \ | | | | | _ \ / \ \ \ / / | ____|| _ \ / || ____|
## | |_) | | | | | | |_) | / ^ \ \ \/ / | |__ | |_) | | (----`| |__
## | _ < | | | | | ___/ / /_\ \ \ / | __| | / \ \ | __|
## | |_) | | `--' | | | / _____ \ \ / | |____ | |\ \----.----) | | |____
## |______/ \______/ | _| /__/ \__\ \__/ |_______|| _| `._____|_______/ |_______|
##
## ── Attaching packages ─────────────────────────────────────── bupaverse 0.1.0 ──
## ✔ bupaR 0.5.3 ✔ processcheckR 0.1.4
## ✔ edeaR 0.9.1 ✔ processmapR 0.5.2
## ✔ eventdataR 0.3.1
## ── Conflicts ────────────────────────────────────────── bupaverse_conflicts() ──
## ✖ bupaR::filter() masks stats::filter()
## ✖ processmapR::frequency() masks stats::frequency()
## ✖ edeaR::setdiff() masks base::setdiff()
## ✖ bupaR::timestamp() masks utils::timestamp()
## ✖ processcheckR::xor() masks base::xor()
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.2
## ✔ tibble 3.2.1 ✔ dplyr 1.1.3
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::contains() masks tidyr::contains(), processcheckR::contains()
## ✖ dplyr::filter() masks bupaR::filter(), stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Loading required package: timechange
##
## Attaching package: 'lubridate'
##
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
data <- data.frame(case = rep("A",5),
activity_id = c("A","B","C","D","E"),
activity_instance_id = 1:5,
lifecycle_id = rep("complete",5),
timestamp = now()+ddays(5),
resource = rep("resource 1", 5))
Meventlog <- eventlog(data,case_id = "case",
activity_id = "activity_id",
activity_instance_id = "activity_instance_id",
lifecycle_id = "lifecycle_id",
timestamp = "timestamp",
resource_id = "resource")
Meventlog
## # Log of 5 events consisting of:
## 1 trace
## 1 case
## 5 instances of 5 activities
## 1 resource
## Events occurred from 2023-12-01 22:48:32 until 2023-12-01 22:48:32
##
## # Variables were mapped as follows:
## Case identifier: case
## Activity identifier: activity_id
## Resource identifier: resource
## Activity instance identifier: activity_instance_id
## Timestamp: timestamp
## Lifecycle transition: lifecycle_id
##
## # A tibble: 5 × 7
## case activity_id activity_instance_id lifecycle_id timestamp
## <chr> <chr> <int> <chr> <dttm>
## 1 A A 1 complete 2023-12-01 22:48:32
## 2 A B 2 complete 2023-12-01 22:48:32
## 3 A C 3 complete 2023-12-01 22:48:32
## 4 A D 4 complete 2023-12-01 22:48:32
## 5 A E 5 complete 2023-12-01 22:48:32
## # ℹ 2 more variables: resource <chr>, .order <int>
## # Log of 5 events consisting of:
## 1 trace
## 1 case
## 5 instances of 5 activities
## 1 resource
## Events occurred from 2023-12-01 22:48:32 until 2023-12-01 22:48:32
##
## # Variables were mapped as follows:
## Case identifier: case
## Activity identifier: activity_id
## Resource identifier: resource
## Activity instance identifier: activity_instance_id
## Timestamp: timestamp
## Lifecycle transition: lifecycle_id
##
## # A tibble: 5 × 7
## case activity_id activity_instance_id lifecycle_id timestamp
## <chr> <chr> <int> <chr> <dttm>
## 1 A A 1 complete 2023-12-01 22:48:32
## 2 A B 2 complete 2023-12-01 22:48:32
## 3 A C 3 complete 2023-12-01 22:48:32
## 4 A D 4 complete 2023-12-01 22:48:32
## 5 A E 5 complete 2023-12-01 22:48:32
## # ℹ 2 more variables: resource <chr>, .order <int>
如果我们想要修改case id 对应的变量,可以使用eventlog函数,set函数
## # Log of 34724 events consisting of:
## 4 traces
## 4 cases
## 34724 instances of 11 activities
## 16 resources
## Events occurred from 2006-06-17 until 2012-03-26
##
## # Variables were mapped as follows:
## Case identifier: vehicleclass
## Activity identifier: activity
## Resource identifier: resource
## Activity instance identifier: activity_instance_id
## Timestamp: timestamp
## Lifecycle transition: lifecycle
##
## # A tibble: 34,724 × 18
## case_id activity lifecycle resource timestamp amount article
## <chr> <fct> <fct> <fct> <dttm> <chr> <dbl>
## 1 A1 Create Fine complete 561 2006-07-24 00:00:00 35.0 157
## 2 A1 Send Fine complete <NA> 2006-12-05 00:00:00 <NA> NA
## 3 A100 Create Fine complete 561 2006-08-02 00:00:00 35.0 157
## 4 A100 Send Fine complete <NA> 2006-12-12 00:00:00 <NA> NA
## 5 A100 Insert Fine No… complete <NA> 2007-01-15 00:00:00 <NA> NA
## 6 A100 Add penalty complete <NA> 2007-03-16 00:00:00 71.5 NA
## 7 A100 Send for Credi… complete <NA> 2009-03-30 00:00:00 <NA> NA
## 8 A10000 Create Fine complete 561 2007-03-09 00:00:00 36.0 157
## 9 A10000 Send Fine complete <NA> 2007-07-17 00:00:00 <NA> NA
## 10 A10000 Insert Fine No… complete <NA> 2007-08-02 00:00:00 <NA> NA
## # ℹ 34,714 more rows
## # ℹ 11 more variables: dismissal <chr>, expense <chr>, lastsent <chr>,
## # matricola <dbl>, notificationtype <chr>, paymentamount <dbl>, points <dbl>,
## # totalpaymentamount <chr>, vehicleclass <chr>, activity_instance_id <chr>,
## # .order <int>
另外,还可以使用mapping函数提取关于日志的映射.
## Case identifier: case_id
## Activity identifier: activity
## Resource identifier: resource
## Activity instance identifier: activity_instance_id
## Timestamp: timestamp
## Lifecycle transition: lifecycle
我们可以使用上面描述的方法增量地调整映射。
traffic_fines %>%
set_case_id("vehicleclass") %>%
set_activity_id("notificationtype") -> traffic_fines
上面的代码我们修改了traffic_fines数据集的映射, 接下来可以是用re_map函数重新映射:
## # Log of 34724 events consisting of:
## 44 traces
## 10000 cases
## 34724 instances of 11 activities
## 16 resources
## Events occurred from 2006-06-17 until 2012-03-26
##
## # Variables were mapped as follows:
## Case identifier: case_id
## Activity identifier: activity
## Resource identifier: resource
## Activity instance identifier: activity_instance_id
## Timestamp: timestamp
## Lifecycle transition: lifecycle
##
## # A tibble: 34,724 × 18
## case_id activity lifecycle resource timestamp amount article
## <chr> <fct> <fct> <fct> <dttm> <chr> <dbl>
## 1 A1 Create Fine complete 561 2006-07-24 00:00:00 35.0 157
## 2 A1 Send Fine complete <NA> 2006-12-05 00:00:00 <NA> NA
## 3 A100 Create Fine complete 561 2006-08-02 00:00:00 35.0 157
## 4 A100 Send Fine complete <NA> 2006-12-12 00:00:00 <NA> NA
## 5 A100 Insert Fine No… complete <NA> 2007-01-15 00:00:00 <NA> NA
## 6 A100 Add penalty complete <NA> 2007-03-16 00:00:00 71.5 NA
## 7 A100 Send for Credi… complete <NA> 2009-03-30 00:00:00 <NA> NA
## 8 A10000 Create Fine complete 561 2007-03-09 00:00:00 36.0 157
## 9 A10000 Send Fine complete <NA> 2007-07-17 00:00:00 <NA> NA
## 10 A10000 Insert Fine No… complete <NA> 2007-08-02 00:00:00 <NA> NA
## # ℹ 34,714 more rows
## # ℹ 11 more variables: dismissal <chr>, expense <chr>, lastsent <chr>,
## # matricola <dbl>, notificationtype <chr>, paymentamount <dbl>, points <dbl>,
## # totalpaymentamount <chr>, vehicleclass <chr>, activity_instance_id <chr>,
## # .order <int>
3.2 可用数据集
eventdataR包中提供了一些可用的真实的event log 数据集
3.2.1 败血症数据集 Sepsis
这个真实的事件日志包含了来自医院的脓毒症病例的事件。脓毒症是一种典型的由感染引起的危及生命的疾病。一个病例代表了通过医院的途径。事件由医院的 ERP (企业资源规划)系统记录。约有1000个个案,共有15,000个事件记录在案,涉及16项不同活动。此外,还记录了39个数据属性,例如,负责活动的小组、测试结果和检查表中的信息。事件和属性值已被匿名化。事件的时间戳已被随机化,但跟踪中的事件之间的时间间隔没有更改. 数据来源于:
https://doi.org/10.4121/uuid:915d2bfb-7e84-49ad-a286-dc35f063a460
3.2.2 医院日志 hospital
一家荷兰学术医院的真实生活日志数据
数据来源于: https://doi.org/10.4121/uuid:d9769f3d-0ab0-4fb8-803b-0d1120ffcf54
3.2.3 医院账单数据 hospital_billing
事件日志中的100,000个跟踪是记录时间超过三年的流程实例的随机抽样。事件日志中包含了一些属性,比如流程的“状态”、“ caseType”、底层的“诊断”等等。事件和属性值已被匿名化。为此,事件的时间戳已经随机化,但是跟踪中的事件之间的时间间隔没有更改。
https://doi.org/10.4121/uuid:76c46b83-c930-4798-a1c9-4be94dfeb741
3.2.4 道路交通精细化管理 traffic_fines
道路交通罚款管理信息系统的实时事件日志。
https://doi.org/10.4121/uuid:270fd440-1057-4fb9-89a9-b699b47990f5
3.4 检查event log
3.4.1 查看元数据
查看Event log 的元数据(也就是映射关系, 将什么变量映射到case id, 将什么变量映射到时间)
## Case identifier: patient
## Activity identifier: handling
## Resource identifier: employee
## Activity instance identifier: handling_id
## Timestamp: time
## Lifecycle transition: registration_type
其他可用函数还包括:
## [1] "handling"
## [1] "patient"
## [1] "employee"
3.4.2 获取基本信息
可以使用summary 函数
## Number of events: 5442
## Number of cases: 500
## Number of traces: 7
## Number of distinct activities: 7
## Average trace length: 10.884
##
## Start eventlog: 2017-01-02 11:41:53
## End eventlog: 2018-05-05 07:16:02
## handling patient employee handling_id
## Blood test : 474 Length:5442 r1:1000 Length:5442
## Check-out : 984 Class :character r2:1000 Class :character
## Discuss Results : 990 Mode :character r3: 474 Mode :character
## MRI SCAN : 472 r4: 472
## Registration :1000 r5: 522
## Triage and Assessment:1000 r6: 990
## X-Ray : 522 r7: 984
## registration_type time .order
## complete:2721 Min. :2017-01-02 11:41:53.00 Min. : 1
## start :2721 1st Qu.:2017-05-06 17:15:18.00 1st Qu.:1361
## Median :2017-09-08 04:16:50.00 Median :2722
## Mean :2017-09-02 20:52:34.40 Mean :2722
## 3rd Qu.:2017-12-22 15:44:11.25 3rd Qu.:4082
## Max. :2018-05-05 07:16:02.00 Max. :5442
##
# 其他可以用来分析变量的函数包括
# patients %>% n_activities
# patients %>% n_activity_instances
# patients %>% n_cases
# patients %>% n_events
# patients %>% n_traces
# patients %>% n_resources
其他分析函数
## # A tibble: 7 × 3
## handling absolute_frequency relative_frequency
## <fct> <int> <dbl>
## 1 Registration 500 0.184
## 2 Triage and Assessment 500 0.184
## 3 Discuss Results 495 0.182
## 4 Check-out 492 0.181
## 5 X-Ray 261 0.0959
## 6 Blood test 237 0.0871
## 7 MRI SCAN 236 0.0867
## # A tibble: 500 × 10
## patient trace_length number_of_activities start_timestamp
## <chr> <int> <int> <dttm>
## 1 1 6 6 2017-01-02 11:41:53
## 2 10 5 5 2017-01-06 05:58:54
## 3 100 5 5 2017-04-11 16:34:31
## 4 101 5 5 2017-04-16 06:38:58
## 5 102 5 5 2017-04-16 06:38:58
## 6 103 6 6 2017-04-19 20:22:01
## 7 104 6 6 2017-04-19 20:22:01
## 8 105 6 6 2017-04-21 02:19:09
## 9 106 6 6 2017-04-21 02:19:09
## 10 107 5 5 2017-04-22 18:32:16
## # ℹ 490 more rows
## # ℹ 6 more variables: complete_timestamp <dttm>, trace <chr>, trace_id <dbl>,
## # duration <drtn>, first_activity <fct>, last_activity <fct>
## # A tibble: 7 × 3
## employee absolute_frequency relative_frequency
## <fct> <int> <dbl>
## 1 r1 500 0.184
## 2 r2 500 0.184
## 3 r6 495 0.182
## 4 r7 492 0.181
## 5 r5 261 0.0959
## 6 r3 237 0.0871
## 7 r4 236 0.0867
## # A tibble: 7 × 3
## trace absolute_frequency relative_frequency
## <chr> <int> <dbl>
## 1 Registration,Triage and Assessment,X-Ra… 258 0.516
## 2 Registration,Triage and Assessment,Bloo… 234 0.468
## 3 Registration,Triage and Assessment,Bloo… 2 0.004
## 4 Registration,Triage and Assessment,X-Ray 2 0.004
## 5 Registration,Triage and Assessment 2 0.004
## 6 Registration,Triage and Assessment,X-Ra… 1 0.002
## 7 Registration,Triage and Assessment,Bloo… 1 0.002