工作回顾 – Adam Dev Site

Company： Urban Spring
Vision：Provide high quality filtered water in a sustainable way and enables Bring-Your-Own-Bottle refill culture
Industry: Environmental Protection, IoT
Employees: ≈ 25
Product: Smart Water Dispenser
Title: Senior Software Engineer
Employment Period：22 months

Two years ago, I was invited by my ex-colleague to join Urban Spring. At that time, I was working in a relatively large company as a senior software engineer. Relatively large means that most of the time I only can work on the same part of the project. Everyone has a concrete job. Your job title is your boundary.

every-private-in-the-french-army-carries-a-field-marshall-wand-in-his-knapsack-napoleon-bonaparte.jpg

Personally speaking, I already spent too much time to maintain a existing system in my previous company. It's time to design and create a system by using the what I learned.

In this article, I will review some projects I have participated in the company. I was the lead and main contributor to these projects.

#1 Data Pipeline

Background

水机有很多运行的数据，包括但不限于：运行数据（滤芯寿命，异常数据）、用户行为数据（用户用水量、出水时间）等。希望能正确、及时和全面地收集水机的数据。

挑战

說服 Manager 更改他之前设计的 DB Schema。

Manager 想让 scheme 更加通用，为了适应以后未知的需求。但是更灵活的 schema ，意味着更多的代码开发成本，更差的 query 性能，还有 Data team 做數據分析時更難寫 SQL。

Solution #1

# table_1 metric
id, value_type, metric_code, metric_code
1, bool  , leak_on   , leak sensor on
2, double, chill_temp, chiller temp
3, double, inlet_temp, inlet water temp


# table_2 system_event
id, device_id,event_code, created_at
1, d_1, ec_leak        , timestamp_1
2, d_1, ec_transaction , timestamp_2

# table_3 metric_data
id, event_id, metric_id, value
1, 1, 1, "true"
1, 1, 2, "14.5"
1, 1, 3, "25.5"
2, 2, 1, "false"

其实这种设计挺好，就是如果有新的未知 metric 類型，可以不用改數據庫表就可以滿足業務需求（但其實我們哪有這麼多新的數據，硬件做好了之後能上傳的數據都是固定不變的）。另外，如果有 10M 条数据， metric 表有 3 款不同的 metric data，那么 metric_data 的数据量 10M x 3 = 30M。如果我要想获取某个 Device ID 最新一个 System Event 所有的 metric data，我就要 join table 好几次。这个性能会大打折扣，数据库其中使用者是 Data team，我收集了他们的 feedback 之后，把原来的 3 個表 flatten 為了 1 個。那么请看我下面的方案。

画外音：说实话，不太好意思反驳我的 Manager，因为他是来自瑞士的大帅哥，要是我是女的肯定会被他迷上。怪不得 Data Team 的人一直不敢反駁他的設計，可能原因是 Data Team Lead 是女的。

Solution #2

# table_1 system_event
id, device_id, event_code, created_at, leak_on, chill_temp, inlet_temp
1, d_1, e_1, timestamp_1, true ,  14.5, 25.5
2, d_1, e_2, timestamp_2, false,  14.0, 24.5

Premature optimization is the root of all evils

为了能说服 manager 接受我和 data team 的意见，我准备了一个個 demo。在 demo 里面，针对这两种设计，我分别创建了 2 个数据库，分别导入了 100 万行历史数据，然后设想了几个未来会查询数据库的场景，對比這兩種数据库 Query 的性能差別和 SQL 的語句複雜度(可以使用 SQL EXPLAIN)。有些數據庫使用場景的 SQL 語句，用 3 個表的設計方案，需要 join 3 次表。我把對比的結果給了我 Manger 看，他皺了皺眉頭，過了幾天後，他還是接受了我的方案。

Changes

Before: CLient(FTP) ==> EC2(Fluentd) ==> MYSQL
After: CLient(SFTP) ==> EC2(File Sync) ==> S3 (Event Trigger) ==> Lambda ==> RDS(PostgreSQL)

Tech

EC2 File Sync 使用的是 MinIO
Lambda 使用的是 serverless 和 Python 3.x

FAQ

Q1: 什么多了 S3 和 Lambda?
- 使用 S3 可以顺便直接备份数据，然后每次上次文件到 S3 直接触发 Lambda 把内容插入 PostgreSQL。
Q2: 为什么使用 EC2 自己搭建 SFTP 服务器而不是 Cloud-hosted SFTP?
- 贵。AWS Transfer Family 可以提供 SFTP 服务，就是太贵了 ($0.30 per hour in sg region not includes the data transfer fee)。
Q3: 為什麼部署 NOSQL？
- Data team 不是很會使用 NOSQL 做數據分析。所以我們就用了 rational 數據庫了。

#2 Admin Panel

Background

Admin Panel is an internal IoT devices management dashboard. Feature list:

View or filter the devices real-time or history operation data
1. Real-time data is powered by MQTT
Device meta info management
Update the device config remotely
IoT Device Provisioning
View and manage the device alert date
1. Alert has two levels Critical and Warning
2. Alert will be auto-resolved if data back to normal
3. Support audit log if alert resolved by user
Import water PDF report (OCR Detect PDF Content)
Support Social Login (Google)
Permission control for users

Changes

Before: Google Spreadsheet + MYSQL + Python CLI (for managing the AWS IoT device)
After: Admin Panel

挑戰

說服 Stakeholder 和 Manager 搭建內部的 Admin Panel

我剛加入公司的時候，包括我在內已經有 4 個程序員（有兩個人會馬上辭職）。我開始接到的第一個任務是做一個 Python CLI，它的作用就是先讀取一個 CSV 文件，然後根據 CSV 的內容在 AWS IoT 裡面 Create Device，然後把 create 的結果（證書，meta info 等）放在 PostgreSQL 裡面。

我工作了沒多久，就向團隊提議說，做一個 Admin Panel 替代 Python CLI。團隊沒有採納我的意見。因為， Python CLI 已經可以完成 stakeholder 的要求。團隊之前沒有過這種 Web Dashboard 的前端經驗。我的想法是，雖然 Python CLI 能完成要求，但是最後需要管理設備的是公司維護團隊，而不是軟件工程師，我們要非技術人員學習使用 cli 去管理 IoT 設備，這樣對用戶太不友好。而且使用 Python CLI 的方式一點也不酷，我們做出來的產品面對的是非技術人員，應該要從用戶的角度去考慮需求。

我在繼續做 Python CLI 的同時，自己在空餘時間使用 react material ui 做了 admin panel 的原型，為了讓原型更真實，我還做了幾個 Backend 的 API，用來集成 AWS IoT。我在 Sprint Review 面向 stakeholder 和 manager demo 了我的原型。他們看了我的 demo 後沒有馬上同意我的想法，過了一段時間，我的想法還是被採納了，其中有一部分的原因是有 PM 在幫忙遊說。得到上層的同意，我們停止了 Python CLI 的開發，轉向了做 Admin Panel。記過幾個月的開發（在 2010 年 1 月左右），我們發布了 1.0.0 版本。

隨著各種功能的加入，Admin Panel 已經成為了一個公司內部設備管理的中心。到我離職的時候，版本號已經到了 v1.17.0。具體增加的功能可以看 Background 的介紹。

Tech

FE: React (material-kit-react) + Redux + Saga
- Hosting: AWS CloudFront + S3
BE: Django + Django REST framework
- Hosting: AWS ECS Fargate + AWS API Gateway
CI/CD: GitLab CI/CD + Runner
IoT Design: Designing MQTT Topics for AWS IoT Core

Alerting System

Background

When I just joined the company, there was already a system handing the alerting. But the existing was not the team wanted. So my manager asked me to create a new one which can fulfill the follow requirement:

Design an alert lifecycle that can be auto-resolved if data back to normal.
Persist all the alert data to Database so that data team do the analysis.
Keep a record in Database if an alert has been updated by the operator.
Send timely notification to device maintenance team via Slack message.
Should classify the alerts by the level. (e.g. The leak_sensor_on is critical level alert. The slow_flow_rate is warning level alert).
Create a dashboard for the team to have a comprehensive status view of the alerting.

Changes

Before: Data Source(FTP) ==> EC2(Fluentd ==> ElasticSearch ==> Kibana Alerting ==> Python API ==> Prometheus Alertmanager) ==> Slack
After: Data Source(SFTP) ==> EC2(File Sync) ==> S3 (Event Trigger) ==> Lambda(serverless) ==> RDS(PostgreSQL) ==> Grafana ==> Serverless API(Lambda) ==> Update DB(PostgreSQL) ==> Slack

Alert State Diagram

Challenges

Utilize Grafana Dashboard to implement the alerting rules

開始 Alerting System 使用的是 ElasticSearch 作為數據庫，Kibana Alerting 定義一些警報的規則，規則的寫法當然是按照 Kibana Alerting Query Syntax。使用 Kibana Alerting 最大的問題是，產生的 Alerting 警報數據只能在存在 ElasticSearch 裡面。數據團隊需要對 Alerting 的數據進行分析，比如算出設備的故障率等，但是他們並不熟悉 ElasticSearch，也就是他們不能對這些 Alerting 數據做一些分析。數據團隊只熟悉 SQL 數據庫。

我把數據庫改為 PostgreSQL 後，就要开始想如何搭建 Alerting System。經過對市面上各種 Monitoring 系统的分析（其實主要是 Kibana、Grafana）原因有幾個：1) Grafana 不挑數據庫，可以跟各種數據源集成。 2) 有開源的版本可以使用，可以自己 Hosting，也就是基本免費。 3) 支持各類插件，還可以自己自己開發。4) 支持 Google SSO 登錄。好了，選擇了 Grafana 之後，現在就是要想如何搞 Alerting 規則了。首先要創建各種 Grafana Panel， Grafana 會根據 SQL Query 去 Render 不同的 Graph。具體可以看官方文檔比如我有一個 Panel:

SELECT time_filter, device_id, COUNT(leak_on)
FROM system_event
GROUP BY time_filter, device_id;

有了 Panel 之後，就可以定義 Alerting Rule 了。比如上面 Panel 的 Alerting Rule 就是 COUNT(leak_on) 個數大於 0 的時候觸發 Alarm，接著我們的 Serverless API 就會收到 Grafana 發送過來的數據，裡面會有提到哪個 device_id 和這個 Alarm 對應的 Alarm Code：ERR_101_LEAK_ON。

其實這個算是比較簡單的 Panel，因為檢測的數據只有一種。如果 Alerting Rule 很複雜，比如有個 Rule 是：針對一個設備， Value1 出現的次數要大於 Value2，說明有 Critical Alarm 等。我們寫 SQL 就會很複雜，涉及到幾個 Sub Queries 和 Grouping。

好的，這就是我覺得最有挑戰的地方，當然這是技術上的挑戰。因為出現警報后需要其他團隊的配合去處理，所以我們在定義規則的時候一直都要和其他團隊保持溝通，保證我們的目標是一致的。這種對溝通能力是也是一個挑戰：警报規則的设计到最后在技術上實現是有一定差距的，要理解翻譯非技術人員的需求，最后加以實現。這種能力可以理解為是在不同背景和語境下的翻譯和轉化能力。

Software Engineering Improvement

Introduced multiple development environmental (dev, test, prod)
- Achieved this by AWS organizations
- AWS organizations supports the consolidated billing (very helpful for Accounting)
Introduced CI/CD and Git Flow
- Tag the branch and to trigger production deployment
- Push to feat/* and fix/* branch will trigger deployment to dev

#1 Data Pipeline

Background

挑战

Changes

Tech

FAQ

#2 Admin Panel

Background

Changes

挑戰

Tech

Alerting System

Background

Changes

Alert State Diagram

Challenges

Software Engineering Improvement

#6 Device OTA (Android)