Skip to content

Train Your Time Series Data with Azure Data Explorer Part 1

Published: at 11:08 PM

3.png

Time series creation

In this section, we’ll create a large set of regular time series simply and intuitively using the make-series operator, and fill-in missing values as needed. The first step in time series analysis is to partition and transform the original telemetry table to a set of time series. The table usually contains a timestamp column, contextual dimensions, and optional metrics. The dimensions are used to partition the data. The goal is to create thousands of time series per partition at regular time intervals.

The input table demo_make_series1 contains 600K records of arbitrary web service traffic. Use the following command to sample 10 records:

demo_make_series1 | take 10

The resulting table contains a timestamp column, three contextual dimensions columns, and no metrics:

TimeStampBrowserVerOsVerCountry/Region
2016-08-25 09:12:35.4020000Chrome 51.0Windows 7United Kingdom
2016-08-25 09:12:41.1120000Chrome 52.0Windows 10
2016-08-25 09:12:46.2300000Chrome 52.0Windows 7United Kingdom
2016-08-25 09:12:46.5100000Chrome 52.0Windows 10United Kingdom
2016-08-25 09:12:46.5570000Chrome 52.0Windows 10Republic of Lithuania
2016-08-25 09:12:47.0470000Chrome 52.0Windows 8.1India
2016-08-25 09:12:51.3600000Chrome 52.0Windows 10United Kingdom
2016-08-25 09:12:51.6930000Chrome 52.0Windows 7Netherlands
2016-08-25 09:12:56.4240000Chrome 52.0Windows 10United Kingdom
2016-08-25 09:13:08.7230000Chrome 52.0Windows 10India

Since there are no metrics, we can only build a set of time series representing the traffic count itself, partitioned by OS using the following query:

let min_t = toscalar(demo_make_series1 | summarize min(TimeStamp));
let max_t = toscalar(demo_make_series1 | summarize max(TimeStamp));
demo_make_series1
| make-series num=count() default=0 on TimeStamp from min_t to max_t step 1h by OsVer
| render timechart

In the table above, we have three partitions. We can create a separate time series: Windows 10 (red), 7 (blue) and 8.1 (green) for each OS version as seen in the graph:

https://github.com/MicrosoftDocs/dataexplorer-docs/raw/main/data-explorer/media/time-series-analysis/time-series-partition.png

Time series analysis functions

In this section, we’ll perform typical series processing functions. Once a set of time series is created, KQL supports a growing list of functions to process and analyze them. We’ll describe a few representative functions for processing and analyzing time series.

Filtering

Filtering is a common practice in signal processing and useful for time series processing tasks (for example, smooth a noisy signal, change detection).

let min_t = toscalar(demo_make_series1 | summarize min(TimeStamp));
let max_t = toscalar(demo_make_series1 | summarize max(TimeStamp));
demo_make_series1
| make-series num=count() default=0 on TimeStamp from min_t to max_t step 1h by OsVer
| extend ma_num=series_fir(num, repeat(1, 5), true, true)
| render timechart

https://github.com/MicrosoftDocs/dataexplorer-docs/raw/main/data-explorer/media/time-series-analysis/time-series-filtering.png

Regression analysis

Azure Data Explorer supports segmented linear regression analysis to estimate the trend of the time series.

Example of series_fit_line() and series_fit_2lines() functions in a time series query:

demo_series2
| extend series_fit_2lines(y), series_fit_line(y)
| render linechart with(xcolumn=x)

https://github.com/MicrosoftDocs/dataexplorer-docs/raw/main/data-explorer/media/time-series-analysis/time-series-regression.png

Note

The function accurately detected the jump (level change) point.

Seasonality detection

Many metrics follow seasonal (periodic) patterns. User traffic of cloud services usually contains daily and weekly patterns that are highest around the middle of the business day and lowest at night and over the weekend. IoT sensors measure in periodic intervals. Physical measurements such as temperature, pressure, or humidity may also show seasonal behavior.

The following example applies seasonality detection on one month traffic of a web service (2-hour bins):

demo_series3
| render timechart

https://github.com/MicrosoftDocs/dataexplorer-docs/raw/main/data-explorer/media/time-series-analysis/time-series-seasonality.png

Note

It’s an anomaly if specific distinct periods don’t exist

demo_series3
| project (periods, scores) = series_periods_detect(num, 0., 14d/2h, 2) //to detect the periods in the time series
| mv-expand periods, scores
| extend days=2h*todouble(periods)/1d
periodsscoresdays
840.8206227860555957
120.7646014058035021

The function detects daily and weekly seasonality. The daily scores less than the weekly because weekend days are different from weekdays.

Element-wise functions

Arithmetic and logical operations can be done on a time series. Using series_subtract() we can calculate a residual time series, that is, the difference between original raw metric and a smoothed one, and look for anomalies in the residual signal:

let min_t = toscalar(demo_make_series1 | summarize min(TimeStamp));
let max_t = toscalar(demo_make_series1 | summarize max(TimeStamp));
demo_make_series1
| make-series num=count() default=0 on TimeStamp in from min_t to max_t step 1h by OsVer
| extend ma_num=series_fir(num, repeat(1, 5), true, true)
| extend residual_num=series_subtract(num, ma_num) //to calculate residual time series
| where OsVer == "Windows 10"   // filter on Win 10 to visualize a cleaner chart
| render timechart

https://github.com/MicrosoftDocs/dataexplorer-docs/raw/main/data-explorer/media/time-series-analysis/time-series-operations.png

Time series workflow at scale

The example below shows how these functions can run at scale on thousands of time series in seconds for anomaly detection. To see a few sample telemetry records of a DB service’s read count metric over four days run the following query:

demo_many_series1
| take 4
TIMESTAMPLocOpDBDataRead
2016-09-11 21:00:00.0000000Loc 951178539340496300892620
2016-09-11 21:00:00.0000000Loc 951178539340496300892410
2016-09-11 21:00:00.0000000Loc 9-865998331941149874262279862
2016-09-11 21:00:00.0000000Loc 93719217345637834102550

And simple statistics:

demo_many_series1
| summarize num=count(), min_t=min(TIMESTAMP), max_t=max(TIMESTAMP)
nummin_tmax_t
21774722016-09-08 00:00:00.00000002016-09-11 23:00:00.0000000

Building a time series in 1-hour bins of the read metric (total four days * 24 hours = 96 points), results in normal pattern fluctuation:

let min_t = toscalar(demo_many_series1 | summarize min(TIMESTAMP));
let max_t = toscalar(demo_many_series1 | summarize max(TIMESTAMP));
demo_many_series1
| make-series reads=avg(DataRead) on TIMESTAMP from min_t to max_t step 1h
| render timechart with(ymin=0)

https://github.com/MicrosoftDocs/dataexplorer-docs/raw/main/data-explorer/media/time-series-analysis/time-series-at-scale.png

The above behavior is misleading, since the single normal time series is aggregated from thousands of different instances that may have abnormal patterns. Therefore, we create a time series per instance. An instance is defined by Loc (location), Op (operation), and DB (specific machine).

How many time series can we create?

demo_many_series1
| summarize by Loc, Op, DB
| count

Now, we’re going to create a set of 18339 time series of the read count metric, as like we run and get the result. We add the by clause to the make-series statement, apply linear regression, and select the top two time series that had the most significant decreasing trend:

let min_t = toscalar(demo_many_series1 | summarize min(TIMESTAMP));
let max_t = toscalar(demo_many_series1 | summarize max(TIMESTAMP));
demo_many_series1
| make-series reads=avg(DataRead) on TIMESTAMP from min_t to max_t step 1h by Loc, Op, DB
| extend (rsquare, slope) = series_fit_line(reads)
| top 2 by slope asc
| render timechart with(title='Service Traffic Outage for 2 instances (out of 18339)')

https://github.com/MicrosoftDocs/dataexplorer-docs/raw/main/data-explorer/media/time-series-analysis/time-series-top-2.png

Display the instances:

let min_t = toscalar(demo_many_series1 | summarize min(TIMESTAMP));
let max_t = toscalar(demo_many_series1 | summarize max(TIMESTAMP));
demo_many_series1
| make-series reads=avg(DataRead) on TIMESTAMP from min_t to max_t step 1h by Loc, Op, DB
| extend (rsquare, slope) = series_fit_line(reads)
| top 2 by slope asc
| project Loc, Op, DB, slope
LocOpDBslope
Loc 15371151-102743.910227889
Loc 13371249-86303.2334644601

In less than two minutes, close to 20,000 time series were analyzed and two abnormal time series in which the read count suddenly dropped were detected.

These advanced capabilities combined with fast performance supply a unique and powerful solution for time series analysis.