[PYTHON] Winning with Monitoring

NetOpsCoding Advent Calendar 2016 This is a 12/8 minute entry.

Introduction

At an event called "Juniper Cloud Builder Community" by Juniper on December 1st I gave a talk under the title of "Winnig with Monitoring".

Thankfully, during the event, I received a lot of voices saying "Please give me presentation materials." I would like to take this opportunity to make it public.

The lecture time of the event was relatively short, 25 minutes, and I had to shave about half of the materials I created. Since it is a great deal, I will post only the summary in this entry.

The content of the lecture at the event Telemetry features are often talked about in the Monitoring category in recent years. (Unlike SNMP, the network device itself sends metrics) is one implementation. Verify OpenNTI (not a substitute for Juniper's official support) It is to confirm the usefulness of the telemetry function.

Monitoring architecture

As a review, about the Monitoring architecture

arch.png

In the pull type, the collector throws a request to a target such as a network node to collect information. On the other hand, the target returns a response.

On the other hand, in the Push type, the target is unilaterally a collector like SNMP Trap and SYSLOG. It is a type that sends information.

As you can see, unlike the Pull type with request / response I think the Push type is an architecture with strong real-time characteristics.

However, I personally don't want to talk about which is better, Pull type or Push type, so I think it's a good choice. (For example, the Pull type can measure delay etc. by request / response and is suitable for life and death monitoring etc.).

Monitoring System classification

The following is the classification of Monitoring System, but it cannot be clearly classified. I hope you can take it as an example only.

kind.png

In reality, there are various combinations, and the backend data source can be selected. Even with the architecture, the existence of SNMP cannot be ignored. I think that most of the Push type supports the Pull type with plug-ins.

SNMP I think that the protocol called SNMP is inevitable in Monitoring. Below are the strengths and problems.

snmp.png

I have too much to talk about SNMP, so I will omit it.

Telemetry implementation status of network vendors

■Cisco -Telemetry function implemented in IOS XR 6.X (The shortest interval to skip the metric seems to be 5 seconds so far)

・ Implementation example   IOS XR 6.X + Streaming Telemetry Collector Stacks

■Arista ・ Telemetry functions such as LANZ were implemented from early on. -Recently, the information collected by various Tracer functions (VM Tracer, Container Tracer) is accumulated in NetDB. It is now possible to analyze with CloudVision

・ Implementation example   NetDB + CloudVision   EOS Splunk Extension + Splunk Apps   LANZ + CorvilNet

■Juniper ・ From QFX5100 to Insight Technology (referred to as Analyticsd in this entry) Achieve telemetry with a function called ・ Implemented with a function called JTI (Junos Telemetry Interface) even in MX etc.

・ Implementation example   Analyticsd + Cloud Analytics Engine + Network Director   Analyticsd + Contrail   docker-junos-datadog   junos-telemetry-splunk   NetReflex   BizReflex

** OpenNTI ⇒ ** To lecture materials

Supplement to public materials (added on 2016/12/16)

P.6 There is a process called "Data Streaming Collector" in OpenNTI This is the processing using the telemetry function of the network device (Push type). There is also a process called "Data Collection Agent". This is a process of fetching data via Netconf (Pull type). The last "SYSLOG" does not send the log of all network devices. I try to send only specific events (such as commit complete) (which can be used later for analysis).

P.8 This is a sample setting for Analyticsd, but the value of the setting item depth-threshold is quite low. In some cases, it may be necessary to change the value.

P.9 What can be achieved by using the telemetry function ・ Packet buffer, communication delay monitoring ・ Microburst monitoring ・ Broad / Multicast PPS monitoring

But these advantages are Scenes that are likely to cause problems as a service provider, such as preventing service impacts, detecting them immediately, and taking evidence (For example, communication delays occur, evidence of communication that fills the bandwidth momentarily is suppressed, storms are detected, etc.) It is effective in.

P.11 Jvision can specify other than Interface as the setting item resource, and [starts various operations from JUNOS 16.1R3](http://www.juniper.net/techpubs/en_US/junos16.1/information-products/topic- collections / release-notes / 16.1 / topic-105380.html # jd0e6650) It seems that various information can be obtained (operation not verified).

P.17 When OpenNTI is started, you will be able to access the Web GUI and view the collected data. "Data Streaming Collector Dashboard" is the data collected by the telemetry function. It will be the dashboard to be displayed.

Please be careful if you try to use OpenNTI as it is. There is a little mistake. For example, "Data Streaming" is correct for this dashboard name as well. The letter "r" is missing and it becomes "Data Steaming", or the data displayed is actually The data of Broadcast of the interface should be displayed, but the data of Multicast is displayed. There are quite a few mistakes.

P.18 "Data Collection Agent Dashboard" can see the data acquired via Netconf. It can be used to monitor whether a specific process of JUNOS has a memory leak.

P.20 The first graph is monitored from Cacti at 1 minute intervals, When traffic of about 9.4 Gbps is passed for 30 seconds, 20 seconds, and 10 seconds Only 4.5 Gbps, 2.5 Gbps, and 1.5 Gbps are flowing, respectively. Actually, 9.4 Gbps of traffic is flowing, but this is the limit that Cacti can recognize. (Of course, there is also information that can be pinpointed with Cacti's Realtime plugin etc.). The following graph was acquired by the telemetry function, and the information can be acquired accurately even for 10 seconds.

When the line is filled with short-term DDoS etc. and the service becomes unavailable Whether the line was really filled by DDoS, etc. There are many cases where it becomes a problem with the user, and I think that there are few cases where you can be convinced by the first Cacti graph. Normally, the traffic graph and xFlow information are combined to solve the problem. In many cases, the fire cannot be extinguished in a day or two. I think that being able to monitor and save the actual traffic situation with high accuracy is a great strength.

P.25 The upper graph shows the traffic generated by users who have applied QoS. The graph in the lower row shows the communication dropped due to the limitation of QoS. Normally, traffic is monitored by physical / logical interfaces, etc. Rule-based monitoring is possible.

P.27 Data acquisition via Netconf tends to be a bottleneck Even if you try to tune, PyEZ's SSH processing uses Paramiko, and you can set ControlMaster etc. Since it cannot be changed, it is not possible to bundle SSH sessions. Therefore, we can only deal with increasing the multiplicity of SSH processing (performing cron registration multiple times).

bonus

We have verified OpenNTI, but we do not expect to use OpenNTI as it is in the production environment. We plan to build a Monitoring System on a bare metal server by diverting only the mechanism.

The reason is that I don't like having a DB in a Docker container (now) physiologically. With a Docker container, you now need to monitor the container itself. (Now) there are many container management and monitoring solutions. It's a vague reason that I want to see the situation for a while.

Also, of course, not only the visualization part but also the Alerting System is linked. Initially, Kapacitor was assumed because it is easy to link with InfluxDB. Although it is in beta from Grafana 4.0, it has implemented the Alerting function. This will also be a candidate.

Recommended Posts

Winning with Monitoring
[Python] Folder monitoring with watchdog
Detailed explanation Monitoring & performance improvement with NewRelic-Part 2
Device monitoring with On-box Python in IOS-XE
Log monitoring
Traffic monitoring with Kibana, ElasticSearch and Python
Make a GIF animation with folder monitoring
Automation of server monitoring etc. with Monit
Detailed explanation Monitoring & performance improvement with NewRelic-Part 1
Make a monitoring device with an infrared sensor