Telemetry is a data collection framework that collects real world data from Firefox. Recently Telemetry experienced some issues resulting in incomplete data being reported on the Telemetry Dashboard and Telemetry Evolution.
Before I get into the details about the data issues, I think a little background is in order for those who don’t have it.
Telemetry sends a daily ping with the information collected by the Telemetry probes. This data is aggregated and presented in the Telemetry Dashboard and Telemetry Evolution (http://mzl.la/telemetrydash). The Telemetry data is stored in three separate data repositories:
1. daily ping (raw) data is stored in hbase
2. aggregated data for use in the Telemetry Dashboard is stored in elastic search
3. trend data for use in Telemetry Evolution is stored in a relational database
A job is run to aggregate the raw data (1) into the data for the Telemetry Dashboard (2). A second job is then run to create trend data from the aggregate data (2) for Telemetry Evolution (3).
Several weeks ago cww discovered that the Telemetry dashboard was not displaying the volume of data that is expected from the Nightly channel. The source of the problem was determined to be an issue with the newly landed persistent Telemetry feature that caused the Telemetry ping to fail. However, once this change was backed out a second issue was discovered with the data job that processes the raw data from the daily Telemetry ping (in hbase) and aggregates it into a form that can be consumed by the Telemetry Dashboard (elastic search). As the trend data for Telemetry Evolution is generated from the elastic search data it was also affected.
The status and summary of these two issues is:
1. The persistent Telemetry issue has been resolved but has resulted in a gap in Nightly Telemetry data submissions from Mar 1-17, 2012.
2. As the issue with the data job was with the job that aggregates the raw data into elastic search there was no issue with data loss of any of the raw data. The job has been corrected and Telemetry data is now available from Jan 1, 2012, onwards. The metrics team is running the jobs to repopulate the trend data from prior to Jan 1, and expect the full data set to be back online shortly. (Bug 731662.)
Our next steps are to put data validation processes in place to catch data issues early. Discussion is taking place in bug 742883, bug 742897, and bug 742903. These issues also surfaced a need to discuss the Telemetry data retention policy, which is being done in bug 742496.