~dricottone/fmg-timesheets: parser/xml.py 6899c67fa4f55e9d7a82f6b2cd780bc43185f3fc

cab4c597 — Dominic Ricottone 2 years ago

Data pipeline error

Some (sub)total rows were being mistaken for actual hour entries.
Previously I cut off all elements ordered after the 'Hours Distribution
by Time Code' marker. This is insufficient because (1) some elements can
be slightly higher and (2) some elements can float to the previous page.

I've fixed (1) by fudging the y-dimension numbers. (2) appears to only
impact a single timesheet (2019-06-15). I've added some more debugging
to help diagnose and communicate this issue.

8ebab2a4 — Dominic Ricottone 2 years ago

Fully functional timesheet parser.

The timesheet parser is a complete success. Some minor issues were
ironed out in the XML parser as well.

Next steps: writing to a time series database and beginning analysis.

ae939a28 — Dominic Ricottone 2 years ago

Goodbye HTML, hello XML

Replaced HTML exporting/parsing with XML exporting/parsing. Also
replaced the 'high-level' function call with 'low-level' pdfminer
usage.

The XML parser handled validation and suppression of header/footer
content on its own.

From the PDF parser, XML is dumped to a file. From the XML parser, CSV
is dumped to a file. The new timesheet parser should read in that CSV
file.