Data pipeline error Some (sub)total rows were being mistaken for actual hour entries. Previously I cut off all elements ordered after the 'Hours Distribution by Time Code' marker. This is insufficient because (1) some elements can be slightly higher and (2) some elements can float to the previous page. I've fixed (1) by fudging the y-dimension numbers. (2) appears to only impact a single timesheet (2019-06-15). I've added some more debugging to help diagnose and communicate this issue.
Fully functional timesheet parser. The timesheet parser is a complete success. Some minor issues were ironed out in the XML parser as well. Next steps: writing to a time series database and beginning analysis.
Goodbye HTML, hello XML Replaced HTML exporting/parsing with XML exporting/parsing. Also replaced the 'high-level' function call with 'low-level' pdfminer usage. The XML parser handled validation and suppression of header/footer content on its own. From the PDF parser, XML is dumped to a file. From the XML parser, CSV is dumped to a file. The new timesheet parser should read in that CSV file.