~dricottone/fmg-timesheets: dev - dominic-ricottone.com git

b10830e9 — Dominic Ricottone 2 years ago dev

README updates

6899c67f — Dominic Ricottone 2 years ago

With data scraping complete, moving on to analysis

Basic summation of projects, as proof of concept

Basic SAS program for importing CSV data and storing as time series data

cab4c597 — Dominic Ricottone 2 years ago

Data pipeline error

Some (sub)total rows were being mistaken for actual hour entries.
Previously I cut off all elements ordered after the 'Hours Distribution
by Time Code' marker. This is insufficient because (1) some elements can
be slightly higher and (2) some elements can float to the previous page.

I've fixed (1) by fudging the y-dimension numbers. (2) appears to only
impact a single timesheet (2019-06-15). I've added some more debugging
to help diagnose and communicate this issue.

721dfa4e — Dominic Ricottone 2 years ago

Adding exporters

Wrote and tested the long CSV exporter. Stubbed out the JSON exporter.

8ebab2a4 — Dominic Ricottone 2 years ago

Fully functional timesheet parser.

The timesheet parser is a complete success. Some minor issues were
ironed out in the XML parser as well.

Next steps: writing to a time series database and beginning analysis.

ae939a28 — Dominic Ricottone 2 years ago

Goodbye HTML, hello XML

Replaced HTML exporting/parsing with XML exporting/parsing. Also
replaced the 'high-level' function call with 'low-level' pdfminer
usage.

The XML parser handled validation and suppression of header/footer
content on its own.

From the PDF parser, XML is dumped to a file. From the XML parser, CSV
is dumped to a file. The new timesheet parser should read in that CSV
file.

e4ae39d2 — Dominic Ricottone 2 years ago

Minor debug/toolchain update

9efdae59 — Dominic Ricottone 2 years ago

Implemented time entry extraction; no assert errors!

There is still a major issue ahead of 'structured' data:
Hours data is leaking between entries. There are entries with no hours
at all. There are almost certainly some entries that have hours out of
order.

It will likely be necessary to re-sort all items ahead of processing
based on top then left style attributes. This is going to have the
consequence of invalidating some of the work I've already put into
parsing the data as-is.

Good luck, future me.

7326d280 — Dominic Ricottone 2 years ago

Started implementing time entry extraction.

Time entries are now being parsed and validated, though there are
numerous issues still to sort out.

I have a feeling that further development will require passing around
the `top` style attributes in the same way I'm passing around the `left`
style attributes. TBD though.

1c07f9a5 — Dominic Ricottone 2 years ago

Remove some debug stuff

f441822f — Dominic Ricottone 2 years ago

Significant updates

Wrote time sheet parser that ingests and validates all semi-structured
data. Next step is to interpret left styles as dates, so that hours can
be parsed into a time entry object.

Updated HTML parser to more completely filter out unhelpful data, and to
internally build the array of doubles (data and left style).

15f788d9 — Dominic Ricottone 2 years ago

Toolchain upgrades

Migrated to `venv` for the Python module dependency(ies).

Added `README` to begin documenting the process.

c041ec57 — Dominic Ricottone 2 years ago

Initial commit