I implemented, what appears to me, a new way of parsing text. Python code.
The metaphor applied is to treat text as the base melody in a system of tracks, -- annotations to the melody. Compare: Track-based music editors.
So for example, you have a big long stream of bytes.
You make a channel called "lines," which has a track in the channel for each line.
So if you want to annotate a line, just attach data to the track instance for that line.
Then, you want to break the text into sections, so you make a new channel, and have a track for each section of the text. (Perhaps an H1 division, or something.)
Then you break the text into smaller sections.
The primary classes are: Tracker, Channel, Track, and Cursor.
What's great about this approach is that it is very intuitive, user-directed (and thus state-preserving,) and super-capable.
You can make a "write-out" channel. As you're interpreting the text (via a Cursor,) you also write down into the "write-out" channel how you want this part of the text to be written out, when the time comes to serialize out the changes you made. This makes perfect round-tripping easy to do, and you specify the write-behavior at the same time as you specify the read-behavior. You attach the data interpreted to the tracks corresponding to the actual read text, so when you write it back out, it is exactly as the user wrote it -- not just like your writing code things to write it out. (Sadly, there's nothing done about new data that the user didn't write, to make it consistent.)
If this makes enough sense and if you are interested, I'm happy to share code, document, explain, etc.,.
Because it's stateful, it's more powerful than regex parsing, in many ways. The trade is performance for ease of expression for programmers.
Another neat thing you can do is make a spontaneous hierarchy by casting channels one to the other. Say, "give me a projection from H1 -> H2 -> H3," and you get it back as nested lists. (Haven't implemented for more than one cast yet, but intending to.)
I'm not sure I understand the meanings of "track" and "channel".
Let me try on an example. We have a stream of of bytes:
0101010101010101010101010101010101010101010101...Now, we add a channel, which is? Suppose it's another stream, going in parallel:
0101010101010101010101010101011010101010101010...
::::::::::::::::::::::::::::::::::::::::::::::...Now we add a track for every line. What is a track here? A stream within the "channel" stream?
Here's an example of a "Tracker":
http://www.youtube.com/watch?v=XoS8fWO_EOs
Here's a stream of bytes:
Text: H e l l o , W o r l d ! \n H o w a r e y o u ?
Lines:[--------------------------] [------------------------]
WS: [] [] [] [] []
Granted, this is a pedantic example.
For it to really be useful, you have tracks for sections of a document, for particular sequences within text, ...
An example of use is:
tracker = Tracker("""
== Heading ==
Hello, world!
== Second Heading ==
Nice day we're having, eh?
=== Subheading ===
In Seattle, I mean.
""")
tracker.recognize_lines()
c = tracker.cursor("lines") # "lines" is now the reference channel, for c.advance()
while not c.done():
if (c.chars().startswith("==")
and c.chars().endswith("==")):
c.close("section") # closes heading track if there was one open
c.open("section", include_line = False) # opens the track, beginning on the next line
c.data("section")["name"] = c.chars()[2:-2].strip()
c.advance()print(len(tracker.channel("section"))) # How many sections are there?
print(tracker.channel("section")[1].chars()) # Prints the characters associated with track #2
L = tracker.channel("section")[1].children("lines") # Returns all line tracks within the given section
So, the cursor has .open(channel_name) and .close(channel_name), which is used to open or close a track on the given channel. You have easy access to the open (or pre-existing) tracks -- you just say c.data(channel_name) to make notes about it, as you're parsing.
You can say "open a track on the channel, and start on this very line," or "open a track on the channel, and start on the very next line." (default: Open at the beginning of this very line.)
You can say "close the track on this channel, at the very beginning of this line," or "close the track on this channel, terminating at the end of this line." (default: Close at the beginning of this very line.)
Note that the reference for ".advance" does not need to be lines. It can be any channel -- you could advance by single characters, by lines, by sections, by subsections, and so on -- and you can set up a cursor within a cursor (perhaps at a different fidelity of document interpretation.)
So, intuitively, it is like going over the text with highlighter pens, and making annotations in the highlighted regions.
What you get from doing this (that is kind of tricky to do with regex) is clear state. It's clear because our brains are already oriented to reading paragraphs of code that have nested for, while, if blocks.
It also keeps track of a lot of bookkeeping related to "do I want to include this line or not?" and corresponding between multiple views of the data.
You can make 1 pass to get one view, then another to get another view, and then another that uses both the text and the prior two views to assemble the more complex relationships.
But the code is very linear, so its much easier to understand what's going on (at the cost of efficiency in execution.)
And: Tracks don't only need to be about reading and interpretation -- you can have tracks that specify write behavior.
If you are interested, I can upload the code to CommunityWiki, along with some examples of use.
That didn't come out right, I'd need to use a fixed width font, or like you have done with the 1's and zeros.
Hello, World! How are you? <- text channel
:::::::::::::.:::::::::::: <- lines channel (2 tracks)
......:......:...:...:.... <- WS channel (4 tracks)
:::::..:::::..:::.:::.:::. <- words channel
You can attach data to the channels.
The code to make lines, for example, is:
cursor = tracker.cursor() # defaults to char reference
cursor.open("lines")
while not cursor.done()
if cursor.char() == "\n":
cursor.close("lines")
cursor.open("lines")Actually, you can shorten it even further -- there's an option on cursor.open to say "close any existing open tracks before reopening."
This is so trivial, and so common, I just have a single function on tracker that automatically performs this operation for you. (There's a subclass of Track called LineTrack that has some special operations for this common case as well.)
It's silly to do here, but you can say:
tracker.channel("words").chars()
(returns:) ["Hello", "World", "How", "are", "you"]
tracker.channel("lines").children("words")
(returns:) [[<Track>, <Track>], [<Track>, <Track>, <Track>]]
[[x.chars() for x in L] for L in tracker.channel("lines").children(words)]
(returns:) [["Hello", "World"], ["How", "are", "you"]]]
Again, this is all rather silly when you're doing "Hello, world!", but when you have multi-sectioned documents, where the sections themselves contain different types of sub-sections (clearly delineated or otherwise,) these mechanics are lifesavers.
I came to this model after asking myself, "How can I conveniently work with the text content, in a linear way, while simultaneously recognizing all the different levels and types of hierarchy?"
For example, there are headers, footers, sections of the page, blocks of key-value pairs (or YAML or JSON or...), paragraphs, lines, comment regions, tables, rows, cells, spans (like italic and bold,) and all these different channels of information, that you want immediate access to.
You don't want a grammar system, because they're just way too bulky and hard to work with, and you have to get everything perfectly right or it breaks. Further, they always have a "we'll call you, don't call us" attitude, and I find myself fitting little shards into a machine. I much prefer to direct the parse.
Regexes break on this kind of stuff -- they can't see nested structures very easily, and state ("Which one am I in the middle of right now?") -- and furthermore, they can be pretty fragile.
The basic built-ins in Python give you a bit of reach, but leave you hand-hacking if you're making a complex parser, and you get very little re-use.
Mertz (David Mertz, Text Processing in Python) can help to a degree, if you're into functional programming, but I found limits to it.
This way gets you state, reusable systems (you can take your "highlighters" -- your code that creates tracks in channels,) and easily reapply them in further parses. Knowing nothing more than an index in the text, you immediately get all the knowledge in all the associated channels: You don't have to dig in and reconstruct context, or search for the index within nested parse structures.
I just really really like this system.
That's a pretty interesting way of storing the parse tree -- by recording the spans of branches at every level of the tree. How do you construct the tree (decide which tracks begin and end where) -- by multi-pass top-down scanning, starting at the root of the tree, and working down through more and more detailed tracks?
Multi-pass top-down scanning, essentially.
There's nothing saying you can't do single-pass with the assistance of a given grammar, and do it that way.
(I can imagine applying multiple grammars as well; So long as the channels they use don't over-ride one another.)
It's all about, "What's easiest? What makes most sense to me?" My files are almost all short, so I do not worry about performance -- rather, I'm far more concerned with how long it takes me to write a parser.
I'm interested in having both a person (myself, basically) and the computer (my program) being able to interact with the text -- hence my interest in round-tripping. So the data that is read is more like annotations to the text, rather than separate data structures. When the computer makes changes, they are re-applied in the places where they were recorded, and when the computer makes additions, they are generally applied at the end of the track that represents the structure as a whole.
I've written a Table class that reads tables into name & index accessible rows and columns, that is serialized back out exactly as it was serialized in, and that can be modified and have entries extended and so on. (A feature I'd like to add is the ability to have a "delete column," so that when an entry is deleted, it simply puts an "XX" in the delete column. That way, as a person, you can still read the text file and see the deleted column, and finally delete if you like, but so that it is not really required.)
Understand, this is all just play at this point. The code is messy, and some things are inconsistent. But I think it's interesting, and makes good test material for the board. :)
My idea was originally born out of frustration with database software, and a (probably irrational) dislike of database software that doesn't work like I think (Python lists and dictionaries,) that is binary (that I can't understand how to interpret,) and that is at the whims of version numbers (upgrading databases and so on.) I looked at CouchDB, but thought, "what the heck, this is just for fun, to see where it goes."