Tutorial How to tokenize a String effiently

Discussion in 'Resources' started by lycano, Mar 13, 2015.

Thread Status:
Not open for further replies.
  1. Offline

    lycano

    Hi,

    i just stumbled over a very well explained article that describes how you can tokenize a String in a very effient manner using a LinkedList and Regular Expressions.

    This is very useful when you want to parse a configuration file that not only has key=value lines but also blocks that dont fit into the standard key=value assignment. In short you can write your own parser within minutes when you understand the principle behind this.

    For performance: the Regular expression pattern is compiled and then stored for quick access.

    External link: http://cogitolearning.co.uk/?p=525

    Everything you need to know is explained there. I think you can learn a lot from this!

    Before you read the article:

    You eventually will read this line:

    Code:java
    1.  
    2. tokenizer.tokenize(" sin(x) * (1 + var_12) ");
    3.  


    At some point he tells you how to configure the tokenizer and that you can use the first tokenizer.add call to define your "functions" aka the first matching rule i.e. "sin|cos|exp|ln|sqrt" and store it as reference with number 1.

    The next definitions are used to define how the pattern should look like i.e. after a function (sin, cos ..) comes always an open braket (store it as ref 2) then a number (ref 3) and then a closing braket (ref 4).

    The Tokenizer will traverse through the LinkedList and execute each pattern on the line and eventually get a hit. If not it will throw a ParserException which you can simply deactivate in production. It essentially means "No match". The Exception Message is for debugging and you can also catch that to do whatever you want to do when you expected a match.

    Back to Topic: You can use this to create a parser that parses a file in a key value manner and extract all sequences by reference numbers. i.e. get all values for x.

    If you dont know much about regular expression you would probably either have used a loop and define your rules along the line like if you find "word=" then the next one is a number or a string ...

    You would probably even use regular expressions but they would have to be compiled at runtime which makes parsing very slow if you have many rules to apply.

    This tends to go very unstructured over time since if you do it "if then else" you might run into problems where you need to exclude some values you might encounter.

    The result is that you will never be able to maintain your code as it grows. We want maintainable code that can be read by others too.

    To test the Tokenizer he has attached source files which you can checkout. If you want to play around with this a bit note that you need to parse line by line since the mathing is fixated in Tokenizer class to "startswith".

    So go and read file contents and use buffered reader to create a List of Strings where each entry is representing a line of your file.

    Pseudo: fi is your file object
    Code:java
    1.  
    2. List<String> fileLines = new ArrayList<String>();
    3. BufferedReader bufferedReader = new BufferedReader(new FileReader(fi));
    4. for (String s = ""; (s = bufferedReader.readLine()) != null;) {
    5. fileLines.add(s);
    6. }
    7. bufferedReader.close();
    8.  


    Then create another function that takes a List<String> as parameter. Define your tokinzer parameters.

    Code:java
    1.  
    2. Tokenizer tokenizer = new Tokenizer();
    3. tokenizer.add("Name|LastName", 1);
    4. tokenizer.add("=", 2);
    5. tokenizer.add("\\w+", 3);
    6.  


    Foreach through the string array and utilize the tokenizer. Get the results.

    Have fun playing around with this.

    Something i did not mention in the article why i use this.

    Im currently rewriting WormholeXTreme and i have a shape file that has a specific format. Currently the values have to be in the exact order or the shape file is invalid. With this system i simply define my rules and get the result regardless of the location of the strings.

    For example

    Code:java
    1.  
    2. Tokenizer tokenizer = new Tokenizer();
    3. tokenizer.add("Value", 1);
    4. tokenizer.add("=", 2);
    5. tokenizer.add("[0-9]+", 3);
    6.  


    after running through the file i can easily get the value of the key Value regardless of its position.

    See attachment.

    Horizontal.shape (open)

    Code:
    # The name for this shape
    Name=Horizontal
    # Version 2 of shape files allows for many new things.
    # 3D gates, new format for blocks, woosh and light order etc
    Version=2
    
    # GateShape now needs "Layer" lines
    # Each Layer requires a #number= and then a newline.
    # Blocks can only be placed into layers.
    # a 2D gate would have only 1 layer.
    # Acceptable blocks are:
    #  [I] = Ignored
    #  [S] = Stargate Material
    #  [P] = Air blocks that will turn into the portal material when activated.
    
    #  Extra parameters:
    #  --- These parameters are 1 of each per gate ---
    #  :N = Block where the name sign will be created. This is optional.
    #  :EP = Block where players teleport in at. The players feet will be on this block.
    #  :EM = Block where minecarts teleport in at. The minecart wheels will be on this block.
    #  :A = Block where the activation switch is attached to. 1 per gate!
    #  The only restriction is that the block that faces it must be "I" (so nothing is in the way)
    #  The switch will face in the positive layer direction.
    #  In this example the switch will face towards where layer 3 would be (if there was a 3rd layer)
    #  :D = Block the sign dialer hangs from. Only 1 per gate!
    #  The only restriction is that the block that faces it must be "I" (so nothing is in the way)
    #  This block is not required, so shapes with this block can be either type. (sign or dial)
    #  Without this block a gateshape can only be /dial gate.
    #  :IA = Iris Activation Switch - Not required unless you want to be able to place an Iris on the gate.
    #
    #  IA, D, N, and A cannot be the same block, and none of those can contain W
    #
    #  --- There can be many of these per gate -- (Currently no restriction)
    #  :L = Blocks that will light when gate is activated
    #  Optionally you may add a #number after L to indicate the order it lights.
    #  Defaults to 1 if there is no #
    #  :W = Blocks that will woosh when gate is activated
    #  Optionally you may add a #number after W to indicate the order it wooshes in.
    #  Defaults to 1 if there is no # and the wait between numbers is configurable below.
    #  After all #s are active it removes them in reverse order but
    #  if a block is [P:W] the block will stay as portal material until gate is shutdown.
    #
    #  Redstone Blocks:
    #  --- There can only be 1 of each of these per gate, and they can-not occupy the same block as anything else ---
    #  [RD] = Redstone activation block. A redstone charge next to this block will activate the gate.
    #  This block requires a :D block for targetting. This block should be on top of a [S] block.
    #  [RS] = Redstone sign dialer cycle block. A redstone charge next to this block will cycle sign targets.
    #  This block requires a :D block for targetting. This block should be on top of a [S] block.
    #  [RA] = Redstone gate Activated block. This block will provide redstone charge when the gate is activated.
    #  This block should be on top of a [S] block.
    
    GateShape=
    
    Layer#1=
    [I][I][I][I][I][I][I]
    [I][I][I][I][I][I][I]
    [I][I][I][I][I][I][I]
    [I][I][S][S:L#7][S][I][I]
    
    Layer#2=
    [I][I][I][I][I][I][I]
    [I][I][I][I:W#2][I][I][I]
    [I][I][I:W#1][I:W#1][I:W#1][I][I]
    [I][S:L#5][P][P][P][S:L#2][I]
    
    Layer#3=
    [I][I][I][I:W#3][I][I][I]
    [I][I][I:W#2][I:W#2][I:W#2][I][I]
    [I][I:W#1][I:W#1][I:W#1][I:W#1][I:W#1][I]
    [S][P][P][P][P][P][S]
    
    Layer#4=
    [I][I][I][I:W#3][I][I][I]
    [I][I][I:W#2][I:W#2][I:W#2][I][I]
    [I][I:W#1][I:W#1][I:W#1][I:W#1][I:W#1][I]
    [S:L#4][P][P][P:EP][P][P][S:L#6]
    
    Layer#5=
    [I][I][I][I:W#3][I][I][I]
    [I][I][I:W#2][I:W#2][I:W#2][I][I]
    [I][I:W#1][I:W#1][I:W#1][I:W#1][I:W#1][I]
    [S][P][P][P][P][P][S]
    
    Layer#6=
    [I][I][I][I][I][I][I]
    [I][I][I][I:W#2][I][I][I]
    [I][I][I:W#1][I:W#1][I:W#1][I][I]
    [I][S:L#3][P][P][P][S:L#1][I]
    
    Layer#7=
    [I][I][I][I][I][I][I]
    [I][I][I][I][I][I][I]
    [I][I][I][I][I][I][I]
    [I][I][S:A][S:N:L#8][S:IA][I][I]
    
    # Number of ticks to wait before activating each # of the woosh. 1 tick = ~50ms
    WOOSH_TICKS = 3;
    # Number of ticks to wait before activating each # of the lights. 1 tick = ~50ms
    LIGHT_TICKS = 3;
    
    # None of the follow materials are required, they will default if not set.
    # Portal material the material the [P] blocks will be when gate is active.
    # Suggested values are as follows: STATIONARY_WATER, STATIONARY_LAVA, PORTAL, and AIR
    PORTAL_MATERIAL=STATIONARY_WATER
    
    # Iris material is the material the [P] block become when iris is active
    # Suggested values are as follows: STONE, DIAMOND_BLOCK, GLASS, IRON_BLOCK, BEDROCK, OBSIDIAN, GLOWSTONE and LAPIS_BLOCK
    IRIS_MATERIAL=GLASS
    
    # Stargate material is the material the [O] blocks are made of
    # Reasonable values are as follows: STONE, DIAMOND_BLOCK, GLASS, IRON_BLOCK, BEDROCK, OBSIDIAN, GLOWSTONE and LAPIS_BLOCK
    STARGATE_MATERIAL=OBSIDIAN
    
    # Active material is the material that :L blocks become when gate is active
    # Suggested Values are as follows: GLOWSTONE and GLOWING_REDSTONE_ORE
    ACTIVE_MATERIAL=GLOWSTONE
    
    # Redstone activated is the parameter to allow redstone to the DHD lever to activate the gate.
    REDSTONE_ACTIVATED=FALSE
    
    


    When you do test your regexes during runtime the above example should be enclosed inside try catch block catching PatternSyntaxException or you will wonder why certain code will not be executed.

    I.e.

    Code:java
    1.  
    2. try {
    3. Tokenizer tokenizer = new Tokenizer();
    4. tokenizer.add("Name|Version", 1);
    5. tokenizer.add("=", 2);
    6. tokenizer.add("\\w+|[0-9]+", 3);
    7. } catch (PatternSyntaxException e) {
    8. // Caught PatternSyntaxException: e.getMessage();
    9. }
    10.  


    Note that the tokenizer is in fact a tokenizer when use OR in regex note that in this example REDSTONE_ACTIVATED will be matched in group 3 since its testing all patterns and assigning them to numbers. So when using \\w+ as third pattern it would still match and get you something like this:

    Code:
    3 REDSTONE_ACTIVATED
    2 =
    3 FALSE
    
    So be careful to check the index before assuming that all must match. With some effort it would be possible implement logic so that you get a result based on AND. Maybe you will come up with a solution =)

    A working example "how to implement this and make use the matching results" can be found here: https://gist.github.com/lycano/35c46308895377d0278a
     
    Last edited: Mar 13, 2015
  2. Offline

    Phasesaber

    This is... weird.... What does this have to do with Bukkit? It's more like a tutorial on making a lexer... :p
     
  3. Online

    timtower Administrator Administrator Moderator

    Locked as this isn't for Bukkit
     
Thread Status:
Not open for further replies.

Share This Page