Parsing Obsession

by 22. April 2010 07:40

This past weekend I had an interview in Austin, TX with a pretty badass company.
The architects hit me with a question that made me cock my head to the side like
a cocker-spaniel that just heard a violin. The question wasn't brain surgery, but
it was complex enough that it required a lot of though and I got a little obsessed.
I had to code it today.

Here was the question (not verbatim, but close):

Write a method that returns a list of strings from a line from a comma delimited
file. Here's the data:

Bill,Brown,"austin, tx", 123, """Jr."""

Output should be:
Bill
Brown
austin, tx
123
"Jr."

I started to pseudocode it using some regex, and they didn't seem to thin that would
work. I went a couple more routes. In the end we discussed and the one architect
said he had solved this by rolling through each character in the line. That seemed
like a perfectly reasonable way, but I was really hung up on a couple of things.

1.) I think that sounds like a lot of CPU cycles for this operation
2.) I wanted to do this without rolling through each character, because that's how I
    roll.


So today my OCD got the best of me... and here's what fell out:

1: using System;
2:     using System.Collections.Generic;
3:     using System.Linq;
4:     using System.Text;
5:     using System.Text.RegularExpressions;
6:     using System.Diagnostics;
7:    
8:     namespace TestLib
9:     {
10:        public class ParserThing
11:         {
12:            public ParserThing()
13:            {
14:                DateTime str = DateTime.Now;
15:                //string dirtyLine = @"Bob,Brown,""""Jr."""",""dayton,oh,N"",123-45-6789";
16:                string dirtyLine = @"Bill,Black,""""""Sr."""",,\/some/thing\/,"""",,""123 Street Ave.,West. SomeCity,OH 45454"",""000000-0000""";
17:                int executions = 100;
18:                List<string> words = new List<string>();
19:                for (int i = 1; i <= executions; i++)
20:                {
21:                    words = CleanIt(dirtyLine);
22:                }
23:                TimeSpan ts = DateTime.Now - str;
24:                Debug.WriteLine("Execution Time: " + Math.Floor(ts.TotalMilliseconds).ToString());
25:                foreach (string word in words)
26:                {
27:                    Debug.WriteLine("Word: " + word);
28:                }            }
29:            public static List<string> CleanIt(string rawLine)
30:            {
31:                const string QUO = @"""";
32:                const string DELIMITER = ",";
33:                List<string> retVal = new List<string>();
34:                List<string> tmpWords = new List<string>();
35:                Regex regx = new Regex(@"""{2,}");
36:                string[] words = rawLine.Split(new string[] { DELIMITER },StringSplitOptions.None);
37:                bool isInQuoteBlock  = false;
38:                StringBuilder finalWord = new StringBuilder();
39:                string tmpWord = string.Empty;
40:                string word = string.Empty;
41:                for (int i = 0, l = words.Length - 1; i <= l; i++)
42:                {
43:                    word = words[i];
44:                    tmpWord = words[i];
45:                    if (word.Contains(QUO))
46:                    {
47:                        if (regx.IsMatch(word))
48:                        {
49:                        tmpWord = regx.Replace(word, @"""");
50:                        }
51:                        else
52:                        {
53:                        tmpWord = word.Replace(@"""", "");
54:                        }
55:                        finalWord.Append(tmpWord);
56:                        if (word.StartsWith(@"""") && !word.EndsWith(@""""))
57:                        {
58:                            // this is a partial word
59:                            finalWord.Append(DELIMITER);
60:                            isInQuoteBlock = true;
61:                        }
62:                        else if (word.EndsWith(@"""") || i == l)
63:                        {
64:                            // this is the end of a block
65:                            isInQuoteBlock = false;
66:                            retVal.Add(finalWord.ToString());
67:                            finalWord.Length = 0; // clear sb
68:                        }
69:                    }
70:                    else if (isInQuoteBlock)
71:                    {
72:                        finalWord.Append(word + DELIMITER);
73:                    }
74:                    else
75:                    {
76:                        retVal.Add(word);
77:                    }
78:                }
79:                return retVal;
80:                }
81:            }
82:        }
Important to note: I wrote this in a pretty short amount of time. There's probably some shortcuts I could do in here, I don't like the if/else if/else use in this. I'm not going to spend any more time on this, but damnit it works, and it works well, and it's reasonably fast. On my box, I can parse 100k iteration in about 1200ms. ... I didn't get the job, btw, but that's ok, it wasn't the right time. I got to see Austin, and I got to interview at a really awesome place. I'm hating blogengine, btw, I need to change this blog to something else.

Tags: , ,

.NET | Development | How To

Add comment


(Will show your Gravatar icon)

  Country flag

biuquote
  • Comment
  • Preview
Loading




RecentComments

Comment RSS