This past weekend I had an interview in Austin, TX with a pretty badass company.
The architects hit me with a question that made me cock my head to the side like
a cocker-spaniel that just heard a violin. The question wasn't brain surgery, but
it was complex enough that it required a lot of though and I got a little obsessed.
I had to code it today.
Here was the question (not verbatim, but close):
Write a method that returns a list of strings from a line from a comma delimited
file. Here's the data:
Bill,Brown,"austin, tx", 123, """Jr."""
Output should be:
Bill
Brown
austin, tx
123
"Jr."
I started to pseudocode it using some regex, and they didn't seem to thin that would
work. I went a couple more routes. In the end we discussed and the one architect
said he had solved this by rolling through each character in the line. That seemed
like a perfectly reasonable way, but I was really hung up on a couple of things.
1.) I think that sounds like a lot of CPU cycles for this operation
2.) I wanted to do this without rolling through each character, because that's how I
roll.
So today my OCD got the best of me... and here's what fell out:
1: using System;
2: using System.Collections.Generic;
3: using System.Linq;
4: using System.Text;
5: using System.Text.RegularExpressions;
6: using System.Diagnostics;
7:
8: namespace TestLib
9: {
10: public class ParserThing
11: {
12: public ParserThing()
13: {
14: DateTime str = DateTime.Now;
15: //string dirtyLine = @"Bob,Brown,""""Jr."""",""dayton,oh,N"",123-45-6789";
16: string dirtyLine = @"Bill,Black,""""""Sr."""",,\/some/thing\/,"""",,""123 Street Ave.,West. SomeCity,OH 45454"",""000000-0000""";
17: int executions = 100;
18: List<string> words = new List<string>();
19: for (int i = 1; i <= executions; i++)
20: {
21: words = CleanIt(dirtyLine);
22: }
23: TimeSpan ts = DateTime.Now - str;
24: Debug.WriteLine("Execution Time: " + Math.Floor(ts.TotalMilliseconds).ToString());
25: foreach (string word in words)
26: {
27: Debug.WriteLine("Word: " + word);
28: } }
29: public static List<string> CleanIt(string rawLine)
30: {
31: const string QUO = @"""";
32: const string DELIMITER = ",";
33: List<string> retVal = new List<string>();
34: List<string> tmpWords = new List<string>();
35: Regex regx = new Regex(@"""{2,}");
36: string[] words = rawLine.Split(new string[] { DELIMITER },StringSplitOptions.None);
37: bool isInQuoteBlock = false;
38: StringBuilder finalWord = new StringBuilder();
39: string tmpWord = string.Empty;
40: string word = string.Empty;
41: for (int i = 0, l = words.Length - 1; i <= l; i++)
42: {
43: word = words[i];
44: tmpWord = words[i];
45: if (word.Contains(QUO))
46: {
47: if (regx.IsMatch(word))
48: {
49: tmpWord = regx.Replace(word, @"""");
50: }
51: else
52: {
53: tmpWord = word.Replace(@"""", "");
54: }
55: finalWord.Append(tmpWord);
56: if (word.StartsWith(@"""") && !word.EndsWith(@""""))
57: {
58: // this is a partial word
59: finalWord.Append(DELIMITER);
60: isInQuoteBlock = true;
61: }
62: else if (word.EndsWith(@"""") || i == l)
63: {
64: // this is the end of a block
65: isInQuoteBlock = false;
66: retVal.Add(finalWord.ToString());
67: finalWord.Length = 0; // clear sb
68: }
69: }
70: else if (isInQuoteBlock)
71: {
72: finalWord.Append(word + DELIMITER);
73: }
74: else
75: {
76: retVal.Add(word);
77: }
78: }
79: return retVal;
80: }
81: }
82: }
Important to note: I wrote this in a pretty short amount of time. There's probably some shortcuts I could do in here, I don't like the if/else if/else use in this. I'm not going to spend any more time on this, but damnit it works, and it works well, and it's reasonably fast. On my box, I can parse 100k iteration in about 1200ms. ... I didn't get the job, btw, but that's ok, it wasn't the right time. I got to see Austin, and I got to interview at a really awesome place. I'm hating blogengine, btw, I need to change this blog to something else.