Delimited file where delimiter clashes with data values

A comma-separated values (CSV) file is a typical way to store tabular/rectangular data. If a data cell contain a comma, then the cell with the commas is typically wrapped with quotes. However, what if a data cell contains a comma and a quotation mark? To avoid such scenarios, it is typically wise to use a delimiter that has a low chance of showing up in your data, such as the pipe (“|”) or caret (“^”) character. However, there are cases when the data is a long string with all sorts of data characters, including the pipe and caret characters. What then should the delimiter be in order to avoid a delimiter collision? As the Wikipedia article suggests, using special ASCII characters such as the unit/field separator (hex: 1F) could help as they probably won’t be in your data (no keyboard key that corresponds to it!).

Currently, my rule of thumb is to use pipe as the default delimiter. If the data contains complicated strings, then I’ll default to the field separator character. In Python, one could refer to the field separator as ‘\ x1f’. In R, one could refer to it as ‘\ x1F’. In SAS, it could be specified as ‘1F’x. In bash, the character could be specified on the command line (e.g., using the cut command, csvlook command, etc) by specifying $’1f’ as the delimiter character.

If the file contains the newline character in a data cell (\n), then the record separator character (hex: 1E) could be used for determining new lines.

About Vinh Nguyen

Statistician

3 comments

  1. This is really only an issue if you are writing the CSV yourself. Any good CSV writer will have a mechanism to escape the quote character if it occurs in the data. In R, write.csv() will use two quotes to signify an actual quote in the data. For example, > data = c(‘plain’, ‘has,comma’, ‘comma,and,”quote” and \’singlequote\”) > f = textConnection(‘test’, ‘w’) > write.csv(data, file=f, row.names=FALSE) > cat(test, sep=’\n’) “x” “plain” “has,comma” “comma,and,””quote”” and ‘singlequote'”

    1. I’m dealing with data where conversation notes are stored in a field. They use all sorts of characters in there: commas, quotes, pipes, carets, etc.

      Data are stored in a DB, and I’m using a sql client to get the data out to a text file (pyodbc seems to work best). Because of that, I find it convenient to use the unit separator character as the delimiter.

  2. Kent is correct, this is a problem that has been solved many times and you should leverage existing solutions rather than try and circumvent a problem that does not exist. RFC 4180 is the de-facto standard in delimited file handling (http://tools.ietf.org/html/rfc4180), and addresses all these concerns.

    Worse yet, by using a less standard delimiter and code that doesn’t handle embedded quotes or delimiters, you just make problems more rare, which can actually make things significantly harder to debug, and you reduce portability and support for a wide audience of users who can handle comma delimited but cannot handle custom delimiters.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>