## Repair line breaks within a field of a delimited file

Sometimes some people generate delimited files with line break characters (carriage return and/or line feed) inside a field without quoting. I previously wrote about the case when the problematic fields are quoted. I also wrote about using non-ascii characters as field and new record indicators to avoid clashes.

The following script reads in stdin and writes to stdout repaired lines by ensuring every output line has at least the number of delimiters (|) as the first/header line (call this the target number of delimiters) by continually concatenating lines (remove line breaks) until it reaches the point when concatenating the next line would yield more delimiters than the target number of delimiters. The script appears more complicated than it should be in order to address the case when there are more than one line breaks in a field (so don’t just concatenate one line but keep doing so) and the case when a line has more delimiters than the target number of delimiter (this could lead to an infinite loop if we restrict the number of delimiters to equal the target).

#! /usr/bin/env python

dlm='|'

import sys
from signal import signal, SIGPIPE, SIG_DFL # http://stackoverflow.com/questions/14207708/ioerror-errno-32-broken-pipe-python
signal(SIGPIPE,SIG_DFL) ## no error when exiting a pipe like less

n_dlm = line.count(dlm)

line0 = line
line_next = 'a'
while line:
if line.count(dlm) > n_dlm or line_next=='':
sys.stdout.write(line0)
line = line_next
if line.count(dlm) > n_dlm: ## line with more delimiters than target?
line0 = line_next
line = line.replace('\r', ' ').replace('\n', ' ') + line_next
else:
line0 = line
line = line.replace('\r', ' ').replace('\n', ' ') + line_next


## Change delimiter in a csv file and remove line breaks in fields

I wrote a script to convert delimiters in CSV files, eg, commas to pipes. I prefer pipe-delimited files because the the pipe-delimiter (|) will not clash data in the different fields 99.999% of the time. I also added the option to convert newline () and carriage return () characters in the data fields to spaces. This comes in handy when I use PROC IMPORT in SAS as line breaks cause it to choke.

Here’s my csvconvert.py script:

#! /usr/bin/env python

#### Command line arguments
import argparse
parser = argparse.ArgumentParser(description="Convert delimited file from one delimiter to another; defaults to converting CSV to pipe-delimited.")
parser.add_argument("--dlm-input", action="store", dest="dlm_in", default=",", required=False, help="delimiter of the input file; defaults to comma (,)", nargs='?', metavar="','")
parser.add_argument("--dlm-output", action="store", dest="dlm_out", default="|", required=False, help="delimiter of the output file; defaults to pipe (|)", nargs='?', metavar="'|'")
parser.add_argument("--remove-line-char", action="store_true", dest="remove_line_char", default=False, help="remove \\n and \\r characters in fields and replace with spaces")
parser.add_argument("--quote-char", action="store", dest="quote_char", default='"', required=False, help="quote character; defaults to double quote (\")", nargs='?', metavar="\"")
parser.add_argument("-i", "--input", action="store", dest="input", required=False, help="input file; if not specified, take from standard input.", nargs='?', metavar="file.csv")
parser.add_argument("-o", "--output", action="store", dest="output", required=False, help="output file; if not specified, write to standard output", nargs='?', metavar="file.pipe")
parser.add_argument("-v", "--verbose", action="store_true", dest="verbose", default=False, help="increase verbosity")
args  =  parser.parse_args()
# print args

# http://snipplr.com/view/45759/convert-csv-file-to-pipe-delineated-file/
import argparse
import csv
import sys
from signal import signal, SIGPIPE, SIG_DFL # http://stackoverflow.com/questions/14207708/ioerror-errno-32-broken-pipe-python
signal(SIGPIPE,SIG_DFL) ## no error when exiting a pipe like less

if args.input:
else:

if args.output:
h_outfile = open(args.output, 'wb')
else:
h_outfile = sys.stdout

row = args.dlm_out.join(row)
if args.remove_line_char:
row  =  row.replace('\n', ' ').replace('\r', ' ')
h_outfile.write("%s\n" % (row))
h_outfile.flush()
# print row


Help description:

usage: csvconvert.py [-h] [--dlm-input [',']] [--dlm-output ['|']]
[--remove-line-char] [--quote-char ["]] [-i [file.csv]]
[-o [file.pipe]] [-v]

Convert delimited file from one delimiter to another; defaults to converting
CSV to pipe-delimited.

optional arguments:
-h, --help            show this help message and exit
--dlm-input [',']     delimiter of the input file; defaults to comma (,)
--dlm-output ['|']    delimiter of the output file; defaults to pipe (|)
--remove-line-char    remove \n and \r characters in fields and replace with
spaces
--quote-char ["]      quote character; defaults to double quote (")
-i [file.csv], --input [file.csv]
input file; if not specified, take from standard
input.
-o [file.pipe], --output [file.pipe]
output file; if not specified, write to standard
output
-v, --verbose         increase verbosity


Usage:

cat myfile.csv | csvconvert.py --remove-line-char > myfile.pipe


## Reorder columns in delimited file using awk

Reordering/re-arranging columns in a delimited file could be easily attained using awk:

cat myfile.txt | awk 'BEGIN{FS="," OFS="|";} {print $1,$2,$4,$5,$6;}' > out.txt  ## Delimited file where delimiter clashes with data values A comma-separated values (CSV) file is a typical way to store tabular/rectangular data. If a data cell contain a comma, then the cell with the commas is typically wrapped with quotes. However, what if a data cell contains a comma and a quotation mark? To avoid such scenarios, it is typically wise to use a delimiter that has a low chance of showing up in your data, such as the pipe (“|”) or caret (“^”) character. However, there are cases when the data is a long string with all sorts of data characters, including the pipe and caret characters. What then should the delimiter be in order to avoid a delimiter collision? As the Wikipedia article suggests, using special ASCII characters such as the unit/field separator (hex: 1F) could help as they probably won’t be in your data (no keyboard key that corresponds to it!). Currently, my rule of thumb is to use pipe as the default delimiter. If the data contains complicated strings, then I’ll default to the field separator character. In Python, one could refer to the field separator as ‘\ x1f’. In R, one could refer to it as ‘\ x1F’. In SAS, it could be specified as ‘1F’x. In bash, the character could be specified on the command line (e.g., using the cut command, csvlook command, etc) by specifying$’1f’ as the delimiter character.

If the file contains the newline character in a data cell (\n), then the record separator character (hex: 1E) could be used for determining new lines.

## Best practices for importing a delimited file in SAS using the DATA step

The easiest way to import a delimited file (e.g., CSV) in SAS is to use PROC IMPORT:

proc import datafile="/path/to/my_file.txt"
out=work.my_data
dbms=dlm
replace
;
delimiter="|" ;
guessingrows=32000 ;
run ;


PROC IMPORT isn’t a viable option when the fileref used in the datafile argument is not of the DISK type. For example, the fileref my_pipe would not work in the following example,

filename my_pipe pipe "gunzip -c my_file.txt.gz" ;


because SAS needs “random access” to the fileref (i.e., to determine the variable type). PROC IMPORT also isn’t suitable when you have a very large data set where one of the columns might contain an element that has a very long length (and this might occur after the number of rows specified by guessingrows). Based on my experience, one should use the truncover, missover (don’t go to next line if line ends early), dsd (allow empty field) and lrecl (make this big for long lines; defaults to 256, which means your lines will be truncated if they are longer than 256 characters long) options in the infile statement to avoid unnecessary errors.

Since the infile is delimited, it is easy to import the fields using the list input method. However, one should use the length statement to declare the maximum length for each character variable, and use the informat statement for numeric variables that have special formats (date, dollar amount, etc.). I usually forget and just declare the informats following the variables in the input statement, which only works when we are inputting using the input pointer method (e.g., @27 my_var date9.). Here is an example:

filename my_pipe pipe "gunzip -c my_file.txt.gz" ;
data my_data ;
infile my_file dlm="|" dsd truncover missover lrecl=50000 ;
length
x2 $50 x3$25
;
informat
x4 date9.
;
format
x4 date9.
;
input
x1
x2 $x3$
x4
;
run ;


## Avoid data truncation in SAS when importing CSV files

SAS’s Proc Import is great for importing a CSV or other delimited files:things just “work” most of the time. We don’t need to specify variable names, variable type, etc. However, data truncation or mis-matched variable type can happen as the procedure determines the data type and length of the variables based on the first few rows of the delimited file.

As this post suggests, one could use the guessingrows=32767; statement in Proc Import so SAS uses the first 32k rows to determine data type and length.

Alternatively, the safer solution would be to import the delimited file by using the Data step and explicitly use the length statement with a long length option to ensure that no truncation occurs (e.g., length my_var \$100). One would also need to specify the data type with the input statement here as well. Note: Do not specify the variable length using the input statement here because SAS might read in characters from other fields as it starts reading from the last delimiter all the way to the character length.