Change delimiter in a csv file and remove line breaks in fields

I wrote a script to convert delimiters in CSV files, eg, commas to pipes. I prefer pipe-delimited files because the the pipe-delimiter (|) will not clash data in the different fields 99.999% of the time. I also added the option to convert newline () and carriage return () characters in the data fields to spaces. This comes in handy when I use PROC IMPORT in SAS as line breaks cause it to choke.

Here’s my script:

#! /usr/bin/env python

#### Command line arguments
import argparse
parser = argparse.ArgumentParser(description="Convert delimited file from one delimiter to another; defaults to converting CSV to pipe-delimited.")
parser.add_argument("--dlm-input", action="store", dest="dlm_in", default=",", required=False, help="delimiter of the input file; defaults to comma (,)", nargs='?', metavar="','")
parser.add_argument("--dlm-output", action="store", dest="dlm_out", default="|", required=False, help="delimiter of the output file; defaults to pipe (|)", nargs='?', metavar="'|'")
parser.add_argument("--remove-line-char", action="store_true", dest="remove_line_char", default=False, help="remove \\n and \\r characters in fields and replace with spaces")
parser.add_argument("--quote-char", action="store", dest="quote_char", default='"', required=False, help="quote character; defaults to double quote (\")", nargs='?', metavar="\"")
parser.add_argument("-i", "--input", action="store", dest="input", required=False, help="input file; if not specified, take from standard input.", nargs='?', metavar="file.csv")
parser.add_argument("-o", "--output", action="store", dest="output", required=False, help="output file; if not specified, write to standard output", nargs='?', metavar="file.pipe")
parser.add_argument("-v", "--verbose", action="store_true", dest="verbose", default=False, help="increase verbosity")
args  =  parser.parse_args()
# print args

import argparse
import csv
import sys
from signal import signal, SIGPIPE, SIG_DFL #
signal(SIGPIPE,SIG_DFL) ## no error when exiting a pipe like less

if args.input:
    csv_reader = csv.reader(open(args.input, 'rb'), delimiter=args.dlm_in, quotechar=args.quote_char)
    csv_reader = csv.reader(sys.stdin, delimiter=args.dlm_in, quotechar=args.quote_char)

if args.output:
    h_outfile = open(args.output, 'wb')
    h_outfile = sys.stdout

for row in csv_reader:
    row = args.dlm_out.join(row)
    if args.remove_line_char:
        row  =  row.replace('\n', ' ').replace('\r', ' ')
    h_outfile.write("%s\n" % (row))
    # print row

Help description:

usage: [-h] [--dlm-input [',']] [--dlm-output ['|']]
                     [--remove-line-char] [--quote-char ["]] [-i [file.csv]]
                     [-o [file.pipe]] [-v]

Convert delimited file from one delimiter to another; defaults to converting CSV to pipe-delimited.

optional arguments: -h, --help show this help message and exit --dlm-input [','] delimiter of the input file; defaults to comma (,) --dlm-output ['|'] delimiter of the output file; defaults to pipe (|) --remove-line-char remove \n and \r characters in fields and replace with spaces --quote-char ["] quote character; defaults to double quote (") -i [file.csv], --input [file.csv] input file; if not specified, take from standard input. -o [file.pipe], --output [file.pipe] output file; if not specified, write to standard output -v, --verbose increase verbosity


cat myfile.csv | --remove-line-char > myfile.pipe

Best practices for importing a delimited file in SAS using the DATA step

The easiest way to import a delimited file (e.g., CSV) in SAS is to use PROC IMPORT:

proc import datafile="/path/to/my_file.txt"
    delimiter="|" ;
    guessingrows=32000 ;
run ;

PROC IMPORT isn’t a viable option when the fileref used in the datafile argument is not of the DISK type. For example, the fileref my_pipe would not work in the following example,

filename my_pipe pipe "gunzip -c my_file.txt.gz" ;

because SAS needs “random access” to the fileref (i.e., to determine the variable type). PROC IMPORT also isn’t suitable when you have a very large data set where one of the columns might contain an element that has a very long length (and this might occur after the number of rows specified by guessingrows). Based on my experience, one should use the truncover, missover (don’t go to next line if line ends early), dsd (allow empty field) and lrecl (make this big for long lines; defaults to 256, which means your lines will be truncated if they are longer than 256 characters long) options in the infile statement to avoid unnecessary errors.

Since the infile is delimited, it is easy to import the fields using the list input method. However, one should use the length statement to declare the maximum length for each character variable, and use the informat statement for numeric variables that have special formats (date, dollar amount, etc.). I usually forget and just declare the informats following the variables in the input statement, which only works when we are inputting using the input pointer method (e.g., @27 my_var date9.). Here is an example:

filename my_pipe pipe "gunzip -c my_file.txt.gz" ;
data my_data ;
    infile my_file dlm="|" dsd truncover missover lrecl=50000 ;
        x2 $50
        x3 $25
        x4 date9.
        x4 date9.
        x2 $
        x3 $
run ;