Sometimes some people generate delimited files with line break characters (carriage return and/or line feed) inside a field without quoting. I previously wrote about the case when the problematic fields are quoted. I also wrote about using non-ascii characters as field and new record indicators to avoid clashes.
The following script reads in
stdin and writes to
stdout repaired lines by ensuring every output line has at least the number of delimiters (|) as the first/header line (call this the target number of delimiters) by continually concatenating lines (remove line breaks) until it reaches the point when concatenating the next line would yield more delimiters than the target number of delimiters. The script appears more complicated than it should be in order to address the case when there are more than one line breaks in a field (so don’t just concatenate one line but keep doing so) and the case when a line has more delimiters than the target number of delimiter (this could lead to an infinite loop if we restrict the number of delimiters to equal the target).
#! /usr/bin/env python dlm='|' import sys from signal import signal, SIGPIPE, SIG_DFL # http://stackoverflow.com/questions/14207708/ioerror-errno-32-broken-pipe-python signal(SIGPIPE,SIG_DFL) ## no error when exiting a pipe like less line = sys.stdin.readline() n_dlm = line.count(dlm) line0 = line line_next = 'a' while line: if line.count(dlm) > n_dlm or line_next=='': sys.stdout.write(line0) line = line_next # line = sys.stdin.readline() if line.count(dlm) > n_dlm: ## line with more delimiters than target? line0 = line_next line_next = sys.stdin.readline() line = line.replace('\r', ' ').replace('\n', ' ') + line_next else: line0 = line line_next = sys.stdin.readline() line = line.replace('\r', ' ').replace('\n', ' ') + line_next