How to Clean and Format Text Online
Messy text is everywhere. You copy a paragraph from a PDF and it has line breaks in the middle of every sentence. You paste data from a spreadsheet and it comes with trailing spaces, invisible tab characters, and inconsistent whitespace. You grab content from a web page and it brings along HTML tags, non-breaking spaces, and zero-width characters. Every time text moves between applications, formatting breaks.
This guide covers what text cleaning is, how to handle the most common formatting problems, how to clean text programmatically in multiple languages, and the specific scenarios that generate the messiest text.
What Is Text Cleaning?
Text cleaning is the process of removing unwanted characters, normalizing whitespace, stripping formatting artifacts, and standardizing text into a consistent, usable format. It includes collapsing multiple spaces into single spaces, removing blank lines, trimming leading and trailing whitespace, stripping HTML tags, normalizing line endings, and converting special characters (smart quotes, em dashes, non-breaking spaces) to their standard ASCII equivalents.
You would use text cleaning after copying from PDFs, word processors, spreadsheets, emails, and web pages. Any time text crosses an application boundary, formatting artifacts accumulate. Cleaning removes these artifacts so the text works correctly in its new context — whether that is a database field, a code file, a plain-text email, or a CMS entry.
How to Clean Text with FlipMyCase
- Open the FlipMyCase Text Cleaner.
- Paste your messy text.
- Toggle the cleaning options you need: remove extra spaces, remove blank lines, remove line breaks, strip HTML tags, trim whitespace.
- Copy the cleaned result.
Everything runs in your browser. The tool handles Unicode correctly, including non-breaking spaces (U+00A0), zero-width characters, and smart quotes. For case-related cleanup, use the Case Converter alongside the text cleaner.
Code Examples for Text Cleaning
JavaScript
function cleanText(text, options = {}) {
let result = text;
// Normalize line endings
result = result.replace(/\r\n/g, '\n').replace(/\r/g, '\n');
// Strip HTML tags
if (options.stripHTML) {
result = result.replace(/<[^>]+>/g, '');
}
// Normalize smart quotes and special chars
result = result
.replace(/[\u2018\u2019]/g, "'") // smart single quotes
.replace(/[\u201C\u201D]/g, '"') // smart double quotes
.replace(/\u2013/g, '-') // en dash
.replace(/\u2014/g, '--') // em dash
.replace(/\u00A0/g, ' ') // non-breaking space
.replace(/[\u200B-\u200D\uFEFF]/g, ''); // zero-width chars
// Collapse multiple spaces
if (options.collapseSpaces) {
result = result.replace(/ {2,}/g, ' ');
}
// Remove blank lines
if (options.removeBlankLines) {
result = result.replace(/\n{3,}/g, '\n\n');
result = result.replace(/^\n+/, '').replace(/\n+$/, '');
}
// Trim each line
if (options.trimLines) {
result = result.split('\n').map(line => line.trim()).join('\n');
}
return result.trim();
}
const messy = ' Hello world \n\n\n Extra spaces \n here ';
console.log(cleanText(messy, {
collapseSpaces: true,
removeBlankLines: true,
trimLines: true
}));
// Hello world\n\nExtra spaces\nhere
Python
import re
import unicodedata
def clean_text(text, strip_html=False, collapse_spaces=True,
remove_blank_lines=True, trim_lines=True):
result = text
# Normalize line endings
result = result.replace('\r\n', '\n').replace('\r', '\n')
# Strip HTML tags
if strip_html:
result = re.sub(r'<[^>]+>', '', result)
# Normalize smart quotes and special characters
replacements = {
'\u2018': "'", '\u2019': "'", # smart quotes
'\u201C': '"', '\u201D': '"',
'\u2013': '-', '\u2014': '--',
'\u00A0': ' ', # non-breaking space
}
for old, new in replacements.items():
result = result.replace(old, new)
# Remove zero-width characters
result = re.sub(r'[\u200B-\u200D\uFEFF]', '', result)
# Collapse multiple spaces
if collapse_spaces:
result = re.sub(r' {2,}', ' ', result)
# Trim each line
if trim_lines:
result = '\n'.join(line.strip() for line in result.split('\n'))
# Remove excess blank lines
if remove_blank_lines:
result = re.sub(r'\n{3,}', '\n\n', result)
return result.strip()
messy = ' Hello world \n\n\n Extra spaces \n here '
print(clean_text(messy))
# Hello world
#
# Extra spaces
# here
Bash
# Remove extra spaces
echo "Hello world test" | tr -s ' '
# Hello world test
# Remove blank lines
sed '/^$/d' input.txt > output.txt
# Strip HTML tags
sed 's/<[^>]*>//g' page.html > plaintext.txt
# Trim trailing whitespace from every line
sed 's/[[:space:]]*$//' input.txt > output.txt
# Normalize smart quotes to ASCII
sed "s/[\x91\x92]/'/g; s/[\x93\x94]/\"/g" input.txt > output.txt
# Full cleanup pipeline
cat input.txt \
| sed 's/<[^>]*>//g' \
| tr -s ' ' \
| sed 's/^[[:space:]]*//;s/[[:space:]]*$//' \
| sed '/^$/d' \
> cleaned.txt
Real-World Use Cases
Text copied from PDFs. PDFs are the single biggest source of messy text. They break lines based on page width, not sentence boundaries, so a single paragraph has line breaks every 60-80 characters. They add extra spaces around justified text, insert hidden characters for hyphenation, and use non-standard quote characters. Paste the text into the Text Cleaner, enable line break removal and space normalization, and get a clean paragraph.
Data from spreadsheets. Copying cells from Excel or Google Sheets introduces trailing spaces, tab characters between columns, and inconsistent whitespace. When you paste this into a database, CMS, or code file, the invisible characters cause matching failures, display issues, and data quality problems. Clean the text before pasting it into its destination.
Web page content extraction. Copying text from a website brings along HTML tags, entities, zero-width characters used for text shaping, and CSS-driven formatting that does not translate to plain text. The HTML tag stripping option in the text cleaner removes all markup, and the smart quote normalization converts curly quotes to straight ones. For HTML-to-Markdown conversion specifically, use the HTML to Markdown tool.
Email chain cleanup. Forwarded email threads accumulate > quote markers, extra blank lines, signature blocks, and messy formatting from each forwarding step. Clean the text to extract just the content, then use the Duplicate Remover to strip repeated sections.
Common Mistakes and Gotchas
The most common mistake is stripping line breaks when you need them. If your text has intentional paragraph breaks, removing all line breaks joins everything into one unreadable wall of text. The Text Cleaner preserves paragraph breaks (double line breaks) while removing mid-sentence line breaks (single line breaks), but test with your specific text first.
Invisible characters are the sneakiest problem. Non-breaking spaces (U+00A0), zero-width spaces (U+200B), and byte-order marks (U+FEFF) look identical to regular spaces or nothing at all, but they cause string comparisons to fail, CSS to break, and data imports to error. The text cleaner strips these automatically, but if you are writing code, explicitly handle Unicode whitespace characters.
Smart quotes cause encoding errors when moving between systems. Microsoft Word converts straight quotes to curly quotes ("" and '') which use different Unicode code points. These display incorrectly in systems expecting ASCII, and they break JSON strings that expect standard double quotes. Always normalize smart quotes to ASCII equivalents before using text in code or data formats.
Stripping HTML tags naively can lose information. If the HTML contains <br> tags for line breaks or <li> tags for list items, removing all tags joins everything into a single line. For better results, convert <br> to newlines and <li> to bullet points before stripping remaining tags.
Frequently Asked Questions
How do I clean text from a specific application? Each application has its own formatting quirks. PDFs add mid-sentence line breaks. Word adds smart quotes and special bullet characters. Excel adds trailing spaces and tabs. The Text Cleaner handles all of these with its toggle options. For persistent issues from a specific source, create a cleaning script using the code examples above.
Can I clean text without removing paragraph breaks? Yes. The "remove blank lines" option targets excess blank lines (three or more in a row) while preserving paragraph breaks (double line breaks). The "remove line breaks" option only removes single line breaks, not double ones. This preserves your paragraph structure while cleaning mid-sentence breaks.
How do I clean text in bulk? For a few blocks of text, paste them into the Text Cleaner one at a time. For automated bulk processing, use the Python or JavaScript code examples above in a script that reads files from a directory, cleans them, and writes the results. The Bash pipeline example processes files of any size efficiently.
What is the difference between the Text Cleaner and the Plain Text Converter? The Text Cleaner removes formatting artifacts while preserving the text structure. The Plain Text Converter strips all formatting and converts rich text to completely unformatted plain text. Use the cleaner for minor cleanup and the converter for complete format stripping.
Conclusion
Text cleaning is one of the most underrated productivity operations. Five seconds of cleaning prevents hours of debugging from invisible characters, broken formatting, and inconsistent whitespace. Whether you are copying from PDFs, spreadsheets, web pages, or email chains, cleaning the text before using it in its destination saves time and prevents errors.
The FlipMyCase Text Cleaner handles all common formatting issues in your browser with no setup. For automated pipelines, the JavaScript, Python, and Bash examples above cover every cleaning operation programmatically. Combine with the Word Counter to verify content length after cleaning, or the Duplicate Remover to deduplicate lines after normalization.