How to Avoid Excel Automatic Formatting¶
When MS Excel opens CSV files (and many other tabular formats), its default behavior will reformat certain values as dates, strip leading zeros, convert long numbers into scientific notation, and more. There are many cases where these kinds of changes actually corrupt your data.
It is possible to control Excel’s formatting behavior using its Text Import Wizard. But as long as other users can open and re-save your CSV files, there may be no good way to guarantee that someone else won’t inadvertently corrupt your data with Excel’s default auto-format behavior. In a situation like this, you can mitigate problems by avoiding values that Excel likes to auto-format.
Using the Predicate
object below, you can check that values
are “Excel safe” and receive a list of differences when values are
vulnerable to inadvertent auto-formatting:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | import re
from datatest import validate, Predicate
# Predicate to check that elements are not subject
# to Excel auto-formatting.
excel_safe = ~Predicate(re.compile(r'''^(
# Date format character combinations.
\d{1,2}-(?:\d{1,2}|\d{4})
| (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[ \-]\d{1,2}
| [01]?[0-9]-(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
# Time conversions.
| [01]?[0-9][ ]?(AM?|PM?) # Twelve-hour clock.
| \d?\d[ ]*: # HH (hours).
| \d?\d[ ]*(:[ ]*\d\d?){1,2} # HH:MM and HH:MM:SS
# Numeric conversions.
| 0\d+\.?\d* # Number with leading zeros.
| \d*\.\d*0 # Decimal point with trailing zeros.
| \d*\. # Trailing decimal point.
| \d.?\d*E[+-]?\d+ # Scientific notation.
| \d{16,} # Numbers of 16+ digits get approximated.
# Whitespace normalization.
| \s.* # Leading whitespace.
| .*\s # Trailing whitespace.
| .*\s\s.* # Irregular whitespace (new in Office 365).
# Other conversions
| =.+ # Spreadsheet formula.
)$''', re.VERBOSE | re.IGNORECASE), name='excel_safe')
data = [
'AOX-18',
'APR-23',
'DBB-01',
'DEC-20',
'DNZ-33',
'DVH-50',
]
validate(data, excel_safe)
|
In the example above, we use excel_safe
as our requirement.
The validation fails because our data contains two codes that
Excel would auto-convert into date types:
ValidationError: does not satisfy excel_safe() (2 differences): [
Invalid('APR-23'),
Invalid('DEC-20'),
]
Fixing the Data¶
To address the failure, we need to change the values in data so they are no longer subject to Excel’s auto-formatting behavior. There are a few ways to do this.
We can prefix the failing values with apostrophes ('APR-23
and 'DEC-20
). This causes Excel to treat them as text instead
of dates or numbers:
34 35 36 37 38 39 40 41 42 43 44 | ...
data = [
"AOX-18",
"'APR-23",
"DBB-01",
"'DEC-20",
"DNZ-33",
"DVH-50",
]
validate(data, excel_safe)
|
Another approach would be to change the formatting for the all of
the values. Below, the hyphens in data have been replaced with
underscores (_
):
34 35 36 37 38 39 40 41 42 43 44 | ...
data = [
'AOX_18',
'APR_23',
'DBB_01',
'DEC_20',
'DNZ_33',
'DVH_50',
]
validate(data, excel_safe)
|
After making the needed changes, the validation will now pass without error.
Caution
The excel_safe
predicate implements a blacklist approach
to detect values that Excel will automatically convert. It is
not guaranteed to catch everything and future versions of Excel
could introduce new behaviors. If you discover auto-formatted
values that are not handled by this helper function (or if you
have an idea regarding a workable whitelist approach), please
file an issue and we will try to improve it.