How to Validate File Names¶
Sometimes you need to make sure that files are well organized and conform
to some specific naming scheme. To validate the files names in a directory,
simply pass the names themselves as the data argument when calling
validate()
. You then control the method of validation with the
requirement you provide.
pathlib Basics
While there are multiple ways to get file names stored on disk, examples
on this page use the Standard Library’s pathlib
module. If you’re
not familiar with pathlib, please review some basic examples before
continuing:
Basic Examples
These examples assume the following directory structure:
├── file1.csv
├── file2.csv
├── file3.xlsx
└── directory1/
├── file4.csv
└── file5.xlsx
Import the Path
class:
>>> from pathlib import Path
Get a list of file and directory names from the current directory:
>>> [str(p) for p in Path('.').iterdir()]
['file1.csv', 'file2.csv', 'file3.xlsx', 'directory1']
Filter the results to just files, no directories, using an if
clause:
>>> [str(p) for p in Path('.').iterdir() if p.is_file()]
['file1.csv', 'file2.csv', 'file3.xlsx']
Get a list of path names ending in “.csv” from the current directory
using glob
-style pattern matching:
>>> [str(p) for p in Path('.').glob('*.csv')]
['file1.csv', 'file2.csv']
Get a list of CSV paths from the current directory and all subdirectories:
>>> [str(p) for p in Path('.').rglob('*.csv')] # <- Using "recursive glob".
['file1.csv', 'file2.csv', 'directory1/file4.csv']
Get a list of CSV names from the current directory and all subdirectories
using p.name
instead of str(p)
(excludes directory name):
>>> [p.name for p in Path('.').rglob('*.csv')] # <- p.name excludes directory
['file1.csv', 'file2.csv', 'file4.csv']
Get a list of file and directory paths from the parent directory
using the special name ..
:
>>> [str(p) for p in Path('..').iterdir()]
[ <parent directory names here> ]
Lowercase¶
Check that file names are lowercase:
1 2 3 4 5 6 7 8 9 10 11 | from pathlib import Path
from datatest import validate, working_directory
with working_directory(__file__):
file_names = (str(p) for p in Path('.').iterdir() if p.is_file())
def islower(x):
return x.islower()
validate(file_names, islower, msg='should be lowercase')
|
Lowercase Without Spaces¶
Check that file names are lowercase and don’t use spaces:
1 2 3 4 5 6 7 8 9 | from pathlib import Path
from datatest import validate, working_directory
with working_directory(__file__):
file_names = (str(p) for p in Path('.').iterdir() if p.is_file())
msg = 'Should be lowercase with no spaces.',
validate.regex(file_names, r'[a-z0-9_.\-]+', msg=msg)
|
Not Too Long¶
Check that the file names aren’t too long (25 characters or less):
1 2 3 4 5 6 7 8 9 10 11 12 | from pathlib import Path
from datatest import validate, working_directory
with working_directory(__file__):
file_names = (str(p) for p in Path('.').iterdir() if p.is_file())
def not_too_long(x):
"""Path names should be 25 characters or less."""
return len(x) <= 25
validate(file_names, not_too_long)
|
Check for CSV Type¶
Check that files are CSVs files:
1 2 3 4 5 6 7 8 9 10 11 | from pathlib import Path
from datatest import validate, working_directory
with working_directory(__file__):
file_names = (str(p) for p in Path('.').iterdir() if p.is_file())
def is_csv(x):
return x.lower().endswith('.csv')
validate(file_names, is_csv, msg='should be CSV file')
|
Multiple Files Types¶
Check that files are CSV, Excel, or DBF file types:
1 2 3 4 5 6 7 8 9 10 11 12 13 | from pathlib import Path
from datatest import validate, working_directory
with working_directory(__file__):
file_names = (str(p) for p in Path('.').iterdir() if p.is_file())
def tabular_formats(x): # <- Helper function.
"""Should be CSV, Excel, or DBF files."""
suffix = Path(x).suffix
return suffix.lower() in {'.csv', '.xlsx', '.xls', '.dbf'}
validate(file_names, tabular_formats)
|
Specific Files Exist¶
Using validate.superset()
, check that the list of file names
includes a given set of required files:
1 2 3 4 5 6 7 8 | from pathlib import Path
from datatest import validate, working_directory
with working_directory(__file__):
file_names = (str(p) for p in Path('.').iterdir() if p.is_file())
validate.superset(file_names, {'readme.txt', 'license.txt', 'config.ini'})
|
Includes Date¶
Check that file names begin with a date in YYYYMMDD format (e.g., 20201103_data.csv):
1 2 3 4 5 6 7 8 9 | from pathlib import Path
from datatest import validate, working_directory
with working_directory(__file__):
file_names = (p.name for p in Path('.').iterdir() if p.is_file())
msg = 'Should have date prefix followed by an underscore (YYYYMMDD_).'
validate.regex(file_names, r'^\d{4}\d{2}\d{2}_.+', msg=msg)
|
You can change the regex pattern to match another naming scheme of your choice. See the following examples for ideas:
description |
regex pattern |
example |
---|---|---|
date prefix |
|
2020-11-03_data.csv |
date prefix (no hyphen) |
|
20201103_data.csv |
date suffix |
|
data_2020-11-03.csv |
date suffix (no hyphen) |
|
data_20201103.csv |