Guess strftime format strings¶
https://github.com/jameysharp/percentagent
This resembles and is inspired by the parser
module of dateutil,
in that it attempts to parse structured dates and times out of arbitrary
strings. However this implementation has several advantages:
- Infers strftime(3)-style format strings
- Returns all ambiguous parses
- Supports many languages and locales out of the box
- Reports which locales could have produced the input
- Additional locales or parsing hints can be provided by example
You should use dateutil
instead if you don’t need any of those
features, because this library also has some disadvantages (although
these may get fixed over time):
- No test suite yet, while
dateutil
is well-tested - This library takes a few milliseconds to parse one input, while
dateutil
takes a few hundred microseconds - Timezone offsets and abbreviations are recognized but not yet reported out of the library
Format strings¶
This library returns strftime(3)-style format strings, as
well as the corresponding datetime
object (or a
date
or time
if that’s all
that could be found in the input) for each possible format string.
The format strings are useful if you have several different examples of strings produced by a single, unknown, format string. A single date/time string may be ambiguous (such as “y/m/d” versus “y/d/m”). But odds are good that if you see a few more samples in the same format, only one format string will explain all of them.
Ambiguous inputs¶
If an input string is ambiguous, such as when it’s unclear whether a date uses day/month or month/day order, this library returns all possibilities. You can implement your own heuristics to decide which one is best for your application.
By contrast, dateutil
picks one interpretation, and provides options
letting you guide which one it will pick.
Broad locale support¶
This library has a fair shot at handling dates in a wide range of languages, without any configuration.
I’ve extracted comprehensive data about how dates are formatted around
the world from the GNU C library locale database. The script which does
that is utils/lc_time
if you want to run it on your own system with
a POSIX-conforming implementation of locale(1). If your
system’s locale database includes locales or format-string examples that
glibc doesn’t, we can merge the extracted data to make this library
support even more kinds of input.
This library will also tell you which locales could have been used to produce the input you hand it. That gives you an additional data point if you’re comparing different date strings to determine if they were generated using the same format. You could also use the suggested locales as a hint about the language of the surrounding text, or the most likely timezones used in the locale’s primary country.
License¶
This repository includes glibc.json
, a file produced by extracting
selected data from the GNU C Library locale database. The source files
for that database include the following text:
This file is part of the GNU C Library and contains locale data. The Free Software Foundation does not claim any copyright interest in the locale data contained in this file. The foregoing does not affect the license of the GNU C Library as a whole. It does not exempt you from the conditions of the license if your use would otherwise be governed by that license.
Therefore, I believe the derived data is not subject to the license of the GNU C Library. To be clear, I also do not claim any copyright interest in the locale facts in the above file.
API¶
-
class
percentagent.
DateParser
(locale_set=None)[source]¶ Infer strftime(3)-style format strings that could have produced given date and/or time strings.
If you don’t provide a locale set, then one will be constructed for you with
TimeLocaleSet.default()
. This is usually what you want, but you can construct special-purpose sets if needed.This class precomputes some large data structures when constructed, so you should reuse the same instance for multiple parses, if possible.
Instances of this class may safely be used from multiple threads.
Parameters: locale_set (TimeLocaleSet) – locales to consider when parsing timestamps -
parse
(s)[source]¶ Infer format strings for a single timestamp.
For example:
>>> parser = DateParser(TimeLocaleSet()) >>> parser.parse("2018-01-09") [('%Y-%m-%d', datetime.date(2018, 1, 9), None), ('%Y-%d-%m', datetime.date(2018, 9, 1), None)]
That output indicates that the input can be explained by either year-month-day order or year-day-month order, and there are no hints indicating which locale the date was formatted for.
On the other hand, “13” is too large to be a month number, so this example is unambiguous:
>>> parser.parse("2018-05-13") [('%Y-%m-%d', datetime.date(2018, 5, 13), None)]
Separators are not required between conversions, but they can help with ambiguity:
>>> sorted(fmt for fmt, value, locales in parser.parse("210456")) ['%H%M%S', '%d%m%y'] >>> parser.parse("21-04-56") [('%d-%m-%y', datetime.date(2056, 4, 21), None)] >>> parser.parse("21:04:56") [('%H:%M:%S', datetime.time(21, 4, 56), None)]
Using locale-specific strings can help avoid ambiguity too:
>>> parser = DateParser(TimeLocaleSet( ... mon={"Jan;Feb;Mar;Apr;May;Jun;Jul;Aug;Sep;Oct;Nov;Dec": ["en_US"]}, ... )) >>> parser.parse("2018Jan9") [('%Y%b%d', datetime.date(2018, 1, 9), frozenset({'en_US'}))]
Parameters: s (str) – text which contains a date and/or time Returns: possible format strings, and corresponding locales Return type: list(tuple(str, set(str) or None))
-
-
class
percentagent.
TimeLocaleSet
(formats=None, day=None, mon=None, am_pm=None, alt_digits=None, era=None)[source]¶ Structured information about how a set of locales express dates and times.
All parameters are dictionaries which map a string to a set of locales in which that string is used.
Except for
formats
, the dictionary keys are semicolon-separated (;
) ordered lists. Their semantics are documented in locale(5).Parameters: - formats – Sample strftime(3) format strings to extract prefix and suffix patterns from.
- day – Names of days of the week.
- mon – Names of months.
- am_pm – Strings indicating times before or after noon.
- alt_digits – Numbers from writing systems which do not use Unicode digits.
- era – Definitions of how years are counted and displayed.
-
classmethod
from_json
(f)[source]¶ Load a locale set from a JSON-formatted stream, such as one produced by
utils/lc_time
.Returns: the loaded locale set
-
classmethod
default
(provider='glibc')[source]¶ Load a locale set that was distributed with this package. See
percentagent/locales/
for the available sets.Returns: the loaded locale set
-
keywords
¶ Group conversion specifiers by the non-numeric strings they can produce. This includes these specifiers:
- Weekday names:
%a
- Month names:
%b
- AM/PM:
%p
- Timezone abbreviations:
%Z
- Non-decimal numbers:
%O
prefix (e.g.%Om
for months)
>>> glibc = TimeLocaleSet.default('glibc').keywords
Many strings can only be produced by a single conversion specifier in a single locale. For example, according to the glibc locale database, “Agustus” is the
id_ID
(Indonesian) word for the 8th month, and does not appear in any other locale.>>> sorted(glibc['agustus']) [('b', 8, ('id_ID',))]
However, other strings can be ambiguous. For example, “Ahad” is the word for Sunday in
ms_MY
(the Malay language locale for Malaysia), but the word for Wednesday inkab_DZ
(the Kabyle language locale for Algeria). These languages are from entirely different language families but we can’t tell them apart if all we see is this one word. However, in either case we do know that the word refers to a weekday.>>> sorted(glibc['ahad']) [('a', 0, ('ms_MY',)), ('a', 3, ('kab_DZ',))]
Sometimes, without context, we can’t even tell which role a word plays. “An” is the word for Tuesday in
lt_LT
(Lithuanian), but hours before noon are distinguished with “AN” inak_GH
(the Akan locale for Ghana).>>> sorted(glibc['an']) [('a', 2, ('lt_LT',)), ('p', 0, ('ak_GH',))]
Similarly, “AWST” is the timezone abbreviation for Australian Western Standard Time, while “Awst” is the
cy_GB
(Welsh) word for the 8th month.>>> sorted(glibc['awst']) [('Z', 'AWST', ()), ('b', 8, ('cy_GB',))]
Finally, in Chinese, Monday through Saturday are abbreviated using the numbers 1-6, and those numbers are written using the same characters in Japanese. So if we see those numbers, they could either be from numeric conversions such as
%Od
, or from the abbreviated weekday conversion,%a
.>>> sorted(glibc['一']) [('O', 1, ('ja_JP', 'lzh_TW')), ('a', 1, ('cmn_TW', 'hak_TW', 'lzh_TW', 'nan_TW', 'yue_HK', 'zh_CN', 'zh_HK', 'zh_SG', 'zh_TW'))]
- Weekday names:
-
prefixes
¶ Group conversion specifiers by the strings which may precede them.
>>> glibc = TimeLocaleSet.default('glibc').prefixes
In
vi_VN
(Vietnamese), “tháng” means “month”, and “năm” means “year”. Within the glibc locale database, we find that these words are used as prefix to the numeric value in question:>>> sorted(glibc['tháng']) [('m', ('vi_VN',))] >>> sorted(glibc['năm']) [('y', ('vi_VN',))]
-
suffixes
¶ Group conversion specifiers by the strings which may follow them.
>>> suffixes = TimeLocaleSet(formats={ ... '%a, %Y.eko %bren %da': {'eu_ES'}, ... '%Y年%m月%d日': {'ja_JP'}, ... }).suffixes
In
eu_ES
(the Basque locale for Spain), year/month/day are followed by “eko”, “ren”, and “a”, respectively. However, in our sample format string, “ren” follows%b
, which is the name of a month, not its number. So we don’t extract it as a suffix; we rely on month names being sufficiently distinctive instead.>>> sorted(suffixes['eko']) [('y', ('eu_ES',))] >>> 'ren' in suffixes False >>> sorted(suffixes['a']) [('d', ('eu_ES',))]
In
ja_JP
(Japanese), year/month/day are followed by “年”, “月”, and “日”, respectively. Since our sample format string uses only numeric conversion specifiers, we extract all three as valid suffixes for their corresponding conversions.>>> sorted(suffixes['年']) [('y', ('ja_JP',))] >>> sorted(suffixes['月']) [('m', ('ja_JP',))] >>> sorted(suffixes['日']) [('d', ('ja_JP',))]