Guess strftime format strings

Documentation Status

https://github.com/jameysharp/percentagent

This resembles and is inspired by the parser module of dateutil, in that it attempts to parse structured dates and times out of arbitrary strings. However this implementation has several advantages:

You should use dateutil instead if you don’t need any of those features, because this library also has some disadvantages (although these may get fixed over time):

Format strings

This library returns strftime(3)-style format strings, as well as the corresponding datetime object (or a date or time if that’s all that could be found in the input) for each possible format string.

The format strings are useful if you have several different examples of strings produced by a single, unknown, format string. A single date/time string may be ambiguous (such as “y/m/d” versus “y/d/m”). But odds are good that if you see a few more samples in the same format, only one format string will explain all of them.

Ambiguous inputs

If an input string is ambiguous, such as when it’s unclear whether a date uses day/month or month/day order, this library returns all possibilities. You can implement your own heuristics to decide which one is best for your application.

By contrast, dateutil picks one interpretation, and provides options letting you guide which one it will pick.

Broad locale support

This library has a fair shot at handling dates in a wide range of languages, without any configuration.

I’ve extracted comprehensive data about how dates are formatted around the world from the GNU C library locale database. The script which does that is utils/lc_time if you want to run it on your own system with a POSIX-conforming implementation of locale(1). If your system’s locale database includes locales or format-string examples that glibc doesn’t, we can merge the extracted data to make this library support even more kinds of input.

This library will also tell you which locales could have been used to produce the input you hand it. That gives you an additional data point if you’re comparing different date strings to determine if they were generated using the same format. You could also use the suggested locales as a hint about the language of the surrounding text, or the most likely timezones used in the locale’s primary country.

Command-line usage

A simple interactive interface is available:

python -m percentagent

License

BSD-2-Clause

This repository includes glibc.json, a file produced by extracting selected data from the GNU C Library locale database. The source files for that database include the following text:

This file is part of the GNU C Library and contains locale data. The Free Software Foundation does not claim any copyright interest in the locale data contained in this file. The foregoing does not affect the license of the GNU C Library as a whole. It does not exempt you from the conditions of the license if your use would otherwise be governed by that license.

Therefore, I believe the derived data is not subject to the license of the GNU C Library. To be clear, I also do not claim any copyright interest in the locale facts in the above file.

API

class percentagent.DateParser(locale_set=None)[source]

Infer strftime(3)-style format strings that could have produced given date and/or time strings.

If you don’t provide a locale set, then one will be constructed for you with TimeLocaleSet.default(). This is usually what you want, but you can construct special-purpose sets if needed.

This class precomputes some large data structures when constructed, so you should reuse the same instance for multiple parses, if possible.

Instances of this class may safely be used from multiple threads.

Parameters:locale_set (TimeLocaleSet) – locales to consider when parsing timestamps
parse(s)[source]

Infer format strings for a single timestamp.

For example:

>>> parser = DateParser(TimeLocaleSet())
>>> parser.parse("2018-01-09")
[('%Y-%m-%d', datetime.date(2018, 1, 9), None), ('%Y-%d-%m', datetime.date(2018, 9, 1), None)]

That output indicates that the input can be explained by either year-month-day order or year-day-month order, and there are no hints indicating which locale the date was formatted for.

On the other hand, “13” is too large to be a month number, so this example is unambiguous:

>>> parser.parse("2018-05-13")
[('%Y-%m-%d', datetime.date(2018, 5, 13), None)]

Separators are not required between conversions, but they can help with ambiguity:

>>> sorted(fmt for fmt, value, locales in parser.parse("210456"))
['%H%M%S', '%d%m%y']
>>> parser.parse("21-04-56")
[('%d-%m-%y', datetime.date(2056, 4, 21), None)]
>>> parser.parse("21:04:56")
[('%H:%M:%S', datetime.time(21, 4, 56), None)]

Using locale-specific strings can help avoid ambiguity too:

>>> parser = DateParser(TimeLocaleSet(
...     mon={"Jan;Feb;Mar;Apr;May;Jun;Jul;Aug;Sep;Oct;Nov;Dec": ["en_US"]},
... ))
>>> parser.parse("2018Jan9")
[('%Y%b%d', datetime.date(2018, 1, 9), frozenset({'en_US'}))]
Parameters:s (str) – text which contains a date and/or time
Returns:possible format strings, and corresponding locales
Return type:list(tuple(str, set(str) or None))
class percentagent.TimeLocaleSet(formats=None, day=None, mon=None, am_pm=None, alt_digits=None, era=None)[source]

Structured information about how a set of locales express dates and times.

All parameters are dictionaries which map a string to a set of locales in which that string is used.

Except for formats, the dictionary keys are semicolon-separated (;) ordered lists. Their semantics are documented in locale(5).

Parameters:
  • formats – Sample strftime(3) format strings to extract prefix and suffix patterns from.
  • day – Names of days of the week.
  • mon – Names of months.
  • am_pm – Strings indicating times before or after noon.
  • alt_digits – Numbers from writing systems which do not use Unicode digits.
  • era – Definitions of how years are counted and displayed.
classmethod from_json(f)[source]

Load a locale set from a JSON-formatted stream, such as one produced by utils/lc_time.

Returns:the loaded locale set
classmethod default(provider='glibc')[source]

Load a locale set that was distributed with this package. See percentagent/locales/ for the available sets.

Returns:the loaded locale set
keywords

Group conversion specifiers by the non-numeric strings they can produce. This includes these specifiers:

  • Weekday names: %a
  • Month names: %b
  • AM/PM: %p
  • Timezone abbreviations: %Z
  • Non-decimal numbers: %O prefix (e.g. %Om for months)
>>> glibc = TimeLocaleSet.default('glibc').keywords

Many strings can only be produced by a single conversion specifier in a single locale. For example, according to the glibc locale database, “Agustus” is the id_ID (Indonesian) word for the 8th month, and does not appear in any other locale.

>>> sorted(glibc['agustus'])
[('b', 8, ('id_ID',))]

However, other strings can be ambiguous. For example, “Ahad” is the word for Sunday in ms_MY (the Malay language locale for Malaysia), but the word for Wednesday in kab_DZ (the Kabyle language locale for Algeria). These languages are from entirely different language families but we can’t tell them apart if all we see is this one word. However, in either case we do know that the word refers to a weekday.

>>> sorted(glibc['ahad'])
[('a', 0, ('ms_MY',)), ('a', 3, ('kab_DZ',))]

Sometimes, without context, we can’t even tell which role a word plays. “An” is the word for Tuesday in lt_LT (Lithuanian), but hours before noon are distinguished with “AN” in ak_GH (the Akan locale for Ghana).

>>> sorted(glibc['an'])
[('a', 2, ('lt_LT',)), ('p', 0, ('ak_GH',))]

Similarly, “AWST” is the timezone abbreviation for Australian Western Standard Time, while “Awst” is the cy_GB (Welsh) word for the 8th month.

>>> sorted(glibc['awst'])
[('Z', 'AWST', ()), ('b', 8, ('cy_GB',))]

Finally, in Chinese, Monday through Saturday are abbreviated using the numbers 1-6, and those numbers are written using the same characters in Japanese. So if we see those numbers, they could either be from numeric conversions such as %Od, or from the abbreviated weekday conversion, %a.

>>> sorted(glibc['一'])
[('O', 1, ('ja_JP', 'lzh_TW')), ('a', 1, ('cmn_TW', 'hak_TW', 'lzh_TW', 'nan_TW', 'yue_HK', 'zh_CN', 'zh_HK', 'zh_SG', 'zh_TW'))]
prefixes

Group conversion specifiers by the strings which may precede them.

>>> glibc = TimeLocaleSet.default('glibc').prefixes

In vi_VN (Vietnamese), “tháng” means “month”, and “năm” means “year”. Within the glibc locale database, we find that these words are used as prefix to the numeric value in question:

>>> sorted(glibc['tháng'])
[('m', ('vi_VN',))]
>>> sorted(glibc['năm'])
[('y', ('vi_VN',))]
suffixes

Group conversion specifiers by the strings which may follow them.

>>> suffixes = TimeLocaleSet(formats={
...     '%a, %Y.eko %bren %da': {'eu_ES'},
...     '%Y年%m月%d日': {'ja_JP'},
... }).suffixes

In eu_ES (the Basque locale for Spain), year/month/day are followed by “eko”, “ren”, and “a”, respectively. However, in our sample format string, “ren” follows %b, which is the name of a month, not its number. So we don’t extract it as a suffix; we rely on month names being sufficiently distinctive instead.

>>> sorted(suffixes['eko'])
[('y', ('eu_ES',))]
>>> 'ren' in suffixes
False
>>> sorted(suffixes['a'])
[('d', ('eu_ES',))]

In ja_JP (Japanese), year/month/day are followed by “年”, “月”, and “日”, respectively. Since our sample format string uses only numeric conversion specifiers, we extract all three as valid suffixes for their corresponding conversions.

>>> sorted(suffixes['年'])
[('y', ('ja_JP',))]
>>> sorted(suffixes['月'])
[('m', ('ja_JP',))]
>>> sorted(suffixes['日'])
[('d', ('ja_JP',))]

Indices and tables