API Reference
This chapter is for developers of demeuk, it contains the API functions.
Demeuk-api
Demeuk - a simple tool to clean up corpora
Usage:
demeuk [options]
Examples:
demeuk -i inputfile.tmp -o outputfile.dict -l logfile.txt
demeuk -i "inputfile*.txt" -o outputfile.dict -l logfile.txt
demeuk -i "inputdir/*" -o outputfile.dict -l logfile.txt
demeuk -i inputfile -o outputfile -j 24
demeuk -i inputfile -o outputfile -c -e
demeuk -i inputfile -o outputfile --threads all
Standard Options:
-i --input <path to file> Specify the input file to be cleaned, or provide a glob pattern
-o --output <path to file> Specify the output file name.
-l --log <path to file> Optional, specify where the log file needs to be writen to
-j --threads <threads> Optional, demeuk doesn't use threads by default. Specify amount of threads to
spawn. Specify the string 'all' to make demeuk auto detect the amount of threads
to start based on the CPU's.
Note: threading will cost some setup time. Only speeds up for larger files.
--input-encoding <encoding> Forces demeuk to decode the input using this encoding (default: en_US.UTF-8).
--output-encoding <encoding> Forces demeuk to encoding the output using this encoding (default: en_US.UTF-8).
-v --verbose When set, the logfile will not only contain lines which caused an error, but
also line which were modified.
--progress Prints out the progress of the demeuk process.
-n --limit <int> Limit the number of lines per thread.
-s --skip <int> Skip <int> amount of lines per thread.
--punctuation <punctuation> Use to set the punctuation that is use by options. Defaults to:
! "#$%&'()*+,-./:;<=>?@[\]^_`{|}~
--version Prints the version of demeuk.
Separating Options:
-c --cut Specify if demeuk should split (default splits on ':'). Returns everything
after the delimiter.
--cut-before Specify if demeuk should return the string before the delimiter.
When cutting, demeuk by default returns the string after the delimiter.
-f --cut-fields <field> Specifies the field to be returned, this is in the 'cut' language thus:
N N'th field, N- from N-th field to end line, N-M, from N-th field to M-th
field. -M from start to M-th field.
-d --delimiter <delimiter> Specify which delimiter will be used for cutting. Multiple delimiters can be
specified using ','. If the ',' is required for cutting, escape it with a
backslash. Only one delimiter can be used per line.
Check modules (check if a line matches a specific condition):
--check-min-length <length> Requires that entries have a minimal requirement of <length> unicode chars
--check-max-length <length> Requires that entries have a maximal requirement of <length> unicode chars
--check-case Drop lines where the uppercase line is not equal to the lowercase line
--check-controlchar Drop lines containing control chars.
--check-email Drop lines containing e-mail addresses.
--check-hash Drop lines which are hashes.
--check-mac-address Drop lines which are MAC-addresses.
--check-uuid Drop lines which are UUID.
--check-non-ascii If a line contain a non ascii char e.g. ü or ç (or everything outside ascii
range) the line is dropped.
--check-replacement-character Drop lines containing replacement characters '�'.
--check-starting-with <string> Drop lines starting with string, can be multiple strings. Specify multiple
with as comma-seperated list.
--check-ending-with <string> Drop lines ending with string, can be multiple strings. Specify multiple
with as comma-seperated list.
--check-empty-line Drop lines that are empty or only contain whitespace characters
--check-regex <string> Drop lines that do not match the regex. Regex is a comma seperated list of
regexes. Example: [a-z]{1,8},[0-9]{1,8}
Modify modules (modify a line in place):
--hex Replace lines like: $HEX[41424344] with ABCD.
--html Replace lines like: şifreyok with şifreyok.
--html-named Replace lines like: &#alpha; Those structures are more like passwords, so
be careful to enable this option.
--lowercase Replace line like 'This Test String' to 'this test string'
--title-case Replace line like 'this test string' to 'This Test String'
--umlaut Replace lines like ko"ffie with an o with an umlaut.
--mojibake Fixes mojibakes, which means lines like SmˆrgÂs will be fixed to Smörgås.
--encode Enables guessing of encoding, based on chardet and custom implementation.
--tab Enables replacing tab char with ':', sometimes leaks contain both ':' and '\t'.
--newline Enables removing newline characters (\r\n) from end and beginning of lines.
--non-ascii Replace non ascii char with their replacement letters. For example ü
becomes u, ç becomes c.
--trim Enables removing newlines representations from end and beginning. Newline
representations detected are '\\n', '\\r', '\n', '\r', '<br>', and '<br />'.
Add modules (Modify a line, but keep the original as well):
--add-lower If a line contains a capital letter this will add the lower case variant
--add-latin-ligatures If a line contains a single ligatures of a latin letter (such as ij), the line
is correct but the original line contain the ligatures is also added to output.
--add-split split on known chars like - and . and add those to the final dictionary.
--add-umlaut In some spelling dicts, umlaut are sometimes written as: o" or i" and not as
one char.
--add-without-punctuation If a line contains punctuations, a variant will be added without the
punctuations
Remove modules (remove specific parts of a line):
--remove-strip-punctuation Remove starting and trailing punctuation
--remove-punctuation Remove all punctuation in a line
--remove-email Enable email filter, this will catch strings like
1238661:test@example.com:password
Macro modules:
-g --googlengram When set, demeuk will strip universal pos tags: like _NOUN_ or _ADJ
--leak When set, demeuk will run the following modules:
mojibake, encode, newline, check-controlchar
This is recommended when working with leaks and was the default bevarior in
demeuk version 3.11.0 and below.
--leak-full When set, demeuk will run the following modules:
mojibake, encode, newline, check-controlchar,
hex, html, html-named,
check-hash, check-mac-address, check-uuid, check-email,
check-replacement-character, check-empty-line
- bin.demeuk.add_latin_ligatures(line)
Returns the line cleaned of latin ligatures if there are any.
- Param:
line (unicode)
- Returns:
False if there are not any latin ligatures Corrected line
- bin.demeuk.add_lower(line)
Returns if the upper case string is different from the lower case line
- Param:
line (unicode)
- Returns:
False if they are the same Lowered string if they are not
- bin.demeuk.add_split(line, punctuation=(' ', '-', '\\.'))
Split the line on the punctuation and return elements longer then 1 char.
- Param:
line (unicode)
- Returns:
split line
- bin.demeuk.add_without_punctuation(line, punctuation)
Returns the line cleaned of punctuation.
- Param:
line (unicode)
- Returns:
False if there are not any punctuation Corrected line
- bin.demeuk.check_case(line, ignored_chars=(' ', "'", '-'))
Checks if an uppercase line is equal to a lowercase line.
- Param:
line (unicode) ignored_chars list(string)
- Returns:
true if uppercase line is equal to uppercase line
- bin.demeuk.check_character(line, character)
Checks if a line contains a specific character
- Params:
line (unicode)
- Returns:
true if line does contain the specific character
- bin.demeuk.check_controlchar(line)
Detects control chars, returns True when detected
- Params:
line (Unicode)
- Returns:
Status, String
- bin.demeuk.check_email(line)
Check if lines contain e-mail addresses with a simple regex
- Params:
line (unicode)
- Returns:
true is line does not contain email
- bin.demeuk.check_empty_line(line)
Checks if a line is empty or only contains whitespace chars
- Params:
line (unicode)
- Returns:
true of line is empty or only contains whitespace chars
- bin.demeuk.check_ending_with(line, strings)
Checks if a line ends with specific strings
- Params:
line (unicode) strings[str]
- Returns:
true if line does end with one of the strings
- bin.demeuk.check_hash(line)
Check if a line contains a hash
- Params:
line (unicode)
- Returns:
true if line does not contain hash
- bin.demeuk.check_length(line, min=0, max=0)
Does a length check on the line
- Params:
line (unicode) min (int) max (int)
- Returns:
true if length is ok
- bin.demeuk.check_mac_address(line)
Check if a line contains a MAC-address
- Params:
line (unicode)
- Returns:
true if line does not contain a MAC-address
- bin.demeuk.check_non_ascii(line)
Checks if a line contains a non ascii chars
- Params:
line (unicode)
- Returns:
true if line does not contain non ascii chars
- bin.demeuk.check_regex(line, regex)
Checks if a line matches a list of regexes
- Params:
line (unicode) regex (list)
- Returns:
true if all regexes match false if line does not match regex
- bin.demeuk.check_starting_with(line, strings)
Checks if a line start with a specific strings
- Params:
line (unicode) strings[str]
- Returns:
true if line does start with one of the strings
- bin.demeuk.check_uuid(line)
Check if a line contains a UUID
- Params:
line (unicode)
- Returns:
true if line does not contain a UUID
- bin.demeuk.chunkify(fname, config, size=1048576)
- bin.demeuk.clean_add_umlaut(line)
Returns the line cleaned of incorrect umlauting
- Param:
line (unicode)
- Returns:
Corrected line
- bin.demeuk.clean_cut(line, delimiters, fields)
Finds the first delimiter and returns the remaining string either after or before the delimiter.
- Params:
line (unicode) delimiters list(unicode) fields (unicode)
- Returns:
line (unicode)
- bin.demeuk.clean_encode(line, input_encoding)
Detects and tries encoding
- Params:
line (bytes)
- Returns:
Decoded UTF-8 string
- bin.demeuk.clean_googlengram(line)
Removes speechtags from line specific to the googlengram module
- Param:
line (unicode)
- Returns:
line (unicode)
- bin.demeuk.clean_hex(line)
Converts strings like ‘$HEX[]’ to proper binary
- Params:
line (bytes)
- Returns:
line (bytes)
- bin.demeuk.clean_html(line)
Detects html encode chars and decodes them
- Params:
line (Unicode)
- Returns:
line (Unicode)
- bin.demeuk.clean_html_named(line)
Detects named html encode chars and decodes them
- Params:
line (Unicode)
- Returns:
line (Unicode)
- bin.demeuk.clean_lowercase(line)
Replace all capitals to lowercase
- Params:
line (Unicode)
- Returns:
line (Unicode)
- bin.demeuk.clean_mojibake(line)
Detects mojibake and tries to correct it. Mojibake are string that are decoded incorrectly and then encoded incorrectly. This results in strings like: único which should be único.
- Param:
line (str)
- Returns:
Cleaned string
- bin.demeuk.clean_newline(line)
Delete newline characters at start and end of line
- Params:
line (Unicode)
- Returns:
line (Unicode)
- bin.demeuk.clean_non_ascii(line)
Replace non ascii chars with there ascii representation.
- Params:
line (Unicode)
- Returns:
line (Unicode)
- bin.demeuk.clean_tab(line)
Replace tab character with ‘:’ greedy
- Params:
line (bytes)
- Returns:
line (bytes)
- bin.demeuk.clean_title_case(line)
Replace words to title word (uppercasing first letter)
- Params:
line (Unicode)
- Returns:
line (Unicode)
- bin.demeuk.clean_trim(line)
Delete leading and trailing character sequences representing a newline from beginning end end of line.
- Params:
line (Unicode)
- Returns:
line (Unicode)
- bin.demeuk.clean_up(filename, chunk_start, chunk_size, config)
Main clean loop, this calls all the other clean functions.
- Parameters:
line (bytes) – Line to be cleaned up
- Returns:
(str(Decoded line), str(Failed line))
- bin.demeuk.main()
- bin.demeuk.remove_email(line)
Removes e-mail addresses from a line.
- Params:
line (unicode)
- Returns:
line (unicode)
- bin.demeuk.remove_punctuation(line, punctuation)
Returns the line without punctuation
- Param:
line (unicode) punctuation (unicode)
- Returns:
line without start and end punctuation
- bin.demeuk.remove_strip_punctuation(line, punctuation)
Returns the line without start and end punctuation
- Param:
line (unicode)
- Returns:
line without start and end punctuation
- bin.demeuk.try_encoding(line, encoding)
Tries to decode a line using supplied encoding
- Params:
line (Byte): byte variable that will be decoded encoding (string): the encoding to be tried
- Returns:
False if decoding failed String if decoding worked