Skip to content

Text Files

  • main thing to note here is encoding
  • a simple text file need no additional metadata other than the knowledge of the character set
  • they have low entropy, that is they occupy more space than needed, they have no compression

text formats

There are many way in which you can store text.

  • as a normal text file
    • a normal text file also contains many things
    • like
      • end of line sequence - is it LF or CRLF
      • which encoding is it using
        • UTF-8 - variable width character encoding
        • UTF-16
        • ISO-8859-1
    • which includes other document specifications
      • markdown
  • .docx - format of the microsoft word documents, proprietary format
  • .pdf - portable document format, developed by adobe

Character Encoding

  • https://en.wikipedia.org/wiki/Character_encoding

  • A character is a minimal unit of text that has semantic value.

  • A character set is a collection of characters that might be used by multiple languages. Example: The Latin character set is used by English and most European languages, though the Greek character set is used only by the Greek language.
  • A coded character set is a character set in which each character corresponds to a unique number.
  • A code point of a coded character set is any allowed value in the character set or code space.
  • A code space is a range of integers whose values are code points.
  • A code unit is the "word size" of the character encoding scheme, such as 7-bit, 8-bit, 16-bit. In some schemes, some characters are encoded using multiple code units, resulting in a variable-length encoding. A code unit is referred to as a code value in some documents.

how to convert between these text formats

  • pandoc can be used to convert between different document formats.

configuration files of applications

What file do programs use to store there configuration? Well these are some options

.ini

; last modified 1 April 2001 by John Doe
[owner]
name = John Doe
organization = Acme Widgets Inc.

[database]
; use IP address in case network name resolution is not working
server = 192.0.2.62     
port = 143
file = "payroll.dat"
  • unstable format

basics

  • comments with ;
  • sections in [section 1]
  • key value pairs name = "shivanshu"

xml

toml

# This is a TOML document.

title = "TOML Example"

[owner]
name = "Tom Preston-Werner"
dob = 1979-05-27T07:32:00-08:00 # First class dates

[database]
server = "192.168.1.1"
ports = [ 8000, 8001, 8002 ]
connection_max = 5000
enabled = true

[servers]

  # Indentation (tabs and/or spaces) is allowed but not required
  [servers.alpha]
  ip = "10.0.0.1"
  dc = "eqdc10"

  [servers.beta]
  ip = "10.0.0.2"
  dc = "eqdc10"

[clients]
data = [ ["gamma", "delta"], [1, 2] ]

# Line breaks are OK when inside arrays
hosts = [
  "alpha",
  "omega"
]

Recfiles

yaml

--- # Indented Block
  name: John Smith
  age: 33
--- # Inline Block
{name: John Smith, age: 33}

json file

{
  "firstName": "John",
  "lastName": "Smith",
  "isAlive": true,
  "age": 27,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021-3100"
  },
  "phoneNumbers": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "office",
      "number": "646 555-4567"
    }
  ],
  "children": [
      "Catherine",
      "Thomas",
      "Trevor"
  ],
  "spouse": null
}

Resources