Text Files
- main thing to note here is encoding
- a simple text file need no additional
metadata
other than the knowledge of the character set - they have low entropy, that is they occupy more space than needed, they have no compression
text formats
There are many way in which you can store text.
- as a normal text file
- a normal text file also contains many things
- like
- end of line sequence - is it LF or CRLF
- which encoding is it using
UTF-8
- variable width character encodingUTF-16
ISO-8859-1
- which includes other document specifications
- markdown
.docx
- format of the microsoft word documents, proprietary format.pdf
- portable document format, developed by adobe
Character Encoding
-
A character is a minimal unit of text that has semantic value.
- A character set is a collection of characters that might be used by multiple languages. Example: The Latin character set is used by English and most European languages, though the Greek character set is used only by the Greek language.
- A coded character set is a character set in which each character corresponds to a unique number.
- A code point of a coded character set is any allowed value in the character set or code space.
- A code space is a range of integers whose values are code points.
- A code unit is the "word size" of the character encoding scheme, such as 7-bit, 8-bit, 16-bit. In some schemes, some characters are encoded using multiple code units, resulting in a variable-length encoding. A code unit is referred to as a code value in some documents.
how to convert between these text formats
pandoc
can be used to convert between different document formats.
configuration files of applications
What file do programs use to store there configuration? Well these are some options
- a binary file (means you can't edit it properly using some text editor)
- a binary file + some encryption (hard to edit even using some external program)
- text file (can be edited using some text editor)
yaml
this document structure is used by many applications, like allacritty
- how can you structure configuration files
- https://en.wikipedia.org/wiki/Comparison_of_data-serialization_formats
- https://en.wikipedia.org/wiki/Markup_language
.ini
; last modified 1 April 2001 by John Doe
[owner]
name = John Doe
organization = Acme Widgets Inc.
[database]
; use IP address in case network name resolution is not working
server = 192.0.2.62
port = 143
file = "payroll.dat"
- unstable format
basics
- comments with
;
- sections in
[section 1]
- key value pairs
name = "shivanshu"
xml
- https://en.wikipedia.org/wiki/XML
- Extensible Markup Language
- used over internet
- also provides schemas and validation
toml
# This is a TOML document.
title = "TOML Example"
[owner]
name = "Tom Preston-Werner"
dob = 1979-05-27T07:32:00-08:00 # First class dates
[database]
server = "192.168.1.1"
ports = [ 8000, 8001, 8002 ]
connection_max = 5000
enabled = true
[servers]
# Indentation (tabs and/or spaces) is allowed but not required
[servers.alpha]
ip = "10.0.0.1"
dc = "eqdc10"
[servers.beta]
ip = "10.0.0.2"
dc = "eqdc10"
[clients]
data = [ ["gamma", "delta"], [1, 2] ]
# Line breaks are OK when inside arrays
hosts = [
"alpha",
"omega"
]
Recfiles
yaml
--- # Indented Block
name: John Smith
age: 33
--- # Inline Block
{name: John Smith, age: 33}
json file
{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 27,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
}
],
"children": [
"Catherine",
"Thomas",
"Trevor"
],
"spouse": null
}