File
Collect logs from files
Requirements
vector
process must have the ability to read the files
listed in include
and execute any of the parent directories
for these files. Please see File
permissions for more details.Configuration
Example configurations
{
"sources": {
"my_source_id": {
"type": "file",
"include": [
"/var/log/**/*.log"
]
}
}
}
[sources.my_source_id]
type = "file"
include = [ "/var/log/**/*.log" ]
sources:
my_source_id:
type: file
include:
- /var/log/**/*.log
{
"sources": {
"my_source_id": {
"type": "file",
"data_dir": "/var/local/lib/vector/",
"exclude": [
"/var/log/binary-file.log"
],
"file_key": "file",
"glob_minimum_cooldown_ms": 1000,
"host_key": "hostname",
"ignore_older_secs": 600,
"include": [
"/var/log/**/*.log"
],
"line_delimiter": "\n",
"max_line_bytes": 102400,
"max_read_bytes": 2048,
"offset_key": "offset",
"read_from": "beginning",
"rotate_wait_secs": 9223372036854776000
}
}
}
[sources.my_source_id]
type = "file"
data_dir = "/var/local/lib/vector/"
exclude = [ "/var/log/binary-file.log" ]
file_key = "file"
glob_minimum_cooldown_ms = 1_000
host_key = "hostname"
ignore_older_secs = 600
include = [ "/var/log/**/*.log" ]
line_delimiter = """
"""
max_line_bytes = 102_400
max_read_bytes = 2_048
offset_key = "offset"
read_from = "beginning"
rotate_wait_secs = 9_223_372_036_854_776_000
sources:
my_source_id:
type: file
data_dir: /var/local/lib/vector/
exclude:
- /var/log/binary-file.log
file_key: file
glob_minimum_cooldown_ms: 1000
host_key: hostname
ignore_older_secs: 600
include:
- /var/log/**/*.log
line_delimiter: "\n"
max_line_bytes: 102400
max_read_bytes: 2048
offset_key: offset
read_from: beginning
rotate_wait_secs: 9223372036854776000
acknowledgements
optional objectControls how acknowledgements are handled by this source.
This setting is deprecated in favor of enabling acknowledgements
at the global or sink level.
Enabling or disabling acknowledgements at the source level has no effect on acknowledgement behavior.
See End-to-end Acknowledgements for more information on how event acknowledgement is handled.
acknowledgements.enabled
optional booldata_dir
optional string literalThe directory used to persist file checkpoint positions.
By default, the global data_dir
option is used.
Make sure the running user has write permissions to this directory.
If this directory is specified, then Vector will attempt to create it.
encoding
optional objectencoding.charset
required string literalEncoding of the source messages.
Takes one of the encoding label strings defined as part of the Encoding Standard.
When set, the messages are transcoded from the specified encoding to UTF-8, which is the encoding that is assumed internally for string-like data. Enable this transcoding operation if you need your data to be in UTF-8 for further processing. At the time of transcoding, any malformed sequences (that can’t be mapped to UTF-8) is replaced with the Unicode REPLACEMENT CHARACTER and warnings are logged.
exclude
optional [string]Array of file patterns to exclude. Globbing is supported.
Takes precedence over the include
option. Note: The exclude
patterns are applied after the attempt to glob everything
in include
. This means that all files are first matched by include
and then filtered by the exclude
patterns. This can be impactful if include
contains directories with contents that are not accessible.
file_key
optional string literalOverrides the name of the log field used to add the file path to each event.
The value is the full path to the file where the event was read message.
Set to ""
to suppress this key.
file
fingerprint
optional objectConfiguration for how files should be identified.
This is important for checkpointing
when file rotation is used.
fingerprint.ignored_header_bytes
optional uintThe number of bytes to skip ahead (or ignore) when reading the data used for generating the checksum.
This can be helpful if all files share a common header that should be skipped.
strategy = "checksum"
fingerprint.lines
optional uintThe number of lines to read for generating the checksum.
If your files share a common header that is not always a fixed size,
If the file has less than this amount of lines, it won’t be read at all.
strategy = "checksum"
1
(lines)fingerprint.strategy
optional string literal enumThe strategy used to uniquely identify files.
This is important for checkpointing when file rotation is used.
Option | Description |
---|---|
checksum | Read lines from the beginning of the file and compute a checksum over them. |
device_and_inode | Use the device and inode as the identifier. |
checksum
glob_minimum_cooldown_ms
optional uintThe delay between file discovery calls.
This controls the interval at which files are searched. A higher value results in greater chances of some short-lived files being missed between searches, but a lower value increases the performance impact of file discovery.
1000
(milliseconds)host_key
optional string literalOverrides the name of the log field used to add the current hostname to each event.
By default, the global log_schema.host_key
option is used.
Set to ""
to suppress this key.
ignore_checkpoints
optional boolWhether or not to ignore existing checkpoints when determining where to start reading a file.
Checkpoints are still written normally.
ignore_not_found
optional boolIgnore missing files when fingerprinting.
This may be useful when used with source directories containing dangling symlinks.
false
ignore_older_secs
optional uintinternal_metrics
optional objectinternal_metrics.include_file_tag
optional boolWhether or not to include the “file” tag on the component’s corresponding internal metrics.
This is useful for distinguishing between different files while monitoring. However, the tag’s cardinality is unbounded.
false
line_delimiter
optional string literal
max_line_bytes
optional uintThe maximum size of a line before it is discarded.
This protects against malformed lines or tailing incorrect files.
102400
(bytes)max_read_bytes
optional uintMax amount of bytes to read from a single file before switching over to the next file.
Note: This does not apply when oldest_first
is true
.
This allows distributing the reads more or less evenly across the files.
2048
(bytes)multiline
optional objectMultiline aggregation configuration.
If not specified, multiline aggregation is disabled.
multiline.condition_pattern
required string literalRegular expression pattern that is used to determine whether or not more lines should be read.
This setting must be configured in conjunction with mode
.
multiline.mode
required string literal enumAggregation mode.
This setting must be configured in conjunction with condition_pattern
.
Option | Description |
---|---|
continue_past | All consecutive lines matching this pattern, plus one additional line, are included in the group. This is useful in cases where a log message ends with a continuation marker, such as a backslash, indicating that the following line is part of the same message. |
continue_through | All consecutive lines matching this pattern are included in the group. The first line (the line that matched the start pattern) does not need to match the This is useful in cases such as a Java stack trace, where some indicator in the line (such as a leading whitespace) indicates that it is an extension of the proceeding line. |
halt_before | All consecutive lines not matching this pattern are included in the group. This is useful where a log line contains a marker indicating that it begins a new message. |
halt_with | All consecutive lines, up to and including the first line matching this pattern, are included in the group. This is useful where a log line ends with a termination marker, such as a semicolon. |
multiline.start_pattern
required string literalmultiline.timeout_ms
required uintThe maximum amount of time to wait for the next additional line, in milliseconds.
Once this timeout is reached, the buffered message is guaranteed to be flushed, even if incomplete.
offset_key
optional string literalEnables adding the file offset to each event and sets the name of the log field used.
The value is the byte offset of the start of the line within the file.
Off by default, the offset is only added to the event if this is set.
oldest_first
optional boolfalse
read_from
optional string literal enumOption | Description |
---|---|
beginning | Read from the beginning of the file. |
end | Start reading from the current end of the file. |
beginning
remove_after_secs
optional uintAfter reaching EOF, the number of seconds to wait before removing the file, unless new data is written.
If not specified, files are not removed.
rotate_wait_secs
optional uint9.223372036854776e+18
(seconds)Outputs
<component_id>
Output Data
Logs
Warning
Line
multiline
options./var/log/apache/access.log
gethostname
command.my-host.local
53.126.150.246 - - [01/Oct/2020:11:25:58 -0400] "GET /disintermediate HTTP/2.0" 401 20308
file
2020-10-10T17:07:36.452332Z
Telemetry
Metrics
linkcheckpoints_total
counterchecksum_errors_total
countercomponent_discarded_events_total
counterfilter
transform, or false if due to an error.component_errors_total
countercomponent_received_bytes_total
countercomponent_received_event_bytes_total
countercomponent_received_events_count
histogramA histogram of the number of events passed in each internal batch in Vector’s internal topology.
Note that this is separate than sink-level batching. It is mostly useful for low level debugging performance issues in Vector due to small internal batches.
component_received_events_total
countercomponent_sent_event_bytes_total
countercomponent_sent_events_total
counterfiles_added_total
counterfiles_deleted_total
counterfiles_resumed_total
counterfiles_unwatched_total
countersource_lag_time_seconds
histogramExamples
Apache Access Log
Given this event...53.126.150.246 - - [01/Oct/2020:11:25:58 -0400] "GET /disintermediate HTTP/2.0" 401 20308
sources:
my_source_id:
type: file
include:
- /var/log/**/*.log
[sources.my_source_id]
type = "file"
include = [ "/var/log/**/*.log" ]
{
"sources": {
"my_source_id": {
"type": "file",
"include": [
"/var/log/**/*.log"
]
}
}
}
{
"file": "/var/log/apache/access.log",
"host": "my-host.local",
"message": "53.126.150.246 - - [01/Oct/2020:11:25:58 -0400] \"GET /disintermediate HTTP/2.0\" 401 20308",
"source_type": "file",
"timestamp": "2020-10-10T17:07:36.452332Z"
}
How it works
Autodiscovery
glob_minimum_cooldown
option. If a new file is added that matches
any of the supplied patterns, Vector will begin tailing it. Vector
maintains a unique list of files and will not tail a file more than
once, even if it matches multiple patterns. You can read more about
how we identify files in the Identification section.Checkpointing
data_dir
option, but can be overridden
via the data_dir
option in the file source directly.Compressed Files
Vector will transparently detect files which have been compressed using Gzip and decompress them for reading. This detection process looks for the unique sequence of bytes in the Gzip header and does not rely on the compressed files adhering to any kind of naming convention.
One caveat with reading compressed files is that Vector is not able to efficiently seek into them. Rather than implement a potentially-expensive full scan as a seek mechanism, Vector currently will not attempt to make further reads from a file for which it has already stored a checkpoint in a previous run. For this reason, users should take care to allow Vector to fully process any compressed files before shutting the process down or moving the files to another location on disk.
File Deletion
EOF
. When a file is
no longer findable in the includes
option and the reader has
reached EOF
, that file’s reader is discarded.File Read Order
By default, Vector attempts to allocate its read bandwidth fairly across all of the files it’s currently watching. This prevents a single very busy file from starving other independent files from being read. In certain situations, however, this can lead to interleaved reads from files that should be read one after the other.
For example, consider a service that logs to timestamped file, creating a new one at an interval and leaving the old one as-is. Under normal operation, Vector would follow writes as they happen to each file and there would be no interleaving. In an overload situation, however, Vector may pick up and begin tailing newer files before catching up to the latest writes from older files. This would cause writes from a single logical log stream to be interleaved in time and potentially slow down ingestion as a whole, since the fixed total read bandwidth is allocated across an increasing number of files.
To address this type of situation, Vector provides the
oldest_first
option. When set, Vector will not read from any file
younger than the oldest file that it hasn’t yet caught up to. In
other words, Vector will continue reading from older files as long
as there is more data to read. Only once it hits the end will it
then move on to read from younger files.
Whether or not to use the oldest_first flag depends on the
organization of the logs you’re configuring Vector to tail. If your
include
option contains multiple independent logical log streams
(e.g. Nginx’s access.log and error.log, or logs from multiple
services), you are likely better off with the default behavior. If
you’re dealing with a single logical log stream or if you value
per-stream ordering over fairness across streams, consider setting
the oldest_first
option to true.
File Rotation
Vector supports tailing across a number of file rotation strategies.
The default behavior of logrotate
is simply to move the old log
file and create a new one. This requires no special configuration of
Vector, as it will maintain its open file handle to the rotated log
until it has finished reading and it will find the newly created
file normally.
A popular alternative strategy is copytruncate
, in which
logrotate
will copy the old log file to a new location before
truncating the original. Vector will also handle this well out of
the box, but there are a couple configuration options that will help
reduce the very small chance of missed data in some edge cases. We
recommend a combination of delaycompress
(if applicable) on the
logrotate
side and including the first rotated file in Vector’s
include
option. This allows Vector to find the file after rotation,
read it uncompressed to identify it, and then ensure it has all of
the data, including any written in a gap between Vector’s last read
and the actual rotation event.
Fingerprinting
By default, Vector identifies files by running a cyclic redundancy
check (CRC) on the first N lines of the file. This serves as a
fingerprint that uniquely identifies the file. The number of lines, N, that are
read can be set using the fingerprint.lines
and
fingerprint.ignored_header_bytes
options.
This strategy avoids the common pitfalls associated with using device and inode names since inode names can be reused across files. This enables Vector to properly tail files across various rotation strategies.
Globbing
glob_minimum_cooldown
option.Line Delimiters
i.e.
the 0xA
byte) or EOF
is found. If needed, the default line
delimiter can be overridden via the line_delimiter
option.Multiline Messages
multiline
options. These
options were carefully thought through and will allow you to solve the
simplest and most complex cases. Let’s look at a few examples:Example 1: Ruby Exceptions
Ruby exceptions, when logged, consist of multiple lines:
foobar.rb:6:in `/': divided by 0 (ZeroDivisionError)
from foobar.rb:6:in `bar'
from foobar.rb:2:in `foo'
from foobar.rb:9:in `<main>'
To consume these lines as a single event, use the following Vector configuration:
[sources.my_file_source]
type = "file"
# ...
[sources.my_file_source.multiline]
start_pattern = '^[^\s]'
mode = "continue_through"
condition_pattern = '^[\s]+from'
timeout_ms = 1000
start_pattern
, set to^[^\s]
, tells Vector that new multi-line events should not start with white-space.mode
, set tocontinue_through
, tells Vector continue aggregating lines until thecondition_pattern
is no longer valid (excluding the invalid line).condition_pattern
, set to^[\s]+from
, tells Vector to continue aggregating lines if they start with white-space followed byfrom
.
Example 2: Line Continuations
Some programming languages use the backslash (\
) character to
signal that a line will continue on the next line:
First line\
second line\
third line
To consume these lines as a single event, use the following Vector configuration:
[sources.my_file_source]
type = "file"
# ...
[sources.my_file_source.multiline]
start_pattern = '\\$'
mode = "continue_past"
condition_pattern = '\\$'
timeout_ms = 1000
start_pattern
, set to\\$
, tells Vector that new multi-line events start with lines that end in\
.mode
, set tocontinue_past
, tells Vector continue aggregating lines, plus one additional line, untilcondition_pattern
is false.condition_pattern
, set to\\$
, tells Vector to continue aggregating lines if they end with a\
character.
Example 3: Line Continuations
Activity logs from services such as Elasticsearch typically begin with a timestamp, followed by information on the specific activity, as in this example:
[2015-08-24 11:49:14,389][ INFO ][env ] [Letha] using [1] data paths, mounts [[/
(/dev/disk1)]], net usable_space [34.5gb], net total_space [118.9gb], types [hfs]
To consume these lines as a single event, use the following Vector configuration:
[sources.my_file_source]
type = "file"
# ...
[sources.my_file_source.multiline]
start_pattern = '^\[[0-9]{4}-[0-9]{2}-[0-9]{2}'
mode = "halt_before"
condition_pattern = '^\[[0-9]{4}-[0-9]{2}-[0-9]{2}'
timeout_ms = 1000
start_pattern
, set to^\[[0-9]{4}-[0-9]{2}-[0-9]{2}
, tells Vector that new multi-line events start with a timestamp sequence.mode
, set tohalt_before
, tells Vector to continue aggregating lines as long as thecondition_pattern
does not match.condition_pattern
, set to^\[[0-9]{4}-[0-9]{2}-[0-9]{2}
, tells Vector to continue aggregating up until a line starts with a timestamp sequence.
File permissions
To be able to source events from the files, Vector must be able to read the files and execute their parent directories.
If you have deployed Vector as using one our distributed
packages, then you will find Vector running as the vector
user. You should ensure this user has read access to the desired
files used as include
. Strategies for this include:
Create a new unix group, make it the group owner of the target files, with read access, and add
vector
to that groupUse POSIX ACLs to grant access to the files to the
vector
userGrant the
CAP_DAC_READ_SEARCH
Linux capability. This capability bypasses the file system permissions checks to allow Vector to read any file. This is not recommended as it gives Vector more permissions than it requires, but it is recommended over running Vector asroot
which would grant it even broader permissions. This can be granted via SystemD by creating an override file usingsystemctl edit vector
and adding:AmbientCapabilities=CAP_DAC_READ_SEARCH CapabilityBoundingSet=CAP_DAC_READ_SEARCH
On Debian-based distributions, the vector
user is
automatically added to the adm
group, if it exists, which has
permissions to read /var/log
.
Read Position
By default, Vector will read from the beginning of newly discovered
files. You can change this behavior by setting the read_from
option to
"end"
.
Previously discovered files will be checkpointed, and
the read position will resume from the last checkpoint. To disable this
behavior, you can set the ignore_checkpoints
option to true
. This
will cause Vector to disregard existing checkpoints when determining the
starting read position of a file.