Mining Splunk's Internal Logs

Matt Alshab, IT Security Consultant, Technical Cyber Services, Coalfire Federal

Splunk is great about logging its warnings and errors, but it won’t tell you about them – you have to ask!

 

As the leading machine-generated data analysis software, it’s not surprising that Splunk excels at creating robust logs. The current version of Splunk Enterprise (v 8.05) generates 22 different logs (for a complete current list see: What Splunk logs about itself). These logs don't consume license usage, so other than disk space, there is no downside to all this logging, and the information the logs provide can be eye opening. The challenge for the Splunk administrator is getting a handle on these logs and using them to troubleshoot issues, find unknown errors, and improve performance.

The most critical log to master is splunkd.log which logs events for the splunk daemon. The SPL to query the splunkd logs is a bit more complicated than it probably should be. The easy part is setting the index since all Splunk's internal logs are conveniently kept in the _internal index. Sourcetype is more complicated, because while there is a splunkd sourcetype, there are five other logs (splunkd_access.log, splunkd_stdout.log, etc.) that share this sourcetype. Source is also problematic because the location (and therefore source) of the splunkd.log varies depending on product and OS. For example, the Windows Universal Forwarder is stored in splunkd.log:

C:\Program Files\SplunkUniversalForwarder\var\log\splunk\splunkd.log

But, on a Linux computer running Splunk Enterprise, the log is at:

/opt/splunk/var/log/splunk/splunkd.log

Luckily, Splunk comes with a pre-defined event type for splunkd.log (eventtype="splunkd-log") which is defined as:

index=_internal source=*/splunkd.log OR source=*\\splunkd.log

That simplifies the SPL for all your splunkd log events from all servers and forwarders to: eventtype="splunkd-log".

To limit our search to just important events, you need to specify the desired log level in our SPL. The splunkd.log has five log levels: DEBUG, INFO, WARN, ERROR, and FATAL. Debug is turned off by default and Info describes expected events. So, to get the important events, you can use the query:

eventtype="splunkd-log" (log_level=WARN OR log_level=ERROR OR log_level=FATAL)

This search will include dispatch manager warnings for when users exceed their quotas of concurrent searches. I don't really feel this is a splunkd warning as much as a user warning, so I filter these out of the query. You may want to create a separate search with component=DispatchManager to monitor user quotas. The final search is:

eventtype="splunkd-log" (log_level=WARN OR log_level=ERROR OR log_level=FATAL) component!=DispatchManager

As a matter of course, this search (with the addition of a host clause) should be run whenever a change is made. You may not think this is necessary, but in large Splunk environments, even basic maintenance tasks like rebooting a server or upgrading an app can have unforeseen consequences. (I once had four million errors in a single day from an indexer with bucket replication errors after the normal monthly OS patching.) Running this search will let you know if something went wrong so you can correct it before it gets out of hand.

Splunk errors and warnings chart

Now that you have an SPL query for the splunkd.log file, you can easily create a daily report of errors by host so that with just a quick glance you can tell the health of your environment. Errors often happen in relation to load, so the query shows the tally for each of the last eight days. This way you can compare yesterday's results to the week before. I set up my report as a stacked column chart printed in landscape and include the results as well as the chart in the report. The query for this report is:

index=_internal source="/opt/splunk/var/log/splunk/splunkd.log" (log_level=WARN OR log_level=ERROR) component!=DispatchManager earliest=-8d@d latest=-0d@d
| timechart count by host span=1d limit=10

Splunk errors and warnings table

So far, you have created two queries for reviewing splunkd errors and warnings: a low-level report of events and a high-level chart by day and host. A third report is a rollup of similar errors. That way, you can see which errors are occurring most often in your environment.

For your report besides the log level, error message and count, you might also want to see the first time the error occurred and the last time the error occurred. That way you will know if the error is ongoing or time-specific. I'm going to leave the dispatch errors (user quota errors) in this report, since it will allow me to see if someone is constantly coming up against their search quota. Finally, you want the report to include the list of computers that have both errors so you can know if an error is on only one server or is deployment wide. The SPL for this report is:

eventtype="splunkd-log" (log_level=ERROR OR log_level=WARN OR log_level=FATAL)
| stats count values(host) earliest(_time) as FirstError latest(_time) as LastError by component, log_level, event_message
| convert timeformat="%H:%M:%S" ctime(FirstError) ctime(LastError)
| sort - count

This report is good, but we can do better. What you really want to see is a tally of errors by type, but Splunk does not have error IDs, so you  will need to find some other way to group the log events. You can try to use the punct field for grouping, but that doesn't work very well because many Splunkd.log events contain GUIDs that have varying punctuation elements. Since the numbers in events often are different even when the words are the same, let’s replace all numbers with a pound sign. Also, let’s replace all space characters with a space. Our new query is:

eventtype="splunkd-log" (log_level=ERROR OR log_level=WARN OR log_level=FATAL)
| eval event_summary=replace(event_message,"\w*\d\w*","#") | eval event_summary=replace(event_summary, "\s+", " ") | stats count values(host) earliest(_time) as FirstError latest(_time) as LastError by component, log_level, event_summary | convert timeformat="%H:%M:%S" ctime(FirstError) ctime(LastError)
| sort - count

Hopefully these queries will help you manage your Splunk environment better, and you can of course tweak them to fit your needs. For example, you may want to limit results to only the top errors, or only errors from your production servers. It's totally up to you based on your environment. I suggest that you get at least one report delivered to your mailbox every morning. That way you'll always have an idea of the overall health of your environment.

Splunk errors and warnings dashboards in department apps

This is the secret sauce for easy Splunk dev-ops. About half the time, ingest errors in Splunk are due to changes made on the forwarders, not the indexers. Errors can happen when the forwarder admins change permissions, delete the splunk user account, upgrade a monitored application, repurpose a server, etc. When these happen, the forwarder admin knows the root cause of the error but the Splunk admin does not. Rather than having the Splunk admin initiate an error investigation only to find out the forwarder admin changed a setting, flip the paradigm so the forwarder admin lets the Splunk admin know if something is amiss. This is easily accomplished by adding a dashboard showing applicable errors and warnings to all custom/department apps. With visibility into their Splunk errors, departments can take greater ownership of their Splunk app and escalate issues to you that they need help resolving. This model doesn’t work for all organizations, but generally having users take greater ownership of their data in Splunk will ultimately increase the usage of Splunk in your organization, which is a good thing for everyone.

SPL queries for splunkd.log events

Check for issues on a host

eventtype="splunkd-log" (log_level=WARN OR log_level=ERROR OR log_level=FATAL) component!=DispatchManager host=myhost

Create an eight-day bar chart of all issues

index=_internal source="/opt/splunk/var/log/splunk/splunkd.log" (log_level=WARN OR log_level=ERROR) component!=DispatchManager earliest=-8d@d latest=-0d@d
| timechart count by host span=1d limit=10

Create a grouped report of all issues on all hosts

eventtype="splunkd-log" (log_level=ERROR OR log_level=WARN OR log_level=FATAL)
| eval event_summary=replace(event_message,"\w*\d\w*","#")
| eval event_summary=replace(event_summary, "\s+", " ")
| stats count values(host) earliest(_time) as FirstError latest(_time) as LastError by component, log_level, event_summary
| convert timeformat="%H:%M:%S" ctime(FirstError) ctime(LastError)
| sort - count

Matt Alshab

Author

Matt Alshab — IT Security Consultant, Technical Cyber Services, Coalfire Federal

Recent Posts

Post Topics

Archives

Tags

Accounting Agency AICPA Assessment assessments ASV audit AWS AWS Certified Cloud Practitioner AWS Certs AWS Summit Azure bitcoin Black Hat Black Hat 2017 blockchain Blueborne Breach BSides BSidesLV Burp BYOD California Consumer Privacy Act careers CCPA Chertoff CISO cloud CMMC CoalfireOne Compliance Covid-19 credit cards C-Store Culture Cyber cyber attacks Cyber Engineering cyber incident Cyber Risk cyber threats cyberchrime cyberinsurance cybersecurity danger Dangers Data DDoS DevOps DevSecOps DFARS DFARS 7012 diacap diarmf Digital Forensics DoD DRG DSS e-banking Education encryption engineering ePHI Equifax Europe EU-US Privacy Shield federal FedRAMP financial services FISMA Foglight forensics Gartner Report GDPR Google Cloud NEXT '18 government GRC hack hacker hacking Halloween Health Healthcare heartbleed Higher Education HIMSS HIPAA HITECH HITRUST HITRUST CSF Horror Incident Response interview IoT ISO IT JAB JSON keylogging Kubernetes Vulnerability labs LAN law firms leadership legal legislation merchant mobile NESA News NH-ISAC NIST NIST 800-171 NIST SP 800-171 NotPetya NRF NYCCR O365 OCR of P2PE PA DSS PA-DSS password passwords Payments PCI PCI DSS penetration Penetration Testing pentesting Petya/NotPetya PHI Phishing Phising policy POODLE PowerShell Presidential Executive Order Privacy program Ransomware Retail Risk RSA RSA 2019 Safe Harbor Scanning Scans scary security security. SOC SOC 2 social social engineering Spectre Splunk Spooky Spraying Attack SSAE State Stories Story test Testing theft Virtualization Visa vulnerability Vulnerability management web Wifi women XSS
Top