Understanding SEC/EDGAR: Extracting Data & Structure Explained

1understanding sec edgar  extracting data & structure explained


The XBRL index is a type of text file that separates values with the | character and it follows this specific structure:



To perform data mining on SEC/EDGAR filings, analysts first need to access the data and convert it into a format that can be easily analyzed. This can involve using software tools to extract the relevant data and convert it into a spreadsheet or other format. 

Once the data is in a usable format, analysts can start to explore it and look for patterns and trends. For example, they might look for companies that have consistently high profits or that are investing heavily in research and development.

Since 2011, the SEC has required companies to submit their filings in XBRL format to make the data more accessible and easier to analyze. However, before 2011, only a small number of companies voluntarily provided their data in this format. The SEC filings index can be found on this website: https://www.sec.gov/Archives/edgar/full-index/. 

The index is organized by year and quarter, with each folder containing many index files. If you're looking for XBRL filings, you can find them in a ZIP archive called xbrl.zip. 

For instance, if you're searching for the first quarter of 2022, you can find it at: 


A filing structure in the SEC/EDGAR System

When a company submits a filing to the SEC/EDGAR system, it's made up of multiple files that are organized into a virtual folder. This folder is identified using two important pieces of information: the company's CIK code and a unique number called the accession number.

The CIK code is a special identifier that the SEC assigns to each company. It's used to keep track of all the filings that a company submits, as well as other important information like the company's financial statements .

Each filing in the SEC/EDGAR system is assigned a special number called the accession number. This number is unique to the filing and helps to differentiate it from other filings, even if they're submitted by the same company. The SEC assigns the accession number to make it easier for investors and analysts to locate and review specific filings when researching a particular company.

Together, the CIK code and accession number make up the address for a filing in the SEC/EDGAR system. Investors, analysts, and other users can use this address to access the filing and review the information that the company has submitted.


 This is what the filing folder looks like: 


When a company submits a filing to the SEC/EDGAR system, it has to include a special text file that contains special tags. This text file acts as a container for all the other files in the filing. The SEC uses this text file as the main source of information when analyzing the filing. It includes important information like the iXBRL version of the filing and other files that belong to the company's specific category. 

 Sometimes we see both an iXBRL file (with ".html" or ".htm" extension) and a "native" XBRL file (with ".xml" extension) for the same filing. In these cases, we always use the iXBRL file because it's considered the most reliable source of data. 

Extracting Quarterly Information from Yearly Reports

Sometimes companies don't provide quarterly reports for the fourth quarter, so we have to calculate it ourselves by subtracting the QTR3 value from the annual value. To do this, we need to find the restated QTR3 value while processing the annual filing.

The "secminer" class keeps track of correction values using a special code that includes several pieces of information:

correction_key_new_fact = f'{cik}|{fyb_date}|{previous_quarter}|{qname}|{signature}'

This equation uses several abbreviations to represent important information. "cik" stands for the company's unique identifying code, "fyb_date" represents the start date of the fiscal year, "previous_quarter" is usually "3," and "qname" is a specific label used to identify different types of financial information. Finally, "signature" refers to a unique identifier for each piece of financial data.

These corrections are stored in an index and each correction is associated with a unique key made up of several pieces of information, including the company's code, the start date of their fiscal year, the previous quarter, the name of the reporting concept, and a fact signature. The name of the reporting concept is a specific way of identifying the type of data being reported, such as revenue from a certain type of customer.

Filings are processed in order based on reporting period, meaning QTR3 will be processed before QTR4. This ensures that the correction value for QTR3 is available in the index before we process the QTR4 (annual) report.


Calculating the Effective Fiscal Period and Period Frequency

 In some cases, a company's fiscal year may not align with the calendar year. To determine the effective fiscal period and period frequency, we use the "Fiscal Year End" value provided in each filing. This value is a combination of a month and day, such as "1231" for a fiscal year that matches the calendar year. However, this value may be different for some companies, such as AAPL, whose fiscal year ends on "0925". 



Revenue segmentation information is collected from XBRL facts that have custom dimensions. These dimensions can be unlimited in number and members. Additionally, a fact can be reported in multiple dimensions, which means the <context> element may include more than one <explicitMember> or <typedMember> element.

In XBRL, a common practice is that the sum of fact values for all members in a specific dimension should be equal to the value of the "total" fact, which is reported without any dimensions. For instance, the total revenue should be the sum of revenues for all geographic regions.

 The diagram below illustrates how the connections between XML elements within the XBRL submission form a network of relationships. 


There are two types of information about dividing things up into groups - Basic and Simplified.

The simplified one doesn't use fancy labels and assumes that the main thing being divided is "Revenue" and it's being divided by either "Geography" or "Product", depending on which table you're looking at.There are two types of information about dividing things up into groups - Basic and Simplified.  


Collecting Metrics information

To gather information about certain measurements, we use XBRL data that doesn't have any special features. This means we get the total of all the facts in a particular dimension. Instead of using the actual names of the measurements, we use standard names to make it easier to understand.

We get the standard measurements from this website: https://www.calcbench.com/api/availablemetrics

If you want to know how we convert measurements to the standard format, you can find more information in the Mapping manual. We create a map for each company that links the names of the measurements in XBRL format to their standard names.

An example of a mapping appears like this: 


For debugging purposes, we have added 4 extra details to the mapping:


  1. Year of the filing when the mapping was created.
  2. Matched fact value, which is the same in both the calcbench report and XBRL report.
  3. Name of the calcbench section where the value is found.
  4. Type of match, which can be one of the following:
  • 0 - Direct or value match
  • 1 - Aggregation match
  • 2 - Separation match.

 We create a company map for each reporting period. This map includes relationships between "QName" and the standardized name of the measurement. It's important to note that a single "QName" may be linked to multiple standardized names for measurements. 


1. Sometimes we need to map not only the "XBRL QName" but also the "QName" combined with custom dimension information. Specifically, we establish a relationship between "QName" and "Signature" for Separation Match.


2. Another situation that may arise is when we link multiple "QName/Signature" combinations to a single standardized name for a measurement. We take care of this internally while processing the filing.