Empower your AI apps with clean data from any website. Featuring advanced scraping, crawling, and data extraction capabilities.
This repository is in development, and we’re still integrating custom modules into the mono repo. It’s not fully ready for self-hosted deployment yet, but you can run it locally.
What is Firecrawl?
Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each. No sitemap required. Check out our documentation.
Pst. hey, you, join our stargazers :)
How to use it?
We provide an easy to use API with our hosted version. You can find the playground and documentation here. You can also self host the backend if you’d like.
If you are using the sdks, it will auto pull the response for you:
{
"success": true,
"data": {
"company_mission": "Firecrawl is the easiest way to extract data from the web. Developers use us to reliably convert URLs into LLM-ready markdown or structured data with a single API call.",
"supports_sso": false,
"is_open_source": true,
"is_in_yc": true
}
}
LLM Extraction (Beta)
Used to extract structured data from scraped pages.
{
"success": true,
"data": {
"content": "Raw Content",
"metadata": {
"title": "Mendable",
"description": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
"robots": "follow, index",
"ogTitle": "Mendable",
"ogDescription": "Mendable allows you to easily build AI chat applications. Ingest, customize, then deploy with one line of code anywhere you want. Brought to you by SideGuide",
"ogUrl": "https://mendable.ai/",
"ogImage": "https://mendable.ai/mendable_new_og1.png",
"ogLocaleAlternate": [],
"ogSiteName": "Mendable",
"sourceURL": "https://mendable.ai/"
},
"json": {
"company_mission": "Train a secure AI on your technical resources that answers customer and employee questions so your team doesn't have to",
"supports_sso": true,
"is_open_source": false,
"is_in_yc": true
}
}
}
Extracting without a schema (New)
You can now extract without a schema by just passing a prompt to the endpoint. The llm chooses the structure of the data.
curl -X POST https://api.firecrawl.dev/v1/scrape \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer YOUR_API_KEY' \
-d '{
"url": "https://docs.firecrawl.dev/",
"formats": ["json"],
"jsonOptions": {
"prompt": "Extract the company mission from the page."
}
}'
Interacting with the page with Actions (Cloud-only)
Firecrawl allows you to perform various actions on a web page before scraping its content. This is particularly useful for interacting with dynamic content, navigating through pages, or accessing content that requires user interaction.
Here is an example of how to use actions to navigate to google.com, search for Firecrawl, click on the first result, and take a screenshot.
You can now batch scrape multiple URLs at the same time. It is very similar to how the /crawl endpoint works. It submits a batch scrape job and returns a job ID to check the status of the batch scrape.
The search endpoint combines web search with Firecrawl’s scraping capabilities to return full page content for any query.
Include scrapeOptions with formats: ["markdown"] to get complete markdown content for each search result otherwise it defaults to getting SERP results (url, title, description).
With LLM extraction, you can easily extract structured data from any URL. We support pydantic schemas to make it easier for you too. Here is how you to use it:
from firecrawl.firecrawl import FirecrawlApp
app = FirecrawlApp(api_key="fc-YOUR_API_KEY")
class ArticleSchema(BaseModel):
title: str
points: int
by: str
commentsURL: str
class TopArticlesSchema(BaseModel):
top: List[ArticleSchema] = Field(..., max_items=5, description="Top 5 stories")
data = app.scrape_url('https://news.ycombinator.com', {
'formats': ['json'],
'jsonOptions': {
'schema': TopArticlesSchema.model_json_schema()
}
})
print(data["json"])
Using the Node SDK
Installation
To install the Firecrawl Node SDK, you can use npm:
With LLM extraction, you can easily extract structured data from any URL. We support zod schema to make it easier for you too. Here is how to use it:
import FirecrawlApp from "@mendable/firecrawl-js";
import { z } from "zod";
const app = new FirecrawlApp({
apiKey: "fc-YOUR_API_KEY"
});
// Define schema to extract contents into
const schema = z.object({
top: z
.array(
z.object({
title: z.string(),
points: z.number(),
by: z.string(),
commentsURL: z.string(),
})
)
.length(5)
.describe("Top 5 stories on Hacker News"),
});
const scrapeResult = await app.scrapeUrl("https://news.ycombinator.com", {
jsonOptions: { extractionSchema: schema },
});
console.log(scrapeResult.data["json"]);
Open Source vs Cloud Offering
Firecrawl is open source available under the AGPL-3.0 license.
To deliver the best possible product, we offer a hosted version of Firecrawl alongside our open-source offering. The cloud solution allows us to continuously innovate and maintain a high-quality, sustainable service for all users.
Firecrawl Cloud is available at firecrawl.dev and offers a range of features that are not available in the open source version:
Contributing
We love contributions! Please read our contributing guide before submitting a pull request. If you’d like to self-host, refer to the self-hosting guide.
It is the sole responsibility of the end users to respect websites’ policies when scraping, searching and crawling with Firecrawl. Users are advised to adhere to the applicable privacy policies and terms of use of the websites prior to initiating any scraping activities. By default, Firecrawl respects the directives specified in the websites’ robots.txt files when crawling. By utilizing Firecrawl, you expressly agree to comply with these conditions.
Contributors
License Disclaimer
This project is primarily licensed under the GNU Affero General Public License v3.0 (AGPL-3.0), as specified in the LICENSE file in the root directory of this repository. However, certain components of this project are licensed under the MIT License. Refer to the LICENSE files in these specific directories for details.
Please note:
The AGPL-3.0 license applies to all parts of the project unless otherwise specified.
The SDKs and some UI components are licensed under the MIT License. Refer to the LICENSE files in these specific directories for details.
When using or contributing to this project, ensure you comply with the appropriate license terms for the specific component you are working with.
For more details on the licensing of specific components, please refer to the LICENSE files in the respective directories or contact the project maintainers.
🔥 Firecrawl
Empower your AI apps with clean data from any website. Featuring advanced scraping, crawling, and data extraction capabilities.
This repository is in development, and we’re still integrating custom modules into the mono repo. It’s not fully ready for self-hosted deployment yet, but you can run it locally.
What is Firecrawl?
Firecrawl is an API service that takes a URL, crawls it, and converts it into clean markdown or structured data. We crawl all accessible subpages and give you clean data for each. No sitemap required. Check out our documentation.
Pst. hey, you, join our stargazers :)
How to use it?
We provide an easy to use API with our hosted version. You can find the playground and documentation here. You can also self host the backend if you’d like.
Check out the following resources to get started:
To run locally, refer to guide here.
API Key
To use the API, you need to sign up on Firecrawl and get an API key.
Features
Powerful Capabilities
You can find all of Firecrawl’s capabilities and how to use them in our documentation
Crawling
Used to crawl a URL and all accessible subpages. This submits a crawl job and returns a job ID to check the status of the crawl.
Returns a crawl job id and the url to check the status of the crawl.
Check Crawl Job
Used to check the status of a crawl job and get its result.
Scraping
Used to scrape a URL and get its content in the specified formats.
Response:
Map (Alpha)
Used to map a URL and get urls of the website. This returns most links present on the website.
Response:
Map with search
Map with
search
param allows you to search for specific urls inside a website.Response will be an ordered list from the most relevant to the least relevant.
Extract
Get structured data from entire websites with a prompt and/or a schema.
You can extract structured data from one or multiple URLs, including wildcards:
Single Page: Example: https://firecrawl.dev/some-page
Multiple Pages / Full Domain Example: https://firecrawl.dev/*
When you use /*, Firecrawl will automatically crawl and parse all URLs it can discover in that domain, then extract the requested data.
If you are using the sdks, it will auto pull the response for you:
LLM Extraction (Beta)
Used to extract structured data from scraped pages.
Extracting without a schema (New)
You can now extract without a schema by just passing a
prompt
to the endpoint. The llm chooses the structure of the data.Interacting with the page with Actions (Cloud-only)
Firecrawl allows you to perform various actions on a web page before scraping its content. This is particularly useful for interacting with dynamic content, navigating through pages, or accessing content that requires user interaction.
Here is an example of how to use actions to navigate to google.com, search for Firecrawl, click on the first result, and take a screenshot.
Batch Scraping Multiple URLs (New)
You can now batch scrape multiple URLs at the same time. It is very similar to how the /crawl endpoint works. It submits a batch scrape job and returns a job ID to check the status of the batch scrape.
Search
The search endpoint combines web search with Firecrawl’s scraping capabilities to return full page content for any query.
Include
scrapeOptions
withformats: ["markdown"]
to get complete markdown content for each search result otherwise it defaults to getting SERP results (url, title, description).Using Python SDK
Installing Python SDK
Crawl a website
Extracting structured data from a URL
With LLM extraction, you can easily extract structured data from any URL. We support pydantic schemas to make it easier for you too. Here is how you to use it:
Using the Node SDK
Installation
To install the Firecrawl Node SDK, you can use npm:
Usage
FIRECRAWL_API_KEY
or pass it as a parameter to theFirecrawlApp
class.Extracting structured data from a URL
With LLM extraction, you can easily extract structured data from any URL. We support zod schema to make it easier for you too. Here is how to use it:
Open Source vs Cloud Offering
Firecrawl is open source available under the AGPL-3.0 license.
To deliver the best possible product, we offer a hosted version of Firecrawl alongside our open-source offering. The cloud solution allows us to continuously innovate and maintain a high-quality, sustainable service for all users.
Firecrawl Cloud is available at firecrawl.dev and offers a range of features that are not available in the open source version:
Contributing
We love contributions! Please read our contributing guide before submitting a pull request. If you’d like to self-host, refer to the self-hosting guide.
It is the sole responsibility of the end users to respect websites’ policies when scraping, searching and crawling with Firecrawl. Users are advised to adhere to the applicable privacy policies and terms of use of the websites prior to initiating any scraping activities. By default, Firecrawl respects the directives specified in the websites’ robots.txt files when crawling. By utilizing Firecrawl, you expressly agree to comply with these conditions.
Contributors
License Disclaimer
This project is primarily licensed under the GNU Affero General Public License v3.0 (AGPL-3.0), as specified in the LICENSE file in the root directory of this repository. However, certain components of this project are licensed under the MIT License. Refer to the LICENSE files in these specific directories for details.
Please note:
For more details on the licensing of specific components, please refer to the LICENSE files in the respective directories or contact the project maintainers.
↑ Back to Top ↑