10 of the Best Open Data Sources You Wish You Knew
Try HeavyIQ Conversational Analytics on 400 million tweets
Download HEAVY.AI Free, a full-featured version available for use at no cost.
GET FREE LICENSEFinding the perfect dataset to round out your project or story is often the most challenging and time-consuming part of the process.
Think about your most significant data projects. Where did you spend the majority of your time? Was it collecting, curating, and engineering datasets?
I recently spent a considerable amount of time exploring open data sources for research on US wildfire and forestry data to support an analysis and visualization series we're running to raise awareness of this year's troubling wildfire season.
The search led me down exciting paths, but I couldn't quite find the dataset I needed until my colleague Dr. Michael Flaxman sent me the California Forest Observatory.
It's precisely the dataset I needed for my project, which got me thinking about datasets that I've either shared or received over the years that solved problems the same way Dr. Flaxman solved my wildfire data search.
So I put out a call for the best open data sources that people at HEAVY.AI thought you should know. This list of open data sources includes searchable repositories, individual datasets of note, and emerging data platforms.
Chances are you know a few of these public data sources, but we hope you find something new and compelling to use in an upcoming project. Here are their submissions.
1. California Forest Observatory
Link: https://forestobservatory.com/
About: The California Forest Observatory is a data-driven forest monitoring system that maps wildfire hazard drivers across California, including forest structure, weather, topography, and infrastructure.
You can download canopy cover, canopy height, canopy base height, canopy bulk density, canopy layer count, ladder fuel density, and surface fuels geodata for the state by county, community, or watershed.
Additional Resources: Modeling & Monitoring Powerline Tree Strike Risk at Scale
2. OpenStreetMap
Link: https://www.openstreetmap.org
About: OpenStreetMap provides a broad range of map data maintained by a worldwide community of geographers and cartographers.
You can access roads, trails, points of interest, railways, and much more worldwide.
Geofabrik's OpenStreetMap Data Extracts are one of the easiest ways to download information for your area of interest quickly.
Additional Resources:
- Todd Mostak's complete OpenStreetMap extraction and load into OmniSci
- Analyze OpenStreetMap Data with OSMnx and OmniSci
3. Registry of Open Data on AWS
Link: https://registry.opendata.aws/
About: The Registry of Open Data on AWS has empowered laboratories, research institutions, and various other organizations to deliver open datasets to developers, startups, and enterprises worldwide since its launch in 2018.
Anyone can easily access the registry through a web interface and search for datasets with keywords or tags like flood risk, remote sensing, imagery, or human genome.
Users are encouraged to grow the adoption of the registry by contributing datasets of their own, usage examples, tutorials, or applications built on data from the registry.
Additional Resources: We recently highlighted ways users can load data into HEAVY.AI Free directly from the registry using the Immerse UI and omnisql.
4. Nasa Earth Observations
Link: https://neo.sci.gsfc.nasa.gov/
About: Nasa Earth Observations offer climate and environmental data for the globe. You can browse and download the satellite data from NASA's constellation of Earth Observing System satellites. Over 50 different global datasets are represented with daily, weekly, and monthly images available in various formats.
Additional Resources: During last year's #30DayMapChallenge we used Nasa's Earth Observation's Chlorophyll Concentration product on day 13.
5. Google Big Query Public Datasets
Link: https://cloud.google.com/bigquery/public-data
About: A Google BigQuery public dataset is any dataset made available to the general public through the Google Cloud Public Dataset Program.
Google hosts the data, covers the costs of storage, and offers public access to data for use in any project.
Of all the free open data sources we highlight in this post this is the only one with a catch. You must sign up for a Google Cloud Platform account to access the data, to begin with, and only the first 1 TB of data per month is free, after that you are subject to query pricing.
Just be aware of the volumes you are extracting and you should be fine. There are some awesome sources to choose from including cryptocurrency exchanges, the American Community Survey, international real estate listings, and much more.
Additional Resources:
- Todd Mostak highly recommends the Hacker News dataset!
- The BigQuery subreddit is an easy way to see what data is publicly available
- Other open data sources available through Google include the Google Data Search and Google Public Data
6. Koordinates
Link: https://koordinates.com/
About: Koordinates is an emerging geospatial data management platform where you can host, manage, share, publish, and access geodata.
While Koordinates' primary product is their geodata management software, they give users the opportunity to share and access open geospatial datasets.
You can browse through thousands of geospatial data layers from around the world from New Zealand property parcels to the United States hazmat routes using regional and publisher filters or classic search.
7. Natural Earth
Link: https://www.naturalearthdata.com/
About: A collection of public domain map datasets available in vector or raster formats and various scales that I've trusted since graduate school.
Data comes in cultural, physical, and raster categories, and users benefit from solid metadata, attribution, neatness, and overall convenience.
Natural Earth is a collaboration that involves members of the North American Cartographic Information Society (NAICS) and cartographers worldwide.
Additional Resources: During last year's #30DayMapChallenge we used Natural Earth's Global Nested Bathymetry Contour lines created from SRTM Plus on day 30.
8. Kaggle
Link: https://www.kaggle.com/datasets
About: Kaggle is a wicked cool platform for new and experienced data scientists and explorers.
You can search their massive library of open datasets, grab sample code, ask their burgeoning community questions, take part in a data competition, and learn as you go.
They have over 95,000 datasets you browse and download on just about any topic you can conjure. You may have to sift through data with varying levels of quality, but more than likely, you'll find a gem.
My only suggestion is to be aware of the original sources of data you plan on downloading, its collection date, and overall fidelity before using it in your project.
9. Safegraph Open Census Data & Neighborhood Demographic
Link: https://www.safegraph.com/open-census-data
About: Safegraph has made an impressive name for itself in the data space these past few years. And while the majority of their data comes at a price, they do offer a spread of open census data and neighborhood demographics.
The datasets they offer for free have a clean schema, are joined with Census Block Group geometries, and include 7500+ demographic attributes (income, age, education, etc.).
Additional Resources: If you are interested in some of their other datasets, check out our Retail Cross Promotion Opportunity demo!
10. Canada Government’s Open Data
Link: https://open.canada.ca/en/open-data
About: HEAVY.AI has a growing contingent of employees and users (perhaps yourself) in Canada that require reliable and accurate open data sets, and the Canadian government has you covered with their Open Data Portal.
You can search or browse through the Canadian government’s open data categorized into the following:
- Agriculture
- Economics and Industry
- Health and Safety
- Labour
- Nature and Environment
- People
- Science and Technology
- Society and Culture
- Transport
- and more
Additional Resources: This last suggestion comes from my colleague, HEAVY.AI's Director of Customer Success, and a proud resident of the great white north, Tony Young.
What are your go-to datasets?
Start using HEAVY.AI for free today and share your favorite open sources of data with us on LinkedIn, Twitter, or our Community Forums. We’re always on the hunt for the best open data sources available!