Title: | A Collection of Small Text Corpora of Interesting Data |
---|---|
Description: | A collection of small text corpora of interesting data. It contains all data sets from 'dariusk/corpora'. Some examples: names of animals: birds, dinosaurs, dogs; foods: beer categories, pizza toppings; geography: English towns, rivers, oceans; humans: authors, US presidents, occupations; science: elements, planets; words: adjectives, verbs, proverbs, US president quotes. |
Authors: | Darius Kazemi, Cole Willsea, Serin Delaunay, Karl Swedberg, Matthew Rothenberg, Greg Kennedy, Nathaniel Mitchell, Javier Arce, Mark Sample, Parker Higgins, Allison Parrish, Matthew Hokanson, Aaron Marriner, Casey Kolderup, Michael Paulukonis, Neil Freeman, nathan lachenmyer, Brett O'Connor, Christian Leon Christensen, David Edgar, Greg Borenstein, Jeffery Bennett, Kris Baillargeon, M. Nowak, Peter Organisciak, Rachel White, Tod Robbins, John Wiseman, Alex Fox, Alice Maz, Becca Ricks, Chris Spurgeon, Colin Mitchell, David Whitten, Mary Dickson Diaz, Michael R. Bernstein, Mike Watson, Patrick Rodriguez, Rebecca Sherman, Rebecca Turner, Ross Barclay, Ross Binden, Ryan Freebern, Will Hankinson, Stefan Bohacek, Justin Alford, Brian Detweiler, Ed Lea, John Ohno, Daniel McNally, Sean May, Tariq Ali, shubham kumar, adam malantonio, Alan Hussey, Amanda Visconti, Andreas Fuchs, Andy Craze, Andy Dayton, Ashur Cabrera, Austin Davis-Richardson, Ben Williams, Brian Chitester, Brian Gawalt, Brian Jones, Casey Olson, Chad Nelson, Cliff Rodgers, Cristian Rivas Gómez, Dan Sumption, Edward Loveall, Elijah Cobb, Garrett Miller, Grant Williamson, Ian McCowan, Jacob Fauber, Jay Mahabal, Jeoff Villanueva, Jesse Spielman, Joe Mahoney, Jordan Killpack, Josh Leong, Kay Belardinelli, K Adam White, Kristian Wichmann, Kyle McDonald, Liam Cooke, Marcos Wright-Kuhns, Mark Wunsch, Matt Beiswenger, Matthew McVickar, Matthew Molnar, Max Bittker, Michael Dewberry, Nathan Black, Noah Kantrowitz, Noah Swartz, Ranjit Bhatnagar, Ray Martinez, Rob Huzzey, Ryan Giglio, Sabareesh Iyer, Sam Raker, Tia Esguerra, Utsav Chadha, Vincent Bruijn, Will Thompson, Zac Moody, aarón montoya-moraga, Alex Miller, Delacannon, Scott Lieber, Pace Ricciardelli, Ruta Kruliauskaite, Scott Grant |
Maintainer: | Gábor Csárdi <[email protected]> |
License: | CC0 |
Version: | 2.0.0 |
Built: | 2024-12-26 02:40:21 UTC |
Source: | https://github.com/gaborcsardi/rcorpora |
List data set categories in the corpora package
categories()
categories()
Character vector of category names.
corpora is a collection of small corpora of interesting data for the creation of bots and similar stuff.
corpora(which, category)
corpora(which, category)
which |
The data set to load, a string. If not given, then all data sets in the package are listed. |
category |
If given, |
This project is a collection of static corpora (plural of "corpus") that are potentially useful in the creation of weird internet stuff. I've found that, as a creator, sometimes I am making something that needs access to a lot of adjectives, but not necessarily every adjective in the English language. So for the last year I've been copy/pasting an adjs.json file from project to project. This is kind of awful, so I'm hoping that this project will at least help me keep everything in one place.
I would like this to help with rapid prototyping of projects. For example: you might use nouns.json to start with, just to see if an idea you had was any good. Once you've built the project quickly around the nouns collection, you can then rip it out and replace it with a more complex or exhaustive data source.
I'm also hoping that this can be used as a teaching tool: maybe someone has three hours to teach how to make Twitter bots. That doesn't give the student much time to find/scrape/clean/parse interesting data. My hope is that students can be pointed to this project and they can pick and choose different interesting data sources to meld together for the creation of prototypes.
See https://github.com/dariusk/corpora
A data frame containing the data set (if which
is
given), or a character vector of data set names.
animals
archetypes
architecture
art
colors
corporations
divination
film-tv
foods
games
games/bannedGames
games/bannedGames/argentina
games/bannedGames/brazil
games/bannedGames/china
games/bannedGames/denmark
geography
governments
humans
instructions
materials
mathematics
medicine
music
mythology
objects
plants
religion
science
societies_and_groups
societies_and_groups/designated_terrorist_groups
societies_and_groups/fraternities
sports
sports/football
technology
transportation
travel
words
words/emoji
words/literature
words/stopwords
words/word_clues
Birds of Antarctica, grouped by family Source: https://en.wikipedia.org/wiki/List_of_birds_of_Antarctica
Birds of North America, grouped by family Source: http://listing.aba.org/aba-checklist/
Collateral adjectives for animals.
A list of dinosaurs.
1000 popular dog names from the New York City Department of Health's dog licensing data. Names are roughly in order, but that may not be totally reliable.
A list of dog breeds.
Artifact archetypes.
Common character archetypes.
Archetypal events.
Setting and location archetypes.
Ways to enter or exit a place.
Different kinds of rooms
A list of modernist art isms.
List of Crayola crayon standard colors
List of assorted paint colors from various brands.
The top 200 most popular palettes on colourlovers.com
List of named HTML colors
The 954 most common RGB monitor colors, as defined by several hundred thousand participants in the xkcd color name survey.
A list of car manufacturers.
Corporations of the Dow Jones Industrial Average
The 2014 Fortune 500 list
A list of all industries on LinkedIn, as of May 21, 2013 Source: http://robertwdempsey.com/liindustries
Corporations of the NASDAQ 100
A list of newspapers scraped in early 2013.
Tarot card interpretations, from Mark McElroy's _A Guide to Tarot Meanings_ (http://www.madebymark.com/a-guide-to-tarot-card-meanings/)
Zodiac signs and associated information, both Western and Eastern. Source: https://en.wikipedia.org/wiki/Astrological_sign
Game of Thrones Houses
Netflix Movie Categories.
A bunch of movies, mostly Best Picture winners or nominees, scraped from the web.
1000 entries from the list of TV shows at http://en.wikipedia.org/wiki/List_of_television_programs_by_name
The 1000 most popular apple cultivars in the USDA's Pomological Watercolor collection.
Beers with the 100 lowest scores on BeerAdvocate, adapted from https://www.beeradvocate.com/lists/bottom/
A list of beer categories.
A list of beer styles.
A list of classic breads and sweet pastries.
A list of recipe instructions.
A list of condiments
A list of curds, cheeses, and other fermented dairy products
A list of fruits.
A list of herbs and spices, and mixtures of the two.
Capsicum cultivars (hot peppers)
Cocktails recognized by the International Bartenders Association for use in the World Cocktail Competition.
A list of the top 1000 most appearing menu items from the 1850s to today from the New York Public Library's "What's on the menu?" project. Please credit The New York Public Library as source on any applications or publications. http://menus.nypl.org/data
A list of pizza toppings.
A list of sandwiches.
A list of sausages
A list of scotch whiskies
types of tea
Approximate cooking times for various vegetables Source: http://recipes.howstuffworks.com/tools-and-techniques/how-to-cook-vegetables24.htm
A list of vegetables.
A list of words commonly used to describe wine.
A list of video games banned in Argentina
A list of video games banned in Brazil
A list of video games banned in China.
A list of video games banned in Denmark
Characters, rooms and weapons from the board game Cluedo / Clue.
Organized components from the Dark Souls III message system
A sampling of 1000 Jeopardy questions and metadata. For the full dataset, see http://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
Source: https://github.com/UberGames/iPokedex-DB
Tile distribution and points for the English-language edition of Scrabble
Street Fighter II fighting moves
Pie categories and colors from Trivial Pursuit
A list of professional wrestling moves
A list of Canadian provinces and territories.
Top 100 Canadian municipalities by 2011 population Source: https://en.wikipedia.org/wiki/List_of_the_100_largest_municipalities_in_Canada_by_population
A list of countries.
A list of countries and its respective capitals.
Two lists: one for English towns, one for English cities.
Japanese regions and prefectures.
London Underground stations, with their lines and Travelcard zones Source: https://en.wikipedia.org/wiki/List_of_London_Underground_stations
A list of nationalities. Source: https://www.gov.uk/government/publications/nationalities/list-of-nationalities
c("Top Norwegian Cities by 2017 population Source: Norway Population 2017 (Demographics, Maps, Graphs)", "Top Norwegian Cities by 2017 population Source: http://worldpopulationreview.com/countries/norway-population")
Neighborhoods of New York City and their corresponding ZIP codes. Normal ZIP code caveats apply. Source: Compiled by United Health Fund and distributed by the New York State Department of Health: https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm
A list of oceans and seas. Source: http://en.wikipedia.org/wiki/List_of_seas
A list of rivers. Source: http://en.wikipedia.org/wiki/List_of_rivers_by_length
San Francisco neighborhoods and their locations
IATA and ICAO airport codes for the primary commercial airports in each state.
Top 1000 U.S. cities by population (2016 estimates) Source: US Census American Community Survey 2016 5-year Data
U.S. Counties by State Source: https://en.wikipedia.org/wiki/List_of_counties_by_U.S._state
U.S. Metropolitan, Micropolitan and Combined Statistical Areas with 2016 population estimates Source: US Census American Community Survey 2016 5-year Data
U.S. State Capitals Source: Wikipedia: List of U.S. state capitals
Venues organized by category. Source: https://developer.foursquare.com/categorytree
A list of regional and local winds and weather phenomena. Source: https://en.wikipedia.org/wiki/List_of_local_winds, http://www.ggweather.com/windsoftheworld.htm
This is a list of government surveillance projects and related databases throughout the world. Source: Data found here: https://en.wikipedia.org/wiki/List_of_government_mass_surveillance_projects
A list of NSA project code names. Source: All data here is from https://docs.google.com/spreadsheets/d/1Uc1hrGqIweF0rgJ1HCbmT_0w9CYCCwZTWBGOwydscqE/htmlview?sle=true&id=1590301345#
A list of uk political parties. Source: http://www.electoralcommission.org.uk/ export on 8th May 2015
A list of federal agencies. Source: This data was sourced from the GSA's list of .gov domains https://github.com/GSA/data/blob/gh-pages/dotgov-domains/2014-12-01-federal.csv
Code names for US Military Operations Source: All names from the scraped pages of http://www.designation-systems.net/usmilav/codenames.html
All individuals who filed a Statement of Candidacy with the FEC to register as a presidential candidate in the 2016 United States election.
Activity category codes used by the US Bureau of Labor Statistics in its American Time Use Survey. Categories either come with a set of example activities, or are standalone 'miscellaneous' categories denoted 'not elsewhere classified'. Source: https://www.bls.gov/tus/lexicons.htm
A list of common human body parts.
A bunch of British actors.
Celebrities
A list of adjectives for describing people, taken from www.enchantedlearning.com/wordlist/adjectivesforpeople.shtml
English honorifics.
Famous duos
First names of men and women, pulled from the US Census for the 2000s.
Last names of people, pulled from the US Census for the 2000s.
A list of words that naturally complete the phrase 'They were feeling...'.
First names of boys, pulled from Statistics Norway 2015. Sorted from high to low distribution.
First names of girls, pulled from Statistics Norway 2015. Sorted from high to low distribution.
Last names of people, pulled from Statistics Norway 2015. Sorted from high to low distribution.
A list of occupations (jobs that people might have).
Prefixes taken from a form on an airline website.
A bunch of rich people from a Forbes listicle, including the source article, img, and name
List of particularly famous scientists
A list of common Spanish first names of men and women. Source: https://github.com/olea/lemarios
A list of common Spanish last names. Source: https://github.com/olea/lemarios
Deceased drummers from the fictional rock band Spinal Tap, taken from Wikipedia.
Suffixes taken from a form on an airline website.
Third person personal pronouns with case
Character names from Tolkien's Middle Earth, from https://en.wikipedia.org/wiki/List_of_Middle-earth_characters
Copy of JSON retrieved from https://www.govtrack.us/api/v2/role?role_type=president. The ID here matches the one in the corpora/data/words/us_president_quotes.json file
A bunch of WWE wrestlers nicknames
A list of laundry care instructions
abridged body fluids
building materials
carbon allotropes
decorative stones
fabrics
fibers
A list of the names of materials commonly used as gemstones Source: https://en.wikipedia.org/wiki/List_of_gemstone_species
layperson metals
metals
natural materials
packaging
plastic brands
sculpture materials
technical fabrics
The first 1000 numbers in the Fibonnaci Sequence
The first 1000 prime numbers.
The first 1000 prime numbers in binary.
A list of trigonometric functions, formulas, equations, etc..
International Statistical Classification of Diseases and Related Health Problems, 10th revision Source: http://www.cdc.gov/nchs/icd/icd10cm.htm
A list of generic pharmaceutical drug name stems. Hypens indicate whether a stem appears at the beginning, middle, or end of the name. Source: http://druginfo.nlm.nih.gov/drugportal/jsp/drugportal/DrugNameGenericStems.jsp
A list of pharmaceutical drug names Source: The United States National Library of Medicine, http://druginfo.nlm.nih.gov/drugportal/
A partial list of the hospitals in the United States Source: Wikipedia - List of Hospitals in the United States, https://en.wikipedia.org/wiki/Lists_of_hospitals_in_the_United_States
A list of guitar manufacturers Source: https://en.wikipedia.org/wiki/List_of_guitar_manufacturers
Bands that have opened for Tool. You must be really dedicated to your music if you are willing to play before Tool fans.
a list of women classical guitarists Source: https://en.wikipedia.org/wiki/List_of_women_classical_guitarists
A list of musical genres taken from wikipedia article titles.
Actors and the named characters played by them in the Original Broadway Cast recording of Hamilton: An American Musical. Actors who played multiple characters are listed multiple times. Source: https://en.wikipedia.org/wiki/Hamilton_(musical)#Principal_roles_and_major_casts
Musical Instruments
Music videos broadcast on MTV's first day Source: https://en.wikipedia.org/wiki/First_music_videos_aired_on_MTV
Artists who have been added to the Rock N' Roll Hall of Fame along with their year of induction Source: https://en.wikipedia.org/wiki/List_of_Rock_and_Roll_Hall_of_Fame_inductees
Every rapper that's ever made the XXL Annual Freshman Cover
Gods and goddesses from Greek myth
Monsters from Greek myth
Titans from Greek myth
Hebrew names of God used in the Old Testament Bible
Deities and supernatural creatures from the works of Lovecraft and the Cthulhu mythos.
A list of monsters and other mythic creatures
Gods and goddesses of norse and germanic myth
List of clothing types
Winners in the Corpora Brackets, from https://twitter.com/corporabrackets
List of household objects
420 popular strains of cannabis
List of plants by common name Source: https://en.wikipedia.org/wiki/List_of_plants_by_common_name
Analogous objects for various hail sizes, adapted from http://www.spc.noaa.gov/misc/tables/hailsize.htm
List of names of the first 1000 numbered minor planets
Planets (including dwarf planets as recognized by the IAU) that orbit the Sun, with their natural satellites.
A list of phrases describing weather conditions. This list includes all possible phrases that may be provided by the US National Weather Service's feeds of current weather conditions. Source: http://w1.weather.gov/xml/current_obs/weather.php
Current (as of November 2016) teams in the EPL (English Premier League) and where they play
Teams in the Spanish Primera División, La Liga(2017-18) with their details
Teams in the Italian First División, Serie A(2017-18) with their details
Current (as of 2016) Major League Baseball teams and where they play
NBA MVP award winners 1956-2017
Current (as of 2016) teams in the NBA and where they play
Current (as of 2016) teams in the NFL and where they play
Current (as of 2016) teams in the NHL and where they play
Olympic Games with host city, host nation, olympiad number (different for winter and summer), year, start date, end date, countries participating, athletes participating, and number of events. Source: Compiled from information on Olympics.org
A list of home appliances
names of technologies related to computer science
A list (ooh!) of firework effects (aah!)
weapons used in mass shootings in the U.S.A.
A list of knot names.
a list of LISP dialects
new or emerging technologies
Photo sharing websites
Social networking websites
Video hosting websites
A list of English adjectives.
closed pairs in English i.e both words rhyme with each other and only with each other. from https://en.wikipedia.org/wiki/List_of_closed_pairs_of_English_rhyming_words
Common English words.
A partial list of English compound words.
confusing or misleading headlines
Commonly mistaken English phrases most likely caused by hearing them rather than reading them (eggcorns) Source: Most of the examples come from http://eggcorns.lascribe.net/
A general corpus of cute kaomoji.
All the Unicode emoji.
a list of encouraging words to tell someone about something they created
'Ergative' verbs in English can be used both transitively and intransitively. Source: Curated from https://en.wiktionary.org/wiki/Category:English_ergative_verbs
Common expletives and spelling variants used in internet comments.
The Harvard sentences are a collection of sample phrases that are used for standardized testing of Voice over IP, cellular, and other telephone systems. They are phonetically balanced sentences that use specific phonemes at the same frequency they appear in English. (description from https://en.wikipedia.org/wiki/Harvard_sentences). The data represents a version with minor typos removed.
a list of exclamatory words and expressions from http://www.enchantedlearning.com/wordlist/interjections.shtml
List of names from the novel Infinite Jest by David Foster Wallace
H.P Lovecraft favorite words, from http://arkhamarchivist.com/wordcount-lovecraft-favorite-words/
Mr Men and Little Miss characters Source: http://www.mrmen.com
Phrasess coined by Shakespeare, from http://www.pathguy.com/shakeswo.htm
Shakespeare's sonnets.
Words coined by Shakespeare, from http://www.pathguy.com/shakeswo.htm
A list of English nouns.
Words of wisdom by Oprah Winfrey
List of personal nouns in the 1890 Webster's Unabridged Dictionary. Assembled by Cory Taylor from Project Gutenberg's HTML edition of the dictionary: http://www.gutenberg.org/ebooks/673 Source: https://github.com/coryandrewtaylor/Personal-Nouns
A list of English prepositions, sourced from Wikipedia.
A list of proverbs sourced from http://tww.id.au/proverbs/proverbs.html
Resume action words Source: http://careercenter.umich.edu/article/resume-action-words
English words for which there is no perfect rhyme, taken from https://en.wikipedia.org/wiki/List_of_English_words_without_rhymes
A list of Harry Potter spells and descriptions
A list of states of drunkenness.
Arabic stop words
Arabic stop words
Czech stop words
Danish stop words
German stop words
English stop words
Spanish stop words
Finnish stop words
French stop words
Greek stop words
Italian stop words
Japanese stop words
Latvian stop words
Dutch stop words
Norwegian stop words
Polish stop words
Portuguese stop words
Russian stop words
Slovak stop words
Swedish stop words
Turkish stop words
Do you know the feeling when you repeat some word many times and it starts to sound weird? Below is the list of some of the strangest sounding words that people submitted during my Intro to Computational Media Class at ITP, NYU.
A list of units of time ordered by magnitude, both formal and colloquial.
A list of quotes from US Presidents from http://bit.ly/1hsAYQT. ID matches up with https://govtrack.us API results.
A list of English verbs.
a list of common 5-letter words followed by crossword/thesaurus-style hints for that word
a list of common 4-letter words followed by crossword/thesaurus-style hints for that word
a list of common 6-letter words followed by crossword/thesaurus-style hints for that word
corpora() corpora(category = "animals") corpora("foods/pizzaToppings")
corpora() corpora(category = "animals") corpora("foods/pizzaToppings")