Tung “Tommy” Tran

Ph.D. in Computer Science
University of Kentucky Alum
✉ tommy [at] tttran.net

Google Scholar dblp LinkedIn

Patent dataset

These datasets contain full-text patents (date of publication, title, abstract, description, claims) as well as their CPC codes for the years 2010 and 2011. The documents are in zip-compressed JSON lines format partitioned by year. The code used to create this dataset, by harvesting and parsing patent documents made publicly available at the USPTO website, can be found here.

Dataset File Size Download Links
2010 US Patents 2556 MB main    mirror
2011 US Patents 2652 MB main    mirror

Last harvested and compiled: 10/8/2017

If you use this dataset, please consider citing the following paper:

@inproceedings{tran2017supervised,
  title={Supervised Approaches to Assign Cooperative Patent Classification (CPC) Codes to Patents},
  author={Tran, Tung and Kavuluru, Ramakanth},
  booktitle={International Conference on Mining Intelligence and Knowledge Exploration},
  pages={22--34},
  year={2017},
  organization={Springer}
}

Related Stuff

For more information on the CPC system, check out:


© Tung Tran 2021