データ分析コンペで有名なkaggleのデータセットとして公開されています。
約23万個ものジョークが、[“id”,”phrase”]のcsvファイルにまとめられていて、見てるだけでも結構楽しめます(笑)。
フレーズを使う場面の情報もセットで公開されていると、もっと応用できそうですが、kaggleのプロジェクトページで分析にかけたnotebookを公開している人もいます。
kaggle(2017.02月公開)Generating humor is a complex task in the domain of machine learning, and it requires the models to understand the deep semantic meaning of a joke in order to generate new ones. Such problems, however, are difficult to solve due to a number of reasons, one of which is the lack of a database that gives an elaborate list of jokes. Thus, a large corpus of over 0.2 million jokes has been collected by scraping several websites containing funny and short jokes.
This dataset is in the form of a csv file containing 231,657 jokes. Length of jokes ranges from 10 to 200 characters. Each line in the file contains a unique ID and joke.