舆情推断模型用于股票相关评论

这是由NUS ISS学生Frank Cao，Gerong Zhang，Jiaqi Yao，Sikai Ni和Yunduo Zhang开展的项目。

描述

该模型是在来自stocktwits的3200000条评论上对roberta-base模型进行微调的，用户标记了'Bullish'或'Bearish'标签。

在推断API上尝试一些个人投资者可能在投资论坛上说的话，例如尝试'red'和'green'。

code on github

训练信息

批大小32
学习率2e-5

Train loss	Validation loss	Validation accuracy
epoch1	0.3495	0.2956	0.8679
epoch2	0.2717	0.2235	0.9021
epoch3	0.2360	0.1875	0.9210
epoch4	0.2106	0.1603	0.9343

如何使用

from transformers import RobertaForSequenceClassification, RobertaTokenizer
from transformers import pipeline
import pandas as pd
import emoji

# the model was trained upon below preprocessing
def process_text(texts):

  # remove URLs
  texts = re.sub(r'https?://\S+', "", texts)
  texts = re.sub(r'www.\S+', "", texts)
  # remove '
  texts = texts.replace('&#39;', "'")
  # remove symbol names
  texts = re.sub(r'(\#)(\S+)', r'hashtag_\2', texts)
  texts = re.sub(r'(\$)([A-Za-z]+)', r'cashtag_\2', texts)
  # remove usernames
  texts = re.sub(r'(\@)(\S+)', r'mention_\2', texts)
  # demojize
  texts = emoji.demojize(texts, delimiters=("", " "))

  return texts.strip()
  
tokenizer_loaded = RobertaTokenizer.from_pretrained('zhayunduo/roberta-base-stocktwits-finetuned')
model_loaded = RobertaForSequenceClassification.from_pretrained('zhayunduo/roberta-base-stocktwits-finetuned')

nlp = pipeline("text-classification", model=model_loaded, tokenizer=tokenizer_loaded)

sentences = pd.Series(['just buy','just sell it',
                      'entity rocket to the sky!',
                      'go down','even though it is going up, I still think it will not keep this trend in the near future'])
# sentences = list(sentences.apply(process_text))  # if input text contains https, @ or # or $ symbols, better apply preprocess to get a more accurate result
sentences = list(sentences)
results = nlp(sentences)
print(results) # 2 labels, label 0 is bearish, label 1 is bullish

作者:

rex zhang

数据集大小:

476.86 MB