Problem Statement being a information scientist when it comes to marketing department at reddit.

Problem Statement being a information scientist when it comes to marketing department at reddit.

i must get the many predictive key words and/or expressions to accurately classify the the dating advice and relationship advice subreddit pages them to determine which advertisements should populate on each page so we can use. Because this is a category issue, I’ll utilize Logistic Regression & Bayes models. Misclassifications in this full instance could be fairly benign therefore I will make use of the precision rating and set up a baseline of 63.3per cent to rate success. Making use of TFiDfVectorization, I’ll get the function value to find out which terms have actually the greatest forecast energy for the goal factors. If effective, this model is also utilized to focus on other pages which have comparable regularity associated with the same words and expressions.

Data Collection

See relationship-advice-scrape and dating-advice-scrape notebooks with this component.

After switching most of the scrapes into DataFrames, they were saved by me as csvs that you can get within the dataset folder for this repo.

Data Cleaning and EDA

  • dropped rows with null self text line becuase those rows are worthless in my experience.
  • combined name and selftext column directly into one brand new columns that are all_text
  • exambined distributions of word counts for games and selftext column per post and contrasted the 2 subreddit pages.

Preprocessing and Modeling

Found the baseline precision rating 0.633 this means if i usually select the value that develops most frequently, i will be right 63.3% of times.

First effort: logistic regression model with default CountVectorizer paramaters. train score: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first pair of scraping, pretty bad rating with a high variance. Train 99%, test 72%

  • attempted to decrease maximum features and rating got a whole lot worse
  • tried with lemmatizer preprocessing instead and test score went as much as 74per cent

Merely increasing the information and stratifying y in my test/train/split increased my cvec test score to 81 and cross val to 80. Incorporating 2 paramaters to my CountVectorizers helped a lot. A min_df of 3 and ngram_range of (1,2) increased my test score to 83.2 and get a cross val to 82.3 But, these rating disappeared.

I do believe Tfidf worked the greatest to reduce my overfitting due to variance issue because

we customized the end terms to just take away the ones that have been really too frequent to be predictive. It was a success, nonetheless, with an increase of time we most likely could’ve tweaked them a bit more to improve all ratings Taking a look at both the solitary terms and terms in categories of two (bigrams) had been the most readily useful param that gridsearch suggested, but, most of my top most predictive terms wound up being uni-grams. My initial directory of features had a good amount of jibberish terms and typos. Minimizing the # of that time period an expressed term ended up being expected to show as much as 2, helped be rid of the. Gridsearch also recommended 90% max df rate which assisted to remove oversaturated words aswell. Lastly, establishing max features to 5000 reduced cut down my columns to about 25 % of whatever they had been to just focus the absolute most commonly used terms of the thing that was kept.

Summary and tips

Also though I wish to have greater train and test scores, I happened to be in a position to effectively reduce the variance and you will find certainly a few words which have high predictive energy

and so I think the model is prepared to introduce a test. If marketing engagement increases, the exact same key phrases could possibly be utilized to get other possibly lucrative pages. It was found by me interesting that taking right out the overly used terms aided with overfitting, but brought the precision score down. I do believe there is certainly probably nevertheless space to relax and play around with the paramaters associated with Tfidf Vectorizer to see if various end terms create a different or


Used Reddit’s API, demands collection, and BeautifulSoup to clean articles from two subreddits: Dating guidance & Relationship information, and trained a classification that is binary to predict which subreddit confirmed post originated from

Leave a Reply

Your email address will not be published. Required fields are marked *