FiR333

Twitter, Facebook Image Crawler 제작

ETC 2019. 1. 25. 00:37

파이썬의 Selenium, BeautifulSoup, Requests를 이용해서 크롤러를 제작해봤다.

늦덕이라 사진 저장할게 너무 많아서 만들어봤는데 저장할 사진이 많은 만큼 오래 걸렸다....

그리고 내 vm 환경이 안좋은지 자꾸 오류가 나던데 이상하게 재부팅하니까 잘되고 또 오류나고 무한반복이더라..;

gif 파일은 못 따내서 아쉽기는 하지만 용량이 없으니 jpg들로 만족해야겠다.. ^^

[ 버전 정보 ]

Chromedriver : 2.45

Selenium : 3.9.0

Python : 2.7.15rc1

[Twitter Crawler]

import time

import requests

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

from selenium.webdriver.common.keys import Keys

from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()

options.add_argument('headless')

options.add_argument('window-size=1920x1080')

options.add_argument("disable-gpu")

options.add_argument("user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36")

browser = webdriver.Chrome('chromedriver path', chrome_options=options)

url = 'https://twitter.com/user_id'

browser.get(url)

time.sleep(1)

body = browser.find_element_by_tag_name('body')

mins = raw_input("minutes: ")

print "* start: "+ time.asctime( time.localtime(time.time()) )

timeout = time.time() + 60* int(mins)

while True:

body.send_keys(Keys.PAGE_DOWN)

time.sleep(0.5)

if time.time() > timeout:

print "* end: "+ time.asctime( time.localtime(time.time()) )

break

'''

tweets = browser.find_elements_by_class_name('tweet-text')

for tweet in tweets:

print tweet.text

'''

html = browser.page_source

soup = BeautifulSoup(html, 'lxml')

imgs = soup.find_all('div', class_="AdaptiveMedia-photoContainer")

for img in imgs:

link = img.get('data-image-url')

pic_res = requests.get(link).content

time.sleep(0.5)

name = link[link.rfind('/')+1:]

with open("./storage/" + name, 'wb') as f:

f.write(pic_res)

f.close()

print name

작동 방식은 시간(분)을 입력받아서 그 시간동안 PAGE_DOWN키를 눌러준다.

트위터는 PAGE_DOWN 키를 누를수록 트윗 더보기가 가능하기 때문이다.

시간이 종료되면 BeautifulSoup를 이용해서 이미지가 있는 div(class가 "AdaptiveMedia-photoContainer")를 모두 찾아내고, data-image-url 속성에 있는 원래 이미지 주소를 추출하여 파일로 저장한다.

chrome options은 구글에서 검색하다가 잘 돌아가는 프로그램의 것을 가져왔다.

chromedriver랑 twitter 주소만 수정하고 실행하면 된다.

주석은 트윗 내용들이 콘솔에 출력되는 부분이다.

[Facebook Crawler]

import time

import requests

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

from selenium.webdriver.common.keys import Keys

from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()

options.add_argument('headless')

options.add_argument('window-size=1920x1080')

options.add_argument("disable-gpu")

options.add_argument("user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36")

browser = webdriver.Chrome('chromedriver path', chrome_options=options)

url = 'https://www.facebook.com/pg/user_id/photos/'

browser.get(url)

time.sleep(1)

body = browser.find_element_by_tag_name('body')

mins = raw_input("minutes: ")

print "* start: "+ time.asctime( time.localtime(time.time()) )

timeout = time.time() + 60 * int(mins)

while True:

body.send_keys(Keys.PAGE_DOWN)

time.sleep(1)

if time.time() > timeout:

print "* end: "+ time.asctime( time.localtime(time.time()) )

break

html = browser.page_source

soup = BeautifulSoup(html, 'lxml')

imgs = soup.find_all('a', rel="theater")

imgs.pop(0)

f = open("list", "w")

for img in imgs:

link = img.get('href')

f.write(link)

f.close()

f = open("list", "r")

lines = f.readlines()

for link in lines:

link = link.replace("\n", "")

req = requests.get(link)

html = req.content

if html.find("data-ploi=\"http") == -1:

continue

dp = html.find("data-ploi=\"https") + 11

link = html[dp:dp+html[dp:].find("\"")]

link = link.replace("amp;", "")

print '[*] ' + link

pic_res = requests.get(link).content

name = link[link.rfind('/')+1:link.find('?')]

with open("./storage/" + name, 'wb') as f:

f.write(pic_res)

f.close()

time.sleep(1)

페이스북은 트위터보다 PAGE_DOWN해서 게시물이 나타나는 속도가 느리다.

그래서 계정의 사진첩의 링크에서 각 사진들의 링크를 가져다가 그 링크에서 이미지를 가져오는 방식으로 코드를 작성했다.

트위터 크롤러를 복사해다가 작동시켰는데 자꾸 중간에 죽기도 해서...ㅎ

먼저 사진들의 링크를 가져오기 위해 a 태그를 모두 찾아냈다.

거기에서 인덱스 0, 1은 프로필과 커버 사진이었나...? 그래서 pop했다.

찾은 a 태그들을 list에 저장해두었는데, 중간에 죽어서 죽은 부분부터 다시 시작하려고 .... ㅎㅎㅎ

밑에는 list 파일을 불러서 사진을 저장하는 부분이다.

이상하게 BeautifulSoup가 안먹길래 원래 사진 주소인 "data-ploi" 부분을 찾아서 슬라이싱 했다.

인스타그램은 만들었는데 자꾸 원래 사진 크기가 아니라 600 * 600 짜리 사진이 추출되길래 지워버리고 구글에 검색하니 좋은 툴이 있어서 그걸로 손쉽게 사진들을 저장할 수 있었다. ^^

근데 이상하게 리눅스에서는 안되고 윈도우에서는 되더라... ㅇ.ㅇ..? 내가 못하는 거인듯..

결과물 뿌듯 *^__________________________^*

골차 덕질을 열심히 하자

골든차일드 화이팅!

… 1 2 3 4 5 6 7 8 ··· 20 …

Twitter, Facebook Image Crawler 제작

CATEGORY

VISITORS

SEARCH

ADMIN

티스토리툴바